Using Evolutionary Trees for the Colorectal Cancer Prognosis Prediction

: The measurement of tree similarity based on structure comparison has been long used in diverse fields. We applied the evolutionary tree method to study the cancer genome. Cancer evolutionary trees, representing cancer diversity, provide information on the clonal evolution and the clinical outcome of cancer patients. This study considered 107 colorectal cancer (CRC) patients who received deep targeted sequencing of cancer tissues. The evolutionary trees of individual cancer patients were reconstructed from genome sequencing data based on variant allele frequencies (VAFs) of point mutations and small insertions or deletions (indels). The main purpose of this study was to predict cancer recurrence. We mapped the structure of a cancer evolutionary tree to a rooted tree and developed a canonical-form transformation for solving tree isomorphism to ensure that each patient has a unique tree structure. We proposed an algorithm for comparing tree similarity in terms of cost calculation between evolutionary structure trees. The cost was calculated using the node position, tree size (or number of nodes), tree height, node depth, number of descendants of the node (the size of the subtree with the node as a root), and relationship of the node with other nodes. After tree similarity comparison, the cancer patients were clustered into two groups through k -means clustering. The clustering information indicated that the evolutionary structure trees were associated with gender and tumor invasion stage. Several machine-learning strategies including random forest, support vector machine (SVM), bagging, and boosting were used to predict cancer recurrence in these patients. Our results revealed that combining the clustering information of evolutionary structure trees increased the prediction performance compared with using clinical information alone, and tree similarity comparison can help in the prognostic analysis of cancer patients.


Introduction
Cancer is a disease caused by the accumulation of somatic mutation. The cancer clonal theory states that most tumors originate in a single cell and that tumor progression and clonal expansions are caused by genetic variability; thus, a tumor is the result of clonal evolution [1]. Understanding the evolutionary history of tumor growth may provide valuable insights into tumor cell proliferation and survival [2]. Many studies have recently integrated evolutionary theory and genomic data into modern techniques for studying tumor growth and progression [3]. In this study, we used the evolutionary tree method to understand tumor progression. The tumor composition and evolutionary structure can be determined through somatic variant calling by using variant allele frequencies (VAFs) according to the read counts of the tumor and matched normal cell sequences in each patient. The VAFs of somatic variants can be used to reconstruct the cancer evolutionary histories as a cancer evolutionary tree, which reflects the somatic variant accumulation in each patient [4,5].
In recent years, many methods have been developed to construct cancer evolutionary trees from single-nucleotide variants in bulk sequencing data [6][7][8] and single-cell sequencing data [9][10][11]. Cancer evolutionary trees are of two types: sample tree and sub-clonal tree. Several methods are used for the construction of a sample tree [12][13][14], including maximum parsimony [15,16], distance matrix method [17], maximum likelihood estimation (expectation-maximization (EM) algorithm) [18], and Bayesian sampling (Markov chain Monte Carlo (MCMC)) [19]. A sub-clonal tree clusters driver mutations into several subclones with a common frequency and reconstructs the lineage according to the following two assumptions [20][21][22]: (1) a mutation cannot recur during tumor evolution and (2) each mutation is acquired once and is never lost. In these trees, the root represents a normal cell and the subsequent node of the root represents a founder cell. The nodes below the founder cell represent descendant subclones, and the edge lengths indicate the number of driver mutations that are newly accumulated in the descendant nodes [23,24]. In this study, rooted cancer evolutionary trees were constructed with a sub-clonal architecture.
Studies have indicated that the cancer evolutionary tree patterns differ between clear cell renal cell carcinomas and non-small-cell lung cancer [25,26]. The branching patterns of cancer evolutionary trees and the somatic mutation fraction of subclones are crucial for identification of the type or even the prognosis of cancer.
Studies have measured the similarity between two trees by using edit cost, which refers to the number of insertions or deletions of nodes required for transforming one tree into the other [27][28][29].
Computational methods for quantifying the similarity between two cancer evolutionary trees based on tree structure have not attracted much research attention [24]. In this study, we mapped the cancer evolutionary tree structure to a rooted tree and developed an algorithm for effective comparison among trees through the exchange of operations between the subtrees. This procedure is referred to as canonical-form transformation and tree similarity comparison. After tree comparison for each patient, we clustered the patients into two groups and developed several machine learning models to predict cancer prognosis. Figure. 1 presents the entire experimental process. In this study, 107 patients with stage III colorectal cancer (CRC) were recruited from National Cheng Kung University Hospital (NCKUH) between January 2014 and January 2019. All CRC patients received the standard surgical resection followed by adjuvant chemotherapy with the mFOLFOX6 (5-fluorouracil, leucovorin, and oxaliplatin) regimen. Clinical information of the patients was obtained from medical records. Tumor tissue and blood samples were collected at the time of enrollment. This study was approved by the Institutional Review Board of NCKUH (A-ER-103-395 and A-ER-104-153) and was performed in accordance with the guidelines of the Declaration of Helsinki. All participants provided written informed consent.

Evolution from Cancer Genome
To elucidate tumor development and progression in each patient, we reconstructed the cancer evolutionary tree by using genome sequencing data based on VAFs of point mutations and small insertions or deletions (indels) obtained from variant call format (VCF) files. The VCF files contain information regarding the variants' chromosomes, positions, bases, reference genes and some data retrieved from the Single Nucleotide Polymorphism Database (dbSNP). In this study, we obtained tumor target sequencing and germline whole genome sequencing data of 107 CRC patients. Information regarding somatic mutations was obtained using the tumor-normal single-nucleotide variant (SNV) calling tool deepSNV [30]. We used the ANNOVAR [31] tool to annotate somatic mutations and filter out indel variants not reported in the 1000 Genomes Project, dbSNP, or Exome Aggregation Consortium (ExAc). After collecting the point mutations and indels of each patient, we reconstructed their cancer evolutionary tree ( Figure. 2) by using PhyloWGS [14].

Cancer Evolutionary Tree
Figure. 2(a) presents an example of cancer evolution in our study. Tumor evolves from a normal cell (clone A) to a cancerous cell. A founder cell (founder clone B) is established after a normal cell acquires several passenger mutations and driver mutations (founder somatic mutations). Subclones (C and D) evolve by acquiring subsequent somatic mutations. A root (A) and its adjacent node (B) represent the normal and founder cell, respectively. The following nodes indicate subclones (C and D), and edge lengths indicate the number of somatic mutations acquired in the subclones ( Figure. 2(b)).

Tree Structure Mapping for the Cancer Evolutionary Tree
To compare cancer evolutionary trees, the constructed cancer evolutionary tree is mapped to the field of use in graph theory.
The cancer evolutionary tree contains the cancer evolutionary history, which denotes the evolution of tumors over time, and the number of nodes in the tree indicates the mutation accumulation. Here we focus on the tree structure. The structure of the cancer evolutionary tree only refers to the height of the tree, depth of the nodes, and relationship among nodes and does not refer to the node size, nodeinformation, and edge lengths. We mapped the cancer evolutionary tree structure to a directed rooted tree, which we referred to as the evolutionary structure tree ( Figure. 3). For example, Figure. 3 is an evolutionary structure tree, which is mapped from the structure of the cancer evolutionary tree in Figure. 2(b), with node A as the root.
In Figure. 3, node A is the root node as well as the parent node of node B, node B is the child node of node A, and nodes C and D are siblings or leaf nodes.  Figure. 2

Evolutionary Structure Tree
An evolutionary structure tree ( Figure. 4) is a three-tuple , , , where is a set of nodes including the root; is a set of edges satisfying ∈ , which denotes an orientation away from the root; and denotes the root of the tree . If , ∈ , then a relationship is established such that node points to node . Therefore, is referred to as a child node of node , and is referred to as a parent node of node , respectively denoted as and . For each node in tree , the set of all children of node is represented as , which also means that node is linked to these children nodes; the set of all descendants of node (including its direct children and indirect descendants) is represented as ; the height of the subtree rooted at node is represented as ; and the number of nodes in the subtree rooted at node is represented as . We also define three attributes for each node , , . The height of tree denoted as " is the number of edges between the tree's root node and its farthest leaf; the height of a tree with only root node is 0. The depth of node denoted as refers to the number of edges from node to the tree's root node, with the depth of the root node being 0.
The set of all children of node is $ % , & , . . . , ( ). A sequence of n child subtrees 〈 + , , , . . . , -〉 of node in tree and the child subtrees are placed from left to right in the ascending order of the index.

Canonical-Form Transformation for Unique Tree Structure
Some evolutionary structure trees are similar in structureor isomorphic. We hope that these trees can have a unique tree structure. Isomorphic trees that do not have an identical structure will be clustered into different groups because of the differences, resulting in the subsequent clustering problem. Two trees are considered isomorphic when each node in one tree has a corresponding node in the other tree and vice versa; that is, two evolutionary structure trees % % , % , have a bijection between the node sets /: % → & such that ∀ , ∈ % , , ∈ % ⇔ / , / ∈ & and / % & . Obviously, both trees will have identical number of nodes.
We established a series of steps to solve the tree isomorphism problem and compare subtrees. The canonical form of an evolutionary structure tree can be obtained by performing a series of subtree exchanges, where the order of the siblings from left to right reflects the comparison results.
We also performed the canonical-form transformation for modularization to identify a unique canonical form of each evolutionary structure tree. The canonical-form transformation involves three steps by which layer-by-layer exchange operations are executed from root to leaf until all the nodes are searched. The layer-by-layer search helps avoid the inconsistency in the information of subtrees caused by an irregular search and exchange. In this process, the exchange operations are first executed between all sibling nodes in the identical layer, followed by a search of child nodes in the next layer until all leaf nodes have been it has searched. The output of the canonical-form transformation is a unique canonical form of an evolutionary structure tree.

Exchange Operation
The exchange operation involves selection of two subtrees, + and , , under node in the rooted tree , execution of the following three steps, and placement of the larger subtree on the right.

Steps of the Canonical-Form Transformation
Step 1. All subtrees under the nodes are placed from left to right according to height, such that the larger subtree is placed on the right (Figure. 5). Subtrees with identical heights are sorted using Step 2.
As shown in Figure. 5, in Step 1, the height of the left and right subtrees is compared first, irrespective of the number of nodes.
Step 2. When subtrees with identical height are encountered under the nodes, they are placed from left to right based on the number of nodes, such that the larger subtree is placed on the right ( Figure. 6).
Step 2 involves counting the number of nodes in each subtree. Therefore, this step can be time consuming when the size of the evolutionary structure tree is large. In addition, some subtrees may have an identical number of nodes. This problem can be solved using Step 3. & are children of node , + is the subtree that has node % as a root, and , is the subtree that has node & as a root. Because the height of + is 3 and the height of , is 2, the larger subtree + is placed on the right.

Figure 5. Exchange of the child subtrees of the root in Step 2. Nodes % and
& are children of node , + is the subtree that has node % as a root, and , is the subtree that has node & as a root. Because the number of nodes in + is 3 and the number of nodes in , is 2, the larger subtree + is placed on the right.

Figure 6. Exchange of the child subtrees of the root in
Step 3. Node % and & are children of node , + is the subtree that has node % as a root, and , is the subtree that has node & as a root. After calculation of the number of nodes in the child subtrees + and , , the list for + is 435 and that for , is 42, 15. Therefore, the larger subtree + is placed on the right.

Canonical-Form Transformation Algorithm
The tree structures of two trees can be compared using the three steps described in Section 2.6.2 in the sequential order.
We assume that the structure of tree 1 is greater than that of tree 2 1 < 2 , the structure of tree 1 is smaller than that of tree 2 1 = 2 , and the structure of tree 1 is equal to that of tree 2 1 2 . Figure 10. Exchange the child subtrees of node 2 in Step 1. Because the height is 1, 1, 0 from left to right, the subtree order is exchanged. The Canonical-Form algorithm (Algorithm 1) can be used to achieve a unique canonical form of the tree.
The Link-Child algorithm (Algorithm 2) provides all the child nodes of the node (not including indirect offspring) and an iterative search idea to the Canonical-Form algorithm to enable it to search nodes layer by layer from root to leaf until all the nodes are searched.  The LGO algorithm (Algorithm 5) is used by the Trees-Sort algorithm to output a list for all subtrees in lexicographical order sorted from large to small.
Because all child subtrees of each node are placed from left to right, we propose the Trees-Sort algorithm. The Trees-Sort algorithm (Algorithm 6) compares the subtree structures from small to large, and the output of the Trees-Sort algorithm is used to sort and order the subtrees.
The Compare-Merge algorithm (Algorithm 7) used merge sort to sort the subtrees for comparison in the Trees-Sort algorithm.

Tree Similarity Comparison and Cost of Tree Similarity Comparison
Here we present the three operations involved in the tree similarity comparison method and the calculation method for comparing the differences in tree structure costs [32].

Operations in Tree Similarity Comparison
According to the previous canonical-form transformation, each evolutionary structure tree has a unique structure. The degree of structural similarity between two evolutionary structure trees can be compared by performing three operations, following which a cost is determined, which is considered the difference in similarity between the two evolutionary structure trees. The cost is a real number. The main idea of tree similarity comparison is transforming one tree into another tree ' for comparison by performing the three operations.
Operation 1. Deleting node (denoted as ). In this operation, the position of the deleted node is identified in the original tree and then the position of the parent node of the original node is changed to point to the new child nodes.
In Figure 18, node is the node to be deleted. Therefore, node is first deleted, and then, the position of node (the parent node of the original node ), which originally pointed to node , is changed to point to nodes B, , and .
Operation 2. Inserting node under node (denoted as C ). Before performing this operation, the following actions must be performed: compare whether the children of node in the original tree 1 intersect with the children of node in tree 2 . If an intersection exists, node points directly to this set. This method can help determine the relationship of the node to other nodes and reduce redundant operations, thereby reducing costs (the insertion operation without and with the aforementioned method is presented in Figure 19 and 20, respectively).    We have ′ $ ), ' @ $ , A |A ∈ ∩ ′ ) $ , ) $ , A |A ∈ ∩ ′ ).
In Figure 19, the parent node of nodes , B, and is directly changed. However, in the method shown in Figure  20, tree & is compared with tree % to verify whether an intersection exists. It is found that node a is only present in tree % and not in tree & , indicating no intersection. Therefore, during the insertion operation, the state of node need not be changed. This method helps reduce the cost.
Operation 3. Moving node to be positioned under node (denoted as 6 7 ). To move the position of a node, the node is first deleted and then inserted at the required position. Thus, both delete and insert operations are used ( Figure 21).
Moreover, details such as the position of nodes in the tree and the relationship of the node with other nodes should be considered in the operations.

Cost of Tree Similarity Comparison
A similarity comparison between two trees, % and & , yields a cost, which is a real number, after every operation, E . The cost of each operation is denoted as F E .
If EG $E % , E & , . . . , E ( ) is an operation sequence, then the cost of a sequence of operations can be represented as F EG .
If EG is an operation sequence used for transforming tree % into tree & , then the cost from % to & is defined as F % → & . Moreover, the cost of tree similarity comparison between two trees, % and & , can be defined as

Computing the Cost of Tree Similarity Comparison
The cost of tree similarity comparison is influenced by the position of the node, size of the tree (or the number of nodes), height of the tree, depth of the node, number of descendants of the node (the size of the subtree having the node as a root), and relationship of the node to other nodes. These factors should be considered when performing tree similarity comparison between two trees. For example, first, an operation performed on a node at a shallow depth will incur a higher cost than that performed on a node at a deeper region; moreover, the cost will be lower when the node is near the bottom of the tree. Second, a node that has more descendants will incur a higher cost than a node with fewer descendants. In general, the node with more descendants will be located at a shallow depth, with some exceptions. The tree structure should be evaluated before conducting the operation. Third, performing operations on a larger tree will incur a lower cost than performing operations on a smaller tree. Finally, two trees may differ only in the position of some nodes, and the operation of moving the node will incur a lower cost than the operation of deleting or inserting the node.
The cost of the three operations can be calculated using the following formulas.
1) Cost of deletion operation ( Figure 22) where is a nonroot node, " is the height of tree , is the depth of node , | | is the number of descendants of node (direct and indirect offspring), and is the original node set before the deletion. In general, 1, and < 1 if is a nonroot node. If node to be deleted is a leaf node, ∅ and | | 0. This means that no extra cost is incurred in the deletion of a leaf node. When is the lowest leaf node in the tree " , deleting will incur a minimal where is a nonroot node, " is the height of tree , is the depth of node , | | is the number of descendants of node (direct and indirect offspring), and is the original node set before the deletion. In general, 1, and < 1 if is a nonroot node. If node to be deleted is a leaf node, ∅ and | | 0. This means that no extra cost is incurred in the deletion of a leaf node. When is the lowest leaf node in the tree " , deleting will incur a minimal cost. The final cost is F 1/| |. 2) Cost of insertion operation ( Figure 23) where " is the height of tree , is the depth of node , | | is the number of descendants that obtains after it is inserted (direct and indirect offspring) and is the original node set before the insertion. The insertion operation differs from the deletion operation in some aspect. Node cannot be inserted at the root position because the root cannot be changed arbitrarily. Therefore, the node can only be inserted under the root. The calculation of the depth of inserting node also differs from that in the deletion operation. Because before doing the operation, we only know to insert the node under the node , so the will be used. Figure 23. An example of cost of insertion: " 2, 1, | | 0, | | 4 (before inserting node v, the size of the tree T is 4). Therefore, the insertion cost is given by 2 @ 1 1 0 /4 0.5.

3) Cost of moving operation (Figure 24)
where | | < 2 (the tree has a root and at least two nonroot nodes) and ?
. Note that C 7 will be performed on a tree without node . In this formula, both the deletion and insertion operations are considered because the operation of moving involves deleting the node first and then inserting it. Some factors are used for balancing this operation. The factor 1/2 adjusts the cost of operation because the node is not actually deleted from and inserted back to the tree. Another factor | | @ 2 /| | adjusts the cost to ensure that in an extreme case where is the only node other than the root, its moving cost is zero because it will obtain the identical result after doing the operation of moving the node and actually it can't be moved. Moreover, as the number of nodes in the tree increases, the effect of the factor | | @ 2 /| | on the operation of moving the node becomes weaker.  Figure 25 presents an example of calculating the cost of tree similarity comparison between two trees.

Patient Clustering
We used tree similarity comparison (mentioned in Section 2.7) to compare the evolutionary structure trees to obtain 107 × 107 matrix data. Because patients exhibit two cancer types, nonrecurrent and recurrent, we divided the obtained matrix data into two clusters through k-means clustering. We also used the average silhouette method to measure the quality of the clusters (Figure 26) [33]. The average silhouette method combines two factors, cohesion and separation. It can be used to evaluate the impact of different algorithms or different operating modes of algorithms on the clustering results based on the same original data. The optimal number of clusters is two ( Figure 26). Now, each patient obtains new information, which is an attribute classified as cluster 1 or cluster 2.

Cancer Recurrence Rate Prediction
We collected clinical data of each patient, including age, gender, tumor site (site), tumor staging (stage), tumor invasion stage (T), and tumor nodal stage (N). Age comprises two attributes, Young and Old, where Young refers to the age group 0-65 years and Old refers to the age group 66 years and above. Gender comprises two attributes, Female and Male. Site comprises four attributes, Left, Right, Right and Left, and Unknown. Stage comprises two attributes, 2 and 3, where 2 indicates cancer stage 1 and cancer stage 2, and 3 indicates cancer stage 3 and above. The feature of T comprises two attributes, Early and Late, where Early means T1 and T2 and Late means T3 and T4 (in TNM staging system). The feature of N comprises two attributes, 1 and 2, where 1 means N0 and N1 and 2 means N2 and N3 (in TNM staging system). We demonstrated the importance of tree similarity comparison in recurrent prediction. We established two groups, a control group and a test group, with one more clustering information, for final prediction and comparison. We used these two groups as the input to the Cancer Recurrence Rate Prediction (as shown in Figure 27.) and assessed whether the prediction of the recurrence rate improved on addition of one clustering data set (presented in Section 3). We then combined the clinical data of each patient and their respective clustering data. Figure 28 presents the combined distribution of all patient data. In the Cancer Recurrence Rate Prediction, we first divided the data into two types: training set and testing set. Of the data for 107 patients, data of 77 patients were included in the training set (80%) and those of 30 patients were included in the testing set (20%). Now, the training set contained 62 nonrecurrent types and 15 recurrent types, and the testing set contained 15 nonrecurrent types and 15 recurrent types. To ensure the training set is balanced (1:1), we performed data augmentation. We used Synthetic Minority Over-sampling Technique (SMOTE) to generate new samples [34]. The basic principle of SMOTE is to analyze for minority samples and artificially synthesize new samples based on the minority samples and add them to the data set; thus, the recurrent type in the training set was increased to 62. Then, we used random forest to select important features and the first five features (Figure 29.) [35]. Finally, we use random forest, support vector machine (SVM), bagging [36], and boosting [37] to train the model and used the testing set to predict the results.

Experimental Results and Discussion
In Figure 28, the left half presents the heatmap distribution of all 107 patients. This denotes the score distribution obtained by comparing tree structures with each other. The upper left corner is cluster 1, and the other is cluster 2. During comparison with other groups based on cluster analysis, a set of objects is clustered into the group with certain defined identical properties. We compared the differences in the tree structures using k-means clustering. Therefore, when the evolutionary structure trees in one group are compared with those in another group, the color distribution is significantly different. This means that the evolutionary structure trees that are not in the identical group do not have close scores, so the evolutionary structure trees in these two groups exhibit huge differences in the tree structure, such as tree height and number of nodes. In the right half of Figure 28, the clinical data are presented by a clustered sorting. This means that the clinical data are distributed according to the clustering data, which makes it convenient to observe the status distribution of each group. These clinical data are recurrent status, survival status, age, gender, tumor invasion stage (T), tumor nodal stage (N), tumor site (site), and tumor staging (stage).  20 11 We also used the clustering information and clinical data for analysis. TABLE 1 indicates that the structure of the cancer evolutionary tree was associated with gender and tumor invasion stage. Early-stage (T1 and T2) patients tend to have a simpler structure of the cancer evolutionary tree than late-stage (T3 and T4) patients. For early-stage patients, the height of the cancer evolutionary tree is relatively small, the depth of the node in the tree is relatively low, and the number of nodes in the tree is relatively few. Cluster 2 patients tend to have a more complicated structure of the cancer evolutionary tree than cluster 1 patients. For cluster 2 patients, the height of the cancer evolutionary tree is relatively large, the depth of the node in the tree is relatively large, and the number of nodes in the tree is relatively high. Among our 107 CRC patients, the ratio of male to female is approximately 1:1. CRC with obvious gender differences affects males more than females, and males not only develop cancer more often but are also more likely to die from their disease [38][39][40]. Male patients tend to have a more complicated structure of the cancer evolutionary tree than female patients. We divided the patients into two clusters according to the clustering information. Figure 30 presents the distribution of the number of SNVs between cluster 1 and cluster 2. The mean of cluster 1 is 162, and the mean of cluster 2 is 135.
Finally, we used random forest, SVM, bagging, and boosting to predict the recurrence of these CRC patients. The optimal accuracy of 0.667 was obtained using the boosting model with one more clustering information (as shown in TABLE 2).

Conclusion and Future Work
In this study, we combined genetic data with clinical data and integrated them into cancer evolutionary trees for cancer recurrence prediction. Through tree similarity comparison, we obtained clustering information of all patients. Our results revealed that integrating the clustering information can improve the performance compared with using only clinical information. An accuracy rate of 0.667 was obtained for our model. If patients with new diagnosis of CRC receive therapy in NCKUH in the future, our model can be used for prediction according to their reconstructed cancer evolutionary tree and their clinical data. Finally, we can improve the model prediction accuracy based on whether recurrence is noted in these CRC patients in the future.
Although the results indicate that the prediction of the cancer recurrence rate can be improved, many areas for improvement remain. In our study, we only collected data of 107 patients, and the data types were unbalanced. Consequently, the amount of data as input for the machine learning model was insufficient. In the future, the amount of patient data and other information in the clinical data should be increased to help train machine-learning models. Another research direction is to apply the developed models for prediction in patients with other cancer types.