Using Genetic Algorithms and Sparse Logistic Regression to Find Gene Signatures for Chemosensitivity Prediction in Breast Cancer
Department of Computer Science, Houghton College, Houghton, New York, USA
To cite this article:
Wei Hu. Using Genetic Algorithms and Sparse Logistic Regression to Find Gene Signatures for Chemosensitivity Prediction in Breast Cancer. American Journal of Bioscience and Bioengineering. Vol. 4, No. 2, 2015, pp. 26-33. doi: 10.11648/j.bio.20160402.12
Received: December 15, 2015; Accepted: January 5, 2016; Published: May 4, 2016
Abstract: Various gene signatures of chemosensitivity in breast cancer have been identified. When used to build predictors of have chemosensitivity, many of them have their prediction accuracy around 80%. Identifying gene signatures to build high accuracy such predictors is a prerequisite for their clinical tests and applications. To elucidate the importance of each individual gene in a signature is another pressing need before such signature could be tested in clinical settings. In this study, Genetic Algorithms (GAs) and Sparse Logistic Regression (SLR) were employed to identify two signatures. The first had 28 probe sets selected by GA from the top 65 probe sets that were highly overexpressed between pathologic compete response (pCR) and residual disease (RD) and was used to build a SLR predictor of pCR (SLR-28). The second had 86 probe sets (Notch-86) selected by GA from Notch signaling pathway and was used to develop a SLR predictor of pCR (SLR-Notch-86). These two predictors tested on a training set (n=81) and validation set (n=52) had very precise predictions measured by accuracy, specificity, sensitivity, positive predictive value and negative predictive value with their corresponding P value all zero. Furthermore, these two predictors discovered 12 important genes in the 28 probe set signature and 14 important genes in the Notch-86 signature. Our two signatures produced superior performance over a signature in a previous study, demonstrating the potential of GA and SLR in identifying robust gene signatures in chemo response prediction in breast cancer.
Keywords: Genetic Algorithm, Gene Signature, Breast Cancer, Sparse Logistic Regression, Predictor, Chemosensitivity
Breast cancer is a complex disease of different molecular subtypes with distinct genetic alterations and clinical outcomes. In current practice, chemotherapy is applied empirically, and not all patients benefit equally, illustrating the imperative needs for a more personalized approach in cancer treatment. The ability to predict whether an individual patient will benefit from a specific therapy is of great clinical significance. The estrogen receptor status can be used to guide the decisions on hormonal therapy. The gene expression data that reflect subtle differences in tumors can be utilized to build a predictor of response to cancer drugs.
Single clinical or molecular parameters, such as tumor size, histology, hormone receptor or human epidermal growth factor receptor 2 (HER2) expression, and tumor grade, does not always give reliable predictions of response. With microarray data, researchers are able to identify gene expression patterns that are predictive of chemotherapy response.
In , t-test for unequal-variance was employed to find a signature of 31 probe sets (27 genes) with highest differentially expressed values between pRC and RD. Based on this signature, a 30-probe set Diagnal Linear Discriminant Analysis (DLDA-30) classifier was constructed to predict pathological response to preoperative paclitaxel/FAC chemotherapy. The value of this type of predictor is the ability to identify those patients most likely to benefit from a particular treatment, the neoadjuvant chemotherapy, in this case. This predictor is able to recognize not all responsive patients but exclusively those that will benefit the most, as defined by attaining a pCR. Other clinical studies also identified gene signatures that predict response to neoadjuvant therapy of breast cancer [2-15].
As a single variable technique, t-test processes one gene at a time and might miss the interactions between genes. We believed the signature identified by t-test could be optimized with the help of a multivariable technique such as GA. We aimed to search for novel signatures and to use them to develop predictors of pCR that can achieve much better predictions than the DLDA-30. Genetic algorithms have the capacity to explore multiple solutions concurrently, which we used to find interacting and informative genes in this study. After identifying a signature, SLR was employed to further explore the importance of each individual gene’s contribution to the prediction of pCR.
2. Patients and Methods
2.1. Patient Cohorts and Clinical Information
One breast cancer patient cohort was obtained from a previous publication  (n=133). Needle-biopsy samples were collected from 133 patients with stage I, II, or III breast cancer who received preoperative weekly paclitaxel and a combination of fluorouracil, doxorubicin, and cyclophosphamide (T/FAC). These 133 patients were divided into two subsets, one training set of size 81 and one validation set of size 52. These data contain clinical information including patient age, gender, race, histological classification, stage, nuclear grade, ER (estrogen receptor), PR (progesterone receptor), and HER2 (human epidermal growth factor 2) status, pathologic complete response, and residual disease. These data also contain each patient’s genome-scale gene expression profiles generated using Affymetrix U133A chip (Santa Clara, CA). pCR was defined as no residual invasive cancer in the breast or lymph nodes. pCR is presently accepted as a reasonable early indicator for long-term survival.
2.2. Sparse Logistic Regression
A standard least squares linear regression solves the following problem:
Given data , find such that
The LASSO regression  deals with the following problem:
with , (2)
where t controls the norm of . This constraint on produces a sparse model, i.e., many components of can be zero. Following the idea of LASSO, Shevade at el.  studied the following problem for sparse logistic regression, when
with , (3)
where g() = log(1 + ), which is the negative log-likelihood function associated with the probability model
Cawley GC et al.  utilized a novel technique to solve this sparse logistic regression problem efficiently. In our study, we used +1 to label those cases of RD status, and used -1 to label those cases of pCR status.
2.3. Notch Signaling Pathway
Notch genes encode highly conserved cell surface receptors. The Notch signaling pathway consists of Notch receptors, ligands, negative and positive modifiers, and transcription factors. It plays a key role in the normal development of many tissues and cell types, through diverse effects on cell regulation, proliferation, and differentiation. Aberrant Notch signaling has been observed in several human cancers, including acute T-cell lymphoblastic leukemia, cervical cancer, and breast cancer [19-21]. The Oligo GEArray Human Notch Signaling Pathway Microarray  was designed for profiling expression of 113 genes (Table 1) involved in Notch signaling. One of the two signatures identified in this study, Notch-86, was selected from these 113 genes.
|Notch Signaling Pathway:|
|Notch Binding: DLL1 (DELTA1), DTX1, JAG1, JAG2.|
|Notch Receptor Processing: ADAM10, PSEN1, PSEN2, PSENEN (PEN2).|
|Notch Signaling Pathway Target Genes:|
|Apoptosis Genes: CDKN1A, CFLAR (CASH), IL2RA, NFKB1.|
|Cell Cycle Regulators: CCND1 (Cyclin D1), CDKN1A (P21), IL2RA.|
|Cell Proliferation: CDKN1A (P21), ERBB2, FOSL1, IL2RA.|
|Genes Regulating Cell Differentiation: DTX1, PPARG.|
|Neurogenesis: HES1, HEY1.|
|Regulation of Transcription: DTX1, FOS, FOSL1, HES1, HEY1, NFKB1, NFKB2, NR4A2, PPARG, STAT6.|
|Other Target Genes with Unspecified Functions: CD44, CHUK, IFNG, IL17B, KRT1, LOR, MAP2K7, PDPK1, PTCRA.|
|Other Genes Involved in the Notch Signaling Pathway:|
|Apoptosis Genes: AXIN1, EP300, HDAC1, NOTCH2, PSEN1, PSEN2.|
|Cell Cycle Regulators: AXIN1, CCNE1, CDC16, EP300, FIGF, JAG2, NOTCH2, PCAF.|
|Cell Proliferation: CDC16, FIGF, FZD3, JAG1, JAG2, LRP5, NOTCH2, PCAF, STIL (SIL).|
|Genes Regulating Cell Differentiation: DLL1, JAG1, JAG2, NOTCH1, NOTCH2, NOTCH3, NOTCH4, PAX5, SHH.|
|Neurogenesis: DLL1, EP300, HEYL, JAG1, NEURL, NOTCH2, PAX5, RFNG, ZIC2 (HPE5).|
|Regulation of Transcription: AES, CBL, CTNNB1, EP300, GLI1, HDAC1, HEYL, HOXB4, HR, MYCL1, NCOR2, NOTCH1, NOTCH2, NOTCH3, NOTCH4, PAX5, PCAF, POFUT1, RUNX1, SNW1 (SKIIP), SUFU, TEAD1, TLE1.|
|Others Genes with Unspecified Functions: ADAM17, GBP2, LFNG, LMO2, MFNG, MMP7, NOTCH2NL, NUMB, SEL1L, SH2D1A.|
|Other Signaling Pathways that Crosstalk with the Notch Signaling Pathway:|
|Sonic Hedgehog (Shh) Pathway: GLI1, GSK3B, SHH, SMO, SUFU.|
|Wnt Receptor Signaling Pathway: AES, AXIN1, CTNNB1, FZD1, FZD2, FZD3, FZD4, FZD6, FZD7, GSK3B, LRP5, TLE1, WISP1, WNT11.|
|Other Genes Involved in the Immune Response: CXCL9, FAS (TNFRSF6), G1P2, GBP1, IFNG, IL2RA, IL2RG, IL4, IL4R, IL6ST, IRF1, ISGF3G, OAS1, OSM, STAT5A, STUB1.|
2.4. Top 65 Probe Sets
T-tests for unequal variances for all the probe sets on the Affymetrix U133A chip were carried out to find the genes that were significantly differentially expressed in either the pCR cases or the RD cases. We chose the 60 probe sets with the smallest t-test P values (False Discovery Rate=1%) and 5 probe sets with the most negative t-test statistics in the remaining probe sets to be our first signature (top 65-probe set signature) as presented in Table 1. The top 31 probe set signature in  had a FDR=0.5%.
2.5. Genetic Algorithms
Genetic Algorithms (GAs), a particular class of evolutionary algorithms, are search algorithms that adopt some common processes in genetics such as selection, mutation, and inheritance. The GAs outperform other traditional search algorithms in various applications.
The outline of a genetic algorithm is as follows:
Generate an initial population of individuals
Evaluate initial population
Apply genetic operations such as mutation and crossover to generate a new generation of individuals
Evaluate individuals in the population
Until some stopping criteria is satisfied
2.6. Prediction Accuracy Evaluation
In order to evaluate the significance of our predictions, we need to compare them with random predictions. For each dataset, a random-label permutation was conducted while keeping the number of instances in each group fixed. The matches between the permuted labels and the original ones were recorded. The standard P value was the percentage of 1000 random predictions with higher accuracy than the calculated predictions.
|Rank by P value||t-Test||P value||Higher Expression in||Probe Set ID||Gene Symbol||Gene Name|
|1||6.215265||2.20E-08||RD||203930_s_at*||MAPT||microtubule-associated protein tau|
|2||6.36741||2.31E-08||RD||203929_s_at*||MAPT||microtubule-associated protein tau|
|3||6.212778||2.56E-08||RD||212207_at*||THRAP2||thyroid hormone receptor associated protein 2|
|4||5.804489||1.25E-07||RD||212745_s_at*||BBS4||Bardet-Biedl syndrome 4|
|5||5.847627||1.42E-07||RD||203928_x_at*||MAPT||microtubule-associated protein tau|
|6||5.763819||1.67E-07||RD||208945_s_at*||BECN1||beclin 1 (coiled-coil, myosin-like BCL2 interacting protein)|
|7||5.704523||2.50E-07||RD||206401_s_at*||MAPT||microtubule-associated protein tau|
|9||5.555817||3.65E-07||RD||219741_x_at*||ZNF552||zinc finger protein 552|
|10||5.523853||4.08E-07||RD||215304_at*||---||Clone 23948 mRNA sequence|
|11||5.449088||5.45E-07||RD||209173_at||AGR2||anterior gradient 2 homolog (Xenopus laevis)|
|12||5.391683||6.89E-07||RD||201508_at*||IGFBP4||insulin-like growth factor binding protein 4|
|13||5.357545||8.43E-07||RD||217542_at*||MDM2||Mdm2, transformed 3T3 cell double minute 2, p53 binding protein (mouse)|
|14||5.312123||1.30E-06||RD||219044_at*||FLJ10916||hypothetical protein FLJ10916|
|15||5.26737||1.41E-06||RD||215616_s_at*||JMJD2B||jumonji domain containing 2B|
|16||5.215414||1.41E-06||RD||204509_at*||CA12||carbonic anhydrase XII|
|17||5.221534||1.42E-06||RD||202204_s_at*||AMFR||autocrine motility factor receptor|
|18||5.215809||1.70E-06||RD||214124_x_at*||FGFR1OP||FGFR1 oncogene partner|
|19||5.210207||1.71E-06||RD||219051_x_at*||METRN||meteorin, glial cell differentiation regulator|
|23||5.054632||2.95E-06||RD||205074_at||SLC22A5||solute carrier family 22 (organic cation transporter), member 5|
|24||5.139071||3.06E-06||RD||213623_at*||KIF3A||kinesin family member 3A|
|25||5.017933||3.38E-06||RD||201413_at||HSD17B4||hydroxysteroid (17-beta) dehydrogenase 4|
|26||4.908014||5.26E-06||RD||205225_at||ESR1||estrogen receptor 1|
|28||4.80725||7.18E-06||RD||214053_at*||---||CDNA FLJ44318 fis, clone TRACH3000780|
|29||4.888899||7.30E-06||RD||213527_s_at||ZNF688||zinc finger protein 688|
|30||4.819068||7.44E-06||RD||203009_at||LU||Lutheran blood group (Auberger b antigen included)|
|31||4.865888||9.07E-06||RD||212046_x_at||MAPK3||mitogen-activated protein kinase 3|
|34||4.700102||1.07E-05||RD||203071_at||SEMA3B||sema domain, immunoglobulin domain (Ig), short basic domain, secreted, (semaphorin) 3B|
|35||4.710655||1.07E-05||RD||210129_s_at||TTLL3||tubulin tyrosine ligase-like family, member 3|
|36||4.671287||1.20E-05||RD||218671_s_at||ATPIF1||ATPase inhibitory factor 1|
|37||4.689638||1.23E-05||RD||209339_at||SIAH2||seven in absentia homolog 2 (Drosophila)|
|38||4.629403||1.44E-05||RD||218976_at||DNAJC12||DnaJ (Hsp40) homolog, subfamily C, member 12|
|39||4.649829||1.44E-05||RD||205734_s_at||AFF3||AF4/FMR2 family, member 3|
|40||4.634054||1.65E-05||RD||202641_at||ARL3||ADP-ribosylation factor-like 3|
|42||4.590716||1.71E-05||RD||220540_at||KCNK15||potassium channel, subfamily K, member 15|
|43||4.578743||1.71E-05||RD||210831_s_at||PTGER3||prostaglandin E receptor 3 (subtype EP3)|
|44||4.608731||1.77E-05||RD||218769_s_at||ANKRA2||ankyrin repeat, family A (RFXANK-like), 2|
|45||4.587999||1.81E-05||RD||218394_at||FLJ22386||leucine zipper domain protein|
|46||4.568723||1.82E-05||RD||216835_s_at||DOK1||docking protein 1, 62kDa (downstream of tyrosine kinase 1)|
|47||4.606517||1.98E-05||RD||221728_x_at||XIST||X (inactive)-specific transcript|
|49||4.531619||2.06E-05||RD||212239_at||PIK3R1||phosphoinositide-3-kinase, regulatory subunit 1 (p85 alpha)|
|50||4.521411||2.13E-05||RD||212209_at||THRAP2||thyroid hormone receptor associated protein 2|
|51||4.509765||2.22E-05||RD||204792_s_at||WDTC2||WD and tetratricopeptide repeats 2|
|52||4.593663||2.45E-05||RD||204862_s_at||NME3||non-metastatic cells 3, protein expressed in|
|53||4.478137||2.49E-05||RD||206418_at||NOX1||NADPH oxidase 1|
|55||4.463108||2.74E-05||RD||210958_s_at||MAST4||microtubule associated serine/threonine kinase family member 4|
|56||4.501318||2.76E-05||RD||202228_s_at||SDFR1||stromal cell derived factor receptor 1|
|57||4.539226||2.83E-05||RD||212660_at||PHF15||PHD finger protein 15|
|58||-5.01605||2.96E-05||pCR||213134_x_at*||BTG3||BTG family, member 3|
|59||4.427751||2.98E-05||RD||203789_s_at||SEMA3C||sema domain, immunoglobulin domain (Ig), short basic domain, secreted, (semaphorin) 3C|
|60||4.484732||3.00E-05||RD||216109_at||THRAP2||Thyroid hormone receptor associated protein 2|
|61||-5.01538||3.31E-05||pCR||205548_s_at*||BTG3||BTG family, member 3|
|62||-4.53199||0.000122||PCR||204825_at*||MELK||maternal embryonic leucine zipper kinase|
|63||-4.00315||0.000496||pCR||205339_at||SIL||TAL1 (SCL) interrupting locus|
|64||-3.97777||0.000442||PCR||203693_s_at*||E2F3||E2F transcription factor 3|
|65||-3.94634||0.000361||PCR||216237_s_at||MCM5||MCM5 minichromosome maintenance deficient 5, cell division cycle 46 (S. cerevisiae)|
3.1. Two Signatures: The 28 Probe Sets and the Notch-86
To search and select a subset of the 65 probe sets, we represented our solution, referred to as an individual in GA terms, as a binary vector of size 65 to indicate the presence (1) or absence (0) of each probe set in the 65 probe sets. We ran the GA algorithm with population size 200, individual size 65, and 100 generations. In each generation, the top 50% of the individuals with highest fitness values were selected as parents to produce the next generation with crossover and a point mutation was applied to each individual randomly at six genes. Our fitness value was the prediction accuracy of SLR based on the training set. In each generation of GA, we divided the training set (n=82) into five equal subsets and used four subsets as a training set for SLR and one subset as a test set to get the accuracy of SLR on this test set. At the same time, in each generation, we calculated the prediction accuracy of SLR on the validation set (n=51). Our goal was to choose an individual that has similar high accuracy on the training and validation sets. We found an individual of this quality and its binary representation has 28 ones, which was our first signature (Table 2). A subset of the Notch signature of such quality, Notch-86, was found the same way. Our second signature had 86 probe sets. Table 3 displays the most important probe sets for response prediction in this second signature.
3.2. Two Predictors: SLR-28 and SLR-Notch-86
In  the DLDA-30 was selected as the best predictor after a thorough search of different predictors based on Support Vector Machine (SVM), Diagonal Linear Discrimmant Analysis (DLDA), and K-nearest neighbor (KNN). We developed two SLR based predictors of high precision. One used the 28 probe sets (SLR-28) and one used the Notch-86 (SLR-Notch-86). The evaluation of prediction performance was conducted on both the training set (n=82) and the validation set (n=51). Since the DLDA-30 was evaluated on the training set with five-fold cross validation, we performed five-fold evaluation too for our two predictors. Further, we repeated the five-fold cross validation 10 times and the averaged results were reported in Table 4.
On the training set, the two predictors, SLR-28 and SLR-Notch-86, produced much better predictions across all five measurements than DLDA-30 (Table 5). In , authors calculated the P values of their DLDA-30 predictions on the training set in five-fold cross validation and they were all zero. Since our two predictors had higher values in all the five measurements, we concluded that they also have P value zero in all these five measurements.
On the validation set, the two predictors trained on the training set had a much higher accuracy, a much higher specificity, and a much higher PPV than DLDA-30 (Table 6). Our two predictors had P values zero in all five measurements, whereas DLDA-30 had three P values larger than 0.05, especially those for accuracy and specificity. SLR-28 correctly identified all but two who achieved pCR and all but two who achieved RD, and SLR-Notch-86 correctly identified all but three who achieved pCR and all but two who achieved RD (Table 6). Tables 5 and 6 together show that our two predictors can predicts pCR with great precision on the training set and the validation set.
These two predictors also identified the important genes in each signature that had nonzero SLR weights (Figure 1). The genes with zero weight did not contribute to the prediction. The genes with positive weight contribute positively to the RD prediction and those with negative weight contribute positively to pCR prediction. In most cases, genes with positive t-test statistic had positive weight like the three genes in Figure 1, BGT3, MELK, and MCM5. However, there were some exceptions. Two genes, STUB1 and PDPK1, had positive t-test statistic, but negative weight in Table 3, demonstrating that SLR as a multivariable technique can capture some interactions between genes whereas t-test may not. Figure 1 showed that the most discriminative genes measured by SLR as a group were not necessarily those with the smallest P values by t-test as individual genes.
|SLR Weight||t-Test||P value||Higher Expression in||Probe set ID||Gene symbol||Gene Name|
|0.29955||2.585026||0.011947||RD||202221_s_at||EP300||E1A binding protein p300|
|-0.4945||-0.57563||0.56903||pCR||203393_at||HES1||hairy and enhancer of split 1|
|-0.02104||-2.73059||0.012681||pCR||203915_at||CXCL9||chemokine (C-X-C motif) ligand 9|
|-0.27652||-1.92262||0.067834||pCR||204152_s_at||MFNG||Manic fringe homolog|
|0.50511||2.208649||0.031467||RD||205552_s_at||OAS1||oligoadenylate synthetase 1|
|-0.71933||-4.00315||0.000496||pCR||205746_s_at||ADAM17||ADAM metallopeptidase domain 17|
|0.60822||3.119622||0.002969||RD||207760_s_at||NCOR2||nuclear receptor co-repressor 2|
|-0.49017||-1.3125||0.199077||pCR||211179_at||RUNX1||runt-related transcription factor 1|
|0.6048||1.873074||0.066922||RD||211209_x_at||SH2D1A||SH2 domain protein 1A|
|-0.11683||0.382658||0.704546||RD||217934_x_at||STUB1||STIP1 homology and U-box containing protein 1|
|0.41896||3.252073||0.001808||RD||218665_at||FZD4||frizzled homolog 4|
|-0.07142||-0.66774||0.50776||pCR||219683_at||FZD3||frizzled homolog 3|
|-0.2956||0.212779||0.832697||RD||32029_at||PDPK1||phosphoinositide dependent protein kinase-1|
One of the important genes in Figure 1 is CA12, which is a membrane zinc metalloenzyme that is present in different normal tissues but is overexpressed in some cancers such as renal cell and breast cancers. Two studies found that increased CA IX expression is associated with poor relapse free and overall survival in invasive breast cancer [23,24]. Another study found that CA12 is regulated by estrogen receptor (ERα) in breastcancer, and that this regulation involves a distal estrogen-responsiveenhancer region .
The mitochondrial ATPase inhibitory factor 1(ATPIF1) is another important gene in Figure 1. Several studies discovered a close link between metabolic and genetic changes observed during malignant growth [26,27]. The large positive weight of this gene in Figure 1 is also in agreement with this observation.
One study in  revealed that MAPT is the best single gene discriminator of pCR to preoperative chemotherapy with paclitaxel, 5-fluorouracil, doxorubicin, and cyclophosphamide. There are four probe sets of MAPT selected in Table 2 and MAPT is one of the 16 informative genes in the top 65 probe sets in Figure 1.
SDFR1 encodes a cell surface protein of the immunglobulin superfamily that regulates cell adhesion and process outgrowth. BTG3 is tumor suppressor [29-31] and its large negative weight implies that its presence enhances chemo sensitivity. The functions of the important genes in Notch-86 signature in Figure 1 can be found in Table 3.
|Measure||DLDA-30||P value||SLR-28||P value||SLR-Notch-86||P value|
|DLDA-30||Predicted as pCR||Predicted as RD||SLR-28||Predicted as pCR||Predicted as RD||SLR-Notch-86||Predicted as pCR||Predicted as RD|
|Observed pCR||n=12||n=1||Observed pCR||n=11||n=2||Observed pCR||n=10||n=3|
|Observed RD||n=11||n=27||Observed RD||n=2||n=36||Observed RD||n=2||n=36|
In this study, we intended to uncover gene signatures for developing predictors that have a much higher accuracy than DLDA-30. With the ability to account for multiple gene interactions, the multivariable techniques, such as genetic algorithms and SLR, have demonstrated their potential utility in identifying robust gene signatures of clinical relevance.
Currently there is an urgent need to develop knowledge to identify groups of patients who will derive benefit from the different chemotherapy agents. Molecular profiling of individual tumors will help to predict the most appropriate therapy for each cancer patient. One study already suggests that responses to neoadjuvant chemotherapy correlate with gene expression profile, with tumors displaying the ER-positive gene signatures being less likely to respond than other types of breast cancer .