Feature Selection and Classification of Leukemia Cancer Using Machine Learning Techniques

Leukemia cancer is one of the most leading detrimental cancer diseases in worldwide. A huge number of genes are responsible for cancer diseases. Therefore, it is necessary to identify the most informative genes of Leukemia cancer. The main objectives of this study are to: (i) identify the most informative genes using five feature selection techniques (FST) and (ii) adopt six classifiers to classify the cancer disease and compare them. Leukemia cancer data has been taken from Kent ridge biomedical data repository, USA. There are 7129 genes and 72 patients. Among them, 47 patients are cancer and 25 are control. We have used five FST as t-test; Wilcoxon sign rank sum (WCSRS) test, random forest (RF), Boruta and least absolute shrinkage and selection operator (LASSO). We have also used six classifiers as Adaboost (AB), classification and regression tree (CART), artificial neural network (ANN), random forest (RF), linear discriminant analysis (LDA) and naive Bayes (NB). The performances of these classifiers are evaluated by accuracy (ACC), sensitivity (SE), specificity (SP), positive predictive value (PPV), negative predictive value (NPV), and F-measure (FM). We used simulated dataset to check the validity of proposed method. The results indicate that the combination of LASSO based FST and NB classifier gives the highest classification accuracy of 99.95%. On the basis of the results, we can conclude that the combination of LASSO based FST and NB classifier predicts the leukemia cancer more accurately compare to any other combination of FST and classifiers utilized in this study.


Introduction
In recent world cancer is a most important health burden. It is caused when the divisions of cells are uncontrolled [1]. According to World health organizations (WHO), there were about 18.10 million new cases and 9.6 million deaths due to cancer in 2018 in worldwide [2]. Leukemia is one of the most leading detrimental cancer diseases which is a group of blood cancer. It begins in bone marrow and spreading via blood cell [3]. In 2015, about 2.3 million people were suffering from leukemia cancer and 3,53,500 deaths due to leukemia cancer [4]. So, the cure of cancer is must for surviving the mankind. Nowadays cancer research is one of the egregious areas in medical combat. For providing better treatment to patient, it is important to precisely predict different types of cancer. Clinical and morphological based prediction was provided to detect the cancer early [5]. A system named global gene expression was proposed to understand the problem of cancer classification [6][7][8]. Microarray technology has bottomed the simultaneous monitoring of genes and cancer classification. Earlier their obtained result was so far promising. By the development of DNA microarray technology, it is possible to monitor the expression level for huge number of genes and generate gene data [9]. High dimensionality (contains thousands of genes), small or large (that contains noisy data) and irrelevant genes to cancer distinction are the basic difference with other dataset for gene expression dataset. Classification techniques were unable to handle this kind of data effectively [10]. For obtaining promising results, many researchers suggested to select the most significant genes before performing classification [11]. It is helpful to reduce the computation times as well as data size. The classification accuracy is Machine Learning Techniques increased by removing a huge number of irrelevant genes [12].
Previously a lot of studies had been conducted for feature selection techniques (FST) for microarray gene selection data [13,14]. But a combination of large number of FST has not been well studied and Boruta and LASSO feature selection techniques has not used yet for gene expression data. Machine learning (ML) techniques are enabled to find the best classification accuracy by selecting the most informative genes. ML-based systems such as Adaboost (AB), classification and regression tree (CART) and artificial neural network (ANN). Principle component analysis (PCA) was used as a FST for different gene expression cancer datasets and showed that quartile discriminant analysis classifier provided the highest classification accuracy of 97.40% [15]. Partial least square (PLS) method was also used to extract the most significant genes on blue cell dataset while linear discriminant analysis (LDA) was regarded as classifier. The combination yields 98.50% classification accuracy [16]. There was a problem among previous studies that no one could not give a satisfactory result on cancer classification because 1% of misclassification can be occurred a serious issue. The hypothesis belongs to our research is which combination of FST and classifier provides the highest classification accuracy.
This research stands on two-stage system which is the fundamental assumptions. Firstly, identify the most significant genes using five FST's namely: t-test, Wilcoxon sign rank sum (WCSRS) test, random forest (RF), Boruta package and least absolute shrinkage and selection operator (LASSO). The two statistical tests along with the p-value are used to identify the cancerous genes. RF, Boruta and LASSO used mean decrease error (MDE); maximum Z score among shadow attributes (MZSA) and tuning parameter for identification of cancer relevant genes, respectively. Secondly, ML foot step picks the most suitable classifier for best result and which includes six classifiers namely: AdaBoost (AB), classification and regression tree (CART), artificial neural network (ANN), random forest (RF), linear discriminant analysis (LDA) and Naive Bayes (NB). Performances of these techniques are evaluated using accuracy (ACC), sensitivity (SE), specificity (SP), positive predictive value (PPV), negative predictive value (NPV) and F-measure. As a part of validation of our best classification, we have used simulated dataset.

Data Sources
In this study we have used leukemia cancer gene expression dataset which is provided by Kent ridge biomedical data repository, USA and is publicly available [17]. The dataset contains 72 patients and 7129 genes. Among the total patients, 25 patients are control and 47 are cancer. The data matrix of the gene expression data was presented in Table 1.

Overview of the Proposed Computational Method
The first step is to normalize the leukemia cancer data; in the second phase we extract the most informative genes using five FST's as t-test, WCSRS test, RF, Boruta, and LASSO. The next step is to divide the dataset into two groups as training set (70%) and test set (30%). Then six classifiers as AB, CART, ANN, RF, LDA, and NB are adopted to classify the patients as cancer vs. control. We estimate the training parameters from training set using the different classifiers. Then these parameter (s) are used in test set to predict the leukemia cancer. To get better and reliable results we repeat this process 1000 times for each classifier and then use mean values of the final results. The overview of this study was presented in Figure 1.

Data Normalization
Data normalization is needed to avoid the biasness of gene expression data [18]. In this study, we have normalized leukemia cancer dataset using the standardized equation as below: where, X is the variable to be normalized, µ and σ is the mean and standard deviation of that variable and Z is the normalized variable that values lies between 0 and 1.

Feature Selection Techniques
Feature selection is a critical and challenging work in the statistical analysis field. Feature selection helps us to choose the high-risk genes for cancer disease. Since microarray gene expression data is a high-dimensional, so important feature extraction is mandatory. In this study, we have used five FST as t-test, WCSRS test, RF, Boruta, and LASSO.

T-test
The t-test is a very simple and standard statistical approach of variable selection. The t-test has been extensively studies in field of machine learning and bioinformatics to measure the differences in means between two groups (cancer vs. control) [19]. The mathematical form of the t-test is written as follows: Where, X and X are the means of cancer and control respectively. Also s and s are the variances and n and n are the total number of cancer and control class, respectively. The t-statistic follows t-distribution with (n + n − 2) degrees of freedom. In this study, we have used three cutoffs of point of p-values as 0.01, 0.001, and 0.0001 for selecting the most significant genes.

Wilcoxon Sign Rank Sum Test
Wilcoxon signed rank sum test (WCSRS) is a nonparametric approach that can be used as a feature selection technique [20]. It is noted that it is a powerful technique in gene selection [21,22]. It is used to compare two matching samples. Let x 1i and x 2i (i=1, 2,..., 7129) be the two set of measurements. Firstly, we have calculated the absolute difference between two measurements. We should omit the pairs |x 1i − x 2i | whose absolute difference between two measurements are zero. Then we need to rank (R i ) the absolute differences and calculate the sign|x − x | . The test statistic can be written as: The value of WCSRS test statistic (W) is compared to pvalue. We have used three different p-values (<0.01, <0.001, <0.0001) for selecting the significant genes.

Random Forest
Random forest (RF) is one of the most popular techniques for feature selection [23]. Permutation importance or Mean Decrease in Accuracy (MDA) is evaluated for each feature by omitting the association between that features and the target [24]. This is achieved by randomly permuting the values of the feature and measuring the resulting increase in error. The influence of the correlated features is also removed.

Boruta Package
Wrapper approach is used for developing Boruta package and build around RF was introduced Boruta package algorithm to determine relevance factors/features by comparing the relevance of the real features to that of the random probes [23,25]. Using Boruta algorithm, we cannot use only Z-score to measure the importance. So, for each attribute we create a corresponding 'shadow' attribute, whose values are obtained by shuffling values of the original attribute across objects. Then we compute the importance of all attributes and finally select the variables based on the importance.

Least Absolute Shrinkage and Selection Operator
Least absolute shrinkage and selection operator (LASSO) was first introduced by Tibshirani [26]. LASSO is a powerful method that performs two main tasks as regularization and feature selection. LASSO setup a linear regression model and penalize the regression coefficients with L1 distance [26]. Most of the coefficients are reduced to zero and the remaining inputs are selected using LASSO. Shrinking and removing the coefficients using LASSO can reduce the variance without a significant increase of the bias [27]. So, LASSO method can provide very good prediction accuracy and this is especially useful when a dataset has a small number of observations and a large number of features.

Classification Techniques
In this study, six most important and available classifiers are adapted due to their simplicity and popularity as: AdaBoost (AB), classification and regression tree (CART), artificial neural network (ANN), random forest (RF), linear discernment analysis (LDA) and Naive Bayes (NB).

AdaBoost
AdaBoost (AB) is the short for Adaptive Boosting. AB is one of the most widely used algorithms to construct a strong classifier in machine learning and it is developed for binary classification [28]. Short decision tree is used for AB. The performance of the tree on each training instance is used after creating the first tree. Further using it we should follow the next tree. Generally, AB uses the weighted average of the weak classifiers to predict [29].

Classification and Regression Tree
Classification and regression tree (CART) is a nonparametric decision tree learning technique which was proposed by Breiman for constructing binary tree [30]. Binary refers to a node in a decision tree which can only be split into two groups. Numerical or categorical values or missing attribute values are handled by CART. It is widely used both regression and classification in machine learning [31].

Artificial Neural Network
Artificial neural network (ANN) was proposed by McCulloch and Pitts (1943) for simulating the behavior of biological system composed of neurons [32]. The human brain makes of millions of neurons [33]. ANN was developed based on animal's central nervous systems. It is not only used in machine learning but also in pattern recognition. ANN consists of a large number of connected processing units to work to process information. A neural network contains three layers. Firstly, input layer which represents the input unit for raw information which can feed into the network. Secondly, hidden layer is used for determining the activity of each hidden unit. Finally, output layer measures the behavior of the output that depends on the activity of the hidden units.

Random Forest
Like as feature selection, random forest (RF) model can be used for machine learning techniques. RF is a tree-based regression and classification techniques and it is suitable for both parametric and nonparametric cases [23,34]. Using the random subspace method, the first algorithm for random decision forests was created by Ho et al. (1995) [35]. For either randomly selected features or a combination of features at each node to grow a tree, RF classifier is used. Gini ratio criteria [36] and Gini index [37] are used as attribute selection measure in decision tree. Gini index is used as an attribute selection in RF-based model to measure the impurity of an attribute with respect to the class.

Linear Discriminant Analysis
Linear discriminant analysis (LDA) is used for eliminating the drawback of logistic regression classifier [38]. By considering the data is Gaussian and each attribute has the same variance. LDA estimates mean and variance for dataset. LDA makes predictions by estimating the probability that a new set of inputs belongs to each class.

Naive Bayes
Naive Bayes (NB) is a simple technique for constructing classifiers. Since the 1960s, NB has been studied extensively. NB classifiers are highly scalable and in a learning problem. It requires a number of parameters linear in the number of variables [39]. The most common assumption for NB is the value of a particular feature is independent of the value of any other feature, given the class variable [40].

Statistical Performance Evaluation
Accuracy (ACC), sensitivity (SE), specificity (SE), positive predictive value (PPV), negative predictive value (NPV), F-measure were used to measures the performance of the different classifiers. These measurements are calculated based on true positive (TP), true negative (TN), false positive (FN) and false negative (FN). The detail of these measurements was described by Maniruzzaman et al. [19].

Identify Best Feature Selection and Classification Technique
One of our main objectives of this experiment was to find the most significant genes for leukemia cancer. Figure 2 indicates a cluster bar diagram where the vertical axis presents the number of significant genes and the horizontal axis for different FST's.    Table A1) for leukemia cancer dataset.

Validation of Proposed Method
For the validation of the proposed method, we generate 72 observations for 7129 genes from the normal distribution using the mean and variance of corresponding 7129 genes of leukemia dataset. Among them 47 patents are cancer and 25 are controls. The validation of the proposed computational method is discussed in Table 3. The results show that the LASSO selects only fifty-nine genes and NB gives the classification accuracy of 100%. While, CART provides the lowest classification accuracy (48.30%) compared to NB. Therefore, the combination of LASSO FST with NB-based classifier gives the highest classification accuracy. The other statistical performance evaluation parameters like SP, SE, PPV, NPV and FM are described in appendix (Appendix 2: Table A2) for simulated dataset.

Discussion
A total of ninety combination system had been designed by the cross combination of five FST (t-test, WCSRS test, RF, Boruta and LASSO) and six classifiers (AB, CART, ANN, RF, LDA and NB). Classification accuracy was evaluated using each combination of FST and classifier. In the first stage, we selected the most informative genes with two statistical tests (t-test and WCSRS test) when p-values are less than 0.01, 0.001 and 0.0001. Then others R software built up FST (RF, Boruta and LASSO) were applied to get the most significant genes. Accuracy for each classifier with FST was evaluated for leukemia cancer. Other performance of all classifiers was compared on the basis of SE, SP, PPV, NPV and FM. From the results, using the performance of different FST's and classification techniques we can propose a unique decision that LASSO based FST and NB based classifier was perform better than all other techniques. A benchmarking of the proposed system against the previous work was also explored which is presented in Table 4. The layout the proposed system against the previous work represents the key differences between our current study and previously published studies.
Comparison Between Our Current Study Against Previous Study.
A novel method was developed to analyze gene expression data of cancer tissue and signal to noise ratio was used to extract the most important genes whose expression levels were highly differentiated with others tissue types [41]. In another study support vector machine (SVM) was used to classify the leukemia and colon cancer patients. Finally, the result outputted that SVM gave 94.10% accuracy for leukemia cancer and 90.30% accuracy for colon cancer dataset [42]. They proposed a genetic algorithm to identify the subset of the predictive genes. A novel research procedure for predicting gene samples based on microarray gene expression was developed by Nguyen and Rocke (2002) [15]. They used two FST as PLS and PCA along with two classifiers as logistic discriminant (LD) and QDA for reducing dimension of tumor genes. The results showed that the combination of PLS and LD gave the highest classification accuracy (94.20%). Dev et al. (2012) focused on BPN, FLANN and PSO-FLANN classifier for breast cancer using signature composing method [43]. The integrated approach of FLANN and PSO (92.36% accuracy) seemed well predict the disease. Student and Fujarewicz (2012) proposed a multiclass gene selection method based on PLS with SVM, multiclass SVM and LDA classifier [16]. The authors tried to focus on the effective identification of informative genes. Finally, a new subset of genes for lung, leukemia and blue cell were designed. LDA classifier was more reliable classifier with the highest accuracy (98.50%). Sharma and Paliwal (2012) applied a new algorithm on leukemia, lung and breast cancer data to extract a subset of crucial genes [44]. Compare with existing techniques, their approach gives more promising result for both lung and breast cancer dataset. Bayesian classification approach provided high classification accuracy Machine Learning Techniques using selected important genes. Lung and breast cancer dataset gave 100% accuracy, on the contrary leukemia cancer dataset served only 96.30% accuracy. Several gene selection and classification methods were applied by Bhola and Tiwari (2015), on different types of cancer datasets [45]. The study found that AB classifier gave 98% accuracy for prostate cancer using FCFB gene selection method [45]. A recent study on colon cancer classification using four gene selection methods and ten classifiers was conducted by Maniruzzaman et al. (2019), resulted that WCSRS test based RF classifier provided highest 99.81% classification accuracy [19]. In this study initially 7129 genes of leukemia cancer dataset are used for extracting important genes. Next, using six classifiers, we have evaluated the degree of accuracy for classification. LASSO based FST is the best for accuracy when naive Bayesian classifier applied. So, this research will discover a new insight in the field of microarray gene expression leukemia cancer dataset.
One of our main objectives was to compare the performance for both leukemia and simulation dataset of combination five FST with six classifiers. Simulated dataset is also supported our evaluation for leukemia cancer dataset. Table 2 and Table 3 show the mean accuracy of all FST for both leukemia and simulated dataset respectively. The classifier NB gives the highest accuracy for both original (99.95%) and simulated data (100.00%) when the feature variables drive from LASSO methods. Among all statistical tests and feature selection methods, the best performance was obtained by LASSO method followed by t-test, WCSRS test, RF and Boruta package. Finally, we may conclude our research with the prediction that the combination of LASSO and NB-based classifier perform better results compared to others which validation is checking with the same prediction for both leukemia and simulated dataset.

Strength and Extension of the Study
This research represents a high-risk stratification system to accurately predict leukemia cancer diseases. Our study showed that LASSO FST with NB-based classifier gives the best classification accuracy along with other higher statistical performance. As a part of better performance, we may apply other FST as F-test, KW test, etc. as well as true for classifiers such as SVM, KNN, etc. One can extend this to adapt deep learning (DL) on microarray gene expression data and compare with our current study.

Conclusion
This study showed a plenary evaluation of classification of leukemia cancer gene expression with the two major criteria. Firstly, the high-risk differential genes were identified using different FST's. Then different classifiers were used to find the best classifier to predict the leukemia cancer. Five FST namely: t-test, WCSRS test, RF, Boruta package and LASSO were used to identify the high-risk differential genes. Further, six classification method such as: AB, CART, ANN, RF, LDA and NB were designed to predict the degree of accuracy. The study provided the highest classification accuracy of 99.95% was obtained by the combination of LASSO FST and NB-based classifier. So, LASSO based FST and NB classifier showed the best performer for leukemia cancer classification.

Appendix
This Appendix 1 and Appendix 2 demonstrates the statistical performance against six classifiers while changing features selection techniques (t-test, WCSRS, RF, Boruta and LASSO) for both leukemia and simulated dataset. Six classifiers were Adaboost (AB), classification and regression tree (CART), artificial neural network (ANN), random forest (RF), linear discriminant analysis (LDA) and Naïve Bayes (NB). The performances of these classifiers are evaluated using sensitivity (SE), specificity (SP), positive predictive value (PPV), negative predictive value (NPV) and F-measure (FM).

Ethics Approval
No ethical approval is required for this dataset.

Funding
No fund received for this project.