High Accuracy Classification of Populations with Breast Cancer: SVM Approach

: Breast cancer is one of the most common cancers diagnosed in the United States. Breast cancer can occur in both men and women. The number of deaths associated with this disease is steadily declining, largely due to factors such as earlier detection and a new personalized approach to treatment. In this article, we offer a highly accurate and reliable classification approach based on feature engineering and an improved support vector machine (SVM) classifier. We examine a dataset with 30 features and use in-depth data analytics and visualization to pinpoint the top nine features that have a significant impact on classification accuracy. The SVM classification outperformed other classifiers, including kernel extensions, with a high accuracy of 99.12%. The study stresses the value of machine learning in medical diagnosis, notably in the early detection of breast cancer


Background
Breast cancer is the most common cancer affecting women and is the most diagnosed cancer worldwide. Breast cancer is also the most common cancer in women in the United States, except for skin cancers. It is about 30% (or 1 in 3) of all new female cancers each year.
The American Cancer Society's estimates for breast cancer in the United States for 2023 are as follow: 1. About 297,790 new cases of invasive breast cancer will be diagnosed in women. 2. About 55,720 new cases of ductal carcinoma in situ (DCIS) will be diagnosed. 3. About 43,700 women will die from breast cancer. Breast cancer mainly occurs in middle-aged and older women. The median age at the time of breast cancer diagnosis is 62. This means that half of the women who developed breast cancer are 62 years of age or younger when they are diagnosed. An exceedingly small number of women diagnosed with breast cancer are younger than 40-45.
One of the most important in breast cancer treatment is its timely detection. Early-stage cancer detection could significantly reduce breast cancer death rates. The most critical point for the best prognosis is to identify early-stage cancer cells.
Computer science and machine learning, in particular, has emerged as a valuable tool to detect various medical conditions with greater accuracy than other approaches. Machine learning involves the creation of algorithms to classify patients with cancer or cancer-free.
Why advanced algorithms have become the forefront of cancer research? Screening for breast cancer is a very sensitive matter: aggressive screening strategies will maximize the benefits of early detection, whereas less-frequent screenings will reduce false positives, anxiety, and costs for those who will never even develop breast cancer. Using the technology presented in this and other papers may significantly slash the number of useless screenings thus saving considerable amount of money for this nation's healthcare.
Besides, current clinical guidelines use risk models to determine which patients should be recommended for supplemental imaging and MRI. Some guidelines use risk models with just age to determine if, and how often, a woman should get screened; others combine multiple factors related to age, hormones, genetics, and breast density to determine further testing. Despite decades of effort, the accuracy of risk models used in clinical practice remains largely inaccurate.
We propose a novel highly accurate, and robust classification algorithm based on an optimized support vector machine classifier and a random forest/feature engineering.

Introduction
Cancer is an uncontrolled growth of cells in the body that can rapidly spread to any organ and 90% of cancer patients die from metastasis. Numerous types of cancer exist, but lung cancer, breast cancer (BC), and skin cancer are the most prevalent. According to World Health Organization (WHO) reports, the cancer death ratio is as high as 9.2 million for lung cancer, 1.7 million for skin cancer and 627,000 for breast cancer.
Breast cancer is considered a multifactorial disease and the most common cancer in women worldwide (about 1.5 million women are diagnosed with breast cancer each year, and on average 500,000 women die from this disease in the world). Over the past 30 years, this disease has increased, while the death rate has decreased due to mammography screening.
Several image-guided deep-learning models were developed for the prediction of cancer. Along these lines, several machine-learning algorithms were utilized to distinguish benign from malignant cells based on histopathological reports (classification problems). Histopathological reports are stored in EHR from the time of diagnosis until the time of discharge, employing a text mining framework to extract meaningful information from medical records or text documents and apply it to machinelearning algorithms for cancer prediction.
Diagnostic mammography can assess abnormal breast cancer tissue in patients with subtle malignancy signs. Due to a large number of images, this method cannot effectively be used in assessing cancer-suspected areas. Approximately 50% of breast cancers were not detected in screenings while a quarter of women with breast cancer are diagnosed negatively.
Based on data, a stage-specific interpretation system was designed and this information serves as the primary resource for guiding patients' treatment methods. Following confirmation of the disease's stage and subtype, the healthcare provider initiates chemotherapy to mitigate the growth of cancer cells. This can be done by modifying the expression of several genes. Text mining has helped to find biologically relevant alternative therapeutic candidates. It is also true that drug development remains a lengthy and expensive procedure.
Most mammography-based breast cancer screenings are performed at regular intervals -usually annually or every two years. On the other hand, experts suggest that considering other risk factors along with mammography screening can help in a more accurate diagnosis of women at risk. Moreover, effective risk prediction through modeling can not only help radiologists in setting up personal screening for patients and encouraging them to participate in the program for early detection but also help identify high-risk patients.
Machine learning and data mining represent modeling approaches, discovering hidden relationships to predict different diseases. A major challenge in predicting breast cancer is the creation of a model for addressing all known risk factors that influence the disease progression. Unfortunately, current prediction models only focus on the analysis of mammographic images without other critical factors such as lifestyle or laboratory data and patient biopsy.
Combining multiple risk factors in modeling breast cancer prediction could help the early diagnosis of the disease with necessary care plans. Collection, storage, and management of different data and intelligent systems based on multiple factors for predicting breast cancer are effective in disease management.
Therefore, multifactorial models with many risk features can be effective in assessing the risk of breast cancer. The current study aimed to predict breast cancer using different machine-learning approaches considering various factors in modeling.
Support Vector Machine is introduced by Vapnik [1-3]. SVM is a supervised learning technique used for regression and classification. the goal of SVM is to identify the specific hyperplane with the maximum margin that may divide the classes in a linear fashion. Finding data sets when there are insufficient training data and where the optimal solution cannot be guaranteed by the regular application of a large number of statistics is the aim of supporting vector machine learning.
Osareh and Shadgar considered support vector machines, K-nearest neighbours and probabilistic neural networks classifiers are combined with signal-to-noise ratio feature ranking, sequential selection-based feature selection and principal component analysis feature extraction to distinguish between the benign and malignant tumors of breast [4]. The best overall accuracy for breast cancer diagnosis was achieved equal to 98.80% and 96.33% respectively using support vector machines classifier models against two widely used breast cancer benchmark datasets. Fan [9][10][11][12][13] The breast cancer datasets are used in this study and may be found at the ACM SIGKDD Cup 2008 and the UCI machine learning repository, respectively the former is a dataset of relatively small size, consisting of 577 data samples, each of which includes 32 distinct features.

Support Vector Machine: Original and Dual Formulation
Support Vector Machine (SVM) basically helps classify the data into two categories (cancer-yes and cancer-no) with the help of a multi-dimensional boundary to differentiate the outcomes. Let us consider a linear separator that can be expressed as: where W defines the slope of a hyperplane and b is the intercept (bias), and is a coordinate vector. In 2-D: = col (x,y) and W = (m, -1) which is the equation of line in a 2-D Euclidean plane. * 0 Support Vectors represent the points that closest to the hyperplane. A separating line will be defined with the help of these data points. The distance between the hyperplane and the support vectors is called margin. For all points below the decision hyperplane the following is true: * 0 The distance from a point in the Euclidean space to the hyperplane can be presented by the following formula: Positive hyperplane: y-mx-b=1 Negative hyperplane y-mx-b=-1 Decision hyperplane y-mx-b=0 Is a measure in norm. Let us consider now the optimization problem. We will consider the first case when we would like to maximize the distance between two hyperplanes: * 1 and * 1 The distance between 2 hyperplanes can be expressed as: Maximizing the distance between to hyperplanes is equivalent to the minimization of the inverse expression: Under the following constrains: * 1 for all "positive" points * 1 for all "negative" points Let us introduce the classifier y that defines positive and negative points.
So the optimization problem can be presented as finding solution of the following problem find optimal w* and b* by the minimization of the following expression: "# ( ,$,* + , ,- This constraint: 1 * 2 ' 1 , , i 1. . . N is very important as it requires that all the training points are correctly classified. This is the constrained problem that can be solved via the introduction of Lagrange multipliers. So the bottom line is that Lagrange multipliers is really just an algorithm that finds where the gradient of a function points in the same direction as the gradients of its constraints, while also satisfying those constraints. For soft bounds the cost function changes: "# ( ,$,* + , ,-6 2 8 0 9 , , The idea is: for every vector , we introduce a variable , . Its value is the distance of , from the corresponding class's margin if , is on the wrong side of the margin, otherwise zero. Thus the points that are far away from the margin on the wrong side would get more penalty.

Support Vector Machine: Fenchel Transform and Lagrange Multipliers
Suppose we have a cost function that is convex: &+ -|| || and a constraint : We can see that there is only one point where two vectors point in the same direction: it is the minimum of the objective function, under the constraint. Here, the left-hand side of the inequality could be thought of like the confidence of classification. Confidence score ≥ 1 suggests that classifier has classified the point correctly. However, if confidence score < 1, it means that classifier did not classify the point correctly and incurring a linear penalty of , . By definition, Lagrange multiplier is a ; parameter that relates the gradient of the cost function to the gradient of the constraint: <&+ , -;<:+ , - Let us form the Lagrangian: The cost function is maximizes when the gradient to it is colinear to the gradient to the constraint function. This means that < + , , ;-0 which is equivalent to a set of equations: To solve the actual problem we do not require the actual data point instead only the dot product between every pair of a vector.
To calculate the "b" biased constant we only require dot product.
The major advantage of dual form of SVM over Lagrange formulation is that it only depends on ;.

Data Analytics and Visualization
In this section, we will display a few characteristics of the data features.       We have calculated the correlation between the features from the correlation matrix. Such as mean concavity has 92% correlation with mean concave points. Then concavity error has 77% correlation with concave points error. In the statistical analysis the bar graph plot is showing the 0 stands for 'malignant' and 1 stands for 'benign'. (muh-LIG-nunt) A term used to describe cancer.
Malignant cells grow in an uncontrolled way and can invade nearby tissues and spread to other parts of the body through the blood and lymph system. (beh-NINE) Not cancer. Benign tumors may grow larger but do not spread to other parts of the body. Also called nonmalignant. Support vector machine and feature engineering.
The classification was done using support vector machine. The decision boundary was found using Lagrange multipliers and Fenchel transform that maps the original problem to a dual problem with convex cost function.

Data training
We found that the results will depend on the way we split the data on training and test data. Kernel Trick on SVM We have two classes of observations: malignant tumors and benign tumors the blue points and the purple points. There are numerous ways to separate these two classes as shown in Figure 15. However, we want to find the "best" hyperplane that could maximize the margin between these two classes, which means that the distance between the hyperplane and the nearest data points on each side is the largest. Depending on which side of the hyperplane a new data point locates, we could assign a class to the new observation.
However, there are a few caveats: not all data are linearly separable. In fact, in the real world, almost all the data are randomly distributed, which makes it hard to separate different classes linearly [14,16]. As one can see in the Figure 16, if we find a way to map the data from 2-dimensional space to 3-dimensional space, we will be able to find a decision surface that clearly divides between different classes. The first thought may be to map all the data points to a higher dimension (in this case, 3 dimensions), find the boundary, and make the classification.
However, when there are more and more dimensions, computations become more and more expensive. This is when the kernel trick comes in.
It allows us to operate in the original feature space without computing the coordinates of the data in higher dimensions.
Let's take a look at the following examples. There was an idea of applying kernel trick on SVM, on the best 9 features, to get better accuracy. The RBF Gaussian kernel resulted into 0.94 accuracy; Polynomial resulted into 0.95 accuracy while linear kernel on the 9 features resulted into 0.99 accuracy. Since the 9 features are impossible to simultaneously visualize, we have visualized features two by two based on the SVM with linear kernel.
Since the correlation does not necessarily determine the accuracy of the model for SVM, we have tried pair by pair, and here are some results visualized below.

Conclusion
Different machine-learning techniques can be used for the prediction of breast cancer. The challenge is to build accurate and computationally efficient breast data classifiers. In this study, we aimed at analyzing dataset of breast cancer using Support Vector Machine to classify the binary outcome (malignant or benign tumor). [15,17] We used Kaggle breast cancer data set. The novelty of our approach is we analyzed two algorithms: optimization of the cost function using the hard and soft bound classification expressed in the dual (Lagrange multiplier) form. Moreover, we compared these algorithms with the kernel extension. We compared the performance, the efficiency, and the effectiveness of the models in terms of accuracy, precision, recall, and specificity, to find the best classification accuracy. The results show, that the SVM reaches an accuracy of 99.12% and thus outperforms the other classifiers. Our studies show that out of 30 features only 9 mostly impact the results of the classification. Our current research is focused on the investigation of the performance of a deep learning architecture in performing a breast cancer classification.

Biography
Philip de Melo is a data scientist and academic. His research focuses on the development and implementation of new IT technologies including artificial intelligence, machine learning, big data analytics, fast data interoperability, etc. in public health and health care. He was on the faculty of Columbia University (NYC) and Georgia Tech (Atlanta, GA). He served as a PI and Co-PI for a number of projects sponsored by ONR, NSF, AFOSR, ONC, and the industrial project MIDAS.
Mane Davtyan is a bachelor's student at the American University of Armenia focusing on Machine Learning. She is taking an internship class at Armenian Code Academy in the Advanced Machine Learning department and K-Telecom CJSC in the Business Data Analysis department. Despite being new to the field, Mane has already had an opportunity to work with programming languages, like Python and R, and build databases in SQL. Her GitHub profile shows projects about various Machine Learning and Deep Learning models.