Explore the Characteristics of Age, BMI and Blood Composition of Breast Cancer Patients Based on Multivariate Statistical Analysis

In this paper, through a series of analysis and testing of breast cancer detection data, the statistical rules of multiple objects and multiple indicators are analyzed in the case of their correlation. First of all, univariate diagnosis and multivariate diagnosis were performed on the data. Among them, when studying the correlation between variables, it was found that HOMA had a clear linear positive correlation with insulin content in blood. It is worth noting that some patients with breast cancer show a high degree of insulin resistance and blood insulin content, which is a feature not found in samples without breast cancer. Then, through single factor analysis of variance, we believe that there were significant differences in blood test conditions, ages, and BMI indicators of samples of different health conditions. Next, the principal component analysis was used to reduce the dimension of the data. In this study, the differences in age, BMI, and blood component content between the two groups with different health conditions can be summarized by these two independent factors. Among them, the absolute value of the MCP-1 (monocyte chemoattractant protein 1) coefficient in the main component 1 is large, reflecting the characteristics of the blood component of the sample; the load values of glucose and leptin in the main component 2 are large, reflecting similar results. Then, assuming the use of m = 3 factor model and the use of maximum likelihood method and principal component method, the original data and factor rotation data are re-analyzed, so that the variables are reduced to 3 factors for analysis. Among them, the maximum likelihood method is used to estimate the factor rotation data. The first factor reflects the insulin resistance factor attributed to insulin and HOMA indicators, and the second factor reflects the body fat and thin factor attributed to BMI and leptin. The third factor reflects the glucose content in the blood. Finally, by setting different misjudgment costs for discriminant analysis, the obtained APER is 0.1638 and EAER is 0.1872. Among them, the probability of discriminating patients with breast cancer from not having breast cancer is 0.09375, which is a low rate of misjudgment and also means the model established in this paper is efficient.


Introduction
Breast cancer is becoming a leading cause of death among women in the whole world, meanwhile, it is confirmed that the early detection and accurate diagnosis of this disease can ensure a long survival of the patients [1]. The breast cancer incidence and mortality rates among Chinese women were increasing rapidly, especially in rural area during the recent 10 years, though they were still in low level worldwidely. The distribution of breast cancer incidence and mortality among Chinese women by age and district were showing significant characters [2]. Yang Ling et al estimated and predicted the incidence and mortality of breast cancer in China in recent years using a log-linear model, and concluded that due to the multiple effects of risk factors, population growth and aging, breast cancer will be one of the most growing malignant tumors in China [3].
Therefore, combined with relevant factors to accurately diagnose individuals who check whether they have breast cancer, breast cancer patients can be screened as early as possible, so that patients can start treatment as soon as possible. M. Eskelinen et al. compared the testing power of 7 tumor markers CEA, AFP, CA15-3, TPS and NEU in the diagnosis of breast cancer [4]. Moreover, Na Liu et al. established a novel intelligent classification model for breast cancer diagnosis, which employed information gain directed simulated annealing genetic algorithm wrapper (IGSAGAW) for feature selection [5], while a rough set (RS) based supporting vector machine classifier (RS_SVM) is proposed for breast cancer diagnosis [1].
However, the several studies mentioned above are based on some difficult-to-obtain indicators for modeling, and cannot be used for preliminary screening of breast cancer under more general conditions. To solve this problem, we have to develop an easier way. Form the data collected by the University Hospital Centre of Coimbra [6], some of the more accessible indicators are fully displayed, and the samples are also divided into those with breast cancer and those without breast cancer.
Therefore, the main purpose of this article are Explore the relationship between the content of seven blood components and age, BMI indicators, and test whether the sample comes from a multivariate normal population; Check whether there is a significant difference between the values of variables of breast cancer patients and non-breast cancer patients by multivariate analysis method; Reduce the data to obtain several principal components, and explore whether the differences in age, BMI, and blood component content between the two groups with different health conditions can be summarized by these several principal components; Summarize 9 continuous variables into several types of indicators, so as to more intuitively reflect the relationship between variables; Re-determine whether each examiner is a breast cancer patient according to the dependent variable, and calculate the misjudgment rate.

Data Sources
The blood composition, age, illness, and BMI index data of 116 examiners selected in this paper are the test results from the University Hospital Centre of Coimbra [6].

Variable Introduction
The data selected in this paper contains 9 continuous variables and one binary variable. The continuous variables are all dependent variables and the binary variables are the corresponding variables. The names, dimensions and introduction of these variables are shown in Table 1 below

Data Diagnosis
After understanding the basic situation of breast cancer, it is necessary to understand the basic data structure of breast cancer screening blood test. In actual work, the data we encounter is often the original data. These data are generally not clear and complete. For example, there may be missing data and the sample ID is not unique, which cannot be directly used for modeling. The data diagnosis work helps us understand the defects of the data in order to further clean up and integrate the data. The data diagnosis in this paper can be started from the aspects of completeness, accuracy, rationality, etc. The diagnostic methods include the following, the diagnosis of a single variable, and the diagnosis of the relationship between multiple variables.

Univariate Diagnosis
The first is the integrity problem of univariate. In some samples, some variable indicators are missing. This may be because the data is missing during the data sampling process, or it may be that the sample itself does not have the record of the indicator during the blood test. It is necessary to distinguish between these two situations. The meaning of the representative is different. If there are missing values in the modeling data, the missing values must be completed, or the records with missing values must be deleted before normal modeling can be performed. The results of univariate diagnosis are summarized in Table 2 and Table 3  Remarks (1): HOMA variable is an evaluation index of islet β cells, the maximum value is 25.05, but the mean is only 2.69, and the standard deviation is very small. It can be inferred that there may be abnormal values or data accuracy problems that need further testing. Since HOMA is an evaluation index of islet β cells, it can also be verified whether it is related to the insulin index.
For each variable, the number of missing samples is 0, so the data performs perfect in terms of completeness, but whether the data has outliers needs further testing.
Then the data of the five variables are plotted as histograms respectively, and the results are shown in Figure 1 below. The results show that the blood glucose content is concentrated at 75-125 mg/dL, and there are also particularly large values, such as 201 mg/dL, but the number is very small; the blood insulin content, adiponectin content, resistin content and MCP-1 The content and HOMA indicators are also similar. Their distributions are right-biased, that is, there are large and particularly small values; the distribution of age and BMI is relatively uniform; the distribution of insulin content is similar to the HOMA distribution. The correlation coefficient is large, which shows that the two have a strong correlation.

Multivariate Diagnosis
The correlation coefficients of nine groups of continuous related variables are calculated, and the results are shown in Table 4.
It can be seen from Figure 1 that the distribution of insulin content is similar to the distribution of HOMA. Homeostatic model assessment (HOMA) is a method for assessing β-cell function and insulin resistance (IR) from basal (fasting) glucose and insulin or C-peptide concentrations [9].
Combining Table 4 above shows that the correlation coefficient between the two is large, indicating that the two have a strong correlation. Figure 2 shows that there is a correlation between the insulin content and the HOMA indicator. And, since Figure 2 draws the health status as Healthy and the sample and the health status as Patients separately, we can see that those with higher HOMA index (≥8) and higher blood insulin content (≥30 µU/mL) The samples are all sick samples. At the same time, the correlation coefficient of the variables in Table 4 also shows that the correlation coefficient between the HOMA index and the glucose content in the blood is 0.696, so it is considered that the correlation between these two variables is relatively strong.

Normality Test
For continuous variable data sets, first check whether the marginal distribution of each continuous variable is normal. The methods to test whether the variables follow the univariate normal distribution include drawing Q-Q diagrams and Shapiro-Wilk test methods. Among them, Q-Q plot is often used to intuitively assess whether a sample comes from a normal population. The construction steps are as follows: 1. Sort samples x , … , x from small to large to get  It can be seen that in addition to the age and BMI variables, the Q-Q Plot of the remaining seven variables are all upward at the right end, which indicates that the sample distribution is right-biased and the tail is thick. This conclusion is consistent with the variable distribution plotted in Figure 1.
The Shapiro-Wilk test [10] results are listed below, as shown in Table 4 below. It can be seen from the Shapiro-Wilk test that p-Value is less than α = 0.05, so is rejected, and the nine continuous variables are considered not to follow the univariate normal distribution. To test whether the sample comes from a multivariate normal population, you need to use a chi-square plot. For the data set, draw a Chi-Square Q-Q plot, as shown in Figure 4 below. It can be seen from the figure that is , , obviously not on a straight line, so it is considered that the original data set does not follow a multivariate normal distribution. When the sample does not satisfy the normality assumption, some transformations can be performed on the sample to make the sample obtained after the transformation satisfy the normality assumption. The transformation method proposed by Box and Cox [11] is Use the power Tranasform (object, family = "bcPower") function in the R package car package to calculate and calculate the λ value used for the transformation. Then use the BoxCox(x,lambda) function in the forecast package and the resulting lambda to perform Box-Cox transformation on the data. For the classification of different health conditions, the Chi Square Q-Q Plot of the transformed data is shown in Figure 5. It can be seen from Figure 5 that for different health conditions, the points # , # $ , $ are almost distributed on a straight line. Therefore, it can be considered that for the transformed data, the sample with Healthy status and the sample with Patients status, all of the nine continuous variables contained in it follow a multivariate normal distribution.

One-Way MANOVA
To study whether there are differences in blood test status, age, and BMI indicators of samples of different health conditions, One-Way MANOVA method, that is, One-Way analysis of variance, is required. The theoretical derivation part refers to Dai's [12] paper, so the MANOVA model used is Test statistic is When the following inequality is true, reject the null hypothesis and think there is a difference where M K is the -quantile of the chi-square distribution, its degree of freedom is p(g-1), g = 2, p = 9. The results of MANOVA analysis are shown in Table 5. Table 5 shows that p-Value = 1.811× <0.05 =, so the null hypothesis is rejected, and it is considered that when the sample's health status is different, the sample's age, BMI, and blood test index values such as insulin in this study are significantly different. So it is necessary to carry out the next research.

Principal Component Analysis
According to Guo's opinion, the PCA method is a dimensionality reduction method that maintains the maximum overall dispersion. Its advantage is that it uses a smaller dimensionality to reflect the structural relationship between samples [13]. In a word, principal component analysis can use a few principal components to reveal the internal structure of multiple variables. Firstly, from Principal component coefficient, eigenvalue and cumulative variance contribution rate are shown in Table 6. The selection of the number of principal components can refer to the screen plot of the cumulative variance contribution rate, as shown in Figure 6 below.  The first principal component explained 98.8% of the total sample variance. The first two main components together accounted for 99.4% of the total sample difference. To set the principal component variance interpretation rate to 99%, you need to select the first two principal components to make their cumulative variance contribution rate reach 99.4%. Therefore, the sample change can be well summarized into two main components, so it is reasonable to reduce the data from 116 observations of 9 variables to 116 observations of 2 main components. The two principal components are: Given the results of the above principal component coefficients ( Table 6), the first major component seems to basically represent MCP-1 (monocyte chemoattractant protein 1). The second main component is basically a weighted sum of glucose and leptin. Dataset for 9 variables perform principal component analysis to obtain the contribution rate of each component, and extract the two principal components that contribute the most. It can be seen from Table 6 that the absolute value of the MCP-1 (monocyte chemoattractant protein 1) coefficient in the main component 1 is large, reflecting the characteristics of the blood component of the sample; the load values of glucose and leptin in the main component 2 are large, It also reflects the characteristics of the blood components of the sample. In this study, the 9 components of the sample were combined into 2 principal components by principal component analysis, and the cumulative contribution rate of each principal component reached 99.4%, which met the requirement that the cumulative contribution rate was greater than or equal to 99%, indicating that the health status in this study was different The differences in age, BMI, and blood component levels between the two groups can be summarized by these two independent factors.

Factor Analysis
It is more common to use principal component analysis for the comprehensive evaluation of multiple indicators, but the evaluation results are unreasonable or even wrong due to the lack of consideration of application conditions [14]. Therefore, factor analysis is necessary. Assuming that the m = 3 factor model is used and the maximum likelihood method and principal component method are used, the data is re-analyzed. In Table 7, the estimated factor load, communalities, specific variances, and the proportion of the total sample variance explained by the principal component method and the maximum likelihood method for each factor obtained from the original data and the rotated data, respectively. The factor scores obtained by principal component analysis (PCA) and maximum likelihood estimate (MLE) methods are shown in Table 8. The proportion of the total variance explained by the three-factor solution obtained by applying the principal component method to the original data is significantly larger than the proportion of the two-factor solution. However, for m = 3, the value produced by is usually greater than the sample correlation coefficient. This is especially true for r 69 . Obviously, on the first factor F1, most variables have very high loads on the factor, and the loads are approximately equal, with the exception of adiponectin content, so F 1 can be regarded as reflecting the blood adiponectin content and other variables Factors of difference between. The second factor compares age, BMI, some blood indicators with the rest of the blood indicators. From this factor, the negative load of BMI is relatively large, while adiponectin has a large positive load. On the third factor F 3 , it mainly reflects the relationship between leptin and MCP-1. Similar conclusions can be drawn from the solutions obtained from the original data using the maximum likelihood method.
After rotation, the two solving methods seem to give some different results. If we focus on the principal component method and the cumulative proportion of the total sample variance, we see that a three-factor solution is obviously necessary. The third factor explains the "large number" of additional sample changes. The first factor is roughly the pancreatic function factor determined by the blood glucose, insulin and HOMA indicators; the second factor mainly reflects the comparison of adiponectin and resistin content in the blood, which can be attributed to the factors that explain obesity and diabetes; The third factor mainly reflects the comparison of BMI and blood leptin content, which can be attributed to the factors that explain the body's fatness and thinness. The maximum likelihood factor load after rotation is similar to the load generated by the principal component factor method for the first factor, but is inconsistent with factors 2 and 3. For the maximum likelihood method, the second factor can also be attributed to explaining the body fat and thin. The third factor mainly reflects the effect of glucose content in the blood.

Discriminant Analysis
Discriminant analysis is to establish discriminant functions based on various variables of the research object, discriminate and classify various groups, and predict the attribution of new samples. Because the main research purpose of this article is to infer whether a sample has breast cancer through the information given in the data set For this disease, the discriminant analysis method is suitable for classifying new samples. The expected cost of the misjudgment rate of discriminant analysis is: The total probability of misjudgment is Data converted using the BoxCox method, so that both categories of data can be considered to be from a multivariate normal distribution population, whose population density function is For the converted data, first examine the scatter plot of two or two variables. The two-variable scatter plots drawn using the ggplot2 package of the R program are presented in Appendix. Most of these scatter plots are relatively disordered, scattered, uniform, and there is no obvious ellipse area, except for the scatter plots of insulin and HOMA indicators, as shown in Figure 7 below Among them, the dark dots represent Healthy status, that is, the group without breast cancer, and the light dots represent Patients, the group with breast cancer. The data in Figure 6 above seems to form a fairly elliptical shape, so for these two variables, the assumption of multiple normality does not seem appropriate. However, because there is no obvious correlation between them and variables, this problem is ignored for now, and the data after BoxCox transformation is considered to be from a multivariate normal population.
If the following inequality is true, x0 is judged as category1 (π 1 ), and the sample's health status is considered as "Healthy", that is, the sample does not have breast cancer: If the following inequality is true, x0 is judged as category2 (π 2 ), and the sample's health status is "Patients", which means that the sample has breast cancer: where the formula of k is Among them, µ 1, µ 2, Σ 1 and Σ 2 are all unknown statistics, so in actual use, the sample statistics will be used instead. The apparent error rate (APER) and expected actual error rate, E(AER) can also be calculated based on the model obtained by fitting the sample data. The formulas are as follows: nZop = " q + " q " + " , o r nop = " q s + " q s " + " Here n 1M represents the number of samples that belong to category1 are misjudged to category2, n 2M represents the number of samples that belong to category 1 is misjudged to category2; n 1 (H) represents the number of the sample of category1 is misjudged to the number of category2 with Lachenbruch's "holdout" process [15], and n 2 (H) is the opposite.
In this study, the determination of a sample with breast cancer as not having breast cancer would cause patients to fail to receive timely treatment, and the consequences are more serious. On the contrary, if a sample without breast cancer is identified as having Breast cancer, after a progressive screening will increase the probability of being correctly classified, so the consequences of this misjudgment are lighter. Therefore, the misjudgment costs c(1|2) and c(2|1) are assigned 0.75 and 0.25, respectively. The calculated linear discriminant function is more complicated, so it is no longer listed. The resulting confusion matrix form is shown in Table 9.  True  π1  39  13  category  π2  6  58 After calculation, APER = 0.1638 and Eˆ(AER) = 0.1872. Among them, the probability that a patient with breast cancer is judged as not suffering from breast cancer is 0.09375, which is a low rate of misjudgment, which is in line with the expected effect of adjusting the cost of misjudgment.

Research Conclusion
In this paper, through a series of analysis and testing of breast cancer detection data, the main conclusions are: 1) HOMA is one of the evaluation indicators for testing insulin resistance in the blood, and it has a clear linear correlation with the insulin content in the blood, as shown in Figure 2, which shows that the stronger the insulin resistance, the less likely the insulin in the blood is Use, so its content in the blood will be correspondingly higher. It is worth noting that compared with the samples without breast cancer, some patients with breast cancer showed a high degree of insulin resistance and high insulin content in the blood, which is a feature that the samples without breast cancer do not have. 2) Through One-Way MANOVA, the blood test conditions, age, and BMI indicators of samples of different health conditions are obtained. There are significant differences in the blood test conditions, age, and BMI indicators of samples of different health conditions. 3) Principal component analysis can be used to reduce the dimension of the data to obtain two principal components. Therefore, the differences in age, BMI and blood component content between the two groups with different health conditions in this study can be summarized by these two independent factors.. Among them, the absolute value of the MCP-1 (monocyte chemoattractant protein 1) coefficient in the main component 1 is very large, reflecting the characteristics of the blood component of the sample; the load values of glucose and leptin in the main component 2 are large, reflecting the same It is the characteristic of the blood component of the sample. 4) Assuming the use of m = 3 factor model and the use of maximum likelihood method and principal component method, the original data data and factor rotation data are re-analyzed, and the variables are reduced to 3 factors for analysis. Among them, the maximum likelihood method is used to estimate the factor rotation data. The first factor reflects the insulin resistance factor attributed to the insulin and HOMA indicators, and the second factor reflects the body fat and thin factor attributed to BMI and leptin. The third factor reflects the glucose content in the blood. 5) By setting different misjudgment costs for discriminant analysis, the obtained APER is 0.1638 and EAER is 0.1872. Among them, the probability of identifying patients with breast cancer as not having breast cancer is 0.09375, which is a low rate of misjudgment.

Inadequacies in Research
Due to the limitation of objective conditions, this study still has deficiencies in the following two aspects: Since no tumor marker protein content data is collected in the blood, only analysis based on the content of conventional blood components, age, and BMI indicators, there is still room for a decline in the overall misjudgment rate.
The samples in the data are all from Coimbra Hospital, so there are certain geographical limitations, and they cannot represent and reflect the general condition of the population.