Robust Minimal Spanning Tree Using Intuitionistic Fuzzy C-means Clustering Algorithm for Breast Cancer Detection

: Breast cancer is the most common cause of death in women and the second leading cause of cancer deaths worldwide. Primary prevention in the early stages of the disease becomes complex as the causes remain almost unknown. However, some typical signatures of this disease, such as lumps and microcalciﬁcations appearing on mammograms, can be used to improve early diagnostic techniques, which is critical for womens quality of life. X-ray mammography is the main test used for screening and early diagnosis, and its analysis and processing are the keys to improving breast cancer prognosis. In this paper


Introduction
Breast cancer is characterized by uncontrolled growth of epithelial cells with an acquired ability of local invasion and distant metastatic dissemination. Morphology and distinctive clinical presentation of breast cancer among patients is highly diversified because of heterogeneity acquired due to distinct mutations, diverse sub population of stem cell and heterotypic signaling between parenchymal and stromal cells within tumor microenvironment. The biggest problem in medical science includes the diagnosis of disease since the reason of breast cancer is unknown, although scientists know some of the risk factors like ageing, genetic risk factors, family history, menstrual periods, not having children, obesity, alcohol, overweight, etc. [1][2]4]. Symptoms of cancer include a lump in the breast or underarm that persists after menstrual cycle, swelling in the armpit, pain or tenderness in the breast, any change in the size, contour, texture, or temperature of the breast, a marble-like area under the skin. Many cancer diseases take place within the pale of the same family and the immediate relatives of patients with cancers often have an increased risk of cancer. Some of the characteristics of malignant tumors are: clustered calcification, isolated ducts, poorly defined mass, etc. [3]. A good amount of research on breast cancer datasets is found in literature. Many of them show good classification accuracy or just introduce new computerized tool for detection of cancer. Saheb Basha and Satya Prasad, suggested novel approach to automatically detect the breast cancer mass in mammograms using morphological operators and fuzzy c -means clustering algorithm [4]. Carlos and Moshe, introduced new neural pattern recognition model which is represented as a combination of two methodologies fuzzy systems and evolutionary algorithms, with a success of 97% [5]. Kovalerchuk etal, proposes several applications of fuzzy systems and algorithms in detection of early phase of tumor [6]. Mammography is an expensive screening mechanism practiced for detection of breast cancer. World Health Organization (WHO) recommends use of mammography testing as vital part early diagnostic procedures to reduce the mortality rate. Three fold decrease in mortality rate of breast cancer has been reported in developed countries by practicing mammography in early detection of cancerous lumps in breast [7]. High mortality rate of breast cancer in Pakistan is due to the poverty, lack of awareness about cancer and its detection methods and high cost as well as fear of mammography testing and other diagnostic procedures [8][9]. Studies on intuitionistic fuzzy set are done by Atanassov on theory and application [10]. Zhang and Chen, suggested a clustering approach where an intuitionistic fuzzy similarity matrix is transformed to interval valued fuzzy martrix [11]. Chaira, recently proposed a novel intuitionistic fuzzy c-means (IFCM) algorithm using intuitionistic fuzzy set theory [12]. IFCM has two serious shortcomings, Firstly, it easily falls into local minima, Secondly, it is necessary to specify the number of clusters and the algorithm is very sensitive to the initial center [13][14]. The graph data structure is being considered as a suitable mathematical tool to model the inherent relationship among data. Reddy, proposed an MST-based cluster initialization for k-means which bridges the k-means and the MST-based clustering algorithms [15]. Huang etal, used the Kruskal algorithm to generate the MST of all data points and then deletes k-1 edges according to the order of their weights [16]. In summary, selecting proper initial cluster centers is an NP problem, and numerous improved methods have not yet been widely applied [11]. Therefore, the selection of initial cluster centers requires further research.
In order to diagnose breast cancer, there are currently four main methods used to distinguish benign lumps from malignant ones: surgical biopsy, mammography, magnetic resonance imaging and fine needle aspiration with visual interpretation. Fine needle aspiration of breast masses is nontraumatic, and mostly invasive diagnostic test that obtains information needed for evaluate of malignancy. Objective of current study was to provide an insight in better diagnosis of breast cancer through statistical evaluation of sensitivity, specificity, predictive accuracy and probability of mammography based breast tumor detection.
Minimum spanning tree is a useful graph for detecting clusters of a given set of data points. MST has been well suited for clustering in the field of pattern recognition, image processing and computational biology. In this paper is presented a novel approach to automatically detect the breast cancer. The proposed approach utilizes initialization method based on MST is proposed to compute initial cluster centers for the Intuitionistic fuzzy c-means clustering algorithm for clear to identify of abnormalities for mammography images. We summarized the mammography results and evaluated the accuracy of mammography, specificity, sensitivity, positive likelihood ratio, negative likelihood ratios were initially calculated. In addition to all these performance evaluation measures, predictive probability of mammography screening was also evaluated through Pearson chi square analysis.

Quantitative Analysis of Mammograms
For 60 highly suspicious cases mammography were obtained in which area of lump was highlighted and specificity and sensitivity parameters were calculated. Calculations include total number of true positive (TP), true negative (TN), false negative (FN) and false positive (FP) cases were calculated. among these four classified categories True negative patients were those women in which no lump was identified and symptoms were due to normal breast cycle or clotting of fatty tissues were present. False positive (FP) were those with benign cancer while TP were cases in which malignant or invasive breast cancer was detected. False negatives cases where those who developed malignant breast cancer during the period of screening (12 months). Age of presentation of disease symptoms and mammography screening was also recorded. Predictive probability of breast cancer detection based on mammography screening is examined using chi square test χ2 test at ≤0.05 significance level. Along with this percentage distribution of 60 selected cases based on obvious breast cancer clinical symptoms was also calculated shown in Table 11.

Sensitivity
The sensitivity is expressed as the ratio of number of true positive, to the sum of ratio of false negative and true positive. Purpose of calculating sensitivity is to measure the reliability of a diagnostic system at making positive and negative identification. Hence to calculate sensitivity for our system understudy, we applied following formula.

Specificity
Specificity is expressed as the ratio of the number of true negatives, to the sum of false positive and true negative. This value defines the probability of a screening test to identify true negative cases.

Positive and Negative Likelihood Ratio Calculation
In the next step, sensitivity and specificity values are used to calculate positive likelihood ratio and negative likelihood ratio. These calculations will further measure the accuracy of mammography based breast cancer detection. Statistical formula used for calculating positive and negative likelihood ratios based on our study sample is given below:

Predictive Probability
Predictive probability of first screen mammography in accurate detection of breast lumps is calculated through chi square ( 2 χ ) at < 0.05 level of significance. Chi square ( 2 χ )formula given below where O denotes observed values of TP, TN, FP and FN cases given in Table 12 and E denotes expected values calculated in Table 14.

Minimal Spanning Tree Algorithm
In this section, we proposed Canberra distance measures for construct the minimal spanning tree.

Canberra Distance Measure MST
Given the grayscale point set D, the hierarchical methods starts by constructing a minimal spanning tree (MST) from the points in D. In where K be the number of non-zero pairs.

Cluster Separation (CS)
The definition of CS between cluster centers is given by the following: where max E is the maximum length edge in the MST, which represents two centroids that are at maximum separation and min E is the length edge in the MST, which represents two centroids that are nearest to each other. Then the CS represents the relative separation of the centroids. The value of CS ranges from 0 to 1. A low value of CS means that the two centroids are too close to each other and the corresponding MST Separation not valid. A high CS value means the MST separation of the data is even and valid. If the CS is greater than the threshold, the MST partition of the dataset is valid. Then, we increase the number of cluster by and test the CS again. This process continuous until the CS is smaller than the threshold. The value setting of the threshold for the CS will be practical and is dependent on the dataset. The higher the value of the threshold the smaller the number of clusters would be, generally the value of the threshold will be 0.8 ≻ .

Algorithm for Determining the Initial Cluster Centers
Algorithm: GMST Input: Data points Output: optimal number of cluster centers Let e1 be an edge in the CMST1 constructed from data points Let e2 be an edge in the CMST2 constructed from C. Let T S be the set of disjoint subtrees of CMST1. 1. Create a node v, for each data points. 2. Compute the edge weight using equation (1).

Intuitionistic Fuzzy C-means (IFCM) Algorithm
Intuitionistic fuzzy set given by Atanassov [3] considers both membership ( ), . An intuitionistic fuzzy set A in X, is written as are the membership and non-membership degrees of an element in the set A with the condition 0 ( ) for every x in the set A, then the set A becomes a fuzzy set. Also indicated a hesitation degree, ( ) A x π which arises due to lack of knowledge in defining the membership degree of each element x in the set A and is given by In [5] intuitionistic fuzzy c-means, minimizes the objective function as: where ik u * denotes the intuitionistic fuzzy membership and: 2 1 This iteration will stop when: where ∈ is a termination criterion between 0 and 1, whereas k is the iteration steps. This procedure converges to a local minimum or a saddle point of IFCM J .

Kernel Function Induced IFCM Algorithm
The function ( , ) K x y is called a kernel function and we assume this known function, as Gaussian radial basis function: This paper proposes an efficient weighted MST based IFCM by introducing kernel function that allows the clustering of objects to be more reasonable. The modified proposed objective function is given by: where ϕ stands as map and the distance function can be expressed using in product space as: To obtain kernel induced IFCM based Gaussian function the distance function can be modified as: Let us express ( , ) k i G x v , between pixel k x and i v as the product of a feature similarity term and spatial proximity term: where σ is a parameter which can be adjusted by users.
Using the above expression we obtain ( , ) 1 so the distance function can be rewritten as: Substituting & we get kernel induced MST based IFCM is given by:

Obtaining Membership
To obtain equation for calculating membership we minimizing the objective function: subject to the constraints ∑ breast cancer detection Therefore, the above objective function (6) can be minimized using one Lagrangian multiplier: To adjust & ik i u v for minimum m J , we set to zero the derivative of ( , , ) To calculate λ , Substitute the above * ik u in the identity constraint for all values of k, we get following relation, by taking the partial derivative of IFCM J equal to zero.
So that i v and * ik u can be calculated by the relation, we obtain: The MST based FCM algorithm iteratively optimizes IFCM J by continuous updating * ik u and i v until the difference in successive * ik u values is very small ≤ ∈ , where ∈ is a small positive value between 0 and 1.

Efficient KFCM Algorithm
is the iteration count, ∈ is a small number that can be set by the user.

Validation Function Based on Feature Structures
Two representative functions for the fuzzy partition namely; Partition coefficient pc V and Validation function p V are used to evaluate the validity of clustering [18][19].
The proposed efficient weighted MST obtained cluster centers; the KIFCM algorithm continues iteratively updates, membership and centroids with these values. When this improved, Efficient KIFCM algorithm has converged, another defuzzification process takes place in order to convert the fuzzy partition matrix to a crisp partition matrix that is segmented.

Results and Discussion
This section describes some experimental results on random data, corrupted with noise to show the segmentation performance of the proposed method.  Figure 1 shows a typical example of CMST1 constructed from point set (from Dissimilarity matrix), in which inconsistent edges are removed to create subtree (clusters/regions).our algorithm finds the center of each clusters, which will be useful in many applications.  Generally in most of the clustering algorithm data points can be represented as dissimilarity matrix representation. It contains the distance values between the data points represented as lower or upper triangular matrix. Our Canberra distance based minimal spanning tree algorithm constructs CMST1 from the dissimilarity matrix is shown figure 1. First to identify the longest edge in the CMST1 to generate subtree (clusters). Table 3, the longest edge weight 0.327 connecting the data points 15 and 6 is find to be inconsistent one. By removing the inconsistent edge from the CMST1, data points in the CMST1 partitioned into two subtrees or clusters 1 T and 2 T namely. T using average of points, these centers is connected and again another minimal spanning tree CMST2 is constructed. The minimum edge of CMST2 is min 0.359 E = and the maximum edge of CMST2 is max 0.359 E = then to compute cluster separation value is 1. If the CS is greater than 0.8 then we conclude the subtrees or clusters created are well separated. Next to identify another longest edge weight from Table 3is 0.254 connecting the data points 9 and 11 is finding to be inconsistent one. By removing the inconsistent edge from the CMST1, data points partitioned into three sub trees or clusters 1 T , 2 T and 3 T namely. T using average of points, these centers is connected and again another minimal spanning tree CMST2 is constructed. Continuing this process, next to identify another longest edge weight from Table 3 is 0.226 connecting the data points 3 and 9 is finding to be inconsistent one. By removing the inconsistent edge from the CMST1, data points partitioned into three sub trees or clusters 1 T , 2 T , 3 T and 4 T namely. T and 4 T using average of points, these centers is connected and again another minimal spanning tree CMST2 is constructed. The minimum edge of CMST2 is min 0.265 E = and the maximum edge of CMST2 is max 0.358 E = then to compute cluster separation value is 0.7402 If the CS is less than 0.8 then we conclude the subtrees or clusters created are not valid. Finally CMST produces three cluster centers. Canberra minimal spanning tree algorithm creates three cluster centers for the given data points. Then the center of the cluster and its convergence of standard FCM and IFCM are determined under successive interactions of experiments using data points. The standard FCM algorithm and the numbers of updated centers are high under the objective function of Euclidean distance measures. This takes more iteration to converge the termination value of algorithm. With the new efficient objective function based kernel distance measure the termination value is achieved, with very less iteration and with much better performance in getting membership (Table8) than standard FCM. Table 9 gives the number of iteration to achieve the results of cluster on the data points by standard FCM and KIFCM. It is clear from the final cluster, membership (Table 8), scatter diagram (Figure 2), that our proposed KIFCM is much faster than the standard FCM and the method is converged fast to terminate condition with less run time. To test the effectiveness of KIFCM, the weighted minimal spanning tree based IFCM is used as center. This is done to find out the fuzzy membership and appropriate number of clusters. Thus, we have concluded the final optimal clusters formed as 3. This algorithm has also reduced the number of iterations. Best result is achieved by this measure fuzzy partition coefficient pc V maximum and validation function p V minimum (Table 10). The KIFCM clustering algorithm has the following membership value intimacy (Table 8).

Statistical Evaluation of Diagnostic Performance of Mammography for Breast Cancer
Our study of random sample in terms of reported breast cancer associated symptoms and patient's age group reveals that 10% of the patients had dense calcification, 10% had watery discharge from breast, 40% were complaining of lump, 30% had pain in breast tissues, 5% cases were having both lump and pain. While 5% were suffering from pain as well as discharge from breast tissues given in table 1. Patient's data is categorized into two age groups; 25% cases belong to age group of 30-40 years while majority (75%) belongs to age group of 41-50 years. Mammography details revealed 33% were having benign tumor while malignancies were reported in 50% cases and 17% cases were diagnosed as normal shown in Table 11.
To evaluate the performance of diagnostic procedure for primary screening of breast cancer, initially specificity and sensitivity was calculated. Randomly select the sample dataset out of 60 patients subjected to mammography for detection of lump in mammary tissues, diagnosis reports analysis revealed 32 cases as TP, as disease was present in them while 04 false negative cases were observed in which diseases was present but symptoms or clinical presentation could not be evaluated through mammography. Likewise, 1 false positive case were reported through mammography and 23 true negative cases were also identified in which no indication of disease was observed. All the cases in terms of true positive (TP), true negative (TN), false positive (FP) and false negative (FN) are properly summarized in Table 12.

Sensitivity Percentage Result
True positive rate that defines the sensitivity of mammography in accurate detection of breast cancer in currently reported data was 88.89% shown in Table 13. In 60 cases, only four diseased cases were identified as false negative in mammogram evaluation. While it exactly reports majority of diseased cases as true positive. High sensitivity percentage corresponds to accurate detection of breast cancer patients of particular regions.

Specificity Percentage Result
True negative rate that defines the specificity of mammography in identification of non-diseases cases in our study sample was 95.83% shown in Table 13. In our study sample of 60 patients, 23 non-diseased cases were identified accurately as true negative through mammography. High specificity percentage corresponds to accurate identification of actual negative cases, this value also state that mammography diagnosis is particularly dedicated to detection of breast lumps in patients.

Positive Likelihood Ratio and Negative Likelihood Ratio of Diagnostic Mammography
Positive likelihood ratio tells the outcome of a true positive result if lump is present and the probability of a true negative result if lump is absent. For our study sample dataset, value of positive likelihood ratio is 21.32% shown in Table 13. Its value corresponds to how well our diagnostic system can differentiate between true positive and false positive results. While negative likelihood ratio of probability of false negative test result in diseased case and the probability of a negative test result given that the lump in breast is absent. Negative likelihood ratio calculated for our study sample is 0.12% given in Table 13, which clearly demonstrate that system is well versed to identify true negative cases and give least prediction of false negative results.

Pearson Chi-square ( 2 χ χ χ χ ) Test Results
To evaluate the diagnostic accuracy of mammographic detection of breast cancer, Pearson Chi-square ( 2 χ ) test was performed to calculate predictive probability. Highly significant p-value (< 0.00001) indicates that for mammography based initial screening is a reliably diagnosed breast cancer in our study sample (Table 14). A highly significant correlation between mammography performance and clinical symptoms of breast cancer was observed in our study sample.

Conclusion
Breast cancer is one of the major causes of death among women. Early diagnoses through regular screening and timely treatments have been demonstrated as the best prevention method for cancer. In this article, is introduced new alternative approach for breast cancer disease diagnosis and classifying benign and malignant breast cancer using MST initialization based Intuitionistic fuzzy c-means clustering algorithm for clear to identify of abnormalities for mammography images. We summarized the mammography results and evaluated the accuracy of mammography, 88.89% sensitivity, 95.83% specificity, 21.32 positive likelihood ratio, 0.12 negative likelihood ratios were calculated. It would be helpful to health professionals for making timely decisions for disease management in breast cancer patients. Also for future research, this method can be extended to apply real mammography images using Matlab, R-language and SPSS software.