Comparative Study of K-Means, Partitioning Around Medoids, Agglomerative Hierarchical, and DIANA Clustering Algorithms by Using Cancer Datasets

Clustering plays a particularly fundamental role in exploring data, creating predictions and to overcome the anomalies in the data. Clusters that contain parallel, identical characteristics in a dataset are grouped using reiterative algorithms. As the data in real world is rising day by day so the challenges of perceiving and interpreting the consequential mass of data, which often consists of millions of measurements are increased by the intricacy of a huge number of genes of biological networks. To addressing this challenge, we use clustering algorithms. In this study, we provided a comparative study of the four most popular clustering algorithms: K-Means, PAM, Agglomerative Hierarchical and DIANA and these are evaluated on eight real cancer (four Affymetrix and four cDNA) gene data and simulated data set. The comparative results based upon seven popular cluster validity indices: Average Silhouette Index, Corrected rand Index, Variation of Information, Dunn Index, Calinski-Harabasz Index, Separation Index, and Pearson Gamma. We determine that PAM is best for Affymetrix data set and DIANA is best for cDNA dataset among these four clustering algorithms. This study provides practical evaluation frameworks for accessing clustering results on gene expression cancer datasets.


Introduction
Microarrays technology can concurrently measures the thousands of genes expression level within a particular mRNA biological sample and across collections of all related samples [1]. Such technology can be used to compare the level of gene expression in order to identify diagnostic or prognostic genes, classify genes, and monitor the response to therapy. For these reasons, microarrays technology are considered important tools for discovery in the medical community. A large number of genes and the complexity of biological networks greatly increase the challenges of comprehending and interpreting the resulting mass of data, which often consists of millions of measurements [2]. A first step toward addressing this challenge is the use of clustering techniques, which is essential in the data mining process to reveal natural structures and identify interesting patterns in the underlying data.
In data mining, there are two learning approaches-Supervised and Unsupervised learning. Clustering is unsupervised learning and defined as it is the task of grouping a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. Clustering techniques have extensively contribute in the various fields including, artificial intelligence, pattern recognition, bioinformatics, segmentation and machine learning. An appropriate Clustering algorithm is highly demanded to extract hidden information from co-expression analysis of enormous genome data [3]. In that case, a common task is to compare the Clustering algorithms for gene expression datasets.
Generally single channel microarrays (Affymetrix) and double channel microarrays (cDNA) are two types of platforms where the gene expression microarray technology is existing and these datasets are significant to cluster both genes and samples [4,5]. The above types of datasets are usually used for gene based clustering and sample based clustering. The sample based clustering only conducted in this study. And in sample based clustering genes are treated as features while samples are treated as objects and samples are partitioned into homogeneous groups.
There are numerous broadly used Clustering algorithms are already developed to capture the overall feature of high dimensional variable datasets. K-Means [6], Partitioning Around Medoids (PAM) [7], Agglomerative Hierarchical methods [8] and Divisive Analysis Methods (DIANA) [9] are more popular between them. Therefore this paper performs a comparative analysis of above four clustering algorithms. The performance of theses clustering algorithms is compared in terms of accuracy and efficiency through seven validity indices [10]

The Gap Statistic
The gap statistic [11] is used for finding an optimal number of clusters (K) in a dataset and also gives the idea behind their approach was to find a way to standardize the comparison of with a null reference distribution of the data, i.e. a distribution with no obvious clustering. Their estimate for the optimal number of clusters k is the value for which falls the farthest below this reference curve. This formula for calculating the gap statistic is: Where denotes the expectation under n sample size from the reference distribution. The estimated will be the value maximizing after taking the sampling distribution into account.

The K-Means Algorithm (KM)
The k-means algorithm [6] is one of the simplest unsupervised learning algorithms to classify a given data set through a certain number of clusters (assume k clusters) static a priori. To decrease the complexity of grouping data it can be run multiple times. How this algorithm works that are explained in Figure 1.

The Partitioning Around Medoids (PAM) Algorithm
The k-means algorithm is considerate to outliers because an object with exceptionally large value may substantially change the distribution of data. In this algorithm, a medoid can be used instead of the mean value of compelling the objects in a cluster which is the most centrally located object in a cluster. Based on the standard of reducing the sum of the differences between each object and its consistent reference point can still be performed as the partitioning method and this forms can perform on the basis of k-Medoids and it is called. Partitioning Around Medoids [7]. The basic strategy Hierarchical, and DIANA Clustering Algorithms by Using Cancer Datasets of PAM clustering algorithms is to find k clusters in n objects by first randomly judgment a representative object (the medoids) for each cluster are showed in Figure 2.

Agglomerative Hierarchical Clustering Algorithm (AHC)
Agglomerative hierarchical clustering or bottom-up clustering method start with each object presenting a cluster, and then the methods gradually merge theses clusters into large ones [8,12]. These algorithms start with each object presenting a cluster, and then the methods gradually merge theses clusters into large ones. For each of the successive iteration it agglomerates (merges) the closest pair of clusters by satisfying some similarity criteria, until all of the data is in one cluster that are clarified in Figure 3.

Divisive Analysis Clustering (DIANA)
Divisive Analysis Clustering [9] is a hierarchical clustering technique which constructs the hierarchy in the inverse order and this approaches is the reversal algorithm of Agglomerative Hierarchical Clustering. One larger cluster consisting of all n objects split into two clusters until finally all clusters, comprise of single objects which is illustrated in Figure 4.

Clustering Validity Indices
Cluster validity indices [10] are functions that help a user answer the question of whether a particular clustering of the data is better than an alternative clustering. For unsupervised clustering, where partitions are made without reference to external classes, these cluster validity metrics must rely only on internal measures of the data. Several such validity metrics exist, such as within-cluster distances (should be low) and between-cluster distances (should be high). Several cluster validity indices are briefly discussed in Table 1.
Its values within [−1, 1]. The optimal value is the highest.

Data Sets
The datasets present different values for features such as type of microarray chip (second column), tissue type (third column), number of samples (fourth column), number of classes (fifth column), number of samples within the classes (sixth column), dimensionality (seventh column) and (last column) shows the dimensionality after feature selection. Short description of these datasets in are presented in Table 2.

Simulated Data Analysis
To check the performance of clustering method it introduced a simulated data set that has 150 rows as genes and 8 columns as sample. First 1-50 gene are high1y expressed, 51-100 gene are medium expressed and last 101-150 gene present low expressed in terms of intensity level. The simulated data are generated from normal distribution N (5,12). Therefore we introduce three cluster as three main effect. Figure 5 represents gap statistic and observed that when the number of cluster is 3 than the Gap statistic gives the optimal value. Therefore we may conclude that three clusters are presented in the simulation data.  The analysis of the simulation data result presented in Table 3 and we see that there are maximum numbers of validity indices satisfied by K-Means followed by DIANA clustering algorithms. So we can say that K-means and DIANA are the best clustering methods than PAM and Agglomerative Hierarchical algorithm for simulated data.

Comparative Results of K-means, PAM, AHC and DIANA for Affymetrix Datasets
We applied the all clustering methods to the 4 set of affymetrix real datasets and also check their accuracy through several indices are given in Table 4 along with the graphical technique as in Figure 6.   Table 4 demonstrate that the comparative analysis of four clustering algorithm and this analysis will evaluate the several measurements of indices. For K-Means and Ag. Hierarchical clustering we see only one optimal index were performed better. In DIANA clustering algorithm there are two indices performed better but in PAM clustering algorithm we see there are three indices performed better. Maximum numbers of validity indices satisfied by PAM clustering algorithm. Figure  6 also represents the comparative analysis and it shows that the maximum number of optimal indices happened in PAM clustering algorithm among others. Therefore we may conclude that PAM clustering algorithm is the best followed by DIANA, K-Means, and Agglomerative Hierarchical methods for Affymetrix datasets.

Comparative Results of K-means, PAM, AHC and DIANA for cDNA Datasets
We applied all clustering methods to the 4 set of cDNA real datasets and check their accuracy through several indices are given in Table 5 along with the graphical technique as in Figure 7.   Table 5 shows that the comparative analysis of four clustering algorithm according to the several measurements of indices. We observed that for DIANA clustering algorithm maximum number of validity indices performed better but in others clustering algorithm only one indices performed better. Figure 7 also represents the comparative analysis and it shows that the maximum number of optimal indices happened in DIANA clustering algorithm among others. Therefore we may conclude that DIANA clustering algorithm is the best algorithm among the others for cDNA datasets.

Conclusions
Cluster analysis problem has always interested scientists as it deals with the grouping of objects having common properties and it run as a first step of data summary and grouping genes in a microarray gene expression data analysis. As we show here a comparative study of four clustering algorithms applied on the simulated data and eight clinical cancer gene expression datasets. Our results reveal that, K-means and DIANA clustering methods perform well for simulated data. The PAM gives the best performance for Affymetrix datasets. For cDNA datasets, the DIANA clustering exhibited the best performance in terms of recovering the true structure of the datasets. To the best of our knowledge, the comparative study of K-means, PAM, Agglomerative Hierarchical clustering and DIANA with several validity indices as Average Silhouette Width, Corrected rand Index, Variation of Information, Dunn Index, Calinski-Harabasz Index, Separation Index, and Pearson Gamma are poorly documented in the literature.