Methods for Evaluating Agglomerative Hierarchical Clustering for Gene Expression Data: A Comparative Study

: Microarray is already well established techniques to understand various cellular functions by profiling transcriptomics data. To capture the overall feature of high dimensional variable datasets in microarray data, various analytical and statistical approaches are already developed. One of the most widely used Agglomerative Hierarchical Clustering (AHC) methods is the cluster analysis of gene expression data; however, little work has been done to compare the performance of clustering methods on gene expression data


Introduction
Cluster analysis programs are routinely run as a first step of data summary and grouping genes in a microarray data analysis. There are many clustering methods, such as hierarchical clustering method, which can classify into agglomerative hierarchical methods and divisive hierarchical methods [28,18]. Agglomerative Hierarchical Clustering (AHC) process starts with these single observation clusters and progressively combines pairs of clusters, forming smaller numbers of clusters that contain more observations [17,29]. Several AHC methods are well established [5,11]. It is essential to know which clustering method is best for which type of microarray gene (cancer) data. Microarray gene expression data allow us to quantitatively and simultaneously monitor the expression of thousands of genes under different conditions [1,3]. DNA microarray technology has now made it possible to simultaneously monitor the expression levels of thousands of genes during important biological processes and across collections of related samples.
Generally the gene expression microarray technology is available in two types of platforms, single channel microarrays (Affymetrix) and double channel microarrays (cDNA) [2,4]. One of the characteristics of gene expression data is that it is meaningful to cluster both genes and samples [13,27]. Therefore there are two types of gene expression data clustering: gene based clustering and sample based clustering.
In sample based clustering, samples are treated as objects while genes are treated as features and samples are partitioned into homogeneous groups [12,19]. The goal of sample-based clustering is to identify the phenotype structures or substructures of the samples. This study conducted only sample based clustering.
There are a small number of analyses in literature for evaluating the performance of different clustering method applied to gene expression data. Three AHC methods (Single Linkage, Complete Linkage and Average Linkage) were used to identify the clustering performance in gene expression data [7,8,16,25]. Four AHC methods (Single Linkage, Complete Linkage, Average Linkage and Centroid Linkage) were practiced to evaluate the clustering performance in their

Descriptions Functions
Euclidean It is the square root of the sum of squared differences between corresponding elements of the two vectors. , = −

Pearsons Correlation
Measures the similarity between the shapes of two expression patterns (profiles)

Spearman Correlation
Measures the degree of a monotonic relationship between two variables, without making any assumptions about the frequency distribution of the variables.

Cosine Correlation
Measure of similarity of two non-binary vectors.
analysis [6,9,10]. Five AHC methods were also compared to check better clustering methods in their datasets [14,15]. However all of the author's demonstrated complete linkage isbetter measure to evaluate gene expression data in their analysis.

Distance and Similarity Measures for Gene Expression Data
Distances and similarities play an important role in cluster analysis [23,26]. In this section, we introduce some distance and similarity measures for gene expression data in Table 1. In shortly discuss the distance and similarity measures for gene expression data, we start with some notation. Let x = x , x , … , x and y = y , y , … , y be two numerical vectors that denote two gene expression data objects, where the objects can be either genes or samples and m is the number of features [20,21,22].

Checking Validity of Clusters
Clustering is an unsupervised process in the data mining and pattern recognition and most of the clustering methods are very sensitive to their input parameters. Therefore it is very important to evaluate the result of the clustering methods. It is difficult to define when a clustering result is acceptable, thus several clustering validity techniques have been developed. In this study the most commonly used validity techniques as Corrected Rand Index are used.

Corrected Rand (cR) Index
Measuring the efficiency of the AHC methods in recovering the true partition of the data sets we use the corrected Rand index [23,24]. The corrected Rand index takes values from -1 to 1, with 1 indicating a perfect agreement between the partitions and values near 0 or negatives corresponding to cluster agreement found by chance. Unlike the majority of other indices, the corrected Rand is not biased towards a given method or number of cluster in the partition [4,23]. Given a set S of n elements and two groupings (e.g. clustering's) of these points, namely x={X 1 ,X 2 ...,X R } and y={Y 1 ,Y 2 ,...,Y S }, the overlap between X and Y can be summarized in a contingency table [ ! ] where each entry ! denotes the number of objects in common X I and Y J : ! = ‫"|‬ ∩ # ! ‫|‬ The corrected form of Rand Index is cR and the index is given as where !, * , * , ! are values from the contingency table.
Motivated by this problem it is important to consider all of the methods for gene expression data by assessing which method are comparatively best. This paper tries to compare seven AHC methods which are single linkage (cluster separation as distance between two nearest objects), complete linkage (as previously, but two furthest objects), average linkage (average distance between all pairs), centroid (distance between centroid's of each cluster), Ward's method (minimizes ANOVA Sum of Squared Errors between two clusters) median (the similarity is based on the distance between the two medians) and mcquitty (Average the distances from both parts of the new cluster) in both Affymetrix and cDNA datasets. It is also provided a detailed graphical and analytical comparison of seven agglomerative hierarchical clustering (AHC) methods and five proximity measures. We used Bar diagram as well as Box and Whisker plot with respect to Corrected Rand Index to check the suitable AHC method for clustering. In this paper the AHC algorithm with different linkages and several proximity measures are implemented using language programming R 3.0.2 with mclust and proxy packages. Several times Ms-Excel and Ms-Word are used as calculation and typing software.

Experiments and Results
Thirty three publicly available microarray data sets are included in our analysis [25]. These data sets were obtained using two microarrays technologies: single channel Affymetrix chips (21 sets) and double-channel cDNA (12 sets). We compare seven different types of clustering methods with regard five proximity measures. Mainly the gene expression data is so much noisy, mixture with expression pattern, down regulated and up regulated so it is necessary to take preprocess before differential expression analysis. To adjust data for technical variation, as opposed to biological differences between the samples we have preprocessed only Affymetrix data by using standardization technique. It is mentioned that the cDNA datasets were preprocessed. The experimental datasets are given in Table 2.
At first we present some graphical displays for both gene expression datasets. For each of the seven AHC methods of clustering, we represent the results by using Bar diagram, Box and whisker plot to compare which AHC methods is best and the graph are given in Figure 1, Figure 2 and Figure 3.
The mean values of the corrected Rand (cR) index of the experiments with Affymetrix 21 datasets are presented in Figure 1. The ward method obtained the highest value with respect to proximity measures when compared to those achieved by the other methods, whereas the second best method, complete linkage, which is one of most traditionally used method obtained the lowest values in comparison to all the other methods. Figure 2 illustrates the mean values of the corrected Rand for the experiments performed with the cDNA 12 datasets. The ward method achieved the highest value with respect to proximity measures in comparison to all the other methods. The median and the centroid methods attained the lowest values in comparison to all the other methods.   In another kind of analysis, we also investigated the performance of proximity measures corresponding to the AHC methods. The mean values of the corrected Rand for the experiments performed with the Affymetrix and cDNA datasets are presented in Figure 3. Based on this Figure 3, we found that the Cosine measures achieved the highest value in compared to the other measures. In terms of results for both Affymetrix and cDNA datasets, Table 3showed the average corrected Rand index values of seven Agglomerative Hierarchical Clustering methods with respect to proximity measures. We observed that ward method performed better than any other methods due to achieving the highest cR values. Furthermore, the cosine proximity measures showed the highest cR values in comparison to all the other proximity measures.

Conclusion
Cluster analysis techniques of gene expression microarray data is of increasing interest in the field of functional genomics. One of the reasons for this is the need for molecular-based refinement of broadly defined biological classes, with implications in cancer diagnosis, prognosis and treatment. For this reason, we revisited two types of microarray datasets: Affymetrix and cDNA. This paper shows a comparative study of seven AHC methods regarding to five proximity measures applied in a large scale datasets. The corrected Rand (cR) index was used to calculate the accuracy of the clustering. We found that the performance of Ward method is superior to all other methods for both types of datasets. We also found that the performance of Cosine is better than all other proximity measures for two types of datasets. It is recommended that Ward method with cosine distance are used to analyze Affymetrix and cDNA gene expression datasets.