On the Selection of Appropriate Proximity Measurement for Gene Expression Data

: Gene expression profile has become a useful biological resource in recent years and its plays an important role in a broad range of biology. But a large number of genes and the complexity of biological networks greatly increase the evaluation of comprehending and interpreting the resulting mass of data, which often consists of millions of measurements. In the computational analysis of gene expression data, the main aspect is to finding co-expressed genes as the proximity (similarity or dissimilarity) measures that are used in the clustering method. Several number of proximity measures work are used in the gene data but the majority of these works has given emphasis on the biological results and no critical assessment of the suitability of the proximity measures for the analysis of gene expression data. For these consequences this paper is to investigate the appropriate proximity measurement for gene expression data. As a case study, we considered six real datasets. Based on this, we provide a comparative study of five proximity measures: Euclidean distance, Manhattan distance, Pearson correlation, Spearman correlation, Cosine distance. We discuss Adjusted Rand Index, Silhouette Index of clustering to assess the quality and reliability of the results. Our results reveal that the Cosine distance method with complete linkage exhibited the best performance for both Affymetrix and cDNA datasets according to Adjusted Rand Index. Our results also reveal that the Spearman correlation measure with complete linkage exhibited the best performance for both Affymetrix and cDNA datasets according to Silhouette Index.


Introduction
Microarray technology measures the evolution of thousands of genes quantitatively and simultaneously in a gene expression profiling experiment under different [1].An appropriate proximity measure is highly demanded to extract hidden information from co-expression analysis of enormous genome data.In that case, a common task is to compare the proximity measures for gene expression datasets.DNA microarray technology has now made it possible to simultaneously monitor the evolution levels of thousands of genes during important biological processes and across collections of related samples.
There are several widely used proximity measures, such as Euclidean Distance, Manhattan Distance, Cosine Distance, Pearson Correlation, Spearman Correlation, Jaccard Coefficient, Kendall Tau Correlation Coefficient etc. Besides, various analytical and statistical approaches are already developed to capture the overall feature of high dimensional variable datasets.Hierarchical clustering method is one of them, which is classified into agglomerative hierarchical methods and divisive hierarchical methods.Agglomerative Hierarchical Clustering (AHC) is more popular between them.There are several AHC methods are well established [2,3].Single channel microarrays (Affymetrix) and double channel microarrays (cDNA) are two types of platforms where the gene expression microarray technology is available and these datasets are meaningful to cluster both genes and samples [4,5,6].The above types datasets are usually used for gene based clustering and sample based clustering.But this study conducted only sample based clustering because the goal of sample-based clustering is to identify the phenotype structures or substructures of the samples.In the sample based clustering, genes are treated as features while samples are treated as objects and samples are partitioned into homogeneous groups.
In this study, five proximity measures (Euclidean Distance, Manhattan Distance, Cosine Distance, Pearson Correlation, and Spearman Correlation) are used to identify the clustering performance in gene expression [7,8].Four AHC methods (Single Linkage, Complete Linkage, Average Linkage and Centroid Linkage) were discussed in [8,9,10] which are used to identify the clustering performance in gene expression data.
Four AHC methods (Single Linkage, Complete Linkage, Average Linkage and Centroid Linkage) were accomplished to evaluate the clustering performance in their analysis are expressed [11,12,13].However most of the author's demonstrated cosine correlation method is better and rest of the author's demonstrated Euclidean distance is better measure to evaluate microarray gene expression data in their analysis.

Proximity Measures of Gene Expression Data
Proximity measures (distances and similarities) are supplementary material for gene expression data analysis are analysis by these two author [14,15].For this reason we introduce some proximity measures (distance and similarity) here.Suppose x and y be denoted as two numerical vector of gene expression data objects with m features, where the object can be either genes or samples are detailed in [16,17,18].Then the measures (Euclidean Distance Method, Manhattan Distance Method, Cosine Distance Method, Pearson Correlation Measure, and Spearman Correlation Measure) can be expressed in [19,20,21,22] that are given below.

Euclidean Distance
The distance between x and y is the square root of squared difference between corresponding elements of the two vectors.It can be defined as , = −

Manhattan Distance
The distance between x and y is measured along axes at right angles and it is defined as

Cosine Similarity
Cosine similarity is widely used similarity measure applied to text documents, such as in numerous information retrieval applications and clustering too.Cosine similarity is popular because it is efficient to evaluate, especially for sparse vectors, as only the non-zero dimensions need to be considered.The independency of document length is an important property of cosine similarity.Therefore, the cosine similarity ignores 0-0 matches like the Jacquard measure.The cosine similarity is defined by the following equation.

Spearman Correlation
Spearman measures the degree of a monotonic relationship between two variables, without making any assumptions about the frequency distribution of the variables.In practice, a simple formula is normally used to calculate Spearman Correlation.

Pearson Correlation
Pearson correlation coefficient is widely used and has proven effective as a similarity measure for gene expression data.Pearson correlation is defined by the following equation.
Where, COV is the covariance between x and y, SD is the standard deviation.

Checking Validity
On the selection of appropriate proximity measures it is common to evaluate the result of those measures with clustering.But clustering is an unsupervised process in the data mining and pattern recognition and most of the clustering methods are very impressible to their input parameters.Therefore it is very important to evaluate the result of the clustering methods.It is difficult to characterize when a clustering result is acceptable, thus several clustering validity techniques have been well developed.In this study the most commonly used validity techniques-Adjusted Rand Index and Silhouette Index are used.

Adjusted Rand Index (ARI)
For cases in which a mention partition is available one can employ emerged validation measures to foretell the quality of the results.Due to its emendation that takes into account equivalencies between partitions [23].We choose the Adjusted Rand is discussed at Bipul [19], which is defined as given below for the evaluation of clustering results.The greater its value, the greater is the resemblance between the two partitions under comparison, with values close to 0 representing equivalencies found by chance.Given a partition U and a mention partition V, (a) accounts for the total number of item pairs belonging to the same cluster in both U and V; (b) represents the total number of object pairs in the same cluster in U and in different clusters in V; (c) is the total number of object pairs that are in different clusters in U and in the similar cluster in V; and (d) is the total number of object pairs that are in dissimilar clusters in both U and V.

Silhouette Index (SI)
To invoice the number of clusters in our third amends view, a corresponding index of balance between partitions is also devoted.The Silhouette index is defined as considering a partitioning of m objects in k disjoint clusters.Here, the average distance among x and all the left over objects of its cluster is represented by u(i).On the other hand, for a conferred object x, the usual distance of x and all the other objects from a given cluster is obtained and is denoted by v(i).This process is repeated for all the k-1 clusters, excluding the cluster belongs to x.At the end of the scheme the lowest average value found is assigned to v(i).In a single words, the mean distance between x and its adjacent cluster (closest cluster) is denoted by v(i).Silhouette, which is a maximization measure, has its values within [−1, 1].
We choose the Silhouette based on its superior consequences in comparison to other relative criteria [24].We also message that the Silhouette has already been successfully employed in order to estimate the number of cluster for gene expression data.
Finally, it can be noted that by using the SI one can simulate a real application in which the user need not any a priori information regarding the number of clusters present in the data.It is significant to make clear, that the use of comparative indexes (such as the Silhouette) is just part of the more general procedure that comprehends the entire clustering analysis.
Tendentious by this problem it is momentous to envisage all of the methods for gene data by standardized which method are relatively best.In this paper, it is tried to compare five proximity measures for the both Affymetrix and cDNA datasets.It is also provided a detailed graphical as well as analytical comparison.We used Bar diagram as well as ARI and SI to check the suitable proximity measures for clustering.This paper is prepared by using the AHC algorithm with several proximity measures are redacted using language programming R 3.0.2.Several times Ms-Excel and Ms-Word are used as calculation and typing software.

Experiments and Results
There are six publicly available microarray datasets from [9] which are related to our analysis.These datasets can be classified into single channel as Affymetrix chip (3 sets) and double-channel as cDNA (3 sets).We compare five proximity measures with four different clustering methods.Generally the gene expression data set is so much noisy, concurrence with expression pattern, beneath constitutional and up constitutional so it is essential to take preprocess before differential analysis.To adjust data for technical segment, as averse to biological differences between the samples we have preprocessed only Affymetrix data by using standardized procedure.It is noted that the cDNA datasets were preprocessed.The empirical datasets are given in Table 1, where n is the number of sample, m is the number of feature as genes and d is the number of feature after filtering.Firstly we present some graphical displays for both gene expression datasets.For each of the five proximity measures along with four AHC methods of clustering, we embody the results by using Bar diagram, to compare which proximity measures is meaningful and the results are given in Figure 1, Figure 2 Figure 3 and Figure 4.The mean values of the Adjusted Rand index of the experiments with Affymetrix datasets are presented in Figure 1.The cosine method with respect to complete linkage obtained the maximum value with respect to AHC methods when compared to those achieved by the other methods.The mean values of the Silhouette Index of the experiments with Affymetrix datasets are presented in Figure 3.The Spearman method obtained the maximum value with respect to AHC methods when compared to those achieved by the other methods.
Figure 4 illustrates the mean values of the Silhouette Index for the experiments performed with the cDNA datasets.The method achieved the highest value with respect to proximity measures in comparison to all the other methods.The Spearman methods and complete linkage give the best result in comparison to all the other methods.Table 2 shows the the mean values of Adjusted Rand Index to check the performance of the proximity measures along with the AHC methods.For Affymetrix datasets cosine with complete linkage gives the best result and for the cDNA datasets cosine gives the partially best result.The overall analysis gives the Cosine distance method with complete linkage exhibited the highest result according to Adjusted Rand Index.
The mean values of Silhouette Index of 5 proximity measures with 4 clustering methods for both Affymetrix and cDNA datasets are presented in Table 3 to observe the best proximity measure with respect to clustering methods.The spearman correlation method with complete linkage shows on average highest values according to Silhouette Index for both types of datasets.

Conclusion
We show here a comparative study of five proximity measures with four clustering algorithms applied on six clinical cancer gene expression datasets.Our results reveal that the Cosine distance method with complete linkage exhibited the best performance for both Affymetrix and cDNA datasets according to Adjusted Rand Index.This analysis also shows the Spearman Correlation method with complete linkage exhibited the best performance for both Affymetrix and cDNA datasets according to Silhouette Index.Additionally, among the clustering methods the complete linkage gives the best result according to ARI and SI for both types of datasets.To the best of our knowledge, the comparative study of proximity measure with the validity index as Adjusted Random Index and Silhouette Index are poorly documented in literature.

Figure 1 .
Figure 1.Bar plot of the Mean of the AR Index of Affymetrix data.

Figure 2 .
Figure 2. Bar plot of the Mean of the AR Index of cDNA data.

Figure 2
Figure 2 illustrates the mean values of the Adjusted Rand for the experiments performed with the cDNA datasets.The

Figure 3 .
Figure 3. Bar plot of the Mean Values of the Silhouette Index of Affymetrix data.

Figure 4 .
Figure 4. Bar plot of the Mean Values of the Silhouette Index of cDNA data.

Table 1 .
Description of Affymetrix and cDNA datasets.

Table 2 .
The mean adjusted Rand value of Affymetrix and cDNA datasets.

Table 3 .
The mean silhouette index value of Affymetrix and cDNA datasets.