Mixture Model Clustering Using Variable Data Segmentation and Model Selection: A Case Study of Genetic Algorithm

: A genetic algorithm for mixture model clustering using variable data segmentation and model selection is proposed in this study. Principle of the method is demonstrated on mixture model clustering of Ruspini data set. The segment numbers of the variables in the data set were determined and the variables were converted into categorical variables. It is shown that variable data segmentation forms the number and structure of cluster centers in data. Genetic Algorithms were used to determine the number of finite mixture models. The number of total mixture models and possible candidate mixture models among them are calculated using cluster centers formed by variable data segmentation in data set. Mixture of normal distributions is used in mixture model clustering. Maximum likelihood, AIC and BIC values were obtained by using the parameters in the data for each candidate mixture model. Candidate mixture models are established, to determine the number and structure of clusters, using sample means and variance-covariance matrices for data set. The best mixture model for model based clustering of data is selected according to information criteria among possible candidate mixture models. The number of components in the best mixture model corresponds to the number of clusters, and the components of the best mixture model correspond to the structure of clusters in


Introduction
Analysis of clusters by means of mixture distributions is called mixture model cluster analysis [1]. Mixture model based clustering is one of the clustering methods for partitioning of p -dimensional multivariate data into meaningful subgroups [2]. Each component in the mixture model of multivariate normal densities corresponds to a cluster in multivariate data. The number of components in mixture model is determines the number of clusters in multivariate data [3]. The number of components in mixture model determines the number of clusters and the structure of components in mixture model forms the structure of clusters in multivariate data.
Mixture model of multivariate normal densities is defined as ( ) Bozdogan [4] proposed a method for choosing the number of clusters, subset selection of variables, and outlier detection in the standart mixture model cluster analysis. Bozdogan [5] developed a method for mixture model cluster analysis using model selection criteria and defined a new informational measure of complexity. Soffritti [6] identified multiple cluster structures in a data matrix. Bozdogan [7] proposed a computationally feasible intelligent data mining and knowledge discovery technique that addresses the potentially daunting statistical and combinatorial problems presented by subset regression models. McLachlan and Chang [8] studied mixture modelling for cluster analysis. In their approach to clustering, the data can be partitioned into a specified number of clusters k by first fitting a mixture model with k components.
Galimberti and Soffritti [9] used model based clustering methods to identify multiple cluster structures in a multivariate data set. Durio and Isaia [10] developed a method for model selection in mixture of normal densities. Scrucca [11] used information on the dimension reduction subspace obtained from the variation on group means and, depending on the estimated mixture model, on the variation on group covariances. His method aims at reducing the dimensionality by identifying a set of linear combinations, ordered by importance as quantified by the associated eigenvalues, of the original features which capture most of the cluster structure contained in the data.
Seo and Kim [12] developed root selection method for identifying the underlying group structure in the data using finite mixtures of normal densities. Fraley et al. [13] defined a method of normal mixture modeling for model based clustering, classification, and density estimation studied. A model selection algorithm for mixture model clustering is defined by Erol [14]. Huang et al. [15] studied model selection for Gaussian mixture models. Their method is statistically consistent in determining the number of components. They used a modified EM algorithm [16] and applied to simultaneously select the number of components, and to estimate the mixing weights.
Galimberti and Soffritti [17] studied conditional independence for parsimonious model based Gaussian clustering. They asumed that the variables can be partitioned into groups resulting to be conditionally independent within components, thus producing component-specific variance matrices with a block diagonal structure. McLachlan and Rathnayake [18] studied the number of components in terms of density estimation. Wei and McNicholas [19] used mixture model averaging for clustering. Model-based clustering of high-dimensional data studied by Bouveyrona and Brunet-Saumardb [20].
A new data mining method with a new genetic algorithm using variable data segmentation and model selection for mixture model clustering of multivariate data is proposed in this study. The genetic algorithm has 6 steps. These steps are: (i) Variable data segmentation, (ii) Determining total number of cluster centers, (iii) Computing total number of mixture and candidate models, (iv) Obtaining candidate mixture models as binary string representation, (v) Calculating parameter estimation of possible (candidate) mixture models from sample and (vi) Selecting the best model among candidate mixture models. The proposed mixture model clustering based on variable data segmentation and model selection will be explained on a data set, known as namely Ruspini data set [21]. Akogul and Erisoglu [28] proposed a new approach for determining the number of clusters in a model-based clustering analysis. Akogul and Erisoglu [29] used the information criteria on determining the number of clusters in the model correctly and effectively. Celeux et al. [30] proposed an approach to determining the number G of components in a mixed distribution in model-based clustering. Gogebakan and Erol [31] used model-based clustering of normal mixture distributions in the semi-supervised classification of clusters in the mixture model. The multivariate data set consists of two real or numeric valued variables with each variable containing four partitions, so they are heterogeneous.

The Method
The proposed data mining clustering method with a genetic algorithm for mixture model clustering of multivariate data based on model selection using variable data segmentation will be explained on Ruspini data set [21] in the following sections.

Determination of Heterogeneous Variables in Multivariate Data for Variable Data Segmentation
A heterogeneous variable is a variable that its values have at least two subgroups otherwise it is considered as a homogeneous variable. Each of two variables 1 X and 2 X in Ruspini data set [21] are heterogeneous each with four segmentations. Variable data segmentation is the first step of genetic algorithm for the proposed mixture model clustering based on model selection. Number of partitions of each variable data can be obtained by applying mixture of univariate normal distributions to each variable in data set. Mixture of univariate normal distribution is of the form where ( ) 2 (4) where i µ and i σ denotes mean and standard deviations of component probability density functions respectively. In order to reveal partitions in each variable data log-likelihood, Akaike Information Criteria (AIC) [22] and Bayesian Information Criteria (BIC) [23] values are examined in mixture of univariate normal distributions. The number of components in each mixture of univariate normal distribution mixture models for each variable data corresponds to the number of variable data partitions for each variable in data set. By evaluating the results in Table 1 and Table 2, one can see that the optimal number of components is 4 for the mixture models for each variable data of 1 X and 2 X . Let k i be the number of partitions in X i for graphical methods such as histograms and cumulative distribution plot should be used in determining the segmentations of each variable [24]. Probability plots and histograms showing the variable data partitions for 1 X and 2 X are illustrated in Figure 1 and Figure 2.  According to the results in Table 1 and Table 2, and in This partitions forms sixteen cluster centers in Ruspini data set [21] as illustrated in Figure 3. Segmentations in variables data forms the cluster centers are shown Figure

Computations for Total Number of Cluster Centers
The assumption of proposed method is that each column and row must have at least one cluster center in Figure 3. The method proposed by Servi and Erol [24] can be used to compute the minimum and maximum number of cluster centers, denoted by min where p denotes number of variables and k s denotes the number of partitions in each variable data for 1 X and 2 X . Thus, k 1 =k 2 =4 for Ruspini data set [21]. n × data matrix for X is of the form Partitions of 1 X variable data in 1 n elements is of the form  . So the minimum number of cluster centers is 4 and the maximum number of cluster centers is 16 for Ruspini data set [21]. Partitions of variables data and cluster centers are illustrated in Figure 3.
Observations for variables can be assigned to partitions of variables using clustering algorithms such as k-means algorithm [25]. So variable data segmentations are obtained from both graphical methods: such as probability plots and histograms; and computational methods: such as mixture of univariate normal distributions and k-means. Variable data partitions and their sizes for variable 1 X and 2 X in Ruspini data set [21] are given in Table 3. Mean vectors and variance-covariance matrices of candidate cluster centers are obtained for construction of mixture models using variable data segmentations in multivariate data set or Ruspini data set [21]. General form of mean vectors in component probability density functions, thus bivariate normal probability density functions, corresponding to each candidate cluster center is of the form Mean vectors and variance-covariance matrices are used in construction of mixture models for mixture model clustering using variable data segmentation and model selection.

Computations for Total and Possible Number of Mıxture Models Usıng Cluster Centers
The total number of mixture models for cluster centers, obtained from variable data segmentation and denoted by Total M , for Ruspini data set [21] can be computed by the relation proposed by Erol [14] as follows where max C as in (6). Minus one term is used to eliminate the case of no cluster center.

Total M
can be obtained as 16

1 65535
Total M = − = for Ruspini data set [21]. The number of cluster centers, the number of total mixture models, the number of possible mixture models and the number of free parameters in mixture models are given in Table 4. Some cases of mixture models does not satisfy the assumption which is each column and row has at least one cluster center, so they are eliminated. The remaining mixture models are called candidate mixture models. The number of possible or candidate mixture models can be computed using the relation formula proposed by Cheballah et al. [26] as where n and m corresponds to number of partitions in variables 1 X and 2 X respectively. Indices i and j are used for the number of cluster centers. k denotes the cases for the number of cluster centers in mixture models. Table 4. The number of cluster centers, the number of total mixture models, the number of possible mixture models and the number of free parameters.

Binary String Representation Of Possible Mixture Models Using Cluster Centers
Mixture model clustering using variable data segmentation and based on model selection uses a genetic algorithm. The genetic algorithm is used to calculate the information criteria of each candidate mixture model. String representation of each candidate model consists of 1 and 0 digits. In Table 5 the zeros and/or the ones represent whether the centers used in construct of the mixture model or not. Binary string representations of possible mixture models with each corresponding to one of 41503 possible models numbers is given in Table 4. For instance the binary string representation of the saturated mixture model that uses all cluster centers is given in Table 5.

List Of Possible Mixture Models Using Cluster Centers
Each binary string representation of candidate mixture model corresponds to one of 41503 possible mixture models.
General form of mixture model with k (4≤k≤16) components and having binary string representation is of the form as variance-covariance matrices for component density function is of the form: where component density function is probability density function for bivariate normal distribution.
There are 24 possible mixture models with four components ( 4 k = ) of the form as in (11). Parameters in the mixture models are ( )  (12), (13) and (14) (12), (13) and (14)  There are 7480 possible mixture models with ten components ( 10 k = ) of the form as in (11). Parameters in the mixture models are ( )

Estimation of Parameters for Possible Mixture Models Using Cluster Centers
Mixture model clustering using variable data segmentation and based on model selection, proposed in this study, is a data mining method. The method developed for mixture model clustering has its own genetic algorithm explained in the previous sections. Since variable data segmentation applied to each variable in the data set; mean vectors, variance-covariance matrices and mixing proportions for each component of possible mixture models can be estimated from the sample. The complexity of mixture model clustering using variable data segmentation and based on model selection is less than other clustering methods. Each binary string representation, as in Table 5, corresponds to one of 41503 possible mixture models of the form as in (11). The where k denotes the number of components in the possible mixture models. The estimate of mean vectors for component density functions are of the form The estimate of variance-covariance matrices for component density functions are of the form  (15), (16) and (17) (15), (16) and (17) (15), (16) and (17) (15), (16) and (17) (15), (16) and (17) (15), (16) and (17) (15), (16) and (17) (15), (16) and (17)

Computation of Information Criteria for Possible Mixture Models
Likelihood function for the mixture of multivariate normal densities is defined as and log-likelihood function for the mixture of multivariate normal densities is computed as log (π, µ, Σ) log( π (x ; µ , Σ )) 1 1 The Maximum Likelihood Estimation method is used in mixture distributions to obtain the parameters in the data set [27]. Log-likelihood function values for possible mixture of bivariate normal densities are computed using the estimated values of ( ) for the Ruspini data set [21].
Akaike's information criterion (AIC) can be computed by Â IC -2 log (π, µ, Σ) 2 Bayesian information criterion (BIC) can be computed by where l og (π, µ, Σ) L is the value of log-likelihood function for possible mixture of multivariate normal densities; d is the number of free parameters in possible mixture of bivariate normal densities and n is the number of observation. The number of free parameters in possible mixture of multivariate normal densities d can be computed by: where k is the number of components, p is the number of variables or dimension in mixture model [5]. Log-likelihood function, AIC and BIC values are computed from partitions of variables data using mean vectors and variance-covariance matrices. Log-likelihood function, AIC and BIC values will be used as criteria for selecting the best mixture model of bivariate normal densities. All calculations are performed using MATLAB.

Selection of The Best Model In a Set of Possible Mixture Models
Selection of the best mixture model among possible mixture of bivariate normal densities for the Ruspini data set [21] according to the information criteria is performed using the values of log-likelihood function, AIC and BIC. The mixture model having maximum Log-likelihood function value and, the mixture model having minimum AIC and BIC values is selected as the best mixture model among the possible 41503 mixture models. The string representation of the best mixture model is given in Table 6.  The number of components, log-likelihood, AIC and BIC values of the best mixture model is given in Table 7. The best mixture model is selected as the mixture of four component bivariate normal densities for Ruspini data set [21]. The best mixture model is the 12 th mixture model among 41503 possible mixture models. The scatter plot and the surface plot of the best mixture model is illustrated in Figure  4.

Conclusions
In this study, a new data mining method using genetic algorithm for mixture model clustering based on variable data segmentation and model selection was developed and performed on Ruspini data set. In the developed genetic algorithm, we calculated the number of candidate cluster centers and structures resulting from segmentation of heterogeneous variables. All mixture models that can be formed from these candidate cluster centers and the number of possible mixture models that are appropriate for the hypothesis were calculated. Possible mixture models corresponding to candidate cluster centers were generated using genetic algorithm. In order to be able to compute possible mixture models, string representation of each possible mixture model was obtained. To be used in calculations, unknown parameters for possible mixture of bivariate normal distributions were calculated from the sample. The information complexity of the proposed mixture model clustering is less than other clustering methods that is why algorithms such as Expectation and Maximization (EM) is not used in computations for estimation of parameters. According to the calculated values thus, log-likelihood, AIC and BIC, the best mixture model that matches the best data clustering structure for Ruspini data set was decided.
It can be heuristically stated that the partitions in the heterogeneous variable data affects and determines the number and structure of clusters in data set with no matter what the number of the variable in data is. The clustering method proposed in this study is developed specially for model based clustering of big data.
As a future work, the proposed method will be applied on human brain studies. The study will cover, the number of human brain function centers, magnitude of these brain function centers, correlation between these brain function centers, and constructing mixture models for these brain function centers of human behaviours and activity movements. Furthermore, the method can be applied on robotics, artificial intelligence and logical circuit design for decision making applications.