Evaluation of Error Rate Estimators in Discriminant Analysis with Multivariate Binary Variables

: Classification problems often suffers from small samples in conjunction with large number of features, which makes error estimation problematic. When a sample is small, there is insufficient data to split the sample and the same data are used for both classifier design and error estimation. Error estimation can suffer from high variance, bias or both. The problem of choosing a suitable error estimator is exacerbated by the fact that estimation performance depends on the rule used to design the classifier, the feature-label distribution to which the classifier is to be applied and the sample size. This paper is concerned with evaluation of error rate estimators in two group discriminant analysis with multivariate binary variables. Behaviour of eight most commonly used estimators are compared and contrasted by mean of Monte Carlo Simulation. The criterion used for comparing those error rate estimators is sum squared error rate (SSE). Four experimental factors are considered for the simulation namely: the number of variables, the sample size relative to number of variables, the prior probability and the correlation between the variables in the populations. From the analysis carried out the estimators can be ranked as follows: DS, O, OS, U, R, JK, P and D.


Introduction
It is common to use the estimated error rate to evaluate the performance of a classifier. In the nonparametric framework the leave-one-out method (also referred to as cross-validation or the U method) proposed by [16] has been shown to have a much smaller bias than the resubstitution method [17], and has become a popular nonparametric error estimator in small sample size situations. However, [18] has shown that the leave-one-out method can have a much larger variance than competing estimators. In some cases, this variance is sufficiently large that competitors with slightly larger bias but smaller variance will outperform the leave-one-out estimator. Error estimation is critical to classification because the validity of the resulting classifier model, composed of the classifier and its error estimate, is based on the accuracy of the error estimation procedure [19, 20, 21, and 22]. Given a large set of sample data, the data can be split between training and test data, with a classifier being designed on the training data and its error being estimated on the test data.
The downside in splitting the data is that there are less data available for design, thereby hurting the design process. This negative impact is negligible when there is an abundance of data but can be significant when samples are small [22, 23, 24, and 25]. In this paper our focus is on using the same data for training and testing. Since it is impossible to know the accuracy of a particular error estimate for a specific sample, estimation quality is judged based on the properties of the estimation procedure. Performance can be judged in various ways. We consider error-estimation performance relative to accuracy, correlation with the true error, regression between the true and estimated errors, conditional bounds on the true error, the number of variables, the sample size relative to number of variables and the prior probability.
In this paper, the problem of estimating the error rate in two group discriminant analysis is considered. Given the existence of two groups of individuals, one want to find a classification rule for allocating new individuals or observations into one of the existing two groups. Corresponding to each classification rule, there is a probability of misclassifications if that classification rule is used to classify new individuals (observations) into one of the two groups. The best classification rule is the one that leads to the smallest probability of misclassifications, which also called error rates [23, 24 and 25]. The error rate considered in this paper is the conditional error rate. Here the word conditional refers to the conditioning of the training samples from which the classification rule is constructed. One may also think of this as the probability that the given classification rule would inaccurately classify a future observation. It should also be noted that the conditional error rate is the error rate that is important to an experimenter who has already determined the classification rule. This conditional error rate is also referred to as the actual error rate or the true error rate by many authors. Hence, in this paper we concentrate only on the actual error rate and its estimation. The rest of the paper is organized as follows; the classification rule which is used in this study is described in section 2, error rates of the discriminant rules in section 3, simulation study plan is given in section 4while results and conclusion is given in section 5.

Classification Rule
The classification rule considered in the current study is the maximum likelihood rule, which can be described as follows; Maximum Likelihood Rule (ML-Rule) The maximum likelihood discriminant rule for allocating an observation x to one of the populations; ,.. , is to allocate x to the population which gives the largest likelihood to x. Classify in if ( ∕ ) > ( ∕ ) or to if where ( ∕ ) is the posterior probability which can be found by the Bayes Rule. But this is the same as: classify to if where ( ∕ ) is the class conditional probability density function and ( ) is the prior probability. By denoting the classes as , … , the maximum likelihood classifier is based on the assumed multivariate normal probability density function for each class given by where *̂ is the estimated mean vector for class i and ∑ is the estimated variance covariance matrix for class and p is the number of characteristics measured (ie the length of each vector x into one of the classes, recall that the density function ( ∕ ) is evaluated for each of the k classes and the x is assigned to if (assuming equal costs of misclassification and equal a prior probabilities) one has ( ∕ ) > ( ∕ , ) for all 2 ≠ 4 We assumed that the data can be modeled adequately by a multi-normal distribution. If the class-conditional probability density function ( ∕ ) is estimated by using the frequency of occurrence of the measurement vectors in the training data, the resulting classifier is non-parametric. An important advantage of the non-parametric classifier is that any pattern, however irregular it may be, can be characterized exactly. This advantage is generally outweighed by two difficulties with the non-parametric approach.
(i). It is difficult to obtain a large enough training sample to adequately characterize the probability distribution of a multi-band data set. (ii). Specification of a meaningful n-dimensional probability density function requires a massive amount of memory or very clever programming. In real situations it is reasonable to consider some important factors such as prior probabilities of observing individuals from the two populations and the cost due to misclassifications. However, in this paper, only the case with equal prior probabilities and equal cost due to misclassifications is considered.

Type of Error Rate of the Discriminant Rules
One of the objectives of evaluating a discriminant function is to determine its performance in the classification of future observations. There are several types of error rates associated with discriminant rules.

The Optimum Error Rate
This is the error rate that would hold if we know the parameter of the distribution. Let 5 (6), 4 = 1, 2 be defined as the probability that a random member of the is misallocated when the rule : is used.
These are known as the optimum error rates, they are the error rates that would occur if F were known. Since π and π are labeled arbitrary, it is necessary only to consider 5 (:). To study 5 (:), the labels of the populations are simply interchanged. Therefore, subsequently, any unknown observation, X is assumed to come from π , the subscript on 5is dropped and 5(:) = 5 , (G). The optimum error rate is now given by

The Conditional Actual Error Rate
The conditional actual error rate is defined as the probability that a random observation from π is misallocated when the rule : H is used.
Note that this error rate is conditional on the estimated parameters which in turn are determined by the training samples.

Expected Actual Error Rate
This is the probability that randomly chosen training samples yield a decision rule which misclassifies a randomly chosen member of π . If the expected value operator is defined with respect to all possible training samples, then the expected actual error rate is written as Note the hierarchy associated with these error rates: the optimum error rate is a function only of the distributions of X for the two populations, the expected actual error rate is a function of the distributions of X and the training sample sizes, while the conditional actual error rate is a function of the distributions of X and particular training samples selected. In order to compare error rate estimators it is necessary to specify the error rate being estimated. Assuming θ is unknown; estimates of the optimum error rate and the expected actual error rate are valuable for deciding whether or not a discriminant analysis should be performed, for comparing possible discriminant rules and for determining the advantages of increasing the size of the training samples. However, an experimenter is most likely to be concerned with the performance of his or her discriminant rule after the training samples have been selected. Although the performance of the rule can vary greatly with the choice of the training samples, the optimum error rate and the expected actual error rate are independent of that choice. Therefore, once a discriminant rule : H has been determined, it is the conditional error rate, 5(: H), which is of interest.

Expression for N(:), N(: H) and OJN(: H)L Under Normality
Throughout this work the costs of misclassification are assumed to be equal, this may be done without loss of generality since this assumption does not restrict the range of the constant k. Now consider the situation where π and π refer to r-variate normal parent distributions with unknown means, μ and μ , respectively, a common covariance matrix, Σ, which may be known or unknown, and let be the mahalanobis distance between the populations. Also assume equal prior probabilities and therefore, k =1. Now let S T = U , S T = U and Σ = S be the minimum variance unbiased estimates of * , * and Σ based on the training samples [1]. Note that Σ refers to a random variable and S to a realization of that random variable. In this situation, the linear discriminant function or Anderson's W statistic is defined as and the decision rule : H reduces to The optimum error rate is simply Where Conditional on the training samples (and therefore on x T , x T and S), W(X) has a univariate normal distribution: The conditional actual error rate is the probability that W(X) is less than or equal to zero and hence can be given as The expected actual error rate is more complicated. For Σ unknown, an asymptotic distribution of IJ5(: H)L was given by [8], [9] and [14] used numerical integration to tabulate values of EJ5(: H)L for r=1…4 and n = n = 25, 50, 100 . These results were compared and were found to be in close agreement.

Criteria for Comparing Error Rate Estimators
Let 5 H represent an arbitrary estimate of the conditional actual error rate, 5(: H), based on the training samples. The most reasonable criteria for comparing estimators is felt to be called the Unconditional mean square error (UMSE) by [15]. Two other possible criteria are the conditional mean square error and the mean absolute error The results obtained using the criterion of conditional mean square are functions of F € , this criterion could be used if it were desirable to have the choice of the error rate estimator depend on the training sample. However, the goal of this study is to compare estimators chosen independently of the training samples. Therefore, UMSE, which is the expected value of the conditional mean square error over the distribution of F € , is the preferred criterion. The mean absolute error is also felt to be a reasonable criterion, but it is not considered further because it is not as sensitive to the variability of the error as the unconditional mean square error.

Error Rate Estimators
In this paper, we considered nine major error rate estimators namely; Plug-in estimator (D-method), Resubstitution estimator or Apparent error rate (R-method) and the leave-one-out estimator, (U-method).

Plug-in Estimator
This is the earliest error rate estimator proposed by [3] Let The plug-in estimate is defined by α & ∆ = ϕ(−∆/2) The probability of misclassification P is given by where T is a standard normal deviate. If we replace μ and Σ by x T and s we have that for normally distributed variables, the estimate of P is Also if we replace μ and Σ by x T and s in the case of P then the estimate of P is Where D = (x T − x T ) s ! (x T − x T ) is the Mahalanobis' sample distance. These estimates are good if the degrees of freedom are large since D is consistent for δ . If the degrees of freedom are not large, this may be badly biased and give much too favourable an impression of the probability of error. Another way to derive this estimate is that since P = ϕ' δ 2 ' " when the parameters are known, by estimating the parameters μ , μ and Σ by x T , x T and s we should arrive at reasonable results.

Resubstitution Estimator
The other commonly used error rate estimator is called the Resubstitution estimator, apparent error rate or the R-method. This is the proportion of the observations in the training sample fromπ which is misclassified by the discriminant rule. In this method, the sample used to compute the discriminant function is reused to estimate the error rate. This means that if n and n are samples from population π and π respectively, then we use n and n to compute the discriminant function. If the number of misclassification on π and π are m and m , then the estimates of the error rate P and P are

Leave-One-Out Estimator
In the leave-one-out estimator or procedure, all but one observation is used to complete the classification rule, and this rule is then used to classify the omitted observation. We repeat this procedure for each observation, so that in a sample of size N = Σin , each observation is classified by a function based on the N-1 observations. When g = 2, that is, two-fold crossvalidation, this is the rotation method. When g = n, that is the n-fold cross-validation error estimator, R(cv), attributed to [6], where in the case of two populations R(cv) = Σ ›oe Σ •oe n ›• ∕ n › . This method is also known as the "leave-one-out" or U estimate. Studies undertaken by numerous authors including [2] have shown that n-fold cross-validation has large variance. Thus, although R(cv) may be an Unbiased estimate, the confidence with which the user can expect R(cv) for his/her sample to approach. R(T) is not great. The main advantage of this method is felt to be that it obtains an unbiased estimate of the expected actual error rate for a discrimination problem with training samples of size n − 1 and n [6]. However, this does not mean that the leave-one-out estimator has small bias with respect to the conditional actual error rate, which is the error rate of interest here. One disadvantage of this estimator is that it requires more computation then the resubstitution estimator. However, ways have been found to reduce this problem. Another disadvantage of the leave-one-out estimate is its large variance. The main consideration of most investigators when comparing estimators has been the bias, but the variance is also important factor. [4] Performed a sampling experiment in order to demonstrate the importance of the variance. In the Univariate normal case, he found that the bias with respect to EJα(ž H)L is very small for the leave-one-out estimator, larger for the plug in estimator and largest for the resubstitution estimator, as expected. However, he also compared the variance of the estimators and found that the leave-one-out estimator had a much larger variance than the resubstitution estimator, which in turn had a larger variance than the plug-inestimator. Unfortunately, Glick did not consider the mean square error and hence, left Unanswered whether the resubstitution estimator over performs better than leave-oneout.

Jackknife Error Rate Estimator
This method was due to [13]. The method involves omitting each observation in turn from the learning sample and to obtain the apparent error rate for the learning sample with the jth observation omitted, R • * (A), so that

The DS Method Estimator
This estimator DS method is based on the plug-in estimator which assumes multivariate normality and contains a bias correction. When Σ is unknown, D is a biased estimator of ∆ . [7] described a consistent estimator of ∆ which has less bias than D . This estimator of ∆ is and hence the estimator of α(ž H), called the DS method is

The O and OS Estimators
The distribution of Anderson's W statistic is very complicated and is not known exactly. [11] Provided an asymptotic expansion for Pr £W(x) < ∆ + a∆¤ where a is a real constant. Since α(ž H) = Pr<W(x) < 0D , one could substitute an estimate of ∆ into Okamoto's expansion in order to estimate α(ž H). [7] Suggested two such estimators: the O method is obtained by replacing ∆ with D , and the OS method is obtained by replacing D with DS. These estimators were explicitly obtained in the Univariate case with δ known by [15]:

Posterior Probability Estimator
This estimator was described by [10]. Assuming equal prior probabilities, if θ is known and the discriminant rule is ž, the posterior probability of misclassification is when θ is estimated, the posterior probability of misclassification by the rule ž H, given x › is estimated by This function is evaluated for each of the x › and the mean is the estimator of α(ž H).

The Simulation Experiments and Results
In this comparative study, some existing estimators are compared using Monte Carlo Simulations. The usefulness of a Monte Carlo assessment is that the population parameters and the true distribution from which the training data are obtained are known. Thus, the true error rates can always be computed. Hence, the estimated error rates can be compared with the true error rate for choosing the best estimator.
The eight estimators' procedures are evaluated at each of the 118 configurations of n, r and d. The 118 configurations of n, r and d are all possible combinations of n=40, 60, 80, 100, 200, 300, 400, 600, 700, 800, 900, 1000, r=3, 4, 5 and d = 0.1, 0.2, 0.3, and 0.4. A simulation experiment which generates the data and evaluates the procedures is now described.
(i). A training data set of size n is generated via Rprogram where n = n 2 ' observations are sampled from π which has multivariateb Bernoulli distribution with input parameter p and p and n = n 2 ' observations sampled from π , which is multivariate Bernoulli with input parameter p , j = 1 … r. These samples are used to construct the various estimators.
(ii). The likelihood ratios are used to define classification rule. The estimators of error rates are determined for each of the methods. (iii).
Step (i) and (ii) are repeated 1000 times and the mean error rate and variances for the 1000 trials are recorded. The following table contains a display of one of the results obtained.  Tables 1 and 2 present the mean error rates and sum of square error rates for estimators under different parameter values. The mean error rates increases with the increase in sample sizes and sum of square error decreases with the increase in sample sizes. From the analysis, DS is ranked first, followed by O, OS, U, R, JK, P, and D came last.

Conclusion
We obtained two major results from this study. Firstly, using the simulation experiments we ranked the estimators as follows: DS, O, OS, U, R, JK, P and D. The best method was the DS estimator. Secondly, we concluded that it is better to increase the number of variables because accuracy increases with increasing number of variables. Also, the general trend for the estimators was an increase in error rate as sample size decreases while decreasing the distance between populations generally increase the error rate. DS estimator was the most consistent and thus reliable over all combinations of probability pattern and sample sizes.