Discriminant Analysis Procedures Under Non-optimal Conditions for Binary Variables

The performance of four discriminant analysis procedures for the classification of observations from unknown populations was examined by Monte Carlo methods. The procedures examined were the Fisher Linear discriminant function, the quadratic discriminant function, a polynomial discriminant function and A-B linear procedure designed for use in situations where covariance matrices are equal. Each procedure was observed under conditions of equal sample sizes, equal covariance matrices, and in conditions where the sample was drawn from populations that have a multivariate normal distribution. When the population covariance matrices were equal, or not greatly different, the quadratic discriminant function performed similarly or marginally the same like Linear procedures. In all cases the polynomial discriminate function demonstrated the poorest, linear discriminant function performed much better than the other procedures. All of the procedures were greatly affected by non-normality and tended to make many more errors in the classification of one group than the other, suggesting that data be standardized when non-normality is suspected.


Introduction
Many practical problems can be reduced to the assignment of various objects to different classes. For example in the case of the medical diagnosis, it is a question of recognizing the pathology of a given patient, the purposes correspond to the patients and the classes with various pathologies. In the economy field, a bank wants to know if a customer applying for a loan is a good or bad customer while being based on several variables like the age, the profession, former fidelity, the required credit. A review of these appears in [16]. In assignment problems in biomedical research, one or more of these techniques is often used. The assumptions underlying these techniques are not always evident to the user, nor are the consequences of their violation. The assumptions include multivariate normality, common covariance matrices and correct assignment of the initial groups [17], [18] and [19]. While a good deal is known in the two group situation, the robustness of these procedures under non-optimal conditions for binary variable is essentially unknown. The purpose of this paper is to compare and delineate these problems systematically and to suggest useful areas of research.
The problem of classifying an individual into one of two concerned groups (called populations), arises in many areas, typically in anthropology, education, psychology, medical diagnosis, biology, engineering, etc. An anthropometrician may wish to identify ancient human remains in two different racial groups or in two different time periods by measuring certain skull characters [2]. A plant breeder discriminates a desired from an undesirable species by observing some heritable characters [14]. A company hires or rejects an applicant frequently based on a certain measurement. Similarly a college accepts or denies a prospective student usually based on his entrance examination scores. In a hospital, a patient maybe diagnosed and classified into a certain potential disease group by a battery of tests, usually it is assumed that there are two populations, say 1 π and 2 π , the individual to be classified comes from either 1 π or 2 π ; furthermore, it is assumed that from previous experiments or records we have in our possession the characteristic measurements of 1 n individuals who were known to belong to 1 π , and of 2 n individuals who were known to belong to 2 π . Based on the available data obtained from previous 1 n + 2 n individuals and the corresponding characteristic measurements of a new individual, we would like to classify the new individual into either 1 π or 2 π by using certain criterion. The case of more than two populations will not be considered in this paper. In this inferential setting, the researcher can commit one of the following errors. An object from 1 π may be misclassified into 2 π likewise, an object from 2 π may be misclassified into 1 π . If misclassification occurs a loss would be suffered. Let ( ) C i j be the cost of misclassifying an object j π , into i π . For the two population setting, we have that (2 1) C means cost of misclassifying an object into 2 π given that it is from 1 π .
(1 2) C is the cost of misclassifying an object into 1 π given that it is from 2 π . The relative magnitude of the loss ( , ) L j i = ( ) C i j depends on the case in question: for example failure to detect an early cancer in a patient is costlier than stating that a patient has cancer and discovering otherwise.

The Fisher's Linear Discriminat Function (FLDF Rules)
The linear discriminant function for discrete variables is given by Where kj s are the element of the inverse of the pooled sample covariance matrix 1i p and 2 j p are the elements of the sample means in 1 π and 2 π respectively. The classification rule obtained using this estimation is classify an item with response pattern X into π and to 2 π or otherwise (2)

The Quadratic Discriminant Function
When an observation vector x, is drawn from a MVN distribution with mean vector µ I and covariance matrix Σ I , the MVN density function f (x), can be expressed as: In the case of two groups an individual is classified as belonging to population 1 if 1 1 proportions of individuals from the two groups in the populations [7].When the two groups have a common covariance matrix, Σ , and mean vectors 1 µ and 2 µ the And taking logarithms, the rule is to assign an individual to population 1 if And to the group 2 otherwise. The sample analogue of the above equation is And the coefficients are seen to be identical to Fisher's result for the LDF.
When covariance matrices are unequal and cannot be pooled, but the population distributions are multivariate normal, the classification rule has the form In these cases, the discriminant function is quadratic, since the term is still present [7]. From the above with 1 u , 2 u , 1 ΣΣ and 2 Σ estimated bytheir respective mean vectors and covariance matrices 1 In each of the conditions of the present study the proportions of each group is the population were assumed to be equal to each other and not proportional to sample size since the true proportion are not usually know in most areas of psychological research. When population proportions are equal the quadratic decision rule is the to classify an

The A-B Discriminant Function
where c is also suitably determined. With this procedure, the misclassification probabilities are: Where Φ is the cumulative distribution function of a standard normal variable. The 1 y and 2 y are determined Where 1 u and 2 u are the means of population 1 and population 2, Now 1 y can be expressed as" The b is then chosen which maximizes 1 y for a given 2 y . By differentiating 1 y withrespectto b . It can be show that the solution consists of solving the following equations in b and a scalar t: The solution to these equations is obtained by a trial -anderror procedure and c is then obtained by: Now y 1 can be obtained from [1] also considered an alternative method when the two misclassification probability are equal, i.e. 1 2 y y = . In this case, b and tare found from: The determination of the value of t was accomplished by using the result due to [3], in which 2 1 Σ Σ and were expressed as: , then ∨ must lie between the minimum and maximum roots of the above characteristic equation.
In the present study the optimal value of t was approximated by evaluating t for ∨ equal to the minimum and maximum characteristic roots and computing the vector b in each case from the equation (13). The value of c was then calculated from Equation (14). And the observation In this manner, the interval ] max , [min, 1 1 λ λ was successively bisected, and for reach value of t, the proportion of correct classification calculated. The interval was bisected a maximum of five times or until classification did not improve. The resultant discriminant function was then applied to the cross validation sample, and the proportion of correct classifications was calculated.

The Polynomial Discriminant Function
In this case, the discriminant function was constructed by estimating the probability density function for each sample directly from the observed data, as described in [15]. This was accomplished by expanding the estimate ˆ( ) , ( ) p x of p x in in a series which represents the probability density function of the ith population, 1 ( ) ). p x pop Tou and Gonzalez show that if it is required that the estimate of the probability density function minimize a mean -square error function defined as: Where w ( ) x is a waiting function, then ( ) p x may be expanded in the series Where the j c are coefficient to be determined, and the The first few Hermite polynomials are: 0 ( ) 1 x Η = : ( ) 16 48 12.
Substituting the expansion of ( ) x Ρ into the mean-square error function yields And minimizing R with respect to the coefficient, And the resultant density may be obtained from By using Bayes' formula ) ( is the probability of the ith population, the discriminant function for this problem are then given by: , the decision boundary is given by 1 In the present study a two-dimensional set of orthogonal function was obtained by forming pairwise combinations of the one-dimensional functions. Six terms were used to appreciate the density function and were constructed as follows: The set of original functions for the six-variable case was constructed in the same manner as for the bivariate case by forming the product of one dimensional Hermite polynomials. In order for the estimates of the density functions to be polynomials of degree two for all the variables, 28 terms were constructed as follow: The vector of coefficients c, was then computes for each sample from equation (25), and the polynomial estimates of the density functions were constructed as in Equation (26). The two estimates of the density functions were the subtracted to form the polynomial discriminant function, which was then applied to the observations in each of the original and cross-validation samples. Finally, the proportion of correct classification was calculated.

Testing Adequacy of Discriminant Coefficient
Consider the discriminant problems between two multinomial populations with mean has been proposed by [12] this test statistics uses the statistic:

Evaluation of Classification Functions
One important way of judging the performance of any classification procedures is to calculate the errors rates or misclassification probability [13]. When the forms of parent populations are known completely, misclassification probabilities can be calculated with relative ease. Because parent populations are rarely know, we shall concentrate on the error rates associated with the sample classification functions. Once this classification function is constructed a measure of its performance in future sample is of interest. The total probability of misclassification (TPM) is given as: The smallest value of this quantity by a judicious choice of 1 2 R and R is calculated the optimum error rate (OFR) OFR = Minimum TPM

Probability of Misclassification
In constructing a procedure of classification, it is desires to minimize on the average the bad effects of misclassification [10], [13] and [11]. Suppose we have an item with response pattern x from either 2 1 π π or . We think of an item as a point in a r-dimensional space. We partition the space R into regions 2 1 R and R which are mutually exclusive. If the item falls in 1 R , we classify it as coming from 1 π and if it falls in 2 R we classify it as coming from 2 π .In following a given classification procedure, the researcher can make two kinds of errors in classification. If the item is actually from 1 π the researcher can classify it as coming from 2 π .Also the researcher can classify an item from 2 π as coming from 1 π . We need to know the relative undesirability of these two kinds of errors in classification. Let the prior probability that an observation comes from 1 π be 1 q , and from 1 π be 2 q .Let the probability mass function of 1 π be 1 ( ) f x and that of 2 π be 1 ( ) f x . Let the regions of classifying into 1 π be 1 R .Then the probability of correctly classifying an observation that is actually from 1 π into 1 π is; Similarly, the probability of correctly classifying an observation from 1 π is Similarly, the probability of correctly classifying an observation from 2 and the probability is misclassifying an item from 1 The total probability of misclassification using the rule is In order to determine the performance of a classification rule R in the classification of future items, we compute the total probability of misclassification know as the error rate. [7] defined the following types of error rates.
i. Error rate for the optimum classification rule . opt R .
When the parameter of the distributions are known the errors which is optimum for this distribution.
ii. Actual error rate: The error rate for the classification rule as it will perform in future samples iii. Expected actual error rate: The expected error for classification rules based on sample size c from 1 π and 2 π from 2 π .
iv. The plug-in estimate of error rate obtained by using the estimated parameters for 1 π and 2 π .
v. The apparent error rate: This is defined as the fraction of items in the initials sample which is misclassified by the classification rule.  The table above is called the confusion matrix and the apparent error rate is given by 12 21 ( ) n n P mc n + = (39) [6] called the second error rate the actual error rate and the third expected actual error rate. Hills showed that the actual error rate is greater than the optimum error rate and it in turns, is greater than the expectation of the plug -in estimate of the error rate. [9] proved a similar inequality. An algebraic expression for the extract bias of the apparent error rate of the sample multinomial discriminant rule was obtained by [5], who tabulated it under various combinations of the sample size 1 n and 2 n the number of multinomial cells and the cell probabilities. Their result demonstrated that the bound described above is generally loose.

The Simulation Experiments and Results
The π , which is multivariate Bernoulli with input parameter 2 , 1... p j r = .
These samples are used to construct the rule for each procedure and estimate the probability of misclassification for each procedure is obtained by the plug-in rule or the confusion matrix in the sense of the full multinomial.
(ii) The likelihood ratios are used to define classification rules. The plug-in estimates of error rates are determined for each of the classification rules.
(iii) Step (i) and (ii) are repeated 1000 times and the mean plug-in error and variances for the 1000 trials are recorded. The method of estimation used here is called the resubstitution method.
The following table contains a display of one of the results obtained Tables 2(a) and (b) present the mean apparent error rates and standard deviation (actual error rates) for classification rules under different parameter values. The mean apparent error rates increases with the increase in sample sizes and actual error rate decreases with the increase in sample sizes. From the analysis, linear discriminant function is ranked first, followed by A-B Discriminant, Quadratic function and Polynomial discriminant function came last.

Discussion and Conclusion
The results in table 3.1b indicate that, in general, with samples drawn from MVN populations with equal covariance matrices, the fisher LDF, the A-B procedure, the Quadratic Discriminant function (QDF) and Polynomial discriminant function (PDF) performed similarly, but as the degree of heterogeneity increases (not shown in the table), the QDF outperformed the other procedures. These results are consistent with those of [8] and [4], since it can be observed that the fisher LDF performed well, with respect to the QDF, for mild departures from homogeneity of covariance matrices, but as the degree of heterogeneity increased, the QDF outperformed the fisher LDF, A-B procedure and Polynomial discriminant function.
However, we obtained two major results from this study.
Firstly, using the simulation experiments we ranked the procedures as follows: Linear Discriminant Function, A-B Discriminant function Quadratic and Polynomial Discriminant function. The best method was the linear discriminant procedure. Secondly, we concluded that it is better to increase the number of variables because accuracy increases with increasing number of variables. Moreover, our study showed that the linear discriminant function is more flexible in such a way to allow the analyst to incorporate some priori information in the models. Nevertheless, this does not exclude the use of other statistical techniques once the required hypotheses are satisfied.