Discriminant Analysis Procedures Under Nonoptimal Conditions for Binary Variables
I. Egbo
Department of Mathematics, Alvan Ikoku University of Education, Owerri, Nigeria
Email address:
To cite this article:
I. Egbo. Discriminant Analysis Procedures Under Nonoptimal Conditions for Binary Variables. American Journal of Theoretical and Applied Statistics. Vol. 4, No. 6, 2015, pp. 602609. doi: 10.11648/j.ajtas.20150406.32
Abstract: The performance of four discriminant analysis procedures for the classification of observations from unknown populations was examined by Monte Carlo methods. The procedures examined were the Fisher Linear discriminant function, the quadratic discriminant function, a polynomial discriminant function and AB linear procedure designed for use in situations where covariance matrices are equal. Each procedure was observed under conditions of equal sample sizes, equal covariance matrices, and in conditions where the sample was drawn from populations that have a multivariate normal distribution. When the population covariance matrices were equal, or not greatly different, the quadratic discriminant function performed similarly or marginally the same like Linear procedures. In all cases the polynomial discriminate function demonstrated the poorest, linear discriminant function performed much better than the other procedures. All of the procedures were greatly affected by nonnormality and tended to make many more errors in the classification of one group than the other, suggesting that data be standardized when nonnormality is suspected.
Keywords: Apparent Error Rates, Fisher’s Linear Discriminant, Quadratic Discriminant Function, AB Discriminant Function, Polynomial Discriminant Function
1. Introduction
Many practical problems can be reduced to the assignment of various objects to different classes. For example in the case of the medical diagnosis, it is a question of recognizing the pathology of a given patient, the purposes correspond to the patients and the classes with various pathologies. In the economy field, a bank wants to know if a customer applying for a loan is a good or bad customer while being based on several variables like the age, the profession, former fidelity, the required credit. A review of these appears in [16]. In assignment problems in biomedical research, one or more of these techniques is often used. The assumptions underlying these techniques are not always evident to the user, nor are the consequences of their violation. The assumptions include multivariate normality, common covariance matrices and correct assignment of the initial groups [17], [18] and [19]. While a good deal is known in the two group situation, the robustness of these procedures under nonoptimal conditions for binary variable is essentially unknown. The purpose of this paper is to compare and delineate these problems systematically and to suggest useful areas of research.
The problem of classifying an individual into one of two concerned groups (called populations), arises in many areas, typically in anthropology, education, psychology, medical diagnosis, biology, engineering, etc. An anthropometrician may wish to identify ancient human remains in two different racial groups or in two different time periods by measuring certain skull characters [2]. A plant breeder discriminates a desired from an undesirable species by observing some heritable characters [14]. A company hires or rejects an applicant frequently based on a certain measurement. Similarly a college accepts or denies a prospective student usually based on his entrance examination scores. In a hospital, a patient maybe diagnosed and classified into a certain potential disease group by a battery of tests, usually it is assumed that there are two populations, say and , the individual to be classified comes from either or ; furthermore, it is assumed that from previous experiments or records we have in our possession the characteristic measurements of individuals who were known to belong to , and of individuals who were known to belong to . Based on the available data obtained from previous + individuals and the corresponding characteristic measurements of a new individual, we would like to classify the new individual into either or by using certain criterion. The case of more than two populations will not be considered in this paper.
In this inferential setting, the researcher can commit one of the following errors. An object from may be misclassified into likewise, an object from may be misclassified into . If misclassification occurs a loss would be suffered. Let be the cost of misclassifying an object, into. For the two population setting, we have that means cost of misclassifying an object into given that it is from.
is the cost of misclassifying an object into given that it is from . The relative magnitude of the loss = depends on the case in question: for example failure to detect an early cancer in a patient is costlier than stating that a patient has cancer and discovering otherwise.
2. Classification Procedures
2.1. The Fisher’s Linear Discriminat Function (FLDF Rules)
The linear discriminant function for discrete variables is given by
(1)
Where are the element of the inverse of the pooled sample covariance matrix and are the elements of the sample means in and respectively. The classification rule obtained using this estimation is classify an item with response pattern X into p
Ifand toor otherwise (2)
2.2. The Quadratic Discriminant Function
When an observation vector x, is drawn from a MVN distribution with mean vector m_{I} and covariance matrix S_{I}, the MVN density function f (x), can be expressed as:
(3)
In the case of two groups an individual is classified as belonging to population 1 if that is, if Alternatively, an individual is assigned to population 2 if that is, if .Where and are the proportions of individuals from the two groups in the populations [7].When the two groups have a common covariance matrix, , and mean vectors and the above rule becomes
(4)
= exp
And taking logarithms, the rule is to assign an individual to population 1 if
(5)
And to the group 2 otherwise. The sample analogue of the above equation is
(6)
And the coefficients are seen to be identical to Fisher’s result for the LDF.
When covariance matrices are unequal and cannot be pooled, but the population distributions are multivariate normal, the classification rule has the form
(7)
(8)
In these cases, the discriminant function is quadratic, since the term is still present [7]. From the above with,, and estimated bytheir respective mean vectors and covariance matrices , and the sample analogue ofis
(9)
In each of the conditions of the present study the proportions of each group is the population were assumed to be equal to each other and not proportional to sample size since the true proportion are not usually know in most areas of psychological research. When population proportions are equal the quadratic decision rule is the to classify an individual into population 1 if> 0 or into population 2 if since
2.3. The AB Discriminant Function
[1] proposed a Linear discriminant function of the form with chosen so that is classified as from population 1 if > c and from population 2 if where c is also suitably determined. With this procedure, the misclassification probabilities are:
And
(10)
Where F is the cumulative distribution function of a standard normal variable. The and are determined by
and (11)
Where and are the means of population 1 and population 2, Now can be expressed as"
(12)
The is then chosen which maximizes for a given. By differentiating withrespectto . It can be show that the solution consists of solving the following equations in and a scalar t:
and (13)
The solution to these equations is obtained by a trial – and error procedure and c is then obtained by:
(14)
Now y_{1 }can be obtained from
(15)
[1] also considered an alternative method when the two misclassification probability are equal, i.e. . In this case, and tare found from:
(16)
The determination of the value of t was accomplished by using the result due to [3], in which were expressed as:
, and , (17)
Wherediagare the roots of the determinantalequation, then must lie between the minimum and maximum roots of the above characteristic equation.
In the present study the optimal value of t was approximated by evaluating t for equal to the minimum and maximum characteristic roots and computing the vector in each case from the equation (13). The value of c was then calculated from Equation (14). And the observation population 1 if or population 2 if . In this manner, the intervalwas successively bisected, and for reach value of t, the proportion of correct classification calculated. The interval was bisected a maximum of five times or until classification did not improve. The resultant discriminant function was then applied to the cross validation sample, and the proportion of correct classifications was calculated.
2.4. The Polynomial Discriminant Function
In this case, the discriminant function was constructed by estimating the probability density function for each sample directly from the observed data, as described in [15]. This was accomplished by expanding the estimate in in a series which represents the probability density function of the population, Tou and Gonzalez show that if it is required that the estimate of the probability density function minimize a mean –square error function defined as:
(18)
Where w is a waiting function, then may be expanded in the series
(19)
Where the are coefficient to be determined, and theare a set of specified basis functions.
A set of univariate basis functions associated with the normal distribution from which multivariate basis functions can be obtained, are Hermite polynomials, generated by the recursive relation
(20)
Where . The first few Hermite polynomials are::
(21)
Substituting the expansion of into the meansquare error function yields
(22)
And minimizing R with respect to the coefficient, yields.
(23)
The right side of this equation is the definition of the expected value of the function and may be approximated from the sample average
(24)
Since the basic functions are orthonormal and are chosen orthogonal with respect to the weighting function, the coefficients may be determined from
(25)
And the resultant density may be obtained from
(26)
By using Bayes’ formula
(27)
Where is the probability of the population, the discriminant function for this problem are then given by:
and (28)
(29)
And if , the decision boundary is given by .
In the present study a twodimensional set of orthogonal function was obtained by forming pairwise combinations of the onedimensional functions. Six terms were used to appreciate the density function and were constructed as follows:
(30)
The set of original functions for the sixvariable case was constructed in the same manner as for the bivariate case by forming the product of one dimensional Hermite polynomials. In order for the estimates of the density functions to be polynomials of degree two for all the variables, 28 terms were constructed as follow:
(31)
The vector of coefficients c, was then computes for each sample from equation (25), and the polynomial estimates of the density functions were constructed as in Equation (26). The two estimates of the density functions were the subtracted to form the polynomial discriminant function, which was then applied to the observations in each of the original and crossvalidation samples. Finally, the proportion of correct classification was calculated.
2.5. Testing Adequacy of Discriminant Coefficient
Consider the discriminant problems between two multinomial populations with mean and common matrixS. The coefficient of the MLD discriminant function are given by in practice of course the parameters are estimated by
(32)
Letting, the coefficient of sample MLDF given by
A test of hypothesis H0: using the sample Mahalanobis distance has been proposed by [12] this test statistics uses the statistic:
(33)
Where, under the null hypothesis has distribution and we reject H_{0} for large value of this statistics.
2.6. Evaluation of Classification Functions
One important way of judging the performance of any classification procedures is to calculate the errors rates or misclassification probability [13]. When the forms of parent populations are known completely, misclassification probabilities can be calculated with relative ease. Because parent populations are rarely know, we shall concentrate on the error rates associated with the sample classification functions. Once this classification function is constructed a measure of its performance in future sample is of interest. The total probability of misclassification (TPM) is given as:
(34)
The smallest value of this quantity by a judicious choice of is calculated the optimum error rate (OFR)
OFR = Minimum TPM
2.7. Probability of Misclassification
In constructing a procedure of classification, it is desires to minimize on the average the bad effects of misclassification [10], [13] and [11]. Suppose we have an item with response pattern x from either . We think of an item as a point in a rdimensional space. We partition the space R into regions which are mutually exclusive. If the item falls in , we classify it as coming from and if it falls in we classify it as coming from .In following a given classification procedure, the researcher can make two kinds of errors in classification. If the item is actually from the researcher can classify it as coming from.Also the researcher can classify an item from as coming from. We need to know the relative undesirability of these two kinds of errors in classification. Let the prior probability that an observation comes from be , and from be .Let the probability mass function of be and that of be . Let the regions of classifying into be.Then the probability of correctly classifying an observation that is actually from into is;
(35)
Similarly, the probability of correctly classifying an observation from is
(36)
Similarly, the probability of correctly classifying an observation from into isand the probability is misclassifying an item from into is
(37)
The total probability of misclassification using the rule is
(38)
In order to determine the performance of a classification rule R in the classification of future items, we compute the total probability of misclassification know as the error rate. [7] defined the following types of error rates.
i. Error rate for the optimum classification rule. When the parameter of the distributions are known the errors is which is optimum for this distribution.
ii. Actual error rate: The error rate for the classification rule as it will perform in future samples
iii. Expected actual error rate: The expected error for classification rules based on sample size c from and from .
iv. The plugin estimate of error rate obtained by using the estimated parameters for and .
v. The apparent error rate: This is defined as the fraction of items in the initials sample which is misclassified by the classification rule.

 











The table above is called the confusion matrix and the apparent error rate is given by
(39)
[6] called the second error rate the actual error rate and the third expected actual error rate. Hills showed that the actual error rate is greater than the optimum error rate and it in turns, is greater than the expectation of the plug –in estimate of the error rate. [9] proved a similar inequality. An algebraic expression for the extract bias of the apparent error rate of the sample multinomial discriminant rule was obtained by [5], who tabulated it under various combinations of the sample size and the number of multinomial cells and the cell probabilities. Their result demonstrated that the bound described above is generally loose.
3. The Simulation Experiments and Results
The four classification procedures are evaluated at each of the 118 configurations of n, r and d. The 118 configurations of n, r and d are all possible combinations of n = 40, 60, 80, 100, 200, r = 3, 4, 5 and d = 0.1, 0.2, 0.3, and 0.4. A simulation experiment which generates the data and evaluates the procedures is now described.
(i) A training data set of size n is generated via Rprogram where observations are sampled from which has multivariate Bernoulli distribution with input parameter and observations sampled from , which is multivariate Bernoulli with input parameter . These samples are used to construct the rule for each procedure and estimate the probability of misclassification for each procedure is obtained by the plugin rule or the confusion matrix in the sense of the full multinomial.
(ii) The likelihood ratios are used to define classification rules. The plugin estimates of error rates are determined for each of the classification rules.
(iii) Step (i) and (ii) are repeated 1000 times and the mean plugin error and variances for the 1000 trials are recorded. The method of estimation used here is called the resubstitution method.
The following table contains a display of one of the results obtained
Sample sizes  AB  Polynomial  LDA  Quadratic 
40  0.157125  0.110074  0.110787  0.204512 
60  0.161900  0.127855  0.127958  0.207491 
100  0.163290  0.143526  0.143680  0.209940 
140  0.162967  0.149837  0.150407  0.209826 
200  0.162565  0.156384  0.155280  0.211542 
Sample sizes  AB  Polynomial  LDA  Quadratic 
40  0.040271  0.052706  0.037112  0.041686 
60  0.032751  0.042691  0.031487  0.033007 
100  0.027786  0.037015  0.026152  0.027125 
140  0.022462  0.031623  0.022112  0.024082 
200  0.017981  0.026657  0.018218  0.019071 
Tables 2(a) and (b) present the mean apparent error rates and standard deviation (actual error rates) for classification rules under different parameter values. The mean apparent error rates increases with the increase in sample sizes and actual error rate decreases with the increase in sample sizes. From the analysis, linear discriminant function is ranked first, followed by AB Discriminant, Quadratic function and Polynomial discriminant function came last.
Classification Rule  Performance/rank 
Linear Discriminant  1 
AB Discriminant  2 
Quadratic function  3 
Polynomial Discriminant function  4 
4. Discussion and Conclusion
The results in table 3.1b indicate that, in general, with samples drawn from MVN populations with equal covariance matrices, the fisher LDF, the AB procedure, the Quadratic Discriminant function (QDF) and Polynomial discriminant function (PDF) performed similarly, but as the degree of heterogeneity increases (not shown in the table), the QDF outperformed the other procedures. These results are consistent with those of [8] and [4], since it can be observed that the fisher LDF performed well, with respect to the QDF, for mild departures from homogeneity of covariance matrices, but as the degree of heterogeneity increased, the QDF outperformed the fisher LDF, AB procedure and Polynomial discriminant function.
However, we obtained two major results from this study. Firstly, using the simulation experiments we ranked the procedures as follows: Linear Discriminant Function, AB Discriminant function Quadratic and Polynomial Discriminant function. The best method was the linear discriminant procedure. Secondly, we concluded that it is better to increase the number of variables because accuracy increases with increasing number of variables. Moreover, our study showed that the linear discriminant function is more flexible in such a way to allow the analyst to incorporate some priori information in the models. Nevertheless, this does not exclude the use of other statistical techniques once the required hypotheses are satisfied.
References