Errors of Misclassification Associated with Edgeworth Series Distribution (ESD)

This study investigates the errors of misclassification associated with Edgeworth Series Distribution (ESD) with a view to assessing the effects of sampling from non-normality. The effects of applying a normal classificatory rule when it is actually a persistent non-normal distribution were examined. These were achieved by comparing the errors of misclassification for ESD with ND using small sample sizes at every level of skewness factor. The simulation procedure for the experiment of the study was implemented using numerical inverse interpolation method in R program to generate a uniformly distributed random variable N. A configuration size of 1000 was obtained for the two training samples drawn at every level of skewness factor (λ3), in the range (0.00625, 0.4). This was repeated for different small sample sizes by comparing errors of misclassification of ESD with ND. The simulation results showed that the optimum probabilities of misclassification by ESD: (E12E) decreases and (E12E) increases, as the skewness factor (λ3) increases. The optimum total probability of misclassification is stable as 3 λ also increases. The probability of misclassification E12E ≥ E12N and E21E ≥ E21N at every level of λ3. Thus, the total probabilities of misclassification are not greatly affected by the skewness factor. This asserts that the normal classification procedure is robust against departure from normality.


Background to the Study
The study of discrimination and classification problems with a view to assessing the effects of departure from the usual assumptions of normality cannot be overemphasized. In discrimination, we are concerned with the existence of two or more groups and a sample of observations from each of the groups. We are therefore required to design a rule based on measurements from these observations to the correct population when we do not know from which of the two populations it emanates [1,22].
Classification is concerned with prediction or allocation of observations into groups in which a sample of observations is also given. The problem is to classify the observations into groups which are as distinct as possible [16].
Classification problem occurs when a researcher makes a number of measurements on observations and wishes to classify the observations into one of several groups on the basis of the measurements. The observations cannot be identified with a group directly without recourse to the measurements. Fisher [8], illustrating this concept, classified iris flower from unknown group (specie) to any of the three known species (Iris Setosa red, Iris Versicolour green, and Iris Virginica black) on the basis of their attributes (Sepal length in cm, Sepal width in cm, Petal length in cm and Petal width in cm).
The general procedure for classifying an observation, X with p observed characters ( ) random and the parameters for determining this function are often unknown, the procedure could result into two types of errors defined by errors of misclassification. Errors of misclassification occur when there is selection of criteria that is not suitable for classification [10].
The observation X may be classified as belonging to population 1 π when it actually comes from population two 2 π or vice versa. These errors are of serious concern in the choice of the procedure and as such, one is required to reduce the errors or more appropriately their probabilities are made as small as possible.
Let 1 ( ) f x and 2 ( ) f x be the probability density functions associated with X for population 1 π and population 2 π respectively. If the prior probabilities for populations 1 π and 2 π are 1 P and 2 P respectively with the regions of classifying observations into i π into ( 1,2) i R i = , then the probabilities of correctly or incorrectly classifying Pr | Pr 1 | 2 ( ) Pr (object is correctly classified into 2 In constructing a classification procedure, it is needful to minimize on the average, the bad effects of misclassification since a good classification procedure results to few misclassifications [18]. The Linear Discriminant Function (LDF) is a statistical procedure constructed as ( ) ( ) assigns p dimensional observation vector X into one of the two populations i π (i = 1, 2) and it is employed as an assignment rule when: (a) The density functions of observations from populations 1 π and 2 π are multivariate normal: The variance-covariance matrix 1 ( ) Σ in population 1 π is the same as 2 Σ in population 2 π ; (c) The prior probabilities of observations coming from populations 1 π and 2 π are known; (d) The parameters of the density functions in (a) are known. Suppose the assumptions specified above are satisfied, then the Linear Discriminant Functions (LDF) provides optimal assignment rule in that it cannot be improved upon and the errors of misclassification are minimized. However, when some or all the assumptions are violated, it would be of interest to researchers to determine the effects of the violation on the procedures using LDF. If the parameters in (a) above are estimated from the samples, two problems may arise from the estimation stage. We may have missing values in the data and the initial sample may not be properly assigned due to inaccuracy in the initial assignment.

Statement of the Problem
For an experimenter who does not recognize an observation to be non-normal, he proceeds to use the normal regions for classification. The question that emanates is: "how does this failure to transform to normality, prior to classification, affect the probability of misclassification"? This problem was investigated by comparing the errors of misclassification associated with Johnson's system distributions in the appropriate transformable non-normal case with that of normal distribution [6]. Errors of misclassification associated with Gamma were also examined by [15]. Considerable work has been done by researchers in connection with errors of misclassification when the underlying distribution is transformable non-normal distribution, but the errors of misclassification associated with persistent non-normal distribution remain unresolved [12].
For any classification rule, the associated error rates are often used as criteria for evaluation of classification performance. These error rates are easily calculated when population parameters are known. However, when these parameters are unknown and must be estimated from the samples, the exact overall expected rate for the Fisher's Linear Discriminant Function becomes virtually intractable. There is also a loss of information which affects the estimation of the probabilities of misclassification, in that it may be underestimated or overestimated. In order to rectify this problem, we derive the asymptotic distribution for the expected probability of misclassification of the distribution under consideration [17].
The aim of this study is to investigate errors of misclassification associated with Edgeworth Series Distribution (ESD). The research work seeks to achieve the following objectives: i. To examine the effect of applying the normal classificatory rule when the distribution is ESD by comparing the errors of misclassification using the Normal Distribution (ND) and ESD classification rules. ii. To use simulated data to validate the established results of the study.

Literature Review
The problem of estimating probabilities of misclassification has received remarkable attention in the literature ever since [8] introduced the Linear Discriminant Function. An extensive bibliography on this subject has been published by [21] since the probabilities of misclassification provide a way of evaluating the performance of the classification procedure.
Several investigations have also been conducted on the effects of non-normality on classification rules: The robustness of the LDF and QDF with respect to certain types of non-normality was studied by [4], considering Johnson's system of distributions which are transformable to normality. They considered three members of the family: log normal transformation Sampling studies were conducted in order to examine the behavior of the errors of misclassification, and they found out that the total error of misclassification is greatly increased as individual errors are distorted for all transformations in the case of the LDF. Approximate minimax rules were investigated because of the distortion in the errors and are found to reduce the errors of misclassification greatly. The effect of non-normality on the QDF was investigated by [13]. They assumed that the data were transformable to normality. They derived random samples from non-normal distributions in order to study the effect of non-normality on QDF. Their results indicated that the actual error rates were considerably larger than the optimal rates in the case of zero mean difference.
The robustness of the Linear Discriminant Function (LDF) to non-normality using three Johnson system's of distributions was examined by [6]. Though their work was restricted to transformable normality only, they opined that further work be carried out on a distribution that is nonnormal with robustness on small sample sizes.
The effect of applying normal classificatory rule with focus on non-normality was also examined by [12]. He also obtained the asymptotic distribution of errors of misclassification in the non -normal case using Johnson's system of distribution. The distribution function 1 ( ) G z and the expected value of the conditional distribution ( ) 12 E e were evaluated for various values of given parameters theoretically using Johnson system of distributions. It was observed that the presence of outliers in one sample does not affect the behavior of the error rates in general.
Errors of misclassification for classification problems with two classes of univariate-gamma distribution were studied by [15]. The gamma density functions given as The effects of applying the normal classificatory rule to non-normal transformable gamma distribution were studied and assessed by comparing probabilities of misclassification (optimum and conditional). This was based on the Linear Discriminant Function (LDF) for normality and the likelihood ratio rule (LR) for gamma populations for various combination of 1 2 , λ λ and θ . They concluded that for small values of θ, n 1 and n 2 , the distribution functions do not become large and fast enough. This indicates that with a high probability, the errors of misclassification are likely to be large.
In this study, we examine the persistent non-transformable, non-normal distribution by investigating the effects of applying the normal classificatory rule when the distribution is ESD using empirical approach. We also develop expected probability of misclassification for ESD and its asymptotic distribution.

Method
The effects of non-normality in a two population discriminatory problem on the errors of misclassification are examined when the Anderson's statistic (W) defined by means of Edgeworth Series Distribution (ESD) is used for classifying an observation as emanating from population 1 π or 2 .
π The effects would be studied for varying values of skewness factor based on the boundary of unimodal region for Edgeworth Series Distribution.
Optimum probabilities of misclassification for ESD are computed from known parameters and subsequently, the apparent probabilities of misclassification in respect of ESD for known and estimated parameters are generated.

Edgeworth Series Distribution (ESD)
Edgeworth Series Distribution (ESD) constitutes an expansion which is a series that approximates a probability distribution in terms of its cumulants and the Hermite polynomials. It relates the probability density function to that of a standard normal distribution [19].
The use of ESD is expedient because approximations to distribution of sample statistics of higher order than 1 2 n − is of concern interest in asymptotic theory of statistics. An important tool that evaluates the refinements is provided for by ESD. Its expansions take cognizance of a method of using information about a higher order moment to increase approximations accuracy [17].
Let F(x) be the distribution to be approximated, { } n k its cumulants, k γ the cumulants of a standard normal distribution function and D the differential operator with respect to x. Also, let Φ and φ be the standard normal distribution and standard normal density function respectively. Then This is identical with the expansions in Hermite orthogonal function for a probability density function where ( ) n H are Hermite polynomials and By considering the standardized sum of n independent and identically distributed random variables, Edgeworth Series is obtained by collecting terms in equation (10) according to the power of n [11].
If $ n θ is constructed from a sample of size n and  i i The parameter 3 , ( 1, 2) i i λ µ = satisfies the conditions: and 3 λ is the skewness factor [4].
ij i X i j n = = be independent samples of sizes 1 2 , n n from populations π 1, π 2 . To estimate the apparent probabilities of misclassification, we define where 1 j γ = if 1 j X is classified as belonging to π 2 and 0 i γ = if 1 j X is classified as belonging to π 1 , The sample 11 1 ,... n X X is taken from 1 π and each observation is classified in accordance with the rules in equations (35) where 1 j δ = if 2 j X is classified as belonging to π 1 and 0 j δ = if 2 j X is classified as belonging to π 2 , 2 1, 2,... j n = The sample 21 2 ,... n X X is taken from 2 π and each observation is classified in accordance with the rules in equations (35) and (36).
The notation E 12E and E 21E represent the apparent probabilities of misclassification when observations from populations 1 2 and π π are misclassified respectively by ESD rule.
For the purpose of comparison, the classification rule in equation (37) is successively applied to 1 2 , .
j j X X The proportion misclassified is estimated by the same procedure. Thus, represent the two errors of misclassification.

Classification Rules for Normal Distribution
Let the probability density function of X in (i 1, 2) If θ is the mean of the observation X and Equation (19) is the Anderson's discriminant function (W) when the distributions in the two populations are univariate normal with the same variance but different means [20]. We reject H 0 if L< K, where K is a constant.
From Equation (19) and the decision rule made, the classification rule specifies as follows: Equation (20) When the parameters 1, 2 µ µ are unknown, and estimated by 1 2 , X X from the sample sizes of 1 2 n and n respectively, the classification rule becomes:

Classification Rule for Edgeworth Series Distribution (ESD)
Let the pdf of X in i π be 3 3 ( ) 1 , , 1, 2 6 i i When 1 2 µ µ < , the likelihood ratio becomes From equation (30), the classification rule takes the form: When the parameters 1 2 , µ µ are unknown, they are estimated by 1 2 , X X respectively and plugged in equation (34) before classification begins.
In the process of comparing errors of misclassification using ESD and ND classification rules, and data generated from the ESD, the effect of applying normal classification rule(likelihood ratio) when the distribution is ESD would be investigated by empirical method. Thus, the classification rule for ESD is given in the form: where P and Q remain as earlier defined in equations (35) and (

Comparison of Errors of Misclassification
We estimate the errors of misclassification with focus on the small sample sizes. This is based on the fact that the asymptotic expansion of the errors does not indicate the behaviour of the error for small sample sizes [5,9].
Estimation of the optimum probability of misclassification in the ESD when the skewness factor is in the range (0.00625, 0.4) is considered.
The apparent error rate for the Normal Distribution and ESD classification rules are examined using simulated data from ESD. The classification rules for the two distributions are also derived using likelihood criterion. The form of the estimators and the choice of values for skewness factor are also presented. The errors of misclassification are subsequently compared using the likelihood ratio rules for the Normal Distribution and ESD.

Choice of Skewness Factor Values
The choice of the values for skewness factor ( 3 λ ) is anchored on the boundary of the positive unimodal regions for ESD where its probability density function is only valid. Thus, the skewness factor is chosen to lie within the range ( ) 0.00625,0.4 as suggested by [3,7].

Optimum Probability of Misclassification of ESD
When all the parameters of the distributions in the populations are known, the probability of misclassification is optimal in the sense that we cannot improve upon it. When an observation from 1 π is misclassified, the optimum probability of misclassification is given by polynomial of degree r and defined by the identity: See [11]. If ( ) x φ denotes the standard normal density function, then we define the Hermite polynomial ( ) Using the result in equation (38)  π are misclassified respectively by ND classification rule.
The simulation experiments have been implemented using R programs and all the simulation results are obtained and displayed along with the total probabilities of misclassification in Tables 1-6.
The total probability of misclassification is also stable (constant) as 3 λ increases. From Table 3, E 12E is either equal to or greater than E 12N and E 21E is either equal to or less than or greater than E 21N at every level of skewness factor. The equality of the probability also occurs when the skewness factor λ 3 is very small.  Table 3, E 12E is either equal to or greater than E 21N and E 21E is also equal to or less than E 21N at every level of skewness λ 3 .   Table 5, E 12E is either equal to or greater than E 12N and E 21E is either equal to or less than E 21N at every level of skewness λ 3 .  Table 6, E 12E is equal to or greater than E 12N and E 21E is either equal to or greater than E 21N at every level of skewness λ 3 .

Discussion
From the simulation results, It is evident that the total probability of misclassification at every value of 3 λ is either under or overestimated when small samples are employed to estimate 1 2 and µ µ . The differences between taking small (52) The parabola of equation (52) faces upwards and indicates that when 0.457 1.457, X P Q − < < < Since P < Q, ln ln ln 0.
With this, the cut-off point of the ESD classification rule is higher than the ND classification cut-off point. This results to 12E E being greater than 12 N E .
Also, if an observation from 1 π is wrongly classified, the cut-off point of the ESD classification rule is lower than the ND classification rule cut-off point. Hence, 21 21 E N E E <

Conclusion
We have investigated the effect of sampling from persistent non-normal distribution by examining the normal classificatory rule when it is actually an Edge worth Series Distribution (ESD). From the results obtained in this study, it is asserted that the normal procedure is sturdy against departures from normality.
Thus, the skewness factor ( ) 3 λ has a very little effect on the total probability of misclassification, which implies that it is not affected by the departures from normality. Nevertheless, the skewness factor indicates an increase or decrease in their errors of misclassification. Besides, the estimation of the errors when small sample sizes are used to estimate the means is an indication that the optimum probability of misclassification is underestimated or overestimated. This is anchored on the data generated and strictly restricted to this work.