On the Application of Linear Discriminant Function to Evaluate Data on Diabetic Patients at the University of Port Harcourt Teaching Hospital, Rivers, Nigeria

Many real life events involves several interacting variables, hence multivariate statistical tool is necessary for appropriate analysis and interpretation. Discriminant analysis (DA) is one of the commonly used multivariate method in various fields of study including education, finance, environment, medicine etc., where complex data analysis and interpretation is required. This paper demonstrates and illustrate approaches in presenting how the discriminant analysis can be carried out on 335 (40 diabetics and 295 non-diabetic) patients and how the output can be interpreted using the Fisher’s linear Discriminant function (FLDF). The performance of FLDF was adjudged based on the percentage of correct reclassification of the original observation to yield the discriminant scores from the functions. Up to 65.4% correct classification was achieved, and similarly 62.7% percent of the cross-validated grouped cases were correctly classified into either being a Diabetic or nondiabetic patient. Patient’s age and gender were found to be the two most important contributing variables in classifying a patient between the two groups.


Introduction
Multivariate analysis is a statistical technique for the analysis of two or more variables observed from one or more sample objects [5]. It consists of methods that are suitable for application when several measurements are taken on each unit in one or more samples. The quality and validity of information obtained about the population of study increases with the number of variables measured on each sampled unit, and adequately modeled to explain variations in the response variable. The discriminant function procedure introduced by Fisher [9] is one of the linear models used for the analysis of multivariate data; it gives the best linear function for discriminating the explanatory variables. Discriminant Analysis is a statistical tool widely used primarily for two objectives; finding a predictive equation for classifying new individuals or interpreting the predictive equation to better understand the relationships that may exist among the variables. It can also be used in explanatory or predictive frameworks where in both cases, some group assignments must be known before carrying out the Discriminant Analysis. Discriminant function analysis is very similar to logistic regression, and both can be used to answer the same research questions. Logistic regression does not have as many assumptions and restrictions as discriminant analysis. However, when discriminant analysis' assumptions are met, it is more powerful than logistic regression. Discriminant analysis can also be used with small sample sizes. It has been shown that when sample sizes are equal, and homogeneity of variance/covariance holds, discriminant analysis is more accurate [2]. While discriminant analysis and multiple regression analysis are similar, the main difference between these two techniques is that regression analysis deals with a continuous dependent variable, while discriminant analysis must have a discrete dependent variable. The method attempts to express one dependent variable as a linear Diabetic Patients at the University of Port Harcourt Teaching Hospital, Rivers, Nigeria combination of other features or measurements.
Discriminant analysis determines which variables discriminate between two or more naturally occurring distinct groups ≥ 2 . It can be used to determine which predictor variables are related to the dependent variable and to predict the value of the dependent variable given certain values of the predictor variables [4]. Discriminant Analysis has found application in numerous fields of study, for example; in health related research, it may be of interest to determine which amongst; religious, demographic, cultural and socio-economic variables discriminate between patients deciding to go for surgery after diagnosis of certain aliment and those that decline.
Here, relevant information on many variables are measured from each patient before diagnosis is made. Discriminant Analysis could then be used to determine which variable(s) are the best predictors of the patient's final decision. Other areas where discriminant analysis is commonly used are in; ecology, prediction of financial risk, education, health, market analysis and consumer choice preference, analysis of corporate performance, livestock farming, credit worthiness, urban residential mobility, demographic studies, prediction of degree classification etc., ([2, 8, 12, 13, 1]). The application of classical linear discriminant analysis in medicine for diagnosis as to whether or not an individual patient has the disease he/she presented with some symptoms is of great significance and usually requires time, resources, equipment and experts/specialists. When such requirements are not all available, diagnosis are mostly based on assumptions which may or may not be accurate. This study seeks to apply linear discriminant analysis to classify diabetic patients based on some socio-economic variables. Diabetes is a common illness prevalent among older individuals; correct diagnosis of patients with traits of diabetes depends to a larger extent on the availability and adequacy of facilities and professionals.

Methodology
Discriminant analysis focuses on the association between multiple independent variables and a categorical dependent variable to create rules for separating distinct groups as much as possible, and for assigning an observation of unknown origin to one of ≥ 2 distinct preexisting groups [11]. Fisher's linear discriminant function (FLDF) is frequently used for discrimination, classification and prediction purposes under the usual basic statistical assumptions of; multivariate normality of the independent variables, equality of variance and covariance matrices, and relative equality of groups sample sizes [11]. According to [7], the linear discriminant analysis (LDA) is suitable for supervised classification when the number of observations, is larger than the number of variables , .
> . However, with the development of new technologies, there has been increase in complex problems with high-dimensionality, a situation where the number of variables (the dimension of the data vectors) is much larger than the number of observations (sample size), that is < in many disciplines such as medicine and epidemiology, genetics, biology, metrology etc. [10]. There are many proposed methods for the analysis highdimensional data where < such as K-D tree and R-tree [6], however, their performance and efficiency decrease as the dimensionality increases because the methods are designed to operate with small dimensionality.
When discriminating between two groups, the analysis assumes the two samples or populations have the same covariance matrix Σ but distinct mean vectors with variables, where the discriminant function that maximizes the separation of the groups is the linear combination of the variables [14]. Discriminant analysis focuses on the association between multiple independent variables and a categorical dependent variable by forming a composite of the independent variables. Fisher's approach transforms the − observations to univariate such that the ′ obtained from the two populations were adequately separated. The objective is to select linear combination of that maximize the ratio of squared distance between sample means of to its variance. This study uses the method of discriminant analysis on health-related data collected from the University of Port where: % * = %& or the predicted score ( * = the discriminant coefficient or weight for variable -* = the independent score for independent variable ( $ = a constant = the number of independent variables The standardized DF coefficients are of immense logical importance; they indicate the importance of each independent variable. The sign on the coefficients (±) specifies the type of relationship, that is whether the variable is making a positive or negative influence, and coefficients with large absolute values associated with variables have more excellent differentiating capability [3].
When using multivariate data for discriminant analysis, measurements are made on variables for the n observations. With the classical LDA, it is assumed that > and the observations are divided into ≥ 2 predefined distinct groups and that the i 01 group is denoted by π 3 , i = 1, 2, ⋯ , k. The data matrix -6×, represent the measurements of all observations and the values for the p variables are; X , X , ⋯ , X : for each observation. Fisher's linear discriminant analysis looks for the linear function ( which maximized the ratio of the between-groups sum of squares to the within-groups sum of squares; that is, let be a linear combination of the columns of -. Then has total sum of squares.
which can be partitioned as; sum of the within-groups sum of squares, and the between-groups sum of squares, where is the mean of JK sub-vector of y, is the centering matrix, D =.
The discrimination rule is to

Results and Discussion
The Fisher's linear discriminant analysis technique was used to fit a predictive equation based on the measured variables for classifying new individuals, and to re-classify the original data to enable the interpretation of the predictive equation for better understanding of the relationships that may exist among the variables. The results of the analysis are presented as follows: In Table 1, 16 (40% of diabetic patients) and 102 (35.3% of non-diabetic patients) observation were wrongly classified in group I and II respectively using re-substitution. While 21 (52.5% of diabetic patient) and 110 (37.3% of non-diabetic patients) observation were wrongly classified in group I and II respectively using cross-validation. In general, the Fisher's linear discriminant function correctly classified 65.4% of the total observation using re-substitution and 62.7% of the total observation using cross-validating method. The Wilks' lambda in Table 2 shows the measure of how well each function separates cases into groups, smaller values of Wilks' lambda indicate greater discriminatory ability of the function. The more important independent variables that contribute significantly to the prediction by the discriminant function as indicated by the smaller values of the Wilks's lambda are, Age , Gender .  A discriminant score can be calculated based on the weighted combination of the independent variables in Table 3, where % * the predicted is or discriminant score.

Conclusion
This study applied the Fisher's linear Discriminant Analysis (FLDF) to health data on diabetic patients from the University of Port Harcourt Teaching Hospital, Rivers, Nigeria and created a predictive discriminant model that classifies patients into one of two groups (Diabetic and Non-Diabetic) based on demographic and socio-cultural information from each patient. Using the structure matrix, the results show that age and gender are the most useful variables for segmenting the patient base. The Fisher's linear discriminant function correctly reclassify 65.4% of the total observation using re-substitution and 62.7% of the total observation using cross-validating method. The fitted model also predicts group membership of new patients adequately.