Discriminant Analysis for the Eigenvalues of Variance Covariance Matrix of FFT Scaling of DNA Sequences: An Empirical Study of Some Organisms

: Many studies discussed different numerical representations of DNA sequences. One naive approach for exploring the nature of a DNA sequence is to assign numerical values (or scales) to the nucleotides and then proceed with standard time series methods. The analysis will depend actually on the particular assignment of numerical values.Discriminant analysis aims to examine the dependence of one qualitative (classification) variable from several quantitative variables according to number of variations of qualitative variable we can distinction. Actually, there is a discriminant analysis for two or more groups. The essential work of discriminant analysis is to get the optimal assigning rules that will minimize the likelihood of incorrect classification of elements. In this paper, we discussed the discriminant analysis of the first, second, third and fourth eigenvalues of variance covariance matrix of Fast Fourier Transform (FFT) for numerical values representation of DNA sequences of five organisms, Human, E. coli, Rat, Wheat and Grasshopper. The analysis is based on three methods (All Variables, Forward Selection and Backward Selection) of discrimination. Functions have been reached whereby discrimination is made among organisms under consideration. Empirical studies are conducted to show the value of our point of view and the applications based on. Therefore, we recommended that, other empirical studies should be done for other organisms and statistical methods by using the point of view adopted here. Also, aspects stated here must be used in an applied manner for DNA sequences discrimination.


Introduction
Discriminant analysis is a multivariate statistical analysis method that serves to set up a model to predict group memberships. The model consists of discriminant functions that appear based on a linear combination of predictive variables that provide the best discrimination between groups. These functions are derived from a sample whose group memberships are known. Afterward, they could be applied to new individuals or units with measures related to the same variables and unknown group memberships. Thus, although discriminant analysis is not frequently used in behavioral sciences because its assumptions are not always easy to meet, it is a conceptually and mathematically powerful multivarite statistical method. Therefore, a description and illustration of the discriminant analysis method may help increase its use [1].
In different areas of applications the term "discriminant analysis" has come to imply distinct meanings, uses, roles, etc. In the fields of learning, psychology, guidance, and others, it has been used for prediction [2][3][4]; in the study of classroom instruction it has been used as avariable reduction technique [5]; and in various fields it has been used as an adjunct to MANOVA [6]. In this sense, discriminant analysis as a general research technique can be very useful in the investigation of various aspects of a multivariate research problem. Tatsuoka and Tiedeman [7] emphasized the multiphasic character of discriminant analysis in the early 1950s: (a) the establishment of significant group-differences, (b) the study and 'explanation' of these differences, and finally (c) the utilization of multivariate information from the samples studied in classifying a future individual known to belong to one of the groups represented. Essentially these same three problems related to discriminatory analysis.
Originally developed in 1936 by R. A. Fisher [8,9], Discriminant Analysis is a classic method of classification that has stood the test of time. Discriminant analysis often produces models whose accuracy approaches (and occasionally exceeds) more complex modern methods. Discriminant analysis can be used only for classification (i.e., with a categorical target variable), not for regression. The target variable may have two or more categorical data.
Discriminant analysis is a powerful statistical pattern recognition method which has been applied to many DNA sequence motif finding problems. On other words, discriminant analysis is widely used in biological analyzes, including DNA analysis. Some of the relevant scientific literatures are as follows.
Solovyev and Salamov [10], introduced a complex of new programs for promoter, 3'-processing, splice sites, coding exons and gene structure identification in genomic DNA of several model species. The human gene structure prediction program FGENEH, exon prediction -FEXH and splice site prediction -HSPL have been modified for sequence analysis of Drosophila (FGENED, FEXD and DSPL), C. elegance (FGENEN, FEXN and NSPL), Yeast (FEXY and YSPL) and Plant (FGENEA, FEXA and ASPL) genomic sequences. They recomputed all frequency and discriminant function parameters for these organisms and adjusted organism specific minimal intron lengths. An accuracy of coding region prediction for these programs is similar with the observed accuracy of FEXH and FGENEH. They have developed FEXHB and FGENEHB programs combining pattern recognition features and information about similarity of predicted exons with known sequences in protein databases. These programs have approximately 10% higher average accuracy of coding region recognition. Two new programs for human promoter site prediction (TSSG and TSSW) have been developed which use Ghosh [11] and Wingender [12] data bases functional motifs, respectively. POLYAH program was designed for prediction of 3"processing regions in human genes and CDSB program was developed for bacterial gene prediction. They have developed a new approach to predict multiple genes based on double dynamic programming, that is very important for analysis of long genomic DNA fragments generated by genome sequencing projects.
Since the identification of functional motifs in a DNA sequence is fundamentally a statistical pattern recognition problem. Discriminant analysis is widely used for solving such problems. Zhang [13], described two basic parametric methods: LDA (linear discriminant analysis) and QDA (quadracic discriminant analysis). He demonstrated their usage in recognition of splice sites and exons in the human genome.
Dudoit et al. [14] compared the performance of difierent discrimination methods for the classification of tumors based on gene expression data. These methods included: nearest neighbor classifiers, linear discriminant analysis, and classifcation trees. They also considered recent machine learning approaches such as bagging and boosting. They investigated the use of prediction votes to assess the confidence of each prediction. The methods are applied to datasets from three recently published cancer gene expression studies.
Kwon et al. [15] finded the causal relationship between several tumors and the gene-expression data by sequentially using the stepwise discriminant analysis method (SDA) and Bayesian decision theory (BDT). Eighty-five samples containing four tumor classes are used in this study. The classes are neuroblastoma (NB), rhabdomyosarcoma (RMS), non-Hodgkin lymphoma (BL) and the Ewing family of tumor (EWS). SDA is used to select critical genes for accurate classification of 4 tumors from original 2308 genes. With the selected genes, Bayesian classifier is made, which minimizes the misclassification rate. As a result, the classification performance increased to 100% and 9 new genes that have relation with the development of the tumors is found additionally.
Liu et al. [16] analyzed various functional regions of the human genome based on sequence fea-tures, including word frequency, dinucleotide relative abundance, and base-base correlation. They analyzed the human chromosome 22 and classified the upstream, exon, intron, downstream, and intergenic regions by principal component analysis and discriminant analysis of these features. The results show that they could classify the functional regions of genome based on sequence feature and discriminant analysis.
Guo et al. [17] in the same year, presented a modified version of linear discriminant analysis, called "shrunken centroids regularized discriminant analysis" (SCRDA). The SCRDA method is specially designed for classification problems in high dimension low sample size situations, for example, microarray data. Through both simulated data and real life data, it is shown that this method performed very well in multivariate classification problems, often outperforms the PAM method and can be as competitive as the SVM classifiers. It is also suitable for feature elimination purpose and can be used as gene selection method.
Jombart et al. [18], proposed the discriminant analysis of principal components (DAPC), a multivariate method designed to identify and describe clusters of genetically related individuals. When group priors are lacking, DAPC uses sequential K-means and model selection to infer genetic clusters. They evaluated the performance of our method using simulated data, which were also analyzed using STRUCTURE as a benchmark. Additionally, they illustrated the method by analyzing microsatellite polymorphism in Salah Hamza Abid and Jinan Hamza Farhood: Discriminant Analysis for the Eigenvalues of Variance Covariance Matrix of FFT Scaling of DNA Sequences: An Empirical Study of Some Organisms worldwide human populations and hemagglutinin gene sequence variation in seasonal influenza. It is well known that outliers are present in virtually every data set in any application domain, and classical discriminant analysis methods (including linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA)) do not work well if the data set has outliers. In order to overcome the difficulty, Jin and An [19] used the robust statistical method. They choosed four different coding characters as discriminant variables and an approving result is presented by the method of robust discriminant analysis.
Libbrecht et al. [20], provided an overview of machine learning applications for the analysis of genome sequencing data sets, including the annotation of sequence elements and epigenetic, proteomic or metabolomic data. They introduced considerations and recurrent challenges in the application of supervised, semi-supervised and unsupervised machine learning methods, as well as of generative and discriminative modelling approaches. They provided general guidelines to assist in the selection of these machine learning methods and their practical application for the analysis of genetic and genomic data sets.
Corvelo et al. [21], introduced taxMaps, a highly efficient, sensitive, and fully scalable taxonomic classification tool. Using a combination of simulated and real metagenomics data sets, they demonstrate that taxMaps is more sensitive and more precise than widely used taxonomic classifiers and is capable of delivering classification accuracy comparable to that of BLASTN, but at up to three orders of magnitude less computational cost.

DNA Sequence
In the process of developing the technology, many possible interesting adaptations became apparent: One of the most interesting directions was the use of the technology in the analysis of long DNA sequences. A benefit of the techniques was that it combined rigorous statistical analysis with modern computer power to quickly search for diagnostic patterns within long DNA sequences. Briefly, a DNA strand can be viewed as a long string of linked nucleotides. Each nucleotide is composed of a nitrogenous base, a five carbon sugar, and a phosphate group. There are four different bases that can be grouped by size, the pyrimidines, thymine (T) and cytosine (C), and the purines, adenine (A) and guanine (G). The nucleotides are linked together by a backbone of alternating sugar and phosphate groups with the / 5 carbon of one sugar linked to the / 3 carbon of the next, giving the string direction. DNA molecules occur naturally as a double helix composed of polynucleotide strands with the bases facing inward. The two strands are complementary, so it is sufficient to represent a DNA molecule by a sequence of bases on a single strand; refer to  [22]). A common problem in analyzing long DNA sequence data is in identifying CDS that are dispersed throughout the sequence and separated by regions of noncoding (which makes up most of the DNA). Another problem of interest that we will address here is that of matching two DNA sequences, say 1t X and 2t X . The background behind the problem is discussed in detail in the study by Waterman and Vingron [23]. For example, every new DNA or protein sequence is compared with one or more sequence databases to find similar or homologous sequences that have already been studied, and there are numerous examples of important discoveries resulting from these database searches. One naive approach for exploring the nature of a DNA sequence is to assign numerical values (or scales) to the nucleotides and then proceed with standard time series methods. It is clear, however, that the analysis will depend on the particular assignment of numerical values. Consider the artificial sequence ACGTACGTACGT... Then, setting A = G = 0 and C = T = 1, yields the numerical sequence 010101010101..., or one cycle every two base pairs (i.e., a frequency of oscillation of 1 / 2 ω = Cycle/bp, or a period of oscillation of length 1 / 2 ω = bp=cycle). Another interesting scaling is A = 1, C = 2, G = 3, and T = 4, which results in the sequence 123412341234..., or one cycle every four bp ( 1/ 4) ω = . In this example, both scalings of the nucleotides are interesting and bring out different properties of the sequence. It is clear, then, that one does not want to focus on only one scaling. Instead, the focus should be on finding all possible scalings that bring our interesting features of the data. Rather than choose values arbitrarily, the spectral envelope approach selects scales that help emphasize any periodic feature that exists in a DNA sequence of virtually any length in a quick and automated fashion. In addition, the technique can determine whether a sequence is merely a random assignment of letters [22]).
Fourier analysis has been applied successfully in DNA analysis; McLachlan and Stewart [24] and Eisenberg et al. [25] studied the periodicity in proteins using Fourier analysis.
Stoffer et al. [26] proposed the spectral envelope as a general technique for analyzing categorical-valued time series in the frequency domain. The basic technique is similar to the methods established by Tavar´e and Giddings [27] and Viari et al. [28], however, there are some differences. The main difference is that the spectral envelope methodology is developed in a statistical setting to allow the investigator to distinguish between significant results and those results that can be attributed to chance.
The article authored by Marhon and Kremer [29], partitions the identification of protein-coding regions into four discrete steps. Based on this partitioning, digital signal processing DSP techniques can be easily described and compared based on their unique implementations of the processing steps. They compared the approaches, and discussed strengths and weaknesses of each in the context of different applications. Their work provides an accessible introduction and comparative review of DSP methods for the identification of protein-coding regions. Additionally, by breaking down the approaches into four steps, they suggested new combinations that may be worthy of future studies. A new methodology for the analysis of DNA/RNA and protein sequences is presented by Bajic [30]. It is based on a combined application of spectral analysis and artificial neural networks for extraction of common spectral characterization of a group of sequences that have the same or similar biological functions. The method does not rely on homology comparison and provides a novel insight into the inherent structural features of a functional group of biological sequences. The nature of the method allows possible applications to a number of relevant problems such as recognition of membership of a particular sequence to a specific functional group or localization of an unknown sequence of a specific functional group within a longer sequence. The results are of general nature and represent an attempt to introduce a new methodology to the field of biocomputing.
Fourier transform infrared (FTIR) spectroscopy has been considered by Han et al. [31] as a powerful tool for analysing the characteristics of DNA sequence. This work investigated the key factors in FTIR spectroscopic analysis of DNA and explored the influence of FTIR acquisition parameters, including FTIR sampling techniques, pretreatment temperature, and sample concentration, on calf thymus DNA. The results showed that the FTIR sampling techniques had a significant influence on the spectral characteristics, spectral quality, and sampling efficiency. Ruiz et al. [32] proposed a novel approach for performing cluster analysis of DNA sequences that is based on the use of Genomic signal processing GSP methods and the K-means algorithm. We also propose a visualization method that facilitates the easy inspection and analysis of the results and possible hidden behaviors. Our results support the feasibility of employing the proposed method to find and easily visualize interesting features of sets of DNA data. A novel clustering method is proposed by Hoang et al. [33] to classify genes and genomes. For a given DNA sequence, a binary indicator sequence of each nucleotide is constructed, and Discrete Fourier Transform is applied on these four sequences to attain respective power spectra. Mathematical moments are built from these spectra, and multidimensional vectors of real numbers are constructed from these moments. Cluster analysis is then performed in order to determine the evolutionary relationship between DNA sequences. The novelty of this method is that sequences with different lengths can be compared easily via the use of power spectra and moments. Experimental results on various datasets show that the proposed method provides an efficient tool to classify genes and genomes. It not only gives comparable results but also is remarkably faster than other multiple sequence alignment and alignment-free methods. One challenge of GSP is how to minimize the error of detection of the protein coding region in a specified DNA sequence with a minimum processing time. Since the type of numerical representation of a DNA sequence extremely affects the prediction accuracy and precision, by this study Mabrouk [34] aimed to compare different DNA numerical representations by measuring the sensitivity, specificity, correlation coefficient (CC) and the processing time for the protein coding region detection. The proposed technique based on digital filters was used to read-out the period 3 components and to eliminate the unwanted noise from DNA sequence. This method applied to 20 human genes demonstrated that the maximum accuracy and minimum processing time are for the 2-bit binary representation method comparing to the other used representation methods. Results suggest that using 2-bit binary representation method significantly enhanced the accuracy of detection and efficiency of the prediction of coding regions using digital filters. Identification and analysis of hidden features of coding and non-coding regions of DNA sequence is a challenging problem in the area of genomics. The objective of the paper authored by Roy and Barman [35] is to estimate and compare spectral content of coding and non-coding segments of DNA sequence both by Parametric and Nonparametric methods. Consequently an attempt has been made so that some hidden internal properties of the DNA sequence can be brought into light in order to identify coding regions from non-coding ones. In this approach the DNA sequence from various Homo Sapien genes have been identified for sample test and assigned numerical values based on weak-strong hydrogen bonding (WSHB) before application of digital signal analysis techniques. The statistical methodology applied for computation of Spectral content are simple and the Spectrum plots obtained show satisfactory results. Spectral analysis can be applied to study FFT Scaling of DNA Sequences: An Empirical Study of Some Organisms base-base correlation in DNA sequences. A key role is played by the mapping between nucleotides and real/complex numbers. In 2006, Galleani and Garello [36] presented a new approach where the mapping is not kept fixed: it is allowed to vary aiming to minimize the spectrum entropy, thus detecting the main hidden periodicities. The new technique is first introduced and discussed through a number of case studies, then extended to encompass time-frequency analysis.
For analyzing periodicities in categorical valued time series, the concept of the spectral envelope was introduced by Stoffer et al. [37] as a computationally simple and general statistical methodology for the harmonic analysis and scaling of non-numeric sequences. However, the spectral envelope methodology is computationally fast and simple because it is based on the fast Fourier transform and is nonparametric (i.e., it is model independent). This makes the methodology ideal for the analysis of long DNA sequences. Fourier analysis has been used in the analysis of correlated data (time series) since the turn of the century. Of fundamental interest in the use of Fourier techniques is the discovery of hidden periodicities or regularities in the data. Although Fourier analysis and related signal processing are well established in the physical sciences and engineering, they have only recently been applied in molecular biology. Since a DNA sequence can be regarded as a categorical-valued time series it is of interest to discover ways in which time series methodologies based on Fourier (or spectral) analysis can be applied to discover patterns in a long DNA sequence or similar patterns in two long sequences. Actually, the spectral envelope is an extension of spectral analysis when the data are categorical valued such as DNA sequences.
An algorithm for estimating the spectral envelope and the optimal scalings given a particular DNA sequence with alphabe { } The optimal sample scaling is ( ) ( ) v ω the eigenvector obtained in the previous step.
In this paper, we discussed the discriminant analysis of the first, second, third and fourth eigenvalues of variance covariance matrix of Fast Fourier Transform (FFT) for numerical values representation of DNA sequences of five organisms, Human, E. coli, Rat, Wheat and Grasshopper. The analysis is based on three methods (All Variables, Forward Selection and Backward Selection) of discriminating. It should be noted that it is the first time that the variance covariance matrix eigenvalues of Fast Fourier Transform (FFT) for numerical values representation of DNA sequences, is used in an analysis like this and related analyzes.

Discriminant Analysis
Discriminant analysis aims to examine the dependence of one qualitative (classification) variable from several quantitative variables according to number of variations of qualitative variable we can distinction [38]). Actually, there is a discriminant analysis for two or more groups. The essential work of discriminant analysis is to get the optimal assigning rules that will minimize the likelihood of incorrect classification of elements. Every element is distinguished by some aspects which reflect its properties. This means that corresponding to measured characteristics; the examined elements are realizations of the random vector ( ) 1 2 , ,...., n X X X X = . The process starts with an analysis of group of elements in which is known relation to a specific group and also values of the random variables. The analysis result of the training set is to determine discriminant function that define the likelihood of classification of new unclassified element to certain group according to measured values. x x x = of its characteristics [39]. Two basic aims of discriminant analysis are stated by Stankovičová and Vojtková [38], the first aim is to find appropriate statistical way to distinguish between groups (Descriptive or analytical). The second aim is to include new statistical unit (object) that is recognized by a vector of k features to one of the based groups (Classification).

Discriminant Analysis: Aims and Assumptions
Discriminant analysis aims is offered by Meloun et al. [40].
(1) Define whether there are significant statistical differences among profiles of the average score of discriminators for two or more pre-defined classes.
(2) Define which of the discriminator is reflected the most in differential profiles of average score of two or more classes.
(3) Define procedures to involve objects into classes according to their score in discriminators set.
(4) Define the number of dimensions compilation of discrimination among classes created by a discriminators set.
Assumptions of discrimination model (1) Multivariate normal distribution conduct tests of significance of individual discriminatory variables and discriminatory functions are needful to assure this assumption. If the data is not distributed as multidimensional normal, then the results of classification are inaccurate. Moreover, the classification total error is not violated by Lack of performance of normal assumption because the classification error in one group may be overestimated and in the other group underestimated. [39] (2) At least two groups must be there, with each case belonging to only one group so that the groups are independent and collectively exhaustive.
(3) Each group must be well defined and clearly distinguished from any one of groups.
(4) Before collecting the data, the groups should be well defined [41].
(5) Equality of variance-covariance within group. (6) The covariance matrix within each group should be equal. Equality Test of Covariance Matrices can be used to verify it. When in doubt, try re-running the analyses using the Quadratic method, or by adding more observations or excluding one or two groups.
(7) Low multicollinearity of the variables When high multicollinearity among two or more variables is present, the discriminant function coefficients will not reliably predict group membership. We can use the pooled within-groups correlation matrix to detect multicollinearity [42].

Practical and Computational Steps for Discrimination and Classification
In this section, we will introduce the discrimination and classification from practical and computational aspect.

Discrimination Among Several Populations
Suppose that we have p of populations, from the first population a sample . Then the sample between matrix is, a Ba n a X X X X a n a X a X X a X a , 1, , , The sample within group matrix is Thus, The pooled estimate based on 1 2 , , , The pooled estimate based on 1 2 , , ,

Classification for Several Populations
Fisher's classification procedure according to the first r s ≤ sample discriminants is to assign observation 0 X to the first population if ( ) Intuition of Fisher's method, ).
imply the total distance between the transformed 0 X and the transformed mean of the first population is smaller than the one between the one between the transformed 0 X and the transformed mean of the other populations. In another meaning, 0 X is closer to the first population than to the other populations. Therefore, 0 X is assigned to the first population.

Stepwise Discriminant Analysis
In stepwise discriminant analysis, large number of variables are entered, then with a series of steps, we are selected variables which discriminate the best and from them is created discriminant function. We can identify by some criteria how the stepwise discriminant analysis seeks at chosen of these variables (Stankovičová and Vojtková (2007) [38]).
(1) Forward selection: Variables come in into the discriminant function progressively and constantly is chosen the one that has the paramount profit in terms of discrimination. If this benefit is not statistically significant, no new variable enter into the function.
(2) Backward selection: here we get in all variables In the discriminant function and gradually are outcasted those whose removal does not case a statistically significant decrease rate of discrimination. When any other throw away would intend significant decrease in discrimination between groups, Then this process is completed.
(3) Stepwise selection: This chosen is mixing of the two past procedures. Here, enter new variables by degrees into discriminant function and it is always selection one with the utmost assist in terms of discrimination, while in every step is confirmed the possibility whether the variable would be eliminated and if eliminated variable does not have significant effect on decrease rate of discrimination [39].
Whatever, these procedures attain same outcomes but stipulation is that the input data have to be mutually uncorrelated. Otherwise if the correlation between input variables is significant, it is approperate to take Stepwise selection, where initially selected variable may be excluded in further steps because it is only correlation of other variables in the model. Criteria for making decision about enter of variable into the model or its elimination from the model avail following statistics. [40] Wilks Lambda (λ) The ratio of intra -group variability to the total variability represent Wilks λ statistic. At every step is chosen the variable that satisfies the minimum value of this statistic. The significance of changes of Wilks criteria after discriminators submitting into the model or abstraction from the model is based on F test criterion. The value of F for change of Wilks criteria while adding discriminator into the model so that the model includes p discriminators is calculated as follows, Where p represent the number of discriminators in the model, n represent the total number of objects, g represent the number of classes, and p λ and 1 p λ + represent Wilks criterion before and after adding discriminators to the model respectively.
Härdle and Simar (2012) [43], derived Wilks lambda as follows, So the smaller value of Ψ implies to more doubt upon the null hypothesis.
Determination the amount of variance in the grouping variable is interpreted by predictor variables by subtracting Ψ from one [41].

An Empirical Study
The following algorithm steps is performed to achieve our aims.
Generate the DNA sequence for five organisms, Human, E. coli, Rat, Wheat and Grasshopper with corresponding information in table 1. The sequence size is n=500 and run size is k=205. Transform DNA sequence to numerical values by setting one to the base that appears and zero to the other bases. FFT Scaling of DNA Sequences: An Empirical Study of Some Organisms Transform the sequence of numerical values to the corresponding FFT values.
Calculate the eigenvalues of variance covariance matrix for each run results, and then we get 205 fourth order vectors of eigenvalues for each organism. Each vector contains the four eigenvalues, rank from the largest one to the smallest.
All Variables, Forward Selection and Backward Selection methods of discrimination have been applied of the first, second, third and fourth variance-covariance matrix eigenvalues of Fast Fourier Transform (FFT) for numerical values representation of DNA sequences of five organisms, Human, E. coli, Rat, Wheat and Grasshopper. It should be noted that it is the first time that the variance covariance matrix eigenvalues of Fast Fourier Transform (FFT) for numerical values representation of DNA sequences, is used in an analysis like this and related analyzes.
For convenient, in the following discussions, we will refer to the organism by the first letter of his name. The three methods (All Variables, Forward Selection and Backward Selection) methods of discriminating are designed to develop a set of discriminating functions which can help predict cf based on the values of other quantitative variables. 1017 cases were used to develop a model to discriminate among the 5 levels of cf. Using a stepwise selection algorithm, it was determined that 4 variables were significant predictors of cf. That is, 4 predictor variables were entered. The 3 discriminating functions with P-values less than 0.05 are statistically significant at the 95.0% confidence level. These functions are used to predict which level of cf new observations belong to. The classification function coefficients for cf in table 4, shows the functions used to classify observations. There is a function for each of the 5 levels of cf. For example, the function used for the first level of cf is -263305. + 1054.3*1 + 1051.03*2 + 1052.79*3 + 1054.87*4

Results and Discussion
These functions are used to predict which level of cf new observations belong to. From the relative magnitude of the coefficients in the above equation, you can determine how the independent variables are being used to discriminate amongst the groups.
The following Classification Table 6 shows the results of using the derived discriminant functions to classify observations. It lists the two highest scores amongst the classification functions for each of the 1017 observations used to fit the model, as well as for any new observations. For example, row 1 scored highest for cf = e and second highest for cf = w. In fact, the true value of cf was e. Amongst the 1017 observations used to fit the model, 583 or 57.3255% were correctly classified. You can predict additional observations by adding new rows to the current data file, filling in values for each of the independent variables but leaving the cell for cf blank. The group centroids for cf in table 7, shows the average values of each of the 4 discriminant functions for each of the 5 values of cf.
The following summary statistics by group in table 8, shows the averages and standard deviations of each independent variable for each level of cf. In addition, the following pooled within-group statistics for cf in table 9, shows the estimated correlations between the independent variables within each group. The within group information from all of the groups has been pooled.

Summary
Functions have been reached whereby a discrimination is made among organisms according to eigenvalues of variance covariance matrix of FFT for numerical values representation of DNA sequences, and then classify any other observation to any of organisms belong.
The methods used here are aimed to discriminant among different organisms using another point of view. This point of view is based on eigenvalues of variance covariance matrix of FFT for numerical values representation of DNA sequences. It should be noted that, it is the first time this point of view is used to achieve aims like ours.
Empirical studies are conducted to show the value of our point of view and the applications based on. Therefore, we recommended that, 1. Other empirical studies should be done for other organisms and statistical methods by using the point of view adopted here. 2. Aspects stated here must be used in an applied manner for DNA sequences discrimination.