A Topological Approach of Principal Component Analysis

: Large datasets are increasingly widespread in many disciplines. The exponential growth of data requires the development of more data analysis methods in order to process information more efficiently. In order to better visualize the data, many methods such as Principal Component Analysis (PCA) and MultiDimensional Scaling (MDS) allow to extract a low-dimensional structure from high-dimensional data set. The proposed approach, called Topological Principal Component Analysis (TPCA), is a multidimensional descriptive method witch studies a homogeneous set of continuous variables defined on the same set of individuals. It is a topological method of data analysis that consists of comparing and classifying proximity measures from among some of the most widely used proximity measures for continuous data. Proximity measures play an important role in many areas of data analysis, the results strongly depend on the proximity measure chosen. So, among the many existing measures, which one is most useful? Are they all equivalent? How to identify the one that is most appropriate to analyze the correlation structure of a set of quantitative variables. TPCA proposes an appropriate adjacency matrix associated to an unknown proximity measure according to the data under consideration, then analyzes and visualizes, with graphic representations, the relationship structure of the variables relating to, the well known PCA problem. Its uses the concept of neighborhood graphs and compares a set of proximity measures for continuous data which can be more-or-less equivalent a topological equivalence criterion between two proximity measures is defined and statistically tested according to the topological correlation between the variables considered. An example on real data illustrates the proposed approach.


Introduction
Choosing a proximity measure from among the many available measures greatly influences the results of any data analysis method, moreover, these measures are more-or-less equivalent according to the concept of the neighborhood graph structure used.
A topological equivalence criterion is defined between proximity measures from the topological structure induced by each measure.
Large datasets are increasingly common and are often difficult to interpret. Principal component analysis (PCA) [16,10,5,18] is a technique for reducing the dimensionality of such datasets, increasing interpretability but at the same time minimizing information loss. It is an exploratory tool for continuous data.
PCA is an adaptive technique for continuous data, variants of this technique have been developed and tailored to various different data types and structures. In order to suitably interpret the large datasets, methods are needed are required to drastically reduce their dimensionality in an interpretable way. Many techniques have been developed for this purpose, but PCA is one of the oldest and most widely used. PCA is statistically considered as a widely used multivariate method for dimension reduction and as a technique of representing data. It aims is to find common factors, the so-called principal components, in form of linear combinations of the variables under investigation. It allows to have an idea of the correlations structure of the set of variables, as well as possible similarities of behavior between individuals.
In the context of artificial intelligence, we often compare des situations représented by a set of objects, for this, we must choose and specify the proximity measure between objects. The study context, the data type and other factors can help us to choose the proximity measure that might be suitable. However, the number of possible measures can still quite large. Moreover, these measures which are still possible, are they all equivalent? Is there a measure more specific or more suited than another for the study considered? In information retrieval, the choose of proximity measure is an essential issue on which the results depend.
The present study proposes a new framework for comparing proximity measures in order to identify those that are similar, thus, we will no longer need to try all measures.
These comparisons are clarified by a proximity measure which evaluates the similarity or dissimilarity between two objects within a set. The proximity measure have mathematical properties and well specific axioms.
The best measure is selected according to the correlation structure of the set of quantitative variables to synthesize, the aim is to establish a topological PCA. The results of TPCA are different according to the selected proximity measure.
Several authors have studied the topological equivalence of proximity measures, in a general framework [4,17,13,24], in the context of the discriminant analysis [3] and the correspondence analysis [2,1], but none in the context of PCA. So, in this paper, we show how to built the appropriate adjacency matrix, induced by an unknown proximity measure but which takes in to account the correlation structure of the variables that we want to describe topologically.
In this article, we compare different proximity measures in an aim to synthesize the relationships of a set of continuous variables in the topological context. Comparison of these measures show that the results are different and depending on the proximity measure chosen. The rest of the paper is organized as follows. In section 2, we discuss topological equivalence between two proximity measures and show how to build an adjacency matrix associated with a proximity measure, how to compare and statistically test the degree of topological equivalence between proximity measures and how to select the best measure to describe topologically the structure of the correlations of the variables. Section 3 presents an illustrative example and surveys existing proximity measures on continuous data and presents a comparison between them. This comparison helps the researchers to take quick decision about which measure to use for considered data. A conclusion of this work is given in section 4. Table 8 in Appendix summarizes some classic proximity measures used for continuous data [23], we give on R n the definition of 15 of them.
We assume that we have at our disposal {x k ; k=1, …, p} a set of p homogeneous quantitative variables measured on n individuals. The interest is to analyze the topological structure of all these variables.

Topological Correlation
The notion of topological equivalence between two proximity measures is based on the concept of the neighborhood graph. Two measures are said topologically equivalent if their graphs induced on the set of objects remain identical. Measuring the similarity between two proximity measures amounts to measuring the similarity of their neighborhood graphs.
Consider a set E={x 1 , x 2 , …, x k , …, x p } of p objects in R n , associated with the p quantitative variables.
Given a proximity measure noted u, we can define a neighborhood binary relationship on E × E noted Vu. Thus, we can build a neighborhood graph on a set of objects-variables, where the vertices are the variables and the edges are defined from the property of the neighborhood relationship. It is a binary symmetric matrix.
Many graph definitions are possible to build this binary matrix. One can choose the Minimal Spanning Tree (MST) [11], the Gabriel Graph (GG) [15] or, as is the case here, the Relative Neighborhood Graph (RNG) [21].
So, given a proximity measure u, we can associated the adjacency matrix Vu of order p, where all pairs (x k , x l ) of neighboring variables satisfy the following RNG expression: , !"#$ %" This means that if two variables x k and xl which verify the RNG property are connected by an edge, the vertices x k and x l are neighbors.
Thus, for any proximity measure given, u, we can associate an adjacency matrix V u , of binary and symmetrical order p. Figure 1 illustrates an example of RNG in R 2 of a set of p=8 objects-variables.
For example, for the first and four variables, V u (x 1 , x 4 )=1, it means that on the geometrical plane, the hyper-Lunula (intersection between the two hyperspheres centered on the two variables x 1 and x 4 ) is empty.
For a given neighborhood property (MST, GG or RNG), each measure u generates a topological structure on the objects in E which are totally described by the adjacency binary matrix V u . In this paper, we chose to use the Relative Neighbors Graph (GNR).

Comparison and Selection of Proximity Measures
First we compare different proximity measures according to their topological similarity in order to regroup them and to better visualize their resemblances.
To measure the topological equivalence between two proximity measures u i and u j , we propose to test if the associated adjacency matrices V ui and V uj are different or not. The degree of topological equivalence between two proximity measures is measured by the following definition of concordance. The topological equivalence between two adjacency matrices satisfy the following expression: Then, in our case, we want to compare these different proximity measures according to their topological equivalence in a context of correlation. So we define a criterion for measuring the deviation from the independence position.
The data can arise from several different sampling frameworks, and the interpretation of the hypothesis of no association depends on the framework. The question of interest is whether there is correlation between the two variables.
We construct the adjacency matrix denoted by V u* , which corresponds to the correlation matrix.
Thus, to examine the correlation structure between the variables, we examine the significance of their linear correlation coefficient. This adjacency matrix can be written as follows using the t-test of the linear correlation coefficient ρ of Bravais-Pearson. The adjacency matrix V u* associated to reference measure u * satisfy the following expression: * , ) = 1 0 0 − 2 3 " = 45 6 789  > − 2 3 "; ≤ < ∀= = 1, 0, ∀3 = 1, 0 ℎ"#$ %" Where p-value is the significance test of the correlation coefficient for the two-sided test of the null and alternative hypotheses, H 0 : ρ(x k , x l )=0 vs. H 1 : ρ (x k , x l )≠0.
The p-value is the evidence against a null hypothesis. The smaller the p-value, the stronger the evidence that you should reject the null hypothesis which means that there is no correlation between x k and x l variables in the population.
Formula for the Student t-test for significance of correlation: t=√(n -2) (1 -r 2 ) with ν=n -2 degrees of freedom (d.f.) and r=r (x k , x l ) is the linear correlation coefficient observed between the variables x k and x l .
Let T n-2 be a t-distributed random variable of Student with ν=n -2 d.f. In this case, the null hypothesis is rejected with a p-value less or equal a chosen α significance level, for example α=5%. Using linear correlation test, if the p-value be very small, it means that there is very small opportunity that null hypothesis is correct, and consequently we can reject it. Statistical significance in statistics is achieved when a p-value is less than the significance level of α. The p-value is the probability of obtaining results which acknowledge that the null hypothesis is true.
The robustness according to the α error risk chosen for the null hypothesis, no linear correlation, can be studied by setting a minimum threshold in order to analyze the sensitivity of the results. Certainly the numerical results will change, but probably not their interpretation.
The binary and symmetric adjacency matrix build V u* , is associated with an unknown proximity measure denoted u * and called a reference measure. Thus, with this reference proximity measure we can establish (V ui , V u* ), the topological equivalence between the two proximity measures u i and u * , by measuring the percentage of similarity between the adjacency matrix V ui and the reference adjacency matrix V u* .
In order to graphically describe the similarities between proximity measures, we can for example apply the notion of them scope [12], which is a methodological sequence of a clustering method on the results of a factorial method. In this case, a Principal Component Analysis (PCA) followed by a Hierarchical Ascendant Classification (HAC) were performed upon the 15 component dissimilarity matrix defined by: [D] ij= D (V ui , V uj )=1 -S (V ui , V uj ) to partition them into homogeneous groups and to view their similarities.
We can use any classic visualization techniques to achieve this. For example, we can build a dendrogram of hierarchical clustering of the proximity measures. We can also use multidimensional scaling or any other technique, such as Laplacian projection, to map the 15 proximity measures into a two dimensional space.
Finally, in order to evaluate and determine the closest class of proximity measures to the reference measure u * , we project the latter as a supplementary element into the two data analysis methods, positioned by the dissimilarity vector with 15 components [D] *I =1 -S (V u* , V ui ).

Statistical Comparisons Between Proximity Measures
In this section, we use the Fisher's Exact Test [9] which is an alternative to the Chi-square test when the samples are small. The principle of this test is to determine if the configuration observed in the contingency table is an extreme situation compared to the possible situations taking into account the marginal distributions. Fisher's exact test is an exact statistical test used for the analysis of contingency tables. It is a test qualified as exact because the probabilities can be calculated exactly rather than relying on an approximation which becomes correct only asymptotically as for the chi-square test used in the contingency tables.
It is not based on a test statistic whose law is known when n is large enough but it calculates, as its name suggests, the exact p-value directly. To test statistically the topological equivalence between two proximity measures, this non parametric test compares these measures based on their associated adjacency matrices. Two proximity measures are statistically in topological equivalence if the null hypothesis H 0 of independence is rejected.
The comparison between indices of proximity measures has also been studied by Demsar [7] and Schneider & Borlund [19,20] from a statistical perspective. The authors proposed an approach that compares similarity matrices obtained by each proximity measure, using Mantel's test [14], in a pairwise manner.
Fisher's exact test is the statistical test best suited to compare matched binary data, the Cohen's Kappa test [6] also but it is in general an asymptotic test. The Kendall or Spearman coefficient compares matched continuous data. It makes it possible in this context to measure the agreement or the concordance of the binary values of two adjacency matrices associated with two proximity measures. The Fisher's exact test between two adjacency matrices evaluates the topological equivalence between their proximity measures.
Let V ui and V uj be adjacency matrices associated with two proximity measures u i and u j . To compare the degree of topological equivalence between these two measures, we propose to test if the associated adjacency matrices are statistically different or not, using a non-parametric test of paired data. These binary and symmetric matrices of order p, are unfolded in two vector-matched components, consisting of p (p + 1)/2 values, the p diagonal values and the p (p -1)/2 values above or below the diagonal.
The degree of topological equivalence between two proximity measures is evaluated from the Fisher's exact test, computed on the 2 × 2 contingency table formed by the two binary vectors of order p (p + 1)/2.
We also test the topological equivalence between each proximity measure u i=1,15 and the reference measure u * by comparing the adjacency matrices V ui and V u* .

Graphical Representations -Variables & Individuals
In order to represent graphically the possible topological links between the p quantitative variables, we use MultiDimensional Scaling (MDS) which makes it possible to find, for any distance matrix (similarity or dissimilarity) of size p × p, a set of p points identified by their Euclidean coordinates whose distance matrix is equal to or very close to the given distance matrix.
We carry out the classical MDS [5], namely factorial analysis on similarity V u* or dissimilarity D u* =U -V u* table, where U=1 p t 1 p is the p×p matrix of 1s and 1 p denotes the p indicator vector of 1s.
The TPCA approach consist to perform the standardized PCA of the triple {V u* ; M; D p }, where, V u* is the adjacency matrix associated with the proximity measure u * , the most appropriate measure for the considered data, M=I p is the identity matrix of order p and D p =1/p I p is the weighted diagonal matrix of variable weights.
The TPCA can be performed from any adjacency matrix V ui associated with each of the 15 proximity measures u i considered. Aid for the interpretation of TPCA results are those of PCA. Graphical representations on factorial plans allow to visualize and identify the topological structure of the variables. As in PCA, for representations of variables, we consider the most significant variables on the axes, that is the variables highly correlated with factors, having a strong contribution and a good quality of representation, measured by the square cosine of the angle between main axes and initial axes.
For representations of active individuals, these are projected as illustrative elements. The quality of representation of these individuals on the factorial axes is measured by their squared cosine.

Illustrative Example and Empirical Results
To illustrate the TPCA, we use Eurostat data [8] on government finance of the 28 European Union (EU) countries in 2017. We examine how key government finance statistics have developed in the EU-28. Specifically, it considers general government gross debt, deficit/surplus, total revenue and total expenditure. Simple statistics of the considered variables are displayed in Table 1. In a metric and classical context, we simply have to apply a standardised PCA on the homogeneous set of the 4 characteristics of the government finance of the EU-28.
In a topological context, the main results of the proposed method are presented in the following tables and graphs, which allow us to visualize proximity measures close to each other and to select the one that best describes and synthesis, the government finance of the EU-28.
The objective here is to give a topological synthesis of the public finances of the EU countries in 2017.
An HAC algorithm based on the Ward criterion [22], aggregation based on the criterion of the loss of minimal inertia, was used in order to characterize classes of proximity measure relative to their similarities. The reference measure u * is projected as a supplementary element. The dendrogram of Figure 2 represents the hierarchical tree of the 15 proximity measures considered. Table 2 describes the final composition of each class of proximity measures, the results of the chosen partition into three homogeneous classes, obtained from the cut of the hierarchical tree of Figure 2.  Moreover, in view of the results in Table 2, the reference measure u * is closer to the third class consisting of Normalized Euclidean, Canberra and Weighted Euclidean measures for which there is a strong topological association between the variables of government finance of EU-28 among the 15 proximity measures considered.
It was shown in [24], by means of a series of experiments, that the choice of proximity measure has an impact on the results of a supervised or unsupervised classification.
In a topological framework, Table 3 summarizes all the results of Table 8  The similarities in pairs between the 15 proximity measures differ somewhat: some are closer than others, some measures are in perfect topological equivalence S (V ui , V uj )=1 with a significant Fisher's exact test p-value < 5%; these are therefore identical for the data considered, as is the case with the measures in each cluster of the partition presented in Table 2. The Table 4 illustrates the contingency tables 2 × 2 between the measures of each cluster: Euclidean, Tchebytchev, Canberra and reference measure u * for the calculation of Fisher's exact test.
Only the topological equivalence between the reference proximity measure and the Canberra proximity measure is significant, p-value=0.0034 < α=5%, the null hypothesis H 0 of independence is rejected. The adjacency matrix Vu* associated to the adapted proximity measure u* to the considered data, is build from the correlations matrix Table 5. Figure 5 shows on the main first TPCA plane, the topological correlation between the Government finance variables.   The adjacency matrix V u* associated to the adapted proximity measure u * to the considered data, is build from the correlation matrix Table 5. Figure 5 shows on the main first TPCA plane, the topological correlation between the Government finance variables.
The corresponding representation for individuals is given in Figure 4. It is thus possible to suggest which are the variablesgovernment finance -responsible for the proximities between the individuals, the 28 EU countries.
The main numerical and graphical results of the proposed TPCA are given in the following Tables and Figures, and are compared to those of the classical PCA. Figure 5 presents, for comparison on the first factorial plane, the correlations between principal components -Factors and the original variables. We can see that these graphical representations of the variables are slightly different. Effectively, the percentage of inertia explained on the first principal plane of the Topological PCA is greater than that of classical PCA and the significant correlations variables-factors are also different. Table 6 shows that the two first factors of TPCA explain 68.96% and 25.00%, respectively, they account for 93.96% of the total variation in the dataset, while the two first factors of classical PCA sum up that 84.88%.
Thus, the first two factors provide an adequate summary of the data, i.e. of government finance of EU-28 countries, we restrict the comparison of the graphical representations to the first factorial plane.  The correlation tables show that the original variables are strongly correlated with the factors, those that contribute the most to the achievement of this principal component.
While the first PCA factor (55.61%) is strongly correlated with three of the original variables, expenditures, revenues and debt, the first TPCA factor (68.96%) opposes these three variables to the deficit. As for the second PCA (29.27%) and TPCA (25.00%) factors, they oppose the debt to revenues.
The representations of the countries presented in Figures 4  and 6 are of course slightly different, indeed, for example, for France which contributes to the realization of the first TPCA axis, it is characterized by high Debts, high Expenditures, high Revenues and a low Deficit. France also contributes on the first PCA axis, it's characterized by high Debts, high Expenditures and high Revenues, but the Deficit does not characterize the first factorial axis of the PCA.
We can represent the topological analysis of each of the 15 proximity measures considered, for example see the Euclidean TPCA in Figure 7. One can moreover give Figure 8, the graphical representation associated with a perfect no correlation between variables, from the identity adjacency matrix.

Conclusion
This research work proposes a new approach that allows to synthesize and describe the correlation structure of a set of quantitative variables in a topological context. Like PCA, the proposed TPCA is a multidimensional topological exploratory method that can be useful for dimension reduction and information redundancy in a data set, it enriches the conventional quantitative data analysis methods. Future work involves extending this topological approach in three directions, to synthesize the relations existing between a set of a mixture of qualitative and quantitative variables, between two sets of continuous variables in the context of canonical analysis and also between several multidimensional data tables in the context of evolutionary data analysis.   Where, p is the dimension of space, x=(xj)j=1,…, p and y=(yj)_j=1, …, p two points in R p , xj the mean, σj the Standard deviation, αj=1\σj 2 and ν > 0.