Scale Independent Principal Component Analysis and Factor Analysis with Preserved Inherent Variability of the Indicators

Principal Component Analysis (PCA) and Factor Analysis (FA) are common multivariate techniques used for dimensionality reduction. With these techniques it is expected to identify actual number of dimensions while accounting almost all observed variability. Standard PCA is based either on correlation matrix (CORM) or covariance matrix (COVM). When it is based on CORM, scale dependency can be removed but inherent variability cannot be preserved. On the other hand, when PCA is based on COVM, inherent variability can be preserved but scale dependency cannot be removed. As a solution to this issue, this paper suggests scaling each indicator by its mean, resulting in new mean equal to 1 and standard deviation equal to the coefficient of variance (CV). This leads to PCs, which are scale independent while retaining the observed variability. The computation of PCs and factors under the suggested method is derived in the study. The procedure is illustrated using the lowest level administrative division census data of Western province of Sri Lanka.


Introduction
Principal component analysis (PCA) and factor analysis (FA) are common multivariate techniques used for dimensionality reduction for many purposes. Multivariate techniques have been used in many divergent fields in the construction of composite indices [1]. In constructing CIs, it is important to identify a small number of transformed indicators out of the considered set of indicators. One of the most important applications of PCA and FA is construction of composite indices.

Principal Component Analysis (PCA)
PCA involves a mathematical procedure that transforms a set of correlated variables into a smaller set of uncorrelated variables. Its goal is to extract the important information from the data table and to express this information as a set of new orthogonal variables called principal components (PCs) [10]. These principal components are linear combinations of the original variables. Hence the results of PCA depend on the scales that the variables are measured on.

Factor Analysis (FA)
Factor analysis is also a variable reduction technique and is similar to PCA. It is a useful tool for investigating variable relationships for complex concepts such as socio-economic status, dietary patterns, or psychological scales [11]. It allows researchers to investigate concepts that are not easily measured directly by collapsing a large number of variables into a few interpretable, uncorrelated underlying factors. In factor analysis, a factor is a latent (unmeasured) variable that expresses itself through its relationship with other measured variables. Contrary to the PCA, the FA model assumes that the data is based on the underlying factors of the model, and that the data variance can be decomposed into that accounted for by common and unique factors [12]. Performing FA based on PCA is one of a commonly used methods.

Composite Index (CI)
CI measures multi-dimensional aspects which cannot be captured properly by a single variable. CI should be based on a theoretical framework or definition, which allows individual indicators or variables to be selected, combined and weighted in a manner which reflects the dimensions or structure of the phenomena being measured [2]. With CIs decision makers should be able to have a better understanding of complex, multi-dimensional realities as it is easier to interpret than a set of separate indicators. The most important fact of CI is, it's ability of reducing the visible size of a set of indicators without dropping the base of underlying information. Farrugia [7] pointed out that in the context of policy analysis, CIs are useful in identifying trends and drawing attention to particular issues and they can also be helpful in setting policy priorities and in benchmarking or monitoring performance. However, if the CI is constructed in a manner which does not reflect the real situation and the construction process lacks proper statistical or conceptual principles, those CIs may indicate misleading information for policy decisions. Therefore more attention should be paid on constructing CIs.

Issues
PCA is performed on a relationship (or association) matrix, which captures the interrelationships between variables. Mainly correlation matrix (CORM) or covariance matrix (COVM) is used as the relationship matrix. But depending on the considered matrix, results of the PCA differ. Jolliffe [3] says that when performing a PCA, a major argument for using CORM rather than COVM is that the results of analyses for different sets of random variables are more directly comparable. Because PCA based on COVM is sensitive to the units of measurement used for each variable. Therefore in CORM approach, PCA operates on standardized data, scaled by their standard deviation. Then all the variables become scale less with zero means and unit variances. On the other hand Jolliffe [3] argues that if there are large differences between the variances among the variables, then those variables whose variances are largest will tend to dominate the first few PCs. In that situation, those inherent variability cannot be captured performing PCA with standardized data. Then drawing conclusions about the dominance of variation for the actual, unstandardized data tends to be misleading. Hence, COVM approach may be entirely appropriate for the set of variables with different variances but measured in the same scale. Another disadvantage of PC's derived using the CORM is that they give coefficients for standardized variables and are therefore less easy to interpret directly [3]. Therefore this problem has to be addressed in constructing scale independent composite indices, while preserving the inherent variability of the variables

Objective
The objective of this study is to find out a solution to the problem of scale dependency of performing PCA without standardizing the variables while preserving the information with respect to inherent variability of the variables.

Proposed Method
As a solution to the issues mentioned in section 1, data of each variable were scaled by its mean. Then the new mean will be equal to 1 and standard deviation equal to the CV. Consequently scale independent new set of variables can be obtained preserving inherent variability.
Suppose the original variables are X 1 X 2 , …, X m with means and variances equal to and . where i=1, 2, …, m.
Let's divide the each variable by their means and symbolized the transformed new set of variables as Y i. Then, Here, Y i s are independent of the scale.
Then, the standard deviation of Y i = CV i Unlike the standardized variables, there are different values for the variances of Y i .
The matrix, X Where, n = number of observations m = number of variables Then the matrix after the transformation, The covariance between two transformed variables, Y i and Y j is given by, -./ , 1 If Pearson Correlation Coefficient of X i and X j is ρ ij ; which is equal to from (5) Then the Variance-Covariance matrix of Y ), ituting FA was performed followed by the PCA using the variance-covariance matrix (8).

Validation
In order to validate the proposed method, analysis was performed using a dataset relevant to the problem. To achieve this task, set of variables had to be identified with different scales and different variances.

Data
In Sri Lanka, urban / rural classification is not based on a proper statistical methodology. Urban areas are defined on the basis of administrative boundaries of local authorities (LAs). There are three types of Local Authorities in the country at present, namely Municipal Council (MC), Urban Council (UC) and Pradeshiya Sabha (PS). MCs and UCs are considered urban LAs while PSs are considered rural LAs. It could be seen that some areas with urban characteristics were in PS divisions while some rural categories were in MC and UCs. Because, variability of those attributes are significant within a LA. Therefore, we need to go to the lowest administrative level in a LA for the classification. Then, the variability of the considered variables within the LAs could be taken into account. In the Sri Lankan context, being the smallest administrative unit, Grama Niladhari (GN) division is the most appropriate level to be considered.

Variables
Considering the following variables, data were collected by GN divisions in the Western province of Sri Lanka. All the variables were adjusted in a manner which explaining the high degree of urban nature.

Results
All the considered variables were in different scales. The appropriateness of them for this study was identified using descriptive statistics.

Descriptive Statistics
Descriptive statistics of the considered variables are given in the table 1 to identify the nature of variability.  Table 1, clearly indicates that the considered set of variables were in different scales. Also they were with highly dispersed variability. That was not only due to magnitude of the numbers but also due to inherent property of the variable. Therefore, that nature could be captured using the CV included in the fifth coloumn in table 1. As an example the highest standard deviation was recorded from the variable "Pop-Dencty" (Population density), which is 4580.556 number of people per Km 2 . But it's CV was not the highest. The variable, "Recreation_HH" (Number of recreation centers per housing unit in a GN division) recorded the highest CV of 3.640. But the standard deviation of it was very low. (0.009 per housing unit). Hence, the set of variables given in table 1 was suitable to validate the proposed method.

Application of PCA
PCA was performed to identify the minimum linear combination of considered variables with higher explanation of the original variation of the data. Proposed method was applied followed by the conventional approaches, those are with the relationship matrices of CORM and COVM. In CORM approach, data were standardized whereas in COVM approach, they were not. Variables under the proposed method were transformed by dividing by their means. Then scale dependency problem was solved and the inherent variability of the variables was also taken into account. Then the variances of the new set of variables are the square term of CV of the original variables. Using COVM, PCA was performed to the transformed data set under the proposed method and the results were included in table 2.

Conventional Method
Considering the results of PCA, first two PCs those eigen values are above 1, explained only 66 percent of total variation. This is not supposed to be a good approach due to two reasons. One of these was requirement of selecting higher number of PCs to get reasonably higher degree of explanation out of total variation though the objective is to reduce variable at minimum level while explaining the greater degree of variability. The other reason was neglecting the inherent variability due to standardizing variables. Therefore, the possible alternative was to perform PCA using covariance matrix approach with unstandardized data. But here, whole variability was dominated by one PC due to the variable having the highest variance (Table 1). However this approach cannot be applied since the set of variables was in different scales.

Proposed Method
Under the proposed method, 75.71 percent of total variance was explained by the first two PCs while in the conventional method with CORM approach, it was 66.02 percent. This is more than 9 percent of improvement which can be considered as sufficient.

Application of FA
FA was performed using the method of principal component. Therefore two factors were considered under CORM approach in conventional method and proposed method. Due to dominance of one PC under the COVM approach, only one factor could be considered. In factor analysis, for both methods (except COVM approach) few variables indicated significant contribution on two factors which should not to be. Therefore Varimax rotation was used to overcome that issue. In the proposed method and the CORM approach under conventional method, all the variables could be adequately explained by two factors. Since the large variance of the variable Pop_dencty, in COVM approach, only that variable indicated highly significant contribution to the identified single factor.
In PCA, with the application of the proposed method, there was a significant improvement over the conventional method. Considering the FA, in the proposed method, contribution of the variables on factors was not dominated by few variables as COVM approach under conventional method.

Conclusions
Conducting PCA and FA as a variable reduction techniques, with CORM approach is not always acceptable due to ignoring the inherent variability of variables. COVM approach is a good solution to the above problem, but it also has the drawback of scale dependency. To get scale independent set of indicators, all the indicators were converted in to new set dividing the data of original indicators by their means. The means of the new set of variables were unit, while the standard deviations were CVs. Hence the inherent variability of the original indicators were preserved under the proposed method. Therefore, in the application of PCA and FA, converting new set of indicators scaling by their means can generate meaningful information.