Credit Risk Assessment Utilizing Data Reduction Technique for Individual Loaning in Financial Institutes (Case Study: Tejarat Bank, Rasht, Iran)

Because of the nature of the financial and economic activities and they are practically accompanied with a degree of risk., banks are usually dealing with many risks, including operational, marketing, interest rate, etc. Since, credit risk has significant effects on financial banks activities in terms of loaning profits, the risk of repayment individual loans has been investigated in this research work. Two well-known regression models of Probit and Logistic have been developed based on nine extracted factors which have been investigated during the offering of loans according to the possibility of late or nonrepayment. In order to minimize inter-correlation and extracting high-independency factors, the statistical technique of Principal Component Analysis (PCA), categorized as a data reduction technique, has been utilized and three factors out of nine have been omitted. One of Tejarat bank branches in the Iranian Northern Province of Guilan has been selected as case study to gather experimental data for assessing the credit risk of individual bank investors. The results of model validation revealed that the implementation of PCA method can improve the accuracy of models’ outputs and Probit regression model has better results rather than Logit one.


Introduction
Risk is a part of human and organization life, so financial managers are mainly dealing with the degree of risk to make their decisions. Since, risk cannot be completely removed, scientific attitude for risk is leading to evaluate and manage it. Since, financial banks are a great pillar of the economy all around the world, they are dealing with many kinds of risks of operational, marketing, interest rate, credit and liquidity and etc because of the nature of their financial loaning [1]. One of the most important activities in banking system is loan payment and credit risk stems from this fact that one of the sides of the loaning contract cannot or will not perform its obligations. Non-repayment risk, caused by individual investor's, is still considered as a major cause of banks failure claimed to be still rising and growing according to published researches [2]. Therefore, what is important for banks is to grant loans to individuals or organizations those meet the standards of getting loans and are able to repay their loans on time. So, it is necessary to assess the credit risk of payments when individual investors receive loans from banks.
Following the above mentioned, the main aim of this research work is to assess the risk of non-repayment loans by individual investors considering the effective factors which have been sieved using a data reduction technique of principal component analysis (PCA). This paper is organized into seven sections. After introduction, scientific background is discussed followed by developing model for assessing risk two regression models of Logit and Probit. Case study and data are following discussed in the fourth section together with computational results and conclusion in the last section.

Literature Review
In the field of assessing credit risk of borrowers, many studies have been conducted and results published. For example, a decision tree model was offered for bank credit risk evaluation using data-extracting techniques to identify affective factors on credit risk of customers and divided them into four groups including past due, suspected of receipt, the outstanding and ongoing and revealed that the possibility of assigning the repayment loans to each group are different customer to customer [3]. The model suggests that new criteria in analyzing the loan application have different influences on repayment loans and evaluating results showed that the proposed model can reduce non-performing loans to less than 5% and as well as banks can be classified to well performing grade [3]. Using three factors in credit scoring including re-payment period (in month), loan amount and customer's age, Dong et al. [4] utilized logistic regression with random coefficients on credit scoring of individual customers. They divided the above mentioned factors into four intervals and allocated ranking points from 1 to 4 for any interval. In terms of prediction accuracy, the performance of their proposed method was much better than logistic regression comparing to constant coefficients.
Probit analysis technique was also used to estimate the probability of credit risk [5]. Influencing factors including personal and financial attributes have been investigated for estimating credit risk for Swedish credit customers on both rejected and approved applicants. Applying a fuzzy analytical hierarchy process and computing Value at Risk (VAR for short) showed that the efficient selection of loan applicants can reduce credit risk by 80% [5]. By utilizing Meta-heuristic approach in the field of assessing financial risk, Che et al. [6] stated that while there is a strong competition between Taiwan financial institutions and banks, high proportion of outstanding deposits can be observed over the loaning systems particularly on small and medium companies in Taiwan. Many warning systems, designed for commercial loans risk, are usually composed of artificial neural network and the concept of traditional warning. Baon et al [7] used the artificial neural network (ANN) to propose a warning system for risk assessment of banking loans. Long and short term repayment abilities, performance and profit abilities and eventually degree of required warning have been used as evaluation criteria for developing the ANN model. The outputs of systems are more understandable and practical rather than ones in normal warning systems and have provided effective decision-making tools in bank loans for the companies who are dealing with loans risk assessment. Different views on assessing credit risk were pointed out for evaluating the credit risk of companies and international firms (corporate customers), banks and the other institutions lending to these companies [8]. Company's international history, international relations and the market in countries have been identified as affecting factors to assess the credit risk of companies while developing regression model shows that the credit risk of international companies can arise from experience of companies and current business activities [8]. Data envelopment analysis (DEA) is another technique which helps to offer appropriate approaches for credit scoring [9]. In contrast to the regression analysis and neural networks models which need additional data to calculate credit point, these kinds of models only use historical records the credit information. Collecting financial data on 1061 foreign companies from the Korea Credit Guarantee and using financial ratios authors [9] combined the overall performance of a company and calculated the company credit rating.
As observed, many studies have been made for accessing credit risk, but loaning by individual investors who are more popular in developing countries should be carefully investigated for late or non repayment loans while many fields of data may be unavailable. Therefore, the main aim in this research work is to study the risk of non-repayment loans of individual investors as well as utilizing an efficient statistical technique of principal component analysis to omit factors which have inter-correlated with the other influencing factors for gathering as less necessary as data should be gathered for carrying out these kinds of studies.

Principal Component Analysis
Principal component analysis (abbreviated as PCA) is a statistical analysis technique used to determine the most influencing variables. Using the PCA, the number of trivial data is deleted to reduce the inter-correlation as well as improve the independency of variables. The main stage of PCA is Kaiser-Meyer-Olkin (KMO) test in which the KMO index checks if factors are efficiently related to the original variables or not [10]. Calculating the correlation matrix is the starting point. It is assumed that the variables are more or less correlated, but the correlation between two variables can be influenced by the others. So, the partial correlation is used in order to measure the relation between two variables by removing the effects of the remaining variables. The KMO index compares correlations between variables and those of the partial correlations. If KMO is less than 0.5, data would not be appropriate to utilize factor analyze means that there is no significant correlation, so they are suitable for statistical analysis because of less or non-correlated variables. Applying this method, the combinations of P primary variable X 1 , X 2 ,..., X p for maximum P independent component, defined as PC 1 , PC 2 ,..., PC p , are created. Each component can be determined with a presented sequence by series of equations as (1) [10].
Where PC i represents new generated variables known as factors, W ij is weighing coefficients of primary variable and X i is the ith primary variable. W ij is estimated according to the variation of variables, so that the first component is considered as the maximum variance and the second component predicts not-intended maximum variance which has not been defined by the first component. In addition, two constraints should be applied for obtaining independency presented as equations (2) and (3).
For analyzing PCA, the following steps are performed: (1) Standardization of input variables: At the first stage, input data should be standardized in terms of scale in a way that has zero mean and deviation benchmark one. Z (Normal Standard) matrix, the matrix includes standardized values of the parameters can be obtained from equation (4) where x j is the average data and s j is the related standard deviation of samples [11].
(2) Calculating the KMO factor: KMO is used in the range of zero to one. The index is derived from the equation (5), where r ij and a ij are respectively the correlation coefficient and partial correlation coefficient between variables i and j [12].
(3) Calculating the correlation matrix (variance) for the primary variables: R matrix shows the correlation between each of the basic variables [11]. Z is standard variables of the main variables calculated by equation (6).
(4) Calculating λ eigenvalues and corresponding eigenvectors of the correlation matrix: By using equations (7) and (8) the eigenvalues and eigenvectors of any particular value are calculated [11]. Special vectors obtained, especially as coefficients for all basic variables in the respective components are formed. Solving the equation (7), where I is the identity matrix, eigenvalues (λ n ) are calculated. The variance of each principal component is calculated by equation (8).
(5) Benchmarking the number of operations: Special value standard, variance percentage standard and cut test standard [11] are the most important parameters for extracting the number factors that are used in this process. (6) Implementation of proper rotation on the components coefficient matrix: At this stage, the variables that have high coefficients in extracted main components are selected as important variables to enter to the modeling [13] and the rest will be omitted. (7) Eventually, after Implementation of proper rotation, the main variables are those at least one of their coefficients uses to form the relevant factor, has relatively high amount and others should be deleted.

Probit Regression Model
The Probit regression method firstly proposed in 1930 [14] is applied in two types of two-binomial (two states) and ratings variables. Probit binomial regression is used for quantitative and qualitative measures to explain and predict the dependent variable to measure two states (binary) based on a set of independent variables. Probit regression procedure is used when dependent variable is measured by a rating scale. The general form of the dependent variable in two states Probit model is defined as a function such equation (9), which can be identified as zero or one. In equation (10), X is a vector of independent variables and β is the vector of parameters that must be estimated using available data targeting to least square errors of observed and estimated amounts of dependent variable [15].
Regarding to figure 1, it is assumed that dependent variable with binomial distribution in terms of its observations can be defined as normal distribution in which a threshold or boundary (t) has made it to the binomial distribution [14]. It is assumed that dependent variable or underlying response has a continuous distribution and follows a normal distribution function. Then Probit link function according to equation (11) is used to calculate the cumulative normal distribution function for modeling and explaining the probability of success (P(Y = 1)). In equation (11), symbol ϕ represents the cumulative normal distribution function available in all published statistical books [15].

Binomial Logistic Regression (Logit)
Binomial logistic regression (Logit) is used for modeling the probability of occurrence or non-occurrence of a given situation on a number of independent variables. Since, dependent variable is a binary variable (accept or reject; one or zero), there is the possibility of using linear regression. In this case while dependent variable is a binary variable and independent variables are nominal, ordinal, interval or a combination of them, there is the possibility of using Logit regression to estimate the probability of success for dependent variable Y. In logit regression, dependent variable (Y) follows un-normal and binomial distribution. Independent variables (X i ) can also be a combination of qualitative and quantitative variables. Link function in the Logit regression is the natural logarithm (ln) chance ratio. The chance ratio indicates the likelihood of success versus failure [14] formulated as equation (12). As a result, the binomial logistic regression equation is written as the equation (13).

Developing Risk Assessment Models
Since, dependent variable used for conducting the present study, is a binary variable (defined as un-creditworthy or creditworthy customer), two methods of Probit and Logit regression models have been used to calculate the probability of credit risk for individual customers. Principal Component Analysis (PCA) has also been used to insure existing input variable independency. In addition, the impact of its utilization on Probit and Logit models are also investigated. Results of the estimating for probability of creditworthy states have been eventually compared before and after utilizing PCA to check if PCA is useful for improving the accuracies of the proposed models.

Assessment Criteria for Credit Risk
(1) Literature reviewing shows that there are 32 effective factors for evaluating the customers' credit of banks.
Nine factors among them have been selected after an interview with managers in the branch of Tejarat Bank Rasht, Guilan, Iran. Data corresponding to nine selected criteria of creditworthy and noncreditworthy customers have been collected over March 21, 2014 to March 20, 2015 (1393 in lunar calendar). The population is a sample size of 45 consisting detailed information on 20 samples on non-creditworthy and 25 on creditworthy customers. (2) Dependent variable: The repayment status is considered as dependent variable. It is a discrete variable known as response variable determines the customer repayment ability status. Customers who have not defaulted in their repayments are creditworthy, adopted the value (0) and noncreditworthy customers who have defaulted on their repayments adopted by (1). In this case, the result of the proposed model is defined as the probability of defaulting repayment loans. It means that loaning banks would be aware of non-creditworthy customers. (3) Independent variables: Dependent variables have been selected as a set of effective criteria on repayment of loaning investment described in the previous section. They are as follows: Gender: Gender is a discrete variable includes male (0) and female (1).
Age: Age is a continuous variable reviewed in years. Job status: Borrowers related to the employment status are divided into two discrete categories based on employment status of having income defined as (0) and no-income defined as (1).
Amount of loan: The amount of loan had been received by borrower. This is a continuous variable.
Repayment period: Repayment period is a time period (in months) that the principal and interest loan to be repaid.
Type of collateral: In the loaning process, collaterals are recommended by customers who guarantee the repayment procedure. Collaterals are in different types but in this research are divided into two types of property collateral (0) and non-property (1) Term of continuous transactions: The period of time (in years) that the customer has an account in the bank with financial transactions.
Interest rate: Bank interest rate is defined by banks including a wide range from 4% to 29% range.
The number of supporters (guarantors): Following the bank instruction, guarantor (one or more) is required to guarantee the repayment process to the bank if borrower fails to repay that. Otherwise, the client collateral must be property.
One dependent and nine independent variables (credit status) as well as their symbols are shown in table 1. The first one in dependent variable notated by CS and the others are dependent. Repayment period TL 7 Type of collateral TC 8 Term relationship with bank TR 9 Interest rate IR 10 Number of guarantors Gu

Binomial Probit Regression Model
Probit binomial regression methods (two states) for all variables have been performed using the well-known statistical software of SPSS. As shown in table 2, outputs indicated that the estimated Probit model has acceptable performance for predicting the dependent variable which is defined as the probability of non-creditworthy. Acceptable performance happens when the likelihood ratio in a significant level is less than 0.05 for confidence level of 95 percent. The observed significant surface is near to zero and less than 0.05. Parameters, statistical coefficients and their corresponding significant levels for both regressions models of Probit and Logit have also been tabulated in table 2. The first column represents notations of parameters (independent parameters). They are defined as one constant factor, three continuous variables and six discrete variables. Probit and logit regression models for each level of qualitative variables except the reference level (the last level is selected by default) are estimated as statistical coefficient and corresponding significant level. As shown in table 2, results of the Probit binomial regression model revealed that four independent variables including employment status (JS), age (A), term relationship with the bank (TR) and the interest rate (IR) have significant effects on the probability of noncreditworthy because their calculated significant levels are less than 0.05. Using experimental data, the proposed Probit model for assessing the real customer's credit risk of branches of Rasht Tejarat bank is formulated by equation (16). In other words, the probability of non-creditworthy accounts and failure to repay loans is followed by equation (16) in which the rate of each variable is defined as a nonlinear formulation.

Logit Regression Model
The implementation of Logit regression method using experimental data indicates that the one is also capable to estimate the acceptable performance in predicting the dependent variable (significant level less than 0.05 for 95% confidence levels). Statistical coefficients and their corresponding significant level have been tabulated in table 2. Results of the Logit binomial regression model showed that three variables of employment status (JS) with coefficient 5.924, age (A) with coefficient -0.965 and term relationship with the bank (TR) with coefficient 1.905, have significant levels less than 0.05 as shown in the fourth column of table 2. Following the above mentioned, it is concluded that the above independent variables have significant effects on the probability of non-creditworthy. The proposed Logit model for assessing credit risk is formulated as equation (17) [P(Y=1)%=  Two aforesaid models have been utilized using 20 samples of non-creditworthy customers and results have been summarized in table 3. Results revealed that the Probit twostate model by accuracy of 0.3 is valid. In other words, 6 out of 20 acceptable probability (more than 0.50), put in the parenthesis, has been offered for non-creditworthy customers.

Performing the Principal Component Analysis
Since, all parameters in decision-making models are not necessarily defined as the same dimensions, normalization or data standardization methods are commonly used for accessing the uniform modes of independent variables. Assuming that the distribution function is normal, the data standardization method is used by equation (18), where X old is initial value of parameter X new is standard value of parameter. The above equation changes the values of input and output parameters to the normal standard range with average of zero and standard deviation of one.
Following the utilization of the data standardizing process, calculating the KMO factor for standardized variables revealed that the value of the coefficient is equal to 0.497. Since it is less than 0.5, main component analysis cannot be performed on the whole variables. Therefore, the variable that is correlated with other variables must be eliminated and emitted. In order to identify and eliminate the mostly correlated variable, covariance matrix is calculated and the total amount of each row, total correlation of each variable with all other variables, have been investigated. The variable by much correlated with other variables has been removed and the KMO factor is recalculated. After elimination process, it has been concluded that the type of collateral (TC) of the input data has the biggest correlation with other variables. It has been removed and KMO has been recalculated as 0.523. In this case, results confirmed that the necessary correlation between input variables is enough to perform principal component analysis method.
Eigenvectors have been calculated using equations (7) and (8), the rotation process on the coefficient matrix has also been done as well. Three leading factors have been extracted denoted as (LF 1 , LF 2 , LF 3 ) shown in table 4. Since, the main variables are those at least one of their coefficients uses to form the relevant factor, have relatively the highest correlation. This criterion is selected as it is equal to 0.7. As shown in Table 4, two factors of gender (G) and age (A) don't have the estimated coefficients more than 0.7. Therefore they are known as the least significant influencing variables in non-repayment and should be deleted. Other variables have coefficients more than 0.7 shown is table 4. Utilizing Probit and Logit regression models remains with six independent variables and the credit status dependent variable continues. It is necessary to notice that symbol Z identifies the standard form of corresponding variable. For instance, Z(G) identifies the standard form of input variable Gender. After removing correlated variables using PCA method, Probit and Logit models have been developed considering six remaining variables of (JS, AL, TL. TR, IR, Gu). The models' outputs revealed that both models have acceptable performance for predicting the dependent variable while the significance level is near to zero. Results for both models revealed that (as shown in table 5), loan repayment period (TL), the interest rate (IR) and guarantor (Gu) have more significant impact on the probability of repayment and have significant level less than 0.05 for 95% confidence levels for Probit model shown in brackets.
By extracting and examining their impact coefficients and significance level, the models of Probit and Logit regressions have been developed as equations (19) and (20), respectively. Validation results of the above proposed models defined by equations (19) and (20) according to 20 samples of non-creditworthy customers are tabulated and shown in table 6. Calculated the probability of risk on non-creditworthy investor put in the parenthesis for more than 0.5 states that the Probit model is validated by an accuracy factor of 0.7 (14 out of 20) rather than Logit one by 0.6 (12 out of 20) credit coefficient has a better performance. "p#Y=1$%=

Conclusion
Risk assessment is one of the main concerns in financial institutes. So, this research work focuses on risk assessment for loaning individual investors. Some affective attributes have been considered to develop both Probit and Logit models for evaluating risk of bank loaning for 45 samples of related data and their situation including credit and noncreditworthy customers. In order to check internal correlation, the well known statistical technique of principal component analysis (PCA) has been utilized and results have been compared before and after the utilization of PCA. Results revealed that utilizing PCA improves the prediction models' accuracies over estimating the probability of being non-creditworthy situation for customers.
Experimental data have been gathered for Tejarat Bank branch in Rasht, the capital city of the Iranian northern province of Guilan. Utilizing two proposed models revealed that the credit risk probability of non-creditworthy for individual customers is mainly affected by three variables of loaning, interest rate and the number of guarantor. In this case, results of running both Probit and Logit models applied on 20 samples of non-creditworthy customers show the accuracy of 0.7 for recognizing non-creditworthy customers. This research work has been examined bank loaning risk assessment for individual customers using experimental data, so developed models can be extended using more effective and detailed data. Due to the existing of a wide range between the amounts of loaning received by private customers, it is recommended to divide customers into two categories including small and large sizes of loaning. Eventually, as application, it is also suggested that two-state Probit model may be used to achieve better result rather than Logit model.