Modelling of Credit Risk: Random Forests versus Cox Proportional Hazard Regression
Dyana Kwamboka Mageto, Samuel Musili Mwalili, Anthony Gichuhi Waititu
Jomo Kenyatta University of Agriculture and Technology, Department of Statistics and Actuarial Science, Nairobi, Kenya
Email address:
To cite this article:
Dyana Kwamboka Mageto, Samuel Musili Mwalili, Anthony Gichuhi Waititu. Modelling of Credit Risk: Random Forests versus Cox Proportional Hazard Regression. American Journal of Theoretical and Applied Statistics. Vol. 4, No. 4, 2015, pp. 247-253. doi: 10.11648/j.ajtas.20150404.13
Abstract: In survival analysis several regression modeling strategies can be applied to predict the risk of future events. Often, however, the default choice of analysis tends to rely on Cox regression modeling due to its convenience. Extensions of the random forest approach to survival analysis provide an alternative way to build a risk prediction model. This paper discusses the two approaches in reference to credit management and compares the impact and results of both methods. The Cox Proportional Hazard model displayed a better performance than that of Random Survival Forest when estimating credit risk.
Keywords: Credit Risk, Random Forests, Survival Models
1. Introduction
Credit scoring is one of the most important aspects in a business. It is a system that aids the decision maker on whether to grant a loan to an applicant or not (Thomas et al. 1999). Credit risk refers to the probability that a borrower will default on any type of debt by failing to make required payments (Basel, 2000). Traditionally this was done by using subjective judgment to assess the credit worth of corporate borrower. However, development of such a system was found to be very time-consuming, cumbersome and expensive.
In past records a great number of the world's largest banks have developed sophisticated systems to try and model the credit risk arising from a business (Wekesa, 2012). However despite the increase in knowledge some institutions fail to make full use of the information at hand. In Zimbabwe between the periods 2003 to 2004 a number of banks were forced to close down in what was termed the Zimbabwean Banking Crisis and the main cause being poor credit risk management (Njanike, 2009). The US 2008 Financial crisis was a very clear and painful illustration of the effects of an inadequate risk management system. In fact, at the end of 2008, the federal government pledged more money to bail out the financial industry than it spent on the korean war, the race to the moon, the vietnam war, Operation iraqi freedom and NASA's lifetime budget combined (Politico, 2008). Africa was also not spared as the rapid growth she had for long harbored was interrupted in 2009 by the crisis. In the beginning, many economists underestimated its likely impact in Africa. However by early 2009 it became evident that the crisis had profound effect throughout the continent. South Africa for one experienced "Sudden stops" of capital flows already in 2008.
The 2008 global financial crisis did not spare Kenya as well. Its impact was both direct and indirect. The indirect effect included the slowdown of tourism industry. Exports as well greatly reduced, which in turn had an effect on the foreign exchange earnings.
The 2008 Financial crisis was a wakeup call to all if not most Micro-Financial Institutions (MFIs). It is thus very important that they put measures in place to curb the credit crisis.
Credit Risk has thus become a subject of considerable research interest in banking and finance, and has also recently drawn attention to statistical researchers (Zhang, 2009).
A lot has been done in developing default models to deal with credit risk. In most circumstances the default choice of analysis tends to rely on Cox regression modeling due to its convenience.
The main aim of this paper is to introduce Random Survival Forests (RSF) as an alternative approach for modeling credit risk, and to compare it with that of Cox Proportional Hazard regression.
2. Review of Previous Research
A lot has been done in developing default models to deal with credit risk. In most circumstances the default choice of analysis tends to rely on Cox regression modeling due to its convenience.
The use of survival analysis for building time to default models was first introduced by Narain (1992) and was further developed by Thomas et al. (1999). In which Narain (1992) applied the accelerated life exponential model to a 24 months of loan data. He illustrated that the proposed model estimated the number of failures at each failure time. The author then built a scorecard using multiple regressions, showing that a better credit-granting decision could be made if the score was supported by the estimated survival times. Thomas et al. (1999) on the other hand compared performance of exponential, Weibull and Cox’s nonparametric models with logistic regression and concluded that survival-analysis methods are competitive with, and sometimes superior to, the traditional logistic-regression approach.
Wekesa (2012) reviewed modeling of credit risk for personal loans using Product-Limit Estimator. The results demonstrated that there is no significant difference between male and female applicants in terms of their survival times and hazard rates. Creamer (2012) however took a different approach and compared Random Forests and Logistic regression while comparing their predictive ability on Latin American Banks. Where RSF model approach indicated that the most important variables that affected banks were size, number of efficient systems and number of deposits. The analysis also revealed that RSF approach had better predictive capacity in comparison to logistic regression. Zhou and Wang (2012) Used RSF approach on Loan data. They improved the original random forests approach by allocating weights to decision trees. The experiments finally concluded that the weighted approach in tree aggregation improve the overall accuracy and performance of the model.
It is evident that most researchers have result to Cox PH and Artificial Neural Networks (ANN) as a form of analysis not only on loan but also on other survival data. Very little has been done on usage of Random forest Approach. The main aim of this paper is to introduce Random Survival Forests (RFS) as an alternative approach for modeling credit risk, and to compare it with that of Cox Proportional Hazard regression
3. Methodology
3.1. Random Forests
A Random Forest (RF) is basically a non-parametric machine learning method that can be applied in survival prediction models. In survival settings, the predictor is an ensemble formed by combining the results of many survival trees Ulla, Hemant and Thomas (2012). According to Leo Breiman (1999) it is an ensemble method that uses random selection of variables and bootstrap samples.
3.1.1. Bootstrapping in Random Survival Forest
Randomization in RSF is brought about in 2 cases. In the first circumstance, a randomly selected bootstrap sample (approximately 67% of the original data) is used for growing the tree called the "in-bag data". Each sample excludes 37% of the data called Out-Of-Bag data (OOB). This selected sample can be viewed as the root of the tree. Secondly, the root is split into 2 daughter nodes by using a splitting rule on a randomly selected co-variant. The split is the best when survival difference between the daughter nodes is maximized as much as possible. Eventually, as the number of tree nodes increases with every split, and dissimilar cases become separated, each node in the tree becomes homogeneous and is populated by cases with similar survival. The tree reaches a saturation point when a terminal node (the most extreme node in a saturated tree) has at least 1 death with unique survival times.
3.1.2.Developing the Random Survival Forest Model
Firstly the conditional cumulative hazard function is estimated using the Nelson-Aalen estimator. For those subjects that are in the bootstrap sample or rather the "in-bag" data. For us to illustrate the risk prediction for the Random forests we will denote the survival as the terminal node of subjects in the bootstrap sample where a subject with predictor values ends up. It is vital to note that when the bootstrap samples are drawn with replacement some subjects from the original data set may occur a number of times. Therefore we denote as the number of times occurs. In a case where the subject is not in the bootstrap sample then
We also introduce a counting notation Andersen, Borgan, Gill, and Keiding (1993).
(1)
(2)
(3)
(4)
In RSF the ensemble is then constructed by aggregating tree based Nelson-Aalen estimators. In other words in each terminal node the CHF is estimated using the subjects that are in the bootstrap sample while using the Nelson-Aalen estimators Ishwaran (2008).
(5)
The survival prediction from the random survival forest at x is then obtained as;
(6)
3.2. Cox Proportional Hazard Regression
The Cox PH model is the most generally used regression model this is due to the fact that it is not based on any assumptions concerning the nature or shape of the particular survival distribution. In Cox Regression the CHF is dependent on the vector of predictor variables.
(7)
The Cox model can then be written as:
(8)
Here describes the baseline hazard function, in our case the risk of a client defaulting payment. While the parameter is the vector of regression coefficients. They describe how the hazard varies in response to the models co-variants. The survival Predictor values of x are then obtained by:
(9)
In this study the Cox model will be built in R statistical package (Version 3.1.2). We will use the model to check which co-variants are significant in Credit default analysis.
3.3. Performance Measure
The two models under studied will be compared on basis of their predictive ability. In this study error will be measured by Harrell’s concordance index (Harrell et al., 1982). Unlike other measures of survival performance, Harrell’s C-index does not depend on choosing a fixed time for evaluation of the model and specifically takes into account censoring of individuals (May et al., 2004). According to Kattan et al. (1998) the method has quickly become quite popular in the literature as a means for assessing prediction performance in survival analysis.
The error rate is Error = 1 − C. Note that 0≤Error≤1 and that Error = 0.5 corresponds to a procedure doing no better than random guessing, whereas Error = 0 indicates perfect accuracy.
4. Data Exploration
4.1. Data Structure
The data used in this experiment was secondary data. It was obtained from leading commercial banks in Kenya. The loan applicants in the study were randomly picked from the banks database comprising of 70 branches. The Sample obtained was based on a portfolio of personal loans whose maturity was 45 months. The study thus included loans taken from the month of January, 2004 to September 2008. The sample obtained included 250 male applicants and 250 female applicants.
4.2. Variable Description
The variables in the account are to be measured from the month it was opened until the account becomes ‘bad’ implying it is closed or until the end of observation. The account is considered bad if payment is not made for two consecutive months in accordance to the industry practice. If the account is does not miss two payments and is closed or survives beyond the observation period, it is considered to be censored. The study will also assume that those who made early payment or settlement were censored.
The variables under study are enlisted below,
Variable | Measurement |
Marital Status | Married, Not Married |
Gender | Male, Female |
Age | Varied |
Status | Default, Non Default |
Time of Payment | Varied |
Employment | Employed, Unemployed |
Homeownership | With Home, Without Home |
Education Level | Secondary and above, Below secondary |
5. Results and Discussion
5.1. Data Presentation
The dominant Characters in this study were, the married, the Unemployed, those without homes and also not having studied beyond secondary school. As for status most of the applicants Do not default. This can be illustrated in the Table below.
Marital Status | Sex | Employment | Home Ownership | Education Level |
Married: 300 | Male: 250 | Employed: 201 | Home: 92 | Post Secondary: 48 |
Unmarried: 200 | Female: 250 | Unemployed: 299 | No Home: 408 | Secondary or Below:352 |
As for age the youngest applicant was 22yrs while the oldest was 55. The shortest dated loan payment was 12 months and the highest 36 months.
5.2. Random Forest Model
The random Survival Forest package used in this study produces an ensemble estimate for the cumulative hazard function. This is a machine learning algorithm consisting of many trees used in classification and analysis. In our study we will only focus on applications of this model that are relevant for our analysis.
First of the basic composition of the model is illustrated in the table bellow
Sample size | 500 |
Number of deaths | 108 |
Number of trees | 2000 |
Minimum terminal node size | 3 |
Average no. of terminal nodes | 76.083 |
No. of variables tried at each split | 3 |
Total no. of variables | 7 |
Analysis | RSF |
Family | Surv |
Splitting rule | Logrank |
Error rate | 43.78% |
From this we can observe that out of the 500 samples taken 108 defaulted payments. The family "surv" forest has built the model with 2000 trees with 3 variables ties at each split. In our study we use the default splitting criterion i.e. the logrank test statistic. The error rate on doing the performance evaluation the out-of-bag (OOB) estimates of the error rate was calculated. The "unbiased" estimates of error suggested that when the resulting model was applied the error was obtained as is smaller than 0.5 hence implying that we do not have enough evidence to conclude that the predictors are not important in predicting the probability of default. Hence suggesting it is fairly a good model.
5.2.1. Error Estimate Against Number of Trees
The figure below represents the OOB error estimates against the number of trees in the forest.
This figure illustrates that it takes about 1000 trees to construct the model. This plot is a good guide as to how many decision trees one requires when creating a random forest model. It is important to note that to ensure each variable is included in the forest it is better to create a large random survival forest tree.
5.2.2. Prediction of Survival Estimates
This is done by extracting the OOB estimates from the random forest. The figure below shows the predicted survival of our RSF model. Blue lines represent the observations who defaulted while the red lines represent those who did not default.
This figure also shows the median survival within a 95% confidence interval of status against time.
5.2.3. Variable Importance According to RSF
The important variables according to RSF were Marital Status, Employment, Home Ownership and Education level. While the least were Sex and age.
5.3. Cox Proportional Hazard Model
On carrying out an analysis of the Cox-PH Model time and status were regressed against the other variables, the following results were obtained.
Variable | Coef | exp(coef) | Lower .95 | upper .95 |
Marital | 1.111953 | 3.040292 | 1.3738 | 6.728 |
Sex | -0.26114 | 0.770175 | 0.5238 | 1.132 |
Age | 0.003961 | 1.003924 | 0.918 | 1.098 |
Employment | 0.43173 | 1.53992 | 1.0237 | 2.317 |
Home | 0.729073 | 2.073159 | 1.1317 | 3.798 |
Education | 0.072468 | 1.075158 | 0.6999 | 1.652 |
Table above gives a portion of the analysis done on the variables. It is evident that the "coef" column gives the coefficients corresponding to each variable. For instance holding other co-variants constant, an additional year of age reduces the hazard of Default by a factor on average. The exponential coefficients in the second column of the output are multiplicative effects of the hazard. While the lower.95 and upper.95 are basically confidence intervals for each specific variables.
5.3.1. Variable Importance According to Cox-Model
The co-variants marital status, employment and Home Ownership are significant at 99% confidence interval with marital status being the most significant. On the other hand Education Level and age are the least important variables.
5.3.2. Error Estimate
The R-square for this model is given as 0.924 which is very close to 1 indicating that the model predicts the probability of default very well.
The likelihood-ratio, Wald, and score chi-square statistics at the bottom of the output were asymptotically equivalent test that is that the variables are not important. In this study the statistics are close in argument, and thus implying we reject the hypothesis concluding that the variables are significant in the model.
The results discussed are visible bellow.
Likelihood ratio test | Wald test | Score (logrank) test |
30.96 | 28.56 | 29.83 |
on 9 df | on 9df | on 9 df |
p = 0.0003005 | p = 0.0007681 | p = 0.0004691 |
5.3.3. Predicted Survival Probability
It is often of interest to examine the distribution of predicted survival times. Whereby there is a view of the survival probability according to each time (months). This is illustrated bellow.
6. Model Diagnostic
The two models used in this study were the Random Survival Forest model and the Cox Proportional Hazard Model. The section below looks at the performance evaluation of the two models. To measure the performance we used Harell's concordance index (C-index).
The C-index for RSF was obtained as 0.4378 while that of the Cox model obtained as 0.3376. From this it is evident to see that the Cox model has a lower C-index value than that of RSF. Hence according to Harell's concordance index the Cox model displays a better performance than that of RSF.
7. Conclusion and Recommendations
Cox-PH model was found to be a better model for predicting the probability of default as compared to RSF. In both models Marital status, Employment and Home ownership were found to be the common important variables. However the RSF model displayed Education Level as an important variable as well. It was also found that Sex, and Age do not affect were not important in predicting the probability of default.
We therefore recommend the use other methods to model credit risk like Accelerated failure-time models and Kaplan-Meier models to view how the models would behave.
References