Naïve Bayesian Classifier and Classification Trees for the Predictive Accuracy of Probability of Default Credit Card Clients

: Decision Trees use a decision support tool that utilizes tree like graph model and make decisions. Naïve Bayesian classifier is a binary classifier to get yes/no from the data and it is a very primitive method of finding true or false classification from a dataset. Both algorithms can be used as a predictive model in machine learning and data-mining. Here, a comparative analysis between these two machine learning algorithms is done. The data we have is used to classify if the client is the default credit card holder or not. In the perspective of risk management, the result can be used to accurately get the result of classifying credible or non-credible clients.


Introduction
Many statistical methods, including discriminant analysis, logistic regression, Bayes classifier, and nearest neighbor, have been used to develop models of risk prediction [1]. With the evolution of artificial intelligence and machine learning, artificial neural network and classification trees were also employed to forecast credit risk [2], Credit risk here means the probability of a delay in the repayment of the credit granted. At the same time, most cardholders, irrespective of their repayment ability, overused credit card for consumption and accumulated heavy credit and cash-card debts, The crisis cause the below to consumer finance confidence and it is a big challenge for both banks and cardholder. In a welldeveloped financial system, crisis management is on the downstream and risk prediction is on the upstream. The major purpose of risk prediction is to use financial information, such as business financial statement, customer transaction and repayment records, etc., to predict business performance or individual customer's credit risk and to reduce the damage and uncertainty. From the perspective of risk control, estimating the probability of default will be more meaningful than classifying customers into the binary results -risky and non-risky. Therefore, whether or not the estimated probability of default produced from data mining methods can represent the real probability of default is an important problem. To forecast probability of default is a challenge facing practitioners and researchers, and it needs more study [1,[3][4][5].

Literature Review
Data mining techniques Right now, data mining is an indispensable tool in decision support system and plays a key role in market segmentation, customer services, fraud detection, credit and behavior scoring, and benchmarking. In the era of information explosion, individual companies will produce and collet huge volume of data every day. Discovering useful knowledge from the database and transforming information into actionable results is a major challenge facing companies. Data mining is the process of exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules [6]. The pros and cons of the naïve Bayesian classifier and classification trees employed in our study are reviewed as follows [7][8][9][10].
Naïve Bayesian classifier (NB) The naïve Bayesian classifier is based on Bayes theory and assumes that the effect of an attribute value on a given class is independent of the other attributes. This assumption is called class conditional independence. This assumption is called conditional independence. Bayesian classifiers are useful is that they provide a theoretical justification for other classifiers that do not explicitly use Bayes theorem. The major weakness of NB is that the predictive accuracy is highly correlated with the assumption of class conditional independence. This assumption simplifies computation. In practice, however, dependences can exist between variables.
Classification trees (CTs) The top-most node in a tree is the root node. In a classification tree structure, each internal node denotes a test on an attribute, each branch represents an outcome of the test, and leaf nodes represent classes. CTs are applied when the response variable is qualitative or quantitative discrete. Classification trees perform a classification of the observation on the basis of all explanatory variables and supervise by the presence of the response variables. The segmentation process is typically carried out using only one explanatory variable at a time. CTs are based on minimizing impurity, which refers to a measure of variability of the response values of the observations. CTs can result in simple classification rules and can handle the nonlinear and interactive effects of explanatory variables. It is difficult to take a tree structure designed for one context and generalize it for other contexts. However, their sequential nature and algorithmic complexity can make them depends on the observed data, and even a small change might alter the structure of the tree.

Related Works
Credit scoring is the term used to describe formal statistical methods which are used for classifying applicants for credit into "good" and "bad" risk classes [1]. Such methods have become increasingly important with the dramatic growth in consumer credit in recent years. A wide range of statistical methods has been applied, though the literature available to the public is limited for reasons of commercial confidentiality. Many static and dynamic models have been used to assist decision making in the area of consumer and commercial credit. The decision of interest includes whether to extend credit, how much credit to extend, when collections of delinquent accounts should be initiated, and what action should be taken. They surveyed the use of discriminant analysis, classification trees, and expert systems for static decisions, and dynamic programming, linear programming and Markov chains for dynamic decision models. Bayesian methods, coupled with Markov Chain Monte Carlo computational techniques, could be successfully employed in the analysis of highly dimensional complex dataset, such as those in credit scoring and benchmarking. Paolo employs conditional independence graphs to localize model specification and inferences, thus allowing a considerable gain in flexibility of modeling and efficiency of the computations. It was found that, based on eight real-life credit scoring data sets, both the LS-SVM and neural network classifiers yield a very good performance, but also simple classifiers such as logistic regression and liner discriminant analysis perform very well for credit scoring [4]. It was explored the performance of credit scoring by integrating the back propagation neural networks with the traditional discriminant analysis approach [11]. The proposed hybrid approach converges much faster than the conventional neural networks model. Moreover, the credit scoring accuracy increases in terms of the proposed methodology and the hybrid approach outperforms traditional discriminant analysis and logistic regression.

Experiment the Data
1. Fix what appears to be a typo in the field header PAY_0 2. Change codes to values for sex, education, and marriage. Any observations associated with undocumented code values will be removed.
The resulting dataset looks like this: 5. With the dataset in long format, create some derived fields: The resulting dataset looks like this: s: 6. Using the dataset just created and stored in credit_data_individual, create an aggregate dataset for the different group combinations of Sex, Age Range, Marital Status, and Education.
Visualize the groups using the data. tree package: Field Definition:

Result
Which group has the highest average credit limit?
Which group has the lowest average credit limit?
Which group is comprised of highest percentage of people who have a balance-to-limit rating less than or equal to 30%? Which group has the lowest utilization or balance-to-limit rating?
Which group is the most likely to predicted to default?
Which group has the highest amount of debt, is the most likely to default, and is the most likely to miss a payment?
Which group has the lowest amount of debt, is the least predicted to default, and is not likely to miss a payment?

Conclusion
In the classification accuracy between the two data mining techniques, this result shows that there are little differences in error rates between two methods. However, there are relatively big differences in area ratio between two techniques. This paper we examines the two major classification techniques of Naïve Bayesian classifier and Classification tress for the performance of classification predictive accuracy. Naïve Bayesian performs classification more accurately than classification trees. Therefore, it can be concluded that the classifier is most important to measure the classification accuracy of models.