A Hybrid Ensemble Model for Corporate Bankruptcy Prediction Based on Feature Engineering Method

The bankruptcy of manufacturing corporates is an important factor affecting economic stability. Corporate bankruptcy has become a hot research topic mainly through financial data analysis and prediction. With the development of data science and artificial intelligence, machine learning technology helps researchers improve the accuracy and robustness of classification models. Ensemble learning, with its strong predictive power and robustness, plays an important role in machine learning and binary classification prediction. In this study, we proposed a bankruptcy classification model combining feature engineering method and ensemble learning method, Synthetic Minority Oversampling Technique (SMOTE) imbalanced data learning algorithm is applied to generate balanced dataset, multi-interval discretization filter is applied to enhance the interpretability of the features and ensemble learning method is applied to get an accurate and objective prediction. To demonstrate the validity and performance of the proposed model, we conducted comparative experiments with ten other baseline classifiers, proving that SMOTE imbalanced learning algorithm and feature engineering method with multi-interval discretization was effective. The comparative experiment results show that the ensemble learning method has a good effect on improving the performance of the proposed model. The final results show that the proposed model has achieved better performance and robustness than other baseline classifiers in terms of classification accuracy, F-measure and Area under Curve (AUC).


Introduction
It is of great guiding significance for the national economy to predict the bankruptcy of its corporates, especially the manufacturing corporates. Manufacturing corporates constitute the cornerstone of a country's economic strength and have a significant impact on its overall national strength. Economic decisions about manufacturing corporates also affect countless jobs, suppliers and government taxes. Therefore, it has become a hot topic for researchers to find out the law of bankruptcy of manufacturing corporates and finally predict the plight of corporate.
As is always been, researchers use the most advanced analytical tools to research the rules of financial statements in order to find out the key to corporate bankruptcy. With the development of data mining and artificial intelligence methods, the data-science approach has entered various research fields, including corporate bankruptcy, and become an effective tool to help decision making.
The main purpose of this research is to explore the effect of different sampling methods on the prediction precision of high imbalance datasets. We purposed an ensemble machine learning model based on Logit Boost [1] algorithm to predict the corporate bankruptcy. We obtained the dataset from the Machine Learning Repository at the University of California Irvine (UCI) http://archive.ics.uci.edu/ml/index.php. The dataset are used to validate the model after data balancing and Feature Engineering Method feature processing steps. The prediction accuracy of the proposed model is compared with Naive Bayes (NB) [2], Logistic Regression (LR) [3], Multilayer Perceptron (MLP) [4], J48 decision tree [5], Random Forests (RF) [6], AdaBoost. M1 [7], Bagging [8], K-Nearest Neighbors (K-NN) [9], Voted Perceptron (VP) [10], proving the superiority of our proposed model.
The remainder of the paper is organized as follows. In Section 2, we review the related work of data mining methods and corporate bankruptcy prediction. In Section 3, an improved ensemble model is proposed for corporate bankruptcy prediction. In Section 4, the experimental results are introduced to prove that the imbalanced data processing algorithm and feature processing steps are effective, and the proposed model performs better than the other baseline models. Sections 5 summarizes and analyzes empirical results and discusses the future work.

Related Work
Data mining technology has been widely used in various fields and achieved remarkable performance. In the field of corporate bankruptcy prediction, a large number of corporate financial datasets are collected for analysis and prediction, providing support for corporate operation and decision-making. In 2004, Foster & Stine applied Least-Squares Regression to predict personal bankruptcy [11]. A new predictive approach to corporate failure in a study using AdaBoost learning algorithm to optimize neural network errors and reduce generalized errors by approximately 30% was proposed by Alfaro & Elizondo. (2008) [12]. Nanni & Lumini (2009) [13] conducted a study on the performance of the bankruptcy prediction and credit scoring system based on classifier integration. Li & Miu (2010) proposed a bankruptcy prediction model with dynamic loadings for both the accounting ratio-based z-score and the market-based DD variable [14]. Charitou et al. (2011) simplified the Black-Scholes-Merton (BSM) bankruptcy model [15]. Wang et al. (2014) proposed a new and improved Boosting, FS-Boosting to predict corporate bankruptcy [16]. Zieba et al. (2016) proposed a novel approach for bankruptcy prediction that utilizes Extreme Gradient Boosting for learning an ensemble of decision trees [17]. During the process of prediction, the adverse effect on the performance caused by imbalanced dataset is also one of the important problems. Kim et al. (2016) examined the effectiveness of a hybrid method using clustering technique and genetic algorithms based on the artificial neural networks model to balance the proportion between the minority class and majority class [18]. Barboza et al. (2017) used an unusually large sample to test the machine learning model, which was used to study corporate bankruptcy, and achieved better prediction accuracy and stability than traditional methods such as Naive Bayes [19]. Jardin (2017) combines method of segmentation with ensemble-based models [20]. Wang et al. (2017) proposed a new Kernel Extreme Learning Machine (KELM) parameter tuning strategy [21]. Kliestik et al. (2017) discussed the moral and economic responsibility of corporate bankruptcy and developed a new bankruptcy prediction tool which is superior to traditional models to predict the failure of Slovak corporates [22]. However, few researches combined imbalanced data processing, feature processing and ensemble learning methods together, and most works cannot validate their robustness in multiple aspects.

Data Preprocessing and Modeling
In this paper, we proposed an ensemble machine learning model to predict corporate bankruptcy. The dataset about bankruptcy prediction of Polish corporates was collected from Emerging Markets Information Service (EMIS), which is a database containing information on emerging markets around the world. The dataset is available on Machine Learning Repository at the University of California Irvine (UCI) (http://archive.ics.uci.edu/ml/index.php) published by Zieba et al (2016) [17]. After imbalanced learning algorithm processing and feature processing of the dataset, we used another ten classifiers to do comparative experiments. Besides, all the experiments are implemented by Weka Data Mining Tool for Java.

Data Exploration
The dataset contains financial rates from 1st year of the forecasting period and corresponding class label that indicates bankruptcy status after 5 years. The data contains 7027 instances (financial statements), with 65 attributes for each instance. The 65 features and the detailed information is shown in the Table 1. There are 271 bankrupted corporates, 6756 corporates that did not bankrupt in the forecasting period of this dataset. Besides, the imbalance ratio is 24.93.

Data Preprocessing
In order to evaluate the quality of the process we obtained a dataset of Polish corporates. The selection process involves choosing departments, databases, research phases, the number of firms and the number of financial indicators that will be analyzed. The study samples include bankruptcy and still operating corporates (imbalanced samples).
The quality of data preprocessing is directly related to the precision of the model. Data preprocessing cannot contaminate data sources. The raw dataset we obtained is a severely imbalanced dataset and contains missing values. So we replaced the missing value with 0.

Modeling
In this section, we use eleven classification models to predict the bankruptcy status of corporates after 5 years. A flow diagram of our proposed model is provided in the Figure 1.
As shown in the Figure 1, the model is divided into three parts: Using the Synthetic Minority Oversampling Technique (SMOTE) to deal with the imbalanced dataset in this model.
Processing the high-dimensional dataset by applying the multi-interval discretization filter.
Adopting LogitBoost Classifier for dataset learning and performance evaluation.
The ensemble model is then used to classify and verify on the testing dataset and compare performance with the other ten classifiers.

SMOTE Imbalanced Learning Algorithm
In the classification, the imbalance of training data refers to the large number of samples with different features. Actually, imbalanced datasets are common and reasonable. In the case of corporate bankruptcy, most corporates can survive eventually, and a small part of corporates will fail due to awful management and other reasons. Chawla et al. (2002) proposed Synthetic Minority Oversampling Technique (SMOTE) as an important technique for dealing with imbalanced datasets [23]. It is an improved scheme based on random oversampling algorithm. The main idea of SMOTE algorithm is to analyze minority samples. New samples are synthesized into the dataset based on the features of minority samples.

Feature Processing
As the prediction accuracy of the final model will be affected by high-dimensional features, we need to do some processing on the features. The multi-interval discretization as an instance filter was proposed by Fayyad (1993) [24], which divide the range of numerical attributes in the dataset into nominal attributes. The multi-interval discretization builds a better decision tree in the same dataset. Without changing the final result of the algorithm, the heuristic efficiency of the cut point selection can be improved.
A conceptually method is to determine the segmentation point by maximizing the purity of the interval. In practice, however, this method may require manual determination of interval purity and minimum interval size. To solve this problem, some statistical methods are used to separate each attribute value interval. Discretization is applied to attributes which used in classification or association analysis. In general, the effect of discretization depends on the algorithm, as well as other properties used. The transformation of a numerical attribute into a nominal one involves two sub-tasks: determining how many classification values are required, and how to map the numerical attribute values to these nominal values. In the first step, after sorting the numerical attribute values, it divides them into n intervals by specifying n segmentation points. In the second step, all values in an interval are mapped to the same classification value. Therefore, the problem of discretization is to decide how many segmentation points to choose and where to locate the segmentation points.

Logit Boost
The fundamental idea of the LogitBoost classifier [1] is to integrate multiple simple weak classifiers into a stronger classifier with higher prediction accuracy and better performance. LogitBoost classifier is derived by maximizing logarithmic likelihood function and optimized by Newton iteration.
The performance of many classification algorithms can often be significantly improved by sequentially applying to reweighted versions of the input data, and the weighted majority voting for classifiers sequence. It shows that the phenomenon can be understood as a well-known statistical principle, namely additive modeling and maximum likelihood. For two types of problems, the maximum Bernoulli likelihood can be used as a criterion, and ascension can be regarded as an approximation of additive modeling on a logical scale. Friedman et al. developed more direct approximations and showed that their results were almost the same [1]. Direct multilevel generalization based on polynomial likelihood proves that, in most cases, the performance is better than other recently proposed multiclass generalization, and in some respects is superior. Its computational speed is faster, making it more suitable for large-scale data mining applications.

Experimental Results
In this section, we mainly introduce the experimental setup and comparison of experimental results of our proposed model. The training data and testing data was divided randomly by the ratio of 0.8:0.2. The training data was pre-processed and processed by SMOTE imbalanced learning algorithm. After multi-interval discretization, training data is learned using the LogitBoost classifier and finally validated using testing dataset. After five times experiments, the average performance of our proposed model was obtained.
All the experiments were done on a graphics workstation with 3.2GHz Intel Core i5 and 12GB RAM, running Windows 7 professional 64-bit operating system.

Metrics of Model Performance
The purpose of classification is to construct a classification function or model (that is, classifier) by which data objects are mapped to a given category. The goal of this classifier has only two categories, namely, bankruptcy and non-bankruptcy. In this paper, we define bankruptcy as positive and non-bankruptcy as negative. We use the confusion matrix as shown in the Table 2 to represent these four situations.
Receiver Operating Characteristic (ROC) Curve is the abbreviation of receiver operating characteristic curve, also called sensitivity curve. ROC Curve is reflecting the sensitivity and specificity of continuous variable comprehensive index, and is the composition method to reveal the relationship of the sensitivity and specificity. We used three metrics in this experiment, they are: The Area under Curve (AUC) is a common summary statistic indicator for the goodness of classifier in a binary classification task. The x-axis of ROC curve represents the false-positive rate (FPR), and the y-axis represents true-positive rate (TPR). Generally, the AUC has a value of between 0.5 and 1.0. The ROC curve has the property that when the distribution of positive and negative samples in the dataset changes, it can remain unchanged. Frequently, the actual dataset is imbalanced, the positive samples are much more than negative samples (or vice versa), and the distribution of positive and negative samples of the test dataset may also change over time. Therefore, AUC value is less sensitive to imbalanced data.

Performance of Baseline Classifier
In this section, In order to make a comparison with our proposed model, we presented the performance of the ten baseline classifiers in Table 3. So, the performance of LogitBoost classifier adopted is compared with Naive Bayes (NB) [2], Logistic Regression (LR) [3], Multilayer Perceptron (MLP) [4], J48 decision tree [5], Random Forests (RF) [6], AdaBoost. M1 [7], Bagging [8], K-Nearest Neighbors (K-NN) [9], Voted Perceptron (VP) [10], proving the superiority of our proposed model. We also display the results of dataset that have not been applied by the imbalanced learning algorithm. These results are marked by "Non-SMOTE" label. In addition, the results of discrete feature processing are also shown in the Table 3. These results are marked by "Non-Discrete" label. In order to demonstrate the optimization of the model after imbalanced data learning and discrete feature processing, we present the unprocessed model results as the experimental control group in the Table 3 as well. After five random partitions of the dataset, we obtained the performance results of ten baseline classifiers.
From Table 3, we found that, no matter whether the SMOTE imbalanced learning technique and multi-interval discretization feature processing are applied, MLP and AdaBoost. M1 model perform well.

The Performance and Analysis of the Proposed Model
In this section, we tested the performance of our proposed model and make a comparison between our proposed model and the best baseline classifier. Similarly, for the ensemble model we proposed, we randomly split the dataset five times and got the performance results of our proposed model.
In Table 3, we compared the model performance of the raw dataset, the dataset processed by SMOTE imbalanced learning technique, the dataset processed by multi-interval discretization and the dataset processed by both SMOTE imbalanced learning technique and multi-interval discretization.
The best result of ten baseline classifiers are recorded in Table 3 and compared with the performance of our proposed model.  The dataset was processed by the same imbalanced learning technology and discrete feature processing, and the performance results of the dataset were compared with those of the non-discrete processing and imbalanced data learning. The performance and comparison results of our proposed model are shown in Table 4.
We found that the AUC value of the baseline classifier and the model we proposed have been improved after SMOTE treatment. Certainly, the higher the AUC value, the better the classifier is to classify, indicating the higher accuracy of the classifier. Besides, the performance of our proposed model has been improved comprehensively after the raw dataset has been processed by the multi-interval discretization. For the dataset that has been processed by SMOTE imbalanced learning technique, the performance of the model can be improved continuously and comprehensively by using multi-interval discretization method for feature processing. However, for the dataset that has been processed through data discretization only, the SMOTE method can only improve the performance of AUC value, while the precision value and F-measure value are slightly decreased.
Two of the best performing baseline classifiers are selected to compare with our proposed ensemble learning classifier. The results of the comparison are shown in Figure 2.
As we can see, Figure 2 (a) shows the performance of dataset processed by SMOTE imbalanced learning technique and the multi-interval discretization method processing in the three classifiers. Figure 2 (b) shows the performance of the dataset after multi-interval discretization method processing in the three classifiers. Figure 2 (c) shows the performance of dataset learned through SMOTE imbalanced learning technique in three classifiers, and we find that AUC value has been improved. Figure 2 (d) shows the performance of dataset without any data preprocessing and feature processing in the three classifiers. We find that compared with the dataset without any data processing, the performance of the dataset after multi-interval discretization is improved in three classifiers. Compared with the previous performance, we find that the performance of three classifiers is improved to varying degrees, and the comprehensive performance of our proposed LogitBoost classifier is better than the other two baseline classifiers.

Conclusion
Bankruptcy is a difficult problem for the global business community. Manufacturing corporates produce discrete products, which are closely related to production activities and people's life. Nowadays, manufacturing corporates are not only the key link for scientific discovery and technological invention to transform into real scale productivity, but also have close relations to a large number of related industries. How to predict the bankruptcy according to the financial statement data of corporates and avoid the operational risk has become an urgent problem. With the rise and development of data science, it has become a hot topic for researchers to analyze and predict the corporates' bankruptcy by data mining. Machine learning, as a mainstream method of artificial intelligence, is playing an increasingly important role in today's research.
In this paper, we propose a model based on ensemble learning to predict the bankruptcy of corporates. We obtained the dataset about the bankruptcy of Polish manufacturing corporates, and used the ensemble learning model for the prediction after SMOTE imbalanced learning algorithm and multi-interval discretization method processing. In order to verify the validity of LogitBoost classifier, we used ten other models as the baseline classifier for comparative experiments. Compared with the baseline classifiers, the validity of our proposed model in terms of precision, F-measure and AUC value was improved. We further compare the results of the proposed model with the model that have not been processed by SMOTE imbalanced learning algorithm, demonstrating the necessity of the balancing algorithm for serious imbalanced dataset. In addition, we compared the proposed model with the model without multi-interval discretization method process, and the result shows that the performance of the proposed model is improved by the application of multi-interval discretization method.
However, there are some shortcomings in our proposed model. First, the corporate's financial data may contain a large number of missing values and abnormal information, which may affect the performance of classifiers. Due to the high dimensional characteristics of financial statement data, we should improve the feature processing and selecting method. For example, we could consider using principal component analysis or linear discriminant analysis to first reduce the dimension of features and then the feature selection. In the process of model learning, there may be problems of local optimal solution and over-fitting, which will be found and improved in future work.