Issues of Class Imbalance in Classification of Binary Data: A Review

: Handling classification issues of class imbalance data has gained attentions of researchers in the last few years. Class imbalance problem evolves when one of two classes has more sample than the other class. The class with more sample is called major class while the other one is referred to as minor class. The most classification or predicting models are more focusing on classifying or predicting the major class correctly, ignoring the minor class. In this paper, various data pre-processing approaches to improve accuracy of the models were reviewed with application to terminated pregnancy data. The data were extracted from the 2013 Nigeria Demographic and Health Survey (NDHS). The response variable is “terminated pregnancy” (asking women of reproductive age whether they have ever experienced terminated pregnancy or not), which has two possible classes (“YES” or “NO”) that exhibited class imbalanced. The major class (“NO”) is 86.82% (of the sample) representing Nigerian women of age 15 – 49 years who had never experience terminated pregnancy while the other category (minor class) is 13.18%. Hence, different resampling techniques were exploited to handle the problem and to improve the model performance. Synthetic Minority Oversampling Technique (SMOTE) improved the model best among the resampling techniques considered. The following socio-demographic factors: age, age at first birth, residential area, region, education level of women were significantly associated with having terminated pregnancy in Nigeria.


Introduction
In real life applications, more often than none, large amount of data is generated with skewed distribution. Classification and/or prediction usually suffer from data exhibiting class imbalance. When data have imbalance classes, prediction and classification usually favour major class while minor class suffer. A data set is said to be imbalance if sample from one class is significantly higher compared to other class [1,2]. In this case, the class with higher observations is referred to as major class while the class with few observation is referred to as minor class [3,2].
In applications such as: where medical diagnosis prediction of uncommon but significant and unignorable disease is more crucial than regular treatment; detecting fraud in banking operations, detecting network intrusions [4]; managing risk and predicting failures of technical equipment [3]; text classification [5], credit scoring [6]. In such situations most of the classifier are biased towards the major classes and hence show very poor classification rates on minor classes [3]. It is not uncommon that model predicts everything as major class and ignores the minor class out-rightly. For such problems, it is required to build models with reasonable performance on the minority class.
However, different methods have been proposed to resolve the issues associated with class imbalance [7], which are divided into three basic categories: the algorithmic approach, data pre-processing and feature selection approach [3]. In data pre-processing technique, resampling is applied on the data. Here there could be over-sampling or under-sampling or both. The process of addition of new sample to existing minor class is known as over-sampling while process of subtracting sample from major class is known as under-sampling. Second method for solving class imbalance problem is that of creating or modifying algorithm [2]. Applying an algorithm alone is not good idea because size of data and class imbalance ratio may be high and hence a new technique which is the combination of sampling method with algorithm is used [8].
Longadge et al., [3] noted that in classification, algorithm generally gives more importance to correctly classify the major class samples. In many applications misclassifying a rare event (minor class) can result to more serious problem than common event [9]. "For example in medical diagnosis in case of cancerous cell detection, misclassifying non-cancerous cells may leads to some additional clinical testing but misclassifying cancerous cells leads to very serious health risks. However in classification problems with imbalanced data, the minor class examples are more likely to be misclassified than the major class examples, due to their design principles, most of the machine learning algorithms optimizes the overall classification accuracy which results in misclassification of minor classes" [3].
The paper is organized as follows: section two contains logistic regression model. Section three gives the review of data reprocessing approach to improve model's performance. Applications of the techniques to pregnancy terminated data are discussed in section four while discussion (of the results) and conclusion are presented in section five.

Logistic Regression
In machine learning, generalized linear models have always been one of the most popular learning methods. They are intuitively easy to explain and the implementation is straightforward. One of the most common classification models is the logistic regression, which is presented as [10]: Note that, Obviously, equation (1) can be expressed as: 4 is the log odds or simply the logit, and , , … are the parameters of the model to be estimated. The logistic regression model in equation (2) does not assume normality of error terms nor does it assume constant error variances. The result in equation (2) can be re-presented as: where odds refers to the odds of Y being equal to (1). From equation (2) and (4), it is clear that logistic regression model has log-odds (left hand side of the equation) that are linear in X.
The usual procedure to fit the logistic regression model is via maximum likelihood estimation using the probability function defined in equation (4), which is commonly implemented in most statistical softwares like R, STATA, SPSS and so on.
The accuracy of your model can be obtained as: where the TP is the sum of instances of class 1 (that is, "YES") correctly predicted, TN is the sum of instances of class 0 (that is, "NO") correctly predicted, the FP is the sum of instances of class 0 classified as class 1, and the where the FN is the sum of instances of class 1 classified as class 0. From confusion matrix, Specificity and Sensitivity can be derived as illustrated below: Precision measures the accuracy of the predictions for a single class, whereas Recall measures accuracy of predictions only considering predicted values. Specificity and Sensitivity plays a crucial role in deriving Receiver Operating Characteristic (ROC) curve. The accuracy of classification can also be measured by calculating area under curve (AUC) in ROC curve. According to Fawcett [11], ROC curve illustrates the classification performance in two dimensions. AUC values ranged from 0 to 1. If the AUC values near to 1 implies the model accuracy or classification is high. [12]

Data Pre-Processing Approach to Improve Model's Performance
The literature survey suggests many algorithm and techniques that solve the problem of imbalance distribution of sample. Of the approaches, resampling methods are discussed in this study. Sampling techniques are used to solve the problems with the distribution of a dataset, sampling techniques involve artificially re-sampling the data set, it also known as data pre-processing method [3]. Sampling can be achieved by: under-sampling the major class, oversampling the minor class, or by combining over and under-sampling techniques.
Under-sampling: The most important method in under-sampling is random under-sampling method which trying to balance the distribution of class by randomly removing major class sample. The problem with this method is loss of valuable information [3].
Over-sampling: Random Oversampling methods also help to achieve balance class distribution by replicating minor class sample. There is no need to add extra information, it reuse the data [8]. However, this problem can be solved by generating new synthetic data of minor sample. Chawla et al., [5] proposed a powerful over-sampling approach called "SMOTE", which stands for Synthetic Minority Oversampling Technique. SMOTE generates synthetic minority examples to over-sample the minor class. In this method learning process consume more time because original data set contain very small number of minority samples [3]. However, considering an example with a data set created artificially from the IRIS available in "DMwR" package in R for illustration [13]. The figure below ( Figure 1) gives a visual check of new data created with "SMOTE" function. The first figure tagged "Original Data" is the raw data before resampling while the other figure on the right hand side is the plot of resampled data using function "SMOTE" in package in R version 3.5.2. [14]

Application to Pregnancy Terminated Data
The data from the 2013 Nigeria Demographic and Health Survey (NDHS) [15] were used in this study. The data comprises of 27,440 observations with eight variables. The summary of the variables are presented in Table 2. The female respondents of ages ranging between 15 -49 years were interviewed. Of the 27,440 respondents, 3616 (13.18%) reported "Yes" and 23824 (86.82%) reported "No" to ever having terminated a pregnancy. Obviously, the "Yes" class is minor class while "No" class is major class. The sociodemographic factors: age, age at first birth, region, etc of the respondents are explanatory variables. All statistical analysis were performed using R version 3.5.2 [14] applying logistic regression model. The two classes in the outcome variable are heavily imbalance. Majority of women ('No' = 86.81%) had never experienced pregnancy termination while minority ('Yes' = 13.19%) had ever experienced pregnancy termination. Hence, some resampling techniques were applied using 'ROSE', 'DMwR' and 'caret' packages in R [16,13,17] to improve the model sensitivity and precision which were initially zero.

Discussion and Conclusion
We used 80% training data as shown in Table 3. Based on model's "Accuracy", "Precision" and "Specificity", the results show that, of all the resampling methods used to improve the model's performance, SMOTE yields better result compared to the random resampling techniques considered. The model's "Sensitivity", and "Recall" values improved from 0% to 50% and 32.43% respectively after correcting for the imbalance issue in the data using SMOTE. However, it is worthy to note that all the resampling methods used improved the performance of the model better than when the raw data were fitted. The ROC curve is presented in Figure 2 and AUC values given in Table 3.
Moreover, the results in Table 4 shows that keeping other factors constant, at a unit increase in age at first birth the odds of having terminated pregnancy decreases by 3.6%; at a unit increase in age the odds of having terminated pregnancy increases by 1.3%; Women living in urban area 50% odds of experiencing pregnancy termination compare to women living in rural area; women from North East (NE), North West (NW), South East (SE), South South (SS) and South West (SW) were 83.5%, 22.5%, 43.2%, 25.3% and 25.9%, respectively, more likely to experience pregnancy termination than women from North Central (NC); women with primary, secondary and higher levels of education were 39.2%, 50.1% and 44%, respectively, more likely to experience pregnancy termination compared to women with no education. The wealth status is partially insignificant factor.

Conflicts of Interest
There is no conflict of interest.