Comparative Study of Various Methods of Handling Missing Data

Scientific literature lack straight forward answer as to the most suitable method for missing data imputation in terms of simplicity, accuracy and ease of use among the existing methods. Exploration various methods of data imputation is done, and then a robust method of data imputation is proposed. The paper uses simulated data sets generated for various distributions. A regression function on the simulated data sets is used and obtained the residual standard errors for the function obtained. Data are randomly from the set of independent variables to create artificial data-non response and use suitable methods to impute the missing data. The method of Mean, regression, hot and cold decking, multiple, median imputation, list wise deletion, EM algorithm and the nearest neighbour method are considered. This paper investigates the three most common traditional methods of handling missing data to establish the most optimal method. The suitability is hence determined by the method whose imputed data sample characteristic does not vary considerably from the original data set before imputation. The variation is here determined using the regression intercept and the residual standard error. R statistical package has been used widely in most of the regression cases. Microsoft excel is used to determine the correlation of columns in hot decking method; this is because it is readily available as a component of Microsoft package. The results from data analysis section indicated an intercept and R-squared values that closely mirror those of original data sets, suggesting that median imputation is a better data imputation method among the conventional methods. This finding is important from the research point of view, given the many cases of data missingness in scientific research. Finding and using the median is simple and as such most researchers have a ready tool at hand for handling missing data.


Introduction
Research is the driving force behind any development of a Nation. Any endeavourer in this area therefore requires that the people concerned with the research arm themselves with the right kind of tools that shall help them get accurate and relevant information from the survey being undertaken. Missing data is a big challenge in many areas of research, especially in social research. Many researchers when confronted with this scenario feel very helpless. Some may resort to non-scientific ways of addressing this challenge while others compromise reliability of their findings by using procedures that cannot guarantee accuracy. Non-response problem is an issue of great concern to researchers because it pervades almost all survey research, [24].
In any research work, the ultimate goal of researcher is to con-duct the most accurate analysis of data to so as to be able to make valid and efficient inferences about a population to guide users of statistical results and researchers alike, [31]. The most challenging part therefore is to get all the relevant information about the case under investigation. In most cases this fails to materialize, one, because subjects make up their minds to hold back certain information for personal reasons or a few may not have ready answers at the point when they are being interviewed. It is worth noting that when part of a data is missing from a given survey and missing data is ignored by and using only the available sample, the result so yielded may not be representative of the population under study; after all there are some of its characteristics missing. Ignoring missing data occur when there is a wide spread failure to understand the significance of the problem or lack of awareness of the solution to the problem of missing data, [17]. The higher the non-response rate the greater the bias if the characteristic under study in respondents differ markedly from non-respondents. According to [14], other causes of missing data are, error on the part of the researcher, those collecting or entering data and the participants.
Comparing them on the basis of their ease of use, efficiency and robustness, Median imputation performs better than the other methods. For a detailed review of these approaches. [25]

Various Methods of Handling Missing Data
This is a review of various methods that exist which have been used towards addressing the issue of missing data in survey.

EM Algorithm
This method is described as "archaic". [11] Despite being archaic it is still the quickest, however where accuracy is key, this method must be used cautiously. It is important to note that the method is largely suitable when dealing with data that is MCAR. This method may be suitable if only a small number of cases are missing values. The sentiment is also supported by, [11].

List Wise Deletion
After the list wise deletion, one cannot be guaranteed that the remaining data is still representative of the original population under study. This systematic loss of data by list wise deletion results into an increased risk of bias. According to [16], list wise deletion method is regarded as the most common and easiest method of dealing with missing data, it is also called complete case analysis according to [11]. This approach there-fore leads to a reduction in sample size which in turn translates into reduced statistical power bringing into question the how representative the remaining sample is of the population being studied. [11]. The list wise deletion according to [6], because of this systematic loss of data with list wise deletion, there is an increased risk of bias, a risk which can only be lessened when the data is MCAR. Some researchers have characterized list wise deletion as the least desirable data imputation method because of these biases and have warned against its use. [11].

Mean Substitution
The third method in this study is the mean substitution method. This method is "archaic" [12] but still considered. To use this method, the mean of the total sample for a variable is substituted for all the missing values in that variable. Mean substitution is a quick and easy way to recover cases. [20]. Furthermore the estimate of the standard deviation and variance used in calculating other parametric tests is reduced resulting in biased standard errors [36]. There is a debate about using this method because of the inherent bias that result [31]. This method would only be appropriate if only a small number of cases are missing values. The serious disadvantage with this method is that it can distort the distribution hence in underestimating variance and covariance [36]. According to [28] among the drawbacks of mean imputation are (a) Sample size is overestimated (b) Variance is underestimated (c) Correlation is negatively biased, and (d) The distribution of the new values is an incorrect representation of the population values because the shape of the distribution is distorted by adding values equal to the mean.

Regression Imputation
According to [15], the best predictors (that is, those with the highest correlations) are selected and used as independent variables in a regression equation the variable with missing data is used as independent variable. The predictors from the last round are the ones that are used to replace the missing value [15]. The statistical software is a better solution here, but even this comes a sacrifice that a user has to embrace in terms of time to learn the software apart from the financial constraints on the part of the user [33]. Clearly then, a better method of data imputation needs to be sought.

Multiple Imputations
Multiple imputations essentially is a way to solve the modeling problem by simulating the distribution of the missing data [30]. Users are free to ignore the imputations, all imputed values are tagged, "Satisfied", if variables that determine the nonresponsive are not included as conditioning variables, [32]. This has been demonstrated in simulation studies, [7]. Furthermore, using simulated and real datasets from different scientific fields and with varying rates of item non-response, existing research emphasizes the robustness of multiple imputation to the specially chosen imputation model, given that appropriate conditioning variables are available in the data set [2]. Multiple imputations create several imputed datasets. If automatic variable selection is then run on each of these datasets separately, the set of variables entering the model can vary across the datasets. This makes it hard to assimilate the results [33].

Hot Decking
This method works well when the variable used to sort the data is highly predictive of the variable with the missing values and when there is a large sample so that a similar case is easily identified [35]. According to [32], One of the advantages of hot decking, compared with mean substitution, is that the standard deviation of the variable with the inserted values better approximates the standard deviation value for the variable without the substituted values. However, standard, standard deviations are still likely to be lower overall [35]. This method may not work when there exists no correlation between the variables. Thus the method only works very when the variable used to sort the data is highly predictive of the variable with missing values and when there is a large sample so that a similar case is easily identified [35]. Another drawback with hot decking is that it is difficult to implement; programming requires great time and labor. [39]

Median Imputation
According to [1], the mean is affected by the presence of outliers ad it seems natural to use the median instead just to assure robustness. The existence of other features in the data set with similar information (high correlation), or similar predicting power can make the missing data imputation useless, or even harmful [1].

The Nearest Neighbor (NN) Imputation Method
According to [38], this is one commonly used imputation method for item non response. Here the missing value is imputed from the ones at the auxiliary variables. Thus impute the auxiliary value closer to the missing value by considering the previous value and the next value. In which case a single value NN is carried out as follows: Considering a population, U = 1,2,3, . . N. Associated with the k th unit of the population are two variables (x , y ), k = 1,2, . . N, where x > 0, y > 0. The variable y is unknown and is the variable under study and while is the covariate assumed to be known for all the units of the population. Supposing that in this sample m unit correspond to an item and , do not. Then the value is imputed for the missing value [38]. According to [26], the bias of the population mean is known to be small if the relationship between and is linear. Obviously therefore, when the relationship is not linear a serious challenge will arise [26].

Methods
Three traditional methods of data imputation are considered in this research. The methods involving substituting a single value [33]. Usually the imputed values are the mean or median of the variable being substituted, [33]. The dependent variables where Є and the independent variable as a linear combination of vectors are considered. A quantitative response is assumed here and a multiple regression as the most common method of statistical adjustment, additive model is proposed. Let Where for * = 1,2, . . , and + = 1, . . , is such that ′are normal iid random variables. A linear model for the distribution can be written as = $ + . (2) Equation (1) is used as a linear model with a logistic error, ..

The Model
Let us denote each independent variable by / , let / , depend on several factors ′-. Each ′-, ′is therefore a vector belonging to a vector of random variables.

Regression for Complete Data Set
Regression for a complete data set is proposed to have the form Where * = 1, … , and 5 = 1, … 4. was obtained for the original sample data set for each distribution.
The regression of data set with median imputation is And is obtained from the data set, finally in the same way the regression with list wise deletion is obtained as The error . is assumed independent and identically distributed with mean zero and unit scale.

Parameter Estimation
The R -statistical package was used to estimate the $ <and the error term. < -. The data non-response imputation that gave the values $ <which closely mirror values from the complete values dataset is deemed to be the most robust method of data imputation. The values of the intercepts $ = <for the complete data set are compared with sample$ > < -, $ 7 <and $ <respectively for mean, median and list wise deletion approach. The procedure I repeated for five different distributions; Gamma, Weibull, binomial, Poisson and normal distributions for the 10% missing and 30% missing.

Results
The results on the optimal imputation, the main comparison here, are the intercepts, residuals, standard errors, R-squared and intercept standard errors. Poisson distribution table 1 shows that the method of median imputation does better with a Poisson distribution 30% data missing.  Binomial distribution table 2 shows that the method of median imputation outperforms other methods in terms of intercept standard error and R-squared, with a Binomial distribution 30% data missing.  The above table of a normal distribution data set with 30% data non-response reveals that the median imputation posts far better intercept values. The R-square value also closely mirror those of original data set compared to the other two methods.  With 10% missingness for a normal distribution table 4, there is clearly no major difference between the mean substitutions and leastwise deletion, perhaps a clear pointer that when non response is small it can be ignored without much effect on the sample size and results.  From above figures and tables 5 and 6 for both Weibull and Poisson distribution with 10% missingness, median imputation does better. We note that a list-wise deletion is not a good method as can seen from the values of the intercept and standard error.  With 10% missingness for a normal distribution both figure and table 6, there is clearly no major difference between the mean substitutions and leastwise deletion, perhaps a clear pointer that when non response is small with a normal distribution, it can be ignored without much effect on the sample size and results.

Conclusion
The aim was to establish the most reliable method of imputing missing data among the conventional methods. Simulation data set analysis show clearly that when data is MAR, Median imputation performed consistently better than list wise deletion. Median imputation equally did better than mean imputation for 30% and 10% non-response for both skewed distribution and the Normal distribution.
Median imputation is proposed as the optimal method of missing data imputation for data non response for up to 30%. It is considered that use of median for data imputation, gives researchers an easy to obtain, ready tool for handling data non-response.