Comparison of Methods for Processing Missing Values in Large Sample Survey Data

: Missing data occurs in every field and most researchers choose simple approach to deal with. But this approach may introduce bias and result in inaccurate results. In this study, we will explore the method suitable for large sample and multivariate missing data patterns. In this paper


Introduction
Missing data are widespread in survey studies. Incomplete data set may be caused by no responds, withdrawing, measurement errors and miscommunication [1]. Complete Case Analysis (CCA) is a traditional statistical method, in which every incomplete observation is deleted, and only complete cases are kept in the data set. CCA works for all types of data and is default for many statistical software. In most cases, the significant disadvantage of CCA may losing a large part of the original observation and then resulted in the loss of available information [2][3]. However, a recent study found that the usage of CCA may actually contribute to an accurate parameter estimates of interest in some situations [4][5]. Other method, such as hot-deck, predictive mean matching [6][7], they are single imputation (SI) and also very popular in practice. Hot-deck method means simply replacing the missing value with either another appropriate value from a similar unit or "neighbor" value. Predictive mean matching means simply replacing the missing value with a mean value. These types of nonparametric approaches ensure that the imputed values can fall within the limited range [8].
Recently, some papers summarized the complications and limitations of methods to process incomplete data in epidemiological studies and proposed some possible solutions, in particular, multiple imputation (MI) in [9][10].
MI is a simulation-based approach which is developed to process incomplete data, and can process the complete-data as well [11][12][13][14]. MI comprises three main steps. First, by filling the model, it will create m (usually 5) copies data sets, replacing the missing values in each data set with independent random draws from the predictive distribution of missing values under a specific model. The second stage is analysis, in which each of the m complete data set are analyzed through a complete-data statistical method of interest (the logistic model is used in this study) to obtain the parameter of these data sets. Finally, a final statistical inference can be obtained by combining these parameter using Rubin's rules.
In this paper, Multivariate Imputation by Chained Equations (MICE) was conducted, also called "fully conditional specification" (FCS), which can process varies of variables (e.g., continuous or binary) and can be used in a wide range of settings [14][15]. The FCS procedure shows an exceeding flexibility, because each variable can be modeled by its conditional densities, such as logistic regression model for binary variables and linear regression model for continuous variables [14]. For the current research, MICE procedures can be implemented in a variety of software (e.g., S-Plus, R, Stata, etc.) [16][17][18].
In particular, missing data mechanisms are generally classified into three main categories which are missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR) [13]. The missing data mechanisms have implications for the choice of methods to handle missing data. For MI, it can provide unbiased estimates of the regression parameter of interest when the missing data is MAR or MCAR. Recent study found when the missing data is assumed to be MAR or MCAR, CCA was performed well (e.g., unbiased risk difference, 95% coverage) [5], although some papers indicated that the method can result in substantially bias [3,9,19]. The application of SI is the same as CCA and MI, for example, inverse probability weighting (IPW) is typically implemented assuming MAR [20][21]. More information about implication of these methods under MNAR was discussed in the paper [22].
The purpose of our research is to study via a simulation study utilizing real data under MCAR (1) can MI provide unbiased estimates to process large sample data compared to CCA and SI (2) can the change of FMI be used as an indicator of MI to process missing data (3) what is the proper scope of application for MI in terms of the proportion of missing data.

Data Source
Data of Youth health risk behavior survey launched by Beijing Centers for Diseases Control and Prevention, Beijing, China in 2016 was used in our study. The survey was conducted by a self-administered anonymous questionnaire which included 106 questions related to dieting, myopia, internet using, sleeping, injury, smoking and drinking behaviors etc. In this paper, we used the subset of injury related behaviors.

Methods
The variables Y 1 , X 1 , X 4 , and X 6 were set as non-response variables and the others were complete variables. Due to the CCA, SI and MI are mainly based on the assumption that missing data are MCAR or MAR, so we simulated the missing data under MCAR using R and constructed the simulation missing data set for 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45% and 50% on the data set K. For each of the missing data set, CCA, SI and MI were adopted to process and finally overall 30 complete data sets were obtained. The statistical package mice in R was used to perform all statistical analyses.

Algorithms of MI [13]
The FCS is a flexible method and it does not a very strict assumption about the multivariate normal distribution. FCS imputations are generated sequentially by specifying an imputation model for each variable. Assuming that Y is a partially observed complete random sample, containing p variables, from a multivariate distribution ) | ( θ Y P . Further, let the Y -j be all variables in the data except Y j (j=1,..., p). We assume that the multivariate distribution of Y is completely specified by θ, a vector of unknown parameter. Therefore after we know the distribution of θ and values for imputation can be extracted from it. The FCS algorithm can obtain the posterior distribution of θ by sampling iteratively from conditional distribution of the form The parameters θ 1 ,..., θ p are specific to the respective conditional densities and are not necessarily the product of a factorization of the ʻ true ʼ joint distribution Starting with an initial imputation and drawing imputations by iterating over the conditional densities and sequentially filling in the current draws of each variable. The tth iteration of chained equations is a Gibbs sampler that successively draws.
is the jth imputed variable at tth iteration. The current draws are taken as the first set of imputed values after the cycle reaches convergence. And the cycle is then repeated until the desired number of imputations has been achieved. Convergence's speed is faster and the number of iterations is generally 10-20 times.

Analysis Models
After obtaining m (in this paper, m=5) imputed data sets from the imputation step, the analysis stage is the simplest stage. The analysis models would be the same as other statistical model with complete data sets. Many studies indicated that the imputation model should contain all variables in the analysis model or any auxiliary variables relating with outcome variables likely to be used in the subsequent analyses [19,23]. For each of 30 simulation data sets and the data set K, taking Y 1 as the dependent variable and the others as the covariates for regression analysis. The following logistic regression analysis model was used: Logit (P(Y 1 )=β 0 +β 1 X 1 +β 2 X 2 +β 3 X 3 +β 4 X 4 +β 5 X 5 +β 6 X 6 β is the regression coefficients of each variable.

Imputation Diagnostics
For all scenarios, after analyzing the complete data sets, we obtained m=5 sets of point estimates and their associated variances. These results are combined together to have one final result. We evaluate both the advantages and disadvantages of the three methods and the applicability of MI through indicators such as Akaike's Information Criterion (AIC) [24], the significance of regression coefficient of variable (β) and the fraction of missing information (FMI). The FMI can quantify the loss of information due to the missing and it indicates the proportion of the overall uncertainty due to the missing data. The value of FMI ranges between 0 and 1. A large value reflects that the variability between imputations and observed data in the imputation model provide less information about the missing values.

Results of the Complete Data Set K
The data set K is the original complete data set, logistic regression was conducted to analyze the relationship between the variable 'student's injury status (yes or no) in the past one year ("Y 1 ") and other variables. The results of the analysis are given in Table 1. The results show that the regression coefficients of each variables are significant (P<0.05) except for the variable "Area" and "Father's educational background", where the value of P of 'Suburb', 'Junior high school' and 'High school' is 0.09699 (>0.05), 0.07579 (>0.05) and 0.05157 (>0.05) respectively.

Comparison Results of Thirty Complete Data Sets Processed by Three Methods
Results of thirty complete data sets are given in Figure 1. The value of AIC of 30 complete data sets processed by CCA gradually decreases from 27000 to 2100 with the increase of the proportion of missing data. Compared with the AIC of the complete data set K (33198), the relative error gradually increases. The AIC of 30 complete data sets processed by SI gradually decreases from 32000 to 21700 with the increase of the proportion of missing data. Compared with the AIC of the complete data set K, the relative error gradually increases as well. The AIC of 30 complete data sets processed by FCS fluctuates from 33200 to 33700 with the increase of the proportion of missing data, and the relative error vary slightly. Under the similar proportion of missing data, comparing the relative error of AIC of each complete data sets processed by three methods with the result of data set K, FCS performs best and CCA performs worst. It indicated that the distribution of complete data imputed with FCS are approximately equal to the original complete real data set K. In general, with the increase of the proportion of missing data, the effect of each method gradually decreased. This indicates that the FCS is a good measure to deal with missing data compared with SI and CCA.

The Result of FCS
The results of logistic regression model about complete data processed by FCS at different proportion of missing data are summarized in Tables 2 and 3. It shows that for low proportion of missing data, the effect of imputation with FCS is greater, while the bias increase gradually with the increase of proportion of missing data. This result is consistent with the result in figure   1. For example, when the proportion of missing data is 10%, except for the value of β of 'Area' and 'Father's education lever', the regression coefficient of the other variables are relatively significant and the standard error of each variable is also relatively small. With the increase of proportion of missing data, especially more than 30%, values of β is meaningless in some variables, such as 'Family typeʼ and ʻAge'. The results in the Figure 2 show that the values of FMI of different variables at different proportion of missing data. It indicates that the value of FMI of a certain variable will increase with the proportion of missing data. For example, the FMI of 'Poor region' and '15~Y' changes from approximately 0.1 to 0.7 and 0.1 to 0.6, respectively, as the proportion of missing data become high. It reflects that the variability between imputations and observed data in the imputation model provide less information about the missing values when the percentage of missing data become high. Another outcome is that when the proportion of missing data is less than 30%, the FMI of most variables are small. But when the proportion of missing data is more than 30%, the FMI increases to about 0.5 or even higher.

Discussion
Under MCAR, the effects of three methods processing survey data with missing value at different missing rates are analyzed. The results show that FCS perform well compared with CCA and SI, which are consistent with other papers [2][3]. Under MCAR, The result of AIC shows substantial differences between in CCA and original data set K, the higher percentage of missing data, the more significant difference of the relative error of AIC. Because of its simple operation, CCA has become the most commonly used method to process missing data in scientific research. However, this method often results in bias because of potential loss of information [2,11]. So it suggests that deleting the sample with missing value directly should be avoided. Alternatively, SI based on substituting the missing values with mode can maximally retains the integrity of the data. However, the approach may result in potential loss of the distributional relationship among variables and it is not possible to provide measures of uncertainty introduced by the imputation process. When the proportion of missing data is low (such as 5%), the processing performs better. And when more missing information appears, it will lead to the variance of the estimated parameters to be biased. In our study, we found MI is the preferred method for dealing with missing data, and in both simulation research and practical applications it showed good ability to process missing data [1,[10][11]25]. The observed data was preserved and the imputation data was concerned as well in MI. Hence, FCS was utilized to estimate missing values and obtain unbiased estimates. This study uses FCS because it does not require stricter model assumptions as multivariate normal (MVN) [26], and it is a very flexible method to deal with missing data as long as the imputation model is correctly specified [10,27].
The value of FMI can also be used as an indicator to guide the application of MI [17]. The FMI contains the fraction of the missing information as defined in Rubin [13], that is, the proportion of the overall uncertainty due to the missing data. The statistic of FMI is as small as possible. When the proportion of missing data is low, the value of FMI is small which indicates that the imputation result of MI is effective. As the proportion of missing data gradually increase, the value of FMI gradually increase and the effect of imputation gradually becomes worse. An explanation to this is that the more data is missing, the less usable the data set and the information it reflects. So the change of FMI reflects the effective of MI to process missing value.
Researchers in a variety of fields often concern what is the proper scope of application for MI in terms of the proportion of missing data [23]. In this paper we found 30% is the scope of MI. Because when the proportion of missing data less than 30%, the imputed effect of FCS is relatively stable. For example, when the missing data rate is 10%, except for the β of "Area" and "Father's education lever", that of other variables are relatively significant and the standard error is small. It indicates that both the deviation and the mean square error between FCS and the original data set K are small and the model results obtained by FCS are consistent with the original model results. When it is more than 30%, the β of variables becomes meaningless and variables with a value of FMI increasing 0.5 or above becomes more. Although some studies suggested that MI can achieve a good imputed effect when the proportion of missing data is 40% or even 90% [23], the bias of variables become increasing with the increase of proportion of missing data, which was another discovery in our study.
Compared with prior studies, our study has two advantages. First, more complex data structures were used and more variables were tested including multivariate variables [5][6]23]. Our simulation study is a multivariate missing data set, including binary, unordered and ordered categorical variables, which is closer to the actual situation of missing data in the survey data compared with the single variable missing pattern [17]. Second, more methods were concerned together such as MI, CCA and SI [1,10,23], which makes the results more reliable.
In this study, all variables were included in the simulation equation, but careful selection was needed to determine whether all variables or covariables related to missing variables should be included in the actual study. If there is no much information included in the imputation model, it will lead to high standard errors in the analysis. In contrast, if the observed data are highly predictive of the missing values the imputations, it will be smaller bias in the results [28][29]. Sample size also influences the analysis of data results. When the sample size increases, the bias of the research results gradually decreases [23,30]. In this study, large sample data were used, and compared with small sample size, the results have better influence efficiency.
However, this study also has some shortcomings. We only studied the processing method under the mechanism of MCAR, and does not study the scenario under MNAR. In practical research, we need to figure out the missing mechanism of data set and then conduct sensitivity analysis [11,31]. That's what we're trying to do and we hope more scholars to study it in the future.

Conclusions
Missing data is a pervasive problem that should be dealt with appropriately. In this paper, the performance of three methods that processing incomplete data under MCAR was evaluated. Under different proportion of missing data, the MI performs well compared with CCA and SI. Second, the changing of FMI can also be used as an indicator of MI to process missing data. Third, it is suitable for MI to process large sample data, and no more than 30% of proportion of missing data is the proper scope of application of MI. It is expected to provide a methodological reference for the similar survey data to process missing values, and solve the problem of inspection efficiency reducing caused by data loss, and provide experience for researchers.