Estimation of Missing Data Using Convoluted Weighted Method in Nigeria Household Survey

The analysis of survey data becomes difficult in the presence of missing data. By the use of Least Squares and Stein Rule method, estimator for the parameters of interest can be obtained. In this study, proposed convoluted Weighted Least Squares and Stein Rule method is compared with some existing techniques where the data is considered missing completely at random (MCAR). The results show that other techniques are occasionally useful in estimating most of the parameter, but proposed (LSSR) technique perform better regardless of the percentage of the missing data under MCAR assumption.


Introduction
Missing data problem is an inherent feature of all surveys and one of the greatest threats compromising the precision of most surveys estimate during design and analysis. It can impair the quality of survey statistics by threatening the ability to draw valid inference from the sample to the target population of the survey. The problem of missing data occurs when some or all of the responses are not collected for a sampled element or when some responses are deleted because they failed to satisfy edit constraints. It is common practice to distinguish between unit (or total) non response, when none of the survey responses are available for a sampled element and item non response, when some but not all of the responses are available. Total non response arises because of refusals, inability to participate, not at home, units closed, away on vacation, unit vacant or demolished and untraced units. Item non response arises because of item refusals, "don't know" omissions and answers deleted in editing.
Over the years, attempts with varying degrees of success have been made in the literature to solve problem of missing data. The success of a particular technique is dependent on the complexity of the problem and no technique is robust for all purposes of estimation but techniques are used indiscriminately.
This paper presents a robust technique of handling missingness and compare how well this technique performs with some existing ones in terms of what happens to mean, variance, correlation coefficient, skewness, and kurtosis under MCAR with different amount of missing data concerning Nigeria household survey.

Missing Completely at Random (MCAR)
The distribution of missing value R is assumed to be independent of both the target variable Y and auxiliary variable X. Thus

Missing at Random (MAR)
In general, MAR occur when there is no direct relation between the target variable Y and response behavior R and the same time there is a relation between the auxiliary variable and the response behavior R. This is expressed as:

Missing Not at Random (MNAR)
Missing data Mechanism where values are assumed to be related to the unobserved dependent variable vector , in addition to the remaining observed values is called Missing not at Random (MNAR). This is expressed as:

Least Square (Yates) Procedure
Yates (1933) proposed a technique that first estimating the parameters of the model with the help of the complete observations alone and obtaining the predicted values for the missing observations, when linearity and unbiaseness criteria of estimates are of interest [18]. The predicted value of the study variable is given as: Which makes use of observations but has T-K instead of -k degrees of freedom, and * in the expression from (4)

Stein-Rule Strategy
This other method called Stein-Rule was proposed by James and Stein (1961), providing the following predictions: where: = and ' = ( − ) ′ ( − ) is the residual sum of square and k is a positive non stochastic scalar. [18] The variance of the Stein-Rule procedure is given as: where: * is the expression from (7)

Proposed Convoluted Weighted Method
If Ŷ * = ɸ (x) and Ŷ * = ɸ (x) are two different functions (models) in estimating the missing values of the study variables.
Let us define our target model as: Which is a linear combination of two existing models.
Where . +. = 1 and. is a non stochastic scalar between 0 and 1; see [18]. The value of . may reflect the weight been given to the prediction of first model value in relation to the prediction of second model values.
= Number of observed cases k = Number of explanatory variables (which is a positive non stochastic scalar) Then, Taking the partial derivative of the expression (13) above with respect to parameter. , we have (14) where: At turning point ; < ;= 5 = 0, therefore, setting (14) to zero, we Therefore, Solving for . in (15), we have The predicted values of the study variable using the proposed model is given as: * = . * + . * (17) where: * , * is as shown in (4), (7) respectively and . = 1 -. 7 = Indicates the number of observed cases K = Number of explanatory variables (which is a positive non stochastic scalar) Thus, the proposed weighted convoluted model is:

Efficiency Comparison
If the data are complete, then X = is the corresponding estimator of variance ( ). If T-cases are incomplete, that is, observation are missing in the model, then the variance can be estimated using the complete case estimator as:

Using Least Square Method
If the missing data are imputed using Least Square (Yates) method, then we have the estimator which makes use of observations but has T-K instead of -k degrees of freedom. As

Using Stein Rule Method
If the missing data are imputed using Stein Rule approach, then we have the estimate

Efficiency Comparison: Least Squares Versus Stein Rule Techniques
The following three possible conditions will hold iff * >0:

Efficiency Comparison: Proposed Technique Versus Least Squares and Stein Rules Techniques
If the missing data are imputed using the proposed weighted least square stein rule (LSSR) method, then we have the estimator Comparison of variance Estimates from Least Square, Stein Rule and Proposed Technique.

Performance Criteria for the Techniques
The criteria comprises of the following: The technique with minimum (RMSE) is adjudged the best.

Numerical Illustration for the Proposed Technique
A simple random sample of n = 100 households was selected from the records of survey data on "household income'' from Akure North Local Government, Iju/Ita-Ogbolu in Ondo state, Nigeria to evaluate the performance of the proposed model with some existing techniques of handling missing data under MCAR using different percentage of missing data.
Three demographic variables; Y (income N'000), Age (X 2 ) and year of schooling (X 1 ) were considered. The Y variable was generated as a combination of explanatory variables with added random components. Then, differing amounts were deleted at random causing MCAR data which had 0,5,12,23 and 44% missing data.          Little percentage of discrepancy Remark: Although, other techniques are occasionally useful in estimating most of the true parameter, proposed (LSSR) technique perform better regardless of the percentage of the missing data under MCAR assumption considered the results from the criteria for selection. However, for mean imputation (MI) and list wise (LW), there is higher percentage of discrepancy in the true values of most of the parameters. Hence, the proposed technique preserve most of the parameters structure within the data. That is there is almost no change in the mean, variance, skewness, kurtosis, coefficient of variation and correlation coefficient under MCAR assumption using the proposed model of imputation.

Conclusion
Although other procedures are occasionally useful, proposed (LSSR) technique performed better regardless of the percentage of the existing data under MCAR nature of missingness.
Considered the result from the criteria for selection, the proposed model reduces the variability around the true parameter value without discarding the linearity and unbiasdness criteria.