Non-parametric Variance Estimation Using Donor Imputation Method

The main objective of this study is to investigate the relative performance of donor imputation method in situations that are likely to occur in practice and to carry out numerical comparative study of estimators of variance using Nadaraya-Watson kernel estimators and other estimators. Nadaraya-Watson kernel estimator can be viewed as a nonparametric imputation method as it leads to an imputed estimator with negligible bias without requiring the specification of a parametric imputation model. Simulation studies were carried out to investigate the performance of Nadaraya-Watson kernel estimators in terms of variance. From the results, it was found out that Nadaraya-Watson kernel estimator has negligible bias and its variance is small. When compared with Naïve, Jackknife and Bootstrap estimators, Nadaraya-Watson kernel estimator was found to perform better than bootstrap estimator in linear and non-linear populations.


Introduction
Donor imputation is a method in which the missing values for one or more variables of a non responding unit (recipient) are replaced by the corresponding values of a responding unit (donor) with no missing value for these variables. It is a variance estimation method which is valid even in the presence of high sampling fractions [1]. However, very few variance estimation methods that take into account donor imputation have been developed. Essentially, donor imputation is convenient and has some interesting statistical properties. Although donor imputation may not be the most efficient method in any specific scenario, it is popular in surveys due to its practical advantages. Therefore, it remains useful to develop variance estimation methods that take donor imputation into account. In this study, variance estimator after donor imputation have been investigated and compared with the Naïve estimator, Jackknife estimator and Bootstrap estimator. Variance estimation methods accounting for the effect of imputation have been studied by [11], [13] and [8], among others. Some methods of variance estimation that have been developed for use with imputed data include a modelassisted method [11], an adjusted jackknife method [11], and multiple imputations [8]. [2] considered Random Hot-Deck (RHD) imputation under more general sampling designs assuming a one-factor analysis of variance model holds. [9], [6] and [5] dealt with Nearest Neighbor Imputation (NNI). [3] considered NNI, an alternative to resampling variance estimation method. [10] considered NNI under simple random sampling assuming that a ratio imputation model holds. [1] dealt with general donor imputation methods including NNI and with possibly postimputation edit rules and hierarchical imputation classes, under general sampling designs and more general imputation models. In this paper, non-parametric variance estimation using donor imputation method have been considered with estimation of parameters ̂ and being done using the kernel method proposed by Nadaraya (1964) and Watson (1964).

Estimation Procedure
Consider a population of N elements identified by a set of indices U = {1, 2,…, N}. Associated with the unit in the population are two variables ( , ) where > 0, > 0 . The variable has some unknown values and it is the variable under study. The variable is the auxiliary variable assumed to be known for all units of the population. A simple random sample without replacement (SRSWOR) of size n denoted as is drawn from the population. Suppose that , , … , are observed (respondents) and , , … , are missing (non-respondents). That is units respond for and = − do not respond. Therefore = ∪ . Consider a unit ∈ . The NNI method imputes a missing ! by where = 1,2, … , and ' = 1,2, . . , . is the nearest neighbor of j measured by the ( variable. That is satisfies )( − ( ! )=min , , )( − ( ! ). If there are tied ( values, then there may be multiple nearest neighbors of ' and is randomly selected from them. Suppose that )( − ( ! )occurs for -= -. Then the value . is imputed for the missing ! .
The completed data set is If the survey has 100% response, then the populations mean is estimated by the sample mean 5 9 = ∑ 9 and its variance is estimated by where > = ? ∑ − 5 9 9 . In the presence of non-response, the customary approach to point estimation is to take the formula for 100% response and calculate it on the completed data set. Thus from (2), the estimator of 5 6 is 5 9 0 = @∑ + ∑ .
F where E is the number of times the responding unit is used as a donor. For variance estimation, the naïve approach is to calculate the ordinary variance estimator, : ; GHI , to (3) on data after imputation. i.e. : ; GHI = < − 7 = > J K where L J = ? ∑ 9 0 − 5 9 0 9 and 0 is defined by (1). This variance estimator can be biased.
Let M • denote the sampling design, that is, M is the known probability of obtaining a sample . In our case, M denote the SRSWOR design. Given , denote the response mechanism by O •∕ . i.e. O ⁄ is the unknown conditional probability that the response set is obtained. We assume that O •∕ may depend on the auxiliary variable /( : ∈ 2 but not on the values / : ∈ 2. The total error (sum of sampling error and imputation error) of 5 9 0 can be broken down into sampling error and imputation error as follows 5 9 0 − 5 6 = 5 9 − 5 6 + 5 9 0 − 5 9 We note that R S 5 9 = 5 6 : S 5 9 = < − 7 = L T , where L T = ∑ L U ?L 5 T V 7? 6 Thus the bias of 5 9 0 is W 5 9 0 = RMDRO 5 9 0 − 5 9 ∕ F Variance of 5 9 0 denoted by : is given by : 9XB is a standard variance estimator using the imputed values as if they were reported values. This is called the naïve variance estimator. [2] show that under the cell mean model and hot deck imputation, the bias of the naïve variance estimator as an estimator for : 9XB is small when no respondent is used too often as a donor of an imputed value.
The jackknife variance estimator of 5 is given by In the presence of non-response to item y, the use of the above estimator may lead to serious underestimation of the variance of the estimator, especially if the non-response rate is important. [11] proposed an adjusted jackknife method that is calculated in a similar fashion as the above estimator except that, whenever a responding unit is deleted, the imputed values are adjusted. The imputed values are unchanged if a non-responding unit is deleted. Let ! X * , denote the adjusted imputed value for unit when unit j was deleted. For mean imputation, we have where 5 ! denotes the mean of the respondents excluding unit ' . The Rao-Shao jackknife variance estimator is then given by The bootstrap method is estimated by [3] proposed a rescaling Bootstrap method in order to estimate the Variance. Their method draws bootstrap samples of size 0 with replacement from the rescaled sample . Note that 0 may be different from . The rescaling factor, denoted byh, is chosen so that the variance under re-sampling matches the usual variance estimator of the population mean.
The Rao-Wu bootstrap variance estimator is given by Applying the Rao-Wu bootstrap in the presence of missing responses and treating the missing values as true values, may lead to serious underestimation of the variance of the estimator. In the presence of imputed data, [12] proposed a bootstrap procedure for imputed survey data. The Shao-Sitter bootstrap variance estimator is given by

Donor Imputation
A sample s of size n is drawn from population total U according to a probability sampling design M . In the absence of non-response, we assume SRSWOR with mean 5 9 .
Variable y is only observed for a subset of according to a response mechanism O | . This subset of size is called the set of respondents (or donors) while its complement B = − of size B = − is called the set of non-respondents (or recipients). To compensate for the missing values, donor imputation is performed. This leads to the imputed estimator of the mean given by ∈ is the donor used to impute the recipient . A variety of strategies can be considered in practice in order to find donors for imputing recipients. Usually, a vector of auxiliary variables, available for all the sample units ∈ , is used to determine a set w B * , of selected donors that are "close" to the corresponding recipients in B .

Approach to Inference
To evaluate properties of the imputed mean estimator 5 p and to make inferences, the following imputation model is used: where the subscript indicates that the expectation, variance, and covariance are evaluated with respect to the imputation model, is the N-row matrix containing ( 0 in its row, and and are parametric or nonparametric smooth functions of . Note that the subscript in B , B ,~ • B, indicates missing values and should not be confused with the imputation model. The vector contains variables used at the imputation stage for the selection of donors. In principle, the imputer uses available variables that are associated with the y-variable. The vector may thus contain design variables (e.g., strata and cluster indicators, size measure), the domain of interest or other auxiliary variables. It is assumed in model (5) that the imputer has appropriately chosen the vector of auxiliary variables so that the design variables and the domain of interest do not explain further the y-variable after conditioning on . This allows us to treat the design variables and the domain(s) of interest as being fixed under model (5).

Proposed Variance Estimator
Considering model (5), the total error of 5 p can be broken down into sampling error and imputation error as shown in (4). The expectation appearing in the true variance component can be evaluated leading to expressions which depend on known ( values and on the unknown model parameters and . Therefore to estimate the three components of the variance, all we need to provide are the model unbiased estimators of and . However, this will not completely lead to an explicit variance estimator since we still have to obtain expectations of some terms with respect to response mechanism.

Estimation of • ' ' and " " ' '
One of the most common methods in non-parametric regression is the kernel method introduced by Nadaraya-Watson (1964), which is often obtained by using a bandwidth [7]. The kernel estimators with varying bandwidths are specially used to estimate density of the long-tailed and multi-mod distributions. A kernel estimate is introduced for obtaining a non-parametric estimate of a regression function.
Smooth linear estimate of A smooth linear estimate of a function denoted by ̂ can be written in general form as Where • c @ , ! C denotes a smoothing function with a bandwidth parameter k. This bandwidth parameter determines the amount of smoothing to be done. The estimates proposed by Nadaraya (1964) and Watson (1964) associated with kernel functions [7] will be considered.
where k denotes the bandwidth parameter. • is called the kernel function with the following properties.

Smooth Linear Estimate of " " ¢ '
Consider = + £ where R £ / = 0 and : £ / = The estimate of the residual term is given by £̂ = + ̂ The square of the estimate of this residual term £ , ∈ is given by To smooth 6 , we choose a smooth function • @ , ! C with a bandwidth parameter ℎ. Using (6), we get = ∑ • @ , ! C ∈9 @ + ̂ C which is a smooth estimate of : / A corresponding ‚¦ estimate of is given by where ℎ denotes the bandwidth parameter.
The estimator for the Variance is given by Where ̂ and are as given above.

Simulation Studies
In our simulation study, the performance of the proposed donor estimator was compared with the naïve estimator, Jackknife estimator and bootstrap estimator empirically. In our comparison, two artificial population structures (linear and non-linear), one real population (linear) and two nonresponse mechanisms were considered. We conducted a simulation study to evaluate the performance of our variance estimator in terms of Relative Bias (RB) and Variance.
The first population (linear population) was generated as follows: 100 data points were generated according to the linear homoscedastic model; A simple random sample of size 0.225 of the population size was taken without replacement from each population structure. We considered two non response mechanisms which are random and non random non-response.
For a random non-response mechanism, non responses were generated using independent Bernoulli trials with a constant parameter 0.3 representing the probability of nonresponse.
For a non random non-response mechanism, the sample values were arranged in order of magnitude using c values and then the largest 30% of the values were regarded as missing.
Non responses were generated for each non-response mechanism. To compensate for the missing values, nearest neighbor imputation was performed. After imputation, the four variance estimates : ; ¶·7 , : ;`a b , : ; GHI ,~ • : ; IG7 were calculated. The experiment was repeated 1000 times independently and the average value of each value was got. In the case of bootstrap estimator, 1000 bootstrap iterations were used. In the instance of donor estimator, we used the bandwidth parameter that minimized the mean squared error and satisfied Silver-man's (1986) condition.
The Epanechnikov's kernel function ¿ À 1 − ( was used since it gives optimal solutions.
The performances of estimators were assessed using two criteria: the relative bias and the Variance. The relative bias of the estimators is calculated as follows: Ç ÈF " , 5 0 is the value of 5 0 for the experiment and : ; represents the value of the estimator for the experiment.

Results
The results were then tabulated showing the performance of the estimators in terms of relative bias and Variance. Three populations were analyzed with each population having two tables. One table shows the case when the non-response mechanism is random while the other shows the case when the non-response mechanism is non-random. a) Case when population is linear. From Table 1, the naïve estimator has the smallest Variance followed by Jackknife while our proposed estimator performs better than Bootstrap. The proposed estimator has the highest relative bias followed by the naïve estimator while Jackknife and Bootstrap seems to do well in terms of relative bias. b) Case when population is real The results of Table 2 are similar to those of Table 1. This implies that whether the population is real or artificial, as long as it is linear, the estimators behave in the same way. c) Case when population scatter is non-linear.    According to Table 3, our proposed estimator performs better than the bootstrap estimator while the naïve and Jackknife estimators have the smallest Variance. Bootstrap seems to be the best in terms of relative bias while our proposed estimator has the highest relative bias.
Discussion of the results Considering the above three tables where we were comparing the estimators when the popuation is linear or non linear, naïve estimator seems to have the smallest Variance followed by Jackknife estimator while our proposed estimator alternates with bootstrap. In non-linear population, our proposed estimator performs better in terms of Variance than bootstrap. It is also noted that the Variance and relative bias of the four estimators have close numerical values implying that they are all valid.
It is worth noting that donor imputation may not be the most efficient imputation method in any specific scenario. Nevertheless, it is quite a popular imputation method in surveys due to its practical advantages. Therefore it is useful to develop variance estimation methods that take donor imputation into account.

Conclusion
The simulation study examined the performance of four variance estimators. Two population structures (linear and non-linear), and two non-response mechanisms were considered. Simulation study was conducted to evaluate the performance of the variance estimators in terms of Relative Bias (RB) and Variance. It was noted that the variance and the relative bias of the 4 estimators have very close numerical values. Hence all are valid and work well in simulation study. We have proposed a variance estimation method for any type of donor imputation. It is valid and was shown to work well in a simulation study. The variance of the proposed estimator is small and its relative bias is also small.
Thus, it is useful to develop a variance estimation method that takes donor imputation into account. Its main drawback is that it depends on the validity of an imputation model. This is also a characteristic of the methods for NN imputation.
Two key issues with any variance estimation method that relies on an imputation model are the appropriate choice of auxiliary variables for donor selection and the estimation of the model mean c and variance c given the chosen auxiliary variables. Auxiliary variables should be associated with the variable of interest so as to ensure that the conditional model bias remains small [1].