A Multiplicative Bias Corrected Nonparametric Estimator for a Finite Population Mean

Nonparametric regression has been widely exploited in survey sampling to construct estimators for the finite population mean and total. It offers greater flexibility with regard to model specification and is therefore applicable to a wide range of problems. A major drawback of estimators constructed under this framework is that they are generally biased due to the boundary problem and therefore require modification at the boundary points. In this study, a bias robust estimator for the finite population mean based on the multiplicative bias reduction technique is proposed. A simulation study is performed to develop the properties of this estimator as well as assess its performance relative to other existing estimators. The asymptotic properties and coverage rates of our proposed estimator are better than those exhibited by the Nadaraya Watson estimator and the ratio estimator.


Introduction
Sample surveys are intended to reduce the time and cost of collecting data while at the same time ensuring valid inference about population quantities. Extrapolation does not give accurate information in surveys since the sample is a subset of an entire population and therefore does not contain information on units that are not represented in the selected sample. The use of auxiliary information that is correlated to the characteristic under study has been very effective in predicting the information in the unobserved units.
Under the model based framework, a super-population model that describes the relationship between the auxiliary variable and the study variable is used to predict the non-sampled values. This has an overall effect of increasing the precision with which population quantities are estimated. Ratio and regression estimators are examples of estimators that are constructed under this framework.
One of the major challenges in using this approach lies in the selection of an optimal model. This presents a danger of model misspecification which if committed, introduces a huge amount of error in the estimates of the population parameters. A number of strategies have been proposed to solve the problems arising from model misspecification.
Nonparametric regression has been embraced as one of the ways of dealing with the problem of model misspecification. In this case, no restrictions are placed on the relationship between the auxiliary variable and the study variable of interest. This has an overall effect of improving the performance of the estimators.
A major problem that is encountered when using nonparametric kernel based regression over a finite interval such as in the estimation of finite population quantities is the bias at the boundary points. A number of techniques have been proposed in this regard and many of them have encountered various pitfalls. Our focus is to apply a multiplicative bias correction technique to the nonparametric estimation of the finite population mean and to study the asymptotic properties, coverage properties and the conditional properties of the resulting estimate.

Outline of the Paper
The rest of the paper is organized as follows. In subsections 1.2, 1.3, 1.4, we briefly highlight on model based estimation, bias-variance tradeoff and confidence intervals. A multiplicative bias corrected estimator for the finite population mean is proposed in section 2. The asymptotic properties of the proposed estimator are derived in section 3. An empirical study is given in section 4 and the conclusion of the paper is given in section 5.

Review of the Model-Based Approach to Survey Inference
The model based approach was originally proposed by Ronald. A. Fisher and comprehensively reviewed by, among others, Royall (1976Royall ( , 1992, Royall and Cumberland (1981), and Chambers (1996). In this framework the survey measurements are assumed to be realized values of some random variables. It is also assumed that an auxiliary variable correlated to the variable under study is available for all units in the population. A model that describes the relationship between the study variable (survey measurements) and the auxiliary variable is then sought. The model and the sampled data are then used to predict the non-sampled values and hence finite population mean or total.
One of its main weaknesses, and a major cause of criticism, is that it is susceptible to bias arising from model misspecification. In fact, when model assumptions are seriously violated this approach can yield estimates that are even worse off than those constructed under the designed-based framework. Consequently, the focus of most research in prediction approach has been to develop strategies to counter the effects of model misspecification on inference.
More specifically, our focus is to advance the work of Dorfman (1992) who considered a similar problem of estimating the finite population total using nonparametric regression. In his work, he used the Nadaraya-Watson estimator of the mean function to predict the nonsampled values of the study variable and consequently to estimate the finite population total. In his findings he demonstrated that the developed estimator was more efficient compared to rival design based estimators.

Trade-Off Between Bias and Variance
In kernel smoothing there exists a fundamental trade-off between the bias and the variance of the estimate which is governed by the smoothing parameter. Choosing a large bandwidth reduces the variance but simultaneously increases the bias of the estimate.
Similarly a choice of a small bandwidth mitigates the bias but leads to an increase in the variance of the estimate. A natural way to mitigate this trade-off is to choose a bandwidth that minimizes the mean squared error of the estimate.

Review of Confidence Intervals in Survey Sampling
Sample based surveys contain a level of uncertainty to the results obtained due to the fact that they are based on a portion of the population (sample) and not the entire population. Confidence intervals are one of the statistical tests that are used to measure the likelihood of getting similar findings if the entire population is used. In other words, it measures the 'confidence' in findings from a sample survey.
Constructing confidence intervals around point estimators provide survey statisticians with a properly scaled measure of the uncertainty associated with a particular estimator of interest. A major drawback of the conventional method is that it relies on the central limit theorem which only holds for sufficiently large sample sizes. A challenge arises when modest sample sizes are encountered in practice.
As a result previous research has been concerned with the provision of alternative approaches that address the limitation of the conventional method of constructing confidence intervals. One such strategy is the bootstrap method described in Efron (1982) that has seen considerable development over the past years. (Rao & Wu, 1988) explore the application of this technique under the design based framework. Their findings are then extended to more complex survey designs by Sitter (1992a, 1992b).
R. Chambers & Dorfman (2003) describe an application of the bootstrap approach in the construction of confidence intervals under the model based approach to sample survey inference. In their work they focus on the ratio estimator as the estimator of interest. However, their empirical results obtained by using the beef population indicate that their objective of constructing sound confidence intervals for the finite population total was not attained.
Ouma & Wafula (2007) suggest the use of a general super-population model. Their methodology is simple to implement. In their study, they generated the values of the survey measurement, Y via simple random sampling with replacement. The results of their empirical study showed that their coverage rates were more satisfactory than those of R. Chambers & Dorfman (2003). Their findings are then extended to two stage cluster sampling by Onyango, Otieno, & Orwa (2010).

Proposed Estimator
In this section, we present the proposed procedure for estimating the finite population mean. We consider a finite population {1, 2, 3.... } U N = in which each of the sampling units is associated with a variable of interest Y . Further assume that an auxiliary variable X is available for all elements in the finite population. We describe the population units using the prediction model ξ ; are assumed to be smooth functions of the variable X . After obtaining sample information on the study variable Y and a census on the auxiliary variable X , the unknown population mean that is to be estimated can be written as: f is the sampling fraction. In this case i refers to the sample units and j refers to the non-sampled units. Since the sample mean is known, the process of estimating the unknown population mean Y is equivalent to predicting the unsampled part of the population. The population mean can therefore be estimated as, To the problem of estimating the unsampled part of the is a smooth function. Therefore the estimator (4) becomes The task is to estimate the second part of equation (5). To do this, the multiplicative bias correction technique is employed in which case the proposed estimator of the population mean is now defined as is as defined in equation (9). We define a pilot smoother of the regression function as Then the ratio ( ) ɶ is a noisy estimate of the inverse relative estimation error of the smoother n µ ɶ at each of the observations given by ( ) Equation (8) above gives a better estimate for the inverse of the relative estimation error at each particular observation and can therefore be used as a multiplicative correction of the pilot smoother in equation (7). This yields the smoother; Using equation (8) and (9) easily yields The ratio in equation (10) can be expressed as For simplicity, we let . Equation Where ( , ) j j r x X is the remainder term that involves the terms x and j X . Using equation (12) and utilizing the (10) we obtain (13) Using the assumption nh → ∞ and n → ∞ the remainder terms converge to zero in probability and equation (13) reduces to (14) The proposed estimator for the finite population mean can then be expressed as (15)

The Asymptotic Bias of the Proposed Estimator
Under the model based framework, the bias of the estimator Ȳ M BC is defined as; Next, the expected value of the proposed estimator for the finite population mean is given by (17) The calculation of is obtained by analyzing the individual terms of the stochastic approximation of the estimator in equation (9) which are given by equation (14).
Analyzing the first term of the expression in equation (14) yields (18) An analysis of the second term of equation (14) gives (19) (20) Which further reduces to Lastly analyzing the third term of (14) gives the following results Consequently, putting the results obtained in the analysis of the above terms equation (14) reduces to, Substituting the first two terms of the expansion given by equation (26) we obtain (28) It is easy to verify that Hence the asymptotic bias of the proposed estimator is given by The bias of MBC Y will be of order 1 Thus it converges to zero at a faster rate compared to the existing non-parametric estimators which generally converge at the rate

The Asymptotic Variance of the Proposed Estimator
We can express the estimator of the finite population mean as (32) Using the assumption nh → ∞ the remainder terms ( , ) j j r x X converge to zero in Probability and the above expression reduces to, Truncating the binomial expansion at the first term yields (34) The variance of the estimator is then defined by Using the Taylor series expansion of the term the variance expression above can be written as This implies that is more efficient than the usual non-parametric regression estimator Proposed by Dorfman (1992)

The Asymptotic Mean Squared Error of the Proposed Estimator
The Mean squared error of M BC Y is given by Substituting the expressions for the variance and the bias in the above equation yields, 2 2 2 2 2 2 2 2 1 As → ∞ n and → ∞ h the mean squared error tends to zero indicating that the proposed estimator is statistically consistent.

Empirical Study
We perform a simulation experiment in order to investigate the statistical properties of the proposed estimator as well as compare its performance to that of the Nadaraya-Watson and the ratio estimators. We consider a case where only one auxiliary variable is available and generate them as independent and identically distributed on uniform (0, 1) random variables. We examine five simulated populations generated from the following regression model, The linear function 1 ( ) x µ is correct specification for the ratio estimator and therefore it is expected that the ratio estimator will perform better than the other estimators under this model because it is rightly specified.
The errors are independent and identically distributed with zero means and standard deviation 1 = σ . Five hundred samples of size 500 were generated using simple random sampling without replacement. The sampling is done with indices due to the assumed relationship between the study variable and the auxiliary variable that has to be reflected in the simulation. We compare the performance of the proposed estimator, MBC Y with the Nadaraya Watson estimator NW Y , ratio estimator, RATIO Y . The following diagrams represent the plots of the linear, quadratic, jump, exponential and the sine populations.  The unconditional biases are computed as Mean Squared Errors are computed for each of the estimators. We also computed the 95% confidence interval lengths for each of the estimators under the different populations. Table 1 gives the results of the unconditional Biases and the unconditional Mean Squared Error of the multiplicative bias corrected Nadaraya Watson Estimator, MBC Y the Nadaraya-Watson estimator NW Y , and the ratio estimator, RATIO Y applied to finite mean estimation for different mean functions. It can be seen that the bias of the multiplicative bias corrected estimator is much lower than those of the Nadaraya-Watson estimator and the corresponding MSE of the multiplicative bias corrected estimator is also lower than that of the Nadaraya Watson estimator for each of the mean functions. Table 2 gives a comparison of the coverage probabilities of the three estimators for the different mean functions. The coverage probabilities for the Multiplicative bias corrected estimator are closer to the nominal value than are the coverage probabilities for the Nadaraya-Watson estimator .The coverage ability of MBC Y is better than that of NW Y .The ratio estimator has the best coverage ability under the linear mean function and outperforms the other two estimators.  Table 3 gives a comparison of the 95% confidence interval lengths for the Multiplicative Bias corrected estimator, Nadaraya Watson estimator and the ratio estimator for the different mean functions. The confidence intervals generated by the Multiplicative bias corrected estimator are much tighter than those generated by the Nadaraya Watson estimator and the ratio estimator. The results indicate that the multiplicative bias corrected estimator outperforms the usual non-parametric regression estimator proposed by Dorfman (1992) at 95% coverage rate. To study the conditional performance of the selected estimators, the 500 samples obtained were sorted by the value of x into groups of 20 samples so that we had a total of 25 groups. We then computed the empirical means and bias within each group. The plots of the conditional biases versus ( ) − x X obtained for the three estimators under the different functions all indicated similar results. We report the behavior of the conditional bias under the linear mean function.

Conclusions and Recommendations
The aim of this study was to develop a bias robust estimator for the finite population mean using the multiplicative bias correction approach to nonparametric regression. The study reveals that the derived estimator is more efficient than the Nadaraya Watson estimator. The proposed estimator has smaller bias, lower mean squared error, better coverage ability and tighter confidence interval lengths compared with the Nadaraya-Watson estimator. It is also approximately conditionally unbiased. It has therefore proved to be efficient in correcting boundary problems that are associated with the existing nonparametric regression smoothers.