On Local Linear Regression Estimation of Finite Population Totals in Model Based Surveys

In this paper, nonparametric regression is employed which provides an estimation of unknown finite population totals. A robust estimator of finite population totals in model based inference is constructed using the procedure of local linear regression. In particular, robustness properties of the proposed estimator are derived and a brief comparison between the performances of the derived estimator and some existing estimators is made in terms of bias, MSE and relative efficiency. Results indicate that the local linear regression estimator is more efficient and performing better than the Horvitz-Thompson and Dorfman estimators, regardless of whether the model is specified or mispecified. The local linear regression estimator also outperforms the linear regression estimator in all the populations except when the population is linear. The confidence intervals generated by the model based local linear regression method are much tighter than those generated by the design based Horvitz-Thompson method. Generally the model based approach outperforms the design based approach regardless of whether the underlying model is correctly specified or not but that effect decreases as the model variance increases.


Introduction
Integrated systems for survey designs and estimation methods to finite population inference have been considered by researchers in the past and categorised as design based approach, model assisted approach, combined inference approach and model based approach. Comparing and contrasting them in terms of their concepts of efficiency and robustness to assumptions about the characteristics of the population, it has been concluded that although none of these approaches delivers both efficiency and robustness, the model based approach seems to achieve the best compromise among the other approaches. In Chambers [1], a brief discussion on these survey strategies is accomplished. Kuo [2], Dorfman and Hall [3] and Kuk [4] apply nonparametric regression for estimating totals in finite populations.
There are two incompatible approaches for making inference from sample to population. In the traditional design-based approach, Horvitz and Thompson [5] use the probability structure of the procedure by which the sample is selected to serve as the basis for inference in finite populations. In the model-based or predictive approach, Dorfman [6], use a regression model of the response on the predictor to predict the non-sample ′ and by consequence, their total. Kikechi e tal [7] employ a model based survey to estimate the unknown values of the survey variable using the local linear regression approach. In particular, the authors derive the properties of a local linear regression estimator and make variance comparisons between the derived estimator and the Nadaraya-Watson regression estimator which show that the two estimators are asymptotically equivalently efficient.
Researches done by Dorfman and Hall [3] and Chambers e tal [8] have dwelt on estimating , a smooth function. The expression for the asymptotic bias of this version of a nonparametric regression estimator of total does not include division by the sampling density, and so we expect the bias of a local linear regression based estimator be less sensitive to sparse regions in the sample data. We make use of the local linear Totals in Model Based Surveys regression technique to study the properties of the derived estimator and compare its performance with the existing estimators. Chambers and Dorfman [9] observe that the calibration estimator based on the columnar model does slightly better than the best linear unbiased estimator at high band width.
The estimator generally appears robust to changes in bandwidth, and gives exact unbiasedness and minimal variance for a particular weighted balanced sample.
They further noted that the estimators based on nonparametric model give approximate unbiasedness with no condition on balance and give approximate minimal variance, under approximate weighted balance. However, Fan and Gijbels [10] explore a more sophisticated method than the kernel regression, for example, the variable bandwidth local linear regression approach in finite populations.
Zeng and Little [11] propose a model-based estimator that uses penalized spline regression, and Zeng and Little [12] extend this estimator to two-stage sampling designs.
A new type of model-assisted non-parametric regression estimator for the finite population total, based on local polynomial smoothing which is a generalization of kernel regression has also been proposed. Breidt and Opsomer [13] use the traditional local polynomial regression estimator for the unknown regression function for the model assisted estimation of the finite population total. Sanchez e tal [14] estimate . using a modified local constant estimator for the mixed variable case. Luc [15] derive asymptotic properties of probability weighted nonparametric regression estimator under a combined inference framework for complex surveys. However, the nonparametric regression estimator considered here is the local constant estimator. Simulation studies showed that the bias of the modified nonparametric regression estimator had the same leading terms and order of probability as under the model based framework. He develops asymptotic properties under the combined inference approach and tests the performance of the estimator against the traditional model based local constant estimators. However, the use of local linear regression procedure in a purely model based framework is open and requires further study.

The Proposed Estimator
The regression model for estimating the population total is given by, Letting j x be any point in the non-sample, and like in Dorfman [6], the estimator proposed by Kikechi e tal [7] is adopted and is defined by, is an estimator of the finite population total, where is a local linear regression estimator of at point .
In Kikechi e tal [7], is derived and defined as under, where, and,

Properties of the Local Linear Regression Estimator, . //
In this section, consider the fixed equally spaced design model. The following assumptions made in Ruppert and Wand [16] are used to derive the properties of the local linear regression estimator: (i) The variables lie in the interval 0, 1 . (v) The point at which the estimation is taking place satisfies ℎ < < 1 − ℎ.
Fan [17] imposed conditions on & . and are only used for convenience in terms of technical arguments and thus can be relaxed.
Using equation 2 as proposed by Kikechi e tal [7], the local linear estimator of finite population total can be estimated using,

The Expectation of the Local Linear Regression Estimator, . //
The expectation of is derived as, Using Taylor series expansion of the form, theorem 3 in Fan and Gijbels [10] is such that, under the conditions given in (i)-(v), we have,

The Bias of the Local Linear Regression Estimator, . //
The bias of is derived as,

The Variance of the Local Linear regression Estimator, . //
The variance of the local linear regression estimator is estimated using the variance of the error. Then, oepq − r is taken as an estimator of oep where, The asymptotic expression for the variance of is given by the expression using the results of in Kikechi e tal [7] that have been derived, thus, Note that in Kikechi et al [7], oep f*g $ = The asymptotic expression for the MSE of the local linear regression estimator is given by,

Simulation Study
In this section, a study is conducted on the performances of various estimators, including the proposed local linear regression estimator 2 . In particular, we consider the design-based estimator, the parametric model-based estimator and the nonparametric model-based estimators.

Population Description
In this study, four populations are considered, which are generated from the regression model of the form, where, The populations ′ are generated as independent and identically distributed (iid) uniform (0, 1) random variables. Four mean functions are considered with 1 ≤ d ≤ 200 , namely; Linear: • = 1 + 2 − 0.5 Quadratic: •• = 1 + 2 − 0.5 Bump:     The Epanechnicov kernel is used in this study for kernel smoothing on each of the populations because of its simplicity and easy computations using well designed computer programs. This is given by, In Silverman. [18], the search for optimal bandwidth is done within the interval, The absolute bias (AB) is computed in order to analyze the performances of the proposed estimator versus some specified estimators using, ¶c ‰Š = ∑ · Š ¬-5Š k%% · .
where z… and z… » are respectively the lower and upper confidence intervals within which we expect our true population total to lie with 95% confidence.

Results
The results for the absolute biases, mean squared errors, relative efficiencies, confidence intervals and average length of confidence intervals for the various estimators are provided in tables 3, 4, 5, 6 and 7 respectively.

Discussion of Results
In this section, results of the bias, the mean square error (MSE), relative efficiency, confidence intervals and average length of confidence intervals are discussed. The bias of an estimator ¼ ̅ of a parameter ¼ is the difference between the expected value of ¼ ̅ and ¼; that is, cde ¼ ̅ = Q ¼ ̅ − ¼. An estimator whose bias is identically equal to 0 is called an unbiased estimator and satisfies Q ¼ ̅ = ¼ for all ¼. The larger the bias, the poorer the estimator. The mean squared error (MSE) measures the average squared difference between the estimator ¼ ̅ and the parameter ¼, which is a somewhat reasonable measure of performance for an estimator. The MSE of an estimator ¼ ̅ of a parameter ¼ is the function of ¼ defined by Q ¼ ̅ − ¼ , and this is denoted as w Q ½ . Thus, MSE has two components, one that measures the variability of the estimator (precision) and the other one that measures its bias (accuracy). An estimator that has good MSE properties has small combined variance and bias.
The relative efficiency of two estimators is the ratio of their efficiencies. If ¼ ̅ # and ¼ ̅ are both unbiased estimators of ¼, then the efficiency of ¼ ̅ . If this is less than 1, then it implies that oep ¼ ̅ < oep ¼ ̅ # and therefore ¼ ̅ has a smaller variance than ¼ ̅ # and so ¼ ̅ is preferred. Finally, confidence intervals consist of a range of values (interval) that act as good estimates of the unknown population parameter. The best performing confidence interval is one whose coverage rate is close to the true population and its length small.

The Absolute Bias
The biases for different estimators are summarised in table 3. In all the populations considered, the Horvitz-Thompson estimator was the poorest resulting in large biases as compared to the other three finite population total estimators. The bias for the Local Linear estimator is much lower than those of the other three estimators. For all the biases computed, the Local Linear Regression estimator is superior and dominates the Horvitz-Thompson estimator and the Linear Regression estimator for all the populations. The Local Linear estimator also dominates the Dorfman estimator for all the populations except when the population is quadratic.

The Mean Squared Error (MSE)
The MSE for different estimators are summarised in table 4. Generally the estimator with a smaller MSE is regarded as the most efficient one. The Local Linear Regression estimator is more efficient and performing better than the Horvitz-Thompson and Dorfman estimators, regardless of whether the model is specified or mispecified. The Local Linear estimator also outperforms the Linear Regression estimator in all the populations except when the population is linear. The Local Linear Regression estimator is not only superior to the popular Kernel Regression estimators, but it is also the best among all linear smoothers including those produced by orthogonal series and spline methods. In general, Local Linear estimation removes a bias term from the kernel estimator, that makes it have better behavior near the boundary of the ′ and smaller MSE everywhere. Table 5 examines the robustness of various estimators i.e. the Horvitz-Thompson estimator, the REG estimator and the Dorfman estimator versus the proposed Local Linear estimator. The results in the table show that relative efficiency of the proposed Local Linear estimator to the Horvitz-Thompson estimator, the REG estimator and the Dorfman estimator is less than 1 . This implies that the proposed Local Linear estimator has a smaller variance than the three estimators and thus the three estimators are less efficient than the Local Linear estimator. Generally, the Local Linear estimator outperforms the HT estimator, the REG estimator and the DORF estimator in all the populations. The Local Linear estimator is therefore robust and the most efficient estimator.

The Confidence Intervals and Their Average Length
The confidence intervals and average length of the intervals are also measured for each case. A smaller length is better because it implies that the true population total is captured within a smaller range and therefore results are more precise. The confidence intervals generated by the model based Local Linear method are much tighter than those generated by the design based Horvitz-Thompson method, regardless of whether the model is specified or mispecified. The confidence intervals also indicate that the Local Linear method dominates the REG and Dorfman methods when the model is incorrectly specified. Generally, the model based estimators are much far better than the traditional design based estimators. The results show that the model based approach outperforms the design based approach at 95% coverage rate. The biases under the model based approach are also much lower than those for the design based approach in different populations.

Conclusion
In this paper, a model based estimator of finite population total has been constructed using the procedure of Local Linear regression. The Local Linear regression estimator has been derived and robustness properties studied. Results of the bias, mean squared error, relative efficiency, confidence intervals and average length of confidence intervals for the various estimators have been provided.
The bias results show that the Local Linear estimator dominates the Horvitz-Thompson estimator for the linear, quadratic, bump and jump populations. The MSE results show that the Local Linear estimator is performing better than the Horvitz-Thompson estimator and Dorfman estimator, irrespective of the model specification or misspecification. Results further indicate that the confidence intervals generated by the model based Local Linear procedure are much tighter than those generated by the design based Horvitz-Thompson method, regardless of whether the model is specified or misspecified. It has been observed that the model based approach outperforms the design based approach at 95% coverage rate.
Generally, the Local Linear Regression estimator is not only superior to the popular kernel regression estimators, but it is also the best among all linear smoothers including those produced by orthogonal series and spline methods. The estimator adapts well to bias problems at boundaries and in regions of high curvature and it does not require smoothness and regularity conditions required by other methods such as boundary kernels. Simulation experiments carried out on the proposed Local Linear regression estimator in comparison with some estimators that exist in the literature indicate that the proposed estimator is robust and is the most efficient estimator.