Exploring Data-Reflection Technique in Nonparametric Regression Estimation of Finite Population Total: An Empirical Study

In survey sampling statisticians often make estimation of population parameters. This can be done using a number of the available approaches which include design-based, model-based, model-assisted or randomization-assisted model based approach. In this paper regression estimation under model based approach has been studied. In regression estimation, researchers can opt to use parametric or nonparametric estimation technique. Because of the challenges that one can encounter as a result of model misspecification in the parametric type of regression, the nonparametric regression has become popular especially in the recent past. This paper explores this type of regression estimation. Kernel estimation usually forms an integral part in this type of regression. There are a number of functions available for such a use. The goal of this study is to compare the performance of the different nonparametric regression estimators (the finite population total estimator due Dorfman (1992), the proposed finite population total estimator that incorporates reflection technique in modifying the kernel smoother), the ratio estimator and the design-based Horvitz-Thompson estimator. To achieve this, data was simulated using a number of commonly used models. From this data the assessment of the estimators mentioned above has been done using the conditional biases. Confidence intervals have also been constructed with a view to determining the better estimator of those studied. The findings indicate that proposed estimator of finite population total that is nonparametric and uses data reflection technique is better in the context of the analysis done.


Introduction
Many non-parametric techniques have in the recent past been used in regression estimation. They include techniques such as the k-nearest neighbors, local polynomial regession, spline regression, and orthogonal series [9,19]. Besides this and in an attempt to correct the unpleasant boundary bias induced by the conventional Nadaraya-Watson estimator, many statisticians have endeavoured to modify it. Some of these include Gasser-Müller [13] and Priestley-Chao (1972). The drawback of these techniques is that their bias components were managed but at the expense of higher variability. In the framework of the modelbased approach, regression estimation is paramount in obtaining estimates of the non-sample population. The flexible nature of the non-parametric technique has made it an attractive option in statistical researches [6]. The technique entails use of kernel smoothers that assign weights to observations used in estimation. In this paper we explore yet another new technique of reflection as a way of modifying the kernel smoothers with a view to minimizing the boundary bias the shortcoming of the Nadaraya-Watson estimator.
This paper has been organized as follows: in section 2, we give a brief review of the literature regarding non-parametric regression, in section 3; a new nonparametric regression estimator for finite population total is proposed. The estimator whose properties have been stated makes use of a modified kernel smoother obtained through reflection of data technique. Empirical analysis has been done in section 4 using some artificially simulated datasets. Discussion of results and conclusion is given in section 5.

Literature Review
A model-based non-parametric model ( ) ξ is conventionally of the form: where Y i -is the variable of interest X i -is the auxiliary variable m-is an unknown function to be determined using sample data e i -is error term-assumed to be N(0, 2 In nonparametric regression estimation ( ) i m X is an unknown function and can therefore be determined by the data sampled. Since this is a sample statistic, there are many estimators in place that have been developed by statisticians. They include the famous Nadaraya-Watson estimator which many have attempted to modify because of its weakness at the boundary. These can be found in Eubank [11] and Gasser and Müller [13].
A simple kernel estimator at an arbitrary point x as presented by Priestley and Chao (1972) can be written as: where h is the bandwidth, sometimes referred to as the tuning parameter or window width. K(.) denotes a kernel function which is also twice continuously differentiable, symmetrical and having support within the bounded interval [-1, 1] such that: For the derivation of the asymptotic bias term and even the variance term, one can see Kyung-Joon and Shucany [15]. They are respectively given by: and The direct proportionality of the bias and the bandwidth means a small bandwidth will reduce it. While this is true for the bias a similar action of decreasing the bandwidth increases the variance making the regression curve to be wiggly. The implication of this scenario is that an optimal bandwidth that minimizes the mean square error (MSE) is necessary. Although with the use of the knowledge of calculus it is possible to obtain, such a bandwidth has never provided a solution to the boundary menace. Following this, Gasser and Müller [13] proposed optimal boundary kernels to address the problem. They suggested multiplying the truncated kernel at the boundary by a linear function. A generalized jackknife approach was proposed by Rice [16]. Eubank and Speckman [12] suggested the use of "bias reduction theorem" to remove the boundary effects. Schuster [18] gave another technique of correcting the boundary bias by using reflection of data method in density estimation. The same idea has also been reviewed by Albert and Karunamuni [1] among others, but notably within density estimation. This technique has further been examined in this paper but in the context of regression estimation. The technique is applied in estimating the finite population total and its performance has been analysed against other known estimators such as: The ratio estimator given by: Unbiased Predictor (BLUP), Cochran [7], Cox [8] and Brewer [5].
Another approach to estimation is the design-based estimator suggested by Horvitz-Thompson [14] is given by: While the nonparametric regression estimator proposed by Dorfman [10] for finite population total is: x is the Nadaraya-watson estimator. As noted above, this estimator suffers from boundary effects. But even with that weakness the nonparametric techniques in regression estimation have been known to outperform its counterparts-the fully parametric and semiparametric techniques. Dorfman [10] did a comparison between the population total estimators constructed from the famous design-based Horvitz-Thompson estimator and the Nadaraya-Watson estimator-the nonparametric regression estimator where he found out that the nonparametric regression estimator better reflects the structure of the data and hence yields greater efficiency. This regression estimator, however, suffered the so called boundary bias besides facing bandwidth selection challenges. Breidt and Opsomer [3] did a similar study on nonparametric regression estimation of finite population total under two-stage sampling. Their study also reveals that the nonparametric regression with the application of local polynomial regression technique dominated the Horvitz-Thompson estimator and improved greatly the Nadaraya-Watson estimator. Breidt et al [4] carried out estimation of population of finite population total under two-stage sampling procedure and their results also show that the nonparametric regression estimation is superior to the standard parametric estimators when the model regression function is incorrectly specified, while being nearly as efficient when the parametric specification is correct.
We also propose an estimator under this nonparametric regression in the model-based framework.

Proposed Estimator
where the first term ∑ is the non-sample total term that is to be estimated nonparametrically using the reflection technique. The datareflected technique therefore provides the data through reflection method so that this information is put on the negative axis thereby supplying the kernel with the information required on this section.

Data Reflection Procedure
The following simple steps give the procedure on how reflection of data is done. Let the {(X 1 , Y 1 ), (X 2 , Y 2 ),…, (X n , Y n )} be the set of n observations in the sample. If the data is augmented by adding the reflections of all the points in the boundary, to give the set {(X 1 , Y 1 ), (-X 1 , Y 1 ), (X 2 , Y 2 ), (-X 2 , Y 2 )..., (-X n , Y n ), (X n , Y n )}. If a kernel estimate m*(x) is constructed from this data set of size 2n, then an estimate based on the original data can be given by puttinĝ ( ) 2 *( ) m x m x = , for 0 x ≥ , and zero otherwise. This gives the modified general weight function given by: It can be shown that the estimate will always have zero derivative at the boundary, provided the kernel is symmetric and differentiable. The estimate has also been shown under the section on properties of the data-reflected technique that it is a p.d.f for the symmetric kernel. In practice it will not usually be necessary to reflect the whole data set, since if X i /h is sufficiently large, the reflected point -X i /h will not be felt in the calculation of m*(x) for x> 0, and hence reflection of points near 0 is all that is needed. Silverman [17] in his example, states that if K is the Gaussian kernel there is no practical need to reflect points beyond X i > 4h.

Asymptotic Properties of the Proposed Estimator
It can be shown (one can see Albers [2] for similar derivation under the density estimation) that the asymptotic bias and the variance of the proposed estimator are respectively given by: and ( )

Empirical Study
To examine the performance of the proposed estimator, simulation was done from various common distributions and analysis was done to compare them based on their confidence lengths and conditional biases. Table 1 gives the models used in simulation.

Unconditional 95% C.I for the Respective Population Total Estimators
The 95% confidence interval of each of the estimators was also computed using the formula given by; 2( ) T T Z Var T α = ± and the interval length is therefore the difference between the upper limit and the lower limit. The results are presented in table 2. Notice that the confidence lengths given by the proposed estimator in the first column are the least of all except for the ratio estimator under the linear model.

Conditional Performance of the Respective Population Total Estimators
To study the conditional performance of the estimators, the sample means ' The figures portray that the proposed estimator is better placed than the other estimators examined in terms of posting a smaller conditional bias.

Conclusion
The proposed estimator of the finite population total that uses the reflection technique shows narrower confidence lengths as opposed to the others considered in the study. The smaller 95% confidence lengths is a characteristic of a better estimator that is more precise and accurate.
Further the graphs given in the figures above shows that the proposed estimator outwits the others. The graphs show that the proposed estimator is almost conditionally unbiased.
It can therefore be concluded that based on the analysis done in this study reflection technique can be of benefit in correcting the boundary bias usually experienced with the use of kernel estimators in regression estimation.