Non-parametric Estimator for a Finite Population Total Based on Edgeworth Expansion

In survey sampling, the main objective is to make inference about the entire population parameters using the sample statistics. In this study, a nonparametric estimator of finite population total is proposed and the coverage probabilities using the Edgeworth expansion explored. Three properties; unbiasedness, efficiency and the confidence interval of the proposed estimator are studied. There is a lot of literature on study of two properties; unbiasedness and efficiency of the finite population total. This study therefore has more focus on confidence interval and coverage probability. The amount of bias and MSE are studied partially analytically, followed by an empirical study on the two properties and the confidence interval of the proposed estimator. Based on the empirical study with simulations in R, the proposed estimator resulted into smaller bias and MSE compared to the nonparametric estimator due to [6], the design-based Horvitz-Thompson estimator and the model-based ratio estimator. Further, the proposed estimator is tighter compared to the other three considered in this study and has higher converging coverage probabilities.


Introduction
In estimating a population parameter such as a mean or a variance, a measure of precision of the estimate is quite paramount. The most commonly reported measure of precision is the function of the variance (or its square root; the standard error). The variance of the estimator is always estimated since the measure of precision of the estimator is the inverse of its variance [9]. In the estimation of the finite population total, misspecification of the model can lead to serious errors in an inference especially with regard to the non-sampled part of the population. In the recent past, efforts have been made to explore alternative ways to attenuate the errors. These include the use of nonparametric regression in evolving robust estimators in finite population sampling [11].
Nonparametric estimators have been found to be robust and more precise than their parametric counterparts. It is known, for instance, that a linear regression estimate will produce a large error for every sample size if the true underlying regression function is not linear and cannot be well approximated by linear functions [12].
The non-parametric regression estimator of a finite population total is a potent rival to familiar design-based estimators. It has the quality of automaticity associated with design-based estimators, but can better reflect the actual structure of the data, yielding greater efficiency [7]. It can be costly in computer power, and will probably not do as well as a parametric-model based estimator, when the modelling process is done carefully. Further research on how satisfactory the consequent confidence intervals of the estimator could be [6].

Statement of the Problem
As long as populations are large, detail is expensive [4]. In most studies the sample information is to estimate the population characteristics. The choosing of models could lead to misspecification especially with regard to using of the auxiliary information of the non-sampled part of the population. A finite population total estimator that gives shorter confidence interval and higher coverage probabilities with possibilities of errors' correction due to skewness and kurtosis remains unexplored.

Objectives of the Study
1. To propose a nonparametric estimator for a finite population total based on Edgeworth expansion. 2. To study the asymptotic properties of the proposed finite population total estimator. 3. To estimate the coverage probabilities for the proposed finite population total estimator.

Review of Nonparametric Estimation
Nonparametric regression has its origin in exploration of data. Let = { , }, = 1, 2, … , be a data set, then a cloud of points is suggested. It may basically mean drawing a line in the − plane through the cloud of points showing the essential characteristics of the nature of relationship between the variables Y and X. In survey sampling, there are four estimation approaches that can be used in statistical investigations; the design-based approach, model-based approach, model-assisted approach and randomizationassisted approach [4].
The model-based approach has bridged the gap between finite population problems and the rest of statistics. Before the model-based approach, finite population sampling was an eccentric realm where many of the basic concepts and tools of statistics were curiously inapplicable. Statisticians skilled in designing experiments and in applying linear models to make inferences from experimental and observational data found that finite population problems were apparently beyond the scope of their techniques [5].
Although there were some familiar-looking formulas, such as the linear regression estimator; these statistics lacked the familiar rationale and properties. Not only was the linear regression estimator biased and therefore certainly not a Best Linear Unbiased Estimator (BLUE), it was not even linear, because the random choice of observation points turned the denominator of the estimated slope into a random variable.
In the model-based approach, the distribution is a structure that is defined by the population itself and is unknown but can be modelled. In this prediction approach, the expectations are over all possible realizations of a linear regression stochastic model linking a variable of interest Y with a set of auxiliary variables, X [1]. The values of the variable Y are believed to be random variables; , , … , generated by some model. The actual observations for the finite population , , … , are one realization of the random variables. The presence of the auxiliary information associates units in the sample and those not in the sample.
The information obtained from the sample is used to predict the information of the non-sampled observations. In thus study, it is assumed that Y is function of X, hence a model of the form = + (1) is used. It is further assumed that are the error terms which are normally identically and independently distributed with = 0 and = An appropriate model-based estimator of the finite population total is of the form . A related nonparametric model-assisted regression estimator considered by replacing local polynomial smoothing with penalized splines. It was extended the local polynomial nonparametric regression estimation to two-stage sampling. In their work, simulation results indicate that the nonparametric estimator dominates standard parametric estimators when the model regression function is incorrectly specified, while being nearly as efficient when the parametric specification is correct [3].
The application of nonparametric regression was also considered to the estimation of finite population error variance for a given sample drawn from the population [11]. The error variance obtained was a function of $ % & ' that are unknown. By considering the squared residual and using some mild assumptions, the study showed that

Local Polynomial Regression
The local polynomial regression was also considered in the estimation of finite population totals. In this research, the equation = + $ was considered and the technique of using a strip of data around the co-variate applied in order to fit a line through the set of data % & , & '. The estimator yielded better results in estimating the finite population total. Further, the estimator was found to be asymptotically unbiased, consistent and normally distributed when certain conditions were satisfied [12].

Use of Jackknife and Bootstraps in Estimation
The Jackknife and bootstrap estimation procedures include trap methodologies. The Jackknife for example can be used in many situations since its bias is asymptotically smaller than the bias of any given biased estimator. However, the method is inappropriate for correlated data or time series data. The method assumes independence between the random variables (and identically distributed data points), and if that assumption is violated, the results will be of no use. Another important condition to note is that the Jackknife estimate is composed of a linear function (subtraction) and hence will only work properly for linear functions of the data and/or parameters, or on functions that are smooth enough to be modelled as continuous without much of a problem [8].
The bootstrap does not show the skewness, hence unable to correct errors there from. The bootstrap method is also computation intensive and produces confidence intervals with smaller rate of coverage [13].

Let
, , … , 0 be independent and identically distributed (iid) random variables with mean µ and variance Then the characteristics function of 0 is given by Using binomial expansion and from (4) and (6), By defining inversion of a function and incorporating the characteristic function, a function S results,

The Proposed Estimator
Let T be the population total, defined as the sum of the values of all the population measurements and let the random variable Y be the variable of interest and that X is an auxiliary variable associated with assumed to be known for all the observable population units such that = ∑ 5 .
All the sampled units are observed and the task therefore is to estimate the non-sampled part of the population. The nonsampled part is estimated using the Edgeworth Expansion.
Let S be the sample from the population of N units, then ∑ ∉] . For the sum ∑ ∉] , consider the model = + where m is an unknown smooth function that depends on the sample data and is estimated by ! for the non sampled data points.
The nonparametric estimator of the finite population total is proposed, where, Taking expectation on both sides of [10] gives: such that to test for the asymptotic normality behaviour, the estimator is considered as the sample size increases.

Simulation of Data
Population of size 1,500 was simulated from three data variables; linear, quadratic and exponential.
The linear function is based on the linear model which has the relation = 1 + 2 − 0.5 + The second study variable or mean function was obtained using the quadratic function which has the relation = 1 + 2 − 0.5 + The third study variable was obtained from an exponential function which is given by The auxiliary variable was assumed to be uniformly distributed and in the interval [0, 1]. The error term is a standard normal variable defined as ~d 0,1 .
A simple random sample of size 300 was selected randomly from the simulated population index-wise, and replicated 1500 times giving rise to 1500 simple random samples. The proposed estimator was therefore compared to the nonparametric regression estimator due to [6], the designbased Horvitz-Thompson estimator and the Ratio estimator using the amount of bias, MSE and the coverage probabilities.

Relative Bias of the Estimator
The relative bias of the estimator was obtained using where T is the actual population total and is the estimator of the population total from the El sample, for = 1,2, … ,1500. From Table 1, some of the values of the average relative biases are either negative or positive which shows either underestimation or overestimation respectively. For the linear function, the ratio estimator has the lowest bias, followed by the proposed estimator showing that the model-based ratio estimator is the best. This is because the ratio estimator is the Best Linear Unbiased Estimator (BLUE). For the quadratic function, the proposed estimator outperforms all the other three estimators and the same applies to the exponential function. It is also observed from the simulated data particularly from quadratic and exponential functions, that most of the estimates obtained using the estimator Dorfman and those of the ratio estimator had slightly larger biases in most of the data models.

Mean Squared Error (MSE) of the Estimator
The measures for the MSEs were computed for the three data sets, t = ∑ f g h Wf 2Yii ;j2 2Yii and then compared. The summary of the results are as tabulated in Table 2. From Table 2, for the linear mean function, the ratio estimator performed the best followed by the proposed estimator. This is because the ratio estimator is the Best Linear Unbiased Estimator (BLUE). For the quadratic function, the proposed estimator performed the best with the ratio estimator having the largest value, attributable to the fact that the ratio estimator though BLUE is unstable for other distribution functions. For the exponential function, the designed-based Horvitz-Thompson estimator and the modelbased ratio estimators have larger values showing that the proposed nonparametric regression estimator of the finite population total is the best of the four followed by the nonparametric regression estimator by [6].

The 95% Confidence Interval Length
The uncertainty in using point estimate is addressed by means of confidence intervals. Confidence intervals provide us with a range of values for the unknown population along with the precision of the method.
The standard error necessitates the construction of the confidence interval. These give the probability to which the range of estimator covers the estimator of the parameter. A 95% confidence interval was therefore constructed such that For the extent of coverage of the estimator, the coverage probability was explored more explicitly by approximating y z Where Φ is the distribution of the estimator which is clearly a function of the variable characteristics and follows the standard normal distribution and O is an order function of the sample size n which is of order + The empirical results were tabulated in Table 3. In Table 3, for the linear function, the ratio estimator being BLUE has the shortest confidence interval followed by the proposed estimator. the proposed nonparametric regression estimator of the finite population total has the shortest confidence interval length for the quadratic and exponential functions, showing that the proposed estimator outperforms the design-based Horvitz-Thompson and the Dorfman's nonparametric estimators.

Coverage Probabilities of the Estimator
The coverage probabilities of the proposed estimator were computed using the nominal probabilities; 0.01, 0.05 and 0.10 for the 99%, 95% and 90% confidence levels respectively.  Table 4, apart from the linear function, the proposed estimator has the highest conditional coverage probabilities for all the functions used in the study.

Conditional Biases
Since the estimation is model-based, the 1,500 simple random samples were grouped into groups of 50 so that there were 30 groups. For each group ̿ = LZ ∑ ̅ [Z 5 was computed was also computed. The conditional bias for each group was computed as • 0^_ − Ž where Ž is the population mean for the survey measurements and ̅ is the sample mean for the auxiliary variables.
The figures 1, 2 and 3 below illustrate the behavior of the conditional bias for each estimator when the three mean functions were used. The figure 1 shows the conditional bias when linear mean functions was used, figure 2 shows the conditional bias when a quadratic mean function was used and figure 3 shows the conditional bias when an exponential mean function was used. From figure 1, the ratio estimator performed well when a linear mean function was used. This is attributed to the fact that the ratio estimator is the Best Linear Unbiased Estimator (BLUE). It can be observed that the biases to the left of the population mean of the auxiliary variable are large but gradually reduce towards the right. From figure 2, the quadratic mean function was used, the proposed estimator gives better estimates of the population total compared to those realized using the estimator proposed by [6], the ratio estimator and the design-based Horvitz-Thompson estimator. It can be observed that biases to the left of the population mean of the auxiliary variable, are large but gradually reduce towards the right. From figure 3, the exponential mean function was used, the proposed estimator gives better estimates of the population total compared to those realized using the estimator proposed by [6], the ratio estimator and the designbased Horvitz-Thompson estimator. Just like in the functions in Figures 1 and 2, it can be observed that biases to the left of the population mean of the auxiliary variable, are large but reduce gradually almost symmetrically towards the right.

Conditional MSEs
Just like the biases, conditional MSEs were determined in order to establish the robustness of the proposed estimator compared to the designed based, the ratio and the nonparametric Dorfman (Nadaraya-Watson) estimators. From Figure 4, the ratio estimator has the lowest MSE compared to all the other estimators, this is attributed to the fact that the ratio estimator is BLUE. Apart from the fact that, the non-parametric estimator proposed by Dorfman has a minimum MSE at around 0.49 mean of the means, the proposed estimator is the second-best estimator based on the MSE.  From Figure 5, the proposed estimator has outperformed the design-based Horvitz-Thompson, model-based ratio and the Dorman's non-parametric Estimators for both functions; quadratic and exponential.

Conditional Confidence Interval Lengths
The confidence intervals and coverage probabilities were the main asymptotic properties of the proposed estimator. Given the proposed estimator is model-based, the conditional confidence interval lengths were also explored as in Figures  6 and 7.  From Figure 7, the proposed estimator using Edgeworth expansion has the shortest confidence interval length, followed by the ratio estimator with the design-based Horvitz Thompson parametric estimator having the longest confidence interval Length. From both the unconditional and conditional confidence interval lengths, the proposed estimator is robust.

Conditional Coverage Properties
Based on the conditional confidence intervals, the coverage probabilities were computed for the 30 samples. The coverage probability was based on the number of observations falling within the confidence interval compared to the total number of observations. The coverage properties of the estimators are captured in Figures 8 -10.   From Figures 8, 9 and 10, the proposed estimator outperformed all the other estimators except in the linear function. The ratio estimator which is quite unstable for the quadratic function performed the best in the linear function which could be attributed to the fact that it is BLUE.

Conclusion
A nonparametric estimator of the finite population total based on Edgeworth expansion is proposed. The proposed estimator comparatively gave a smaller bias and MSE and a confidence interval that was shorter and tighter compared to the other estimators (the design-based Horvitz-Thompson, model-based ratio and the nonparametric regression estimator due to [6] considered in the study.
The application of Edgeworth expansion in computing coverage probabilities performed better than the traditional way of using the central limit theorem and is therefore be recommended for error correction as a result of skewness and kurtosis.