Estimation of Population Total Using Spline Functions
Gladys Gakenia Njoroge
Department of Physical Sciences, Chuka University, Chuka, Kenya
To cite this article:
Gladys Gakenia Njoroge. Estimation of Population Total Using Spline Functions.American Journal of Theoretical and Applied Statistics. Vol. 4, No. 5, 2015, pp. 396-403.doi: 10.11648/j.ajtas.20150405.20
Abstract: This study sought to estimate finite population total using spline functions. The emerging patterns from spline smoother were compared with those that were obtained from the model-based, the model-assisted and the non-parametric estimators. To measure the performance of each estimator, three aspects were considered: the average bias, the efficiency by use of the average mean square error and the robustness using the rate of change of efficiency. We used six populations: four natural and two simulated. The findings showed that the model-based estimator works very well in terms of efficiency while the model-assisted is almost unbiased when the model is linear and homoscedastic. However, the estimators break down when the underlying model assumptions are violated. The Kernel Estimator (Nadaraya-Watson) is found to be the most robust of the five estimators considered. Between the two spline functions that we considered, the periodic spline was found to perform better. The spline functions were found to provide good results whether or not the design points were uniformly spaced. We also found out that, under certain conditions, a smoothing spline estimator and a Kernel estimator are equivalent. The study recommends that both the ratio estimator and the local polynomial estimator should be used within the confines of a linear homoscedastic model. The Nadaraya-Watson and the periodic spline estimators, both of which are non-parametric, are highly robust. The Nadaraya-Watson however is even more robust than the periodic spline.
Keywords: Population Total, Estimator, Efficiency, Homoscedasticity, Robustness
The name "spline function" was given by  to the piecewise polynomial functions known as univariate polynomial spines. This was because of their resemblance to the curves obtained by their draftsmen using a mechanical spline- a thin flexible rod with a groove and a set of weights called "ducks" used to position the rods at points through which it was derived to draw smooth interpolating curves passing through prescribed points. The basic idea dates back at least to . More recent papers on the subject include [6, 12, and 14] among others.
The available literature in statistics indicates that the approaches mostly used in estimation of population total include the model-based, the design-based and the model-assisted approaches. The non-parametric approach has also picked up especially with such works as of [5,10] on the Kernel estimation. The spline smoothing is another non-parametric approach to estimation of finite population total. However, not much literature is available on this approach and neither has there been a lot of its application on estimation of population, as compared to the previous approaches. This study therefore sought to estimate finite population total using spline functions while using ratio estimator, local polynomial estimator and Kernel functions for a numerical comparison to determine whether the patterns of estimation would be as accurate as those derived from the use of previous approaches. To measure the performance of each estimator, we considered three aspects namely: bias, the efficiency by use of the average mean square error and the robustness using the rate of change of efficiency.
2. The Estimators
2.1. Ratio Estimator (Model-Based)
The prediction approach is based on a model. Royall  summarizes the philosophy behind this approach. Suppose the number of the units in the finite population is known and that in each unit is associated a number. The general problem is to choose some of the units as a sample, observe the’s for the sample units and then use those observations to estimate the value of some function of all the’s in the population. The prediction approach treats the numbers as realized values of random variables. After the samples have been observed, estimating entails predicting a function of the unobserved’s. The relationships among the random variables both the auxiliary variable and the survey variable are expressed in a model. The general model being
Where is the mean function and a random error term. After selecting and observing a sample, the ’s for the sample units get to be known but the values for the non-sample units remain unknown. The ignorance of the non-sample values implies that some functions of those values must be mathematically predicted in order to have an estimator or predictor for the full population. Suppose the study of the scatter diagram reveals that the sample points are clustered around straight line passing through the origin. Then, the ratio , are more or less the same. We may then postulate the approximate relation.
. Hence we can write
From which we can suggest an estimator of ȳ as
where and refer to the sample means for and , respectively. The is assumed to be known before hand. This estimator in (3) is popularly known as the Ratio Estimator . The estimator of the population total using the model-based approach (prediction approach) thus becomes
substituting equation (5) in (4) gives
we take for the non-sample where is linear and i.e. homoscedastic . Let be the predictor of of the non sample values which is given as
Thus, our estimate of the population total under Royall’s prediction model is
is the ratio estimator for the population total
2.2. The Local Polynomial Regression Estimator (Model–Assisted)
Breidt and Opsomer , assumed that the population is generated by the super population model: where is an independent sequence of random variables with mean zero and the variance is a smooth function of. They employed local polynomial smoothing techniques to obtain a model-assisted regression estimator for the finite population total. We consider a finite population of units with label set an auxiliary variable is observed. A probability sample is drawn from according to a fixed size sampling design where is the probability of drawing the sample . Let be the size of . Assume
The study variable is observed for each. The goal is to estimate
Let if and otherwise.
, where denotes expectation with respect to the sampling design i.e. averaging over all possible samples from the finite population.
Using this notation, an estimator of is said to be design-unbiased if
A well known design-unbiased estimator of is the Horvitz-Thompson estimator,
The variance of the Horvitz Thompson estimator under the sampling design is
An estimator motivated by modeling the finite population of ’s, conditioned on the auxiliary variable , as a realization from a super population , in which is proposed. Given, is called the regression function, while is the variance function.
Let denote a continuous kernel function and let denote the bandwidth. We begin by defining the Local polynomial Kernel estimator of degree based on the entire finite population. Let be the N-vector of ’s in the finite population.
Define the matrix as
and define the matrix,
the Kernel weights where is the smoothing parameter (bandwidth). Let represent a vector with a 1 in the position and 0 elsewhere. The local polynomial kernel estimator of the regression function at , based on the entire finite population is then given by
which is well defined as long as is invertible.
Since only in are known, is replaced by a sample-based consistent estimator to make its calculation possible.
Let be the n-vector ’s obtained in the sample.
Define the matrix,
And define the matrix, a sample design-based estimator of is then given byas long as is invertible.which is a vector.
The above shows that the local polynomial estimators linear smoothers are of the form
The coefficient of the linear combination depends on the degree of the polynomial approximation. We note that for, the estimator reduces to the Nadaraya-Watson estimator . Now, based on the proposed estimator in equation (6), and assuming that throughout, due to mathematical complexity, then the local polynomial regression estimator for the finite population total is given by
where is the sample estimator for . Substituting equation (9) in (11) above gives
2.3. Kernel Estimation
We consider the Nadaraya-Watson Kernel estimator. It is assumed that the auxiliary information is available for the entire population and the auxiliary variable and the study variable are related in a more general way. The studies of the properties of the proposed estimator are conditional on the available sample and non-sample values of the auxiliary variable. A conceptually simple approach to a representation of the weight sequences is to describe the shape of the weight function by a density function with a scale parameter that adjusts the size and the form of the weights near. This function is commonly referred to as Kernel . The Kernel is continuous, bounded and symmetric function which integrates to one,
To estimate in model (1) one method is to average the nearby values of where "nearby" is measured in terms of the distance
Let be the Kernel with bandwidth.
The weight sequences for the Kernel smoothers (for one dimensional x) is given by
The Nadaraya-Watson estimator of in (1) is
On substituting (13) in (14) we get
The shape of the Kernel weights is determined by . One unique feature of the size of the bandwidth is that the smaller it is the more concentrated are the weights around x.
Selection of the bandwidth is the important part of the Kernel estimation method. When selecting the bandwidth we need to consider the error in our selection. This is the deeper reason why precision has to be measured in terms of point wise Mean Squared Error (MSE), the sum of variance and squared bias. The MSE is given by
which tends to zero for the Kernel estimator.
, if and.
The non-parametric regression-based estimator, , for the population total T is given by
where is the Nadaraya-Watson estimator in (15).
Therefore the Nadaraya-Watson estimator of the population total is given by substituting (15) in (16) which gives
where represents the Nadaraya-Watson estimator of the population total.
2.4. The Spline Smoothing
A measure of the rapid local variation of a curve can be given by a roughness penalty such as the integrated square second derivative. Various penalties have been suggested and used. For example, , but is most convenient for our purpose. Using this measure, we define the modified sum of squares as
The idea behind spline estimation then, is to find the function such that the following minimization problem is solved
The parameter is a smoothing parameter which controls the trade-off between smoothness and goodness of fit to the data. If the minimization of (21) gives a linear fit whereas letting gives a wiggly function. The larger the value of , the more the data will be smoothed to produce the curve estimate. However, the basic underlying idea of penalising a measure of goodness of fit by one of roughness was described by .Equation (21) shows that the function to be minimized consists of two components: first, the deviation of the fitted function from the observed values should be minimized which gives the goodness of the fit. Second, complex functions are penalised by the second term in (21), as measured by the second order derivative. From  and from the quadratic nature of equation (21), the spline smoother is linear in the observations in the sense that there exists a weight function such that
with the Kernel function given by
and the local bandwidth satisfies
It has been assured that is large and that the design points have local density, in that the proportion of in an interval of length near is approximately. Equation (23) above applies for large provided is not too near the edge of the interval on which the data lie, and is not too big or too small.
After obtaining the spline smoother in equation (22), we then can substitute this value in the equation (16) to obtain the population total as fromand
we get the smoothing spline estimator of the population total,as
While the periodic Spline Estimator of the Population Total is obtained as
3. Empirical Study
We present the analysis and results of the five estimators i.e. the ratio, the local polynomial, the Nadaraya-Watson Kernel, the spline smoother and the periodic spline. We used four natural and two artificial populations in the study.
3.1. Description of the Study Populations
In artificial population I, we generated 100 data points according to the linear homoscedastic model:
In artificial population II, we again generated 100 data points according to the quadratic homoscedastic model:
We obtained the natural populations from the Kenya Central Bureau of Statistics ofbetween 2006and 2014. The description of each of the populations is given in the table 3.1 below.
|I||100||Value (in millions) of Road Transport equipment Imported.||Quantity (number) of Road Transport Equipment Imported.|
|II||126||Value in thousands of principle articles traded.||Quantity (units) of principle Articles Traded.|
|III||130||Total number of employees engaged per industry.||Total number of firms and Establishments per industry.|
|IV||130||Total outputs per industry in a manufacturing sector.||Total inputs per Industry in the manufacturing sector.|
Scatter plots drawn for each of the four natural populations (Population I-IV) were used to deduce the form of the population structures as below:
Population I: the structure of the population could be non-linear and heteroscedastic
Population II: the structure of the population could be linear and heteroscedastic.
Population III: the structure of the population could be linear and heteroscedastic
Population IV: the structure of the population could be linear and homoscedastic.
Population V and IV were the artificial populations with known population structures:
Population V: is of a linear homoscedastic model and passing through the origin.
Population VI: is of a quadratic homoscedatic model.
3.2. Design of the Study
For each of the six populations, 500 samples of size 50 were drawn by Simple Random Sampling without replacement. The Epanechnikov Kernel defined as
was used in the study for the Local Polynomial Estimator and the Nadaraya-Watson Kernel Estimator. An optional bandwidth for Nadaraya-Watson smoother within the interval was sought where is the standard deviation of ’s. The Kernel function used in the spline smoothing and periodic spline is, with the local bandwidth satisfying
3.3. Description of the Computation Procedure
For each of the six populations, we computed the true population total , where is the number of units in each population. The estimator of population total , was then obtained for each population using the five different estimators as follows;
Ratio Estimator:Local polynomial:
To compare the five estimators, the average biases and the average Mean Square Errors (MSE) for each population were calculated. For population five and six, the relative change in efficiency was calculated to measure the robustness of the estimators. The Average Bias for each estimator was calculated as;
Average Bias where denotes the different estimators.
The Average Mean Square Error for each estimator was obtained from
The Relative change in efficiency (RCE) for each estimator was given by
The results of this study were summarized in Tables 3.2, 3.3, 3.4, 3.5 and 3. 6 below:
|Pop 1||Pop 2||Pop 3||Pop 4||Pop 5||Pop 6|
|Pop 1||Pop 2||Pop 3||Pop 4||Pop 5||Pop 6|
|Estimator||Nadaraya-Watson||Smoothing spline||Local polynomial||Ratio Estimator||Periodic spline|
3.5. Discussion of the Results
For population I which is approximately non-linear and heteroscedastic, the bias of local polynomial estimator is the smallest compared to the rest, making it the best estimator for this population. Periodic spline has the smallest bias for population II which is approximately linear and heteroscedastic. On the other hand, Nadaraya-Watson has the lowest bias for population III which is also approximately linear and heteroscedastic. In population four (approximately linear and homoscedastic), we notice that the periodic spline has the lowest bias, hence becoming a good estimator for this population. Table 3.4 shows that generally all the estimators have low biases in population V compared to the rest of the populations. The lowest bias however is of the local polynomial estimator which makes it a good estimator for the linear homoscedastic model. We further notice that Nadaraya-Watson estimator has the smallest bias in population VI, making it the best estimator for the non-linear homoscedastic model.
We next consider the performance of each estimator across the six populations in terms of average biases as shown in table 3.4. The Nadaraya-Watson estimator performed relatively well in all the populations. It, however, did best in populations three and six which are linear and heteroscedastic and quadratic and homoscedastic respectively. The smoothing spline on the other hand, had the largest bias in all the populations. It had its best performance with a linear homoscedastic population. For the Local polynomial estimator, we notice that it had the lowest bias in population one which is linear and heteroscedastic and population five which is linear and homoscedastic. Its bias in population six, which is quadratic and homoscedastic, is also relatively low. When it comes to Ratio Estimator, we notice that generally its performance is low compared to the other estimators but better than the smoothing spline. Its best performance is in population three which is approximately linear and heteroscedastic.
Then we moved on to the Average Mean Square Error (AMSE) in table 3.5. The smaller the AMSE, the higher the efficiency of the estimator for the given population. In population I, the lowest AMSE was given by the Ratio Estimator while in population II, it was the periodic spline. Nadaraya-Watson had the lowest AMSE in population III and IV while for Population V it was the Ratio Estimator. On the other hand, the Nadaraya-Watson was the most efficient estimator for the non-linear homoscedastic population VI.
Finally, we compared the Relative Change in Efficiency (RCE) among the five estimators. We noticed from table 3.6 that the Nadaraya-Watson had the lowest RCE. The implication here was that it is the least sensitive to the change of structure of the population and hence the most Robust among the five estimators. It was then followed by the Periodic Spline, the Ratio Estimator and the Local polynomial. The Smoothing Spline was the least Robust among them.
4. Summary, Conclusions and Recommendations
4.1. Summary of the Findings
The research set out to estimate population total using spline functions. However, other estimators of the population total were also involved for comparative purposes. In all the six populations considered, the Periodic spline had a smaller average bias, had less average AMSE and was found to be more robust than the Smoothing Spline. The Nadaraya-Watson estimator performed generally well in terms of the average bias, efficiency and robustness. It had very small biases in both linear and non-linear homoscedastic models. The bias in heteroscedastic models was also relatively low. Its efficiency was equally higher in most of the populations and it also had the lowest RCE value out of the five estimators considered.
The local polynomial estimator was found to be almost unbiased for a linear homoscedastic model. Its bias however goes up when a non-linear homoscedastic population is considered. In terms of efficiency, the estimator is far more efficient in a linear homoscedastic model than a non-linear one. It has a high RCE value.
We observed that this estimator is relatively highly biased across the six populations considered. However in terms of efficiency, it was the most efficient of the five estimators for a linear homoscedastic model. The efficiency went down when a non-linear homoscedastic population was considered. The RCE value is relatively high. We also observed that the periodic spline and the Nadaraya-Watson estimators gave results that were quite similar in terms of bias, efficiency and robustness.
4.2. Conclusions and Recommendations
We observed from this study that the two spline functions considered perform quite differently. The periodic spline performed better than the smoothing spline in all the aspects considered: bias, efficiency and robustness. We, therefore, concluded that the periodic spline is a better estimator than the smoothing spline in a case of a linear homoscedastic model and even when the model assumptions have been violated. It was also shown that the Nadaraya-Watson estimator performed well in the linear homoscedastic model and also when the conditions were violated. It had the lowest RCE value. Therefore, we came to the conclusion that, Nadaraya-Watson estimator was the most robust of the five estimators. The results also showed the periodic spline and the Nadaraya-Watson estimators to be quite similar. Thus, we concluded from both the theoretical results and the empirical study that spline smoothing corresponds approximately to smoothing by a Kernel method thus concurring with the theoretical observation made by .
The local polynomial estimator was very sensitive to model assumption violation and we therefore concluded that it is not robust. The results also indicated that the radio estimator was the most efficient of the five estimators for a linear homoscedastic model. Nevertheless, when these conditions are violated, the estimator completely breaks down. We conclude that this estimator is not robust to the violation of the linear and homoscedastic conditions.
From the findings of the study, we gave the following recommendations:
1. Both the ratio estimator (model-based) and the local polynomial (model -assisted) estimator should be used within the confines of a linear homoscedastic model. They are not appropriate for use when the model is unspecified or when the linear and homoscedastic assumptions are violated.
2. The Nadaraya-Watson and the periodic spline estimators, both of which are non-parametric, should be used in case of a linear and homoscedastic model and even when the model assumptions are violated. Their sensitivity to the change of structure of the population is relatively low and hence are highly robust. The Nadaraya-Watson, however, is even more robust than the periodic spline.