A Design Unbiased Variance Estimator of the Systematic Sample Means
Festus A. Were, George Orwa, Romanus Odhiambo
Jomo Kenyatta University of Agriculture and Technology, School of Mathematical Sciences, Nairobi, Kenya
To cite this article:
Festus A. Were, George Orwa, Romanus Odhiambo. A Design Unbiased Variance Estimator of the Systematic Sample Means. American Journal of Theoretical and Applied Statistics. Vol. 4, No. 3, 2015, pp. 201-210. doi: 10.11648/j.ajtas.20150403.27
Abstract: Systematic sampling is normally used in surveys of finite populations because of its appealing simplicity and efficiency. When properly applied, it can reflect stratification in the population and thus can be more precise than SRS. In systematic sampling technique, the sampling units are evenly spread over the whole population. This sampling scheme is very sensitive to correlation between units in the entire population. A positive autocorrelation reduces the precision while a negative autocorrelation will improve the precision compared to simple random sampling. The limitation of this sampling method is that, it is not possible to estimate the design variance that is unbiased. This study proposes an estimator for the design variance based on a non-parametric model for the population using local polynomial regression as the estimation technique. The non-parametric model is more flexible that it can hold for many practical situations. A simulation study is performed to enable the comparison of the efficiency of the proposed estimator to the existing ones. The performance measures used include: Relative Bias (RB) and Mean Square Error (MSE). From the simulation results, it can be seen that local polynomial estimator based on nonparametric model is consistent and design unbiased for the variance of systematic sample mean. The simulation study gave smaller values for the relative biases and mean squared errors for proposed estimator.
Keywords: Systematic Sampling, Local Polynomial Regression, Non-Parametric Model, Design Variance
1.1. Background of the Study
Systematic sampling is a probability sampling technique where a sample is obtained by selecting every element of the population where is an integer greater than . The first number of the sample must be selected randomly from within the first elements. The selection is done from an ordered list. It is a popular method of selection especially when units are many and are serially arranged from to . Suppose that the total number of units is a multiple of the required sample size and an integer , such that ,a random number is selected between and .
A sample which comprises of the first unit is selected randomly and every unit, until the required sample size is obtained. The interval k divides the population into groups. In this method we are selecting one cluster of units with probability .Since the first number is drawn at random from 1 to k, each unit in the supposedly equal clusters gets the same probability of selection .
Systematic sampling is widely used in surveys of finite populations such as forest where other sampling scheme cannot be easily applied. This is due to its appealing simplicity and efficiency. When properly applied, the method picks up any obvious or hidden stratification in the population and thus can be more precise than simple random sampling. Also, systematic sampling is easy to implement, thus reducing costs.
Since a systematic sample can be regarded as a random selection of one cluster, it is not possible to give an unbiased or even consistent design based estimator of the variance and this is the challenge faced by researchers who apply it in practice. There are two approaches that are proposed to solving this problem. One is to postulate a superpopulation model characterizing the population structure and to obtain model-unbiased estimators of variance Cochran(1977). The superpopulation model is used to describe the relationship between the auxiliary variable and the study variable. This approach may not yield satisfactory results because the model assumption is usually hard to verify in practice and the unbiased estimators of variance can be sensitive to model assumption. The second approach is to take additional observations (supplementary sample) typically of smaller size than the first sample via simple random sampling Zinger(1980) or systematic sampling.
Nonparametric regression is motivated by the fact that it provides a flexible way of studying the relationships between variables and also results in good estimators thus increasing their efficiency comparedto estimators obtained using designed based approaches.
In this framework, this study is concerned with the estimation of the variance of systematic sample mean using a nonparametric approach(local polynomial regression technique) with the aid of a superpopulation model. It also offers the methodology to study the convergence properties of the proposed estimator.
1.2. Literature Review
Zinger(1980 ) pursued an approach defined as partially systematic sampling in which he obtained an unbiased estimator for the variance of systematic sample mean, however, his proposed estimator faced the challenge of not being able to prove for non negative variance except for the case of . Wu(1984) suggested difference estimator to tackle the problem faced by Zinger (1980 ) which was non negative for all . Rana (1989) following the work done by Zinger(1980) proposed a different estimator for variance of systematic sample mean that was unbiased and non negative for all Values of .
Wolter (2007) gives more comprehensive review on eight biased variance estimators and guidelines for choosing among them is given. The above variance estimation procedures are conditional on the design. In other words, they are design-based in the way that the finite population is treated as fixed.
There also exist some model-based variance estimators where the populations are considered random realizations from a super population model. Montanari. G and Bartolucci. F(1998) came up with a model based variance estimator using OLS which was approximately unbiased for the variance of systematic sample mean under linear super-population model. However this estimator lacked some accuracy and efficiency due to a higher contribution of the bias if the systematic component of the super-population is not linear. Montanari. G and Bartolucci. F(2006) later proposed a new class of unbiased estimators of the variance of systematic sample mean that included some simple nonparametric estimators under the assumption that the population follows a super population model that satisfied some mild assumptions. They showed that, the estimator based on local polynomial regression as the estimation technique under the assumption that the population follows a linear trend and the errors are homoscedastic and uncorrelated. The simulation results showed that the LPR estimator performed better in terms of relative bias and mean square error as they all had small values.
X. Li and J. Opsomer (2010) and Ayora. O(2014) also using the work that was proposed by Later Montanari. G and Bartolucci. F(2006) considered a broadly applicable model for the data, in which both the mean and the variance are left unspecied subject only to smoothness assumptions. They then came up with a model-based nonparametric variance estimator, in which both the mean and the variance functions of the data are estimated nonparametric ally using local polynomial regression as the smoothing technique. In their simulation experiment performed, it was evident that this estimator perform better giving small relative bias and mean square errors as compared to the other classical estimators discussed in Wolter (2007).
This study considers a more applicable model in which the mean function is unspecified but the variance function is homoscedastic. The researcher proposes a model-based nonparametric estimator using local polynomial regression as the smoothing technique for variance of systematic sample means. It will later shows that the estimator proposed is model consistent for the design variance of the survey estimator, subject to the population smoothness assumptions.
1.3. Statement of the Problem
Variance estimation for systematic sample mean still remains an issue that has not been addressed as only estimation procedures which are not so robust have been proposed. In view of this, exact computation of a robust estimator for variance in the systematic sample mean or total mean still remains an open area of research.
In this study, the researcher will first review Systematic sampling and the existing estimators of variance of systematic sample mean. Assumption used in developing the proposed estimator will be reviewed, then propose an estimator based on local polynomial regression using a nonparametric super population model. It will also provide the proof for the consistency of the proposed estimation and lastly compare the performance of the proposed estimator through a simulation study.
In the current study, let be finite population measurements of size N representing some survey characteristics and be a vector of auxiliary variables which is considered fixed. Let be the sampling interval and be the probability of each element in the sample being selected from the population, then the systematic sample will consist of the observations where is the sample size and the systematic sample will be . Let be the population mean, and be the systematic sample mean, then, the study is interested in estimating the variance of which is defined by equation (3) . To estimate this variance, the study uses the local polynomial regression function discussed in Wand and Jones (1995) estimated from the where for with being the smoothing parameter and a kernel function.
2.2. Review of Systematic Sampling
Suppose that the population size is units and the study variable . Then the population mean is given as
To draw a systematic SYS we first sort the population using some criterion. For example we can sort by one of the auxiliary variables in. If the study variable Y and auxiliary variable X are related through a certain function, sorting by X may provide a good spread of Y’s so that a systematic sample can pick up hidden structures in the population. If we sort the population by some criterion that is not related to Y at all, for instance sort by a variable Z which is independent of Y, then we will have a random permutation of the population. In this case systematic sampling is equivalent to SRSWOR. After sorting the population we randomly choose an element from the first k ones say the one, then, this systematic sample consists of the observations . Thus Systematic sampling amounts to the selection of a single complex sampling unit that constitutes the whole sample. A systematic sample is a of one cluster unit from a population of cluster units. Table 1 illustrates this procedure. Each column corresponds to a possible sample systematic sample. The interval k divides the population into n rows of k elements each. One element from each row is selected and each element has the same location on each row.
The population mean is estimated by the sample mean given as
The design based variance for this mean was first derived by Madow and Madow (1944) and is give by
But there is no unbiased design based estimate of for the general variable Y. Among the eight estimators evaluatedby Wolter(2007) as the estimates of we look at the three main ones which are used in practice. One of the approaches is to treat the systematic sample as if it had been obtained by SRS. This estimator is defined by
The other two estimators are based on pairwise differences
and are recommended in Wolter (2007) as the best general purpose estimators of , these estimators are defined as
Which uses all successive pairwise differences and hence uses OL. The other estimator is defined by
This takes on successive NO. The three estimators are designed biased for in general.
The first estimator is viewed suitable when the ordering of the population is thought to have no effect on or is considered as a conservative estimator when the ordering is related to the variable Y. However, as discussed by X. Li and Opsomer (2010), the unbiasedness of for uninformative ordering only holds if one averages over samples and over orderings of the population, so not design strictly design unbiasedness.
The bias of for a fixed ordering of the population can be larger and either positive or negative. The last two estimators tended to have smaller bias in the simulation experiments discussed in Wolter(2007). To obtain an unbiased estimate of , the following three designs have to be considered.
1) Multiple systematic sampling using a randomly determined starting position for each systematic sampling stage.
2) Systematic stratified - Two or more systematic samples (each with a different random start position) are taken within each stratum
3) Two stage sampling where the sub samples are collected according to systematic sampling design
4) Complementary systematic and random sampling where a systematic sample is supplemented by a random sample of size from the remaining population units.
2.3. Review of Local Polynomial Regression
Nonparametric regression has become a rapidly developing and growing field of statistics. Nonparametric approaches to regression are flexible and data-analytic ways to estimate the regression function without the specification of a parametric model, that is, to let data find a suitable function that well explains the data. The Local modeling techniques with kernel weights provide a basic and easily understood nonparametric approach to regression. Local polynomial regression is a generalization of kernel regression since the regression function at a point x in kernel regression is estimated by a locally weighted average, which can be shown to correspond to fitting degree zero polynomials, that is, Nadaraya Watson estimator. Wand and Jones  give a clear explanation of kernel smoothing including local polynomial regression.
Local polynomial regression has several advantages of other nonparametric approaches. This particular method is readily adapted to highly clustered, random, fixed designs and close to uniform designs, and on both interiors and boundaries. Local polynomial regression estimators don’t have boundary bias, that is, they adapt automatically to the boundary effect, and thus there is no need for any modifications for correcting the large bias problem at the boundary.
Local polynomial estimators have high mini-max efficiency among the class of linear smoothers, including those ones produced by kernel smoothers and spline technique, both in the interior and at the boundary points. Fan , Fan  discusses in detail the local linear fit in comparison with the local constant fit and shows that the local linear regression smoothers have the desirable mean squared error (MSE), the design adaptation property, no boundary effects, and high asymptotic mini-max efficiency properties. Fan  in their work were able to show that, local linear regression estimator adapts automatically to estimation at the boundary and they give expressions for the conditional MSE and mean integrated squared error (MISE) of the estimator. Wand and Jones  extend the results of Fan  on asymptotic bias and variance to the case of local polynomial estimators. Fan and Gijbels  in their work, they emphasizes on methodologies with a particular focus on applications of local polynomial modeling techniques to various statistical problems including survival analysis, least square regression, nonlinear time series, robust regression, generalized linear models. Breidt and Opsomer  apply local polynomial regression to model-assisted survey sampling.
One of the important issues in nonparametric regression is the choice of the smoothing parameter (bandwidth). In most scenarios, bandwidth is often selected subjectively by eye, but there are other situations where it is necessary to have the bandwidth automatically selected from the data. In data-driven smoothing parameter selection, all methods try to estimate the optimal bandwidth value that minimizes the mean squared error (MSE) at a point x or the MSE over all values of x. Most bandwidth selection methods attempt to find a value for the MISE (Mean integrated squared error)-minimizing bandwidth, and thus those are called global bandwidth selection methods. Cross-validation (CV) technique is a well-known method of optimizing the bandwidth, using the leave one-out prediction technique. However, the smoothing parameter computed by the CV method is very variable and normally tends to under-smooth in practice that is, the chosen bandwidths tend to be very small. In the case of linear smoothers, calculation of the CV method is easy since the expression of the leave-one-out predictor is a linear function of the complete data predictor. Another approach to bandwidth selection is to estimate MISE directly based on the data. This method estimates the variance and the bias of the estimator, thus it minimizes the estimated MISE with respect to the bandwidth. This "plug-in" method is used mostly in kernel regression and local polynomial regression. Plug-in technique gives more stable performance. The theory, the choice of a global variable bandwidth based on the plug-in procedure for the local linear smoothers was discussed by Fan .
Wand and Jones  developed a simple direct plug-in bandwidth selector for local linear regression that is seen to work well in practice for a wide variety of functions and is shown to have appealing theoretical and practical properties. Fan and Gijbels  propose a data-driven variable bandwidth selection procedure based on a residual squares criterion and show that local polynomial fitting using the variable bandwidth has spatial adaptation properties.
2.4. Trade-Off Between Bias and Variance
The choice of the bandwidth, h is of crucial importance tool for local polynomial regression. Smaller bandwidth results in less smoothing while larger bandwidth oversmooths the curve. There is a trade-off between variance and bias. Large values of bandwidth will reduce the variance since more points will be included in the estimate. However, as the bandwidth increases, the average distance between the local points and will increase. This can result in a larger bias in the estimator. A natural way to choose a bandwidth and balance this trade-off is by minimizing the mean square error (MSE) Fan and Gijbels . Therefore one should choose an optimal bandwidth to minimize MSE so as to balance the trade-off between the bias and variance.
In addition to selecting the optimal bandwidth, it is also important to select the appropriate order of polynomial to fit as when choosing a bandwidth, there is also trade-off between bias and variance. Higher order polynomials allow for precise fitting meaning the bias will be small but the order increases, so does the variance, but this increase is not constant. The asymptotic variance of only increases whenever the order goes from odd to even. There is no loss when going from p = 0 to p = 1 but going form p = 1 to p = 2 will increase asymptotic variance. This suggests only considering odd-ordered polynomials since the gain in bias appear to be free with no associated cost in variance Fan and Gijbels ,Wand and Jones .
Fan and Gijbels  suggests an adaptive method of choosing the correct order of polynomial based on local factor, allowing p to vary for different points in the support of data. The resulting estimator has the property of being robust to bandwidth. This means that if the chosen bandwidth is large is too large, a higher order polynomial is chosen to better model the boundaries of the data. If the chosen bandwidth is too small, a lower order polynomial is chosen to help make the estimate numerically stable and reduce the variance. Therefore one should select an appropriate bandwidth and order of the polynomial to balance the trade-off between the bias and variance in order to give an appropriate amount of smoothing.
2.5. Assumptions Used in Developing the Estimator in the Current Study
To prove the convergence property of the proposed estimator, the study adopts a theoretical framework in which both the population size N, the sample size n and the sampling interval tend to infinity. A sample is the selected as described in section 3.1
We make the following additional assumption on the study variable, the design and the smoothing parameter.
A1: The errors are independent with a mean of zero and variance and compact support, uniformly for all N
A2: For each N, we consider the as fixed with respect to the superpopulation model. The’s are independent and identically distributed.
, where is the density function with compact support and
A3: The sample size and the sampling interval are positive integers with . It is assumed that and allow or
A4: As , it is assumed and where
A5: The kernel function is a compactly supported, bounded, symmetric kernel with assume that
A6: The derivative of the mean function exists and is bounded on
2.6. The Proposed Estimator
This study employs a model based approach in which a consistent variance estimator of systematic sample means is proposed under a nonparametric model using local polynomial regression as the method of estimation. In the estimator the bias correction term considered by Montanari. G and Bartolucci. F(1998),(2006) is not considered here and also the variance function of the model is assumed to be Homoscedastic. In the estimation let be a vector of univariate auxiliary variable, then, the non parametric superpopulation model is given by
Now the design variance in equation (3) can be written as
with here is the Kronecker product and is a column vector of of length .
Let be a continuous and bounded function and define, it is assumed that are bounded and positive where
Under model (7), the expected value of is
To estimate , the following local polynomial regression estimator for variance of systematic sample means is proposed
Where is the local polynomial regression estimator obtained from the sample
Where is the vector of the identity matrix having in the first entry and other entries . denotes the degree of local polynomial regression.
Where is the smoothing parameter and the kernel function.
In developing the current estimator, reference is made to Wand and Jones(1995) version of the local polynomial regression estimator.
Under assumption A1-A6, the design variance is model consistent for the anticipated variance in the sense that
And the local polynomial variance estimator is model consistent for the anticipated variance for the design variance in the sense that
And the best bandwidth should satisfy the condition
which leads to the usual optimal rate for local polynomial regression. The bandwidth selection procedures such as plug-in or cross validation methods can be used in this case. This study provides the proof for the equation (11). The proof for equations 10 and 12 see X. Li(2006)
2.7. Proof of Equation (11)
From equation 11,
The first term on the right hand of equation 13 can be written as
Note that because they are both scalars.
By definition of matrix , can be written as
Here is the smoother matrix and where and are defined. In this case for simplicity we will use to denote . Now expanding the parentheses in the following expression is obtained
The right hand side of equation 14 has four terms; each part will be calculated one by one.
(i) first let us investigate . Using the technique similar to the one used by Wand and Jones(1995).
Then by Taylor theorem
And is a vector of Taylor series remainder terms, therefore,
Under assumption A2 and A3 by lemma Bredit and Opsomer (2000) for a certain point there are atleast points in the interval. So is invertible.
Lemma 1: Assume that the kernel function is bounded above, then
The proof of lemma 1 is provided by X.Li(2006). Thus, suppose A4 holds, by lemma 1, we have
Note that the order of a matrix is the same as its inverse, therefore,
Now we compute in equation 14
Where is the variance covariance matrix of the model 7 and
By lemma 1 X.Li (2006) shows that
Thirdly we now compute in 14 and using the results in 15 we get
the last term on the right hand side of 14 is .
X.Li (2006) shows that
Assumption A3 implies that and by 16, 17 and 19
Similarly, and is calculated under A3
Also note that and thus by 20, 21 and 22
this implies that
Next using a similar approach to that of A
Now let us calculate in 13
X. Li (2006) shows that
And by 24, 25 and 26, we have
Therefore by 23 and 26
hence the result.
2.8. Simulation Study
To further investigate the statistical properties of the above variance estimators, a simulation study are was performed. For simplicity, the researcher considered the case where there was only one auxiliary variable x. It is also assumed that the errors are independently and normally distributed with homogeneous variances. Two super population models are examined. One is the linear model
The quadratic model
The bigger the , the bigger the predictive power of the model. The two levels of , that are achieved are
"precise" model and the diffuse model.
To draw a systematic sample, the population first needs to be sorted. Three ways are considered: (1) Sort by auxiliary variable ; (2) Sort by , where and . Choose to make (3) Sort by , where and . Choose to make . Populations of size is generated. To achieve this, values of model variable x from the uniform distribution on and values of error " from were generated. Then values of response variable y computed by model 28 and 29. Two systematic samples of size and , with corresponding sampling intervals and are considered respectively. To draw a systematic sample, the data first sorted, either by or from the smallest to the largest, then randomly choose an observation from the first observations, say the one. Then, the selected sample consists of the observations with the following subscripts: .
For each simulation, the corresponding, , , and is calculated. For it is calculated using two bandwidth values: and , each simulation setting is repeated B = 10 000 times. The researcher then compare the performance of the nonparametric variance estimator with the overlapping differences , he non-overlapping differences estimator , , which are recommended by Wolter  and the simple random sampling estimator . The relative bias (RB) and the mean squared error (MSE) are calculated. Let represent,,, and
where denotes the expectation under the superpopulation model , and denotes the expectation under both the model and design.
3. Simulation Results and Discussion
This section presents the results obtained through the simulation discussed in section 2.7.
3.2. Results and Discussion
Table 2 gives the relative biases of , , , and for the sample of size n=500 for different sorting variables with homoscedastic errors. The relative biases for non parametric estimators are computed at different bandwidth. The results from table show that given a proper bandwidth is chosen, non parametric estimator performs well overall than other three estimators resulting to smaller biases with most biases being less than zero. Especially when the super-population model is linear, tends to favor bigger bandwidth. This is because local linear regression was used in the calculation of . The bigger bandwidth results in more points in the neighborhood of and because the local polynomial regression is local linear which is correct one for this population with linear trend, so having more points will increase the precision of each local linear regression.
When the super-population model is quadratic, it tends to favor small bandwidth. This is because, as discussed above, for parametric estimation, linear regression will not estimate quadratic trend well. In other words, the wider the neighborhood, the more likely a quadratic trend will be seen there. Therefore local linear regression on that neighborhood could be bad. When the bandwidth is small, then the trend within each local interval will be approximated well by a linear trend.
The estimator based on simple random sampling performed poorly, resulting in large biases in both cases as it overestimated the true variance.
It can also be seen that , and have smaller biases under linear and quadratic models when the population is sorted by before drawing a systematic sample. This is because and capture the population trend very well and thus very efficient. When the sorting variable is not related to the population that is sorting the population by and , overlapping and non overlapping difference estimators cannot capture the population trend well hence resulting to large biases and MSE.
Table 3 gives the ratios of MSE for that is evaluated at two bandwidth values (h=0.25 and h=0.5) obtained by dividing the MSE of other estimators by the MSE of evaluated at a bandwidth ( h=0.1). The MSE measures the variability of an estimator and smaller MSE values are normally desired. Therefore, it can be seen from this study that, performs better than the variance estimators , and as it has smaller MSE values in almost all the cases of linear and quadratic models.
Table 2. Relative Bias(%) for with bandwidth(h=0.1, 0.25, 0.5),, and with n=500.
Table 3. MSE(%) for with bandwidth( 0.25, 0.5),, and with n=500 divided by MSE of with bandwidth h=0.1 and Homoscedastic errors.
4. Conclusions and Recommendation
The aim of this study was to develop design unbiased estimator of variance of the systematic means using local polynomial regression as the estimation technique. This study reveals that, the estimator based on non parametric model (7) using local polynomial regression as the estimation technique is a consistent estimator for the . In comparison to other estimator discussed in Wolter (2007), the local polynomial estimator performed better in all the three cases. Therefore, this estimator has proved to be consistent and unbiased in estimating the design variance of systematic sample mean.
Hence, in practice, this study recommends the use of non parametric estimator for estimating the variance of systematic sample mean over the estimators proposed by Wolter (2007).
MSE- Mean squared Error
RB - Relative Bias
NO -Non Overlapping difference
OL-Ordinary Least Square
SRS-Simple random sampling
SRSWOR-Simple Random Sampling without Replacement
NP-Non parametric model
I thank the almighty God for granting me the knowledge to carry out this study. My sincere gratitude goes to my colleagues for the company and readiness to read this material and their corrections. I can’t overlook the love and support from my family and friends.
Special thanks to my supervisor Dr. G. Orwa for his presence since the start of this work until the end. I highly appreciate the moral and academic support he gave me. I do also thank my supervisor prof. R. Odhiambo for the schorlarly and professional assistance and his presence to correct my work. I also express my gratitude to the staff of statistics department for their friendly guidance throughout my study. May God bless you all.