Bootstrap Confidence Interval for Model Based Sampling

The bootstrap approach to statistical inference in sample surveys is an area which has seen considerable development in the recent past. In model based approach to sample survey theory the main interest has been to overcome the problem of robustness under misspecifications. The bootstrap method under restrictive model specifications has been suggested by some authors as a way of achieving this. In this study, bootstrap and conventional confidence intervals for the population total in model based surveys using the simple random sampling without replacement are constructed. This is to provide a better measure of uncertainty associated with estimates of population total as compared to the corresponding rival confidence intervals under restrictive model. In order to achieve this, generated bootstrap simulations for the population of interest in assumed general model are used. The bootstrap method is less cumbersome to apply and in terms of coverage performance in 95% confidence interval, the bootstrap method is better compared to corresponding one under conventional methods. In terms of length, the confidences generated by the bootstrap method are much smaller as compared to the conventional counterparts. It is noted that the best performing confidence interval is one whose coverage rate is close to the true population total and its length small. The study research results provides great insight in constructing better confidence interval for the finite population total estimators.


Introduction
Sample survey theory is concerned with methods of sampling from finite population of N identifiable units and then making inferences about finite population quantities on the basis of sample data. A method of sampling together with a method of estimation given the sample data is commonly known as sampling strategy which is set of rules that define how to obtain sample units from the finite population and later how to manipulate the resulting sample data to estimate the value of the population quantity. There are different approaches to specifying sampling strategy and these include, design based, model assisted and model based approaches. In considering all these approaches [1] has suggested that the model-based approach performs better than the other two approaches although no one single approach gives both efficiency and robustness.

Statement of the Problem
The major concern in model based approach to statistical survey inference is finding robust estimators for the population parameters of interest under model misspecifications so as to make robust inference in finite populations. The authors [2] used restrictive super population model to construct confidence interval for the population mean for the case of the ratio estimator. In this study, a more general super population model to construct bootstrap confidence interval for the population total under simple random sampling without replacement is considered.

Significance of the Study
The use of general super population model, Y i =m (x i ) + e i , ( i =1, 2,…, N), give rise to confidence interval of population total which is robust since all the population values Y i , ( i =1, 2, …, N) are consistent with the general model as opposed to the restrictive model. The bootstrap confidence interval under model gives better measure of uncertainty associated with estimates of the population total as compared to the corresponding rival confidence interval under restrictive model and conventional ones.

General Objective
Construct bootstrap confidence intervals for finite population total under general super population model in simple random sampling without replacement.

Specific Objectives
(i) Construct confidence intervals for the population total based on simple random sampling without replacement. (ii) Simulate and determine the confidence intervals for finite population estimators using bootstrap and conventional methods. (iii) Compare the confidence intervals for bootstrap and conventional methods.

Study Hypothesis
The 95% confidence interval for bootstrap method is better compared to corresponding conventional method.

Basic Assumptions
Given finite population P of size N, let Y denote variable of interest having values Y i , ( i =1, 2… N), and X denote auxiliary variable with corresponding population values X i ( i =1, 2… N). It is assumed that X i values are all known but the characteristic values Y i are known for only the sample of n ≤ N of the population elements. A way of characterizing the sample selection of the survey variable of interest is to assume that for every unit i on the list of N units making the finite population, also known as the frame, a new variable s i takes a value equivalent to the number of times that particular population unit's Y value is observed. The distribution of these s i values defines the design of the sample survey. Once the sample has been chosen, the values {Y j , jϵ s} are known. The problem is how to use the sample values together with the known values of X to make an inference about unknown population total T = ∑ = N 1 i i Y . Discussion of the three approaches of dealing with this problem, that is, the designbased approach, model assisted approach and the model based approach are presented.

Introduction
The model based approach to statistical survey has been described in [3] and [4]. The idea of non-parametric regression smoothing has been discussed in [5] and [6]. The use of non-parametric regression in estimating population parameters under conditions of missing data has been discussed in [7]. The author [8] considered a non-parametric regression model estimating population totals in finite populations. In the study, non-parametric regression based estimators for the population total and compared performance corresponding to design based and linear regression estimators were considered.
The authors [9] assumed non-parametric regression model and developed new class of model assisted estimators for T based on local polynomial regression. In the simulation study, their estimate performs better than the Horvitz-Thompson estimator. In sample survey theory, it is important to construct confidence interval for the population parameter under investigation. One way of achieving this is through the conventional method. As noted earlier, the conventional method assumes that the sample size is large enough for the central limit theorem to be applicable. However, this is not always true, as consequence of this, [2] proposed the bootstrap methodology as way of addressing this problem.
The bootstrap approach to statistical inference is described in [10] and in the study it has been demonstrated how to apply the bootstrap in design based survey sampling under different sampling designs including stratified cluster sampling with replacement, stratified simple random sampling without replacement, unequal probability sampling without replacement and two stage cluster sampling with equal probabilities and without replacement. The use of the bootstrap in model-based surveys was first suggested by [11] and developed by [2]. The latter work forms the basis of this research work. The method of constructing confidence intervals as suggested by [2] involved the use of the mean and the variance. The authors made use of linear unbiased estimator obtained from the ratio between the mean of the ysample and the x-sample together with the ratio estimator. The model based approach to the above problem is based on the assumption that the values of Y can be assumed to be realizations of random variables whose distribution conditional on the known values of X may be specified through a convenient probability model, [2] proposed modifications of their procedure to take account of misspecifications in the working model. They noted that there was greater efficiency in the use of successive model refinements estimators obtained using the bootstrap approach as opposed to rival estimators obtained by other methods. However, the evidence of the extended simulation study showed that the achievement of their research did not precisely attain its goal. The recommended construction of confidence interval using the bootstrap approach thus requires further investigation.

Design-Based Inference
The design-based approach to constructing confidence interval problem involves first the choice of an appropriate design. This can be conceived to be a procedure of drawing samples of size n repeatedly. In order to complete, the inference on the estimator is defined for T and the distribution of this estimator over repeated samples evaluated. However, result by [12] shows that there can be no best estimator, therefore only require a criterion such as unbiasedness or consistency that defines reasonable number of linear estimators. The inferences are then based on the limiting distributions resulting from the induced randomization. In probability sampling designs, it is assumed that each i Y ( N ..., 2, 1, i = ) in the population has definite probability of being included in the sample. This approach however requires infinite sequence of sample values so as to apply the central limit theorem, consequently, it is best suited for large scale surveys.
The main concept under this approach in solving the above problem is that of design unbiasedness, that is, for any choice of sampling process S, the weighted average value of T over all possible samples generated under S is the actual value of T. Thus this approach restricts consideration to those weights W which ensure that irrespective of the sample selection chosen (that is, S), for all values of X and Y. However, as noted by [1] no uniform optimal sampling strategy exists under the design based approach, for example, consider the population defined by Y i > 0 and Y j > 0, i ≠ 1 and use the weighting scheme W 1 If the is chosen so that, , such that E (S 1 |X) = 1. However, this restricted strategy is no longer optimal if we apply it to another population where Y i > 2 and Y j = 0, j ≠ 2.

Model Based Inference
The model based approach to statistical survey sampling has been described and developed in [3] and [4]. The idea of non-parametric regression goes back to [5] and [6]. The model based approach to the above problem is based on the assumption that the values of Y can be assumed to be realization of random variables whose distribution conditional on the known values of X may be specified through a convenient probability model. For example, consider a linear regression model in which; Where α and β are unknown, ) σ(x i is known and {e i } is a sequence of independently and identically distributed random variables with zero and unknown variance.
In estimating the population total T, consider the relation; Where α and β are the best predictors of α and β respectively.
The problem in the model-based approach to survey sample is finding robust estimators for the population parameters of interest. Suppose the population value Y is assumed to be generated by the following linear regression model; where s y and s x are the means of the sample values of Y and X respectively and X is the population mean of X. It is noted that model (4) states that the regression line of Y on X passes through the origin and that Y i , are independent. Suppose some of these conditions or all are not true, will the ratio estimator still be unbiased? Will it still be optimal? Such problems are known as robustness problems. Thus robustness problems are those that point out the weaknesses of the estimator under application. An estimator which is optimal under the assumed model and remains optimal or approximately so when there are errors in the model is considered and is commonly referred to as robust estimator.
The non-parametric regression model for estimating population totals in finite populations was considered by [8]. The non-parametric regression based estimator for population total was proposed and in developing estimator, it was assumed that population values are generated by model given by; Where m (·) is a smooth function, {e i } is a sequence of independent random variables with mean zero and variance, The non-parametric population total estimator due to [8] is given by; is the weight associated with i th unit of sample for selected bandwidth h. The error variance of (6) is given by [8]. In the empirical study, [8] illustrates that the estimate D T performs well compared to the corresponding design based and linear regression estimators. Author [7] also discussed the use of nonparametric regression estimating population parameters under conditions of missing data. However, the work by [8] has yet to be extended to more complex designs such as two stage cluster sampling.

Model Assisted Inference
Consider a general linear estimator of T of the form; In this model i W (S, X) is the weight of the sample associated with population unit j when unit is selected into the sample. Hence the sampling strategy consists of; (i) Given X, choosing an appropriate distribution of S.
(ii) Given S and the distribution generated under (i), choosing an appropriate specification for W. In order to estimate T, the model assisted approach assumes that the resulting estimator T is design unbiased, or approximately. The distribution for S is sought which minimizes the expected value of the design mean squared If Y values are consistent with the known values in X. The author [1] argues that there are many other design unbiased strategies which also satisfy the average design unbiasedness condition; However this condition is weak, [9] assumed model (6) and developed new of model-assisted estimators for T based on local polynomial regression. The estimator is given by; In simulation study, T LP performs better than the Horvitz-Thompson estimator The work by [9] was recently extended to two stage sampling by [13] and [1] has considered a uniformed framework for survey design and estimation, that is, designbased approach, the model-based approach and the modelassisted approach. In contrasting them on the basis of their concepts of efficiency and robustness based on the assumptions about the characteristics of the finite population, it is concluded that, although no any of these approaches gives both efficiency and robustness, the model based approach performs better than design and model assisted approaches. The authors [2] have proposed the bootstrap to overcome the above problem and assumed a restrictive nonparametric model to construct bootstrap confidence interval even when the sample is not large for the central limit theorem to hold. In order to obtain robust confidence interval they subjected their model to multiple modifications. The empirical results showed that their objective which was to construct a sound confidence interval was not attained. In this study, the application of bootstrap method to construct a confidence interval for the population total T assuming a general supper-population model as a working model is considered. This consideration will lead to avoiding multiple modifications undertaken by [2].

Sample Size
The sample size for the finite population is obtained using [14] formula given by

Selection of Bandwidth for Gaussian Kernels
The optimal bandwidth is selected based on results from various techniques [15] that include considering common variation using factor 1.06 denoted (nrd), rule-of-thumb for choosing bandwidth of Gaussian kernel density estimator (nrd0), implementing unbiased cross-validation (ucv), implementing biased cross-validation (bcv) and determination of bandwidth that minimizes estimation (mcv) of finite population total error

Confidence Intervals
A good statistical practice requires that confidence interval around the point estimator in order to provide properly scaled measure of uncertainty associated with the estimator is constructed. Suppose T is unbiased estimator for the population total T. The conventional method to achieving this is to calculate the model unbiased point estimator V of the model variance of the estimation error, Researchers have suggested some model variance estimators, for instance the heteroscedasticity robust estimator investigated by [16] and [17] for the population mean Y of the ratio estimator Y R . This is given by; and s X denote the population mean of X and sample respectively. The authors [11] have considered application of the bootstrap to estimate the model variance. There estimator under general conditions is given as; Where w i correspond to the ridge weight I and Ω is an , C is a diagonal matrix, λ is a ridge parameter and U =diag (u 1 , u 2 , …, u b ). Using conventional method, 100 (1-α) % confidence interval for the population total T [18] and [19] is given by; quartile of the tdistribution with n-1 degree of freedom, of the estimation error T -T . However, the conventional method is based on the assumption that the sample size is large enough for the central limit theorem to apply but this is not always true in practice.

Bootstrap Confidence Intervals
The model assumed is; Where m ( ) is a smooth function, i e is an independent random variable with mean zero constant variance, ) σ( ⋅ is a smooth and non-negative function. The bootstrap simulations for the population Y by use of the model are generated; . This is done N times to obtain bootstrap values . Then T D given in (10)

Conventional Confidence Intervals
In obtaining conventional confidence interval, the model unbiased point estimate of the variance ) σ(x j n) ..., 2, 1, (j = is calculated as given by [8] such that; Where h is the bandwidth and is a pilot estimator based on the scaling factor h. It is noted that the estimator (12) can be negative, the suggestion by [8] is adopted in which the negative values are ignored since negative values offer no significance in the study.

Introduction
In evaluating bootstrap confidence intervals performance, the empirical work is based on data provided in [21]. It gives the number of inhabitants in 49 selected states in United States of America. The data values for 1920 are taken to be X values while 1930 values are taken to be the Y values. The regression of Y on X is approximately linear and hence the population total for any state in 1930 largely depends on corresponding size in 1920 as shown in Table 1 based on parametric linear regression test [22] on testing hypothesis H 0 : β 1 =0 against Ha: β 1 ≠ 0 at α =0.05 level of significance. The test statistic t =35.383 equivalent to p-value less than 0 is an indication of linear regression such that β 1 ≠ 0 at α =0.05 level of significance. In each sample randomly selected from the population, both the bootstrap confidence interval and the conventional confidence intervals are calculated and the results given in tabulated form. Later, the coverage rates for the two methods are obtained to compare the performance.

Procedure
The survey variable of interest is Y i ( i =1, 2, …, 49) is considered and that these values are known only for the sample while the auxiliary variable X i ( i =1, 2, …, 49) in the population are known. Independent 1000 samples of size 44 from population are drawn by simple random sampling without replacement using sample size determination relationship considered in [14]. Now considering sample values of Y selected from the population and the corresponding known values of X the non-sample values are estimated so as to obtain T D . In achieving this, the following assumptions are made: (i) K (u) is a standard normal density function and since this function is symmetric, it meets the required criterion. (ii) An optimal bandwidth would be to select h such that it minimizes the mean average squared error T D -T, where T D is given in equation (10). In obtaining bootstrap simulation for the population values, samples without replacement are selected, further, using the above assumptions, the bootstrap value for Y i is . Then T D in (10) is calculated to obtain T * D . This is repeated 1, 000 times to obtain, T * , T * , …, T * for samples.
In order to obtain the 95% bootstrap confidence interval for population total, the data is arranged in order of size from the least to the largest then the 2.5 percentile and the 97.5 percentile bootstrap population values for population total for samples are determined. This is repeated for 1,000 iterations to obtain 1,000 lower and upper confidence interval for the population total T. In evaluating the corresponding conventional confidence interval for the same sample the model unbiased point estimate σ (x i ) for the population total given by (12) is considered and lower and upper conventional confidence limits are calculated using the formula; The above procedure is repeated for 1,000 iterations for subsequent samples drawn from population without replacement. Further, number of times each method covered true population total, T = 6, 262 are counted in order to obtain the coverage rates under considered methods. That is,

Empirical Results
The methods of estimating the finite population are compared using the mean square criterion given as MSE =  Table 2. It is observed that minimizing cross validation (mcv) procedure has least mean square error followed by unbiased cross validation (ucv) and biased cross validation procedures while rule of thumb (nrd0) using factor 1.34 and rule of thumb using factor 1.06 have higher mean square error. The procedures with less mean square error are considered better compared to the ones with higher mean square values in estimating finite population total. In Figure 1, the mean square for each of the techniques for 20 sample groups is represented and mcv, ucv and bcv indicates lower mean square error in all the twenty groups while ndro and nrd records higher mean square errors in all the twenty groups. The results of corresponding confidence intervals for the population totals are presented in Table 3. In terms of coverage performance the 95% confidence interval, results of conventional method indicates that it has higher coverage compared to family of nonparametric methods as shown in Figure 2. In terms of length, confidences generated by family of nonparametric methods are much smaller compared to conventional counterpart. It is noted that best performing confidence interval is one whose coverage rate is close to true population total T and its length is small. The study results provide great insight in constructing confidence interval for population total using family of nonparametric methods in which minimizing cross validation, unbiased cross validation and biased cross validation techniques perform better compared to rest of bootstrap confidence interval methods as shown in Figure 2.

Conclusion
The main objective of the study is constructing confidence intervals for the population total based on simple random sampling without replacement. The investigation focused on application of a general super population model. The evidence of the extended simulation study has shown that there is greater coverage rate using the bootstrap method as opposed to the conventional method, thus, the result is consistent with research objective. The results of this study could be used in any statistical data containing bivariate data X and Y, where values Y 1 , Y 2 , …, Y N are independent whereas values for auxiliary variables X 1 , X 2 , …, X N are well known and those for the survey variable Y are known only for the sample.

Recommendations
(i) This study is computer intensive where simulations of the population values using 1,000 samples in simple random sampling without replacement was considered.
There is need to extend study to larger or very large sample simulations given appropriate application package or computer program. (ii) The study work could be extended to two stage sampling and model assisted approach. (iii) Further research on the comparison of these research results with the one under a restrictive model can be considered as part of extension on this study.