Nonparametric Penalized Spline Model Calibrated Estimator in Complex Survey with Known Auxiliary Information at Both Cluster and Element Levels

: The present study uses penalized splines (p- spline) to estimate the functional relationship between the survey variable and the auxiliary variable in a complex survey design; where a population divided into clusters is in turn subdivided into strata. This study has considered a case of auxiliary information present at two levels; at both cluster and element levels. The study further applied model calibration technique by penalty function to estimate the population total. The calibration problems at both levels have been treated as optimization problems and solved using penalty functions to derive the estimators for this study. The reasoning behind model calibration is that if the calibration constraints are satisfied by the auxiliary variable, the study expects that the variable of interest’s fitted values meets such constraints. This study runs a Monte Carlo simulation to assess the finite sample performance of the penalized spline model calibrated estimator under complex survey data. Simulation studies were conducted to compare the efficiency of p-spline model calibrated estimator with Horvitz Thompson estimator (HT) by mean squared error (MSE) criterion. This study shows that the p-spline model-based estimator is generally more efficient than the HT in terms of the mean squared error. The results have also shown that the estimator obtained is unbiased, consistent and very robust because it does not fail if the model is misspecified for the data.


Introduction
In recent years, Nonparametric estimation methods have gained considerable attention due to their flexibility. One of these methods is the penalized spline estimation. This method requires knowing the number and locations of knots, the degree of the polynomial, and the degrees of freedom. The degrees of freedom are known as the equivalent number of parameters. There are practical rules in the existing literature to determine the degree of the polynomial and the number and locations of knots. The degrees of freedom are established according to the user's experience.
Splines can be classified as regression splines, cubic splines, B-splines, penalized-splines, natural splines, thinplate splines, and smoothing splines, according to De Boor (2001), [3]. Nonparametric regression using splines has undergone extensive development in recent years. Smoothing splines (Eubank 1988;Wahba 1990) [6,13] use a knot at each distinct value except the boundary values of the Xvariable and control overfitting by applying a rough-ness penalty. Penalized splines (p-splines), formally introduced by Eilers and Marx (1996), [5], are in general computationally inexpensive and allow flexible knot selection yet yield sound performance. P-splines are also easy to implement: there is a close relationship which implies that they can be fitted using widely-available statistical software such as R software and S-Plus (Pinheiro and Bates, 2000), [9] function lme ().
The Horvitz-Thompson (HT) estimator (Horvitz and Thompson 1952), [7] is a standard design-unbiased estimator of population total and weights cases by the inverse of their inclusion probabilities (Zhen and little 2003), [15]. This estimator was used in this study as a baseline comparison with p-spline model calibrated estimator.
The concept of auxiliary variables in the present scholarship in statistics denotes independent or predictor variables in a regression analysis. As the name suggests, the variables offer additional information and may be used to improve the estimation of population parameters. Breidt and Opsomer (2000), [1] noted that micro-econometric research is frequently performed using data collected by surveys, which may contain auxiliary information for every unit of the population of interest. As can be expected, many of these surveys use complex sampling plans to reduce costs and increase the estimation efficiency for subgroups of the population. Complex sampling designs result in unequal sampling probabilities for the units in the sample and create data with correlations between observations, violating the assumption that the data are independently and identically distributed (iid) (Sayed, 2010), [12]. This means that the number of population units represented by a given sample unit is not uniform across all of the units in the sample. Although the word complex survey has been used mostly by researchers to refer to different combinations of sampling plans, however, in this research, complex survey refers to a mixture of both stratified and cluster sampling methods (Clair, 2016), [2].
In this study, Nthiwa Janiffer Mwende, et al. (2020), [8] work which considered auxiliary information at cluster level only is extended. This study extends this work to consider auxiliary information available at both the element and cluster levels. The extension is by treating the two levels of calibration problems in a cluster-Strata sampling, as constrained nonlinear optimization problems which is then converted into unconstrained optimization problems. The resulting problems were then solved using the penalty function method to obtain the weights at both cluster and cluster-strata element levels assigned to sample observations from some chi-square distance measures.
This study considered the generalized calibration procedure using model calibration as proposed by Wu and sitter (2001), [14]. They considered generalized linear and nonparametric regression models for the super population model ψ given in the equation below: where is a sequence of independent and identically distributed random variables with ( ) = 0 and ( ) = and ℎ( ) is a smooth function that can be estimated using nonparametric methods like kernel, neural network, and penalized splines. Given n pair of sample observations ( , ), … , ( , ) from a population of size , of interest, is the estimator ℎ ( ) of ℎ( ) = ( / ) E y x ψ . For model calibration, calibration is performed to the population mean of the fitted values ℎ ( ) (Wu and sitter, 2001), [14]. The study considers a model calibration estimator for the population total t Y given below.
Where a is a set of sampled units under a general sampling design while , i w s are design weights such that for a given metric, are as close as possible in an average sense to the = . These weights are obtained by minimizing a given distance measure between the , i w s and , i z s subject to some constraints. The chi-squared distance measure to be minimized is as provided in the equation below where i q 's are known positive constants uncorrelated with the i z 's, (Deville and Sarndal, 1992), [4] subject to two constraints equation given below Model calibration method is intended to provide good efficiency if the model is correctly specified but maintains desirable properties like design consistency if the model is misspecified (Sahar, 2012), [11]. The simulations in this study suggest that for estimation of the finite population total, p-spline model-based predictive estimators are, in general, more efficient than the HT estimator. In situations that favor the HT estimator, the nonparametric model-based estimators are only slightly less efficient.

Fitting of Missing Values
In this section, the study considered fitting missing values for a population divided into clusters which are then subdivided into strata. This section considered a case where there is auxiliary information known at both cluster and cluster element Levels.
In which a matrix O X is considered with rows for i G ε , q is the degree of the spline, and the l k are the knots, while The study considered the diagonal matrix of inverse inclusion probabilities as; and its sample submatrix in equation 8 given as; . As a result, the model calibrated estimator for the sampled cluster total was defined by; The optimal weights; ijk w in equation (10) was obtained by the penalty function method. They were obtained by minimizing the chi square distance below as discussed by Deville and Sarndal (1992), [4].

( )
subject to where 1 ijk ijk z π = is the inverse of inclusion probability and ijk q are some known positive constants, uncorrelated with ijk z . This study followed same optimization procedure by Rao, 1984, [10] and considered an optimization problem of the form Minimize where ˆi jk h is a nonparametric fit of the missing value ijk y .
The study then constructed an unconstrained problem as follows according to by Rao, 1984, [10].
( ) Differentiating equation (14) partially with respect to ijk w we get Equating equation (15) to zero and solving for ijk w the study obtained; ( ) Thus, a nonparametric estimator of the cluster total is given as; Secondly, the nonparametric fit of the cluster totals based on penalized splines in this study was defined as; where T psi J is as defined by Nthiwa Janiffer Mwende et al.
consists of the rows T ri X for which the cluster i c ε , while the diagonal matrix of inverse inclusion probabilities was defined the same way as Nthiwa Janiffer Mwende et al.
(2020), [8] as; 1 , The study then proposed a nonparametric penalized spline model calibrated population total estimator as; The which gave a penalty function of the form; ( ) (24) Following same penalty procedure as in estimating the optimal weights ijk w by penalty method above, the weight i w becomes; ( ) The weighted nonparametric estimator of population total based on p-splines when information is available at both cluster and element level as defined in equation (20) was obtained as;

Penalty Function Method of Obtaining the Weights
To obtain the within cluster weights, ( )  (14) as an unconstrained minimization problem. The research in this case started with some initial guess for ijk w and b r then iteratively improved on the initial values until optimal values are obtained. The present study, therefore, followed the Newton method of unconstrained optimization, according to [8] By neglecting the higher-order terms in the above equation where iu w J is a i n by i n matrix of second derivatives of the penalty function equation (15) The following iterative procedure is used to find the improved approximations of * The sequence of the points This study applied iterative procedure to obtain the cluster level weights

Empirical Analysis and Discussions
In the simulation study, a population of size 10,000 (200*50=10,000) was simulated from a population structure containing 200 clusters each of size 50. Each cluster had 5 strata of size 10 each. At stage one 10, 20, 30,…, 190 clusters were sampled from the 200 clusters by simple random sampling while at stage two, 5 elements where drawn from each stratum by proportional allocation. This gave sample of 25 elements from each of the sampled clusters. 10 replications per each sample size were generated. For penalized spline method, the number of knots and the Spline penalty were optimally generated.
Using R program, a population of independent and identically distributed variable x was simulated using uniform (0, 1). This study used penalized spline equation (18) to fit cluster totals and equation (7) to fit for elements within a cluster. Using the auxiliary variable x, five populations for the dependent random variable for cluster element, and cluster total; i t were generated with the five functions as: Linear function (Lin); On the other hand, the respective kth auxiliary variable in the jth stratum of ith cluster was given as; ( ) This study reports on the performance penalized spline model calibrated estimator and its efficiency in comparison with Horvitz Thompson estimator. The performance of the nonparametric estimator; ˆP S y was evaluated using its relative bias R B and relative efficiency R E . The relative bias was defined as

Results for Population Total Estimates
The results in tables 2, 3, 4, 5 and 6 below shows the actual total and the estimates of the penalized splines and the Horvitz Thompson for the respective mean functions with sample sizes; 50, 100 and 150. From the results the estimator PS y are seen to give estimates that are close the actual total and also to those of Horvitz Thompson design estimator for all the 5 population functions.

Results of Variances and Variance Ratios for Various Sample Size
This section presents both the variance and variance ratio of the two estimators; ˆP S y based on a penalized spline and ˆH T y based on Horvitz Thompson and their respective graphs.
The variance and variance ratios for different functions are summarized in table 7, and their comparative graphs in figures 1, 2, 3, 4 and 5 for the respective functions. The table 7

Relative Bias
The following table 8 shows the values of relative biases for the five population functions. The results from the table show that the relative biases for the two estimates are minimal, given that the population totals were in thousands, and this point to unbiasedness. It is also noted that the difference of the relative biases of the penalized spline estimator with its corresponding Horvitz Thompson estimators for all the five population functions is not very significant. Hence, the proposed model calibrated estimator is unbiased.

Tabular Results for MSE and MSE Ratios for Various Sample Size
The MSE and relative efficiency (MSE Ratios) for the five different functions are summarized in   Figure 6 up to figure 10 show graphical representation of relative efficiency for the respective population functions. The MSE ratios for the five functions are seen to mostly concentrated at a point slights below one. This implies that the nonparametric estimator; ˆP S y is more efficient thanˆH T y and that it's a robust estimator.

Conclusion
The results from this study show that, the nonparametric calibrated estimator; ˆP S y is slightly more efficient thanˆH T y .
The estimator; ˆP S y is also very robust because it is not failing for all the five functions compared with the design estimator ˆH T y which is known as a well-performing estimator. Generally, the results have also shown that the performance of the nonparametric estimator ˆP S y is indistinguishable from that of the design estimator in some instances, and that ˆP S y is a normal, unbiased and consistent estimator. Therefore, this study concludes that the nonparametric model calibrated estimators; it is a robust estimator since it does not fail under misspecification. Due to these good properties of nonparametric model calibrated estimators, this study, therefore, recommends the use of such model calibrated estimators in the estimation of population total in sampling. In the present real-world problem where there are missing variables at both cluster and cluster element levels, yet there is relevant auxiliary information about the variables, model calibrated estimators would be the estimators of choice. This study has further shown that in cases where some cluster and elements within clusters are missing but auxiliary information is available at both levels, then an advantage can be taken of this auxiliary information to fit both cluster totals and cluster elements, which are then calibrated in the estimation of population total. This provides the researcher with a normal, consistent, unbiased, robust and efficient estimator of population total.