Power of Simulation Extrapolation in Correction of Covariates Measured with Errors

Statistics is one of the most vibrant disciplines where research is inevitable. Most researches in statistics are concerned with the measurement of values of variables in order to make valid conclusions for decision making. Often, researchers do not use the exact values of the variables since it’s difficult to establish the exact value of variables during data collection. This study aimed at using simulation studies to ascertain the power of Simulation Extrapolation (SIMEX) in correcting the bias of coefficients of a logistic regression model with one covariate measured with error. The corrected coefficient values of the model can then be used to predict the exact values of the explanatory variable. The Mean Square Error and the coverage probability were used to test the adequacy of the different model's estimates. The study showed that the use of SIMEX with the quadratic fitting method would give significantly good estimates of the model’s predictors’ coefficients. For further studies, the researcher recommends the study to be done using other models and with multiple covariates measured with errors.


Introduction
Logistic regression is a widely used tool in the analysis of the data when the response variable is binary in nature (for example the presence or absence of a disease). The response variable is explained by the different explanatory variables in the model. The different coefficients on the explanatory variables are the gradients with respect to the variables they are associated with. The role of regression analysis is to correctly estimate these coefficients. Correct estimates of the model coefficients can only be gotten when the explanatory variables are measured without errors. However, the normal assumption that the explanatory variables have no error does not always hold water. When these explanatory variables are measured with errors, then the gradient estimates are biased. The gradient estimate for a covariate measured with error will be; * Fuller refers to as the reliability ratio [1].
Various methods have been proposed by different researchers to correct this bias that is associated with measurement errors of the covariates. Cook and Stefanski explained the simulation extrapolation (SIMEX) method which is one of such methods [2]. This study used simulation studies by simulating a true model then introducing errors in one of the covariates to come up with a naive model that was later used in Simulation Extrapolation procedure. The researcher used the R -SIMEX library for extrapolation.

Measurement Errors
Measurement error in a scenario of continuous data is classified into either Berkson measurement errors or Classical measurement errors. Freedman et al. claimed that the fundamental difference between the two kinds of measurement errors is based on the distribution assumed by the errors [3].

Classical Error Model
According to Stefanski and Cook, Classical error model assumes a distribution for the observed values given the true values ( | ) [4]. This model also assumes that measurement errors are independent of the true values and the explanatory variable X is incorrectly recorded by W. Babanezhad, expressed the classical model as; Where U is the measurement error and is assumed to be independent of X [5].

Berkson Error Model
The basic assumption in Berkson error model is that the model assumes a distribution for the true values given the observed values ( | ) and that measurement error is always independent of the observed explanatory variable W. Babanezhad, expressed the Berkson model as follows [5]; Rudemo, Ruppert and Streibig suggested that Berkson error model has proved to be very efficient in medical and agriculture studies [6].

Measurement Error in a Logistic Regression Model
According to Stefanski, logistic regression is one of the non-linear models that are often concerned with measurement error [7]. We consider a logistic regression model for the dependence of a binary response Y with the scalar predictor X in which; Where; Given the data set ( , ), 1,2, … , ( the maximum likelihood estimator needs numerical maximization. We suppose that the latent variable is unobservable, but the quantity + is observed. Since the MLE's don't have close-form expression, the effect of replacing X with W in logistic regression is not easily determined, though Stefanski claims that the estimate of is attenuated as in the case of linear regression [7].
In logistic regression, estimation when the measurement error is normally distributed proceeds under the assumption that variance of the error is known or it is independently estimable, for example, replicate measurement.
Next, we consider the functional version of logistic measurement error model with errors U which are normally distributed with known variance ) * + . Here the density function of ( , ) is given by; Where ∅() is the standard normal density function.

Simulation Extrapolation (SIMEX)
Cook and Stefanski were the first researchers to suggest the SIMEX method and it was developed further by Stefanski and Cook and Carroll and Küchenhoff [4,8]. Shang explains the simulation extrapolation (SIMEX) method as a technique used for correction of measurement error through simulation [9]. In the lines of Weeding, this method is used when the measurement error variance can be accurately estimated from replicated measurement or from validation data or the variance is already known [10]. The method further assumes that there exists an estimator which is consistent when all variables are measured without error. Such an estimator is referred to as the naive estimator when it is used despite the measurement error.
Küchenhoff, Mwalili, and Lesaffre note that SIMEX utilizes the relationship of measurement error variance ) < + to the bias of the effect estimators while disregarding the measurement error [11]. As a result, SIMEX estimator is obtained by adding additional measurement error to the already observed data in the resampling stage, establishing a relation of the error-induced bias against the variance of the added measurement error and extrapolating back to a case where no measurement error is present. We then define the following function; Where * is the limiting value of the naive estimator as the sample size increases to infinity. The result of consistency is that ?(0) = . Mwalili suggests that more often ?() < + ) declines in its absolute value as ) < + increases [12]. G() < + ) matches to the attenuation of the projected effect induced by the measurement error. The SIMEX method is built on the parametric estimation of the function ?() < + ) ≈ ?() < + ; D) for instance, in a quadratic approximation;

SIMEX in Simple Linear Regression
SIMEX method is best illustrated by the use of simple linear measurement error regression model. For illustration purposes, we consider the following model; We also consider that * = + ) * instead of X is observed where U is normally distributed with mean zero and variance 1 and that the measurement error variance ) < + is known. Babanezhad, noted that the ordinary least squares regression does not estimate but it estimates; * = = Fuller refers to = to as reliability ratio [1]. ) + denotes the variance of . Now, consider adding by simulation, additional error with mean zero and variance ) < + H to * resulting in * * , for fixed H ≥ 0 so that the variance of * * is ) < + + ) < + H = (1 + H)) < + Then, an ordinary least squares regression of Y on * * consistently estimates the following quantity. * * (H) = ( J) (11) We observe that at H = −1, * * (H) = i.e, * * (−1) = which, in this case, will represent a situation with no measurement error. Hence, the rule of thumb is to fit a regression model of * * (H) against H and then extrapolate the graph back to where H = −1.
Hasan et al. pointed out that without loss of generality, for any set of data, the SIMEX method uses simulation to add more measurement error with a variance of ) < + H to the error susceptible variable [13]. As a result, the measurement error then becomes (1 + H)) < + which will lead to an estimator that

Jackknife Variance Estimation
One drawback of SIMEX procedure is that the error variance should be known or it is independently and correctly estimable for instance using the replicate measurement. A study by Tsamardinos et al. gives more insight into the jackknife variance estimation procedure which is otherwise referred to as a leave -one -out method [14]. It is an alternative to other variance estimation procedures such as bootstrap method and the delta method. According to Shao and Dongsheng the idea in jackknife variance estimation is to sequentially delete one observation from the dataset and then calculating the estimator O N n times [15]. This implies that for a sample of size n we have ( jackknife estimates. Suppose we have n observations, we compute n estimates by sequentially omitting one observation from the dataset and then estimating O N on the ( − 1 observations that remained. The building blocks of a jackknife variance estimate are basically the n differences [4]. i.e.

Data Simulation
To explore the power of SIMEX for error correction, the researcher simulated two variables (x.true -random exponential values and z -normal random variables) each with size 200. These two variables were used to come up with a logistic regression with variable y denoting the response variable with a binary outcome. A true model was then generated using a generalized linear model (glm). To archive the objective of the study, the researcher introduced errors with a standard error of 2 to x.true variable to give the implication of x.measured (error-prone covariate) while variable z remained unchanged. A naive model was then developed with predictor variables x. measured and z. The following part of the code was used for these tasks;

SIMEX Models
The naive model (model having one covariate with error) was used to fit SIMEX models with the quadratic and linear fitting method. The simulation was done for three times where for every lambda the number of iterations was 500, 1000 and 2000 respectively. The model coefficients were stored for every iteration and an average got for every lambda. These coefficients were then used to check for model diagnostics such as the Root Mean Square Error and the coverage rate.

Results
In this study, the true model was used as the standard model for making comparisons and generation of the confidence interval. The comparison was made in regard to the other three models (naive model, quadratic model, and the linear model). The results of the various simulations are represented in table 1 below. The study did not show a significant change in the estimates in spite of the increased number of iteration. The SIMEX model using the fitting method as quadratic performed well with all its RMSE being the smallest among all models. In addition, the SIMEX model using quadratic fitting method had the highest coverage rate among the three models. Figure 1 shows the performance of SIMEX using quadratic as the fitting method. The three graphs demonstrate a consistent trend and extrapolation of the graph to the value of H = −1 gives an approximate of the true coefficients as 4.7661, 40.4160, _(` + 40.0984 which are approximate to 5, 40.5 _(` + 40.1 . The naive model performed poorest since it was the model that we initially introduce the measurement error.

Conclusion and Recommendation
The study confirmed that the SIMEX method is ideal in correcting errors for the covariates measured with errors. For the two SIMEX fitting methods that were considered, the quadratic fitting method proved to be the best having the smallest RMSE among the models considered and having the highest coverage probability. The high coverage probability means that many of the predicted values will fall within the 95% confidence interval. Consequently, the study proved the power of simulation extrapolation as a method of error corrections. Hence for independent variables which are collected with measurement errors, the researcher should consider the SIMEX method with the fitting method as quadratic to correct the errors and have the correct estimates of the model's coefficients that will give better approximations of the response variable.
The study recommends the use of the SIMEX method with the fitting method as the quadratic for correcting covariates with errors. Further studies can be done using other statistical models to reaffirm the claims from this study.