Sieve Estimation for Mixture Cure Rate Model with Informatively Interval-Censored Failure Time Data

: In biomedical and public health studies, interval-censored data arise when the failure time of interest is not exactly observed and instead only known to lie within an interval. Furthermore, the failure time and censoring time may be dependent. There may also exist a cured subgroup, meaning that a proportion of study subjects are not susceptible to the failure event of interest. Many authors have investigated inference procedure for interval-censored data. However, most existing methods either assume no cured subgroup or apply only to limited situations such that the failure time and the observation time have to be independent. To take both cured subgroups and informative censoring into consideration for regression analysis of intervalcensored data, we employ a mixture cure model and propose a sieve maximum likelihood estimation approach using Bernstein Polynomials. A novel expectation-maximization algorithm with the use of subject-specific independent log-normal latent variable is developed to obtain the numerical solutions of the model. The robustness and finite-sample performance of the proposed method in terms of estimation accuracy and predictive power is evaluated by an extensive simulation study which suggest that the proposed method works well for practical situations. In addition, we provide an illustrative example using NASA’s hypobaric decompression sickness database


Introduction
This paper discusses regression analysis of intervalcensored data when there exists the informative censoring issue and a cured subpopulation. Interval-censored data occur naturally and frequently in randomized clinical trials, where the exact time of event occurrence is unknown but the event time is only known to lie within an interval. In regular survival analysis, it is usually assumed that every subject is susceptible to the failure event. However, there may exist a subpopulation which is cured or immune to the failure event.
Another challenging issue for this problem is having correlated failure time and censoring. Many authors have developed regression procedure to deal with informative censoring [2][3], [7][8][9]. Furthermore, [8][9] considered the cure rate models with informatively right censored data. It is also proved by [10] that ignoring of the cured subpopulation could result in an overestimation of the survival time. And the estimation could be seriously biased if the informative censoring is not considered in the model [11]. However, it does not seem to exist an established inference procedure for interval-censored data that takes both cured subgroup and informative censoring into account.
In this paper, we present a sieve estimation procedure for analyzing interval-censored data that is able to address both cured subgroups and informative censoring using the mixture cure rate model. Cox's proportional hazards model is used for modeling both failure time and censoring time. A latent variable is introduced in order to directly characterize the correlation between failure time and the dependence between failure time and censoring time. The remainder of the article is organized as follows. Section 2 introduces notation, underlying model as well as the parameter estimate procedure for informative interval censored data. A sieve maximum likelihood estimation procedure is then be described in Section 3. An EM algorithm is developed and Bernstein polynomials is used to approximate unknown functions. Section 4 presents some results obtained from an extensive simulation study conducted to assess the performance of the proposed methodology and an illustrative example is provided in Section 5. Section 6 contains some discussion and concluding remarks.

Assumptions, Models and Likelihood Function
In a clinical study with a cured subpopulation, let denote the failure time and assume the failure event of each patients is observed within a time interval [L, R]. is the covariates of patients. Now have interval-censored survival data. Define the cure indicator variable = 0 if the subject is cured and nonsusceptible and = 1 otherwise, and suppose that we can write as where * < ∞ denotes the failure time of a susceptible subject. The cure indicator is modeled by the logistic model [6] Here denotes the vector of covariates that may have effects on , which may be the same as, a part of or different from , and denotes a vector of regression parameters as and . Now assume a clinical study that has independent subjects. For the -th subject , let ! denote the event time and let " ! and # ! be the left and right endpoint of the interval censored data.

Inference Procedure
We propose to use a sieve method to approximate Λ , for alleviating the computation burden. Because the integrated form of the log-normal frailty is very complicated, the EM algorithm is used to perform the maximum likelihood estimates (MLEs) [6].

A Simulation Study
In this section, an comprehensive simulation study is conducted to evaluation the finite sample performance of the proposed estimator. The uniform distribution and normal distribution is used to generate the covariates. To generate the data, we first generated the ! 's and the % ! 's and then the " ! and # ! ′D from model (3) and (4)  The simulation results indicate that the proposed estimator performs pretty well as the bias is small and variance estimation is close to sample variance of the simulated data. In addition, the coverage probability is close to 95% which indicates that the proposed estimator is asymptotically normally distributed. Moreover, as the sample size increases, both the biased and estimated variance decreases which is to be expected.
The simulation results for the two covariate scenario are displayed in Table 4 with , = (1,1.5) 2 , 5 = (0.5,1) 2 , -= (1,1) 2 , 7 = (1,1) 2 or (1, −1) 2 , and ¤ -= 0.5. The covariates for cure model and & were generated from the standard normal distribution and the uniform distribution over (−1,1), respectively, and the covariates for Cox model and & were generated from the Bernoulli distribution with 8 = 0.5 and the standard normal distribution, respectively. The results indicate that our proposed method also works well for multiple covariates scenarios. Furthermore, we tried different values for the degree of Bernstein Polynomials and found they all gave similar simulation results. Thus, the results are robust for different degree of polynomial.

An Application
In this section, we illustrate our methodology by applying it to the NASA's Hypobaric decompression sickness database (HDSD). There are 238 subjects aged between 20 and 54 in the study (177 male and 61 female). The subjects are tested by a dehydrogenation process in a hypobaric environment. The response variable is the time of developing grade IV venous gas emboli (VGE). The goal of the study is to find out association between VGE and potential risk factors (NOADYN, TR360, age and gender). NOADYN is an indicator of the conditional of test subjects (NOADYN=1 for ambulatory and NOADYN=0 for lower body adynamic). TR360 represents the tissue ratio at 360 degrees.
We have interval-censored data here since the failure event (Grade IV VGE) is only observed to occur within two examination time points. We also have informative censoring scenario since the subjects who develop Grade IV VGE are more likely to have their examination earlier. Also some subjects are immune to Grade IV VGE and will never develop any related symptom. Therefore, cure rate model would fit the scenario here. It is pointed out that only covariates relate to the characteristic of the subject can affect the immunity of the failure event [14]. Thus, we only include age and gender in the logistic model for the cure rate. The estimation results are given in table 5. For comparison, we also include the estimation results given by a naïve estimation procedure that ignores the dependence between censoring time and failure time.  From table 5, we can see that the estimation is robust with respect to the degree of Bernstein polynomials. Both proposed and naïve approach give similar estimates to the coefficient of NOADYN, gender and age. Moreover, the results showed that subjects with higher TR360 had longer survival time in terms of Grade IV VGE. Nevertheless, only the proposed approach detected a significant effect of TR360. Whereas the naïve estimation procedure that ignores the informative censoring failed to detect the significance of risk factor TR360. In addition, our proposed sieve estimation procedure detected the significance of risk factor age and NOADYN. Therefore, we conclude that younger people develop Grade IV VGE more quickly and subjects who are ambulatory develop Grade IV VGE more quickly that those who are lower body adynamic. The estimation on cure rate suggest that older and male subjects are more likely to develop Grade IV VGE than younger and female subjects.

Discussions and Conclusions
In this article, we considered the analysis of informatively interval-censored data when there is a cured subpopulation. In order to deal with informative interval-censoring and cured subpopulation at the same time, we used a log-normal frailty variable to account for the independence between censoring time and failure time. A mixture cure rate model was developed to account for the cured subpopulation. To estimate the model parameters, we proposed a sieve maximum likelihood estimation procedure. Bernstein polynomials are used as the sieve functions to estimate the non-parametric component of the model. Furthermore, we derived an EM algorithm to obtain the numerical solutions of the model parameters. The EM algorithm has the advantage of reducing the computational burden of the problem and provide efficient estimators. We also conducted an extensive simulation study that showed our method has an advantage over the traditional method that ignores the informative censoring and cured subpopulation. In addition, the simulation results suggested our method performed well for different practical scenarios.
There are a few future research directions on this topic. In our paper, we employed the Cox model under the assumptions of proportional hazards. Obviously, in some practical situations, other survival models such as the semiparametric transformation model or proportional odds model may be more appropriate. Therefore, it would be interesting to develop different estimation procedure for these models. Moreover, we developed a mixture cure rate model for the problem. Several researchers have developed a nonmixture cure model which has the advantage of modeling event time uniformly [15][16]. Another research direction is to develop a sieve estimation procedure using the nonmixture cure model.
In general, when dealing interval-censored data with informative censoring and cured subgroup, we recommend the sieve maximum likelihood approach. However, this approach may be less reliable when the data is subject to measurement error. Note that in the application section above, the tissue ratio at 360 degrees (TR360) may subject to measurement error. It is well known that ignorance of measurement error could lead to biased estimates. In the future, it would be interesting to establish an estimation procedure to address the measurement error and informative censoring at the same time.