Non-parametric Analysis of Interval-Censored Survival Data with Application to a Phase III Metastatic Colorectal Cancer Clinical Trial

: In oncology clinical trials, the exact time of event occurrence such as tumor progression is usually unknown but the time interval within which the event occurs is known. The determination of such survival time can be subject to measurement error and influenced by the timing of scheduled assessment. Ignoring interval-censored survival time could lead to serious estimation bias. In addition, a crucial characteristic of interval-censored data is how frequently the measurement interval is taken, which directly determine the efficiency of statistical inference. Therefore, it is highly desirable to find statistical methods that are robust to different assessment frequencies. We compare conventional imputation-based approach with non-parametric approaches to handle interval-censored survival data. We apply these approaches to both hypothesis test and the estimations of hazard and survival functions. Empirical performance of these methods are assessed through extensive simulation studies with various sample sizes. A phase III randomized clinical trial on metastatic colorectal cancer is analyzed by using conventional approaches and non-parametric interval-censored analysis approaches. Out findings suggest that the phase III colorectal cancer clinical trial failed to show a clinical benefit of adding bevacizumab (B) to standard chemotherapy (CT), and the proposed non-parametric interval-censored analysis approaches outperforms the conventional approach for routine applications to oncology clinical trials to analyze interval-censored survival data.


Introduction
Interval-censored time-to-event data occur naturally and frequently in randomized clinical trials, where the exact time of event occurrence is unknown but the time interval within which the event occurs is known. The left-point of the time interval in the interval-censored data represents the last time the individual is known to be event-free, and the right-point of the interval represents the earliest time that the individual is recorded with an event. There are two important special cases of interval-censored data. The first case is current status data, where only the observation time and whether or not the event has occurred at the time are known. The second case is grouped time-to-event data, where the interval-censored time for each subject is a member of a collection of non-overlapping intervals, and multinomial distribution can be used on the number of subjects in the given intervals. This paper focuses on case II interval-censored data.
In oncology clinical trials, progression-free survival is the time from randomization date to the time of disease progression or death. Due to the latency of disease progression, the exact time of disease progression is never known. Most progression-free survival are interval-censored time-to-event data, since determination of such survival time is always subject to measurement error and influenced by the timing of scheduled assessment. Several researchers have studied the impacts of bias due to ignoring interval-censored survival time. For example, Panageas et al. discussed that ignoring the interval censored data structure leads to overestimation of median progression-free survival [1]; Hess et al. discussed that unscheduled assessments may falsely conclude the significance of treatment effect [2]; Penson et al. considered that different measurement intervals between treatment arms could lead to estimation bias. [3]. These researchers recommended to use consistent and symmetric interval assessment across treatment arms whenever possible, and use interval censoring analysis methodology for progression-survival data to minimize potential bias. Following these recommendations, this paper aims to provide recommendations of interval-censoring analysis for progression-free survival data.
Intuitively, when comparing with right-censored time-to-event data, interval-censored data is subject to loss of information. As a result, a crucial characteristic of interval-censored data analysis is how frequently the measurement interval is taken, which directly determine the efficiency of statistical inference. Meanwhile, the assessment schedule is often predetermined by various external factors such as evaluation cost, patient convenience and clinical practices. Therefore, it is highly desirable that statistical methods are robust to different assessment frequencies and schedules for progression-free survival data.
Statistical methods for right-censored data are widely used in pharmaceutical industry. For example, we have Kaplan-Meier estimator for non-parametric estimator of survival function; log-rank test for non-parametric test of treatment effect; semi-parametric Cox proportional hazards model is used for treatment effect estimation. The corresponding statistical methods for interval-censored data have also been developed in the past three decades, for example, nonparametric estimation of survival function by Turnbull [4], Gentle-man and Geyer [5], and Titman [6]; comparison of survival functions by Zhao and Sun [7], and Sun et. al. [8]; nonparametric proportional hazards model by Finkelstein [9] and Withana [10]. Complete references of statistical methods for interval-censored data can be found in the review paper by Zhang and Sun [11]. However, very few of these new methods have been directly compared with right-point and mid-point imputations which are widely used as the conventional approaches for interval-censored data.
Recently, Sun and Chen [12] compared the conventional methods with Finkestein's method of proportional hazard model [9] when analyzing interval-censored time-to-event data based on Monte Carlo simulation studies, and argued that Finkelstein's method for interval-censored data is superior to conventional approaches for interval-censored data. Their conclusions are based on limited scenarios and may not be valid all the time under different scenarios. In addition, the statistical methods for hypothesis testing for interval-censored data are also of our interests, which was not extensively studied by Sun and Chen. We conduct extensive Monte Carlo simulations under various scenarios which may occur in clinical trials. We compare the conventional imputation-based approaches with Finkelstein's method in terms of estimation, as well as Finkelstein's score test, generalized log-rank tests [7,8] in terms of hypothesis testing.
The rest of paper is organized as follows, in section 2 we first introduce some notation and then review the idea behind non-parametric interval-censored data analysis approaches. In section 3, we present some results obtained from an extensive simulation study where the pros and cons of different statistical methods are discussed. In section 4, a Phase III randomized clinical trial on metastatic colorectal cancer is analyzed by the methods mentioned in section 2. We compare the performances of previously mentioned methods. Section 5 contains some discussion and concluding remarks.

Conventional Approach
The goal of right-point and mid-point imputations is to transform the interval-censored data into right-censored data. Right-point imputation uses the right-point of the time interval as the true event time, while the mid-point imputation uses the average of left-point and right-point of the time interval as the true event time [13]. After either right-point or mid-point imputation, one can use standard statistical methods for right-censored data, such as Kaplan-Meier estimator, log-rank test, and Cox proportional hazard model for estimation, inference, and hypothesis testing. When the assessment intervals are symmetric between treatment groups, both imputations have the same ranks for the time. Therefore, rank-based methods such as log-rank test and Cox proportional hazard model, gives similar results for right-point and mid-point imputations. Moreover, the assessment intervals usually have heavy ties in most clinical trials, the methods of handling such ties should be very carefully chosen. We recommend Efron's method [14] to deal with tied event times. Among methods of handling ties, which include Breslow's method, Efron's method, and exact method [15], Efron's method yield estimate that is fairly close to the one given by exact method and it is more computationally efficient.
For each treatment arm, the hazard function | and the survival function | are set to be constant. Then the full log-likelihood function can be written as and the likelihood contribution of the i-th patient is where ( = 1 if < ≤ and ( = 0 otherwise. When the problem reduces to one-sample, the problem of finding the non-parametric maximum likelihood estimator (NPMLE) of becomes that of maximizing under the Application to a Phase III Metastatic Colorectal Cancer Clinical Trial constraint that ∑ ) $ * & − $ &+ = ∑ , = 1 and , ≥ 0 . Different methods to maximize the likelihood function for one-sample problem have been proposed, for example, EM self-consistency algorithm by Turnbull [4], as well as Gentleman and Geyer [5], and iterative convex minorant (ICM) algorithm by Groeneboom and Wellner [6]. Maximum likelihood estimates (MLE) of : and 6 are obtained by maximizing the re-parameterized log-likelihood function together using Newton-Raphson algorithm under the constraint 6 < ⋯ < 6 . Overserved Fisher's information matrix is used to obtain the standard error of the estimator. The approach actually simplifies the situation to a finite-dimensional parametric estimation problem. As a result, Finkelstein's maximum likelihood estimation becomes more computationally intensive as the number of intervals gets larger.

Nonparametric Comparisons of Survival Functions
Suppose there are K treatment arms in a clinical study and let S (k) (t) denote the survival function of the kth arm with k=1, · · ·, K. The null hypothesis to test is When assuming each of the n subjects receives one of the K treatments, the data for the K samples can be represented as where Z i is the K × 1 vector of treatment indicators that are associated with subject i with interval-censored time (L i , R i ] whose k element is 1 if it is from the k-th population, and 0 otherwise.

Finkelstein's Score Test
For right-censored data, the log-rank test can be obtained as a score test on the proportional hazards regression model. One way to compare survival functions for interval-censored data is to perform a score test on a regression model for interval-censored data. The survival functions are compared by performing the score test for β=0 based on Finkelstein's method for proportional hazard model, where β is the vector of regression coefficients for Z i . The score statistic for testing β=0 is

Generalized Log-rank Test I
Zhao and Sun [7] proposed a rank-based approach that is a direct generalization of the log-rank test for right-censored data. For each pair of ( , G), define ( to be the indicator of the event ∈ ( , /, 1 ≤ ≤ , 1 ≤ G ≤ I. For subject , define J = 0 if the observation on K is right-censored and 1 otherwise; L = M$J = 0, ≥ &, which is equal to 1 if K is right-censored and subject is still at risk at −. The log-rank statistic = = (= , ⋯ , = N ) ; is thus defined as for k=1, · · ·, K, where O , which can be regarded as the estimates of the total observed failure and risk numbers, respectively, at time under b . It can be easily shown that if right-censored data are available, the statistic U reduce to the log-rank test statistic. Zhao and Sun [7] also proposed a multiple imputation approach to estimate the covariance matrix Σ of U. The null hypothesis of the homogeneity of the K populations can be tested by comparing the test statistic K = = ; Σ * = to a χ2 distribution with K-1 degrees of freedom, where Σ * is a generalized inverse of Σ.

Generalized Log-rank Test II
Sun, Zhao and Zhao [8] proposed a new class of K-sample test for interval-censored data which includes Finkelstein's score test statistic [9] as a special case. The K-sample test statistic is defined as Where where ξ is a known function over (0, 1). When ξ(u)=u log(u) this test statistic reduces to Finkelstein's score test statistic. Denote the first components of = as = * , Σ as the covariance matrix of = , Σ * is derived by deleting the last row and column of Σ, whose expression is provided by the authors. The null hypothesis of the homogeneity of the K populations can be tested by comparing the statistic = * ; Σ * * = * to a h i distribution with K-1 degrees of freedom.

Data Generation
We generate data to simulate a hypothetical oncology Phase III clinical trial with two arms based on allocation ratio of 1:1. The sample size is set to be 200, 400 or 600 and the number of replications is 1000. The survival time is set to follow an exponential distribution with median equals to 8 weeks, 12 weeks, or 24 weeks for control arm C. The hazard ratio between Treatment arm (T) and Control arm © is assumed to be either 0.5 or 0.78. In simulations, each exact failure time is censored by a pre-specified time invterval to simulate the non-informative censoring. We also report the results from Cox-regression and log-rank test based on exact event times in order to control the random error from simulations. The cut-off date is set to be equal for all patients. The overall duration is chose to have an approximately 80% or 60% event rate, respectively. Table 1 provides overall study duration and the maximum number of assessments under different assessment schedules. The maximum number of assessments ranges from 3 to 38 for treatment comparison, which cover a wide range of scenarios in practice. To keep the event rates remaining at desired proportions, we add an additional assessment at the end of the study.

Simulation Results
Tables 2, 3 summarize point estimates under equal assessment schedules, with different median survival times in control arm (8 weeks, 12 weeks, or 24 weeks), and hazard ratios between treatment and control arms (HR=0.5 or 0.78), based on total sample size 200. Results from Cox regression of exact failure times are used as benchmarks. We use relative bias in order to make fair comparison for simulation results between different true parameter values. As we can see from the results, point estimates based on Finkelstein's method is almost unbiased under different scenarios, while point estimates based on conventional method are always negatively bias (away from null) and over-estimate treatment effects. The biases for conventional method become worse as assessment frequency decreases (assessment interval 8 weeks), right-censoring proportion increases (> 20%), as well as treatment effect between treatment and control arms attenuates. The estimates based on Finkelstein's method are very robust to the assessment frequency and censoring in general. Based on our extensive simulations (some results are not shown here), when maximum number of assessment is fewer than 5 times, conventional method would have about at least 10% negative bias at log hazard ratio scale, while for Finkelstein's method, the bias is at most 5%. Conventional methods and Cox model with exact failure times yield similar standard deviations. However, Finkelstein's method overestimates the standard deviations. The 95% coverage probability of both conventional method and Finkelstein's method are fairly close to the Cox model with exact failure time.
Tables 4, 5 summarize the empirical Type I error rates at α=5% (two-sided) based on sample size of 200 and 400, respectively. Consistent with the findings on point estimation results in Tables 2, 3, the score test based on conventional method tends to be conservative when assessment frequency decreases and censoring proportion increases. Finkelstein's Type I errors increases as number of assessments increases. On the other hand, Finkelstein's Wald test Type I error rate are very well controled. Log-rank test of mid-point imputations also performs well under most scenarios. Generalized log-rank tests seem to perform consistently well among all tests evaluated in our simulation, when comparing the log-rank tests of exact time. In addition, type I errors based on any of these interval-censored methods tend to be slightly inflated when event rate is low.
Tables 6, 7 summarize the empirical power at α=5% (two-sided) based on sample size of 200 and 400. The findings are consistent with each other as well. Finkelstein's score test and Wald test have comparable power as compared to the log-rank test of conventional method. When comparing with the log-rank tests of exact failure time, it is clear to see that as assessment frequency decreases and censoring proportion increases, the interval-censored data based tests become less powerful, as the information contained in the data decreases.
In conclusion, we find that the conventional approaches over-estimate treatment effect in point-estimation.
In particular, when the assessment frequency is low, conventional methods may give severely biased estimation.
For hypothesis testing, when assessments are balanced between treatment arms, conventional approaches and interval-censoring methods are comparable. In particular, conventional approach score test is more conservative in terms of type I error, while conventional approach score test is less powerful than log-rank tests.

An Application
Now we apply the methodologies proposed in previous sections to a Phase III colorectal cancer clinical trial (ITACa). The full analysis set includes 376 patients, among which 176 patients in treatment (CT+B) arm A and 194 patients in control (CT alone) arm B. At the time of final analysis, 343 failure events were observed (306 disease progressions and 37 deaths), 163 in arm A and 180 in arm B.
The unstratified log-rank test shows that survival distributions between arm A and B are not significant at level α=0.05, with p-value 0.681. The estimated hazard ratio between group B verse A is 1.027, with p-value 0.772. The unstratified log-rank test shows that survival distributions between group A and B are not significant at level α=0.05, with p-value 0.63. The estimated hazard ratio between group B verse A is 0.974 with p-value 0.541.
To illustrate the nonparametric interval-censored analysis methods, we compare the results by the conventional method with mid-point imputation, Finkelstein's method, and generalized log-rank tests. When we consider interval-censored data structure, the reported assessment date serves as the right point for events, and the left point for censoring. The assessment date prior to the reported assessment date is used as the left point for the events. If the recorded progression date is the first assessment date after randomization, the left point is set to be 0.
For overall survival, the mid-point imputation log-rank test, generalized log-rank test I and test II are neither significant at α=0.05 with p-values 0.87, 0.653 and 0.591, respectively. The hazard ratio based on Fikelstein's method is 1.109. For progression free survival, the mid-point imputation log-rank test, generalized log-rank test I and test II are neither significant at α=0.05 with p-values 0.591, 0.577 and 0.565 respectively. The hazard ratio based on Fikelstein's method is 0.991. We also compared the nonparametric estimates of survival functions for treatment and control groups, based on right-point imputation, mid-point imputation, and interval-censoring EM-ICM method. The results, as well as the median of overall survival and progression-free survival between two arms, are shown in Figures 1 and 2.  In conclusion, the phase III colorectal cancer clinical trial failed to show a clinical benefit of adding bevacizumab (B) to standard chemotherapy (CT). A possible future research direction is to identify the biomarkers that could predict the sensitivity of anti-angiogenic drugs.

Discussions and Conclusions
In this paper, we compare various approaches to handle interval-censored survival data, e.g., conventional approaches and non-parametric approaches. The performance of these methods are evaluated through an extensive simulation study.
We discovered that when assessment interval are exactly symmetric across treatment arms, the conventional approach (right-point imputation) performs similarly to mid-point imputation, since the ranks of event times are same. We show that regular Cox regression could be severely biased when assessment is less frequent or event proportion is low. Finkelstein's non-parametric maximum likelihood estimation method, as a natural extension of Cox proportional hazard model for right-censored data, performs uniformly better in various scenarios we examined. It is remarkably robust to different assessment schedules and event proportions. Both Wald test and score test based on Finkelstein's method, as well as generalized log-rank tests, have performed well with generally acceptable Type I error rates and power relative to the convectional approach with log-rank test at given sample sizes.
In conclusion, when analyzing interval-censored survival data, we recommend to always consider and assess the possibility of evaluation-time bias. In practice, we strongly recommend adopting consistent and symmetric interval assessments across treatment arms whenever possible, and use sensitivity analysis to investigate the robustness of analysis results. Based on Monte Carlo simulation we conduct, we conclude that interval-censoring methods, e.g., Finkelstein's method for point estimation, Finkelstein's score test and generalized log-rank tests for hypothesis testing, are preferred when analyzing such data if possible. However, interval-censoring methods may be less efficient when sample size is small or moderate, and the corresponding computation may be too intensive when too many events occur.