Modeling Survival Data by Using Cox Regression Model
Medhat Mohamed Ahmed Abdelaal*, Sally Hossam Eldin Ahmed Zakria
Statistics and Mathematics Department, Faculty of Commerce, Ain Shams University, Cairo, Egypt
To cite this article:
Medhat Mohamed Ahmed Abdelaal, Sally Hossam Eldin Ahmed Zakria. Modeling Survival Data by Using Cox Regression Model.American Journal of Theoretical and Applied Statistics.Vol.4, No. 6, 2015, pp. 504-512. doi: 10.11648/j.ajtas.20150406.21
Abstract: Survival analysis refers to the general set of statistical methods developed specifically to model the timing of events. A popular regression model for the analysis of survival data is the Cox proportional hazards regression model. The Cox regression model is a semi parametric model, making fewer assumptions than typical parametric methods but more assumptions than those nonparametric methods. The main objective of this paper is to construct Cox proportional hazards regression model for examining the covariate effects on the hazard function and to determine the risk factors affecting the outcome of liver transplantation operation for end-stage liver disease. This article will focus on a review of (a) the Cox model and interpretation of its results, (b) assessment of the validity of the PH assumption, and (c) accommodating non-proportional hazards using covariate stratification. Cox PH model showed that the variables: Recipient age,, Ln_Creatinine, and GRWR are statistically significant and selected as significant factors for risk of death after liver transplantation operation. Also the scaled Schoenfeld residual displayed non-proportionality for variable Recipient Age and this variable needed to be stratified. And the Cox-Snell residual showed the Cox PH model does not fit these data adequately. So the stratified Cox model could be more appropriate to the current study. The stratified Cox model with interaction and with no interaction were applied and showed that the no-interaction model is acceptable at 0.05 level of significance and the variables, Ln_Creatinine are statistically significant and selected as significant factors for risk of death after liver transplantation operation at 0.05 level of significance.
Keywords: Survival Analysis, Censoring, Cox Proportional Hazard Regression Model, Cox- Snell Residual,
Stratified Cox Regression Model
The most common approach to model covariate effects on survival is the Cox proportional hazard model, which can handle censored and/or truncated observations . Regression analysis is generally used for identifying the risk factors. But due to the presence of censoring in survival data, ordinary regression models cannot be used. Also simple logistic regression analysis has the limitation of only allowing a view of survival probability over the entire study period as a single time interval and it assume that every patient is at risk over the entire study period. This is not valid for studies with long follow up or other situations where patients have variable time at risk. For this purpose, in survival analysis, Cox’s regression model is widely applicable.
The distinguishing feature of Cox PH model is its ability to estimate the relationship between the hazard rate and explanatory variables without having to make any assumptions about the shape of the baseline hazard function. Hence the Cox model is sometimes referred to as a semi-parametric model.
The Cox regression model is a statistical theory of counting processes that unifies and extends nonparametric censored survival analysis. The approach integrates the benefits of nonparametric and parametric approaches to statistical inferences .
The Cox proportional hazards regression model relates covariates to the hazard function as follows:
Where is called the baseline hazard function, which is the hazard function for an individual for whom all the variables included in the model are zero,= is a parameter vector of regression coefficients, is the value of the vector of explanatory variables for a particular individual, and is a fixed, known scalar function .
This is a semi-parametric model where the baseline hazard is estimated non-parametrically, while the covariate effect is constrained by the parametric representation. Where, take an exponential form
Which assures that the hazard is non-negative and assumes that covariate effects on the hazard are multiplicative. So
The Cox model is called a proportional hazards model since the ratio of hazard rates of two individuals with covariate values and, at time t is:
The hazard ratio is time-independent as, the ratio does not depend on .
Since the hazard function at t given covariate x is =. The survival function, the cumulative hazard function and probability density function can be derived as follows:
2. Parametric PH Models
Parametric models need some special assumptions about , such as the exponential and Weibull distributions. But the advantage of Cox model is the fact that such assumptions can be avoided.
The Parametric PH model is the parametric versions of the Cox proportional hazards model. It is given in similar form to the Cox PH models. The main difference between the two kinds of models is that the baseline hazard function is assumed to follow a specific distribution when a fully parametric PH model is fitted to the data, while the Cox model has not such assumption. A number of different parametric PH models can be derived by choosing different hazard functions . The commonly used models are exponential, Weibull, or Gompertz models.
Weibull PH model:
The Weibull model allows for hazard rates to be non-constant but monotonic that either increase or decrease exponentially with time.
Under the Weibull PH model, the hazard function of a particular patient with covariates is given by:
Where is the scale parameterand is the shape parameter
Exponential PH model:
The hazard function under this model is to assume that it is constant over time. Under the exponential PH model, the hazard function of a particular patient is given by:
3. Likelihood Estimation for the Cox PH Model
Derivation of an estimator of cannot be based on an ordinary likelihood function since is not specified parametrically in the Cox model. Instead, partial likelihood function has been proposed by Cox  for the estimation of regression parameters which is a function depending on only.
Cox partial likelihood
Let ……. be the observed survival time for n individuals, and the ordered death time of r individuals be …. .The set of individuals who are at risk atis denoted by . So that is the group of individuals who are alive and uncensored at a time just prior to . The conditional probability that the individual dies at given that one individual from the risk set on dies at is:
By taking the product of these conditional probabilities over death times gives:
Then the partial likelihood function for the Cox PH model is given by:
Where is the risk set at time andis the event indicator which is zero if the survival time is right censored and unity otherwise. This is the partial likelihood defined by Cox. The Cox methodology uses the partial likelihood to yield estimates of that are consistent and efficient regardless of the form of . The partial likelihood is valid when there are no ties in the dataset.
4. The Score Function and Information Matrix
The regression coefficients are estimated with that maximize the partial likelihood. Assuming no ties, the log partial likelihood is:
Then the score function which is the first partial derivative:
For h = 1, 2 . . . . The maximum partial likelihood estimate can be obtained uniquely by solving the partial likelihood equation:
Whereas the second derivative of the partial likelihood is given by:
This matrix for = 1, 2 . . . . and h = 1, 2 . . . ., is a sum over i = 1, 2 . . . n. of weighted covariance matrices for the x vector in the populations at risk at the time .
The negative of the second partial derivatives provide the observed information matrix which estimates the covariance matrix of the estimated regression coefficients .
is defined as the observed information matrix .
The score vector evaluated at the true value of will be asymptotically distributed as a multivariate normal with mean vector zero and covariance matrix which can be unbiased estimated by .
The estimate will also be asymptoticly normal
5. Model Checking in Cox Regression Model
After the model has been fitted, the adequacy of the fitted model needs to be assessed which is usually performed using model residuals.
5.1. Cox-Snell Residuals
The Cox-Snell residual is given by Cox and Snell, which is used for assessing the fitness of PH model . The Cox-Snell residual for the ith individual is defined as:
Where is an estimate of the baseline cumulative hazard function at time . In practice the Nelson – Aalen estimate is generally used. . If the final PH model is correct and the are close to the true values of the, then should resemble a censored sample from a unit exponential distribution. Therefore, a plot of the Nelson-Aalen cumulative hazard estimate of residuals versus residuals should be a straight line through the origin with a slope of 1, if the fitted model is correct.
5.2. Proportional Hazard Assumption Checking
The main assumption of the Cox proportional hazards model is proportional hazards, which mean that the hazard ratio is constant over time. There are several methods for verifying that a model satisfies the assumption of proportionality (Graphical method, Scaled Schoenfeld residuals, Adding time dependent covariate) .
• Graphical method
According to Cox regression model the survival function for ith individual is given by:
Where x = is the values of the vector of explanatory variables for a particular individual. By taking the logarithm twice, we get:
Then the difference in log-log curves corresponding to two different individuals with variables = and which does not depend on t is given by:
This provides the basis for assessing the validity of PH assumption. By plotting estimated versus survival time for two groups we would see parallel curves if the hazards are proportional . This method does not work well for categorical predictors with many levels because the graph becomes cluttered.
• Scaled Schoenfeld residuals
Scaled Schoenfeld residuals are defined as the product of the inverse of the estimated variance-covariance matrix of the kth Schoenfeld residual and the kth Schoenfeld residual . The scaled Schoenfeld residualcan be used to assess time trends and lack of proportionality.
Where is the Scaled Schoenfeld residual and is the Schoenfeld residual.
Under the null hypothesis, we expect to see a constant function over time. When the proportional hazards assumption holds, straight horizontal line with zero slope is expected.
6. The Stratified Cox Regression Model
The stratified Cox regression model is a modification of the Cox regression model that allows for control by stratification of a covariate that does not satisfy the proportional hazards assumption. Covariates that are assumed to satisfy the proportional hazards assumption are included in the model, however the predictors being stratified is not included. There are interaction and no-interaction models defined in the stratified Cox regression model .
6.1. No-Interaction Model
In the stratified model with no interaction, the strata divide the individuals into disjoint groups, each having a distinct baseline hazard but a common value for the regression parameter which means that the coefficients are the same for each stratum. The hazard function for the failure time of an individual in stratum takes the form:
Where denotes the particular stratum , is a vector of unknown regression parameters, and are K unknown baseline hazard functions. The subscript in the equation indicates that each stratum has its own baseline hazard function while the are the same across strata.
Under the stratified model, it can be seen that individuals within the th stratum share the same baseline hazard functionwhich implies that the proportional hazards for two individuals in the same stratum still holds:
On the other hand, individuals from different groups can have non-proportional hazards as their baseline hazards functions may differ.
Since these functions are unrestricted, any relationship of this hazard ratio over time is possible.
The partial likelihood for the stratified Cox model is the product of partial likelihoods in each stratum:
6.2. Interaction Model
The data set can be stratified into strata according to the variable that does not satisfy the proportional hazards assumption; in this case, the interaction model is defined as follows:
In this interaction model, each regression coefficient has the subscript, which denotes the stratum and indicates that the regression coefficients are different for different strata. So if there is no interaction the stratified Cox regression model will contains regression coefficients that do not vary over the strata. If interaction is allowed for, different coefficients for each of the stratum are obtained. Likelihood ratio test statistics is used to examine the no-interaction assumption.
The likelihood ratio (LR) test compares log likelihood statistics for the interaction model and the no-interaction model.
7. Data Set
This part of the study includes shedding light on the case study and the collected data description. The study involved 308 patients performed liver transplantation operation consecutively admitted to Egypt Air hospital and Ain Shams University hospital from January 2007 to May 2013. Among 308 patients 4 were excluded due to their age were less than 18 years and 2 for re-transplantation. So the study included 302 patients. In the following analysis, the time after the operation till death is the endpoint of interest, this variable is measured in months. There was a 2-year follow-up period for the patients. Patients who were still alive at the end of the follow-up period were treated as censored observations. The complete data set consists of 302 observations, of which 81.45% are censored. The results of Cox PH regression model is obtained by using (STATA) statistical packages.
8. Description of the Variables
• Survival time: time to death or censoring time, measured in months.
• Death status: the event indicator, equal to 1 for those died during the period of the study and 0 for those who were not died or censored.
• Recipient Age: a dummy variable, equal to 0 if the patient age less than 50 and 1 if his age greater than or equal 50.
• Recipient Sex: a dummy variable, equal to 1 for male and 0 for female.
• Donor Age: a dummy variable, equal to 0 if the patient age less than 30 and 1 if his age greater than or equal 30.
• Donor Sex: a dummy variable, equal to 1 for male and 0 for female.
• BMI: Body Mass Index
• CTP score: stands for Child-Turcotte-Pugh score. A categorical variable, with codes 1 for class A, 2 for class B and 3 class C. Since the variable CTP has three levels, it is included in the model using the subgroup as the reference group.
• MELD Score: stands for Model for End stage Liver Disease. A categorical variable, with codes 1 for MELD score from , 2 for MELD score from and 3 for MELD score from. Since the variable MELD has three levels, it is included in the model using the subgroup as the reference group.
• HCC: stands for Hepatocellular Carcinoma .A dummy variable, equal to 1 for patients suffer from HCC and 0 if they did not.
• Ascites: a dummy variable, equal to 1 if the patients suffer from Ascites and 0 if they did not.
• Encephalopathy: a dummy variable, equal to 1 if the patients suffer from Encephalopathy and 0 if they did not.
• Ln-Total Bilirubin: mg/dl, this variable is transformed by taking the logarithm to decrease the influence of extreme values and to fit normal distribution.
• Ln-Creatinine: mg/dl, this variable is transformed by taking the logarithm to decrease the influence of extreme values.
• Albumin: a dummy variable, equal to 1 if the Albumin level is less than or equal 2.6 and 0 if the Albumin level is higher than 2.6 (
• Inverse-INR: this variable is transformed by taking the inverse to decrease the influence of extreme values.
• Sodium: a categorical variable, with codes 1 for Sodium level , 2 for Sodium level from and 3 for Sodium level from. Since the variable Sodium has three levels, it is included in the model using the subgroup as the reference group.
• Ln-Calcium: mg/dl, this variable is transformed by taking the logarithm to decrease the influence of extreme values.
• Ln-Potassium: mg/dl, this variable is transformed by taking the logarithm to decrease the influence of extreme values.
• GRWR: Graft to recipient weight ratio showed in percentage.
9. Analysis and Results
|Covariates||β||Hazard Ratio||95% CI LL||95% CI UL||p-value|
|Recipient Age <50||1|
|Recipient Age >=50||0.505||1.657||0.964||2.846||0.067*|
|Recipient Sex Female||1|
|Recipient Sex Male||-0.289||0.748||0.354||1.581||0.448|
|Donor Age < 30||1|
|Donor Age >= 30||0.049||1.0507||0.616||1.789||0.855|
|Donor Sex Female||1|
|Donor Sex Male||0.242||1.274||0.643||2.526||0.475|
|CTP Score A||1|
|CTP Score B||-0.094||0.909||0.117||7.04||0.928|
|CTP Score C||0.34||1.405||0.193||10.202||0.736|
|MELD1 (6 to 12)||1|
|MELD2 (13 to 18||1.256||3.513||0.823||14.986||0.089*|
|MELD3 (19 or higher)||2.218||9.1916||2.204||38.316||0.002*|
|Albumin < 2.6||1|
|Albumin >= 2.6||0.2505||1.284||0.751||2.195||0.300|
|Sodium1 (<= 130)||1|
|Sodium2 (131- 135)||-0.271||0.762||0.367||1.579||0.465|
To determine the variables to be included in the final model, the univariate Cox PH regression analysis is applied first to identify the impact of individual variable on time to event before proceeding more complicated model selection. Variables are identified as significant using a 0.2 significance level in the univariate analysis.
Table (1) presents the univariate Cox PH analysis .The first column is showing the coefficients the parameter estimate, in the second column the hazard ratio for a one unit change in the predictor, then the 95% confidence interval and finally the -value.
According to the univariate Cox PH analysis, that the covariates Recipient Age ( , (, (, Ln creatinine (, and the GRWR ( are statistically significant and selected as significant factors for risk of death after liver transplantation operation.
• Multivariate Cox PH regression analysis
We then conducted full multivariate Cox PH analysis (by using stepwise selection process) including all the potential risk factors that had a -value of less than or equal 0.2 in univariate Cox PH analysis.
To select the best subgroup of variables in our model, the approach of stepwise was applied. The stepwise selection process consists of a series of alternating forward selection and backward elimination steps. The former step adds variables into the model, and the latter step removes variables from the model. The threshold for variable selection into the model is setting with at 0.2 (SLENTRY = 0.2), while the threshold for variable removing from the model is setting with at 0.1 (SLSTAY = 0.1). It means only variables with less than 0.2 will be tested in the model, and to keep it in the model, its should be less than 0.1. The results from the stepwise proportional hazard regression are displayed as below.
|Covariates||β||Hazard Ratio||95% CL LL||95% CL UL||p-value|
|Recipient Age <50||1|
|Recipient Age >=50||0.748||2.113||1.163||3.841||0.014*|
|MELD1 (6 to 12)||1|
|MELD2 (13 to 18||1.159||3.189||0.739||13.75||0.101|
|MELD3 (19 or higher)||2. 057||7.827||1.851||33.08||0.005*|
To further optimize the Cox model, the variable with the highest and over threshold of significance are removed from the predictive model one by one until all the rest variables are shown significant impact on the prediction of hazard rate. From Table (2) the variable Sodium3 is the one with highest p-value= 0.532, so it is removed. The result is shown as below Table (3).
|Covariates||β||Hazard Ratio||95% CL LL||95% CL UL||p- value|
|Recipient Age <50||1|
|Recipient Age >=50||0.596||1.816||1.04||3.154||0.034*|
|MELD1 (6 to 12)||1|
|MELD2 (13 to 18||1.214||3.368||0.787||14.417||0.102|
|MELD3 (19 or higher)||2.18||8.853||2.11||37.129||0.003*|
And then, the same strategy is applied for the following analysis. The variable has highest, so removed in the following step as shown in Table (4).
|Covariates||β||Hazard Ratio||95% CL LL||95% CL UL||p- value|
|Recipient Age <50||1|
|Recipient Age >=50||0.604||1.831||1.0547||3.178||0.032*|
|MELD1 (6 to 12)||1|
|MELD3 (19 or higher)||1.160||3.190||1.834||5.548||0.0001*|
The final model is obtained as in Table (4), the data showed that most of the predictors are significant in the model with their p-value less than 0.05 except for the GRWR. After we built a multivariate model of main effects, we then check all the interactions between predictors. To test the interaction among variables, the list of all raw variables and all possible combinations of interactions are included for proportional hazard regression analysis however none of the interactions are significant. Eventually, the final model is generated including the variables Recipient age, MELD3, Ln_Creatinine, and GRWR.
The final multivariate Cox PH model is then given by:
The final multivariate Cox PH model concluded that:
• The higher: the Recipient age, the MELD score and the Creatinine level the more the risk of death after LDLT.
• The Larger the transplanted liver graft the lower the risk of death after LDLT.
After fitting Cox PH model, we can plot the survivor, cumulative hazard, and the estimated hazard functions, as shown in figure (1).
It is obvious from figure (1-c) that the hazard function is not monotonic, as it first very high during the early weeks after the LT operation, then decreases and tends to stabilize during the first year from LT, and after the 1st year it begins to increase slightly.
10. Model checking
Adequacy of a fitted model needs to be assessed after a model has been constructed.
10.1. The PH Assumption Checking
The final model is based on a major assumption that the hazards between groups are proportional. To test the assumption of proportionality, the scaled Schoenfeld residuals and log-log survival plot have been used.
• Log- Log Survival Plot
Figure (2) shows plot of the variable Recipient Age and . For Recipient Age the plotted lines are not parallel however for the plotted lines are parallel. Although using graphs to assess the validity of the assumption is subjective, it can be a helpful tool.
Figur (2). Log-Log Survival plot For Rec.Age and .
• Scaled Schoenfeld residuals
Scaled Schoenfeld residuals is based on the principle that, for a given regressor, the assumption restricts for all . This implies that a plot of versus time will have a slope of zero. The null hypothesis is having zero slope, which is equivalent to testing that the log hazard-ratio function is constant over time.
|Global test||5. 72||4||0.221|
Table (5) shows both covariate-specific and global tests. It is obvious that Recipient age variable violates the proportional-hazards assumption =0.017 which is less than 0.05.
Also figure (3) supports the results obtained before, variables Creatinine,, and GRWR , having zero slope and does not violate the proportional-hazards assumption . However, Recipient age variable violates the proportional-hazards assumption as it does not have a zero slope.
10.2. Cox-Snell Residuals
We assess goodness of fit by using Cox–Snell residual plot.
In Figure (4), the blue line is the estimation of Cox-Snell residuals while the red line is the origin with a slope equals to 1. The plot suggests that the Cox PH model does not fit the straight line adequately. There is some evidence of a systematic deviation from the straight line which gives us some concern about the adequacy of the fitted model. As the scaled Schoenfeld residuals and showed that the Cox PH model displayed non-proportionality for variable Recipient Age, and the Cox-Snell residuals suggests that the Cox PH model does not fit the data adequately, so the Stratified Cox regression model is more adequate to be used.
11. The Stratified Cox Regression Model
As Schoenfeld residuals showed that the Cox PH model displayed non-proportionality for variable Recipient Age, which means that there is an interaction between this variable and time, so the Stratified Cox regression model is more adequate to be used. Here we applied the stratified Cox regression with no interaction and with interaction model.
|Variables||No interaction model||Interaction model|
|Strata 1 Recipient Age||Strata 2 Recipient Age|
Log likelihood for no interaction model = -256.31462
Log likelihood for interaction model = -256.072857
It is clear that in the no interaction model there is different baseline hazard function for each stratum however the coefficients are the same. However in the interaction model different baseline hazard function and different coefficients are obtained for each stratum.
Table (6) shows the application of the stratified Cox regression with no interaction and with interaction model, to determine which model is more appropriate statistically the likelihood ratio test is used, which compares log-likelihood statistics for the interaction model and the no-interaction model.
Under the null hypothesis : the no interaction model is correct and statistically appropriate.
The Likelihood ratio statistics is calculated as follows:
This value (0.48348) is not significant at the 0.05 level of significance for 2 degrees of freedom.
Thus, it appears that despite the numerical difference between corresponding coefficients in and models, there is no statistically significant difference. We can therefore conclude for these data that the stratified Cox model with no-interaction model is acceptable (at 0.05 level of significance).
The Cox PH model was used to examine the covariate effects on the hazard function and to determine the risk factors affecting the outcome of liver transplantation operation for end-stage liver disease.
The final multivariate Cox PH model is then given by:
Cox PH model showed that the variables: Recipient age,, Ln_Creatinine, and GRWR are statistically significant and selected as significant factors for risk of death after liver transplantation operation.
The scaled Schoenfeld residuals showed that the Cox PH model displayed non-proportionality for variable Recipient Age and the Stratified Cox regression model is more adequate to be used.
Also the Cox-Snell residual showed that there is some evidence of a systematic deviation from the straight line which gives us some concern about the adequacy of the fitted model. So we concluded that the Cox PH model does not fit these data adequately. The stratified Cox model with interaction and with no interaction were applied and showed that the no-interaction model is acceptable at 0.05 level of significance and the variables , Ln_Creatinine are statistically significant and selected as significant factors for risk of death after liver transplantation operation at 0.05 level of significance.