Comparative Analysis of the Cox Semi-parametric and Weibull Parametric Models on Colorectal Cancer Data

The survival and hazard functions are key concepts in survival analysis for describing the distribution of event times. The survival function gives, for every time, the probability of surviving (or not experiencing the event). The hazard function gives the potential that the event will occur, per time unit, given that an individual has survived up to the specified time. While these are often of direct interest, many other quantities of interest (e.g., median survival) may subsequently be estimated from knowing either the hazard or survival function. This research was a five-year retrospective study on data from a record of colorectal cancer patients that received treatments from 2013 to 2017 in Radiotherapy Department of Usmanu Danfodiyo University Teaching Hospital, Sokoto, being it one of the cancer registries in Nigeria. 9 covariates were selected to fit colorectal cancer data using Cox and Weibull Regression Models. From the result it is concluded that the predictor variables could significantly predict the survival of colorectal cancer patients using Cox. Also the result of the Weibull Proportional Hazard Model shows that the model is adequate enough to predict the survival of the colorectal patients. The A. I. C result shows that, according to our colorectal cancer data, the semi-parametric Cox regression model performed better than the parametric Weibull proportional hazards model. However, in the present study, the Cox model provided an efficient and a better fit to the study data than Weibull model.


Introduction
Colorectal cancer (CRC) is a tumour of the colon and rectum. Most cases of CRC are sporadic; meaning there are no known hereditary (genetic) components, and it develops slowly over several years through adenomatous polyps (Brenner et al,. [1]). Changes in bowel habits, blood in the stool, and anaemia are cardinal symptoms and sings of CRC. In later stages, fatigue, anorexia, weight loss, pain, jaundice, and other signs and symptoms of locally advanced and metastatic disease occur. CRC is traditionally diagnosed by sigmoidoscopy and colonoscopy using biopsy. There are several ways to treat colorectal cancer depending on the cancer stage and where the tumour is localized. The main treatment is surgery; however, chemotherapy and radiation therapy can also use (Potter & Hunter, [2]). Approximately 1.4 million new cases of colorectal cancer and almost 700 000 deaths occurred worldwide in 2012 (Arnold et al., [3]). Survival analysis is generally defined as a set of methods for analyzing data where the outcome variable is the time until the occurrence of an event of interest. The event can be death, occurrence of a disease, marriage, divorce, etc. The time to event or survival time can be measured in days, weeks, years, etc. For example, if the event of interest is death, then the survival time can be the time in years until a person dies (Hosmer, Lemeshow, and May, [4]).
According to Hosmer et al,. [4] Observations are called censored when the information about their survival time is incomplete; the most commonly encountered form is right censoring. Censoring is an important issue in survival analysis, representing a particular type of missing data. Censoring that is random and non-informative is usually required in order to avoid bias in a survival analysis.
The survival and hazard functions are key concepts in survival analysis for describing the distribution of event times. The survival function gives, for every time, the probability of surviving (or not experiencing the event) up to that time. The hazard function gives the potential that the event will occur, per time unit, given that an individual has survived up to the specified time. While these are often of direct interest, many other quantities of interest (e.g., median survival) may subsequently be estimated from knowing either the hazard or survival function (Hosmer et al., [4]).
Many countries today have population-based cancer registries. Their task is to collect and store information on all cases of cancer in the country and produce statistics of the incidence of cancer, and the survival of cancer patients. They play an important role in analysing the impact of cancer in the community. In Nigeria, for example, there are ten (10) population-based cancer registries owned by the Federal Government located at various tertiary hospitals across the country, according to Nigerian National System of Cancer Registries (NSCR, [5]). In most part of Africa, cancer burden is under reported due to lack of or inaccurate population statistics, which makes age specific incidence rate impossible or inaccurate (Abdulkareem, [6]).
This study was to estimate the population based colorectal cancer survival analysis using Cox and Weibull regression models, in order to ascertain the one that better fits colorectal cancer data in population-based research. The specific objectives were to: Describe the survival function using Kaplan-Meier (K-M) approach, and then compare the survival curves using Logrank tests. Fit the two models used in the survival analysis using data on colorectal cancer. Test for Cox Proportional Hazards assumptions using both the statistical test and graphical method. Estimate the survival function, hazard function, using Cox and Weibull Proportional Hazards Models and the effect of the covariates on patients from the colorectal cancer data collected. Ascertain performance, efficiency and flexibility of the models using AIC test.
The origin of survival was traced back to World War II (Dickman and Hakulinen, [7]). Survival analysis is a series of procedure that analyse timing of events (Kleinbaum and Klein,[8]). Dickman [9] listed some developments in statistical methods for population-based cancer survival analysis.
Adejumo and Ahmadu [10] disclosed that the shape parameter of the weibull model does not depend on the performance of cox proportional hazard model. Quantin et al., [11] compared statistical models in terms of performance of different regression models for censored survival data in modeling the impact of prognostic factors on all-cause mortality in colon cancer. Similarly, Ahmad et al., [12] sees the Cox Model as multivariate Semi-parametric Regression Model in regarding Colon cancer.
Abdulkabir et al. [13] point out that the shape parameter of the Weibull Model does not depend or affect the performance of the Cox Proportional Hazard Model. He further asserts that both models perform similarly when the distributional assumptions are not met except when sample size is small and the Weibull Model out-perform the Cox model when the distributional assumption are met and the shape parameter is known.
According to Wang et al. [14] Weibull Distribution is the best model for survival analysis of genetic associations in HNFIB of cancer patients when Cox proportional hazard and parametric models are compared.
The leading cause of death and disabilities worldwide is cancer which affects more than 14 million people annually (W. H. O., [15]). Knut et al. [16] consider colorectal cancer (CRC) as a complex disease that almost 40% of the surgically cured patients experience cancer recurrence within 5 years. Cancer control refers to all actions taken to reduce the frequency and impact of cancer (Armstrong, [17]).
Zaki [18] found a general formula for generating survival data on the computer trough the fundamental relation between hazard rate and survival function.
Yuan [19] assumed an exponential form for the baseline hazard function and combined Cox Proportional Hazard Regression for the survival study of a group of lung cancer patients. The covariates in the hazard function are estimated by maximum likelihood estimation following the proportional hazards regression analysis. Although the proportional hazards model does not give an explicit baseline hazard function, the baseline hazard function can be estimated by fitting the data with a nonlinear least square technique.
Nigeria contributed 15% to the estimated 681,000 new cases of cancer that occurred in Africa in 2008 (Sylla, [20]). Similar to the situation in the rest of the developing world, a significant proportion of the increase in incidence of cancer in Nigeria is due to increasing life expectancy, reduced risk of death from infectious diseases, increasing prevalence of smoking, physical inactivity, obesity as well as changing dietary and lifestyle patterns (Sylla,[20]).

Material and Method
This research was a five-year retrospective study on data from a record of colorectal cancer patients that received treatments from 2013 to 2017 in Radiotherapy Department of Usmanu Danfodiyo University Teaching Hospital, Sokoto. A purposive sampling was considered in selecting UDUTH being it one of the cancer registries in Nigeria.
The research was designed to follow the subsequent procedure. The first stage was the discussion and formulation of Cox Proportional-Hazards Regression Model and Weibull Proportional Hazard Model. Finally, the data from one of the cancer registries (Usmanu Danfodiyo University Teaching Hospital, Sokoto) were collected for the following estimates: (Survival, Hazard and Median Survival Functions), efficiency and Performance of the fitted Model using AIC.
Software: The R programming language has sufficient packages required to carry out the research work. And SPSS was used for data entries and arrangements.

Cox Proportional-Hazards Regression
The Proportional Hazards Model, proposed by Cox [22], has been used primarily in medical testing analysis to model the effect of secondary variables on survival. Its strength lies in its ability to model and test many inferences about survival without making any specific assumptions about the form of the life distribution model.
Most interesting survival-analysis research examines the relationship between survival typically in the form of the hazard function and one or more explanatory variables (or covariates).
The most common are linear-like models for the log hazard. For example, a parametric regression model based on the exponential distribution: Or equivalently, where ℎ =Denotes the Hazards Function ℎ Is the Baseline Hazards Represents the Relative Risk Represents the Covariates This is therefore a linear model for the log-hazard or a multiplicative model for the hazards itself. The model is parametric because, once the regression parameters ℎ , ,…, are specified, the hazard function ℎ is fully characterized by the model, the regression constant represents a kind of baseline hazard when all of the ′+ are 0. Other parametric hazard regression models are based on other distributions commonly used in modelling survival data such as the Weibull distributions.
Fully parametric hazard regression models have largely been superseded by the Cox model [22], which leaves the baseline hazard function ℎ = ℎ unspecified: The Cox Model is termed semi-parametric because, while the baseline hazard can take any form, the covariates enter the model through the linear predictor Notice that there is no constant term (intercept) in the linear predictor: The constant is absorbed in the baseline hazard. The Cox Regression Model is a Proportional-Hazards Model: Consider two observations, and / , that differ in theirvalues with respective linear predictors And The hazard ratio for these two observations is This ratio is constant over time. In this initial formulation, the research assumed that the values of the covariate are constant over time.
As we will see later, the Cox model can easily accommodate time-dependent covariates as well.

Parametric Model
The parametric proportional hazards model is the parametric versions of the Cox Proportional Hazards Model. It is given with the similar form to the Cox PH models. The hazard function at time t for the particular patient with a set of ' covariates , , … , is given as follows: ℎ |6 = ℎ 7 , + + ⋯ + = ℎ 7 exp / 6 .
The key deference between the two kinds of models is that the baseline hazard function and is assumed to follow a specific distribution when a fully parametric PH model is fitted to the data, whereas the Cox model has no such constraint. The coefficients are estimated by partial likelihood in Cox model but maximum likelihood in parametric PH model. Other than this, the two types of models are equivalent. Hazard ratios have the same interpretation and proportionality of hazards is still assumed. A number of different parametric PH models may be derived by choosing different hazard functions. The commonly applied models are exponential, Weibull, or Gompertz Models.

Weibull Proportional Hazard Model
Suppose that survival times are assumed to have a Weibull distribution with scale parameter 9 and shape parameter : so that the survival and hazard function of a ; 9, : distributionare given by < = exp −9 > , ℎ = 9: >4 (10) With 9, : > 0. The hazard rate increases when : > 1, and decreases when : < 1 as time goes on. When : = 1 , the hazard rate remains constant, which is the special exponential case.
Under the Weibull PH model, the hazard function of a particular patient with covariates , , … , is given by ℎ |6 = 9: >4 , + + ⋯ + ℎ |6 = 9: >4 exp / 6 (11) We can see that the survival time of this patient has the Weibull distribution with scale parameter 9 exp / and shape parameter :. Therefore, the Weibull family with fixed : possesses PH property. This shows that the effects of the explanatory variables in the model alter the scale parameter of the distribution, while the shape parameter remains constant. The corresponding survival function is given by After a transformation of the survival function for a Weibull distribution, we can obtain B− log < C = log 9 + : log . (13)

Model Evaluation Using Akaike Information Criterion (AIC)
To select a model that performs best among models we use Akaike Information Criterion (AIC) proposed by Akaike [23]. AIC is a measure of the goodness of fit of an estimated statistical model (Akaike,[23]). The AIC is an operational way of trading off the complexity of an estimated model against how well the model fits the data. Given a set of data, several competing models may be ranked according to their corresponding AIC, and the one having the smallest AIC is the best. AIC is the first model criterion to gain widespread acceptance. It was an extension to the maximum likelihood principle. AIC is given by the formula: where Likelihood=The probability of the data in a given model K=The number of the parameters in the model For the number of parameters, k=2 for the Cox and Weibull proportional hazard models (Klein and Moeschberger, [24]).
The Wald test equals to 17.390 with 9 d.f also it is significant. Looking at the P-values of the test results above i.e. the Concordance, the Likelihood ratio, the Log-rank and Wald test we can say that the predictor variables can significantly predict the survival of colorectal cancer patients. Also Age of the patient and Age at Diagnosis are the most effective covariates which relates to the survival of patients as we have seen them displayed in the output of Rprogramme. From the result of table 3 the test is not statistically significant for each of the covariates, and the global test is also not statistically significant. Therefore, we can assume the proportional hazards (which mean that proportion hazards assumptions are met). In the figure 1, the solid line is a smoothing spline fit to the plot, with the dashed lines representing a standard-error band around the fit. Systematic departures from a horizontal line are indicative of non-proportional hazards, since proportional hazards assumes that estimates , , … , L do not vary much over time. From the graphical inspection, there is no pattern with time. The assumption of proportional hazards appears to be supported for the covariates.

Results from Analysis of Weibull Proportional Hazard Model
Log. Likelihood=-139.890. From the result of table 4, it is observed that the chi-square value (log-Rank test) is 52 with 9 d.f and p-value is 0.000. Age of the patients and Age at Diagnosis are the most effective covariates with p-values=0.001 which are highly significant and related to the survival of patients. The maximum log likelihood for Weibull is -139.890, while the overall p-value is 0.000 which show that the model is adequate enough to predict the survival of the Colorectal Cancer patients.

Conclusion
The results of this study shows that, according to our colorectal cancer data, the semi-parametric Cox regression model could better determine the factors associated with the colorectal cancer disease than the parametric Weibull proportional hazards model. However, in the present study, the Cox model provided an efficient and a better fit to the study data than Weibull model. Therefore, it would be better for researchers of the health care field to consider this model in their researches concerning the colorectal cancer disease if the assumption of proportional hazards is fulfilled.