Estimation of Population Based Colorectal Cancer Survival Analysis Using Cox Proportional Hazards Model

Colorectal cancer (CRC) is a tumour of the colon and rectum. Most cases of CRC are sporadic; meaning there are no known hereditary (genetic) components, and it develops slowly over several years through adenomatous polyps. Changes in bowel habits, blood in the stool, and anaemia are cardinal symptoms and sings of CRC. In later stages, fatigue, anorexia, weight loss, pain, jaundice, and other signs and symptoms of locally advanced and metastatic disease occur. The aim of this study is to estimate the population based colorectal cancer survival analysis using cox Proportional Hazards model, in order to fits colorectal cancer data in population-based research. This research was a five-year retrospective study on data from a record of colorectal cancer patients that received treatments from 2013 to 2017 in Radiotherapy Department of Usmanu Danfodiyo University Teaching Hospital, Sokoto, being it one of the cancer registries in Nigeria. 9 covariates were selected to fit colorectal cancer data using Cox Regression Models. The 5-year median survival was found to be 121 days. From the results, it was concluded that the predictor variables could significantly predict the survival of colorectal cancer patients using Cox proportional model. Also the results show that the data met Cox Proportional Hazards Assumptions.


Introduction
Colorectal cancer (CRC) is a tumour of the colon and rectum. Most cases of CRC are sporadic; meaning there are no known hereditary (genetic) components, and it develops slowly over several years through adenomatous polyps (Brenner et al., [1]). Changes in bowel habits, blood in the stool, and anaemia are cardinal symptoms and sings of CRC. In later stages, fatigue, anorexia, weight loss, pain, jaundice, and other signs and symptoms of locally advanced and metastatic disease occur. CRC is traditionally diagnosed by sigmoidoscopy and colonoscopy using biopsy. There are several ways to treat colorectal cancer depending on the cancer stage and where the tumour is localized. The main treatment is surgery; however, chemotherapy and radiation therapy can also use (Potter & Hunter, [2]). Approximately 1.4 million new cases of colorectal cancer and almost 700 000 deaths occurred worldwide in 2012 (Arnold et al., [3]). Survival analysis is generally defined as a set of methods for analyzing data where the outcome variable is the time until the occurrence of an event of interest. The event can be death, occurrence of a disease, marriage, divorce, etc. The time to event or survival time can be measured in days, weeks, years, etc. For example, if the event of interest is death, then the survival time can be the time in years until a person dies (Hosmer D. W., Lemeshow S., and May S., [4]).
According to Hosmer et al. [4] observations are called censored when the information about their survival time is incomplete; the most commonly encountered form is right censoring. Censoring is an important issue in survival analysis, representing a particular type of missing data. Censoring that is random and non-informative is usually required in order to avoid bias in a survival analysis.
The survival and hazard functions are key concepts in survival analysis for describing the distribution of event times. The survival function gives, for every time, the probability of surviving (or not experiencing the event) up to that time. The hazard function gives the potential that the event will occur, per time unit, given that an individual has survived up to the specified time. While these are often of direct interest, many other quantities of interest (e.g., median survival) may subsequently be estimated from knowing either the hazard or survival function (Hosmer et al., [4]).
Many countries today have population-based cancer registries. Their task is to collect and store information on all cases of cancer in the countries and produce statistics of the incidence of cancer, and the survival of cancer patients. They play an important role in analysing the impact of cancer in the community. In Nigeria, for example, there are ten (10) population-based cancer registries owned by the Federal Government located at various tertiary hospitals across the country, according to Nigerian National System of Cancer Registries (NSCR, [5]). In most part of Africa, cancer burden is under reported due to lack of or inaccurate population statistics, which makes age specific incidence rate impossible or inaccurate (Abdulkareem, [6]).
This study was to estimate the population based colorectal cancer survival analysis using Cox proportional hazard model, in order to fits colorectal cancer data in populationbased research.
The leading cause of death and disabilities worldwide is cancer which affects more than 14 million people annually (W. H. O., [15]). Knut et al. [16] consider colorectal cancer (CRC) as a complex disease that almost 40% of the surgically cured patients experience cancer recurrence within 5 years. Cancer control refers to all actions taken to reduce the frequency and impact of cancer (Armstrong, [17]).
Zaki [18] found a general formula for generating survival data on the computer trough the fundamental relation between hazard rate and survival function. The development of methods in analyzing survival data is one of the areas in statistics that have increased recently.
Nigeria contributed 15% to the estimated 681,000 new cases of cancer that occurred in Africa in 2008 (Sylla, [19]). Similar to the situation in the rest of the developing world, a significant proportion of the increase in incidence of cancer in Nigeria is due to increasing life expectancy, reduced risk of death from infectious diseases, increasing prevalence of smoking, physical inactivity, obesity as well as changing dietary and lifestyle patterns (Sylla, [19]).

Material and Method
This research was a five-year retrospective study on data from a record of colorectal cancer patients that received treatments from 2013 to 2017 in Radiotherapy Department of Usmanu Danfodiyo University Teaching Hospital, Sokoto. A purposive sampling was considered in selecting UDUTH being it one of the cancer registries in Nigeria.
The research was designed to follow the subsequent procedure. The first stage was the discussion and formulation of Cox Proportional-Hazards Model. Finally, the data from one of the cancer registries (Usmanu Danfodiyo University Teaching Hospital, Sokoto) were collected for the following estimates: Kaplan-Meier Plots, test survival curves using Log-rank tests (Survival, Hazard and Median Survival Functions).
Software: The R programming language has sufficient packages required to carry out the research work. And SPSS was used for data entries and arrangements.

Kaplan Meier
In cancer trial, Kaplan-Meier (K-M) method is one of the recommended techniques in survival analysis: it is the most popular in developing survival function (Collett,[20]). The method is used to measure the fraction of subjects living for a certain period of time after treatment. It is applied by analyzing the distribution of patients' survival times following their recruitment to a study. The analysis expresses in terms of proportion of patients still alive up to a given time, following their recruitment. In terms of graph, a plot of proportion of patients' surviving against time has a characteristic decline; the steepness of the curve indicates the efficacy of the treatments being investigated. The shallower part of the curve shows the more effective treatment. In analysing the survival data, two functions that are dependent on time are of particular interest: the survival function and the hazard function.
The survival function denoted by S (t) is the probability of surviving at least to time t.
The hazard function denoted by h (t) is the conditional probability of dying at time t having survived to that time. The graph of S (t) against t is called the survival curve.
The Kaplan-Meier method can be used to estimate this curve from the observed survival times without the assumption of the underlying probability distribution. The method is based on the basic idea that the probability of surviving p or more periods from entering the study is the product of the p observed survival rates for each period i.e. the cumulative surviving, and is given by where Denotes the proportion of surviving the th period = 1,2, … = Proportion of surviving beyond the second period conditional on having survived up to the second period and so on.
The proportional surviving period having survived up to period is given by = where = the numbers alive at the beginning of the th period = The number of deaths within the th period

Log-rank Test
A statistical hypothesis test called Log-rank test was used to compare the two survival curves. It is used to test the null hypothesis that there is no difference between the survival curves, i.e. the probability of event occurring at any point in time is the same for each population.
The total expected number of events for a group was the sum of the expected number of events at the time of each event. The expected number of events at the time of an event can be calculated as the risk for death at that time multiplied by the numbers alive in the group. Under the null hypothesis, the risk of death, i.e. number of deaths divided by the numbers alive can be calculated from the combined data for these groups.

= ∑
where = the numbers alive from group 2 at the time of event is calculated as − , where = the total number of events The test statistic is compared with a − !" # with 1 degree of freedom.

Cox Proportional-hazards Regression
The Proportional Hazards Model, proposed by Cox [21], has been used primarily in medical testing analysis to model the effect of secondary variables on survival. Its strength lies in its ability to model and test many inferences about survival without making any specific assumptions about the form of the life distribution model.
Most interesting survival-analysis research examines the relationship between survival typically in the form of the hazard function and one or more explanatory variables (or covariates).
The most common are linear-like models for the log hazard. For example, a parametric regression model based on the exponential distribution: This is therefore a linear model for the log-hazard or a multiplicative model for the hazards itself. The model is parametric because, once the regression parameters ℎ ( , * ,…, * are specified, the hazard function ℎ is fully characterized by the model, the regression constant represents a kind of baseline hazard when all of the +′ are 0. Other parametric hazard regression models are based on other distributions commonly used in modelling survival data such as the Weibull distributions.
The hazard ratio for these two observations is This ratio is constant over time. In this initial formulation, the research assumed that the values of the covariate + 7 are constant over time.
As we will see later, the Cox model can easily accommodate time-dependent covariates as well.  Figure 1 Show that the overall median survival time is 121 days. This implies that 50% of the colorectal cancer patients survived less or equal to 121days and the other 50% survive longer than 121days after they are diagnosed with the disease. This is the survival time at which the cumulative survival function is equal to 0.5.   From table 3, the Cox Proportional Hazard result shows that the colorectal cancer patients receiving combined therapy (surgery and chemotherapy) have higher risk of death event than those receiving single therapy, and increasing the covariate Type of Treatment with 1 unit will increase the hazard ratio by 0.353. So, it is not significant since the p-value = 0.394. The result of the covariate Sex shows that it is significant, and increasing the Sex by 1 unit will increase the hazard ratio by 1.130. This indicates that the female patients have the higher risk of death than the male patients. Increasing the covariate Age by 1 unit will decrease the hazard ratio by -19.290; so it is highly significant. From table 4 the test is not statistically significant for each of the covariates, and the global test is also not statistically significant. Therefore, we can assume the proportional hazards (which mean that proportion hazards assumptions are met). In the figures 2, the solid lines of the graphs are the smoothing spline fit to the plot, with the dashed lines representing a standard-error band around the fit. From the graphical inspection, there is no pattern with time. The assumption of proportional hazards appears to be supported for the covariates.

Conclusions
The results of this study shows that, according to our colorectal cancer data, the semi-parametric Cox regression model could better determine the factors associated with the colorectal cancer disease. However, in the present study, the Cox model provided an efficient and a better fit to the study data. Therefore, it would be better for researchers of the health care field to consider this model in their researches concerning the colorectal cancer disease if the assumptions of proportional hazards are fulfilled.