Demonstrating the Performance of Accelerated Failure Time Model over Cox-PH Model of Survival Data Analysis with Application to HIV-Infected Patients under HAART

Background: Human Immunode�ciency Virus (HIV) is a virus that kills CD4 cells. These CD4 cells are white blood cells that �ght infection. CD4 count is like a snapshot of how well our immune system is functioning. Studying the way of CD4+ count over time provides an insight to the disease evolution. Methods: This study was considering the data of HIV/AIDS patients who were undergoing Antiretroviral Therapy in the ART clinic of Menellik II Referral Hospital, Addis Ababa, Ethiopia, during the period 1st January 2014 to 31st December 2017. The data was analyzed in separate survival models i.e non parametric, semi parametric (Cox PH) and parametric survival model (AFT models). For the purpose of model diagnosis cox-snail residual analysis were incorporated. Results: For separate survival model log-logistic model is more appropriate for the survival data than other parametric models. Therefore; functional status and regimen class are signi�cant covariates in determining the hazard function patients. . In the Log-logistic model, among the covariates we have included in the survival model: functional status (working subgroup) and regimen class (all subgroup) were signi�cant at 5% level of signi�cance. But, sex, age, baseCD4, marital status and WHO-clinical stage are not signi�cance at 5% signi�cance level. Using cox-snail residual shows proportionality not satis�ed for these WHO stage, regimen class and marital status. Conclusions: Log rank and Wilcoxon tests showed that the signi�cant difference in survival situation among the categorical variables selected for this study sex, marital status, functional status, WHO-clinical stages and regimen class subgroups. But, there was no signi�cant difference in the time-to-event between subgroups of sex, Marital Status and WHO clinical Stage, while, Regimen Class and Functional Status there was a signi�cant difference in the time-to-event between subgroups.

Sub-Saharan Africa carries a disproportionate burden of HIV, accounting for more than 70% of the global burden of infection. In 2013, sub-Saharan Africa was home for approximately 24.7 million people living with HIV, accounting for 71% of the global total with around 1.5 million new HIV infections and 1.1 million AIDS related deaths as cited in [3]. In sub-Saharan African countries, the prevalence rate among both female and male sex workers was still high (13%) [8].
Ethiopia is the second most populous nation in Africa, next to Nigeria, and one of the seriously affected countries in sub-Saharan Africa. Since rst evidence of the HIV epidemic was detected in Ethiopia in 1984, AIDS has claimed the lives of millions. It has been estimated that approximately 45,200 deaths were related to AIDS and that 793,700 people were living with HIV in 2013 as cited in [7].
In order to reduce the number of deaths caused by HIV, it is better to supply free Antiretroviral (ARV) drugs for patients in HIV. These drugs are taken in combination with three or more drugs; this kind of approach is commonly known as Highly Active Antiretroviral Therapy (HAART). After the introduction of ART the death of HIV patients decreases tremendously. Antiretroviral drugs (ARVs) are medications that slow or block the replication of Human Immuno De ciency Virus (HIV), help to boost the body's ability to ght of HIV infection (immune system) and decrease the viral load in the body.
This therapy positively in uences the quality of life and the survival of seropositive HIV carriers [7]. In Ethiopia, there were more than 222,000 patients on ART at the end of 2010 and ART has improved the survival of patients with HIV/AIDS and the quality of life [9]. It is quite common in clinical research that both the repeated measurement data and time-to-event data are simultaneously observed. For example, in HIV/AIDS studies, the trajectories of CD4 counts and time-to-death are collected. In such studies, the interest often lies in understanding the relationships between the longitudinal history of a process and its effect on the risk of an event [4].
The main objective of this study was to analyze the survival endpoints of HIV infected patients under HAART based on a data records in Menilik II Referral Hospital, Addis Ababa, Ethiopia. Speci cally this study was aimed to 1. Empirically investigate the major factors that determine the survival time among those patients. In the absence of facilities for CD4 T lymphocytes count, all patients with WHO stages 3 and 4 disease should start ART. Those with WHO stages 1 and 2 disease need supervision and should be monitored carefully, with a minimum of three-monthly clinical reviews and at any time if new symptoms develop. Initiation of ART is recommended for all patients with pulmonary TB or severe bacterial infections and CD4 counts < 350 cells/mm3. Initiation of ART is also recommended for all pregnant women with any stage 3 disease and a CD4 count < 350 cells/mm3. In response to successful ART, the CD4 T lymphocytes count typically increases by > 50 cells/µL within weeks after viral suppression, and then increases 50-100 cells/µL per year thereafter until a threshold is reached [20]. Antiretroviral therapy ART increases the CD4 lymphocyte counts and reduce risk of opportunistic infections and improve survival of HIV-infected people [Santos ACOD, Almeida AMR 2013] as cited in [9]. HIV affects the CD4 cell count in the human body, so it can be employed to make appropriate decisions for the initiation of HAART and proper management of the progression of the infection [19]. In addition to HIV, demographic variables are other factors affecting CD4 cell count changes. Such as age (older ages are predictors of lower count response to HAART [13]; [10], sex (females experience better CD4 count response to HAART compared to males [1], and residence area (rural patients who start ART with a deteriorated CD4 cell count at the initiation of HAART poorly respond to HAART) [9]. ART cannot cure HIV, but it can help the patient live a longer, healthier life and reduce their risk of HIV transmission [14].
The effectiveness of ART has been assessed by clinical observations, CD4 cell counts and determination of plasma viral load [Awoke T., et al., 2016]. In the national treatment guideline of Ethiopia 2010, the rstline ART contains four NRTIs backbone which are Stavudine (d4t), zidovudine (AZT), Abacavir (ABC) and Tenofovir (TDF) plus lamivudine (3TC) with additional two NNRTI drugs (efavirenz (EFV) or nevirapine (NVP) [16]. The combination regimens which have been used most frequently in Ethiopia are d4t-3TC-EFV, d4t-3TC-NVP, AZT-3TC-EFV, AZT-3TC-NVP, TDF-3TC-EFV, or TDF-3TC-NVP [WHO, 2004].Whereas, patients switch to second-line regimen when the rst-line regimen failed due to different reasons. A failure in treatment is measured in three ways: (1) clinical-when new or recurrent WHO stage 4 condition, (2) immunological-when persistent CD4 level below 100 or 50% fall from on-treatment peak value and (3) virological-when plasma viral load above 10,000 copies/ml in duplicates after six months on ART [2].

Methods
The method of estimators in this analysis: Kaplan-Meier estimator, Cox's proportional hazards regression model (Cox PH) and accelerated failure time (AFT) model will be discussed. The model comparison and model diagnosis techniques used in this data analysis are also included. Survival data analysis involves the analysis of time-to-event data that have a principal endpoint-the time until an event occurs (time-toevent data). Survival data are censored in the sense that they did not provide complete information for a variety of reasons; subjects may not have experienced the event of interest. The existence of variables that change over time is also a distinguishing feature in survival analysis.
Descriptive analysis of survival data utilizes non-parametric methods to compare the survival functions of two or more groups. Kaplan-Meier estimator (product-limit-estimator) of the survival function used for this purpose. The log-rank test is utilized to test whether observed differences in survival experience between/among the groups is signi cant or not. The multivariable model used is the semi-parametric regression model known as the proportional hazards regression (PHR) model. When a study involves multiple characteristics, appropriate statistical techniques must be used to select variables that have signi cant effects on survival and which are judged to be clinically meaningful for inclusion in a PHR model. The following sections present the method of non-parametric, semi-parametric and fullyparametric estimation statistical techniques.

Non-parametric Survival Model
Kaplan-Meier (KM) Estimator: the standard non-parametric technique to estimate the survival function is proposed by Kaplan and Meier (1958). Some said this product-limit estimator. Kaplan and Meier commonly called as Kaplan-Meier estimator. Let t 1 <t 2 < ··· < t D represent the observed failure times in a sample of size n and y i be the number of individuals at risk prior to time t i. This estimator is de ned as: The Product-limit estimator is a right continuous step function with jumps at the observed event times and it provides an e cient means of estimating the survival function for right-censored data. An alternative non-parametric estimator is suggested by Nelson (1972). It has better small-sample-size performance estimator based on the Product-limit estimator.

Cox's Proportional Hazard Regression Model
The Cox-PH regression model [Cox DR, 1972] is widely used in epidemiological research to examine the association between an exposure and a health outcome. In a typical approach to the analysis of epidemiological data with a continuous exposure variable, the exposure is transformed to an ordinal or nominal variable and relative risk (RR) is modeled as a step function of the exposure. The Cox-PH model is used to analyze censored data. Suppose the observed data are the triples (t i , z i , c i ) where ti is the possibly censored survival time, z i the scalar predictor variable, and ci the event indicator, taking values of 1 if the event of interest occurred and 0 if it did not. Survival analysis model that has fully parametric regression structure, but leaves its dependence on time unspeci ed is known as semi-parametric regression model. Cox  where h 0 (.) is an arbitrary unspeci ed hazard function and beta is the regression coe cient.
In other instances, when the covariate Z(t) may be thought of as a stress factor affecting the individuals under study at time t. With such time-dependent covariates, the Cox's proportional hazard model is of the form: [Due to technical limitations, this equation is only available as a download in the supplemental les section.] where h0(t) re ects how hazard function changes with survival time, and Z(t)′β characterizes how hazard function changes with covariates. Cox has proposed exponential function the function of covariate x, and parameter β, say it f(x, β), and the hazard function formulated as: [Due to technical limitations, this equation is only available as a download in the supplemental les section.] when x changes from x 0 to x 1, the hazard ratio is computed as [Due to technical limitations, this equation is only available as a download in the supplemental les section.] this model in literature is termed as Cox proportional hazard model Cox DR [6]. In this model researchers are mainly interested on the parameter β, interpreted as changing rate of hazard when the covariate changed by unit value of (x 1 − x 0 ). Remembering that the given baseline hazard function h0(t) remains known, because of this is that the model is called semi-parametric model [11].
In cox proportional hazard regression model there are basic concepts we have to know, these are: (i) the baseline hazard λ0(t) depends on t but not on the covariates x 1 , x 2 , …, x n; (ii) the hazard ratio exp(βX), depends on the covariates x 1 , x 2 , …, x n but not on time t; and also, (iii) the covariates x 1 , x 2 , …, x n do not depend on the time t. These concepts are assumptions of Cox PH model. To check these Cox PH assumptions we can develop diagnostic plots and formal tests. We have been developed diagnostic plots of selected graphical representations and used formal tests to test time-varying coe cients using proportionality test by including a covariate and time interaction terms (time-dependent covariates) associated with the graphical ones.
Allison et.al 1995) and Hosmer, Lemeshow (1999) presented two different things about the PH model assumptions. Allison et.al 1995) was presented that violation of PH assumptions is not that much serious. On the contrary, Hosmer, Lemeshow (1999) put their ideas as violation of PH assumptions should be taken into account and appropriate modi cation of the model should be used for the concise interpretation of the model and covariates. Partial likelihood used for the cox model allows using timedependent covariates. A covariate is time-dependent mean that if the covariate value is change over-time for an individual.

Accelerated Failure Time (AFT) Model
The AFT model is an alternative model when the proportional hazards assumption does not ful ll. For the past twenty years the Cox proportional hazards model has been used widely to study the covariate effects on the hazard function for the failure time variable. On the other hand, the AFT model, which simply regresses the logarithm of the survival time over the covariates, has rarely been utilized in the analysis of censored survival data. The AFT model has an intuitive physical interpretation and would be an alternative method to the Cox model in survival analysis.
The AFT model treats the logarithm of survival time as the response variable and includes an error term that is assumed to follow a particular some distribution. Equation below shows the log-linear representation of the AFT model for the i th individual, where log(T i ) is the log-transformed survival time, T i survival-time, X 1 , X 2 , …, X p are explanatory variables with coe cients β 1 , β 2 , …, β p, respectively, and ϵi represents residual or unexplained variation in the log-transformed survival times, while β 0 and σ are intercept and scale parameters, respectively [5]. The main reason we use logarithm of T i is that to consider the truth behind the survival times are always positive with probability of 1.
[Due to technical limitations, this equation is only available as a download in the supplemental les section.] In the absence of censored data, we can estimate the data by using ordinary least square (OLS) estimators. By assigning a new variable, Y = log (T), using the linear regression model with Y as the response variable. But, survival data have at least some censored observations and these are di cult to manage in OLS. Instead we have to use the MLE with different assumptions on ϵ. For each the distribution of ϵ, there should be a corresponding distribution of survival time, T.
In the presence of only a single explanatory variable, X 1 was de ned as a 0-1 indicator variable, and the deceleration factor (cˆ) was calculated from the coe cient estimate associated with X 1 (i.e, cˆ = exp(βˆ1)). When there are many covariates were available, additional terms β 2 x 2 … β p x p were included in the model, where the added variables represented factors such as gender, age, marital status, weight, WHO Stage, functional status, regimen class. In survival interest response variable that the length of time from HAART start date until the date of death or censor (measured in months). 608 (78.70%) and 165 (21.30%) of patients were alive on HAART (Censored) and died due to HIV/AIDS related death, respectively.

Exploring Univariate Survival Data Analysis
In any data analysis process it is always better to do some uni-variate analysis before directly goes to the more complicated models. In survival analysis it is recommended that visualizing the Kaplan-Meier curves for all the categorical predictors in the dataset. This was show insight to the shape of the survival function for each categorical group and gives an idea whether or not the groups are proportional (i.e., the survival functions are approximately parallel). We have also done the tests of equality across strata to explore whether or not the predictor included in the nal model, which is a non-parametric test.
The Kaplan-Meier estimator plots, which used to estimate the survival curves for categorical variables is presented. Plot of the Kaplan-Meier estimates for only four selected categorical variables; sex, functional status, marital status and Regimen Class was displayed (Figure 1). In Figure 1: (i) reveals that both male and female seems have equal survival time. But in some extent female are more survived than male. In functional status of patients in ambulatory are more survived than patients working and bedridden.
According to marital status of patients from plot (iii) also one can see that all strata have equal survival time, no more difference between them. Similarly, in plot (iv) patients who take another regimens other than the listed have some more survival time than patients who allocated in the three class. As a general, all the survival time in all variables are very delayed, no rapid decrements.
The results in the Log rank and Wilcoxon tests showed that the signi cant difference in survival situation among the categorical variables selected for this study sex, marital status, functional status, WHO-clinical stages and regimen class subgroups (Table 3). But, there was no signi cant difference in the time-toevent between subgroups of sex, Marital Status and WHO clinical Stage, while, Regimen Class and Functional Status there was a signi cant difference in the time-to-event between subgroups. The log-rank test of equality across strata for these predictors are insigni cance, indicate that no need to include them as potential candidate for the nal model. From the graph also we have seen the subgroups in each variable were not parallel and overlap for some part of the graph. This lack of parallelism could gain a problem when we include these predictors in the Cox PH model since one of the assumptions is proportionality of the predictors.

Cox's Proportional Hazard Model Analysis
In order to determine the hazard function in survival data analysis, it is better to t Cox's PH Regression Model. Therefore; to undertake the hazard function, we have tted the full model that contains all main effect variables. These may help to identify the effects of each main effects of the study interest. Therefore; the full cox regression model is examined using these seven main effect variables: Sex, functional status, age, regimen class, marital status, WHO-clinical stage and baseline CD4 that were under study. But before building the full model we have to check model with covariate or without covariate (null model) is effective.
From the Table 4: the value of -2LOG L is less, in model with covariates than model without covariates which indicate that the Log likelihood is larger in model with covariates. But the value of AIC and BIC contradict with concept. According to AIC and BIC value no need to include covariates in the Cox PH model. When testing proportional assumption in Cox PH model there are a lot of techniques, these makes di cult and they may contradict one to other. In this case we have to proof using further test. So, another test is global null hypothesis test.
The global null hypothesis test from Table 4 also reveals that all three tests (basically Likelihood Ratio, Score and Wald test) are signi cantly different from zero and yield similar conclusions, that is, the model without explanatory variables was more effective than the full model. This support the AIC and BIC conclusion what we seen in Table 4. Even the K-M plot in Figure 1: the covariates were intersected with in groups and this is an indication that the PH assumption violated for that variable. None of the variables is globally important in explaining in changes the survival time or hazard rate of the patients.
The cox PH regression model presented as: λ i (t) = λ 0 (t){β Am (Ambulatory2 i ) + β AZ (AZT + 3TC + NVPorEFV1 i )+ β TD (TDF + 3TC + NVPorEFV2 i ) + β d4 (d4t + 3TC + NVPorEFV3 i) } From the type III tests of the PH regression model Wald test p-value is 0.045, 0.047 for functional status and regimen class are seems statistically signi cant at 5% level of signi cance. In the cox PH regression model, model parameters interpreted in the form of exponential function of the parameter multiplied by the covariate and the results indicates the ratio of hazards.
HIV patients in Ambulatory status have the probability of event (death) to decrease by 34% than bedridden HIV patients. Likewise, HIV patients who are taking AZT+3TC+NVP or EFV, TDF+3TC+NVP or EFV and d4t+3TC+NVP or EFV increase the death opportunity 97%, doubling and 61% respectively.
Testing Proportionality: To check the validity of the PH model we have to test by examining the interaction between time-dependent covariates with time. The cox PH regression model just we have been conducted was assumed that the risks are proportional, which means the proportion is constant over time. To check this assumption of risks proportionality, we have used time-dependent covariates or variables and test whether they are signi cant or not. If they are not signi cant, it shows us time did not affect the relative risk, and we can conclude that the risks in our model are proportional. Here; we have done proportionality test for time-dependent variables.
The value of the proportionality test found was not signi cant (p-value = 0.3491). Therefore; none of two (age and CD4+) time-dependent variables individually or jointly were signi cant, when makes interaction with time. So, the risk is constant overtime. This result taken from Cox PH is satisfactory for detail conclusion and recommendations we have to invite AFT models.

Accelerated Failure Time (AFT) Model Analysis
All AFT models are named for the distribution of T rather than the distribution of ε or log T. The reason for allowing different distribution assumptions is that they have different implications for the shapes of hazard function. To t the AFT model we have applied six parametric survival models (exponential, weibull, normal, log-logistic, Gamma and log-normal) to the effects of covariates to HIV/AIDS patients on ART on patient's survival, the data under study. To compare the models, we have used AIC, BIC and -2 Log Likelihood smaller is better.
From the tted six parametric survival models, Log-logistic seems t the data better than the others. With the minimum AIC (961.107), BIC (1029.222) and -2 Log likelihood (931.107), Log-logistic model suggests appropriate for the data under study. Therefore; we have tted our model for parametric survival analysis using the assumed Log-logistic distribution survival function for survival time (T). Subsequent analysis of the survival data are based on this Log-logistic model. In the Log-logistic model, among the covariates we have included in the survival model: functional status (working subgroup) and regimen class (all subgroup) were signi cant at 5% level of signi cance. But, sex, age, baseCD4, marital status and WHO-clinical stage are not signi cance at 5% signi cance level. However; the AFT model do not undergoing on the signi cance of the p-value for these predictors. Instead the signi cance of the study variables is tested parametrically by the selected Log-logistic distribution. From the AFT model in Table 8 Among the class of ART treatment patients who are allocated in AZT+3TC+NVP or EFV and d4t+3TC+NVP or EFV treatment have (exp (-0.2461) = 0.78) 22% decreased in survival time than the patients allocated other than the listed treatment class. Also patients in TDF+3TC+NVP or EFV drug have (exp (-0.2853) = 0.75) 25% decreased in survival time than the patients who take another combination drugs.

Separate Survival Model Selection
The global null hypothesis test from Table 4  and Pr> ChiSq = 0.1743) were statistically insigni cant. But in this Cox PH model functional status (Ambulatory) and regimen class (AZT+3TC+NVP or EFV, TDF+3TC+NVP or EFV and d4t+3TC+NVP or EFV) were seems statistically signi cant. Likewise, Table 7 from the tted six parametric survival models (exponential, weibull, normal, log-logistic, Gamma and Log-normal), Log-logistic t the data better than the others with the minimum AIC (961.107), BIC (1029.222) and -2 Log likelihood (931.107). In this AFT model with Log-logistic distribution functional status (working), regimen class (AZT+3TC+NVP or EFV, TDF+3TC+NVP or EFV and d4t+3TC+NVP or EFV) and WHO clinical stage (stage I, stage II and stage III) were statistically signi cance at 5% level signi cance.

Separate Survival Model Diagnostics
0000Cox-Snell Residuals: Parallel lines indicated that hazards between groups were proportional and assumption of proportionality was satis ed. But in Figure 2: (i, ii and iii) bellow shows proportionality not satis ed for these WHO stage, regimen class and marital status respectively.

Conclusions
Separate analysis of time-to-event outcomes reveals 165 (21.30%) were event occurred and the remaining 608 (78.70%) are censored from the ART as well as the estimated mean survival is 32.27 (±SD 8.40) months. From a Log-logistic parametric model functional status (Working) and regimen class (AZT + 3TC + NVP or EFV, TDF + 3TC + NVP or EFV and d4t + 3TC + NVP or EFV) found to be a signi cant factor on time to death. This lead us to conclude HIV positive patients who can able to work day-today activities have an immunity in resisting the disease burden and delaying the disease event with treatment. Also the regimens that comprise AZT+3TC+NVP or EFV, TDF+3TC+NVP or EFV and d4t+3TC+NVP or EFV have a signi cant role in increasing the number of CD4+ for HIV-positive patients, which helps to live long than other regimen class ordered by clinician for patients.
As recommendations: This thesis shall be recommend that further studies of this nature include other important socioeconomic and clinical covariates that were not included in this study like opportunistic infections, body mass index, condom use, drug addiction, education level and many others which are recorded for long duration. Beside this health o cers and data clerks working with recording HIV-positive patients data either in chart or database need training to improve quality of medical data especially in keeping dataset from missingness. Ethics Approval and Consent to Participate I obtained a formal ethical clearance from Debre Berhan University Department of Statistics (Ref no. Stat42/9/18). Personal information will have to be kept con dentially without disclosing to others during data collection from patient cards. Afterward I got ethics approval from Menellik hospital in orally with oath. Since I am the only study participant stated in ethics committee I have no peer consents.

Consent for publication Not Applicable
Availability of data and material The data that I take from hospital is full secured because of ethical oath. If anyone can access this data set he/she have to license from hospital. The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.

Competing interests
Non-nancial competing interests       Cox-Snell Residuals

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.