Applying Survival Analysis to Telecom Churn Data

: In competitive telecommunication environment, it is imperative to maintain an effective customer retention strategy even while mobile service operators attracting new customers. Not only acquiring new customers costly process, but successful customer retention helps build brand loyalty and good business reputation. Motivated by real mobile service operator data set, we designed and proposed a solution to employ survival analysis technique that estimates customers’ survivals and hazards. We aim to examine the impact of: campaign, tariff, tenure, age, auto-payment on survival times and hazards. After hazard ratios and survival experiences determined for each predictor, results enable mobile service operator to target the right customers to incentivize so that they can stay with their current operator. Proactive actions triggered by the results of the survival model is key to customer retention.


Introduction
Customer churn also known as customer attrition is when a customer ends subscription with their current operator [1]. There are two types of churn, the first is involuntarily churn which involves the customer's account being cancelled by the mobile service operator as a result of death, relocation or non-payment reasons. The second is voluntarily churn which is when customer cancels their subscription due to poor voice quality, limited cellular coverage, network problems, high prices or billing issues [2]. In voluntarily churn, customer defects to a competitor. Only voluntary churn data was examined in this article.
The Prominence of Churn Models: Finding new customers for mobile service operators in a saturated market is very difficult and costly because of the marketing activities, incentives and campaigns to attract new customers. Competition is intense due to increasing varieties of customer-focused services, campaigns and tariffs offered from mobile service operators. As a result, cost of acquiring a new customer is five times higher than retaining a current customer.
For comparison, below is a list of average costs to acquire one customer for a variety of industries: 1. Travel: Priceline.com: $7 2. Telecom: Sprint PCS: $315 3. Retail: Barnesandnoble.com: $10 4. Financial: TD Waterhouse: $175 The telecom sector has the highest acquisition cost among other sectors. Due to high acquisition cost, mobile service operators are more focused on retention than customer acquisition. Thus, it is crucial to implement an effective retention strategy (ERS) to retain customers, as retention is far easier; less expensive and more achievable than acquisition. In addition, successfully retained customers acquire more new customers; they provide cash flow and profit and they are less sensitive to new pricing. In comparison with new customers, existing old customers may be more loyal, and churning may take a longer time [3][4][5][6].
One of the best practices to implement an ERS is to create a model based on customer behavior that will predict churn with high accuracy. The goal is to detect well in advance, which customers may churn, so that mobile service operators can react early to keep them.
Customer Relationship Management and Churn Models: Customer Relationship Management (CRM) is the whole set of processes and systems applied to establish long-lived and profitable relationships with specific customers to support business strategies [7]. In telecommunication companies, retention of customers is one of the key activities of the CRM.
The main objective of CRM is to establish a win-win relationship between the customer and the firm. CRM has four dimensions: 1. Customer Identification 2. Customer Attraction 3. Customer Retention 4. Customer Development Data mining techniques that are used in these four dimensions are: 1. Association 2. Classification 3. Clustering 4. Forecasting 5. Regression 6. Sequence discovery In CRM, churn models are classification techniques used for customer retention [8]. The goal of data mining techniques for churn analysis is to: 1. identify who is going to churn and when 2. identify why customers are churning This paper intends to illustrate how to apply survival analysis techniques to telecom churn data to identify the above items. The rest of the paper is structured as follows. Section 2 defines some basic concepts and terminology of survival analysis as well as describes the application of survival analyses to different fields. Section 3 introduces the most famous and widely used model for survival analyses, the Cox proportional hazard model. In Section 4, a telecom churn data set is described, and a Cox model is applied. Section 5 concludes the paper.

Survival Analysis
Survival Analysis (SA) encompasses a variety of statistical procedures used to analyze and quantify time to event data. SA aims to answer the question "How long is it before some event occur?". In practice, the event of interest may be: finding a job, failure of a component, default of an applicant for credit, corporate failure, divorce, academic promotion, etc. In SA, survival means the subject is free from the event of interest during the observation period [9][10][11][12].
The goal of SA is: 1. Estimate and interpret survival function 2. Compare survival functions 3. Asses the relationship between survival and one or more predictors (usually termed as explanatory variables or covariates) Characteristic of SA include: 1. Outcome variable is time until the event occurs 2. Its ability to manage incomplete data, resulting from the event not occurring during the follow up time (censored data) 3. Framework of SA: 4. Definition of event 5. Specify survival times scale (year, month, week, etc.) 6. Identify the start time SA models can be based on historical data or can be based on prospective studies where data is collected in real time. SA is also sometimes referred: event history analysis, failure time analysis, hazard analysis, transition analysis, and duration analysis.
History of SA: The origins of SA and its history spread far back to the early work on mortality by John Graunt, who published the book "Natural and Political Observations on the Bills of Mortality" in 1962. The concept of "Life Tables" was introduced in this book. A new, modern era in SA started during World War II in USA. Reliability of military equipment was tested using SA. After World War II, SA became popular and spread to other various disciplines [13].
The most influential papers on SA were published by Kaplan and Meier [14] and Cox [15], in which the Kaplan-Meier product limit estimator and Cox proportional hazard model were introduced, respectively.
Application of SA SA is successfully used in various fields, such as medical research, engineering (reliability analysis), finance, and economics (duration analysis). For example, in finance, a firm's vulnerability to global financial crisis are analyzed. There are several studies in this field ( [16][17][18][19]). These research papers aim to identify firms that are at risk of failure in the future. Prior notice can help some corporations to take necessary steps to survive. Additionally, the time to sell or buy stock within the stock market can be analyzed and predicted with SA [20].
SA is also used in banking such as banks' survival during financial crisis or measure their strength when they enter to a new country's market [21,22] and for credit risk analysis [23,24].
In the telecommunication sector, churn prediction is vital for mobile service operators. It has received a lot of attention and popularity from marketing and academia. There are several articles published that mostly employ machine learning algorithms [1][2][3][4]8]. Few articles have been published using SA techniques [25]. SA is also used to estimate customer life time value (CLTV) which is a measure of customer profitability over time [26,27].

Censoring
In theory, researchers would like to know the exact survival time for each subject who enters the study during the observation period. However, this is often impractical. Some survival times are below (or above) some bound. Because of time limitations of the study, researchers are not able to wait indefinitely to see if every subject has the event. For some subjects, the study ends before the event has occurred or some subjects are lost to follow up or drop out of the study. It is important to know the drop out reasons for these subjects. If the reasons are unrelated to the event of interest, then this is called non-informative censoring. When censored data is random and non-informative, there is no issue of bias [28].
Informative censoring means that a subject leaves the study for a reason that is related to the event. Imputation techniques for missing data methods can be applied to deal with informative censoring [29].
There are three general types of censoring: right-censoring, left-censoring and interval-censoring [30]. We will only focus on right censoring which is the most common type of censoring. Now, let's consider the mathematical definition below and figure 1 to help fully understand right-censoring. For the ith subject, i=1, 2,…, n (number of subjects), let: 1. denotes the survival time 2.
denote the censoring time 3. δ denote the event indicator where: 4. denote the observed response where: = min( , ) In other words, observed survival times will always be less than or equal to true survival times.

Model
SA has two main functions known as the survival and hazard function. Let's introduce both functions.
Survival Function: Let's assume for now that T is non-negative and a continuous random variable denoting the elapsed time until an event occurs with probability density function f(t) and cumulative distribution function F(t). F(t) is defined as ( ) = Pr{T < t} (1) S(t) is defined to be the complement of the c.d.f. [31], can be obtained from The survival function simply indicates the probability that the event of interest has not occurred by time t. Let's take the derivative of S(t) and express f(t) in terms of & / ( ): As t ranges from 0 to ∞, the three characteristics of survival function are [32]: 1. S(t) is non-increasing in t 2. At t=0, all the subjects are alive which means S(0)=1 3. At t=∞, all the subjects have the event which means S(∞)=0 Hazard Function: The hazard function is defined as [33]: The numerator of expression (7) is the probability of failure between time t and t+∆ , given that it has not occurred up to time t, while the denominator ∆ is the length of interval. Taking the limit of this expression as length of the interval goes to zero is called instantaneous rate of failure. Let's take a detailed look at the expression (7) [30,34].
The formula is derived for the hazard function. Let's use expression (6) and plug in (14) The result is Integrate from 0 to t and use S(0)=1 Expression in the parenthesis called the cumulative hazard function and defined as

Regression Versus Survival Analysis
In SA, researchers may be interested in modeling the risk of event any time, conditional on having survived up to that moment. Further, they may also be interested in what factors benefit the time to event. SA can also handle censored data.
Logistic regression, researchers may be interested in modeling the risk of event whether it occurs in some fixed period of time. Additionally, they may also be interested in what factors benefit the risk of event occurring. Although, it is not mathematically correct to use logistic regression with censored data, it still can be applied but it gives us limited information since it avoids censoring ( [33,36,37]). Properties and advantages of models are summarized in table 1 and table 2.

Kaplan-Meier
It is a non-parametric method and known as Product Limit Formula. In this method S(t), which is the probability of surviving beyond t, is computed for each t.
Let (E) < (T) < (U) < ⋯ < (W) denotes the observed times of event of the n subjects in the study, @ ( ) is number of deaths at each of ( ) , ( ) is number of subjects who are event free at time ( ) . S(t) can be estimated by We not only estimate & X ( ), but it is also used to make comparison between two or more groups. Log-Rank Test Survival curves are compared statistically using log-rank test. The purpose of this test is whether there is no statistical difference between the survival experience of two or more groups. Hypothesis of log-rank test defines as Where b is the largest time interval that at least one subject is at risk [31]. Table 3 is illustrated how to simply estimate hazard function and survival function.

The Cox Proportional Hazard Model
The Cox Proportional Hazard Model (CPHM) is a widely used statistical technique for quantifying the risk of observing the event of interest during the observation time. This technique is also used to assess simultaneously the effect of several covariates on survival time.
The hazard function, 1( |g), is defined as; Where g is a vector of covariates (explanatory or predictor variables), and e is a vector of regression coefficients. The baseline hazard, 1 5 ( ), is the hazard when all covariates are zero. It is not estimated and specified for the CPHM, it is consequently called semi-parametric [12].

Hazard Ratio (HR)
The CPHM estimates the ratio of hazard values between two levels, say g and g * , with different covariate values. It is estimated by Let us assume the covariate values of two different subjects differ on only one level. That is: The result for the ratio is Assumptions of Cox's Method ( [38,39]): 1. Non-Informative censoring 2. Proportional hazard 3. No tied events CPHM does not require any specific distribution for the survival times. However, it assumes a proportional hazard assumption (PHA) which means the hazard ratio, in expression (22) does not depend on t. The survival curves for two groups must have hazard functions that are proportional for all values of t and additionally the hazard ratio does not vary with time. Graphically if the hazard curves cross, PHA may be violated.
There are methods that can be applied to test proportionality: 1. Schoenfeld residuals are used to assess CPHM fit for survival data. They should be independent of time. One of the diagnostic test can be employed is to plot Schoenfeld residuals against time. If there is a random pattern in the plot, the PHA has not been violated [40].
Where X is the set of fixed covariates and X(t) is the time dependent covariates. However, fixed covariate form of hazard function is much simpler than the hazard function with time dependent covariates ( [41,42]). Further, time dependent Cox model does not have some characteristics of fixed covariate model. For example, survival curves cannot be estimated in time dependent Cox model. Time dependent covariates should be used in caution in order to avoid any bias [43]. Time dependency for some variables might be arose from long term study times [46]. An example of time dependent covariates such as body weight, income, marital status, marketing promotions, hypertension status and location. Examples of fixed covariates are: sex, race, income [44]. Time dependent variables can be handled both in SAS [41] and R [45].

Inference
The CPHM Partial likelihood function is given by Where › denotes the event indicator, (E) < (T) < (U) < ⋯ < (W) are the ordered, observed event times and } ( ) is the risk set for time ( ) , which is the set of all subjects surviving just before time ( ) [15]. Additionally, the numerator has uncensored data while the denominator has both censored and uncensored data. Expression (26) is called partial likelihood because the baseline hazard function is unspecified.
Below is a simple example. The observed event times are shown for four subjects, in which the third subject is censored: 2, 3, 4+, 5 (subject 1-4) The partial likelihood function is expressed as: Expression (26) has the assumption that data has no ties. Ties are said to be present in the data when two or more subjects share the same survival time. No ties can be observed if the time is measured on a perfect continuous scale. However, in real life applications, this is unrealistic, and time is measured as a discrete variable. For instance, if the time scale is month, more than one subject can fail in the same month. We usually record: day, week, month or year of the survival times, and as a result, multiple failures can occur at the same time. There are several useful methods introduced to estimate the likelihood with tied event times. Such methods are: 1

. Cox's Exact Discrete Method 2. Breslow Method 3. Efron Method
Let's describe and write the partial likelihood function for each method.

Cox's Exact Discrete Method
The Exact Discrete Method assumes that tie events occur from inaccurate measurement of a continuous time scale [15]. Consequently, Cox's partial likelihood function needs to be revised to handle ties. Let's introduce the revised partial likelihood function to handle ties.
Suppose time E < T < ⋯ < are ordered observed event times and i = 1, 2, …, m following as events occur at . All possible arrangements of ties should be considered ( [47,32]). Now, let's illustrate how to deal with ties with example data found in table 4: Where id is subject's number, T is time in weeks, event is whether failure occurs (1 as yes, 0 as no) and X is the covariate value. In week three and ten, there are ties with two failures. Which id fails first is unknown, so all possible arrangements should be considered.
If the number of ties is large, the Exact method can involve extensive calculations due to a large number of permutations for each tie.

Breslow Method
The Breslow approximation is based on the assumption that time is continuous, and the hazard rate is constant between the time interval , ;E [48]. Another assumption is that any subject, who has a censored time in the interval , ;E , is said to be censored at the beginning of the interval . The adjusted Cox's partial likelihood function can be obtained by: Where ¸ is a set with the collection of failure times called , at time , and } is a risk set containing the subjects who were at risk at time [49]. Using the same data from  (40) In comparison to the Exact method, the Breslow method can be much simpler to run.

Efron Method
The Efron method is used as the default method in R [50]. SAS, on the other hand, uses the Breslow approximation [51]. Even though Efron method is the better approach to estimating e′s than the Breslow method, nearly all statistical software uses the Breslow approximation. Implications of ties to the estimations are [52]: 1. When no ties data exist, three methods have the same results 2. When a few ties data exist, any method can be used 3. As the number of tied events increases in percent, the worse performance is obtained from Breslow 4. Efron estimations' are biased toward zero. 5. Efron method always estimates better than the Breslow method. The Efron method should be chosen if the Exact method is too time consuming. The adjusted Cox's partial likelihood function for the Efron approximation is: Where ¸ is a set with collection of failure times called , at time , } is a risk set contains the subjects who were at risk at time [53]. Again, using the data found in Table 4, the likelihood is: Š e = Š ET × Š¯× Š ns (42) Where each of Š ET , Š¯ and Š ns are estimated by: (Before defining Š ET , let us call) Á w = vw { + vw oe + vw ž + vw ¹ + vw ± (43) Newton-Rabson method can be used to optimize parameters of expressions 26, 28, 35 and 41 but it does not converge when there are collinearity problems in the design matrix [54].

Application to Telecom Data
Churn models use historical data to predict customers' behavior. The historical data set used in this study was acquired from a mobile service operator.
In this study, advertisement and marketing departments in mobile service operator are interested in exploring the impact of selected predictors' (see table 5) on survival times (time till customer churn). With this information, they can target specific customers at risk of churn with incentives, campaigns and tariffs. For this study, a sample of 10365 customers were randomly selected from the customers data. These customers were followed for one year from January 1st, 2015 to December 31st, 2015. Out of 10365, 2654 customers churned during this time period. It is therefore assumed that the remaining 7711 customers are right censored at the end of the time period. This churn number (2654 customers are 25.6% of total customers) is expected to have as an average monthly churn rate for top mobile service operators is 2% which indicates a loss of a quarter of their customers every year [3]. (Asian mobile service operators can lose up to 40% over a year).
In R, package called "Survival" was used to perform analysis in this data. Auto payment -1: Direct bank account or credit card authorization to pay monthly bills in January 2015 -0: Otherwise Figure 2 illustrates survival plots of Kaplan-Meier product limit estimates of customers. Figure 3 illustrates the pdf and cdf of the survival times. The distribution is left-skewed (non-normal) since most of the churns occurred later in the follow up and most of the non-churners have censored data at 12 months. Because of skewness, survival median time can be used as summary statistics which represents the time at which survival probability is 0.5. Since less than 50% of the customers churned, the survival median time cannot be estimated.    Figure 5 illustrates the survival probabilities of Kaplan-Meier product limit estimates of two groups of customers: one with campaign and one without the campaign. This graph and log rank test result enable mobile operators to assess whether the campaign encourages customers stay. Figure 5 shows that customers with the campaign have higher survival probabilities until 10 months. After 10 months, those without the campaign have a higher survival probability. Since two survival curves cross each other, the PHA is violated. Further analysis needs to be done with respect to the campaign variable to evaluate its effectiveness.  In order to quantify survival curves between two groups statistically, the log-rank test is still used 70% of the time even though PHA does not hold [55]. Because of the test in the table 6, we conclude that there is a difference between the groups. Chisq=6, on 1 degrees of freedom, p=0.011 Figure 6 illustrates the survival probabilities of the Kaplan-Meier product limit estimates of four groups of customers: with Tariff 1, Tariff 2, Tariff 3, Tariff 4. This graph and the log rank test results enable mobile service operator to assess how tariffs affect survival probabilities. Figure 6 shows tariff 1 has the best survival of all tariffs; Figure 6 also approximately informs us that PHA may be hold, since no curves cross each other. But a straightforward approach is to employ numerous tests to validate PHA.

Kaplan-Meier
As a result of the test in the table 7, we conclude that there is a difference between the groups.  Figure 7 illustrates the survival probabilities of Kaplan-Meier product limit estimates for two stratified groups of customers: one with tenure less than or equal to one year and one with tenure greater than one year. Figure 7 shows how tenure affects survival; customers whose tenure is less than or equal to one year have lower survival probabilities than those with more than one year. Additionally, PHA appears to be valid.
As a result of the test in the table 8, we conclude that there is a difference between the groups.   Tenure≤1  4640  2233  1046  1347  2305  Tenure>1  5725  421  1608  876  2305 Chisq=2305, on 1 degrees of freedom, p=0

Poportional Hazard Assumption
The package "Survival" in R contains Cox.zph function was used to test proportionality of all the covariates in the model. For each covariate, Cox.zph estimates the p-values by examining the correlation between the scaled Schoenfeld residuals and time [50,51,56]. P-values less than 0.05 indicates that PHA is not satisfied.
According to table 9, global p-value<0.05 implies evidence of non-proportionality of the churn data set since campaign gives strong evidence of non-proportionality. Subsection 4.2.3 detailed how to handle covariates that do not satisfy the PHA. However, in the following models, campaign was excluded from analyses [32].        In the figure 9, figure 10 and figure 11, the calculated Schoenfeld residuals for tariff are plotted against time. The slope of the line is zero which is the evidence of proportionality. Furthermore, figure 13 plots the −log(−log S(t)) against time for the tariff predictor variable. The curves look parallel and do not cross, which indicates the PHA might be valid. Table 10 illustrates the results of two Cox models. Due to a limited amount of computer resources and a large number of ties, R could not provide any results for exact method. Table  10 shows that some coefficients are different. Not only are coefficients different in both methods, but also the AIC is lower in Efron method [56,57].

Result Analysis
For dealing with ties, the default Efron approximation in R was used and the results are shown in All the predictor variables have profound effect on the risk of churn. For tenure, the hazard ratio is 0.95, which means a one month increase in tenure, holding all other variables constant, is associated with a 5% decrease in the hazard at any time over a year. For age, holding all other variables constant, a customer in category 2 is 11% less likely to churn than a customer in category 1 at any time over a year.
Clearly, the Cox model describes the relationship between variables and the risk of churn. The hazard ratio by different levels of each predictor provides valuable information to be used by a CRM department. According to table 12, tariff 1 has the lowest hazard ratio with respect to other tariffs. This demonstrates that customers with tariff 2, 3, 4 deserve special attention.
The tenure predictor also has a significant impact on churn. As customers' tenure increases, the more inclined the customer is to stay with the current operator.
For the age predictor, the older the customer's age, the less likely they are to churn. Researchers should analyze 'why' young customers decide to churn. If the reason is high monthly bills, then this enables operators to develop new tariffs with low monthly bills for young customers. Young customers may use more data than voice. Thus, a tariff or a campaign related to data usage could be more effective than voice usage. Operators should recognize needs and preferences for different age groups to increase survival [4].
For the auto payment predictor, the customers who pay their bill automatically from their bank account have higher survival probabilities. A CRM department may or may not take corresponding actions.
In this study variables were selected to ensure the proportional hazard assumption holds. Some telecom churn predictors can change over time and were not included in this study such as number of minutes, number of calls and data usage. However, we still analyzed Schoenfeld residuals to check whether the PHA holds for each variable, and only the campaign variable failed to satisfy this assumption. Intervals of time should be used to code campaign in [58]. Table 11 is an example. Table 11. Time dependent covariate coding in R. Time 1  Time 2  Churn  campaign  1  1  2  0  0  1  2  3  0  0  1  3  4  1  1 Time 1 is a start time and Time 2 is the end time for the first interval in month. Between month 1 and month 2, the customer did not churn and did not have the campaign. Since campaign depends on time, campaign should be coded for every time range during the study time.

Conclusion
In liberalized mobile telecom market, churn management is very crucial. In order to stay competitive in this market, possible churners should be identified from existing customers to implement selective or personalized marketing practices.
Usually conventional data mining methods such as Logistic Regression, Neural Networks and Decision Trees are applied to identify churners for telecom churn management. These models are applied to evaluate the risk of churn for each customer for fixed time period. Further, these models identify the factors effecting customer decisions.
In this research, we presented a survival analysis technique to build predictive models to identify churners. Unlike the propensity churn scores or odds ratio, Kaplan-Meier method and the Cox model are employed to quantify and describe the risk until the customer churns. Survival curves and hazard ratios are determined for each level of a predictor which enables mobile service providers to take effective retention actions.