Académique Documents
Professionnel Documents
Culture Documents
Coursework
Telecom Churn Analysis with RStudio
Management Report
Table of Contents
Introduction .............................................................................................................................. 1
Customer Profiling................................................................................................11-13
INTRODUCTION
Subscriber Churn has been increasing, becoming a problem that telecommunications carriers
like ourselves have to deal with in order to retain our high-value subscribers. With virtually
everyone owning mobile phones now, some even owning two or more, the saturated market
means that there is next to no new customer to acquire. Therefore, it is of paramount importance
for us to identify our high-value subscribers early and target them with enticements to prevent
This report serves to give an overview of an RStudio application that has been developed and
tested by our data analytics team based on a training dataset and tested on a verify dataset. The
findings from this report will provide Management with actionable insights to better manage
churn.
1
TECHNICAL REPORT
The training dataset (training) consists of 6000 entries with 21 variables and Churn as the
target. The tasks required were to determine which of the 21 variables (inputs/predictors) could
best predict the outcome of Churn and to develop rule sets that identify subscribers who
churn.
Data Pre-Processing
As most data mining algorithms perform poorly with datasets that are greatly imbalanced, the
distribution of Churn in training was checked as part of our preliminary data examination.
The distribution of Churn was found to be Yes = 26.4% and No = 73.6%, implying that
there is sufficient amount of Yes to carry out data mining without the need for data balancing.
Future improvements could involve incorporating functions to carry out a balancing using the
40-60. For the purpose of this analysis, our team has decided that the balancing threshold
would be Yes = 20%, meaning that as long as the proportion of Yes is higher than 20%,
Following that, Churn was separated from training so that training can be rbind-ed
with verify to form combined. This was necessary to ensure that there is consistency in the
The combined dataset was first audited for missing values as they create complications in
data mining and analysis. MissingValuesCheck(combined) was used to identify the missing
values which were all found under the TotalCharges column. A mean/mode substitution
2
method was not used to replace the missing values as it reduces variability of the data. Instead,
the 11 entries/rows (9 in training and 2 in verify) with missing values were deleted for
simplicity, comparability and because there were sufficient remaining entries to ensure that
The 9 rows with missing values from the training section were correspondingly removed
from the Churn column which was removed earlier. The Yes and No responses were also
converted into 1 or 0. This is then defined as churn_col, which will be attached to the pre-
2. 1-hot-encoding of text entries with more than 2 unique values, converting them into 1-
of-x
Rows 1-5991 of prepro_combined_s belonged to the original training set and is now cbind-
ed with churn_col to form a pre-processed training dataset that includes Churn, defined as
training_dataset_wChurn.
dataset, redefined as verifySet, which will be used to verify the results derived from the
3
NPREPROCESSING_splitdataset(training_dataset_wChurn) was then used to create a
training dataset to be used for data mining (trainingSet) by randomising and splitting the
dataset using 70% of the records. A test dataset (testSet) was also created with the remaining
trainingSet, testSet and verifySet are then run through a gsub function to clean their
Churn being the Target is once again separated out and set as the Expected while the
Data Mining
Random Forest
and trainExpect (which must be converted into a factor) with an initial ntree count of 500
and this model was named randf. The predict function was then used to evaluate how
randf fared with the testVariables, where the output (randf_predict) was randfs predicted
Churn.
C5.0
Rs C50 package has the C5.0 function that is applied in a similar way to randomForest,
except that instead of an ntree count, C5.0 uses a trials count. The model applied (C5dt)
requires trainExpect to be converted into a factor and was given a rules = TRUE input
4
so that it would a rule-based model. Similarly, predict was used to evaluate how C5dt fared
with the testVariables, where the output (C5dt_predict) was C5dts predicted Churn.
Following that, the outputs (randf_predict and C5dt_predict) are each compared to testExpect
Using the results, from the confusion matrix, an ROC curve was plotted using Rs plot.roc
function in the pROC package to graphically represent the results. The Area Under Curve
(AUC) is then obtained by using the print function on the plot. The ConfMat results, True
Positives (TP), True Negatives (TN), False Positives and False Negatives (FN) and the AUC
value are then tabulated into an Excel file (AlgorithmComparison.xls) for model comparison
purposes.
The above steps were repeated with varying ntree and trials count for the randomForest and
C5.0 functions respectively, derive different outputs for the predicted Churn and hence
4 additional ntree counts (1000, 1500, 2000 and 2500) were used for alternative
randfomForest models. Similarly, 4 additional trials counts (3, 5, 7, 10) were used for
5
Evaluation of Models
Results and outputs from all 10 models were tabulated in AlgorithmComparison.xls with
additional columns created for measures such as Sensitivity, Specificity, Precision, Accuracy,
To decide which model was the best, the models were ranked by their AUC and MCC values.
Although the AUC statistic in itself is already a good measure of quality of classification
models such as Random Forest and C5.0, MCC adds value to the model assessment as it is a
balanced measure which takes into account all 4 ConfMat results. Since the aim of this
project is to correctly identify the positive cases of Churn, Precision was used as the third
measure. Using 3 measures for evaluation adds extra stringency when evaluating and choosing
The 10 models were ranked by the highest AUC and then by MCC, where the top 5 included
just one C5.0 model. Although this model (C5.0 with 10 trails) had the best AUC (0.8397), its
MCC is much lower than the rest and has a Precision of only 39.4% compared to the other 4
Random Forests. Of the remaining, the best is Random Forest with ntree = 2000, as it has the
highest AUC (0.828), MCC (0.4474) and the best Precision (50.8%).
6
Hence, the chosen machine learning algorithm to be applied for our companys Churn problem
is the Random Forest with ntree = 2000 (RANDF2000). The models ConfMat results and
7
MANAGEMENT REPORT
Since the RANDF2000 model produced the best AUC, MCC and Precision statistics, it is
selected as the model to carry out our cost evaluation on our customers.
Cost evaluation is done with the following assumptions relating to costs (in US$):
4. Customers are responsive to enticements and will not Churn when offered enticements
8
The RANDF2000 model is applied to the verify dataset to derive the predicted Churn of those
customers. The output, containing the characteristics of each customer as well as their predicted
Churn (randfChurn: 1 for Churn, 0 for Dont Churn), is combined with important customer
information such as LostCustCost, RetCost and customerID. Figure 5 shows a small excerpt
When the model predicts that there is more than a 50% chance that a customer will Churn
(randfChurnProb > 0.5), it expects that customer to Churn (randfChurn = 1). This represents a
would not Churn. Since acquiring a new customer ($750) is costlier than enticing this customer
with a retention offer, it makes business sense to retain the customer instead.
9
In terms of customer retention efforts, we should target high-value customers expected to
Churn first. This is done by first sorting the Cost Evaluation table with highest-value (largest
LostCustCost value) at the top. Following that, we can zoom in and focus only on those
expected to Churn (randfChurn = 1). Figure 6 is a small excerpt of the resulting table, called
HighValueChurn.
The RANDF2000 model has predicted that, out of the customers within the verify dataset, there
are 188 who are expected to churn (Figure 7). The highest-value Churner has the customerID:
3685-YLCMQ. If he/she churns, the impact to our business is a loss of $6219.60. Since RetCost
is $621.96 (< $750), this customer should be contacted and enticed with a retention offer of
up to $621.96.
10
A full cost evaluation of all 188 predicted Churners is run in R, producing the cost figures in
Figure 8.
In essence, the Total Expected Loss of $169,429.10 can be avoided if $16,942.91 is spent on
enticing the 188 customers that RANDF2000 predicts will Churn. This is clearly the better
option than losing the customers ($169,429.10) and having to spend an additional $141,000.00
to acquire the same number that was lost since it will result in a negative impact on our
companys profits by $310,429.10, with no certainty of the amount of revenue from new
customers.
Customer Profiling
acquire new customers, the RANDF2000 model outputs also allow us to do customer profiling.
Figure 9 shows a bar-plot that is constructed based on the 31 key characteristics of Churners.
Longer bars mean that the particular attribute is one that is more common among the
11
Churners. For example, 185 out of 188 of the Churners are on a Month-to-Month Contract.
Other characteristics that are common to most Churners are those with No Online Security
and No Tech Support (both with 175). The interpretation of this is that, by focusing on the
longer bars, customers who have all these characteristics are classified as Churners.
Customers can be profiled based on this and potential Churners can be identified when they
exhibit the same set of characteristics, i.e. if customers are on a Month-to-Month Contract,
have No Online Security and No Tech Support, we should expect them to Churn. We can then
12
refer back to the cost evaluation to determine if it is more economically viable to retain them
The RANDF2000 model our team developed has analysed and learnt from the training dataset
in a way that when it is tested on the verify dataset, we are able to profile Churners and do a
detailed cost evaluation on the potential loss to our companys Revenue as well as how we can
mitigate this loss by enticing them. We are confident that this model, if applied in the future,
will allow us to reduce Churn among our valued customers and secure our revenue stream for
the long-term.
From a technical standpoint, our team did not account for the risks of overfitting for this
analysis and modelling as we felt that this is something that should be worked on after we have
first derived a good working model. Since we now have a working model in RANDF2000, we
can deal with these risks more adequately in our subsequent analysis and modelling.
We have assumed in our cost evaluation that customers will react positively to our enticement
efforts and not churn after being offered a retention package. However, this may not be the
case all the time since barriers for churn are being reduced tremendously with regulations in
13
place such as mobile number portability. Future analysis efforts could take into account of
customers response to retention packages so that the predictive model can be improved upon.
As with most telecommunication subscriptions, family members very often all subscribe to the
same service. Subscribers could even influence their friends to join up with them on the same
service. However, when one of these subscribers Churn, it could result in a whole group of
subscribers Churning, leading to a loss of revenue that is greatly amplified. By using Social
Network Analysis, we will be able to identify connections between our different subscribers.
In doing so, we will be able to leverage on our subscribers and their networks to prevent them
from churning. Taking this a step further, we can integrate this with our Customer Relationship
Management (CRM) systems to work on better packages to entice and retain our subscribers.
collecting customer sentiments in collaboration with CRM, we might be able to gain insights
into why these characteristics affect their satisfaction rating of our service that lead them to
churn. This could help us identify the real reasons for churn, helping us deal with them swiftly.
14