Vous êtes sur la page 1sur 16

MANM354: Machine Learning and Visualisation

Coursework
Telecom Churn Analysis with RStudio
Management Report


Table of Contents

Introduction .............................................................................................................................. 1

Technical Report ...................................................................................................................2-7

Data Pre-Processing ..................................................................................................2-4

Data Mining (Random Forest and C5.0 Decision Tree) ........................................4-5

Evaluation of Models ................................................................................................6-7

Management Report ...........................................................................................................8-14

Cost Evaluation .......................................................................................................8-11

Customer Profiling................................................................................................11-13

Recommendations and Future Improvements ...................................................13-14


INTRODUCTION

Subscriber Churn has been increasing, becoming a problem that telecommunications carriers

like ourselves have to deal with in order to retain our high-value subscribers. With virtually

everyone owning mobile phones now, some even owning two or more, the saturated market

means that there is next to no new customer to acquire. Therefore, it is of paramount importance

for us to identify our high-value subscribers early and target them with enticements to prevent

them from churning.

This report serves to give an overview of an RStudio application that has been developed and

tested by our data analytics team based on a training dataset and tested on a verify dataset. The

findings from this report will provide Management with actionable insights to better manage

churn.

1
TECHNICAL REPORT

The training dataset (training) consists of 6000 entries with 21 variables and Churn as the

target. The tasks required were to determine which of the 21 variables (inputs/predictors) could

best predict the outcome of Churn and to develop rule sets that identify subscribers who

churn.

Data Pre-Processing

As most data mining algorithms perform poorly with datasets that are greatly imbalanced, the

distribution of Churn in training was checked as part of our preliminary data examination.

The distribution of Churn was found to be Yes = 26.4% and No = 73.6%, implying that

there is sufficient amount of Yes to carry out data mining without the need for data balancing.

Future improvements could involve incorporating functions to carry out a balancing using the

SMOTE or unBalanced" packages where the Yes:No proportion is brought down to

40-60. For the purpose of this analysis, our team has decided that the balancing threshold

would be Yes = 20%, meaning that as long as the proportion of Yes is higher than 20%,

data balancing was NOT required.

Following that, Churn was separated from training so that training can be rbind-ed

with verify to form combined. This was necessary to ensure that there is consistency in the

pre-processing procedure without it affecting algorithms prediction capability.

The combined dataset was first audited for missing values as they create complications in

data mining and analysis. MissingValuesCheck(combined) was used to identify the missing

values which were all found under the TotalCharges column. A mean/mode substitution

2
method was not used to replace the missing values as it reduces variability of the data. Instead,

the 11 entries/rows (9 in training and 2 in verify) with missing values were deleted for

simplicity, comparability and because there were sufficient remaining entries to ensure that

data integrity would not be compromised, producing combined_s.

The 9 rows with missing values from the training section were correspondingly removed

from the Churn column which was removed earlier. The Yes and No responses were also

converted into 1 or 0. This is then defined as churn_col, which will be attached to the pre-

processed training dataset (without Churn) later.

Npreprocessdataset(combined_s) was then used to carry out:

1. Conversion of variables into {1,0}

2. 1-hot-encoding of text entries with more than 2 unique values, converting them into 1-

of-x

3. Check for correlation of fields

4. Removal of correlated fields

creating the prepro_combined_s dataset with 35 variables.

Rows 1-5991 of prepro_combined_s belonged to the original training set and is now cbind-

ed with churn_col to form a pre-processed training dataset that includes Churn, defined as

training_dataset_wChurn.

The remaining data in rows 5992-6990 of prepro_combined_s is the pre-processed verify

dataset, redefined as verifySet, which will be used to verify the results derived from the

chosen machine learning algorithm later.

3
NPREPROCESSING_splitdataset(training_dataset_wChurn) was then used to create a

training dataset to be used for data mining (trainingSet) by randomising and splitting the

dataset using 70% of the records. A test dataset (testSet) was also created with the remaining

30% of the records.

trainingSet, testSet and verifySet are then run through a gsub function to clean their

column names of special characters.

Churn being the Target is once again separated out and set as the Expected while the

rest are Variables, creating trainVariables, trainExpect, testVariables and testExpect,

which will all be used in the Data Mining process.

Data Mining

Random Forest

The randomForest function in Rs Random Forest package was applied to trainVariables

and trainExpect (which must be converted into a factor) with an initial ntree count of 500

and this model was named randf. The predict function was then used to evaluate how

randf fared with the testVariables, where the output (randf_predict) was randfs predicted

Churn.

C5.0

Rs C50 package has the C5.0 function that is applied in a similar way to randomForest,

except that instead of an ntree count, C5.0 uses a trials count. The model applied (C5dt)

requires trainExpect to be converted into a factor and was given a rules = TRUE input

4
so that it would a rule-based model. Similarly, predict was used to evaluate how C5dt fared

with the testVariables, where the output (C5dt_predict) was C5dts predicted Churn.

Confusion Matrix (ConfMat)

Following that, the outputs (randf_predict and C5dt_predict) are each compared to testExpect

individually to produce 2 separate ConfMat using a custom confusion_matrix.R function.

This was double-checked with Rs confusionMatrix function in the caret package.

Receiver Operating Characteristic (ROC) Curve

Using the results, from the confusion matrix, an ROC curve was plotted using Rs plot.roc

function in the pROC package to graphically represent the results. The Area Under Curve

(AUC) is then obtained by using the print function on the plot. The ConfMat results, True

Positives (TP), True Negatives (TN), False Positives and False Negatives (FN) and the AUC

value are then tabulated into an Excel file (AlgorithmComparison.xls) for model comparison

purposes.

Repeating the Algorithms with variation

The above steps were repeated with varying ntree and trials count for the randomForest and

C5.0 functions respectively, derive different outputs for the predicted Churn and hence

different ConfMat results for multiple iterations of the algorithms.

4 additional ntree counts (1000, 1500, 2000 and 2500) were used for alternative

randfomForest models. Similarly, 4 additional trials counts (3, 5, 7, 10) were used for

alternative C5.0 models.

5
Evaluation of Models

Results and outputs from all 10 models were tabulated in AlgorithmComparison.xls with

additional columns created for measures such as Sensitivity, Specificity, Precision, Accuracy,

AUC and the Matthews Correlation Coefficient (MCC).

To decide which model was the best, the models were ranked by their AUC and MCC values.

Although the AUC statistic in itself is already a good measure of quality of classification

models such as Random Forest and C5.0, MCC adds value to the model assessment as it is a

balanced measure which takes into account all 4 ConfMat results. Since the aim of this

project is to correctly identify the positive cases of Churn, Precision was used as the third

measure. Using 3 measures for evaluation adds extra stringency when evaluating and choosing

the best model.

The 10 models were ranked by the highest AUC and then by MCC, where the top 5 included

just one C5.0 model. Although this model (C5.0 with 10 trails) had the best AUC (0.8397), its

MCC is much lower than the rest and has a Precision of only 39.4% compared to the other 4

Random Forests. Of the remaining, the best is Random Forest with ntree = 2000, as it has the

highest AUC (0.828), MCC (0.4474) and the best Precision (50.8%).

Figure 1: Summarised results of models, ranked by AUC statistic

6
Hence, the chosen machine learning algorithm to be applied for our companys Churn problem

is the Random Forest with ntree = 2000 (RANDF2000). The models ConfMat results and

ROC curve are shown below:

Figure 2: ConfMat results

Figure 3: ConfMat results from caret packages confusionMatrix function

Figure 4: ROC curve

7
MANAGEMENT REPORT

Since the RANDF2000 model produced the best AUC, MCC and Precision statistics, it is

selected as the model to carry out our cost evaluation on our customers.

Cost evaluation is done with the following assumptions relating to costs (in US$):

1. Average Cost to acquire new customer = US$750

2. Cost of Customer Loss (LostCustCost) = TotalCharges (12-month revenue to our

company per customer)

3. Average Cost of enticements to retain customers (RetCost) = 10% of TotalCharges

4. Customers are responsive to enticements and will not Churn when offered enticements

8
The RANDF2000 model is applied to the verify dataset to derive the predicted Churn of those

customers. The output, containing the characteristics of each customer as well as their predicted

Churn (randfChurn: 1 for Churn, 0 for Dont Churn), is combined with important customer

information such as LostCustCost, RetCost and customerID. Figure 5 shows a small excerpt

with 11 customers of this Cost Evaluation table.

Figure 5: Cost Evaluation excerpt

When the model predicts that there is more than a 50% chance that a customer will Churn

(randfChurnProb > 0.5), it expects that customer to Churn (randfChurn = 1). This represents a

loss of customer revenue (e.g. LostCustCost = $2,497.20). However, if this customer

(customerID: 1597-LHYNC) is enticed with an offer (amounting to RetCost = $249.72), he/she

would not Churn. Since acquiring a new customer ($750) is costlier than enticing this customer

with a retention offer, it makes business sense to retain the customer instead.

9
In terms of customer retention efforts, we should target high-value customers expected to

Churn first. This is done by first sorting the Cost Evaluation table with highest-value (largest

LostCustCost value) at the top. Following that, we can zoom in and focus only on those

expected to Churn (randfChurn = 1). Figure 6 is a small excerpt of the resulting table, called

HighValueChurn.

Figure 6: HighValueChurn excerpt

The RANDF2000 model has predicted that, out of the customers within the verify dataset, there

are 188 who are expected to churn (Figure 7). The highest-value Churner has the customerID:

3685-YLCMQ. If he/she churns, the impact to our business is a loss of $6219.60. Since RetCost

is $621.96 (< $750), this customer should be contacted and enticed with a retention offer of

up to $621.96.

Figure 7: Number of Customers RANDF2000 predicts will Churn

10
A full cost evaluation of all 188 predicted Churners is run in R, producing the cost figures in

Figure 8.

Figure 8: Cost Figures for Cost Evaluation

In essence, the Total Expected Loss of $169,429.10 can be avoided if $16,942.91 is spent on

enticing the 188 customers that RANDF2000 predicts will Churn. This is clearly the better

option than losing the customers ($169,429.10) and having to spend an additional $141,000.00

to acquire the same number that was lost since it will result in a negative impact on our

companys profits by $310,429.10, with no certainty of the amount of revenue from new

customers.

Customer Profiling

In addition to enabling us to do a cost evaluation to determine whether to entice Churners or

acquire new customers, the RANDF2000 model outputs also allow us to do customer profiling.

Figure 9 shows a bar-plot that is constructed based on the 31 key characteristics of Churners.

Longer bars mean that the particular attribute is one that is more common among the

11
Churners. For example, 185 out of 188 of the Churners are on a Month-to-Month Contract.

Other characteristics that are common to most Churners are those with No Online Security

and No Tech Support (both with 175). The interpretation of this is that, by focusing on the

longer bars, customers who have all these characteristics are classified as Churners.

Figure 9: Characteristics of Churners

Customers can be profiled based on this and potential Churners can be identified when they

exhibit the same set of characteristics, i.e. if customers are on a Month-to-Month Contract,

have No Online Security and No Tech Support, we should expect them to Churn. We can then

12
refer back to the cost evaluation to determine if it is more economically viable to retain them

through enticements or acquire a new customer instead.

The RANDF2000 model our team developed has analysed and learnt from the training dataset

in a way that when it is tested on the verify dataset, we are able to profile Churners and do a

detailed cost evaluation on the potential loss to our companys Revenue as well as how we can

mitigate this loss by enticing them. We are confident that this model, if applied in the future,

will allow us to reduce Churn among our valued customers and secure our revenue stream for

the long-term.

Recommendations and Future Improvements

From a technical standpoint, our team did not account for the risks of overfitting for this

analysis and modelling as we felt that this is something that should be worked on after we have

first derived a good working model. Since we now have a working model in RANDF2000, we

can deal with these risks more adequately in our subsequent analysis and modelling.

We have assumed in our cost evaluation that customers will react positively to our enticement

efforts and not churn after being offered a retention package. However, this may not be the

case all the time since barriers for churn are being reduced tremendously with regulations in

13
place such as mobile number portability. Future analysis efforts could take into account of

customers response to retention packages so that the predictive model can be improved upon.

As with most telecommunication subscriptions, family members very often all subscribe to the

same service. Subscribers could even influence their friends to join up with them on the same

service. However, when one of these subscribers Churn, it could result in a whole group of

subscribers Churning, leading to a loss of revenue that is greatly amplified. By using Social

Network Analysis, we will be able to identify connections between our different subscribers.

In doing so, we will be able to leverage on our subscribers and their networks to prevent them

from churning. Taking this a step further, we can integrate this with our Customer Relationship

Management (CRM) systems to work on better packages to entice and retain our subscribers.

In terms of Churners profiling, we could investigate further into these characteristics. By

collecting customer sentiments in collaboration with CRM, we might be able to gain insights

into why these characteristics affect their satisfaction rating of our service that lead them to

churn. This could help us identify the real reasons for churn, helping us deal with them swiftly.

14

Vous aimerez peut-être aussi