Machine Learning Management Report

MANM354: Machine Learning and Visualisation
Coursework
Telecom Churn Analysis with RStudio
Management Report

Table of Contents
Introduction .............................................................................................................................. 1
Technical Report ...................................................................................................................2-7
Data Pre-Processing ..................................................................................................2-4
Data Mining (Random Forest and C5.0 Decision Tree) ........................................4-5
Evaluation of Models ................................................................................................6-7
Management Report ...........................................................................................................8-14
Cost Evaluation .......................................................................................................8-11
Customer Profiling................................................................................................11-13
Recommendations and Future Improvements ...................................................13-14

INTRODUCTION
Subscriber Churn has been increasing, becoming a problem that telecommunications carriers
like ourselves have to deal with in order to retain our high-value subscribers. With virtually
everyone owning mobile phones now, some even owning two or more, the saturated market
means that there is next to no new customer to acquire. Therefore, it is of paramount importance
for us to identify our high-value subscribers early and target them with enticements to prevent
them from churning.
This report serves to give an overview of an RStudio application that has been developed and
tested by our data analytics team based on a training dataset and tested on a verify dataset. The
findings from this report will provide Management with actionable insights to better manage
churn.
1
TECHNICAL REPORT
The training dataset (training) consists of 6000 entries with 21 variables and Churn as the
target. The tasks required were to determine which of the 21 variables (inputs/predictors) could
best predict the outcome of Churn and to develop rule sets that identify subscribers who
churn.
Data Pre-Processing
As most data mining algorithms perform poorly with datasets that are greatly imbalanced, the
distribution of Churn in training was checked as part of our preliminary data examination.
The distribution of Churn was found to be Yes = 26.4% and No = 73.6%, implying that
there is sufficient amount of Yes to carry out data mining without the need for data balancing.
Future improvements could involve incorporating functions to carry out a balancing using the
SMOTE or unBalanced" packages where the Yes:No proportion is brought down to
40-60. For the purpose of this analysis, our team has decided that the balancing threshold
would be Yes = 20%, meaning that as long as the proportion of Yes is higher than 20%,
data balancing was NOT required.
Following that, Churn was separated from training so that training can be rbind-ed
with verify to form combined. This was necessary to ensure that there is consistency in the
pre-processing procedure without it affecting algorithms prediction capability.
The combined dataset was first audited for missing values as they create complications in
data mining and analysis. MissingValuesCheck(combined) was used to identify the missing
values which were all found under the TotalCharges column. A mean/mode substitution
2
method was not used to replace the missing values as it reduces variability of the data. Instead,
the 11 entries/rows (9 in training and 2 in verify) with missing values were deleted for
simplicity, comparability and because there were sufficient remaining entries to ensure that
data integrity would not be compromised, producing combined_s.
The 9 rows with missing values from the training section were correspondingly removed
from the Churn column which was removed earlier. The Yes and No responses were also
converted into 1 or 0. This is then defined as churn_col, which will be attached to the pre-
processed training dataset (without Churn) later.
Npreprocessdataset(combined_s) was then used to carry out:
1. Conversion of variables into {1,0}
2. 1-hot-encoding of text entries with more than 2 unique values, converting them into 1-
of-x
3. Check for correlation of fields
4. Removal of correlated fields
creating the prepro_combined_s dataset with 35 variables.
Rows 1-5991 of prepro_combined_s belonged to the original training set and is now cbind-
ed with churn_col to form a pre-processed training dataset that includes Churn, defined as
training_dataset_wChurn.
The remaining data in rows 5992-6990 of prepro_combined_s is the pre-processed verify
dataset, redefined as verifySet, which will be used to verify the results derived from the
chosen machine learning algorithm later.
3
NPREPROCESSING_splitdataset(training_dataset_wChurn) was then used to create a
training dataset to be used for data mining (trainingSet) by randomising and splitting the
dataset using 70% of the records. A test dataset (testSet) was also created with the remaining
30% of the records.
trainingSet, testSet and verifySet are then run through a gsub function to clean their
column names of special characters.
Churn being the Target is once again separated out and set as the Expected while the
rest are Variables, creating trainVariables, trainExpect, testVariables and testExpect,
which will all be used in the Data Mining process.
Data Mining
Random Forest
The randomForest function in Rs Random Forest package was applied to trainVariables
and trainExpect (which must be converted into a factor) with an initial ntree count of 500
and this model was named randf. The predict function was then used to evaluate how
randf fared with the testVariables, where the output (randf_predict) was randfs predicted
Churn.
C5.0
Rs C50 package has the C5.0 function that is applied in a similar way to randomForest,
except that instead of an ntree count, C5.0 uses a trials count. The model applied (C5dt)
requires trainExpect to be converted into a factor and was given a rules = TRUE input
4
so that it would a rule-based model. Similarly, predict was used to evaluate how C5dt fared
with the testVariables, where the output (C5dt_predict) was C5dts predicted Churn.
Confusion Matrix (ConfMat)
Following that, the outputs (randf_predict and C5dt_predict) are each compared to testExpect
individually to produce 2 separate ConfMat using a custom confusion_matrix.R function.
This was double-checked with Rs confusionMatrix function in the caret package.
Receiver Operating Characteristic (ROC) Curve
Using the results, from the confusion matrix, an ROC curve was plotted using Rs plot.roc
function in the pROC package to graphically represent the results. The Area Under Curve
(AUC) is then obtained by using the print function on the plot. The ConfMat results, True
Positives (TP), True Negatives (TN), False Positives and False Negatives (FN) and the AUC
value are then tabulated into an Excel file (AlgorithmComparison.xls) for model comparison
purposes.
Repeating the Algorithms with variation
The above steps were repeated with varying ntree and trials count for the randomForest and
C5.0 functions respectively, derive different outputs for the predicted Churn and hence
different ConfMat results for multiple iterations of the algorithms.
4 additional ntree counts (1000, 1500, 2000 and 2500) were used for alternative
randfomForest models. Similarly, 4 additional trials counts (3, 5, 7, 10) were used for
alternative C5.0 models.
5
Evaluation of Models
Results and outputs from all 10 models were tabulated in AlgorithmComparison.xls with
additional columns created for measures such as Sensitivity, Specificity, Precision, Accuracy,
AUC and the Matthews Correlation Coefficient (MCC).
To decide which model was the best, the models were ranked by their AUC and MCC values.
Although the AUC statistic in itself is already a good measure of quality of classification
models such as Random Forest and C5.0, MCC adds value to the model assessment as it is a
balanced measure which takes into account all 4 ConfMat results. Since the aim of this
project is to correctly identify the positive cases of Churn, Precision was used as the third
measure. Using 3 measures for evaluation adds extra stringency when evaluating and choosing
the best model.
The 10 models were ranked by the highest AUC and then by MCC, where the top 5 included
just one C5.0 model. Although this model (C5.0 with 10 trails) had the best AUC (0.8397), its
MCC is much lower than the rest and has a Precision of only 39.4% compared to the other 4
Random Forests. Of the remaining, the best is Random Forest with ntree = 2000, as it has the
highest AUC (0.828), MCC (0.4474) and the best Precision (50.8%).
Figure 1: Summarised results of models, ranked by AUC statistic
6
Hence, the chosen machine learning algorithm to be applied for our companys Churn problem
is the Random Forest with ntree = 2000 (RANDF2000). The models ConfMat results and
ROC curve are shown below:
Figure 2: ConfMat results
Figure 3: ConfMat results from caret packages confusionMatrix function
Figure 4: ROC curve
7
MANAGEMENT REPORT
Since the RANDF2000 model produced the best AUC, MCC and Precision statistics, it is
selected as the model to carry out our cost evaluation on our customers.
Cost evaluation is done with the following assumptions relating to costs (in US$):
1. Average Cost to acquire new customer = US$750
2. Cost of Customer Loss (LostCustCost) = TotalCharges (12-month revenue to our
company per customer)
3. Average Cost of enticements to retain customers (RetCost) = 10% of TotalCharges
4. Customers are responsive to enticements and will not Churn when offered enticements
8
The RANDF2000 model is applied to the verify dataset to derive the predicted Churn of those
customers. The output, containing the characteristics of each customer as well as their predicted
Churn (randfChurn: 1 for Churn, 0 for Dont Churn), is combined with important customer
information such as LostCustCost, RetCost and customerID. Figure 5 shows a small excerpt
with 11 customers of this Cost Evaluation table.
Figure 5: Cost Evaluation excerpt
When the model predicts that there is more than a 50% chance that a customer will Churn
(randfChurnProb > 0.5), it expects that customer to Churn (randfChurn = 1). This represents a
loss of customer revenue (e.g. LostCustCost = $2,497.20). However, if this customer
(customerID: 1597-LHYNC) is enticed with an offer (amounting to RetCost = $249.72), he/she
would not Churn. Since acquiring a new customer ($750) is costlier than enticing this customer
with a retention offer, it makes business sense to retain the customer instead.
9
In terms of customer retention efforts, we should target high-value customers expected to
Churn first. This is done by first sorting the Cost Evaluation table with highest-value (largest
LostCustCost value) at the top. Following that, we can zoom in and focus only on those
expected to Churn (randfChurn = 1). Figure 6 is a small excerpt of the resulting table, called
HighValueChurn.
Figure 6: HighValueChurn excerpt
The RANDF2000 model has predicted that, out of the customers within the verify dataset, there
are 188 who are expected to churn (Figure 7). The highest-value Churner has the customerID:
3685-YLCMQ. If he/she churns, the impact to our business is a loss of $6219.60. Since RetCost
is $621.96 (< $750), this customer should be contacted and enticed with a retention offer of
up to $621.96.
Figure 7: Number of Customers RANDF2000 predicts will Churn
10
A full cost evaluation of all 188 predicted Churners is run in R, producing the cost figures in
Figure 8.
Figure 8: Cost Figures for Cost Evaluation
In essence, the Total Expected Loss of $169,429.10 can be avoided if $16,942.91 is spent on
enticing the 188 customers that RANDF2000 predicts will Churn. This is clearly the better
option than losing the customers ($169,429.10) and having to spend an additional $141,000.00
to acquire the same number that was lost since it will result in a negative impact on our
companys profits by $310,429.10, with no certainty of the amount of revenue from new
customers.
Customer Profiling
In addition to enabling us to do a cost evaluation to determine whether to entice Churners or
acquire new customers, the RANDF2000 model outputs also allow us to do customer profiling.
Figure 9 shows a bar-plot that is constructed based on the 31 key characteristics of Churners.
Longer bars mean that the particular attribute is one that is more common among the
11
Churners. For example, 185 out of 188 of the Churners are on a Month-to-Month Contract.
Other characteristics that are common to most Churners are those with No Online Security
and No Tech Support (both with 175). The interpretation of this is that, by focusing on the
longer bars, customers who have all these characteristics are classified as Churners.
Figure 9: Characteristics of Churners
Customers can be profiled based on this and potential Churners can be identified when they
exhibit the same set of characteristics, i.e. if customers are on a Month-to-Month Contract,
have No Online Security and No Tech Support, we should expect them to Churn. We can then
12
refer back to the cost evaluation to determine if it is more economically viable to retain them
through enticements or acquire a new customer instead.
The RANDF2000 model our team developed has analysed and learnt from the training dataset
in a way that when it is tested on the verify dataset, we are able to profile Churners and do a
detailed cost evaluation on the potential loss to our companys Revenue as well as how we can
mitigate this loss by enticing them. We are confident that this model, if applied in the future,
will allow us to reduce Churn among our valued customers and secure our revenue stream for
the long-term.
Recommendations and Future Improvements
From a technical standpoint, our team did not account for the risks of overfitting for this
analysis and modelling as we felt that this is something that should be worked on after we have
first derived a good working model. Since we now have a working model in RANDF2000, we
can deal with these risks more adequately in our subsequent analysis and modelling.
We have assumed in our cost evaluation that customers will react positively to our enticement
efforts and not churn after being offered a retention package. However, this may not be the
case all the time since barriers for churn are being reduced tremendously with regulations in
13
place such as mobile number portability. Future analysis efforts could take into account of
customers response to retention packages so that the predictive model can be improved upon.
As with most telecommunication subscriptions, family members very often all subscribe to the
same service. Subscribers could even influence their friends to join up with them on the same
service. However, when one of these subscribers Churn, it could result in a whole group of
subscribers Churning, leading to a loss of revenue that is greatly amplified. By using Social
Network Analysis, we will be able to identify connections between our different subscribers.
In doing so, we will be able to leverage on our subscribers and their networks to prevent them
from churning. Taking this a step further, we can integrate this with our Customer Relationship
Management (CRM) systems to work on better packages to entice and retain our subscribers.
In terms of Churners profiling, we could investigate further into these characteristics. By
collecting customer sentiments in collaboration with CRM, we might be able to gain insights
into why these characteristics affect their satisfaction rating of our service that lead them to
churn. This could help us identify the real reasons for churn, helping us deal with them swiftly.
14

Machine Learning Management Report

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Machine Learning Management Report

Transféré par

Droits d'auteur :

Formats disponibles

MANM354: Machine Learning and Visualisation

Technical Report ...................................................................................................................2-7

Data Pre-Processing ..................................................................................................2-4

Data Mining (Random Forest and C5.0 Decision Tree) ........................................4-5

Evaluation of Models ................................................................................................6-7

Management Report ...........................................................................................................8-14

Cost Evaluation .......................................................................................................8-11

Recommendations and Future Improvements ...................................................13-14

them from churning.

SMOTE or unBalanced" packages where the Yes:No proportion is brought down to

data balancing was NOT required.

pre-processing procedure without it affecting algorithms prediction capability.

data integrity would not be compromised, producing combined_s.

processed training dataset (without Churn) later.

Npreprocessdataset(combined_s) was then used to carry out:

1. Conversion of variables into {1,0}

3. Check for correlation of fields

4. Removal of correlated fields

creating the prepro_combined_s dataset with 35 variables.

The remaining data in rows 5992-6990 of prepro_combined_s is the pre-processed verify

chosen machine learning algorithm later.

30% of the records.

column names of special characters.

rest are Variables, creating trainVariables, trainExpect, testVariables and testExpect,

which will all be used in the Data Mining process.

The randomForest function in Rs Random Forest package was applied to trainVariables

Confusion Matrix (ConfMat)

individually to produce 2 separate ConfMat using a custom confusion_matrix.R function.

This was double-checked with Rs confusionMatrix function in the caret package.

Receiver Operating Characteristic (ROC) Curve

Repeating the Algorithms with variation

different ConfMat results for multiple iterations of the algorithms.

alternative C5.0 models.

AUC and the Matthews Correlation Coefficient (MCC).

the best model.

Figure 1: Summarised results of models, ranked by AUC statistic

ROC curve are shown below:

Figure 2: ConfMat results

Figure 3: ConfMat results from caret packages confusionMatrix function

Figure 4: ROC curve

1. Average Cost to acquire new customer = US$750

2. Cost of Customer Loss (LostCustCost) = TotalCharges (12-month revenue to our

company per customer)

3. Average Cost of enticements to retain customers (RetCost) = 10% of TotalCharges

with 11 customers of this Cost Evaluation table.

Figure 5: Cost Evaluation excerpt

loss of customer revenue (e.g. LostCustCost = $2,497.20). However, if this customer

(customerID: 1597-LHYNC) is enticed with an offer (amounting to RetCost = $249.72), he/she

Figure 6: HighValueChurn excerpt

Figure 7: Number of Customers RANDF2000 predicts will Churn

Figure 8: Cost Figures for Cost Evaluation

In addition to enabling us to do a cost evaluation to determine whether to entice Churners or

Figure 9: Characteristics of Churners

through enticements or acquire a new customer instead.

Recommendations and Future Improvements

In terms of Churners profiling, we could investigate further into these characteristics. By

Vous aimerez peut-être aussi