Académique Documents
Professionnel Documents
Culture Documents
R Venkataraman
Reference……………………………………………………………………………………………….. 25
Data Mining – Thera Bank
Project Description
This case is about a bank (Thera Bank) which has a growing customer base. Majority of these
customers are liability customers (depositors) with varying size of deposits. The number of
customers who are also borrowers (asset customers) is quite small, and the bank is interested
in expanding this base rapidly to bring in more loan business and in the process, earn more
through the interest on loans. In particular, the management wants to explore ways of
converting its liability customers to personal loan customers (while retaining them as
depositors). A campaign that the bank ran last year for liability customers showed a healthy
conversion rate of over 9% success. This has encouraged the retail marketing department to
devise campaigns with better target marketing to increase the success ratio with a minimal
budget. The department wants to build a model that will help them identify the potential
customers who have a higher probability of purchasing the loan. This will increase the success
ratio while at the same time reduce the cost of the campaign. The dataset has data on 5000
customers. The data include customer demographic information (age, income, etc.), the
customer's relationship with the bank (mortgage, securities account, etc.), and the customer
response to the last personal loan campaign (Personal Loan). Among these 5000 customers,
only 480 (= 9.6%) accepted the personal loan that was offered to them in the earlier campaign.
You are brought in as a consultant and your job is to build the best model which can classify the
right customers who have a higher probability of purchasing the loan.
Project Objective
3|Page
Data Mining – Thera Bank
Project Report
mythera=read.csv("Thera Bank_data.csv",header=TRUE)
str(mythera)
glimpse(mythera)
summary(mythera)
With necessary R libraries loaded and setting the default working directory, the dataset is
loaded into R. Initial glimpse of the data as follows:
We have 13 independent variables and 1 dependent variable (‘Personal Loan’) in the given
data set. We have 5000 rows which can be split into test & train dataset for various model
building.
Data Description:
ID Customer ID
Age Customer's age in years
Experience Years of professional experience
Income Annual income of the customer ($000)
ZIPCode Home Address ZIP code.
Family Family size of the customer
CCAvg Avg. spending on credit cards per month ($000)
Education Education Level. 1: Undergrad; 2: Graduate; 3: Advanced/Professional
Mortgage Value of house mortgage if any. ($000)
Personal Loan Did this customer accept the personal loan offered in the last campaign?
Securities
Does the customer have a securities account with the bank?
Account
CD Account Does the customer have a certificate of deposit (CD) account with the bank?
Online Does the customer use internet banking facilities?
CreditCard Does the customer use a credit card issued by the bank?
4|Page
Data Mining – Thera Bank
5|Page
Data Mining – Thera Bank
Data correction
Missing value in Family.members is populated using mice and new dataset created
impute1=mice(data=mythera,m=5,method="pmm",maxit=50,seed = 500)
impute1$imp$Family.members
mythera1=complete(impute1,2)
No missing values
Also there is a negative value in Experience in years as it can be from Zero to some value only.
mythera1$Experience..in.years.[mythera1$Experience..in.years.<0]=0
mythera1$Education=as.factor(mythera1$Education)
mythera1$Family.members=as.factor(mythera1$Family.members)
mythera1$Personal.Loan=as.factor(mythera1$Personal.Loan)
mythera1$CD.Account=as.factor(mythera1$CD.Account)
mythera1$Online=as.factor(mythera1$Online)
mythera1$CreditCard=as.factor(mythera1$CreditCard)
mythera1$Securities.Account=as.factor(mythera1$Securities.Account)
mythera1$Personal.Loan=as.factor(mythera1$Personal.Loan)
6|Page
Data Mining – Thera Bank
Univariate Analysis
Factor Variables
7|Page
Data Mining – Thera Bank
8|Page
Data Mining – Thera Bank
Numeric Variables
9|Page
Data Mining – Thera Bank
The mortgage variable has many outliers which might affect the model. There are 291 records
which is 6% of the total dataset. If we remove the outliers for the sake of removing, it affects
the personal loan characteristic very much(ref.below). Also they are not data entry errors or
measurement errors and very much part of the population we are addressing. So I am not
removing the outliers.
Bivariate Analysis
We will analyze how the independent variables stack up with the dependent varirable
(Personal Loan)
10 | P a g e
Data Mining – Thera Bank
11 | P a g e
Data Mining – Thera Bank
12 | P a g e
Data Mining – Thera Bank
13 | P a g e
Data Mining – Thera Bank
There is a moderate positive correlation between Credit card spending and Income. High
correlation between Age and experience, which is obvious.
Model Building
First CART model is done on the dataset. The dataset is already cleaned by removing the ID
column, NA updated etc...
Train and Test data sets are created with a split of 70% & 30%.
Initial tree output with CP, relative error, cross validated error as below:
14 | P a g e
Data Mining – Thera Bank
Best practice is to have a small tree with the one having least cross validated error.
15 | P a g e
Data Mining – Thera Bank
16 | P a g e
Data Mining – Thera Bank
Confusion Matrix and the concordance ratio of the train data set:
The pruned CART model is applied on the test dataset and predict class and scores are
updated.
17 | P a g e
Data Mining – Thera Bank
AUC=97.85
GINI coefficient = 2AUC-1 = 95.71
18 | P a g e
Data Mining – Thera Bank
Random Forest
19 | P a g e
Data Mining – Thera Bank
The predict class and score updated in the train data set and the output measures are as
below:
20 | P a g e
Data Mining – Thera Bank
The RF model is applied on the test dataset and the results as below:
Importance of the model is given below which clearly states the importance of two variables
(Income & Education) with a high Meandecrease Gini score & Meandecrease accuracy.
21 | P a g e
Data Mining – Thera Bank
The same importance plot can be viewed from another functionality as below:
AUC is plotted for this RF model and the area and GINI coefficient calculated.
AUC = 99.65%
GINI Coefficient = 99.31%
22 | P a g e
Data Mining – Thera Bank
Model Comparision
We have created prediction models using CART decision tree and tuned Random Forest
methods.
Using both the models, the prediction was trained using train dataset and applied on the test
dataset.
The prediction score, classes are updated on the test dataset for measuring the model
performance parameters.
CART
Parameters model RF model
Confusion Matrix
Total Accuracy 98.44% 98.53%
Random Accuracy 90.75% 92.09%
Sensitivity 98.93% 98.67%
Specificity 93.57% 97.30%
Balanced Accuracy 96.25% 97.98%
23 | P a g e
Data Mining – Thera Bank
It is clearly evident from the values that the tuned RF model is better than the CART one.
So the Random Forest model can be used to classify the potential loan
customers.
The RF model is applied on the whole dataset and the parameters are given below:
AUC : 99.94%
24 | P a g e
Data Mining – Thera Bank
25 | P a g e
Data Mining – Thera Bank
Reference
26 | P a g e