Thera Bank

Thera Bank
Loan Purchase Modelling
R Venkataraman
15th April 2020

—
PGP BABI
—
Group 5
Index
Project Description & Objective ………………………………………………………………. 3
Project Report……………………………………………………………………………………..... 4-24
Reference……………………………………………………………………………………………….. 25
Data Mining – Thera Bank
Project Description
Thera Bank - Loan Purchase Modeling
This case is about a bank (Thera Bank) which has a growing customer base. Majority of these
customers are liability customers (depositors) with varying size of deposits. The number of
customers who are also borrowers (asset customers) is quite small, and the bank is interested
in expanding this base rapidly to bring in more loan business and in the process, earn more
through the interest on loans. In particular, the management wants to explore ways of
converting its liability customers to personal loan customers (while retaining them as
depositors). A campaign that the bank ran last year for liability customers showed a healthy
conversion rate of over 9% success. This has encouraged the retail marketing department to
devise campaigns with better target marketing to increase the success ratio with a minimal
budget. The department wants to build a model that will help them identify the potential
customers who have a higher probability of purchasing the loan. This will increase the success
ratio while at the same time reduce the cost of the campaign. The dataset has data on 5000
customers. The data include customer demographic information (age, income, etc.), the
customer's relationship with the bank (mortgage, securities account, etc.), and the customer
response to the last personal loan campaign (Personal Loan). Among these 5000 customers,
only 480 (= 9.6%) accepted the personal loan that was offered to them in the earlier campaign.
You are brought in as a consultant and your job is to build the best model which can classify the
right customers who have a higher probability of purchasing the loan.
Project Objective
 EDA - Basic data summary, Univariate, Bivariate analysis, graphs

 Applying CART <plot the tree>
 Interpret the CART model output <pruning, remarks on pruning, plot the pruned tree>
 Applying Random Forests<plot the tree>
 Interpret the RF model output <with remarks, making it meaningful for everybody>
 Confusion matrix interpretation
 Interpretation of other Model Performance Measures < AUC, ROC>
 Remarks on Model validation exercise <Which model performed the best>
3|Page
Project Report
EDA - Basic data summary, Univariate, Bivariate analysis, graphs
mythera=read.csv("Thera Bank_data.csv",header=TRUE)
#### Basic sanity checks and conversions
str(mythera)
glimpse(mythera)
summary(mythera)
With necessary R libraries loaded and setting the default working directory, the dataset is
loaded into R. Initial glimpse of the data as follows:
We have 13 independent variables and 1 dependent variable (‘Personal Loan’) in the given
data set. We have 5000 rows which can be split into test & train dataset for various model
building.
Data Description:
ID Customer ID
Age Customer's age in years
Experience Years of professional experience
Income Annual income of the customer ($000)
ZIPCode Home Address ZIP code.
Family Family size of the customer
CCAvg Avg. spending on credit cards per month ($000)
Education Education Level. 1: Undergrad; 2: Graduate; 3: Advanced/Professional
Mortgage Value of house mortgage if any. ($000)
Personal Loan Did this customer accept the personal loan offered in the last campaign?
Securities
Does the customer have a securities account with the bank?
Account
CD Account Does the customer have a certificate of deposit (CD) account with the bank?
Online Does the customer use internet banking facilities?
CreditCard Does the customer use a credit card issued by the bank?
4|Page
Initial summary of the data
Data structure just after loading
Missing values checked and results as below:
5|Page
Data correction
Missing value in Family.members is populated using mice and new dataset created
impute1=mice(data=mythera,m=5,method="pmm",maxit=50,seed = 500)
impute1$imp$Family.members
mythera1=complete(impute1,2)
No missing values
Also there is a negative value in Experience in years as it can be from Zero to some value only.
Fixed the negative values with Zero
mythera1$Experience..in.years.[mythera1$Experience..in.years.<0]=0
Converted the factor variables from numeric
mythera1$Education=as.factor(mythera1$Education)
mythera1$Family.members=as.factor(mythera1$Family.members)
mythera1$Personal.Loan=as.factor(mythera1$Personal.Loan)
mythera1$CD.Account=as.factor(mythera1$CD.Account)
mythera1$Online=as.factor(mythera1$Online)
mythera1$CreditCard=as.factor(mythera1$CreditCard)
mythera1$Securities.Account=as.factor(mythera1$Securities.Account)
mythera1$Personal.Loan=as.factor(mythera1$Personal.Loan)
Final summary of the dataset
6|Page
Univariate Analysis
Factor Variables
42% of the customer are undergraduate,

28% are graduates and remaining 30%
are advanced/Professional
29% of the customers have a family size

of 1, 26% with a family size of 2, 20% with 3
and 24% with a family size of 4
94% of the customers doesn’t have CD

Account with the bank
7|Page
71% of the customers doesn’t use Credit

Card issued by the bank.
60% the customers use internet banking

Facilities.
90% of the customers doesn’t have

Securities account with the bank
90% of the customers haven’t availed

the personal loan offered in the last
campaign.
8|Page
Numeric Variables
Mean age=45 with Std.dev=11.5
Mean Ccard spending = 1.94 with

Std.dev=1.75
Mean house mortgage=56.5 with

Std.dev=101.71. Plenty of outliers with
Kurtosis=4.74
9|Page
Mean annual income=73.77 with

Std.dev=46.03
The mortgage variable has many outliers which might affect the model. There are 291 records
which is 6% of the total dataset. If we remove the outliers for the sake of removing, it affects
the personal loan characteristic very much(ref.below). Also they are not data entry errors or
measurement errors and very much part of the population we are addressing. So I am not
removing the outliers.
Count %age Pers.loan %age

Original 5000 100% 480 9.60%
Outliers 291 6% 93 31.96%
Treated 4709 94% 387 8.22%
Bivariate Analysis
We will analyze how the independent variables stack up with the dependent varirable
(Personal Loan)
Family Members Vs Personal Loan
Family members have little effect on

the personal loan availing.
10 | P a g e
Education Vs Personal Loan
As education increases, there is an

Increasing trend in availing Personal Loan.
Credit card Vs Personal Loan
Less effect of having credit card on

Availing personal loan
Online facility Vs Personal Loan
No effect of using online facilities of the

Bank on availing personal loan.
11 | P a g e
Securities account Vs Personal Loan
Again, no effect of having securities account

with the bank on Personal Loan.
CD Accounts Vs Personal Loan
Close 5o% of customers having CD A/c

take personal loan.
Age Vs Personal Loan
Age bucket created in the dataset and

it doesn’t affect the %age of availing
personal loan.
12 | P a g e
Income Vs Personal Loan
As the income increases, more customers

are availing personal loan as evident from
the group graph.
Education Vs Personal Loan
Customers with higher education have

availed personal loan.
Correlation in numeric variables
13 | P a g e
There is a moderate positive correlation between Credit card spending and Income. High
correlation between Age and experience, which is obvious.
Model Building
First CART model is done on the dataset. The dataset is already cleaned by removing the ID
column, NA updated etc...
Train and Test data sets are created with a split of 70% & 30%.
Initial tree created with control=(minsplit=30, minbucket=3 and cp=0)
Initial tree output with CP, relative error, cross validated error as below:
14 | P a g e
This tree has to be pruned to avoid any overfitting.
Best practice is to have a small tree with the one having least cross validated error.
The value of cp should be least, so that the cross-validated error rate is

minimum.
Output of cptable as below:
This is also validated by the plotting of CP values using plotcp.
15 | P a g e
The tree is pruned with the minimum value as obtained.
myptree= prune(mytree,cp = mytree$cptable[which.min (mytree$cptable

[,"xerror"]),"CP"],"CP")
16 | P a g e
Predict class and probability are updated on the train dataset.
Confusion Matrix and the concordance ratio of the train data set:
The pruned CART model is applied on the test dataset and predict class and scores are
updated.
17 | P a g e
AUC is plotted using the train data set as below:
AUC=97.85
GINI coefficient = 2AUC-1 = 95.71
18 | P a g e
Random Forest
We will now start the Random Forest model.

Original dataset is again split into train & test by using a different technique.
RF model run on the train dataset. Initially with a ntree=300 and not mentioning mtry value
which will be tuned in the next step.
From this plot, we can infer that the

OOB stays flat after 50 trees and
safe to assume 100 trees.
The model is tuned with tuneRF with mtry=3 as

starting point with step factor=1.5 & improve=.01.
The purpose of this exercise is to find the value of
mtry where OOB(out of bag) is minimum.
19 | P a g e
We infer mtry=6 from the tuning exercise

and the same is applied by running the RF
model with number of trees = 100 this time.
The output of the model tuned:
The predict class and score updated in the train data set and the output measures are as
below:
20 | P a g e
The RF model is applied on the test dataset and the results as below:
The margin of prediction of the tuned RF model:
Importance of the model is given below which clearly states the importance of two variables
(Income & Education) with a high Meandecrease Gini score & Meandecrease accuracy.
21 | P a g e
The same importance plot can be viewed from another functionality as below:
AUC is plotted for this RF model and the area and GINI coefficient calculated.
AUC = 99.65%
GINI Coefficient = 99.31%
22 | P a g e
Model Comparision
We have created prediction models using CART decision tree and tuned Random Forest
methods.
Using both the models, the prediction was trained using train dataset and applied on the test
dataset.
The prediction score, classes are updated on the test dataset for measuring the model
performance parameters.
Key model performance measures:
 Confusion matrix parameters

o This measures the various ratios of TP,TN,FP,FN and gives accuracy, sensitivity,
specificity and random & balanced accuracy.
 Concordance ratio
o This measures the percentage of pairs, where true event’s probability scores are
greater than the scores of true non-events. Higher the ratio, better the quality of
the model.
 AUC
o This represents the probability that a random positive is positioned to the right
of a random negative. It measures how well the predictions are ranked rather
than absolute values.
 GINI coefficient
o This is defined as the ratio of Area within the model curve and the random
model line and outputs how perfect a model is against a random model.
Both the model parameters are given below
CART
Parameters model RF model
Confusion Matrix
Total Accuracy 98.44% 98.53%
Random Accuracy 90.75% 92.09%
Sensitivity 98.93% 98.67%
Specificity 93.57% 97.30%
Balanced Accuracy 96.25% 97.98%
Concordance ratio 96.01% 99.63%
A.U.C 97.85% 99.65%
Gini Coefficient 95.71% 99.31%
23 | P a g e
It is clearly evident from the values that the tuned RF model is better than the CART one.
So the Random Forest model can be used to classify the potential loan
customers.
The RF model is applied on the whole dataset and the parameters are given below:
AUC : 99.94%
GINI Coefficient: 99.88%
24 | P a g e
25 | P a g e
Reference
Data Mining Techniques

For Marketing, Sales and Customer Relationship Management
Gordon S. Linoff
Michael J.A.Berry
Great Learning course materials
26 | P a g e

Thera Bank - Project - Submission - V1 PDF

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Thera Bank - Project - Submission - V1 PDF

Transféré par

Droits d'auteur :

Formats disponibles

Loan Purchase Modelling

15th April 2020

Project Description & Objective ………………………………………………………………. 3

Project Report……………………………………………………………………………………..... 4-24

Thera Bank - Loan Purchase Modeling

 EDA - Basic data summary, Univariate, Bivariate analysis, graphs

EDA - Basic data summary, Univariate, Bivariate analysis, graphs

#### Basic sanity checks and conversions

Initial summary of the data

Data structure just after loading

Missing values checked and results as below:

Fixed the negative values with Zero

Converted the factor variables from numeric

Final summary of the dataset

42% of the customer are undergraduate,

29% of the customers have a family size

94% of the customers doesn’t have CD

71% of the customers doesn’t use Credit

60% the customers use internet banking

90% of the customers doesn’t have

90% of the customers haven’t availed

Mean age=45 with Std.dev=11.5

Mean Ccard spending = 1.94 with

Mean house mortgage=56.5 with

Mean annual income=73.77 with

Count %age Pers.loan %age

Family Members Vs Personal Loan

Family members have little effect on

Education Vs Personal Loan

As education increases, there is an

Credit card Vs Personal Loan

Less effect of having credit card on

Online facility Vs Personal Loan

No effect of using online facilities of the

Securities account Vs Personal Loan

Again, no effect of having securities account

CD Accounts Vs Personal Loan

Close 5o% of customers having CD A/c

Age Vs Personal Loan

Age bucket created in the dataset and

Income Vs Personal Loan

As the income increases, more customers

Education Vs Personal Loan

Customers with higher education have

Correlation in numeric variables

Initial tree created with control=(minsplit=30, minbucket=3 and cp=0)

This tree has to be pruned to avoid any overfitting.

The value of cp should be least, so that the cross-validated error rate is

Output of cptable as below:

This is also validated by the plotting of CP values using plotcp.

The tree is pruned with the minimum value as obtained.

myptree= prune(mytree,cp = mytree$cptable[which.min (mytree$cptable

Predict class and probability are updated on the train dataset.

AUC is plotted using the train data set as below:

We will now start the Random Forest model.

From this plot, we can infer that the

The model is tuned with tuneRF with mtry=3 as

We infer mtry=6 from the tuning exercise

The output of the model tuned:

The margin of prediction of the tuned RF model:

Key model performance measures: