Vous êtes sur la page 1sur 26

Thera Bank

Loan Purchase Modelling

R Venkataraman

15th April 2020



PGP BABI

Group 5
Index

Project Description & Objective ………………………………………………………………. 3

Project Report……………………………………………………………………………………..... 4-24

Reference……………………………………………………………………………………………….. 25
Data Mining – Thera Bank

Project Description

Thera Bank - Loan Purchase Modeling

This case is about a bank (Thera Bank) which has a growing customer base. Majority of these
customers are liability customers (depositors) with varying size of deposits. The number of
customers who are also borrowers (asset customers) is quite small, and the bank is interested
in expanding this base rapidly to bring in more loan business and in the process, earn more
through the interest on loans. In particular, the management wants to explore ways of
converting its liability customers to personal loan customers (while retaining them as
depositors). A campaign that the bank ran last year for liability customers showed a healthy
conversion rate of over 9% success. This has encouraged the retail marketing department to
devise campaigns with better target marketing to increase the success ratio with a minimal
budget. The department wants to build a model that will help them identify the potential
customers who have a higher probability of purchasing the loan. This will increase the success
ratio while at the same time reduce the cost of the campaign. The dataset has data on 5000
customers. The data include customer demographic information (age, income, etc.), the
customer's relationship with the bank (mortgage, securities account, etc.), and the customer
response to the last personal loan campaign (Personal Loan). Among these 5000 customers,
only 480 (= 9.6%) accepted the personal loan that was offered to them in the earlier campaign.

You are brought in as a consultant and your job is to build the best model which can classify the
right customers who have a higher probability of purchasing the loan.

Project Objective

 EDA - Basic data summary, Univariate, Bivariate analysis, graphs


 Applying CART <plot the tree>
 Interpret the CART model output <pruning, remarks on pruning, plot the pruned tree>
 Applying Random Forests<plot the tree>
 Interpret the RF model output <with remarks, making it meaningful for everybody>
 Confusion matrix interpretation
 Interpretation of other Model Performance Measures < AUC, ROC>
 Remarks on Model validation exercise <Which model performed the best>

3|Page
Data Mining – Thera Bank

Project Report

EDA - Basic data summary, Univariate, Bivariate analysis, graphs

mythera=read.csv("Thera Bank_data.csv",header=TRUE)

#### Basic sanity checks and conversions

str(mythera)
glimpse(mythera)
summary(mythera)

With necessary R libraries loaded and setting the default working directory, the dataset is
loaded into R. Initial glimpse of the data as follows:

We have 13 independent variables and 1 dependent variable (‘Personal Loan’) in the given
data set. We have 5000 rows which can be split into test & train dataset for various model
building.

Data Description:

ID Customer ID
Age Customer's age in years
Experience Years of professional experience
Income Annual income of the customer ($000)
ZIPCode Home Address ZIP code.
Family Family size of the customer
CCAvg Avg. spending on credit cards per month ($000)
Education Education Level. 1: Undergrad; 2: Graduate; 3: Advanced/Professional
Mortgage Value of house mortgage if any. ($000)
Personal Loan Did this customer accept the personal loan offered in the last campaign?
Securities
Does the customer have a securities account with the bank?
Account
CD Account Does the customer have a certificate of deposit (CD) account with the bank?
Online Does the customer use internet banking facilities?
CreditCard Does the customer use a credit card issued by the bank?

4|Page
Data Mining – Thera Bank

Initial summary of the data

Data structure just after loading

Missing values checked and results as below:

5|Page
Data Mining – Thera Bank

Data correction

Missing value in Family.members is populated using mice and new dataset created

impute1=mice(data=mythera,m=5,method="pmm",maxit=50,seed = 500)
impute1$imp$Family.members
mythera1=complete(impute1,2)

No missing values

Also there is a negative value in Experience in years as it can be from Zero to some value only.

Fixed the negative values with Zero

mythera1$Experience..in.years.[mythera1$Experience..in.years.<0]=0

Converted the factor variables from numeric

mythera1$Education=as.factor(mythera1$Education)
mythera1$Family.members=as.factor(mythera1$Family.members)
mythera1$Personal.Loan=as.factor(mythera1$Personal.Loan)
mythera1$CD.Account=as.factor(mythera1$CD.Account)
mythera1$Online=as.factor(mythera1$Online)
mythera1$CreditCard=as.factor(mythera1$CreditCard)
mythera1$Securities.Account=as.factor(mythera1$Securities.Account)
mythera1$Personal.Loan=as.factor(mythera1$Personal.Loan)

Final summary of the dataset

6|Page
Data Mining – Thera Bank

Univariate Analysis

Factor Variables

42% of the customer are undergraduate,


28% are graduates and remaining 30%
are advanced/Professional

29% of the customers have a family size


of 1, 26% with a family size of 2, 20% with 3
and 24% with a family size of 4

94% of the customers doesn’t have CD


Account with the bank

7|Page
Data Mining – Thera Bank

71% of the customers doesn’t use Credit


Card issued by the bank.

60% the customers use internet banking


Facilities.

90% of the customers doesn’t have


Securities account with the bank

90% of the customers haven’t availed


the personal loan offered in the last
campaign.

8|Page
Data Mining – Thera Bank

Numeric Variables

Mean age=45 with Std.dev=11.5

Mean Ccard spending = 1.94 with


Std.dev=1.75

Mean house mortgage=56.5 with


Std.dev=101.71. Plenty of outliers with
Kurtosis=4.74

9|Page
Data Mining – Thera Bank

Mean annual income=73.77 with


Std.dev=46.03

The mortgage variable has many outliers which might affect the model. There are 291 records
which is 6% of the total dataset. If we remove the outliers for the sake of removing, it affects
the personal loan characteristic very much(ref.below). Also they are not data entry errors or
measurement errors and very much part of the population we are addressing. So I am not
removing the outliers.

Count %age Pers.loan %age


Original 5000 100% 480 9.60%
Outliers 291 6% 93 31.96%
Treated 4709 94% 387 8.22%

Bivariate Analysis

We will analyze how the independent variables stack up with the dependent varirable
(Personal Loan)

Family Members Vs Personal Loan

Family members have little effect on


the personal loan availing.

10 | P a g e
Data Mining – Thera Bank

Education Vs Personal Loan

As education increases, there is an


Increasing trend in availing Personal Loan.

Credit card Vs Personal Loan

Less effect of having credit card on


Availing personal loan

Online facility Vs Personal Loan

No effect of using online facilities of the


Bank on availing personal loan.

11 | P a g e
Data Mining – Thera Bank

Securities account Vs Personal Loan

Again, no effect of having securities account


with the bank on Personal Loan.

CD Accounts Vs Personal Loan

Close 5o% of customers having CD A/c


take personal loan.

Age Vs Personal Loan

Age bucket created in the dataset and


it doesn’t affect the %age of availing
personal loan.

12 | P a g e
Data Mining – Thera Bank

Income Vs Personal Loan

As the income increases, more customers


are availing personal loan as evident from
the group graph.

Education Vs Personal Loan

Customers with higher education have


availed personal loan.

Correlation in numeric variables

13 | P a g e
Data Mining – Thera Bank

There is a moderate positive correlation between Credit card spending and Income. High
correlation between Age and experience, which is obvious.

Model Building

First CART model is done on the dataset. The dataset is already cleaned by removing the ID
column, NA updated etc...

Train and Test data sets are created with a split of 70% & 30%.

Initial tree created with control=(minsplit=30, minbucket=3 and cp=0)

Initial tree output with CP, relative error, cross validated error as below:

14 | P a g e
Data Mining – Thera Bank

This tree has to be pruned to avoid any overfitting.

Best practice is to have a small tree with the one having least cross validated error.

The value of cp should be least, so that the cross-validated error rate is


minimum.

Output of cptable as below:

This is also validated by the plotting of CP values using plotcp.

15 | P a g e
Data Mining – Thera Bank

The tree is pruned with the minimum value as obtained.

myptree= prune(mytree,cp = mytree$cptable[which.min (mytree$cptable


[,"xerror"]),"CP"],"CP")

16 | P a g e
Data Mining – Thera Bank

Predict class and probability are updated on the train dataset.

Confusion Matrix and the concordance ratio of the train data set:

The pruned CART model is applied on the test dataset and predict class and scores are
updated.

17 | P a g e
Data Mining – Thera Bank

AUC is plotted using the train data set as below:

AUC=97.85
GINI coefficient = 2AUC-1 = 95.71

18 | P a g e
Data Mining – Thera Bank

Random Forest

We will now start the Random Forest model.


Original dataset is again split into train & test by using a different technique.
RF model run on the train dataset. Initially with a ntree=300 and not mentioning mtry value
which will be tuned in the next step.

From this plot, we can infer that the


OOB stays flat after 50 trees and
safe to assume 100 trees.

The model is tuned with tuneRF with mtry=3 as


starting point with step factor=1.5 & improve=.01.
The purpose of this exercise is to find the value of
mtry where OOB(out of bag) is minimum.

19 | P a g e
Data Mining – Thera Bank

We infer mtry=6 from the tuning exercise


and the same is applied by running the RF
model with number of trees = 100 this time.

The output of the model tuned:

The predict class and score updated in the train data set and the output measures are as
below:

20 | P a g e
Data Mining – Thera Bank

The RF model is applied on the test dataset and the results as below:

The margin of prediction of the tuned RF model:

Importance of the model is given below which clearly states the importance of two variables
(Income & Education) with a high Meandecrease Gini score & Meandecrease accuracy.

21 | P a g e
Data Mining – Thera Bank

The same importance plot can be viewed from another functionality as below:

AUC is plotted for this RF model and the area and GINI coefficient calculated.

AUC = 99.65%
GINI Coefficient = 99.31%

22 | P a g e
Data Mining – Thera Bank

Model Comparision

We have created prediction models using CART decision tree and tuned Random Forest
methods.

Using both the models, the prediction was trained using train dataset and applied on the test
dataset.

The prediction score, classes are updated on the test dataset for measuring the model
performance parameters.

Key model performance measures:

 Confusion matrix parameters


o This measures the various ratios of TP,TN,FP,FN and gives accuracy, sensitivity,
specificity and random & balanced accuracy.
 Concordance ratio
o This measures the percentage of pairs, where true event’s probability scores are
greater than the scores of true non-events. Higher the ratio, better the quality of
the model.
 AUC
o This represents the probability that a random positive is positioned to the right
of a random negative. It measures how well the predictions are ranked rather
than absolute values.
 GINI coefficient
o This is defined as the ratio of Area within the model curve and the random
model line and outputs how perfect a model is against a random model.

Both the model parameters are given below

CART
Parameters model RF model
Confusion Matrix
Total Accuracy 98.44% 98.53%
Random Accuracy 90.75% 92.09%
Sensitivity 98.93% 98.67%
Specificity 93.57% 97.30%
Balanced Accuracy 96.25% 97.98%

Concordance ratio 96.01% 99.63%

A.U.C 97.85% 99.65%

Gini Coefficient 95.71% 99.31%

23 | P a g e
Data Mining – Thera Bank

It is clearly evident from the values that the tuned RF model is better than the CART one.

So the Random Forest model can be used to classify the potential loan
customers.

The RF model is applied on the whole dataset and the parameters are given below:

AUC : 99.94%

GINI Coefficient: 99.88%

24 | P a g e
Data Mining – Thera Bank

25 | P a g e
Data Mining – Thera Bank

Reference

Data Mining Techniques


For Marketing, Sales and Customer Relationship Management
Gordon S. Linoff
Michael J.A.Berry

Great Learning course materials

26 | P a g e

Vous aimerez peut-être aussi