Vous êtes sur la page 1sur 16

Taiwan Credit

Defaults
Henry Chang | Avani Sharma
Atindra Bandi | Abraham Khan
Group 14
Situation

Taiwan economy grew 95% from 1990-2000

Banks loosened credit requirements to continue growth

People started borrowing more than they could pay

Introduction Data Exploration Analysis Conclusion


Problem Statement

Decision: Identify high risk customers based their credit history

Key Questions:

How to identify potential defaulters?

What are the factors leading to potential default?

Introduction Data Exploration Analysis Conclusion


Dataset

Categorical variables
Sex, Education, Marriage, and Payment status for 6 months

Predict default on credit card


payment next month
Numerical variables
Age, Credit limit, Balance and Payment amounts for 6 months

Data Source: https://archive.ics.uci.edu/ml/datasets/default%20of%20credit%20card%20clients

Introduction Data Exploration Analysis Conclusion


Demographics
0.05
60% 40% Non-defaulters
Female Male 0.04 Defaulters
Default

Density
20% 0.03
40%
No default
75%
Male
60% 0.02
Female
Default
No default 0.01
25%
80%
0.00
20 40 60 80
Age

Introduction Data Exploration Analysis Conclusion


Demographics
Education Marital Status
47% 53%
50% 45%
40% 35%

% Credit Card Holders


% Credit Card Holders

40%
30%
30%
20% 16%
20%

10% 10%
2% 2%
0% 0%
University Post Highschool Others Single Married Others
Graduate

Introduction Data Exploration Analysis Conclusion


Credit Limits by Default Status

6.0 e-6 Non-defaulters


% of customers with
Defaulters
credit limit TWD
4.0 e-6
Density

100,000 defaulted

2.0 e-6 % of customers with


credit limit TWD
100,000 defaulted
0
0 250,000 500,000 750,000
Credit Limit (TWD)

Introduction Data Exploration Analysis Conclusion


Variable Creation
Payment – Spending Ratio
σ 𝑃𝑎𝑦𝑚𝑒𝑛𝑡𝑠
σ 𝑆𝑝𝑒𝑛𝑑𝑖𝑛𝑔

Weighted Payment Score


(w1·1st Month Status) + (w2 · 2nd Month Status) + (w3 · 3rd Month Status)
+ (w4 · 4th Month Status) + (w5 · 5th Month Status) + (w6 · 6th Month Status)

Introduction Data Exploration Analysis Conclusion


Variable Selection

Logistic regression
• 10 fold cross validation, 20 times
Select variables that remained most often
• Credit limit
• Recent payment amounts
• Recent delayed payments
• Age of customer

Introduction Data Exploration Analysis Conclusion


Lasso Logistic Output Henry

Henry
• Credit limit
• Recent payment amounts
• Recent delayed payments
• Age of customer

Henry
Introduction Data Exploration Analysis Conclusion
Model Comparison
Sensitivity = true positive rate Logistic Naïve Random
Models
Regression Bayes Forest
Specificity = true negative rate Accuracy 80% 80% 80%
Sensitivity 15% 46% 53%
Accuracy = correct prediction rate Specificity 98% 90% 88%
AUC 73% 73% 76%
AUC = area under ROC curve Cut-off 0.57 0.97 0.32

Cut-off = threshold for calculating default

Introduction Data Exploration Analysis Conclusion


Receiver Operating Curve
1.0
Random Forest AUC = 76.8%
0.8 Logistic Regression AUC = 72.3%
Naïve Bayes AUC = 71.3%
0.6
Sensitivity
(True Positive)
0.4

0.2
1 - Specificity
(False Positive)
0.0
0.0 0.2 0.4 0.6 0.8 1.0

Introduction Data Exploration Analysis Conclusion


Random Forest
Run cross validation to 0.4
optimize parameters

Misclassification
• Threshold 0.3
• Number of trees

Error
• Number of variables 0.2 Out of Sample

0.1
In Sample

0 100 200 300 400 500


# of Trees

Introduction Data Exploration Analysis Conclusion


Random Forest – Variable Importance

0 50 100 150 0 200 400 600 800


Mean Decrease Accuracy Mean Decrease Gini

Introduction Data Exploration Analysis Conclusion


Key Takeaways

Random forest has the best balance between TPR and FPR

Recent payments are the most important variables for prediction

Demographic variables are not important predictors for defaulting

Feature engineering can be tricky, but insightful

Introduction Data Exploration Analysis Conclusion


Introduction Data Exploration Analysis Conclusion

Vous aimerez peut-être aussi