Académique Documents
Professionnel Documents
Culture Documents
ASSIGNMENT-1
1.
There are total 1000 observations in the dataset out of which 700 have Good credit history while the
remaining 300 are categorized as bad. So the proportion of good to bad cases is 700:300 i.e. 7:3.
There are many missing values in the dataset contains in various columns as below:
There more than 50% of the records in the above columns contains missing values so we can replace the
missing values with ‘0’
● SAV_ACCT: We observe that the current status of the account matters as the frequency of the
response variable is seen to differ from sub-category to another. Overall, people with no savings
account and also with less balance in account tend to have a bad credit history than the people
in other categories.
● History: It is surprising that people who have paid their existing credits duly till now have bad
credit history.
● Foreign: Foreign workers category have less number of bad records therefore a good credit
history.
(b) The important variables we have identified are CHECK_ACC, OBS#, HISTORY, DURATION ,SAV_ACCT
and AMOUNT. We consider them important as these have the largest count of good and bad cases. Yes,
this does match our expectation. Both, HISTORY and SAV_ACCOUNT are part of the important variables.
0 1
split = “information”
minsplit = 5, and 0.7997 0.5368 0.8033
minbucket=1
split = “information”
and cp = 0.01 0.7266 0.5882 0.7672
c.) A lift chart was not created as this is just a descriptive model. The model is not reliable yet as the data
has not been split. There is a lot of unseen data. An ROC curve is created in part 3 after partitioning
training and testing data.
3.
For this problem we started off with creating three different scenarios. Our first split between training
and testing has 50:50 ratio, 70:30 for the second scenario and 80:20 for the third scenario. In each
scenario our team has tested different parameters to obtain a model with that highly accurate and
precise. We have also used ROC plots to further evaluate model performance.
Scenario I
Training: Testing → 50:50
0 1
Minsplit = 5 and
minbucket = 1 0.832 0.8695 0.8235
The table above displays the different parameters used to identify the best model for the 50:50 split.
Here, using the C50 section with parameter such as minsplit and minbucket result in the highest
accuracy while maintaining precision.
Fig 3.1: ROC curve for split 50:50
Scenario II
Training: Testing → 70:30
0 1
split = “information”
Using rpart function: 0.72 0.5573 0.7615
Minsplit = 5 and
minbucket = 1
split = “information”
Using C50 function: 0.72321 0.7886
0.774
Minsplit = 5 and
minbucket = 1
split = “gini”
Using C50 function: 0.778 0.70796 0.7984
and cp = 0.01
Scenario III
Training: Testing → 80:20
0 1
split = “information”
Using rpart function: 0.725 0.5490 0.7852
Minsplit = 5 and minbucket =
1
split = “information”
Using C50 function: 0.6097 0.8050
0.765
Minsplit = 5 and minbucket =
1
split = “gini”
Using C50 function: and cp = 0.73 0.5882 0.7590
0.01
b.)
We built the C5 decision Tree using a 50:50 ratio of training to test data. We started building the tree
using the basic parameters like Type=’Class’ and got the below accuracy.
Accuracy 90.2
TRUE
p
red 0 1
0
119 18
1 31 332
· After obtaining the results we started optimizing the parameters using the for loop and got the
below results for the optimized values.
Trials-> It is an integer specifying the number of boosting iterations. A value of one indicates that a single
model is used. We obtained the optimized value 30.
Model->Tree
Winnow->True
Outcome->1
0 1 0 1 0 1
77. 8 82.
73.6 58.8 7 40 8 47.6 3
· After applying this model on the test data we are getting the below results for the AUC value:73.3
We should use the measure Precision, as from the above model we can see that the correct
classification rate for the good credit history is high, which is therefore important here.
The model obtained using this method gives completely different set of results from the rpart decision
tree. Below is the comparison of both:
Model Accuracy
Decision 83.2
Tree
C5.0 73.6
c.) Decision trees are said to be unstable in the terms that even a small change in the data can lead to
change in the results in the training or test data. We tried analyzing the same and got the result below
on the test data. As the initial value of the tree is changed we observe there will be change in the
accuracy and Precision. This is because by changing the seed value we are creating randomness in
generation of tree. Everytime we are changing the initial value for generation this will create the
difference how are tree will be initialized and built further. We are observing slight changes in the values
here because our dataset has 1000 records only.
0 1
576 72.6 55 77
d.)
Below is the table containing the top 5 important variables obtained in each model.
We could observe that there are differences in the variable chosen by each model.
C5.0 Model
Rpart Decision Tree
CHK_ACCT CHK_ACCT
OTHER_INSTALL HISTORY
DURATION DURATION
SAV_ACCT SAV_ACCT
GUARANTOR AMOUNT
4.
Best model identified as 50:50 split with parameters minsplit = 5 and minbucket = 1.
0 1
Threshold = 0.5
0.83 0.7777 0.84438
Threshold = 0.7
0.7333 0.8500
0.822
Threshold = 0.9
0.306 0.2947 1.0000
b.) ‘Theoretical’ threshold and assess performance was calculated to be 0.833. This information seems
accurate and is supported by the different threshold values tested above. As listed Table 4.1, accuracy
dips to a 0.306 (for TH = 0.9) from a 0.822 (for for TH = 0.7). This shows use that the model with a
threshold value beyond 0.8 is highly unreliable.
c.) Decision trees for “rpart” function with and without misclassification costs are posted below. The
figures below clearly indicate that the decision tree with misclassification costs have higher splits. In the
decision tree with misclassification costs the variable “amount” has higher splits.
Fig 4.1:
Decision
tree using
misclassific
ation costs
‘rpart’
Fig 4.2: Decision tree for ‘rpart’ function without misclassification costs
Table 4.1: Comparison b/w using misclassification costs and not using them
C50 with threshold model 0.5 has the best accuracy as identified previously. Using misclassification costs
has improved the model’s accuracy.
5.
Tree depth: 11
· Number of nodes: 14
Based on the decision tree, checking account and duration play an important role in determining good
vs bad credit scores.
6)
ProfitValue ScoreTest
7100 0.9161677
· We see the maximum gain at 200 and the net cumulative profit adds to 7100. When moving after
200th observation we start incurring losses for any incorrect prediction we make.
· In order to get the maximum benefit (7100) our test score would need to be 91.6%.