Assignment 1

DATA MINING
ASSIGNMENT-1
1.
There are total 1000 observations in the dataset out of which 700 have Good credit history while the
remaining 300 are categorized as bad. So the proportion of good to bad cases is 700:300 i.e. 7:3.
Fig 1.1 Proportion of Cases
There are many missing values in the dataset contains in various columns as below:
Table 1: Missing values
Sr. No. Variable Name Missing Values

1 Education 950
2 Furniture 819
3 RADIO/TV 720
4 USED_CAR 897
5 RETRAINING 903
There more than 50% of the records in the above columns contains missing values so we can replace the
missing values with ‘0’
Some of the real value attributes or predictor variable are below:

Categorical Attributes:
Fig 1.2 Categorical Attributes
● SAV_ACCT: We observe that the current status of the account matters as the frequency of the
response variable is seen to differ from sub-category to another. Overall, people with no savings
account and also with less balance in account tend to have a bad credit history than the people
in other categories.
● History: It is surprising that people who have paid their existing credits duly till now have bad
credit history.
● Foreign: Foreign workers category have less number of bad records therefore a good credit
history.
Fig 1.3: Additional Categories

2. (a) We identified a number of parameters and optimised them to obtain the optimal decision
tree:
Criterion: Information gain Maximum depth: 6 Apply pruning: Y
Minimal size for split: 5 Confidence: 0.5 Minimal gain:0.001
Minimal Leaf size: 6
First we ran our tree with “information” as a parameter ( we received a .787 accuracy). Then we
changed the parameter to “gini” ( we received .787 accuracy). Then we added a CP = .001 (received a
.83 accuracy).
(b) The important variables we have identified are CHECK_ACC, OBS#, HISTORY, DURATION ,SAV_ACCT
and AMOUNT. We consider them important as these have the largest count of good and bad cases. Yes,
this does match our expectation. Both, HISTORY and SAV_ACCOUNT are part of the important variables.
Scenarios Accuracy Precision
0 1
split = “information”
minsplit = 5, and 0.7997 0.5368 0.8033
minbucket=1
split = “gini” and cp = 0.5818 0.7714

0.7366
0.001
and cp = 0.01 0.7266 0.5882 0.7672
Table 2.1: Multiple scenarios to identify best parameters
c.) A lift chart was not created as this is just a descriptive model. The model is not reliable yet as the data
has not been split. There is a lot of unseen data. An ROC curve is created in part 3 after partitioning
training and testing data.
3.
For this problem we started off with creating three different scenarios. Our first split between training
and testing has 50:50 ratio, 70:30 for the second scenario and 80:20 for the third scenario. In each
scenario our team has tested different parameters to obtain a model with that highly accurate and
precise. We have also used ROC plots to further evaluate model performance.
Scenario I
Training: Testing → 50:50
0 1
Minsplit = 5 and
minbucket = 1 0.832 0.8695 0.8235
Using C50 function:

Minsplit = 5 and 0.9210 0.8886
0.896
minbucket = 1
Using C50 function:

split = “information” 0.758 0.6578 0.7875
and cp = 0.01
Table 3.1: Split 50:50
The table above displays the different parameters used to identify the best model for the 50:50 split.
Here, using the C50 section with parameter such as minsplit and minbucket result in the highest
accuracy while maintaining precision.
Fig 3.1: ROC curve for split 50:50
Scenario II
0 1
Using rpart function: 0.72 0.5573 0.7615
Minsplit = 5 and
minbucket = 1
Using C50 function: 0.72321 0.7886
0.774
Minsplit = 5 and
minbucket = 1
split = “gini”
Using C50 function: 0.778 0.70796 0.7984
and cp = 0.01

The table above displays the different parameters used to identify the best model for the 70:30 split.
Here, using the C50 section with parameter such as minsplit and minbucket result in the highest
accuracy while maintaining precision.
Scenario III
0 1
Using rpart function: 0.725 0.5490 0.7852
Minsplit = 5 and minbucket =
1
Using C50 function: 0.6097 0.8050
0.765
Minsplit = 5 and minbucket =
1
split = “gini”
Using C50 function: and cp = 0.73 0.5882 0.7590
0.01
b.)
We built the C5 decision Tree using a 50:50 ratio of training to test data. We started building the tree
using the basic parameters like Type=’Class’ and got the below accuracy.
Accuracy 90.2
Below is the confusion matrix for the same:
TRUE
p
red 0 1
0
119 18
1 31 332
· After obtaining the results we started optimizing the parameters using the for loop and got the
below results for the optimized values.
Trials-> It is an integer specifying the number of boosting iterations. A value of one indicates that a single
model is used. We obtained the optimized value 30.
Model->Tree
Winnow->True
Outcome->1
Below are the results we obtained for different measures:
Accuracy Precision Recall Sensitivity
0 1 0 1 0 1
77. 8 82.
73.6 58.8 7 40 8 47.6 3
ROC Curve for C5.0
· The model performs bad after a threshold value of 0.6
· After applying this model on the test data we are getting the below results for the AUC value:73.3
We should use the measure Precision, as from the above model we can see that the correct
classification rate for the good credit history is high, which is therefore important here.
The model obtained using this method gives completely different set of results from the rpart decision
tree. Below is the comparison of both:
Model Accuracy
Decision 83.2
Tree
C5.0 73.6
c.) Decision trees are said to be unstable in the terms that even a small change in the data can lead to
change in the results in the training or test data. We tried analyzing the same and got the result below
on the test data. As the initial value of the tree is changed we observe there will be change in the
accuracy and Precision. This is because by changing the seed value we are creating randomness in
generation of tree. Everytime we are changing the initial value for generation this will create the
difference how are tree will be initialized and built further. We are observing slight changes in the values
here because our dataset has 1000 records only.
Seed Value Accuracy (in %) Precision(in %)
0 1
123 73 58.8 75.90
576 72.6 55 77
893 70 53.33 74.4
d.)
Below is the table containing the top 5 important variables obtained in each model.
We could observe that there are differences in the variable chosen by each model.
Variable Importance (TOP 5)
C5.0 Model
Rpart Decision Tree
CHK_ACCT CHK_ACCT
OTHER_INSTALL HISTORY
DURATION DURATION
SAV_ACCT SAV_ACCT
GUARANTOR AMOUNT
4.
Best model identified as 50:50 split with parameters minsplit = 5 and minbucket = 1.
0 1
Threshold = 0.5
0.83 0.7777 0.84438
Threshold = 0.7
0.7333 0.8500
0.822
Threshold = 0.9
0.306 0.2947 1.0000
Table 4.1: Testing different thresholds
b.) ‘Theoretical’ threshold and assess performance was calculated to be 0.833. This information seems
accurate and is supported by the different threshold values tested above. As listed Table 4.1, accuracy
dips to a 0.306 (for TH = 0.9) from a 0.822 (for for TH = 0.7). This shows use that the model with a
threshold value beyond 0.8 is highly unreliable.
c.) Decision trees for “rpart” function with and without misclassification costs are posted below. The
figures below clearly indicate that the decision tree with misclassification costs have higher splits. In the
decision tree with misclassification costs the variable “amount” has higher splits.
Fig 4.1:
Decision
tree using
misclassific
ation costs
‘rpart’
Fig 4.2: Decision tree for ‘rpart’ function without misclassification costs
C50 model Using misclassification costs Without misclassification costs
Threshold Train Test Train Test
0.5 0.831 0.863 0.794 0.866
0.7 0.6670 0.738 0.694 0.742
0.8 0.560 0.615 0.584 0.619
Table 4.1: Comparison b/w using misclassification costs and not using them
C50 with threshold model 0.5 has the best accuracy as identified previously. Using misclassification costs
has improved the model’s accuracy.
5.
Tree depth: 11
· Number of nodes: 14
Based on the decision tree, checking account and duration play an important role in determining good
vs bad credit scores.
· Two relatively pure leaf nodes:
OBS# < 934 right node (0:14),
Amount >= 8725 left node (4:0)
Probabilities for these pure nodes:

Probabilities of pure Good Bad
leaf nodes
OBS# < 934 1 0
Amount >= 8754 0 1
6)
ProfitValue ScoreTest
7100 0.9161677
· For each correct prediction we gain 100 DM and

for each incorrect we lose 500DM. After sorting the
data on predicted probability we infer that after the
Score Value of 0.916, we start incurring losses for any
incorrect prediction.
· We see the maximum gain at 200 and the net cumulative profit adds to 7100. When moving after
200th observation we start incurring losses for any incorrect prediction we make.
· In order to get the maximum benefit (7100) our test score would need to be 91.6%.

Assignment 1

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Assignment 1

Transféré par

Droits d'auteur :

Formats disponibles

DATA MINING

Fig 1.1 Proportion of Cases

Table 1: Missing values

Sr. No. Variable Name Missing Values

Some of the real value attributes or predictor variable are below:

Fig 1.2 Categorical Attributes

Fig 1.3: Additional Categories

Scenarios Accuracy Precision

split = “gini” and cp = 0.5818 0.7714

Table 2.1: Multiple scenarios to identify best parameters

Scenarios Accuracy Precision

Using C50 function:

Using C50 function:

Table 3.1: Split 50:50

Scenarios Accuracy Precision

Table 3.2: Split 70:30

Fig 3.2: ROC curve for split 50:50

Scenarios Accuracy Precision

Table 3.3: Split 80:20

Fig 3.3: ROC curve for split 80:20

Below is the confusion matrix for the same:

Below are the results we obtained for different measures:

Accuracy Precision Recall Sensitivity

ROC Curve for C5.0

· The model performs bad after a threshold value of 0.6

Seed Value Accuracy (in %) Precision(in %)

123 73 58.8 75.90

893 70 53.33 74.4

Variable Importance (TOP 5)

Scenarios Accuracy Precision

Table 4.1: Testing different thresholds

C50 model Using misclassification costs Without misclassification costs

Threshold Train Test Train Test

0.5 0.831 0.863 0.794 0.866

0.7 0.6670 0.738 0.694 0.742

0.8 0.560 0.615 0.584 0.619

· Two relatively pure leaf nodes:

OBS# < 934 right node (0:14),

Amount >= 8725 left node (4:0)

Probabilities for these pure nodes:

OBS# < 934 1 0

Amount >= 8754 0 1

· For each correct prediction we gain 100 DM and

Vous aimerez peut-être aussi