Vous êtes sur la page 1sur 15

DATA MINING

ASSIGNMENT-1
1.
There are total 1000 observations in the dataset out of which 700 have Good credit history while the
remaining 300 are categorized as bad. So the proportion of good to bad cases is 700:300 i.e. 7:3.

Fig 1.1 Proportion of Cases

There are many missing values in the dataset contains in various columns as below:

Table 1: Missing values

Sr. No. Variable Name Missing Values


1 Education 950
2 Furniture 819
3 RADIO/TV 720
4 USED_CAR 897
5 RETRAINING 903

There more than 50% of the records in the above columns contains missing values so we can replace the
missing values with ‘0’

Some of the real value attributes or predictor variable are below:


Categorical Attributes:

Fig 1.2 Categorical Attributes

● SAV_ACCT: We observe that the current status of the account matters as the frequency of the
response variable is seen to differ from sub-category to another. Overall, people with no savings
account and also with less balance in account tend to have a bad credit history than the people
in other categories.
● History: It is surprising that people who have paid their existing credits duly till now have bad
credit history.
● Foreign: Foreign workers category have less number of bad records therefore a good credit
history.

Fig 1.3: Additional Categories


2. (a) We identified a number of parameters and optimised them to obtain the optimal decision
tree:
Criterion: Information gain Maximum depth: 6 Apply pruning: Y
Minimal size for split: 5 Confidence: 0.5 Minimal gain:0.001
Minimal Leaf size: 6
First we ran our tree with “information” as a parameter ( we received a .787 accuracy). Then we
changed the parameter to “gini” ( we received .787 accuracy). Then we added a CP = .001 (received a
.83 accuracy).

(b) The important variables we have identified are CHECK_ACC, OBS#, HISTORY, DURATION ,SAV_ACCT
and AMOUNT. We consider them important as these have the largest count of good and bad cases. Yes,
this does match our expectation. Both, HISTORY and SAV_ACCOUNT are part of the important variables.

Scenarios Accuracy Precision

0 1

split = “information”
minsplit = 5, and 0.7997 0.5368 0.8033
minbucket=1

split = “gini” and cp = 0.5818 0.7714


0.7366
0.001

split = “information”
and cp = 0.01 0.7266 0.5882 0.7672

Table 2.1: Multiple scenarios to identify best parameters

c.) A lift chart was not created as this is just a descriptive model. The model is not reliable yet as the data
has not been split. There is a lot of unseen data. An ROC curve is created in part 3 after partitioning
training and testing data.

3.
For this problem we started off with creating three different scenarios. Our first split between training
and testing has 50:50 ratio, 70:30 for the second scenario and 80:20 for the third scenario. In each
scenario our team has tested different parameters to obtain a model with that highly accurate and
precise. We have also used ROC plots to further evaluate model performance.

Scenario I
Training: Testing → 50:50

Scenarios Accuracy Precision

0 1

Minsplit = 5 and
minbucket = 1 0.832 0.8695 0.8235

Using C50 function:


Minsplit = 5 and 0.9210 0.8886
0.896
minbucket = 1

Using C50 function:


split = “information” 0.758 0.6578 0.7875
and cp = 0.01

Table 3.1: Split 50:50

The table above displays the different parameters used to identify the best model for the 50:50 split.
Here, using the C50 section with parameter such as minsplit and minbucket result in the highest
accuracy while maintaining precision.
Fig 3.1: ROC curve for split 50:50

Scenario II
Training: Testing → 70:30

Scenarios Accuracy Precision

0 1

split = “information”
Using rpart function: 0.72 0.5573 0.7615
Minsplit = 5 and
minbucket = 1

split = “information”
Using C50 function: 0.72321 0.7886
0.774
Minsplit = 5 and
minbucket = 1

split = “gini”
Using C50 function: 0.778 0.70796 0.7984
and cp = 0.01

Table 3.2: Split 70:30


The table above displays the different parameters used to identify the best model for the 70:30 split.
Here, using the C50 section with parameter such as minsplit and minbucket result in the highest
accuracy while maintaining precision.

Fig 3.2: ROC curve for split 50:50

Scenario III
Training: Testing → 80:20

Scenarios Accuracy Precision

0 1

split = “information”
Using rpart function: 0.725 0.5490 0.7852
Minsplit = 5 and minbucket =
1

split = “information”
Using C50 function: 0.6097 0.8050
0.765
Minsplit = 5 and minbucket =
1
split = “gini”
Using C50 function: and cp = 0.73 0.5882 0.7590
0.01

Table 3.3: Split 80:20

Fig 3.3: ROC curve for split 80:20

b.)

We built the C5 decision Tree using a 50:50 ratio of training to test data. We started building the tree
using the basic parameters like Type=’Class’ and got the below accuracy.

Accuracy 90.2

Below is the confusion matrix for the same:

TRUE

p
red 0 1

0
119 18

1 31 332
· After obtaining the results we started optimizing the parameters using the for loop and got the
below results for the optimized values.

Trials-> It is an integer specifying the number of boosting iterations. A value of one indicates that a single
model is used. We obtained the optimized value 30.

Model->Tree

Winnow->True

Outcome->1

Below are the results we obtained for different measures:

Accuracy Precision Recall Sensitivity

0 1 0 1 0 1

77. 8 82.
73.6 58.8 7 40 8 47.6 3

ROC Curve for C5.0

· The model performs bad after a threshold value of 0.6

· After applying this model on the test data we are getting the below results for the AUC value:73.3

We should use the measure Precision, as from the above model we can see that the correct
classification rate for the good credit history is high, which is therefore important here.

The model obtained using this method gives completely different set of results from the rpart decision
tree. Below is the comparison of both:

Model Accuracy

Decision 83.2
Tree

C5.0 73.6
c.) Decision trees are said to be unstable in the terms that even a small change in the data can lead to
change in the results in the training or test data. We tried analyzing the same and got the result below
on the test data. As the initial value of the tree is changed we observe there will be change in the
accuracy and Precision. This is because by changing the seed value we are creating randomness in
generation of tree. Everytime we are changing the initial value for generation this will create the
difference how are tree will be initialized and built further. We are observing slight changes in the values
here because our dataset has 1000 records only.

Seed Value Accuracy (in %) Precision(in %)

0 1

123 73 58.8 75.90

576 72.6 55 77

893 70 53.33 74.4

d.)

Below is the table containing the top 5 important variables obtained in each model.

We could observe that there are differences in the variable chosen by each model.

Variable Importance (TOP 5)

C5.0 Model
Rpart Decision Tree

CHK_ACCT CHK_ACCT

OTHER_INSTALL HISTORY
DURATION DURATION

SAV_ACCT SAV_ACCT

GUARANTOR AMOUNT

4.

Best model identified as 50:50 split with parameters minsplit = 5 and minbucket = 1.

Scenarios Accuracy Precision

0 1

Threshold = 0.5
0.83 0.7777 0.84438

Threshold = 0.7
0.7333 0.8500
0.822
Threshold = 0.9
0.306 0.2947 1.0000

Table 4.1: Testing different thresholds

b.) ‘Theoretical’ threshold and assess performance was calculated to be 0.833. This information seems
accurate and is supported by the different threshold values tested above. As listed Table 4.1, accuracy
dips to a 0.306 (for TH = 0.9) from a 0.822 (for for TH = 0.7). This shows use that the model with a
threshold value beyond 0.8 is highly unreliable.

c.) Decision trees for “rpart” function with and without misclassification costs are posted below. The
figures below clearly indicate that the decision tree with misclassification costs have higher splits. In the
decision tree with misclassification costs the variable “amount” has higher splits.

Fig 4.1:
Decision
tree using
misclassific
ation costs
‘rpart’
Fig 4.2: Decision tree for ‘rpart’ function without misclassification costs

C50 model Using misclassification costs Without misclassification costs

Threshold Train Test Train Test

0.5 0.831 0.863 0.794 0.866

0.7 0.6670 0.738 0.694 0.742

0.8 0.560 0.615 0.584 0.619

Table 4.1: Comparison b/w using misclassification costs and not using them

C50 with threshold model 0.5 has the best accuracy as identified previously. Using misclassification costs
has improved the model’s accuracy.
5.

Tree depth: 11

· Number of nodes: 14

Based on the decision tree, checking account and duration play an important role in determining good
vs bad credit scores.

· Two relatively pure leaf nodes:

OBS# < 934 right node (0:14),

Amount >= 8725 left node (4:0)

Probabilities for these pure nodes:


Probabilities of pure Good Bad
leaf nodes

OBS# < 934 1 0

Amount >= 8754 0 1

6)

ProfitValue ScoreTest

7100 0.9161677

· For each correct prediction we gain 100 DM and


for each incorrect we lose 500DM. After sorting the
data on predicted probability we infer that after the
Score Value of 0.916, we start incurring losses for any
incorrect prediction.

· We see the maximum gain at 200 and the net cumulative profit adds to 7100. When moving after
200th observation we start incurring losses for any incorrect prediction we make.

· In order to get the maximum benefit (7100) our test score would need to be 91.6%.

Vous aimerez peut-être aussi