Vous êtes sur la page 1sur 5

Homework 4: Decision Tree Inductive Learning

A+ GRADE R DATA ANALYSIS ANSWERS


ANSWERS DONE USING R DATA
The file dt_train.csv contains 601 lines with 10 variables. The first line contains
column headers that may be interpreted as follows:
id: observation identifier.
t1:
measurement on test 1;
t2:
measurement on test 2.
t3:
measurement on test 3;
t4:
measurement on test 4.
t5:
measurement on test 5;
t6:
measurement on test 6.
t7:
measurement on test 7;
t8:
measurement on test 8.
d:
binary output variable set to 1 if product is defective and 0 otherwise.
The next 600 lines contain 600 examples, for which the values of the above
features are specified.
The table below reproduces the first 2 observations.
id
1
2

t1
84
39

t2
8
67

t3
64
61

t4
6
77

t5
94
80

t6
36
35

t7
51
89

t8
21
80

Use rpart with the training examples to come up with a small set of rules that
correctly classify the output variable d based on input variable values (t1, t2,
t3, t4, t5, t6, t7, and t8).
Answer: Completed
Command:
> library(rpart)
> trainingdata = read.csv("dt_train.csv")
> modeldata <- rpart(d ~ t1+t2+t3+t4+t5+t6+t7+t8, data = trainingdata, method
= "class")
> modeldata

d
1
1

Output:
n= 600
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 600 95 1 (0.1583333 0.8416667)
2) t7< 33.5 196 95 1 (0.4846939 0.5153061)
4) t5< 60.5 125 30 0 (0.7600000 0.2400000)
8) t3< 76 95 0 0 (1.0000000 0.0000000) *
9) t3>=76 30 0 1 (0.0000000 1.0000000) *
5) t5>=60.5 71 0 1 (0.0000000 1.0000000) *
3) t7>=33.5 404 0 1 (0.0000000 1.0000000) *
Analysis:
Terminal nodes (leafs) are marked as * at the end of every row. In this case the
nodes are 3, 5, 8 and 9.
Specify the rules.
Command:
> rule <- path.rpart(modeldata, nodes= 3)
> rule <- path.rpart(modeldata, nodes= 5)
> rule <- path.rpart(modeldata, nodes= 8)
> rule <- path.rpart(modeldata, nodes= 9)
Output:
node number: 3
root
t7>=33.5
node number: 5
root
t7< 33.5
t5>=60.5
node number: 8
root
t7< 33.5

t5< 60.5
t3< 76
node number: 9
root
t7< 33.5
t5< 60.5
t3>=76
Analysis:
Predicted value of 'd' is referred from the 'yval' value for terminal node in
'modeldata'. The rules can be specified as:
If t7 >= 33.5 THEN d = 1
If t7 < 33.5 AND t5 >= 60.5 THEN d = 1
If t7 < 33.5 AND t5 < 60.5 AND t3 <76 THEN d = 0
If t7 < 33.5 AND t5 < 60.5 AND t3 >= 76 THEN d = 1
The file dt_test.csv contains 200 test examples with the same 10 variables.
Test your trained classifier on these test example and present your confusion
matrix. Comment on your classification accuracy.
Command & Confusion Matrix (highlighted in Grey):
> testdata = read.csv("dt_test.csv")
> testdataRule1 <- subset(testdata, t7>= 33.5)
> table(testdataRule1$d, testdataRule1$d == "1")
TRUE
1 129
> testdataRule2 <- subset(testdata, t7< 33.5 & t5>= 60.5)
> table(testdataRule2$d, testdataRule2$d == "1")
TRUE
1 22
> testdataRule3 <- subset(testdata, t7< 33.5 & t5< 60.5 & t3 < 76)

> table(testdataRule3$d, testdataRule3$d == "0")


TRUE
0 37
> testdataRule4 <- subset(testdata, t7< 33.5 & t5< 60.5 & t3 >= 76)
> table(testdataRule4$d, testdataRule4$d == "1")
TRUE
1 12
Analysis:
Since there is no false in the confusion matrix for all four rules, classification
accuracy is 100% correct.
Then use the rules to predict the output class d for the following 10 test cases
(presented in the file dt_new.csv):
new_case
1
2
3
4
5
6
7
8
9
10

t1
8
22
74
66
55
34
23
9
6
68

t2
86
36
26
71
72
58
70
19
71
40

Command:
> newdata = read.csv("dt_new.csv")
> predict(modeldata, newdata)
Output:
01

t3
55
80
32
71
61
22
39
67
20
86

t4
53
69
26
52
41
84
65
43
6
82

t5
36
90
38
42
91
84
16
2
27
82

t6
12
33
52
88
39
61
71
20
58
44

t7
82
22
63
89
50
95
96
92
6
61

t8
19
6
12
70
96
57
78
3
22
48

d
1
1
1
1
1
1
1
1
0
1

1 01
2 01
3 01
4 01
5 01
6 01
7 01
8 01
9 10
10 0 1
Analysis:
d = 1 for new_case = 1, 2, 3, 4, 5, 6, 7, 8 and 10
d = 0 for new_case = 9