Timu A11 Classification Decision Tree Induction

Tietmyksen muodostaminen
Knowledge discovery Classification Decision tree induction
Kati Iltanen Computer Sciences School of Information Sciences University of Tampere
Classification
Aim: to predict the value of a qualitative attribute
class labels (the values of the target attribute, class) are predicted
Every case belongs to one of the mutually exclusive classes. This class is known.
supervised learning
The classification method classifies the training data based on the attribute values and class labels.
constructs a model
Classification
The model is used for classifying new data.

The model is evaluated (e.g. accuracy and subjective estimate) If the model is acceptable, it is used to classify cases whose class labels are not known.
new data (unknown data, previously unseen data)
Classification methods
Decision trees, rules, k nearest neighbour method, nave Bayesian classifier, neural networks, ...
Application examples: To give a diagnosis suggestion on the basis of the symptoms and test results of a patient
To predict the paying capacity of a loan applicant
Constructing and testing a classifier

Training data
NAME Mike Mary Bill Jim Dave Anne RANK YEARS TENURED Assistant Prof 3 no Assistant Prof 7 yes Professor 2 yes Associate Prof 7 yes Assistant Prof 6 no Associate Prof 3 no
Learning algorithm
Classifier (Model)
Known class labels
Test data
NAME Tom Merlisa George Joseph RANK YEARS TENURED Assistant Prof 2 no Associate Prof 7 no Professor 5 yes Assistant Prof 7 yes
IF rank = professor OR years > 6 THEN tenured = yes ELSE tenured = no

no yes yes yes The model misclassifies the second test case (gives , the known yes class is ) no
4
Using the classifier

Classifier (Model)
IF rank = professor OR years > 6 THEN tenured = yes ELSE tenured = no
New data: (Jeff, Professor, 4) Tenured? Yes
Decision tree induction
TDIDT (Top Down Induction of Decision Trees)
Inductive learning: general knowledge from separate cases

Cases are described using fixed-length attribute vectors. Each case belongs to one class. Classes are mutually exclusive. The class of a case is known: supervised learning
Knowledge is represented in the form a decision tree.
A decision tree is a classification model.
The tree is constructed in a top-down manner (from the root to the leaves)
Decision tree induction
Training data: Saturday mornings

Outlook sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast rain Temperature Humidity Windy Class hot high false N hot high true N hot high false P m ild high false P cool normal false P cool normal true N cool normal true P m ild high false N cool normal false P m ild normal false P m ild normal true P m ild high true P hot normal false P m ild high true N
outlook sunny overcast P rain
humidity high N normal P
windy true N false P
Decision tree
Classes: P = play tennis, N = dont play tennis (Quinlan 86)
Decision tree
Decision tree
Inner nodes contain tests based on attributes (test nodes) Branches correspond to the outcomes of the tests (attribute values) Leaf nodes (leaves) contain the class information (one class or class distribution)
Decision tree
Classification of a new case starts from the root of the tree.

outlook
The attribute assigned to the root node is examined and a branch corresponding to the attribute value is followed. This process continues until a leaf node is encountered.
sunny
overcast P
rain
The leaf predicts the class of the new case.
Decision tree
The classification path from the root to a leaf gives an explanation for the decision. The number of tested attributes depends on the classification path.
It is not necessary to test all the attributes in all the paths.
A classification path: a conjunction of constraints set on attributes A decision tree: a disjunction of the classification paths
10
10
Building a decision tree
Building a decision tree is a two step process
Tree construction

A complete (fully-grown) tree is built based on the training data. (prepruning: the growth of tree is restricted)
Tree pruning
postpruning: branches are pruned from a complete tree (or from a prepruned tree)
11
11
TDIDT: Basic algorithm
A decision tree is constructed in a top-down recursive divide-andconquer manner. In the beginning, all the training examples are at the root. If the stopping criterion is fulfilled, a leaf node is formed. If the stopping criterion is not fulfilled, the best attribute is selected according to some criterion (a greedy algorithm) and

a test node is formed cases are divided into subsets according to values of the chosen attribute a decision tree is formed recursively for each subset
12
12
DTIDT: Basic algorithm

Generate a decision tree (1) Create a node N (2) (3) (4) (5) (6) (7) (8) (9) if (stopping criterion is fulfilled) Make a leaf node (node N) else Choose the best attribute and make a test node (node N) that tests the chosen attribute Divide cases into subsets according to the values of the chosen attribute Generate a decision tree for each subset
13
13
TDIDT: Key questions

How to select the best attribute? How to specify the attribute test condition?
How to form inner nodes and branches?
When to stop the recursive splitting? How to form decision nodes (leaves)? How to prune a tree?
14
14
Attribute selection criterion

How to select the best attribute? Adequacy of attributes
Attributes are adequate for the classification task, if all the cases having the same attribute values belong to the same class.
If the attributes are adequate it is always possible to construct a decision tree which correctly classifies all the training data.

Usually there exist several correctly classifying decision trees. In the worst case, there is a leaf in the tree for each of the training cases.
15
15
Simple decision tree
A simple decision tree for the Tennis playing classification task
16
16
Complex decision tree

A complex decision tree for the same classification task
17
17
Attribute selection criterion
The aim is to generate simple (small) decision trees.

Derives from the principle called Occam razor: s If there are two models having the same accuracy on the training data, the smaller one (simpler one) can be seen more general and thus better Smaller trees: more general, easier to understand and possibly more accurate in classifying unseen cases
Try to generate simple trees by generating simple nodes. The complexity of a node is
in its largest when the node has an equal number of cases from every class of the node in its smallest when the node has cases from one class only
Heuristic attribute selection measures (measures of goodness of split) are used. These aim to generate homogeneous (pure) child nodes (subsets).
18
18
TDIDT algorithm family
CLS (Concept Learning System)

E.B. Hunt (50 and 60 s s) To simulate human problem solving methods Analysing the content of English texts, medical diagnostics
ID3 (Iterative Dichotomizer 3)

J.R. Quinlan (end of 70 s) Chess endgames Applications from medical diagnostics to scouting
Other early decision tree algorithms

CART (Classification and Regression Trees) (-84) Assistant (-84)
C4.5, C5, See5

descendants of ID3 Addresses issues arising in real world classification tasks C4.5 is one of the most widely used machine learning algorithms, frequently used as a reference algorithm in machine learning research
19
19
ID3
Assumes that
attributes are categorical and have a small number of possible values the class (the target attribute) has two possible values
applicable to classification tasks with two classes
attributes are adequate data contain no missing values
ID3 selects the best attribute according to a criterion called information gain
Criterion selects an attribute that maximises information gain (or minimises entropy)
20
20
ID3: Attribute selection criterion
Let

S be a training set that contains s cases (s is the number of cases) the class attribute C have values C1 , , Cm (m is the number of classes)
In ID3 m = 2
si be the number of cases belonging to the class Ci in the training set S and p(Ci) = si /s the relative frequency of the class Ci in S
21
21
The expected information needed to classify an arbitrary case in S (or entropy of C in S) is
H (C ) = p (Ci ) log 2 p (Ci )

i =1

2-based logarithm, because the information is coded in bits We define in this context that if p(Ci) = 0 then p(Ci) log2 p(Ci) returns 0 (zero)
log k a = x k x = a, k , l R+ \ {1}, a R+
logl a = log k a : log k l

22
22

C1 C2 C1 C2
C1 C2 C1 C2
0 6 1 5
2 4 3 3
p(C1) = 0/6 = 0 p(C2) = 6/6 = 1 H(C) = 0 log2 0 1 log2 1 = 0 0 = 0
p(C1) = 1/6 p(C2) = 5/6 H(C) = (1/6) log2 (1/6) (5/6) log2 (5/6) = 0.65 p(C1) = 2/6 p(C2) = 4/6 H(C) = (2/6) log2 (2/6) (4/6) log2 (4/6) = 0.92
p(C1) = 3/6 p(C2) = 3/6 H(C) = (3/6) log2 (3/6) (3/6) log2 (3/6) = 1
Maximum (= log2 m) when cases are equally distributed among the classes
m = number of classes
23
Minimum (= 0) when all cases belong to the same class
23

Let an attribute A have the values Aj , j = 1, ,v Let the set S be divided into subsets {S1, S2 , , Sv} according to the values of the attribute A The expected information needed to classify an arbitrary case in the branch corresponding the value Aj is
m
H (C | A j ) = p (Ci | A j ) log 2 p (Ci | A j )

i =1
Consider only those cases having the value Aj for the attribute A and calculate p(Ci) in the set of these cases
24
24
The expected information needed to classify an arbitrary case when using the attribute A as root is
H(C|A) = p(A j )H(C|A j )

j =1
p(Ai) is the relative frequency of the cases having value Aj for the attribute A in the set S
Information gained by branching on the attribute A is
I(C|A) = H (C ) H (C | A)
ID3 chooses the attribute resulting in the greatest information gain as the attribute for the root of the decision tree.
25
25
ID3: Tests
Tests in the inner nodes take the form of
A = Aj
An attribute A has the value Aj
Outcomes of a test are mutually exclusive. There is an own branch in the tree for each possible outcome .
26
26
ID3: Stopping criterion

ID3 assumes that attributes are adequate. It splits the data in recursive fashion, until all the cases of a node belong to the same class. The class of a leaf node is defined on the basis of the class of the cases in the node.
If the leaf is empty (there are no cases with some particular value of an attribute), the class is unknown (the leaf is labelled as ) null
27
27
Example: ID3 (1)

Playing tennis (Quinlan 86)
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Outlook sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast rain Temperature hot hot hot mild cool cool cool mild cool mild mild mild hot mild Humidity Windy Class high false N high true N high false P high false P norm al false P norm al true N norm al true P high false N norm al false P norm al false P norm al true P high true P norm al false P high true N
Cases: Saturday mornings
Classes: P = positive (play) N = negative (donplay) t
28
28
Example: ID3 (2)
Class P (positive class): play tennis
9 cases
Class N (negative class): don play tennis t
5 cases
The expected information needed to classify an arbitrary case in S is
H (C ) =
9 9 5 5 log 2 log 2 = 0.940 14 14 14 14
29
29
Example: ID3 (3)
The expected information required for each of the subtrees after using the attribute Outlook to split the set S into 3 subsets
outlook sunny overcast rain
P 2 4 3
N H(C|A j ) 3 0.971 0 0 2 0.971
sunny:
2 2 3 3 H (C | A1 ) = log 2 log 2 = 0.971 5 5 5 5 4 4 0 0 H (C | A2 ) = log 2 log 2 = 0 4 4 4 4 3 3 2 2 H (C | A3 ) = log 2 log 2 = 0.971 5 5 5 5

30
overcast:
rain:
30
Example: ID3 (4)
The expected information needed to classify an arbitrary case for the tree with the attribute Outlook as root is
H (C | A) =
5 4 5 0.971 + 0+ 0.971 = 0.694 14 14 14
The information gained by branching on the attribute Outlook (A) is
I ( C | A ) = H ( C ) H ( C | A ) = 0.940 0.694 = 0.246
31
31
Example: ID3 (5)
The information gain for other candidate attributes is calculated similarly

I(C|temperature) = 0.029 I (C|humidity) = 0.151 I (C|windy) = 0.048
The attribute resulting in the greatest information gain is chosen as the attribute for the root of the decision tree.
I(C|outlook) = 0.246
32
32
Example: ID3 (6)
The attribute Outlook has been chosen and the cases have been divided into subsets according to their values of the Outlook attribute.
outlook sunny
Cases (1, sunny, hot, , N) (2, sunny, hot, , N) (8, sunny, mild, , N) (9, sunny, cool, , P) (11, sunny, mild, , P)
overcast
Cases (3, overcast, hot, , P) (7, overcast, cool, , P) (12, overcast, mild, , P) (13, overcast, hot, , P)
rain
Cases (4, rain, mild, , P) (5, rain, cool, , P) (6, rain, cool, , N) (10, rain, mild, , P) (14, rain, mild, , N)
33
33
Example: ID3 (7)
The branch corresponding the outcome sunny is built next.

Cases: (1, sunny, hot, high, false, N) (2, sunny, hot, high, true, N) (8, sunny, mild, high, false, N) (9, sunny, cool, normal, false, P) (11, sunny, mild, normal, true, P)
Calculate the expected information
2 2 3 3 H (C ) = log 2 log 2 = 0.971 5 5 5 5
and the information gain for all candidate attributes...

34
34
Example: ID3 (8)

Tempera hot mild cool P 0 1 1 N H(C|Aj ) 2 0 1 1 0 0
2 2 1 H (C | Temperature) = 0 + 1 + 0 5 5 5 = 0.400
I (C | Temperature) = 0.971 0.400 = 0.571 3 2 H (C | Humidity ) = 0 + 0 = 0 5 5 I (C | Humidity ) = 0.971 0 = 0.971
Humidity high normal W indy FALSE TRUE
P 0 2 P 1 1
N H(C|Aj ) 3 0 0 0 N H(C|Aj ) 2 0.918 1 1
3 2 H (C | Windy ) = 0.918 + 1 = 0.951 5 5 I (C | Windy ) = 0.971 0.951 = 0.020

35
35
Example: ID3 (9)
Humidity is chosen
Cases are sent down to high and normal branches The cases in high branch are all of the same class: a leaf node is formed The same situation in the normal branch
outlook sunny humidity high N P

36
overcast
rain
normal
Branches for overcast and rain are built in the similar way
36
Example: ID3 (10)
Complete decision tree and classification of a new case (Outlook: rain, Temperature: hot, Humidity: high, Windy: true) Play tennis?
37
37
Real world classification tasks
Real world data can be mixed.
Attributes may have different scales (both qualitative and quantitative).
Data may contain

missing values noise (erroneous values) exceptional values or value combinations
ID3 does not address issues arising in real world classification tasks.
Modifications to the original algorithm are needed.
38
38
C4.5

Descendant of ID3 algorithm (Quinlan -93) Upgrades:

Gain ratio attribute selection criterion Tests for value groups and quantitative attributes No requirement of fully adequate attributes Probabilistic approach for handling of missing values Pruning
Prepruning and postpruning
Converting trees to rules
39
39
C4.5 Attribute selection criterion
The information gain criterion has a tendency to favour attributes with many outcomes.
However, this kind of attributes may be less relevant in prediction than attributes having a smaller number of outcomes.
An extreme example is an attribute that is used as an identifier. Identifiers have unique values resulting in pure nodes but they donhave t predictive power.
To overcome this problem, a gain ratio criterion has been developed.
40
40
C4.5 Gain ratio selection criterion
A gain ratio is calculated as
I(C|A) , H (A)
where I(C |A) is the information gain got from testing the attribute A and H (A) is the expected information needed to sort out the value of the attribute A i.e. the uncertainty relating to the value of the attribute A
H ( A) = p ( A j ) log 2 p ( A j )
j =1
where p (Aj) is the probability of the value Aj (the relative frequency of the value Aj)
41
41
The gain ratio criterion selects the attribute having the highest gain ratio among of those attributes whose information gain is at least the average information gain over all the attributes examined.
The information gain of the attribute has to be large.
42
42
Let calculate the gain ratio for the Outlook attribute of the s Tennis example. The information gain I(C |A) for the attribute Outlook is 0.246. Calculate the expected information for the Outlook attribute:
outlook frek sunny 5 overcast 4 rain 5
H ( A) = (5 / 14) log 2 (5 / 14) (4 / 14) log 2 (4 / 14) (5 / 14) log 2 (5 / 14) = 1.577
The gain ratio for the attribute Outlook is
GR(C | A) =I(C|A) / H ( A) = 0.246 / 1.577 = 0.156

43
43
C4.5 Test types
One branch for each possible attribute value
Outlook
Sunny Rain Overcast
Value groups
{Sunny, Overcast}
Outlook
{Rain}
Thresholds for quantitative attributes
Humidity
75 > 75
44
44
C4.5 Value groups
Tests based on qualitative attributes can take the form of outlook in sunny, overcast outlook = rain
Why value groups?
To avoid too small subsets of cases
Useful patterns may become undetectable because of the scarcity of data
To assess equitably qualitative attributes that vary in their numbers of possible values
Gain ratio criterion is biased to prefer attributes having a small number of possible values
45
45
C4.5 Value groups
Appropriate value groups can be determined on the basis of domain knowledge.
For each appropriate grouping, an additional attribute is formed in the preprocessing phase. This approach is economical from a computational viewpoint Problem: Appropriateness of a grouping may depend on the context (the part of the tree). A constantgrouping may be too crude.
46
46
C4.5 Value groups

In C4.5, values are merged to groups in an iterative manner. A greedy method

At first, each value forms its own group. Then, all possible pairs of groups are formed.
A grouping yielding the highest gain ratio is chosen.
Process continues until just two value groups remain, or until no such merger would result in a better division of the training data.
Aims to find a grouping which results in the highest gain ratio.

Example on the next slide:

Michalski Soybean data s 35 attributes, 19 classes, 683 training cases Attribute stem canker with four values: none, below soil, above soil, above 2nd node
47
47
C4.5 Value groups
1) Partition into four one-value groups 2) Two onevalue groups are merged
3)
Based on the results of the section 2, above soiland above 2nd nodeare merged No merger of the section 3 improves the situation the process stops. Final groups: {none}, {below soil}, {above soil, above 2nd node}
48
48
C4.5 Value groups
From the overall viewpoint, the aim is to get simpler and more accurate trees. Advantageous of value groupings depends on the application domain. Search for value groups can require a substantial increase in computation.
49
49
C4.5 Quantitative attributes
Tests based on quantitative attributes employ thresholds.

The value of the attribute A is compared to some threshold Z. A Z, A > Z
The threshold is defined dynamically. Cases are first sorted on the values of the attribute A being considered.
A1, A2, ,Aw
The midpoint of adjacent values Ak and Ak+1
Ak + Ak +1 2
is a possible threshold Z that divides the cases of the training set S into two subsets.
50
50

There are w-1 candidate thresholds. The best threshold is the one that results in the largest gain ratio. The largest value of the attribute A in the training set that does not exceed the best midpoint is chosen as the threshold.
All the threshold values appearing in the tree actually occur in the training data.
After finding the threshold, the quantitative attribute can be compared to qualitative and to other quantitative attributes in the usual way.
51
51
Finding the threshold value Z dynamically during the tree construction:

A 32 46 52 58 Class P N P P
Cases are first sorted on the values of the attribute A.
Z 39 49 55
A Z P 1 1 2 N 0 1 1 P 2 2 1
A>Z N 1 0 0
The candidate threshold 49 yields the highest gain ratio, and, thus, 46 is chosen as the threshold. A 46, A > 46
Midpoints of successive values are possible thresholds. The gain ratio is calculated for each candidate threshold. The best candidate is the one resulting in the highest gain ratio. Choose as the threshold the biggest value of A in the training set that does not exceed the best candidate (midpoint).
52
52

Outlook sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast rain Temperature Humidity Windy Class hot 85 false N hot 90 true N hot 78 false P mild 96 false P cool 80 false P cool 70 true N cool 65 true P mild 95 false N cool 70 false P mild 80 false P mild 70 true P mild 90 true P hot 75 false P mild 96 true N
53
Tennis playing 1 2 (Quinlan 86)

3 4 5 6 7 8 9 10 11 12 13 14
Humidity has been measured using a quantitative scale
53
An example of a decision tree built from the Tennis data in which the attribute humidity has been measured using a quantitative scale.
outlook = overcast: P outlook = sunny: :...humidity = high: N : humidity = normal: P outlook = rain: :...windy = true: N windy = false: P
outlook = overcast: P outlook = sunny: :...humidity <= 75: P : humidity > 75: N outlook = rain: :...windy = true: N windy = false: P
54
54
C4.5 ordinal attributes
Ordinal attributes can be handled either in the same way than nominal attributes or in the same way than quantitative attributes. Processing of quantitative attributes is based on ordering of values. Values of ordinal attributes have a natural order, and, thus, the approach employed for quantitative attributes can be utilised with ordinal attributes, too.
55
55
C4.5 - stopping criterion
Stopping criteria

All the cases in a node belong to the same class No cases in a node None of the attributes improves the situation in a node The number of cases in a node is too small for continuing the splitting process:
Every test must have at least two outcomes having the minimum number of cases. The default value for the number of cases is 2.
56
56
C4.5 - Leaves
A leaf can contain
cases all belonging to a single class Cj:
The class Cj is associated with the leaf
no cases:
The most frequent class (the majority class) at the parent of the leaf is associated with the leaf.
cases belonging to a mixture of classes:
The most frequent class (the majority class) at the leaf is associated with the leaf.
57
57
C4.5 - Missing values

Real world data often have missing attribute values. Missing values may be e.g. filled in (imputed) with

mode, median or mean of the complete cases of a class estimates given by some more intelligentmethod
before running the decision tree program. However, imputation is not unproblematic.
Algorithms can be amended to cope with missing values
in the tree construction

selecting tests sending cases to subtrees submitting cases to subtrees

58
when the tree is used in prediction
58
Missing values are taken into account when calculating the information gain
I(C | A) = p(A known ) (H(C) H(C | A)) + p( A unknown ) 0 = p( A known ) ( H(C) H(C | A))
where p (Aknown) is the probability that the value of the attribute A is known (i.e. the relative frequency of those cases for which the value of the attribute A is known)
and calculating the expected information H (A) needed to test the value of the attribute A

Let an attribute A have the values A1, A2, , Av . Missing values are now treated as an own value, the value v+1.
H ( A) = p ( A j ) log 2 p ( A j )
j =1
59
v +1
59
Let us assume that the Tennis example has one missing value
Temperature Humidity Windy Class hot 85 false N Tennis playing 1 hot 90 true N (Quinlan 86) 2 hot 78 false P 3 mild 96 false P 4 cool 80 false P 5 cool 70 true N 6 cool 65 true P 7 mild 95 false N 8 cool 70 false P 9 mild 80 false P 10 Missing value 11 mild 70 true P mild 90 true P 12 75 false P 13 overcast hot mild 96 true N 14 rain
60
Outlook sunny sunny overcast rain rain rain overcast sunny sunny rain sunny
60
The information gain for the Outlook attribute is calculated on the basis of the 13 cases having known value.
H (C ) = (8 / 13) log 2 (8 / 13) (5 / 13) log 2 (5 / 13) = 0.961

outlook sunny overcast rain P 2 3 3 N H(C|A ) 3 0.971 0 0 2 0.971
H (C | A) = (5 / 13) 0.971 + (3 / 13) 0 + (5 / 13) 0.971 = 0.747
I (C | A) = (13 / 14) (0.961 0.747) = 0.199

61
61
The expected information needed to test the value of the Outlook attribute is calculated:
H ( A) = (5 / 14) log 2 (5 / 14) (3 / 14) log 2 (3 / 14) (5 / 14) log 2 (5 / 14) (1 / 14) log 2 (1 / 14) = 1.809
sunny overcast rain ? unknown
The gain ratio for the Outlook attribute is
GR (C | A) =I(C|A) / H ( A) = 0.199 / 1.809 = 0.110

62
62
C4.5 Missing values

When cases are sent to subtrees, a weight is given for each case. If the tested attribute value is known, the case is sent to the branch corresponding the outcome Oi with the weight w = 1. Otherwise, a fraction of the case is sent to each branch Oi with the weight w = p(Oi ).
p(Oi ) is the probability (the relative frequency) of the outcome Oi in the current node. The case is divided between the possible outcomes {O1, O2, , Ov} of the test.
The 13 cases with a known value for the Outlook attribute are sent to the corresponding sunny, overcast or rain branches with the weight w = 1. Case 12 is divided between the sunny, overcast and rain branches.
outlook sunny overcast rain
Case 12: w = 5/13
Case 12: w = 3/13
Case 12: w = 5/13 63
63
C4.5 Missing values

outlook
Cases in the sunny branch:

Case no 1 2 8 9 11 12 Outlook sunny sunny sunny sunny sunny ? Temperature Humidity Windy hot 85 FALSE hot 90 TRUE mild 95 FALSE cool 70 FALSE mild 70 TRUE mild 90 TRUE Class N N N P P P Weight 1 1 1 1 1 5/13=0.4
sunny
overcast
rain
The number of cases in a node is now interpreted as the sum of weights of (fractional) cases in the node.
There may be whole cases and fractional cases in a node.
A case came to a node with the weight w. It is sent to the node(s) of the next level with the weight

w w 1 = w w p(Oi ) =
(the value of the attribute of the current node is known) (the value of the attribute of the current node is unknown)
64
64
C4.5 Missing values
Cases in the sunny branch:

Case no 1 2 8 9 11 12 Outlook sunny sunny sunny sunny sunny ? Temperature Humidity Windy hot 85 FALSE hot 90 TRUE mild 95 FALSE cool 70 FALSE mild 70 TRUE mild 90 TRUE Class N N N P P P Weight 1 1 1 1 1 5/13=0.4
sunny outlook overcast rain
humidity <=75 P >75 N
Let us assume that this subset is partitioned further by the test on humidity. The branch humidity <= 75has cases from the single class P. The branch humidity > 75has cases from both classes (class P 0.4/3.4 and class N 3/3.4)
Since no test improves the situation further, a leaf is made (the most frequent class in the node gives the class label).
65
65
C4.5 Missing values
A decision tree constructed from the data having a missing value:

outlook = overcast: P (3.2) outlook = sunny: :...humidity <= 75: P (2) : humidity > 75: N (3.4/0.4) outlook = rain: :...windy = true: N (2.4/0.4) windy = false: P (3)
The tree is alike the tree constructed from the original data, but now some leaves have a marking (N/E)

N is the sum of fractional cases belonging to the leaf E is the sum of those cases misclassified by the leaf (i.e. the sum of fractional cases belonging to classes other than suggested by the leaf) The majority class gives the class label of the node.
The majority class = the biggest class in the node

66
66
C4.5 Missing values
Classification of a new case
If the new case has a missing value for the attribute tested in the current node, the case is divided between the outcomes of the test. Now the case has multiple classification paths from the root to leaves, and, therefore, a classificationis a class distribution. The majority class is the predicted class.
67
67
C4.5 Missing values
A case having a missing value is classified Outlook: sunny, temperature: mild, humidity: ?, windy: false
outlook = overcast: P (3.2) outlook = sunny: :...humidity <= 75: P (2) : humidity > 75: N (3.4/0.4) outlook = rain: :...windy = true: N (2.4/0.4) windy = false: P (3)
If the humidity were less than or equal to 75, the class for the case would be P If the humidity were greater than 75, the class for the case would be N with the probability of 3/3.4 (88%) and P with the probability of 0.4/3.4 (12%).
Results from normal and high branches are summed for the final class distribution class P: (2.0/5.4) 100% + (3.4/5.4 ) 12% = 44% 2 cases of 5.4 training cases belonged to the humidity <= 75 branch and in this branch the probability of the class P is 100% 3.4 cases of 5.4 training cases belonged to the humidity > 75 branch and in this branch the probability of the class P is 12% class N: 3.4/5.4 88% = 56%
68
68
Underfitting and overfitting

Overfitting: Test error rate starts to increase Training error rate continues to decrease
Underfitting: when model is too simple, both training and test errors are large
69
69
Overfitting
The built decision tree may overfit the training data.
The tree is complex. Its lowest branches reflect noise and outliers occurring in the training data. Lower classification accuracy on unseen cases
Reasons for overfitting

Noise and outliers Inadequate attributes Too small training data A local maximum in the greedy search
70
70
Pruning

Overfitting can be overcome by pruning. Pruning generally results in

a faster classification a better classification accuracy on unseen cases
Pruning decreases the accuracy on the training data.
Prepruning
Stop the tree construction early.
Postpruning
Let the tree grow and remove branches from the full fully grown tree.
In a combined approach both pre- and postpruning are used.

71
71
Pruning
(a) The branch marked with a star may be partly based on erroneous or exceptional cases.
(b) The tree growth has been stopped. (prepruning)
(c) The tree has grown (the tree full ) after which it has a been pruned. (postpruning)
72
72
Pruning

The tree growth can be limited in many ways. Define a minimum for the number of cases in a node.
If the number of cases in a node is below the minimum, the recursive division of the example set is stopped and a leaf is formed.
The leaf is labeled with the majority class or the class distribution.
Define a threshold for the attribute selection criterion.
The problem: the definition of a suitable threshold

too high a threshold: oversimplification: useful attributes are discarded too low a threshold: no simplification at all (or little simplification)
73
73
Postpruning
Usually it is more profitable to let the tree grow complete and prune it afterwards than halt the tree growth.
If the tree growth is halted, all the branches growing from a node are lost. Postpruning allows saving some of the branches.
Postpruning requires more calculation than prepruning but it usually results in more reliable trees than prepruning. In postpruning, parts of the tree, whose removal does not decrease the classification accuracy on unseen cases, are discarded.
74
74
Postpruning
Postpruning is based on classification errors made by the tree.
an error rate of a node is E /N

N is the number of training cases belonging to the leaf E is the number of cases that do not belong to the class suggested by the leaf the error rate of the whole tree: E and N are summed over all the leaves
a predicted error rate: the error rate on new cases
75
75
Postpruning
The basic idea of postpruning:
Start from the bottom of the tree and examine each subtree that is not a leaf. If replacement of the subtree with a leaf (or with its most frequently used branch) would reduce the predicted error rate, then prune the tree accordingly.
When the error rate of any of the subtrees reduces, also the error rate of the whole tree reduces.
There can be cases from several classes in a leaf, and, thus, the leaf is labeled with the majority class. The error rate can be predicted by using the training set or a new set of cases.
Not a topic of this course

76
76
C4.5 - Pruning
Prepruning
Every test must have at least two outcomes having the minimum number of cases.
Because of the missing values, the minimum number of cases is actually the minimum for the summed weights of the cases.
The default value for the number of cases is 2.
Postpruning

A verypessimistic method based on estimated error rates How to calculate the very pessimistic estimates is not a topic of this course. However, the idea of the pruning is presented on the next slides.
77
77
C4.5 - postpruning
Example: original, complete tree (Quinlan 93)
Congressional voting data, UCI Machine Learning Repository
physician fee freeze = n: :...adoption of the budget resolution = y: democrat (151.0) : adoption of the budget resolution = u: democrat (1.0) : adoption of the budget resolution = n: : :...education spending = n: democrat (6.0) : education spending = y: democrat (9.0) : education spending = u: republican (1.0) physician fee freeze = y: :...synfuels corporation cutback = n: republican (97.0/3.0) synfuels corporation cutback = u: republican (4.0) synfuels corporation cutback = y: :...duty free exports = y: democrat (2.0) duty free exports = u: republican (1.0) duty free exports = n: :...education spending = n: democrat (5.0/2.0) education spending = y: republican (13.0/2.0) education spending = u: democrat (1.0) physician fee freeze = u: :...water project cost sharing = n: democrat (0.0) water project cost sharing = y: democrat (4.0) water project cost sharing = u: :...mx missile = n: republican (0.0) mx missile = y: democrat (3.0/1.0) mx missile = u: republican (2.0)
78
78
C4.5 postpruning
Pruned tree
The original tree had 17 leaves, the pruned one has 5 leaves.
Subtrees have been replaced with leaves
physician fee freeze = n: democrat (168.0/2.6) physician fee freeze = y: republican (123.0/13.9) physician fee freeze = u: :...mx missile = n: democrat (3.0/1.1) mx missile = y: democrat (4.0/2.2) mx missile = u: republican (2.0/1.0)
Subtree has been replaced with the most frequently used subtree
123 training cases in the leaf If 123 new cases were classified, 13.9 cases would be misclassified (a very pessimistic estimate)
79
79
C4.5 postpruning
physician fee freeze = n: :...adoption of the budget : adoption of the budget : adoption of the budget : :...education spending : education spending : education spending resolution = y: democrat (151.0) resolution = u: democrat (1.0) resolution = n: = n: democrat (6.0) = y: democrat (9.0) = u: republican (1.0)
The subtree has been replaced with a leaf physician fee freeze = n: democrat (168.0/2.6) 168 training cases in the leaf. One of them is missclassified by the leaf. If 168 new cases were classified, 2.6 cases would be misclassified (a very pessimistic estimate)
80
80
C4.5 postpruning
physician fee freeze = n: :...adoption of the budget : adoption of the budget : adoption of the budget : :...education spending : education spending : education spending resolution = y: democrat (151.0) resolution = u: democrat (1.0) resolution = n: = n: democrat (6.0) The very pessimistic = y: democrat (9.0) estimate: the sum of = u: republican (1.0) predicted errors is 3.273
First, the subtree has been replaced with a leaf

: adoption of the budget resolution = n: democrat (16.0/2.512) One training case is misclassified. The very pessimistic estimate: the number of predicted errors is 2.512
physician fee freeze = n: :...adoption of the budget resolution = y: democrat (151.0) : adoption of the budget resolution = u: democrat (1.0) : adoption of the budget resolution = n: democrat (16.0/2.512) 81
81
C4.5 postpruning
physician fee freeze = n: :...adoption of the budget resolution = y: democrat (151.0) : adoption of the budget resolution = u: democrat (1.0) : adoption of the budget resolution = n: democrat (16/2.512)
Then, the subtree has been replaced with a leaf
The very pessimistic estimate: the sum of predicted errors is 4.642
physician fee freeze = n: democrat (168.0/2.6)
82
82
C4.5 postpruning
Interpretation of the numbers (N /E ) in a pruned tree

N is the number of training cases in the leaf E is the number of predicted errors if a set of N unseen cases were classified by the tree. The sum of the predicted errors over the leaves, divided by the size of the training set (the number of the training cases) provides an immediate estimate of the error rate of the pruned tree on new cases.
20.8/300=0.069 (6.9%) (The pruned tree will misclassify 6.9% of new cases.)
83
83
C4.5 - postpruning
Results for the Congressional voting data Training set (300 cases) Complete tree Nodes 25 Errors 8 (2.7%) Pruned tree Nodes 7 Errors 13 (4.3%)
Test set (135 cases) Complete tree Nodes 25
Pruned tree Nodes 7 Errors 4 (3.0%)
Errors 7 (5.2%)
10-fold cross-validation gives the error rate of 5.3% on new cases (the average predicted, very pessimistic error rate on new cases is 5.6%)
84
84
DTI - pros
Construction of a tree does not (necessarily) require any parameter setting Can handle high dimensional data Can handle heterogeneous data Nonparametric approach Representation form is intuitive, relatively easy to interpret
85
85
DTI - pros
Learning and classification steps are simple and fast
Learning: the complexity depends on the number of nodes, cases and attributes
In each node: O(n quantitative attributes O(p n n p), log ) n number of cases in the node = p = number of attributes
Classification: O(w), where w is the maximum depth of the tree An eagermethod: training is computationally more expensive than classification
Quite robust to the presence of noise In general, good classification accuracy comparable with other classification methods
86
86
DTI - other issues
Decision tree algorithms divide the training data into smaller and smaller subsets in a recursive fashion. Problems

Data fragmentation Repetition Replication
Data fragmentation
Number of instances at the leaf nodes can be too small to make any statistically significant decision
87
87
DTI - other issues
Repetition
An attribute is repeatedly tested along some branch of the decision tree
Replication
A decision tree contains duplicate subtrees
P
88
88
DTI - other issues - decision boundary
Border line between two neighbouring regions of different classes is known as decision boundary Decision boundary is parallel to axes because test condition involves a single attribute at-a-time
89
89
DTI - other issues - multivariate split
x+y<1
Class = +
Class =
Multivariate splits based on a combination of attributes More expressive representation The use of multivariate splits can prevent problems of fragmentation, repetition and replication. Finding optimal test condition is computationally expensive.
90
90
DTI - other issues
Decision tree induction is a widely studied topic - different kind of enhancements to the basic algorithm have been developed.
challenges arising from real world data: quantitative attributes, missing values, noise, outliers multivariate decision trees incremental decision tree induction
updatable decision trees
scalable decision tree induction
Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed
91
91
DTI - other issues
C4.5 is a kind of reference algorithm used in machine learning research. In this course we will use See5, a descendant of C4.5.
A demonstration version of See5 is freely available from
http://www.rulequest.com/download.html
The source code of C4.5 is freely available for research and teaching from

http://www.rulequest.com/Personal/c4.5r8.tar.gz written in C
92
92
References
These slides are partly based on the slides of the books: Han J, Kamber M. Data Mining: Concepts and Techniques, Morgan Kaufmann, 2006 http://www-sal.cs.uiuc.edu/~hanj/bk2/ Tan P-N, Steinbach M, Kumar V. Introduction to Data Mining, Addison-Wesley, 2006 http://www-users.cs.umn.edu/~kumar/dmbook/
Hand D, Mannila H, Smyth P. Principles of Data Mining, MIT Press, 2001 Mitchell TM. Machine learning. McGraw-Hill, 1997. Quinlan JR. Induction of decision trees. Machine Learning 1: 81-106, 1986 Quinlan JR. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. Quinlan JR. See5. http://www.rulequest.com
93
93

Timu A11 Classification Decision Tree Induction

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Timu A11 Classification Decision Tree Induction

Transféré par

Droits d'auteur :

Formats disponibles

Tietmyksen muodostaminen

Knowledge discovery Classification Decision tree induction

Kati Iltanen Computer Sciences School of Information Sciences University of Tampere

Aim: to predict the value of a qualitative attribute

The model is used for classifying new data.

new data (unknown data, previously unseen data)

To predict the paying capacity of a loan applicant

Constructing and testing a classifier

IF rank = professor OR years > 6 THEN tenured = yes ELSE tenured = no

Using the classifier

New data: (Jeff, Professor, 4) Tenured? Yes

Decision tree induction

TDIDT (Top Down Induction of Decision Trees)

Inductive learning: general knowledge from separate cases

Knowledge is represented in the form a decision tree.

A decision tree is a classification model.

Decision tree induction

Training data: Saturday mornings

outlook sunny overcast P rain

humidity high N normal P

windy true N false P

Classes: P = play tennis, N = dont play tennis (Quinlan 86)

outlook sunny overcast P rain

humidity high N normal P

windy true N false P

Classification of a new case starts from the root of the tree.

humidity high N normal P

windy true N false P

The leaf predicts the class of the new case.

It is not necessary to test all the attributes in all the paths.

Building a decision tree

Building a decision tree is a two step process

TDIDT: Basic algorithm

DTIDT: Basic algorithm

TDIDT: Key questions

How to form inner nodes and branches?

Attribute selection criterion

How to select the best attribute? Adequacy of attributes

Simple decision tree

A simple decision tree for the Tennis playing classification task

Complex decision tree

Attribute selection criterion

The aim is to generate simple (small) decision trees.

TDIDT algorithm family

CLS (Concept Learning System)

ID3 (Iterative Dichotomizer 3)

Other early decision tree algorithms

CART (Classification and Regression Trees) (-84) Assistant (-84)

C4.5, C5, See5

applicable to classification tasks with two classes

attributes are adequate data contain no missing values

ID3: Attribute selection criterion

ID3: Attribute selection criterion

The expected information needed to classify an arbitrary case in S (or entropy of C in S) is

H (C ) = p (Ci ) log 2 p (Ci )

logl a = log k a : log k l

ID3: Attribute selection criterion

p(C1) = 0/6 = 0 p(C2) = 6/6 = 1 H(C) = 0 log2 0 1 log2 1 = 0 0 = 0

Minimum (= 0) when all cases belong to the same class

ID3: Attribute selection criterion

H (C | A j ) = p (Ci | A j ) log 2 p (Ci | A j )

ID3: Attribute selection criterion