Académique Documents
Professionnel Documents
Culture Documents
Classification
class labels (the values of the target attribute, class) are predicted
Every case belongs to one of the mutually exclusive classes. This class is known.
supervised learning
The classification method classifies the training data based on the attribute values and class labels.
constructs a model
Classification
The model is evaluated (e.g. accuracy and subjective estimate) If the model is acceptable, it is used to classify cases whose class labels are not known.
Classification methods
Decision trees, rules, k nearest neighbour method, nave Bayesian classifier, neural networks, ...
Application examples: To give a diagnosis suggestion on the basis of the symptoms and test results of a patient
Learning algorithm
Classifier (Model)
Known class labels
Test data
NAME Tom Merlisa George Joseph RANK YEARS TENURED Assistant Prof 2 no Associate Prof 7 no Professor 5 yes Assistant Prof 7 yes
Cases are described using fixed-length attribute vectors. Each case belongs to one class. Classes are mutually exclusive. The class of a case is known: supervised learning
The tree is constructed in a top-down manner (from the root to the leaves)
Decision tree
Decision tree
Decision tree
Inner nodes contain tests based on attributes (test nodes) Branches correspond to the outcomes of the tests (attribute values) Leaf nodes (leaves) contain the class information (one class or class distribution)
Decision tree
The attribute assigned to the root node is examined and a branch corresponding to the attribute value is followed. This process continues until a leaf node is encountered.
sunny
overcast P
rain
Decision tree
The classification path from the root to a leaf gives an explanation for the decision. The number of tested attributes depends on the classification path.
A classification path: a conjunction of constraints set on attributes A decision tree: a disjunction of the classification paths
10
10
Tree construction
A complete (fully-grown) tree is built based on the training data. (prepruning: the growth of tree is restricted)
Tree pruning
postpruning: branches are pruned from a complete tree (or from a prepruned tree)
11
11
A decision tree is constructed in a top-down recursive divide-andconquer manner. In the beginning, all the training examples are at the root. If the stopping criterion is fulfilled, a leaf node is formed. If the stopping criterion is not fulfilled, the best attribute is selected according to some criterion (a greedy algorithm) and
a test node is formed cases are divided into subsets according to values of the chosen attribute a decision tree is formed recursively for each subset
12
12
13
13
How to select the best attribute? How to specify the attribute test condition?
When to stop the recursive splitting? How to form decision nodes (leaves)? How to prune a tree?
14
14
Attributes are adequate for the classification task, if all the cases having the same attribute values belong to the same class.
If the attributes are adequate it is always possible to construct a decision tree which correctly classifies all the training data.
Usually there exist several correctly classifying decision trees. In the worst case, there is a leaf in the tree for each of the training cases.
15
15
16
16
17
17
Derives from the principle called Occam razor: s If there are two models having the same accuracy on the training data, the smaller one (simpler one) can be seen more general and thus better Smaller trees: more general, easier to understand and possibly more accurate in classifying unseen cases
Try to generate simple trees by generating simple nodes. The complexity of a node is
in its largest when the node has an equal number of cases from every class of the node in its smallest when the node has cases from one class only
Heuristic attribute selection measures (measures of goodness of split) are used. These aim to generate homogeneous (pure) child nodes (subsets).
18
18
E.B. Hunt (50 and 60 s s) To simulate human problem solving methods Analysing the content of English texts, medical diagnostics
J.R. Quinlan (end of 70 s) Chess endgames Applications from medical diagnostics to scouting
descendants of ID3 Addresses issues arising in real world classification tasks C4.5 is one of the most widely used machine learning algorithms, frequently used as a reference algorithm in machine learning research
19
19
ID3
Assumes that
attributes are categorical and have a small number of possible values the class (the target attribute) has two possible values
ID3 selects the best attribute according to a criterion called information gain
Criterion selects an attribute that maximises information gain (or minimises entropy)
20
20
Let
S be a training set that contains s cases (s is the number of cases) the class attribute C have values C1 , , Cm (m is the number of classes)
In ID3 m = 2
si be the number of cases belonging to the class Ci in the training set S and p(Ci) = si /s the relative frequency of the class Ci in S
21
21
2-based logarithm, because the information is coded in bits We define in this context that if p(Ci) = 0 then p(Ci) log2 p(Ci) returns 0 (zero)
log k a = x k x = a, k , l R+ \ {1}, a R+
22
0 6 1 5
2 4 3 3
p(C1) = 1/6 p(C2) = 5/6 H(C) = (1/6) log2 (1/6) (5/6) log2 (5/6) = 0.65 p(C1) = 2/6 p(C2) = 4/6 H(C) = (2/6) log2 (2/6) (4/6) log2 (4/6) = 0.92
p(C1) = 3/6 p(C2) = 3/6 H(C) = (3/6) log2 (3/6) (3/6) log2 (3/6) = 1
Maximum (= log2 m) when cases are equally distributed among the classes
m = number of classes
23
23
Let an attribute A have the values Aj , j = 1, ,v Let the set S be divided into subsets {S1, S2 , , Sv} according to the values of the attribute A The expected information needed to classify an arbitrary case in the branch corresponding the value Aj is
m
Consider only those cases having the value Aj for the attribute A and calculate p(Ci) in the set of these cases
24
24
The expected information needed to classify an arbitrary case when using the attribute A as root is
p(Ai) is the relative frequency of the cases having value Aj for the attribute A in the set S
I(C|A) = H (C ) H (C | A)
ID3 chooses the attribute resulting in the greatest information gain as the attribute for the root of the decision tree.
25
25
ID3: Tests
A = Aj
Outcomes of a test are mutually exclusive. There is an own branch in the tree for each possible outcome .
26
26
ID3 assumes that attributes are adequate. It splits the data in recursive fashion, until all the cases of a node belong to the same class. The class of a leaf node is defined on the basis of the class of the cases in the node.
If the leaf is empty (there are no cases with some particular value of an attribute), the class is unknown (the leaf is labelled as ) null
27
27
28
28
9 cases
5 cases
H (C ) =
29
29
The expected information required for each of the subtrees after using the attribute Outlook to split the set S into 3 subsets
P 2 4 3
sunny:
overcast:
rain:
30
The expected information needed to classify an arbitrary case for the tree with the attribute Outlook as root is
H (C | A) =
31
31
The attribute resulting in the greatest information gain is chosen as the attribute for the root of the decision tree.
I(C|outlook) = 0.246
32
32
The attribute Outlook has been chosen and the cases have been divided into subsets according to their values of the Outlook attribute.
outlook sunny
Cases (1, sunny, hot, , N) (2, sunny, hot, , N) (8, sunny, mild, , N) (9, sunny, cool, , P) (11, sunny, mild, , P)
overcast
Cases (3, overcast, hot, , P) (7, overcast, cool, , P) (12, overcast, mild, , P) (13, overcast, hot, , P)
rain
Cases (4, rain, mild, , P) (5, rain, cool, , P) (6, rain, cool, , N) (10, rain, mild, , P) (14, rain, mild, , N)
33
33
34
P 0 2 P 1 1
35
Humidity is chosen
Cases are sent down to high and normal branches The cases in high branch are all of the same class: a leaf node is formed The same situation in the normal branch
overcast
rain
normal
Branches for overcast and rain are built in the similar way
36
Complete decision tree and classification of a new case (Outlook: rain, Temperature: hot, Humidity: high, Windy: true) Play tennis?
37
37
ID3 does not address issues arising in real world classification tasks.
38
38
C4.5
Gain ratio attribute selection criterion Tests for value groups and quantitative attributes No requirement of fully adequate attributes Probabilistic approach for handling of missing values Pruning
39
39
The information gain criterion has a tendency to favour attributes with many outcomes.
However, this kind of attributes may be less relevant in prediction than attributes having a smaller number of outcomes.
An extreme example is an attribute that is used as an identifier. Identifiers have unique values resulting in pure nodes but they donhave t predictive power.
40
40
I(C|A) , H (A)
where I(C |A) is the information gain got from testing the attribute A and H (A) is the expected information needed to sort out the value of the attribute A i.e. the uncertainty relating to the value of the attribute A
H ( A) = p ( A j ) log 2 p ( A j )
j =1
where p (Aj) is the probability of the value Aj (the relative frequency of the value Aj)
41
41
The gain ratio criterion selects the attribute having the highest gain ratio among of those attributes whose information gain is at least the average information gain over all the attributes examined.
42
42
Let calculate the gain ratio for the Outlook attribute of the s Tennis example. The information gain I(C |A) for the attribute Outlook is 0.246. Calculate the expected information for the Outlook attribute:
H ( A) = (5 / 14) log 2 (5 / 14) (4 / 14) log 2 (4 / 14) (5 / 14) log 2 (5 / 14) = 1.577
43
Outlook
Sunny Rain Overcast
Value groups
{Sunny, Overcast}
Outlook
{Rain}
Humidity
75 > 75
44
44
Tests based on qualitative attributes can take the form of outlook in sunny, overcast outlook = rain
To assess equitably qualitative attributes that vary in their numbers of possible values
Gain ratio criterion is biased to prefer attributes having a small number of possible values
45
45
For each appropriate grouping, an additional attribute is formed in the preprocessing phase. This approach is economical from a computational viewpoint Problem: Appropriateness of a grouping may depend on the context (the part of the tree). A constantgrouping may be too crude.
46
46
At first, each value forms its own group. Then, all possible pairs of groups are formed.
Process continues until just two value groups remain, or until no such merger would result in a better division of the training data.
Michalski Soybean data s 35 attributes, 19 classes, 683 training cases Attribute stem canker with four values: none, below soil, above soil, above 2nd node
47
47
1) Partition into four one-value groups 2) Two onevalue groups are merged
3)
Based on the results of the section 2, above soiland above 2nd nodeare merged No merger of the section 3 improves the situation the process stops. Final groups: {none}, {below soil}, {above soil, above 2nd node}
48
48
From the overall viewpoint, the aim is to get simpler and more accurate trees. Advantageous of value groupings depends on the application domain. Search for value groups can require a substantial increase in computation.
49
49
The threshold is defined dynamically. Cases are first sorted on the values of the attribute A being considered.
Ak + Ak +1 2
is a possible threshold Z that divides the cases of the training set S into two subsets.
50
50
There are w-1 candidate thresholds. The best threshold is the one that results in the largest gain ratio. The largest value of the attribute A in the training set that does not exceed the best midpoint is chosen as the threshold.
All the threshold values appearing in the tree actually occur in the training data.
After finding the threshold, the quantitative attribute can be compared to qualitative and to other quantitative attributes in the usual way.
51
51
Z 39 49 55
A Z P 1 1 2 N 0 1 1 P 2 2 1
A>Z N 1 0 0
The candidate threshold 49 yields the highest gain ratio, and, thus, 46 is chosen as the threshold. A 46, A > 46
Midpoints of successive values are possible thresholds. The gain ratio is calculated for each candidate threshold. The best candidate is the one resulting in the highest gain ratio. Choose as the threshold the biggest value of A in the training set that does not exceed the best candidate (midpoint).
52
52
53
An example of a decision tree built from the Tennis data in which the attribute humidity has been measured using a quantitative scale.
outlook = overcast: P outlook = sunny: :...humidity = high: N : humidity = normal: P outlook = rain: :...windy = true: N windy = false: P
outlook = overcast: P outlook = sunny: :...humidity <= 75: P : humidity > 75: N outlook = rain: :...windy = true: N windy = false: P
54
54
Ordinal attributes can be handled either in the same way than nominal attributes or in the same way than quantitative attributes. Processing of quantitative attributes is based on ordering of values. Values of ordinal attributes have a natural order, and, thus, the approach employed for quantitative attributes can be utilised with ordinal attributes, too.
55
55
Stopping criteria
All the cases in a node belong to the same class No cases in a node None of the attributes improves the situation in a node The number of cases in a node is too small for continuing the splitting process:
Every test must have at least two outcomes having the minimum number of cases. The default value for the number of cases is 2.
56
56
C4.5 - Leaves
no cases:
The most frequent class (the majority class) at the parent of the leaf is associated with the leaf.
The most frequent class (the majority class) at the leaf is associated with the leaf.
57
57
Real world data often have missing attribute values. Missing values may be e.g. filled in (imputed) with
mode, median or mean of the complete cases of a class estimates given by some more intelligentmethod
before running the decision tree program. However, imputation is not unproblematic.
58
Missing values are taken into account when calculating the information gain
I(C | A) = p(A known ) (H(C) H(C | A)) + p( A unknown ) 0 = p( A known ) ( H(C) H(C | A))
where p (Aknown) is the probability that the value of the attribute A is known (i.e. the relative frequency of those cases for which the value of the attribute A is known)
and calculating the expected information H (A) needed to test the value of the attribute A
Let an attribute A have the values A1, A2, , Av . Missing values are now treated as an own value, the value v+1.
H ( A) = p ( A j ) log 2 p ( A j )
j =1
59
v +1
59
Let us assume that the Tennis example has one missing value
Temperature Humidity Windy Class hot 85 false N Tennis playing 1 hot 90 true N (Quinlan 86) 2 hot 78 false P 3 mild 96 false P 4 cool 80 false P 5 cool 70 true N 6 cool 65 true P 7 mild 95 false N 8 cool 70 false P 9 mild 80 false P 10 Missing value 11 mild 70 true P mild 90 true P 12 75 false P 13 overcast hot mild 96 true N 14 rain
60
Outlook sunny sunny overcast rain rain rain overcast sunny sunny rain sunny
60
The information gain for the Outlook attribute is calculated on the basis of the 13 cases having known value.
61
The expected information needed to test the value of the Outlook attribute is calculated:
H ( A) = (5 / 14) log 2 (5 / 14) (3 / 14) log 2 (3 / 14) (5 / 14) log 2 (5 / 14) (1 / 14) log 2 (1 / 14) = 1.809
62
When cases are sent to subtrees, a weight is given for each case. If the tested attribute value is known, the case is sent to the branch corresponding the outcome Oi with the weight w = 1. Otherwise, a fraction of the case is sent to each branch Oi with the weight w = p(Oi ).
p(Oi ) is the probability (the relative frequency) of the outcome Oi in the current node. The case is divided between the possible outcomes {O1, O2, , Ov} of the test.
The 13 cases with a known value for the Outlook attribute are sent to the corresponding sunny, overcast or rain branches with the weight w = 1. Case 12 is divided between the sunny, overcast and rain branches.
63
sunny
overcast
rain
The number of cases in a node is now interpreted as the sum of weights of (fractional) cases in the node.
A case came to a node with the weight w. It is sent to the node(s) of the next level with the weight
w w 1 = w w p(Oi ) =
(the value of the attribute of the current node is known) (the value of the attribute of the current node is unknown)
64
64
Let us assume that this subset is partitioned further by the test on humidity. The branch humidity <= 75has cases from the single class P. The branch humidity > 75has cases from both classes (class P 0.4/3.4 and class N 3/3.4)
Since no test improves the situation further, a leaf is made (the most frequent class in the node gives the class label).
65
65
The tree is alike the tree constructed from the original data, but now some leaves have a marking (N/E)
N is the sum of fractional cases belonging to the leaf E is the sum of those cases misclassified by the leaf (i.e. the sum of fractional cases belonging to classes other than suggested by the leaf) The majority class gives the class label of the node.
66
If the new case has a missing value for the attribute tested in the current node, the case is divided between the outcomes of the test. Now the case has multiple classification paths from the root to leaves, and, therefore, a classificationis a class distribution. The majority class is the predicted class.
67
67
A case having a missing value is classified Outlook: sunny, temperature: mild, humidity: ?, windy: false
outlook = overcast: P (3.2) outlook = sunny: :...humidity <= 75: P (2) : humidity > 75: N (3.4/0.4) outlook = rain: :...windy = true: N (2.4/0.4) windy = false: P (3)
If the humidity were less than or equal to 75, the class for the case would be P If the humidity were greater than 75, the class for the case would be N with the probability of 3/3.4 (88%) and P with the probability of 0.4/3.4 (12%).
Results from normal and high branches are summed for the final class distribution class P: (2.0/5.4) 100% + (3.4/5.4 ) 12% = 44% 2 cases of 5.4 training cases belonged to the humidity <= 75 branch and in this branch the probability of the class P is 100% 3.4 cases of 5.4 training cases belonged to the humidity > 75 branch and in this branch the probability of the class P is 12% class N: 3.4/5.4 88% = 56%
68
68
Underfitting: when model is too simple, both training and test errors are large
69
69
Overfitting
The tree is complex. Its lowest branches reflect noise and outliers occurring in the training data. Lower classification accuracy on unseen cases
Noise and outliers Inadequate attributes Too small training data A local maximum in the greedy search
70
70
Pruning
Prepruning
Postpruning
Let the tree grow and remove branches from the full fully grown tree.
71
Pruning
(a) The branch marked with a star may be partly based on erroneous or exceptional cases.
(c) The tree has grown (the tree full ) after which it has a been pruned. (postpruning)
72
72
Pruning
The tree growth can be limited in many ways. Define a minimum for the number of cases in a node.
If the number of cases in a node is below the minimum, the recursive division of the example set is stopped and a leaf is formed.
The leaf is labeled with the majority class or the class distribution.
too high a threshold: oversimplification: useful attributes are discarded too low a threshold: no simplification at all (or little simplification)
73
73
Postpruning
Usually it is more profitable to let the tree grow complete and prune it afterwards than halt the tree growth.
If the tree growth is halted, all the branches growing from a node are lost. Postpruning allows saving some of the branches.
Postpruning requires more calculation than prepruning but it usually results in more reliable trees than prepruning. In postpruning, parts of the tree, whose removal does not decrease the classification accuracy on unseen cases, are discarded.
74
74
Postpruning
N is the number of training cases belonging to the leaf E is the number of cases that do not belong to the class suggested by the leaf the error rate of the whole tree: E and N are summed over all the leaves
75
75
Postpruning
Start from the bottom of the tree and examine each subtree that is not a leaf. If replacement of the subtree with a leaf (or with its most frequently used branch) would reduce the predicted error rate, then prune the tree accordingly.
When the error rate of any of the subtrees reduces, also the error rate of the whole tree reduces.
There can be cases from several classes in a leaf, and, thus, the leaf is labeled with the majority class. The error rate can be predicted by using the training set or a new set of cases.
76
C4.5 - Pruning
Prepruning
Every test must have at least two outcomes having the minimum number of cases.
Because of the missing values, the minimum number of cases is actually the minimum for the summed weights of the cases.
Postpruning
A verypessimistic method based on estimated error rates How to calculate the very pessimistic estimates is not a topic of this course. However, the idea of the pruning is presented on the next slides.
77
77
C4.5 - postpruning
physician fee freeze = n: :...adoption of the budget resolution = y: democrat (151.0) : adoption of the budget resolution = u: democrat (1.0) : adoption of the budget resolution = n: : :...education spending = n: democrat (6.0) : education spending = y: democrat (9.0) : education spending = u: republican (1.0) physician fee freeze = y: :...synfuels corporation cutback = n: republican (97.0/3.0) synfuels corporation cutback = u: republican (4.0) synfuels corporation cutback = y: :...duty free exports = y: democrat (2.0) duty free exports = u: republican (1.0) duty free exports = n: :...education spending = n: democrat (5.0/2.0) education spending = y: republican (13.0/2.0) education spending = u: democrat (1.0) physician fee freeze = u: :...water project cost sharing = n: democrat (0.0) water project cost sharing = y: democrat (4.0) water project cost sharing = u: :...mx missile = n: republican (0.0) mx missile = y: democrat (3.0/1.0) mx missile = u: republican (2.0)
78
78
C4.5 postpruning
Pruned tree
The original tree had 17 leaves, the pruned one has 5 leaves.
Subtrees have been replaced with leaves
physician fee freeze = n: democrat (168.0/2.6) physician fee freeze = y: republican (123.0/13.9) physician fee freeze = u: :...mx missile = n: democrat (3.0/1.1) mx missile = y: democrat (4.0/2.2) mx missile = u: republican (2.0/1.0)
Subtree has been replaced with the most frequently used subtree
123 training cases in the leaf If 123 new cases were classified, 13.9 cases would be misclassified (a very pessimistic estimate)
79
79
C4.5 postpruning
physician fee freeze = n: :...adoption of the budget : adoption of the budget : adoption of the budget : :...education spending : education spending : education spending resolution = y: democrat (151.0) resolution = u: democrat (1.0) resolution = n: = n: democrat (6.0) = y: democrat (9.0) = u: republican (1.0)
The subtree has been replaced with a leaf physician fee freeze = n: democrat (168.0/2.6) 168 training cases in the leaf. One of them is missclassified by the leaf. If 168 new cases were classified, 2.6 cases would be misclassified (a very pessimistic estimate)
80
80
C4.5 postpruning
physician fee freeze = n: :...adoption of the budget : adoption of the budget : adoption of the budget : :...education spending : education spending : education spending resolution = y: democrat (151.0) resolution = u: democrat (1.0) resolution = n: = n: democrat (6.0) The very pessimistic = y: democrat (9.0) estimate: the sum of = u: republican (1.0) predicted errors is 3.273
physician fee freeze = n: :...adoption of the budget resolution = y: democrat (151.0) : adoption of the budget resolution = u: democrat (1.0) : adoption of the budget resolution = n: democrat (16.0/2.512) 81
81
C4.5 postpruning
physician fee freeze = n: :...adoption of the budget resolution = y: democrat (151.0) : adoption of the budget resolution = u: democrat (1.0) : adoption of the budget resolution = n: democrat (16/2.512)
82
82
C4.5 postpruning
N is the number of training cases in the leaf E is the number of predicted errors if a set of N unseen cases were classified by the tree. The sum of the predicted errors over the leaves, divided by the size of the training set (the number of the training cases) provides an immediate estimate of the error rate of the pruned tree on new cases.
20.8/300=0.069 (6.9%) (The pruned tree will misclassify 6.9% of new cases.)
83
83
C4.5 - postpruning
Results for the Congressional voting data Training set (300 cases) Complete tree Nodes 25 Errors 8 (2.7%) Pruned tree Nodes 7 Errors 13 (4.3%)
Errors 7 (5.2%)
10-fold cross-validation gives the error rate of 5.3% on new cases (the average predicted, very pessimistic error rate on new cases is 5.6%)
84
84
DTI - pros
Construction of a tree does not (necessarily) require any parameter setting Can handle high dimensional data Can handle heterogeneous data Nonparametric approach Representation form is intuitive, relatively easy to interpret
85
85
DTI - pros
Learning: the complexity depends on the number of nodes, cases and attributes
In each node: O(n quantitative attributes O(p n n p), log ) n number of cases in the node = p = number of attributes
Classification: O(w), where w is the maximum depth of the tree An eagermethod: training is computationally more expensive than classification
Quite robust to the presence of noise In general, good classification accuracy comparable with other classification methods
86
86
Decision tree algorithms divide the training data into smaller and smaller subsets in a recursive fashion. Problems
Data fragmentation
Number of instances at the leaf nodes can be too small to make any statistically significant decision
87
87
Repetition
Replication
A decision tree contains duplicate subtrees
P
88
88
Border line between two neighbouring regions of different classes is known as decision boundary Decision boundary is parallel to axes because test condition involves a single attribute at-a-time
89
89
x+y<1
Class = +
Class =
Multivariate splits based on a combination of attributes More expressive representation The use of multivariate splits can prevent problems of fragmentation, repetition and replication. Finding optimal test condition is computationally expensive.
90
90
Decision tree induction is a widely studied topic - different kind of enhancements to the basic algorithm have been developed.
challenges arising from real world data: quantitative attributes, missing values, noise, outliers multivariate decision trees incremental decision tree induction
Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed
91
91
C4.5 is a kind of reference algorithm used in machine learning research. In this course we will use See5, a descendant of C4.5.
http://www.rulequest.com/download.html
The source code of C4.5 is freely available for research and teaching from
http://www.rulequest.com/Personal/c4.5r8.tar.gz written in C
92
92
References
These slides are partly based on the slides of the books: Han J, Kamber M. Data Mining: Concepts and Techniques, Morgan Kaufmann, 2006 http://www-sal.cs.uiuc.edu/~hanj/bk2/ Tan P-N, Steinbach M, Kumar V. Introduction to Data Mining, Addison-Wesley, 2006 http://www-users.cs.umn.edu/~kumar/dmbook/
Hand D, Mannila H, Smyth P. Principles of Data Mining, MIT Press, 2001 Mitchell TM. Machine learning. McGraw-Hill, 1997. Quinlan JR. Induction of decision trees. Machine Learning 1: 81-106, 1986 Quinlan JR. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. Quinlan JR. See5. http://www.rulequest.com
93
93