Chap4 Classification Sep13

Unit 5
Classification: Basic Concepts, Decision

Trees, and Model Evaluation
by
Pang Ning Tan, Vipin Kumar, Michael Steinbach

Examples of Classification Task
Predicting tumor cells as benign or malignant

Classifying credit card transactions
as legitimate or fraudulent

Classifying secondary structures of protein
as alpha-helix, beta-sheet, or random
coil

Categorizing news stories as finance,
weather, entertainment, sports, etc
Definition
Classification is the task of learning a target function f that maps each attribute set
x to one of the predefined class labels y.
Target function known as classification model.
Descriptive Modeling : Distinguishes between objects of different classes.
(mammal, reptile, bird, fish, or amphibian).
Predictive Modeling: To predict class label of unknown records. (class- fish)
Gila monster Cold-blooded scales Not give Birth ?
Name Give Birth Lay Eggs Can Fly Live in Water Have Legs Class
human yes no no no yes mammals
python no yes no no no reptiles
salmon no yes no yes no fishes
whale yes no no yes no mammals
frog no yes no sometimes yes amphibians
komodo no yes no no yes reptiles
bat yes no yes no yes mammals
pigeon no yes yes no yes birds
cat yes no no no yes mammals
leopard shark yes no no yes no fishes
turtle no yes no sometimes yes reptiles
penguin no yes no sometimes yes birds
porcupine yes no no no yes mammals
eel no yes no yes no fishes
salamander no yes no sometimes yes amphibians
gila monster no yes no no yes reptiles
platypus no yes no no yes mammals
owl no yes yes no yes birds
dolphin yes no no yes no mammals
eagle no yes yes no yes birds
Vertebrate data set (20)
General Approach to solve a Classification Problem
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10

Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10

Test Set
Learning
algorithm
Training Set
Classification
Task of assigning objects to predefined category
Given a collection of records (training set )
Each record contains a set of attributes, one of the attributes is the class.
Find a model for class attribute as a function of the values of other
attributes.
Goal: previously unseen records should be assigned a class as accurately as
possible.
A test set is used to determine the accuracy of the model. Usually, the
given data set is divided into training and test sets, with training set used
to build the model and test set used to validate it.
Performance metrics
Accuracy = Number of correct predictions = f
11
+ f
00
Total number of predictions f
11
+ f
10
+ f
01
+ f
00
Error Rate = Number of wrong predictions = f
10
+ f
01
Total number of predictions f
11
+ f
10
+ f
01
+ f
00

Classification Techniques
Decision Tree Induction Methods
Rule-based Classifier Methods
Nearest-Neighbor classifiers
Bayesian Classifiers
Decision Tree Induction (How it works?)
Root node : No incoming edges and zero or more outgoing edges.
Internal Nodes: Exactly one incoming edge & two or more outgoing edges.
Leaf or terminal nodes: Exactly one incoming edge and no outgoing edges.
Classifying a unlabeled vertebrate
How to build a Decision Tree
Algorithm which employ greedy method in a reasonable amount of time is
used.
One such algorithm is Hunts algorithm.
Hunts Algorithm
1. If all the records in D
t
belong to the same class y
t
, then t is a leaf
node labeled as y
t
2. If D
t
contains records that :
belong to more than one class, an attribute test condition is used
to split the data into smaller subsets.
Child node created for each outcome of the test condition and
the records in D
t
are distributed to the children based on the
outcomes.

3. Recursively apply the procedure to each subset.
D
t
?
Conditions & Issues

First means most of the borrowers repaid the loans
We need to consider data from both class; so we take the root as home owner.
Left child is splitted again to continue splitting
Some records can be empty with no nodes associated with it
Records with identical attribute cannot be split further.
Design Issues:
(How) Training records splitting is based on attribute test condition to
smaller subsets.
Procedure to stop splitting is to exapand a node until all the records
belonging to same class have identical attribute values.
Method for Expressing Attribute Test Conditions
Depends on attribute types
Binary
Nominal
Ordinal
Continuous

Depends on number of ways to split
2-way split
Multi-way split
Binary Attributes
Test condition generates two potential outcomes

Nominal Attributes
Multi-way split: Use as many partitions as distinct
values.

Binary split: Divides values into two subsets.
Need to find optimal partitioning.
Decision tree algorithm like CART produce 2
k
-1
CarType
Family
Sports
Luxury
CarType
{Family,
Luxury}
{Sports}
CarType
{Sports,
Luxury}
{Family}
OR
Produce Multi-way split or binary.
Grouped as long it does not violate the order of
attribute values.
4.10 (a) (b) preserve the order but (c) combines small & large and
also medium & extra large

Ordinal Attributes
Continuous Attributes
Different ways of handling
Discretization to form an ordinal categorical
attribute
Static discretize once at the beginning
Dynamic ranges can be found by equal interval
bucketing, equal frequency bucketing
(percentiles), or clustering.

Binary Decision: (A < v) or (A > v)
consider all possible splits and finds the best cut
can be more compute intensive
Continuous Attributes
Measures for selecting the Best Split
) | ( t j p
) | ( t i p
Fraction of records belonging to class i at a given node t
Measures for selecting the best split is based on the degree of
impurity of child nodes
Smaller the degree of impurity the more skewed the class
distribution
Node with class distribution (0,1) has impurity=0; node with
uniform distribution (0.5, 0.5) has highest impurity

Which test condition is the best?
How to determine the Best Split
Greedy approach:
Nodes with homogeneous class distribution are
preferred
Need a measure of node impurity:

C0: 5
C1: 5
C0: 9
C1: 1
Non-homogeneous,
High degree of impurity
Homogeneous,
Low degree of impurity
Measures of Node Impurity
Gini Index

Entropy

Misclassification error
=
i
t i p t GINI
2
)] | ( [ 1 ) (
=
i
t i p t i p t Entropy ) | ( log ) | ( ) ( 2
(NOTE: p( j | t) is the relative frequency of class j at node t).
)] / ( [ max 1 ) ( t i p t tionerror Classifica i =
Where c is the number of classes and 0 log2 0 = 0 is entropy calculations

Measure of Impurity: GINI
Gini Index for a given node t :

Maximum (1 - 1/n
c
) when records are equally distributed among all
classes, implying least interesting information
Minimum (0.0) when all records belong to one class, implying most
interesting information

C1 0
C2 6
Gini=0.000
C1 2
C2 4
Gini=0.444
C1 3
C2 3
Gini=0.500
C1 1
C2 5
Gini=0.278
=
i
t i p t GINI
2
)] | ( [ 1 ) (
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Gini = 1 P(C1)
2
P(C2)
2
= 1 0 1 = 0
P(C1) = 1/6 P(C2) = 5/6
Gini = 1 (1/6)
2
(5/6)
2
= 0.278
P(C1) = 2/6 P(C2) = 4/6
Gini = 1 (2/6)
2
(4/6)
2
= 0.444
P(C1) = 3/6 P(C2) = 3/6
Gini = 1 (3/6)
2
(3/6)
2
=0.5
Alternative Splitting Criteria based on INFO
Entropy at a given node t:

Measures homogeneity of a node.
Maximum (log n
c
) when records are equally distributed among all
classes implying least information
Minimum (0.0) when all records belong to one class, implying most
information
Entropy based computations are similar to the GINI index
computations
=
i
Examples for computing Entropy
C1 0
C2 6

C1 2
C2 4

C1 1
C2 5

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Entropy = 0 log
2
0

1 log
2
1 = 0 0 = 0
P(C1) = 1/6 P(C2) = 5/6
Entropy = (1/6) log
2
(1/6)

(5/6) log
2
(1/6) = 0.65
P(C1) = 2/6 P(C2) = 4/6
Entropy = (2/6) log
2
(2/6)

(4/6) log
2
(4/6) = 0.92
=
i
Splitting Criteria based on Classification Error
Classification error at a node t :

Measures misclassification error made by a node.
Maximum (1 - 1/n
c
) when records are equally distributed
among all classes, implying least interesting information
Minimum (0.0) when all records belong to one class, implying
most interesting information
Examples for Computing Error
C1 0
C2 6

C1 2
C2 4

C1 1
C2 5

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Error = 1 max (0, 1) = 1 1 = 0
P(C1) = 1/6 P(C2) = 5/6
Error = 1 max (1/6, 5/6) = 1 5/6 = 1/6
P(C1) = 2/6 P(C2) = 4/6
Error = 1 max (2/6, 4/6) = 1 4/6 = 1/3

Misclassification Error vs Gini
A?
Yes No
Node N1 Node N2
Parent
C1 7
C2 3
Gini = 0.42

N1 N2
C1 3 4
C2 0 3
Gini=0.361

Gini(N1)
= 1 (3/3)
2
(0/3)
2

= 0
Gini(N2)
= 1 (4/7)
2
(3/7)
2

= 0.489
Gini(Children)
= 3/10 * 0
+ 7/10 * 0.489
= 0.342
Gini improves !!
Information Gain to find best split
B?
Yes No
Node N3 Node N4
A?
Yes No
Node N1 Node N2
Before Splitting:
C0 N10
C1 N11

C0 N20
C1 N21

C0 N30
C1 N31

C0 N40
C1 N41

C0 N00
C1 N01

M0
M1
M2 M3 M4
M12
M34
Difference in entropy gives information Gain
Gain = M0 M12 vs M0 M34
Splitting of Binary Attributes

Binary Attributes: Computing GINI Index
Splits into two partitions
Effect of Weighing partitions:
Larger and Purer Partitions are sought for.
B?
Yes No
Node N1 Node N2
Parent
C0 6
C1 6
Gini = 0.500

N1 N2
C0 1 5
C1 4 2
Gini=0.333

Gini(N1)
= 1 (5/6)
2
(2/6)
2

= 0.194
Gini(N2)
= 1 (1/6)
2
(4/6)
2

= 0.528
Gini(Children)
= 7/12 * 0.194 + 5/12 * 0.528
= 0.333
Splitting of Nominal Attributes: Computing Gini Index
For each distinct value, gather counts for each class in the dataset
Use the count matrix to make decisions
Multiway split
Gini{family} = 0.375 ; Gini {sports } = 0 ; Gini {Luxury} = 0.219
Gini (car type) = (4/20) * 0.375 + (8/20) * 0 + (8/20) * 0.219 = 0.163
Splitting of Continuous Attributes: Computing Gini Index
Use Binary Decisions based on one value
Several Choices for the splitting value
Number of possible splitting values
= Number of distinct values
Each splitting value has a count matrix associated
with it
Class counts in each of the partitions, A < v
and A > v
Simple method to choose best v
For each v, scan the database to gather count
matrix and compute its Gini index
Computationally Inefficient! Repetition of work.
Tid Refund Marital
Status
Taxable
Income
Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

Taxable
Income
> 80K?
Yes No
Continuous Attributes: Computing Gini Index...
For efficient computation: for each attribute,
Sort the attribute on values
Linearly scan these values, each time updating the count matrix and
computing Gini index
Choose the split position that has the least Gini index

Splitting Based on INFO...
Information Gain:

Parent Node, p is split into k partitions;
n
i
is number of records in partition i
Measures Reduction in Entropy achieved because of the split.
Choose the split that achieves most reduction (maximizes GAIN)
Used in ID3 and C4.5
Disadvantage: Tends to prefer splits that result in large number of
partitions, each being small but pure.
|
.
|
\
|
=
=
k
i
i
split
i Entropy
n
n
p Entropy GAIN
1
) ( ) (
Example of Information Gain
Class P: buys_computer = yes
Class N: buys_computer = no
age p
i
n
i
I(p
i
, n
i
)
<=30 2 3 0.971
3140 4 0 0
>40 3 2 0.971
694 . 0 ) 2 , 3 (
14
5
) 0 , 4 (
14
4
) 3 , 2 (
14
5
) (
= +
+ =
I
I I D Inf o
age
048 . 0 ) _ (
151 . 0 ) (
029 . 0 ) (
=
=
=
rating credit Gain
student Gain
income Gain
246 . 0 ) ( ) ( ) ( = = D Info D Info age Gain
age
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
3140 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
3140 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
3140 medium no excellent yes
3140 high yes fair yes
>40 medium no excellent no
940 . 0 )
14
5
( log
14
5
)
14
9
( log
14
9
) 5 , 9 ( ) (
2 2
= = = I D Info
Splitting the samples using age
income student credit_rating buys_computer
high no fair no
high no excellent no
medium no fair no
low yes fair yes
medium yes excellent yes
high no fair yes
low yes excellent yes
medium no excellent yes
high yes fair yes
medium no fair yes
low yes fair yes
low yes excellent no
medium yes fair yes
medium no excellent no
age?
<=30
30...40
>40
labeled yes
Splitting Based on INFO...
Gain Ratio:

Parent Node, p is split into k partitions
n
i
is the number of records in partition I
Adjusts Information Gain by the entropy of the partitioning (SplitINFO).
Higher entropy partitioning (large number of small partitions) is penalized!
Used in C4.5
Designed to overcome the disadvantage of Information Gain
SplitINFO
GAIN
GainRATIO
Split
split
=
=
=
k
i
i i
n
n
n
n
SplitINFO
1
log
Decision Tree Induction
Greedy strategy.
Split the records based on an attribute test
that optimizes certain criterion.

Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting

Stopping Criteria for Tree Induction
Stop expanding a node when all the records
belong to the same class

Stop expanding a node when all the records have
similar attribute values

Early termination
Algorithm : Decision Tree Algorithm
TreeGrowth (E, F)
1. If stopping_cond(E, F) =true then
2. leaf = createNode ()
3. leaf.label = Classify (E)
4. return leaf.
5. else
6. root = createNode()
7. root. test_cond = find_best_split (E,F)
8. let V= { v | v is poosible outcome of root. test_cond }.
9. for each v V do
10. E
v
= { e | root. test_cond (e) = v and e V }.
11. child = TreeGrowth (E
v
, F)
12. add child as descendent of root and label the edge ( root child) as v
13. end for
14.end if
15.return root
createNode() create a new node. Has a test condition or a
class label (node.label)
find_best_split () attribute to be selected as as test condition for
splitting records. Entropy, Gini, Error.
Classify() determine class label to be assigned to leaf node.
leaf.label =argmax p(i | t)
stopping _cond() terminate the tree growth by testing whether all
records has same class label or same attribute values.
Later tree pruning and overfitting.

Decision Tree Based Classification
Advantages:
Inexpensive to construct
Extremely fast at classifying unknown records
Easy to interpret for small-sized trees
Accuracy is comparable to other classification
techniques for many simple data sets

Decision Tree Induction
Many Algorithms:
Hunts Algorithm (one of the earliest)
CART
ID3, C4.5
SLIQ,SPRINT
Computing Impurity Measure
Tid Refund Marital
Status
Taxable
Income
Class
3 No Single 70K No
6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10 ? Single 90K Yes
10

Class
= Yes
Class
= No
Refund=Yes 0 3
Refund=No 2 4
Refund=? 1 0
Split on Refund:
Entropy(Refund=Yes) = 0
Entropy(Refund=No)
= -(2/6)log(2/6) (4/6)log(4/6) = 0.9183
Entropy(Children)
= 0.3 (0) + 0.6 (0.9183) = 0.551
Gain = 0.9 (0.8813 0.551) = 0.3303
Missing
value
Before Splitting:
Entropy(Parent)
= -0.3 log(0.3)-(0.7)log(0.7) = 0.8813
Rule-Based Classifier
Classify records by using a collection of
ifthen rules

Rule: (Condition) y
where
Condition is a conjunctions of attributes
y is the class label
LHS: rule antecedent or condition
RHS: rule consequent
Examples of classification rules:
(Blood Type=Warm) . (Lay Eggs=Yes) Birds
(Taxable Income < 50K) . (Refund=Yes) Evade=No
Rule-based Classifier (Example)
R1: (Give Birth = no) . (Can Fly = yes) Birds
R2: (Give Birth = no) . (Live in Water = yes) Fishes
R3: (Give Birth = yes) . (Blood Type = warm) Mammals
R4: (Give Birth = no) . (Can Fly = no) Reptiles
R5: (Live in Water = sometimes) Amphibians
Name Blood Type Give Birth Can Fly Live in Water Class
human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals
frog cold no no sometimes amphibians
komodo cold no no no reptiles
bat warm yes yes no mammals
pigeon warm no yes no birds
cat warm yes no no mammals
leopard shark cold yes no yes fishes
turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
porcupine warm yes no no mammals
eel cold no no yes fishes
salamander cold no no sometimes amphibians
gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
dolphin warm yes no yes mammals
eagle warm no yes no birds
Application of Rule-Based Classifier
A rule r covers an instance x if the attributes of the instance
satisfy the condition of the rule
(A
j
op v
j
) attribute test which is an attribute- value pair called
conjunct
Condition
i
= (A
1
op v
1
) . (A
2
op v
2
) . . (A
k
op v
k
)
The rule R1 covers a hawk => Bird
The rule R3 covers the grizzly bear => Mammal
hawk warm no yes no ?
grizzly bear warm yes no no ?
Rule Coverage and Accuracy
Coverage of a rule:
Fraction of records that satisfy the antecedent of a rule
= | A | / |D|
Accuracy of a rule:
Fraction of records that satisfy both the antecedent and
consequent of a rule
= | A y | / |A|

(Gives Birth=yes) . (Body Temperature = warm-blooded) Mammals
Coverage = 33%, Accuracy = 6/6 = 100%
Tid Refund Marital
Status
Taxable
Income
Class
3 No Single 70K No
6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10

(Status=Single) No
Coverage = ---- %, Accuracy = -------------%
How does Rule-based Classifier Work?
A lemur triggers rule R3, so it is classified as a mammal
A turtle triggers both R4 and R5
A dogfish shark triggers none of the rules
lemur warm yes no no ?
turtle cold no no sometimes ?
dogfish shark cold yes no yes ?
Characteristics of Rule-Based Classifier
Mutually exclusive rules
Classifier contains mutually exclusive rules if
the rules are independent of each other
Every record is covered by at most one rule

Exhaustive rules
Classifier has exhaustive coverage if it
accounts for every possible combination of
attribute values
Each record is covered by at least one rule
From Decision Trees To Rules
YES YES NO NO
NO NO
NO NO
Yes No
{Married}
{Single,
Divorced}
< 80K > 80K
Taxable
Income
Marital
Status
Refund
Classification Rules
(Refund=Yes) ==> No
(Refund=No, Marital Status={Single,Divorced},
Taxable Income<80K) ==> No
Taxable Income>80K) ==> Yes
(Refund=No, Marital Status={Married}) ==> No
Rules are mutually exclusive and exhaustive
Rule set contains as much information as the
tree
Rules Can Be Simplified
YES YES NO NO
NO NO
NO NO
Yes No
{Married}
{Single,
Divorced}
< 80K > 80K
Taxable
Income
Marital
Status
Refund
Tid Refund Marital
Status
Taxable
Income
Cheat
3 No Single 70K No
6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10

Initial Rule: (Refund=No) . (Status=Married) No
Simplified Rule: (Status=Married) No
Effect of Rule Simplification
Rules are no longer mutually exclusive
A record may trigger more than one rule
Solution?
Ordered rule set
Unordered rule set use voting schemes

Rules are no longer exhaustive
A record may not trigger any rules
Solution?
Use a default class
Ordered Rule Set
Rules are rank ordered according to their priority
An ordered rule set is known as a decision list
When a test record is presented to the classifier
It is assigned to the class label of the highest ranked rule it has
triggered
If none of the rules fired, it is assigned to the default class
Rule Ordering Schemes
Rule-based ordering
Individual rules are ranked based on their quality
Class-based ordering
Rules that belong to the same class appear together
Rule-based Ordering
(Refund=Yes) ==> No
Class-based Ordering
(Refund=Yes) ==> No
Building Classification Rules
Direct Method:
Extract rules directly from data
e.g.: RIPPER, CN2, Holtes 1R

Indirect Method:
Extract rules from other classification models (e.g.
decision trees, neural networks, etc).
e.g: C4.5rules
Direct Method: Sequential Covering
1. Start from an empty rule
2. Grow a rule using the Learn-One-Rule function
3. Remove training records covered by the rule
4. Repeat Step (2) and (3) until stopping criterion
is met
Example of Sequential Covering
(i) Original Data (ii) Step 1
(iii) Step 2
R1
(iv) Step 3
R1
R2
Aspects of Sequential Covering
Rule Growing

Instance Elimination

Rule Evaluation

Stopping Criterion

Rule Pruning
Rule Growing
Two common strategies
Status =
Single
Status =
Divorced
Status =
Married
Income
> 80K
...
Yes: 3
No: 4
{ }
Yes: 0
No: 3
Refund=
No
Yes: 3
No: 4
Yes: 2
No: 1
Yes: 1
No: 0
Yes: 3
No: 1
(a) General-to-specific
Refund=No,
Status=Single,
Income=85K
(Class=Yes)
Refund=No,
Status=Single,
Income=90K
(Class=Yes)
Refund=No,
Status = Single
(Class = Yes)
(b) Specific-to-general
Ripper Algorithm
Sequential covering Algorithm:
Start from an empty rule: {} => class
Add conjuncts that maximizes FOILs information gain measure:
R0: {} => class (initial rule)
R1: {A} => class (rule after adding conjunct)
Gain(R0, R1) = t [ log (p1/(p1+n1)) log (p0/(p0 + n0)) ]
where t: number of positive instances covered by both R0 and R1
p0: number of positive instances covered by R0
n0: number of negative instances covered by R0
Why do we need to
eliminate instances?
Otherwise, the next rule is
identical to previous rule
Why do we remove
positive instances?
Ensure that the next rule is
different
Why do we remove
negative instances?
Prevent underestimating
accuracy of rule
Compare rules R2 and R3
in the diagram
class = +
class = -
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
-
-
-
-
- -
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
+
+
+
+
+
+
+
R1
R3 R2
+
+
Rule Evaluation
Metrics:
Accuracy
Likelihood ratio statistic
Laplace

M-estimate
FOILs info gain

Details please refer text book page 217
k n
n
c
+
+
=
1
k n
kp n
c
+
+
=
n : Number of instances
covered by rule
n
c
: Number of instances
covered by rule
k : Number of classes
p : Prior probability
n
n
c
=
Stopping Criterion and Rule Pruning
Stopping criterion
Compute the gain
If gain is not significant, discard the new rule

Rule Pruning
Similar to post-pruning of decision trees
Reduced Error Pruning:
Remove one of the conjuncts in the rule
Compare error rate on validation set before and
after pruning
If error improves, prune the conjunct
Why do we need to
Why do we remove
positive instances?
different
Why do we remove
negative instances?
accuracy of rule
in the diagram
class = +
class = -
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
-
-
-
-
- -
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
+
+
+
+
+
+
+
R1
R3 R2
+
+
Rule Evaluation
Metrics:
Accuracy

Laplace

M-estimate
k n
n
c
+
+
=
1
k n
kp n
c
+
+
=
covered by rule
n
c
covered by rule
n
n
c
=
Stopping criterion
Compute the gain

Rule Pruning
after pruning
Indirect Methods
Rule Set
r1: (P=No,Q=No) ==> -
r2: (P=No,Q=Yes) ==> +
r3: (P=Yes,R=No) ==> +
r4: (P=Yes,R=Yes,Q=No) ==> -
r5: (P=Yes,R=Yes,Q=Yes) ==> +
P
Q R
Q
- + +
- +
No No
No
Yes Yes
Yes
No Yes
Rule set from decision tree
Each rule with a class + or -.
R2, R3, r5 predict positive class when Q=yes
Rules simplified as r2 : (Q=yes) +
r3: (P=yes) . (R=No) +
Rule generation : C4.5rules
Extract classification rules from every path of a
decision tree
For each rule, r: A y,
consider an alternative rule r: A y where A is
obtained by removing one of the conjuncts in A
Compare the pessimistic error rate for r against
all rs
Prune if one of the rs has lower pessimistic error
rate
Repeat until we can no longer improve
generalization error
Rule ordering : C4.5rules
Use of class ordering where rules that predict same
class grouped into same subsets.
Compute description length of each subset
The classes are arranged in increasing order of
their total description length.
Class with smallest description length is given
highest priority.
Description length = L(exception) + g * L(model)
g is a parameter that takes into account the presence
of redundant attributes in a rule set
(default value = 0.5)
Example
C4.5 versus C4.5rules versus RIPPER
C4.5rules:
(Give Birth=No, Can Fly=Yes) Birds
(Give Birth=No, Live in Water=Yes) Fishes
(Give Birth=Yes) Mammals
(Give Birth=No, Can Fly=No, Live in Water=No) Reptiles
( ) Amphibians

Give
Birth?
Live In
Water?
Can
Fly?
Mammals
Fishes Amphibians
Birds Reptiles
Yes No
Yes
Sometimes
No
Yes
No
RIPPER:
(Live in Water=Yes) Fishes
(Have Legs=No) Reptiles
(Give Birth=No, Can Fly=No, Live In Water=No)
Reptiles
(Can Fly=Yes,Give Birth=No) Birds
() Mammals
Advantages of Rule-Based Classifiers
As highly expressive as decision trees
Easy to interpret
Easy to generate
Can classify new instances rapidly
Performance comparable to decision trees
Rule-Based Classifier
Classify records by using a collection of
ifthen rules

Rule: (Condition) y
where
Condition is a conjunctions of attributes
y is the class label
LHS: rule antecedent or condition
RHS: rule consequent
Examples of classification rules:
(Blood Type=Warm) . (Lay Eggs=Yes) Birds
(Taxable Income < 50K) . (Refund=Yes) Evade=No
Rule-based Classifier (Example)
human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals
frog cold no no sometimes amphibians
komodo cold no no no reptiles
bat warm yes yes no mammals
pigeon warm no yes no birds
cat warm yes no no mammals
leopard shark cold yes no yes fishes
turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
porcupine warm yes no no mammals
eel cold no no yes fishes
salamander cold no no sometimes amphibians
gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
dolphin warm yes no yes mammals
eagle warm no yes no birds
Application of Rule-Based Classifier
A rule r covers an instance x if the attributes of the instance
satisfy the condition of the rule
(A
j
op v
j
) attribute test which is an attribute- value pair called
conjunct
Condition
i
= (A
1
op v
1
) . (A
2
op v
2
) . . (A
k
op v
k
)
The rule R1 covers a hawk => Bird
The rule R3 covers the grizzly bear => Mammal
hawk warm no yes no ?
grizzly bear warm yes no no ?
Rule Coverage and Accuracy
Coverage of a rule:
Fraction of records that satisfy the antecedent of a rule
= | A | / |D|
Accuracy of a rule:
Fraction of records that satisfy both the antecedent and
consequent of a rule
= | A y | / |D|

(Gives Birth=yes) . (Body Temperature = warm-blooded) Mammals
Coverage = 33%, Accuracy = 6/6 = 100%
Tid Refund Marital
Status
Taxable
Income
Class
3 No Single 70K No
6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10

(Status=Single) No
Coverage = ---- %, Accuracy = -------------%
How does Rule-based Classifier Work?
A lemur triggers rule R3, so it is classified as a mammal
A turtle triggers both R4 and R5
A dogfish shark triggers none of the rules
lemur warm yes no no ?
dogfish shark cold yes no yes ?
Characteristics of Rule-Based Classifier
Mutually exclusive rules
Classifier contains mutually exclusive rules if
the rules are independent of each other
Every record is covered by at most one rule

Exhaustive rules
Classifier has exhaustive coverage if it
accounts for every possible combination of
attribute values
Each record is covered by at least one rule
From Decision Trees To Rules
YES YES NO NO
NO NO
NO NO
Yes No
{Married}
{Single,
Divorced}
< 80K > 80K
Taxable
Income
Marital
Status
Refund
Classification Rules
(Refund=Yes) ==> No
Rules are mutually exclusive and exhaustive
Rule set contains as much information as the
tree
Rules Can Be Simplified
YES YES NO NO
NO NO
NO NO
Yes No
{Married}
{Single,
Divorced}
< 80K > 80K
Taxable
Income
Marital
Status
Refund
Tid Refund Marital
Status
Taxable
Income
Cheat
3 No Single 70K No
6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10

Initial Rule: (Refund=No) . (Status=Married) No
Simplified Rule: (Status=Married) No
Effect of Rule Simplification
Rules are no longer mutually exclusive
A record may trigger more than one rule
Solution?
Ordered rule set
Unordered rule set use voting schemes

Rules are no longer exhaustive
A record may not trigger any rules
Solution?
Use a default class
Ordered Rule Set
Rules are rank ordered according to their priority
An ordered rule set is known as a decision list
When a test record is presented to the classifier
It is assigned to the class label of the highest ranked rule it has
triggered
If none of the rules fired, it is assigned to the default class
Rule Ordering Schemes
Rule-based ordering
Individual rules are ranked based on their quality
Class-based ordering
Rules that belong to the same class appear together
Rule-based Ordering
(Refund=Yes) ==> No
Class-based Ordering
(Refund=Yes) ==> No
Building Classification Rules
Direct Method:
Extract rules directly from data
e.g.: RIPPER, CN2, Holtes 1R

Indirect Method:
Extract rules from other classification models (e.g.
decision trees, neural networks, etc).
e.g: C4.5rules
Direct Method: Sequential Covering
1. Start from an empty rule
2. Grow a rule using the Learn-One-Rule function
3. Remove training records covered by the rule
4. Repeat Step (2) and (3) until stopping criterion
is met
(i) Original Data (ii) Step 1
(iii) Step 2
R1
(iv) Step 3
R1
R2
Aspects of Sequential Covering
Rule Growing


Rule Evaluation

Stopping Criterion

Rule Pruning
Rule Growing
Two common strategies
Status =
Single
Status =
Divorced
Status =
Married
Income
> 80K
...
Yes: 3
No: 4
{ }
Yes: 0
No: 3
Refund=
No
Yes: 3
No: 4
Yes: 2
No: 1
Yes: 1
No: 0
Yes: 3
No: 1
(a) General-to-specific
Refund=No,
Status=Single,
Income=85K
(Class=Yes)
Refund=No,
Status=Single,
Income=90K
(Class=Yes)
Refund=No,
Status = Single
(Class = Yes)
(b) Specific-to-general
Rule Growing (Examples)
CN2 Algorithm:
Start from an empty conjunct: {}
Add conjuncts that minimizes the entropy measure: {A}, {A,B},
Determine the rule consequent by taking majority class of instances
covered by the rule
RIPPER Algorithm:
Start from an empty rule: {} => class
Add conjuncts that maximizes FOILs information gain measure:
R0: {} => class (initial rule)
R1: {A} => class (rule after adding conjunct)
Gain(R0, R1) = t [ log (p1/(p1+n1)) log (p0/(p0 + n0)) ]
where t: number of positive instances covered by both R0 and R1
Why do we need to
Why do we remove
positive instances?
different
Why do we remove
negative instances?
accuracy of rule
in the diagram
class = +
class = -
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
-
-
-
-
- -
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
+
+
+
+
+
+
+
R1
R3 R2
+
+
Rule Evaluation
Metrics:
Accuracy

Laplace

M-estimate
k n
n
c
+
+
=
1
k n
kp n
c
+
+
=
covered by rule
n
c
covered by rule
n
n
c
=
Stopping criterion
Compute the gain

Rule Pruning
after pruning
Summary of Direct Method
Grow a single rule

Remove Instances from rule

Prune the rule (if necessary)

Add rule to Current Rule Set

Repeat
Direct Method: RIPPER
For 2-class problem, choose one of the classes as
positive class, and the other as negative class
Learn rules for positive class
Negative class will be default class
For multi-class problem
Order the classes according to increasing class
prevalence (fraction of instances that belong to a
particular class)
Learn the rule set for smallest class first, treat the rest
as negative class
Repeat with next smallest class as positive class
Growing a rule:
Start from empty rule
Add conjuncts as long as they improve FOILs
information gain
Stop when rule no longer covers negative examples
Prune the rule immediately using incremental reduced
error pruning
Measure for pruning: v = (p-n)/(p+n)
p: number of positive examples covered by the rule in
the validation set
n: number of negative examples covered by the rule in
the validation set
Pruning method: delete any final sequence of
conditions that maximizes v
Building a Rule Set:
Use sequential covering algorithm
Finds the best rule that covers the current set of
positive examples
Eliminate both positive and negative examples
covered by the rule
Each time a rule is added to the rule set,
compute the new description length
stop adding new rules when the new description
length is d bits longer than the smallest description
length obtained so far
Optimize the rule set:
For each rule r in the rule set R
Consider 2 alternative rules:
Replacement rule (r*): grow new rule from scratch
Revised rule(r): add conjuncts to extend the rule r
Compare the rule set for r against the rule set for r*
and r
Choose rule set that minimizes MDL principle
Repeat rule generation and rule optimization
for the remaining positive examples
Indirect Methods
Rule Set
r1: (P=No,Q=No) ==> -
r2: (P=No,Q=Yes) ==> +
r3: (P=Yes,R=No) ==> +
r4: (P=Yes,R=Yes,Q=No) ==> -
r5: (P=Yes,R=Yes,Q=Yes) ==> +
P
Q R
Q
- + +
- +
No No
No
Yes Yes
Yes
No Yes
Rule set from decision tree
Each rule with a class + or -.
R2, R3, r5 predict positive class when Q=yes
Rules simplified as r2 : (Q=yes) +
r3: (P=yes) . (R=No) +
Rule generation : C4.5rules
Extract classification rules from every path of a
decision tree
For each rule, r: A y,
consider an alternative rule r: A y where A is
obtained by removing one of the conjuncts in A
Compare the pessimistic error rate for r against
all rs
Prune if one of the rs has lower pessimistic error
rate
Repeat until we can no longer improve
generalization error
Rule ordering : C4.5rules
Use of class ordering where rules that predict same
class grouped into same subsets.
Compute description length of each subset
The classes are arranged in increasing order of
their total description length.
Class with smallest description length is given
highest priority.
Description length = L(exception) + g * L(model)
g is a parameter that takes into account the presence
of redundant attributes in a rule set
(default value = 0.5)
Example
C4.5rules:
(Give Birth=No, Can Fly=Yes) Birds
(Give Birth=No, Live in Water=Yes) Fishes
(Give Birth=Yes) Mammals
(Give Birth=No, Can Fly=No, Live in Water=No) Reptiles
( ) Amphibians

Give
Birth?
Live In
Water?
Can
Fly?
Mammals
Fishes Amphibians
Birds Reptiles
Yes No
Yes
Sometimes
No
Yes
No
RIPPER:
(Live in Water=Yes) Fishes
(Have Legs=No) Reptiles
(Give Birth=No, Can Fly=No, Live In Water=No)
Reptiles
(Can Fly=Yes,Give Birth=No) Birds
() Mammals
PREDICTED CLASS
Amphibians Fishes Reptiles Birds Mammals
ACTUAL Amphibians 0 0 0 0 2
CLASS Fishes 0 3 0 0 0
Reptiles 0 0 3 0 1
Birds 0 0 1 2 1
Mammals 0 2 1 0 4
PREDICTED CLASS
Amphibians Fishes Reptiles Birds Mammals
ACTUAL Amphibians 2 0 0 0 0
CLASS Fishes 0 2 0 0 1
Reptiles 1 0 3 0 0
Birds 1 0 0 3 0
Mammals 0 0 1 0 6
C4.5 and C4.5rules:
RIPPER:
Advantages of Rule-Based Classifiers
As highly expressive as decision trees
Easy to interpret
Easy to generate
Can classify new instances rapidly
Performance comparable to decision trees
Instance-Based Classifiers
Atr1
...
AtrN Class
A
B
B
C
A
C
B
Set of Stored Cases
Atr1
...
AtrN
Unseen Case
Store the training records
Use training records to
predict the class label of
unseen cases
Instance Based Classifiers
Examples:
Rote-learner
Memorizes entire training data and performs
classification only if attributes of record match one of
the training examples exactly

Nearest neighbor
Uses k closest points (nearest neighbors) for
performing classification

Nearest Neighbor Classifiers
Basic idea:
If it walks like a duck, quacks like a duck, then
its probably a duck
Training
Records
Test
Record
Compute
Distance
Choose k of the
nearest records
Nearest-Neighbor Classifiers
Requires three things
The set of stored records
Distance Metric to compute
distance between records
The value of k, the number of
nearest neighbors to retrieve

To classify an unknown record:
Compute distance to other
training records
Identify k nearest neighbors
Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)
Unknown record
Definition of Nearest Neighbor
X X X
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
K-nearest neighbors of a record x are data points
that have the k smallest distance to x
1 nearest-neighbor
Voronoi Diagram
Nearest Neighbor Classification
Compute distance between two points:
Euclidean distance

Determine the class from nearest neighbor list
take the majority vote of class labels among
the k-nearest neighbors
Weigh the vote according to distance
weight factor, w = 1/d
2
=
i
i i
q p q p d
2
) ( ) , (
Choosing the value of k:
If k is too small, sensitive to noise points
If k is too large, neighborhood may include points from
other classes
X
Scaling issues
Attributes may have to be scaled to prevent
distance measures from being dominated by
one of the attributes
Example:
height of a person may vary from 1.5m to 1.8m
weight of a person may vary from 90lb to 300lb
income of a person may vary from $10K to $1M
Problem with Euclidean measure:
High dimensional data
curse of dimensionality
Can produce counter-intuitive results

1 1 1 1 1 1 1 1 1 1 1 0
0 1 1 1 1 1 1 1 1 1 1 1
1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 1
vs
d = 1.4142 d = 1.4142

Solution: Normalize the vectors to unit length
Nearest neighbor Classification
k-NN classifiers are lazy learners
It does not build models explicitly
Unlike eager learners such as decision tree
induction and rule-based systems
Classifying unknown records are relatively
expensive
Example: PEBLS
PEBLS: Parallel Examplar-Based Learning
System (Cost & Salzberg)
Works with both continuous and nominal
features
For nominal features, distance between two
nominal values is computed using modified value
difference metric (MVDM)
Each record is assigned a weight factor
Number of nearest neighbor, k = 1

Example: PEBLS
Class
Marital Status
Single Married Divorced
Yes 2 0 1
No 2 4 1
=
i
i i
n
n
n
n
V V d
2
2
1
1
2 1
) , (
Distance between nominal attribute values:
d(Single,Married)
= | 2/4 0/4 | + | 2/4 4/4 | = 1
d(Single,Divorced)
= | 2/4 1/2 | + | 2/4 1/2 | = 0
d(Married,Divorced)
= | 0/4 1/2 | + | 4/4 1/2 | = 1
d(Refund=Yes,Refund=No)
= | 0/3 3/7 | + | 3/3 4/7 | = 6/7
Tid Refund Marital
Status
Taxable
Income
Cheat
3 No Single 70K No
6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10
Class
Refund
Yes No
Yes 0 3
No 3 4
Example: PEBLS
=
= A
d
i
i i Y X
Y X d w w Y X
1
2
) , ( ) , (
Tid Refund Marital
Status
Taxable
Income
Cheat
X Yes Single 125K No
Y No Married 100K No
10

Distance between record X and record Y:
where:
correctly predicts X times of Number
prediction for used is X times of Number
=
X
w
w
X
~ 1 if X makes accurate prediction most of the time
w
X
> 1 if X is not reliable for making predictions

Chap4 Classification Sep13

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Chap4 Classification Sep13

Transféré par

Droits d'auteur :

Formats disponibles

Unit 5

Classification: Basic Concepts, Decision

Vous aimerez peut-être aussi