Vous êtes sur la page 1sur 44

Decision tree Induction

Lecture 12
14-08-09

Lect 12/ 14-08-09 1


Name Body temp Skin Cover Gives birth Aquatic creature Legs Hibernates Class label

Body temp.
Root
node
cold
warm
Internal node

Non-
Gives birth mammals

Leaf
yes no Node

Mammals Non-mammals
Non-mammals

Leaf nodes
Lect 12/ 14-08-09 2
Example whether a person will buy a computer/not

age income student credit_rating buys_computer


<=30 high no fair no
<=30 high no excellent no
30…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Lect 12/ 14-08-09 3
Output: A Decision Tree for
“buys_computer”
In order to classify an Root
unknown sample, the A path is traced
node
attribute values of the from root to a leaf
age? node that holds the
sample are tested
against the decision class prediction for
tree. that sample.
<=30 overcast
30..40 >40

yes Decision
student? yes credit rating? trees can be
converted
easily to
no yes excellent fair classification
rules.

no yes
no yes
yes no
no yes
Lect 12/ 14-08-09 4
Apply Model to Test Data
Start from the root of tree. Test Data
Refund Marital Taxable
Status Income Cheat
Refund
No Married 80K ?
Yes No 10

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

Lect 12/ 14-08-09 5


Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

Lect 12/ 14-08-09 6


Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

Lect 12/ 14-08-09 7


Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

Lect 12/ 14-08-09 8


Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES

Lect 12/ 14-08-09 9


Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES

Lect 12/ 14-08-09 10


How to build a decision tree
• Exponentially many trees can be build from a set
of attri.

• Finding optimal tree is infeasible.

• But efficient algo. have been designed to obtain


reasonably accurate DT that are constructed in
reasonable amt. of time.

• These algos. usually employ a greedy strategy


that grows DT by making series of locally
optimum decision about “which attr. to use for
partitioning the data”

Lect 12/ 14-08-09 11


Decision Tree Induction
• Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT

Lect 12/ 14-08-09 12


Decision Tree Cont…

Lecture 13/17-08-09

Lect 12/ 14-08-09 13


General Structure of Hunt’s Algorithm
• Let Dt be the set of training records
that are associated with a node t.
• General Procedure:
– If Dt contains records that belong the
same class yt, then t is a leaf node
Tid Ho
ow
labeled as yt
– If Dt is an empty set, then t is a leaf
node labeled by the default class, yd
– If Dt contains records that belong to
more than one class, use an attribute
test to split the data into smaller
subsets. Recursively apply the
procedure to each subset.

1 Ye
Dt

Lect 12/ 14-08-09


2 No
14
Hunt’s Algorithm
Home
Cheat= owner
Yes No
no
(a) Cheat= Cheat=
no no

(b)

Home Home
owner owner
Yes No Yes No

Cheat Cheat= Marital


Marital
=no no Status
Status
Single, Single,
Married Married
Divorced Divorced
Cheat= Cheat= Taxable Cheat=
yes no no
Income
< 80K >= 80K
(c) (d)
Cheat= Cheat=
no yes
Lect 12/ 14-08-09 15
Design Issues of Tree Induction
• Determine how to split the records
– 1. Each recursive step of the tree-generating
process must select an attr. test cond. to divide the
records into smaller subsets.
– For this, the algo must provide a method for
specifying the test cond for diff. attr types as well
as an objective measure for evaluating the
goodness of each test cond.
– Determine when to stop splitting
• A stopping cond is needed to terminate the tree-
generating process.
• For this, continue expanding a node until either all the
records belong to the same class or all records have
identical attr values.

Lect 12/ 14-08-09 16


How to Specify Test Condition?
• Depends on attribute types
– Nominal
– Ordinal
– Continuous

• Depends on number of ways to split


– 2-way split
– Multi-way split

Lect 12/ 14-08-09 17


Methods for expressing attr test condition

• Binary attributes
• Test cond for a binary attr generates 2 potential
outcomes,

Body

temprature

Warm-blooded Cold-blooded

Lect 12/ 14-08-09 18


Splitting Based on Nominal Attributes
• Multi-way split: Use as many partitions as
distinct values.
Marital status CarType
Family Luxury
Sports
Married
single divorced

• Binary split: Divides values into two


subsets. Need to find optimal
partitioning.

CarType CarType
{Sports, {Family,
Luxury} {Family} OR Luxury} {Sports}

Lect 12/ 14-08-09 19


Splitting Based on Ordinal Attributes
• Multi-way split: Use as many partitions as distinct
values.
Size
Small Large
Medium

• Binary split: Divides values into two sub


sets. Need to find optimal partitioning.

Size
Size {Medium,
{Small, OR Large} {Small}
Medium} {Large}

Size
• What about this split? {Small,
Large} {Medium}

Lect 12/ 14-08-09 20


Splitting Based on Continuous Attributes
• Different ways of handling
– Binary Decision: (A < v) or (A ≥ v)
• consider all possible splits and find the best cut
• can be computationally intensive
– Multi-way split:
• The algo must consider all possible ranges of
cont values, vi<=A<vi+1, i=1,---k.
• Discretization strategies are also applied to form an
ordinal categorical attribute
• After discretization, a new ordinal value will be assigned
to each discretized interval.
• Order property must be preserved.

Lect 12/ 14-08-09 21


Splitting Based on Continuous
Attributes
Taxable Taxable
Income Income?
> 80K?
< 10K > 80
Yes No

[10K,25K) [25K,50K) [50K,80

(i) Binary split (ii) Multi-way split

Lect 12/ 14-08-09 22


How to determine the Best Split
Before Splitting: 10 records of class 0,
10 records of class 1

Own
Gender Car Cust_ID Studen
Car? Type? ID?
M F Family Luxury c1
Yes No
c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C

Which test condition is the best?

Lect 12/ 14-08-09 23


How to determine the Best Split
• Greedy approach:
– Nodes with homogeneous class distribution
are preferred
• Need a measure of node impurity:
C0: 5 C0: 9
C1: 5 C1: 1
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity

Lect 12/ 14-08-09 24


Measure of selecting the best
split
• These measures are defined in terms of the
class distribution before and after splitting.
• Let pi --- fraction of the records belonging to a
class i at a given node t.
• For the previous dataset, the class dist. before
splitting is (0.5,0.5) b’coz there are equal no. of
records for both the classes.

Lect 12/ 14-08-09 25


• Consider a split at “gender”, class dist. of
child nodes are (0.6,0.4) & (0.4,0.6).
• Similarly for “car type”.
• A node with CD (0,1) has 0 impurity and a
CD (0.5,0.5) has highest impurity.
• Impurity measures are:

Lect 12/ 14-08-09 26


Measures of Node Impurity
c−1 2
• Gini = 1 −∑[ pi ]
i =0

c −1
• Entropy= − ∑ pi ( log 2 p i )
i =0

• (Mis) classification error= 1 − . max[ pi ]


i

pi --- fraction of the records belonging to a class i


at a given node t,
c = no. of classes
& 0 log 0=0 in entropy calculations.

Lect 12/ 14-08-09 27


How to Find the Best Split
Before Splitting: M0

Yes
A?
C0No Yes
B?

No

C1
Node N1 Node N2 Node N3 Node N4

M1
C0 C0 N10
M2
C0N M3 M4

C1 C1 N11
C1N
M12 M34
Gain = M0 – M12 vs M0 – M34
Lect 12/ 14-08-09 28
Measure of Impurity: GINI
• Gini Index for a given node t :

GINI (t ) = 1 − ∑ [ p ( j | t )]2
j

(NOTE: p( j | t) is the relative frequency of class j at node t).

– Maximum (1 - 1/nc) when records are equally distributed


among all classes, implying least interesting information
– Minimum (0.0) when all records belong to one class,
implying most interesting information

C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
29
Examples for computing GINI
GINI (t ) = 1 − ∑ [ p ( j | t )]2
j

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444

Lect 12/ 14-08-09 30


Examples for computing Entropy
c −1
Entropy = − ∑ pi ( log 2 p i )
i =0

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

Lect 12/ 14-08-09 31


Examples for Computing Error

Error (t ) = 1 − max pi
i

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Error = 1 – max (0, 1) = 1 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3

Lect 12/ 14-08-09 32


Comparison among Splitting
Criteria the values of the
For a 2-class problem:
•This fig compares

impurity measures
for binary
classification
problems
• ‘p’ refers to the
fraction of rec. that
belong to one of the
two classes
•All measures attain
their max. value
when CD is uniform
(p=0.5) and min.
values when all
records belong to
the same class (p=0
or 1)
33
Customer ID Gender Car Type Shirt Size Class

1 M Family S C0
2 M Sports M C0
3 M Sports M C0
4 M Sports L C0
5 M Sports XL C0
6 M Sports XL C0
7 F Sports S C0
8 F Sports S C0
9 F Sports M C0
10 F Luxury L C0
11 M Family L C1
12 M Family XL C1
13 M Family M C1
14 M Luxury XL C1
15 F Luxury S C1
16 F Luxury S C1
17 F Luxury M C1
18 F Luxury M C1
19 F Luxury M C1 34
20 F Luxury L C1
• Q1: consider the training ex shown in
previous slide for a binary classification
problem.
• (a) compute Gini index for Customer ID
attr.
• (b) Gini Index for Gender attr.
• (c) Gini index for Car type attr. Using multi
way split

Lect 12/ 14-08-09 35


Splitting Based on GINI
• Used in CART, SLIQ, SPRINT.
• When a node p is split into k partitions (children), the
quality of split is computed as,

k
ni
GINI split = ∑ GINI (i )
i =1 n

where, ni = number of records at child i,


n = number of records at node p.

Lect 12/ 14-08-09 36


Binary Attributes: Computing GINI Index

• Splits into two partitions


• Effect of Weighing partitions:
– Larger and Purer Partitions are sought for.
Parent
B? C1 6
Yes No C2 6
Gini = 0.500
Node N1 Node N2
Gini(N1)
= 1 – (5/6)2 – (2/6)2 N1 N2 Gini(Children)
= 0.194 C1 5 1 = 7/12 * 0.194 +
Gini(N2) C2 2 4 5/12 * 0.528
= 1 – (1/6)2 – (4/6)2 Gini=0.333 = 0.333
= 0.528
Lect 12/ 14-08-09 37
Categorical Attributes: Computing Gini Index

• For each distinct value, gather counts for each


class in the dataset
• Use the count matrix to make decisions
Multi-way split Two-way split
(find best partition of values)

CarType CarType CarType


Family Sports Luxury {Sports, {Family,
{Family} {Sports}
Luxury} Luxury}
C1 1 2 1 3 1 2 2
C1 C1
C2 4 1 1 C2 2 4 C2 1 5
Gini 0.393 Gini 0.400 Gini 0.419

Lect 12/ 14-08-09 38


Continuous Attributes: Computing Gini Index
• Use Binary Decisions based on one
value
• Several Choices for the splitting value
– Number of possible splitting values
= Number of distinct values
• Each splitting value has a count matrix
associated with it
– Class counts in each of the partitions,
A < v and A ≥ v
• Simple method to choose best v
– For each v, scan the database to
gather count matrix and compute its
Gini index
– Computationally Inefficient! Repetition Taxable
of work. Income
> 80K?

Yes No

Lect 12/ 14-08-09 39


Continuous Attributes: Computing Gini
Index...
• For efficient computation: for each attribute,
– Sort the attribute on values
– Linearly scan these values, each time updating the
count matrix and computing gini index
– Choose the split position that has the least gini
index

Cheat No No No Yes Yes Yes No No No No


Taxable Income
60 70 75 85 90 95 100 120 125 220
Sorted Values
55 65 72 80 87 92 97 110 122 172 230
Split Positions
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

Lect 12/ 14-08-09 40


Alternative Splitting Criteria based

on INFO
Entropy at a given node t:
Entropy (t ) = − ∑ p ( j | t ) log p ( j | t )
j

(NOTE: p( j | t) is the relative frequency of class j at node t).


– Measures homogeneity of a node.
• Maximum (log nc) when records are equally distributed
among all classes implying least information
• Minimum (0.0) when all records belong to one class,
implying most information
– Entropy based computations are similar to the
GINI index computations
Lect 12/ 14-08-09 41
Splitting Based on INFO...
• Information Gain:

GAIN  n 
= Entropy ( p ) −  ∑ Entropy (i ) 
k
i

 n 
split i =1

Parent Node, p is split into k partitions;


ni is number of records in partition i
– Measures Reduction in Entropy achieved because of
the split. Choose the split that achieves most reduction
(maximizes GAIN)
– Used in ID3 and C4.5
– Disadvantage: Tends to prefer splits that result in large
number of partitions, each being small but pure.
Lect 12/ 14-08-09 42
Splitting Based on INFO...
• Gain Ratio:
GAIN n n
= SplitINFO = −∑ log
k
GainRATIO Split i i

SplitINFO
split

n n i =1

Parent Node, p is split into k partitions


ni is the number of records in partition i

– Adjusts Information Gain by the entropy of the partitioning


(SplitINFO). Higher entropy partitioning (large number of small
partitions) is penalized!
– Used in C4.5
– Designed to overcome the disadvantage of Information Gain

Lect 12/ 14-08-09 43


Splitting Criteria based on Classification Error

• Classification error at a node t :

Error (t ) = 1 − max P (i | t )i

• Measures misclassification error made by a node.


• Maximum (1 - 1/nc) when records are equally distributed among
all classes, implying least interesting information
• Minimum (0.0) when all records belong to one class, implying
most interesting information

Lect 12/ 14-08-09 44

Vous aimerez peut-être aussi