Classification Decision Tree

INTRODUCTION TO
DATA ANALYTICS
XIAOFENG ZHOU | ITEC3040
@X. Zhou Intro to Data 1

Analytics
CLASSIFICATION
• Basic Concept
• Decision Tree Induction
• Bayes Classification Method
• KNN method
• Model Evaluation and Selection
• Techniques to Improve Classification Accuracy
@X. Zhou Intro to Data Analytics 2

SUPERVISED VS. UNSUPERVISED
LEARNING
• Supervised learning (classification)
• Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
• New data is classified based on the training set
• Unsupervised learning (clustering)
• The class labels of training data is unknown
• Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
FUNDAMENTAL ASSUMPTION OF
LEARNING
Assumption: The distribution of training examples is
identical to the distribution of test examples (including
future unseen examples).
• In practice, this assumption is often violated to certain

degree.
• Strong violations will clearly result in poor classification
accuracy.
• To achieve good accuracy on the test data, training
examples must be sufficiently representative of the test
data.

CLASSIFICATION: DEFINITION
• Given a collection of records (training set )
• Each record contains a set of attributes, one of the
attributes is the class.
• Find a model for class attribute as a function of the
values of other attributes.
• Goal: previously unseen records should be assigned a
class as accurately as possible.
• A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to
build the model and test set used to validate it.

CLASSIFICATION—A TWO-STEP
PROCESS
• Model construction: describing a set of predetermined classes
• Each tuple/sample is assumed to belong to a predefined class, as determined
by the class label attribute
• The set of tuples used for model construction is training set
• The model is represented as classification rules, decision trees, or
mathematical formulae
• Model usage: for classifying future or unknown objects
• Estimate accuracy of the model
• The known label of test sample is compared with the classified result
from the model
• Accuracy rate is the percentage of test set samples that are correctly
classified by the model
• Test set is independent of training set (otherwise overfitting)
• If the accuracy is acceptable, use the model to classify new data
• Note: If the test set is used to select models, it is called validation (test) set

PROCESS (1): MODEL CONSTRUCTION
Classification
Algorithms
Training
Data
NAME RANK YEARS TENURED Classifier

Mike Assistant Prof 3 no (Model)
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes IF rank = ‘professor’
Dave Assistant Prof 6 no
OR years > 6
Anne Associate Prof 3 no
THEN tenured = ‘yes’
PROCESS (2): USING THE MODEL IN
PREDICTION
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
EXAMPLES OF CLASSIFICATION
• An emergency room in a hospital measures 17 variables (e.g.,
blood pressure, age, etc) of newly admitted patients.
• A decision is needed: whether to put a new patient in an
intensive-care unit.
• Due to the high cost of ICU, those patients who may survive less
than a month are given higher priority.
• Problem: to predict high-risk patients and discriminate them
from low-risk patients.
• Other applications:
• Predicting tumor cells as benign or malignant
• Classifying credit card transactions as legitimate or fraudulent
• Classifying secondary structures of protein as alpha-helix, beta-
sheet, or random coil
• Categorizing news stories as finance, weather, entertainment,
sports, etc.
THE DATA AND THE GOAL
• Data: A set of data records (also called examples,
instances or cases) described by
• k attributes: A1, A2, … Ak.
• a class: Each example is labelled with a pre-defined
class – class attribute
• The class attribute has a set of discrete values, n,
>=2
• Goal: To learn a classification model from the data that
can be used to predict the classes of new (future, or
test) cases/instances.

Table is from Web Mining written by Bin Liu
DECISION TREE INDUCTION
 Decision tree induction is one of the most widely used
techniques for classification.
◦ Its classification accuracy is competitive with
other methods, and
◦ it is very efficient.
 The classification model is a tree, called decision tree.
 How a decision tree works?

EXAMPLE OF A DECISION TREE
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat
1 Yes Single 125K No

2 No Married 100K No Refund
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
Internal TaxInc NO
7 Yes Divorced 220K No
nodes
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10
leaf
Training Data Model: Decision Tree
USE THE DECISION TREE
Test Data
TID Refund Marital Taxable Cheat
Status Income
100 No Yes 65K NO
Start from the root of tree.
Refund
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES

IS THE DECISION TREE UNIQUE?
• No. Here is a simpler
tree.
• We want smaller tree Married
MarSt Single,
Divorced
and accurate tree.
• Easy to NO Refund
No
understand and Yes
perform better. NO TaxInc

< 80K > 80K
• All current tree
building algorithms NO YES
are heuristic
algorithms
There could be more than one tree that fits
the same data!
ALGORITHM FOR DECISION TREE
INDUCTION
• Basic algorithm (a greedy algorithm)
• Tree is constructed in a top-down recursive divide-and-conquer
manner
• At start, all the training examples are at the root
• Attributes are categorical (if continuous-valued, they are
discretized in advance)
• Examples are partitioned recursively based on selected
attributes
• Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
• Conditions for stopping partitioning
• All samples for a given node belong to the same class
• There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
• There are no samples left
DECISION TREE INDUCTION
• Many Algorithms:
• Hunt’s Algorithm (one of the earliest)
• CART
• ID3, C4.5
• SLIQ,SPRINT
• Issues
• Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
• Determine when to stop splitting
HOW TO SPECIFY TEST CONDITION?
• Depends on attribute types
• Nominal
• Ordinal
• Continuous
• Depends on number of ways to split

• 2-way split
• Multi-way split

SPLITTING BASED ON NOMINAL
ATTRIBUTES
• Multi-way split: Use as many partitions as distinct
values.
CarType
Family Luxury
Sports
• Binary split: Divides values into two subsets.

Need to find optimal partitioning.
CarType CarType
{Sports, {Family,
Luxury} {Family} OR Luxury} {Sports}

SPLITTING BASED ON ORDINAL
ATTRIBUTES
• Multi-way split: Use as many partitions as distinct values.
Size
Small Large
Medium
• Binary split: Divides values into two subsets.

Need to find optimal partitioning.
Size Size
{Small, OR {Medium,
Medium} {Large} Large} {Small}
Size
• What about this split? {Small, {Medium}
Large}
SPLITTING BASED ON CONTINUOUS
ATTRIBUTES
• Different ways of handling
• Discretization to form an ordinal categorical
attribute
• Static – discretize once at the beginning
• Dynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing
(percentiles), or clustering.
• Binary Decision: (A < v) or (A ≥ v)

• consider all possible splits and finds the best cut
• can be more compute intensive

SPLITTING BASED ON CONTINUOUS
ATTRIBUTES
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No
[10K,25K) [25K,50K) [50K,80K)
(i) Binary split (ii) Multi-way split

HOW TO DETERMINE THE BEST
SPLIT
Before Splitting: 10 records of class 0,
10 records of class 1
Own Car Student

Car? Type? ID?
Yes No Family Luxury c1 c20

c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1
Which test condition is the best?

HOW TO DETERMINE THE BEST
SPLIT
• The key to building a decision tree - which
attribute to choose in order to branch?
• The objective is to reduce impurity or uncertainty in
data as much as possible.
• A subset of data is pure if all instances belong to
the same class.
• Need a measure of node impurity
• Information gain
• Gain ratio
• Gini Index

BRIEF INTRODUCTION OF ENTROPY
• Entropy(Information Theory)
• A measure of uncertainty associated with a random variable
• Calculation: for a discrete random variable Y taking m
distinct values {y1, y2, …, ym}
|C |
entropy ( D ) = −∑ P (Y = y j ) log 2 P (Y = y j )
j =1
Where |C |
∑
j =1
P (Y = y j ) = 1
• Define 0*log20=0
• Interpretation:
• Higher entropy: -> higher uncertainty
ATTRIBUTE SELECTION MEASURE:
INFORMATION GAIN (ID3/C4.5)
• Select the attribute with the highest information gain
• Let pi (P(Y=yi))be the probability that an arbitrary tuple
in D belongs to class Ci, estimated by |Ci, D|/|D|
• Expected information (entropy) needed to classify a
tuple in D:
m
Info( D) = −∑ pi log 2 ( pi )
i =1
• Information needed (after using A to split D into v
partitions) to classify D: v | Dj |
Info A ( D ) = ∑ × Info( D j )
j =1 |D|
• Information gained by branching on attribute A
Gain(A) = Info(D) − Info A(D)
AN ILLUSTRATION EXAMPLE
Training examples
Day Outlook Temperature Humidity Wind PlayTennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rainy Mild High Weak Yes
D5 Rainy Cool Normal Weak Yes
D6 Rainy Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rainy Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rainy Mild High Strong No

Which attribute is the best for the root?
S: [9+, 5-] S: [9+, 5-]
Entropy=0.940 Entropy=0.940
Humidity Outlook
High Sunny Overcast Rainy
Normal
SHigh: [3+, 4-] SNormal: [6+, 1-] SSunny: [2+, 3-] SOvercast: [4+, 0-] SRain: [3+, 2-]
Entropy=0.985 Entropy=0.592 Entropy=0.97095 Entropy=0 Entropy=0.97095
Gain(S, Humidity) Gain(S, Outlook)
=.940- (7/14).985- (7/14).592=0.151 =.940- (5/14).97095- (4/14)0- (5/14).97095=0.2467
S: [9+, 5-] S: [9+, 5-]

E=0.940 E=0.940
Wind
Temperature
Weak Strong Hot Mild Cool
SWeak: [6+, 2-] SStrong: [3+, 3-] SHot: [2+, 2-] SMildt: [4+, 2-] SCool: [3+, 1-]
Entropy=0.811 Entropy=1 Entropy=1 Entropy=0.91826 Entropy=0.811
Gain(S, Wind) Gain(S, Outlook)
=.940- (8/14).811- (6/14)1.0=0.048 =.940- (4/14)1- (6/14).91826- (4/14).811=0.029
An illustrative example (Cont’d.)
S: {D1, D2, …, D14}
[9+, 5-]
Outlook
Sunny Overcast Rainy
SSunny:{D1, D2, D8, D9, D11} SOvercast:{D3, D7, D12, D13} SRain:{D4,D5,D6,D10,D14}
[2+, 3-] [4+, 0-] [3+, 2-]
Entropy=0.97095 Entropy=0 Entropy=0.97095
? Yes ?
Which attribute should be tested here, Humidity, Temperature, or Wind?
Gain(Ssunny, Humidity) = .97095- (3/5)0.0- (2/5)0.0 = 0.97095

Gain(Ssunny, Temperature) = .97095- (2/5)0.0- (2/5)1.0- (1/5)0.0 = 0.57095
Gain(Ssunny, Wind) = .97095- (2/5)1.0- (3/5).918 = 0.02015
Therefore, Humidity is chosen as the next test attribute for the left branch.

An illustrative example (Cont’d.)
S: {D1, D2, …, D14}
[9+, 5-]
Outlook
Sunny Overcast Rain
SSunny:{D1,D2,D8,D9,D11} SOvercast:{D3,D7,D12,D13} SRain:{D4,D5,D6,D10,D14}

[2+, 3-] [4+, 0-] [3+, 2-]
Entropy=0.97095 Entropy=0 Entropy=0.97095
Humidity Yes Wind
High Normal Strong Weak
(D1, D2, D8) (D9, D11) (D6, D14) (D4, D5, D10)
[0+, 3-] [2+, 0-] [0+, 2-] [3+, 0-]
No Yes No Yes

BASIC DECISION TREE LEARNING
ALGORITHM
1. Select the “best” attribute A for the root node
2. Create new descendent of the node according to the values
of A
3. Put training examples to the descendent nodes.
4. For each descendent node
 if the training examples associated with the node belong to the same
class, the node is marked as a leaf node and labeled with the class
when to
terminate  else if there are no remaining attributes on which the examples can be
the further partitioned, the node is marked as a leaf node and labeled with
recursive the most common class among the training cases for classification;
process  else if there is no example for the node, the node is marked as a leaf
node and labeled with the majority class in its parent node.
 otherwise, recursively apply the process on the new node.
GAIN RATIO FOR ATTRIBUTE SELECTION
(C4.5)
• Information gain measure is biased towards attributes with a large
number of values
• C4.5 (a successor of ID3) uses gain ratio to overcome the problem
(normalization to information gain)
v | Dj | | Dj |
SplitInfo A ( D) = −∑ × log 2 ( )
j =1 |D| |D|
• GainRatio(A) = Gain(A)/SplitInfo(A)
• Ex.
4 4 6 6 4
• 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇(𝐷𝐷) = - × log2 ( ) - × log2 - ×
14 14 14 14 14
4
log2 ( ) =1.557
14
• gain_ratio(temperature) = 0.029/1.557 = 0.019

• The attribute with the maximum gain ratio is selected as the splitting
attribute
32
GINI INDEX
(CART, IBM INTELLIGENTMINER)
• If a data set D contains examples from n classes, gini index, gini(D) is
defined as n
gini( D) = 1− ∑ p 2j
j =1
where pj is the relative frequency of class j in D
• If a data set D is split on A into two subsets D1 and D2, the gini index
gini(D) is defined as
|D1| |D2 |
gini A ( D) = gini( D1) + gini( D 2)
|D| |D|
• Reduction in Impurity:
∆gini( A) = gini( D) − giniA ( D)
• The attribute provides the smallest ginisplit(D) (or the largest reduction
in impurity) is chosen to split the node (need to enumerate all the
33
possible splitting points for each attribute)

COMPUTATION OF GINI INDEX
• Ex. D has 9 tuples in playTennis= “yes” and 5 in “no”
2 2
9 5
gini ( D) = 1 −   −   = 0.459
 14   14 
• Suppose the attribute temperature partitions D into 10 in D1: {low,
medium} and 4 in D2  10  4
giniTemperature∈{Cool , Mild } ( D ) =  Gini ( D1 ) +  Gini ( D2 )
 14   14 
 10  7 3 ∈  4  2 2 
=  1 − ( ) 2 − ( ) 2  +  1 − ( ) 2 − ( ) 2 
 14  10 10   14  4 4 
= 0.443
= Gini{Temperature∈Hot } ( D)
Gini{Cool,Hot} is 0.458; Gini{Mild,Hot} is 0.450. Thus, split on the {Cool,
Mild} (and {hot}) since it has the lowest Gini index
• All attributes are assumed continuous-valued
• May need other tools, e.g., clustering, to get the possible split values 34
• Can be modified for categorical attributes

COMPARING ATTRIBUTE SELECTION
MEASURES
• The three measures, in general, return good results but
• Information gain:
• biased towards multivalued attributes
• Gain ratio:
• tends to prefer unbalanced splits in which one partition is
much smaller than the others
• Gini index:
• biased to multivalued attributes
• has difficulty when # of classes is large
• tends to favor tests that result in equal-sized partitions and
purity in both partitions

LIMITATION: OVERFITTING
• Overfitting: An induced tree may overfit the training data
• Too many branches, some may reflect anomalies due to
noise or outliers
• Poor accuracy for unseen samples
• Two approaches to avoid overfitting
• Prepruning: Halt tree construction early ̵ do not split a
node if this would result in the goodness measure falling
below a threshold
• Difficult to choose an appropriate threshold
• Postpruning: Remove branches from a “fully grown”
tree—get a sequence of progressively pruned trees
• Use a set of data different from the training data to decide
which is the “best pruned tree”
• C4.5
AN EXAMPLE: OVERFITTING(1)
This example is from “Web Mining” written by Bin Liu

AN EXAMPLE: OVERFITTING(2)
This example is from “Web Mining” written by Bin Liu

OTHER ISSUES IN DECISION TREE
INDUCTION
• From tree to rules, and rule pruning
• Handling of miss values
• Handing skewed distributions
• Handling attributes and classes with different costs.
• Attribute construction
• Etc.

IRIS EXAMPLE
# install package
install.packages("rpart")
install.packages("rpart.plot")
library("rpart")
library("rpart.plot")
data("iris")
str(iris) # explore data set structure
# divide the data into training and testing set

indexes <- sample(150, 110)
iris_train <- iris[indexes,]
iris_test <- iris[-indexes,]
# build the decision tree

decesion_tree <- rpart(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data =
iris_train, method = "class")
rpart.plot(decesion_tree)
# check the accuracy

predictions <- predict(decesion_tree, iris_test)
Measure used: Gini Index

Measure used: Information gain

REFERENCE
• Data Mining: Concepts and Techniques, Third Edition.
By Jiawei Han, Micheline Kamber, Jian Pei
• Chapter 8
• Introduction to Data Mining. By Pang-Ning Tan, Michael
Steinbach, Vipin Kumar
• Chapter 3

QUESTIONS?

Classification Decision Tree

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Classification Decision Tree

Transféré par

Droits d'auteur :

Formats disponibles

INTRODUCTION TO

XIAOFENG ZHOU | ITEC3040

@X. Zhou Intro to Data 1

@X. Zhou Intro to Data Analytics 2

• In practice, this assumption is often violated to certain

@X. Zhou Intro to Data Analytics 4

@X. Zhou Intro to Data Analytics 5

@X. Zhou Intro to Data Analytics 6

NAME RANK YEARS TENURED Classifier

@X. Zhou Intro to Data Analytics 10

@X. Zhou Intro to Data Analytics 12

1 Yes Single 125K No

Start from the root of tree.

@X. Zhou Intro to Data Analytics 14

perform better. NO TaxInc

• Depends on number of ways to split

@X. Zhou Intro to Data Analytics 18

• Binary split: Divides values into two subsets.

@X. Zhou Intro to Data Analytics 19

• Binary split: Divides values into two subsets.

• Binary Decision: (A < v) or (A ≥ v)

@X. Zhou Intro to Data Analytics 21

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split

@X. Zhou Intro to Data Analytics 22

Own Car Student

Yes No Family Luxury c1 c20

Which test condition is the best?

@X. Zhou Intro to Data Analytics 24

@X. Zhou Intro to Data Analytics 27

S: [9+, 5-] S: [9+, 5-]

Which attribute should be tested here, Humidity, Temperature, or Wind?

Gain(Ssunny, Humidity) = .97095- (3/5)0.0- (2/5)0.0 = 0.97095

@X. Zhou Intro to Data Analytics 29

SSunny:{D1,D2,D8,D9,D11} SOvercast:{D3,D7,D12,D13} SRain:{D4,D5,D6,D10,D14}

High Normal Strong Weak

@X. Zhou Intro to Data Analytics 30

• gain_ratio(temperature) = 0.029/1.557 = 0.019

possible splitting points for each attribute)

• Can be modified for categorical attributes

@X. Zhou Intro to Data Analytics 35

This example is from “Web Mining” written by Bin Liu

@X. Zhou Intro to Data Analytics 37

This example is from “Web Mining” written by Bin Liu

@X. Zhou Intro to Data Analytics 39

# divide the data into training and testing set

# build the decision tree

# check the accuracy

@X. Zhou Intro to Data Analytics 41

@X. Zhou Intro to Data Analytics 42

@X. Zhou Intro to Data Analytics 43

@X. Zhou Intro to Data Analytics 44

Vous aimerez peut-être aussi