Vous êtes sur la page 1sur 44

INTRODUCTION TO DATA ANALYTICS

XIAOFENG ZHOU | ITEC3040

@X. Zhou

Intro to Data

Analytics

1

CLASSIFICATION

Basic Concept

Decision Tree Induction

Bayes Classification Method

KNN method

Model Evaluation and Selection

Techniques to Improve Classification Accuracy

SUPERVISED VS. UNSUPERVISED LEARNING

Supervised learning (classification)

Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations

New data is classified based on the training set

Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data

FUNDAMENTAL ASSUMPTION OF LEARNING

Assumption: The distribution of training examples is identical to the distribution of test examples (including future unseen examples).

In practice, this assumption is often violated to certain degree.

Strong violations will clearly result in poor classification accuracy. To achieve good accuracy on the test data, training examples must be sufficiently representative of the test data.

CLASSIFICATION: DEFINITION

Given a collection of records (training set )

Each record contains a set of attributes, one of the attributes is the class.

Find a model for class attribute as a function of the values of other attributes.

Goal: previously unseen records should be assigned a class as accurately as possible.

A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

CLASSIFICATION—A TWO-STEP

PROCESS

Model construction: describing a set of predetermined classes

Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute

The set of tuples used for model construction is training set

The model is represented as classification rules, decision trees, or mathematical formulae

Model usage: for classifying future or unknown objects

Estimate accuracy of the model

The known label of test sample is compared with the classified result from the model

Accuracy rate is the percentage of test set samples that are correctly classified by the model

Test set is independent of training set (otherwise overfitting)

If the accuracy is acceptable, use the model to classify new data

Note: If the test set is used to select models, it is called validation (test) set

PROCESS (1): MODEL CONSTRUCTION

Training Data NAME RANK YEARS TENURED Mike Assistant Prof 3 no Mary Assistant Prof 7
Training
Data
NAME
RANK
YEARS
TENURED
Mike
Assistant Prof
3
no
Mary
Assistant Prof
7
yes
Bill
Professor
2
yes
Jim
Associate Prof
7
yes
Dave
Assistant Prof
6
no
Anne
Associate Prof
3
no

@X. Zhou

Intro to Data Analytics

Classification

Algorithms

@X. Zhou Intro to Data Analytics Classification Algorithms Classifier (Model) IF rank = ‘ professor ’
Classifier (Model)
Classifier
(Model)

IF rank = professorOR years > 6 THEN tenured = yes

7

PROCESS (2): USING THE MODEL IN PREDICTION

PROCESS (2): USING THE MODEL IN PREDICTION Classifier Unseen Data Testing Data (Jeff, Professor, 4) NAME
Classifier
Classifier
PROCESS (2): USING THE MODEL IN PREDICTION Classifier Unseen Data Testing Data (Jeff, Professor, 4) NAME
PROCESS (2): USING THE MODEL IN PREDICTION Classifier Unseen Data Testing Data (Jeff, Professor, 4) NAME
Unseen Data
Unseen Data
Testing Data
Testing
Data
THE MODEL IN PREDICTION Classifier Unseen Data Testing Data (Jeff, Professor, 4) NAME RANK YEARS TENURED

(Jeff, Professor, 4)

NAME

RANK

YEARS

TENURED

Tom

Assistant Prof

2

no

Merlisa

Associate Prof

7

no

George

Professor

5

yes

Joseph

Assistant Prof

7

yes

Tenured?

George Professor 5 yes Joseph Assistant Prof 7 yes Tenured? @X. Zhou Intro to Data Analytics
George Professor 5 yes Joseph Assistant Prof 7 yes Tenured? @X. Zhou Intro to Data Analytics

EXAMPLES OF CLASSIFICATION

An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients.

A decision is needed: whether to put a new patient in an intensive-care unit.

Due to the high cost of ICU, those patients who may survive less than a month are given higher priority.

Problem: to predict high-risk patients and discriminate them from low-risk patients.

Other applications:

Predicting tumor cells as benign or malignant

Classifying credit card transactions as legitimate or fraudulent

Classifying secondary structures of protein as alpha-helix, beta- sheet, or random coil

Categorizing news stories as finance, weather, entertainment, sports, etc.

THE DATA AND THE GOAL

Data: A set of data records (also called examples, instances or cases) described by

k attributes: A 1 , A 2 , … A k .

a class: Each example is labelled with a pre-defined class – class attribute

The class attribute has a set of discrete values, n,

>=2

Goal: To learn a classification model from the data that can be used to predict the classes of new (future, or test) cases/instances.

Table is from Web Mining written by Bin Liu @X. Zhou Intro to Data Analytics

Table is from Web Mining written by Bin Liu

DECISION TREE INDUCTION

Decision tree induction is one of the most widely used techniques for classification.

Its classification accuracy is competitive with other methods, and

it is very efficient.

The classification model is a tree, called decision tree.

How a decision tree works?

EXAMPLE OF A DECISION TREE

EXAMPLE OF A DECISION TREE Tid Refund Marital Taxable   Status Income Cheat 1 Yes Single
EXAMPLE OF A DECISION TREE Tid Refund Marital Taxable   Status Income Cheat 1 Yes Single

Tid

Refund

Marital

Taxable

 

Status

Income

Cheat

1

Yes

Single

125K

No

2

No

Married

100K

No

3

No

Single

70K

No

4

Yes

Married

120K

No

5

No

Divorced

95K

Yes

6

No

Married

60K

No

7

Yes

Divorced

220K

No

8

No

Single

85K

Yes

9

No

Married

75K

No

10

10

No

Single

90K

Yes

Training Data

Splitting AttributesNo 10 10 No Single 90K Yes Training Data Refund No MarSt Yes NO Internal nodes

Single 90K Yes Training Data Splitting Attributes Refund No MarSt Yes NO Internal nodes Single, Divorced

Refund

90K Yes Training Data Splitting Attributes Refund No MarSt Yes NO Internal nodes Single, Divorced Married
No MarSt
No
MarSt
Yes Training Data Splitting Attributes Refund No MarSt Yes NO Internal nodes Single, Divorced Married TaxInc

Yes

NO Internal
NO
Internal

nodes

Splitting Attributes Refund No MarSt Yes NO Internal nodes Single, Divorced Married TaxInc NO YES >
Splitting Attributes Refund No MarSt Yes NO Internal nodes Single, Divorced Married TaxInc NO YES >

Single, Divorced

Married

TaxInc

NO

Yes NO Internal nodes Single, Divorced Married TaxInc NO YES > 80K < 80K N O
Yes NO Internal nodes Single, Divorced Married TaxInc NO YES > 80K < 80K N O
YES
YES

> 80K

< 80K

NO

Divorced Married TaxInc NO YES > 80K < 80K N O leaf Model: Decision Tree @X.

leaf

Model: Decision Tree

USE THE DECISION TREE

Test Data

TID

Refund

Marital

Taxable

Cheat

Status

Income

100

No

Yes

65K

NO

Start from the root of tree.

Refund No MarSt Single, Divorced TaxInc > 80K
Refund
No
MarSt
Single, Divorced
TaxInc
> 80K

YES

Refund No MarSt Single, Divorced TaxInc > 80K YES Yes NO Married NO < 80K N

Yes

No MarSt Single, Divorced TaxInc > 80K YES Yes NO Married NO < 80K N O

NO

No MarSt Single, Divorced TaxInc > 80K YES Yes NO Married NO < 80K N O

Married

NO

< 80K

NO

IS THE DECISION TREE UNIQUE?

No. Here is a simpler tree.

We want smaller tree and accurate tree.

Easy to understand and perform better.

All current tree building algorithms are heuristic algorithms

MarSt Single, Divorced Refund No Yes NO TaxInc < 80K
MarSt
Single,
Divorced
Refund
No
Yes
NO
TaxInc
< 80K

NO

Married

Refund No Yes NO TaxInc < 80K N O Married NO > 80K YES There could

NO

> 80K

YES

There could be more than one tree that fits the same data!

ALGORITHM FOR DECISION TREE INDUCTION

Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start, all the training examples are at the root

Attributes are categorical (if continuous-valued, they are discretized in advance)

Examples are partitioned recursively based on selected attributes

Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain)

Conditions for stopping partitioning

All samples for a given node belong to the same class

There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf

There are no samples left

DECISION TREE INDUCTION

Many Algorithms:

Hunt’s Algorithm (one of the earliest)

CART

ID3, C4.5

SLIQ,SPRINT

Issues

Determine how to split the records

How to specify the attribute test condition?

How to determine the best split?

Determine when to stop splitting

HOW TO SPECIFY TEST CONDITION?

Depends on attribute types

Nominal

Ordinal

Continuous

Depends on number of ways to split

2-way split

Multi-way split

SPLITTING BASED ON NOMINAL ATTRIBUTES

Multi-way split: Use as many partitions as distinct values.

CarType Family Luxury Sports
CarType
Family
Luxury
Sports

Binary split: Divides values into two subsets. Need to find optimal partitioning.

CarType
CarType

{Sports,

{Family}

Luxury}

OR

CarType
CarType

{Family,

{Sports}

Luxury}

SPLITTING BASED ON ORDINAL ATTRIBUTES

Multi-way split: Use as many partitions as distinct values.

Size Small Large Medium
Size
Small
Large
Medium

Binary split: Divides values into two subsets. Need to find optimal partitioning.

Size
Size

{Small,

Medium}

{Large}

OR

What about this split?

{Small,

 

Large}

Size
Size

{Medium,

{Small}

Large}

Size
Size

{Medium}

SPLITTING BASED ON CONTINUOUS ATTRIBUTES

Different ways of handling

Discretization to form an ordinal categorical attribute

Static – discretize once at the beginning

Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering.

Binary Decision: (A < v) or (A v)

consider all possible splits and finds the best cut

can be more compute intensive

SPLITTING BASED ON CONTINUOUS ATTRIBUTES

Taxable Income > 80K? Yes No (i) Binary split
Taxable
Income
> 80K?
Yes
No
(i) Binary split
Taxable Income? < 10K > 80K [10K,25K) [25K,50K) [50K,80K)
Taxable
Income?
< 10K
> 80K
[10K,25K)
[25K,50K)
[50K,80K)

(ii) Multi-way split

HOW TO DETERMINE THE BEST SPLIT

Own Car? Yes No C0: 6 C0: 4 C1: 4 C1: 6
Own
Car?
Yes
No
C0: 6
C0: 4
C1: 4
C1: 6

Before Splitting: 10 records of class 0, 10 records of class 1

Car Type? Family Luxury Sports C0: 1 C0: 8 C0: 1 C1: 3 C1: 0
Car
Type?
Family
Luxury
Sports
C0: 1
C0: 8
C0: 1
C1: 3
C1: 0
C1: 7

Which test condition is the best?

Student ID? c 1 c 20 c 10 c 11 C0: 1 C0: 1 C0:
Student
ID?
c 1
c 20
c 10
c 11
C0: 1
C0: 1
C0: 0
C0: 0
C1: 0
C1: 0
C1: 1
C1: 1

HOW TO DETERMINE THE BEST SPLIT

• •

The key to building a decision tree - which

attribute to choose in order to branch?

The objective is to reduce impurity or uncertainty in data as much as possible.

A subset of data is pure if all instances belong to the same class.

Need a measure of node impurity

Information gain

Gain ratio

Gini Index

BRIEF INTRODUCTION OF ENTROPY

Entropy(Information Theory)

A measure of uncertainty associated with a random variable

Calculation: for a discrete random variable Y taking m distinct values {y 1 , y 2 , …, y m }

(

entropy D

) = −

|

C

|

P

(

Y

= y

j

) log

2

P

(

Y

= y

j

)

Where

|

C

|

j = 1

P

(

Y

j = 1

= y

j

)

=

1

Define 0*log 2 0=0

Interpretation:

Higher entropy: -> higher uncertainty

@X. Zhou

Intro to Data Analytics

0*log 2 0=0 • Interpretation: • Higher entropy: -> higher uncertainty @X. Zhou Intro to Data

25

ATTRIBUTE SELECTION MEASURE:

INFORMATION GAIN (ID3/C4.5)

Select the attribute with the highest information gain

Let p i (P(Y=y i ))be the probability that an arbitrary tuple in D belongs to class C i , estimated by |C i, D |/|D|

Expected information (entropy) needed to classify a

tuple in D:

Info D

(

)

=−

m

i = 1

p

i

log

2

(

p

i

)

Information needed (after using A to split D into v partitions) to classify D:

Info

A

(

D

)

=

v

j = 1

|

D

j

|

|

D

|

× Info D

(

j

)

Information gained by branching on attribute A

Gain(A)

=

Info(D)

Info (D)

A

AN ILLUSTRATION EXAMPLE

AN ILLUSTRATION EXAMPLE Training examples Day Outlook Temperature Humidity Wind PlayTennis D1 Sunny Hot High

Training examples

Day

Outlook

Temperature

Humidity

Wind

PlayTennis

D1

Sunny

Hot

High

Weak

No

D2

Sunny

Hot

High

Strong

No

D3

Overcast

Hot

High

Weak

Yes

D4

Rainy

Mild

High

Weak

Yes

D5

Rainy

Cool

Normal

Weak

Yes

D6

Rainy

Cool

Normal

Strong

No

D7

Overcast

Cool

Normal

Strong

Yes

D8

Sunny

Mild

High

Weak

No

D9

Sunny

Cool

Normal

Weak

Yes

D10

Rainy

Mild

Normal

Weak

Yes

D11

Sunny

Mild

Normal

Strong

Yes

D12

Overcast

Mild

High

Strong

Yes

D13

Overcast

Hot

Normal

Weak

Yes

D14

Rainy

Mild

High

Strong

No

Which attribute is the best for the root?

S: [9+, 5-]

Entropy=0.940

Humidity

High
High
best for the root? S: [9+, 5-] Entropy=0.940 Humidity High Normal S H i g h

Normal

S High : [3+, 4-]

Entropy=0.985

Gain(S, Humidity) =.940- (7/14).985- (7/14).592=0.151

S Normal : [6+, 1-]

Entropy=0.592

S: [9+, 5-]

E=0.940 Wind Weak Strong
E=0.940
Wind
Weak
Strong

S Weak : [6+, 2-]

Entropy=0.811

S Strong : [3+, 3-]

Entropy=1

Gain(S, Wind) =.940- (8/14).811- (6/14)1.0=0.048

S: [9+, 5-]

Entropy=0.940

Outlook Sunny Overcast
Outlook
Sunny
Overcast
S: [9+, 5-] Entropy=0.940 Outlook Sunny Overcast Rainy S S u n n y : [2+,

Rainy

S Sunny : [2+, 3-]

Entropy=0.97095

S Overcast : [4+, 0-]

S Rain : [3+, 2-]

Entropy=0.97095

Entropy=0

Gain(S, Outlook) =.940- (5/14).97095- (4/14)0- (5/14).97095=0.2467

S: [9+, 5-]

E=0.940

Temperature

Hot
Hot
Cool
Cool

Mild

S: [9+, 5-] E=0.940 Temperature Hot Cool Mild S H o t : [2+, 2-] Entropy=1

S Hot : [2+, 2-]

Entropy=1

S Mildt : [4+, 2-]

Entropy=0.91826

S Cool : [3+, 1-]

Entropy=0.811

Gain(S, Outlook) =.940- (4/14)1- (6/14).91826- (4/14).811=0.029

An illustrative example (Cont’d.)

S: {D1, D2, …, D14} [9+, 5-]

Outlook

Sunny
Sunny
Rainy
Rainy

Overcast

S: {D1, D2, …, D14} [9+, 5-] Outlook Sunny Rainy Overcast S S u n n

S Sunny :{D1, D2, D8, D9, D11} [2+, 3-]

Entropy=0.97095

S Overcast :{D3, D7, D12, D13} [4+, 0-]

Entropy=0

S Rain :{D4,D5,D6,D10,D14} [3+, 2-]

Entropy=0.97095

?

R a i n :{D4,D5,D6,D10,D14} [3+, 2-] Entropy=0.97095 ? Yes ? Which attribute should be tested
Yes
Yes

?

Which attribute should be tested here, Humidity, Temperature, or Wind?

Gain(S sunny , Humidity) = .97095- (3/5)0.0- (2/5)0.0 = 0.97095 Gain(S sunny , Temperature) = .97095- (2/5)0.0- (2/5)1.0- (1/5)0.0 = 0.57095 Gain(S sunny , Wind) = .97095- (2/5)1.0- (3/5).918 = 0.02015

Therefore, Humidity is chosen as the next test attribute for the left branch.

An illustrative example (Cont’d.)

S: {D1, D2, …, D14} [9+, 5-]

Outlook

Sunny
Sunny
Rain
Rain

Overcast

S: {D1, D2, …, D14} [9+, 5-] Outlook Sunny Rain Overcast S S u n n

S Sunny :{D1,D2,D8,D9,D11} [2+, 3-]

Entropy=0.97095

Humidity

n y :{D1,D2,D8,D9,D11} [2+, 3-] Entropy=0.97095 Humidity High Normal (D1, D2, D8) [0+, 3-] No (D9,
n y :{D1,D2,D8,D9,D11} [2+, 3-] Entropy=0.97095 Humidity High Normal (D1, D2, D8) [0+, 3-] No (D9,

High

Normal

(D1, D2, D8) [0+, 3-]

No
No

(D9, D11)

[2+, 0-] Yes
[2+, 0-]
Yes

S Overcast :{D3,D7,D12,D13} [4+, 0-]

Entropy=0

Yes
Yes

S Rain :{D4,D5,D6,D10,D14} [3+, 2-]

Entropy=0.97095 Wind Strong Weak (D6, D14) [0+, 2-] (D4, D5, D10) [3+, 0-] No Yes
Entropy=0.97095
Wind
Strong
Weak
(D6, D14)
[0+, 2-]
(D4, D5, D10)
[3+, 0-]
No
Yes

BASIC DECISION TREE LEARNING ALGORITHM

1.
1.

2.

BASIC DECISION TREE LEARNING ALGORITHM 1. 2. Select the “best” attribute A for the root node

Select the “best” attribute A for the root node

Create new descendent of the node according to the values of A

3. Put training examples to the descendent nodes.

4. For each descendent node

when to terminate the recursive process
when to
terminate
the
recursive
process

if the training examples associated with the node belong to the same class, the node is marked as a leaf node and labeled with the class

else if there are no remaining attributes on which the examples can be further partitioned, the node is marked as a leaf node and labeled with the most common class among the training cases for classification;

else if there is no example for the node, the node is marked as a leaf node and labeled with the majority class in its parent node.

otherwise, recursively apply the process on the new node.

GAIN RATIO FOR ATTRIBUTE SELECTION

(C4.5)

Information gain measure is biased towards attributes with a large number of values

C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain)

SplitInfo

A

(

D

) =−

v

j = 1

|

D

j

|

|

D

|

× log

2

(

|

D

j

|

| D |

)

GainRatio(A) = Gain(A)/SplitInfo(A)

Ex.

4

( ) = - 14 × log 2 (

4

log 2 ( 14 ) =1.557

4

14 ) -

6

14 × log 2

6 14
6
14

4

- 14 ×

gain_ratio(temperature) = 0.029/1.557 = 0.019

The attribute with the maximum gain ratio is selected as the splitting attribute

GINI INDEX (CART, IBM INTELLIGENTMINER)

If a data set D contains examples from n classes, gini index, gini(D) is defined as

(

gini D

) = 1

n

j = 1

p

2

j

where p j is the relative frequency of class j in D

If a data set D is split on A into two subsets D 1 and D 2 , the gini index gini(D) is defined as

gini A

(

D

) =

|

D

1

|

|

D

|

gini (

D

1

) +

|

D

2

|

|

D

|

gini (

D

2

)

Reduction in Impurity:

gini(A)

=

gini(D)

gini (D)

A

The attribute provides the smallest gini split (D) (or the largest reduction in impurity) is chosen to split the node (need to enumerate all the possible splitting points for each attribute)

33

COMPUTATION OF GINI INDEX

Ex.

D has 9 tuples in playTennis= yes and 5 in no

gini ( D

)

=

1

9  −   5

2

− 

14

14

2

= 0.459

Suppose the attribute temperature partitions D into 10 in D 1 : {low,

medium} and 4 in D 2

gini

Temperature

{

Cool Mild

,

}

(

D

=

  10     1 (

 

14

7

10

)

2

= 0.443

= Gini

{

Temperature

Hot

}

(

3

10

(

D

)

)

10

14

) =  

Gini D

1

2

4

(

2

+   4     1 (

∈∈

14

 

) +   4

14

(

Gini D

)

2

(

2

4

)

2

2

)

Gini {Cool,Hot} is 0.458; Gini {Mild,Hot} is 0.450. Thus, split on the {Cool, Mild} (and {hot}) since it has the lowest Gini index

All attributes are assumed continuous-valued

May need other tools, e.g., clustering, to get the possible split values

Can be modified for categorical attributes

34

COMPARING ATTRIBUTE SELECTION MEASURES

The three measures, in general, return good results but

Information gain:

biased towards multivalued attributes

Gain ratio:

tends to prefer unbalanced splits in which one partition is much smaller than the others

Gini index:

biased to multivalued attributes

has difficulty when # of classes is large

tends to favor tests that result in equal-sized partitions and purity in both partitions

LIMITATION: OVERFITTING

Overfitting: An induced tree may overfit the training data

Too many branches, some may reflect anomalies due to noise or outliers

Poor accuracy for unseen samples

Two approaches to avoid overfitting

Prepruning: Halt tree construction early ̵do not split a

node if this would result in the goodness measure falling below a threshold

Difficult to choose an appropriate threshold

Postpruning: Remove branches from a fully grown

tree—get a sequence of progressively pruned trees

Use a set of data different from the training data to decide which is the best pruned tree

C4.5

AN EXAMPLE: OVERFITTING(1)

AN EXAMPLE: OVERFITTING(1) This example is from “Web Mining” written by Bin Liu @X. Zhou Intro

This example is from “Web Mining” written by Bin Liu

@X. Zhou

Intro to Data Analytics

37

AN EXAMPLE: OVERFITTING(2)

AN EXAMPLE: OVERFITTING(2) This example is from “Web Mining” written by Bin Liu @X. Zhou Intro

This example is from “Web Mining” written by Bin Liu

OTHER ISSUES IN DECISION TREE INDUCTION

From tree to rules, and rule pruning

Handling of miss values

Handing skewed distributions

Handling attributes and classes with different costs.

Attribute construction

Etc.

IRIS EXAMPLE

# install package

install.packages("rpart")

install.packages("rpart.plot")

library("rpart") library("rpart.plot") data("iris") str(iris) # explore data set structure

# divide the data into training and testing set

indexes <- sample(150, 110) iris_train <- iris[indexes,] iris_test <- iris[-indexes,]

# build the decision tree

decesion_tree <- rpart(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = iris_train, method = "class")

rpart.plot(decesion_tree)

# check the accuracy

predictions <- predict(decesion_tree, iris_test)

Measure used: Gini Index
Measure used: Gini Index
Measure used: Information gain
Measure used: Information gain

REFERENCE

Data Mining: Concepts and Techniques, Third Edition. By Jiawei Han, Micheline Kamber, Jian Pei

Chapter 8

Introduction to Data Mining. By Pang-Ning Tan, Michael Steinbach, Vipin Kumar

Chapter 3

QUESTIONS?