Vous êtes sur la page 1sur 30

Applied Predictive Analytics

for Business
Decision Trees

1
Todays Agenda

Class Business Items


Homework 2, Homework 3
Projects, Review Session, Exam
Finish Decision Trees
Bagging
Random Forests
Boosting

2
Decision Tree Representation

Internal nodes test attributes


Branches corresponds to attribute values
Each leaf node assigns a classification
or value

3
Classification and Regression Trees

The official name of the tree building algorithm is


called recursive partitioning.
Its a greedy algorithm: it doesnt look ahead; it
takes the best one of those currently available.
The algorithm only considers binary splits.
Two types based on Y:
Classification trees: Y is categorical.
Regression trees: Y is quantitative.

4
Top-Down Induction of Decision Trees

1. A the best decision attribute for the next


node.
2. Assign A as the decision attribute for node.
3. For the binary split according to the decision
attribute of A, create two new descendants.
4. Sort training examples to leaf nodes.
5. If training examples perfectly classified or
meet a maximum number of observations,
then STOP, else iterate over new leaf nodes.

5
Regression Trees
customer assets income amount
1 H 75 150
2 L 50 30
3 M 25 25
4 M 50 100
5 M 100 110
6 H 25 200
7 L 25 15
8 M 75 90

Two predictors:
assets = {Low, Medium, High}
Income in 1000s of dollars
Response (quantitative):
Borrowing amount in 1000s of dollars
Goal: Can you create a decision rule (a tree!) to predict the borrowing amount?

6
Building Regression Trees
Regression tree building process:
Suppose we are starting at the root node.
The algorithm considers all the partitions of the predictors
into the left and right nodes and then computes RSS
(Residual Sums of Squares) for each partition.
RSS (Y Yleft )2 (Y Y right
) 2

left node right node

Then it picks the partition that gives the lowest RSS.


We continue the process at each node until some stopping
rule is met.

7
Recursive Partitioning in Action

Suppose we are starting at the root node. We consider:

Partition Left Node Right Node RSS


1 Asset = M or H Asset = L ?
2 Asset = L or H Asset = M ?
3 Asset = L or M Asset = H ?
4 Income < 37.5 Income > 37.5 ?
5 Income < 62.5 Income > 62.5 ?
6 Income < 87.5 Income > 87.5 ?

8
Partition 3
Partition Left Node Right Node RSS
1 Asset = M or H Asset = L ?
2 Asset = L or H Asset = M ?
3 Asset = L or M Asset = H ?
4 Income < 37.5 Income > 37.5 ?
5 Income < 62.5 Income > 62.5 ?
6 Income < 87.5 Income > 87.5 ?

Customer Assets Income Amount Left RSS Right RSS


1 H 75 150 625
2 L 50 30 1003
3 M 25 25 1344
4 M 50 100 1469
5 M 100 110 2336
Total RSS for Partition 3:
6 H 25 200 625 9122 + 1250 = 10383
7 L 25 15 2178
8 M 75 90 803
9133 1250
Mean of Amount for Asset = H 175.00
Mean of Amount for Asset = L or M 61.67

9
Partition 4
Partition Left Node Right Node RSS
1 Asset = M or H Asset = L ?
2 Asset = L or H Asset = M ?
3 Asset = L or M Asset = H ?
4 Income < 37.5 Income > 37.5 ?
5 Income < 62.5 Income > 62.5 ?
6 Income < 87.5 Income > 87.5 ?

Customer Assets Income Amount Left RSS Right RSS


1 H 75 150 2916
2 L 50 30 4356
3 M 25 25 3025
4 M 50 100 16
5 M 100 110 196
Total RSS for Partition 4:
6 H 25 200 14400 21650 + 7520 = 29170
7 L 25 15 4225
8 M 75 90 36
21650 7520
Mean of Amount for Income < 37.5 80.00
Mean of Amount for Income > 37.5 96.00
Note: RSS for Partition 3 < RSS for Partition 4
Partition 3 is taken over partition 4.

10
Executing Tree Commands in R

11
Decision Tree with the Hitters Data Set

12
Cost Complexity Pruning

13
Pruning Through Cost Complexity Analysis

14
Pruned Tree

15
Regression Tree Lab

16
Classification Trees
With a regression tree, the predicted response for an
observation is given by the mean response of the
training observations that belong to the same terminal
node

For classification, we predict that each observation


belongs to the most commonly occurring class of
training observations in the region to which it belongs

17
Growing a Classification Tree
Similar to regression, just cant use RSS.

Classification error would provide a natural


correspondence, but it is not sensitive enough
for tree growing.

Instead, use the Gini index or cross-entropy


index.

The basic idea behind using the index is that


splitting occurs by reducing node impurity.
18
Entropy

S is a sample of training examples


is the proportion of positive examples in S
is the proportion of negative examples in S
Entropy measures the impurity of S

log 2 log 2

19
Information Gain

Gain(S,A) = expected reduction in


entropy due to sorting on A

,

()

20
Decision Tree for Play Tennis

21
Training Examples
Day Outlook Temperature Humidity Wind Play Tennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Stong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Stong No
D7 Overcast Cool Normal Stong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Stong Yes
D12 Overcast Mild High Stong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Stong No

22
Which attribute is the best classifier?

23
Which attribute should be tested?

,E = 0.940

,E = 0.971 ,E = 0.971

Which attribute should be tested here?

24
Test Temperature
Day Outlook Temperature Humidity Wind Play Tennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Stong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Stong No
D7 Overcast Cool Normal Stong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Stong Yes
D12 Overcast Mild High Stong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Stong No

Ssunny = {D1, D2, D8, D9, D11}


Gain(Ssunny,Temperature):
0.971 (2/5) *1 (2/5) * 0 (1/5) * 0 = 0.571
Mild Hot Cool
25
Test Wind
Day Outlook Temperature Humidity Wind Play Tennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Stong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Stong No
D7 Overcast Cool Normal Stong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Stong Yes
D12 Overcast Mild High Stong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Stong No

Ssunny = {D1, D2, D8, D9, D11}


Gain(Ssunny, Wind):
0.971 (3/5) *0.918 (2/5) * 1.0 = 0.019
Weak Strong
26
Test Humidity
Day Outlook Temperature Humidity Wind Play Tennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Stong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Stong No
D7 Overcast Cool Normal Stong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Stong Yes
D12 Overcast Mild High Stong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Stong No

Ssunny = {D1, D2, D8, D9, D11}


Gain(Ssunny, Humidity):
0.971 (3/5) *0 (2/5) * 0 = 0.971
High Normal
27
Regression Tree Lab

28
Advantages of Trees
Most common transformation of the predictors will not change the
tree results. (Transformation of the response could make a
difference.)
Interaction terms (often used in multiple regression) are not used.
They are automatically handled within the context of a tree.
They are easy to visualize. (Unless you have too many
branches.)
There is no need to have dummy variables for categorical data.
They are easy to explain and use.
Missing values can be handled easily.
Many people use trees as an exploratory tool.

29
Disadvantages of Trees
Trees are not generally robust. They can be severely
unstable if small changes in data are made.
Regression trees only give the mean (or mode) of Y
values in the leaves as predictions. Why just a mean?
Often they dont predict well!

There are fixes to the disadvantages but they come at a price: they take away
from the advantages!

80

Vous aimerez peut-être aussi