Vous êtes sur la page 1sur 17

Basic Principles

• The top, or first node, is called the root node.


• The last level of nodes are the leaf nodes
and contain the final classification.
• The intermediate nodes are the
descendant or “hidden” layers.
• Binary trees, like the one shown to the
right, are the most popular type of tree.
However, M-ary trees (M branches at each node) are possible.
• Nodes can contain one more questions. In a binary tree, by convention if the answer to a
question is “yes”, the left branch is selected. Note that the same question can appear in
multiple places in the network.
• Decision trees have several benefits over neural network-type approaches, including
interpretability and data-driven learning.
• Key questions include how to grow the tree, how to stop growing, and how to prune the
tree to increase generalization.
• Decision trees are very powerful and can give excellent performance on closed-set testing.
Generalization is a challenge.
Classification Trees: where the target variable is categorical and the tree is used to identify the
"class" within which a target variable would likely fall into.

Regression Trees: where the target variable is continuous and tree is used to predict it's value.
The main goal of regression algorithms is the predict the discrete or a continues value.

Although Classification and Regression come under the same umbrella of Supervised Machine Learning and
share the common concept of using past data to make predictions, or take decisions, that’s where their
similarity ends.
• A generic tree-growing methodology, known as CART, successively splits nodes until they are pure. Six key
questions:

1) Should the questions be binary (e.g., is gender male or female) or numeric (e.g., is height >= 5’4”) or
multi-valued (e.g., race)?

2) Which properties should be tested at each node?

3) When should a node be declared a leaf?

4) If the tree becomes too large, how can it be pruned?

5) If the leaf node is impure, what category should be assigned to it?

6) How should missing data be handled?


When To Stop Splitting

• If we continue to grow the tree until each leaf node has the lowest impurity, then the data
will be overfit.
• Two strategies: (1) stop tree from growing or (2) grow and then prune the tree.
• A traditional approach to stopping splitting relies on cross-validation:
 Validation: train a tree on 90% of the data and test on 10% of the data (referred to as
the held-out set).
 Cross-validation: repeat for several independently chosen partitions.
 Stopping Criterion: Continue splitting until the error on the held-out data is minimized.
• Reduction In Impurity: stop if the candidate split leads to a marginal reduction of the
impurity (drawback: leads to an unbalanced tree).

• Cost-Complexity: use a global criterion function that combines size and impurity:
. This approach is related to minimum description length when the impurity is based on
  size   i( N )
entropy. leaf nodes

• Other approaches based on statistical significance and hypothesis testing attempt to


assess the quality of the proposed split.
Pruning

• The most fundamental problem with decision trees is that they "overfit" the data and
hence do not provide good generalization. A solution to this problem is to prune the tree:

• But pruning the tree will always increase the error rate on the training set .

• Cost-complexity Pruning: . Each node in the tree can be classified in terms


  size   i( N )
of its impact on the cost-complexity if itleaf
werenodespruned. Nodes are successively pruned until

certain heuristics are satisfied.


• By pruning the nodes that are far too specific to the training set, it is hoped the tree will
have better generalization. In practice, we use techniques such as cross-validation and
held-out training data to better calibrate the generalization properties.
Classification: Definition
• Given a collection of records (training set )
• Each record contains a set of attributes, one of the attributes is the
class.
• Find a model for class attribute as a function of the
values of other attributes.
• Goal: previously unseen records should be
assigned a class as accurately as possible.
• A test set is used to determine the accuracy of the model. Usually,
the given data set is divided into training and test sets, with
training set used to build the model and test set used to validate it.
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply Decision Tree
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Alternate Splitting Criteria

• Variance impurity:

i( N )  P(1 ) P(2 )

because this is related to the variance of a distribution associated with the two classes.

• Gini Impurity:

1
i( N )   P(i )P( j )  [1   P 2 ( j )]
i j
The expected 2 node
error rate at j N if the category label is selected randomly from the class
distribution present at node N.

• Misclassification impurity:

i( N )  1  max P( j )
j
measures the minimum probability that a training pattern would be misclassified at node
N.

• In practice, simple entropy splitting (choosing the question that splits the data into two
classes of equal size) is very effective.

Vous aimerez peut-être aussi