Académique Documents
Professionnel Documents
Culture Documents
Regression Trees: where the target variable is continuous and tree is used to predict it's value.
The main goal of regression algorithms is the predict the discrete or a continues value.
Although Classification and Regression come under the same umbrella of Supervised Machine Learning and
share the common concept of using past data to make predictions, or take decisions, that’s where their
similarity ends.
• A generic tree-growing methodology, known as CART, successively splits nodes until they are pure. Six key
questions:
1) Should the questions be binary (e.g., is gender male or female) or numeric (e.g., is height >= 5’4”) or
multi-valued (e.g., race)?
• If we continue to grow the tree until each leaf node has the lowest impurity, then the data
will be overfit.
• Two strategies: (1) stop tree from growing or (2) grow and then prune the tree.
• A traditional approach to stopping splitting relies on cross-validation:
Validation: train a tree on 90% of the data and test on 10% of the data (referred to as
the held-out set).
Cross-validation: repeat for several independently chosen partitions.
Stopping Criterion: Continue splitting until the error on the held-out data is minimized.
• Reduction In Impurity: stop if the candidate split leads to a marginal reduction of the
impurity (drawback: leads to an unbalanced tree).
• Cost-Complexity: use a global criterion function that combines size and impurity:
. This approach is related to minimum description length when the impurity is based on
size i( N )
entropy. leaf nodes
• The most fundamental problem with decision trees is that they "overfit" the data and
hence do not provide good generalization. A solution to this problem is to prune the tree:
• But pruning the tree will always increase the error rate on the training set .
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply Decision Tree
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Alternate Splitting Criteria
• Variance impurity:
i( N ) P(1 ) P(2 )
because this is related to the variance of a distribution associated with the two classes.
• Gini Impurity:
1
i( N ) P(i )P( j ) [1 P 2 ( j )]
i j
The expected 2 node
error rate at j N if the category label is selected randomly from the class
distribution present at node N.
• Misclassification impurity:
i( N ) 1 max P( j )
j
measures the minimum probability that a training pattern would be misclassified at node
N.
• In practice, simple entropy splitting (choosing the question that splits the data into two
classes of equal size) is very effective.