Machine Learning: Classification & Decision Trees

Machine Learning
Classification & Decision Trees

04/01/2013
Proprietary Information created by Parth Khare
Contents
Recursive Partitioning
Classification
Regression/Decision
Bagging
Random Forest
Boosting
Gradient Boosting
Questions
2
2
Detail and flow
What is the difference between supervised and unsupervised learning?

What is ML? how is it different from classical statistics?
Supervised learning: machine -> an application is Trees
Most elementary analysis: CART
Tree
3
3
Basics
Supervised Learning:
Called supervised because of the presence of the outcome variable to guide learning process
building a learner (/model) to predict the outcome for new unseen objects.
Alternatively,
Unsupervised Learning:
observe only the features and have no measurements of the outcome
task is rather to describe how the data are organized or clustered
4
4
Machine Learning viz Statistics

learning viz fitting
Machine learning: a branch of artificial intelligence, is about the construction and study
of systems that can learn from data.
Statistics bases everything on probability models
assuming your data are samples from a random variable with some
distribution, then making
inferences about the parameters of the distribution
Machine learning may use probability models, and when it does, it overlaps with
statistics.
isn't so committed to probability
use other approaches to problem solving that are not based on probability
The basic optimization concept is the same for trees is same as that of parametric
techniques, minimizing errors metrics. Instead of square error function or MLE,
Machine Learning supervises optimization of entropy, node impurity etc
An application _-> Trees
5
5
Decision Tree Approach: Parlance

A decision tree represents a hierarchical segmentation of the data
The original segment is called the root node and is the entire data set
The root node is partitioned into two or more segments by applying a series of simple
rules over an input variables
For example, risk = low, risk = not low
Each rule assigns the observations to a segment based on its input value
Each resulting segment can be further partitioned into sub-segments, and so on

For example risk = low can be partitioned into income = low and income = not low
The segments are also called nodes, and the final segments are called leaf nodes or
leaves
Final node surviving the partitions called the terminal node
Decision Tree Example: Risk

Assessment(Loan)
Income
< $30k
>= $30k
Age
< 25
not on-time
Credit Score
>=25
on-time
< 600
not on-time
>= 600
on-time
CART: Heuristic and Visual
Generic supervised learning problem:

given a bunch of data (x1, y1), (x2, y2)(xn,yn), and a new point x i , supervised learning
objective: associates a y with this new x
Main Idea: form a binary tree and minimize error in each leaf
Given dataset, a decision tree: choose a sequence of binary split of the data
8
8
Growing the tree

Growing the tree involves successively partitioning the data recursively partitioning
If an input variable is binary, then the two categories can be used to split the data
(relative concentration of 0s and 1s)
If an input variable is interval, a splitting value is used to classify the data into two
segments
For example, if household income is interval and there are 100 possible incomes in the
data set, then there are 100 possible splitting values
For example, income < $30k, and income >= $30k
Classification Tree: again (referrence)

Represented by a series of binary splits.
Each internal node represents a value
query on one of the variables e.g. Is
X3 > 0.4. If the answer is Yes, go right,
else go left.
The terminal nodes are the decision
nodes. Typically each terminal node is
dominated by one of the classes.
The tree is grown using training data, by
recursive splitting.
The tree is often pruned to an optimal
size, evaluated by cross-validation.
New observations are classified by
passing their X down to a terminal node of
the tree, and then using majority vote.
10
10
Evaluating the partitions

When the target is categorical, for each partition of an input variable a chi-square
statistic is computed
A contingency table is formed that maps responders and non-responders against the
partitioned input variable
For example, the null hypothesis might be that there is no difference between people
with income <$30k and those with income >=$30k in making an on-time loan payment
The lower the significance or p-value, the more likely that we reject this hypothesis,
meaning that this income split is a discriminating factor
11
Splitting Criteria: Categorical

Information Gain -> Entropy
The rarity of an event is defined as: -log2(pi)
Impurity Measure:
- Pr(Y=0) X log2 [Pr(Y=0)] - Pr(Y=1) X log2 [Pr(Y=1)]
e.g. check at Pr(Y=0) = 0.5??
Entropy sums up the rarity of response and non-response over all observations
Entropy ranges from the best case of 0 (all responders or all non-responders) to 1
(equal mix of responders and non-responders)
link
http://www.youtube.com/watch?v=p17C9q2M00Q
12
12
Splitting Criteria :Continuous

An F-statistic is used to measure the degree of separation of a split for an interval
target, such as revenue
Similar to the sum of squares discussion under multiple regression,
F-statistic is based on the ratio of the sum of squares between the groups and the sum
of squares within groups, both adjusted for the number of degrees of freedom
The null hypothesis is that there is no difference in the target mean between the two
groups
13
Contents
Classification
Regression/Decision
Bagging
Random Forest
Boosting
Gradient Boosting
14
14
Bagging
Ensemble Models : Combines the results from different models
An ensemble classifier using many decision tree models
Bagging: Bootstrapped Samples of data

Working: Random Forest
A different subset of the training data are selected (~2/3), with replacement, to train
each tree
Remaining training data (OOB) are used to estimate error and variable importance
Class assignment is made by the number of votes from all of the trees and for
regression the average of the results is used
A randomly selected subset of variables is used to split each node
The number of variables used is decided by the user (mtry parameter in R)
15
15
Bagging: Stanford
Suppose
C(S, x) is a classier, such as a tree, based
on our training data S, producing a
predicted class label at input point x.
To bag C, we draw bootstrap samples
S1,...SB each of size N with replacement
from the training data.
Then
Cbag(x) = Majority Vote{C(Sb, x)}Bb =1.
Bagging can dramatically reduce the
variance of unstable procedures (like trees),
leading to improved prediction.
However any simple structure in C (e.g a
tree) is lost.
16
16
Bootstrapped samples
17
17
Contents
Classification
Regression/Decision
Bagging
Random Forest
Boosting
Gradient Boosting
18
18
Boosting
Make Copies of Data
Boosting idea: Based on "strength of weak learn ability" principles
Example:
IF Gender=MALE AND Age<=25 THEN claim_freq.=high
Combination of weak learners increased accuracy

Simple or weak" learners are not perfect!
Every boosting algorithm can be interpreted as optimizing the loss function in a greedy stagewise manner
Working: Gradient Descent

First tree is created, residuals observed
Now, a tree is fitted on the residuals of the first tree and so on
In this way, boosting grows trees in series, with later trees dependent on the results of previous
trees
Shrinkage, CV folds, Interaction Depth
Adaboost, DirectBoost, Laplace Loss(Gaussian Boost)
19
19
GBM
Gradient Tree Boosting is a generalization of boosting to arbitrary differentiable loss functions.

GBRT is an accurate and effective off-the-shelf procedure that can be used for both regression and
classification problems.
What it does essentially

By sequentially learning form the errors of the previous trees Gradient Boosting, in a way tries
to learn the unconditional distribution of the target variable. So, analogus to how we use
different types of distributions in GLM modeling, GBM creates/replicates the distribution in the
given data as close as possible.
This comes with an additional risk of over-fitting, resolved by methods like cross validation
within, min observation per node etc.
Parameters working: OOB data/error

We know that the first tree of GBM is build on training data and the subsequent trees are
developed on the error form the first tree. This process carries on.
For OOB, the training data is also split in two parts, on one part the trees and developed, and
on the other part the tree developed on the first part is tested. This second part is called the
OOB data and the error obtained is known as OOB error.
20
20
Summary: Rf and GBM

Main similarities:
Both derive many benefits from ensembling, with few disadvantages
Both can be applied to ensembling decision trees
Main differences:
Boosting performs an exhaustive search for best predictor to split on; RF searches
only a small subset
Boosting grows trees in series, with later trees dependent on the results of
previous trees
RF grows trees in parallel independently of one another.
RF cannot work with missing values GBM can
21
21
More diff b/w RF and GBM

Algorithmic difference is;
Random Forests are trained with random sample of data (even more randomized
cases available like feature randomization) and it trusts randomization to have better
generalization performance on out of train set.
On the other spectrum, Gradient Boosted Trees algorithm additionally tries to find
optimal linear combination of trees (assume final model is the weighted sum of
predictions of individual trees) in relation to given train data. This extra tuning might
be deemed as the difference. Note that, there are many variations of those
algorithms as well.
At the practical side; owing to this tuning stage,
Gradient Boosted Trees are more susceptible to jiggling data. This final stage makes
GBT more likely to overfit therefore if the test cases are inclined to be so verbose
compared to train cases this algorithm starts lacking.
On the contrary, Random Forests are better to strain on overfitting although it is
lacking on the other way around.
22
22
Questions
Concept/ Interpretation
Application
23
23
For further details contact:
Parth Khare
https://www.linkedin.com/profile/view?
id=43877647&trk=nav_responsive_tab_profile

Machine Learning: Classification & Decision Trees

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Machine Learning: Classification & Decision Trees

Transféré par

Droits d'auteur :

Formats disponibles

Machine Learning

Classification & Decision Trees

Proprietary Information created by Parth Khare

Detail and flow

What is the difference between supervised and unsupervised learning?

Machine Learning viz Statistics

Decision Tree Approach: Parlance

Each resulting segment can be further partitioned into sub-segments, and so on

Decision Tree Example: Risk

CART: Heuristic and Visual

Generic supervised learning problem:

Growing the tree

Classification Tree: again (referrence)

Evaluating the partitions

Splitting Criteria: Categorical

Splitting Criteria :Continuous

Bagging: Bootstrapped Samples of data

Combination of weak learners increased accuracy

Working: Gradient Descent

Gradient Tree Boosting is a generalization of boosting to arbitrary differentiable loss functions.

What it does essentially

Parameters working: OOB data/error

Summary: Rf and GBM

More diff b/w RF and GBM

For further details contact:

Vous aimerez peut-être aussi