Bagging+Boosting+Gradient Boosting

Ensemble Methods
Bagging+Boosting+Gradient
Boosting
Pulak Ghosh
IIMB
Introduction & Motivation
Suppose that you are a patient with a set of symptoms
Instead of taking opinion of just one doctor (classifier), you decide to take
opinion of a few doctors!
Is this a good idea? Indeed it is.
Consult many doctors and then based on their diagnosis; you can get a fairly
accurate idea of the diagnosis.
Majority voting - bagging
More weightage to the opinion of some good (accurate) doctors -
boosting
In bagging, you give equal weightage to all classifiers, whereas in boosting
you give weightage according to the accuracy of the classifier.
Ensemble Methods
Construct a set of classifiers from the training data
Predict class label of previously unseen records by

aggregating predictions made by multiple classifiers
General Idea
Ensemble Classifiers (EC)
An ensemble classifier constructs a set of base classifiers

from the training data
Methods for constructing an EC
Manipulating training set
Manipulating input features
Manipulating class labels
Manipulating learning algorithms
Manipulating training set

Multiple training sets are created by resampling the data
according to some sampling distribution
Sampling distribution determines how likely it is that an example
will be selected for training may vary from one trial to another
Classifier is built from each training set using a paritcular
learning algorithm
Examples: Bagging & Boosting
Manipulating input features

Subset of input features chosen to form each training set
Subset can be chosen randomly or based on inputs given by
Domain Experts
Good for data that has redundant features
Random Forest is an example which uses DT as its base classifierss
Manipulating class labels

When no. of classes is sufficiently large
Training data is transformed into a binary class problem by randomly
partitioning the class labels into 2 disjoint subsets, A0 & A1
Re-labelled examples are used to train a base classifier
By repeating the class labeling and model building steps several times, and
ensemble of base classifiers is obtained
How a new tuple is classified?
Example error correcting output codings (pp 307)
Manipulating learning algorithm

Learning algorithms can be manipulated in such a way that applying
the algorithm several times on the same training data may result in
different models
Example ANN can produce different models by changing network
topology or the initial weights of links between neurons
Example ensemble of DTs can be constructed by introducing
randomness into the tree growing procedure instrad of choosing the
best split attribute at each node, we randomly choose one of the top k
attributes
First 3 approaches are generic can be applied to any

classifier
Fourth approach depends on the type of classifier used
Base classifiers can be generated sequentially or in
parallel
General Idea
S Training
Data
Multiple Data
S1 S2 Sn
Sets
Multiple C1 C2 Cn
Classifiers
Combined
Classifier
H
Build Ensemble Classifiers
Basic idea:
Build different experts, and let them vote
Advantages:
Improve predictive performance
Other types of classifiers can be directly included
Easy to implement
No too much parameter tuning
Disadvantages:
The combined classifier is not so transparent (black box)
Not a compact representation
Why does it work?
Suppose there are 25 base classifiers

Each classifier has error rate, = 0.35
Assume classifiers are independent
Probability that the ensemble classifier makes a wrong prediction:
25
25 i

i
i 13
(1 ) 25 i
0.06

Examples of Ensemble Methods
How to generate an ensemble of classifiers?

Bagging
Boosting
Random Forests
Bagging
Introduced by Breiman (1996)
Bagging stands for bootstrap aggregating.
It is an ensemble method: a method of combining multiple

predictors.
Bagging algorithm
Let the original training data be L

Repeat B times:
Get a bootstrap sample Lk from L.
Train a predictor using Lk.
Combine B predictors by
Voting (for classification problem)
Averaging (for estimation problem)

Bagging -- the Idea
Bootstrap
estimators
1
2
B

Bootstrap
samples
X*2
1 X*B
X*
Original X = (x1, ..., xn)

sample
The final estimate: = (1 + 2 + ... + B )/B

Adaptive Bagging
X*B
2

2
B

X*
X*1
X = (x1, ..., xn)
Reduce both variance and bias

Bagging
Bagging works because it reduces variance by voting/averaging
o In some pathological hypothetical situations the overall error might increase
o Usually, the more classifiers the better
Problem: we only have one dataset.
Solution: generate new ones of size n by bootstrapping, i.e.
sampling it with replacement
Can help a lot if data is noisy.
When does Bagging work?
Learning algorithm is unstable: if small changes to the training
set cause large changes in the learned classifier.
If the learning algorithm is unstable, then Bagging almost

always improves performance
Some candidates:
Decision tree, decision stump, regression tree, linear
regression, SVMs
Bias-variance Decomposition
Used to analyze how much selection of any specific training set
affects performance
Assume infinitely many classifiers, built from different training

sets
For any learning scheme,

o Bias = expected error of the combined classifier on new data
o Variance = expected error due to the particular training set used
Total expected error ~ bias + variance

Why Bagging works?
Let S {( yi , xi ), i 1...N } be the set of training dataset
Let {S k } be a sequence of training sets containing a sub-set of

S
Let P be the underlying distribution of S .
Bagging replaces the prediction of the model with the majority

of the predictions given by the classifiers
A ( x, P ) ES ( ( x, S k ))
Why Bagging works?
A ( x, P ) ES [ ( x, S k )]
Direct error:
e ES EY , X [Y ( X , S )] 2
Bagging error:
e A EY , X [Y A ( X , P)] 2
Jensens inequality: E[ Z ]2 E[ Z 2 ]
e E[Y 2 ] 2 E[Y A ] EY , X ES [ 2 ( X , S )]
E (Y A ) 2 e A
Boosting
An iterative procedure to adaptively change distribution

of training data by focusing more on previously
misclassified records
Initially, all N records are assigned equal weights
Unlike bagging, weights may change at the end of a boosting
round
Overview of boosting
Introduced by Schapire and Freund in 1990s.
Boosting: convert a weak learning algorithm into a strong one.
Main idea: Combine many weak classifiers to produce a powerful

committee.
Algorithms:
AdaBoost: adaptive boosting
Gentle AdaBoost
BrownBoost

Bagging
le t ML
m p n
sa eme f1
dom plac
n
Ra th re
wi ML
f2 f
Ra
wi ndo ML
th m
re sa
pla m
ce ple
fT
m
en
t
Boosting
ML
Training Sample f1
ML
Weighted Sample f2
f

ML
Weighted Sample fT
What is Boosting?
Analogy: Consult several doctors, based on a combination of weighted diagnosesweight
assigned based on the previous diagnosis accuracy
How boosting works?

Weights are assigned to each training tuple
A series of k classifiers is iteratively learned
After a classifier Mi is learned, the weights are updated to allow the subsequent classifier, Mi+1, to pay more attention to the training tuples that
were misclassified by Mi
The final M* combines the votes of each individual classifier, where the weight of each classifier's vote is a function of its accuracy
The boosting algorithm can be extended for the prediction of continuous values
Comparing with bagging: boosting tends to achieve greater accuracy, but it also risks
overfitting the model to misclassified data
Basic Idea?
Suppose there are just 5 training examples {1,2,3,4,5}
Initially each example has a 0.2 (1/5) probability of being sampled
1st round of boosting samples (with replacement) 5 examples:

{2, 4, 4, 3, 2} and builds a classifier from them
Suppose examples 2, 3, 5 are correctly predicted by this classifier, and examples 1, 4 are wrongly predicted:
Weight of examples 1 and 4 is increased

Weight of examples 2, 3, 5 is decreased
2nd round of boosting samples again 5 examples, but now examples 1 and 4 are more likely to be sampled
And so on until some convergence is achieved

Boosting (Contd)
Adaboost - Adaptive
Boosting
Instead of resampling, uses training set re-weighting
Each training sample uses a weight to determine the probability of
being selected for a training set.
AdaBoost is an algorithm for constructing a strong classifier

as linear combination of simple weak classifier
Final classification based on weighted vote of weak classifiers

AdaBoost
B
Update distribution
XB
DB
D2 X2
Choose weight
t
t = 1/2ln(1- t / t)
X1
D1 Calculate error t
X = (x1, ..., xn)
Initialize Distribution
D1(i) = 1/n
The final estimate: = (11 + 22 + ... + n B )/B

Ada Boost.M1
The most popular boosting algorithm Fruend and Schapire
(1997)
Consider a two-class problem, output variable coded as Y {-
1,+1}
For a predictor variable X, a classifier G(X) produces
predictions that are in {-1,+1}
The error rate on the training sample is
1 N
err I( y G ( x ))
N
i i
i 1
Ada Boost.M1 (Contd)
Sequentially apply the weak classification to repeatedly
modified versions of data
produce a sequence of weak classifiers Gm(x) m=1,2,..,M
The predictions from all classifiers are combined via majority
vote to produce the final prediction
Adaboost Concept
Adaboost starts with a uniform
distribution of weights over training
examples. The weights tell the learning
algorithm the importance of the
example.
Obtain a weak classifier from the

weak learning algorithm, hj(x).
Increase the weights on the training

examples that were misclassified.
(Repeat)
At the end, carefully make a linear
combination of the weak classifiers
obtained at all iterations.
f final (x) final,1h1 (x) final,n hn (x)

A toy example(contd)
Final Classifier: integrate the three weak classifiers and obtain a final strong
classifier.
Bagging vs Boosting
Bagging: the construction of complementary base-learners is left
to chance and to the unstability of the learning methods.
Boosting: actively seek to generate complementary base-
learner--- training the next base-learner based on the mistakes of
the previous learners.

Bagging+Boosting+Gradient Boosting

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Bagging+Boosting+Gradient Boosting

Transféré par

Droits d'auteur :

Formats disponibles

Ensemble Methods

Construct a set of classifiers from the training data

Predict class label of previously unseen records by

An ensemble classifier constructs a set of base classifiers

Manipulating training set

Manipulating input features

Manipulating class labels

Manipulating learning algorithm

First 3 approaches are generic can be applied to any

Suppose there are 25 base classifiers

How to generate an ensemble of classifiers?

Bagging stands for bootstrap aggregating.

It is an ensemble method: a method of combining multiple

Let the original training data be L

Original X = (x1, ..., xn)

The final estimate: = (1 + 2 + ... + B )/B

X = (x1, ..., xn)

Reduce both variance and bias

If the learning algorithm is unstable, then Bagging almost

Assume infinitely many classifiers, built from different training

For any learning scheme,

Total expected error ~ bias + variance

Let {S k } be a sequence of training sets containing a sub-set of

Let P be the underlying distribution of S .

Bagging replaces the prediction of the model with the majority

An iterative procedure to adaptively change distribution

Boosting: convert a weak learning algorithm into a strong one.

Main idea: Combine many weak classifiers to produce a powerful

How boosting works?

Suppose there are just 5 training examples {1,2,3,4,5}

Initially each example has a 0.2 (1/5) probability of being sampled

1st round of boosting samples (with replacement) 5 examples:

Weight of examples 1 and 4 is increased

And so on until some convergence is achieved

AdaBoost is an algorithm for constructing a strong classifier

Final classification based on weighted vote of weak classifiers

The final estimate: = (11 + 22 + ... + n B )/B

Obtain a weak classifier from the

Increase the weights on the training

f final (x) final,1h1 (x) final,n hn (x)

Vous aimerez peut-être aussi