Foundations of Machine Learning: Courant Institute and Google Research

Foundations of Machine Learning Lecture 6
Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu
Boosting
Mehryar Mohri - Foundations of Machine Learning
page 2
Weak Learning
(Kearns and Valiant, 1994)
Denition: concept class C is weakly PAC-learnable if there exists a (weak) learning algorithm L and > 0 such that: for all c C, > 0, > 0, and all distributions D,
S D
Pr
R(hS )
for samples S of size m = poly(1/ ) for a xed

polynomial.
1 2
page 3
Boosting Ideas
Finding simple relatively accurate base classiers often not hard weak learner. Main ideas:
use weak learner to create a strong learner. how? Combine base classiers returned by weak
learner (ensemble method). But, how should the base classiers be combined?
page 4
AdaBoost - Preliminaries
Family of base classiers: H
{ 1, +1}X .
(Freund and Schapire, 1997)
Iterative algorithm (T rounds): at each round t [1, T ] maintains distribution Dt over training sample. t convex combination hypothesis ft = s=1 s hs , with s 0 and hs H . Weighted error of hypothesis h at round t :
m
Pr[h(xi ) = yi ] =
Dt i=1
Dt (i)1h(xi )=yi .
page 5
AdaBoost
H { 1, +1} .
X
AdaBoost(S = ((x1 , y1 ), . . . , (xm , ym ))) 1 for i 1 to m do 1 2 D1 (i) m 3 for t 1 to T do 4 ht base classier in H with small error t = PrDt [ht (xi ) = yi ] 1 t 1 log 5 t 2 t 1 2 6 Zt 2[ t (1 normalization factor t )] 7 for i 1 to m do Dt (i) exp( t yi ht (xi )) 8 Dt+1 (i) Zt T 9 f t=1 t ht 10 return h = sgn(f )
page 6
Notes
Distributions Dt over training sample: originally uniform. at each round, the weight of a misclassied example is increased. e Q observation: Dt+1 (i) = m , since Z
yi ft (xi ) t s=1 s
Dt+1 (i) =
Dt (i)e
t y i h t (x i )
Zt
Dt
1 (i)e
1 y i ht
1 (x i )
t y i h t (x i )
Zt
1 Zt
1 e = m
yi
Pt
s=1
s h s (x i )
t s=1
Zs
Weight assigned to base classier ht : t directy depends on the accuracy of ht at round t .
page 7
Illustration
t = 1
t = 2
page
t = 3
...
...
page 9
+ 2
+ 3
page 10
Bound on Empirical Error

T
Theorem: The empirical error of the classier output by AdaBoost veries:

R(h) exp 2
t=1
1 2
2 t
If further for all t [1, T ] ,

R(h) exp( 2
2
(1 2
T ).
t ) , then
does not need to be known in advance: adaptive boosting.

page 11
Proof: Since, as we saw, D

1 R(h) = m 1 m
m m
t+1 (i) = m
1 y i f (x i )
i=1 m T
1 m
exp( yi f (xi ))
i=1 T
yi ft (xi ) e Q m t s=1 Zs
m
i=1 t t=1
Zt DT +1 (i) =
t=1
Zt .
Now, since Z
Zt = =
i=1
is a normalization factor,
t y i h t (x i )
Dt (i)e
Dt (i)e
t )e t) 1
t
Dt (i)e
i:yi ht (xi ) 0
i:yi ht (xi )<0
= (1 = (1
+ te
t t
1
t
=2
t (1
page 12
t ).
Thus,
T
Zt =
t=1 t=1 T
2 exp
t=1
t (1
t)
=
t=1 2 t
1 2
2 t T
1 2
= exp
2
t=1
1 2
2 t
Notes: minimizer of since (1 )e =

t t
t
(1
te
t
t )e
+ te .
, at each round, AdaBoost assigns the same probability mass to correctly classied and misclassied instances. for base classiers x [ 1, +1] , t can be similarly chosen to minimize Zt .
page 13
AdaBoost = Coordinate Descent

Objective Function: convex and differentiable.
m m
F( ) =
i=1
y i f (x i )
=
i=1
x
yi
PT
t=1
t h t (x i )
0 1 loss
page 14
Direction: unit vector e with

t
et = argmin
t
dF (
t 1
+ et )
=0
s h s (x i )
d
yi Pt
1 s=1
dF (
Since F (
t 1 + et ) d
m t 1
+ et ) =
m
e
i=1
y i ht (x i )
t 1
=
=0 i=1 m
yi ht (xi ) exp
yi
s=1 t 1
s hs (xi )
=
i=1
yi ht (xi )Dt (i) m

s=1 t 1
Zs
t 1
[(1
t)
t]
m
s=1
Zs = [2
1] m
s=1
Zs .
Thus, direction corresponding to base classier with smallest error.
page 15
Step size: obtained via

dF (
t 1 + et ) =0 d m t 1
yi ht (xi ) exp
i=1 m
yi
s=1 t 1
s hs (xi )
y i h t (x i )
=0
yi ht (xi )Dt (i) m

i=1 m s=1
Zs e
y i h t (x i )
=0
yi ht (xi )Dt (i)e

i=1
y i h t (x i )
=0
[(1 =
t )e
te t t
]=0
1 1 log 2
Thus, step size matches base classier weight of AdaBoost.
page 16
Alternative Loss Functions

boosting loss x e x logistic loss
x log2 (1 + e
x
square loss
x (1
x)2 1x
hinge loss
x max(1
x, 0)
zero-one loss x 1x<0
page 17
Relationship with Logistic Regression

Objective function:
m
(Lebanon and Lafferty, 2001)
F( ) =
i=1
log(1 + e
2yi (xi )
).
Solve exactly the same optimization problem, except for a normalization constraint required for logistic regression:
min D(p, 1)
p
subject to: p x
0
x
p(x)
y
p(x, y )[
j (x, y )
E[
p b
j |x]]
=0
X,
y Y
p(x, y ) = 1
difference between the algorithms.
with D(p, q) =
x
p(x)
y
p(x, y ) log
p(x, y ) q (x, y )
p(x, y ) + q (x, y ) .
page 18
Standard Use in Practice

Base learners: decision trees, quite often just decision stumps (trees of depth one).
Boosting stumps: data in RN , e.g.,N = 2, (height(x), weight(x)) . associate a stump to each component. pre-sort each component: O(N m log m). at each round, nd best component and threshold. total complexity: O((m log m)N + mN T ). stumps not weak learners: think XOR example!
page 19
Overtting?
Assume that VCdim(H ) = d and for a xed T , dene
T
FT =
sgn
t=1
t ht
b :
t, b
R, h t
H .
FT can form a very rich family of classiers. It can
be shown (Freund and Schapire, 1997) that:

VCdim(FT )
2(d + 1)(T + 1) log2 ((T + 1)e).
This suggests that AdaBoost could overt for large values of T , and that is in fact observed in some cases, but in various others it is not!
page 20
Empirical Observations
cumulative distribution
Several empirical observations (not all): AdaBoost does not seem to overt, furthermore:
20 15
1.0
test error
error
10 5 0 10 100 1000
0.5
training error
# rounds
-1
-0.5
ma
Figure 2: Errortrees curves and the margin C4.5 decision (Schapire et al., 1998). distribution graph for the letter dataset as reported by Schapire et al. [69]. Left: the error curves (lower and upper curves, respectively) of the comb
page 21
L1 Margin Denitions
Denition: the margin of a point x with label y is (the algebraic distance of x = [h1 (x), . . . , hT (x)] to the hyperplane x = 0 ):
(x) = yf (x)
m t=T
=
t
T t=1
t ht (x) 1
=y
Denition: the margin of the classier for a sample S = (x1 , . . . , xm ) is the minimum margin of the points in that sample:
= min yi
i [1,m]
xi
1
page 22
Note: SVM margin: Boosting margin:

with H(x) =
h 1 (x ) h T (x )
w (xi ) = min yi . w 2 i [1,m] = min yi

i [1,m]
H(xi )
1
. . .
Distances:
distance to hyperplane w x + b = 0 :
|w x + b| , w p
1 1 with + = 1. p q
page 23
Rademacher Complexity of Convex Hulls

Theorem: Let H be a set of functions mapping from X to R . Let the convex hull of H be dened as
p p
conv(H ) = {
k hk : p
k=1
1, k
0,
k=1
1 , hk
H }.
Then, for any sample S , RS (conv(H )) = RS (H ). Proof:

1 RS (conv(H )) = E m 1 E = m 1 = E m
m p i
1
sup
hk H, 0, 1 i=1 p
k hk (xi )
k=1 m i hk (xi ) i=1
sup
sup
1
hk H 0,
k
1 k=1 m i hk (xi ) i=1
hk H k [1,p] m
sup max
1 E sup = m h H i=1
i h(xi )
= RS (H ).
page 24
Margin Bound - Ensemble Methods

(Koltchinskii and Panchenko, 2002)
Corollary: Let H be a set of real-valued functions. Fix > 0 . For any > 0 , with probability at least 1 , the following holds for all h conv(H ) :
R(h) R(h) R (h) + Rm H + R (h) + RS H + 3 2 2 log 1 2m log 2 . 2m
Proof: Direct consequence of margin bound of Lecture 4 and RS (conv(H )) = RS (H ) .
page 25
Margin Bound - Ensemble Methods

(Koltchinskii and Panchenko, 2002); see also (Schapire et al., 1998)
Corollary: Let H be a family of functions taking values in { 1, +1} with VC dimension d . Fix > 0 . For any > 0 , with probability at least 1 , the following holds for all h conv(H ) :
R(h) R (h) + 2 2d log em d + m log 1 . 2m
Proof: Follows directly previous corollary and VC dimension bound on Rademacher complexity (see lecture 3).
page 26
Notes
All of these bounds can be generalized to hold uniformly for all (0, 1), at the cost of an additional log log2 2 term and other minor constant factor m changes (Koltchinskii and Panchenko, 2002). For AdaBoost, the bound applies to the functions
x f (x)
1 T t=1 t ht (x) 1
conv(H ).
Note that T does not appear in the bound.
page 27
Margin Distribution
Theorem: For any > 0 , the following holds:
Pr yf (x)
1 T
2T
t=1
1 t
(1
1+ t)
Proof: Using the identity Dt+1 (i) =

1 m
m
1 y i f (x i )
i=1
1 m 1 = m =e
m i=1 m
exp( yi f (xi ) +
T
yi f (xi ) eQ m T t=1 Zt
e
i=1
1
Zt DT +1 (i)
t=1 T T t=1
page 28
T t=1
Zt = 2
1
t
t (1
t ).
Notes
If for all t [1, T ] , can be bounded by
Pr yf (x)
1
(1 2
(1
t ), then
the upper bound

T /2
2 )1
(1 + 2 )1+
For < , (1 2 )1 (1+2 )1+ < 1 and the bound decreases exponentially in T .
O(1/ m), For the bound to be convergent: O(1/ m) is roughly the condition on the thus edge value.
page 29
But, Does AdaBoost Maximize the Margin?

No: AdaBoost may converge to a margin that is signicantly below the maximum margin (Rudin et al., 2004) (e.g., 1/3 instead of 3/8)! Lower bound: AdaBoost can achieve asymptotically a margin that is at least max if data separable and 2 some conditions on the base learners (Rtsch and Warmuth, 2002). Several boosting-type margin-maximization algorithms: but, performance in practice not clear or not reported.
page 30
Outliers
AdaBoost assigns larger weights to harder examples.
Application: Detecting mislabeled examples. Dealing with noisy data: regularization based on the average weight assigned to a point (soft margin idea for boosting) (Meir and Rtsch, 2003).
page
31
AdaBoosts Weak Learning Condition

Denition: the edge of a base classier ht for a distribution D over the training sample is
1 (t) = 2
t
1 = 2
yi ht (xi )D(i).
i=1
Condition: there exists > 0 for any distribution D over the training sample and any base classier
(t) .
page 32
Zero-Sum Games
Denition: payoff matrix M = (Mij ) Rm n. m possible actions (pure strategy) for row player. n possible actions for column player. Mij payoff for row player ( = loss for column player) when row plays i , column plays j . Example:
rock rock 0 paper 1 scissors -1 paper scissors -1 1 0 -1 1 0
page 33
Mixed Strategies
(von Neumann, 1928)
Denition: player row selects a distribution p over the rows, player column a distribution q over columns. The expected payoff for row is
m i p j q n
E [Mij ] =
i=1 j =1
pi Mij qj = p Mq.
von Neumanns minimax theorem:

max min p Mq = min max p Mq.
equivalent form:
p j [1,n]
max min p Mej = min max ei Mq.

q i [1,m]
page 34
AdaBoost and Game Theory
Game: Player A: selects point xi , i [1, m]. Player B: selects base learner ht , t [1, T ]. Payoff matrix M { 1, +1}m T: Mit = yi ht (xi ). von Neumanns theorem: assume nite H .
m T
= min max
D h H i=1
D(i)yi h(xi ) = max min yi

i [1,m] t=1
t ht (xi ) 1
page 35
Consequences
Weak learning condition = non-zero margin. thus, possible to search for non-zero margin. AdaBoost = (suboptimal) search for corresponding . Weak learning = strong condition: the condition implies linear separability with margin 2 > 0 .
page 36
Linear Programming Problem

Maximizing the margin:
( xi ) = max min yi . || ||1 i [1,m] max subject to : yi ( xi ) 1 = 1.
This is equivalent to the following convex optimization LP problem:
Note that:
| x|
1
= x
, with H = {x :
x = 0}.
page 37
Maximum-margin solutions with different norms.
Norm || ||2 .
Norm || || .
AdaBoost gives an approximate solution for the

LP problem.
Mehryar Mohri - Foundations of Machine Learning page 38
Advantages of AdaBoost
Simple: straightforward implementation. Efcient: complexity O(mN T ) for stumps: when N and T are not too large, the algorithm is quite fast. Theoretical guarantees: but still many questions. AdaBoost not designed to maximize margin. regularized versions of AdaBoost.
page 39
Weaker Aspects
Parameters: need to determine T , the number of rounds of boosting: stopping criterion. need to determine base learners: risk of overtting or low margins. Noise: severely damages the accuracy of Adaboost(Dietterich, 2000). boosting algorithms based on convex potentials do not tolerate even low levels of random noise, even with L1 regularization or early stopping (Long and Servedio, 2010).
page 40
References

Thomas G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Machine Learning, 40(2): 139-158, 2000. Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119-139, 1997. G. Lebanon and J. Lafferty. Boosting and maximum likelihood for exponential models. In NIPS, pages 447454, 2001. Ron Meir and Gunnar Rtsch. An introduction to boosting and leveraging. In Advanced Lectures on Machine Learning (LNAI2600), 2003. J. von Neumann. Zur Theorie der Gesellschaftsspiele. Mathematische Annalen, 100:295-320, 1928. Cynthia Rudin, Ingrid Daubechies and Robert E. Schapire. The dynamics of AdaBoost: Cyclic behavior and convergence of margins. Journal of Machine Learning Research, 5: 1557-1595, 2004.
page 41
References

Rtsch, G., and Warmuth, M. K. (2002) Maximizing the Margin with Boosting, in Proceedings of the 15th Annual Conference on Computational Learning Theory (COLT 02), Sidney, Australia, pp. 334350, July 2002. Robert E. Schapire. The boosting approach to machine learning: An overview. In D. D. Denison, M. H. Hansen, C. Holmes, B. Mallick, B.Yu, editors, Nonlinear Estimation and Classication. Springer, 2003. Robert E. Schapire,Yoav Freund, Peter Bartlett and Wee Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5): 1651-1686, 1998.
page 42

Foundations of Machine Learning: Courant Institute and Google Research

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Foundations of Machine Learning: Courant Institute and Google Research

Transféré par

Droits d'auteur :

Formats disponibles

Foundations of Machine Learning Lecture 6

Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

Mehryar Mohri - Foundations of Machine Learning

for samples S of size m = poly(1/ ) for a xed

Mehryar Mohri - Foundations of Machine Learning

Mehryar Mohri - Foundations of Machine Learning

(Freund and Schapire, 1997)

Mehryar Mohri - Foundations of Machine Learning

(Freund and Schapire, 1997)

Mehryar Mohri - Foundations of Machine Learning

Weight assigned to base classier ht : t directy depends on the accuracy of ht at round t .

Mehryar Mohri - Foundations of Machine Learning

Mehryar Mohri - Foundations of Machine Learning

Mehryar Mohri - Foundations of Machine Learning

Mehryar Mohri - Foundations of Machine Learning

Bound on Empirical Error

(Freund and Schapire, 1997)

Theorem: The empirical error of the classier output by AdaBoost veries:

If further for all t [1, T ] ,

does not need to be known in advance: adaptive boosting.

Mehryar Mohri - Foundations of Machine Learning

Proof: Since, as we saw, D

i:yi ht (xi )<0

Mehryar Mohri - Foundations of Machine Learning

Notes: minimizer of since (1 )e =

Mehryar Mohri - Foundations of Machine Learning

AdaBoost = Coordinate Descent

Mehryar Mohri - Foundations of Machine Learning

Direction: unit vector e with

yi ht (xi )Dt (i) m

Thus, direction corresponding to base classier with smallest error.

Mehryar Mohri - Foundations of Machine Learning

Step size: obtained via

yi ht (xi )Dt (i) m

yi ht (xi )Dt (i)e

Thus, step size matches base classier weight of AdaBoost.

Mehryar Mohri - Foundations of Machine Learning

Alternative Loss Functions

zero-one loss x 1x<0

Mehryar Mohri - Foundations of Machine Learning

Relationship with Logistic Regression

(Lebanon and Lafferty, 2001)

difference between the algorithms.

Mehryar Mohri - Foundations of Machine Learning

Standard Use in Practice

Mehryar Mohri - Foundations of Machine Learning

FT can form a very rich family of classiers. It can

be shown (Freund and Schapire, 1997) that:

2(d + 1)(T + 1) log2 ((T + 1)e).

Mehryar Mohri - Foundations of Machine Learning

Mehryar Mohri - Foundations of Machine Learning

Note: SVM margin: Boosting margin:

w (xi ) = min yi . w 2 i [1,m] = min yi

Rademacher Complexity of Convex Hulls

Then, for any sample S , RS (conv(H )) = RS (H ). Proof:

Mehryar Mohri - Foundations of Machine Learning

Margin Bound - Ensemble Methods

Proof: Direct consequence of margin bound of Lecture 4 and RS (conv(H )) = RS (H ) .

Mehryar Mohri - Foundations of Machine Learning

Margin Bound - Ensemble Methods

Note that T does not appear in the bound.

Mehryar Mohri - Foundations of Machine Learning