Vous êtes sur la page 1sur 42

Foundations of Machine Learning Lecture 6

Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

Boosting

Mehryar Mohri - Foundations of Machine Learning

page 2

Weak Learning
(Kearns and Valiant, 1994)

Denition: concept class C is weakly PAC-learnable if there exists a (weak) learning algorithm L and > 0 such that: for all c C, > 0, > 0, and all distributions D,
S D

Pr

R(hS )

for samples S of size m = poly(1/ ) for a xed


polynomial.

1 2

Mehryar Mohri - Foundations of Machine Learning

page 3

Boosting Ideas
Finding simple relatively accurate base classiers often not hard weak learner. Main ideas:

use weak learner to create a strong learner. how? Combine base classiers returned by weak
learner (ensemble method). But, how should the base classiers be combined?

Mehryar Mohri - Foundations of Machine Learning

page 4

AdaBoost - Preliminaries
Family of base classiers: H
{ 1, +1}X .

(Freund and Schapire, 1997)

Iterative algorithm (T rounds): at each round t [1, T ] maintains distribution Dt over training sample. t convex combination hypothesis ft = s=1 s hs , with s 0 and hs H . Weighted error of hypothesis h at round t :
m

Pr[h(xi ) = yi ] =
Dt i=1

Dt (i)1h(xi )=yi .

Mehryar Mohri - Foundations of Machine Learning

page 5

AdaBoost
H { 1, +1} .
X

(Freund and Schapire, 1997)

AdaBoost(S = ((x1 , y1 ), . . . , (xm , ym ))) 1 for i 1 to m do 1 2 D1 (i) m 3 for t 1 to T do 4 ht base classier in H with small error t = PrDt [ht (xi ) = yi ] 1 t 1 log 5 t 2 t 1 2 6 Zt 2[ t (1 normalization factor t )] 7 for i 1 to m do Dt (i) exp( t yi ht (xi )) 8 Dt+1 (i) Zt T 9 f t=1 t ht 10 return h = sgn(f )

Mehryar Mohri - Foundations of Machine Learning

page 6

Notes

Distributions Dt over training sample: originally uniform. at each round, the weight of a misclassied example is increased. e Q observation: Dt+1 (i) = m , since Z
yi ft (xi ) t s=1 s

Dt+1 (i) =

Dt (i)e

t y i h t (x i )

Zt

Dt

1 (i)e

1 y i ht

1 (x i )

t y i h t (x i )

Zt

1 Zt

1 e = m

yi

Pt

s=1

s h s (x i )

t s=1

Zs

Weight assigned to base classier ht : t directy depends on the accuracy of ht at round t .

Mehryar Mohri - Foundations of Machine Learning

page 7

Illustration

t = 1

Mehryar Mohri - Foundations of Machine Learning

t = 2

page

t = 3

...

...

Mehryar Mohri - Foundations of Machine Learning

page 9

+ 2

+ 3

Mehryar Mohri - Foundations of Machine Learning

page 10

Bound on Empirical Error


T

(Freund and Schapire, 1997)

Theorem: The empirical error of the classier output by AdaBoost veries:


R(h) exp 2
t=1

1 2

2 t

If further for all t [1, T ] ,


R(h) exp( 2
2

(1 2
T ).

t ) , then

does not need to be known in advance: adaptive boosting.


page 11

Mehryar Mohri - Foundations of Machine Learning

Proof: Since, as we saw, D


1 R(h) = m 1 m
m m

t+1 (i) = m

1 y i f (x i )
i=1 m T

1 m

exp( yi f (xi ))
i=1 T

yi ft (xi ) e Q m t s=1 Zs

m
i=1 t t=1

Zt DT +1 (i) =
t=1

Zt .

Now, since Z
Zt = =
i=1

is a normalization factor,
t y i h t (x i )

Dt (i)e

Dt (i)e
t )e t) 1
t

Dt (i)e

i:yi ht (xi ) 0

i:yi ht (xi )<0

= (1 = (1

+ te
t t

1
t

=2

t (1
page 12

t ).

Mehryar Mohri - Foundations of Machine Learning

Thus,
T

Zt =
t=1 t=1 T

2 exp
t=1

t (1

t)

=
t=1 2 t

1 2

2 t T

1 2

= exp

2
t=1

1 2

2 t

Notes: minimizer of since (1 )e =


t t
t

(1
te
t

t )e

+ te .

, at each round, AdaBoost assigns the same probability mass to correctly classied and misclassied instances. for base classiers x [ 1, +1] , t can be similarly chosen to minimize Zt .
page 13

Mehryar Mohri - Foundations of Machine Learning

AdaBoost = Coordinate Descent


Objective Function: convex and differentiable.
m m

F( ) =
i=1

y i f (x i )

=
i=1
x

yi

PT

t=1

t h t (x i )

0 1 loss

Mehryar Mohri - Foundations of Machine Learning

page 14

Direction: unit vector e with


t

et = argmin
t

dF (

t 1

+ et )
=0
s h s (x i )

d
yi Pt
1 s=1

dF (

Since F (
t 1 + et ) d

m t 1

+ et ) =
m

e
i=1

y i ht (x i )

t 1

=
=0 i=1 m

yi ht (xi ) exp

yi
s=1 t 1

s hs (xi )

=
i=1

yi ht (xi )Dt (i) m


s=1 t 1

Zs
t 1

[(1

t)

t]

m
s=1

Zs = [2

1] m
s=1

Zs .

Thus, direction corresponding to base classier with smallest error.

Mehryar Mohri - Foundations of Machine Learning

page 15

Step size: obtained via


dF (
t 1 + et ) =0 d m t 1

yi ht (xi ) exp
i=1 m

yi
s=1 t 1

s hs (xi )

y i h t (x i )

=0

yi ht (xi )Dt (i) m


i=1 m s=1

Zs e

y i h t (x i )

=0

yi ht (xi )Dt (i)e


i=1

y i h t (x i )

=0

[(1 =

t )e

te t t

]=0

1 1 log 2

Thus, step size matches base classier weight of AdaBoost.

Mehryar Mohri - Foundations of Machine Learning

page 16

Alternative Loss Functions


boosting loss x e x logistic loss
x log2 (1 + e
x

square loss
x (1

x)2 1x

hinge loss
x max(1

x, 0)

zero-one loss x 1x<0

Mehryar Mohri - Foundations of Machine Learning

page 17

Relationship with Logistic Regression


Objective function:
m

(Lebanon and Lafferty, 2001)

F( ) =
i=1

log(1 + e

2yi (xi )

).

Solve exactly the same optimization problem, except for a normalization constraint required for logistic regression:
min D(p, 1)
p

subject to: p x

0
x

p(x)
y

p(x, y )[

j (x, y )

E[
p b

j |x]]

=0

X,
y Y

p(x, y ) = 1

difference between the algorithms.

with D(p, q) =
x

p(x)
y

p(x, y ) log

p(x, y ) q (x, y )

p(x, y ) + q (x, y ) .
page 18

Mehryar Mohri - Foundations of Machine Learning

Standard Use in Practice


Base learners: decision trees, quite often just decision stumps (trees of depth one).

Boosting stumps: data in RN , e.g.,N = 2, (height(x), weight(x)) . associate a stump to each component. pre-sort each component: O(N m log m). at each round, nd best component and threshold. total complexity: O((m log m)N + mN T ). stumps not weak learners: think XOR example!
page 19

Mehryar Mohri - Foundations of Machine Learning

Overtting?
Assume that VCdim(H ) = d and for a xed T , dene
T

FT =

sgn
t=1

t ht

b :

t, b

R, h t

H .

FT can form a very rich family of classiers. It can

be shown (Freund and Schapire, 1997) that:


VCdim(FT )

2(d + 1)(T + 1) log2 ((T + 1)e).

This suggests that AdaBoost could overt for large values of T , and that is in fact observed in some cases, but in various others it is not!
Mehryar Mohri - Foundations of Machine Learning
page 20

Empirical Observations
cumulative distribution

Several empirical observations (not all): AdaBoost does not seem to overt, furthermore:
20 15
1.0

test error

error

10 5 0 10 100 1000

0.5

training error

# rounds

-1

-0.5

ma

Mehryar Mohri - Foundations of Machine Learning

Figure 2: Errortrees curves and the margin C4.5 decision (Schapire et al., 1998). distribution graph for the letter dataset as reported by Schapire et al. [69]. Left: the error curves (lower and upper curves, respectively) of the comb
page 21

L1 Margin Denitions
Denition: the margin of a point x with label y is (the algebraic distance of x = [h1 (x), . . . , hT (x)] to the hyperplane x = 0 ):
(x) = yf (x)
m t=T

=
t

T t=1

t ht (x) 1

=y

Denition: the margin of the classier for a sample S = (x1 , . . . , xm ) is the minimum margin of the points in that sample:
= min yi
i [1,m]

xi
1

Mehryar Mohri - Foundations of Machine Learning

page 22

Note: SVM margin: Boosting margin:


with H(x) =
h 1 (x ) h T (x )

w (xi ) = min yi . w 2 i [1,m] = min yi


i [1,m]

H(xi )
1

. . .

Distances:

distance to hyperplane w x + b = 0 :
|w x + b| , w p

1 1 with + = 1. p q
Mehryar Mohri - Foundations of Machine Learning

page 23

Rademacher Complexity of Convex Hulls


Theorem: Let H be a set of functions mapping from X to R . Let the convex hull of H be dened as
p p

conv(H ) = {

k hk : p
k=1

1, k

0,
k=1

1 , hk

H }.

Then, for any sample S , RS (conv(H )) = RS (H ). Proof:


1 RS (conv(H )) = E m 1 E = m 1 = E m
m p i
1

sup
hk H, 0, 1 i=1 p

k hk (xi )
k=1 m i hk (xi ) i=1

sup

sup
1

hk H 0,

k
1 k=1 m i hk (xi ) i=1

hk H k [1,p] m

sup max

Mehryar Mohri - Foundations of Machine Learning

1 E sup = m h H i=1

i h(xi )

= RS (H ).
page 24

Margin Bound - Ensemble Methods


(Koltchinskii and Panchenko, 2002)

Corollary: Let H be a set of real-valued functions. Fix > 0 . For any > 0 , with probability at least 1 , the following holds for all h conv(H ) :
R(h) R(h) R (h) + Rm H + R (h) + RS H + 3 2 2 log 1 2m log 2 . 2m

Proof: Direct consequence of margin bound of Lecture 4 and RS (conv(H )) = RS (H ) .

Mehryar Mohri - Foundations of Machine Learning

page 25

Margin Bound - Ensemble Methods


(Koltchinskii and Panchenko, 2002); see also (Schapire et al., 1998)

Corollary: Let H be a family of functions taking values in { 1, +1} with VC dimension d . Fix > 0 . For any > 0 , with probability at least 1 , the following holds for all h conv(H ) :
R(h) R (h) + 2 2d log em d + m log 1 . 2m

Proof: Follows directly previous corollary and VC dimension bound on Rademacher complexity (see lecture 3).
Mehryar Mohri - Foundations of Machine Learning
page 26

Notes
All of these bounds can be generalized to hold uniformly for all (0, 1), at the cost of an additional log log2 2 term and other minor constant factor m changes (Koltchinskii and Panchenko, 2002). For AdaBoost, the bound applies to the functions
x f (x)
1 T t=1 t ht (x) 1

conv(H ).

Note that T does not appear in the bound.

Mehryar Mohri - Foundations of Machine Learning

page 27

Margin Distribution
Theorem: For any > 0 , the following holds:
Pr yf (x)
1 T

2T
t=1

1 t

(1

1+ t)

Proof: Using the identity Dt+1 (i) =


1 m
m

1 y i f (x i )
i=1

1 m 1 = m =e

m i=1 m

exp( yi f (xi ) +
T

yi f (xi ) eQ m T t=1 Zt

e
i=1
1

Zt DT +1 (i)
t=1 T T t=1
page 28

T t=1

Zt = 2

1
t

t (1

t ).

Mehryar Mohri - Foundations of Machine Learning

Notes
If for all t [1, T ] , can be bounded by
Pr yf (x)
1

(1 2
(1

t ), then

the upper bound


T /2

2 )1

(1 + 2 )1+

For < , (1 2 )1 (1+2 )1+ < 1 and the bound decreases exponentially in T .
O(1/ m), For the bound to be convergent: O(1/ m) is roughly the condition on the thus edge value.

Mehryar Mohri - Foundations of Machine Learning

page 29

But, Does AdaBoost Maximize the Margin?


No: AdaBoost may converge to a margin that is signicantly below the maximum margin (Rudin et al., 2004) (e.g., 1/3 instead of 3/8)! Lower bound: AdaBoost can achieve asymptotically a margin that is at least max if data separable and 2 some conditions on the base learners (Rtsch and Warmuth, 2002). Several boosting-type margin-maximization algorithms: but, performance in practice not clear or not reported.
Mehryar Mohri - Foundations of Machine Learning
page 30

Outliers
AdaBoost assigns larger weights to harder examples.

Application: Detecting mislabeled examples. Dealing with noisy data: regularization based on the average weight assigned to a point (soft margin idea for boosting) (Meir and Rtsch, 2003).

Mehryar Mohri - Foundations of Machine Learning

page

31

AdaBoosts Weak Learning Condition


Denition: the edge of a base classier ht for a distribution D over the training sample is
1 (t) = 2
t

1 = 2

yi ht (xi )D(i).
i=1

Condition: there exists > 0 for any distribution D over the training sample and any base classier
(t) .

Mehryar Mohri - Foundations of Machine Learning

page 32

Zero-Sum Games

Denition: payoff matrix M = (Mij ) Rm n. m possible actions (pure strategy) for row player. n possible actions for column player. Mij payoff for row player ( = loss for column player) when row plays i , column plays j . Example:
rock rock 0 paper 1 scissors -1 paper scissors -1 1 0 -1 1 0
page 33

Mehryar Mohri - Foundations of Machine Learning

Mixed Strategies
(von Neumann, 1928)

Denition: player row selects a distribution p over the rows, player column a distribution q over columns. The expected payoff for row is
m i p j q n

E [Mij ] =
i=1 j =1

pi Mij qj = p Mq.

von Neumanns minimax theorem:


max min p Mq = min max p Mq.

equivalent form:
p j [1,n]
Mehryar Mohri - Foundations of Machine Learning

max min p Mej = min max ei Mq.


q i [1,m]
page 34

AdaBoost and Game Theory

Game: Player A: selects point xi , i [1, m]. Player B: selects base learner ht , t [1, T ]. Payoff matrix M { 1, +1}m T: Mit = yi ht (xi ). von Neumanns theorem: assume nite H .
m T

= min max
D h H i=1

D(i)yi h(xi ) = max min yi


i [1,m] t=1

t ht (xi ) 1

Mehryar Mohri - Foundations of Machine Learning

page 35

Consequences

Weak learning condition = non-zero margin. thus, possible to search for non-zero margin. AdaBoost = (suboptimal) search for corresponding . Weak learning = strong condition: the condition implies linear separability with margin 2 > 0 .

Mehryar Mohri - Foundations of Machine Learning

page 36

Linear Programming Problem


Maximizing the margin:
( xi ) = max min yi . || ||1 i [1,m] max subject to : yi ( xi ) 1 = 1.

This is equivalent to the following convex optimization LP problem:

Note that:
| x|
1

= x

, with H = {x :

x = 0}.
page 37

Mehryar Mohri - Foundations of Machine Learning

Maximum-margin solutions with different norms.

Norm || ||2 .

Norm || || .

AdaBoost gives an approximate solution for the


LP problem.
Mehryar Mohri - Foundations of Machine Learning page 38

Advantages of AdaBoost
Simple: straightforward implementation. Efcient: complexity O(mN T ) for stumps: when N and T are not too large, the algorithm is quite fast. Theoretical guarantees: but still many questions. AdaBoost not designed to maximize margin. regularized versions of AdaBoost.

Mehryar Mohri - Foundations of Machine Learning

page 39

Weaker Aspects

Parameters: need to determine T , the number of rounds of boosting: stopping criterion. need to determine base learners: risk of overtting or low margins. Noise: severely damages the accuracy of Adaboost(Dietterich, 2000). boosting algorithms based on convex potentials do not tolerate even low levels of random noise, even with L1 regularization or early stopping (Long and Servedio, 2010).
page 40

Mehryar Mohri - Foundations of Machine Learning

References

Thomas G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Machine Learning, 40(2): 139-158, 2000. Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119-139, 1997. G. Lebanon and J. Lafferty. Boosting and maximum likelihood for exponential models. In NIPS, pages 447454, 2001. Ron Meir and Gunnar Rtsch. An introduction to boosting and leveraging. In Advanced Lectures on Machine Learning (LNAI2600), 2003. J. von Neumann. Zur Theorie der Gesellschaftsspiele. Mathematische Annalen, 100:295-320, 1928. Cynthia Rudin, Ingrid Daubechies and Robert E. Schapire. The dynamics of AdaBoost: Cyclic behavior and convergence of margins. Journal of Machine Learning Research, 5: 1557-1595, 2004.

Mehryar Mohri - Foundations of Machine Learning

page 41

References

Rtsch, G., and Warmuth, M. K. (2002) Maximizing the Margin with Boosting, in Proceedings of the 15th Annual Conference on Computational Learning Theory (COLT 02), Sidney, Australia, pp. 334350, July 2002. Robert E. Schapire. The boosting approach to machine learning: An overview. In D. D. Denison, M. H. Hansen, C. Holmes, B. Mallick, B.Yu, editors, Nonlinear Estimation and Classication. Springer, 2003. Robert E. Schapire,Yoav Freund, Peter Bartlett and Wee Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26(5): 1651-1686, 1998.

Mehryar Mohri - Foundations of Machine Learning

page 42

Vous aimerez peut-être aussi