Vous êtes sur la page 1sur 402

Learning From Data

Prof. Yaser Abu-Mostafa

Caltech, 2012
http://work.caltech.edu/telecourse.html

Outline of the Course


1.
2.

3.

4.

5.

6.

7.

8.

9.

The Learning Problem


Is Learning Feasible?
The Linear Model I
Error and Noise
Training versus Testing
Theory of Generalization
The VC Dimension
Bias-Varian e Tradeo
The Linear Model II
Neural Networks

10.

11.

12.

(April 3 )

(April

(April

(April

13.

5)
14.

10 )

15.

12 )

16.

(April

(April

(April

(May

3)

19 )

17.

18.

(May

8)

(May

(May

10 )

15 )

(May

1)

26 )

17 )

22 )

(May

(May

(May

24 )
(May

31 )

24 )

(April

(May

17 )

Overtting
Regularization
Validation
Support Ve tor Ma hines
Kernel Methods
Radial Basis Fun tions
Three Learning Prin iples
Epilogue

theory; mathemati al
te hnique; pra ti al
analysis; on eptual

29 )

Learning From Data


Yaser S. Abu-Mostafa

California Institute of Te hnology

Le ture 1:

The Learning Problem

Sponsored by Calte h's Provost O e, E&AS Division, and IST

Tuesday, April 3, 2012

The learning problem - Outline


Example of ma hine learning
Components of Learning
A simple model
Types of learning
Puzzle

Learning From Data - Le ture 1

2/19

Example: Predi ting how a viewer will rate a movie


10% improvement = 1 million dollar prize
The essen e of ma hine learning:

A pattern exists.
We annot pin it down mathemati ally.
We have data on it.
Learning From Data - Le ture 1

3/19

Movie rating - a solution


?
s
r
ste

?
e
s
i
ru

? n? kbu
y
d
e tio blo
m
o s a rs

s
like like prefe

s
like

C
m
To

viewer

Mat h movie and


viewer fa tors

add
ontributions
from ea h fa tor

predi ted
rating

movie

Tom
ise

Cru
t?

in i

r?
uste t
kb ten
blo on nt
ion nte
a t edy o
om

Learning From Data - Le ture 1

4/19

The learning approa h


top

viewer
movie

LEARNING

rating
bottom

Learning From Data - Le ture 1

5/19

Metaphor: Credit approval


Appli ant information:

Components of learning

age
gender
annual salary
years in residen e
years in job
urrent debt

23 years
male
$30,000
1 year
1 year
$15,000

Approve redit?
Learning From Data - Le ture 1

6/19

Components of learning

Formalization:
Input: x

( ustomer appli ation)

Output: y

(good/bad ustomer? )

Target fun tion: f : X Y

(ideal redit approval formula)

Data: (x1, y1), (x2, y2), , (xN , yN )

Hypothesis: g : X Y
Learning From Data - Le ture 1

(histori al re ords)

(formula to be used )
7/19

UNKNOWN TARGET FUNCTION


f: X Y
(ideal credit approval function)

TRAINING EXAMPLES
( x1 , y1 ), ... , ( xN , yN )
(historical records of credit customers)
LEARNING
ALGORITHM
A

FINAL
HYPOTHESIS
g~
~f

(final credit approval formula)

HYPOTHESIS SET
H
(set of candidate formulas)

Learning From Data - Le ture 1

8/19

Solution omponents
The 2 solution omponents of the learning
problem:

The Hypothesis Set


H = {h}

gH

UNKNOWN TARGET FUNCTION


f: X Y
(ideal credit approval function)

TRAINING EXAMPLES
( x1 , y1 ), ... , ( xN , yN )
(historical records of credit customers)
LEARNING
ALGORITHM
A

The Learning Algorithm

FINAL
HYPOTHESIS
g~
~f

(final credit approval formula)

HYPOTHESIS SET
H

Together, they are referred to as the learning


model .
Learning From Data - Le ture 1

(set of candidate formulas)

9/19

A simple hypothesis set - the `per eptron'


For input x = (x1, , xd)

`attributes of a ustomer'

Approve redit if

d
P

wixi > threshold,

i=1

Deny redit if

d
P

wixi < threshold.

i=1

This linear formula h H an be written as

h(x) = sign

d
X
i=1

Learning From Data - Le ture 1

wixi

threshold

!
10/19

d
X

h(x) = sign

wi xi +

i=1

w0

Introdu e an arti ial oordinate x0 = 1:

h(x) = sign

d
X

wi xi

i=0

In ve tor form, the per eptron implements

!
_

+
_

+
_

+
+

+
+

_
_

_
_

`linearly separable' data

h(x) = sign(wTx)
Learning From Data - Le ture 1

11/19

A simple learning algorithm - PLA


The per eptron implements

h(x) = sign(wTx)

y= +1

w+y x
x

Given the training set:

(x1, y1), (x2, y2), , (xN , yN )


pi k a mis lassied point:
sign(wTxn) 6= yn
and update the weight ve tor:

y= 1
w+y x

w
x

w w + ynxn
Learning From Data - Le ture 1

12/19

Iterations of PLA
One iteration of the PLA:
w w + yx
where (x, y) is a mis lassied training point.

At iteration t = 1, 2, 3, , pi k a mis lassied point from

+
_
_

+
+
+

(x1, y1), (x2, y2), , (xN , yN )


and run a PLA iteration on it.

+
_

That's it!
Learning From Data - Le ture 1

13/19

The learning problem - Outline


Example of ma hine learning
Components of learning
A simple model
Types of learning
Puzzle

Learning From Data - Le ture 1

14/19

Basi premise of learning


using a set of observations to un over an underlying pro ess

broad premise = many variations

Supervised Learning
Unsupervised Learning
Reinfor ement Learning

Learning From Data - Le ture 1

15/19

Supervised learning
Example from vending ma hines  oin re ognition
25

Mass

Mass

25

5
1

10

10

Size

Learning From Data - Le ture 1

Size

16/19

Mass

Instead of

Unsupervised learning
(input, orre t output), we get (input, ? )

Size

Learning From Data - Le ture 1

17/19

Reinfor ement learning


Instead of (input, orre t output),
we get (input,some output,grade for this output)
The world hampion was
a neural network!

Learning From Data - Le ture 1

18/19

A Learning puzzle
f = 1

f = +1

f =?
Learning From Data - Le ture 1

19/19

Review of Le ture 1

Example: Per eptron Learning Algorithm

Learning is used when

+
+

- A pattern exists
- We annot pin it down mathemati ally

+
+

- We have data on it

+
_

Fo us on supervised learning
- Unknown target fun tion
- Data set

y = f (x)

Learning an unknown fun tion?


- Impossible

(x1, y1), , (xN , yN )

. The fun tion an assume

any value outside the data we have.


- Learning algorithm pi ks
a hypothesis set

gf

from
- So what now?

Learning From Data


Yaser S. Abu-Mostafa

California Institute of Te hnology

Le ture 2:

Is Learning Feasible?

Sponsored by Calte h's Provost O e, E&AS Division, and IST

Thursday, April 5, 2012

Feasibility of learning - Outline

Probability to the res ue

Conne tion to learning

Conne tion to

A dilemma and a solution

Learning From Data - Le ture 2

real

learning

2/17

A related experiment
top

- Consider a `bin' with red and green marbles.

P[

pi king a red marble

P[

pi king a green marble

- The value of

- We pi k

]=

SAMPLE

]=1

fraction
of red marbles

is unknown to us.

marbles independently.

- The fra tion of red marbles in sample

Learning From Data - Le ture 2

BIN

= probability
of red marbles
bottom

3/17

Does say anything about ?


No!

top

BIN

Sample an be mostly green while bin is


mostly red.

SAMPLE

Yes!

Sample frequen y
frequen y

fraction
of red marbles
is likely lose to bin

possible
Learning From Data - Le ture 2

versus

probable

= probability
of red marbles
bottom

4/17

What does say about ?


In a big sample (large

N ),

is probably lose to

(within ).

Formally,

PP [ | | > ] 2e

22N

This is alled

Hoeding's Inequality.

In other words, the statement 

Learning From Data - Le ture 2

= 

is

P.A.C.
5/17

PP [| | > ] 2e

22N

top

Valid for all N and

BIN
SAMPLE

Bound does not depend on

fraction
of red marbles

Tradeo: N , , and the bound.


=

= probability
of red marbles
bottom

Learning From Data - Le ture 2

6/17

Conne tion to learning


Hi

h(x)= f (x)

Bin: The unknown is a number


Learning: The unknown is a fun tion f : X
Ea h marble

is a point

h(x)= f (x)

xX

Hypothesis got it right

Hypothesis got it wrong

h(x)=f (x)
h(x)6=f (x)
Hi

Learning From Data - Le ture 2

7/17

Ba k to the learning diagram


UNKNOWN TARGET FUNCTION
f: X Y

The bin analogy:


Hi

PROBABILITY
DISTRIBUTION

P
TRAINING EXAMPLES
( x1 , y1 ), ... , ( xN , yN )

x1 , ... , xN

LEARNING
ALGORITHM

FINAL
HYPOTHESIS
g~
~f

Hi

Learning From Data - Le ture 2

on

HYPOTHESIS SET
H
8/17

Are we done?
Hi

Not so fast!

h(x)= f (x)

is xed.

h(x)= f (x)
For this

h,

generalizes to

`veri ation' of h, not learning


No guarantee

We need to

will be small.

hoose from multiple h's.


Hi

Learning From Data - Le ture 2

9/17

Multiple bins
Generalizing the bin model to more than one hypothesis:
top

h1

h2

hM

M
........

M
bottom

Learning From Data - Le ture 2

10/17

Notation for learning


Both

and

depend on whi h hypothesis

Hi

Eout(h)

in sample' denoted by Ein(h)

is `

is `out of sample' denoted by Eout(h)


The Hoeding inequality be omes:

P[

22N

|Ein(h) Eout(h)| > ] 2e

Ein(h)
Hi

Learning From Data - Le ture 2

11/17

Notation with multiple bins


top

h1

h2

Eout( h1)

Eout( h2)

hM
Eout( hM )
........

Ein( h1)

Ein( h2)

Ein( hM )
bottom

Learning From Data - Le ture 2

12/17

Are we done already?


Not so fast!! Hoeding doesn't apply to multiple bins.

What?
UNKNOWN TARGET FUNCTION
f: X Y

top

BIN

PROBABILITY
DISTRIBUTION

P
TRAINING EXAMPLES
( x1 , y1 ), ... , ( xN , yN )

SAMPLE

= fraction
of red marbles

ALGORITHM

HYPOTHESIS SET
H

= probability
of red marbles

x1 , ... , xN

LEARNING
A

on

top

FINAL
HYPOTHESIS
g~
~f

h1

h2

Eout( h1)

Eout( h2)

hM
Eout( hM )
........

Ein( h1)

Ein( h2)

Ein( hM )
bottom

bottom

Learning From Data - Le ture 2

13/17

Coin analogy
If you toss a fair oin 10 times, what is the probability that you
will get 10 heads?

Question:

Answer:

0.1%

If you toss 1000 fair oins 10 times ea h, what is the probability


that some oin will get 10 heads?
Question:

Answer:

63%

Learning From Data - Le ture 2

14/17

From oins to learning


hi

BINGO
Learning From Data - Le ture 2

Hi

15/17

A simple solution
P[

|Ein(g) Eout(g)| > ] P[

|Ein(h1) Eout(h1)| >


or |Ein(h2) Eout(h2)| >

or |Ein(hM ) Eout(hM )| > ]


M
X

P [|Ein(hm) Eout(hm)| > ]


m=1

Learning From Data - Le ture 2

16/17

The nal verdi t

P[

|Ein(g) Eout(g)| > ]

M
X

m=1
M
X

P [|Ein(hm) Eout(hm)|

> ]

22N

2e

m=1

P[|Ein(g) Eout(g)|

Learning From Data - Le ture 2

22N

> ] 2M e

17/17

Review of Le ture 2

Sin e

has to be one of

h 1 , h2 , , hM ,

we on lude that
Is Learning feasible?
Yes, in a

probabilisti sense.

If:

Hi

|E (g) E (g)| >


in

Eout(h)

out

Then:

|E (h1) E (h1)| > or


|E (h2) E (h2)| > or

|E (hM ) E (hM )| >


in

out

in

out

in

out

Ein(h)
Hi

P[

22N

|E (h) E (h)| > ] 2e


in

out

This gives us an added

fa tor.

Learning From Data


Yaser S. Abu-Mostafa

California Institute of Te hnology

Le ture 3:

Linear Models I

Sponsored by Calte h's Provost O e, E&AS Division, and IST

Tuesday, April 10, 2012

Outline

Input representation

Linear Classi ation

Linear Regression

Nonlinear Transformation

Learning From Data - Le ture 3

2/23

A real data set

Learning From Data - Le ture 3

3/23

Input representation

`raw' input x = (x ,x , x , , x
linear model: (w , w , w , , w
0

256)
256)

Extra t useful information, e.g.,


intensity and symmetry x = (x ,x , x )
linear model:
(w , w , w )

Features:

Learning From Data - Le ture 3

4/23

810 0.91
1214
0
Illustration of features
0.1
1618 0.2
x =02(x0,x0.3
x1: intensity
1, x2)
0.4
46 0.5
0.6
810 0.7
1214 0.8
0.9
1618 -81
510 -7-6
155 -5
1015 -4-3
510 -2-1
0
15
Learning From Data - Le ture 3

810
1214
1618
x2: 02symmetry
46
810
1214
1618
510
155
1015
510
15
5/23

Evolution of Ein and Eout

a ements

50%

Eout

10%
1%
0

250

Learning From Data - Le ture 3

500

Ein

750

0.35
What PLA does0.4
-8Final per eptron boundary
-7
-6
-5
-4
-3
-2
-1
0
1
1000
6/23

The `po ket' algorithm


PLA:

a ements

Po ket:

50%

PSfrag repla ements

Eout

10%

10%

Eout

1%

1%
0

50%

250

Learning From Data - Le ture 3

500

Ein

750 1000

Ein

250

500

750 1000
7/23

a ements
0.05
0.1
0.15PLA:
0.2
0.25
0.3
0.35
0.4
-8
-7
-6
-5
-4
-3
-2
-1
0
1
Learning From Data - Le ture 3

0.35
Classi ation boundary0.4- PLA versus Po ket
-8Po ket:
-7
-6
-5
-4
-3
-2
-1
0
1
8/23

Outline

Input representation

Linear Classi ation

Linear Regression

Nonlinear Transformation

Learning From Data - Le ture 3

regression

real-valued output

9/23

Credit again

Credit approval (yes/no)


Regression: Credit line (dollar amount)
Classi ation:

Input:

x =

age
annual salary
years in residen e
years in job
urrent debt

Linear regression output:

h(x) =

d
X

23 years
$30,000
1 year
1 year
$15,000

wi xi = wTx

i=0

Learning From Data - Le ture 3

10/23

The data set

Credit o ers de ide on redit lines:


(x1, y1), (x2, y2), , (xN , yN )
yn R

is the redit line for ustomer x .


n

Linear regression tries to repli ate that.


Learning From Data - Le ture 3

11/23

How to measure the error

How well does h(x) = w x approximate f (x)?


In linear regression, we use squared error (h(x) f (x))
T

in-sample error:

Learning From Data - Le ture 3

N
X
1
2
(h(xn) yn)
Ein(h) =
N n=1

12/23

x
Learning From Data - Le ture 3

0.5
1
Illustration of linear0 regression
0.2
0.4
0.6
0.8
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
x
1
1

0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1

x2
13/23

The expression for

in

N
X
1
2
T
( w xn yn)
Ein(w) =
N n=1

1
2
=
kXw yk
N

where
Learning From Data - Le ture 3

X=

x 
x 
.
x 
1
2

T
T

y1

y2
y=

yN

14/23

Minimizing
Ein(w) =
Ein(w) =

1
N kXw

2 T
(Xw
X
N

in

yk

y) = 0

XTXw = XTy
w = Xy where X = (XTX)1XT
X is the `pseudo-inverse' of X

Learning From Data - Le ture 3

15/23

The pseudo-inverse
X = (XTX)1XT

Learning From Data - Le ture 3



{z| {z }}
d+1
d+1
N d+1
|
N

{z }
d+1
{z

d+1 N

{z

d+1 N

16/23

The linear regression algorithm


1:

Constru t the matrix

and the ve tor y from the data set


(x , y ), , (x , y ) as follows


xT
y

T

x

y

X=
y = . .
,
.


xT 
y
|
|
{z
}
{z
}
1

target ve tor

input data matrix

2:

3:

Compute the pseudo-inverse X


Return w = X y.

Learning From Data - Le ture 3

= (XTX)1XT

17/23

Linear regression for lassi ation

Linear regression learns a real-valued fun tion y = f (x) R


Binary-valued fun tions are also real-valued! 1 R
Use linear regression to get w where w x y = 1
In this ase, sign(w x ) is likely to agree with y = 1
Good initial weights for lassi ation
T

Learning From Data - Le ture 3

18/23

Linear regression boundary

Symmetry

0.5
0.55
-8
-7
-6
-5
-4
-3
-2
-1
0
Learning From Data - Le ture 3

Average Intensity

18/23

Outline

Input representation

Linear Classi ation

Linear Regression

Nonlinear Transformation

Learning From Data - Le ture 3

19/23

frag repla ements


Data:
-1
-0.5
0
0.5
1
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Learning From Data - Le ture 3

-1
Linear is limited
-0.5
0
Hypothesis:
0.5
1
-1.5
-1
-0.5
0
0.5
1
1.5
20/23

Another example
Credit line is ae ted by `years in residen e'
but not in a linear way!
Nonlinear [[xi < 1]] and [[xi > 5]] are better.
Can we do that with linear models?
Learning From Data - Le ture 3

21/23

Linear in what?

Linear regression implements

d
X

wi xi

i=0

Linear lassi ation implements

sign

d
X
i=0

Algorithms work be ause of


Learning From Data - Le ture 3

wi xi

linearity in the weights


22/23

PSfrag repla ements


-1
-0.5
0
0.5
1
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Learning From Data - Le ture 3

0.6
0.8
Transform the data nonlinearly
1
0
(x , x ) 0.1(x , x )
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9

2
1

2
2

23/23

0.2

Review of Le ture 3

0.3
0.4

Linear models use the `signal':

d
X

0.6

wixi = w x

i=0
- Classi ation:
- Regression:

0.5

h(x) = sign(wTx)

h(x) = wTx

Linear regression algorithm:

w = (X X) X y
1

one-step learning

0.7
0.8
0.9
1

x1

x2

Nonlinear transformation:
-

wTx

- Any

is linear in

x z preserves

- Example:

this linearity.

(x1, x2) (x21, x22)

Learning From Data


Yaser S. Abu-Mostafa

California Institute of Te hnology

Le ture 4:

Error and Noise

Sponsored by Calte h's Provost O e, E&AS Division, and IST

Thursday, April 12, 2012

Outline

Nonlinear transformation ( ontinued)

Error measures

Noisy targets

Preamble to the theory

Learning From Data - Le ture 4

2/22

PSfrag repla ements

PSfrag repla ements

-1

0.2

-0.5

0.4

0.6

0.5

0.8

-1

-0.8

-0.6

-0.4
-0.2

0.1
0.2
0.3

0.4

0.2

0.5

0.4

0.6

0.6

0.7

0.8

0.8

0.9

1. Original data

2. Transform the data

xn X

zn = (xn ) Z

PSfrag repla ements


0

PSfrag repla ements

0.2
0.4

-1

0.6

-0.5

0.8

'

1
-1.5

1
-0.2
0
0.2

-1

0.4

-0.5

0.6

0.8

0.5

1.2

1.5

X -spa e
T (x))
g(x) = g
((x)) = sign(w
4. Classify in

Learning From Data - Le ture 4

0.5

Z -spa e
Tz)
g
(z) = sign(w

3. Separate data in

3/22

What transforms to what

x = (x0, x1, , xd)


x1, x2, , xN
y1 , y2 , , yN
No weights in

z = (z0, z1, , zd)


z1 , z2 , , zN
y1 , y2 , , yN
= (w0, w1, , wd)
w

g(x)
Learning From Data - Le ture 4

T(x))
sign(w
4/22

Outline

Nonlinear transformation ( ontinued)

Error measures

Noisy targets

Preamble to the theory

Learning From Data - Le ture 4

5/22

The learning diagram - where we left it

PROBABILITY
DISTRIBUTION

UNKNOWN TARGET FUNCTION


f: X Y

P
TRAINING EXAMPLES
( x1 , y1 ), ... , ( xN , yN )

on

x1 , ... , xN
x

ERROR
MEASURE

LEARNING
ALGORITHM

g ( x )~
~ f (x )

FINAL
HYPOTHESIS
g: X Y

HYPOTHESIS SET
H

Learning From Data - Le ture 4

6/22

Error measures

What does 

h f

Error measure:

Almost always

mean?

E(h, f )
pointwise denition:

h(x), f (x)

Examples:
Squared error:

Binary error:

Learning From Data - Le ture 4


2
h(x), f (x) = h(x) f (x)

 q
y
h(x), f (x) = h(x) 6= f (x)
7/22

From pointwise to overall

Overall error

E(h, f )

= average of pointwise errors e


h(x), f (x) .

In-sample error:

N
X

1
E (h) =
e h(xn ), f (xn )
N n=1
in

Out-of-sample error:

E (h) =Ex
out

Learning From Data - Le ture 4


h(x), f (x)
8/22

The learning diagram - with pointwise error

PROBABILITY
DISTRIBUTION

UNKNOWN TARGET FUNCTION


f: X Y

P
TRAINING EXAMPLES
( x1 , y1 ), ... , ( xN , yN )

on

x1 , ... , xN
x

ERROR
MEASURE

LEARNING
ALGORITHM

g ( x )~
~ f (x )

FINAL
HYPOTHESIS
g: X Y

HYPOTHESIS SET
H

Learning From Data - Le ture 4

9/22

How to hoose the error measure

Fingerprint veri ation:


Two types of error:

false a ept

and

false reje t
f

How do we penalize ea h type?

8
>
<+1 you

>
:
1 intruder

f
+1
+1 no error
h
1 false reje t
Learning From Data - Le ture 4

1
false a ept
no error

10/22

The error measure - for supermarkets

Supermarket veries ngerprint for dis ounts

False reje t is ostly; ustomer gets annoyed!

False a ept is minor; gave away a dis ount


and intruder left their ngerprint

8
>
<+1 you

>
:
1 intruder

f
+1 1
+1 0
1
h
1 10 0
Learning From Data - Le ture 4

11/22

The error measure -

for the CIA

CIA veries ngerprint for se urity

False a ept is a disaster!

False reje t an be tolerated


Try again; you are an employee

8
>
<+1 you

>
:
1 intruder

f
+1 1
+1 0 1000
h
0
1 1
Learning From Data - Le ture 4

12/22

Take-home lesson

The error measure should be spe ied by the user.

Not always possible. Alternatives:

Plausible
Friendly

Learning From Data - Le ture 4

measures: squared error

Gaussian noise

measures: losed-form solution, onvex optimization

13/22

The learning diagram - with error measure

PROBABILITY
DISTRIBUTION

UNKNOWN TARGET FUNCTION


f: X Y

P
TRAINING EXAMPLES
( x1 , y1 ), ... , ( xN , yN )

on

x1 , ... , xN
x

ERROR
MEASURE
e( )

LEARNING
ALGORITHM

g ( x )~
~ f (x )

FINAL
HYPOTHESIS
g: X Y

HYPOTHESIS SET
H

Learning From Data - Le ture 4

14/22

Noisy targets

The `target fun tion' is not always a

fun tion

Consider the redit- ard approval:

age
annual salary
years in residen e
years in job
urrent debt

two `identi al' ustomers

Learning From Data - Le ture 4

23 years
$30,000
1 year
1 year
$15,000

two dierent behaviors

15/22

Target `distribution'

Instead of

y = f (x),

we use target

distribution:

P (y | x)
(x, y)

is now generated by the joint distribution:

P (x)P (y | x)
Noisy target = deterministi target

f (x) = E(y|x)

plus noise

y f (x)

Deterministi target is a spe ial ase of noisy target:

P (y | x)
Learning From Data - Le ture 4

is zero ex ept for

y = f (x)
16/22

The learning diagram - in luding noisy target

UNKNOWN TARGET DISTRIBUTION


P(y |

target function f:

PROBABILITY
DISTRIBUTION

x)

X Y

plus noise

P
TRAINING EXAMPLES
( x1 , y1 ), ... , ( xN , yN )

on

x1 , ... , xN
x

ERROR
MEASURE
e( )

LEARNING
ALGORITHM

g ( x )~
~ f (x )

FINAL
HYPOTHESIS
g: X Y

HYPOTHESIS SET
H

Learning From Data - Le ture 4

17/22

Distin tion between

Both onvey probabilisti aspe ts of

The target distribution

and

P (y|x)

and

P (x)

P (y | x)

UNKNOWN TARGET DISTRIBUTION


P(y |

is what we are trying to learn

target function f:

x)

X Y

plus noise

UNKNOWN

The input distribution

P (x)

quanties relative importan e of

Merging

P (x)P (y|x)

TRAINING EXAMPLES
( x1 , y1 ), ... , ( xN , yN )

as

INPUT
DISTRIBUTION
P (x)

P (x, y)

mixes the two on epts

Learning From Data - Le ture 4

18/22

Outline

Nonlinear transformation ( ontinued)

Error measures

Noisy targets

Preamble to the theory

Learning From Data - Le ture 4

19/22

What we know so far

Learning is feasible. It is likely that

E (g) E (g)
out

in

Is this learning?

We need

g f,

whi h means

E (g) 0
out

Learning From Data - Le ture 4

20/22

The 2 questions of learning

E (g) 0
out

is a hieved through:

E
(g)

E
(g)
|
{z
}
Le ture 2
out

and

in

E
(g)

0
| {z }
Le ture 3
in

Learning is thus split into 2 questions:

1. Can we make sure that


2. Can we make

Learning From Data - Le ture 4

E (g)
in

E (g) is lose enough to E (g)?


out

in

small enough?

21/22

0.5
0.6

What the theory will a hieve


0.7
0.8

Chara terizing the feasibility of learning for 0.9


innite

out-of-sample error

Chara terizing the tradeo:

model omplexity

Error

Model omplexity
Model omplexity

E
E E
in

out

in

8
10
12

Learning From Data - Le ture 4

in-sample error

dv

VC dimension,

dv

22/22

Review of Le ture 4

Noisy targets
y = f (x)

Error measures

- User-spe ied e h(x), f (x)
f

>
:
1 intruder

- In-sample:

N
X

1
E (h) =
e h(xn), f (xn)
N n=1
in

target function f:

out

x)

X Y

plus noise

TRAINING EXAMPLES
( x1 , y1 ), ... , ( xN , yN )

UNKNOWN
INPUT
DISTRIBUTION
P (x)

- (x1, y1), , (xN , yN ) generated by

P (x, y) = P (x)P (y|x)

- Out-of-sample



E (h) = Ex e h(x), f (x)

y P (y | x)

UNKNOWN TARGET DISTRIBUTION


P(y |

8
>
<+1 you



- E (h) is now Ex,y e h(x), y
out

Learning From Data


Yaser S. Abu-Mostafa

California Institute of Te hnology

Le ture 5:

Training versus Testing

Sponsored by Calte h's Provost O e, E&AS Division, and IST

Tuesday, April 17, 2012

Outline

From training to testing

Illustrative examples

Key notion: break point

Puzzle

Learning From Data - Le ture 5

2/20

The nal exam


Testing:

P [ |E E | > ] 2 e

22N

in

out

Training:

P [ |E E | > ] 2M e

22N

in

Learning From Data - Le ture 5

out

3/20

Where did the


The

B ad


events

Bm

ome from?

are

in

B2

B1

|E (hm) E (hm)| > 


out

The union bound:

P[B1

or B2 or or BM ]
P
[B
]
+
P[B2] + + P[BM ]
1
{z
}
|

no overlaps:

Learning From Data - Le ture 5

M terms

B3
4/20

Can we improve on
Yes, bad events are

E
E

out

very

: hange in

M?
up

overlapping!

+1

and

1 areas

1
in

: hange in labels of data points

+1
|E (h1) E (h1)| |E (h2) E (h2)|
in

out

in

out

down

Learning From Data - Le ture 5

5/20

What an we repla e

with?

Instead of the whole input spa e,

we onsider a nite set of input points,

and ount the number of

Learning From Data - Le ture 5

di hotomies

6/20

Di hotomies: mini-hypotheses
A hypothesis

h : X {1, +1}

A di hotomy

h : {x1, x2, , xN } {1, +1}

Number of hypotheses

|H|

an be innite

Number of di hotomies

|H(x1, x2, , xN )|

Candidate for repla ing

Learning From Data - Le ture 5

is at most

2N

7/20

The growth fun tion


The growth fun tion ounts the most di hotomies on any

mH(N )=

max

x1, ,xN X

points

|H(x1, , xN )|

The growth fun tion satises:

mH(N ) 2N
Let's apply the denition.

Learning From Data - Le ture 5

8/20

ements

PSfrag repla ements

Applying

0
0.5
1
1.5
2
2.5
3
3.5
4
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3

0
0.5
1
1.5
2
2.5
3
3.5
4
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3

mH(N )

N =3

PSfrag repla ements


0
0.5
1
1.5
2
2.5
3
3.5
4
0.5
1
1.5
2
2.5
3

N =3

mH(3) = 8
Learning From Data - Le ture 5

denition - per eptrons

N =4

mH(4) = 14
9/20

Outline

From training to testing

Illustrative examples

Key notion: break point

Puzzle

Learning From Data - Le ture 5

10/20

0.04
Example 1: positive rays

0.06
0.08
0.1

h(x) = 1
x1

x2

x3

h(x) = +1

...

xN
H

is set of

h : R {1, +1}

h(x) = sign(x a)
mH(N ) = N + 1
Learning From Data - Le ture 5

11/20

Example 2: positive intervals

0.06
0.08
0.1

h(x) = 1
x1

x2

x3

h(x) = +1

h(x) = 1

...

xN
H

is set of

h : R {1, +1}

Pla e interval ends in two of

mH(N ) =
Learning From Data - Le ture 5

N +1
2

N +1

spots

+1 = 12 N 2 + 12 N + 1
12/20

Example 3: onvex sets


H

is set of

h : R {1, +1}

h(x) = +1 is

up

onvex

mH(N ) = 2N
The

points are `shattered' by onvex sets

+
bottom

Learning From Data - Le ture 5

13/20

The 3 growth fun tions


H

is positive rays:

mH(N ) = N + 1
H

is positive intervals:

mH(N ) = 12 N 2 + 12 N + 1
H

is onvex sets:

mH(N ) = 2N

Learning From Data - Le ture 5

14/20

Ba k to the big pi ture


Remember this inequality?

P [ |E E | > ] 2M e

22N

in

What happens if

mH(N )

mH(N )

polynomial

Just prove that

Learning From Data - Le ture 5

repla es

mH(N )

out

M?

Good!

is polynomial?

15/20

Outline

From training to testing

Illustrative examples

Key notion:

Puzzle

Learning From Data - Le ture 5

break point

16/20

1.5
2

Break point of

2.5

Denition:

3.5

If no data set of size


then

is a

break point

an be shattered by

for

mH(k) < 2k

H,

4
0.5
1
1.5
2

For 2D per eptrons,

k=4

2.5
3

A bigger data set annot be shattered either

Learning From Data - Le ture 5

17/20

Break point - the 3 examples

Positive rays

mH(N ) = N + 1

Positive intervals

Convex sets

Learning From Data - Le ture 5

break point

k=

break point

k=

break point

k =`'

mH(N ) = 12 N 2 + 12 N + 1

mH(N ) = 2N

18/20

Main result
No break point
Any break point

Learning From Data - Le ture 5

=
=

mH(N ) = 2N
mH(N ) is

polynomial in N

19/20

Puzzle

x1

Learning From Data - Le ture 5

x2

x3

20/20

1.5

Review of Le ture 5
Di hotomies

Break point
2

2.5

3.5

4
0.5
1
1.5
2
2.5
3

Growth fun tion


mH(N )=

max

x1, ,xN X

|H(x1, , xN )|

Maximum # of di hotomies
x1 x2 x3



Learning From Data


Yaser S. Abu-Mostafa

California Institute of Te hnology

Le ture 6:

Theory of Generalization

Sponsored by Calte h's Provost O e, E&AS Division, and IST

Thursday, April 19, 2012

Outline

Proof that mH(N ) is polynomial

Proof that mH(N ) an repla e M

Learning From Data - Le ture 6

2/18

Bounding

mH(N )

To show:

mH(N ) is polynomial

We show:

mH(N ) a polynomial

Key quantity:

B(N, k): Maximum number of di hotomies on N points, with break point k

Learning From Data - Le ture 6

3/18

Re ursive bound on

B(N, k)

Consider the following table:

# of rows

B(N, k) = + 2

S1

S2+

S2
S2

Learning From Data - Le ture 6

x1 x2 . . . xN 1
+1 +1 . . .
+1
1 +1 . . .
+1

xN
+1
1

+1
1
+1
1

1
+1
1
1

...
...
...
...

1
1
+1
+1

1
+1
+1
+1

+1
1
+1
1

1
1
1
1

...
...
...
...

+1
1
+1
+1

+1
+1
1
1

+1 1 . . .
1 1 . . .

+1
1

1
1

..
..
..

..
..
..

..
..
..

..
..
..

..
..
..

4/18

Estimating

and

Fo us on x1, x2, , xN 1 olumns:

# of rows

+ B(N 1, k)

S1

S2+

S2
S2

Learning From Data - Le ture 6

x1 x2 . . . xN 1
+1 +1 . . .
+1
1 +1 . . .
+1

xN
+1
1

+1
1
+1
1

1
+1
1
1

...
...
...
...

1
1
+1
+1

1
+1
+1
+1

+1
1
+1
1

1
1
1
1

...
...
...
...

+1
1
+1
+1

+1
+1
1
1

+1 1 . . .
1 1 . . .

+1
1

1
1

..
..
..

..
..
..

..
..
..

..
..
..

..
..
..

5/18

Estimating

by itself

Now, fo us on the S2 = S2+ S2 rows:

# of rows

B(N 1, k 1)

S1

S2+

S2
S2

Learning From Data - Le ture 6

x1 x2 . . . xN 1
+1 +1 . . .
+1
1 +1 . . .
+1

xN
+1
1

+1
1
+1
1

1
+1
1
1

...
...
...
...

1
1
+1
+1

1
+1
+1
+1

+1
1
+1
1

1
1
1
1

...
...
...
...

+1
1
+1
+1

+1
+1
1
1

+1 1 . . .
1 1 . . .

+1
1

1
1

..
..
..

..
..
..

..
..
..

..
..
..

..
..
..

6/18

Putting it together

B(N, k) = + 2

# of rows

+ B(N 1, k)

S1

B(N 1, k 1)
S2+

B(N, k)
B(N 1, k) + B(N 1, k 1)
Learning From Data - Le ture 6

S2
S2

x1 x2 . . . xN 1
+1 +1 . . .
+1
1 +1 . . .
+1

xN
+1
1

+1
1
+1
1

1
+1
1
1

...
...
...
...

1
1
+1
+1

1
+1
+1
+1

+1
1
+1
1

1
1
1
1

...
...
...
...

+1
1
+1
+1

+1
+1
1
1

+1 1 . . .
1 1 . . .

+1
1

1
1

..
..
..

..
..
..

..
..
..

..
..
..

..
..
..

7/18

Numeri al omputation of

B(N, k)

bound

B(N, k) B(N 1, k) + B(N 1, k 1)


top

bottom

Learning From Data - Le ture 6

1
2
3
N 4
5
6
:

1
1
1
1
1
1
1
:

2
2
3
4
5
6
7
:

3
2
4
7
11
:
:
:

k
4
2
4
8
..
.

5
2
4
8
..

6
2
4
8
..

..
..
..
..
..

.
.
8/18

Analyti solution for

B(N, k)

bound

B(N, k) B(N 1, k) + B(N 1, k 1)


Theorem:

B(N, k)


k1
X
i=0

N
i

1. Boundary onditions:

Learning From Data - Le ture 6

easy

1
2
3
N 4
5
6
:

1
1
1
1
1
1
1
:

2
2
3
4
5
6
7
:

3
2
4
7
11
:
:
:

top

k
4
2
4
8
..
.

5
2
4
8
..

6
2
4
8
..

..
..
..
..
..

bottom

.
9/18

2. The indu tion step


top


k1
X
i=0

N
i

=
=
=
=

Learning From Data - Le ture 6


k1
X


k2
X

N 1
N 1
+
?
i
i
i=0
i=0




k1
k1
X N 1
X N 1
+
1+
i1
i
i=1
i=1




k1
X N 1
N 1
1+
+
i
i1
i=1
k1  
k1  
X
X
N
N
X
=
1+
i
i
i=0
i=1

k1

N1

N
bottom

10/18

It is polynomial!

For a given H, the break point k is xed

mH(N )


k1
X

N
i
|i=0 {z }

maximum power is

Learning From Data - Le ture 6

N k1

11/18

Three examples
k1  
X
N
i=0

H is

positive rays:

H is

positive intervals:

H is

(break point k = 2)
mH(N ) = N + 1 N + 1

2D per eptrons:

Learning From Data - Le ture 6

(break point k = 3)
mH(N ) = 12 N 2 + 12 N + 1
(break point k = 4)
mH(N ) = ?

1 3
6N

1 2
2N

+ 12 N + 1

+ 56 N + 1
12/18

Outline

Proof that mH(N ) is polynomial

Proof that mH(N ) an repla e M

Learning From Data - Le ture 6

13/18

What we want

Instead of:

P [ |E (g) E (g)| > ] 2


in

out

22N

We want:

P [ |E (g) E (g)| > ] 2 mH(N ) e

22N

in

Learning From Data - Le ture 6

out

14/18

Pi torial proof

How does mH(N ) relate to overlaps?


What to do about Eout?
Putting it together

Learning From Data - Le ture 6

15/18

Hoeffding Inequality

Union Bound

VC Bound

space of
data sets
.

(a)
Learning From Data - Le ture 6

(b)

(c)
16/18

What to do about

Eout

Hi

Eout(h)

Ein(h)

Eout(h)

Ein(h)
Ein(h)
Hi

Learning From Data - Le ture 6

17/18

Putting it together

Not quite:

2
2 N
P [ |E (g) E (g)| > ] 2 mH( N ) e

in

out

but rather:

P [ |E (g) E (g)| > ] 4 mH(2N )


in

out

1
8 2 N
e

The Vapnik-Chervonenkis Inequality


Learning From Data - Le ture 6

18/18

Review of Le ture 6

The VC Inequality

mH(N ) is polynomial

Hoeffding Inequality

Union Bound

VC Bound

if H has a break point k

1
2
3
4
5
6
:

1
1
1
1
1
1
1
:

k
2 3 4 5 6
2 2 2 2 2
3 4 4 4 4
4 7 top 8 8 8
5 11 . . . . . .
6 :
.
7 :
.
: :
.
bottom

mH(N )


N
i
i=0
| {z }

.
.
.
.
.

.
.
.
.
.

k1
X

maximum power is N k1

space of
data sets
.

(a)

P [ |E

in

P [ |E

in

(b)

(c)

(g) E (g)| > ] 2

out

(g) E (g)| > ] 4 mH(2N )


out

e 2

1
8 2 N
e

Learning From Data


Yaser S. Abu-Mostafa

California Institute of Te hnology

Le ture 7:

The VC Dimension

Sponsored by Calte h's Provost O e, E&AS Division, and IST

Tuesday, April 24, 2012

Outline

The denition

VC dimension of per eptrons

Interpreting the VC dimension

Generalization bounds

Learning From Data - Le ture 7

2/24

Denition of VC dimension

The VC dimension of a hypothesis set H, denoted by dv (H), is


the largest value of N for whi h m
the most points

N dv (H) = H
k > dv (H) = k
Learning From Data - Le ture 7

N
(N
)
=
2
H

an shatter

an shatter N points
is a break point for H
3/24

The growth fun tion

In terms of a break point k:


mH(N )

k1  
X
N
i=0

In terms of the VC dimension dv :


mH(N )

dv  
X
N

|i=0 {z }
maximum power is N dv

Learning From Data - Le ture 7

4/24

Examples

is positive rays:

dv = 1

is 2D per eptrons:

dv = 3

is onvex sets:

up

dv =

bottom

Learning From Data - Le ture 7

5/24

VC dimension and learning

dv (H)

is nite

gH

will generalize

Independent of the learning algorithm

Independent of the input distribution

Independent of the target fun tion

up

UNKNOWN TARGET FUNCTION


f: X Y

PROBABILITY
DISTRIBUTION

on

TRAINING EXAMPLES
( x1 , y1 ), ... , ( xN , yN )

LEARNING
ALGORITHM

FINAL
HYPOTHESIS
g~
~f

HYPOTHESIS SET
H
down

Learning From Data - Le ture 7

6/24

VC dimension of per eptrons

For d = 2, dv = 3
In general, dv = d + 1

up

We will prove two dire tions:


dv d + 1
dv d + 1
down

Learning From Data - Le ture 7

7/24

Here is one dire tion

A set of N = d + 1 points in R shattered by the per eptron:


d

X =

x
x
x
.
x

1
T

2
T






d+1

X
Learning From Data - Le ture 7

1
1
1
...
1

0
1
0

... 0
... 0

0
0 ... 0 1

0
0
1

...

is invertible
8/24

Can we shatter this data set?

For any

1
y1

y2 1
y =
=

yd+1
1

an we nd a ve tor w satisfying


sign(Xw) = y
Easy!

Just make

whi h means
Learning From Data - Le ture 7

sign(Xw)= y

w = X1y
9/24

We an shatter these

d+1

points

This implies what?

[a dv = d + 1
[b dv d + 1
[ dv d + 1
[d

No on lusion

Learning From Data - Le ture 7

10/24

Now, to show that

dv d + 1

We need to show that:

[a
[b
[
[d
Learning From Data - Le ture 7

There are

d + 1 points we annot

shatter

There are

d + 2 points we annot

shatter

We annot shatter

any

set of

d + 1 points

We annot shatter

any

set of

d + 2 points

11/24

Take any

d+2

points

For any d + 2 points,


x1, , xd+1, xd+2

More points than dimensions

we must have
xj =

ai xi

i6=j

where not all the a 's are zeros


i

Learning From Data - Le ture 7

12/24

So?

xj =

ai xi

i6=j

Consider the following di hotomy:


x 's with non-zero a get y = sign(a )
and x gets y = 1
No per eptron an implement su h di hotomy!
i

Learning From Data - Le ture 7

13/24

Why?

xj =

ai xi

wTxj =

i6=j

ai wTxi

i6=j

If y = sign(w x ) = sign(a ), then a w x > 0


X
This for es
wx =
a wx > 0
T

i6=j

Therefore,

yj =

Learning From Data - Le ture 7

sign(wTxj ) = +1
14/24

Putting it together

We proved

dv d + 1

and

dv d + 1

dv = d + 1

What is d + 1 in the per eptron?


It is the number of parameters w , w , , w
0

Learning From Data - Le ture 7

15/24

Outline

The denition

VC dimension of per eptrons

Interpreting the VC dimension

Generalization bounds

Learning From Data - Le ture 7

16/24

1. Degrees of freedom

Parameters reate degrees of freedom


# of parameters: analog degrees of freedom
dv : equivalent `binary' degrees of freedom

Learning From Data - Le ture 7

17/24

-0.06
0.02
-0.04
0.04
-0.02
0.06(
0
0.08
0.02
0.1
0.04
0.06
0.08
0.1

Positive rays

):

dv = 1

Positive intervals

Learning From Data - Le ture 7

The usual suspe ts

h(x) = 1
x1

x2

x3

h(x) = +1

...

xN

(dv = 2):
h(x) = 1
x1

x2

x3

h(x) = +1
...

h(x) = 1
xN
18/24

Not just parameters

Parameters may not ontribute degrees of freedom:

down

down

dv

measures the

Learning From Data - Le ture 7

ee tive

number of parameters

19/24

2. Number of data points needed

Two small quantities in the VC inequality:


P [|E (g) E (g)| > ] 4m
(2N
)e
H
{z
|
in

81 2N

out

If we want ertain and , how does N depend on dv ?


Let us look at
Learning From Data - Le ture 7

N deN

20/24

N deN

Fix N e = small value


How does N hange with d?
d N

10

10

N 30eN

10

10

Rule of thumb:
5

10

N 10 dv

Learning From Data - Le ture 7

20

40

60

80

100

120

140

160

180

200

21/24

Outline

The denition

VC dimension of per eptrons

Interpreting the VC dimension

Generalization bounds

Learning From Data - Le ture 7

22/24

Rearranging things

Start from the VC inequality:


P [|E E )| > ] |4mH(2N
{z)e
out

Get in terms of :

81 2N

in

18 2N

= 4mH(2N )e

8 4mH(2N )
= =
ln
| N {z }

With probability 1 ,
Learning From Data - Le ture 7

|E

out

E | (N, H, )
in

23/24

Generalization bound

With probability 1 ,

|E

out

E | (N, H, )
in

With probability 1 ,
E

Learning From Data - Le ture 7

out

in

24/24

Review of Le ture 7
VC dimension dv (H)

Utility of VC dimension
10

10

10

most points

an shatter
0

10

S ope of VC analysis

10

up

20

UNKNOWN TARGET FUNCTION


f: X Y

80

100

120

140

160

180

200

N dv

DISTRIBUTION
on

TRAINING EXAMPLES
( x1 , y1 ), ... , ( xN , yN )

ALGORITHM

60

PROBABILITY

LEARNING

40

Rule of thumb:
FINAL
HYPOTHESIS

N 10 dv

Generalization bound

g~
~f

HYPOTHESIS SET
H
down

out

E +
in

Learning From Data


Yaser S. Abu-Mostafa

California Institute of Te hnology

Le ture 8:

Bias-Varian e Tradeo

Sponsored by Calte h's Provost O e, E&AS Division, and IST

Thursday, April 26, 2012

Outline

Bias and Varian e

Learning Curves

Learning From Data - Le ture 8

2/22

Approximation-generalization tradeo
Small E : good approximation of f out of sample.
out

More omplex H = better han e of


Less omplex H = better han e of
Ideal H = {f }

Learning From Data - Le ture 8

approximating f

generalizing out of sample

winning lottery ti ket

3/22

Quantifying the tradeo


VC analysis was one approa h: E

out

E +
in

Bias-varian e analysis is another: de omposing E

out

into

1. How well H an approximate f


2. How well we an zoom in on a good h H
Applies to

real-valued targets and uses squared error

Learning From Data - Le ture 8

4/22

Start with E
E (g (D) )= Ex
out

out

i

2
(D)
g (x) f (x)

h h
ii



2
ED E (g (D) ) = ED Ex g (D) (x) f (x)
h h
ii

2
= Ex ED g (D) (x) f (x)
out

Now, let us fo us on:


h
i

2
(D)
ED g (x) f (x)
Learning From Data - Le ture 8

5/22

The average hypothesis


To evaluate ED

i

2
(D)
g (x) f (x)

we dene the `average' hypothesis g(x):


h
i
g(x) = ED g (D)(x)
Imagine

many data sets D1, D2, , DK


K
X
1
g(x)
g (Dk )(x)
K
k=1

Learning From Data - Le ture 8

6/22

Using g(x)
ED

h
i
i


2
2
g (D) (x) f (x)
=ED g (D) (x) g(x) + g(x) f (x)
= ED

(D)

2
2
(x) g(x) + g(x) f (x)

i


+ 2 g (D) (x) g(x) g(x) f (x)
= ED
Learning From Data - Le ture 8

i

2
2
(D)
g (x) g(x) + g(x) f (x)
7/22

Bias and varian e


ED

h
h
i
i
i



2
2
2
(D)
(D)
g (x) f (x)
= ED g (x) g(x) + g(x) f (x)
|
{z
} |
{z
}
var(x)

bias(x)

h h
ii



2
(D)
(D)
Therefore, ED E (g ) = Ex ED g (x) f (x)
out

= Ex[bias(x) + var(x)]
=

Learning From Data - Le ture 8

bias

var
8/22

The tradeo
bias = Ex

2 i
g(x) f (x)

var = Ex

ED

ii

2
(D)
g (x) g(x)

H
f
H

bias

f
var

Learning From Data - Le ture 8

9/22

Example: sine target


H0
f :[1, 1] R

f (x) = sin(x)
2

Only two training examples!

N =2

1.5

Two models used for learning:

0.5

H0:

h(x) = b

H1:

h(x) = ax + b

0.5

1.5

Whi h is better, H0 or H1?


Learning From Data - Le ture 8

2
1

0.8

0.6

0.4

0.2

0.2

0.4

0.6

0.8

10/22

Approximation - H0 versus H1
H0

H1

1.5

1.5

0.5

0.5

0.5

0.5

Eout = 0.50

1.5

2
1

0.8

0.6

Learning From Data - Le ture 8

0.4

0.2

0.2

0.4

Eout = 0.20

1.5

0.6

0.8

2
1

0.8

0.6

0.4

0.2

0.2

0.4

0.6

0.8

11/22

Learning - H0 versus H1
H0

H1

1.5

1.5

0.5

0.5

0.5

0.5

1.5

1.5

2
1

0.8

0.6

Learning From Data - Le ture 8

0.4

0.2

0.2

0.4

0.6

0.8

2
1

0.8

0.6

0.4

0.2

0.2

0.4

0.6

0.8

12/22

Learning From Data - Le ture 8

1
-1
Bias and varian e
- H0
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
sin(x)
0.6
0.8
x
1

1
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1

g(x)

13/22

Learning From Data - Le ture 8

0.2
0.4
Bias and varian e
- H1
0.6
0.8
1
-3
-2
g(x)
-1
0
sin(x)
1
2
x
3

0.4
0.6
0.8
1
-8
-6
-4
-2
0
2
4
6

14/22

-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1

H0

0.4
and the winner
is
...
0.6
0.8
H1
1
-3
-2
g(x)
-1
g(x)
0
sin(x)
1
2
x
3

sin(x)
x
bias = 0.50

Learning From Data - Le ture 8

var = 0.25

bias = 0.21

var = 1.69
15/22

Lesson learned

Mat h the

`model omplexity'

to the

Learning From Data - Le ture 8

data resour es,

not to the

target omplexity

16/22

Outline

Bias and Varian e

Learning Curves

Learning From Data - Le ture 8

17/22

Expe ted E

out

and E

in

Data set D of size N


Expe ted out-of-sample error ED [E (g (D))]
out

Expe ted in-sample error ED [E (g (D))]


in

How do they vary with N ?

Learning From Data - Le ture 8

18/22

The

out

in

Number of Data Points, N


Simple Model

Learning From Data - Le ture 8

Expe ted Error

Expe ted Error

20
40
60
80
100
120
0.16
0.18
0.2
0.22

20
40
urves
60
80
100
120
0.05
0.1
0.15
0.2
0.25

out

in

Number of Data Points, N


Complex Model
19/22

VC versus

out

in

generalization error

in-sample error
Number of Data Points, N
VC analysis

Learning From Data - Le ture 8

20
40
bias-varian e
60
80
0.16
E
0.17
varian e
0.18
0.19
E
0.2
bias
0.21
Number of Data Points, N
0.22
Expe ted Error

Expe ted Error

20
40
60
80
0.16
0.17
0.18
0.19
0.2
0.21
0.22

out

in

bias-varian e
20/22

Linear regression ase


Noisy target y = wTx + noise
Data set D = {(x1, y1), . . . , (xN , yN )}
Linear regression solution: w = (XTX)1XTy
In-sample error ve tor = Xw y
`Out-of-sample' error ve tor = Xw y

Learning From Data - Le ture 8

21/22

Expe ted Error

100
0
Learning urves for linear regression
0.2
0.4
0.6
Best approximation error = 2
0.8

E
d+1
2
1
Expe ted in-sample error = 1 N
2

1.2

1.4
Expe ted out-of-sample error = 2 1 + d+1
N
E
1.6

2 d+1
1.8
Expe ted generalization error = 2 N
d + 1 Number of Data Points, N
2
Learning From Data - Le ture 8

out

in

22/22

PSfrag repla ements

Review of Le ture 8

Learning urves
20

Bias and varian e

in

and40

out

vary with

60

Expe ted value of

bias

out

w.r.t.

0.16

0.17

B-V:

var

0.19
0.2
20
0.21
40
0.22
60

f
H

0.18

bias

Expe ted Error

80

out

in

varian e

bias
Number of Data Points,

80
0.16
0.17

VC:
f

0.18
0.19
0.2

var

0.21
0.22

g (D)(x) g(x) f (x)

Expe ted Error

How

out

generalization error

in

in-sample error
Number of Data Points,

VC dimension

Learning From Data


Yaser S. Abu-Mostafa

California Institute of Te hnology

Le ture 9:

The Linear Model II

Sponsored by Calte h's Provost O e, E&AS Division, and IST

Tuesday, May 1, 2012

Where we are

Learning From Data - Le ture 9

Linear lassi ation

Linear regression

Logisti regression

Nonlinear transforms

X
X
?

2/24

Nonlinear transforms

x = (x0, x1, , xd)

Ea h

z = (z0, z1, , zd)


zi = i(x)

Example:

z = (1, x1, x2, x1x2, x21, x22)

Final hypothesis

g(x)


(x)
sign w
T

Learning From Data - Le ture 9

z = (x)

in

X
or

spa e:

T(x)
w
3/24

The pri e we pay

x = (x0, x1, , xd)

Learning From Data - Le ture 9

z = (z0, z1, , zd)

dv = d + 1

dv d + 1

4/24

0.4
0.6

-0.5
0
Two non-separable
ases

0.8

0.5

-1.5

-1.5

-1

-1

-0.5

-0.5

0.5

0.5

1.5

1.5

Learning From Data - Le ture 9

5/24

-0.5

First ase 0
0.5
Use a linear model in

X;

a ept

E >0

in

-1.5

or
Insist on

E = 0;
in

go to high-dimensional

-1

-0.5
0
0.5
1
1.5

Learning From Data - Le ture 9

6/24

-0.5
Se ond ase0

0.5

z = (1, x1, x2, x1x2, x21, x22)

(1, x21, x22)

Why not:

z=

or better yet:

z = (1, x21 + x22)

or even:

z=

(x21

x22

0.6)

-1.5
-1
-0.5
0
0.5
1
1.5

Learning From Data - Le ture 9

7/24

Lesson learned

Looking at the data

before

hoosing the model an be hazardous to your

out

Data snooping

Learning From Data - Le ture 9

8/24

Logisti regression - Outline

The model

Error measure

Learning algorithm

Learning From Data - Le ture 9

9/24

A third linear model

s =

d
X

wixi

i=0

linear lassi ation

linear regression

h(x) = sign(s)

h(x) = s

x2
xd

Learning From Data - Le ture 9

h(x) = (s)
x0

x0

x0
x1

logisti regression

x1
s
h(x)

x2
xd

x1
s
h(x)

x2

s
h(x)

xd

10/24

-4
The logisti fun tion

The formula:

es
(s) =
1 + es

-2
0
21
4
0
0.5
0
1

(s)
s

soft threshold: un ertainty


sigmoid: attened out `s'

Learning From Data - Le ture 9

11/24

Probability interpretation

h(x) = (s)

is interpreted as a probability

Example.

Predi tion of heart atta ks

Input

x:

(s):

probability of a heart atta k

holesterol level, age, weight, et .

The signal

s = wTx

risk s ore

h(x) = (s)
Learning From Data - Le ture 9

12/24

Genuine probability

Data

(x, y)

with binary

y,

generated by a noisy target:

P (y | x) =

(
f (x)

1 f (x)

for

y = +1;

for

y = 1.

The target

Learn

Learning From Data - Le ture 9

f : Rd [0, 1] is

the probability

g(x) = (wT x) f (x)


13/24

Error measure

For ea h

(x, y), y

is generated by probability

Plausible error measure based on

f (x)

likelihood:
If

h = f,

how likely to get

P (y | x) =

Learning From Data - Le ture 9

from

x?

h(x)

for

y = +1;

1 h(x)

for

y = 1.

14/24

Formula for likelihood


-4

P (y | x) =

h(x)

for

1 h(x)

for

-2

y = +1;

y = 1.

1
(s)

4
0

Substitute

h(x) = (wTx),

noting

(s) = 1 (s)

0.5

P (y | x) = (y wTx)
Likelihood of
N
Y

n=1

Learning From Data - Le ture 9

D = (x1, y1), . . . , (xN , yN )

P (yn | xn) =

N
Y

is

(ynwTxn)

n=1
15/24

Maximizing the likelihood

ln
N

Minimize

1
=
N

n=1

(yn wT xn)

n=1

1
ln
(yn wT xn)

N


X
1
T
ln 1 + eynw xn
E (w) =
N n=1 |
{z
}
e(h(xn),yn)
in

Learning From Data - Le ture 9

N
X

N
Y

1
(s) =
1 + es

 ross-entropy error
16/24

Logisti regression - Outline

The model

Error measure

Learning algorithm

Learning From Data - Le ture 9

17/24

How to minimize

Ein

For logisti regression,

N


X
1
T
ln 1 + eynw xn
E (w) =
N n=1
in

iterative

solution

Compare to linear regression:

N
X
1
E (w) =
(wTxn yn)2
N n=1
in

Learning From Data - Le ture 9

losed-form solution

18/24

Iterative method: gradient des ent


-10
-8

w(0);

take a step along steepest slope

-2
0

Fixed step size:

w(1) = w(0) + v

2
10
15

What is the dire tion

?
v

in

20
25

Learning From Data - Le ture 9

in

-4

Start at

E (w)

-6

In-sample Error,

General method for nonlinear optimization

Weights,

19/24

Formula for the dire tion

in

= E ( w(0) +
v ) E (w(0))
in

in

+ O( 2)
= E (w(0))tv
in

kE (w(0))k
in

Sin e

is a unit ve tor,

E (w(0))
=
v
kE (w(0))k
in

in

Learning From Data - Le ture 9

20/24

Fixed-size step?

Weights,

too small

Weights,

too large

in

large

-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1

In-sample Error,

in

-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0
0.05
0.1
0.15
0.2
0.25

In-sample Error,

-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1

in

ae ts the algorithm:

In-sample Error,

How

small

Weights,

variable

 just right

should in rease with the slope

Learning From Data - Le ture 9

21/24

Easy implementation

Instead of

w = v
E (w(0))
=
kE (w(0))k
in

in

Have

w = E (w(0))
in

Fixed
Learning From Data - Le ture 9

learning rate

22/24

Logisti regression algorithm

1:

Initialize the weights at

2:

for

t = 0, 1, 2, . . .

t=0

to

w(0)

do

Compute the gradient

3:

in

N
X
1
ynxn
=
T(t)x
y
w
n
n
N n=1 1 + e

w(t + 1) = w(t) E

4:

Update the weights:

5:

Iterate to the next step until it is time to stop

6:

Return the nal weights

Learning From Data - Le ture 9

in

w
23/24

Summary of Linear Models

Credit
Analysis

Learning From Data - Le ture 9

Approve
or Deny

Per eptron

Classi ation Error


PLA, Po ket,. . .

Amount
of Credit

Linear Regression

Squared Error
Pseudo-inverse

Probability
of Default

Logisti Regression

Cross-entropy Error
Gradient des ent

24/24

Logisti regression
x0
x1

s
h(x)

x2
xd

(s)

-8
-6
-4
-2
0
2
10
15
20
25

in

-10
Gradient
des ent

In-sample Error, E

Review of Le ture 9

Weights, w

- Initialize w(0)

Likelihood measure
N
Y
n=1

P (yn | xn) =

N
Y
n=1

- For t = 0, 1, 2,

[to termination

w(t + 1) = w(t) E (w(t))


in

(ynwTxn)

- Return nal w

Learning From Data


Yaser S. Abu-Mostafa

California Institute of Te hnology

Le ture 10:

Neural Networks

Sponsored by Calte h's Provost O e, E&AS Division, and IST

Thursday, May 3, 2012

Outline

Sto hasti gradient des ent

Neural network model

Ba kpropagation algorithm

Learning From Data - Le ture 10

2/21

Sto hasti gradient des ent


GD minimizes:

N
X

1
E (w) =
e| h(x{zn), yn}
N n=1
T
ln(1+eynw xn )
in

by iterative steps along

in logisti regression

in

w = E (w)
in

in

is based on all examples




Learning From Data - Le ture 10

(xn, yn)

bat h GD
3/21

The sto hasti aspe t


Pi k one

(xn, yn)

at a time.

Average dire tion:

Apply GD to


En e

h(xn), yn

N
X

 1
h(xn), yn =
e h(xn), yn
N n=1

= E

in

randomized version of GD

sto hasti gradient des ent (SGD)


Learning From Data - Le ture 10

4/21

Benets of SGD

1
2
3

1. heaper omputation

2. randomization

in

0
0.05

3. simple

0.1
0.15

Weights,

Rule of thumb:
= 0.1

Learning From Data - Le ture 10

works

randomization helps

5/21

SGD in a tion
Remember movie ratings?
top

eij =

rij

K
X
k=1

!2

uikvjk

user

ui1 ui2 ui3

uiK

vj1 vj2 vj3

vjK

movie

rij

rating
bottom

Learning From Data - Le ture 10

6/21

Outline

Sto hasti gradient des ent

Neural network model

Ba kpropagation algorithm

Learning From Data - Le ture 10

7/21

Biologi al inspiration
biologi al fun tion

Learning From Data - Le ture 10

biologi al stru ture

8/21

Combining per eptrons


+
x2

x2

x2
+

+
h1

h2
x1

x1
1

x1
1

1.5
x1

x2

Learning From Data - Le ture 10

1
1

1.5
OR(x1, x2)

x1

x2

AN D(x1, x2)

9/21

Creating layers

1
1.5
2
h1 h
1 h2
h

1
1

Learning From Data - Le ture 10

h1

h2

1.5

1.5
1

1.5

1
1

1
1

10/21

The multilayer per eptron


1

1
w1 x

x1

1.5
1.5
1

1.5

1
1

x2
w2 x

3 layers
Learning From Data - Le ture 10

feedforward
11/21

A powerful model

+
+

+
+

+
+

Target

2 red ags
Learning From Data - Le ture 10

+
+

+
+

generalization

+
+

8 per eptrons

for

16 per eptrons

and

optimization
12/21

The neural network


1

x1

x2

xd
input x
Learning From Data - Le ture 10

h(x)
(s)
s

hidden layers 1 l < L

output layer l = L
13/21

How the network operates


-4

1 l L

(l)
wij

(l)
xj

Apply

0 i d

(l1)

1 j d(l)
(l)
(sj )

(0)
to x1

Learning From Data - Le ture 10

=
(0)
xd(0)

(l1)
dX

i=0

-2

layers

inputs

2
4

outputs

+1

linear
tanh

-1
-0.5

(l)
wij

(l1)
xi

(L)
x1

0
0.5

hard threshold

es es
(s) = tanh(s) = s
e + es

= h(x)
14/21

Outline

Sto hasti gradient des ent

Neural network model

Ba kpropagation algorithm

Learning From Data - Le ture 10

15/21

Applying SGD
All the weights

w =

Error on example

(l)
{wij } determine

(xn, yn)


h(xn), yn =

h(x)

is

e(w)

To implement SGD, we need the gradient

e(w):
Learning From Data - Le ture 10

e(w)
(l)
wij

for all

i, j, l
16/21

Computing
We an evaluate

e(w)

(l)
wij

e(w)
(l)
wij

top

(l)

xj
one by one: analyti ally or numeri ally

(l)

sj

A tri k for e ient omputation:


(l)

e(w)

(l)

We have

sj

(l)
wij

Learning From Data - Le ture 10

(l)
wij

(l1)
xi

e(w)

(l)
sj

w ij

(l)
sj
(l)
wij

We only need:

e(w)
(l)
sj

(l1)

xi

(l)
j

bottom

17/21


(l)
j

e(w)

(l)
sj

for the nal layer

For the nal layer

l = L and j = 1:

(L)
1

e(w)
(L)
x1

e(w)

(L)
s1

=(

(L)
x1

yn)2

(L)
(s1 )

(s) = 1 2(s)
Learning From Data - Le ture 10

for the tanh

18/21

Ba k propagation of

top

(l1)
i

e(w)

x j(l)

(l1)

si
d(l)

X e(w)
j=1

(l)
d
X

(l)
sj

(l)
j

(l)

sj

(l1)
xi

(l)
wij

(l1)
xi
(l1)
si

j=1

(l1)
i

= (1

(l1) 2
(xi ) )

(l)
d
X

j(l)

(l)
w ij

(l1)
(si )

(l1)

xi
2
1( x (l1)
)
i

(l)
wij

(l)
j

(l1)

bottom

j=1

Learning From Data - Le ture 10

19/21

Ba kpropagation algorithm
top

1:
2:
3:
4:
5:
6:
7:
8:

(l)
Initialize all weights wij

for t = 0, 1, 2, . . . do

at random

n {1, 2, , N }
(l)
Forward: Compute all x
j
(l)
Ba kward: Compute all
j
(l1) (l)
(l)
(l)
j
Update the weights: wij wij xi

(l)

Pi k

Iterate to the next step until it is time to stop

(l)

w ij
(l1)

xi

(l)
Return the nal weights wij
bottom

Learning From Data - Le ture 10

20/21

Final remark: hidden layers


learned nonlinear transform
interpretation?

x1

x2

xd

Learning From Data - Le ture 10

h(x)
(s)
s

21/21

Review of Le ture 10
Multilayer per eptrons

+
+

+
+

+
+

x1

x2

+
+

+
+

input x

(l)
xj

hidden layers 1 l L

output layer l = L

Ba kpropagation

Neural networks

(l1) (l)
j

(l)

wij = xi

(l1)

dX

(s)

Logi al ombinations of per eptrons

h(x)

xd

+
+

(l)
wij

i=0

where (s) = tanh(s)

(l1)
xi

where
(l1)

(l1) 2

= (1 (xi

) )

(l)
d
X

j=1

(l)

(l)

wij j

Learning From Data


Yaser S. Abu-Mostafa

California Institute of Te hnology

Le ture 11:

Overtting

Sponsored by Calte h's Provost O e, E&AS Division, and IST

Tuesday, May 8, 2012

Outline

What is overtting?

The role of noise

Deterministi noise

Dealing with overtting

Learning From Data - Le ture 11

2/23

Illustration of overtting
-1
-0.5

Simple target fun tion

0
5 data points-

noisy

Data
Target

0.5

Fit

4th-order polynomial t

-30

E = 0, E
in

out

is huge

-20
-10
0
10

Learning From Data - Le ture 11

x
3/23

Overtting versus bad generalization


top

Neural network tting noisy data

3.5

Overtting:

E
in

out

2.5

Error

1.5

Early stopping

out

0.5

E in
0

1000

2000

3000

4000

5000
Epochs

6000

7000

8000

9000

10000
bottom

Learning From Data - Le ture 11

4/23

The ulprit

Overtting:
Culprit:

tting the data more than is warranted

tting the noise - harmful

Learning From Data - Le ture 11

5/23

Target

Case study

-1

10th-order target + noise

-1
-0.5
0

0.5

0.5

-0.5

0.5

0.5

-0.5

Data
Target

1
1.5

Learning From Data - Le ture 11

50th-order target

Data
Target

1
1.5

x
6/23

Two ts for ea h target

-1

-1

-0.5

0.5

0.5

-10

-500

-0.5

Data
2nd Order Fit
10th Order Fit

10
20

30

2nd Order

10th Order

in

0.050

0.034

out

0.127

9.00

Learning From Data - Le ture 11

500

1000

Noisy low-order target

E
E

Data
2nd Order Fit
10th Order Fit

Noiseless high-order target


2nd Order

10th Order

in

0.029

105

out

0.120

7680

E
E

7/23

An irony of two
learners
-1

O and R

They know the target is 10th order

O hooses H10

R hooses H2

0.5
1
-10

Two learners

-0.5

0
Data
2nd Order Fit
10th Order Fit

10
20
30

x
Learning a 10th-order target

Learning From Data - Le ture 11

8/23

We have seen this ase

20

20

Remember learning urves?

40

40

60

60

120
0.05
0.1
0.15

Expe ted Error

100

80

H2

100

out

in

120
0.05
0.1
0.15

0.2
0.25

0.2
Number of Data Points,

Learning From Data - Le ture 11

0.25

H10
Expe ted Error

80

out

in

Number of Data Points,

N
9/23

Even without noise


-1

H10

and

H2

They know there is no noise

Is there really no noise?

-0.5
0
0.5
1

The two learners

-500
Data
2nd Order Fit
10th Order Fit

0
500
1000

x
Learning a 50th-order target

Learning From Data - Le ture 11

10/23

A detailed experiment
Target

Impa t of

noise level

and

target omplexity

-1

y = f (x) + |{z}
(x) =

-0.5

0
0.5

noise level:

xq + (x)

|q=0 {z }
normalized

Qf
X

target omplexity:

0.5

Data
Target

1
1.5
Learning From Data - Le ture 11

data set size:

Qf

N
11/23

The overt measure

10th Order Fit

We t the data set


-1
-0.5

H2:

(x1, y1), , (xN , yN )


2nd-order polynomials

using our two models:

H10:

10th-order polynomials

0
0.5

Compare out-of-sample errors of

g2 H2

and

g10 H10

-10

0
Data
2nd Order Fit
10th Order Fit

10
20
30
Learning From Data - Le ture 11

overt measure:

E (g10) E (g2)
out

out

x
12/23

The results

100

0.2

Qf

Target omplexity,

Noise level,

0.2

0.1

-0.1

80

100

120N

Number of data points,

Impa t of

0.1

50

-0.1

-0.2
Hi

Hi

80

100

120N

-0.2

Number of data points,

Impa t of

Qf

Hi

Learning From Data - Le ture 11

13/23

Impa t of noise

0.1

-0.1

80

100

120N

-0.2

0.1

50

sto hasti noise


deterministi noise

80

100

120N

-0.2

Number of data points,

Sto hasti noise

number of data points

-0.1

Number of data points,

Learning From Data - Le ture 11

0.2

Qf

100
Target omplexity,

Noise level,

0.2

Deterministi noise

Overtting
Overtting
Overtting

14/23

Outline

What is overtting?

The role of noise

Deterministi noise

Dealing with overtting

Learning From Data - Le ture 11

15/23

0.2
0.4
Denition of deterministi
noise

The part of

that

annot apture:

f (x)

0.6
0.8
h(x)
1
-100

Why noise?

-80
Main dieren es with sto hasti noise:

-60

-40
1. depends on

2. xed for a given

-20

0
20
40

Learning From Data - Le ture 11

f
x
16/23

Impa t on overtting

Finite

N: H

tries to t the noise

0.2

Qf

Qf

Target omplexity,

Deterministi noise and

100

0.1

50

-0.1

80

100

120N

-0.2

Number of data points,

how mu h overt
Learning From Data - Le ture 11

17/23

Noise and bias-varian e

Re all the de omposition:

ED

i
i h
h
i



2
2
2
(D)
(D)
= ED g (x) g(x) + g(x) f (x)
g (x) f (x)
|
{z
} |
{z
}
var(x)

What if

bias(x)

is a noisy target?

y = f (x) + (x)

Learning From Data - Le ture 11


E (x) = 0
18/23

A noise term

ED,

g (D) (x) y

2i

= ED,

i

2
(D)
g (x) f (x) (x)

= ED,

i

2
(D)
g (x) g(x) + g(x) f (x) (x)

= ED,

(D)

+
Learning From Data - Le ture 11

2
2
2
(x) g(x) + g(x) f (x) + (x)

ross terms

i
19/23

A tually, two noise terms

ED,x

h
i
i
i
h



2
2
2
(D)
g (x) g(x) + Ex g(x) f (x) +E,x (x)
{z
} |
{z
} |
{z
}
var

bias

deterministi noise

Learning From Data - Le ture 11

sto hasti noise

20/23

Outline

What is overtting?

The role of noise

Deterministi noise

Dealing with overtting

Learning From Data - Le ture 11

21/23

Two ures

Regularization:

Validation:

Learning From Data - Le ture 11

Putting the brakes

Che king the bottom line

22/23

Putting the -1
brakes

-1

-0.5

-0.5
Data

Target

0.5

Fit

0.5
1

-20

0.5

-10

1.5

-30

10

x
free t

Learning From Data - Le ture 11

x
restrained t
23/23

-0.2

Overtting

Fitting the noise, sto hasti /deterministi


0.4

Deterministi noise
0.8
1

-1
Fitting
the data more than is warranted
-0.5

-100

0
-80
0.5
-60

Target

0.5

Fit

1
-40

Data

-20

0
20

40

-30

100

0.2

Qf

-20
-10
0
10

0.2

0.6

Review of Le ture 11

VC allows it; doesn't predi t it

Target omplexity,

Fit

0.1

50

-0.1

80

100

120

Number of data points,

-0.2

Learning From Data


Yaser S. Abu-Mostafa

California Institute of Te hnology

Le ture 12:

Regularization

Sponsored by Calte h's Provost O e, E&AS Division, and IST

Thursday, May 10, 2012

Outline

Regularization - informal

Regularization - formal

Weight de ay

Choosing a regularizer

Learning From Data - Le ture 12

2/21

Two approa hes to regularization


Mathemati al:
Ill-posed problems in fun tion approximation

Heuristi :
Handi apping the minimization of

Learning From Data - Le ture 12

Ein

3/21

0.8

0.6

A familiar 1example

0.8

-2

-8

-1.5

-6

-1

-4

-0.5

-2

0.5

1.5

without regularization
Learning From Data - Le ture 12

x
with regularization
4/21

0.4

0.4

and the winner


is
0.6

0.6
0.8

without regularization

0.8

-3

-1.5

-2

-1

g(x)

with regularization

g(x)

-0.5

-1

...

sin(x)

sin(x)

0.5
1

3
bias

= 0.21

Learning From Data - Le ture 12

1.5
var

= 1.69

bias

= 0.23

var

= 0.33
5/21

The polynomial model


Hq:

polynomials of order

linear regression in

L1(x)
z = ..
.
Lq(x)

Hq =

Q
X

q=0

spa e

wq Lq (x)

Legendre polynomials:

L1
PSfrag repla ements

ag repla ements

Learning From Data - Le ture 12

L2
PSfrag repla ements

1
2
2 (3x

1)

L3

L4

PSfrag repla ements

1
3
2 (5x

3x)

L5

PSfrag repla ements

1
4
8 (35x

30x2 + 3)

1
5
8 (63x )

6/21

Un onstrained solution
Given

(x1, y1), , (xN , yn)

(z1, y1), , (zN , yn)

Minimize

N
X
1
(wTzn yn)2
Ein(w) =
N n=1

Minimize

1
N

(Zw y)T(Zw y)

wlin = (ZTZ)1ZTy

Learning From Data - Le ture 12

7/21

Constraining the weights


Hard onstraint:

Softer version:

H2
Q
X

is onstrained version of

wq2 C

H10

soft-order

with

wq = 0

for

q>2

onstraint

q=0

Minimize

1
N

(Zw y)T(Zw y)

subje t to:

Solution:

Learning From Data - Le ture 12

wreg

wTw C

instead of

wlin
8/21

Solving for
Minimize

Ein(w) =

1
N

subje t to:

wTw C

(Zw y)T(Zw y)

wreg
E = onst.
in

lin

Ein(wreg) wreg

normal

= 2 N wreg
Ein(wreg) + 2 N wreg = 0
Minimize

Ein(w) + N wTw

Learning From Data - Le ture 12

in

wtw = C
9/21

Augmented error
Minimizing

Eaug(w) = Ein(w) + N wTw


=

1
N

(Zw y)T(Zw y) + N wTw

Minimizing

Ein(w) =
subje t to:

Learning From Data - Le ture 12

1
N

solves

un onditionally

(Zw y)T(Zw y)

wTw C

VC formulation
10/21

The solution
Minimize

Eaug(w) = Ein(w) + N wTw



1 
T
t
=
(Zw y) (Zw y) + w w
N

Eaug(w) = 0

ZT(Zw y) + w = 0

wreg = (ZTZ + I)1 ZTy


as opposed to

Learning From Data - Le ture 12

wlin = (ZTZ)1ZTy

(with regularization)

(without regularization)
11/21

The result
Minimizing

wTw
Ein(w) + N

for dierent

's:

epla ements

=0
Fit

= 0.0001

PSfrag repla ements

Data
Target

= 0.01

=1

PSfrag repla ements

PSfrag repla ements

-1
-0.5
0
0.5
1
0
0.5
1
1.5
2

-1
-0.5
0
0.5
1
0
0.5
1
1.5
2

overtting
Learning From Data - Le ture 12

-1
-0.5
0
0.5
1
0
0.5
1
1.5
2

-1
-0.5
0
0.5
1
-30
-20
-10
0
10

Fit

undertting
12/21

Weight `de ay'


Minimizing

Ein(w) + N wTw

Gradient des ent:

is alled weight

de ay.

Why?

w(t + 1) = w(t) Ein

w(t) 2
w(t)
N

= w(t) (1 2 ) Ein w(t)


N
Applies in neural networks:

ww=

(l1) d(l)
L
d
X X X

l=1

Learning From Data - Le ture 12

i=0

j=1

(l)
wij

2
13/21

Variations of weight de ay
Emphasis of ertain weights:

Q
X

q wq2

q=0

Examples:

q = 2q =

low-order t

q = 2q =

high-order t

Neural networks: dierent layers get dierent

Tikhonov regularizer:
Learning From Data - Le ture 12

's

wTTw
14/21

0.5

1
Even weight growth!
1.5
0

Pra ti al rule:

1.5
2
2.5

sto hasti noise is `high-frequen y'

Eout

0.5

Expe ted

We ` onstrain' the weights to be large - bad!

weight growth

weight de ay

3.5

deterministi noise is also non-smooth

Regularization Parameter,

onstrain learning towards smoother hypotheses

Learning From Data - Le ture 12

15/21

General form of augmented error


Calling the regularizer

= (h),

we minimize

Eaug(h) = Ein(h) +
(h)
N

Rings a bell?

Eout(h) Ein(h) + (H)


Eaug
Learning From Data - Le ture 12

is better than

Ein

as a proxy for

Eout
16/21

Outline

Regularization - informal

Regularization - formal

Weight de ay

Choosing a regularizer

Learning From Data - Le ture 12

17/21

The perfe t regularizer


Constraint in the `dire tion' of the target fun tion

(going in ir les

Guiding prin iple:


Dire tion of

Chose a bad

smoother or simpler

We still have

regularization is a ne essary evil


Learning From Data - Le ture 12

18/21

Neural-network regularizers
-4
-2
0

Weight de ay:

From linear to logi al

2
4

linear

+1

tanh

-1

Weight elimination:

-0.5
0
0.5

Fewer weights

= smaller VC

dimension

hard threshold

Soft weight elimination:

(w) =

X
i,j,l

Learning From Data - Le ture 12

(l)
wij

2

2

(l)
2
+ wij
19/21

Early stopping as a regularizer


top

3.5

Regularization through the optimizer!


3

When to stop?

validation

2.5

Error

1.5

Early stopping

out

0.5

E in
0

1000

2000

3000

4000

5000
Epochs

6000

7000

8000

9000

10000
bottom

Learning From Data - Le ture 12

20/21

The optimal
0.6

0.5

2 = 0.25

Qf = 100
Eout

.75

2 = 0.5

.25

2 = 0

0.5
1
1.5
Regularization Parameter,
Sto hasti noise

Learning From Data - Le ture 12

Expe ted

Expe ted

Eout

0.4
0.2

Qf = 30
Qf = 15

0.5
1
1.5
Regularization Parameter,

Deterministi noise
21/21

Review of Le ture 12

Choosing a regularizer

Eaug(h) = Ein(h) +
(h)
N

Regularization
onstrained

un onstrained

(h):

E = onst.

heuristi

most used:

in

lin

normal

smooth, simple

weight de ay

prin ipled; validation

= 0.0001

= 1.0

PSfrag repla ements

PSfrag repla ements

-1
-0.5
0
0.5
1
0
0.5
1
1.5
2

-1
-0.5
0
0.5
1
0
0.5
1
1.5
2

Minimize

Eaug(w) = Ein(w) + N wTw

in

wtw = C

Learning From Data

Yaser S. Abu-Mostafa

California Institute of Te hnology

Le ture 13: Validation

Sponsored by Calte h's Provost O e, E&AS Division, and IST

Tuesday, May 15, 2012

Outline

The validation set

Model sele tion

Cross validation

Learning From Data - Le ture 13

2/22

Validation versus regularization


In one form or another,

Eout(h) = Ein(h) +

overt penalty

Regularization:
Eout(h) = Ein(h) + overt
|
{zpenalty}
regularization estimates this quantity

Validation:

E
| out
{z(h)} = Ein(h) +
validation estimates this quantity
Learning From Data - Le ture 13

overt penalty

3/22

Analyzing the estimate


On out-of-sample point

(x, y), the

error is

e(h(x), y)
h(x) y

Squared error:
Binary error:

2

h(x) 6= y


E e(h(x), y) = Eout(h)


2
var e(h(x), y) =


Learning From Data - Le ture 13

4/22

From a point to a set


On a validation set

(x1, y1), , (xK , yK ),

the error is

k=1


E Eval(h) =


var

K
X
1
Eval(h) =
e
(h(xk ), yk )
K

K
X


1
E e(h(xk ), yk ) = Eout(h)
K
k=1

K
X


1
2
Eval(h) =
var e(h(xk ), yk ) =
2
K
K

k=1

1
Eval(h) = Eout(h) O
K
Learning From Data - Le ture 13


5/22

is taken out of N
20

D = (x1, y1), , (xN , yN )

|K points
{z } validation
Dval

1
:
K

Small
Large

N
| K
{z points}

K =

Dtrain

bad estimate

K = ?

40
60

training

80
100

Expe ted Error

Given the data set

120
0.05

0.1
0.15

0.2
0.25

Learning From Data - Le ture 13

Eout
Ein

Number of Data Points, N K


K

6/22

is put ba k into N

D Dtrain Dval

N K

D = g

(N )

Dtrain

Dtrain = g

Eval = Eval(g )

Large

(N K )

K =

bad estimate!

(K )

Rule of Thumb:
N
K=
5
Learning From Data - Le ture 13

Dval

Eval(g )
7/22

Why `validation'
top

3.5

is used to make learning hoi es


3

If an estimate of

Eout

ae ts learning:

the set is no longer a

It be omes a

validation set

test set!

2.5

2
Error

Dval

1.5

Early stopping

out

0.5

E in
0

1000

2000

3000

4000

5000
Epochs

6000

7000

8000

9000

10000
bottom

Learning From Data - Le ture 13

8/22

What's the dieren e?


Test set is unbiased; validation set has optimisti bias

Two hypotheses

h1

and

h2

with

Eout(h1) = Eout(h2) = 0.5


Error estimates

Pi k

h {h1, h2}

e) <

E(

Learning From Data - Le ture 13

e1 and e2

0.5

with

uniform on

[0, 1]

e = min(e1, e2)

optimisti bias

9/22

Outline

The validation set

Model sele tion

Cross validation

Learning From Data - Le ture 13

10/22

Using Dval more than on e


M
Use

H1, . . . , HM

models

Dtrain

to learn

Evaluate gm using

gm

Pi k model

HM

g1

g2

gM

E2
{z

EM

Dval

Dval:

m = m

H2

Dtrain

for ea h model

Em = Eval(gm
);

H1

E1

m = 1, . . . , M
with smallest

pi k the best

m , Em )

(H

Em
D

gm

Learning From Data - Le ture 13

11/22

The bias

Eval(gm
)

Hm

using

is a biased estimate of

Dval

Eout(gm
)

0.8

Expe ted Error

We sele ted the model

0.7

Illustration: sele ting between 2 models

Eout gm

0.6

0.5

Learning From Data - Le ture 13

Eval gm

Validation Set Size, K


15

25

12/22

How mu h bias
For

models:

H1, . . . , HM

Dval

is used for training on the

nalists model:

Hval = {g1, g2, . . . , gm}


Ba k to Hoeding and VC!

E
Eout(gm
(g

val m ) + O
regularization

Learning From Data - Le ture 13

q

ln M
K

early-stopping

13/22

Data ontamination
Error estimates:

Ein, Etest, Eval

Contamination:

Optimisti (de eptive) bias in estimating

Eout

Training set: totally ontaminated


Validation set: slightly ontaminated
Test set: totally ` lean'
Learning From Data - Le ture 13

14/22

Outline

The validation set

Model sele tion

Cross validation

Learning From Data - Le ture 13

15/22

The dilemma about K


The following hain of reasoning:

Eout(g)Eout(g )Eval(g )
(small K )
(large K )
highlights the dilemma in sele ting

Can we have

Learning From Data - Le ture 13

K:

both small and large?

16/22

Leave one out


N 1

points for training, and

point for validation!

Dn = (x1, y1), . . . , (xn1, yn1), 


(xn, yn), (xn+1, yn+1), . . . , (xN , yN )
Final hypothesis learned from

en

is

gn

= Eval(gn) = e (gn(xn), yn)

ross validation error:

Learning From Data - Le ture 13

Dn

E v

N
X
1
=
e
n
N n=1

17/22

ements

Illustration of ross validation


0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
1.05
1.1
1.15

e1

E v
Learning From Data - Le ture 13

e3

e2

0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
-0.2
0
0.2
0.4
0.6
0.8
1
1.2

0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
1.05
1.1
1.15

x
1
= (
3

e1

e2

e3 )
18/22

PSfrag repla ements

Model sele tion using CV

Learning From Data - Le ture 13

e1

0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.7
0.75
0.8
0.85
0.9
0.95
1
1.05
1.1

e3

e2

0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.7
0.75
0.8
0.85
0.9
0.95
1
1.05
1.1

x
e3

0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.7
0.75
0.8
0.85
0.9
0.95
1
1.05
1.1

e1

Constant:

Linear:

0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
e2
0.65
0.7
PSfrag repla ements
0.75
0.8
0.85
0.9
0.95
1
1.05
1.1
0.1
1.15
0.2

0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
PSfrag repla ements
0.9
-0.2
0
0.2
0.4
0.6
0.8
1
0.1
1.2
0.2

0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.65
0.7
PSfrag repla ements
0.75
0.8
0.85
0.9
0.95
1
1.05
1.1
0.1
1.15
0.2

x
19/22

PSfrag repla ements

Cross validation in a tion


Digits lassi ation task

Not

0.1

Dierent errors
0.03

0.2

Eout

0.4
-8
-7
-6
-5

Symmetry

0.3
0.02

E v
0.01

-4
-3
-2
-1
0

Not

Average Intensity

Ein
5

# Features Used
10

15

20

(1, x1, x2) (1, x1, x2, x21, x1x2, x22, x31, x21x2, . . . , x51, x41x2, x31x22, x21x32, x1x42, x52)
Learning From Data - Le ture 13

20/22

0.35

The result

PSfrag repla ements

0.4

without validation

-5

with validation

-4.5
-4
-3.5

0.05
0.1

0.2
0.25
0.3
0.35
-5

Symmetry

0.15

-2.5
-2
-1.5

-4

Symmetry

-3

-1

-3

-0.5

-2
-1

Average Intensity

Ein = 0%
Learning From Data - Le ture 13

Eout = 2.5%

Average Intensity

Ein = 0.8%

Eout = 1.5%
21/22

Leave more than one out


Leave one out:

training sessions on

N 1 points

ea h

More points for validation?


D
}|

z
D1

D2

D3

train

N
K

D4

D5

{
D6

D7

validate

training sessions on

N K

D9

D10

train

points ea h

10-fold ross validation:


Learning From Data - Le ture 13

D8

K=

N
10
22/22

Review of Le ture 13

Data ontamination
0.8

Expe ted Error

Validation

0.7

Eout gm

0.6

(N )
Dtrain

Eval gm

0.5

(N K )

Dval

(K )

Dval

15

Validation Set Size,

25

slightly ontaminated

Cross validation
z
Eval(g )

Eval(g )

estimates

Eout(g)

D
}|
{
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10
train validate
train

10-fold ross validation

Learning From Data


Yaser S. Abu-Mostafa

California Institute of Te hnology

Le ture 14: Support

Ve tor Ma hines

Sponsored by Calte h's Provost O e, E&AS Division, and IST

Thursday, May 17, 2012

Outline

Maximizing the margin

The solution

Nonlinear transforms

Learning From Data - Le ture 14

2/20

Better linear separation


Hi

Hi

Hi

Linearly separable data

Dierent separating lines

Whi h is best?
Hi

Hi

Hi

Two questions:

1. Why is bigger margin better?

2. Whi h
Learning From Data - Le ture 14

maximizes the margin?


3/20

Remember the growth fun tion?


All di hotomies with any line:

Learning From Data - Le ture 14

4/20

Di hotomies with fat margin


Fat margins imply fewer di hotomies

Learning From Data - Le ture 14

infinity

0.866

0.5

0.397

infinity

0.866

0.5

0.397

5/20

Finding

Let

xn

with large margin

be the nearest data point to the plane

wTx = 0.

How far is it?

2 preliminary te hni alities:


1.

Normalize

w:

|wTxn| = 1
2.

Pull out

w0:
w = (w1, , wd)
The plane is now

Learning From Data - Le ture 14

apart from

w Tx + b = 0

b
(no

x0 )
6/20

Computing the distan e

The distan e between

to

The ve tor

is

Take

and

wTx + b = 0
T

xn

and the plane

the plane in the

wTx + b = 0

|wTxn + b| = 1

spa e:Hi

on the plane
and

where

xn
w
x

wTx + b = 0

= w (x x ) = 0

Hi

Learning From Data - Le ture 14

7/20

and the distan e is

Distan e between
Take any point
Proje tion of

Learning From Data - Le ture 14

and the plane:

on the plane

xn x

w
=
w
=
kwk

distan e

xn

on

...

Hi

xn
w

T

(xn x)
distan e = w

Hi



1
1 T
1 T
T
T

=
w xn w x =
w xn + b w x b =
kwk
kwk
kwk
8/20

The optimization problem

Maximize

1
kwk

subje t to

min

n=1,2,...,N

|wTxn + b| = 1
Noti e:

Minimize

subje t to

Learning From Data - Le ture 14

|wTxn + b| = yn (wTxn + b)

1 T
ww
2
T

yn (w xn + b) 1

for

n = 1, 2, . . . , N
9/20

Outline

Maximizing the margin

The solution

Nonlinear transforms

Learning From Data - Le ture 14

10/20

Constrained optimization

Minimize

subje t to

1 T
ww
2
yn (wTxn + b) 1

for

n = 1, 2, . . . , N

w Rd, b R
Lagrange?

Learning From Data - Le ture 14

inequality onstraints

KKT

11/20

We saw this before

Remember regularization?

Minimize

E (w) =
in

subje t to:

1
N

E = onst.
in

(Zw y) (Zw y)
T

lin

wTw C

normal

w
in

normal to onstraint

Regularization:
SVM:
Learning From Data - Le ture 14

optimize

onstrain

w Tw

in

wTw

in

wtw = C

in

12/20

Lagrange formulation

Minimize

1 T
L(w, b, ) = w w
2
w.r.t.

and

and

N
X

n(yn (wTxn + b) 1)

n=1

maximize w.r.t. ea h

wL = w

n 0

N
X

nynxn = 0

n=1

L
=
b
Learning From Data - Le ture 14

N
X

n y n = 0

n=1

13/20

Substituting

w =

N
X

nynxn

N
X

and

1 T
L(w, b, ) = w w
2

in the Lagrangian

N
X

n (yn (wTxn+b) 1 )

n=1

N
X

N X
N
X
1
L() =
n
ynym nm xnT xm
2 n=1 m=1
n=1

we get

Learning From Data - Le ture 14

n y n = 0

n=1

n=1

Maximize w.r.t. to

...

subje t to

n 0

for

n = 1, , N

and

PN

n=1 nyn

=0
14/20

The solution - quadrati programming

min

subje t to

1 T

2

y1y1 x1Tx1 y1y2 x1Tx2


y2y1 x2Tx1 y2y2 x2Tx2
...
...
yN y1 xNTx1 yN y2 xNTx2
{z

quadrati oe ients

T
y
| {z= 0}

T
)
+ |(1
{z
}

linear

linear onstraint
0
|{z}

lower bounds
Learning From Data - Le ture 14

. . . y1yN x1TxN
. . . y2yN x2TxN
...
...
. . . yN yN xNTxN

|{z}

upper bounds
15/20

QP hands us

Solution:

= 1 , , N

= w =

N
X

nynxn

E = onst.
in

lin

n=1

KKT ondition:

normal

For

n = 1, , N

n (yn (wTxn + b) 1) = 0
E

We saw this before!

n > 0 = xn
Learning From Data - Le ture 14

is a

in

support ve tor

wtw = C

16/20

Support ve tors
Hi

Closest

xn's

to the plane: a hieve the margin

= yn (wTxn + b) = 1
X

w =

xn
Solve for

is SV

nynxn

using any SV:

yn (wTxn + b) = 1
Hi
Learning From Data - Le ture 14

17/20

Outline

Maximizing the margin

The solution

Nonlinear transforms

Learning From Data - Le ture 14

18/20

0.6

z
PSfrag repla ements
-1
-0.5
0
0.5
1
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6

0.8

instead of

N
X

N X
N
X
0
1
L() =
n
ynym nm zTnzm
0.1
2
n=1
n=1 m=1

0.2
0.3
0.4
0.5

X Z

0.6
0.7
0.8

0.8
1
Learning From Data - Le ture 14

0.9
19/20

Support ve tors in

spa e

Hi

Support ve tors live in

In

spa e

spa e, pre-images of support ve tors

The margin is maintained in

spa e

Generalization result

E[Eout]

E [#

of SV's

N 1
Hi

Learning From Data - Le ture 14

20/20

Support ve tors

Review of Le ture 14

Hi

Hi

The margin
Hi

Hi

(or zn) with Lagrange n > 0


E[# of SV's]
E[E ]
Hi

xn
Hi

Hi

Maximizing the margin = dual problem:


N
X

1
L() =
n
2
n=1

N
X

N
X

ynym nm xnT xm

n=1 m=1

quadrati programming

out

N 1

Hi

(in-sample he k of out-of-sample error)


Nonlinear transform

Complex h, but simple H

Learning From Data


Yaser S. Abu-Mostafa

California Institute of Te hnology

Le ture 15: Kernel Methods

Sponsored by Calte h's Provost O e, E&AS Division, and IST

Tuesday, May 22, 2012

Outline

The kernel tri k

Soft-margin SVM

Learning From Data - Le ture 15

2/20

What do we need from the

spa e?

N X
N
X
1
L() =
n
ynym nm zTnzm
2 n=1 m=1
n=1
N
X

Constraints:

n 0

for

n = 1, , N

PN

n=1 nyn

and

g(x) = sign (wTz + b)


where

w =

zn
and
Learning From Data - Le ture 15

b:

is SV

=0
need

zTnz

n y n zn

ym (wTzm + b) = 1

need

zTnzm
3/20

Generalized inner produ t


Given two points

Let

and

zTz = K(x, x)

x X ,

we need

(the kernel)

zT z

inner produ t of

Example:

and

x = (x1, x2)

2nd-order

z = (x) = (1, x1, x2, x21, x22, x1x2)


K(x, x) = zTz = 1 + x1x1 + x2x2 +
x21x21 + x22x22 + x1x1x2x2
Learning From Data - Le ture 15

4/20

The tri k
Can we ompute

Example:

K(x, x)

Consider

without

transforming

and

K(x, x) = (1 + xTx)2 = (1 + x1x1 + x2x2)2


= 1 +

2 2
x1 x 1

2 2
x2 x 2

+ 2x1x1 + 2x2x2 + 2x1x1x2x2

This is an inner produ t!

Learning From Data - Le ture 15

(1,

x21

(1,

x21

x22

x22

2 x1 ,

2x1 ,

2 x2 ,

2 x1 x2 )

2x2 ,

2x1x2 )
5/20

The polynomial kernel


X = Rd

and

:X Z

The equivalent kernel

is polynomial of order

K(x, x) = (1 + xTx)Q
= (1 + x1x1 + x2x2 + + xdxd)Q
Compare for

d = 10

Can adjust s ale:

Learning From Data - Le ture 15

and

Q = 100

K(x, x) = (axTx + b)Q

6/20

We only need
If

K(x, x)

is an inner produ t in some spa e

Example:

to exist!

Z , we are good.



K(x, x) = exp kx xk2
Innite-dimensional

take simple ase

K(x, x ) = exp (x x )
= exp x


2

X
k
k k
2
(x)
(x )
2
exp x
k!
|k=0 {z
}

exp(2xx )

Learning From Data - Le ture 15

7/20

This kernel in a tion


Slightly non-separable ase:

Transforming

Overkill?

into

Hi

-dimensional Z

Count the support ve tors

Hi

Learning From Data - Le ture 15

8/20

Kernel formulation of SVM


Remember quadrati programming?

y1y1 K(x1, x1)T


y2y1 K(x2, x1)T
...
yN y1K(xN , x1)T

The only dieren e now is:

y1y2 K(x1, x2)T


y2y2 K(x2, x2)T
...
yN y2K(xN , x2)T
{z

quadrati oe ients

...
...
...
...

y1yN K(x1, xN )T
y2yN K(x2, xN )T
...
yN yNK(xN , xN )T

Everything else is the same.

Learning From Data - Le ture 15

9/20

The nal hypothesis


g(x) = sign (wTz + b)

Express

w =

zn

is SV

n y n zn

in terms of

K(, )

g(x) = sign

n>0

where

nyn K(xn, x) + b

b = ym

n>0

for any support ve tor (


Learning From Data - Le ture 15

nynK(xn, xm)

m > 0)
10/20

How do we know that


...

for a given

K(x, x)?

exists

...

valid kernel

Three approa hes:

Learning From Data - Le ture 15

1.

By onstru tion

2.

Math properties (Mer er's ondition)

3.

Who ares?

11/20

Design your own kernel


K(x, x)

1.

is a valid kernel i

It is symmetri

and

2.

The matrix:

K(x1, x1) K(x1, x2)


K(x2, x1) K(x2, x2)
...
...
K(xN , x1) K(xN , x2)

is

positive semi-denite

for any
Learning From Data - Le ture 15

. . . K(x1, xN )
. . . K(x2, xN )
...
...
. . . K(xN , xN )

x1, , xN

(Mer er's ondition)


12/20

Outline

The kernel tri k

Soft-margin SVM

Learning From Data - Le ture 15

13/20

-1

Two types of non-separable


-0.5
slightly:

0
0.5

frag repla ements

-1

-1.5

-0.5
0
0.5
1
-1.5
-1
-0.5
0
0.5
1
1.5

Learning From Data - Le ture 15

seriously:

-1
-0.5
0
0.5
1
1.5

14/20

Error measure
Hi

Margin violation:

Quantify:

yn (wTxn + b) 1

yn (wTxn + b) 1 n
Total violation =

fails

n 0
N
X
n=1

n
violation

Hi
Learning From Data - Le ture 15

15/20

The new optimization

Minimize

subje t to

1 T
ww+C
2

N
X

n=1

yn (wTxn + b) 1 n
and

n 0

for

for

n = 1, . . . , N

n = 1, . . . , N

w Rd , b R , RN

Learning From Data - Le ture 15

16/20

Lagrange formulation
1 T
L(w, b, , , ) = w w + C
2

N
X

w, b,

and

Minimize w.r.t.

n=1

N
X
n=1

n(yn (w xn + b) 1+n)

and maximize w.r.t. ea h

wL = w
L
=
b

N
X

N
X

n 0

and

N
X

n n

n=1

n 0

nynxn = 0

n=1

n y n = 0

n=1

L
= C n n = 0
n
Learning From Data - Le ture 15

17/20

and the solution is

...

N
N X
X
1
ynym nm xnT xm
n
L() =
2
n=1
n=1 m=1
N
X

Maximize

subje t to

0 n C

for

n = 1, , N

and

w.r.t. to

N
X

n y n = 0

N
X

n=1

= w =

N
X

nynxn

n=1

minimizes

Learning From Data - Le ture 15

1 T
ww + C
2

n=1

18/20

Types of support ve tors


margin support ve tors

Hi

0 < n < C )

yn (wTxn + b) = 1

non-margin support ve tors


yn (wTxn + b) < 1

( n

= 0)

n = C )

( n

> 0)

Hi
Learning From Data - Le ture 15

19/20

Two te hni al observations


1.

Hard margin: What if data is not linearly separable?

primal

2.

Z:

dual

What if there is

w0?

All goes to

Learning From Data - Le ture 15

breaks down

and

w0 0

20/20

Review of Le ture 15

Kernel methods

K(x, x) = zTz

for some

Soft-margin SVM

Minimize

1 T
ww + C
2

spa e

N
X

n=1

Hi

Hi

violation

Hi



K(x, x) = exp kx xk2

Hi

Same as hard margin, but

0 n C

Learning From Data


Yaser S. Abu-Mostafa

California Institute of Te hnology

Le ture 16:

Radial Basis Fun tions

Sponsored by Calte h's Provost O e, E&AS Division, and IST

Thursday, May 24, 2012

Outline

RBF and nearest neighbors

RBF and neural networks

RBF and kernel methods

RBF and regularization

Learning From Data - Le ture 16

2/20

Basi RBF model


Ea h

(xn, yn) D

inuen es

h(x)

based on

radial

Standard form:

h(x) =

N
X
n=1

Learning From Data - Le ture 16

kx

x
k
n
| {z }

wn exp kx xnk2
|
{z
}
basis fun tion

3/20

The learning algorithm

Finding

w1, , wN :

h(x) =

N
X
n=1



2
wn exp kx xnk
based on

E = 0:
in

N
X

m=1

Learning From Data - Le ture 16

wm

D = (x1, y1), , (xN , yN )


h(xn) = yn

for

n = 1, , N :



2
exp kxn xmk = yn
4/20

The solution
N
X

wm exp kxn xmk

m=1

exp( kx1 x1k )

2
exp( kx2 x1k )

...

2
exp( kxN x1k )
|
If

Learning From Data - Le ture 16

is invertible,

...
...
...

...
{z

= yn

equations in

exp( kx1 xN k )
2
exp( kx2 xN k )

...

2
exp( kxN xN k )
}|

w = 1y

unknowns

w1

w2
=
...

wN
{z }
w

y1

y2
..
.
yN
| {z }
y

exa t interpolation
5/20

The ee t of

h(x) =

N
X
n=1

small
Learning From Data - Le ture 16



wn exp kx xnk2

large

6/20

RBF for lassi ation

h(x) = sign

N
X
n=1



wn exp kx xnk2
Learning:

s =

linear

N
X

regression for lassi ation

wn exp kx xnk2

n=1

Minimize

(s y)2

on

y = 1

h(x) = sign(s)
Learning From Data - Le ture 16

7/20

Relationship to nearest-neighbor method


Adopt the

value of a nearby point:

Learning From Data - Le ture 16

similar ee t by a basis fun tion:

8/20

RBF with

w1, , wN

parameters

Use

KN

enters:

based on

1, , K

enters

data points

instead of

x1, , xN

h(x) =

K
X
k=1

Learning From Data - Le ture 16



2
wk exp kx k k

1.

How to hoose the enters

2.

How to hoose the weights

wk
9/20

Choosing the enters

Minimize the distan e between

xn

and the

losest
Split

enter

x1, , xN

Minimize

K -means lustering

into lusters

K
X
X

kxn k k

S1, , SK
2

k=1 xnSk

Unsupervised learning

NP -hard
Learning From Data - Le ture 16

10/20

An iterative algorithm

Lloyd's algorithm:

Iteratively minimize

K X
X

kxn k k2

w.r.t.

k , Sk

k=1 xnSk

1 X
xn
k
|Sk |
xnSk

Sk {xn : kxn k k
Convergen e

Learning From Data - Le ture 16

all

kxn k}

lo al minimum

11/20

Lloyd's algorithm in a tion


Hi

1. Get the data points

2. Only the inputs!

3. Initialize the enters

4. Iterate

5. These are your

k 's

Hi
Learning From Data - Le ture 16

12/20

Centers versus support ve tors

Hi

support ve tors

Hi

Hi
Learning From Data - Le ture 16

RBF enters

Hi

13/20

Choosing the weights


K
X
k=1



wk exp kxn k k2 yn

exp( kx1 1k )

2
exp(
kx

2
1 )

...

2
exp( kxN 1k )
|
If

Learning From Data - Le ture 16

is invertible,

...
...
...

...
{z

equations in

K< N

exp( kx1 K k )
2
exp( kx2 K k )

...

2
exp( kxN K k )
}|

w = (T)1Ty

unknowns

w1

w2

...

wK
{z } |
w

y1

y2

...

yN
{z }
y

pseudo-inverse
14/20

RBF network

The features are

exp kx k k2

Nonlinear transform depends on

No longer a linear model

h(x)

w1

wk

kx 1k

kx k k

wK

kx K k

x
b

A bias term (
Learning From Data - Le ture 16

or

w0)

is often added
15/20

Compare to neural networks

w1

h(x)

h(x)

wk

kx 1k

kx k k

w1

wK

kx K k

wk

w1Tx

wkT x

wK

T x
wK

RBF network

neural network

Learning From Data - Le ture 16

16/20

Choosing

Treating

as a parameter to be learned

h(x) =

K
X

wk exp kx k k

k=1

EM

Iterative approa h (

algorithm

in mixture of Gaussians):

1.

Fix

, solve for w1, , wK

2.

Fix

w1, , wK ,

minimize error w.r.t.

We an have a dierent

Learning From Data - Le ture 16

k for ea h enter k
17/20

Outline

RBF and nearest neighbors

RBF and neural networks

RBF and kernel methods

RBF and regularization

Learning From Data - Le ture 16

18/20

RBF versus its SVM kernel


Hi

SVM kernel implements:

sign

n>0



2
nyn exp kx xnk + b

SVM
RBF

Straight RBF implements:

sign

K
X
k=1

Learning From Data - Le ture 16

wk exp kx k k

+b

Hi

19/20

RBF and regularization

RBF an be derived based purely on regularization:

N
X
n=1

h(xn) yn

2

X
k=0

ak

dh
dxk

2

dx

smoothest interpolation

Learning From Data - Le ture 16

20/20

Review of Le ture 16

nearest
neighbors

Radial Basis Fun tions


h(x) =

K
X

wk exp kx k k

neural

networks

k=1

SVM

Kernel =

RBF

regularization

Choose
Choose

k 's:
wk 's:

Lloyd's algorithm
Pseudo-inverse

unsupervised
learning

Learning From Data


Yaser S. Abu-Mostafa

California Institute of Te hnology

Le ture 17:

Three Learning Prin iples

Sponsored by Calte h's Provost O e, E&AS Division, and IST

Tuesday, May 29, 2012

Outline

O am's Razor

Sampling Bias

Data Snooping

Learning From Data - Le ture 17

2/22

Re urring theme - simple hypotheses


A quote by Einstein:
An explanation of the data should be made as simple as possible, but no simpler

The razor: symboli of a prin iple set by William of O am

Learning From Data - Le ture 17

3/22

O am's Razor
The simplest model that ts the data is also the most plausible.
Two questions:

1. What does it mean for a model to be simple?


2. How do we know that simpler is better?

Learning From Data - Le ture 17

4/22

First question: `simple' means?


Measures of omplexity - two types:

Learning From Data - Le ture 17

omplexity of h

and

Complexity of

h:

Complexity of

H:

omplexity of H

MDL, order of a polynomial

Entropy, VC dimension

When we think of simple, it's in terms of

Proofs use simple in terms of

5/22

and the link is . . .


ounting:

bits spe ify

Real-valued parameters?

is one of

elements of a set

Example: 17th order polynomial - omplex and one of many

Ex eptions? Looks omplex but is one of few - SVM

Hi

Hi

Learning From Data - Le ture 17

6/22

Puzzle 1: Football ora le


00000000000000001111111111111111
00000000111111110000000011111111
00001111000011110000111100001111
00110011001100110011001100110011
01010101010101010101010101010101

Learning From Data - Le ture 17

0
1
0
1
1

Letter predi ting game out ome

Good all!

More letters - for 5 weeks

Perfe t re ord!

Want more? $50 harge

Should you pay?


7/22

Se ond question: Why is simpler better?


Better doesn't mean more elegant! It means better out-of-sample performan e

The basi argument: (formal proof under dierent idealized onditions)


Fewer simple hypotheses than omplex ones

mH(N )

less likely to t a given data set

more signi ant when it happens

mH(N )/2N

The postal s am:


Learning From Data - Le ture 17

mH(N ) = 1

versus

2N
8/22

temperature

S ientist A

temperature

S ientist B

-1
0
1
2
3
4
5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
4

ondu tivity

-1
0
1
2
3
4
5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
4

ondu tivity

-1
0
1
2
3
4
5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
4

ondu tivity

A t that means nothing

temperature

falsiable

Condu tivity linear in temperature?

Two s ientists ondu t experiments

What eviden e do A and B provide?


Learning From Data - Le ture 17

9/22

Outline

O am's Razor

Sampling Bias

Data Snooping

Learning From Data - Le ture 17

10/22

Puzzle 2: Presidential ele tion


In 1948,

Truman ran against Dewey in lose ele tions

A newspaper ran a phone poll of how people voted

Hi

Dewey won the poll de isively - newspaper de lared:

Hi

Learning From Data - Le ture 17

11/22

On to the vi tory rally . . .


...

of Truman

It's not

P[

's

fault:

|Ein Eout| > ]

Learning From Data - Le ture 17

12/22

The bias
In 1948, phones were expensive.

If the data is sampled in a biased way, learning


will produ e a similarly biased out ome.
Example: normal period in the market
Testing: live trading in real market

Learning From Data - Le ture 17

13/22

Mat hing the distributions


Hi
Methods to mat h training and testing distributions

Doesn't work if:

P(x)

testing
Region has

P =0

in training, but

P > 0 in

testing

training

x
Hi

Learning From Data - Le ture 17

14/22

Puzzle 3: Credit approval


Histori al re ords of ustomers

Input: information on redit appli ation:

Target: protable for the bank

Learning From Data - Le ture 17

age

23 years

gender

male

annual salary

$30,000

years in residen e

1 year

years in job

1 year

urrent debt

$15,000

15/22

Outline

O am's Razor

Sampling Bias

Data Snooping

Learning From Data - Le ture 17

16/22

The prin iple


If a data set has ae ted any step in the learning pro ess,
its ability to assess the out ome has been ompromised.
Hi

Most ommon trap for pra titioners - many ways to slip

Hi

Learning From Data - Le ture 17

17/22

-0.5

Looking at the data


0
0.5
Remember nonlinear transforms?

z = (1, x1, x2, x1x2, x21, x22)


or

z = (1, x21, x22)

Snooping involves

or

D,

z = (1, x21 + x22)

not other information

1
-1.5
-1
-0.5
0
0.5
1
1.5

Learning From Data - Le ture 17

18/22

Puzzle 4: Finan ial fore asting


Predi t US Dollar versus British Pound

Normalize data, split randomly:

Train on

Dtrain

only, test

on

Dtrain, Dtest
Dtest

Cumulative Prot %

30

snooping

20

10

-10

no snooping
100

200

Day

300

400

500

r20, r19, , r1 r0
Learning From Data - Le ture 17

19/22

Reuse of a data set


Trying one model after the other

on the same data set, you will eventually `su eed'

If you torture the data long enough, it will onfess

VC dimension of the

May in lude what

total learning model

others tried!

Key problem: mat hing a parti ular data set

Learning From Data - Le ture 17

20/22

Two remedies
1.

Avoid data snooping


stri t dis ipline

2.

A ount for data snooping


how mu h data ontamination

Learning From Data - Le ture 17

21/22

Puzzle 5: Bias via snooping


Testing long-term performan e of buy and hold in sto ks.

Use

50 years worth of data

All urrently traded ompanies in S&P500

Assume you stri tly followed buy and hold

Would have made great prot!

Sampling bias aused by `snooping'

Learning From Data - Le ture 17

22/22

PSfrag repla ements

Review of Le ture 17

Sampling bias
Hi

O am's Razor
The simplest model that
ts the data is also the
most plausible.

P(x)

testing
training

Hi

Data snooping
Cumulative Prot %

30

snooping

20

10

omplexity of h omplexity of H
unlikely event signi ant if it happens

-10

no snooping
100

200

300

Day

400

500

Learning From Data


Yaser S. Abu-Mostafa

California Institute of Te hnology

Le ture 18:

Epilogue

Sponsored by Calte h's Provost O e, E&AS Division, and IST

Thursday, May 31, 2012

Outline

The map of ma hine learning

Bayesian learning

Aggregation methods

A knowledgments

Learning From Data - Le ture 18

2/23

It's a jungle out there

semisupervised learning

stochastic gradient descent

overfitting

Gaussian processes
distributionfree
collaborative filtering

deterministic noise
linear regression
VC dimension
nonlinear transformation

decision trees

data snooping
sampling bias

Q learning

SVM

learning curves

mixture of expe
neural networks

no free

training versus testing


RBF
noisy targets
Bayesian prior
active learning
linear models
biasvariance tradeoff
weak learners
ordinal regression
logistic regression
data contamination
cross validation

ensemble learning

types of learning

xploration versus exploitation

error measures

is learning feasible?

clustering

Learning From Data - Le ture 18

regularization

kernel methods

hidden Markov mod


perceptrons
graphical models

softorder constraint
weight decay

Occams razor

Boltzmann mach

3/23

The map

THEORY

TECHNIQUES

models
VC
biasvariance
complexity

linear

methods
supervised
regularization

neural networks
SVM
nearest neighbors

bayesian

PARADIGMS

RBF
gaussian processes

unsupervised
validation
reinforcement
aggregation
active
input processing
online

SVD
graphical models

Learning From Data - Le ture 18

4/23

Outline

The map of ma hine learning

Bayesian learning

Aggregation methods

A knowledgments

Learning From Data - Le ture 18

5/23

Probabilisti approa h
Hi

Extend probabilisti role to all omponents

P (D | h = f )
How about

de ides whi h

P (h = f | D)

UNKNOWN TARGET DISTRIBUTION


P(y |

x)

target function f: X

UNKNOWN
INPUT
DISTRIBUTION

plus noise

P( x )

(likelihood)
x1 , ... , xN

DATA SET
D = ( x1 , y1 ), ... , ( xN , yN )

g ( x )~
~ f (x )

LEARNING
ALGORITHM

FINAL
HYPOTHESIS
g: X Y

HYPOTHESIS SET
H
Hi

Learning From Data - Le ture 18

6/23

The prior
P (h = f | D)

requires an additional probability distribution:

P (D | h = f ) P (h = f )
P (h = f | D) =
P (D | h = f ) P (h = f )
P (D)
P (h = f )

is the

P (h = f | D)

prior

is the

posterior

Given the prior, we have the full distribution

Learning From Data - Le ture 18

7/23

Example of a prior
Consider a per eptron:

A possible prior on

w:

h is
Ea h

determined by

wi

w = w0, w1, , wd

is independent, uniform over

[1, 1]

This determines the prior over

Given

D,

we an ompute

P (h = f )

P (D | h = f )

Putting them together, we get

P (h = f | D)

P (h = f )P (D | h = f )
Learning From Data - Le ture 18

8/23

A prior is an assumption
Even the most neutral prior:
Hi

is unknown

is random

P(x)

x
Hi

The true equivalent would be:


Hi

is unknown

is random

(xa)
1

Learning From Data - Le ture 18

x
Hi
9/23

If we knew the prior


...

we ould ompute

P (h = f | D)

for every

hH

we an nd the most probable

we an derive

E(h(x))

we an derive the

h given the data

for every

error bar for every x

we an derive everything in a prin ipled way

Learning From Data - Le ture 18

10/23

When is Bayesian learning justied?


1. The prior is

valid

trumps all other methods

2. The prior is

irrelevant

just a omputational atalyst

Learning From Data - Le ture 18

11/23

Outline

The map of ma hine learning

Bayesian learning

Aggregation methods

A knowledgments

Learning From Data - Le ture 18

12/23

What is aggregation?
Combining dierent solutions
Hi

h 1 , h2 , , hT

that were trained on

D:

Hi

Regression: take an average


Classi ation: take a vote
a.k.a.
Learning From Data - Le ture 18

ensemble learning

and

boosting
13/23

Dierent from 2-layer learning


Hi

In a 2-layer model, all units learn

In aggregation, they learn

training data

jointly:

Learning
Algorithm

independently then get ombined:

Hi

Hi

training data
Learning
Algorithm

Hi

Learning From Data - Le ture 18

14/23

Two types of aggregation


1. After the fa t: ombines existing solutions
Example. Netix teams merging

blending

2. Before the fa t: reates solutions to be ombined


Example. Bagging - resampling D
Hi

training data
Learning
Algorithm

Hi

Learning From Data - Le ture 18

15/23

De orrelation - boosting
Create

h 1 , , ht ,

sequentially: Make

ht

de orrelated with previous

h's:

Hi

training data
Learning
Algorithm

Hi

Emphasize points in

Choose weight of

Learning From Data - Le ture 18

ht

that were mis lassied

based on

E (ht)
in

16/23

Blending - after the fa t


For regression,

h 1 , h2 , , hT

g(x) =

T
X

t ht(x)

t=1
Prin ipled hoi e of

t's:

minimize the error on an aggregation data set

Some

t's

pseudo-inverse

an ome out negative

Most valuable

ht

in the blend?

Un orrelated ht's help the blend


Learning From Data - Le ture 18

17/23

Outline

The map of ma hine learning

Bayesian learning

Aggregation methods

A knowledgments

Learning From Data - Le ture 18

18/23

Course ontent

Learning From Data - Le ture 18

Professor

Malik Magdon-Ismail, RPI

Professor

Hsuan-Tien Lin, NTU

19/23

Course sta
Carlos Gonzalez (Head TA)
Ron Appel
Costis Sideris
Doris Xin

Learning From Data - Le ture 18

20/23

Filming, produ tion, and infrastru ture


Leslie Maxeld and the AMT sta
Ri h Fagen and the IMSS sta

Learning From Data - Le ture 18

21/23

Calte h support
IST -

Mathieu Desbrun

E&AS Division -

Ares Rosakis

Provost's O e -

Ed Stolper

Learning From Data - Le ture 18

and

and

Mani Chandy
Melany Hunt

22/23

Many others
Calte h TA's and sta members
Calte h alumni and Alumni Asso iation
Colleagues all over the world

Learning From Data - Le ture 18

23/23

To the fond memory of

Faiza A. Ibrahim

Vous aimerez peut-être aussi