Menu de navigation ouvert

Paramètres de l'utilisateur

Bienvenue sur Scribd !

Ignorer le carrousel

Learning From Data

Transféré par

0% ont trouvé ce document utile (0 vote)

350 vues402 pages

The problem is that the Hoeding inequality does not apply when choosing from multiple hypotheses after seeing the data. When considering many hypotheses, it is likely that at least one will appear to t the data well just by chance. To address this, we need to consider the entire hypothesis class H, not just a single hypothesis. A simple solution is to bound the probability that some hypothesis in H will appear to t well, even if it does not actually t the underlying function. This leads to a notion of uniform convergence over the entire class. Learning From Data - Le ture 2 ǫ ] ≤ δ for ALL g ∈ H 16/17

Description originale:

Learning From data

Titre original

Learning from Data

Copyright

© © All Rights Reserved

Formats disponibles

PDF, TXT ou lisez en ligne sur Scribd

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Signaler ce document

The problem is that the Hoeding inequality does not apply when choosing from multiple hypotheses after seeing the data. When considering many hypotheses, it is likely that at least one will appear to t the data well just by chance. To address this, we need to consider the entire hypothesis class H, not just a single hypothesis. A simple solution is to bound the probability that some hypothesis in H will appear to t well, even if it does not actually t the underlying function. This leads to a notion of uniform convergence over the entire class. Learning From Data - Le ture 2 ǫ ] ≤ δ for ALL g ∈ H 16/17

Droits d'auteur :

© All Rights Reserved

Formats disponibles

Téléchargez comme PDF, TXT ou lisez en ligne sur Scribd

Signaler comme contenu inapproprié

0% ont trouvé ce document utile (0 vote)

350 vues402 pages

Learning From Data

Transféré par

The problem is that the Hoeding inequality does not apply when choosing from multiple hypotheses after seeing the data. When considering many hypotheses, it is likely that at least one will appear to t the data well just by chance. To address this, we need to consider the entire hypothesis class H, not just a single hypothesis. A simple solution is to bound the probability that some hypothesis in H will appear to t well, even if it does not actually t the underlying function. This leads to a notion of uniform convergence over the entire class. Learning From Data - Le ture 2 ǫ ] ≤ δ for ALL g ∈ H 16/17

Droits d'auteur :

© All Rights Reserved

Formats disponibles

Téléchargez comme PDF, TXT ou lisez en ligne sur Scribd

Signaler comme contenu inapproprié

Passer à la page

Vous êtes sur la page 1sur 402

Rechercher à l'intérieur du document

Learning From Data

Prof. Yaser Abu-Mostafa

Caltech, 2012
http://work.caltech.edu/telecourse.html

Outline of the Course

1.
2.

3.

4.

5.

6.

7.

8.

9.

The Learning Problem

Is Learning Feasible?
The Linear Model I
Error and Noise
Training versus Testing
Theory of Generalization
The VC Dimension
Bias-Varian e Tradeo
The Linear Model II
Neural Networks

10.

11.

12.

(April 3 )

(April

(April

(April

13.

5)
14.

10 )

15.

12 )

16.

(April

(April

(April

(May

3)

19 )

17.

18.

(May

8)

(May

(May

10 )

15 )

(May

1)

26 )

17 )

22 )

(May

(May

(May

24 )
(May

31 )

24 )

(April

(May

17 )

Overtting
Regularization
Validation
Support Ve tor Ma hines
Kernel Methods
Radial Basis Fun tions
Three Learning Prin iples
Epilogue

theory; mathemati al
te hnique; pra ti al
analysis; on eptual

29 )

Learning From Data

Yaser S. Abu-Mostafa

California Institute of Te hnology

Le ture 1:

The Learning Problem

Sponsored by Calte h's Provost O e, E&AS Division, and IST

Tuesday, April 3, 2012

The learning problem - Outline

Example of ma hine learning
Components of Learning
A simple model
Types of learning
Puzzle

Learning From Data - Le ture 1

2/19

Example: Predi ting how a viewer will rate a movie

10% improvement = 1 million dollar prize
The essen e of ma hine learning:

A pattern exists.
We annot pin it down mathemati ally.
We have data on it.
Learning From Data - Le ture 1

3/19

Movie rating - a solution

?
s
r
ste

?
e
s
i
ru

? n? kbu
y
d
e tio blo
m
o s a rs

s
like like prefe

s
like

C
m
To

viewer

Mat h movie and

viewer fa tors

add
ontributions
from ea h fa tor

predi ted
rating

movie

Tom
ise

Cru
t?

in i

r?
uste t
kb ten
blo on nt
ion nte
a t edy o
om

Learning From Data - Le ture 1

4/19

The learning approa h

top

viewer
movie

LEARNING

rating
bottom

Learning From Data - Le ture 1

5/19

Metaphor: Credit approval

Appli ant information:

Components of learning

age
gender
annual salary
years in residen e
years in job
urrent debt

23 years
male
$30,000
1 year
1 year
$15,000

Approve redit?
Learning From Data - Le ture 1

6/19

Components of learning

Formalization:
Input: x

( ustomer appli ation)

Output: y

(good/bad ustomer? )

Target fun tion: f : X Y

(ideal redit approval formula)

Data: (x1, y1), (x2, y2), , (xN , yN )

Hypothesis: g : X Y
Learning From Data - Le ture 1

(histori al re ords)

(formula to be used )
7/19

UNKNOWN TARGET FUNCTION

f: X Y
(ideal credit approval function)

TRAINING EXAMPLES
( x1 , y1 ), ... , ( xN , yN )
(historical records of credit customers)
LEARNING
ALGORITHM
A

FINAL
HYPOTHESIS
g~
~f

(final credit approval formula)

HYPOTHESIS SET
H
(set of candidate formulas)

Learning From Data - Le ture 1

8/19

Solution omponents
The 2 solution omponents of the learning
problem:

The Hypothesis Set

H = {h}

gH

UNKNOWN TARGET FUNCTION

f: X Y
(ideal credit approval function)

TRAINING EXAMPLES
( x1 , y1 ), ... , ( xN , yN )
(historical records of credit customers)
LEARNING
ALGORITHM
A

The Learning Algorithm

FINAL
HYPOTHESIS
g~
~f

(final credit approval formula)

HYPOTHESIS SET
H

Together, they are referred to as the learning

model .
Learning From Data - Le ture 1

(set of candidate formulas)

9/19

A simple hypothesis set - the `per eptron'

For input x = (x1, , xd)

`attributes of a ustomer'

Approve redit if

d
P

wixi > threshold,

i=1

Deny redit if

d
P

wixi < threshold.

i=1

This linear formula h H an be written as

h(x) = sign

d
X
i=1

Learning From Data - Le ture 1

wixi

threshold

!
10/19

d
X

h(x) = sign

wi xi +

i=1

w0

Introdu e an arti ial oordinate x0 = 1:

h(x) = sign

d
X

wi xi

i=0

In ve tor form, the per eptron implements

!
_

+
_

+
_

+
+

+
+

_
_

_
_

`linearly separable' data

h(x) = sign(wTx)
Learning From Data - Le ture 1

11/19

A simple learning algorithm - PLA

The per eptron implements

h(x) = sign(wTx)

y= +1

w+y x
x

Given the training set:

(x1, y1), (x2, y2), , (xN , yN )

pi k a mis lassied point:
sign(wTxn) 6= yn
and update the weight ve tor:

y= 1
w+y x

w
x

w w + ynxn
Learning From Data - Le ture 1

12/19

Iterations of PLA
One iteration of the PLA:
w w + yx
where (x, y) is a mis lassied training point.

At iteration t = 1, 2, 3, , pi k a mis lassied point from

+
_
_

+
+
+

(x1, y1), (x2, y2), , (xN , yN )

and run a PLA iteration on it.

+
_

That's it!
Learning From Data - Le ture 1

13/19

The learning problem - Outline

Example of ma hine learning
Components of learning
A simple model
Types of learning
Puzzle

Learning From Data - Le ture 1

14/19

Basi premise of learning

using a set of observations to un over an underlying pro ess

broad premise = many variations

Supervised Learning
Unsupervised Learning
Reinfor ement Learning

Learning From Data - Le ture 1

15/19

Supervised learning
Example from vending ma hines oin re ognition
25

Mass

Mass

25

5
1

10

10

Size

Learning From Data - Le ture 1

Size

16/19

Mass

Instead of

Unsupervised learning
(input, orre t output), we get (input, ? )

Size

Learning From Data - Le ture 1

17/19

Reinfor ement learning

Instead of (input, orre t output),
we get (input,some output,grade for this output)
The world hampion was
a neural network!

Learning From Data - Le ture 1

18/19

A Learning puzzle
f = 1

f = +1

f =?
Learning From Data - Le ture 1

19/19

Review of Le ture 1

Example: Per eptron Learning Algorithm

Learning is used when

+
+

- A pattern exists
- We annot pin it down mathemati ally

+
+

- We have data on it

+
_

Fo us on supervised learning
- Unknown target fun tion
- Data set

y = f (x)

Learning an unknown fun tion?

- Impossible

(x1, y1), , (xN , yN )

. The fun tion an assume

any value outside the data we have.

- Learning algorithm pi ks
a hypothesis set

gf

from
- So what now?

Learning From Data

Yaser S. Abu-Mostafa

California Institute of Te hnology

Le ture 2:

Is Learning Feasible?

Sponsored by Calte h's Provost O e, E&AS Division, and IST

Thursday, April 5, 2012

Feasibility of learning - Outline

Probability to the res ue

Conne tion to learning

Conne tion to

A dilemma and a solution

Learning From Data - Le ture 2

real

learning

2/17

A related experiment
top

- Consider a `bin' with red and green marbles.

P[

pi king a red marble

P[

pi king a green marble

- The value of

- We pi k

]=

SAMPLE

]=1

fraction
of red marbles

is unknown to us.

marbles independently.

- The fra tion of red marbles in sample

Learning From Data - Le ture 2

BIN

= probability
of red marbles
bottom

3/17

Does say anything about ?

No!

top

BIN

Sample an be mostly green while bin is

mostly red.

SAMPLE

Yes!

Sample frequen y
frequen y

fraction
of red marbles
is likely lose to bin

possible
Learning From Data - Le ture 2

versus

probable

= probability
of red marbles
bottom

4/17

What does say about ?

In a big sample (large

N ),

is probably lose to

(within ).

Formally,

PP [ | | > ] 2e

22N

This is alled

Hoeding's Inequality.

In other words, the statement

Learning From Data - Le ture 2

=

is

P.A.C.
5/17

PP [| | > ] 2e

22N

top

Valid for all N and

BIN
SAMPLE

Bound does not depend on

fraction
of red marbles

Tradeo: N , , and the bound.

=

= probability
of red marbles
bottom

Learning From Data - Le ture 2

6/17

Conne tion to learning

Hi

h(x)= f (x)

Bin: The unknown is a number

Learning: The unknown is a fun tion f : X
Ea h marble

is a point

h(x)= f (x)

xX

Hypothesis got it right

Hypothesis got it wrong

h(x)=f (x)
h(x)6=f (x)
Hi

Learning From Data - Le ture 2

7/17

Ba k to the learning diagram

UNKNOWN TARGET FUNCTION
f: X Y

The bin analogy:

Hi

PROBABILITY
DISTRIBUTION

P
TRAINING EXAMPLES
( x1 , y1 ), ... , ( xN , yN )

x1 , ... , xN

LEARNING
ALGORITHM

FINAL
HYPOTHESIS
g~
~f

Hi

Learning From Data - Le ture 2

on

HYPOTHESIS SET
H
8/17

Are we done?
Hi

Not so fast!

h(x)= f (x)

is xed.

h(x)= f (x)
For this

h,

generalizes to

`veri ation' of h, not learning

No guarantee

We need to

will be small.

hoose from multiple h's.

Hi

Learning From Data - Le ture 2

9/17

Multiple bins
Generalizing the bin model to more than one hypothesis:
top

h1

h2

hM

M
........

M
bottom

Learning From Data - Le ture 2

10/17

Notation for learning

Both

and

depend on whi h hypothesis

Hi

Eout(h)

in sample' denoted by Ein(h)

is `

is `out of sample' denoted by Eout(h)

The Hoeding inequality be omes:

P[

22N

|Ein(h) Eout(h)| > ] 2e

Ein(h)
Hi

Learning From Data - Le ture 2

11/17

Notation with multiple bins

top

h1

h2

Eout( h1)

Eout( h2)

hM
Eout( hM )
........

Ein( h1)

Ein( h2)

Ein( hM )
bottom

Learning From Data - Le ture 2

12/17

Are we done already?

Not so fast!! Hoeding doesn't apply to multiple bins.

What?
UNKNOWN TARGET FUNCTION
f: X Y

top

BIN

PROBABILITY
DISTRIBUTION

P
TRAINING EXAMPLES
( x1 , y1 ), ... , ( xN , yN )

SAMPLE

= fraction
of red marbles

ALGORITHM

HYPOTHESIS SET
H

= probability
of red marbles

x1 , ... , xN

LEARNING
A

on

top

FINAL
HYPOTHESIS
g~
~f

h1

h2

Eout( h1)

Eout( h2)

hM
Eout( hM )
........

Ein( h1)

Ein( h2)

Ein( hM )
bottom

bottom

Learning From Data - Le ture 2

13/17

Coin analogy
If you toss a fair oin 10 times, what is the probability that you
will get 10 heads?

Question:

Answer:

0.1%

If you toss 1000 fair oins 10 times ea h, what is the probability

that some oin will get 10 heads?
Question:

Answer:

63%

Learning From Data - Le ture 2

14/17

From oins to learning

hi

BINGO
Learning From Data - Le ture 2

Hi

15/17

A simple solution
P[

|Ein(g) Eout(g)| > ] P[

|Ein(h1) Eout(h1)| >

or |Ein(h2) Eout(h2)| >

or |Ein(hM ) Eout(hM )| > ]

M
X

P [|Ein(hm) Eout(hm)| > ]

m=1

Learning From Data - Le ture 2

16/17

The nal verdi t

P[

|Ein(g) Eout(g)| > ]

M
X

m=1
M
X

P [|Ein(hm) Eout(hm)|

> ]

22N

2e

m=1

P[|Ein(g) Eout(g)|

Learning From Data - Le ture 2

22N

> ] 2M e

17/17

Review of Le ture 2

Sin e

has to be one of

h 1 , h2 , , hM ,

we on lude that
Is Learning feasible?
Yes, in a

probabilisti sense.

If:

Hi

|E (g) E (g)| >

in

Eout(h)

out

Then:

|E (h1) E (h1)| > or

|E (h2) E (h2)| > or

|E (hM ) E (hM )| >

in

out

in

out

in

out

Ein(h)
Hi

P[

22N

|E (h) E (h)| > ] 2e

in

out

This gives us an added

fa tor.

Learning From Data

Yaser S. Abu-Mostafa

California Institute of Te hnology

Le ture 3:

Linear Models I

Sponsored by Calte h's Provost O e, E&AS Division, and IST

Tuesday, April 10, 2012

Outline

Input representation

Linear Classi ation

Linear Regression

Nonlinear Transformation

Learning From Data - Le ture 3

2/23

A real data set

Learning From Data - Le ture 3

3/23

Input representation

`raw' input x = (x ,x , x , , x
linear model: (w , w , w , , w
0

256)
256)

Extra t useful information, e.g.,

intensity and symmetry x = (x ,x , x )
linear model:
(w , w , w )

Features:

Learning From Data - Le ture 3

4/23

810 0.91
1214
0
Illustration of features
0.1
1618 0.2
x =02(x0,x0.3
x1: intensity
1, x2)
0.4
46 0.5
0.6
810 0.7
1214 0.8
0.9
1618 -81
510 -7-6
155 -5
1015 -4-3
510 -2-1
0
15
Learning From Data - Le ture 3

810
1214
1618
x2: 02symmetry
46
810
1214
1618
510
155
1015
510
15
5/23

Evolution of Ein and Eout

a ements

50%

Eout

10%
1%
0

250

Learning From Data - Le ture 3

500

Ein

750

0.35
What PLA does0.4
-8Final per eptron boundary
-7
-6
-5
-4
-3
-2
-1
0
1
1000
6/23

The `po ket' algorithm

PLA:

a ements

Po ket:

50%

PSfrag repla ements

Eout

10%

10%

Eout

1%

1%
0

50%

250

Learning From Data - Le ture 3

500

Ein

750 1000

Ein

250

500

750 1000
7/23

a ements
0.05
0.1
0.15PLA:
0.2
0.25
0.3
0.35
0.4
-8
-7
-6
-5
-4
-3
-2
-1
0
1
Learning From Data - Le ture 3

0.35
Classi ation boundary0.4- PLA versus Po ket
-8Po ket:
-7
-6
-5
-4
-3
-2
-1
0
1
8/23

Outline

Input representation

Linear Classi ation

Linear Regression

Nonlinear Transformation

Learning From Data - Le ture 3

regression

real-valued output

9/23

Credit again

Credit approval (yes/no)

Regression: Credit line (dollar amount)
Classi ation:

Input:

x =

age
annual salary
years in residen e
years in job
urrent debt

Linear regression output:

h(x) =

d
X

23 years
$30,000
1 year
1 year
$15,000

wi xi = wTx

i=0

Learning From Data - Le ture 3

10/23

The data set

Credit o ers de ide on redit lines:

(x1, y1), (x2, y2), , (xN , yN )
yn R

is the redit line for ustomer x .

n

Linear regression tries to repli ate that.

Learning From Data - Le ture 3

11/23

How to measure the error

How well does h(x) = w x approximate f (x)?

In linear regression, we use squared error (h(x) f (x))
T

in-sample error:

Learning From Data - Le ture 3

N
X
1
2
(h(xn) yn)
Ein(h) =
N n=1

12/23

x
Learning From Data - Le ture 3

0.5
1
Illustration of linear0 regression
0.2
0.4
0.6
0.8
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
x
1
1

0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1

x2
13/23

The expression for

in

N
X
1
2
T
( w xn yn)
Ein(w) =
N n=1

1
2
=
kXw yk
N

where
Learning From Data - Le ture 3

X=

x
x
.
x
1
2

T
T

y1

y2
y=

yN

14/23

Minimizing
Ein(w) =
Ein(w) =

1
N kXw

2 T
(Xw
X
N

in

yk

y) = 0

XTXw = XTy
w = Xy where X = (XTX)1XT
X is the `pseudo-inverse' of X

Learning From Data - Le ture 3

15/23

The pseudo-inverse
X = (XTX)1XT

Learning From Data - Le ture 3

{z| {z }}
d+1
d+1
N d+1
|
N

{z }
d+1
{z

d+1 N

{z

d+1 N

16/23

The linear regression algorithm

1:

Constru t the matrix

and the ve tor y from the data set

(x , y ), , (x , y ) as follows

xT
y

T

x

y

X=
y = . .
,
.

xT
y
|
|
{z
}
{z
}
1

target ve tor

input data matrix

2:

3:

Compute the pseudo-inverse X

Return w = X y.

Learning From Data - Le ture 3

= (XTX)1XT

17/23

Linear regression for lassi ation

Linear regression learns a real-valued fun tion y = f (x) R

Binary-valued fun tions are also real-valued! 1 R
Use linear regression to get w where w x y = 1
In this ase, sign(w x ) is likely to agree with y = 1
Good initial weights for lassi ation
T

Learning From Data - Le ture 3

18/23

Linear regression boundary

Symmetry

0.5
0.55
-8
-7
-6
-5
-4
-3
-2
-1
0
Learning From Data - Le ture 3

Average Intensity

18/23

Outline

Input representation

Linear Classi ation

Linear Regression

Nonlinear Transformation

Learning From Data - Le ture 3

19/23

frag repla ements

Data:
-1
-0.5
0
0.5
1
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Learning From Data - Le ture 3

-1
Linear is limited
-0.5
0
Hypothesis:
0.5
1
-1.5
-1
-0.5
0
0.5
1
1.5
20/23

Another example
Credit line is ae ted by `years in residen e'
but not in a linear way!
Nonlinear [[xi < 1]] and [[xi > 5]] are better.
Can we do that with linear models?
Learning From Data - Le ture 3

21/23

Linear in what?

Linear regression implements

d
X

wi xi

i=0

Linear lassi ation implements

sign

d
X
i=0

Algorithms work be ause of

Learning From Data - Le ture 3

wi xi

linearity in the weights

22/23

PSfrag repla ements

-1
-0.5
0
0.5
1
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Learning From Data - Le ture 3

0.6
0.8
Transform the data nonlinearly
1
0
(x , x ) 0.1(x , x )
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9

2
1

2
2

23/23

0.2

Review of Le ture 3

0.3
0.4

Linear models use the `signal':

d
X

0.6

wixi = w x

i=0
- Classi ation:
- Regression:

0.5

h(x) = sign(wTx)

h(x) = wTx

Linear regression algorithm:

w = (X X) X y
1

one-step learning

0.7
0.8
0.9
1

x1

x2

Nonlinear transformation:
-

wTx

- Any

is linear in

x z preserves

- Example:

this linearity.

(x1, x2) (x21, x22)

Learning From Data

Yaser S. Abu-Mostafa

California Institute of Te hnology

Le ture 4:

Error and Noise

Sponsored by Calte h's Provost O e, E&AS Division, and IST

Thursday, April 12, 2012

Outline

Nonlinear transformation ( ontinued)

Error measures

Noisy targets

Preamble to the theory

Learning From Data - Le ture 4

2/22

PSfrag repla ements

PSfrag repla ements

-1

0.2

-0.5

0.4

0.6

0.5

0.8

-1

-0.8

-0.6

-0.4
-0.2

0.1
0.2
0.3

0.4

0.2

0.5

0.4

0.6

0.6

0.7

0.8

0.8

0.9

1. Original data

2. Transform the data

xn X

zn = (xn ) Z

PSfrag repla ements

0

PSfrag repla ements

0.2
0.4

-1

0.6

-0.5

0.8

'

1
-1.5

1
-0.2
0
0.2

-1

0.4

-0.5

0.6

0.8

0.5

1.2

1.5

X -spa e
T (x))
g(x) = g
((x)) = sign(w
4. Classify in

Learning From Data - Le ture 4

0.5

Z -spa e
Tz)
g
(z) = sign(w

3. Separate data in

3/22

What transforms to what

x = (x0, x1, , xd)

x1, x2, , xN
y1 , y2 , , yN
No weights in

z = (z0, z1, , zd)

z1 , z2 , , zN
y1 , y2 , , yN
= (w0, w1, , wd)
w

g(x)
Learning From Data - Le ture 4

T(x))
sign(w
4/22

Outline

Nonlinear transformation ( ontinued)

Error measures

Noisy targets

Preamble to the theory

Learning From Data - Le ture 4

5/22

The learning diagram - where we left it

PROBABILITY
DISTRIBUTION

UNKNOWN TARGET FUNCTION

f: X Y

P
TRAINING EXAMPLES
( x1 , y1 ), ... , ( xN , yN )

on

x1 , ... , xN
x

ERROR
MEASURE

LEARNING
ALGORITHM

g ( x )~
~ f (x )

FINAL
HYPOTHESIS
g: X Y

HYPOTHESIS SET
H

Learning From Data - Le ture 4

6/22

Error measures

What does

h f

Error measure:

Almost always

mean?

E(h, f )
pointwise denition:

h(x), f (x)

Examples:
Squared error:

Binary error:

Learning From Data - Le ture 4

2
h(x), f (x) = h(x) f (x)

q
y
h(x), f (x) = h(x) 6= f (x)
7/22

From pointwise to overall

Overall error

E(h, f )

= average of pointwise errors e

h(x), f (x) .

In-sample error:

N
X

1
E (h) =
e h(xn ), f (xn )
N n=1
in

Out-of-sample error:

E (h) =Ex
out

Learning From Data - Le ture 4

h(x), f (x)
8/22

The learning diagram - with pointwise error

PROBABILITY
DISTRIBUTION

UNKNOWN TARGET FUNCTION

f: X Y

P
TRAINING EXAMPLES
( x1 , y1 ), ... , ( xN , yN )

on

x1 , ... , xN
x

ERROR
MEASURE

LEARNING
ALGORITHM

g ( x )~
~ f (x )

FINAL
HYPOTHESIS
g: X Y

HYPOTHESIS SET
H

Learning From Data - Le ture 4

9/22

How to hoose the error measure

Fingerprint veri ation:

Two types of error:

false a ept

and

false reje t
f

How do we penalize ea h type?

8
>
<+1 you

>
:
1 intruder

f
+1
+1 no error
h
1 false reje t
Learning From Data - Le ture 4

1
false a ept
no error

10/22

The error measure - for supermarkets

Supermarket veries ngerprint for dis ounts

False reje t is ostly; ustomer gets annoyed!

False a ept is minor; gave away a dis ount

and intruder left their ngerprint

8
>
<+1 you

>
:
1 intruder

f
+1 1
+1 0
1
h
1 10 0
Learning From Data - Le ture 4

11/22

The error measure -

for the CIA

CIA veries ngerprint for se urity

False a ept is a disaster!

False reje t an be tolerated

Try again; you are an employee

8
>
<+1 you

>
:
1 intruder

f
+1 1
+1 0 1000
h
0
1 1
Learning From Data - Le ture 4

12/22

Take-home lesson

The error measure should be spe ied by the user.

Not always possible. Alternatives:

Plausible
Friendly

Learning From Data - Le ture 4

measures: squared error

Gaussian noise

measures: losed-form solution, onvex optimization

13/22

The learning diagram - with error measure

PROBABILITY
DISTRIBUTION

UNKNOWN TARGET FUNCTION

f: X Y

P
TRAINING EXAMPLES
( x1 , y1 ), ... , ( xN , yN )

on

x1 , ... , xN
x

ERROR
MEASURE
e( )

LEARNING
ALGORITHM

g ( x )~
~ f (x )

FINAL
HYPOTHESIS
g: X Y

HYPOTHESIS SET
H

Learning From Data - Le ture 4

14/22

Noisy targets

The `target fun tion' is not always a

fun tion

Consider the redit- ard approval:

age
annual salary
years in residen e
years in job
urrent debt

two `identi al' ustomers

Learning From Data - Le ture 4

23 years
$30,000
1 year
1 year
$15,000

two dierent behaviors

15/22

Target `distribution'

Instead of

y = f (x),

we use target

distribution:

P (y | x)
(x, y)

is now generated by the joint distribution:

P (x)P (y | x)
Noisy target = deterministi target

f (x) = E(y|x)

plus noise

y f (x)

Deterministi target is a spe ial ase of noisy target:

P (y | x)
Learning From Data - Le ture 4

is zero ex ept for

y = f (x)
16/22

The learning diagram - in luding noisy target

UNKNOWN TARGET DISTRIBUTION

P(y |

target function f:

PROBABILITY
DISTRIBUTION

x)

X Y

plus noise

P
TRAINING EXAMPLES
( x1 , y1 ), ... , ( xN , yN )

on

x1 , ... , xN
x

ERROR
MEASURE
e( )

LEARNING
ALGORITHM

g ( x )~
~ f (x )

FINAL
HYPOTHESIS
g: X Y

HYPOTHESIS SET
H

Learning From Data - Le ture 4

17/22

Distin tion between

Both onvey probabilisti aspe ts of

The target distribution

and

P (y|x)

and

P (x)

P (y | x)

UNKNOWN TARGET DISTRIBUTION

P(y |

is what we are trying to learn

target function f:

x)

X Y

plus noise

UNKNOWN

The input distribution

P (x)

quanties relative importan e of

Merging

P (x)P (y|x)

TRAINING EXAMPLES
( x1 , y1 ), ... , ( xN , yN )

as

INPUT
DISTRIBUTION
P (x)

P (x, y)

mixes the two on epts

Learning From Data - Le ture 4

18/22

Outline

Nonlinear transformation ( ontinued)

Error measures

Noisy targets

Preamble to the theory

Learning From Data - Le ture 4

19/22

What we know so far

Learning is feasible. It is likely that

E (g) E (g)
out

in

Is this learning?

We need

g f,

whi h means

E (g) 0
out

Learning From Data - Le ture 4

20/22

The 2 questions of learning

E (g) 0
out

is a hieved through:

E
(g)

E
(g)
|
{z
}
Le ture 2
out

and

in

E
(g)

0
| {z }
Le ture 3
in

Learning is thus split into 2 questions:

1. Can we make sure that

2. Can we make

Learning From Data - Le ture 4

E (g)
in

E (g) is lose enough to E (g)?

out

in

small enough?

21/22

0.5
0.6

What the theory will a hieve

0.7
0.8

Chara terizing the feasibility of learning for 0.9

innite

out-of-sample error

Chara terizing the tradeo:

model omplexity

Error

Model omplexity
Model omplexity

E
E E
in

out

in

8
10
12

Learning From Data - Le ture 4

in-sample error

dv

VC dimension,

dv

22/22

Review of Le ture 4

Noisy targets
y = f (x)

Error measures

- User-spe ied e h(x), f (x)
f

>
:
1 intruder

- In-sample:

N
X

1
E (h) =
e h(xn), f (xn)
N n=1
in

target function f:

out

x)

X Y

plus noise

TRAINING EXAMPLES
( x1 , y1 ), ... , ( xN , yN )

UNKNOWN
INPUT
DISTRIBUTION
P (x)

- (x1, y1), , (xN , yN ) generated by

P (x, y) = P (x)P (y|x)

- Out-of-sample

E (h) = Ex e h(x), f (x)

y P (y | x)

UNKNOWN TARGET DISTRIBUTION

P(y |

8
>
<+1 you

- E (h) is now Ex,y e h(x), y
out

Learning From Data

Yaser S. Abu-Mostafa

California Institute of Te hnology

Le ture 5:

Training versus Testing

Sponsored by Calte h's Provost O e, E&AS Division, and IST

Tuesday, April 17, 2012

Outline

From training to testing

Illustrative examples

Key notion: break point

Puzzle

Learning From Data - Le ture 5

2/20

The nal exam

Testing:

P [ |E E | > ] 2 e

22N

in

out

Training:

P [ |E E | > ] 2M e

22N

in

Learning From Data - Le ture 5

out

3/20

Where did the

The

B ad

events

Bm

ome from?

are

in

B2

B1

|E (hm) E (hm)| >

out

The union bound:

P[B1

or B2 or or BM ]
P
[B
]
+
P[B2] + + P[BM ]
1
{z
}
|

no overlaps:

Learning From Data - Le ture 5

M terms

B3
4/20

Can we improve on
Yes, bad events are

E
E

out

very

: hange in

M?
up

overlapping!

+1

and

1 areas

1
in

: hange in labels of data points

+1
|E (h1) E (h1)| |E (h2) E (h2)|
in

out

in

out

down

Learning From Data - Le ture 5

5/20

What an we repla e

with?

Instead of the whole input spa e,

we onsider a nite set of input points,

and ount the number of

Learning From Data - Le ture 5

di hotomies

6/20

Di hotomies: mini-hypotheses
A hypothesis

h : X {1, +1}

A di hotomy

h : {x1, x2, , xN } {1, +1}

Number of hypotheses

|H|

an be innite

Number of di hotomies

|H(x1, x2, , xN )|

Candidate for repla ing

Learning From Data - Le ture 5

is at most

2N

7/20

The growth fun tion

The growth fun tion ounts the most di hotomies on any

mH(N )=

max

x1, ,xN X

points

|H(x1, , xN )|

The growth fun tion satises:

mH(N ) 2N
Let's apply the denition.

Learning From Data - Le ture 5

8/20

ements

PSfrag repla ements

Applying

0
0.5
1
1.5
2
2.5
3
3.5
4
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3

0
0.5
1
1.5
2
2.5
3
3.5
4
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3

mH(N )

N =3

PSfrag repla ements

0
0.5
1
1.5
2
2.5
3
3.5
4
0.5
1
1.5
2
2.5
3

N =3

mH(3) = 8
Learning From Data - Le ture 5

denition - per eptrons

N =4

mH(4) = 14
9/20

Outline

From training to testing

Illustrative examples

Key notion: break point

Puzzle

Learning From Data - Le ture 5

10/20

0.04
Example 1: positive rays

0.06
0.08
0.1

h(x) = 1
x1

x2

x3

h(x) = +1

...

xN
H

is set of

h : R {1, +1}

h(x) = sign(x a)
mH(N ) = N + 1
Learning From Data - Le ture 5

11/20

Example 2: positive intervals

0.06
0.08
0.1

h(x) = 1
x1

x2

x3

h(x) = +1

h(x) = 1

...

xN
H

is set of

h : R {1, +1}

Pla e interval ends in two of

mH(N ) =
Learning From Data - Le ture 5

N +1
2

N +1

spots

+1 = 12 N 2 + 12 N + 1
12/20

Example 3: onvex sets

H

is set of

h : R {1, +1}

h(x) = +1 is

up

onvex

mH(N ) = 2N
The

points are `shattered' by onvex sets

+
bottom

Learning From Data - Le ture 5

13/20

The 3 growth fun tions

H

is positive rays:

mH(N ) = N + 1
H

is positive intervals:

mH(N ) = 12 N 2 + 12 N + 1
H

is onvex sets:

mH(N ) = 2N

Learning From Data - Le ture 5

14/20

Ba k to the big pi ture

Remember this inequality?

P [ |E E | > ] 2M e

22N

in

What happens if

mH(N )

mH(N )

polynomial

Just prove that

Learning From Data - Le ture 5

repla es

mH(N )

out

M?

Good!

is polynomial?

15/20

Outline

From training to testing

Illustrative examples

Key notion:

Puzzle

Learning From Data - Le ture 5

break point

16/20

1.5
2

Break point of

2.5

Denition:

3.5

If no data set of size

then

is a

break point

an be shattered by

for

mH(k) < 2k

H,

4
0.5
1
1.5
2

For 2D per eptrons,

k=4

2.5
3

A bigger data set annot be shattered either

Learning From Data - Le ture 5

17/20

Break point - the 3 examples

Positive rays

mH(N ) = N + 1

Positive intervals

Convex sets

Learning From Data - Le ture 5

break point

k=

break point

k=

break point

k =`'

mH(N ) = 12 N 2 + 12 N + 1

mH(N ) = 2N

18/20

Main result
No break point
Any break point

Learning From Data - Le ture 5

=
=

mH(N ) = 2N
mH(N ) is

polynomial in N

19/20

Puzzle

x1

Learning From Data - Le ture 5

x2

x3

20/20

1.5

Review of Le ture 5
Di hotomies

Break point
2

2.5

3.5

4
0.5
1
1.5
2
2.5
3

Growth fun tion

mH(N )=

max

x1, ,xN X

|H(x1, , xN )|

Maximum # of di hotomies
x1 x2 x3

Learning From Data

Yaser S. Abu-Mostafa

California Institute of Te hnology

Le ture 6:

Theory of Generalization

Sponsored by Calte h's Provost O e, E&AS Division, and IST

Thursday, April 19, 2012

Outline

Proof that mH(N ) is polynomial

Proof that mH(N ) an repla e M

Learning From Data - Le ture 6

2/18

Bounding

mH(N )

To show:

mH(N ) is polynomial

We show:

mH(N ) a polynomial

Key quantity:

B(N, k): Maximum number of di hotomies on N points, with break point k

Learning From Data - Le ture 6

3/18

Re ursive bound on

B(N, k)

Consider the following table:

# of rows

B(N, k) = + 2

S1

S2+

S2
S2

Learning From Data - Le ture 6

x1 x2 . . . xN 1
+1 +1 . . .
+1
1 +1 . . .
+1

xN
+1
1

+1
1
+1
1

1
+1
1
1

...
...
...
...

1
1
+1
+1

1
+1
+1
+1

+1
1
+1
1

1
1
1
1

...
...
...
...

+1
1
+1
+1

+1
+1
1
1

+1 1 . . .
1 1 . . .

+1
1

1
1

..
..
..

..
..
..

..
..
..

..
..
..

..
..
..

4/18

Estimating

and

Fo us on x1, x2, , xN 1 olumns:

# of rows

+ B(N 1, k)

S1

S2+

S2
S2

Learning From Data - Le ture 6

x1 x2 . . . xN 1
+1 +1 . . .
+1
1 +1 . . .
+1

xN
+1
1

+1
1
+1
1

1
+1
1
1

...
...
...
...

1
1
+1
+1

1
+1
+1
+1

+1
1
+1
1

1
1
1
1

...
...
...
...

+1
1
+1
+1

+1
+1
1
1

+1 1 . . .
1 1 . . .

+1
1

1
1

..
..
..

..
..
..

..
..
..

..
..
..

..
..
..

5/18

Estimating

by itself

Now, fo us on the S2 = S2+ S2 rows:

# of rows

B(N 1, k 1)

S1

S2+

S2
S2

Learning From Data - Le ture 6

x1 x2 . . . xN 1
+1 +1 . . .
+1
1 +1 . . .
+1

xN
+1
1

+1
1
+1
1

1
+1
1
1

...
...
...
...

1
1
+1
+1

1
+1
+1
+1

+1
1
+1
1

1
1
1
1

...
...
...
...

+1
1
+1
+1

+1
+1
1
1

+1 1 . . .
1 1 . . .

+1
1

1
1

..
..
..

..
..
..

..
..
..

..
..
..

..
..
..

6/18

Putting it together

B(N, k) = + 2

# of rows

+ B(N 1, k)

S1

B(N 1, k 1)
S2+

B(N, k)
B(N 1, k) + B(N 1, k 1)
Learning From Data - Le ture 6

S2
S2

x1 x2 . . . xN 1
+1 +1 . . .
+1
1 +1 . . .
+1

xN
+1
1

+1
1
+1
1

1
+1
1
1

...
...
...
...

1
1
+1
+1

1
+1
+1
+1

+1
1
+1
1

1
1
1
1

...
...
...
...

+1
1
+1
+1

+1
+1
1
1

+1 1 . . .
1 1 . . .

+1
1

1
1

..
..
..

..
..
..

..
..
..

..
..
..

..
..
..

7/18

Numeri al omputation of

B(N, k)

bound

B(N, k) B(N 1, k) + B(N 1, k 1)

top

bottom

Learning From Data - Le ture 6

1
2
3
N 4
5
6
:

1
1
1
1
1
1
1
:

2
2
3
4
5
6
7
:

3
2
4
7
11
:
:
:

k
4
2
4
8
..
.

5
2
4
8
..

6
2
4
8
..

..
..
..
..
..

.
.
8/18

Analyti solution for

B(N, k)

bound

B(N, k) B(N 1, k) + B(N 1, k 1)

Theorem:

B(N, k)

k1
X
i=0

N
i

1. Boundary onditions:

Learning From Data - Le ture 6

easy

1
2
3
N 4
5
6
:

1
1
1
1
1
1
1
:

2
2
3
4
5
6
7
:

3
2
4
7
11
:
:
:

top

k
4
2
4
8
..
.

5
2
4
8
..

6
2
4
8
..

..
..
..
..
..

bottom

.
9/18

2. The indu tion step

top

k1
X
i=0

N
i

=
=
=
=

Learning From Data - Le ture 6

k1
X

k2
X

N 1
N 1
+
?
i
i
i=0
i=0

k1
k1
X N 1
X N 1
+
1+
i1
i
i=1
i=1

k1
X N 1
N 1
1+
+
i
i1
i=1
k1
k1
X
X
N
N
X
=
1+
i
i
i=0
i=1

k1

N1

N
bottom

10/18

It is polynomial!

For a given H, the break point k is xed

mH(N )

k1
X

N
i
|i=0 {z }

maximum power is

Learning From Data - Le ture 6

N k1

11/18

Three examples
k1
X
N
i=0

H is

positive rays:

H is

positive intervals:

H is

(break point k = 2)
mH(N ) = N + 1 N + 1

2D per eptrons:

Learning From Data - Le ture 6

(break point k = 3)
mH(N ) = 12 N 2 + 12 N + 1
(break point k = 4)
mH(N ) = ?

1 3
6N

1 2
2N

+ 12 N + 1

+ 56 N + 1
12/18

Outline

Proof that mH(N ) is polynomial

Proof that mH(N ) an repla e M

Learning From Data - Le ture 6

13/18

What we want

Instead of:

P [ |E (g) E (g)| > ] 2

in

out

22N

We want:

P [ |E (g) E (g)| > ] 2 mH(N ) e

22N

in

Learning From Data - Le ture 6

out

14/18

Pi torial proof

How does mH(N ) relate to overlaps?

What to do about Eout?
Putting it together

Learning From Data - Le ture 6

15/18

Hoeffding Inequality

Union Bound

VC Bound

space of
data sets
.

(a)
Learning From Data - Le ture 6

(b)

(c)
16/18

What to do about

Eout

Hi

Eout(h)

Ein(h)

Eout(h)

Ein(h)
Ein(h)
Hi

Learning From Data - Le ture 6

17/18

Putting it together

Not quite:

2
2 N
P [ |E (g) E (g)| > ] 2 mH( N ) e

in

out

but rather:

P [ |E (g) E (g)| > ] 4 mH(2N )

in

out

1
8 2 N
e

The Vapnik-Chervonenkis Inequality

Learning From Data - Le ture 6

18/18

Review of Le ture 6

The VC Inequality

mH(N ) is polynomial

Hoeffding Inequality

Union Bound

VC Bound

if H has a break point k

1
2
3
4
5
6
:

1
1
1
1
1
1
1
:

k
2 3 4 5 6
2 2 2 2 2
3 4 4 4 4
4 7 top 8 8 8
5 11 . . . . . .
6 :
.
7 :
.
: :
.
bottom

mH(N )

N
i
i=0
| {z }

.
.
.
.
.

.
.
.
.
.

k1
X

maximum power is N k1

space of
data sets
.

(a)

P [ |E

in

P [ |E

in

(b)

(c)

(g) E (g)| > ] 2

out

(g) E (g)| > ] 4 mH(2N )

out

e 2

1
8 2 N
e

Learning From Data

Yaser S. Abu-Mostafa

California Institute of Te hnology

Le ture 7:

The VC Dimension

Sponsored by Calte h's Provost O e, E&AS Division, and IST

Tuesday, April 24, 2012

Outline

The denition

VC dimension of per eptrons

Interpreting the VC dimension

Generalization bounds

Learning From Data - Le ture 7

2/24

Denition of VC dimension

The VC dimension of a hypothesis set H, denoted by dv (H), is

the largest value of N for whi h m
the most points

N dv (H) = H
k > dv (H) = k
Learning From Data - Le ture 7

N
(N
)
=
2
H

an shatter

an shatter N points
is a break point for H
3/24

The growth fun tion

In terms of a break point k:

mH(N )

k1
X
N
i=0

In terms of the VC dimension dv :

mH(N )

dv
X
N

|i=0 {z }
maximum power is N dv

Learning From Data - Le ture 7

4/24

Examples

is positive rays:

dv = 1

is 2D per eptrons:

dv = 3

is onvex sets:

up

dv =

bottom

Learning From Data - Le ture 7

5/24

VC dimension and learning

dv (H)

is nite

gH

will generalize

Independent of the learning algorithm

Independent of the input distribution

Independent of the target fun tion

up

UNKNOWN TARGET FUNCTION

f: X Y

PROBABILITY
DISTRIBUTION

on

TRAINING EXAMPLES
( x1 , y1 ), ... , ( xN , yN )

LEARNING
ALGORITHM

FINAL
HYPOTHESIS
g~
~f

HYPOTHESIS SET
H
down

Learning From Data - Le ture 7

6/24

VC dimension of per eptrons

For d = 2, dv = 3
In general, dv = d + 1

up

We will prove two dire tions:

dv d + 1
dv d + 1
down

Learning From Data - Le ture 7

7/24

Here is one dire tion

A set of N = d + 1 points in R shattered by the per eptron:

d

X =

x
x
x
.
x

1
T

2
T

d+1

X
Learning From Data - Le ture 7

1
1
1
...
1

0
1
0

... 0
... 0

0
0 ... 0 1

0
0
1

...

is invertible
8/24

Can we shatter this data set?

For any

1
y1

y2 1
y =
=

yd+1
1

an we nd a ve tor w satisfying

sign(Xw) = y
Easy!

Just make

whi h means
Learning From Data - Le ture 7

sign(Xw)= y

w = X1y
9/24

We an shatter these

d+1

points

This implies what?

[a dv = d + 1
[b dv d + 1
[ dv d + 1
[d

No on lusion

Learning From Data - Le ture 7

10/24

Now, to show that

dv d + 1

We need to show that:

[a
[b
[
[d
Learning From Data - Le ture 7

There are

d + 1 points we annot

shatter

There are

d + 2 points we annot

shatter

We annot shatter

any

set of

d + 1 points

We annot shatter

any

set of

d + 2 points

11/24

Take any

d+2

points

For any d + 2 points,

x1, , xd+1, xd+2

More points than dimensions

we must have
xj =

ai xi

i6=j

where not all the a 's are zeros

i

Learning From Data - Le ture 7

12/24

So?

xj =

ai xi

i6=j

Consider the following di hotomy:

x 's with non-zero a get y = sign(a )
and x gets y = 1
No per eptron an implement su h di hotomy!
i

Learning From Data - Le ture 7

13/24

Why?

xj =

ai xi

wTxj =

i6=j

ai wTxi

i6=j

If y = sign(w x ) = sign(a ), then a w x > 0

X
This for es
wx =
a wx > 0
T

i6=j

Therefore,

yj =

Learning From Data - Le ture 7

sign(wTxj ) = +1
14/24

Putting it together

We proved

dv d + 1

and

dv d + 1

dv = d + 1

What is d + 1 in the per eptron?

It is the number of parameters w , w , , w
0

Learning From Data - Le ture 7

15/24

Outline

The denition

VC dimension of per eptrons

Interpreting the VC dimension

Generalization bounds

Learning From Data - Le ture 7

16/24

1. Degrees of freedom

Parameters reate degrees of freedom

# of parameters: analog degrees of freedom
dv : equivalent `binary' degrees of freedom

Learning From Data - Le ture 7

17/24

-0.06
0.02
-0.04
0.04
-0.02
0.06(
0
0.08
0.02
0.1
0.04
0.06
0.08
0.1

Positive rays

):

dv = 1

Positive intervals

Learning From Data - Le ture 7

The usual suspe ts

h(x) = 1
x1

x2

x3

h(x) = +1

...

xN

(dv = 2):
h(x) = 1
x1

x2

x3

h(x) = +1
...

h(x) = 1
xN
18/24

Not just parameters

Parameters may not ontribute degrees of freedom:

down

down

dv

measures the

Learning From Data - Le ture 7

ee tive

number of parameters

19/24

2. Number of data points needed

Two small quantities in the VC inequality:

P [|E (g) E (g)| > ] 4m
(2N
)e
H
{z
|
in

81 2N

out

If we want ertain and , how does N depend on dv ?

Let us look at
Learning From Data - Le ture 7

N deN

20/24

N deN

Fix N e = small value

How does N hange with d?
d N

10

10

N 30eN

10

10

Rule of thumb:
5

10

N 10 dv

Learning From Data - Le ture 7

20

40

60

80

100

120

140

160

180

200

21/24

Outline

The denition

VC dimension of per eptrons

Interpreting the VC dimension

Generalization bounds

Learning From Data - Le ture 7

22/24

Rearranging things

Start from the VC inequality:

P [|E E )| > ] |4mH(2N
{z)e
out

Get in terms of :

81 2N

in

18 2N

= 4mH(2N )e

8 4mH(2N )
= =
ln
| N {z }

With probability 1 ,
Learning From Data - Le ture 7

|E

out

E | (N, H, )
in

23/24

Generalization bound

With probability 1 ,

|E

out

E | (N, H, )
in

With probability 1 ,
E

Learning From Data - Le ture 7

out

in

24/24

Review of Le ture 7
VC dimension dv (H)

Utility of VC dimension
10

10

10

most points

an shatter
0

10

S ope of VC analysis

10

up

20

UNKNOWN TARGET FUNCTION

f: X Y

80

100

120

140

160

180

200

N dv

DISTRIBUTION
on

TRAINING EXAMPLES
( x1 , y1 ), ... , ( xN , yN )

ALGORITHM

60

PROBABILITY

LEARNING

40

Rule of thumb:
FINAL
HYPOTHESIS

N 10 dv

Generalization bound

g~
~f

HYPOTHESIS SET
H
down

out

E +
in

Learning From Data

Yaser S. Abu-Mostafa

California Institute of Te hnology

Le ture 8:

Bias-Varian e Tradeo

Sponsored by Calte h's Provost O e, E&AS Division, and IST

Thursday, April 26, 2012

Outline

Bias and Varian e

Learning Curves

Learning From Data - Le ture 8

2/22

Approximation-generalization tradeo
Small E : good approximation of f out of sample.
out

More omplex H = better han e of

Less omplex H = better han e of
Ideal H = {f }

Learning From Data - Le ture 8

approximating f

generalizing out of sample

winning lottery ti ket

3/22

Quantifying the tradeo

VC analysis was one approa h: E

out

E +
in

Bias-varian e analysis is another: de omposing E

out

into

1. How well H an approximate f

2. How well we an zoom in on a good h H
Applies to

real-valued targets and uses squared error

Learning From Data - Le ture 8

4/22

Start with E
E (g (D) )= Ex
out

out

i

2
(D)
g (x) f (x)

h h
ii

2
ED E (g (D) ) = ED Ex g (D) (x) f (x)
h h
ii

2
= Ex ED g (D) (x) f (x)
out

Now, let us fo us on:

h
i

2
(D)
ED g (x) f (x)
Learning From Data - Le ture 8

5/22

The average hypothesis

To evaluate ED

i

2
(D)
g (x) f (x)

we dene the `average' hypothesis g(x):

h
i
g(x) = ED g (D)(x)
Imagine

many data sets D1, D2, , DK

K
X
1
g(x)
g (Dk )(x)
K
k=1

Learning From Data - Le ture 8

6/22

Using g(x)
ED

h
i
i

2
2
g (D) (x) f (x)
=ED g (D) (x) g(x) + g(x) f (x)
= ED

(D)

2
2
(x) g(x) + g(x) f (x)

i

+ 2 g (D) (x) g(x) g(x) f (x)
= ED
Learning From Data - Le ture 8

i

2
2
(D)
g (x) g(x) + g(x) f (x)
7/22

Bias and varian e

ED

h
h
i
i
i

2
2
2
(D)
(D)
g (x) f (x)
= ED g (x) g(x) + g(x) f (x)
|
{z
} |
{z
}
var(x)

bias(x)

h h
ii

2
(D)
(D)
Therefore, ED E (g ) = Ex ED g (x) f (x)
out

= Ex[bias(x) + var(x)]
=

Learning From Data - Le ture 8

bias

var
8/22

The tradeo
bias = Ex

2 i
g(x) f (x)

var = Ex

ED

ii

2
(D)
g (x) g(x)

H
f
H

bias

f
var

Learning From Data - Le ture 8

9/22

Example: sine target

H0
f :[1, 1] R

f (x) = sin(x)
2

Only two training examples!

N =2

1.5

Two models used for learning:

0.5

H0:

h(x) = b

H1:

h(x) = ax + b

0.5

1.5

Whi h is better, H0 or H1?

Learning From Data - Le ture 8

2
1

0.8

0.6

0.4

0.2

0.2

0.4

0.6

0.8

10/22

Approximation - H0 versus H1
H0

H1

1.5

1.5

0.5

0.5

0.5

0.5

Eout = 0.50

1.5

2
1

0.8

0.6

Learning From Data - Le ture 8

0.4

0.2

0.2

0.4

Eout = 0.20

1.5

0.6

0.8

2
1

0.8

0.6

0.4

0.2

0.2

0.4

0.6

0.8

11/22

Learning - H0 versus H1
H0

H1

1.5

1.5

0.5

0.5

0.5

0.5

1.5

1.5

2
1

0.8

0.6

Learning From Data - Le ture 8

0.4

0.2

0.2

0.4

0.6

0.8

2
1

0.8

0.6

0.4

0.2

0.2

0.4

0.6

0.8

12/22

Learning From Data - Le ture 8

1
-1
Bias and varian e
- H0
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
sin(x)
0.6
0.8
x
1

1
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1

g(x)

13/22

Learning From Data - Le ture 8

0.2
0.4
Bias and varian e
- H1
0.6
0.8
1
-3
-2
g(x)
-1
0
sin(x)
1
2
x
3

0.4
0.6
0.8
1
-8
-6
-4
-2
0
2
4
6

14/22

-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1

H0

0.4
and the winner
is
...
0.6
0.8
H1
1
-3
-2
g(x)
-1
g(x)
0
sin(x)
1
2
x
3

sin(x)
x
bias = 0.50

Learning From Data - Le ture 8

var = 0.25

bias = 0.21

var = 1.69
15/22

Lesson learned

Mat h the

`model omplexity'

to the

Learning From Data - Le ture 8

data resour es,

not to the

target omplexity

16/22

Outline

Bias and Varian e

Learning Curves

Learning From Data - Le ture 8

17/22

Expe ted E

out

and E

in

Data set D of size N

Expe ted out-of-sample error ED [E (g (D))]
out

Expe ted in-sample error ED [E (g (D))]

in

How do they vary with N ?

Learning From Data - Le ture 8

18/22

The

out

in

Number of Data Points, N

Simple Model

Learning From Data - Le ture 8

Expe ted Error

Expe ted Error

20
40
60
80
100
120
0.16
0.18
0.2
0.22

20
40
urves
60
80
100
120
0.05
0.1
0.15
0.2
0.25

out

in

Number of Data Points, N

Complex Model
19/22

VC versus

out

in

generalization error

in-sample error
Number of Data Points, N
VC analysis

Learning From Data - Le ture 8

20
40
bias-varian e
60
80
0.16
E
0.17
varian e
0.18
0.19
E
0.2
bias
0.21
Number of Data Points, N
0.22
Expe ted Error

Expe ted Error

20
40
60
80
0.16
0.17
0.18
0.19
0.2
0.21
0.22

out

in

bias-varian e
20/22

Linear regression ase

Noisy target y = wTx + noise
Data set D = {(x1, y1), . . . , (xN , yN )}
Linear regression solution: w = (XTX)1XTy
In-sample error ve tor = Xw y
`Out-of-sample' error ve tor = Xw y

Learning From Data - Le ture 8

21/22

Expe ted Error

100
0
Learning urves for linear regression
0.2
0.4
0.6
Best approximation error = 2
0.8

E
d+1
2
1
Expe ted in-sample error = 1 N
2

1.2

1.4
Expe ted out-of-sample error = 2 1 + d+1
N
E
1.6

2 d+1
1.8
Expe ted generalization error = 2 N
d + 1 Number of Data Points, N
2
Learning From Data - Le ture 8

out

in

22/22

PSfrag repla ements

Review of Le ture 8

Learning urves
20

Bias and varian e

in

and40

out

vary with

60

Expe ted value of

bias

out

w.r.t.

0.16

0.17

B-V:

var

0.19
0.2
20
0.21
40
0.22
60

f
H

0.18

bias

Expe ted Error

80

out

in

varian e

bias
Number of Data Points,

80
0.16
0.17

VC:
f

0.18
0.19
0.2

var

0.21
0.22

g (D)(x) g(x) f (x)

Expe ted Error

How

out

generalization error

in

in-sample error
Number of Data Points,

VC dimension

Learning From Data

Yaser S. Abu-Mostafa

California Institute of Te hnology

Le ture 9:

The Linear Model II

Sponsored by Calte h's Provost O e, E&AS Division, and IST

Tuesday, May 1, 2012

Where we are

Learning From Data - Le ture 9

Linear lassi ation

Linear regression

Logisti regression

Nonlinear transforms

X
X
?

2/24

Nonlinear transforms

x = (x0, x1, , xd)

Ea h

z = (z0, z1, , zd)

zi = i(x)

Example:

z = (1, x1, x2, x1x2, x21, x22)

Final hypothesis

g(x)

(x)
sign w
T

Learning From Data - Le ture 9

z = (x)

in

X
or

spa e:

T(x)
w
3/24

The pri e we pay

x = (x0, x1, , xd)

Learning From Data - Le ture 9

z = (z0, z1, , zd)

dv = d + 1

dv d + 1

4/24

0.4
0.6

-0.5
0
Two non-separable
ases

0.8

0.5

-1.5

-1.5

-1

-1

-0.5

-0.5

0.5

0.5

1.5

1.5

Learning From Data - Le ture 9

5/24

-0.5

First ase 0
0.5
Use a linear model in

X;

a ept

E >0

in

-1.5

or
Insist on

E = 0;
in

go to high-dimensional

-1

-0.5
0
0.5
1
1.5

Learning From Data - Le ture 9

6/24

-0.5
Se ond ase0

0.5

z = (1, x1, x2, x1x2, x21, x22)

(1, x21, x22)

Why not:

z=

or better yet:

z = (1, x21 + x22)

or even:

z=

(x21

x22

0.6)

-1.5
-1
-0.5
0
0.5
1
1.5

Learning From Data - Le ture 9

7/24

Lesson learned

Looking at the data

before

hoosing the model an be hazardous to your

out

Data snooping

Learning From Data - Le ture 9

8/24

Logisti regression - Outline

The model

Error measure

Learning algorithm

Learning From Data - Le ture 9

9/24

A third linear model

s =

d
X

wixi

i=0

linear lassi ation

linear regression

h(x) = sign(s)

h(x) = s

x2
xd

Learning From Data - Le ture 9

h(x) = (s)
x0

x0

x0
x1

logisti regression

x1
s
h(x)

x2
xd

x1
s
h(x)

x2

s
h(x)

xd

10/24

-4
The logisti fun tion

The formula:

es
(s) =
1 + es

-2
0
21
4
0
0.5
0
1

(s)
s

soft threshold: un ertainty

sigmoid: attened out `s'

Learning From Data - Le ture 9

11/24

Probability interpretation

h(x) = (s)

is interpreted as a probability

Example.

Predi tion of heart atta ks

Input

x:

(s):

probability of a heart atta k

holesterol level, age, weight, et .

The signal

s = wTx

risk s ore

h(x) = (s)
Learning From Data - Le ture 9

12/24

Genuine probability

Data

(x, y)

with binary

y,

generated by a noisy target:

P (y | x) =

(
f (x)

1 f (x)

for

y = +1;

for

y = 1.

The target

Learn

Learning From Data - Le ture 9

f : Rd [0, 1] is

the probability

g(x) = (wT x) f (x)

13/24

Error measure

For ea h

(x, y), y

is generated by probability

Plausible error measure based on

f (x)

likelihood:
If

h = f,

how likely to get

P (y | x) =

Learning From Data - Le ture 9

from

x?

h(x)

for

y = +1;

1 h(x)

for

y = 1.

14/24

Formula for likelihood

-4

P (y | x) =

h(x)

for

1 h(x)

for

-2

y = +1;

y = 1.

1
(s)

4
0

Substitute

h(x) = (wTx),

noting

(s) = 1 (s)

0.5

P (y | x) = (y wTx)
Likelihood of
N
Y

n=1

Learning From Data - Le ture 9

D = (x1, y1), . . . , (xN , yN )

P (yn | xn) =

N
Y

is

(ynwTxn)

n=1
15/24

Maximizing the likelihood

ln
N

Minimize

1
=
N

n=1

(yn wT xn)

n=1

1
ln
(yn wT xn)

N

X
1
T
ln 1 + eynw xn
E (w) =
N n=1 |
{z
}
e(h(xn),yn)
in

Learning From Data - Le ture 9

N
X

N
Y

1
(s) =
1 + es

ross-entropy error
16/24

Logisti regression - Outline

The model

Error measure

Learning algorithm

Learning From Data - Le ture 9

17/24

How to minimize

Ein

For logisti regression,

N

X
1
T
ln 1 + eynw xn
E (w) =
N n=1
in

iterative

solution

Compare to linear regression:

N
X
1
E (w) =
(wTxn yn)2
N n=1
in

Learning From Data - Le ture 9

losed-form solution

18/24

Iterative method: gradient des ent

-10
-8

w(0);

take a step along steepest slope

-2
0

Fixed step size:

w(1) = w(0) + v

2
10
15

What is the dire tion

?
v

in

20
25

Learning From Data - Le ture 9

in

-4

Start at

E (w)

-6

In-sample Error,

General method for nonlinear optimization

Weights,

19/24

Formula for the dire tion

in

= E ( w(0) +
v ) E (w(0))
in

in

+ O( 2)
= E (w(0))tv
in

kE (w(0))k
in

Sin e

is a unit ve tor,

E (w(0))
=
v
kE (w(0))k
in

in

Learning From Data - Le ture 9

20/24

Fixed-size step?

Weights,

too small

Weights,

too large

in

large

-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1

In-sample Error,

in

-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0
0.05
0.1
0.15
0.2
0.25

In-sample Error,

-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1

in

ae ts the algorithm:

In-sample Error,

How

small

Weights,

variable

just right

should in rease with the slope

Learning From Data - Le ture 9

21/24

Easy implementation

Instead of

w = v
E (w(0))
=
kE (w(0))k
in

in

Have

w = E (w(0))
in

Fixed
Learning From Data - Le ture 9

learning rate

22/24

Logisti regression algorithm

1:

Initialize the weights at

2:

for

t = 0, 1, 2, . . .

t=0

to

w(0)

do

Compute the gradient

3:

in

N
X
1
ynxn
=
T(t)x
y
w
n
n
N n=1 1 + e

w(t + 1) = w(t) E

4:

Update the weights:

5:

Iterate to the next step until it is time to stop

6:

Return the nal weights

Learning From Data - Le ture 9

in

w
23/24

Summary of Linear Models

Credit
Analysis

Learning From Data - Le ture 9

Approve
or Deny

Per eptron

Classi ation Error

PLA, Po ket,. . .

Amount
of Credit

Linear Regression

Squared Error
Pseudo-inverse

Probability
of Default

Logisti Regression

Cross-entropy Error
Gradient des ent

24/24

Logisti regression
x0
x1

s
h(x)

x2
xd

(s)

-8
-6
-4
-2
0
2
10
15
20
25

in

-10
Gradient
des ent

In-sample Error, E

Review of Le ture 9

Weights, w

- Initialize w(0)

Likelihood measure
N
Y
n=1

P (yn | xn) =

N
Y
n=1

- For t = 0, 1, 2,

[to termination

w(t + 1) = w(t) E (w(t))

in

(ynwTxn)

- Return nal w

Learning From Data

Yaser S. Abu-Mostafa

California Institute of Te hnology

Le ture 10:

Neural Networks

Sponsored by Calte h's Provost O e, E&AS Division, and IST

Thursday, May 3, 2012

Outline

Sto hasti gradient des ent

Neural network model

Ba kpropagation algorithm

Learning From Data - Le ture 10

2/21

Sto hasti gradient des ent

GD minimizes:

N
X

1
E (w) =
e| h(x{zn), yn}
N n=1
T
ln(1+eynw xn )
in

by iterative steps along

in logisti regression

in

w = E (w)
in

in

is based on all examples

Learning From Data - Le ture 10

(xn, yn)

bat h GD
3/21

The sto hasti aspe t

Pi k one

(xn, yn)

at a time.

Average dire tion:

Apply GD to

En e

h(xn), yn

N
X

1
h(xn), yn =
e h(xn), yn
N n=1

= E

in

randomized version of GD

sto hasti gradient des ent (SGD)

Learning From Data - Le ture 10

4/21

Benets of SGD

1
2
3

1. heaper omputation

2. randomization

in

0
0.05

3. simple

0.1
0.15

Weights,

Rule of thumb:
= 0.1

Learning From Data - Le ture 10

works

randomization helps

5/21

SGD in a tion
Remember movie ratings?
top

eij =

rij

K
X
k=1

!2

uikvjk

user

ui1 ui2 ui3

uiK

vj1 vj2 vj3

vjK

movie

rij

rating
bottom

Learning From Data - Le ture 10

6/21

Outline

Sto hasti gradient des ent

Neural network model

Ba kpropagation algorithm

Learning From Data - Le ture 10

7/21

Biologi al inspiration
biologi al fun tion

Learning From Data - Le ture 10

biologi al stru ture

8/21

Combining per eptrons

+
x2

x2

x2
+

+
h1

h2
x1

x1
1

x1
1

1.5
x1

x2

Learning From Data - Le ture 10

1
1

1.5
OR(x1, x2)

x1

x2

AN D(x1, x2)

9/21

Creating layers

1
1.5
2
h1 h
1 h2
h

1
1

Learning From Data - Le ture 10

h1

h2

1.5

1.5
1

1.5

1
1

1
1

10/21

The multilayer per eptron

1

1
w1 x

x1

1.5
1.5
1

1.5

1
1

x2
w2 x

3 layers
Learning From Data - Le ture 10

feedforward
11/21

A powerful model

+
+

+
+

+
+

Target

2 red ags
Learning From Data - Le ture 10

+
+

+
+

generalization

+
+

8 per eptrons

for

16 per eptrons

and

optimization
12/21

The neural network

1

x1

x2

xd
input x
Learning From Data - Le ture 10

h(x)
(s)
s

hidden layers 1 l < L

output layer l = L
13/21

How the network operates

-4

1 l L

(l)
wij

(l)
xj

Apply

0 i d

(l1)

1 j d(l)
(l)
(sj )

(0)
to x1

Learning From Data - Le ture 10

=
(0)
xd(0)

(l1)
dX

i=0

-2

layers

inputs

2
4

outputs

+1

linear
tanh

-1
-0.5

(l)
wij

(l1)
xi

(L)
x1

0
0.5

hard threshold

es es
(s) = tanh(s) = s
e + es

= h(x)
14/21

Outline

Sto hasti gradient des ent

Neural network model

Ba kpropagation algorithm

Learning From Data - Le ture 10

15/21

Applying SGD
All the weights

w =

Error on example

(l)
{wij } determine

(xn, yn)

h(xn), yn =

h(x)

is

e(w)

To implement SGD, we need the gradient

e(w):
Learning From Data - Le ture 10

e(w)
(l)
wij

for all

i, j, l
16/21

Computing
We an evaluate

e(w)

(l)
wij

e(w)
(l)
wij

top

(l)

xj
one by one: analyti ally or numeri ally

(l)

sj

A tri k for e ient omputation:

(l)

e(w)

(l)

We have

sj

(l)
wij

Learning From Data - Le ture 10

(l)
wij

(l1)
xi

e(w)

(l)
sj

w ij

(l)
sj
(l)
wij

We only need:

e(w)
(l)
sj

(l1)

xi

(l)
j

bottom

17/21

(l)
j

e(w)

(l)
sj

for the nal layer

For the nal layer

l = L and j = 1:

(L)
1

e(w)
(L)
x1

e(w)

(L)
s1

=(

(L)
x1

yn)2

(L)
(s1 )

(s) = 1 2(s)
Learning From Data - Le ture 10

for the tanh

18/21

Ba k propagation of

top

(l1)
i

e(w)

x j(l)

(l1)

si
d(l)

X e(w)
j=1

(l)
d
X

(l)
sj

(l)
j

(l)

sj

(l1)
xi

(l)
wij

(l1)
xi
(l1)
si

j=1

(l1)
i

= (1

(l1) 2
(xi ) )

(l)
d
X

j(l)

(l)
w ij

(l1)
(si )

(l1)

xi
2
1( x (l1)
)
i

(l)
wij

(l)
j

(l1)

bottom

j=1

Learning From Data - Le ture 10

19/21

Ba kpropagation algorithm
top

1:
2:
3:
4:
5:
6:
7:
8:

(l)
Initialize all weights wij

for t = 0, 1, 2, . . . do

at random

n {1, 2, , N }
(l)
Forward: Compute all x
j
(l)
Ba kward: Compute all
j
(l1) (l)
(l)
(l)
j
Update the weights: wij wij xi

(l)

Pi k

Iterate to the next step until it is time to stop

(l)

w ij
(l1)

xi

(l)
Return the nal weights wij
bottom

Learning From Data - Le ture 10

20/21

Final remark: hidden layers

learned nonlinear transform
interpretation?

x1

x2

xd

Learning From Data - Le ture 10

h(x)
(s)
s

21/21

Review of Le ture 10
Multilayer per eptrons

+
+

+
+

+
+

x1

x2

+
+

+
+

input x

(l)
xj

hidden layers 1 l L

output layer l = L

Ba kpropagation

Neural networks

(l1) (l)
j

(l)

wij = xi

(l1)

dX

(s)

Logi al ombinations of per eptrons

h(x)

xd

+
+

(l)
wij

i=0

where (s) = tanh(s)

(l1)
xi

where
(l1)

(l1) 2

= (1 (xi

) )

(l)
d
X

j=1

(l)

(l)

wij j

Learning From Data

Yaser S. Abu-Mostafa

California Institute of Te hnology

Le ture 11:

Overtting

Sponsored by Calte h's Provost O e, E&AS Division, and IST

Tuesday, May 8, 2012

Outline

What is overtting?

The role of noise

Deterministi noise

Dealing with overtting

Learning From Data - Le ture 11

2/23

Illustration of overtting
-1
-0.5

Simple target fun tion

0
5 data points-

noisy

Data
Target

0.5

Fit

4th-order polynomial t

-30

E = 0, E
in

out

is huge

-20
-10
0
10

Learning From Data - Le ture 11

x
3/23

Overtting versus bad generalization

top

Neural network tting noisy data

3.5

Overtting:

E
in

out

2.5

Error

1.5

Early stopping

out

0.5

E in
0

1000

2000

3000

4000

5000
Epochs

6000

7000

8000

9000

10000
bottom

Learning From Data - Le ture 11

4/23

The ulprit

Overtting:
Culprit:

tting the data more than is warranted

tting the noise - harmful

Learning From Data - Le ture 11

5/23

Target

Case study

-1

10th-order target + noise

-1
-0.5
0

0.5

0.5

-0.5

0.5

0.5

-0.5

Data
Target

1
1.5

Learning From Data - Le ture 11

50th-order target

Data
Target

1
1.5

x
6/23

Two ts for ea h target

-1

-1

-0.5

0.5

0.5

-10

-500

-0.5

Data
2nd Order Fit
10th Order Fit

10
20

30

2nd Order

10th Order

in

0.050

0.034

out

0.127

9.00

Learning From Data - Le ture 11

500

1000

Noisy low-order target

E
E

Data
2nd Order Fit
10th Order Fit

Noiseless high-order target

2nd Order

10th Order

in

0.029

105

out

0.120

7680

E
E

7/23

An irony of two
learners
-1

O and R

They know the target is 10th order

O hooses H10

R hooses H2

0.5
1
-10

Two learners

-0.5

0
Data
2nd Order Fit
10th Order Fit

10
20
30

x
Learning a 10th-order target

Learning From Data - Le ture 11

8/23

We have seen this ase

20

20

Remember learning urves?

40

40

60

60

120
0.05
0.1
0.15

Expe ted Error

100

80

H2

100

out

in

120
0.05
0.1
0.15

0.2
0.25

0.2
Number of Data Points,

Learning From Data - Le ture 11

0.25

H10
Expe ted Error

80

out

in

Number of Data Points,

N
9/23

Even without noise

-1

H10

and

H2

They know there is no noise

Is there really no noise?

-0.5
0
0.5
1

The two learners

-500
Data
2nd Order Fit
10th Order Fit

0
500
1000

x
Learning a 50th-order target

Learning From Data - Le ture 11

10/23

A detailed experiment
Target

Impa t of

noise level

and

target omplexity

-1

y = f (x) + |{z}
(x) =

-0.5

0
0.5

noise level:

xq + (x)

|q=0 {z }
normalized

Qf
X

target omplexity:

0.5

Data
Target

1
1.5
Learning From Data - Le ture 11

data set size:

Qf

N
11/23

The overt measure

10th Order Fit

We t the data set

-1
-0.5

H2:

(x1, y1), , (xN , yN )

2nd-order polynomials

using our two models:

H10:

10th-order polynomials

0
0.5

Compare out-of-sample errors of

g2 H2

and

g10 H10

-10

0
Data
2nd Order Fit
10th Order Fit

10
20
30
Learning From Data - Le ture 11

overt measure:

E (g10) E (g2)
out

out

x
12/23

The results

100

0.2

Qf

Target omplexity,

Noise level,

0.2

0.1

-0.1

80

100

120N

Number of data points,

Impa t of

0.1

50

-0.1

-0.2
Hi

Hi

80

100

120N

-0.2

Number of data points,

Impa t of

Qf

Hi

Learning From Data - Le ture 11

13/23

Impa t of noise

0.1

-0.1

80

100

120N

-0.2

0.1

50

sto hasti noise

deterministi noise

80

100

120N

-0.2

Number of data points,

Sto hasti noise

number of data points

-0.1

Number of data points,

Learning From Data - Le ture 11

0.2

Qf

100
Target omplexity,

Noise level,

0.2

Deterministi noise

Overtting
Overtting
Overtting

14/23

Outline

What is overtting?

The role of noise

Deterministi noise

Dealing with overtting

Learning From Data - Le ture 11

15/23

0.2
0.4
Denition of deterministi
noise

The part of

that

annot apture:

f (x)

0.6
0.8
h(x)
1
-100

Why noise?

-80
Main dieren es with sto hasti noise:

-60

-40
1. depends on

2. xed for a given

-20

0
20
40

Learning From Data - Le ture 11

f
x
16/23

Impa t on overtting

Finite

N: H

tries to t the noise

0.2

Qf

Qf

Target omplexity,

Deterministi noise and

100

0.1

50

-0.1

80

100

120N

-0.2

Number of data points,

how mu h overt
Learning From Data - Le ture 11

17/23

Noise and bias-varian e

Re all the de omposition:

ED

i
i h
h
i

2
2
2
(D)
(D)
= ED g (x) g(x) + g(x) f (x)
g (x) f (x)
|
{z
} |
{z
}
var(x)

What if

bias(x)

is a noisy target?

y = f (x) + (x)

Learning From Data - Le ture 11

E (x) = 0
18/23

A noise term

ED,

g (D) (x) y

2i

= ED,

i

2
(D)
g (x) f (x) (x)

= ED,

i

2
(D)
g (x) g(x) + g(x) f (x) (x)

= ED,

(D)

+
Learning From Data - Le ture 11

2
2
2
(x) g(x) + g(x) f (x) + (x)

ross terms

i
19/23

A tually, two noise terms

ED,x

h
i
i
i
h

2
2
2
(D)
g (x) g(x) + Ex g(x) f (x) +E,x (x)
{z
} |
{z
} |
{z
}
var

bias

deterministi noise

Learning From Data - Le ture 11

sto hasti noise

20/23

Outline

What is overtting?

The role of noise

Deterministi noise

Dealing with overtting

Learning From Data - Le ture 11

21/23

Two ures

Regularization:

Validation:

Learning From Data - Le ture 11

Putting the brakes

Che king the bottom line

22/23

Putting the -1
brakes

-1

-0.5

-0.5
Data

Target

0.5

Fit

0.5
1

-20

0.5

-10

1.5

-30

10

x
free t

Learning From Data - Le ture 11

x
restrained t
23/23

-0.2

Overtting

Fitting the noise, sto hasti /deterministi

0.4

Deterministi noise
0.8
1

-1
Fitting
the data more than is warranted
-0.5

-100

0
-80
0.5
-60

Target

0.5

Fit

1
-40

Data

-20

0
20

40

-30

100

0.2

Qf

-20
-10
0
10

0.2

0.6

Review of Le ture 11

VC allows it; doesn't predi t it

Target omplexity,

Fit

0.1

50

-0.1

80

100

120

Number of data points,

-0.2

Learning From Data

Yaser S. Abu-Mostafa

California Institute of Te hnology

Le ture 12:

Regularization

Sponsored by Calte h's Provost O e, E&AS Division, and IST

Thursday, May 10, 2012

Outline

Regularization - informal

Regularization - formal

Weight de ay

Choosing a regularizer

Learning From Data - Le ture 12

2/21

Two approa hes to regularization

Mathemati al:
Ill-posed problems in fun tion approximation

Heuristi :
Handi apping the minimization of

Learning From Data - Le ture 12

Ein

3/21

0.8

0.6

A familiar 1example

0.8

-2

-8

-1.5

-6

-1

-4

-0.5

-2

0.5

1.5

without regularization
Learning From Data - Le ture 12

x
with regularization
4/21

0.4

0.4

and the winner

is
0.6

0.6
0.8

without regularization

0.8

-3

-1.5

-2

-1

g(x)

with regularization

g(x)

-0.5

-1

...

sin(x)

sin(x)

0.5
1

3
bias

= 0.21

Learning From Data - Le ture 12

1.5
var

= 1.69

bias

= 0.23

var

= 0.33
5/21

The polynomial model

Hq:

polynomials of order

linear regression in

L1(x)
z = ..
.
Lq(x)

Hq =

Q
X

q=0

spa e

wq Lq (x)

Legendre polynomials:

L1
PSfrag repla ements

ag repla ements

Learning From Data - Le ture 12

L2
PSfrag repla ements

1
2
2 (3x

1)

L3

L4

PSfrag repla ements

1
3
2 (5x

3x)

L5

PSfrag repla ements

1
4
8 (35x

30x2 + 3)

1
5
8 (63x )

6/21

Un onstrained solution
Given

(x1, y1), , (xN , yn)

(z1, y1), , (zN , yn)

Minimize

N
X
1
(wTzn yn)2
Ein(w) =
N n=1

Minimize

1
N

(Zw y)T(Zw y)

wlin = (ZTZ)1ZTy

Learning From Data - Le ture 12

7/21

Constraining the weights

Hard onstraint:

Softer version:

H2
Q
X

is onstrained version of

wq2 C

H10

soft-order

with

wq = 0

for

q>2

onstraint

q=0

Minimize

1
N

(Zw y)T(Zw y)

subje t to:

Solution:

Learning From Data - Le ture 12

wreg

wTw C

instead of

wlin
8/21

Solving for
Minimize

Ein(w) =

1
N

subje t to:

wTw C

(Zw y)T(Zw y)

wreg
E = onst.
in

lin

Ein(wreg) wreg

normal

= 2 N wreg
Ein(wreg) + 2 N wreg = 0
Minimize

Ein(w) + N wTw

Learning From Data - Le ture 12

in

wtw = C
9/21

Augmented error
Minimizing

Eaug(w) = Ein(w) + N wTw

=

1
N

(Zw y)T(Zw y) + N wTw

Minimizing

Ein(w) =
subje t to:

Learning From Data - Le ture 12

1
N

solves

un onditionally

(Zw y)T(Zw y)

wTw C

VC formulation
10/21

The solution
Minimize

Eaug(w) = Ein(w) + N wTw

1
T
t
=
(Zw y) (Zw y) + w w
N

Eaug(w) = 0

ZT(Zw y) + w = 0

wreg = (ZTZ + I)1 ZTy

as opposed to

Learning From Data - Le ture 12

wlin = (ZTZ)1ZTy

(with regularization)

(without regularization)
11/21

The result
Minimizing

wTw
Ein(w) + N

for dierent

's:

epla ements

=0
Fit

= 0.0001

PSfrag repla ements

Data
Target

= 0.01

=1

PSfrag repla ements

PSfrag repla ements

-1
-0.5
0
0.5
1
0
0.5
1
1.5
2

-1
-0.5
0
0.5
1
0
0.5
1
1.5
2

overtting
Learning From Data - Le ture 12

-1
-0.5
0
0.5
1
0
0.5
1
1.5
2

-1
-0.5
0
0.5
1
-30
-20
-10
0
10

Fit

undertting
12/21

Weight `de ay'

Minimizing

Ein(w) + N wTw

Gradient des ent:

is alled weight

de ay.

Why?

w(t + 1) = w(t) Ein

w(t) 2
w(t)
N

= w(t) (1 2 ) Ein w(t)

N
Applies in neural networks:

ww=

(l1) d(l)
L
d
X X X

l=1

Learning From Data - Le ture 12

i=0

j=1

(l)
wij

2
13/21

Variations of weight de ay
Emphasis of ertain weights:

Q
X

q wq2

q=0

Examples:

q = 2q =

low-order t

q = 2q =

high-order t

Neural networks: dierent layers get dierent

Tikhonov regularizer:
Learning From Data - Le ture 12

's

wTTw
14/21

0.5

1
Even weight growth!
1.5
0

Pra ti al rule:

1.5
2
2.5

sto hasti noise is `high-frequen y'

Eout

0.5

Expe ted

We ` onstrain' the weights to be large - bad!

weight growth

weight de ay

3.5

deterministi noise is also non-smooth

Regularization Parameter,

onstrain learning towards smoother hypotheses

Learning From Data - Le ture 12

15/21

General form of augmented error

Calling the regularizer

= (h),

we minimize

Eaug(h) = Ein(h) +
(h)
N

Rings a bell?

Eout(h) Ein(h) + (H)

Eaug
Learning From Data - Le ture 12

is better than

Ein

as a proxy for

Eout
16/21

Outline

Regularization - informal

Regularization - formal

Weight de ay

Choosing a regularizer

Learning From Data - Le ture 12

17/21

The perfe t regularizer

Constraint in the `dire tion' of the target fun tion

(going in ir les

Guiding prin iple:

Dire tion of

Chose a bad

smoother or simpler

We still have

regularization is a ne essary evil

Learning From Data - Le ture 12

18/21

Neural-network regularizers
-4
-2
0

Weight de ay:

From linear to logi al

2
4

linear

+1

tanh

-1

Weight elimination:

-0.5
0
0.5

Fewer weights

= smaller VC

dimension

hard threshold

Soft weight elimination:

(w) =

X
i,j,l

Learning From Data - Le ture 12

(l)
wij

2

2

(l)
2
+ wij
19/21

Early stopping as a regularizer

top

3.5

Regularization through the optimizer!

3

When to stop?

validation

2.5

Error

1.5

Early stopping

out

0.5

E in
0

1000

2000

3000

4000

5000
Epochs

6000

7000

8000

9000

10000
bottom

Learning From Data - Le ture 12

20/21

The optimal
0.6

0.5

2 = 0.25

Qf = 100
Eout

.75

2 = 0.5

.25

2 = 0

0.5
1
1.5
Regularization Parameter,
Sto hasti noise

Learning From Data - Le ture 12

Expe ted

Expe ted

Eout

0.4
0.2

Qf = 30
Qf = 15

0.5
1
1.5
Regularization Parameter,

Deterministi noise
21/21

Review of Le ture 12

Choosing a regularizer

Eaug(h) = Ein(h) +
(h)
N

Regularization
onstrained

un onstrained

(h):

E = onst.

heuristi

most used:

in

lin

normal

smooth, simple

weight de ay

prin ipled; validation

= 0.0001

= 1.0

PSfrag repla ements

PSfrag repla ements

-1
-0.5
0
0.5
1
0
0.5
1
1.5
2

-1
-0.5
0
0.5
1
0
0.5
1
1.5
2

Minimize

Eaug(w) = Ein(w) + N wTw

in

wtw = C

Learning From Data

Yaser S. Abu-Mostafa

California Institute of Te hnology

Le ture 13: Validation

Sponsored by Calte h's Provost O e, E&AS Division, and IST

Tuesday, May 15, 2012

Outline

The validation set

Model sele tion

Cross validation

Learning From Data - Le ture 13

2/22

Validation versus regularization

In one form or another,

Eout(h) = Ein(h) +

overt penalty

Regularization:
Eout(h) = Ein(h) + overt
|
{zpenalty}
regularization estimates this quantity

Validation:

E
| out
{z(h)} = Ein(h) +
validation estimates this quantity
Learning From Data - Le ture 13

overt penalty

3/22

Analyzing the estimate

On out-of-sample point

(x, y), the

error is

e(h(x), y)
h(x) y

Squared error:
Binary error:

2

h(x) 6= y

E e(h(x), y) = Eout(h)

2
var e(h(x), y) =

Learning From Data - Le ture 13

4/22

From a point to a set

On a validation set

(x1, y1), , (xK , yK ),

the error is

k=1

E Eval(h) =

var

K
X
1
Eval(h) =
e
(h(xk ), yk )
K

K
X

1
E e(h(xk ), yk ) = Eout(h)
K
k=1

K
X

1
2
Eval(h) =
var e(h(xk ), yk ) =
2
K
K

k=1

1
Eval(h) = Eout(h) O
K
Learning From Data - Le ture 13

5/22

is taken out of N
20

D = (x1, y1), , (xN , yN )

|K points
{z } validation
Dval

1
:
K

Small
Large

N
| K
{z points}

K =

Dtrain

bad estimate

K = ?

40
60

training

80
100

Expe ted Error

Given the data set

120
0.05

0.1
0.15

0.2
0.25

Learning From Data - Le ture 13

Eout
Ein

Number of Data Points, N K

K

6/22

is put ba k into N

D Dtrain Dval

N K

D = g

(N )

Dtrain

Dtrain = g

Eval = Eval(g )

Large

(N K )

K =

bad estimate!

(K )

Rule of Thumb:
N
K=
5
Learning From Data - Le ture 13

Dval

Eval(g )
7/22

Why `validation'
top

3.5

is used to make learning hoi es

3

If an estimate of

Eout

ae ts learning:

the set is no longer a

It be omes a

validation set

test set!

2.5

2
Error

Dval

1.5

Early stopping

out

0.5

E in
0

1000

2000

3000

4000

5000
Epochs

6000

7000

8000

9000

10000
bottom

Learning From Data - Le ture 13

8/22

What's the dieren e?

Test set is unbiased; validation set has optimisti bias

Two hypotheses

h1

and

h2

with

Eout(h1) = Eout(h2) = 0.5

Error estimates

Pi k

h {h1, h2}

e) <

E(

Learning From Data - Le ture 13

e1 and e2

0.5

with

uniform on

[0, 1]

e = min(e1, e2)

optimisti bias

9/22

Outline

The validation set

Model sele tion

Cross validation

Learning From Data - Le ture 13

10/22

Using Dval more than on e

M
Use

H1, . . . , HM

models

Dtrain

to learn

Evaluate gm using

gm

Pi k model

HM

g1

g2

gM

E2
{z

EM

Dval

Dval:

m = m

H2

Dtrain

for ea h model

Em = Eval(gm
);

H1

E1

m = 1, . . . , M
with smallest

pi k the best

m , Em )

(H

Em
D

gm

Learning From Data - Le ture 13

11/22

The bias

Eval(gm
)

Hm

using

is a biased estimate of

Dval

Eout(gm
)

0.8

Expe ted Error

We sele ted the model

0.7

Illustration: sele ting between 2 models

Eout gm

0.6

0.5

Learning From Data - Le ture 13

Eval gm

Validation Set Size, K

15

25

12/22

How mu h bias
For

models:

H1, . . . , HM

Dval

is used for training on the

nalists model:

Hval = {g1, g2, . . . , gm}

Ba k to Hoeding and VC!

E
Eout(gm
(g

val m ) + O
regularization

Learning From Data - Le ture 13

q

ln M
K

early-stopping

13/22

Data ontamination
Error estimates:

Ein, Etest, Eval

Contamination:

Optimisti (de eptive) bias in estimating

Eout

Training set: totally ontaminated

Validation set: slightly ontaminated
Test set: totally ` lean'
Learning From Data - Le ture 13

14/22

Outline

The validation set

Model sele tion

Cross validation

Learning From Data - Le ture 13

15/22

The dilemma about K

The following hain of reasoning:

Eout(g)Eout(g )Eval(g )
(small K )
(large K )
highlights the dilemma in sele ting

Can we have

Learning From Data - Le ture 13

K:

both small and large?

16/22

Leave one out

N 1

points for training, and

point for validation!

Dn = (x1, y1), . . . , (xn1, yn1),

(xn, yn), (xn+1, yn+1), . . . , (xN , yN )
Final hypothesis learned from

en

is

gn

= Eval(gn) = e (gn(xn), yn)

ross validation error:

Learning From Data - Le ture 13

Dn

E v

N
X
1
=
e
n
N n=1

17/22

ements

Illustration of ross validation

0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
1.05
1.1
1.15

e1

E v
Learning From Data - Le ture 13

e3

e2

0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
-0.2
0
0.2
0.4
0.6
0.8
1
1.2

0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
1.05
1.1
1.15

x
1
= (
3

e1

e2

e3 )
18/22

PSfrag repla ements

Model sele tion using CV

Learning From Data - Le ture 13

e1

0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.7
0.75
0.8
0.85
0.9
0.95
1
1.05
1.1

e3

e2

0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.7
0.75
0.8
0.85
0.9
0.95
1
1.05
1.1

x
e3

0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.7
0.75
0.8
0.85
0.9
0.95
1
1.05
1.1

e1

Constant:

Linear:

0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
e2
0.65
0.7
PSfrag repla ements
0.75
0.8
0.85
0.9
0.95
1
1.05
1.1
0.1
1.15
0.2

0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
PSfrag repla ements
0.9
-0.2
0
0.2
0.4
0.6
0.8
1
0.1
1.2
0.2

0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.65
0.7
PSfrag repla ements
0.75
0.8
0.85
0.9
0.95
1
1.05
1.1
0.1
1.15
0.2

x
19/22

PSfrag repla ements

Cross validation in a tion

Digits lassi ation task

Not

0.1

Dierent errors
0.03

0.2

Eout

0.4
-8
-7
-6
-5

Symmetry

0.3
0.02

E v
0.01

-4
-3
-2
-1
0

Not

Average Intensity

Ein
5

# Features Used
10

15

20

(1, x1, x2) (1, x1, x2, x21, x1x2, x22, x31, x21x2, . . . , x51, x41x2, x31x22, x21x32, x1x42, x52)
Learning From Data - Le ture 13

20/22

0.35

The result

PSfrag repla ements

0.4

without validation

-5

with validation

-4.5
-4
-3.5

0.05
0.1

0.2
0.25
0.3
0.35
-5

Symmetry

0.15

-2.5
-2
-1.5

-4

Symmetry

-3

-1

-3

-0.5

-2
-1

Average Intensity

Ein = 0%
Learning From Data - Le ture 13

Eout = 2.5%

Average Intensity

Ein = 0.8%

Eout = 1.5%
21/22

Leave more than one out

Leave one out:

training sessions on

N 1 points

ea h

More points for validation?

D
}|

z
D1

D2

D3

train

N
K

D4

D5

{
D6

D7

validate

training sessions on

N K

D9

D10

train

points ea h

10-fold ross validation:

Learning From Data - Le ture 13

D8

K=

N
10
22/22

Review of Le ture 13

Data ontamination
0.8

Expe ted Error

Validation

0.7

Eout gm

0.6

(N )
Dtrain

Eval gm

0.5

(N K )

Dval

(K )

Dval

15

Validation Set Size,

25

slightly ontaminated

Cross validation
z
Eval(g )

Eval(g )

estimates

Eout(g)

D
}|
{
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10
train validate
train

10-fold ross validation

Learning From Data

Yaser S. Abu-Mostafa

California Institute of Te hnology

Le ture 14: Support

Ve tor Ma hines

Sponsored by Calte h's Provost O e, E&AS Division, and IST

Thursday, May 17, 2012

Outline

Maximizing the margin

The solution

Nonlinear transforms

Learning From Data - Le ture 14

2/20

Better linear separation

Hi

Hi

Hi

Linearly separable data

Dierent separating lines

Whi h is best?
Hi

Hi

Hi

Two questions:

1. Why is bigger margin better?

2. Whi h
Learning From Data - Le ture 14

maximizes the margin?

3/20

Remember the growth fun tion?

All di hotomies with any line:

Learning From Data - Le ture 14

4/20

Di hotomies with fat margin

Fat margins imply fewer di hotomies

Learning From Data - Le ture 14

infinity

0.866

0.5

0.397

infinity

0.866

0.5

0.397

5/20

Finding

Let

xn

with large margin

be the nearest data point to the plane

wTx = 0.

How far is it?

2 preliminary te hni alities:

1.

Normalize

w:

|wTxn| = 1
2.

Pull out

w0:
w = (w1, , wd)
The plane is now

Learning From Data - Le ture 14

apart from

w Tx + b = 0

b
(no

x0 )
6/20

Computing the distan e

The distan e between

to

The ve tor

is

Take

and

wTx + b = 0
T

xn

and the plane

the plane in the

wTx + b = 0

|wTxn + b| = 1

spa e:Hi

on the plane
and

where

xn
w
x

wTx + b = 0

= w (x x ) = 0

Hi

Learning From Data - Le ture 14

7/20

and the distan e is

Distan e between
Take any point
Proje tion of

Learning From Data - Le ture 14

and the plane:

on the plane

xn x

w
=
w
=
kwk

distan e

xn

on

...

Hi

xn
w

T

(xn x)
distan e = w

Hi

1
1 T
1 T
T
T

=
w xn w x =
w xn + b w x b =
kwk
kwk
kwk
8/20

The optimization problem

Maximize

1
kwk

subje t to

min

n=1,2,...,N

|wTxn + b| = 1
Noti e:

Minimize

subje t to

Learning From Data - Le ture 14

|wTxn + b| = yn (wTxn + b)

1 T
ww
2
T

yn (w xn + b) 1

for

n = 1, 2, . . . , N
9/20

Outline

Maximizing the margin

The solution

Nonlinear transforms

Learning From Data - Le ture 14

10/20

Constrained optimization

Minimize

subje t to

1 T
ww
2
yn (wTxn + b) 1

for

n = 1, 2, . . . , N

w Rd, b R
Lagrange?

Learning From Data - Le ture 14

inequality onstraints

KKT

11/20

We saw this before

Remember regularization?

Minimize

E (w) =
in

subje t to:

1
N

E = onst.
in

(Zw y) (Zw y)
T

lin

wTw C

normal

w
in

normal to onstraint

Regularization:
SVM:
Learning From Data - Le ture 14

optimize

onstrain

w Tw

in

wTw

in

wtw = C

in

12/20

Lagrange formulation

Minimize

1 T
L(w, b, ) = w w
2
w.r.t.

and

and

N
X

n(yn (wTxn + b) 1)

n=1

maximize w.r.t. ea h

wL = w

n 0

N
X

nynxn = 0

n=1

L
=
b
Learning From Data - Le ture 14

N
X

n y n = 0

n=1

13/20

Substituting

w =

N
X

nynxn

N
X

and

1 T
L(w, b, ) = w w
2

in the Lagrangian

N
X

n (yn (wTxn+b) 1 )

n=1

N
X

N X
N
X
1
L() =
n
ynym nm xnT xm
2 n=1 m=1
n=1

we get

Learning From Data - Le ture 14

n y n = 0

n=1

n=1

Maximize w.r.t. to

...

subje t to

n 0

for

n = 1, , N

and

PN

n=1 nyn

=0
14/20

The solution - quadrati programming

min

subje t to

1 T

2

y1y1 x1Tx1 y1y2 x1Tx2

y2y1 x2Tx1 y2y2 x2Tx2
...
...
yN y1 xNTx1 yN y2 xNTx2
{z

quadrati oe ients

T
y
| {z= 0}

T
)
+ |(1
{z
}

linear

linear onstraint
0
|{z}

lower bounds
Learning From Data - Le ture 14

. . . y1yN x1TxN
. . . y2yN x2TxN
...
...
. . . yN yN xNTxN

|{z}

upper bounds
15/20

QP hands us

Solution:

= 1 , , N

= w =

N
X

nynxn

E = onst.
in

lin

n=1

KKT ondition:

normal

For

n = 1, , N

n (yn (wTxn + b) 1) = 0
E

We saw this before!

n > 0 = xn
Learning From Data - Le ture 14

is a

in

support ve tor

wtw = C

16/20

Support ve tors
Hi

Closest

xn's

to the plane: a hieve the margin

= yn (wTxn + b) = 1
X

w =

xn
Solve for

is SV

nynxn

using any SV:

yn (wTxn + b) = 1
Hi
Learning From Data - Le ture 14

17/20

Outline

Maximizing the margin

The solution

Nonlinear transforms

Learning From Data - Le ture 14

18/20

0.6

z
PSfrag repla ements
-1
-0.5
0
0.5
1
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6

0.8

instead of

N
X

N X
N
X
0
1
L() =
n
ynym nm zTnzm
0.1
2
n=1
n=1 m=1

0.2
0.3
0.4
0.5

X Z

0.6
0.7
0.8

0.8
1
Learning From Data - Le ture 14

0.9
19/20

Support ve tors in

spa e

Hi

Support ve tors live in

In

spa e

spa e, pre-images of support ve tors

The margin is maintained in

spa e

Generalization result

E[Eout]

E [#

of SV's

N 1
Hi

Learning From Data - Le ture 14

20/20

Support ve tors

Review of Le ture 14

Hi

Hi

The margin
Hi

Hi

(or zn) with Lagrange n > 0

E[# of SV's]
E[E ]
Hi

xn
Hi

Hi

Maximizing the margin = dual problem:

N
X

1
L() =
n
2
n=1

N
X

N
X

ynym nm xnT xm

n=1 m=1

quadrati programming

out

N 1

Hi

(in-sample he k of out-of-sample error)

Nonlinear transform

Complex h, but simple H

Learning From Data

Yaser S. Abu-Mostafa

California Institute of Te hnology

Le ture 15: Kernel Methods

Sponsored by Calte h's Provost O e, E&AS Division, and IST

Tuesday, May 22, 2012

Outline

The kernel tri k

Soft-margin SVM

Learning From Data - Le ture 15

2/20

What do we need from the

spa e?

N X
N
X
1
L() =
n
ynym nm zTnzm
2 n=1 m=1
n=1
N
X

Constraints:

n 0

for

n = 1, , N

PN

n=1 nyn

and

g(x) = sign (wTz + b)

where

w =

zn
and
Learning From Data - Le ture 15

b:

is SV

=0
need

zTnz

n y n zn

ym (wTzm + b) = 1

need

zTnzm
3/20

Generalized inner produ t

Given two points

Let

and

zTz = K(x, x)

x X ,

we need

(the kernel)

zT z

inner produ t of

Example:

and

x = (x1, x2)

2nd-order

z = (x) = (1, x1, x2, x21, x22, x1x2)

K(x, x) = zTz = 1 + x1x1 + x2x2 +
x21x21 + x22x22 + x1x1x2x2
Learning From Data - Le ture 15

4/20

The tri k
Can we ompute

Example:

K(x, x)

Consider

without

transforming

and

K(x, x) = (1 + xTx)2 = (1 + x1x1 + x2x2)2

= 1 +

2 2
x1 x 1

2 2
x2 x 2

+ 2x1x1 + 2x2x2 + 2x1x1x2x2

This is an inner produ t!

Learning From Data - Le ture 15

(1,

x21

(1,

x21

x22

x22

2 x1 ,

2x1 ,

2 x2 ,

2 x1 x2 )

2x2 ,

2x1x2 )
5/20

The polynomial kernel

X = Rd

and

:X Z

The equivalent kernel

is polynomial of order

K(x, x) = (1 + xTx)Q
= (1 + x1x1 + x2x2 + + xdxd)Q
Compare for

d = 10

Can adjust s ale:

Learning From Data - Le ture 15

and

Q = 100

K(x, x) = (axTx + b)Q

6/20

We only need
If

K(x, x)

is an inner produ t in some spa e

Example:

to exist!

Z , we are good.

K(x, x) = exp kx xk2
Innite-dimensional

take simple ase

K(x, x ) = exp (x x )
= exp x

2

X
k
k k
2
(x)
(x )
2
exp x
k!
|k=0 {z
}

exp(2xx )

Learning From Data - Le ture 15

7/20

This kernel in a tion

Slightly non-separable ase:

Transforming

Overkill?

into

Hi

-dimensional Z

Count the support ve tors

Hi

Learning From Data - Le ture 15

8/20

Kernel formulation of SVM

Remember quadrati programming?

y1y1 K(x1, x1)T

y2y1 K(x2, x1)T
...
yN y1K(xN , x1)T

The only dieren e now is:

y1y2 K(x1, x2)T

y2y2 K(x2, x2)T
...
yN y2K(xN , x2)T
{z

quadrati oe ients

...
...
...
...

y1yN K(x1, xN )T
y2yN K(x2, xN )T
...
yN yNK(xN , xN )T

Everything else is the same.

Learning From Data - Le ture 15

9/20

The nal hypothesis

g(x) = sign (wTz + b)

Express

w =

zn

is SV

n y n zn

in terms of

K(, )

g(x) = sign

n>0

where

nyn K(xn, x) + b

b = ym

n>0

for any support ve tor (

Learning From Data - Le ture 15

nynK(xn, xm)

m > 0)
10/20

How do we know that

...

for a given

K(x, x)?

exists

...

valid kernel

Three approa hes:

Learning From Data - Le ture 15

1.

By onstru tion

2.

Math properties (Mer er's ondition)

3.

Who ares?

11/20

Design your own kernel

K(x, x)

1.

is a valid kernel i

It is symmetri

and

2.

The matrix:

K(x1, x1) K(x1, x2)

K(x2, x1) K(x2, x2)
...
...
K(xN , x1) K(xN , x2)

is

positive semi-denite

for any
Learning From Data - Le ture 15

. . . K(x1, xN )
. . . K(x2, xN )
...
...
. . . K(xN , xN )

x1, , xN

(Mer er's ondition)

12/20

Outline

The kernel tri k

Soft-margin SVM

Learning From Data - Le ture 15

13/20

-1

Two types of non-separable

-0.5
slightly:

0
0.5

frag repla ements

-1

-1.5

-0.5
0
0.5
1
-1.5
-1
-0.5
0
0.5
1
1.5

Learning From Data - Le ture 15

seriously:

-1
-0.5
0
0.5
1
1.5

14/20

Error measure
Hi

Margin violation:

Quantify:

yn (wTxn + b) 1

yn (wTxn + b) 1 n
Total violation =

fails

n 0
N
X
n=1

n
violation

Hi
Learning From Data - Le ture 15

15/20

The new optimization

Minimize

subje t to

1 T
ww+C
2

N
X

n=1

yn (wTxn + b) 1 n
and

n 0

for

for

n = 1, . . . , N

n = 1, . . . , N

w Rd , b R , RN

Learning From Data - Le ture 15

16/20

Lagrange formulation
1 T
L(w, b, , , ) = w w + C
2

N
X

w, b,

and

Minimize w.r.t.

n=1

N
X
n=1

n(yn (w xn + b) 1+n)

and maximize w.r.t. ea h

wL = w
L
=
b

N
X

N
X

n 0

and

N
X

n n

n=1

n 0

nynxn = 0

n=1

n y n = 0

n=1

L
= C n n = 0
n
Learning From Data - Le ture 15

17/20

and the solution is

...

N
N X
X
1
ynym nm xnT xm
n
L() =
2
n=1
n=1 m=1
N
X

Maximize

subje t to

0 n C

for

n = 1, , N

and

w.r.t. to

N
X

n y n = 0

N
X

n=1

= w =

N
X

nynxn

n=1

minimizes

Learning From Data - Le ture 15

1 T
ww + C
2

n=1

18/20

Types of support ve tors

margin support ve tors

Hi

0 < n < C )

yn (wTxn + b) = 1

non-margin support ve tors

yn (wTxn + b) < 1

( n

= 0)

n = C )

( n

> 0)

Hi
Learning From Data - Le ture 15

19/20

Two te hni al observations

1.

Hard margin: What if data is not linearly separable?

primal

2.

Z:

dual

What if there is

w0?

All goes to

Learning From Data - Le ture 15

breaks down

and

w0 0

20/20

Review of Le ture 15

Kernel methods

K(x, x) = zTz

for some

Soft-margin SVM

Minimize

1 T
ww + C
2

spa e

N
X

n=1

Hi

Hi

violation

Hi

K(x, x) = exp kx xk2

Hi

Same as hard margin, but

0 n C

Learning From Data

Yaser S. Abu-Mostafa

California Institute of Te hnology

Le ture 16:

Radial Basis Fun tions

Sponsored by Calte h's Provost O e, E&AS Division, and IST

Thursday, May 24, 2012

Outline

RBF and nearest neighbors

RBF and neural networks

RBF and kernel methods

RBF and regularization

Learning From Data - Le ture 16

2/20

Basi RBF model

Ea h

(xn, yn) D

inuen es

h(x)

based on

radial

Standard form:

h(x) =

N
X
n=1

Learning From Data - Le ture 16

kx

x
k
n
| {z }

wn exp kx xnk2
|
{z
}
basis fun tion

3/20

The learning algorithm

Finding

w1, , wN :

h(x) =

N
X
n=1

2
wn exp kx xnk
based on

E = 0:
in

N
X

m=1

Learning From Data - Le ture 16

wm

D = (x1, y1), , (xN , yN )

h(xn) = yn

for

n = 1, , N :

2
exp kxn xmk = yn
4/20

The solution
N
X

wm exp kxn xmk

m=1

exp( kx1 x1k )

2
exp( kx2 x1k )

...

2
exp( kxN x1k )
|
If

Learning From Data - Le ture 16

is invertible,

...
...
...

...
{z

= yn

equations in

exp( kx1 xN k )
2
exp( kx2 xN k )

...

2
exp( kxN xN k )
}|

w = 1y

unknowns

w1

w2
=
...

wN
{z }
w

y1

y2
..
.
yN
| {z }
y

exa t interpolation
5/20

The ee t of

h(x) =

N
X
n=1

small
Learning From Data - Le ture 16

wn exp kx xnk2

large

6/20

RBF for lassi ation

h(x) = sign

N
X
n=1

wn exp kx xnk2
Learning:

s =

linear

N
X

regression for lassi ation

wn exp kx xnk2

n=1

Minimize

(s y)2

on

y = 1

h(x) = sign(s)
Learning From Data - Le ture 16

7/20

Relationship to nearest-neighbor method

Adopt the

value of a nearby point:

Learning From Data - Le ture 16

similar ee t by a basis fun tion:

8/20

RBF with

w1, , wN

parameters

Use

KN

enters:

based on

1, , K

enters

data points

instead of

x1, , xN

h(x) =

K
X
k=1

Learning From Data - Le ture 16

2
wk exp kx k k

1.

How to hoose the enters

2.

How to hoose the weights

wk
9/20

Choosing the enters

Minimize the distan e between

xn

and the

losest
Split

enter

x1, , xN

Minimize

K -means lustering

into lusters

K
X
X

kxn k k

S1, , SK
2

k=1 xnSk

Unsupervised learning

NP -hard
Learning From Data - Le ture 16

10/20

An iterative algorithm

Lloyd's algorithm:

Iteratively minimize

K X
X

kxn k k2

w.r.t.

k , Sk

k=1 xnSk

1 X
xn
k
|Sk |
xnSk

Sk {xn : kxn k k
Convergen e

Learning From Data - Le ture 16

all

kxn k}

lo al minimum

11/20

Lloyd's algorithm in a tion

Hi

1. Get the data points

2. Only the inputs!

3. Initialize the enters

4. Iterate

5. These are your

k 's

Hi
Learning From Data - Le ture 16

12/20

Centers versus support ve tors

Hi

support ve tors

Hi

Hi
Learning From Data - Le ture 16

RBF enters

Hi

13/20

Choosing the weights

K
X
k=1

wk exp kxn k k2 yn

exp( kx1 1k )

2
exp(
kx

2
1 )

...

2
exp( kxN 1k )
|
If

Learning From Data - Le ture 16

is invertible,

...
...
...

...
{z

equations in

K< N

exp( kx1 K k )
2
exp( kx2 K k )

...

2
exp( kxN K k )
}|

w = (T)1Ty

unknowns

w1

w2

...

wK
{z } |
w

y1

y2

...

yN
{z }
y

pseudo-inverse
14/20

RBF network

The features are

exp kx k k2

Nonlinear transform depends on

No longer a linear model

h(x)

w1

wk

kx 1k

kx k k

wK

kx K k

x
b

A bias term (
Learning From Data - Le ture 16

or

w0)

is often added
15/20

Compare to neural networks

w1

h(x)

h(x)

wk

kx 1k

kx k k

w1

wK

kx K k

wk

w1Tx

wkT x

wK

T x
wK

RBF network

neural network

Learning From Data - Le ture 16

16/20

Choosing

Treating

as a parameter to be learned

h(x) =

K
X

wk exp kx k k

k=1

EM

Iterative approa h (

algorithm

in mixture of Gaussians):

1.

Fix

, solve for w1, , wK

2.

Fix

w1, , wK ,

minimize error w.r.t.

We an have a dierent

Learning From Data - Le ture 16

k for ea h enter k
17/20

Outline

RBF and nearest neighbors

RBF and neural networks

RBF and kernel methods

RBF and regularization

Learning From Data - Le ture 16

18/20

RBF versus its SVM kernel

Hi

SVM kernel implements:

sign

n>0

2
nyn exp kx xnk + b

SVM
RBF

Straight RBF implements:

sign

K
X
k=1

Learning From Data - Le ture 16

wk exp kx k k

+b

Hi

19/20

RBF and regularization

RBF an be derived based purely on regularization:

N
X
n=1

h(xn) yn

2

X
k=0

ak

dh
dxk

2

dx

smoothest interpolation

Learning From Data - Le ture 16

20/20

Review of Le ture 16

nearest
neighbors

Radial Basis Fun tions

h(x) =

K
X

wk exp kx k k

neural

networks

k=1

SVM

Kernel =

RBF

regularization

Choose
Choose

k 's:
wk 's:

Lloyd's algorithm
Pseudo-inverse

unsupervised
learning

Learning From Data

Yaser S. Abu-Mostafa

California Institute of Te hnology

Le ture 17:

Three Learning Prin iples

Sponsored by Calte h's Provost O e, E&AS Division, and IST

Tuesday, May 29, 2012

Outline

O am's Razor

Sampling Bias

Data Snooping

Learning From Data - Le ture 17

2/22

Re urring theme - simple hypotheses

A quote by Einstein:
An explanation of the data should be made as simple as possible, but no simpler

The razor: symboli of a prin iple set by William of O am

Learning From Data - Le ture 17

3/22

O am's Razor
The simplest model that ts the data is also the most plausible.
Two questions:

1. What does it mean for a model to be simple?

2. How do we know that simpler is better?

Learning From Data - Le ture 17

4/22

First question: `simple' means?

Measures of omplexity - two types:

Learning From Data - Le ture 17

omplexity of h

and

Complexity of

h:

Complexity of

H:

omplexity of H

MDL, order of a polynomial

Entropy, VC dimension

When we think of simple, it's in terms of

Proofs use simple in terms of

5/22

and the link is . . .

ounting:

bits spe ify

Real-valued parameters?

is one of

elements of a set

Example: 17th order polynomial - omplex and one of many

Ex eptions? Looks omplex but is one of few - SVM

Hi

Hi

Learning From Data - Le ture 17

6/22

Puzzle 1: Football ora le

00000000000000001111111111111111
00000000111111110000000011111111
00001111000011110000111100001111
00110011001100110011001100110011
01010101010101010101010101010101

Learning From Data - Le ture 17

0
1
0
1
1

Letter predi ting game out ome

Good all!

More letters - for 5 weeks

Perfe t re ord!

Want more? $50 harge

Should you pay?

7/22

Se ond question: Why is simpler better?

Better doesn't mean more elegant! It means better out-of-sample performan e

The basi argument: (formal proof under dierent idealized onditions)

Fewer simple hypotheses than omplex ones

mH(N )

less likely to t a given data set

more signi ant when it happens

mH(N )/2N

The postal s am:

Learning From Data - Le ture 17

mH(N ) = 1

versus

2N
8/22

temperature

S ientist A

temperature

S ientist B

-1
0
1
2
3
4
5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
4

ondu tivity

-1
0
1
2
3
4
5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
4

ondu tivity

-1
0
1
2
3
4
5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
4

ondu tivity

A t that means nothing

temperature

falsiable

Condu tivity linear in temperature?

Two s ientists ondu t experiments

What eviden e do A and B provide?

Learning From Data - Le ture 17

9/22

Outline

O am's Razor

Sampling Bias

Data Snooping

Learning From Data - Le ture 17

10/22

Puzzle 2: Presidential ele tion

In 1948,

Truman ran against Dewey in lose ele tions

A newspaper ran a phone poll of how people voted

Hi

Dewey won the poll de isively - newspaper de lared:

Hi

Learning From Data - Le ture 17

11/22

On to the vi tory rally . . .

...

of Truman

It's not

P[

's

fault:

|Ein Eout| > ]

Learning From Data - Le ture 17

12/22

The bias
In 1948, phones were expensive.

If the data is sampled in a biased way, learning

will produ e a similarly biased out ome.
Example: normal period in the market
Testing: live trading in real market

Learning From Data - Le ture 17

13/22

Mat hing the distributions

Hi
Methods to mat h training and testing distributions

Doesn't work if:

P(x)

testing
Region has

P =0

in training, but

P > 0 in

testing

training

x
Hi

Learning From Data - Le ture 17

14/22

Puzzle 3: Credit approval

Histori al re ords of ustomers

Input: information on redit appli ation:

Target: protable for the bank

Learning From Data - Le ture 17

age

23 years

gender

male

annual salary

$30,000

years in residen e

1 year

years in job

1 year

urrent debt

$15,000

15/22

Outline

O am's Razor

Sampling Bias

Data Snooping

Learning From Data - Le ture 17

16/22

The prin iple

If a data set has ae ted any step in the learning pro ess,
its ability to assess the out ome has been ompromised.
Hi

Most ommon trap for pra titioners - many ways to slip

Hi

Learning From Data - Le ture 17

17/22

-0.5

Looking at the data

0
0.5
Remember nonlinear transforms?

z = (1, x1, x2, x1x2, x21, x22)

or

z = (1, x21, x22)

Snooping involves

or

D,

z = (1, x21 + x22)

not other information

1
-1.5
-1
-0.5
0
0.5
1
1.5

Learning From Data - Le ture 17

18/22

Puzzle 4: Finan ial fore asting

Predi t US Dollar versus British Pound

Normalize data, split randomly:

Train on

Dtrain

only, test

on

Dtrain, Dtest
Dtest

Cumulative Prot %

30

snooping

20

10

-10

no snooping
100

200

Day

300

400

500

r20, r19, , r1 r0
Learning From Data - Le ture 17

19/22

Reuse of a data set

Trying one model after the other

on the same data set, you will eventually `su eed'

If you torture the data long enough, it will onfess

VC dimension of the

May in lude what

total learning model

others tried!

Key problem: mat hing a parti ular data set

Learning From Data - Le ture 17

20/22

Two remedies
1.

Avoid data snooping

stri t dis ipline

2.

A ount for data snooping

how mu h data ontamination

Learning From Data - Le ture 17

21/22

Puzzle 5: Bias via snooping

Testing long-term performan e of buy and hold in sto ks.

Use

50 years worth of data

All urrently traded ompanies in S&P500

Assume you stri tly followed buy and hold

Would have made great prot!

Sampling bias aused by `snooping'

Learning From Data - Le ture 17

22/22

PSfrag repla ements

Review of Le ture 17

Sampling bias
Hi

O am's Razor
The simplest model that
ts the data is also the
most plausible.

P(x)

testing
training

Hi

Data snooping
Cumulative Prot %

30

snooping

20

10

omplexity of h omplexity of H
unlikely event signi ant if it happens

-10

no snooping
100

200

300

Day

400

500

Learning From Data

Yaser S. Abu-Mostafa

California Institute of Te hnology

Le ture 18:

Epilogue

Sponsored by Calte h's Provost O e, E&AS Division, and IST

Thursday, May 31, 2012

Outline

The map of ma hine learning

Bayesian learning

Aggregation methods

A knowledgments

Learning From Data - Le ture 18

2/23

It's a jungle out there

semisupervised learning

stochastic gradient descent

overfitting

Gaussian processes
distributionfree
collaborative filtering

deterministic noise
linear regression
VC dimension
nonlinear transformation

decision trees

data snooping
sampling bias

Q learning

SVM

learning curves

mixture of expe
neural networks

no free

training versus testing

RBF
noisy targets
Bayesian prior
active learning
linear models
biasvariance tradeoff
weak learners
ordinal regression
logistic regression
data contamination
cross validation

ensemble learning

types of learning

xploration versus exploitation

error measures

is learning feasible?

clustering

Learning From Data - Le ture 18

regularization

kernel methods

hidden Markov mod

perceptrons
graphical models

softorder constraint
weight decay

Occams razor

Boltzmann mach

3/23

The map

THEORY

TECHNIQUES

models
VC
biasvariance
complexity

linear

methods
supervised
regularization

neural networks
SVM
nearest neighbors

bayesian

PARADIGMS

RBF
gaussian processes

unsupervised
validation
reinforcement
aggregation
active
input processing
online

SVD
graphical models

Learning From Data - Le ture 18

4/23

Outline

The map of ma hine learning

Bayesian learning

Aggregation methods

A knowledgments

Learning From Data - Le ture 18

5/23

Probabilisti approa h
Hi

Extend probabilisti role to all omponents

P (D | h = f )
How about

de ides whi h

P (h = f | D)

UNKNOWN TARGET DISTRIBUTION

P(y |

x)

target function f: X

UNKNOWN
INPUT
DISTRIBUTION

plus noise

P( x )

(likelihood)
x1 , ... , xN

DATA SET
D = ( x1 , y1 ), ... , ( xN , yN )

g ( x )~
~ f (x )

LEARNING
ALGORITHM

FINAL
HYPOTHESIS
g: X Y

HYPOTHESIS SET
H
Hi

Learning From Data - Le ture 18

6/23

The prior
P (h = f | D)

requires an additional probability distribution:

P (D | h = f ) P (h = f )
P (h = f | D) =
P (D | h = f ) P (h = f )
P (D)
P (h = f )

is the

P (h = f | D)

prior

is the

posterior

Given the prior, we have the full distribution

Learning From Data - Le ture 18

7/23

Example of a prior
Consider a per eptron:

A possible prior on

w:

h is
Ea h

determined by

wi

w = w0, w1, , wd

is independent, uniform over

[1, 1]

This determines the prior over

Given

D,

we an ompute

P (h = f )

P (D | h = f )

Putting them together, we get

P (h = f | D)

P (h = f )P (D | h = f )
Learning From Data - Le ture 18

8/23

A prior is an assumption
Even the most neutral prior:
Hi

is unknown

is random

P(x)

x
Hi

The true equivalent would be:

Hi

is unknown

is random

(xa)
1

Learning From Data - Le ture 18

x
Hi
9/23

If we knew the prior

...

we ould ompute

P (h = f | D)

for every

hH

we an nd the most probable

we an derive

E(h(x))

we an derive the

h given the data

for every

error bar for every x

we an derive everything in a prin ipled way

Learning From Data - Le ture 18

10/23

When is Bayesian learning justied?

1. The prior is

valid

trumps all other methods

2. The prior is

irrelevant

just a omputational atalyst

Learning From Data - Le ture 18

11/23

Outline

The map of ma hine learning

Bayesian learning

Aggregation methods

A knowledgments

Learning From Data - Le ture 18

12/23

What is aggregation?
Combining dierent solutions
Hi

h 1 , h2 , , hT

that were trained on

D:

Hi

Regression: take an average

Classi ation: take a vote
a.k.a.
Learning From Data - Le ture 18

ensemble learning

and

boosting
13/23

Dierent from 2-layer learning

Hi

In a 2-layer model, all units learn

In aggregation, they learn

training data

jointly:

Learning
Algorithm

independently then get ombined:

Hi

Hi

training data
Learning
Algorithm

Hi

Learning From Data - Le ture 18

14/23

Two types of aggregation

1. After the fa t: ombines existing solutions
Example. Netix teams merging

blending

2. Before the fa t: reates solutions to be ombined

Example. Bagging - resampling D
Hi

training data
Learning
Algorithm

Hi

Learning From Data - Le ture 18

15/23

De orrelation - boosting
Create

h 1 , , ht ,

sequentially: Make

ht

de orrelated with previous

h's:

Hi

training data
Learning
Algorithm

Hi

Emphasize points in

Choose weight of

Learning From Data - Le ture 18

ht

that were mis lassied

based on

E (ht)
in

16/23

Blending - after the fa t

For regression,

h 1 , h2 , , hT

g(x) =

T
X

t ht(x)

t=1
Prin ipled hoi e of

t's:

minimize the error on an aggregation data set

Some

t's

pseudo-inverse

an ome out negative

Most valuable

ht

in the blend?

Un orrelated ht's help the blend

Learning From Data - Le ture 18

17/23

Outline

The map of ma hine learning

Bayesian learning

Aggregation methods

A knowledgments

Learning From Data - Le ture 18

18/23

Course ontent

Learning From Data - Le ture 18

Professor

Malik Magdon-Ismail, RPI

Professor

Hsuan-Tien Lin, NTU

19/23

Course sta
Carlos Gonzalez (Head TA)
Ron Appel
Costis Sideris
Doris Xin

Learning From Data - Le ture 18

20/23

Filming, produ tion, and infrastru ture

Leslie Maxeld and the AMT sta
Ri h Fagen and the IMSS sta

Learning From Data - Le ture 18

21/23

Calte h support
IST -

Mathieu Desbrun

E&AS Division -

Ares Rosakis

Provost's O e -

Ed Stolper

Learning From Data - Le ture 18

and

and

Mani Chandy
Melany Hunt

22/23

Many others
Calte h TA's and sta members
Calte h alumni and Alumni Asso iation
Colleagues all over the world

Learning From Data - Le ture 18

23/23

To the fond memory of

Faiza A. Ibrahim

Vous aimerez peut-être aussi

Advanced Engineering Math Guide for ECE
Document82 pages
Advanced Engineering Math Guide for ECE
Em Reyes
Pas encore d'évaluation
Machine Learning Notes
Document134 pages
Machine Learning Notes
sasa
100% (1)
Campbell Biology 10th Edition - Test Bank
Document17 pages
Campbell Biology 10th Edition - Test Bank
kenlyjunior
100% (2)
Sigma Shockintro
Document5 pages
Sigma Shockintro
James Jameson
50% (2)
Virtualization
Document11 pages
Virtualization
Vidit Aneja
Pas encore d'évaluation
Reactivate Test Book Answers Key
Document2 pages
Reactivate Test Book Answers Key
Cintia Andreea
100% (1)
Domain and Range
Document7 pages
Domain and Range
Melody Hoyohoy
Pas encore d'évaluation
Neural Networks Economics
Document27 pages
Neural Networks Economics
kodecel
Pas encore d'évaluation
From Myth To Music. Levi-Strauss's Mythologiques and Berio's Sinfonia
Document32 pages
From Myth To Music. Levi-Strauss's Mythologiques and Berio's Sinfonia
Said Bonduki
Pas encore d'évaluation
Review of Robert Lanza & Bob Berman's Book Biocentrism How Life and Consciousness Are The Keys To Understanding The True Nature of The Universe
Document3 pages
Review of Robert Lanza & Bob Berman's Book Biocentrism How Life and Consciousness Are The Keys To Understanding The True Nature of The Universe
gabitzu2oo6
100% (1)
Jung y La Pérdida de Los Evangelios
Document272 pages
Jung y La Pérdida de Los Evangelios
Edo Dakedezu
100% (1)
1 - Script Ceu Azul 2023
Document21 pages
1 - Script Ceu Azul 2023
Indigo Graficos
50% (2)
Morning and Evening Adhkar, Remembrances and Supplications
Document8 pages
Morning and Evening Adhkar, Remembrances and Supplications
Abdurrahman Al-Masum
Pas encore d'évaluation
HIT3002: Introduction To Artificial Intelligence: Learning From Observations
Document13 pages
HIT3002: Introduction To Artificial Intelligence: Learning From Observations
Huan Nguyen
Pas encore d'évaluation
Module 3
Document15 pages
Module 3
Ann Bombita
Pas encore d'évaluation
ML Review04
Document4 pages
ML Review04
languages cultures
Pas encore d'évaluation
2004 Vijayakumar
Document14 pages
2004 Vijayakumar
Yang Yi
Pas encore d'évaluation
SAMPLE-LESSON-PLAN
Document4 pages
SAMPLE-LESSON-PLAN
Reng Carlo Cadorna
Pas encore d'évaluation
f Distribution Example Homework Help
Document4 pages
f Distribution Example Homework Help
mlukgetif
100% (1)
Testing: Training
Document25 pages
Testing: Training
seanwu95
Pas encore d'évaluation
MAST20005 Module01 Notes
Document20 pages
MAST20005 Module01 Notes
Anonymous na314kKjOA
Pas encore d'évaluation
Math8-Q4-LAS-7
Document7 pages
Math8-Q4-LAS-7
RT MIAS
Pas encore d'évaluation
Learning Polytrees Dasgupta
Document8 pages
Learning Polytrees Dasgupta
Flor Romero Montiel
Pas encore d'évaluation
Chapter 8: Learning: By, Safa Hamdare
Document46 pages
Chapter 8: Learning: By, Safa Hamdare
Divya Shetty
Pas encore d'évaluation
Advances in Domain Adaptation Theory
D'Everand
Advances in Domain Adaptation Theory
Ievgen Redko
Pas encore d'évaluation
Statistical Learning: First Steps: Sasha Rakhlin
Document26 pages
Statistical Learning: First Steps: Sasha Rakhlin
Irfan Fadhullah
Pas encore d'évaluation
Solving Modular Exam Scheduling with Genetic Algorithms
Document8 pages
Solving Modular Exam Scheduling with Genetic Algorithms
Niveditha Tamilselvan
Pas encore d'évaluation
Game Theory Notes: Introduction to Choice Under Uncertainty
Document5 pages
Game Theory Notes: Introduction to Choice Under Uncertainty
Martel Engenharia
Pas encore d'évaluation
Week 5 LP
Document3 pages
Week 5 LP
api-273059316
Pas encore d'évaluation
Content PDF
Document61 pages
Content PDF
Andika Saputra
Pas encore d'évaluation
Ortonormalidad en Espacios de Hilbert
Document20 pages
Ortonormalidad en Espacios de Hilbert
juan
Pas encore d'évaluation
1 - Modelul Supervizat Al Invatarii Din Date
Document16 pages
1 - Modelul Supervizat Al Invatarii Din Date
Valentin Munteanu
Pas encore d'évaluation
Department of Education: 4 QUARTER - Module 4
Document16 pages
Department of Education: 4 QUARTER - Module 4
anna mae
100% (1)
Devoir 1
Document6 pages
Devoir 1
Melania Nitu
Pas encore d'évaluation
Math10 q1 Mod8of8 Solving-Problems-Involving-Polynomials v2
Document21 pages
Math10 q1 Mod8of8 Solving-Problems-Involving-Polynomials v2
ALGER ABAIGAR
Pas encore d'évaluation
Learning From Data
Document34 pages
Learning From Data
Hamed Aryanfar
Pas encore d'évaluation
Advanced Math Guide for ECE Students
Document48 pages
Advanced Math Guide for ECE Students
Em Reyes
Pas encore d'évaluation
AdaBoost New PDF
Document45 pages
AdaBoost New PDF
Raphael
Pas encore d'évaluation
Mathematics of Learning Dealing With Data Notices-Ams2003refs
Document19 pages
Mathematics of Learning Dealing With Data Notices-Ams2003refs
Srinivasa
Pas encore d'évaluation
Reflections On Mathematics Education Research Questions in Elementary Number Theory
Document31 pages
Reflections On Mathematics Education Research Questions in Elementary Number Theory
christian jay esteada
Pas encore d'évaluation
Statistics For Data Science - 1: Lecture 6.2: Probability-Events
Document50 pages
Statistics For Data Science - 1: Lecture 6.2: Probability-Events
Utkarsh Singh
Pas encore d'évaluation
Semi-Supervised Learning Explained
Document10 pages
Semi-Supervised Learning Explained
Regent Vaidya
Pas encore d'évaluation
n27 PDF
Document3 pages
n27 PDF
Christine Straub
Pas encore d'évaluation
Statistical Learning Theory Explained
Document4 pages
Statistical Learning Theory Explained
slowdog
Pas encore d'évaluation
CS168: Generalization or How Much Data Is Enough
Document16 pages
CS168: Generalization or How Much Data Is Enough
Danish Shabbir
Pas encore d'évaluation
Data Science Course Overview and Models as Probability Distributions
Document13 pages
Data Science Course Overview and Models as Probability Distributions
BhuveshArora
Pas encore d'évaluation
Learning and Consistency: Rolf Wiehagen Thomas Zeugmann
Document24 pages
Learning and Consistency: Rolf Wiehagen Thomas Zeugmann
Abhishek
Pas encore d'évaluation
Eodr Fcds
Document12 pages
Eodr Fcds
Wojtek
Pas encore d'évaluation
COT PLAN4TH
Document6 pages
COT PLAN4TH
Neiane Joy Doronila
Pas encore d'évaluation
Computational Statistics: Simulation of Probability Distributions: Classical Methods
Document35 pages
Computational Statistics: Simulation of Probability Distributions: Classical Methods
Achraf K-More
Pas encore d'évaluation
MaxLikihood, HomotopyContMethod
Document11 pages
MaxLikihood, HomotopyContMethod
xxyyzz0003
Pas encore d'évaluation
Introduction to Probability and Statistics
Document20 pages
Introduction to Probability and Statistics
Noname
Pas encore d'évaluation
Topic 1
Document13 pages
Topic 1
PRSL GAkasiya
Pas encore d'évaluation
Algebra II Sample Items Spring 2015 6
Document40 pages
Algebra II Sample Items Spring 2015 6
zaineddin Saadeddin
Pas encore d'évaluation
Statand Prob Q4 M6
Document16 pages
Statand Prob Q4 M6
Jessa Banawan Edulan
Pas encore d'évaluation
Week 4 LP
Document3 pages
Week 4 LP
api-273059316
Pas encore d'évaluation
A Geometrical Viewpoint On The Benign Overfitting Property of The Minimum - Norm Interpolant Estimator
Document32 pages
A Geometrical Viewpoint On The Benign Overfitting Property of The Minimum - Norm Interpolant Estimator
Marcelo Marcy Majstruk Cimillo
Pas encore d'évaluation
Modern techniques in applied econometrics
Document104 pages
Modern techniques in applied econometrics
Kadir
Pas encore d'évaluation
ML Opt
Document89 pages
ML Opt
KADDAMI Saousan
Pas encore d'évaluation
Mussmann 18 A
Document9 pages
Mussmann 18 A
jchill2018
Pas encore d'évaluation
p2 Video 1 LP
Document8 pages
p2 Video 1 LP
api-407315401
Pas encore d'évaluation
Multiobjective Optimization Using Evolutionary Algorithms - A Comparative Case Study
Document11 pages
Multiobjective Optimization Using Evolutionary Algorithms - A Comparative Case Study
Emanuel Diego
Pas encore d'évaluation
Advanced Algebra and Trigonometry Quarter 1: Self Learning Module 17
Document14 pages
Advanced Algebra and Trigonometry Quarter 1: Self Learning Module 17
Dog God
Pas encore d'évaluation
2223hk1 Slide02 ML2022
Document34 pages
2223hk1 Slide02 ML2022
san cris
Pas encore d'évaluation
Jurnal Statistik
Document9 pages
Jurnal Statistik
Muhammad Rifai
Pas encore d'évaluation
A Fast Decision Tree Learning Algorithm
Document6 pages
A Fast Decision Tree Learning Algorithm
wordmaze
Pas encore d'évaluation
Stochastic Algorithms
Document19 pages
Stochastic Algorithms
fls159
Pas encore d'évaluation
Instructors: Moses Charikar, Tengyu Ma, and Chris Re: Hope Everyone Stays Safe and Healthy in These Difficult Times!
Document40 pages
Instructors: Moses Charikar, Tengyu Ma, and Chris Re: Hope Everyone Stays Safe and Healthy in These Difficult Times!
Kristen Brooks Sandler
Pas encore d'évaluation
Unit 4
Document11 pages
Unit 4
SANGEETHA San
Pas encore d'évaluation
283 - 01virtualization Basics 1
Document37 pages
283 - 01virtualization Basics 1
Anuvrat Tiku
Pas encore d'évaluation
VIrtualization Basics 2
Document41 pages
VIrtualization Basics 2
Anuvrat Tiku
Pas encore d'évaluation
VIrtualization Basics 3
Document28 pages
VIrtualization Basics 3
Anuvrat Tiku
Pas encore d'évaluation
CMU - Virtualization Basics
Document30 pages
CMU - Virtualization Basics
Anuvrat Tiku
Pas encore d'évaluation
Virtualization Considerations
Document14 pages
Virtualization Considerations
Anuvrat Tiku
Pas encore d'évaluation
Directors Duties-An Accidental Change
Document16 pages
Directors Duties-An Accidental Change
Amanda Penelope Wall
Pas encore d'évaluation
Money As A Coordinating Device of A Commodity Economy: Old and New, Russian and French Readings of Marx
Document28 pages
Money As A Coordinating Device of A Commodity Economy: Old and New, Russian and French Readings of Marx
oktay yurdakul
Pas encore d'évaluation
(Posco) The Protection of Children From Sexual Offences Act, 2012.
Document34 pages
(Posco) The Protection of Children From Sexual Offences Act, 2012.
anirudha joshi
Pas encore d'évaluation
Mgmte660 Strategic MKT Harvard
Document31 pages
Mgmte660 Strategic MKT Harvard
radmar2002
Pas encore d'évaluation
Managing Industrial Relations and Trade Unions
Document22 pages
Managing Industrial Relations and Trade Unions
Sai Prabhas
Pas encore d'évaluation
An Approach to Optimizing Reservoir Management
Document7 pages
An Approach to Optimizing Reservoir Management
Ricci FG
Pas encore d'évaluation
Organisational Justice Climate, Social Capital and Firm Performance
Document16 pages
Organisational Justice Climate, Social Capital and Firm Performance
Anonymous YSA8CZ0Tz5
Pas encore d'évaluation
Philippine Literature
Document19 pages
Philippine Literature
sam magdato
Pas encore d'évaluation
Development of An Effective School-Based Financial Management Profile in Malaysia The Delphi Method
Document16 pages
Development of An Effective School-Based Financial Management Profile in Malaysia The Delphi Method
alanz123
Pas encore d'évaluation
The Amsterdam Argumentation Chronicle
Document15 pages
The Amsterdam Argumentation Chronicle
Adrion StJohn
Pas encore d'évaluation
Bilingual Education Models
Document6 pages
Bilingual Education Models
Aziel Lorenzo Reyes
0% (1)
Revisiting Fit Between AIS Design and Performance With The Analyzer Strategic-Type
Document16 pages
Revisiting Fit Between AIS Design and Performance With The Analyzer Strategic-Type
Yeni Puri Rahayu
Pas encore d'évaluation
Amour-Propre (Self-Love That Can Be Corrupted) : Student - Feedback@sti - Edu
Document3 pages
Amour-Propre (Self-Love That Can Be Corrupted) : Student - Feedback@sti - Edu
Franchezka Ainsley Afable
Pas encore d'évaluation
Standard C Iostreams and Locales Advanced Programmers Guide and Reference
Document19 pages
Standard C Iostreams and Locales Advanced Programmers Guide and Reference
Sourav Paul
0% (2)
Topographical Map Checkbric
Document1 page
Topographical Map Checkbric
api-357997977
Pas encore d'évaluation
UDC Statistics Measures
Document16 pages
UDC Statistics Measures
Luccas Z
Pas encore d'évaluation
PGD 102
Document8 pages
PGD 102
Nkosana Ngwenya
Pas encore d'évaluation
Mercy in Everyday Life: The Way of Mercy
Document1 page
Mercy in Everyday Life: The Way of Mercy
Alice Duff
Pas encore d'évaluation
v2 Gleanings From Theosophical Path
Document237 pages
v2 Gleanings From Theosophical Path
Mark R. Jaqua
Pas encore d'évaluation
Modal Analysis of Cantilever Beam PDF
Document8 pages
Modal Analysis of Cantilever Beam PDF
Cholan Pillai
Pas encore d'évaluation
CS 561: Artificial Intelligence
Document36 pages
CS 561: Artificial Intelligence
alvsroa
Pas encore d'évaluation
Teaching Methods and Learning Theories Quiz
Document8 pages
Teaching Methods and Learning Theories Quiz
jonathan baco
Pas encore d'évaluation