Académique Documents
Professionnel Documents
Culture Documents
Caltech, 2012
http://work.caltech.edu/telecourse.html
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
(April 3 )
(April
(April
(April
13.
5)
14.
10 )
15.
12 )
16.
(April
(April
(April
(May
3)
19 )
17.
18.
(May
8)
(May
(May
10 )
15 )
(May
1)
26 )
17 )
22 )
(May
(May
(May
24 )
(May
31 )
24 )
(April
(May
17 )
Overtting
Regularization
Validation
Support Ve
tor Ma
hines
Kernel Methods
Radial Basis Fun
tions
Three Learning Prin
iples
Epilogue
theory; mathemati
al
te
hnique; pra
ti
al
analysis;
on
eptual
29 )
Le ture 1:
2/19
A pattern exists.
We
annot pin it down mathemati
ally.
We have data on it.
Learning From Data - Le
ture 1
3/19
?
e
s
i
ru
? n?
kbu
y
d
e
tio blo
m
o s a rs
s
like like prefe
s
like
C
m
To
viewer
add
ontributions
from ea
h fa
tor
predi
ted
rating
movie
Tom
ise
Cru
t?
in i
r?
uste t
kb ten
blo
on nt
ion nte
a
t edy
o
om
4/19
viewer
movie
LEARNING
rating
bottom
5/19
Components of learning
age
gender
annual salary
years in residen
e
years in job
urrent debt
23 years
male
$30,000
1 year
1 year
$15,000
Approve
redit?
Learning From Data - Le
ture 1
6/19
Components of learning
Formalization:
Input: x
Output: y
(good/bad ustomer? )
Hypothesis: g : X Y
Learning From Data - Le
ture 1
(histori al re ords)
(formula to be used )
7/19
TRAINING EXAMPLES
( x1 , y1 ), ... , ( xN , yN )
(historical records of credit customers)
LEARNING
ALGORITHM
A
FINAL
HYPOTHESIS
g~
~f
HYPOTHESIS SET
H
(set of candidate formulas)
8/19
Solution
omponents
The 2 solution
omponents of the learning
problem:
gH
TRAINING EXAMPLES
( x1 , y1 ), ... , ( xN , yN )
(historical records of credit customers)
LEARNING
ALGORITHM
A
FINAL
HYPOTHESIS
g~
~f
HYPOTHESIS SET
H
9/19
`attributes of a ustomer'
Approve redit if
d
P
i=1
Deny redit if
d
P
i=1
h(x) = sign
d
X
i=1
wixi
threshold
!
10/19
d
X
h(x) = sign
wi xi +
i=1
w0
h(x) = sign
d
X
wi xi
i=0
!
_
+
_
+
_
+
+
+
+
_
_
_
_
h(x) = sign(wTx)
Learning From Data - Le
ture 1
11/19
h(x) = sign(wTx)
y= +1
w+y x
x
y= 1
w+y x
w
x
w w + ynxn
Learning From Data - Le
ture 1
12/19
Iterations of PLA
One iteration of the PLA:
w w + yx
where (x, y) is a mis
lassied training point.
+
_
_
+
+
+
+
_
That's it!
Learning From Data - Le
ture 1
13/19
14/19
Supervised Learning
Unsupervised Learning
Reinfor
ement Learning
15/19
Supervised learning
Example from vending ma
hines
oin re
ognition
25
Mass
Mass
25
5
1
10
10
Size
Size
16/19
Mass
Instead of
Unsupervised learning
(input,
orre
t output), we get (input, ? )
Size
17/19
18/19
A Learning puzzle
f = 1
f = +1
f =?
Learning From Data - Le
ture 1
19/19
Review of Le ture 1
+
+
- A pattern exists
- We
annot pin it down mathemati
ally
+
+
- We have data on it
+
_
Fo
us on supervised learning
- Unknown target fun
tion
- Data set
y = f (x)
gf
from
- So what now?
Le ture 2:
Is Learning Feasible?
Conne tion to
real
learning
2/17
A related experiment
top
P[
P[
- The value of
- We pi k
]=
SAMPLE
]=1
fraction
of red marbles
is unknown to us.
marbles independently.
BIN
= probability
of red marbles
bottom
3/17
top
BIN
SAMPLE
Yes!
Sample frequen
y
frequen
y
fraction
of red marbles
is likely
lose to bin
possible
Learning From Data - Le
ture 2
versus
probable
= probability
of red marbles
bottom
4/17
N ),
is probably lose to
(within ).
Formally,
PP [ | | > ] 2e
22N
This is alled
Hoeding's Inequality.
=
is
P.A.C.
5/17
PP [| | > ] 2e
22N
top
BIN
SAMPLE
fraction
of red marbles
= probability
of red marbles
bottom
6/17
h(x)= f (x)
is a point
h(x)= f (x)
xX
h(x)=f (x)
h(x)6=f (x)
Hi
7/17
PROBABILITY
DISTRIBUTION
P
TRAINING EXAMPLES
( x1 , y1 ), ... , ( xN , yN )
x1 , ... , xN
LEARNING
ALGORITHM
FINAL
HYPOTHESIS
g~
~f
Hi
on
HYPOTHESIS SET
H
8/17
Are we done?
Hi
Not so fast!
h(x)= f (x)
is xed.
h(x)= f (x)
For this
h,
generalizes to
We need to
will be small.
9/17
Multiple bins
Generalizing the bin model to more than one hypothesis:
top
h1
h2
hM
M
........
M
bottom
10/17
and
Hi
Eout(h)
is `
P[
22N
Ein(h)
Hi
11/17
h1
h2
Eout( h1)
Eout( h2)
hM
Eout( hM )
........
Ein( h1)
Ein( h2)
Ein( hM )
bottom
12/17
What?
UNKNOWN TARGET FUNCTION
f: X Y
top
BIN
PROBABILITY
DISTRIBUTION
P
TRAINING EXAMPLES
( x1 , y1 ), ... , ( xN , yN )
SAMPLE
= fraction
of red marbles
ALGORITHM
HYPOTHESIS SET
H
= probability
of red marbles
x1 , ... , xN
LEARNING
A
on
top
FINAL
HYPOTHESIS
g~
~f
h1
h2
Eout( h1)
Eout( h2)
hM
Eout( hM )
........
Ein( h1)
Ein( h2)
Ein( hM )
bottom
bottom
13/17
Coin analogy
If you toss a fair
oin 10 times, what is the probability that you
will get 10 heads?
Question:
Answer:
0.1%
Answer:
63%
14/17
BINGO
Learning From Data - Le
ture 2
Hi
15/17
A simple solution
P[
16/17
P[
M
X
m=1
M
X
P [|Ein(hm) Eout(hm)|
> ]
22N
2e
m=1
P[|Ein(g) Eout(g)|
22N
> ] 2M e
17/17
Review of Le ture 2
Sin e
has to be one of
h 1 , h2 , , hM ,
we
on
lude that
Is Learning feasible?
Yes, in a
probabilisti sense.
If:
Hi
Eout(h)
out
Then:
out
in
out
in
out
Ein(h)
Hi
P[
22N
out
fa tor.
Le ture 3:
Linear Models I
Outline
Input representation
Linear Regression
Nonlinear Transformation
2/23
3/23
Input representation
`raw' input x = (x ,x , x , , x
linear model: (w , w , w , , w
0
256)
256)
Features:
4/23
810 0.91
1214
0
Illustration of features
0.1
1618 0.2
x =02(x0,x0.3
x1: intensity
1, x2)
0.4
46 0.5
0.6
810 0.7
1214 0.8
0.9
1618 -81
510 -7-6
155 -5
1015 -4-3
510 -2-1
0
15
Learning From Data - Le
ture 3
810
1214
1618
x2: 02symmetry
46
810
1214
1618
510
155
1015
510
15
5/23
a ements
50%
Eout
10%
1%
0
250
500
Ein
750
0.35
What PLA does0.4
-8Final per
eptron boundary
-7
-6
-5
-4
-3
-2
-1
0
1
1000
6/23
a ements
Po ket:
50%
Eout
10%
10%
Eout
1%
1%
0
50%
250
500
Ein
750 1000
Ein
250
500
750 1000
7/23
a
ements
0.05
0.1
0.15PLA:
0.2
0.25
0.3
0.35
0.4
-8
-7
-6
-5
-4
-3
-2
-1
0
1
Learning From Data - Le
ture 3
0.35
Classi
ation boundary0.4- PLA versus Po
ket
-8Po
ket:
-7
-6
-5
-4
-3
-2
-1
0
1
8/23
Outline
Input representation
Linear Regression
Nonlinear Transformation
regression
real-valued output
9/23
Credit again
Input:
x =
age
annual salary
years in residen
e
years in job
urrent debt
h(x) =
d
X
23 years
$30,000
1 year
1 year
$15,000
wi xi = wTx
i=0
10/23
11/23
in-sample error:
N
X
1
2
(h(xn) yn)
Ein(h) =
N n=1
12/23
x
Learning From Data - Le
ture 3
0.5
1
Illustration of linear0 regression
0.2
0.4
0.6
0.8
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
x
1
1
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
x2
13/23
in
N
X
1
2
T
( w xn yn)
Ein(w) =
N n=1
1
2
=
kXw yk
N
where
Learning From Data - Le
ture 3
X=
x
x
.
x
1
2
T
T
y1
y2
y=
yN
14/23
Minimizing
Ein(w) =
Ein(w) =
1
N kXw
2 T
(Xw
X
N
in
yk
y) = 0
XTXw = XTy
w = Xy where X = (XTX)1XT
X is the `pseudo-inverse' of X
15/23
The pseudo-inverse
X = (XTX)1XT
{z| {z }}
d+1
d+1
N d+1
|
N
{z }
d+1
{z
d+1 N
{z
d+1 N
16/23
xT
y
T
x
y
X=
y = . .
,
.
xT
y
|
|
{z
}
{z
}
1
target ve tor
2:
3:
= (XTX)1XT
17/23
18/23
Symmetry
0.5
0.55
-8
-7
-6
-5
-4
-3
-2
-1
0
Learning From Data - Le
ture 3
Average Intensity
18/23
Outline
Input representation
Linear Regression
Nonlinear Transformation
19/23
-1
Linear is limited
-0.5
0
Hypothesis:
0.5
1
-1.5
-1
-0.5
0
0.5
1
1.5
20/23
Another example
Credit line is ae
ted by `years in residen
e'
but not in a linear way!
Nonlinear [[xi < 1]] and [[xi > 5]] are better.
Can we do that with linear models?
Learning From Data - Le
ture 3
21/23
Linear in what?
d
X
wi xi
i=0
sign
d
X
i=0
wi xi
0.6
0.8
Transform the data nonlinearly
1
0
(x , x ) 0.1(x , x )
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
2
1
2
2
23/23
0.2
Review of Le ture 3
0.3
0.4
d
X
0.6
wixi = w x
i=0
- Classi
ation:
- Regression:
0.5
h(x) = sign(wTx)
h(x) = wTx
w = (X X) X y
1
one-step learning
0.7
0.8
0.9
1
x1
x2
Nonlinear transformation:
-
wTx
- Any
is linear in
x z preserves
- Example:
this linearity.
Le ture 4:
Outline
Error measures
Noisy targets
2/22
-1
0.2
-0.5
0.4
0.6
0.5
0.8
-1
-0.8
-0.6
-0.4
-0.2
0.1
0.2
0.3
0.4
0.2
0.5
0.4
0.6
0.6
0.7
0.8
0.8
0.9
1. Original data
xn X
zn = (xn ) Z
0.2
0.4
-1
0.6
-0.5
0.8
'
1
-1.5
1
-0.2
0
0.2
-1
0.4
-0.5
0.6
0.8
0.5
1.2
1.5
X -spa
e
T (x))
g(x) = g
((x)) = sign(w
4. Classify in
0.5
Z -spa
e
Tz)
g
(z) = sign(w
3. Separate data in
3/22
g(x)
Learning From Data - Le
ture 4
T(x))
sign(w
4/22
Outline
Error measures
Noisy targets
5/22
PROBABILITY
DISTRIBUTION
P
TRAINING EXAMPLES
( x1 , y1 ), ... , ( xN , yN )
on
x1 , ... , xN
x
ERROR
MEASURE
LEARNING
ALGORITHM
g ( x )~
~ f (x )
FINAL
HYPOTHESIS
g: X Y
HYPOTHESIS SET
H
6/22
Error measures
What does
h f
Error measure:
Almost always
mean?
E(h, f )
pointwise denition:
h(x), f (x)
Examples:
Squared error:
Binary error:
2
h(x), f (x) = h(x) f (x)
q
y
h(x), f (x) = h(x) 6= f (x)
7/22
Overall error
E(h, f )
h(x), f (x) .
In-sample error:
N
X
1
E (h) =
e h(xn ), f (xn )
N n=1
in
Out-of-sample error:
E (h) =Ex
out
h(x), f (x)
8/22
PROBABILITY
DISTRIBUTION
P
TRAINING EXAMPLES
( x1 , y1 ), ... , ( xN , yN )
on
x1 , ... , xN
x
ERROR
MEASURE
LEARNING
ALGORITHM
g ( x )~
~ f (x )
FINAL
HYPOTHESIS
g: X Y
HYPOTHESIS SET
H
9/22
false a ept
and
false reje
t
f
8
>
<+1 you
>
:
1 intruder
f
+1
+1 no error
h
1 false reje
t
Learning From Data - Le
ture 4
1
false a
ept
no error
10/22
8
>
<+1 you
>
:
1 intruder
f
+1 1
+1 0
1
h
1 10 0
Learning From Data - Le
ture 4
11/22
8
>
<+1 you
>
:
1 intruder
f
+1 1
+1 0 1000
h
0
1 1
Learning From Data - Le
ture 4
12/22
Take-home lesson
Plausible
Friendly
Gaussian noise
13/22
PROBABILITY
DISTRIBUTION
P
TRAINING EXAMPLES
( x1 , y1 ), ... , ( xN , yN )
on
x1 , ... , xN
x
ERROR
MEASURE
e( )
LEARNING
ALGORITHM
g ( x )~
~ f (x )
FINAL
HYPOTHESIS
g: X Y
HYPOTHESIS SET
H
14/22
Noisy targets
fun tion
age
annual salary
years in residen
e
years in job
urrent debt
23 years
$30,000
1 year
1 year
$15,000
15/22
Target `distribution'
Instead of
y = f (x),
we use target
distribution:
P (y | x)
(x, y)
P (x)P (y | x)
Noisy target = deterministi
target
f (x) = E(y|x)
plus noise
y f (x)
P (y | x)
Learning From Data - Le
ture 4
y = f (x)
16/22
target function f:
PROBABILITY
DISTRIBUTION
x)
X Y
plus noise
P
TRAINING EXAMPLES
( x1 , y1 ), ... , ( xN , yN )
on
x1 , ... , xN
x
ERROR
MEASURE
e( )
LEARNING
ALGORITHM
g ( x )~
~ f (x )
FINAL
HYPOTHESIS
g: X Y
HYPOTHESIS SET
H
17/22
and
P (y|x)
and
P (x)
P (y | x)
target function f:
x)
X Y
plus noise
UNKNOWN
P (x)
Merging
P (x)P (y|x)
TRAINING EXAMPLES
( x1 , y1 ), ... , ( xN , yN )
as
INPUT
DISTRIBUTION
P (x)
P (x, y)
18/22
Outline
Error measures
Noisy targets
19/22
E (g) E (g)
out
in
Is this learning?
We need
g f,
whi h means
E (g) 0
out
20/22
E (g) 0
out
is a hieved through:
E
(g)
E
(g)
|
{z
}
Le
ture 2
out
and
in
E
(g)
0
| {z }
Le
ture 3
in
E (g)
in
in
small enough?
21/22
0.5
0.6
out-of-sample error
model omplexity
Error
Model
omplexity
Model
omplexity
E
E E
in
out
in
8
10
12
in-sample error
dv
VC dimension,
dv
22/22
Review of Le ture 4
Noisy targets
y = f (x)
Error measures
- User-spe
ied e h(x), f (x)
f
>
:
1 intruder
- In-sample:
N
X
1
E (h) =
e h(xn), f (xn)
N n=1
in
target function f:
out
x)
X Y
plus noise
TRAINING EXAMPLES
( x1 , y1 ), ... , ( xN , yN )
UNKNOWN
INPUT
DISTRIBUTION
P (x)
- Out-of-sample
E (h) = Ex e h(x), f (x)
y P (y | x)
8
>
<+1 you
- E (h) is now Ex,y e h(x), y
out
Le ture 5:
Outline
Illustrative examples
Puzzle
2/20
P [ |E E | > ] 2 e
22N
in
out
Training:
P [ |E E | > ] 2M e
22N
in
out
3/20
B ad
events
Bm
ome from?
are
in
B2
B1
P[B1
or B2 or or BM ]
P
[B
]
+
P[B2] + + P[BM ]
1
{z
}
|
no overlaps:
M terms
B3
4/20
Can we improve on
Yes, bad events are
E
E
out
very
: hange in
M?
up
overlapping!
+1
and
1 areas
1
in
+1
|E (h1) E (h1)| |E (h2) E (h2)|
in
out
in
out
down
5/20
What an we repla e
with?
di hotomies
6/20
Di
hotomies: mini-hypotheses
A hypothesis
h : X {1, +1}
A di hotomy
Number of hypotheses
|H|
an be innite
Number of di hotomies
|H(x1, x2, , xN )|
is at most
2N
7/20
mH(N )=
max
x1, ,xN X
points
|H(x1, , xN )|
mH(N ) 2N
Let's apply the denition.
8/20
ements
Applying
0
0.5
1
1.5
2
2.5
3
3.5
4
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3
0
0.5
1
1.5
2
2.5
3
3.5
4
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3
mH(N )
N =3
N =3
mH(3) = 8
Learning From Data - Le
ture 5
N =4
mH(4) = 14
9/20
Outline
Illustrative examples
Puzzle
10/20
0.04
Example 1: positive rays
0.06
0.08
0.1
h(x) = 1
x1
x2
x3
h(x) = +1
...
xN
H
is set of
h : R {1, +1}
h(x) = sign(x a)
mH(N ) = N + 1
Learning From Data - Le
ture 5
11/20
0.06
0.08
0.1
h(x) = 1
x1
x2
x3
h(x) = +1
h(x) = 1
...
xN
H
is set of
h : R {1, +1}
mH(N ) =
Learning From Data - Le
ture 5
N +1
2
N +1
spots
+1 = 12 N 2 + 12 N + 1
12/20
is set of
h : R {1, +1}
h(x) = +1 is
up
onvex
mH(N ) = 2N
The
+
bottom
13/20
is positive rays:
mH(N ) = N + 1
H
is positive intervals:
mH(N ) = 12 N 2 + 12 N + 1
H
is onvex sets:
mH(N ) = 2N
14/20
P [ |E E | > ] 2M e
22N
in
What happens if
mH(N )
mH(N )
polynomial
repla es
mH(N )
out
M?
Good!
is polynomial?
15/20
Outline
Illustrative examples
Key notion:
Puzzle
break point
16/20
1.5
2
Break point of
2.5
Denition:
3.5
is a
break point
an be shattered by
for
mH(k) < 2k
H,
4
0.5
1
1.5
2
k=4
2.5
3
17/20
Positive rays
mH(N ) = N + 1
Positive intervals
Convex sets
break point
k=
break point
k=
break point
k =`'
mH(N ) = 12 N 2 + 12 N + 1
mH(N ) = 2N
18/20
Main result
No break point
Any break point
=
=
mH(N ) = 2N
mH(N ) is
polynomial in N
19/20
Puzzle
x1
x2
x3
20/20
1.5
Review of Le
ture 5
Di
hotomies
Break point
2
2.5
3.5
4
0.5
1
1.5
2
2.5
3
max
x1, ,xN X
|H(x1, , xN )|
Maximum # of di
hotomies
x1 x2 x3
Le ture 6:
Theory of Generalization
Outline
2/18
Bounding
mH(N )
To show:
mH(N ) is polynomial
We show:
mH(N ) a polynomial
Key quantity:
3/18
Re ursive bound on
B(N, k)
# of rows
B(N, k) = + 2
S1
S2+
S2
S2
x1 x2 . . . xN 1
+1 +1 . . .
+1
1 +1 . . .
+1
xN
+1
1
+1
1
+1
1
1
+1
1
1
...
...
...
...
1
1
+1
+1
1
+1
+1
+1
+1
1
+1
1
1
1
1
1
...
...
...
...
+1
1
+1
+1
+1
+1
1
1
+1 1 . . .
1 1 . . .
+1
1
1
1
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
4/18
Estimating
and
# of rows
+ B(N 1, k)
S1
S2+
S2
S2
x1 x2 . . . xN 1
+1 +1 . . .
+1
1 +1 . . .
+1
xN
+1
1
+1
1
+1
1
1
+1
1
1
...
...
...
...
1
1
+1
+1
1
+1
+1
+1
+1
1
+1
1
1
1
1
1
...
...
...
...
+1
1
+1
+1
+1
+1
1
1
+1 1 . . .
1 1 . . .
+1
1
1
1
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
5/18
Estimating
by itself
# of rows
B(N 1, k 1)
S1
S2+
S2
S2
x1 x2 . . . xN 1
+1 +1 . . .
+1
1 +1 . . .
+1
xN
+1
1
+1
1
+1
1
1
+1
1
1
...
...
...
...
1
1
+1
+1
1
+1
+1
+1
+1
1
+1
1
1
1
1
1
...
...
...
...
+1
1
+1
+1
+1
+1
1
1
+1 1 . . .
1 1 . . .
+1
1
1
1
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
6/18
Putting it together
B(N, k) = + 2
# of rows
+ B(N 1, k)
S1
B(N 1, k 1)
S2+
B(N, k)
B(N 1, k) + B(N 1, k 1)
Learning From Data - Le
ture 6
S2
S2
x1 x2 . . . xN 1
+1 +1 . . .
+1
1 +1 . . .
+1
xN
+1
1
+1
1
+1
1
1
+1
1
1
...
...
...
...
1
1
+1
+1
1
+1
+1
+1
+1
1
+1
1
1
1
1
1
...
...
...
...
+1
1
+1
+1
+1
+1
1
1
+1 1 . . .
1 1 . . .
+1
1
1
1
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
7/18
Numeri al omputation of
B(N, k)
bound
bottom
1
2
3
N 4
5
6
:
1
1
1
1
1
1
1
:
2
2
3
4
5
6
7
:
3
2
4
7
11
:
:
:
k
4
2
4
8
..
.
5
2
4
8
..
6
2
4
8
..
..
..
..
..
..
.
.
8/18
B(N, k)
bound
B(N, k)
k1
X
i=0
N
i
1. Boundary onditions:
easy
1
2
3
N 4
5
6
:
1
1
1
1
1
1
1
:
2
2
3
4
5
6
7
:
3
2
4
7
11
:
:
:
top
k
4
2
4
8
..
.
5
2
4
8
..
6
2
4
8
..
..
..
..
..
..
bottom
.
9/18
k1
X
i=0
N
i
=
=
=
=
k1
X
k2
X
N 1
N 1
+
?
i
i
i=0
i=0
k1
k1
X N 1
X N 1
+
1+
i1
i
i=1
i=1
k1
X N 1
N 1
1+
+
i
i1
i=1
k1
k1
X
X
N
N
X
=
1+
i
i
i=0
i=1
k1
N1
N
bottom
10/18
It is polynomial!
mH(N )
k1
X
N
i
|i=0 {z }
maximum power is
N k1
11/18
Three examples
k1
X
N
i=0
H is
positive rays:
H is
positive intervals:
H is
(break point k = 2)
mH(N ) = N + 1 N + 1
2D per eptrons:
(break point k = 3)
mH(N ) = 12 N 2 + 12 N + 1
(break point k = 4)
mH(N ) = ?
1 3
6N
1 2
2N
+ 12 N + 1
+ 56 N + 1
12/18
Outline
13/18
What we want
Instead of:
out
22N
We want:
22N
in
out
14/18
Pi torial proof
15/18
Hoeffding Inequality
Union Bound
VC Bound
space of
data sets
.
(a)
Learning From Data - Le
ture 6
(b)
(c)
16/18
What to do about
Eout
Hi
Eout(h)
Ein(h)
Eout(h)
Ein(h)
Ein(h)
Hi
17/18
Putting it together
Not quite:
2
2 N
P [ |E (g) E (g)| > ] 2 mH( N ) e
in
out
but rather:
out
1
8 2 N
e
18/18
Review of Le ture 6
The VC Inequality
mH(N ) is polynomial
Hoeffding Inequality
Union Bound
VC Bound
1
2
3
4
5
6
:
1
1
1
1
1
1
1
:
k
2 3 4 5 6
2 2 2 2 2
3 4 4 4 4
4 7 top 8 8 8
5 11 . . . . . .
6 :
.
7 :
.
: :
.
bottom
mH(N )
N
i
i=0
| {z }
.
.
.
.
.
.
.
.
.
.
k1
X
maximum power is N k1
space of
data sets
.
(a)
P [ |E
in
P [ |E
in
(b)
(c)
out
e 2
1
8 2 N
e
Le ture 7:
The VC Dimension
Outline
The denition
Generalization bounds
2/24
Denition of VC dimension
N dv
(H) = H
k > dv
(H) = k
Learning From Data - Le
ture 7
N
(N
)
=
2
H
an shatter
an shatter N points
is a break point for H
3/24
k1
X
N
i=0
dv
X
N
|i=0 {z }
maximum power is N dv
4/24
Examples
is positive rays:
dv = 1
is 2D per eptrons:
dv = 3
is onvex sets:
up
dv =
bottom
5/24
dv (H)
is nite
gH
will generalize
up
PROBABILITY
DISTRIBUTION
on
TRAINING EXAMPLES
( x1 , y1 ), ... , ( xN , yN )
LEARNING
ALGORITHM
FINAL
HYPOTHESIS
g~
~f
HYPOTHESIS SET
H
down
6/24
For d = 2, dv
= 3
In general, dv
= d + 1
up
7/24
X =
x
x
x
.
x
1
T
2
T
d+1
X
Learning From Data - Le
ture 7
1
1
1
...
1
0
1
0
... 0
... 0
0
0 ... 0 1
0
0
1
...
is invertible
8/24
For any
1
y1
y2 1
y =
=
yd+1
1
Just make
whi
h means
Learning From Data - Le
ture 7
sign(Xw)= y
w = X1y
9/24
We an shatter these
d+1
points
[a dv
= d + 1
[b dv
d + 1
[
dv
d + 1
[d
No on lusion
10/24
dv d + 1
[a
[b
[
[d
Learning From Data - Le
ture 7
There are
d + 1 points we annot
shatter
There are
d + 2 points we annot
shatter
We annot shatter
any
set of
d + 1 points
We annot shatter
any
set of
d + 2 points
11/24
Take any
d+2
points
we must have
xj =
ai xi
i6=j
12/24
So?
xj =
ai xi
i6=j
13/24
Why?
xj =
ai xi
wTxj =
i6=j
ai wTxi
i6=j
i6=j
Therefore,
yj =
sign(wTxj ) = +1
14/24
Putting it together
We proved
dv d + 1
and
dv d + 1
dv = d + 1
15/24
Outline
The denition
Generalization bounds
16/24
1. Degrees of freedom
17/24
-0.06
0.02
-0.04
0.04
-0.02
0.06(
0
0.08
0.02
0.1
0.04
0.06
0.08
0.1
Positive rays
):
dv = 1
Positive intervals
h(x) = 1
x1
x2
x3
h(x) = +1
...
xN
(dv
= 2):
h(x) = 1
x1
x2
x3
h(x) = +1
...
h(x) = 1
xN
18/24
down
down
dv
measures the
ee tive
number of parameters
19/24
81 2N
out
N deN
20/24
N deN
10
10
N 30eN
10
10
Rule of thumb:
5
10
N 10 dv
20
40
60
80
100
120
140
160
180
200
21/24
Outline
The denition
Generalization bounds
22/24
Rearranging things
Get in terms of :
81 2N
in
18 2N
= 4mH(2N )e
8 4mH(2N )
= =
ln
| N {z }
With probability 1 ,
Learning From Data - Le
ture 7
|E
out
E | (N, H, )
in
23/24
Generalization bound
With probability 1 ,
|E
out
E | (N, H, )
in
With probability 1 ,
E
out
in
24/24
Review of Le
ture 7
VC dimension dv
(H)
Utility of VC dimension
10
10
10
most points
an shatter
0
10
S ope of VC analysis
10
up
20
80
100
120
140
160
180
200
N dv
DISTRIBUTION
on
TRAINING EXAMPLES
( x1 , y1 ), ... , ( xN , yN )
ALGORITHM
60
PROBABILITY
LEARNING
40
Rule of thumb:
FINAL
HYPOTHESIS
N 10 dv
Generalization bound
g~
~f
HYPOTHESIS SET
H
down
out
E +
in
Le ture 8:
Bias-Varian e Tradeo
Outline
Learning Curves
2/22
Approximation-generalization tradeo
Small E : good approximation of f out of sample.
out
approximating f
3/22
out
E +
in
out
into
4/22
Start with E
E (g (D) )= Ex
out
out
i
2
(D)
g (x) f (x)
h h
ii
2
ED E (g (D) ) = ED Ex g (D) (x) f (x)
h h
ii
2
= Ex ED g (D) (x) f (x)
out
5/22
i
2
(D)
g (x) f (x)
6/22
Using g(x)
ED
h
i
i
2
2
g (D) (x) f (x)
=ED g (D) (x) g(x) + g(x) f (x)
= ED
(D)
2
2
(x) g(x) + g(x) f (x)
i
+ 2 g (D) (x) g(x) g(x) f (x)
= ED
Learning From Data - Le
ture 8
i
2
2
(D)
g (x) g(x) + g(x) f (x)
7/22
h
h
i
i
i
2
2
2
(D)
(D)
g (x) f (x)
= ED g (x) g(x) + g(x) f (x)
|
{z
} |
{z
}
var(x)
bias(x)
h h
ii
2
(D)
(D)
Therefore, ED E (g ) = Ex ED g (x) f (x)
out
= Ex[bias(x) + var(x)]
=
bias
var
8/22
The tradeo
bias = Ex
2 i
g(x) f (x)
var = Ex
ED
ii
2
(D)
g (x) g(x)
H
f
H
bias
f
var
9/22
f (x) = sin(x)
2
N =2
1.5
0.5
H0:
h(x) = b
H1:
h(x) = ax + b
0.5
1.5
2
1
0.8
0.6
0.4
0.2
0.2
0.4
0.6
0.8
10/22
Approximation - H0 versus H1
H0
H1
1.5
1.5
0.5
0.5
0.5
0.5
Eout = 0.50
1.5
2
1
0.8
0.6
0.4
0.2
0.2
0.4
Eout = 0.20
1.5
0.6
0.8
2
1
0.8
0.6
0.4
0.2
0.2
0.4
0.6
0.8
11/22
Learning - H0 versus H1
H0
H1
1.5
1.5
0.5
0.5
0.5
0.5
1.5
1.5
2
1
0.8
0.6
0.4
0.2
0.2
0.4
0.6
0.8
2
1
0.8
0.6
0.4
0.2
0.2
0.4
0.6
0.8
12/22
1
-1
Bias and varian
e
- H0
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
sin(x)
0.6
0.8
x
1
1
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
g(x)
13/22
0.2
0.4
Bias and varian
e
- H1
0.6
0.8
1
-3
-2
g(x)
-1
0
sin(x)
1
2
x
3
0.4
0.6
0.8
1
-8
-6
-4
-2
0
2
4
6
14/22
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
H0
0.4
and the winner
is
...
0.6
0.8
H1
1
-3
-2
g(x)
-1
g(x)
0
sin(x)
1
2
x
3
sin(x)
x
bias = 0.50
var = 0.25
bias = 0.21
var = 1.69
15/22
Lesson learned
Mat h the
`model omplexity'
to the
not to the
target omplexity
16/22
Outline
Learning Curves
17/22
Expe ted E
out
and E
in
18/22
The
out
in
20
40
60
80
100
120
0.16
0.18
0.2
0.22
20
40
urves
60
80
100
120
0.05
0.1
0.15
0.2
0.25
out
in
VC versus
out
in
generalization error
in-sample error
Number of Data Points, N
VC analysis
20
40
bias-varian
e
60
80
0.16
E
0.17
varian
e
0.18
0.19
E
0.2
bias
0.21
Number of Data Points, N
0.22
Expe
ted Error
20
40
60
80
0.16
0.17
0.18
0.19
0.2
0.21
0.22
out
in
bias-varian
e
20/22
21/22
100
0
Learning
urves for linear regression
0.2
0.4
0.6
Best approximation error = 2
0.8
E
d+1
2
1
Expe
ted in-sample error = 1 N
2
1.2
1.4
Expe
ted out-of-sample error = 2 1 + d+1
N
E
1.6
2 d+1
1.8
Expe
ted generalization error = 2 N
d + 1 Number of Data Points, N
2
Learning From Data - Le
ture 8
out
in
22/22
Review of Le ture 8
Learning
urves
20
in
and40
out
vary with
60
bias
out
w.r.t.
0.16
0.17
B-V:
var
0.19
0.2
20
0.21
40
0.22
60
f
H
0.18
bias
80
out
in
varian e
bias
Number of Data Points,
80
0.16
0.17
VC:
f
0.18
0.19
0.2
var
0.21
0.22
How
out
generalization error
in
in-sample error
Number of Data Points,
VC dimension
Le ture 9:
Where we are
Linear regression
Logisti regression
Nonlinear transforms
X
X
?
2/24
Nonlinear transforms
Ea h
Example:
Final hypothesis
g(x)
(x)
sign w
T
z = (x)
in
X
or
spa e:
T(x)
w
3/24
dv = d + 1
dv d + 1
4/24
0.4
0.6
-0.5
0
Two non-separable
ases
0.8
0.5
-1.5
-1.5
-1
-1
-0.5
-0.5
0.5
0.5
1.5
1.5
5/24
-0.5
First
ase 0
0.5
Use a linear model in
X;
a ept
E >0
in
-1.5
or
Insist on
E = 0;
in
go to high-dimensional
-1
-0.5
0
0.5
1
1.5
6/24
-0.5
Se
ond
ase0
0.5
Why not:
z=
or better yet:
or even:
z=
(x21
x22
0.6)
-1.5
-1
-0.5
0
0.5
1
1.5
7/24
Lesson learned
before
out
Data snooping
8/24
The model
Error measure
Learning algorithm
9/24
s =
d
X
wixi
i=0
linear regression
h(x) = sign(s)
h(x) = s
x2
xd
h(x) = (s)
x0
x0
x0
x1
logisti regression
x1
s
h(x)
x2
xd
x1
s
h(x)
x2
s
h(x)
xd
10/24
-4
The logisti
fun
tion
The formula:
es
(s) =
1 + es
-2
0
21
4
0
0.5
0
1
(s)
s
11/24
Probability interpretation
h(x) = (s)
is interpreted as a probability
Example.
Input
x:
(s):
The signal
s = wTx
risk s ore
h(x) = (s)
Learning From Data - Le
ture 9
12/24
Genuine probability
Data
(x, y)
with binary
y,
P (y | x) =
(
f (x)
1 f (x)
for
y = +1;
for
y = 1.
The target
Learn
f : Rd [0, 1] is
the probability
Error measure
For ea h
(x, y), y
is generated by probability
f (x)
likelihood:
If
h = f,
P (y | x) =
from
x?
h(x)
for
y = +1;
1 h(x)
for
y = 1.
14/24
P (y | x) =
h(x)
for
1 h(x)
for
-2
y = +1;
y = 1.
1
(s)
4
0
Substitute
h(x) = (wTx),
noting
(s) = 1 (s)
0.5
P (y | x) = (y wTx)
Likelihood of
N
Y
n=1
P (yn | xn) =
N
Y
is
(ynwTxn)
n=1
15/24
ln
N
Minimize
1
=
N
n=1
(yn wT xn)
n=1
1
ln
(yn wT xn)
N
X
1
T
ln 1 + eynw xn
E (w) =
N n=1 |
{z
}
e(h(xn),yn)
in
N
X
N
Y
1
(s) =
1 + es
ross-entropy error
16/24
The model
Error measure
Learning algorithm
17/24
How to minimize
Ein
N
X
1
T
ln 1 + eynw xn
E (w) =
N n=1
in
iterative
solution
N
X
1
E (w) =
(wTxn yn)2
N n=1
in
losed-form solution
18/24
w(0);
-2
0
w(1) = w(0) + v
2
10
15
?
v
in
20
25
in
-4
Start at
E (w)
-6
In-sample Error,
Weights,
19/24
in
= E ( w(0) +
v ) E (w(0))
in
in
+ O( 2)
= E (w(0))tv
in
kE (w(0))k
in
Sin e
is a unit ve tor,
E (w(0))
=
v
kE (w(0))k
in
in
20/24
Fixed-size step?
Weights,
too small
Weights,
too large
in
large
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
In-sample Error,
in
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0
0.05
0.1
0.15
0.2
0.25
In-sample Error,
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
in
In-sample Error,
How
small
Weights,
variable
just right
21/24
Easy implementation
Instead of
w = v
E (w(0))
=
kE (w(0))k
in
in
Have
w = E (w(0))
in
Fixed
Learning From Data - Le
ture 9
learning rate
22/24
1:
2:
for
t = 0, 1, 2, . . .
t=0
to
w(0)
do
3:
in
N
X
1
ynxn
=
T(t)x
y
w
n
n
N n=1 1 + e
w(t + 1) = w(t) E
4:
5:
6:
in
w
23/24
Credit
Analysis
Approve
or Deny
Per eptron
Amount
of Credit
Linear Regression
Squared Error
Pseudo-inverse
Probability
of Default
Logisti Regression
Cross-entropy Error
Gradient des
ent
24/24
Logisti
regression
x0
x1
s
h(x)
x2
xd
(s)
-8
-6
-4
-2
0
2
10
15
20
25
in
-10
Gradient
des
ent
In-sample Error, E
Review of Le ture 9
Weights, w
- Initialize w(0)
Likelihood measure
N
Y
n=1
P (yn | xn) =
N
Y
n=1
- For t = 0, 1, 2,
[to termination
(ynwTxn)
- Return nal w
Le ture 10:
Neural Networks
Outline
Ba kpropagation algorithm
2/21
N
X
1
E (w) =
e| h(x{zn), yn}
N n=1
T
ln(1+eynw xn )
in
in logisti regression
in
w = E (w)
in
in
(xn, yn)
bat
h GD
3/21
(xn, yn)
at a time.
Apply GD to
En e
h(xn), yn
N
X
1
h(xn), yn =
e h(xn), yn
N n=1
= E
in
randomized version of GD
4/21
Benets of SGD
1
2
3
1. heaper omputation
2. randomization
in
0
0.05
3. simple
0.1
0.15
Weights,
Rule of thumb:
= 0.1
works
randomization helps
5/21
SGD in a
tion
Remember movie ratings?
top
eij =
rij
K
X
k=1
!2
uikvjk
user
uiK
vjK
movie
rij
rating
bottom
6/21
Outline
Ba kpropagation algorithm
7/21
Biologi
al inspiration
biologi
al fun
tion
8/21
x2
x2
+
+
h1
h2
x1
x1
1
x1
1
1.5
x1
x2
1
1
1.5
OR(x1, x2)
x1
x2
AN D(x1, x2)
9/21
Creating layers
1
1.5
2
h1 h
1 h2
h
1
1
h1
h2
1.5
1.5
1
1.5
1
1
1
1
10/21
1
w1 x
x1
1.5
1.5
1
1.5
1
1
x2
w2 x
3 layers
Learning From Data - Le
ture 10
feedforward
11/21
A powerful model
+
+
+
+
+
+
Target
2 red ags
Learning From Data - Le
ture 10
+
+
+
+
generalization
+
+
8 per eptrons
for
16 per eptrons
and
optimization
12/21
x1
x2
xd
input x
Learning From Data - Le
ture 10
h(x)
(s)
s
output layer l = L
13/21
1 l L
(l)
wij
(l)
xj
Apply
0 i d
(l1)
1 j d(l)
(l)
(sj )
(0)
to x1
=
(0)
xd(0)
(l1)
dX
i=0
-2
layers
inputs
2
4
outputs
+1
linear
tanh
-1
-0.5
(l)
wij
(l1)
xi
(L)
x1
0
0.5
hard threshold
es es
(s) = tanh(s) = s
e + es
= h(x)
14/21
Outline
Ba kpropagation algorithm
15/21
Applying SGD
All the weights
w =
Error on example
(l)
{wij } determine
(xn, yn)
h(xn), yn =
h(x)
is
e(w)
e(w):
Learning From Data - Le
ture 10
e(w)
(l)
wij
for all
i, j, l
16/21
Computing
We
an evaluate
e(w)
(l)
wij
e(w)
(l)
wij
top
(l)
xj
one by one: analyti
ally or numeri
ally
(l)
sj
e(w)
(l)
We have
sj
(l)
wij
(l)
wij
(l1)
xi
e(w)
(l)
sj
w ij
(l)
sj
(l)
wij
We only need:
e(w)
(l)
sj
(l1)
xi
(l)
j
bottom
17/21
(l)
j
e(w)
(l)
sj
l = L and j = 1:
(L)
1
e(w)
(L)
x1
e(w)
(L)
s1
=(
(L)
x1
yn)2
(L)
(s1 )
(s) = 1 2(s)
Learning From Data - Le
ture 10
18/21
Ba k propagation of
top
(l1)
i
e(w)
x j(l)
(l1)
si
d(l)
X e(w)
j=1
(l)
d
X
(l)
sj
(l)
j
(l)
sj
(l1)
xi
(l)
wij
(l1)
xi
(l1)
si
j=1
(l1)
i
= (1
(l1) 2
(xi ) )
(l)
d
X
j(l)
(l)
w ij
(l1)
(si )
(l1)
xi
2
1( x (l1)
)
i
(l)
wij
(l)
j
(l1)
bottom
j=1
19/21
Ba
kpropagation algorithm
top
1:
2:
3:
4:
5:
6:
7:
8:
(l)
Initialize all weights wij
for t = 0, 1, 2, . . . do
at random
n {1, 2, , N }
(l)
Forward: Compute all x
j
(l)
Ba
kward: Compute all
j
(l1) (l)
(l)
(l)
j
Update the weights: wij wij xi
(l)
Pi k
(l)
w ij
(l1)
xi
(l)
Return the nal weights wij
bottom
20/21
x1
x2
xd
h(x)
(s)
s
21/21
Review of Le
ture 10
Multilayer per
eptrons
+
+
+
+
+
+
x1
x2
+
+
+
+
input x
(l)
xj
hidden layers 1 l L
output layer l = L
Ba kpropagation
Neural networks
(l1) (l)
j
(l)
wij = xi
(l1)
dX
(s)
h(x)
xd
+
+
(l)
wij
i=0
(l1)
xi
where
(l1)
(l1) 2
= (1 (xi
) )
(l)
d
X
j=1
(l)
(l)
wij j
Le ture 11:
Overtting
Outline
What is overtting?
Deterministi noise
2/23
Illustration of overtting
-1
-0.5
0
5 data points-
noisy
Data
Target
0.5
Fit
4th-order polynomial t
-30
E = 0, E
in
out
is huge
-20
-10
0
10
x
3/23
3.5
Overtting:
E
in
out
2.5
Error
1.5
Early stopping
out
0.5
E in
0
1000
2000
3000
4000
5000
Epochs
6000
7000
8000
9000
10000
bottom
4/23
The ulprit
Overtting:
Culprit:
5/23
Target
Case study
-1
-1
-0.5
0
0.5
0.5
-0.5
0.5
0.5
-0.5
Data
Target
1
1.5
50th-order target
Data
Target
1
1.5
x
6/23
-1
-1
-0.5
0.5
0.5
-10
-500
-0.5
Data
2nd Order Fit
10th Order Fit
10
20
30
2nd Order
10th Order
in
0.050
0.034
out
0.127
9.00
500
1000
E
E
Data
2nd Order Fit
10th Order Fit
10th Order
in
0.029
105
out
0.120
7680
E
E
7/23
An irony of two
learners
-1
O and R
O hooses H10
R hooses H2
0.5
1
-10
Two learners
-0.5
0
Data
2nd Order Fit
10th Order Fit
10
20
30
x
Learning a 10th-order target
8/23
20
20
40
40
60
60
120
0.05
0.1
0.15
100
80
H2
100
out
in
120
0.05
0.1
0.15
0.2
0.25
0.2
Number of Data Points,
0.25
H10
Expe
ted Error
80
out
in
N
9/23
H10
and
H2
-0.5
0
0.5
1
-500
Data
2nd Order Fit
10th Order Fit
0
500
1000
x
Learning a 50th-order target
10/23
A detailed experiment
Target
Impa t of
noise level
and
target omplexity
-1
y = f (x) + |{z}
(x) =
-0.5
0
0.5
noise level:
xq + (x)
|q=0 {z }
normalized
Qf
X
target omplexity:
0.5
Data
Target
1
1.5
Learning From Data - Le
ture 11
Qf
N
11/23
H2:
H10:
10th-order polynomials
0
0.5
g2 H2
and
g10 H10
-10
0
Data
2nd Order Fit
10th Order Fit
10
20
30
Learning From Data - Le
ture 11
overt measure:
E (g10) E (g2)
out
out
x
12/23
The results
100
0.2
Qf
Target omplexity,
Noise level,
0.2
0.1
-0.1
80
100
120N
Impa t of
0.1
50
-0.1
-0.2
Hi
Hi
80
100
120N
-0.2
Impa t of
Qf
Hi
13/23
Impa t of noise
0.1
-0.1
80
100
120N
-0.2
0.1
50
80
100
120N
-0.2
-0.1
0.2
Qf
100
Target
omplexity,
Noise level,
0.2
Deterministi noise
Overtting
Overtting
Overtting
14/23
Outline
What is overtting?
Deterministi noise
15/23
0.2
0.4
Denition of deterministi
noise
The part of
that
annot apture:
f (x)
0.6
0.8
h(x)
1
-100
Why noise?
-80
Main dieren
es with sto
hasti
noise:
-60
-40
1. depends on
-20
0
20
40
f
x
16/23
Impa t on overtting
Finite
N: H
0.2
Qf
Qf
Target omplexity,
100
0.1
50
-0.1
80
100
120N
-0.2
how mu
h overt
Learning From Data - Le
ture 11
17/23
ED
i
i h
h
i
2
2
2
(D)
(D)
= ED g (x) g(x) + g(x) f (x)
g (x) f (x)
|
{z
} |
{z
}
var(x)
What if
bias(x)
is a noisy target?
y = f (x) + (x)
E (x) = 0
18/23
A noise term
ED,
g (D) (x) y
2i
= ED,
i
2
(D)
g (x) f (x) (x)
= ED,
i
2
(D)
g (x) g(x) + g(x) f (x) (x)
= ED,
(D)
+
Learning From Data - Le
ture 11
2
2
2
(x) g(x) + g(x) f (x) + (x)
ross terms
i
19/23
ED,x
h
i
i
i
h
2
2
2
(D)
g (x) g(x) + Ex g(x) f (x) +E,x (x)
{z
} |
{z
} |
{z
}
var
bias
deterministi noise
20/23
Outline
What is overtting?
Deterministi noise
21/23
Two ures
Regularization:
Validation:
22/23
Putting the -1
brakes
-1
-0.5
-0.5
Data
Target
0.5
Fit
0.5
1
-20
0.5
-10
1.5
-30
10
x
free t
x
restrained t
23/23
-0.2
Overtting
Deterministi
noise
0.8
1
-1
Fitting
the data more than is warranted
-0.5
-100
0
-80
0.5
-60
Target
0.5
Fit
1
-40
Data
-20
0
20
40
-30
100
0.2
Qf
-20
-10
0
10
0.2
0.6
Review of Le ture 11
Target omplexity,
Fit
0.1
50
-0.1
80
100
120
-0.2
Le ture 12:
Regularization
Outline
Regularization - informal
Regularization - formal
Weight de ay
Choosing a regularizer
2/21
Heuristi
:
Handi
apping the minimization of
Ein
3/21
0.8
0.6
A familiar 1example
0.8
-2
-8
-1.5
-6
-1
-4
-0.5
-2
0.5
1.5
without regularization
Learning From Data - Le
ture 12
x
with regularization
4/21
0.4
0.4
0.6
0.8
without regularization
0.8
-3
-1.5
-2
-1
g(x)
with regularization
g(x)
-0.5
-1
...
sin(x)
sin(x)
0.5
1
3
bias
= 0.21
1.5
var
= 1.69
bias
= 0.23
var
= 0.33
5/21
polynomials of order
linear regression in
L1(x)
z = ..
.
Lq(x)
Hq =
Q
X
q=0
spa e
wq Lq (x)
Legendre polynomials:
L1
PSfrag repla
ements
ag repla ements
L2
PSfrag repla
ements
1
2
2 (3x
1)
L3
L4
1
3
2 (5x
3x)
L5
1
4
8 (35x
30x2 + 3)
1
5
8 (63x )
6/21
Un
onstrained solution
Given
Minimize
N
X
1
(wTzn yn)2
Ein(w) =
N n=1
Minimize
1
N
(Zw y)T(Zw y)
wlin = (ZTZ)1ZTy
7/21
Softer version:
H2
Q
X
is onstrained version of
wq2 C
H10
soft-order
with
wq = 0
for
q>2
onstraint
q=0
Minimize
1
N
(Zw y)T(Zw y)
subje t to:
Solution:
wreg
wTw C
instead of
wlin
8/21
Solving for
Minimize
Ein(w) =
1
N
subje t to:
wTw C
(Zw y)T(Zw y)
wreg
E =
onst.
in
lin
Ein(wreg) wreg
normal
= 2 N wreg
Ein(wreg) + 2 N wreg = 0
Minimize
Ein(w) + N wTw
in
wtw = C
9/21
Augmented error
Minimizing
1
N
Minimizing
Ein(w) =
subje
t to:
1
N
solves
un onditionally
(Zw y)T(Zw y)
wTw C
VC formulation
10/21
The solution
Minimize
Eaug(w) = 0
ZT(Zw y) + w = 0
wlin = (ZTZ)1ZTy
(with regularization)
(without regularization)
11/21
The result
Minimizing
wTw
Ein(w) + N
for dierent
's:
epla ements
=0
Fit
= 0.0001
Data
Target
= 0.01
=1
-1
-0.5
0
0.5
1
0
0.5
1
1.5
2
-1
-0.5
0
0.5
1
0
0.5
1
1.5
2
overtting
Learning From Data - Le
ture 12
-1
-0.5
0
0.5
1
0
0.5
1
1.5
2
-1
-0.5
0
0.5
1
-30
-20
-10
0
10
Fit
undertting
12/21
Ein(w) + N wTw
is alled weight
de ay.
Why?
w(t) 2
w(t)
N
ww=
(l1) d(l)
L
d
X X X
l=1
i=0
j=1
(l)
wij
2
13/21
Variations of weight de
ay
Emphasis of
ertain weights:
Q
X
q wq2
q=0
Examples:
q = 2q =
low-order t
q = 2q =
high-order t
Tikhonov regularizer:
Learning From Data - Le
ture 12
's
wTTw
14/21
0.5
1
Even weight growth!
1.5
0
Pra ti al rule:
1.5
2
2.5
Eout
0.5
Expe ted
weight growth
weight de ay
3.5
Regularization Parameter,
15/21
= (h),
we minimize
Eaug(h) = Ein(h) +
(h)
N
Rings a bell?
is better than
Ein
as a proxy for
Eout
16/21
Outline
Regularization - informal
Regularization - formal
Weight de ay
Choosing a regularizer
17/21
(going in ir les
Chose a bad
smoother or simpler
We still have
18/21
Neural-network regularizers
-4
-2
0
Weight de ay:
2
4
linear
+1
tanh
-1
Weight elimination:
-0.5
0
0.5
Fewer weights
= smaller VC
dimension
hard threshold
(w) =
X
i,j,l
(l)
wij
2
2
(l)
2
+ wij
19/21
3.5
When to stop?
validation
2.5
Error
1.5
Early stopping
out
0.5
E in
0
1000
2000
3000
4000
5000
Epochs
6000
7000
8000
9000
10000
bottom
20/21
The optimal
0.6
0.5
2 = 0.25
Qf = 100
Eout
.75
2 = 0.5
.25
2 = 0
0.5
1
1.5
Regularization Parameter,
Sto
hasti
noise
Expe ted
Expe ted
Eout
0.4
0.2
Qf = 30
Qf = 15
0.5
1
1.5
Regularization Parameter,
Deterministi
noise
21/21
Review of Le ture 12
Choosing a regularizer
Eaug(h) = Ein(h) +
(h)
N
Regularization
onstrained
un onstrained
(h):
E = onst.
heuristi
most used:
in
lin
normal
smooth, simple
weight de ay
= 0.0001
= 1.0
-1
-0.5
0
0.5
1
0
0.5
1
1.5
2
-1
-0.5
0
0.5
1
0
0.5
1
1.5
2
Minimize
in
wtw = C
Yaser S. Abu-Mostafa
Outline
Cross validation
2/22
Eout(h) = Ein(h) +
overt penalty
Regularization:
Eout(h) = Ein(h) + overt
|
{zpenalty}
regularization estimates this quantity
Validation:
E
| out
{z(h)} = Ein(h) +
validation estimates this quantity
Learning From Data - Le
ture 13
overt penalty
3/22
error is
e(h(x), y)
h(x) y
Squared error:
Binary error:
2
h(x) 6= y
E e(h(x), y) = Eout(h)
2
var e(h(x), y) =
4/22
the error is
k=1
E Eval(h) =
var
K
X
1
Eval(h) =
e
(h(xk ), yk )
K
K
X
1
E e(h(xk ), yk ) = Eout(h)
K
k=1
K
X
1
2
Eval(h) =
var e(h(xk ), yk ) =
2
K
K
k=1
1
Eval(h) = Eout(h) O
K
Learning From Data - Le
ture 13
5/22
is taken out of N
20
|K points
{z } validation
Dval
1
:
K
Small
Large
N
| K
{z points}
K =
Dtrain
bad estimate
K = ?
40
60
training
80
100
120
0.05
0.1
0.15
0.2
0.25
Eout
Ein
6/22
is put ba k into N
D Dtrain Dval
N K
D = g
(N )
Dtrain
Dtrain = g
Eval = Eval(g )
Large
(N K )
K =
bad estimate!
(K )
Rule of Thumb:
N
K=
5
Learning From Data - Le
ture 13
Dval
Eval(g )
7/22
Why `validation'
top
3.5
If an estimate of
Eout
ae ts learning:
It be omes a
validation set
test set!
2.5
2
Error
Dval
1.5
Early stopping
out
0.5
E in
0
1000
2000
3000
4000
5000
Epochs
6000
7000
8000
9000
10000
bottom
8/22
Two hypotheses
h1
and
h2
with
Pi k
h {h1, h2}
e) <
E(
e1 and e2
0.5
with
uniform on
[0, 1]
e = min(e1, e2)
optimisti bias
9/22
Outline
Cross validation
10/22
H1, . . . , HM
models
Dtrain
to learn
Evaluate gm using
gm
Pi k model
HM
g1
g2
gM
E2
{z
EM
Dval
Dval:
m = m
H2
Dtrain
for ea h model
Em = Eval(gm
);
H1
E1
m = 1, . . . , M
with smallest
pi k the best
m , Em )
(H
Em
D
gm
11/22
The bias
Eval(gm
)
Hm
using
is a biased estimate of
Dval
Eout(gm
)
0.8
0.7
Eout gm
0.6
0.5
Eval gm
25
12/22
How mu
h bias
For
models:
H1, . . . , HM
Dval
nalists model:
E
Eout(gm
(g
val m ) + O
regularization
q
ln M
K
early-stopping
13/22
Data
ontamination
Error estimates:
Contamination:
Eout
14/22
Outline
Cross validation
15/22
Eout(g)Eout(g )Eval(g )
(small K )
(large K )
highlights the dilemma in sele
ting
Can we have
K:
16/22
en
is
gn
Dn
E v
N
X
1
=
e
n
N n=1
17/22
ements
e1
E
v
Learning From Data - Le
ture 13
e3
e2
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
1.05
1.1
1.15
x
1
= (
3
e1
e2
e3 )
18/22
e1
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.7
0.75
0.8
0.85
0.9
0.95
1
1.05
1.1
e3
e2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.7
0.75
0.8
0.85
0.9
0.95
1
1.05
1.1
x
e3
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.7
0.75
0.8
0.85
0.9
0.95
1
1.05
1.1
e1
Constant:
Linear:
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
e2
0.65
0.7
PSfrag repla
ements
0.75
0.8
0.85
0.9
0.95
1
1.05
1.1
0.1
1.15
0.2
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
PSfrag repla
ements
0.9
-0.2
0
0.2
0.4
0.6
0.8
1
0.1
1.2
0.2
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.65
0.7
PSfrag repla
ements
0.75
0.8
0.85
0.9
0.95
1
1.05
1.1
0.1
1.15
0.2
x
19/22
Not
0.1
Dierent errors
0.03
0.2
Eout
0.4
-8
-7
-6
-5
Symmetry
0.3
0.02
E
v
0.01
-4
-3
-2
-1
0
Not
Average Intensity
Ein
5
# Features Used
10
15
20
(1, x1, x2) (1, x1, x2, x21, x1x2, x22, x31, x21x2, . . . , x51, x41x2, x31x22, x21x32, x1x42, x52)
Learning From Data - Le
ture 13
20/22
0.35
The result
0.4
without validation
-5
with validation
-4.5
-4
-3.5
0.05
0.1
0.2
0.25
0.3
0.35
-5
Symmetry
0.15
-2.5
-2
-1.5
-4
Symmetry
-3
-1
-3
-0.5
-2
-1
Average Intensity
Ein = 0%
Learning From Data - Le
ture 13
Eout = 2.5%
Average Intensity
Ein = 0.8%
Eout = 1.5%
21/22
training sessions on
N 1 points
ea h
z
D1
D2
D3
train
N
K
D4
D5
{
D6
D7
validate
training sessions on
N K
D9
D10
train
points ea h
D8
K=
N
10
22/22
Review of Le ture 13
Data
ontamination
0.8
Validation
0.7
Eout gm
0.6
(N )
Dtrain
Eval gm
0.5
(N K )
Dval
(K )
Dval
15
25
slightly ontaminated
Cross validation
z
Eval(g )
Eval(g )
estimates
Eout(g)
D
}|
{
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10
train validate
train
Ve tor Ma hines
Outline
The solution
Nonlinear transforms
2/20
Hi
Hi
Whi
h is best?
Hi
Hi
Hi
Two questions:
2. Whi
h
Learning From Data - Le
ture 14
4/20
infinity
0.866
0.5
0.397
infinity
0.866
0.5
0.397
5/20
Finding
Let
xn
wTx = 0.
Normalize
w:
|wTxn| = 1
2.
Pull out
w0:
w = (w1, , wd)
The plane is now
apart from
w Tx + b = 0
b
(no
x0 )
6/20
to
The ve tor
is
Take
and
wTx + b = 0
T
xn
wTx + b = 0
|wTxn + b| = 1
spa e:Hi
on the plane
and
where
xn
w
x
wTx + b = 0
= w (x x ) = 0
Hi
7/20
Distan
e between
Take any point
Proje
tion of
on the plane
xn x
w
=
w
=
kwk
distan e
xn
on
...
Hi
xn
w
T
(xn x)
distan
e = w
Hi
1
1 T
1 T
T
T
=
w xn w x =
w xn + b w x b =
kwk
kwk
kwk
8/20
Maximize
1
kwk
subje t to
min
n=1,2,...,N
|wTxn + b| = 1
Noti
e:
Minimize
subje t to
|wTxn + b| = yn (wTxn + b)
1 T
ww
2
T
yn (w xn + b) 1
for
n = 1, 2, . . . , N
9/20
Outline
The solution
Nonlinear transforms
10/20
Constrained optimization
Minimize
subje t to
1 T
ww
2
yn (wTxn + b) 1
for
n = 1, 2, . . . , N
w Rd, b R
Lagrange?
inequality onstraints
KKT
11/20
Remember regularization?
Minimize
E (w) =
in
subje t to:
1
N
E =
onst.
in
(Zw y) (Zw y)
T
lin
wTw C
normal
w
in
normal to onstraint
Regularization:
SVM:
Learning From Data - Le
ture 14
optimize
onstrain
w Tw
in
wTw
in
wtw = C
in
12/20
Lagrange formulation
Minimize
1 T
L(w, b, ) = w w
2
w.r.t.
and
and
N
X
n(yn (wTxn + b) 1)
n=1
maximize w.r.t. ea h
wL = w
n 0
N
X
nynxn = 0
n=1
L
=
b
Learning From Data - Le
ture 14
N
X
n y n = 0
n=1
13/20
Substituting
w =
N
X
nynxn
N
X
and
1 T
L(w, b, ) = w w
2
in the Lagrangian
N
X
n (yn (wTxn+b) 1 )
n=1
N
X
N X
N
X
1
L() =
n
ynym nm xnT xm
2 n=1 m=1
n=1
we get
n y n = 0
n=1
n=1
Maximize w.r.t. to
...
subje t to
n 0
for
n = 1, , N
and
PN
n=1 nyn
=0
14/20
min
subje t to
1 T
2
T
y
| {z= 0}
T
)
+ |(1
{z
}
linear
linear
onstraint
0
|{z}
lower bounds
Learning From Data - Le
ture 14
. . . y1yN x1TxN
. . . y2yN x2TxN
...
...
. . . yN yN xNTxN
|{z}
upper bounds
15/20
QP hands us
Solution:
= 1 , , N
= w =
N
X
nynxn
E =
onst.
in
lin
n=1
KKT ondition:
normal
For
n = 1, , N
n (yn (wTxn + b) 1) = 0
E
n > 0 = xn
Learning From Data - Le
ture 14
is a
in
support ve tor
wtw = C
16/20
Support ve
tors
Hi
Closest
xn's
= yn (wTxn + b) = 1
X
w =
xn
Solve for
is SV
nynxn
yn (wTxn + b) = 1
Hi
Learning From Data - Le
ture 14
17/20
Outline
The solution
Nonlinear transforms
18/20
0.6
z
PSfrag repla
ements
-1
-0.5
0
0.5
1
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
instead of
N
X
N X
N
X
0
1
L() =
n
ynym nm zTnzm
0.1
2
n=1
n=1 m=1
0.2
0.3
0.4
0.5
X Z
0.6
0.7
0.8
0.8
1
Learning From Data - Le
ture 14
0.9
19/20
Support ve tors in
spa e
Hi
In
spa e
spa e
Generalization result
E[Eout]
E [#
of SV's
N 1
Hi
20/20
Support ve tors
Review of Le ture 14
Hi
Hi
The margin
Hi
Hi
xn
Hi
Hi
1
L() =
n
2
n=1
N
X
N
X
ynym nm xnT xm
n=1 m=1
quadrati programming
out
N 1
Hi
Outline
Soft-margin SVM
2/20
spa e?
N X
N
X
1
L() =
n
ynym nm zTnzm
2 n=1 m=1
n=1
N
X
Constraints:
n 0
for
n = 1, , N
PN
n=1 nyn
and
w =
zn
and
Learning From Data - Le
ture 15
b:
is SV
=0
need
zTnz
n y n zn
ym (wTzm + b) = 1
need
zTnzm
3/20
Let
and
zTz = K(x, x)
x X ,
we need
(the kernel)
zT z
inner produ t of
Example:
and
x = (x1, x2)
2nd-order
4/20
The tri
k
Can we
ompute
Example:
K(x, x)
Consider
without
transforming
and
2 2
x1 x 1
2 2
x2 x 2
(1,
x21
(1,
x21
x22
x22
2 x1 ,
2x1 ,
2 x2 ,
2 x1 x2 )
2x2 ,
2x1x2 )
5/20
and
:X Z
is polynomial of order
K(x, x) = (1 + xTx)Q
= (1 + x1x1 + x2x2 + + xdxd)Q
Compare for
d = 10
and
Q = 100
6/20
We only need
If
K(x, x)
Example:
to exist!
Z , we are good.
K(x, x) = exp kx xk2
Innite-dimensional
K(x, x ) = exp (x x )
= exp x
2
X
k
k k
2
(x)
(x )
2
exp x
k!
|k=0 {z
}
exp(2xx )
7/20
Transforming
Overkill?
into
Hi
-dimensional Z
Hi
8/20
...
...
...
...
y1yN K(x1, xN )T
y2yN K(x2, xN )T
...
yN yNK(xN , xN )T
9/20
Express
w =
zn
is SV
n y n zn
in terms of
K(, )
g(x) = sign
n>0
where
nyn K(xn, x) + b
b = ym
n>0
nynK(xn, xm)
m > 0)
10/20
for a given
K(x, x)?
exists
...
valid kernel
1.
By onstru tion
2.
3.
Who ares?
11/20
1.
is a valid kernel i
It is symmetri
and
2.
The matrix:
is
positive semi-denite
for any
Learning From Data - Le
ture 15
. . . K(x1, xN )
. . . K(x2, xN )
...
...
. . . K(xN , xN )
x1, , xN
Outline
Soft-margin SVM
13/20
-1
0
0.5
-1
-1.5
-0.5
0
0.5
1
-1.5
-1
-0.5
0
0.5
1
1.5
seriously:
-1
-0.5
0
0.5
1
1.5
14/20
Error measure
Hi
Margin violation:
Quantify:
yn (wTxn + b) 1
yn (wTxn + b) 1 n
Total violation =
fails
n 0
N
X
n=1
n
violation
Hi
Learning From Data - Le
ture 15
15/20
Minimize
subje t to
1 T
ww+C
2
N
X
n=1
yn (wTxn + b) 1 n
and
n 0
for
for
n = 1, . . . , N
n = 1, . . . , N
w Rd , b R , RN
16/20
Lagrange formulation
1 T
L(w, b, , , ) = w w + C
2
N
X
w, b,
and
Minimize w.r.t.
n=1
N
X
n=1
n(yn (w xn + b) 1+n)
wL = w
L
=
b
N
X
N
X
n 0
and
N
X
n n
n=1
n 0
nynxn = 0
n=1
n y n = 0
n=1
L
= C n n = 0
n
Learning From Data - Le
ture 15
17/20
...
N
N X
X
1
ynym nm xnT xm
n
L() =
2
n=1
n=1 m=1
N
X
Maximize
subje t to
0 n C
for
n = 1, , N
and
w.r.t. to
N
X
n y n = 0
N
X
n=1
= w =
N
X
nynxn
n=1
minimizes
1 T
ww + C
2
n=1
18/20
Hi
0 < n < C )
yn (wTxn + b) = 1
( n
= 0)
n = C )
( n
> 0)
Hi
Learning From Data - Le
ture 15
19/20
primal
2.
Z:
dual
What if there is
w0?
All goes to
breaks down
and
w0 0
20/20
Review of Le ture 15
Kernel methods
K(x, x) = zTz
for some
Soft-margin SVM
Minimize
1 T
ww + C
2
spa e
N
X
n=1
Hi
Hi
violation
Hi
K(x, x) = exp kx xk2
Hi
0 n C
Le ture 16:
Outline
2/20
(xn, yn) D
inuen es
h(x)
based on
radial
Standard form:
h(x) =
N
X
n=1
kx
x
k
n
| {z }
wn exp kx xnk2
|
{z
}
basis fun
tion
3/20
Finding
w1, , wN :
h(x) =
N
X
n=1
2
wn exp kx xnk
based on
E = 0:
in
N
X
m=1
wm
for
n = 1, , N :
2
exp kxn xmk = yn
4/20
The solution
N
X
m=1
2
exp( kx2 x1k )
...
2
exp( kxN x1k )
|
If
is invertible,
...
...
...
...
{z
= yn
equations in
exp( kx1 xN k )
2
exp( kx2 xN k )
...
2
exp( kxN xN k )
}|
w = 1y
unknowns
w1
w2
=
...
wN
{z }
w
y1
y2
..
.
yN
| {z }
y
exa
t interpolation
5/20
The ee t of
h(x) =
N
X
n=1
small
Learning From Data - Le
ture 16
wn exp kx xnk2
large
6/20
h(x) = sign
N
X
n=1
wn exp kx xnk2
Learning:
s =
linear
N
X
wn exp kx xnk2
n=1
Minimize
(s y)2
on
y = 1
h(x) = sign(s)
Learning From Data - Le
ture 16
7/20
8/20
RBF with
w1, , wN
parameters
Use
KN
enters:
based on
1, , K
enters
data points
instead of
x1, , xN
h(x) =
K
X
k=1
2
wk exp kx k k
1.
2.
wk
9/20
xn
and the
losest
Split
enter
x1, , xN
Minimize
K -means lustering
into lusters
K
X
X
kxn k k
S1, , SK
2
k=1 xnSk
Unsupervised learning
NP -hard
Learning From Data - Le
ture 16
10/20
An iterative algorithm
Lloyd's algorithm:
Iteratively minimize
K X
X
kxn k k2
w.r.t.
k , Sk
k=1 xnSk
1 X
xn
k
|Sk |
xnSk
Sk {xn : kxn k k
Convergen
e
all
kxn k}
lo al minimum
11/20
4. Iterate
k 's
Hi
Learning From Data - Le
ture 16
12/20
Hi
support ve tors
Hi
Hi
Learning From Data - Le
ture 16
RBF enters
Hi
13/20
wk exp kxn k k2 yn
exp( kx1 1k )
2
exp(
kx
2
1 )
...
2
exp( kxN 1k )
|
If
is invertible,
...
...
...
...
{z
equations in
K< N
exp( kx1 K k )
2
exp( kx2 K k )
...
2
exp( kxN K k )
}|
w = (T)1Ty
unknowns
w1
w2
...
wK
{z } |
w
y1
y2
...
yN
{z }
y
pseudo-inverse
14/20
RBF network
exp kx k k2
h(x)
w1
wk
kx 1k
kx k k
wK
kx K k
x
b
A bias term (
Learning From Data - Le
ture 16
or
w0)
is often added
15/20
w1
h(x)
h(x)
wk
kx 1k
kx k k
w1
wK
kx K k
wk
w1Tx
wkT x
wK
T x
wK
RBF network
neural network
16/20
Choosing
Treating
as a parameter to be learned
h(x) =
K
X
wk exp kx k k
k=1
EM
Iterative approa h (
algorithm
in mixture of Gaussians):
1.
Fix
2.
Fix
w1, , wK ,
We an have a dierent
k for ea
h
enter k
17/20
Outline
18/20
sign
n>0
2
nyn exp kx xnk + b
SVM
RBF
sign
K
X
k=1
wk exp kx k k
+b
Hi
19/20
N
X
n=1
h(xn) yn
2
X
k=0
ak
dh
dxk
2
dx
smoothest interpolation
20/20
Review of Le ture 16
nearest
neighbors
K
X
wk exp kx k k
neural
networks
k=1
SVM
Kernel =
RBF
regularization
Choose
Choose
k 's:
wk 's:
Lloyd's algorithm
Pseudo-inverse
unsupervised
learning
Le ture 17:
Outline
O am's Razor
Sampling Bias
Data Snooping
2/22
3/22
O
am's Razor
The simplest model that ts the data is also the most plausible.
Two questions:
4/22
omplexity of h
and
Complexity of
h:
Complexity of
H:
omplexity of H
Entropy, VC dimension
5/22
Real-valued parameters?
is one of
elements of a set
Hi
Hi
6/22
0
1
0
1
1
Good all!
Perfe t re ord!
mH(N )
mH(N )/2N
mH(N ) = 1
versus
2N
8/22
temperature
S ientist A
temperature
S ientist B
-1
0
1
2
3
4
5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
ondu tivity
-1
0
1
2
3
4
5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
ondu tivity
-1
0
1
2
3
4
5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
ondu tivity
temperature
falsiable
9/22
Outline
O am's Razor
Sampling Bias
Data Snooping
10/22
Hi
Hi
11/22
of Truman
It's not
P[
's
fault:
12/22
The bias
In 1948, phones were expensive.
13/22
P(x)
testing
Region has
P =0
in training, but
P > 0 in
testing
training
x
Hi
14/22
age
23 years
gender
male
annual salary
$30,000
years in residen e
1 year
years in job
1 year
urrent debt
$15,000
15/22
Outline
O am's Razor
Sampling Bias
Data Snooping
16/22
Hi
17/22
-0.5
Snooping involves
or
D,
1
-1.5
-1
-0.5
0
0.5
1
1.5
18/22
Train on
Dtrain
only, test
on
Dtrain, Dtest
Dtest
Cumulative Prot %
30
snooping
20
10
-10
no snooping
100
200
Day
300
400
500
r20, r19, , r1 r0
Learning From Data - Le
ture 17
19/22
VC dimension of the
others tried!
20/22
Two remedies
1.
2.
21/22
Use
22/22
Review of Le ture 17
Sampling bias
Hi
O
am's Razor
The simplest model that
ts the data is also the
most plausible.
P(x)
testing
training
Hi
Data snooping
Cumulative Prot %
30
snooping
20
10
omplexity of h
omplexity of H
unlikely event signi
ant if it happens
-10
no snooping
100
200
300
Day
400
500
Le ture 18:
Epilogue
Outline
Bayesian learning
Aggregation methods
A knowledgments
2/23
semisupervised learning
overfitting
Gaussian processes
distributionfree
collaborative filtering
deterministic noise
linear regression
VC dimension
nonlinear transformation
decision trees
data snooping
sampling bias
Q learning
SVM
learning curves
mixture of expe
neural networks
no free
ensemble learning
types of learning
error measures
is learning feasible?
clustering
regularization
kernel methods
softorder constraint
weight decay
Occams razor
Boltzmann mach
3/23
The map
THEORY
TECHNIQUES
models
VC
biasvariance
complexity
linear
methods
supervised
regularization
neural networks
SVM
nearest neighbors
bayesian
PARADIGMS
RBF
gaussian processes
unsupervised
validation
reinforcement
aggregation
active
input processing
online
SVD
graphical models
4/23
Outline
Bayesian learning
Aggregation methods
A knowledgments
5/23
Probabilisti
approa
h
Hi
P (D | h = f )
How about
de ides whi h
P (h = f | D)
x)
target function f: X
UNKNOWN
INPUT
DISTRIBUTION
plus noise
P( x )
(likelihood)
x1 , ... , xN
DATA SET
D = ( x1 , y1 ), ... , ( xN , yN )
g ( x )~
~ f (x )
LEARNING
ALGORITHM
FINAL
HYPOTHESIS
g: X Y
HYPOTHESIS SET
H
Hi
6/23
The prior
P (h = f | D)
P (D | h = f ) P (h = f )
P (h = f | D) =
P (D | h = f ) P (h = f )
P (D)
P (h = f )
is the
P (h = f | D)
prior
is the
posterior
7/23
Example of a prior
Consider a per
eptron:
A possible prior on
w:
h is
Ea
h
determined by
wi
w = w0, w1, , wd
[1, 1]
Given
D,
we an ompute
P (h = f )
P (D | h = f )
P (h = f | D)
P (h = f )P (D | h = f )
Learning From Data - Le
ture 18
8/23
A prior is an assumption
Even the most neutral prior:
Hi
is unknown
is random
P(x)
x
Hi
is unknown
is random
(xa)
1
x
Hi
9/23
we ould ompute
P (h = f | D)
for every
hH
we an derive
E(h(x))
we an derive the
for every
10/23
valid
2. The prior is
irrelevant
11/23
Outline
Bayesian learning
Aggregation methods
A knowledgments
12/23
What is aggregation?
Combining dierent solutions
Hi
h 1 , h2 , , hT
D:
Hi
ensemble learning
and
boosting
13/23
training data
jointly:
Learning
Algorithm
Hi
Hi
training data
Learning
Algorithm
Hi
14/23
blending
training data
Learning
Algorithm
Hi
15/23
De
orrelation - boosting
Create
h 1 , , ht ,
sequentially: Make
ht
h's:
Hi
training data
Learning
Algorithm
Hi
Emphasize points in
Choose weight of
ht
based on
E (ht)
in
16/23
h 1 , h2 , , hT
g(x) =
T
X
t ht(x)
t=1
Prin
ipled
hoi
e of
t's:
Some
t's
pseudo-inverse
Most valuable
ht
in the blend?
17/23
Outline
Bayesian learning
Aggregation methods
A knowledgments
18/23
Course ontent
Professor
Professor
19/23
Course sta
Carlos Gonzalez (Head TA)
Ron Appel
Costis Sideris
Doris Xin
20/23
21/23
Calte
h support
IST -
Mathieu Desbrun
E&AS Division -
Ares Rosakis
Provost's O e -
Ed Stolper
and
and
Mani Chandy
Melany Hunt
22/23
Many others
Calte
h TA's and sta members
Calte
h alumni and Alumni Asso
iation
Colleagues all over the world
23/23
Faiza A. Ibrahim