9 vues

Transféré par missinu

Basis Expansion and Regularization

- Numerical Analysis And Differential Equations -Part b.pdf
- solns1
- TA notes.pdf
- 2157317
- State Space and Control Theory
- Www.bridge.t.u-tokyo.ac.Jp Apss Lectures Prof.johnson July16 APSS2010
- c1bs01-mathematics-i-set1
- Smo1264 Imortant
- Angular 2
- Lecture_25_twoside
- Wind Buckling of Metal Tanks During Their Construction
- eugen
- khihuhikhiuh
- Basics of Moment of Inertia
- Summer Reading List
- JURNAL
- Control of Distributed Parameter Systems
- 2014-02-26-kernels
- Syllabus for diff eq
- math physics solution

Vous êtes sur la page 1sur 98

DD3364

April 1, 2012

Introduction

Main idea

Augment the vector of inputs X with additional variables.

These are transformations of X

hm (X) : Rp R

with m = 1, . . . , M .

Then model the relationship between X and Y

f (X) =

M

X

m hm (X) =

m=1

M

X

m Z m

m=1

Have a linear model w.r.t. Z. Can use the same methods as

before.

Which transformations?

Some examples

Linear:

hm (X) = Xm , m = 1, . . . , p

Polynomial:

hm (X) = Xj2 ,

or

hm (X) = Xj Xk

hm (X) = log(Xj ),

p

Xj , ...

hm (X) = kXk

Use of Indicator functions:

Pros

Can model more complicated decision boundaries.

Can model more complicated regression relationships.

Cons

Lack of locality in global basis functions.

Solution Use local polynomial representations such as

There is the danger of over-fitting.

Pros

Can model more complicated decision boundaries.

Can model more complicated regression relationships.

Cons

Lack of locality in global basis functions.

Solution Use local polynomial representations such as

There is the danger of over-fitting.

Common approaches taken:

Restriction Methods

Limit the class of functions considered. Use additive models

f (X) =

Mj

p X

X

jm hjm (Xj )

j=1 m=1

Selection Methods

significantly to the fit of the model - Boosting, CART.

Regularization Methods

Let

f (X) =

M

X

j hj (X)

j=1

of ridge regression and lasso.

To obtain a piecewise polynomial function f (X)

Divide the domain of X into contiguous intervals.

142

Basis Expansions and Regularization

Represent

f5. by

a separate polynomial in each interval.

Examples

Piecewise Constant

O

O

O O

O

OO

O O

OO

Piecewise Linear

O

O

O

O

O

O

OOO

O

O

O

O

O

O O

O

OO

O O

O

O

O

O

O

O

O

OO

O

O

O

O

O

O

O

O

OOO

O

O

O

O

O

O

O O

O

O

O

O

O

O O

Green curve - piecewise constant/linear fit to the training data.

O

O

O O

O

OO

O O

OO

O

O

OOO

O

O

O

O

To obtain a piecewise polynomial function f (X)

Divide the domain of X into contiguous intervals.

142

Basis Expansions and Regularization

Represent

f5. by

a separate polynomial in each interval.

Examples

Piecewise Constant

O

O

O O

O

OO

O O

OO

Piecewise Linear

O

O

O

O

O

O

OOO

O

O

O

O

O

O O

O

OO

O O

O

O

O

O

O

O

O

OO

O

O

O

O

O

O

O

O

OOO

O

O

O

O

O

O

O O

O

O

O

O

O

O O

Green curve - piecewise constant/linear fit to the training data.

O

O

O O

O

OO

O O

OO

O

O

OOO

O

O

O

O

Piecewise Constant

O

O

O O

O

OO

O O

Piecewise Linear

O

O

O

O

O

O

O

O

O

O

OO

O

O

O

O

O

O

O

O

OOO

O

O

O

O

O

O O

O

O

O

O O

O

OO

O O

O

O

O

O

OOO

OO

O

O

O

O

O

O

O

O

O

O O

Continuous

Linear three regions

Piecewise-linear Basis Function

Divide [a, b], the domain

ofPiecewise

X, into

O

O

[a, 1 ), [1 , 2 ), [2 , b]

O O

O

OO

O O

OO

O

O

O

O

O

O

O

O

O

OOO

O

O

O

O O

(X =

1 )+

h1 (X) = Ind(X < 1 ), h2 (X) = Ind(1 X < 2 ), h3 (X)

Ind(

2 X)

O

O

O

O

P3

m=1

of yi s

in

the mth region. FIGURE 5.1. The top left panel shows a piecewise constant function fit to some

1

Piecewise Constant

O

O

O O

O

OO

O O

OO

Piecewise Linear

O

O

O

O

O

O

O

O

O

O

O

O

OO

O

O

O

O

O

O

O

O

OOO

O

O

O

O

O

O

O O

O

OO

O O

O

O

O

O

O O

O

O

O

O

O

O

O

O O

O

O

O

O O

O

OO

O O

O

O

O

O

OOO

OO

O

O

O

O

O

O

OOO

h4 (X) = X h1 (X),

O

O

O

O

O

O

O

O

O

h5 (X) = X h2 (X),

O O

O

O

P6

h3 (X) = Ind(2 X)

h6 (X) = X h3 (X)

(X 1 )+

m=1

O

model to

the data in each

region.

2

FIGURE 5.1. The top left panel shows a piecewise constant function fit to some

O

O

O

O O

O

OO

O O

O

O

OO

O

O

O

O

O

O

O

O

OOO

O

O

O

O

O

O

O

O

O O

O

O

O

(X 1 )+

FIGURE 5.1. The top left panel shows a piecewise constant function fit to some

artificial data. The broken vertical lines indicate the positions of the two knots

1 and 2 . The blue curve represents the true function, from which the data were

generated with Gaussian noise. The remaining two panels show piecewise linear functions fit to the same datathe top right unrestricted, and the lower left

restricted to be continuous at the knots. The lower right panel shows a piecewise

linear basis function, h3 (X) = (X 1 )+ , continuous at 1 . The black points

indicate the sample evaluations h3 (xi ), i = 1, . . . , N .

1 and 2 .

This means

1 + 2 1 = 3 + 4 1 , and

3 + 4 2 = 5 + 6 2

This reduces the # of dof of f (X) from 6 to 4.

Piecewise Constant

O

O

O O

O

OO

O O

Piecewise Linear

O

O

O

O

OO

basis instead:

O

O

O

O

O

O

O

O

O

O

OOO

O

O

O

OO

O

O

O

h1 (X) = 1

h2 (X) = X

h4 (X) = (X 2 )+

h3 (X) = (X 1 )+

O

O

O

O O

O

O

O

O O

O

OO

O O

OO

O

O

O O

O

O

O O

O

OO

O O

O

O

O

O

OOO

O

O

O

O

O

O

O

O

O

O

O

O

OOO

O

O

O

O

O

O O

O

O

O

O

O

O

(X 1 )+

FIGURE 5.1. The top left panel shows a piecewise constant function fit to some

artificial data. The broken vertical lines indicate the positions of the two knots

1 and 2 . The blue curve represents the true function, from which the data were

generated with Gaussian noise. The remaining two panels show piecewise linear functions fit to the same datathe top right unrestricted, and the lower left

Smoother f (X)

Can achieve a smoother f (X) by increasing the order

of the local polynomials

Smoother f (X)

Can achieve a smoother f (X) by increasing the order

of the local polynomials

5.2 Piecewise Polynomials and Splines

143

of the continuity at the knots

Piecewise Cubic Polynomials

Discontinuous

O

O

O O

O

OO O

OO

Continuous

O

O

O

O

OOO

O

O

O

O

O O

O

O

O

O O

O

OO O

OO

O

O

O

O

OOO

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O O

O

OO O

O

OO

O

O

O

O

O

O O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O O

O

OO O

O

O

OO

O

O

O

O

O

O

O

O

O

O

O O

O

O

O

O

O

O

O

O

O

OOO

O

O

OOO

O

O

O

O O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O O

O

OO O

O

OO

O

O

O

O

OOO

O

O

O

O

O O

O

O

O

O

O

O

O

O

O

O

O O

O

OO O

OO

O

O

O

O

OOO

O

O

O

O

O

O

O O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

1

2

has 1st

and 2nd continuity

at the2 knots

O

O

O O

O

OO O

O

OO

O

O

O

O

OOO

O

O

O

O

O

O

O

O O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O O

O

OO O

O

O

OO

O

O

O

O

OOO

O

O

O

O

O

O

O

O O

O

O

O

O

O

O

O

O

O

O

O

O

ntinuity.

A cubic spline

ght panel is continuous, and has continuous first and second derivatives

t the knots. It is known as a cubic spline. Enforcing one more order of

Cubic Spline

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O O

O

OO O

O

OO

O

O

O

O O

A cubic spline

O

O

O

O

O O

O

OO O

O

O

O

O

O

O

O

O

O

O

O

O

OO

O

O

O

O

OOO

O

O

O

O

O

O

O

O

Cubic Spline

OOO

O

O

O

O

O

O

O O

O

O

O

O

O

O

O

O

O

O

O

O

ntinuity.

2 :

2

ght panel is continuous,

and

second

h1 (X)

= has

1, continuoush3first

(X)and

=X

, derivatives

h5 (X) = (X 1 )3+

t the knots. It is known as a cubic spline. Enforcing one more order of

3 hard to show

ontinuity would lead

a global

h2to(X)

= X,cubic polynomial.

h4 (X) It=isXnot

,

h6 (X) = (X 2 )3+

Exercise 5.1) that the following basis represents a cubic spline with knots

t 1 and 2 :

h (X) = 1,

h (X) = X 2 ,

h (X) = (X )3 ,

Order M spline

An order M spline with knots 1 , . . . , K is

a piecewise-polynomial of order M and

has continuous derivatives up to order M 2

hj (X) = X j1

j = 1, . . . , M

1

hM +l (X) = (X l )M

,

+

l = 1, . . . , K

Order M spline

An order M spline with knots 1 , . . . , K is

a piecewise-polynomial of order M and

has continuous derivatives up to order M 2

hj (X) = X j1

j = 1, . . . , M

1

hM +l (X) = (X l )M

,

+

l = 1, . . . , K

Order M spline

An order M spline with knots 1 , . . . , K is

a piecewise-polynomial of order M and

has continuous derivatives up to order M 2

hj (X) = X j1

j = 1, . . . , M

1

hM +l (X) = (X l )M

,

+

l = 1, . . . , K

Regression Splines

Fixed-knot splines are known as regression splines.

For a regression spline one needs to select

the order of the spline,

the number of knots and

the placement of the knots.

There are many equivalent bases for representing splines and

computationally attractive.

Regression Splines

Fixed-knot splines are known as regression splines.

For a regression spline one needs to select

the order of the spline,

the number of knots and

the placement of the knots.

There are many equivalent bases for representing splines and

computationally attractive.

Regression Splines

Fixed-knot splines are known as regression splines.

For a regression spline one needs to select

the order of the spline,

the number of knots and

the placement of the knots.

There are many equivalent bases for representing splines and

computationally attractive.

Problem

The polynomials fit beyond the boundary knots behave wildly.

Solution: Natural Cubic Splines

Have the additional constraints that the function is linear

Near the boundaries one has reduced the variance of the fit

Smoothing Splines

Smoothing Splines

Avoid knot selection problem by using a maximal set of knots.

Complexity of the fit is controlled by regularization.

Consider the following problem:

Find the function f (x) with continuous second derivative

which minimizes

Z

n

X

2

RSS(f, ) =

(yi f (xi )) + (f 00 (t))2 dt

i=1

Smoothing Splines

Avoid knot selection problem by using a maximal set of knots.

Complexity of the fit is controlled by regularization.

Consider the following problem:

Find the function f (x) with continuous second derivative

which minimizes

Z

n

X

2

RSS(f, ) =

(yi f (xi )) + (f 00 (t))2 dt

i=1

Smoothing Splines

Avoid knot selection problem by using a maximal set of knots.

Complexity of the fit is controlled by regularization.

Consider the following problem:

Find the function f (x) with continuous second derivative

which minimizes

Z

n

X

2

RSS(f, ) =

(yi f (xi )) + (f 00 (t))2 dt

i=1

smoothing parameter

closeness to data

curvature penalty

Z

n

X

2

RSS(f, ) =

(yi f (xi )) + (f 00 (t))2 dt

i=1

= 0: f is any function which interpolates the data.

= : f is the simple least squares line fit.

Hope is (0, ) indexes an interesting class of functions in

between.

Z

n

X

2

RSS(f, ) =

(yi f (xi )) + (f 00 (t))2 dt

i=1

= 0: f is any function which interpolates the data.

= : f is the simple least squares line fit.

Hope is (0, ) indexes an interesting class of functions in

between.

Z

n

X

2

RSS(f, ) =

(yi f (xi )) + (f 00 (t))2 dt

i=1

= 0: f is any function which interpolates the data.

= : f is the simple least squares line fit.

Hope is (0, ) indexes an interesting class of functions in

between.

Z

n

X

2

RSS(f, ) =

(yi f (xi )) + (f 00 (t))2 dt

i=1

the xi , i = 1, . . . , n.

That is

f(x) =

n

X

Nj (x)j

j=1

for representing this family of natural splines.

Z

n

X

2

RSS(f, ) =

(yi f (xi )) + (f 00 (t))2 dt

i=1

the xi , i = 1, . . . , n.

That is

f(x) =

n

X

Nj (x)j

j=1

for representing this family of natural splines.

The criterion to be optimized thus reduces to

RSS(, ) = (y N)t (y N) + t N

where

N1 (x1 )

N1 (x2 )

N= .

..

N1 (xn )

..

.

N2 (x1 )

N2 (x2 )

..

.

N2 (xn )

R 00

N2 (t)N100 (t)dt

N =

..

R 00 . 00

Nn (t)N1 (t)dt

y = (y1 , y2 , . . . , yn )t

Nn (x1 )

Nn (x2 )

..

.

Nn (xn )

..

R 00 . 00

Nn (t)N2 (t)dt

..

.

..

.

R 00

00

Nn (t)Nn (t)dt

The criterion to be optimized thus reduces to

RSS(, ) = (y N)t (y N) + t N

and its solution is given by

= (Nt N + N )1 Nt y

The fitted smoothing spline is then given by

f(x) =

n

X

j=1

Nj (x)j

The criterion to be optimized thus reduces to

RSS(, ) = (y N)t (y N) + t N

and its solution is given by

= (Nt N + N )1 Nt y

The fitted smoothing spline is then given by

f(x) =

n

X

j=1

Nj (x)j

Assume that has been set.

Remember the estimated coefficients are a linear

combination of the yi s

= (Nt N + N )1 Nt y

Let

f be the n-vector of the fitted values f(xi ) then

f = N = N(Nt N + N )1 Nt y = S y

where S = N(Nt N + N )1 Nt .

Assume that has been set.

Remember the estimated coefficients are a linear

combination of the yi s

= (Nt N + N )1 Nt y

Let

f be the n-vector of the fitted values f(xi ) then

f = N = N(Nt N + N )1 Nt y = S y

where S = N(Nt N + N )1 Nt .

Properties of S

S is symmetric and positive semi-definite.

S S S

S has rank n.

The book defines the effective degrees of freedom of a

smoothing spline to be

df = trace(S )

152

0.15

0.10

0.05

0.0

-0.05

0.20

Male

Female

10

15

20

25

Age

FIGURE 5.6. The response is the relative change in bone mineral density measured at the spine in adolescents, as a function of age. A separate smoothing spline

was fit to the males and females, with 0.00022. This choice corresponds to

about 12 degrees of freedom.

about 12 degrees of freedom.

where the Nj (x) are an N -dimensional set of basis functions for representing this family of natural splines (Section 5.2.1 and Exercise 5.4). The

criterion thus reduces to

Let N = U SV t be the svd of N .

Using this decomposition it is straightforward to re-write

S = N(Nt N + N )1 Nt

as

S = (1 + K)1

where

K = U S 1 V t N V S 1 U t .

It is also easy to show that

f = S y is the solution to the

optimization problem

min (y f )t (y f ) + f t Kf

f

Let N = U SV t be the svd of N .

Using this decomposition it is straightforward to re-write

S = N(Nt N + N )1 Nt

as

S = (1 + K)1

where

K = U S 1 V t N V S 1 U t .

It is also easy to show that

f = S y is the solution to the

optimization problem

min (y f )t (y f ) + f t Kf

f

The eigen-decomposition of S

Let K = P DP 1 be the real eigen-decomposition of K -

Then

S = (I + K)1 = (I + P DP 1 )1

= (P P 1 + P DP 1 )1

= (P (I + D)P 1 )1

= P (I + D)1 P 1

n

X

1

pk ptk

=

1 + dk

i=1

and pk are the e-vectors of K.

pk are also the e-vectors of S and 1/(1 + dk ) its e-values.

The eigen-decomposition of S

Let K = P DP 1 be the real eigen-decomposition of K -

Then

S = (I + K)1 = (I + P DP 1 )1

= (P P 1 + P DP 1 )1

= (P (I + D)P 1 )1

= P (I + D)1 P 1

n

X

1

pk ptk

=

1 + dk

i=1

and pk are the e-vectors of K.

pk are also the e-vectors of S and 1/(1 + dk ) its e-values.

5.4 Smoothing Splines

155

30

20

10

Ozone Concentration

-50

50

100

1.0

1.2

df=5

df=11

0.6

0.4

0.2

0.0

-0.2

Eigenvalues

0.8

spline with df = trace(S ) = 5.

0.6

0.4

df=5

df=11

0.2

Eigenvalues

0.8

1.0

1.2

Example: Eigenvalues of S

-0.2

0.0

10

15

Order

20

25

-50

50

100

FIGURE

5.7.of(Top:)

Smoothing

spline fit of ozone concentr

Green curve

eigenvalues

S with

df = 11.

pressure gradient. The two fits correspond to different valu

achieve

Red curveparameter,

eigenvalueschosen

of S to

with

df =five

5. and eleven effective degrees

by df = trace(S ). (Lower left:) First 25 eigenvalues for the t

matrices. The first two are exactly 1, and all are 0. (Lo

-50

50

100

Example: Eigenvectors of S

df=5

df=11

15

Order

20

25

-50

50

100

-50

50

100

p:) Smoothing

spline

of ozone

versus

Daggot

Each

bluefit

curve

is an concentration

eigenvector of S

plotted against x. Top left

The two fits has

correspond

to different

values

the smoothing

highest e-value,

bottom

right of

samllest.

o achieve five and eleven effective degrees of freedom, defined

curve

is the eigenvector

by 1/(1 + dk ).

(Lower left:)Red

First

25 eigenvalues

for thedamped

two smoothing-spline

The eigenvectors of S do not depend on .

The smoothing spline decomposes y w.r.t. the basis {pk } and

S y =

n

X

k=1

1

pk (ptk y)

1 + dk

df = trace(S ) =

n

X

k=1

1/(1 + dk ).

The eigenvectors of S do not depend on .

The smoothing spline decomposes y w.r.t. the basis {pk } and

S y =

n

X

k=1

1

pk (ptk y)

1 + dk

df = trace(S ) =

n

X

k=1

1/(1 + dk ).

The eigenvectors of S do not depend on .

The smoothing spline decomposes y w.r.t. the basis {pk } and

S y =

n

X

k=1

1

pk (ptk y)

1 + dk

df = trace(S ) =

n

X

k=1

1/(1 + dk ).

Visualization of a S

Equivalent Kernels

Row 12

Smoother Matrix

12

Row 25

Row 50

25

50

Row 75

75

100

115

Row 100

Row 115

FIGURE 5.8. The smoother matrix for a smoothing spline is nearly banded,

indicating an equivalent kernel with local support. The left panel represents the

Choosing ???

This is a crucial and tricky problem.

Will deal with this problem in Chapter 7 when we consider the

Previously considered a binary classifier s.t.

log

P (Y = 1|X = x)

= 0 + t x

P (Y = 0|X = x)

log

P (Y = 1|X = x)

= f (x)

P (Y = 0|X = x)

P (Y = 1|X = x) =

ef (x)

1 + ef (x)

of P (Y = 1|X = x).

Previously considered a binary classifier s.t.

log

P (Y = 1|X = x)

= 0 + t x

P (Y = 0|X = x)

log

P (Y = 1|X = x)

= f (x)

P (Y = 0|X = x)

P (Y = 1|X = x) =

ef (x)

1 + ef (x)

of P (Y = 1|X = x).

Construct the penalized log-likelihood criterion

`(f ; ) =

Z

n

X

[yi log P (Y = 1|xi ) + (1 yi ) log(1 P (Y = 1|xi ))] .5 (f 00 (t))2 dt

i=1

Z

n

X

=

[yi f (xi ) log(1 + ef (xi ) )] .5 (f 00 (t))2 dt

i=1

Hilbert Spaces

There is a class of generalization problems which have the form

min

f H

"

n

X

i=1

where

L(yi , f (xi )) is a loss function,

J(f ) is a penalty functional,

H is a space of functions on which J(f ) is defined.

These are generated by a positive definite kernel K(x, y) and

the corresponding space of functions HK called a reproducing

well.

What follows is mainly based on the notes of Nuno Vasconcelos.

Types of Kernels

Definition

A kernel is a mapping k : X X R.

These three types of kernels are equivalent

dot-product kernel

m

m

Mercer kernel

Dot-product kernel

Definition

A mapping

k :X X R

is a dot-product kernel if and only if

k(x, y) = h(x), (y)i

where

:X H

and H is a vector space and h, i is an inner-product on H.

Definition

A mapping

k :X X R

is a positive semi-definite kernel on X X if m N and

x1 , . . . , xm with each xi X the Gram matrix

k(x2 , x1 ) k(x2 , x2 ) k(x2 , xm )

K=

.

.

...

.

...

...

k(xm , x1 ) k(xm , x2 ) k(xm , xm )

is positive semi-definite.

Mercer kernel

Definition

A symmetric mapping k : X X R such that

Z Z

k(x, y) f (x) f (y) dx dy 0

for all functions f s.t.

Z

is a Mercer kernel.

f (x)2 dx <

These different definitions lead to different interpretations of what

the kernel does:

Interpretation I

Reproducing kernel map:

X

Hk = f (.) | f () =

i k(, xi )

j=1

hf, gi =

m X

m

X

i=1 j=1

: X k(, x)

i j k(xi , x0j )

These different definitions lead to different interpretations of what

the kernel does:

Interpretation II

Mercer kernel map:

HM = `2 =

x|

X

i

x2i <

hf, gi = f g

p

p

: X ( 1 1 (x), 2 2 (x), ...)t

of k(x, y) with i > 0.

P 2

where `2 is the space of vectors s.t.

i ai < .

When a Gaussian kernel k(x, xi ) = exp(kx xi k2 /) is used

the point xi X is mapped into the Gaussian G(, xi , I)

Hk is the space of all functions that are linear combinations of

Gaussians.

on X .

With the definition of Hk and h, i one has

f Hk

Leads to the reproducing Kernel Hilbert Spaces

Definition

A Hilbert Space is a complete dot-product space.

(vector space + dot product + limit points of all Cauchy

sequences)

With the definition of Hk and h, i one has

f Hk

Leads to the reproducing Kernel Hilbert Spaces

Definition

A Hilbert Space is a complete dot-product space.

(vector space + dot product + limit points of all Cauchy

sequences)

Definition

Let H be a Hilbert space of functions f : X R. H is a

Reproducing Kernel Hilbert Space (rkhs) with inner-product h, i

if there exists a

k :X X R

s. t.

k(, ) spans H that is

H = {f () | f () =

i i k(, xi )

for i R and xi X }

f H

Theorem

Let k : X X R be a Mercer kernel. Then there exists an

orthonormal set of functions

Z

i (x)j (x)dx = ij

and a set of i 0 such that

Z Z

X

2

1

i =

k 2 (x, y)dx dy < and

i

k(x, y) =

X

i=1

i i (x)i (y)

This eigen-decomposition gives another way to design the feature

transformation induced by the kernel k(, ).

Let

: X `2

be defined by

where `2 is the space of square summable sequences.

Clearly

p

X

p

i i (x) i i (y)

h(x), (y)i =

i=1

X

i=1

Issues

Therefore there is a vector space `2 other than Hk such that

k(x, y) is a dot product in that space.

Have two very different interpretations of what the kernel

does

1

2

Mercer kernel map

For HM we write

(x) =

P

i

i i (x)ei

: `2 span{k }

Can write

( )(x) =

P

i

ek =

p

k k ()

i i (x)i () = k(, x)

we have

!

xi

x2

x

x x x

x

x

o o

x

x

o o o o

x

x

o

o

o

o

o o

x

e1

)(xi)

l2 d

x x x

x

x

x

xo o

o oo

o

o

o

o

ed

x1

e3

e2

x

x

x x x

x

x

x

x

I1

7R)(xi)=k(.,x

)=k( xi)

"

xo o

o oo

o

o

o

o

o

o

o

Id

I3

I2

13

Mercer map

Define the inner-product in M as

Z

hf, gim = f (x)g(x) dx

Note we will normalize the eigenfunctions l such that

Z

lk

l (x)k (x) dx =

l

Any function f M can be written as

f (x) =

X

k=1

then

k k (x)

Mercer map

=

=

=

f (x)k(x, y) dx

Z X

k k (x)

k=1

X

l=1

k l l (y)

k=1 l=1

l l l (y)

l=1

1

l

l l (y) = f (y)

l=1

k is a reproducing kernel on M.

l l (x)l (y) dx

k (x)l (x) dx

We want to check if

the space M = Hk

1

2

3

Show Hk M.

Show M Hk .

Hk M

If f Hk then there exists m N, {i } and {xi } such that

f () =

=

=

=

m

X

i=1

m

X

i=1

X

l=1

i k(, xi )

i

l l (xi ) l ()

l=1

m

X

i l l (xi )

i=1

l ()

l l ()

l=1

This shows that if f H then f M and therefore H M.

Let f, g H with

f () =

n

X

i k(, xi ),

g() =

i=1

m

X

j k(, yj )

j=1

Then by definition

hf, gi =

While

hf, gim =

=

=

n X

m

X

i j k(xi , yj )

i=1 j=1

f (x)g(x) dx

Z X

n

i k(x, xi )

i=1

n

m

XX

i=1 j=1

i j

m

X

j k(x, yj ) dx

j=1

k(x, xi ) k(x, yj ) dx

hf, gim =

=

=

n X

m

X

i=1 j=1

n X

m

X

i=1 j=1

n X

m

X

i j

Z X

l l (x)l (xi )

l=1

i j

l l (xi ) l (yj )

l=1

i j k(xi , yj )

i=1 j=1

= hf, gi

hf, gim = hf, gi

X

s=1

s s (x)s (yj ) dx

MH

Can also show that if f M then also f Hk .

Will not prove that here.

But it implies M Hk

Summary

The reproducing kernel map and the Mercer Kernel map lead to

the same RKHS, Mercer gives us an orthonormal basis.

Interpretation I

Reproducing kernel map:

X

Hk = f (.) | f () =

i k(, xi )

j=1

hf, gi =

m X

m

X

i=1 j=1

r : X k(, x)

i j k(xi , x0j )

Summary

The reproducing kernel map and the Mercer Kernel map lead to

the same RKHS, Mercer gives us an orthonormal basis.

Interpretation II

Mercer kernel map:

HM = `2 =

x|

X

i

x2i

<

hf, gi = f t g

p

p

M : X ( 1 1 (x), 2 2 (x), ...)t

: `2 span{k ()}

M = r

Back to Regularization

Back to regularization

We to solve

min

f Hk

" n

X

i=1

Intuition: wigglier functions have larger norm than smoother

functions.

For f Hk we have

f (x) =

i k(x, xi )

X

i

X

l

X

l

i

"

l l (x)l (xi )

X

i

cl l (x)

i l (xi ) l (x)

and therefore

kf (x)k2 =

with cl = l

Hence

lk

cl ck hl (x), k (x)im =

X 1

X c2

l

cl ck lk =

l

l

lk

i i l (xi ).

functions with large e-values get penalized less and vice versa

more coefficients means more high frequencies or less

smoothness.

Representer Theorem

Theorem

Let

: [0, ) R be a strictly monotonically increasing function

H is the RKHS associated with a kernel k(x, y)

L(y, f (x)) be a loss function

then

f = arg min

f Hk

" n

X

i=1

f(x) =

n

X

i=1

i k(x, xi )

Relevance

The remarkable consequence of the theorem is that

space.

This is because as f =

Pn

i=1 i k(, xi )

kfk2 = hf, fi =

=

X

ij

then

i j hk(, xi ), k(, xj )i

i j k(xi , xj ) = t K

ij

and

f(xi ) =

j k(xi , xj ) = Ki

Relevance

The remarkable consequence of the theorem is that

space.

This is because as f =

Pn

i=1 i k(, xi )

kfk2 = hf, fi =

=

X

ij

then

i j hk(, xi ), k(, xj )i

i j k(xi , xj ) = t K

ij

and

f(xi ) =

j k(xi , xj ) = Ki

Representer Theorem

Theorem

Let

: [0, ) R be a strictly monotonically increasing function

H is the RKHS associated with a kernel k(x, y)

then

"

f = arg min

f Hk

n

X

i=1

#

2

Pn

f(x) = i=1

i k(x, xi )

where

= arg min

" n

X

i=1

L(yi , Ki ) + ( K)

When given linearly separable data {(xi , yi )} the optimal

min kk2

0 ,

subject to

yi (0 + t xi ) 1 i

max(0, 1 yi (0 + t xi )) = (1 yi (0 + t xi )+ = 0 i

Hence we can re-write the optimization problem as

min

0 ,

"

n

X

(1 yi (0 + t xi ))+ + kk2

i=1

Finding the optimal separating hyperplane

min

0 ,

"

n

X

(1 yi (0 + t xi ))+ + kk2

i=1

min

f

"

n

X

i=1

where

L(y, f (x)) = (1 yi f (xi ))+

(kf k2 ) = kf k2

From the Representor theorem know the solution to the latter

problem is

f(x) =

n

X

i xti x

i=1

Therefore kf k2 = t K

This is the same form of the solution found via the KKT

conditions

n

X

i=1

i yi x i

- Numerical Analysis And Differential Equations -Part b.pdfTransféré parSanjeev Shukla
- solns1Transféré parthermopolis3012
- TA notes.pdfTransféré parMin Cheol Kim
- 2157317Transféré parWei Gao
- State Space and Control TheoryTransféré parOneSetiaji SetiajiOne
- Www.bridge.t.u-tokyo.ac.Jp Apss Lectures Prof.johnson July16 APSS2010Transféré parmaudgeesjedirkje
- c1bs01-mathematics-i-set1Transféré parSrinivasa Rao G
- Smo1264 ImortantTransféré paralgwaal
- Angular 2Transféré parMasoud Mosaddeghi
- Lecture_25_twosideTransféré parsathishkumar18492328
- Wind Buckling of Metal Tanks During Their ConstructionTransféré parARNOUX
- eugenTransféré parRiky Ikhwan
- khihuhikhiuhTransféré parpackya7191
- Basics of Moment of InertiaTransféré parmayank4scribd
- Summer Reading ListTransféré parSanthosh Mamidala
- JURNALTransféré parrindangsukmanita
- Control of Distributed Parameter SystemsTransféré parOnu Ralber Taslan
- 2014-02-26-kernelsTransféré parShelly Puspa Ardina
- Syllabus for diff eqTransféré parpablo lainez
- math physics solutionTransféré parRRAJUDE
- #5FibonacciTransféré parFachni Rosyadi
- ffggTransféré parFahd Saif
- Finite DifferencesTransféré parSatyanarayana Lalam
- Implementing a Principal Component Analysis (PCA)Transféré parRobert Kowalski
- Linear Algebr1Transféré parShiva Prakash
- Denver PELS 20070410 Hesterman Magnetic CouplingTransféré parBLHesterman
- MTH603 Final Term Solved MCQsTransféré parMuhammad Asif Butt Mohsini
- genvectors (2)Transféré parJorge Orrante Sakanassi
- 05678830-DFIG.pdfTransféré parAhmed Westminister
- Math IntroTransféré parzhushazang

- Bai Giang Toan C2 (2009)Transféré parmissinu
- Lecture 2 - Some Course AdminTransféré parmissinu
- Lecture 3 - Linear Methods for ClassificationTransféré parmissinu
- Exam4135 2004 SolutionsTransféré parmissinu
- Balabolka SampleTransféré parmissinu
- Exit Questionnaire 2005 FinalTransféré parmissinu
- Economic Accounts 2005 FinalTransféré parmissinu
- Economic Accounts 2005 FinalTransféré parmissinu
- MulticollinearityTransféré parmissinu
- Exam ECON301B 2002 CommentedTransféré parmissinu
- Exam ECON301B 2002 CommentedTransféré parmissinu
- AnkiTransféré parmissinu
- Stata RTransféré parmissinu
- Midterm Microeconomics 1 2012-13Transféré parmissinu
- Lecture 1 - Overview of Supervised LearningTransféré parmissinu
- Visual BasicTransféré parxuananh
- Giai Tich 2 2014 Chuong 5Transféré parmissinu
- Vocabulary IELTS Speaking Theo TopicTransféré parBBBBBBB
- Introduction to Microeconomic Theory and GE Theory (2015)Transféré parmissinu
- 3-Vu Trong Khai - Tich Tu Ruong DatTransféré parematn
- GRE VocabularyTransféré parKoksiong Poon
- Regulations Livestock in VN SummaryTransféré parmissinu
- Manure Estimates.pdfTransféré parmissinu
- Cuc BVTVTransféré parmissinu
- Bai Giang Toan Kinh Te Quang 2012 1171Transféré parnicksforums
- Thuchanh CH 141030Transféré parmissinu
- Tom Tat Cong Thuc XSTKTransféré parTuấn Lê
- Giao Trinh VBA_GXDTransféré parYumi Ling
- Bai Tap Giai Tich 2 Chuong 2Transféré parmissinu

- [Snieder] Trampert Inverse Problems in Geophysics.(BookFi.org).PsTransféré parZulfani Aziz
- Msc ThesisTransféré parMuhammad Rizwan
- Sequential Learning for Multimodal 3D HumanTransféré parullisrinu
- Air Tools TutorialTransféré parKarthikPrakash
- 37195Transféré parGalina Alexeeva
- Data Science and Big Data an Environment of Computational IntelligenceTransféré parSamuel Reddy
- Course Ra Machine LearningTransféré parc3m3gyanesh
- Regularized Weighted Ensemble of DeepTransféré parAnonymous lVQ83F8mC
- Convconvex optimization theoryTransféré parJackeline Huaccha Neyra
- Electromagnetic Subsurface Imaging at VlfTransféré parAndreea Ionescu
- 16Transféré partudorpc
- 10.1.1.45Transféré parAkanksha Gupta
- Neural Toolbox 1 Nnet GsTransféré parArmando Cajahuaringa
- Tag Based Image Search by Social Re RankingTransféré parSanjay Shelar
- Blind PSF Estimation and Methods of DeconvolutionTransféré parJan Kristanto
- Karakteristik Batuan SedimenTransféré parMatthew Fransteven
- Lecun 20150610 Cvpr KeynoteTransféré paruniquejiya
- 4882 Dropout Training as Adaptive RegularizationTransféré parP6E7P7
- TPAMI-GNMFTransféré paravi_weber
- ScienceTransféré parmiroslav11
- 2010.Mlss.webers1Transféré parSaksham Singhal
- Evolution of Regression - Ols to Gps to Mars Sf MeetupTransféré parVolodja
- FX Trading via Recurrent Reinforcement Learning - C Gold - California Institute of TechnologyTransféré pardarreal53
- 1212.2129v1Transféré parmaesliop
- Computational Inverse Techniques in Nondestructive Evaluation__G. R. Liu_X. HanTransféré parLuis Adrian Sigcha
- Image NetTransféré parsavisu
- Gallardo y Meju 2004Transféré parMarcos Flores Hernandez
- E_Inverse Heat Conduction Problems.pdfTransféré parEdgar Garcia
- Unit Tests for Stochastic OptimizationTransféré parDeusExMacchiato
- A Reconfigurable Parallel FPGA Accelerator for the Kernel Affine Projection Algorithm.pdfTransféré parBoppidiSrikanth