Vous êtes sur la page 1sur 39

SVMs: nonlinearity through kernels

Chapter 3.4, e-8

Non-separable data

e-8. Support Vector Machines

8.1. The Optimal Hyperplane

(a) Few noisy data.

ER

Consider the following two datasets:

(b) Nonlinearly separable.

Both are
not 8.6:
linearly
separable.
But there isofaFigure
difference!
Figure
Non-separable
data (reproduction
3.1).

data is not linearly separable? Figure 8.6 (reproduced from Chapter 3) illustrates the two types of non-separability. In Figure 8.6(a), two noisy data
points render the data non-separable. In Figure 8.6(b), the target function is
inherently nonlinear.

Non-separable data

e-8. Support Vector Machines

8.1. The Optimal Hyperplane

(a) Few noisy data.

ER

Consider the following two datasets:

(b) Nonlinearly separable.

Linear8.6:
with
outliers data (reproduction
Nonlinear
Figure
Non-separable
of Figure 3.1).

data is not linearly separable? Figure 8.6 (reproduced from Chapter 3) illustrates the two types of non-separability. In Figure 8.6(a), two noisy data
points render the data non-separable. In Figure 8.6(b), the target function is
inherently nonlinear.

Mechanics of the Feature Transform I


Mechanics
of theyour
Feature
Transform I
Transform
features!
Transform the data to a Z-space in which the data is separable.
Transform
the data to a Z-space in which the data is separable.
0
Map your data into Z-space using a nonlinear function

x2 x2

z2 x=22 x22
z2 =

z1 = x21
z1 = x21

x1
x1


1

x = x11

x = xx1
2
x2
c AM
#
L Creator: Malik Magdon-Ismail

Nonlinear Transforms: 6 /17

1 1

z = (x) = x121 = 11(x)


1(x)
(x)
z = (x) = xx221 =
2
2
2
x2
2(x)

Feature transform II: classify in Z-space


Separate the data in the Z-space with w:

parate the data in the Z-space with w:

Classification in Z-space
tz)
g(z) = sign(w
In Z-space the data can be linearly separated:

tz)
g(z) = sign(w

c AM
#
L Creator: Malik Magdon-Ismail

Nonlinear Transforms: 7 /17

Feature transform III: bring back to X -space

To classify a new x, first transform x to (x) Z-space and classify there with g.

Classification in Z-space

a g(x) = g((x))
t(x))
= sign(w

tz)
g(z) = sign(w

c AM
$
L Creator: Malik Magdon-Ismail

Nonlinear Transforms: 8 /17

Summary of nonlinear transform

After constructing features carefully, before seeing the data . . .

Classification in Z-space
. . . if you think linear is not enough, try the 2nd order polynomial transform.

1

x1 = x
x2

1
1


1(x) x1


(x) x
2
2
(x) =
= 2
3(x) x1


(x) x x
4
1 2
5(x)
x22

What can we say about the


dimensionality
of the Z-space as a
Nonlinear
Transforms: 12 /17
function of the dimensionality of the data?

c AM
#
L Creator: Malik Magdon-Ismail

The polynomial transform

The General Polynomial Transform k

The polynomial Z-space

We canWe
getcan
evenchoose
fancier:higher
degree-k
polynomial
transform:
order
polynomials:
1(x) = (1, x1, x2),
2(x) = (1, x1, x2, x21, x1x2, x22),
3(x) = (1, x1, x2, x21, x1x2, x22, x31, x21x2, x1x22, x32),
4(x) = (1, x1, x2, x21, x1x2, x22, x31, x21x2, x1x22, x32, x41, x31x2, x21x22, x1x32, x42),
..
What are the potential effects of increasing the order of the
Dimensionality
of the feature space increases rapidly (dvc)!
polynomial?

Similar transforms for d-dimensional original space.


Approximation-generalization tradeoff
Higher degree gives lower (even zero) Ein but worse generalization.
8

c AM
!
L Creator: Malik Magdon-Ismail

Nonlinear Transforms: 13 /17

Be carefull with nonlinear transforms

The polynomial Z-space

Be Careful with Feature Transforms


Linear model

fourth order polynomial

Feature-space dimensionality increases rapidly and with it the


complexity of the model: danger of overfitting
High order polynomial transform leads to nonsense.

c AM
!
L Creator: Malik Magdon-Ismail

Nonlinear Transforms: 15 /17

Digits data

A few potential issues


u
u
u

Danger of overfitting
Better chance of obtaining linear separability
Computationally expensive (memory and time)

Kernels: avoid the computational expense by an implicit mapping

10

Achieving non-linear discriminant functions


Consider two dimensional data and the mapping

(x) =

(x21 ,

2x1 x2 , x22 )|

Lets plug that into the discriminant function:

(x) =

w1 x21

2w2 x1 x2 + w3 x22

The resulting decision boundary is a conic section.

11

How to avoid the overhead of explicit mapping


Suppose the weight vector can be expressed as:

w=

n
X

i x i

i=1

The discriminant function is then:


X
| f (x) =
x| x + b

w x+b

i i

And using our nonlinear mapping:

f (x)
w| (x) +
b =

i (xi )| (x) + b

Turns out we can often compute the dot product without


explicitly mapping the data into a high dimensional feature
space!

12

Example
Lets go back to the example

(x) =

(x21 ,

2x1 x2 , x22 )|

and compute the dot product

(x)

(z) =
=

(x21 , 2x1 x2 , x22 )| (z12 , 2z1 z2 , z22 )


x21 z12 + 2x1 x2 z1 z2 + x22 z22
| 2

= (x z)

Do we need to perform the mapping explicitly?

13

Example
Lets go back to the example

(x) =

(x21 ,

2x1 x2 , x22 )|

and compute the dot product

(x)

(z) =
=

(x21 , 2x1 x2 , x22 )| (z12 , 2z1 z2 , z22 )


x21 z12 + 2x1 x2 z1 z2 + x22 z22
| 2

= (x z)

Do we need to perform the mapping explicitly?


NO! Squaring the dot product in the original space has the same
effect as computing the dot product in feature space.

14

Kernels
Definition: A function k(x, z) that can be expressed as a dot
product in some feature space is called a kernel.
In other words, k(x, z) is a kernel if there exists
such that

k(x, z) = (x)| (z)

: X 7! F

Why is this interesting?


If the algorithm can be expressed in terms of dot products, we
can work in the feature space without performing the mapping
explicitly!

15

The dual SVM problem


The dual SVM formulation depends on the data through dot
products, and so can be expressed using kernels. Replace

maximize

with:

n
X

i=1

n
X

subject to: 0 i C,

maximize

n
X
i=1

i yi = 0

i=1

1 XX
i j yi yj x|i xj
2 i=1 j=1

1 XX
i j yi yj k(xi , xj )
2 i=1 j=1

subject to: 0 i C,

n
X

i yi = 0

i=1

16

Standard kernel functions


The linear kernel

k(x, z) = x| z
Homogeneous polynomial kernel

k(x, z) = (x| z)d


Polynomial kernel

k(x, z) = (x| z + 1)d


Gaussian kernel

k(x, z) = exp(

||x

z||2 )

17

Standard kernel functions


The linear kernel

Feature space:

k(x, z) = x| z

Original features

Homogeneous polynomial kernel

k(x, z) = (x| z)d

All monomials of
degree d

Polynomial kernel

k(x, z) = (x| z + 1)d

All monomials of
degree less than d

Gaussian kernel

k(x, z) = exp(

||x

z||2 )

Infinite dimensional

18

Demo
Using polynomial kernel:

k(x, z) = (x| z + 1)d

19

Demo
Using the Gaussian kernel:

k(x, z) = exp(

||x

z||2 )

20

Kernelizing the perceptron algorithm


Recall the primal version of the perceptron algorithm:
Input: labeled data D in homogeneous coordinates
Output: weight vector w
w = 0
converged = false
while not converged :
converged = true
for i in 1,,|D| :
if xi is misclassified update w and set
0
converged=false

w = w + yi xi

What do you need to change to express it in the dual?


(I.e. express the algorithm in terms of the alpha coefficients)
21

Kernelizing the perceptron algorithm

Input: labeled data D in homogeneous coordinates


Output: weight vector
= 0
converged = false
n
X
while not converged :
w=
i x i
converged = true
i=1
for i in 1,,|D| :
if xi is misclassified update and set
converged=false

22

Kernelizing the perceptron algorithm


Input: labeled data D in homogeneous coordinates
Output: weight vector
= 0
converged = false
while not converged :
converged = true
for i in 1,,|D| :
if xi is misclassified update and set
converged=false

The update

i ! i + yi

Is equivalent to:

w0 = w + yi xi
23

Linear regression revisited


The sum-squared cost function:

(yi

w| xi )2 = (y

Xw)| (y

Xw)

The optimal solution satisfies:

X| Xw = X| y
If we express w as:

w=

n
X

i x i = X|

i=1

We get:

X| XX| = X| y
24

Kernel linear regression


We now get that satisfies:

XX| = y
Compare with:

X| Xw = X| y
Which is harder to find? What have we gained?

25

The kernel matrix


X| X

The covariance matrix (d x d)

XX|

Matrix of dot products associated with a dataset (n x n).

Can replace it with a matrix K such that:

Kij = (xi )

(xj ) = k(xi , xj )

This is the kernel matrix associated with a dataset


a.k.a the Gram matrix

26

How does that matrix look like?


Kernel matrix for gene
expression data in yeast:

27

Properties of the kernel matrix


The kernel matrix:

Kij = (xi )| (xj ) = k(xi , xj )

q
q
q

Symmetric (and therefore has real eigenvalues)


Diagonal elements are positive
Every kernel matrix is positive semi-definite, i.e.

x| Kx
q

0 8x

Corollary: the eigenvalues of a kernel matrix are positive

28

Standard kernel functions


The linear kernel

Feature space:

k(x, z) = x| z

Original features

Homogeneous polynomial kernel

k(x, z) = (x| z)d

All monomials of
degree d

Polynomial kernel

k(x, z) = (x| z + 1)d

All monomials of
degree less than d

Gaussian kernel (aka RBF kernel)

k(x, z) = exp(

||x

z||2 )

Infinite dimensional

How do we even know that the Gaussian is a valid kernel?


29

Some tricks for constructing kernels


Let K(x,z) be a kernel function
If a > 0 then aK(x,z) is a kernel

30

Some tricks for constructing kernels


Sums of kernels are kernels:
Let K1 and K2 be kernel functions then K1 + K2 is a kernel
What is the feature map that shows this?

31

Some tricks for constructing kernels


Sums of kernels are kernels:
Let K1 and K2 be kernel functions then K1 + K2 is a kernel
Feature map: concatenation of the underlying feature maps

32

Some tricks for constructing kernels


Products of kernels are kernels:
Let K1 and K2 be kernel functions then K1 K2 is a kernel

33

Some tricks for constructing kernels


Products of kernels are kernels:
Let K1 and K2 be kernel functions then K1 K2 is a kernel
Construct a feature map that contains all products of pairs of
features

34

The cosine kernel


If K(x,z) is a kernel then
0

is a kernel

K (x, z) = p

K(x, z)
K(x, x)K(z, z)

35

The cosine kernel


If K(x,z) is a kernel then
0

is a kernel

K (x, z) = p

K(x, z)
K(x, x)K(z, z)

This kernel is equivalent to normalizing each example to have


unit norm in the feature space associated with the kernel.
This kernel is the cosine in the feature space associated with
the kernel K:

(x)| (z)
(x)| (z)
cos( (x), (z)) =
=p
|| (x)|| || (z)||
(x)| (x) (z)| (z)
36

Infinite sums of kernels


Theorem: A function K(x, z)
with a series expansion

K(t) =

1
X

= K(x| z)

a n tn

n=0

is a kernel iff

an

0 for all n.

37

Infinite sums of kernels


Theorem: A function K(x, z)
with a series expansion

K(t) =

1
X

= K(x| z)

a n tn

n=0

is a kernel iff
Corollary:

an

0 for all n.

K(x, z) = exp (2 x| z)

is a kernel

38

Infinite sums of kernels


Theorem: A function K(x, z)
with a series expansion

K(t) =

1
X

= K(x| z)

a n tn

n=0

is a kernel iff

an

0 for all n.

Corollary:

K(x, z) = exp (2 x| z)

Corollary:

k(x, z) = exp(

exp(

||x

z||2 ) = exp (

||x

is a kernel

z||2 )

is a kernel

(x| x + z| z) + 2 (x| z))

i.e. the Gaussian kernel is the cosine kernel of the exponential


K(x, z)
kernel
0
K (x, z) = p

K(x, x)K(z, z)

39

Vous aimerez peut-être aussi