Vous êtes sur la page 1sur 15

Lecture 3: Dual problems and Kernels

C19 Machine Learning Primal and dual forms Linear separability revisted Feature mapping Kernels for SVMs
Kernel trick requirements radial basis functions

Hilary 2013

A. Zisserman

SVM review
We have seen that for an SVM learning a linear classier f (x) = w>x + b is formulated as solving an optimization problem over w :
N X 2 min ||w|| + C max (0, 1 yif (xi)) w Rd i

This quadratic optimization problem is known as the primal problem. Instead, the SVM can be formulated to learn a linear classier f (x) =
N X i

iyi(xi>x) + b

by solving an optimization problem over i. This is know as the dual problem, and we will look at the advantages of this formulation.

Sketch derivation of dual form


The Representer Theorem states that the solution w can always be written as a linear combination of the training data:
N X

w=
Proof: see example sheet .

j yj xj

j =1

Now, substitute for w in f (x) = w>x + b f (x ) =


X
j

j =1

N X

and for w in the cost function minw ||w||2 subject to yi w>xi + b 1, i ||w||2 = j yj xj

>

j yj x j > x + b =
X
k

j =1

N X

j yj xj >x + b

k yk xk

X
jk

j k yj yk (xj >xk )

Hence, an equivalent optimization problem is over j min


j

X
jk

and a few more steps are required to complete the derivation.

j k yj yk (xj >xk ) subject to yi

j =1

N X

j yj (xj >xi) + b 1, i

Primal and dual formulations


N is number of training points, and d is dimension of feature vector x. Primal problem: for w Rd
wRd

min ||w|| + C

N X i

max (0, 1 yif (xi))

Dual problem: for RN (stated without proof): max


i 0

X
i

X 1X j k yj yk (xj >xk ) subject to 0 i C for i, and i yi = 0 2 jk i

Need to learn d parameters for primal, and N for dual If N << d then more ecient to solve for than w Dual form only involves (xj >xk ). We will return to why this is an advantage when we look at kernels.

Primal and dual formulations


Primal version of classier: f (x) = w>x + b Dual version of classier: f (x) =
N X i

iyi(xi>x) + b

At rst sight the dual form appears to have the disadvantage of a K-NN classier it requires the training data points xi. However, many of the is are zero. The ones that are non-zero dene the support vectors xi.

Support Vector Machine


wTx + b = 0
b ||w||

Support Vector Support Vector

f (x) =

X
i

i yi (xi > x) + b
support vectors

C = 10

soft margin

Handling data that is not linearly separable

introduce slack variables


wRd ,iR+

min

||w||2 + C

N X i

subject to yi w>xi + b 1 i for i = 1 . . . N


linear classifier not appropriate ??

Solution 1: use polar coordinates


r

<0

>0

Data is linearly separable in polar coordinates Acts non-linearly in original space ! ! x1 r : R2 x2

R2

Solution 2: map data to higher dimension


:

x1 x2

x2 1 x2 2 2x1x2

R2 R3
Z= 2x1x2

Y = x2 2

X = x2 1

Data is linearly separable in 3D This means that the problem can still be solved by a linear classifier

SVM classifiers in a transformed feature space


Rd

RD

f (x) = 0

: x (x)

Rd R D

Learn classier linear in w for RD : f (x) = w>(x) + b (x) is a feature map

Primal Classifier in transformed feature space


Classier, with w RD : f (x) = w>(x) + b Learning, for w RD
N X 2 min ||w|| + C max (0, 1 yif (xi)) wRD i

Simply map x to (x) where data is separable Solve for w in high dimensional space RD If D >> d then there are many more parameters to learn for w. Can this be avoided?

Dual Classifier in transformed feature space


Classier: f (x) = f (x) = Learning: 1X i j k yj yk xj >xk max i 0 2 jk i X 1X max i j k yj yk (xj )>(xk ) i 0 2 jk i subject to 0 i C for i, and
X
i N X i N X i

iyi xi>x + b iyi (xi)>(x) + b

i yi = 0

Dual Classifier in transformed feature space


Note, that (x) only occurs in pairs (xj )>(xi) Once the scalar products are computed, only the N dimensional vector needs to be learnt; it is not necessary to learn in the D dimensional space, as it is for the primal Write k(xj , xi) = (xj )>(xi). This is known as a Kernel Classier: f (x) = Learning: max
i 0 N X i

iyi k(xi, x) + b

X
i

1X j k yj yk k(xj , xk ) 2 jk
X
i

subject to 0 i C for i, and i y i = 0

Special transformations
:

x1 x2

2 z1 2 2 (x)>(z) = x2 1, x2, 2x1 x2 z2 2z1z2 2 2 2 = x2 1 z1 + x2z2 + 2x1x2z1 z2

x2 1 x2 2 2x1x2

R2 R3

= (x1z1 + x2z2)2 = (x>z)2

Kernel Trick
Classier can be learnt and applied without explicitly computing (x) All that is required is the kernel k(x, z) = (x>z)2 Complexity of learning depends on N (typically it is O(N 3)) not on D

Example kernels
Linear kernels k(x, x0) = x>x0

d 0 > 0 Polynomial kernels k(x, x ) = 1 + x x for any d > 0

Contains all polynomials terms up to degree d


0 0 2 2 for > 0 Gaussian kernels k(x, x ) = exp ||x x || /2

Innite dimensional feature space

Valid kernels when can the kernel trick be used?


Given some arbitrary function k(xi, xj ), how do we know if it corresponds to a scalar product (xi)>(xj ) in some space? Mercer kernels: if k(, ) satises: Symmetric k(xi, xj ) = k(xj , xi) Positive denite, >K 0 for all RN , where K is the N N Gram matrix with entries Kij = k(xi, xj ). then k(, ) is a valid kernel. e.g. k(x, z) = x>z is a valid kernel, k(x, z) = x x>z is not.

SVM classifier with Gaussian kernel


N = size of training data

f (x ) =

N X i

i y i k (x i , x ) + b
support vector

weight (may be zero)

Gaussian kernel k(x, x0) = exp ||x x0||2/2 2


Radial Basis Function (RBF) SVM
f (x) =
N X i

iyi exp ||x xi|| /2

+b

RBF Kernel SVM Example

0.6

0.4

feature y

0.2

-0.2

-0.4

-0.6 -0.8

-0.6

-0.4

-0.2

0 0.2 feature x

0.4

0.6

0.8

data is not linearly separable in original feature space

= 1.0
f (x) = 0

C=

f (x) = 1

f (x) = 1

f (x) =

N X i

iyi exp ||x xi|| /2

+b

= 1. 0

C = 100

Decrease C, gives wider (soft) margin

= 1. 0

C = 10

f (x) =

N X i

iyi exp ||x xi|| /2

+b

= 1.0

C=

f (x) =

N X i

iyi exp ||x xi|| /2

+b

= 0.25

C=

Decrease sigma, moves towards nearest neighbour classifier

= 0.1

C=

f (x) =

N X i

iyi exp ||x xi|| /2

+b

Kernel block structure


Examine

the structure of the Kernel Gram matrix


-6 -4

data
pos. vec. neg. vec.

For N data points xi, the Gram matrix is a N x N matrix K with entries:

-2

Kij = k(xi, xj )

6 -6 -4 -2 0 2 4 6

Kernel block structure


linear kernel (C = 0.1) -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6
-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 RBF kernel (C = 1, gamma = 0.25) pos. vec. neg. vec. supp. vec. margin vec. decision bound. pos. margin neg. margin

Gram matrix linear kernel 20 5 10 15 20 25 30 10 20 30 15 10 5 0 -5 -10 -15 30 15 20 25 5 10

Gram matrix RBF kernel

The kernel measures similarity between the points


1 0.8 0.6 0.4 0.2

10

20

30

Kernel Trick - Summary


Classifiers can be learnt for high dimensional features spaces, without actually having to map the points into the high dimensional space Data may be linearly separable in the high dimensional space, but not linearly separable in the original feature space Kernels can be used for an SVM because of the scalar product in the dual form, but can also be used elsewhere they are not tied to the SVM formalism Kernels apply also to objects that are not vectors, e.g.

k(h, h0) =

We will see other examples of kernels later in regression and unsupervised learning

0 0 k min(hk , hk ) for histograms with bins hk , hk

Background reading
Bishop, chapters 6.2 and 7 Hastie et al, chapter 12 More on web page: http://www.robots.ox.ac.uk/~az/lectures/ml

Vous aimerez peut-être aussi