Lect 3

Lecture 3: Dual problems and Kernels
C19 Machine Learning Primal and dual forms Linear separability revisted Feature mapping Kernels for SVMs
Kernel trick requirements radial basis functions
Hilary 2013
A. Zisserman
SVM review
We have seen that for an SVM learning a linear classier f (x) = w>x + b is formulated as solving an optimization problem over w :
N X 2 min ||w|| + C max (0, 1 yif (xi)) w Rd i
This quadratic optimization problem is known as the primal problem. Instead, the SVM can be formulated to learn a linear classier f (x) =
N X i
iyi(xi>x) + b
by solving an optimization problem over i. This is know as the dual problem, and we will look at the advantages of this formulation.
Sketch derivation of dual form

The Representer Theorem states that the solution w can always be written as a linear combination of the training data:
N X
w=
Proof: see example sheet .
j yj xj
j =1
Now, substitute for w in f (x) = w>x + b f (x ) =

X
j
j =1
N X
and for w in the cost function minw ||w||2 subject to yi w>xi + b 1, i ||w||2 = j yj xj

>
j yj x j > x + b =
X
k
j =1
N X
j yj xj >x + b

k yk xk
X
jk
j k yj yk (xj >xk )
Hence, an equivalent optimization problem is over j min

j
X
jk
and a few more steps are required to complete the derivation.
j k yj yk (xj >xk ) subject to yi
j =1
N X
j yj (xj >xi) + b 1, i
Primal and dual formulations

N is number of training points, and d is dimension of feature vector x. Primal problem: for w Rd
wRd
min ||w|| + C
N X i
max (0, 1 yif (xi))
Dual problem: for RN (stated without proof): max

i 0
X
i
X 1X j k yj yk (xj >xk ) subject to 0 i C for i, and i yi = 0 2 jk i
Need to learn d parameters for primal, and N for dual If N << d then more ecient to solve for than w Dual form only involves (xj >xk ). We will return to why this is an advantage when we look at kernels.
Primal and dual formulations

Primal version of classier: f (x) = w>x + b Dual version of classier: f (x) =
N X i
iyi(xi>x) + b
At rst sight the dual form appears to have the disadvantage of a K-NN classier it requires the training data points xi. However, many of the is are zero. The ones that are non-zero dene the support vectors xi.
Support Vector Machine

wTx + b = 0
b ||w||
Support Vector Support Vector
f (x) =
X
i
i yi (xi > x) + b
support vectors
C = 10
soft margin
Handling data that is not linearly separable
introduce slack variables

wRd ,iR+
min
||w||2 + C
N X i
subject to yi w>xi + b 1 i for i = 1 . . . N

linear classifier not appropriate ??
Solution 1: use polar coordinates

r
<0
>0
Data is linearly separable in polar coordinates Acts non-linearly in original space ! ! x1 r : R2 x2
R2
Solution 2: map data to higher dimension

:
x1 x2
x2 1 x2 2 2x1x2
R2 R3
Z= 2x1x2
Y = x2 2
X = x2 1
Data is linearly separable in 3D This means that the problem can still be solved by a linear classifier
SVM classifiers in a transformed feature space

Rd
RD
f (x) = 0
: x (x)
Rd R D
Learn classier linear in w for RD : f (x) = w>(x) + b (x) is a feature map
Primal Classifier in transformed feature space

Classier, with w RD : f (x) = w>(x) + b Learning, for w RD
N X 2 min ||w|| + C max (0, 1 yif (xi)) wRD i
Simply map x to (x) where data is separable Solve for w in high dimensional space RD If D >> d then there are many more parameters to learn for w. Can this be avoided?
Dual Classifier in transformed feature space

Classier: f (x) = f (x) = Learning: 1X i j k yj yk xj >xk max i 0 2 jk i X 1X max i j k yj yk (xj )>(xk ) i 0 2 jk i subject to 0 i C for i, and
X
i N X i N X i
iyi xi>x + b iyi (xi)>(x) + b
i yi = 0
Dual Classifier in transformed feature space

Note, that (x) only occurs in pairs (xj )>(xi) Once the scalar products are computed, only the N dimensional vector needs to be learnt; it is not necessary to learn in the D dimensional space, as it is for the primal Write k(xj , xi) = (xj )>(xi). This is known as a Kernel Classier: f (x) = Learning: max
i 0 N X i
iyi k(xi, x) + b
X
i
1X j k yj yk k(xj , xk ) 2 jk
X
i
subject to 0 i C for i, and i y i = 0
Special transformations
:
x1 x2
2 z1 2 2 (x)>(z) = x2 1, x2, 2x1 x2 z2 2z1z2 2 2 2 = x2 1 z1 + x2z2 + 2x1x2z1 z2
x2 1 x2 2 2x1x2
R2 R3

= (x1z1 + x2z2)2 = (x>z)2
Kernel Trick
Classier can be learnt and applied without explicitly computing (x) All that is required is the kernel k(x, z) = (x>z)2 Complexity of learning depends on N (typically it is O(N 3)) not on D
Example kernels
Linear kernels k(x, x0) = x>x0
d 0 > 0 Polynomial kernels k(x, x ) = 1 + x x for any d > 0
Contains all polynomials terms up to degree d

0 0 2 2 for > 0 Gaussian kernels k(x, x ) = exp ||x x || /2
Innite dimensional feature space
Valid kernels when can the kernel trick be used?

Given some arbitrary function k(xi, xj ), how do we know if it corresponds to a scalar product (xi)>(xj ) in some space? Mercer kernels: if k(, ) satises: Symmetric k(xi, xj ) = k(xj , xi) Positive denite, >K 0 for all RN , where K is the N N Gram matrix with entries Kij = k(xi, xj ). then k(, ) is a valid kernel. e.g. k(x, z) = x>z is a valid kernel, k(x, z) = x x>z is not.
SVM classifier with Gaussian kernel

N = size of training data
f (x ) =
N X i
i y i k (x i , x ) + b
support vector
weight (may be zero)
Gaussian kernel k(x, x0) = exp ||x x0||2/2 2

Radial Basis Function (RBF) SVM
f (x) =
N X i
iyi exp ||x xi|| /2
+b
RBF Kernel SVM Example
0.6
0.4
feature y
0.2
-0.2
-0.4
-0.6 -0.8
-0.6
-0.4
-0.2
0 0.2 feature x
0.4
0.6
0.8
data is not linearly separable in original feature space
= 1.0
f (x) = 0
C=
f (x) = 1
f (x) = 1
f (x) =
N X i
iyi exp ||x xi|| /2
+b
= 1. 0
C = 100
Decrease C, gives wider (soft) margin
= 1. 0
C = 10
f (x) =
N X i
iyi exp ||x xi|| /2
+b
= 1.0
C=
f (x) =
N X i
iyi exp ||x xi|| /2
+b
= 0.25
C=
Decrease sigma, moves towards nearest neighbour classifier
= 0.1
C=
f (x) =
N X i
iyi exp ||x xi|| /2
+b
Kernel block structure

Examine
the structure of the Kernel Gram matrix

-6 -4
data
pos. vec. neg. vec.
For N data points xi, the Gram matrix is a N x N matrix K with entries:
-2
Kij = k(xi, xj )
6 -6 -4 -2 0 2 4 6
Kernel block structure

linear kernel (C = 0.1) -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6
-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 RBF kernel (C = 1, gamma = 0.25) pos. vec. neg. vec. supp. vec. margin vec. decision bound. pos. margin neg. margin
Gram matrix linear kernel 20 5 10 15 20 25 30 10 20 30 15 10 5 0 -5 -10 -15 30 15 20 25 5 10
Gram matrix RBF kernel
The kernel measures similarity between the points

1 0.8 0.6 0.4 0.2
10
20
30
Kernel Trick - Summary

Classifiers can be learnt for high dimensional features spaces, without actually having to map the points into the high dimensional space Data may be linearly separable in the high dimensional space, but not linearly separable in the original feature space Kernels can be used for an SVM because of the scalar product in the dual form, but can also be used elsewhere they are not tied to the SVM formalism Kernels apply also to objects that are not vectors, e.g.
k(h, h0) =
We will see other examples of kernels later in regression and unsupervised learning
0 0 k min(hk , hk ) for histograms with bins hk , hk
Background reading
Bishop, chapters 6.2 and 7 Hastie et al, chapter 12 More on web page: http://www.robots.ox.ac.uk/~az/lectures/ml

Lect 3

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Lect 3

Transféré par

Droits d'auteur :

Formats disponibles

Lecture 3: Dual problems and Kernels

Sketch derivation of dual form

Now, substitute for w in f (x) = w>x + b f (x ) =

Hence, an equivalent optimization problem is over j min

and a few more steps are required to complete the derivation.

j k yj yk (xj >xk ) subject to yi

Primal and dual formulations

max (0, 1 yif (xi))

Dual problem: for RN (stated without proof): max

X 1X j k yj yk (xj >xk ) subject to 0 i C for i, and i yi = 0 2 jk i

Primal and dual formulations

Support Vector Machine

Support Vector Support Vector

Handling data that is not linearly separable

introduce slack variables

subject to yi w>xi + b 1 i for i = 1 . . . N

linear classifier not appropriate ??

Solution 1: use polar coordinates

Data is linearly separable in polar coordinates Acts non-linearly in original space ! ! x1 r : R2 x2

Solution 2: map data to higher dimension

SVM classifiers in a transformed feature space

Learn classier linear in w for RD : f (x) = w>(x) + b (x) is a feature map

Primal Classifier in transformed feature space

Dual Classifier in transformed feature space

iyi xi>x + b iyi (xi)>(x) + b

Dual Classifier in transformed feature space

subject to 0 i C for i, and i y i = 0

2 z1 2 2 (x)>(z) = x2 1, x2, 2x1 x2 z2 2z1z2 2 2 2 = x2 1 z1 + x2z2 + 2x1x2z1 z2

= (x1z1 + x2z2)2 = (x>z)2

d 0 > 0 Polynomial kernels k(x, x ) = 1 + x x for any d > 0

Contains all polynomials terms up to degree d

Innite dimensional feature space

Valid kernels when can the kernel trick be used?

SVM classifier with Gaussian kernel

weight (may be zero)

Gaussian kernel k(x, x0) = exp ||x x0||2/2 2

iyi exp ||x xi|| /2

RBF Kernel SVM Example

data is not linearly separable in original feature space

iyi exp ||x xi|| /2

Decrease C, gives wider (soft) margin

iyi exp ||x xi|| /2

iyi exp ||x xi|| /2

Decrease sigma, moves towards nearest neighbour classifier

iyi exp ||x xi|| /2

Kernel block structure

the structure of the Kernel Gram matrix

Kernel block structure

Gram matrix linear kernel 20 5 10 15 20 25 30 10 20 30 15 10 5 0 -5 -10 -15 30 15 20 25 5 10

Gram matrix RBF kernel

The kernel measures similarity between the points

Kernel Trick - Summary

0 0 k min(hk , hk ) for histograms with bins hk , hk

Vous aimerez peut-être aussi