08 Kernels

SVMs: nonlinearity through kernels
Chapter 3.4, e-8
Non-separable data
e-8. Support Vector Machines
8.1. The Optimal Hyperplane
(a) Few noisy data.
ER
Consider the following two datasets:
(b) Nonlinearly separable.
Both are
not 8.6:
linearly
separable.
But there isofaFigure
difference!
Figure
Non-separable
data (reproduction
3.1).
data is not linearly separable? Figure 8.6 (reproduced from Chapter 3) illustrates the two types of non-separability. In Figure 8.6(a), two noisy data
points render the data non-separable. In Figure 8.6(b), the target function is
inherently nonlinear.
Non-separable data
e-8. Support Vector Machines
8.1. The Optimal Hyperplane
(a) Few noisy data.
ER
Consider the following two datasets:
(b) Nonlinearly separable.
Linear8.6:
with
outliers data (reproduction
Nonlinear
Figure
Non-separable
of Figure 3.1).
data is not linearly separable? Figure 8.6 (reproduced from Chapter 3) illustrates the two types of non-separability. In Figure 8.6(a), two noisy data
points render the data non-separable. In Figure 8.6(b), the target function is
inherently nonlinear.
Mechanics of the Feature Transform I

Mechanics
of theyour
Feature
Transform I
Transform
features!
Transform the data to a Z-space in which the data is separable.
Transform
the data to a Z-space in which the data is separable.
0
Map your data into Z-space using a nonlinear function
x2 x2
z2 x=22 x22
z2 =
z1 = x21
z1 = x21
x1
x1

1

x = x11

x = xx1
2
x2
c AM
#
L Creator: Malik Magdon-Ismail
Nonlinear Transforms: 6 /17
1 1

z = (x) = x121 = 11(x)

1(x)
(x)
z = (x) = xx221 =
2
2
2
x2
2(x)
Feature transform II: classify in Z-space

Separate the data in the Z-space with w:
parate the data in the Z-space with w:
Classification in Z-space
tz)
g(z) = sign(w
In Z-space the data can be linearly separated:
tz)
g(z) = sign(w
c AM
#
Feature transform III: bring back to X -space
To classify a new x, first transform x to (x) Z-space and classify there with g.
a g(x) = g((x))
t(x))
= sign(w
tz)
g(z) = sign(w
c AM
$
Summary of nonlinear transform
After constructing features carefully, before seeing the data . . .
. . . if you think linear is not enough, try the 2nd order polynomial transform.
1

x1 = x
x2
1
1

1(x) x1

(x) x
2
2
(x) =
= 2
3(x) x1

(x) x x
4
1 2
5(x)
x22
What can we say about the

dimensionality
of the Z-space as a
Nonlinear
Transforms: 12 /17
function of the dimensionality of the data?
c AM
#
The polynomial transform
The General Polynomial Transform k
The polynomial Z-space
We canWe
getcan
evenchoose
fancier:higher
degree-k
polynomial
transform:
order
polynomials:
1(x) = (1, x1, x2),
2(x) = (1, x1, x2, x21, x1x2, x22),
3(x) = (1, x1, x2, x21, x1x2, x22, x31, x21x2, x1x22, x32),
4(x) = (1, x1, x2, x21, x1x2, x22, x31, x21x2, x1x22, x32, x41, x31x2, x21x22, x1x32, x42),
..
What are the potential effects of increasing the order of the
Dimensionality
of the feature space increases rapidly (dvc)!
polynomial?
Similar transforms for d-dimensional original space.

Approximation-generalization tradeoff
Higher degree gives lower (even zero) Ein but worse generalization.
8
c AM
!
Be carefull with nonlinear transforms
The polynomial Z-space
Be Careful with Feature Transforms

Linear model
fourth order polynomial
Feature-space dimensionality increases rapidly and with it the

complexity of the model: danger of overfitting
High order polynomial transform leads to nonsense.
c AM
!
Digits data
A few potential issues

u
u
u
Danger of overfitting
Better chance of obtaining linear separability
Computationally expensive (memory and time)
Kernels: avoid the computational expense by an implicit mapping
10
Achieving non-linear discriminant functions

Consider two dimensional data and the mapping
(x) =
(x21 ,
2x1 x2 , x22 )|
Lets plug that into the discriminant function:
(x) =
w1 x21
2w2 x1 x2 + w3 x22
The resulting decision boundary is a conic section.
11
How to avoid the overhead of explicit mapping

Suppose the weight vector can be expressed as:
w=
n
X
i x i
i=1
The discriminant function is then:

X
| f (x) =
x| x + b
w x+b
i i
And using our nonlinear mapping:
f (x)
w| (x) +
b =
i (xi )| (x) + b
Turns out we can often compute the dot product without

explicitly mapping the data into a high dimensional feature
space!
12
Example
Lets go back to the example
(x) =
(x21 ,
2x1 x2 , x22 )|
and compute the dot product
(x)
(z) =
=
(x21 , 2x1 x2 , x22 )| (z12 , 2z1 z2 , z22 )

x21 z12 + 2x1 x2 z1 z2 + x22 z22
| 2
= (x z)
Do we need to perform the mapping explicitly?
13
Example
Lets go back to the example
(x) =
(x21 ,
2x1 x2 , x22 )|
and compute the dot product
(x)
(z) =
=
(x21 , 2x1 x2 , x22 )| (z12 , 2z1 z2 , z22 )

x21 z12 + 2x1 x2 z1 z2 + x22 z22
| 2
= (x z)
Do we need to perform the mapping explicitly?

NO! Squaring the dot product in the original space has the same
effect as computing the dot product in feature space.
14
Kernels
Definition: A function k(x, z) that can be expressed as a dot
product in some feature space is called a kernel.
In other words, k(x, z) is a kernel if there exists
such that
k(x, z) = (x)| (z)
: X 7! F
Why is this interesting?

If the algorithm can be expressed in terms of dot products, we
can work in the feature space without performing the mapping
explicitly!
15
The dual SVM problem

The dual SVM formulation depends on the data through dot
products, and so can be expressed using kernels. Replace
maximize
with:
n
X
i=1
n
X
subject to: 0 i C,
maximize
n
X
i=1
i yi = 0
i=1
1 XX
i j yi yj x|i xj
2 i=1 j=1
1 XX
i j yi yj k(xi , xj )
2 i=1 j=1
subject to: 0 i C,
n
X
i yi = 0
i=1
16
Standard kernel functions

The linear kernel
k(x, z) = x| z
Homogeneous polynomial kernel
k(x, z) = (x| z)d

Polynomial kernel
k(x, z) = (x| z + 1)d

Gaussian kernel
k(x, z) = exp(
||x
z||2 )
17

The linear kernel
Feature space:
k(x, z) = x| z
Original features
k(x, z) = (x| z)d
All monomials of
degree d
Polynomial kernel
k(x, z) = (x| z + 1)d
All monomials of
degree less than d
Gaussian kernel
k(x, z) = exp(
||x
z||2 )
Infinite dimensional
18
Demo
Using polynomial kernel:
k(x, z) = (x| z + 1)d
19
Demo
Using the Gaussian kernel:
k(x, z) = exp(
||x
z||2 )
20
Kernelizing the perceptron algorithm

Recall the primal version of the perceptron algorithm:
Input: labeled data D in homogeneous coordinates
Output: weight vector w
w = 0
converged = false
while not converged :
converged = true
for i in 1,,|D| :
if xi is misclassified update w and set
0
converged=false
w = w + yi xi
What do you need to change to express it in the dual?

(I.e. express the algorithm in terms of the alpha coefficients)
21

Output: weight vector
= 0
converged = false
n
X
w=
i x i
converged = true
i=1
for i in 1,,|D| :
if xi is misclassified update and set
converged=false
22

Output: weight vector
= 0
converged = false
converged = true
for i in 1,,|D| :
if xi is misclassified update and set
converged=false
The update
i ! i + yi
Is equivalent to:
w0 = w + yi xi
23
Linear regression revisited

The sum-squared cost function:
(yi
w| xi )2 = (y
Xw)| (y
Xw)
The optimal solution satisfies:
X| Xw = X| y
If we express w as:
w=
n
X
i x i = X|
i=1
We get:
X| XX| = X| y
24
Kernel linear regression

We now get that satisfies:
XX| = y
Compare with:
X| Xw = X| y
Which is harder to find? What have we gained?
25
The kernel matrix

X| X
The covariance matrix (d x d)
XX|
Matrix of dot products associated with a dataset (n x n).
Can replace it with a matrix K such that:
Kij = (xi )
(xj ) = k(xi , xj )
This is the kernel matrix associated with a dataset

a.k.a the Gram matrix
26
How does that matrix look like?

Kernel matrix for gene
expression data in yeast:
27
Properties of the kernel matrix

The kernel matrix:
Kij = (xi )| (xj ) = k(xi , xj )
q
q
q
Symmetric (and therefore has real eigenvalues)

Diagonal elements are positive
Every kernel matrix is positive semi-definite, i.e.
x| Kx
q
0 8x
Corollary: the eigenvalues of a kernel matrix are positive
28

The linear kernel
Feature space:
k(x, z) = x| z
Original features
k(x, z) = (x| z)d
All monomials of
degree d
Polynomial kernel
k(x, z) = (x| z + 1)d
All monomials of
degree less than d
Gaussian kernel (aka RBF kernel)
k(x, z) = exp(
||x
z||2 )
Infinite dimensional
How do we even know that the Gaussian is a valid kernel?

29
Some tricks for constructing kernels

Let K(x,z) be a kernel function
If a > 0 then aK(x,z) is a kernel
30

Sums of kernels are kernels:
Let K1 and K2 be kernel functions then K1 + K2 is a kernel
What is the feature map that shows this?
31

Sums of kernels are kernels:
Let K1 and K2 be kernel functions then K1 + K2 is a kernel
Feature map: concatenation of the underlying feature maps
32

Products of kernels are kernels:
Let K1 and K2 be kernel functions then K1 K2 is a kernel
33

Products of kernels are kernels:
Let K1 and K2 be kernel functions then K1 K2 is a kernel
Construct a feature map that contains all products of pairs of
features
34
The cosine kernel

If K(x,z) is a kernel then
0
is a kernel
K (x, z) = p
K(x, z)
K(x, x)K(z, z)
35
The cosine kernel

If K(x,z) is a kernel then
0
is a kernel
K (x, z) = p
K(x, z)
K(x, x)K(z, z)
This kernel is equivalent to normalizing each example to have

unit norm in the feature space associated with the kernel.
This kernel is the cosine in the feature space associated with
the kernel K:
(x)| (z)
(x)| (z)
cos( (x), (z)) =
=p
|| (x)|| || (z)||
(x)| (x) (z)| (z)
36
Infinite sums of kernels

Theorem: A function K(x, z)
with a series expansion
K(t) =
1
X
= K(x| z)
a n tn
n=0
is a kernel iff
an
0 for all n.
37

K(t) =
1
X
= K(x| z)
a n tn
n=0
is a kernel iff
Corollary:
an
0 for all n.
K(x, z) = exp (2 x| z)
is a kernel
38

K(t) =
1
X
= K(x| z)
a n tn
n=0
is a kernel iff
an
0 for all n.
Corollary:
K(x, z) = exp (2 x| z)
Corollary:
k(x, z) = exp(
exp(
||x
z||2 ) = exp (
||x
is a kernel
z||2 )
is a kernel
(x| x + z| z) + 2 (x| z))
i.e. the Gaussian kernel is the cosine kernel of the exponential

K(x, z)
kernel
0
K (x, z) = p
K(x, x)K(z, z)
39

08 Kernels

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

08 Kernels

Transféré par

Droits d'auteur :

Formats disponibles

SVMs: nonlinearity through kernels

Chapter 3.4, e-8

e-8. Support Vector Machines

8.1. The Optimal Hyperplane

(a) Few noisy data.

Consider the following two datasets:

(b) Nonlinearly separable.

e-8. Support Vector Machines

8.1. The Optimal Hyperplane

(a) Few noisy data.

Consider the following two datasets:

(b) Nonlinearly separable.

Mechanics of the Feature Transform I

Nonlinear Transforms: 6 /17

z = (x) = x121 = 11(x)

Feature transform II: classify in Z-space

parate the data in the Z-space with w:

Nonlinear Transforms: 7 /17

Feature transform III: bring back to X -space

Nonlinear Transforms: 8 /17

Summary of nonlinear transform

After constructing features carefully, before seeing the data . . .

What can we say about the

The polynomial transform

The General Polynomial Transform k

The polynomial Z-space

Similar transforms for d-dimensional original space.

Nonlinear Transforms: 13 /17

Be carefull with nonlinear transforms

The polynomial Z-space

Be Careful with Feature Transforms

fourth order polynomial

Feature-space dimensionality increases rapidly and with it the

Nonlinear Transforms: 15 /17

A few potential issues

Kernels: avoid the computational expense by an implicit mapping

Achieving non-linear discriminant functions

Lets plug that into the discriminant function:

The resulting decision boundary is a conic section.

How to avoid the overhead of explicit mapping

The discriminant function is then:

And using our nonlinear mapping:

Turns out we can often compute the dot product without

and compute the dot product

(x21 , 2x1 x2 , x22 )| (z12 , 2z1 z2 , z22 )

Do we need to perform the mapping explicitly?

and compute the dot product

(x21 , 2x1 x2 , x22 )| (z12 , 2z1 z2 , z22 )

Do we need to perform the mapping explicitly?

k(x, z) = (x)| (z)

Why is this interesting?

The dual SVM problem

Standard kernel functions

k(x, z) = (x| z)d

k(x, z) = (x| z + 1)d

Standard kernel functions

Homogeneous polynomial kernel

k(x, z) = (x| z)d

k(x, z) = (x| z + 1)d

k(x, z) = (x| z + 1)d

Kernelizing the perceptron algorithm

What do you need to change to express it in the dual?

Kernelizing the perceptron algorithm

Input: labeled data D in homogeneous coordinates