w11 PDF

Kernel Machines
CSL465/603 - Fall 2016

Narayanan C Krishnan
ckn@iitrpr.ac.in
Outline
Optimal Separating Hyperplane
Support Vector Machine
Linear Separable Case
Soft Margin
Kernel Functions
Support Vector Regression
Kernel Machines
CSL465/603 - Machine Learning
Kernel Machines
Or, support vector machine (SVM)
Discriminant-based method
Learn class boundaries
Support vector consists of examples closest to

boundary
Kernel computes similarity between examples
Maps instance space to a higher-dimensional space
where (hopefully) linear models suffice
Choosing the right kernel is crucial

Kernel machines among best-performing learners
Kernel Machines

(1)
Given a dataset - =
x$ , $
)
$'( ,
where
the class labels are binary $ 1, +1
Find w and 0 , such that

w 1 x2 + 0 +1 for $ = +1
w 1 x2 + 0 1 for $ = 1
Which can be combined and written as
$ (w 1 x2 + 0 ) +1 = 1, ,
Note we want +1, and not 0
Want instances at some distance from the
hyperplane.
Kernel Machines

(2)
w
Kernel Machines
Margin
Distance of point x2 to the hyperplane w 1 x + 0
w 1 x2 + 0
$ w 1 x2 + 0
or
w
w
Distance from the hyperplane to the closest
instances is the margin.
w
margin
Kernel Machines

(3)
Optimal separating hyperplane is the one
maximizing the margin
We want to choose (w, w0 ) maximizing such that
$ w 1 x2 + 0
,
w
Infinite number of solutions by scaling w
So fix w = 1
Thus, we choose solution minimizing w
min
(
C
Kernel Machines
w C subject to $ w 1 x2 + 0 1,

(4)
Maximizing the margin
Kernel Machines

(5)
min
(
C
Quadratic optimization problem

Complexity is in terms of the number of features -
Later we will talk about the Kernel that will map the
points to a even higher dimensional space!!!
Prefer complexity not based on
Transform it to a problem, where the complexity
depends on the number of training examples
More precisely, the points that are on the margin ( )
Kernel Machines
Lagrange Multipliers (1)

Minimize ()
Subject to $ = 0, for = 1,2,
At the solution , must lie in the subspace
spanned by $ , = 1,2,
Lagrangian function:
, = + N $ $
Lagrange multipliers - $
Solve , = 0
Kernel Machines
10
Lagrange Multipliers (2)
Kernel Machines
11
Primal and Dual Problems

Problem over is the primal
Solve equations for and substitute
Resulting problem over is the dual
If its easier, solve the dual instead of primal
In SVMs:
Primal problem is over feature weights
Dual problem is over instance weights
Kernel Machines
12
What happens with inequality

constraints?
Minimize ()
Subject to $ = 0, for = 1,2,
And $ 0, for = 1,2,
Lagrange multipliers for inequalities: $
Lagrangian function:
, , = + N $ $ + N $ $ ()
$
Kernel Machines
13
Karush-Kuhn-Tucker Conditions
KKT Conditions
, , = 0
$ 0
$ 0
$ $ = 0
$ = 0
Complementarity - either a constraint is active

($ = 0) or its multiplier is zero ($ = 0)
In SVMs: active constraint implies a support vector
Exercise: necessary and sufficient conditions for a convex problem
Kernel Machines
14
Lagrange Multipliers SVM (1)

min
(
C
Rewrite the quadratic constrained optimization

problem using Lagrange multipliers $ , = 1, ,
Kernel Machines
15

min
(
C
Rewrite the quadratic constrained optimization

problem using Lagrange multipliers $ , = 1, ,
)
1
R = max min w C N $ $ w 1 x2 + 0 1
TU V,WX 2
$'(
Kernel Machines
16

1
R = max min w
TU V,WX 2
YZ[
YV
YZ[
YVX
)
C
N $ $ w 1 x2 + 0 + N $
$'(
$'(
=0
=0
Kernel Machines
17

1
R = max min w
TU V,WX 2
YZ[
)
C
N $ $ w 1 x2 + 0 + N $
$'(
$'(
= 0 w = )
$'( $ $ x$
YV
YZ[
YVX
= 0 )
$'( $ $ = 0
Plugging these back into R

KKT conditions
Kernel Machines
18
Dual Formulation SVM (1)

^
$'(
$'(
$'(
1 1
= max w w w 1 N $ $ x$ 0 N $ $ + N $
TU 2
Kernel Machines
19

)
1
^ = max N N $ _ $ _ x$1 x` + N $
TU
2
$'( _'(
Subject to
$'(
N $ $ = 0
$'(
$
Kernel Machines
20

Quadratic optimization methods can be used to
solve this maximization problem
Complexity - ( b )
Size of the dual depends on the sample size (and
not on feature dimension)
Kernel Machines
21
Support Vector Machine (1)

Most $ = 0
$ for points that lie outside the margin are 0
Support vectors: points such that $ > 0

$ for points that lie on the margin are > 0
)
w = N $ $ x$
$'(
0 = $ w 1 x2 for any support vector

Typically the averaged over all support vectors
The resulting discriminant is called support vector

machine
Kernel Machines
22
Support Vector Machine (2)

O = support vectors
margin
Kernel Machines
23
Soft Margin Hyperplane

Data is not linearly separable
Find the hyperplane with least error
Define slack variables $ 0 that stores deviation
from the margin
$ w 1 x2 + 0 1 $
Kernel Machines
24
Soft Error (1)

margin
O = support vectors
Kernel Machines
25
Soft Error
Correctly classified example - far from the margin
(but on the correct side) - $ = 0
Correctly classified sample on the margin - $ = 0
Correctly classified sample, but inside the margin0 < $ <1
Incorrectly classified example (away from the
margin on the wrong side) - $ 1
Therefore the total error can be summarized in
terms of - $ s
Soft Error - )
$'( $
Kernel Machines
26
Soft Margin Hyperplane (1)

1
min w
2
)
C
+ N $
$'(
subject to $ w 1 x2 + 0 1 $ , and $ 0,
Where C is a penalty factor that stresses the
importance on reducing the soft error.
Using Lagrange multipliers
Kernel Machines
27

1
w
2
)
C
+ N $ N $ $ w 1 x2 + 0 1 + $ N $ $
$'(
$'(
This is the primal problem

Applying KKT conditions for optimality
YZ[
YV
YZ[
YVX
YZ[
YhU
$'(
=0
=0
=0
Kernel Machines
28

Thus the Dual problem is
Kernel Machines
29

) )
)
1
^ = max N N $ _ $ _ x$1 x` + N $
TU
2
$'( _'(
Subject to
$'(
N $ $ = 0
$'(
0 $
Kernel Machines
30

) )
)
1
^ = max N N $ _ $ _ x$1 x` + N $
TU
2
$'( _'(
Subject to
$'(
N $ $ = 0
$'(
0 $
Think of as a regularization parameter
High - high penalty for non-separable examples (overfit)
Low - less penalty (underfit)
Determined using a validation set.
Kernel Machines
31

) )
)
1
^ = max N N $ _ $ _ x$1 x` + N $
TU
2
$'( _'(
Subject to
$'(
N $ $ = 0
$'(
0 $
Quadratic optimization problem
Support vectors have $ > 0
Examples misclassified have $ =
Examples correctly classified $ = 0
Kernel Machines
32
Kernels (1)
Kernel Machines
33
Kernels (2)
Kernel Machines
34
Kernels (3)
Prior approaches assumed
Data is linearly separable or obtained the best fit
If the data is not linearly separable in the current

space,
Perhaps it is linearly separable in some other high
dimensional space.
Let be the function for transforming the data to the high
n
l
l
dimensional space - : - Basis functions
Kernel Machines
35
Kernels (4)
Transform the -dimensional input feature space (xspace) to a dimensional feature space using
x
Call the dimensional feature space as the -space
z = (x), where ^ = ^ , = 1, ,
Linearly separable in the -space.
z = w 1 z + w0n
l
z = w 1 x + 0 = N ^ ^ x
^'0
Assume 0 x = 1
Kernel Machines
36
Soft Margin Hyperplane with

Kernels (1)
1
w
2
)
C
+ N $ N $ $ w 1 x2 + 0 1 + $ N $ $
$'(
$'(
$'(
1
max N N $ _ $ _ x$1 x` + N $
TU
2
$'( _'(
Kernel Machines
$'(
37
Soft Margin Hyperplane with

Kernels (2)
In the -space
w = N $ $ z$ = N $ $ (x2 )
$'(
$'(
z = w 1 z = N $ $ x2 1 (x)
$'(
Kernel Machines
38
Kernel Functions
While solving the optimization and applying the solution,
we observe that we need to compute
x2 1 (x` ) Inner product of two points in the -space
Make our life easy, define a kernel matrix (function)

, = x2 1 (x` )
Also called the Gram matrix

So, we need not know what is . If we can define a
Kernel function, then our job is done. Kernel Trick
Can we use any function as a kernel function (matrix)
n
Mercers theorem Let : l l be given. Then for

to be a valid kernel, it is necessary and sufficient that for any
( , , ) ( < ), the corresponding kernel matrix is
symmetric and positive semidefinite.
Kernel Machines
39
Positive Semidefinite (1)

Akin to saying (numbers) matrices are 0
To prove matrix ll to positive semidefinite, show
that for any x l , x 1 x 0
Show that if , = x2 1 (x` ), then is
symmetric and positive semidefinite.
40
Kernel Machines
Kernel Machines
41
Kernel Machines
42
Kernel Machines
43
Examples of Kernel Functions (1)

Polynomial kernel of degree
x, x } = x 1 x } + 1 ~
Suppose = 1, then we are using the original
features
Kernel Machines
44

x, x } = x 1 x } + 1 ~
features
Suppose = 2 quadratic kernel
x, x } = x 1 x } + 1 C
Kernel Machines
45

Polynomial of degree 2
margin
O = support vectors
Kernel Machines
46

x, x } = x 1 x } + 1 ~
features
Suppose = 2 quadratic kernel
x, x } = x 1 x } + 1 C
What is the dimension of the -space?
Kernel Machines
47

Radial Basis Functions (Gaussian Kernel)
} C
x
x
x, x } = exp
2 C
Gaussian Kernel Width (radius) -
Larger implies smoother boundaries
Kernel Machines
48

Kernel Machines
49

} C
x
x
x, x } = exp
2 C
Gaussian Kernel Width (radius) -
Larger implies smoother boundaries
What is the dimension of the -space?
Kernel Machines
50
Defining Kernels (1)

Think of x, x } as a similarity measure between x
and x } .
Prior knowledge can be included in the kernel
function
E.g., training examples are documents
x, x } = # shared words
E.g., training examples are strings (e.g., DNA)
x, x } = 1 / edit distance between x and x }

Edit distance is the number of insertions, deletions and/or
substitutions to transform x into x }
Kernel Machines
51
Defining Kernels (2)

E.g., training examples are nodes in a graph (e.g.,
social network)
x, x } = 1 / length of shortest path connecting nodes
x, x } = #paths connecting nodes
Diffusion kernel (using the Laplacian of the graph)
E.g., training examples are graphs, not feature

vectors
E.g., carcinogenic vs. non-carcinogenic chemical
structures
Compare substructures of graphs
E.g., walks, paths, cycles, trees, subgraphs
x, x } = number of identical random walks in both

graphs
x, x } = number of subgraphs shared by both graphs
Kernel Machines
52
Combining Kernels (1)

Training data from multiple modalities (e.g.,
biometrics, social network, audio/visual)
Construct new kernels by combining simpler kernels
If ( x, x } and C x, x } are valid kernels, and is a
constant, then the following are also valid kernels
x, x }
Kernel Machines
( x, x }
= ( x, x } + C x, x }
( x, x } C x, x }
53
Combining Kernels (2)

Adaptive Kernel Combination
x, x } = N x, x }
'(
)
'(
$'(
1
N N $ _ $ _ N x2 , x` + N $
2
$'( _'(
x = N $ $ N x2 , x
$'(
'(
Learn $ and through optimization.

Kernel Machines
54
Loss Function for SVM (1)

Recollect the primal formulation
)
1
min w C + N $
2
$'(
subject to $ w 1 x2 + 0 1 $ , and $ 0,
Kernel Machines
55
Loss Function for SVM (2)

Zero/One loss
Not continuous
Figure
7.5 Plot
the hinge error function used
NP
hard
to ofoptimize
in support vector machines, shown
7.1. Maximum Margin Classifiers
337
E(z)
in blue, along with the error function

for logistic regression, rescaled by a
factor of 1/ ln(2) so that it passes
through the point (0, 1), shown in red.
Also shown are the misclassification
error in black and the squared error
in green.
Squared loss
Data points far from the

decision boundary have
significant influence
Hinge loss
SVM
Logarithmic
loss
remaining points we have
Logistic
Kernel Machines
= 1 yn tn . Thus the objective function (7.21) can be

written (up to an overall multiplicative constant) in the form
regression
n
N
!
ESV (yn tn ) + w2
CSL465/603 -nMachine
Learning
=1
(7.44)
56
SVM multi-class classification (1)

Learn K different
kernel machines
Each uses one class a
positive, remaining
classes as negative
The predicted class is
argmax x
Works best in practice
Kernel Machines
57

Learn K(K-1)/2 kernel
machines
Each uses one class as
positive and another
class as negative
Easier (faster) learning
per kernel machine
Kernel Machines
58

Learn all margins at once
)
1
min N C + N N $
2
'(
'( $'(
Subject to
w1U x2 + U0 w1 x2 + 0 + 2 $ , $ , $ 0
The class label for x2 is y2
It is an expensive optimization problem ()
Kernel Machines
59
Optimizing the SVM Objective

Function SMO (1)
Diversion Coordinate ascent
Consider the unconstrained optimization problem
max ( , l
T
Until convergence
For = 1, ,
^ = argmaxT ( , , ^( , ^ , ^( , , l
22
2.5
1.5
0.5
0.5
1.5
Kernel Machines
1.5
0.5
0.5
1.5
2.5
60
The ellipses in the figure are the contours of a quadratic function that

Function SMO (2)
Recall SVM dual optimization problem
) )
)
1
^ = max N N $ _ $ _ x$1 x` + N $
TU
2
$'( _'(
Subject to
$'(
N $ $ = 0
$'(
0 $
Can we perform coordinate ascent here?
Optimize wrt to $ by fixing the remaining?
Kernel Machines
61

Function SMO (3)
Sequential Minimal Optimization
Repeat till convergence
Select a pair of $ and _ to optimize.
Re-optimize ^ with respect to $ and _ , while holding all
the other ( , ) fixed
Convergence is determined by checking if the KKT

conditions are satisfied.
Suppose we fix b , , ) , then
)
( ( + C C = N $ $
$'b
Kernel Machines
62
algorithm can be expressed in a short amount of C code, rather than invoking an entire QP library
routine. Even though more optimization sub-problems are solved in the course of the algorithm,
each sub-problem is so fast that the overall QP problem is solved quickly.

Platt, MSR-TR-98-14
Function SMO (4)
In addition, SMO requires no extra matrix storage at all. Thus, very large SVM training problems
can fit inside of the memory of an ordinary personal computer or workstation. Because no matrix
algorithms are used in SMO, it is less susceptible to numerical precision problems.
There are two components to SMO: an analytic method for solving for the two Lagrange
multipliers, and a heuristic for choosing which multipliers to optimize.
( ( + C C =
!2 ! C
!2 ! C
!1 ! 0
!1 ! C
!2 ! 0
( C ( C =
y1 " y2 # ! 1 $ ! 2 ! k
!1 ! 0
!1 ! C
!2 ! 0
( = C ( + C =
y1 ! y2 # ! 1 % ! 2 ! k
Figure 1. The two Lagrange multipliers must fulfill all of the constraints of the full problem.
The inequality constraints cause the Lagrange multipliers to lie in the box. The linear equality
constraint causes them to lie on a diagonal line. Therefore, one step of SMO must find an
optimum of the objective function on a diagonal line segment.
Kernel Machines
63
algorithm can be expressed in a short amount of C code, rather than invoking an entire QP library
routine. Even though more optimization sub-problems are solved in the course of the algorithm,
each sub-problem is so fast that the overall QP problem is solved quickly.

Platt, MSR-TR-98-14
Function SMO (5)
In addition, SMO requires no extra matrix storage at all. Thus, very large SVM training problems
can fit inside of the memory of an ordinary personal computer or workstation. Because no matrix
( ( + C C =
!2 ! C
!2 ! C
!1 ! 0
!1 ! C
!2 ! 0
( C ( C =
y1 " y2 # ! 1 $ ! 2 ! k
!1 ! 0
!1 ! C
!2 ! 0
( = C ( + C =
y1 ! y2 # ! 1 % ! 2 ! k
Without loss of generality, let us first compute the

1. The two Lagrange multipliers must fulfill all of the constraints of the full problem.
secondFigure
multiplier
toClie in the box. The linear equality
TheLagrange
inequality constraints cause
the Lagrange multipliers
constraint causes them to lie on a diagonal line. Therefore, one step of SMO must find an
Compute the ends of the diagonal line segment in terms

of C
Kernel Machines
64
can fit inside of the memory of an ordinary personal computer or workstation. Because no matr

Platt, MSR-TR-98-14
Function SMO (6)
!2 ! C
( ( + C C =
!1 ! 0
!2 ! C
!1 ! C
!1 ! 0
!2 ! 0
!1 ! C
!2 ! 0
( = C ( + C =
( C ( C =
y1 " y2 # ! 1 $ ! 2 ! k
y1 ! y2 # ! 1 % ! 2 ! k
Without loss of generality, let us first compute the

second LagrangeFigure
multiplier
1. The two Lagrange
multipliers
C must fulfill all of the constraints of the full problem.
The inequality constraints cause the Lagrange multipliers to lie in the box. The linear equality
constraint
to lie on
a diagonal line. Therefore,
one step ofof
SMO
must
Compute the ends
ofcauses
thethemline
segment
in terms
C find an
If ( C, = max 0, C ( , = min , + C (
If ( = C, = max 0, C + ( , = min , C + (
Kernel Machines
65

Function SMO (7)
Rewrite ^ as a function of C
Quadratic function of the form CC + C +
Find the optimal value by differentiating with respect to

C and equating it to 0
C ( C
W
^
C = C
Where
1
$ = ( 1 $ + 0 $ ) = )
_
_
_'(
_ $ + 0
= ( C 1 ( C
Finally clip the value of CW

,
if CW >
if CW
CW = CW ,
,
if CW <
Kernel Machines
66

Function SMO (8)
Now compute the new value for (
(W = (^ + ( C C^ CW
The intercept term 0 is updated to ensure that KKT
conditions are satisfied for the 1st and 2nd examples
Refer the reading material for the exact equation
Kernel Machines
67

Function SMO (9)
Comparison with the chunking approach
fixed number of examples are added at every step, while
Thediscarding
timing performance
of the SMO algorithm
chunking algorithm
for the linear SVM
examples
with versus
zerotheLagrange
multipliers
on the adult data set is shown in the table below:
Training Set Size
SMO time
Chunking time
1605
2265
3185
4781
6414
11221
16101
22697
32562
0.4
0.9
1.8
3.6
5.5
17.0
35.3
85.7
163.6
37.1
228.3
596.2
1954.2
3684.6
20711.3
N/A
N/A
N/A
Number of Non-Bound
Support Vectors
42
47
57
63
61
79
67
88
149
Number of Bound
Support Vectors
633
930
1210
1791
2370
4079
5854
8209
11558
The training set size was varied by taking random subsets of the full training set. These subsets
are nested. The "N/A" entries in the chunking time column had matrices that were too large to fit
Platt, MSR-TR-98-14
into 128 Megabytes, hence could not be timed due to memory thrashing. The number of nonbound and the number of bound support vectors were determined from SMO: the chunking
results vary by a small amount, due to the tolerance of inaccuracies around the KKT conditions.
Kernel Machines
- Machine Learning
By fitting a line to the log-log plot of CSL465/603
training time
versus training set size, an empirical scaling
68
Support Vector Regression (1)

Normally, we would use squared error
C
, x = x , x = 1 x + 0
For support vector regression, we use e -sensitive
loss
0,
if <
, x =
,
otherwise
Tolerate errors up to
Errors beyond have only linear effect
Kernel Machines
69

Use slack variables to account for deviations
beyond
For positive deviations
For negative deviations
Thus the SVR formulation is

)
1
C + N $ + $
2
$'(
Subject to
Kernel Machines
$ 1 $ + 0 + $
1 $ + 0 $ + $
$ , $ 0
70

Introduce Lagrange multipliers $ and $ and
formulate the dual
) )
1
^ = N N $ $ _ _ x21 x`
2
)
$'( _'(
N $ + $ N $ $ $
$'(
Subject to
$'(
)
0 $ , 0 $ , N $ $ = 0
$'(
Kernel Machines
71

Non-support vectors will lie inside the margin $ = $ = 0
Support Vectors
On the margin - 0 < $ < or 0 < $ <

Outside the margin (outliers) - $ = or $ =
Kernel Machines
72

Final fitted line weighted sum of support vectors
)
x = w 1 x + 0 = N $ $ x$1 x + 0
$'(
Average 0 over:
$ = w 1 x$ + 0 + , if 0 < $ <
$ = w 1 x$ + 0 , if 0 < $ <
Similar to classification this can be extended to use
the Kernel function.
Kernel Machines
73

Polynomial kernel
Kernel Machines
Gaussian Kernel
74
Support Vector Machines Implementations

Weka
Classification SMO
Regression SMOreg
LibSVM tool box most popular SVM toolbox (C++)

Matlab/python interfaces
SVMLight
Kernel Machines
75
Summary
Support Vector Machine
Linear Separable Case
Soft Margin
Kernel Functions
Loss Function
SMO algorithm for optimization
Support Vector Regression
One class Kernel Machines (refer 13.11)
Kernel Machines
76

w11 PDF

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

w11 PDF

Transféré par

Droits d'auteur :

Formats disponibles

Kernel Machines

CSL465/603 - Fall 2016

CSL465/603 - Machine Learning

Support vector consists of examples closest to

Choosing the right kernel is crucial

CSL465/603 - Machine Learning

Optimal Separating Hyperplane

the class labels are binary $ 1, +1

Find w and 0 , such that

CSL465/603 - Machine Learning

Optimal Separating Hyperplane

CSL465/603 - Machine Learning

CSL465/603 - Machine Learning

Optimal Separating Hyperplane

Optimal Separating Hyperplane

CSL465/603 - Machine Learning

Optimal Separating Hyperplane

Quadratic optimization problem

CSL465/603 - Machine Learning

Lagrange Multipliers (1)

CSL465/603 - Machine Learning

Lagrange Multipliers (2)

CSL465/603 - Machine Learning

Primal and Dual Problems

CSL465/603 - Machine Learning

What happens with inequality

CSL465/603 - Machine Learning

Complementarity - either a constraint is active

CSL465/603 - Machine Learning

Lagrange Multipliers SVM (1)

Rewrite the quadratic constrained optimization

CSL465/603 - Machine Learning

Lagrange Multipliers SVM (2)

Rewrite the quadratic constrained optimization

CSL465/603 - Machine Learning

Lagrange Multipliers SVM (3)

CSL465/603 - Machine Learning

Lagrange Multipliers SVM (4)

Plugging these back into R

CSL465/603 - Machine Learning

Dual Formulation SVM (1)

CSL465/603 - Machine Learning

Dual Formulation SVM (2)

CSL465/603 - Machine Learning

Dual Formulation SVM (3)

CSL465/603 - Machine Learning

Support Vector Machine (1)

$ for points that lie outside the margin are 0

Support vectors: points such that $ > 0

0 = $ w 1 x2 for any support vector

The resulting discriminant is called support vector

CSL465/603 - Machine Learning

Support Vector Machine (2)

CSL465/603 - Machine Learning

Soft Margin Hyperplane

CSL465/603 - Machine Learning

Soft Error (1)

CSL465/603 - Machine Learning

CSL465/603 - Machine Learning

Soft Margin Hyperplane (1)

CSL465/603 - Machine Learning

Soft Margin Hyperplane (2)

This is the primal problem