Académique Documents
Professionnel Documents
Culture Documents
Outline
Optimal Separating Hyperplane
Support Vector Machine
Linear Separable Case
Soft Margin
Kernel Functions
Support Vector Regression
Kernel Machines
Kernel Machines
Or, support vector machine (SVM)
Discriminant-based method
Learn class boundaries
x$ , $
)
$'( ,
where
Kernel Machines
Margin
Distance of point x2 to the hyperplane w 1 x + 0
w 1 x2 + 0
$ w 1 x2 + 0
or
w
w
Distance from the hyperplane to the closest
instances is the margin.
w
margin
Kernel Machines
(
C
Kernel Machines
w C subject to $ w 1 x2 + 0 1,
CSL465/603 - Machine Learning
Kernel Machines
(
C
w C subject to $ w 1 x2 + 0 1,
Later we will talk about the Kernel that will map the
points to a even higher dimensional space!!!
Prefer complexity not based on
Transform it to a problem, where the complexity
depends on the number of training examples
More precisely, the points that are on the margin ( )
Kernel Machines
Solve , = 0
Kernel Machines
10
Kernel Machines
11
Kernel Machines
12
Kernel Machines
13
Karush-Kuhn-Tucker Conditions
KKT Conditions
, , = 0
$ 0
$ 0
$ $ = 0
$ = 0
Kernel Machines
14
(
C
w C subject to $ w 1 x2 + 0 1,
Kernel Machines
15
(
C
w C subject to $ w 1 x2 + 0 1,
Kernel Machines
16
)
C
N $ $ w 1 x2 + 0 + N $
$'(
$'(
=0
=0
Kernel Machines
17
)
C
N $ $ w 1 x2 + 0 + N $
$'(
$'(
= 0 w = )
$'( $ $ x$
YV
YZ[
YVX
= 0 )
$'( $ $ = 0
Kernel Machines
18
$'(
$'(
$'(
1 1
= max w w w 1 N $ $ x$ 0 N $ $ + N $
TU 2
Kernel Machines
19
1
^ = max N N $ _ $ _ x$1 x` + N $
TU
2
$'( _'(
Subject to
$'(
N $ $ = 0
$'(
$
Kernel Machines
20
Kernel Machines
21
w = N $ $ x$
$'(
22
margin
Kernel Machines
23
Kernel Machines
24
O = support vectors
Kernel Machines
25
Soft Error
Correctly classified example - far from the margin
(but on the correct side) - $ = 0
Correctly classified sample on the margin - $ = 0
Correctly classified sample, but inside the margin0 < $ <1
Incorrectly classified example (away from the
margin on the wrong side) - $ 1
Therefore the total error can be summarized in
terms of - $ s
Soft Error - )
$'( $
Kernel Machines
26
)
C
+ N $
$'(
subject to $ w 1 x2 + 0 1 $ , and $ 0,
Where C is a penalty factor that stresses the
importance on reducing the soft error.
Using Lagrange multipliers
Kernel Machines
27
)
C
+ N $ N $ $ w 1 x2 + 0 1 + $ N $ $
$'(
$'(
$'(
=0
=0
=0
Kernel Machines
28
Kernel Machines
29
Subject to
$'(
N $ $ = 0
$'(
0 $
Kernel Machines
30
Subject to
$'(
N $ $ = 0
$'(
0 $
Think of as a regularization parameter
High - high penalty for non-separable examples (overfit)
Low - less penalty (underfit)
Determined using a validation set.
Kernel Machines
31
Subject to
$'(
N $ $ = 0
$'(
0 $
Quadratic optimization problem
Support vectors have $ > 0
Examples misclassified have $ =
Examples correctly classified $ = 0
Kernel Machines
32
Kernels (1)
Kernel Machines
33
Kernels (2)
Kernel Machines
34
Kernels (3)
Prior approaches assumed
Data is linearly separable or obtained the best fit
Kernel Machines
35
Kernels (4)
Transform the -dimensional input feature space (xspace) to a dimensional feature space using
x
Call the dimensional feature space as the -space
z = (x), where ^ = ^ , = 1, ,
Linearly separable in the -space.
z = w 1 z + w0n
l
z = w 1 x + 0 = N ^ ^ x
^'0
Assume 0 x = 1
Kernel Machines
36
)
C
+ N $ N $ $ w 1 x2 + 0 1 + $ N $ $
$'(
$'(
$'(
1
max N N $ _ $ _ x$1 x` + N $
TU
2
$'( _'(
Kernel Machines
$'(
37
w = N $ $ z$ = N $ $ (x2 )
$'(
$'(
z = w 1 z = N $ $ x2 1 (x)
$'(
Kernel Machines
38
Kernel Functions
While solving the optimization and applying the solution,
we observe that we need to compute
x2 1 (x` ) Inner product of two points in the -space
39
40
Kernel Machines
Kernel Machines
41
Kernel Machines
42
Kernel Machines
43
Kernel Machines
44
Kernel Machines
45
O = support vectors
Kernel Machines
46
Kernel Machines
47
x
x, x } = exp
2 C
Gaussian Kernel Width (radius) -
Larger implies smoother boundaries
Kernel Machines
48
Kernel Machines
49
x
x, x } = exp
2 C
Gaussian Kernel Width (radius) -
Larger implies smoother boundaries
What is the dimension of the -space?
Kernel Machines
50
Kernel Machines
51
52
x, x }
Kernel Machines
( x, x }
= ( x, x } + C x, x }
( x, x } C x, x }
53
x, x } = N x, x }
'(
)
'(
$'(
1
N N $ _ $ _ N x2 , x` + N $
2
$'( _'(
x = N $ $ N x2 , x
$'(
'(
54
subject to $ w 1 x2 + 0 1 $ , and $ 0,
Kernel Machines
55
337
E(z)
Squared loss
Hinge loss
SVM
Logarithmic
loss
remaining points we have
Logistic
Kernel Machines
N
!
ESV (yn tn ) + w2
CSL465/603 -nMachine
Learning
=1
(7.44)
56
Kernel Machines
57
Kernel Machines
58
)
1
min N C + N N $
2
'(
'( $'(
Subject to
w1U x2 + U0 w1 x2 + 0 + 2 $ , $ , $ 0
The class label for x2 is y2
It is an expensive optimization problem ()
Kernel Machines
59
Until convergence
For = 1, ,
^ = argmaxT ( , , ^( , ^ , ^( , , l
22
2.5
1.5
0.5
0.5
1.5
Kernel Machines
1.5
0.5
0.5
1.5
2.5
60
The ellipses in the figure are the contours of a quadratic function that
Subject to
$'(
N $ $ = 0
$'(
0 $
Can we perform coordinate ascent here?
Optimize wrt to $ by fixing the remaining?
Kernel Machines
61
( ( + C C = N $ $
$'b
Kernel Machines
62
algorithm can be expressed in a short amount of C code, rather than invoking an entire QP library
routine. Even though more optimization sub-problems are solved in the course of the algorithm,
each sub-problem is so fast that the overall QP problem is solved quickly.
( ( + C C =
!2 ! C
!2 ! C
!1 ! 0
!1 ! C
!2 ! 0
( C ( C =
y1 " y2 # ! 1 $ ! 2 ! k
!1 ! 0
!1 ! C
!2 ! 0
( = C ( + C =
y1 ! y2 # ! 1 % ! 2 ! k
Figure 1. The two Lagrange multipliers must fulfill all of the constraints of the full problem.
The inequality constraints cause the Lagrange multipliers to lie in the box. The linear equality
constraint causes them to lie on a diagonal line. Therefore, one step of SMO must find an
optimum of the objective function on a diagonal line segment.
Kernel Machines
63
algorithm can be expressed in a short amount of C code, rather than invoking an entire QP library
routine. Even though more optimization sub-problems are solved in the course of the algorithm,
each sub-problem is so fast that the overall QP problem is solved quickly.
( ( + C C =
!2 ! C
!2 ! C
!1 ! 0
!1 ! C
!2 ! 0
( C ( C =
y1 " y2 # ! 1 $ ! 2 ! k
!1 ! 0
!1 ! C
!2 ! 0
( = C ( + C =
y1 ! y2 # ! 1 % ! 2 ! k
64
can fit inside of the memory of an ordinary personal computer or workstation. Because no matr
algorithms are used in SMO, it is less susceptible to numerical precision problems.
!2 ! C
( ( + C C =
!1 ! 0
!2 ! C
!1 ! C
!1 ! 0
!2 ! 0
!1 ! C
!2 ! 0
( = C ( + C =
( C ( C =
y1 " y2 # ! 1 $ ! 2 ! k
y1 ! y2 # ! 1 % ! 2 ! k
The inequality constraints cause the Lagrange multipliers to lie in the box. The linear equality
constraint
to lie on
a diagonal line. Therefore,
one step ofof
SMO
must
Compute the ends
ofcauses
thethemline
segment
in terms
C find an
optimum of the objective function on a diagonal line segment.
If ( C, = max 0, C ( , = min , + C (
If ( = C, = max 0, C + ( , = min , C + (
Kernel Machines
65
Where
1
$ = ( 1 $ + 0 $ ) = )
_
_
_'(
_ $ + 0
= ( C 1 ( C
66
Kernel Machines
67
SMO time
Chunking time
1605
2265
3185
4781
6414
11221
16101
22697
32562
0.4
0.9
1.8
3.6
5.5
17.0
35.3
85.7
163.6
37.1
228.3
596.2
1954.2
3684.6
20711.3
N/A
N/A
N/A
Number of Non-Bound
Support Vectors
42
47
57
63
61
79
67
88
149
Number of Bound
Support Vectors
633
930
1210
1791
2370
4079
5854
8209
11558
The training set size was varied by taking random subsets of the full training set. These subsets
are nested. The "N/A" entries in the chunking time column had matrices that were too large to fit
Platt, MSR-TR-98-14
into 128 Megabytes, hence could not be timed due to memory thrashing. The number of nonbound and the number of bound support vectors were determined from SMO: the chunking
results vary by a small amount, due to the tolerance of inaccuracies around the KKT conditions.
Kernel Machines
- Machine Learning
By fitting a line to the log-log plot of CSL465/603
training time
versus training set size, an empirical scaling
68
Kernel Machines
69
Subject to
Kernel Machines
$ 1 $ + 0 + $
1 $ + 0 $ + $
$ , $ 0
CSL465/603 - Machine Learning
70
$'( _'(
N $ + $ N $ $ $
$'(
Subject to
$'(
)
0 $ , 0 $ , N $ $ = 0
$'(
Kernel Machines
71
Support Vectors
Kernel Machines
72
x = w 1 x + 0 = N $ $ x$1 x + 0
$'(
Average 0 over:
$ = w 1 x$ + 0 + , if 0 < $ <
$ = w 1 x$ + 0 , if 0 < $ <
Similar to classification this can be extended to use
the Kernel function.
Kernel Machines
73
Kernel Machines
Gaussian Kernel
74
SVMLight
Kernel Machines
75
Summary
Optimal Separating Hyperplane
Support Vector Machine
Linear Separable Case
Soft Margin
Kernel Functions
Loss Function
SMO algorithm for optimization
Support Vector Regression
One class Kernel Machines (refer 13.11)
Kernel Machines
76