Académique Documents
Professionnel Documents
Culture Documents
By Ali Habibnia
(a.habibnia@lse.ac.uk)
31 Oct 2012
Outline
2
Study 2:
SVMs History
3
Pattern Recognition
Regression Estimation Non Parametric (Applications for function estimation started ~ 1995 called
Support Vector Regression)
Good generalization performance: SVMs implement the Structural Risk Minimization Principle
which seeks to minimize the upper bound of the generalization error rather than only minimize
the training error.
Generalize well even in high dimensional spaces under small training set
conditions. Also it is robust to noise.
Linear Classifiers
x2
wT x + b > 0
g (x) = wT x + b
A hyper-plane in the feature
space
w
w
wT x + b < 0
x1
Linear Classifiers
denotes +1
x2
denotes -1
x1
Linear Classifiers
denotes +1
x2
denotes -1
x1
Linear Classifiers
denotes +1
x2
denotes -1
x1
Linear Classifiers
denotes +1
x2
denotes -1
x1
x2
safe zone
Margin
x1
{(xi , yi )}, i = 1, 2, L , n
T
x2
safe zone
Margin
n
For yi = 1, wT xi + b < 0
n
For yi = +1, w xi + b 1
For yi = 1, wT xi + b 1
Support Vectors
x1
We know that
wT x + + b = 1
x2
safe zone
Margin
n
wT x + b = 1
n
Support Vectors
x1
x2
Formulation:
maximize
2
w
safe zone
Margin
n
such that
For yi = +1, wT xi + b 1
For yi = 1, wT xi + b 1
Support Vectors
x1
Formulation:
1
minimize
w
2
safe zone
2
Margin
n
such that
For yi = +1, wT xi + b 1
For yi = 1, wT xi + b 1
such that
T
yi ( w xi + b) 1
Support Vectors
x1
Quadratic
programming
with linear
constraints
minimize
s.t.
1
w
2
yi ( w T xi + b) 1
Lagrangian
Function
1
minimize L p ( w, b, i ) =
w
2
s.t.
i ( yi ( w T xi + b) 1)
i 0
i =1
n
1
2
minimize Lp (w, b, i ) = w i ( yi ( wT xi + b) 1)
2
i =1
s.t.
Lp
w
Lp
b
i 0
n
=0
w = i yi xi
i =1
=0
y
i
i =1
=0
i ( yi (wT xi + b) 1) = 0
n
n
x+
i 0
w = i yi xi =
i =1
x+
x-
Support Vectors
T
get
b
from
y
(
w
xi + b) 1 = 0,
y
x
i i i
i
iSV
where xi is support vector
x1
11
2
General idea: the original input space can always be mapped to some
higher-dimensional feature space where the training set is separable:
: x (x)
If every data point is mapped into high-dimensional space via some transformation
: x (x), the dot product becomes:
K(xi,xj)= (xi) T(xj)
n
Kernel methods map the data into higher dimensional spaces in the hope that in this
higher-dimensional space the data could become more easily separated or better
structured.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
Wave Kernel
Power Kernel
Log Kernel
Spline Kernel
B-Spline Kernel
Bessel Kernel
Cauchy Kernel
Chi-Square Kernel
Histogram Intersection Kernel
Generalized Histogram Intersection Kernel
Generalized T-Student Kernel
Bayesian Kernel
Wavelet Kernel
K(x1,x1)
K(x1,x2)
K(x1,x3)
K(x2,x1)
K(x2,x2)
K(x2,x3)
K(xN,x1)
K(xN,x2)
K(xN,x3)
K(x1,xN)
K(x2,xN)
K(xN,xN)
K (xi , x j ) = xTi x j
Linear kernel:
Polynomial kernel:
K (xi , x j ) = (1 + xTi x j ) p
K (xi , x j ) = exp(
q
Sigmoid:
xi x j
2
T
0 i
K (xi , x j ) = tanh( x x j + 1 )
Some Issues
28
Choice of kernel
- Gaussian or polynomial kernel is default
- if ineffective, more elaborate kernels are needed
Choice of kernel parameters
- e.g. in Gaussian kernel
- is the distance between closest points with different classifications
- In the absence of reliable criteria, applications rely on the use of a validation set or
cross-validation to set such parameters.
Optimization criterion Hard margin v.s. Soft margin
- a lengthy series of experiments in which various parameters are tested
f(x)
Loss = (wX + b ) Y
f (x) = wx + b
dLoss
=0
dw
(X
x
X w = X TY
f(x)
f (x) = wx + b
+
0
-
Min
1 T
w w
2
Constraints:
yi wT xi b
wT xi + b yi
f(x)
f (x) = wx + b
+
0
-
1 T
w w + C i + i*
2
i =1
Constraints:
T
yi w xi b + i
wT xi + b yi + i*
i , i* 0
*
x
Lagrange Optimisation
N
1 T
L = w w + C i + i*
2
i =1
Target
i + i yi + wT xi + b
i =1
N
i* + i* yi + wT xi + b
i =1
N
i i + *i i*
i =1
Regression:
)
)
y (x ) = i i* xi , x + b
i =1
Constraints
Nonlinear Regression
f(x)
f(x)
+
0
-
+
0
-
(x)
Regression Formulas
N
Linear:
i =1
Nonlinear:
General:
y (x ) = i i* (xi ), (x ) + b
i =1
y (x ) = i i* xi , x + b
y (x ) = i i* K (xi , x ) + b
i =1
Kernel Types
l
l
Linear:
K ( x, xi ) = x, xi
Polynomial:
K ( x, xi ) = x, xi d
x
i
K ( x, xi ) = exp
2
Exponential RBF:
x xi
K ( x, xi ) = exp
Wavelet Kernel
37
The Wavelet kernel (Zhang et al, 2004) comes from Wavelet theory and is given as:
Where a and c are the wavelet dilation and translation coefficients, respectively (the form
presented above is a simplification). A translation-invariant version of this kernel can be
given as:
Fourier Analysis
n
In other words: Transform the view of the signal from timebase to frequency-base.
40
41
42
51
52
By Ali Habibnia
a.habibnia@lse.ac.uk
31 Oct 2012