Académique Documents
Professionnel Documents
Culture Documents
Pattern Classification
Shigeo Abe
Class 1
Class 1
1
N.N.
0
Class 2
Class 2
Multilayer
Multilayer Neural
Neural Networks
Networks
Class 1
Class 1
1
N.N.
0
Class 2
Class 2
Support
Support Vector
Vector Machines
Machines
Direct decision functions: decision boundaries are
determined to minimize the classification error of both
training data and unknown data.
Class 1
Maximum margin
Class 2
Optimal hyperplane (OHP) D(x) = 0
Summary
Summary
• When the number of training data is small,
SVMs outperform conventional classifiers.
g(x)
Class 2
Class 1
OHP
Input Space Feature Space
Types
Types of
of SVMs
SVMs
• Hard margin SVMs
– linearly separable in the feature
space
– maximize generalization ability
• Soft margin SVMs
– Not separable in the feature space Class 2
– minimize classification error and
Class 1
maximize generalization ability
• L1 soft margin SVMs (commonly used)
• L2 soft margin SVMs
Feature Space
Hard
Hard Margin
Margin SVMs
SVMs
We determine OHP so that
the generalization region is Margin δ
Generalization
maximized: region
maximize δ
Class 1
subject to yi = 1
wtg(xi ) + b ≧ 1 for Class 1
− wtg(xi ) − b ≧ 1 for Class 2 Class 2
yi = − 1
+ side
Combining the two:
yi D(x) /||w|| ≧ δ .
Class 1
yi = 1
Imposing ||w|| δ = 1, the
problem is to Class 2
minimize ||w||2/2 yi = − 1
+ side
subject to
yi D(xi) ≧ 1 for i = 1,…,M. OHP D(x) = wtg(xi ) + b = 0
Soft
Soft Margin
Margin SVMs
SVMs
If the problem is non-
separable, we introduce Margin δ
slack variables ξi.
Minimize Class 1
||w||2/2 + C/p Σi=1,M ξip yi = 1
subject to 1> ξi > 0
yi D(xi) ≧ 1 − ξi ξk >1
Class 2
where C: margin yi = − 1 + side
parameter,
p = 1: L1 SVM,
= 2: L2 SVM OHP D(x) = 0
Conversion
Conversion to
to Dual
Dual Problems
Problems
Introducing the Lagrange multiplies αi and βi,
L2 SVM
Maximize
Σi αi − C/2 Σi,j αi αj yi yj (g(xi )tg(xj) + δij /C)
Σ
subject to
Σi yi αi = 0, αi ≧0,
Σ
where δij : 1 for i = j and 0 for i ≠ j.
KKT
KKT Complementarity
Complementarity Condition
Condition
For L1 SVMs, from αi(yi(wtg(xi) + b) −1+ξi) = 0,
βiξi = (C−
− αi) ξi = 0, there are three cases for αi :
H(x,x’) = g(x)tg(x’).
L2 SVM
HL2 = HL1 + {(yi yj + δij)/C}
Table Uniqueness
L1 SVM L2 SVM
Primal Non-unique* Unique*
Dual Non-unique Unique*
2
Irreducible
Irreducible Set
Set of
of Support
Support Vectors
Vectors
A set of support vectors is irreducible if
deletion of boundary vectors and any
support vectors result in the change of the
optimal hyperplane.
Property
Property 22
For the L1 SVM, let all the support vectors
be unbounded. Then the Hessian matrix
associated with the irreducible set is positive
definite.
Property
Property 33
For the L1 SVM, if there is only one
irreducible set, and support vectors are all
unbounded, the solution is unique.
Irreducible sets
2
{1, 3}, {2, 4}
1
The dual problem is
4 non-unique, but the
primal problem is
3
unique.
95
L1 Test
% 90
L2 Test
85 L1 Train.
L2 Train.
80
75
dot poly2 poly3 poly4 poly5
Support
Support Vectors
Vectors for
for Polynomial
Polynomial Kernels
Kernels
For poly4 and poly5 the numbers are the same.
200
180
160
140
120
100 L1
80 L2
60
40
20
0
dot poly2 poly3 poly4 poly5
Training
Training Time
Time Comparison
Comparison
As the polynomial degree increases, the
difference becomes smaller.
60
50
40
(s) 30 L1 SVM
L2 SVM
20
10
0
dot poly2 poly3 poly4 poly5
Summary
Summary
D3(x) = 0
Class 2 Class 2
D2(x) = 0
Class 3 Class 3
Class 1 Class 1
D1(x) = 0
Unclassifiable
Continuous
Continuous SVMs
SVMs
Class 3
Class 1 Class 3
Class 1
D1(x) = 0
Fuzzy
Fuzzy SVMs
SVMs
D1(x) = 1
D1(x) = 0
Membership function
Class
Class ii Membership
Membership
Class 2 Class 2
Class 3 Class 3
Class 1 Class 1
Decision-tree-based
Decision-tree-basedSVMs
SVMs
Each node corresponds to the hyperplane.
Training step Classification step
f j(x ) = 0
Class k Class k
Class j Class j f k (x ) = 0
f j (x ) = 0
?
The overall classification
performance becomes worse.
Class j
Class k
f j (x ) = 0
Class 1
D31(x) = 0 Class 2
Class 3
D23(x) = 0
Unclassifiable regions still exist.
D1(x) = D2(x) = D3(x) = 1
Pairwise
Pairwise Fuzzy
Fuzzy SVMs
SVMs
Class 1 Class 1
class 2 Class 2
Class 3 Class 3
P-SVM P-FSVM
Decision-tree-based
Decision-tree-based SVMs
SVMs (DDAGs)
(DDAGs)
D12(x)
2 1 Class 1
D13(x) D32(x)
class 2
3 1 2 3 Class 3
D12(x)
1 2 3
2 1
D13(x) D32(x)
3 1 2 3
Equivalent DDAG
ECC
ECC Capability
Capability
The maximum number of decision functions = 2n-1 -1,
where n is the number of classes.
Class 1 1 1 1 -1 1 1 -1
Class 2 1 1 -1 1 1 -1 1
Class 3 1 -1 1 1 -1 1 1
Class 4 -1 1 1 1 -1 -1 -1
Continuous
Continuous Hamming
Hamming Distance
Distance
Di(x) = 1
Hamming Distance
Di(x) = 0
= Σ ERi(x)
Class k Class l
= Σ(1 − mi(x))
Equivalent to membership
1 functions with sum
0
operators
Membership
Error function function
All-at-once
All-at-once SVMs
SVMs
Decision functions
Original formulation
n ×M variables
where n: number of classes
M: number of training data
Crammer and Singer (2000) ’s
M variables
Performance
Performance Evaluation
Evaluation
水戸57
水戸
ひ 59-16
Japanese License Plate
Data
Data Sets
Sets Used
Used for
for Evaluation
Evaluation
100
98 Performance
96 improves by fuzzy
(%) 94 SVMs and decision
92
trees.
90
88
86
84
82
Blood Thyroid H-50
Performance
Performance Improvement
Improvement
for
for Pairwise
Pairwise Classification
Classification
SVM FSVM
ADAG MX ADAG Av
ADAV MN
100
98
(%) 96
FSVMs are
94
comparable with
92
ADAG MX.
90
88
86
Blood Thyroid H-13
Summary
Summary
Maximize
α) = Σi∈
Q( ∈B αi − C/2 Σi,j∈
∈B αi αj yi yj H(xi, xj)
− C Σi∈ ∈N αi αj yi yj H(xi, xj)
∈B, j∈
∈N αi αj yi yj H(xi, xj) + Σi∈
− C/2 Σi,j∈ ∈N αi
subject to
∈B yi αi = − Σi∈
Σi∈
Σ ∈N yi αi , C ≧ αi ≧0 for i ∈Β
• SMO
– |B| = 2, i.e., solve the problem for two variables
– The subproblem is solvable without matrix calculations
• Case 1
Variables are updated.
• Case 2
Corrections are reduced to
1 satisfy constraints.
2 • Case 3
3 Variables are not corrected.
Performance
Performance Evaluation
Evaluation
• Compare training time by the steepest
ascent method, SMO, and the interior point
method.
• Data sets used:
– white blood cell data
– thyroid data
– hiragana data with 50 inputs
– hiragana data with 13 inputs
水戸57
水戸
ひ 59-16
Japanese License Plate
Data
Data Sets
Sets Used
Used for
for Evaluation
Evaluation
1200
1000
Dot product kernels
800 are used.
(s)
600
For a larger size, L2
400 SVMs are trained
200 faster.
0
2 4 13 20 30
Effect
Effect of
of Working
Working Set
Set Size
Size for
for
Blood
Blood Cell
Cell Data
Data (Cond.)
(Cond.)
L1 SVM L2 SVM
700
600
Polynomial kernels
500 with degree 4 are
(s) 400 used.
300
No much difference
200
for L1 and L2 SVMs.
100
0
2 20 40
Training
Training Time
Time Comparison
Comparison
LOQO SMO SAM
LOQO: LOQO is
10000 combined with
decomposition
1000 technique.
(s)
100 SMO is the
slowest. LOQO
10 and SAM are
comparable for
1 hiragana data.
Thyroid Blood H-50 H-13
Summary
Summary
• Kernel-based Methods
– Kernel Perceptron
– Kernel Least Squares
– Kernel Mahalanobis Distance
– Kernel Principal Component Analysis
• Maximum Margin Classifiers
– Fuzzy Classifiers
– Neural Networks
Contents
Contents
Membership function
mi(x) = exp(- hi2(x))
hi2(x) = di2(x)/αi
di2(x) = (x – ci)t Qi-1 (x – ci)
• by maximizing margins.
Symmetric
Symmetric Cholesky
Cholesky Factorization
Factorization
Q i = L i L it,
We calculate the
upper bound for all
the class i data.
Lower
Lower Bound of ααii
Bound of
The class j datum
remains correctly
classified for the
decrease of αi.
We calculate the
lower bound for all
the data not
belonging to class i.
Tuning
Tuning Procedure
Procedure
Test Train.
The recognition
100
rate improved as
90
80 η was increased.
70
60
(%)
50
40
30
20
10
0
10-5 10-4 10-3 10-2 10-1 η
Recognition Rate of Hiragana-50 Data
(Maximizing Margins)
Test Train.
100
98
The better
96
recognition rate
(%) 94
for small η.
92
90
88
86
84
10-5 10-4 10-3 10-2 10-1 η
Performance Comparison
x1 Input space
x1
x2 The weights between input layer and
Bias hidden layer represent the hyperplane
Input layer
Hidden layer
CARVE
CARVE Algorithm
Algorithm (hidden
(hidden layer)
layer)
x1
x2 The weights between input layer and
Bias hidden layer represent the hyperplane
Input layer
Hidden layer
CARVE
CARVE Algorithm
Algorithm (output
(output layer)
layer)
(0,1,0)
(0,0,1)
Sigmoid function
(0,0,0) (0,0,1) 1
(1,1,0)
(1,0,0) 0 (0,0,0) (1,0,0)
Input space Hidden layer output space
Largest violation
: +1
ex. target values
: ‐1
x1
x2
Bias
100
98 FSVM: 1 vs. all
96
(%) MM-NN is better
94
than BP and
92 comparable to
90 FSVMs.
88
86
Blood Thyroid H-50
Summary
Summary