Vous êtes sur la page 1sur 14

A Summary of Support Vector Machine

Jinlong Wu
Computational Mathematics,
SMS, PKU
May 4,2007

Introduction

The Support Vector Machine(SVM) has achieved a lot of attention since it is


developed. It is widely used in many areas because of its powerful ability of
classification and regression, such as textual classification, face recognition,
image processing, hand-written recognition and so forth.
In this article I try to give a summary of SVM. Because of its plentiful
contents, This article has to mainly focus on one aspectSupport Vector
Classifier(SVC). All of the techniques mentioned here are realized in a new
SVM software packagePKSVM. Presently PKSVM can handle two-class
and multi-class classification problems using C-SVC and -SVC (omitted
here).
The earliest pattern recognition systems were Linear Classifiers(Nilsson,1965).
We will point out the differences between SVC and LC in the following sections. Suppose n training data(sometimes called examples or observations)
(xi , yi ) are given, where xi Rp (p features or variables) and yi R. Without loss of generality, the whole article is limited to two-class problems, that
is to say, yi s only have two values, which can be +1 and 1 for simplicity.

SVC in the linearly separable situation

In this section I will review linearly separable problems because they are the
x1 , y1 ), . . . , (x
xn , yn )} is said to be linearly
simpliest for SVC. The training set {(x
separable if there exists a linear discriminant function whose sign matches the
1

Data from class +1 (red) and class 1 (green)


10

10

Figure 1: A linearly separable training set (n = 9,p = 2)


class of all training examples. Figure 1 gives a trivial example with n = 9,
p = 2.
In the linearly separable problems, it is obvious one can use the simpliest
model linear model, y(x) = T x +0 to separate the two classes. However,
as in Figure 2, there usually exist infinite separating hyperplanes which can
separate the training set perfectly. So which one should be chosen is a big
problem.
Vapnik and Lerner(1963) proposes to choose the separating hyperplane
that maximizes the margin. This optimal hyperplane for the trivial example
in Figure 1 is illustrated in Figure 3. That the optimal hyperplane is selected
as the best one is the first new idea which was added into SVM in comparison
with linear classifier.
As in Figure 4, one can easily know the signed distance from a point x to
the hyperplane L is
1
1
x x0 ) =
T (x
kk
k
1
=
k
2

k
k

( T x T x0 )
(1)
( T x + 0 )

Three lines which can separate the training set perfectly


10

10

Figure 2: Some optional hyperplanes (p = 2)


Hence, our aim is to obtain the biggest positive C which makes all examples
1
satisfy kk
| T x i + 0 | C, i. Obviously it is a constrained optimization
problem(OP)
max C
,0

subject to

1
yi ( T x i + 0 ) C,

k k

i.

Since the length of is insignicant, we can let k k be any value, of


1
course it is allowed to assume kk
= C all the time. With this assumption
the previous OP can be rewritten as
1
k k2
,0
2
T x i + 0 ) 1,
subject to yi (
min

(2)
i.

However, it is difficult to solve this OP directly because of the complicated


constraint conditions. So we will resort to solving its dual problem. Detail
will be exhibited in Section 4.
3

Figure 3: The optimial hyperplan

Figure 4: The distance of any point x to the hyperplane L


4

SVM with RBF kernel: decision boundary (black)

2.5

1.5

x2

0.5

0.5
+1
1

val = 1
dec boun

1.5

val = +1

x1

Figure 5: An inseparable example

3
3.1

Two important extentions to QP (2)


Hypersurfaces

For real-world data sets, they can not be separated idealy by simple hyperplanes usually. If hyperplanes are replaced by hypersurfaces, it could be
expected that better classification results will be obtained. Therefore, the
OP (2) should be generalized to
1
k k2
,0
2
T (x
xi ) + 0 ) 1,
subject to yi (
min

(3)
i.

where () is a map from a n dimensional space to a higher(maybe infinite)


dimensional space.

3.2

Soft margins

Maybe someone will raise another problem after careful thought: Is this
generalization able to separate any real-world data sets perfectly?
5

My first answer is yes. If the function () is very complicated, and it


will produce very wiggly surfaces, then the given set can always be separated
perfectly. However, if one really does that, another serious problem will give
birth. That is overfitting. The model obtained will be worthless because
of its serious overfitting. Figure 5 gives another data set which can not be
separated perfectly using linear hyperplanes. At first sight the black surfaces
give a much better separating result, but at the same time they are too
wiggly to classify new data well. Therefore, one mustnt utilize excessively
complicated function () !
Hence, my second answer is no. Compared to the first answer, this
one is more realistic. That is a very depressing answer. Cant people do
something to improve the separating result for this example in Figure 5? Of
course we can. We can take a more step to extend our model (3) again.
Since data sets are from real-world problems, they almost include more
or less noises. The intention of our model is to learn some useful information
from the data sets, not to learn their noises. It is not wise to use excessively
complicated function () to separate the classes of data.
Coretes and Vapnik(1995) shows that noisy problems are best addressed
by allowing some examples to violate the margin constraints in (3). We can
realize this idea using some positive slack variables = (1 , . . . , n ), that is
xi ) + b) 1, i to new
to say we change the previous constraints yi ( T (x
T
xi ) + b) 1 i , i. However, it is necessary avoid each
constraints yi ( (x
i getting an unnecessarily big value, such as +. So some penalizations to
all i s should be applied.
Combining this consideration, we get a new QP problem for the inseparable case:
1
min
k k2
,0
2
(
(4)
yi ((xi )T + 0 ) 1 i i,
X
subject to
i 0,
i constant.
Computationally it is convenient to re-express (4) in the equivalent form
n

min
,0

(
subject to

X
1
k k2 +C
i
2
i=1

yi ((xi )T + 0 ) 1 i i,
i 0.
6

(5)

where C replaces the constant in (4); the separable case corresponds to C =


+. We call (5) Lagrange primal problem of Support Vector Classifier(SVC)
hereinafter.

Duality of (5)

The Lagrange primal function of (5) is


n

X
X
X
1
LP = k k2 +C
i
i [yi ((xi )T + 0 ) (1 i )]
i i , (6)
2
i=1
i=1
i=1
where i , i , i 0 i. We minimize LP w.r.t , 0 and i . Setting the
respective derivatives to zero, we get
=

n
X

i yi (xi ),

(7)

i=1

0=

n
X

i yi ,

(8)

i=1

i = C i , i,

(9)

By substituting the last three equalities into (6), we obtain the Lagrangian
(Wolfe) dual objective function
LD =

n
X
i=1

1 XX
i
i i0 yi yi0 (xi )T (xi0 ).
2 i=1 i0 =1

(10)

P
We maximize LD subject to 0 i C and ni=1 i yi = 0. In addition to
(7)-(9), the Karush-Kuhn-Tucker (KKT) conditions include the constraints
i [yi ((xi )T + 0 ) (1 i )] = 0,
i i = 0,

(11)
(12)

yi [(xi )T + 0 ] (1 i ) 0,

(13)

for i = 1, 2, . . . , n. Together these equations (7)-(13) uniquely characterize


the solution to the primal and dual problem.

Hence all we need to do is to solve the dual problem


n
X

1 XX
max
i
i i0 yi yi0 (xi )T (xi0 )

2 i=1 i0 =1
i=1
n
X

yi i = 0
subject to i=1
,

0 i C, i

(14)

and get
i s, and from the (7) has the form
=

n
X

i yi (xi ),

(15)

i=1

with nonzero coefficients


i only for those observations i for which the constraints in (13) are exactly met(=) (due to (11)). These observations are
called the support vectors, that is to say an observation xi is called support
vector if its respective Lagrange multiplier i > 0. Any of these margin
points (0 <
i , i = 0) can be used to solve for 0 , and we typically use an
average of all the solutions for numerical stability.
After and 0 are both obtained, the discriminant function
x) = (x
x)T + 0 =
f(x

n
X

x)T (x
xi ) + 0

i yi (x

(16)

i=1

The decision function can be written as


n
X

x) = sign[f (x
x)] = sign[
x)T (x
xi ) + 0 ].
G(x

i yi (x

(17)

i=1

The tuning parameter of this procedure is C. The optimal value for C


can be estimated by cross-validation. A large value of C will discourage any
positive i , and leads to an overfit wiggly boundary in the original feature
space; a small value of C will encourage a small value of k k, which causes
f (x) and hence the boundary to be smoother.

Another useful extention of QP (2)Kernel

Since SVC just uses the sign of the decision function to classify the class, only
the decision function is informative eventually. However, x always appears
8

x)T (x
x0 ) in (17). Defining the kernel function K(x
x, x 0 )
with pairwise forms (x
x, x 0 ) = (x
x)T (x
x0 ), (17) can be written as
as K(x
n
X
x) = sign[f(x
x)] = sign[
x, x j ) + 0 ].
G(x

i yi K(x

(18)

i=1

x), Boser, Guyon, and


Instead of hand-choosing a feature function (x
x, x 0 ) that
Vapnik(1992) proposes to directly choose a kernel function K(x
x)T (x
x0 ) in some unspecified high dimensional
represents a dot product (x
space. For instance, any continuous decision boundary can be implemented
0 2
x, x 0 ) = ekxxxx k .
using the Radial Basis Function(RBF) kernel K(x
Although many new kernels are being proposed by researchers, the following four basic kernels are used most widely:
x, x 0 ) = x T x 0
Linear: K(x
x, x0 ) = (x
xT x0 + r)d , > 0, d N
Polynomial: K(x
0 2

x, x 0 ) = ekxxxx k , > 0
Radial Basis Function (RBF): K(x
x, x 0 ) = tanh(x
xT x 0 + r)
Sigmoid: K(x
where , r, and d are kernel parameters.

Shrinking and Caching

xi , x j ), QP (14) can be
Defining a new n by n matrix Q as Qij = yi yj K(x
expressed more simply as
1
) = T Q
eT
min g(

2
(
yT = 0
subject to
,
0 i C, i

(19)

where e is a vector with all elements equal to 1.


Traditional optimization methods need to store the whole matrix Q if
used to solve (19). In most situation it is impossible to put the whole Q into
computer memory when the number of training examples n is very large.
Therefore, traditional algorithms are not suitable for SVC.
9

Fortunately many particularly more powerful algorithms have been invented to solve (19), such as chunking, decomposition and sequential minimal optimization(SMO). Especially SMO has attracted a lot of attention
since it was proposed by Platt(1998). All of them employ separate and
conquer strategies to split the original big QP (19) into many much smaller
subproblems, and to solve them iteratively and finally obtain the solution to
(19).
But the iterative computation is expensively time-comsuming when n
is big. Joachims(1998) proposes two new techniques to reduce the cost of
computation. The first one is Shrinking.

6.1

Shrinking

For many problems the number of free Support Vectors (0 < i < C) is small
compared to all Support Vectors (0 < i C). The shrinking technique
reduces the size of the problem without considering some bounded Support
Vectors(i = C). When the iterative process approaches the end, only variables in a small set A, we call it the active set, is allowed to move according to
Theorem 5 in Fan et al.(2005). After shrinking, the decomposition method
works on a smaller problem:
1 T
A QAA A (eeA QAN kN )T A
A
2
(
y TA A = y TN kN
subject to
,
A )i C, i = 1, . . . , |A|
0 (
min

(20)

where N = {1, . . . , n}\A is the set of shrunken variables, and is a constant


determined by the (k 1)th iteration.
Although Theorem 5 in Fan et al.(2005) tells us the active set A exists,
it does not tell when A will be obtained. Hence it may fail using the previous heuristic shrinking if the optimal solution of subproblem (20) is not the
corresponding part of that of (19). If a failure happens, the whold QP (19)
will be reoptimized with initial values where A is an optimal solution of
(20) and N are bounded variables identified on the previous subproblem.
Before presenting the shrinking details, it is necessary to supply some
definitions.

10

Defining
) {i|i < C, yi = 1 or i > 0, yi = 1},
Iup (
) {i|i < C, yi = 1 or i > 0, yi = 1};
Ilow (

(21)

) max {yi g(
)i },
m(
)
iIup (

) max {yi g(
)i },
M (

(22)

)
iIlow (

) is the gradient of the objective function g(


) in (19), that is,
where g(
) = Q
e.
g(
is an optimal solution to (19) if and only if
) M (
).
m(

(23)

In other words, the above equality is equivalent to the KKT conditions. So


we can stop the iteration if
k ) M (
k ) tol ,
m(

(24)

where tol is a small positive value which indicates the KKT conditions are
obeyed within tol. In PKSVM, tol = 103 by default.
The following shrinking procedure is from LIBSVM, which is one of the
most popular SVM softwares at present. More details are available from
C.-C. Chang et al.(2001).
1. Some bounded variables will be shrunken after every min(n, 1000) iterations. Since the KKT conditions are not satisfied within tol during
the iterative process, (24) will not be obeyed yet, that is,
) > M (
).
m(

(25)

Following Theorem 5 in Fan et al.(2005), variables in the following set


may be shrunken:
k )i > m(
k ), i Ilow (
k ), ik = C or 0}
{i| yi g(
k )i < M (
k ), i Iup (
k ), ik = C or 0}
{i| yi g(
k )i > m(
k ), ik = C, yi = 1 or ik = 0, yi = 1}
= {i| yi g(
k )i < M (
k ), ik = C, yi = 1 or ik = 0, yi = 1}. (26)
{i| yi g(
Hence the active set A is dynamically reduced in every min(n, 1000)
itetrations.
11

2. Since the previous shrinking strategy may fail, and many iterations are
spent in obtaining the final digit of the required accuracy, we would
not hope these iterations are wasted because they are trying to solve
a wrongly shrunken subproblem (20). Thus once the iteration attains
the tolerance
) M (
) + 10tol,
m(
(27)
). After reconstruction, we shrink
we reconstruct the whold gradient g(
some bounded variables based on the same rule in step 1, and the
iterations continue.
The other useful technique for saving computational time by Joachims(1998)
is called Caching. To illustrate caching technique is very necessary, some
analyses about computational complexity will first be presented.

6.2

Computational complexity

Most time in each iteration is spent on the kernel evalutions needed to compute the q rows of Q, which q relys on the decomposition method, for SMO
q = 2. This step has a time complexity of O(npq). Using the stored rows
of Q, updating k is done in time O(nq). Setting up the QP subproblem
requires O(nq) as well. The selection of the next working set, which includes
computing the gradient, can be done in O(nq).

6.3

Caching

As illustrated in the last subsection, the most expensive step in each iteration
is the kernel evalutions to compute the q rows of Q. Near the end of iterations,
eventual support vectors enter the working set multiple times. To avoid
recomputing the rows of Q, Caching is useful for reducing computational
cost.
Since Q is fully dense and may not be put into the computer memory
completely, usually a special storage using the idea of a cache is utilized to
store recently used Qij .
Just as in SVM light and LIBSVM, a simple least-recently-used caching
strategy is implemented in PKSVM. When the cache has not enough room
for a new row, the row which has not been used for the greatest number of
iterations will be eliminated from the cache.

12

Only those rows of Q which correspond to active set A are computed


and cached in PKSVM. Once shrinking occurs, we simply clean up the whole
cache and recache shrunken rows.

Conclusions

This article presents a simple summary of Support Vector Machine for classification problems. However, it also includes most of the state-of-the-art
techniques to make SVM more practical for large-scale problems, such as
shrinking and caching. Of course some other useful methods, i.e., working
set selection, have to be skipped because of the space restriction.
SVM has been developed to be a big family because of thousands of
excellent research papers in the last ten years. It has become one of the
most powerful and popular tools in machining learning. Although the whole
article is dedicated to C-SVC, yet -SVC, -SVR, -SVR and some other
generalizations of SVM share most of the techniques mentioned here. The
difference between them is small expect that the dual optimization problems
differ formally. Thus one can easily consult more details about them in many
textbooks and papers if it is necessary.

References
[1] Naiyang Deng, Yingjie Tian, A New Method in Data MiningSupport
Vector Machine, Science Press, 2004.
[2] Nils J. Nilsson, Learning machines: Foundations of Trainable Pattern
Classifying Systems, McGraw-Hill, 1965.
[3] Vladimir N. Vapnik and A. Lerner, Pattern recognition using generalized
portrait method, Automation and Remote Control, 24: 774-780, 1963.
[4] Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik, A
training algorithm for optimal margin classifiers. In D. Haussler, editor, Proceedings of the 5th Annual ACM Workshop on Computational
Learning Theory, pages 144-152, Pittsburgh, PA, July 1992, ACM Press.
[5] Corinna Cortes and Vladimir N. Vapnik, Support vector networks,Machine Learning, 20:pp 1-25, 1995.
13

[6] J. C. Platt. Fast training of support vector machines using sequential


minimal optimization. In B. Scholkopf, C. J. C. Burges, and A. J. Smola,
editors, Advances in Kernel Methods - Support Vector Learning, Cambridge, MA, 1998. MIT Press.
[7] T. Joachims. Making large-scale SVM learning practical. In B.
Scholkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel
Methods - Support Vector Learning, Cambridge, MA, 1998. MIT Press.
[8] Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library
for support vector machines, 2001. Software available at
http://www.csie.ntu.edu.tw/cjlin/libsvm
[9] C.-W. Hsu, C.-C. Chang, C.-J. Lin. A practical guide to support vector
classification July, 2003.
[10] R.-E. Fan, P.-H. Chen, and C.-J. Lin. Working set selection using second order information for training SVM. Journal of Machine Learning
Research, 6:1889-1918, 2005.
URL http://www.csie.ntu.edu.tw/cjlin/papers/quadworkset.pdf.

14