18 vues

Transféré par TruptiShah

An Summary of Svm 2007

- Schedule Complete
- Machine Learning Advanced
- SVMEEG
- NARMA model
- Denial of Service attack in Adhoc network environment
- Px c 3878093
- cs229-cvxopt2
- Machine Learning c
- Machine Learning in Medicine A Primer
- Optimal Multi-Antenna Relay Beamforming With Per-Antenna Power Control
- SIMILARITY-BASED TECHNIQUES FOR TEXT DOCUMENT CLASSIFICATION
- Twitter Dynamic
- Exercise Class 1
- 06347098
- CH5
- chapter6
- 10.1.1.214.3008
- Cnf - 2006 - Aupec - Australia
- Predicting Construction Project Duration With Support Vector Machine
- Twitter Translation

Vous êtes sur la page 1sur 14

Jinlong Wu

Computational Mathematics,

SMS, PKU

May 4,2007

Introduction

developed. It is widely used in many areas because of its powerful ability of

classification and regression, such as textual classification, face recognition,

image processing, hand-written recognition and so forth.

In this article I try to give a summary of SVM. Because of its plentiful

contents, This article has to mainly focus on one aspectSupport Vector

Classifier(SVC). All of the techniques mentioned here are realized in a new

SVM software packagePKSVM. Presently PKSVM can handle two-class

and multi-class classification problems using C-SVC and -SVC (omitted

here).

The earliest pattern recognition systems were Linear Classifiers(Nilsson,1965).

We will point out the differences between SVC and LC in the following sections. Suppose n training data(sometimes called examples or observations)

(xi , yi ) are given, where xi Rp (p features or variables) and yi R. Without loss of generality, the whole article is limited to two-class problems, that

is to say, yi s only have two values, which can be +1 and 1 for simplicity.

In this section I will review linearly separable problems because they are the

x1 , y1 ), . . . , (x

xn , yn )} is said to be linearly

simpliest for SVC. The training set {(x

separable if there exists a linear discriminant function whose sign matches the

1

10

10

class of all training examples. Figure 1 gives a trivial example with n = 9,

p = 2.

In the linearly separable problems, it is obvious one can use the simpliest

model linear model, y(x) = T x +0 to separate the two classes. However,

as in Figure 2, there usually exist infinite separating hyperplanes which can

separate the training set perfectly. So which one should be chosen is a big

problem.

Vapnik and Lerner(1963) proposes to choose the separating hyperplane

that maximizes the margin. This optimal hyperplane for the trivial example

in Figure 1 is illustrated in Figure 3. That the optimal hyperplane is selected

as the best one is the first new idea which was added into SVM in comparison

with linear classifier.

As in Figure 4, one can easily know the signed distance from a point x to

the hyperplane L is

1

1

x x0 ) =

T (x

kk

k

1

=

k

2

k

k

( T x T x0 )

(1)

( T x + 0 )

10

10

Hence, our aim is to obtain the biggest positive C which makes all examples

1

satisfy kk

| T x i + 0 | C, i. Obviously it is a constrained optimization

problem(OP)

max C

,0

subject to

1

yi ( T x i + 0 ) C,

k k

i.

1

course it is allowed to assume kk

= C all the time. With this assumption

the previous OP can be rewritten as

1

k k2

,0

2

T x i + 0 ) 1,

subject to yi (

min

(2)

i.

constraint conditions. So we will resort to solving its dual problem. Detail

will be exhibited in Section 4.

3

4

2.5

1.5

x2

0.5

0.5

+1

1

val = 1

dec boun

1.5

val = +1

x1

3

3.1

Hypersurfaces

For real-world data sets, they can not be separated idealy by simple hyperplanes usually. If hyperplanes are replaced by hypersurfaces, it could be

expected that better classification results will be obtained. Therefore, the

OP (2) should be generalized to

1

k k2

,0

2

T (x

xi ) + 0 ) 1,

subject to yi (

min

(3)

i.

dimensional space.

3.2

Soft margins

Maybe someone will raise another problem after careful thought: Is this

generalization able to separate any real-world data sets perfectly?

5

will produce very wiggly surfaces, then the given set can always be separated

perfectly. However, if one really does that, another serious problem will give

birth. That is overfitting. The model obtained will be worthless because

of its serious overfitting. Figure 5 gives another data set which can not be

separated perfectly using linear hyperplanes. At first sight the black surfaces

give a much better separating result, but at the same time they are too

wiggly to classify new data well. Therefore, one mustnt utilize excessively

complicated function () !

Hence, my second answer is no. Compared to the first answer, this

one is more realistic. That is a very depressing answer. Cant people do

something to improve the separating result for this example in Figure 5? Of

course we can. We can take a more step to extend our model (3) again.

Since data sets are from real-world problems, they almost include more

or less noises. The intention of our model is to learn some useful information

from the data sets, not to learn their noises. It is not wise to use excessively

complicated function () to separate the classes of data.

Coretes and Vapnik(1995) shows that noisy problems are best addressed

by allowing some examples to violate the margin constraints in (3). We can

realize this idea using some positive slack variables = (1 , . . . , n ), that is

xi ) + b) 1, i to new

to say we change the previous constraints yi ( T (x

T

xi ) + b) 1 i , i. However, it is necessary avoid each

constraints yi ( (x

i getting an unnecessarily big value, such as +. So some penalizations to

all i s should be applied.

Combining this consideration, we get a new QP problem for the inseparable case:

1

min

k k2

,0

2

(

(4)

yi ((xi )T + 0 ) 1 i i,

X

subject to

i 0,

i constant.

Computationally it is convenient to re-express (4) in the equivalent form

n

min

,0

(

subject to

X

1

k k2 +C

i

2

i=1

yi ((xi )T + 0 ) 1 i i,

i 0.

6

(5)

+. We call (5) Lagrange primal problem of Support Vector Classifier(SVC)

hereinafter.

Duality of (5)

n

X

X

X

1

LP = k k2 +C

i

i [yi ((xi )T + 0 ) (1 i )]

i i , (6)

2

i=1

i=1

i=1

where i , i , i 0 i. We minimize LP w.r.t , 0 and i . Setting the

respective derivatives to zero, we get

=

n

X

i yi (xi ),

(7)

i=1

0=

n

X

i yi ,

(8)

i=1

i = C i , i,

(9)

By substituting the last three equalities into (6), we obtain the Lagrangian

(Wolfe) dual objective function

LD =

n

X

i=1

1 XX

i

i i0 yi yi0 (xi )T (xi0 ).

2 i=1 i0 =1

(10)

P

We maximize LD subject to 0 i C and ni=1 i yi = 0. In addition to

(7)-(9), the Karush-Kuhn-Tucker (KKT) conditions include the constraints

i [yi ((xi )T + 0 ) (1 i )] = 0,

i i = 0,

(11)

(12)

yi [(xi )T + 0 ] (1 i ) 0,

(13)

the solution to the primal and dual problem.

n

X

1 XX

max

i

i i0 yi yi0 (xi )T (xi0 )

2 i=1 i0 =1

i=1

n

X

yi i = 0

subject to i=1

,

0 i C, i

(14)

and get

i s, and from the (7) has the form

=

n

X

i yi (xi ),

(15)

i=1

i only for those observations i for which the constraints in (13) are exactly met(=) (due to (11)). These observations are

called the support vectors, that is to say an observation xi is called support

vector if its respective Lagrange multiplier i > 0. Any of these margin

points (0 <

i , i = 0) can be used to solve for 0 , and we typically use an

average of all the solutions for numerical stability.

After and 0 are both obtained, the discriminant function

x) = (x

x)T + 0 =

f(x

n

X

x)T (x

xi ) + 0

i yi (x

(16)

i=1

n

X

x) = sign[f (x

x)] = sign[

x)T (x

xi ) + 0 ].

G(x

i yi (x

(17)

i=1

can be estimated by cross-validation. A large value of C will discourage any

positive i , and leads to an overfit wiggly boundary in the original feature

space; a small value of C will encourage a small value of k k, which causes

f (x) and hence the boundary to be smoother.

Since SVC just uses the sign of the decision function to classify the class, only

the decision function is informative eventually. However, x always appears

8

x)T (x

x0 ) in (17). Defining the kernel function K(x

x, x 0 )

with pairwise forms (x

x, x 0 ) = (x

x)T (x

x0 ), (17) can be written as

as K(x

n

X

x) = sign[f(x

x)] = sign[

x, x j ) + 0 ].

G(x

i yi K(x

(18)

i=1

Instead of hand-choosing a feature function (x

x, x 0 ) that

Vapnik(1992) proposes to directly choose a kernel function K(x

x)T (x

x0 ) in some unspecified high dimensional

represents a dot product (x

space. For instance, any continuous decision boundary can be implemented

0 2

x, x 0 ) = ekxxxx k .

using the Radial Basis Function(RBF) kernel K(x

Although many new kernels are being proposed by researchers, the following four basic kernels are used most widely:

x, x 0 ) = x T x 0

Linear: K(x

x, x0 ) = (x

xT x0 + r)d , > 0, d N

Polynomial: K(x

0 2

x, x 0 ) = ekxxxx k , > 0

Radial Basis Function (RBF): K(x

x, x 0 ) = tanh(x

xT x 0 + r)

Sigmoid: K(x

where , r, and d are kernel parameters.

xi , x j ), QP (14) can be

Defining a new n by n matrix Q as Qij = yi yj K(x

expressed more simply as

1

) = T Q

eT

min g(

2

(

yT = 0

subject to

,

0 i C, i

(19)

Traditional optimization methods need to store the whole matrix Q if

used to solve (19). In most situation it is impossible to put the whole Q into

computer memory when the number of training examples n is very large.

Therefore, traditional algorithms are not suitable for SVC.

9

Fortunately many particularly more powerful algorithms have been invented to solve (19), such as chunking, decomposition and sequential minimal optimization(SMO). Especially SMO has attracted a lot of attention

since it was proposed by Platt(1998). All of them employ separate and

conquer strategies to split the original big QP (19) into many much smaller

subproblems, and to solve them iteratively and finally obtain the solution to

(19).

But the iterative computation is expensively time-comsuming when n

is big. Joachims(1998) proposes two new techniques to reduce the cost of

computation. The first one is Shrinking.

6.1

Shrinking

For many problems the number of free Support Vectors (0 < i < C) is small

compared to all Support Vectors (0 < i C). The shrinking technique

reduces the size of the problem without considering some bounded Support

Vectors(i = C). When the iterative process approaches the end, only variables in a small set A, we call it the active set, is allowed to move according to

Theorem 5 in Fan et al.(2005). After shrinking, the decomposition method

works on a smaller problem:

1 T

A QAA A (eeA QAN kN )T A

A

2

(

y TA A = y TN kN

subject to

,

A )i C, i = 1, . . . , |A|

0 (

min

(20)

determined by the (k 1)th iteration.

Although Theorem 5 in Fan et al.(2005) tells us the active set A exists,

it does not tell when A will be obtained. Hence it may fail using the previous heuristic shrinking if the optimal solution of subproblem (20) is not the

corresponding part of that of (19). If a failure happens, the whold QP (19)

will be reoptimized with initial values where A is an optimal solution of

(20) and N are bounded variables identified on the previous subproblem.

Before presenting the shrinking details, it is necessary to supply some

definitions.

10

Defining

) {i|i < C, yi = 1 or i > 0, yi = 1},

Iup (

) {i|i < C, yi = 1 or i > 0, yi = 1};

Ilow (

(21)

) max {yi g(

)i },

m(

)

iIup (

) max {yi g(

)i },

M (

(22)

)

iIlow (

) in (19), that is,

where g(

) = Q

e.

g(

is an optimal solution to (19) if and only if

) M (

).

m(

(23)

we can stop the iteration if

k ) M (

k ) tol ,

m(

(24)

where tol is a small positive value which indicates the KKT conditions are

obeyed within tol. In PKSVM, tol = 103 by default.

The following shrinking procedure is from LIBSVM, which is one of the

most popular SVM softwares at present. More details are available from

C.-C. Chang et al.(2001).

1. Some bounded variables will be shrunken after every min(n, 1000) iterations. Since the KKT conditions are not satisfied within tol during

the iterative process, (24) will not be obeyed yet, that is,

) > M (

).

m(

(25)

may be shrunken:

k )i > m(

k ), i Ilow (

k ), ik = C or 0}

{i| yi g(

k )i < M (

k ), i Iup (

k ), ik = C or 0}

{i| yi g(

k )i > m(

k ), ik = C, yi = 1 or ik = 0, yi = 1}

= {i| yi g(

k )i < M (

k ), ik = C, yi = 1 or ik = 0, yi = 1}. (26)

{i| yi g(

Hence the active set A is dynamically reduced in every min(n, 1000)

itetrations.

11

2. Since the previous shrinking strategy may fail, and many iterations are

spent in obtaining the final digit of the required accuracy, we would

not hope these iterations are wasted because they are trying to solve

a wrongly shrunken subproblem (20). Thus once the iteration attains

the tolerance

) M (

) + 10tol,

m(

(27)

). After reconstruction, we shrink

we reconstruct the whold gradient g(

some bounded variables based on the same rule in step 1, and the

iterations continue.

The other useful technique for saving computational time by Joachims(1998)

is called Caching. To illustrate caching technique is very necessary, some

analyses about computational complexity will first be presented.

6.2

Computational complexity

Most time in each iteration is spent on the kernel evalutions needed to compute the q rows of Q, which q relys on the decomposition method, for SMO

q = 2. This step has a time complexity of O(npq). Using the stored rows

of Q, updating k is done in time O(nq). Setting up the QP subproblem

requires O(nq) as well. The selection of the next working set, which includes

computing the gradient, can be done in O(nq).

6.3

Caching

As illustrated in the last subsection, the most expensive step in each iteration

is the kernel evalutions to compute the q rows of Q. Near the end of iterations,

eventual support vectors enter the working set multiple times. To avoid

recomputing the rows of Q, Caching is useful for reducing computational

cost.

Since Q is fully dense and may not be put into the computer memory

completely, usually a special storage using the idea of a cache is utilized to

store recently used Qij .

Just as in SVM light and LIBSVM, a simple least-recently-used caching

strategy is implemented in PKSVM. When the cache has not enough room

for a new row, the row which has not been used for the greatest number of

iterations will be eliminated from the cache.

12

and cached in PKSVM. Once shrinking occurs, we simply clean up the whole

cache and recache shrunken rows.

Conclusions

This article presents a simple summary of Support Vector Machine for classification problems. However, it also includes most of the state-of-the-art

techniques to make SVM more practical for large-scale problems, such as

shrinking and caching. Of course some other useful methods, i.e., working

set selection, have to be skipped because of the space restriction.

SVM has been developed to be a big family because of thousands of

excellent research papers in the last ten years. It has become one of the

most powerful and popular tools in machining learning. Although the whole

article is dedicated to C-SVC, yet -SVC, -SVR, -SVR and some other

generalizations of SVM share most of the techniques mentioned here. The

difference between them is small expect that the dual optimization problems

differ formally. Thus one can easily consult more details about them in many

textbooks and papers if it is necessary.

References

[1] Naiyang Deng, Yingjie Tian, A New Method in Data MiningSupport

Vector Machine, Science Press, 2004.

[2] Nils J. Nilsson, Learning machines: Foundations of Trainable Pattern

Classifying Systems, McGraw-Hill, 1965.

[3] Vladimir N. Vapnik and A. Lerner, Pattern recognition using generalized

portrait method, Automation and Remote Control, 24: 774-780, 1963.

[4] Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik, A

training algorithm for optimal margin classifiers. In D. Haussler, editor, Proceedings of the 5th Annual ACM Workshop on Computational

Learning Theory, pages 144-152, Pittsburgh, PA, July 1992, ACM Press.

[5] Corinna Cortes and Vladimir N. Vapnik, Support vector networks,Machine Learning, 20:pp 1-25, 1995.

13

minimal optimization. In B. Scholkopf, C. J. C. Burges, and A. J. Smola,

editors, Advances in Kernel Methods - Support Vector Learning, Cambridge, MA, 1998. MIT Press.

[7] T. Joachims. Making large-scale SVM learning practical. In B.

Scholkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel

Methods - Support Vector Learning, Cambridge, MA, 1998. MIT Press.

[8] Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library

for support vector machines, 2001. Software available at

http://www.csie.ntu.edu.tw/cjlin/libsvm

[9] C.-W. Hsu, C.-C. Chang, C.-J. Lin. A practical guide to support vector

classification July, 2003.

[10] R.-E. Fan, P.-H. Chen, and C.-J. Lin. Working set selection using second order information for training SVM. Journal of Machine Learning

Research, 6:1889-1918, 2005.

URL http://www.csie.ntu.edu.tw/cjlin/papers/quadworkset.pdf.

14

- Schedule CompleteTransféré parabdullahnisar92
- Machine Learning AdvancedTransféré pardhruvit
- SVMEEGTransféré parlady_esp
- NARMA modelTransféré parecommr
- Denial of Service attack in Adhoc network environmentTransféré parShahnawaz Husain
- Px c 3878093Transféré parJeena Mathew
- cs229-cvxopt2Transféré parSafique Mohamed
- Machine Learning cTransféré parskathpalia806
- SIMILARITY-BASED TECHNIQUES FOR TEXT DOCUMENT CLASSIFICATIONTransféré parijaert
- Twitter DynamicTransféré parMuhammad Ikhsan
- Exercise Class 1Transféré parscribdtvu5
- 06347098Transféré parmarej312
- Machine Learning in Medicine A PrimerTransféré parEditor IJTSRD
- Optimal Multi-Antenna Relay Beamforming With Per-Antenna Power ControlTransféré parMatthew Carter
- CH5Transféré parMaurice Monjerezi
- chapter6Transféré parYoungSik Ahn
- 10.1.1.214.3008Transféré parjose lopez
- Cnf - 2006 - Aupec - AustraliaTransféré parvndieu
- Predicting Construction Project Duration With Support Vector MachineTransféré paresatjournals
- Twitter TranslationTransféré parClara Meurer
- Brain Tumor Detection from MRI22final.docxTransféré parAbhishek Maurya
- Tutorial_Duin_PR_Barcelona_08.pdfTransféré parNguyễn Minh
- p656-xuTransféré parruben rivas
- Deep Learning using svm in matlab.pptxTransféré parAsrarLoon
- [IJETA-V1I2P1] Author: S Revathi, N RajkumarTransféré parIJETA - EighthSenseGroup
- Naive Ba YesTransféré parSubhash Mantha
- 01 Ummi Kalsom Asmala Science Int_Training set size.pdfTransféré parasmalanew
- Lecture 0.pdfTransféré paranifauqa
- Hierarchical Neural Network Algorithm for Classification of Normal Daily Activity Using Wearable SensorsTransféré parTan Quoc Huynh
- Machine LearningTransféré parfauzansaadon

- Cubic Spline Data Interpolation - MATLABTransféré parBasim Ali
- Numerical Methods VivaTransféré parhksaifee
- Book BindTransféré parFaye Yturralde
- Project 1 - Image CompressionTransféré parsrinusdreamz11
- 고유변형도_추출_방법Transféré parNick Fuller
- Genetic Algorithm Maze Solving ProgramTransféré parRareş Ariciu
- ECE220Transféré parHackonymous World
- An Overview of JPEG-2000Transféré partrieulh
- lec5_6Transféré parOh Nani
- RadixTransféré parsoutri
- GSM Speech CodingTransféré parChan Soriya
- Analiza Circuitelor Cu Aplicatii FiziologiceTransféré parcalidor
- Chepter 2Transféré parRam Kumar Solanki
- Anexo B AdaptFilt LMS MatlabTransféré parmercielm
- LAB10 (1)Transféré parMuhammadAxadKhataabGujjar
- IRJET-Empirical Study of DWT and FFT Techniques to Extract Intensity Based Features From the ImagesTransféré parIRJET Journal
- Digital image processing Lab manualTransféré parAnubhav Shrivastava
- Image Chapter4 Part1Transféré parSiraj Ud-Doulla
- ENGG 407_P12_L20_Lecture_03Transféré parcoolboy2456
- DP_Sound_ADI_15025_Drivers.txtTransféré parJuan Carlos Gonzalez L
- Explaining Convolution Using MATLABTransféré parRupesh Verma
- Enhancement of Speech Compression Technique Using Wavelet Transforms With Parallel Processing and FusionTransféré parseventhsensegroup
- tool boxTransféré parsamuellivingston
- Super LuTransféré parElvira Rod
- ALGORITHMS: DJISKTRA AND PRIM'STransféré parAditya Agrawal
- Extended Essay AbstractTransféré parManav Shah
- K Means HandoutTransféré parEdgar Coló
- mlpr-5regressionTransféré parStefan Dimov
- Interpolation and ApproximationTransféré parHind Abu Ghazleh
- LAB 6Transféré parShahzebKhurshid