Vous êtes sur la page 1sur 10

Parallel Multiclass Classication using SVMs on GPUs

Sergio Herrero-Lopez
Massachusetts Institute of
Technology
77 Massachusetts Avenue
Cambridge,MA, USA
sherrero@mit.edu
John R. Williams
Massachusetts Institute of
Technology
77 Massachusetts Avenue
Cambridge,MA, USA
jrw@mit.edu
Abel Sanchez
Massachusetts Institute of
Technology
77 Massachusetts Avenue
Cambridge,MA, USA
doval@mit.edu
ABSTRACT
The scaling of serial algorithms cannot rely on the improve-
ment of CPUs anymore. The performance of classical Sup-
port Vector Machine (SVM) implementations has reached
its limit and the arrival of the multi core era requires these
algorithms to adapt to a new parallel scenario. Graphics
Processing Units (GPU) have arisen as high performance
platforms to implement data parallel algorithms. In this
paper, it is described how a nave implementation of a mul-
ticlass classier based on SVMs can map its inherent de-
grees of parallelism to the GPU programming model and
eciently use its computational throughput. Empirical re-
sults show that the training and classication time of the
algorithm can be reduced an order of magnitude compared
to a classical multiclass solver, LIBSVM, while guaranteeing
the same accuracy.
Categories and Subject Descriptors
D.2 [Programming Techniques]: Concurrent Program-
ming
General Terms
Parallel Programming
Keywords
Support Vector Machine, GPU
1. INTRODUCTION.
Since the semiconductor industry revealed that physical
constraints were imposing an unbeatable upper frequency
limit for processors, the computing market has been continu-
ously supplied with multi-core, multi-processor, and most re-
cently GPU and multi-GPU architectures. This shift to par-
allel architectures was initially ignored, but the demand for
performance and scalability eventually has required adapt-
ing not only new algorithms, but also old methods to the new
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for prot or commercial advantage and that copies
bear this notice and the full citation on the rst page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specic
permission and/or a fee.
GPGPU 10, March 14, 2010 Pittsburgh, PA, USA
Copyright 2010 ACM 978-1-60558-935-0/10/03 ...$10.00.
existing panorama. Unfortunately, not all the algorithms see
their performance enhanced by parallel architectures, but
those that do, often called data parallel algorithms [11], can
be greatly beneted by their adaptation to this new scenario.
Support Vector Machines [28] (SVM) is a learning algo-
rithm that can be conveniently adapted to parallel archi-
tectures. Its success solving classication tasks in a wide
variety of elds, such as text or image processing and med-
ical informatics, have stimulated research not only on their
generalization performance, but also in their execution per-
formance and scalability. Nevertheless, the training phase
of a SVM is a computationally expensive task. The training
time of a binary task composed of 100,000 points with hun-
dreds of dimensions can often take on the order of hours of
serial execution.
The eorts to reduce training time have been numerous
and eective. Osuna et al. presented a decomposition ap-
proach that enabled tackling larger problems by solving sub-
problems iteratively [24]. Joachims introduced additional
techniques such as shrinking and kernel caching that are
common practice today [15]. Platt presented the Sequen-
tial Minimal Optimization (SMO) algorithm, which decom-
posed the Quadratic Programming (QP) problem into a se-
ries of smaller QP subproblems that were solvable analyti-
cally without the need of the time consuming QP optimiza-
tion [25]. More recently, Fan et al. developed a series of
working set selection heuristics that lead the SMO algorithm
to a faster convergence [7]. The combination of these distri-
butions has enabled fast serial SVM implementations.
Nevertheless, eorts have not only been focused on the
development of techniques to accelerate serial SVMs; there
are several initiatives that seek to achieve performance gains
by either parallelizing the algorithm to make it t in new
architectures (PSVMs), distributing the classication prob-
lem across nodes in a cluster (DSVMs), or a hybrid of both
(DPSVMs). Cao et al. present a parallel version of the SMO
algorithm that divides the large dataset into smaller sub-
sets in which Platts SMO is applied [3]. Considerable per-
formance gains are reported when executing this approach
on multiprocessor machines that use MPI for communica-
tion. Similarly, Zanni et al. explored an iterative decom-
position technique that would use both storage and com-
puting resources available in multiprocessor systems [29].
Graf et al. designed the Cascade SVM, an algorithm that
distributed the classication task into a cascade top-down
network topology [8]. This was the rst initiative to in-
clude network topologies in DSVM problems. Finally, Lu
et al. present their DPSVM approach, which follows the
2
basic principles of the Cascade SVM but generalizes the
distributed classication problem for general strongly con-
nected network topologies and provides a convergence proof
for them [20].
There is little work on how SVM algorithms map to GPU
architectures. The computational, memory and communica-
tion possibilities of new multicore hardware seem to provide
a convenient platform for the execution of an entire range
of SVM alternatives, from simple individual SVMs to com-
plex DPSVM network congurations. Catanzaro et al. [4]
presented a pioneer implementation of a binary SVM on a
Graphics Processing Unit (GPU), and reported promising
speed ups both for training (5-32x) and classication (81-
138x) compared to LIBSVM [5]. Do et al. [6] propose a
Least Squares SVM (LS-SVM) implementation for Graph-
ics Processors that is substantially faster than some con-
gurations of LIBSVM over some specic datasets. In this
paper, we continue the research direction initiated by Catan-
zaro et al. and implement a multi-class SVM classier on
a GPU. This work shows that the resources and degrees of
parallelism provided by the latest multicore hardware can be
conveniently exploited to train large scale multiclass clas-
sication tasks. To the best of authors knowledge, this
multiclass SVM implementation is unique in the sense that
not only provides similar speedups to the best existing bi-
nary SVMs on GPUs, but also executes multiple cooperative
binary SVM instances concurrently, which is advantageous
compared to the sequential execution of tasks carried out
by LIBSVM, GPU binary SVMs and other popular solvers.
Although there are a variety of methods to solve the SVM
training problem, SMO algorithm was chosen for its popu-
larity and more importantly, because it allows dening data
reusability patterns that can be exploited by the GPU.
The organization of this paper is as follows. Section 2,
briey introduces the SVM training and classication prob-
lem for binary tasks and problems with multiple classes.
Section 3 gives an overview of the GPU architecture and pro-
gramming model. Section 4 describes the Parallel-Parallel
SMO (P2SMO) algorithm, the motivation to execute binary
tasks concurrently, along with the details of our implemen-
tation. The performance results of our multiclass classier
are compared with LIBSVM in Section 5. Section 6 gives
the conclusions of this project and section 7 discusses future
work.
2. MULTICLASS CLASSIFICATION.
This section succinctly reviews the basic principles of soft-
margin binary SVM classication and its combination to
solve multiclass classication problems.
2.1 Binary SVM.
The binary classication problem is dened as nding
the classication function that solves the following regular-
ized learning problem: Given l examples ( x1, y1) , . . . , ( xl, yl)
with xi R
n
and yi {1, 1} i, where the regularization
is controlled via C.
min
fH
C
l

i=1
V (yi, f ( xi)) +
1
2
f
2
k
(1)
Classical SVM arises by considering a specic loss func-
tion:
V (yi, f ( xi)) = (1 yf( x))
+
(2)
where (k)+ = max (k, 0). Then slack variables i are in-
troduced to overcome the problem introduced by its non-
dierentiability:
min
fH
C
l

i=1
i +
1
2
f
2
k
(3)
subject to: yif(xi) 1 i and i 0, i = 1, . . . , l. The
dual form of this problem is given by:
max
R
l
l

i=1
i
1
2

T
K (4)
subject to:

l
i=1
yii = 0 and 0 i C, i = 1, . . . , l,
where Kij = yiyjk ( xi, xj) is a kernel function. Equation 4 is
a quadratic programming optimization problem. Common
kernel functions are shown in table 1. Solving equation 4
denes the classication function:
f(x) =
l

i=1
yiik ( x, xi) + b (5)
where b is an unregularized bias term.
2.2 Multiclass SVM.
In multiclass classication given l examples ( x1, y1), . . . , ( xl, yl)
with xi R
n
, yi Y , i and Y = 1, . . . , M, the goal is to
design a classier that predicts the label of new unseen sam-
ples. Classical approaches construct the multiclass classier
as the combination of N independent binary classication
tasks. Binary tasks are dened in the output code matrix
R of size MxN, where M is the number of classes and N
is the number of tasks, and Rij {1, 0, 1}. Each column
in R represents the way original labels are translated into
binary labels for each specic binary task. Then each f
k
( x)
is trained separately with ( x1, Ry
1
k), . . . , ( xl, Ry
l
k) where
k = 1 . . . N. The outputs of trained binary classiers f
k
( x)
are used to predict the class label that best agrees with the
binary predictions:
y = arg max
yY
_
N

k=1
Rykf
k
( x)
_
(6)
In general, predictions can be derived from output codes by
specifying a loss function:
y = arg min
yY
_
N

k=1
Loss(Rykf
k
( x))
_
(7)
The common types of output codes are:
1. One-vs-All (OVA): M classiers are needed. For the
f
i
( x) classier, the positive examples are all the points
in class i, and the negative samples all the points not
in class i.
Linear k ( xi, xj) = xi xj
Polynomial k ( xi, xj) = (a xi xj + b)
d
Radial Basis k ( xi, xj) = e
x
i
x
j
2
Sigmoid k ( xi, xj) = tanh (a xi xj + b)
Table 1: Kernel Functions
3
2. All-vs-All (AVA):
_
M
2
_
classiers are needed, one clas-
sier to distinguish each pair of classes i and j. f
ij
( x)
is the classier where class i has positive samples and
class j negative.
3. Error correcting codes: Often error correcting codes
are applied to reconstruct labels from noisy predicted
binary labels.
3. GENERAL PURPOSE GPU.
The popularity and relatively low price of graphics proces-
sors have motivated many programmers to use GPUs for sci-
entic computing. Even though they were designed specif-
ically for triangle rasterization, today they have evolved to
serve general purpose computation needs. Since NVIDIA
released Compute Unied Device Architecture (CUDA) [22]
in 2007, a variety of parallel programs have been developed
for a variety of dierent applications, including uid dy-
namics, nance or imaging. The key to CUDAs success
are three abstractions that can be integrated using exten-
sions to conventional C code [21]: (1) The hierarchy of
thread groups, (2) shared memory, and (3) barrier synchro-
nization. The combination of multiple levels of threads, a
memory hierarchy and synchronization mechanisms enable
achieving ne-grained data parallelism which can be conve-
niently interleaved with coarse-grained data parallelism and
task parallelism. Nevertheless, GPUs do not speed up all
possible applications. The algorithms need to explicitly ex-
press parallelism by the execution of thousands of threads, so
that available resources in the GPU are eciently occupied.
Fortunately, machine learning algorithms are typically com-
posed of primitives that are highly parallelizable [27]: (1)
Inner products, (2) outer products, (3) linear algebra, (4)
the application of non-linearities to vectors or matrices, and
(5) matrix transposes.
In November 2006, NVIDIA presented the Tesla archi-
tecture, a massively multithreaded processor array capable
of concurrently executing tens of thousands of threads [19].
This architecture was designed for computation rather than
control ow and caching, and one of its state-of-the art de-
vices reported almost a Teraop of processing power which
is over an order of magnitude larger than the latest CPUs
existing today. It has been shown in other research elds
that Teslas high throughput in oating point computation,
along with abundant hierarchized memory and fast memory
bandwidth for communication have yield to promising accel-
eration results. Next, the CUDA programming model and
the memory model are briey introduced.
3.1 CUDA Programming Model.
The CUDA programming model is based on a logical rep-
resentation composed of three elements: grids, blocks and
threads. This logical representation is determined by the
user, and CUDA maps this representation to the real hard-
ware representation underneath. This separation between
the logical and physical representations allows algorithms
scaling as GPUs augment their capabilities. As a rst step,
it is programmers task to adapt the algorithm to a 2D grid
structure. Grid executions, known as kernels, are sequen-
tially invoked by the host. Grids are composed of blocks,
which are groups of threads that share local memory and
can be synchronized using barriers. Similarly, threads in
a block are organized in 3D. The logical representation is
Figure 1: Logical Representation
Figure 2: Memory Model
illustrated in gure 1.
The maximum size of each dimension of a block is (512,
512, 64), but the maximum number of threads in a block
cannot exceed 512. The maximum size of each dimension of
a grid is (65535, 65535). The blocks in a grid are launched
in parallel, which allows a large number of threads being
executed in concurrently. The number of threads that run
simultaneously on a block, which is called warp, and the
number of blocks that run simultaneously on a grid are
hardware-specic and depend on the number of Stream Mul-
tiprocessors (SMs) and Stream Processors (SPs) available in
the device. The number of SMs and SPs increases in ev-
ery hardware generation. Consequently, developers need to
nd an appropriate partitioning of the data that occupies
the maximum number of blocks possible, and hence utilize
hardware resources uninterruptedly, in order to get maxi-
mum acceleration of the algorithm.
3.2 CUDA Memory Model.
CUDA provides a hierarchy of memories that dier on
their accessibility, operability and speed: 1) Registers: The
4
smallest but fastest memory available. It is only accessible
at the thread level with Read/Write operations. 2)Shared
Memory: It can be as fast as registers. It is shared by all
the threads in a block, and allows Read/Write operations.
3)Device Memory: The largest but slowest memory avail-
able. Accessible by all the threads executed in the grid, and
allows Read/Write operations. 4)Constant/Texture Mem-
ory: It is faster than device memory, and similarly is acces-
sible at the grid level, but is Read only. These memories are
illustrated in gure 2.
Latest generations of GPUs provide 102GB/s memory
bandwidth on the GPU and 8GB/s for communication with
the CPU via the PCI-express bus.
4. MULTICLASS SVMIMPLEMENTATION.
Each of the N binary tasks is trained using the Sequen-
tial Minimal Optimization (SMO) algorithm, which solves
equation 4. SMO solves very large quadratic programming
(QP) optimization problems by breaking it into a series of
smaller QP subproblems. These QP subproblems can be
solved analytically without the need of numerical optimiza-
tion. In order to map the training of N binary tasks to the
CUDA programming model, Caos Parallel SMO (PSMO)
[3] algorithm was conveniently adapted and the execution
of N instances of it were intertwined. This new algorithm
was named Parallel-Parallel SMO (P2SMO). In this section,
the P2SMO algorithm is described. Next, its mapping to
grids, blocks and threads is explained. Finally, the impli-
cations of the execution of N PSMO instances concurrently
and cooperatively are analyzed.
4.1 Parallel-Parallel SMO Algorithm.
Cao et al. designed the PSMO algorithm aiming to ac-
celerate the binary SVM training time by partitioning the
algorithm across multiple processors. Their experiments re-
ported considerable speedups while maintaining the accu-
racy of the sequential SMO. Multiple intertwined instances
of the PSMO algorithm running cooperatively lead to this
new algorithm, denominated Parallel-Parallel SMO (P2SMO).
Next, this algorithm is introduced: Since, N binary tasks
need to be executed, the correspondence between an in-
stance of the PSMO algorithm and a binary task is rep-
resented by the superscript k, where k = 1 . . . N. Given
P processing units per binary task, the sample dataset l is
partitioned in P subsets and one subset is given to each pro-
cessing unit. The subsets are represented by {l
p
} p = 1 . . . P,
where

p
p=1
l
p
= l. Each of the p subsets of every task k will
have its own set of variables, based on the lters I
k
0
, I
k
1
, I
k
2
,
I
k
3
and I
k
4
. We follow the following notation:
I
k
0
=
_
i : y
k
i
= 1, 0 <
k
i
< C
k
_

_
i : y
k
i
= 1, 0 <
k
i
< C
k
_
I
k
1
=
_
i : y
k
i
= 1,
k
i
= 0
_
I
k
2
=
_
i : y
k
i
= 1,
k
i
= C
k
_
I
k
3
=
_
i : y
k
i
= 1,
k
i
= C
k
_
I
k
4
=
_
i : y
k
i
= 1,
k
i
= 0
_
(8)
f
p,k
i
=
l

j=1

k
j
y
k
j
k( xj, xi) y
k
i
(9)
b
p,k
up
= min
_
f
p,k
i
: i I
k
0
I
k
1
I
k
2
l
p
_
I
p,k
up
= arg min
i
f
p,k
k
b
p,k
low
= max
_
f
p,k
i
: i I
k
0
I
k
3
I
k
4
l
p
_
I
p,k
low
= arg max
i
f
p,k
k
(10)
Global variables representing the entire dataset can be ob-
tained from the subset variables:
b
k
up
= min
_
b
p,k
up
_
I
k
up
= arg
I
p,k
up
b
k
up
b
k
low
= max
_
b
p,k
low
_
I
k
low
= arg
I
p,k
low
b
k
low
(11)
I
k
up
and I
k
low
are the indices of the two weights
k
i
of the
smallest QP subproblem and they can be solved analytically:

new,k
I
k
up
=
old,k
I
k
up

y
k
I
k
up
(f
old,k
I
k
low
f
old,k
I
k
up
)

new,k
I
k
low
=
old,k
I
k
low
+ s
k
(
old,k
I
k
up

new,k
I
k
up
)
(12)
where
s
k
= y
k
I
k
up
y
k
I
k
low

k
= 2k( x
I
k
low
, x
I
k
up
)
k( x
I
k
low
, x
I
k
low
) k( x
I
k
up
, x
I
k
up
)
(13)

k
needs three kernel evaluations: k
I
k
low
,I
k
up
= k( x
I
k
low
, x
I
k
up
),
k
I
k
low
,I
k
low
= k( x
I
k
low
, x
I
k
low
) and k
I
k
up
,I
k
up
= k( x
I
k
up
, x
I
k
up
).
These values are either calculated on-demand or retrieved
from a common cache

K where the concurrent N tasks coop-
erate to share kernel evaluations. This is explored in section
5.4.
new,k
I
k
up
and
new,k
I
k
low
need to be clipped to [0, C
k
]. Af-
ter optimizing the weights the error on the i
th
data pattern,
f
p,k
i
needs to be updated:
f
p,new,k
i
=f
p,old,k
i
+
(
new,k
I
k
low

old,k
I
k
low
)y
k
I
k
low
k( x
I
k
low
, xi)
(
new,k
I
k
up

old,k
I
k
up
)y
k
I
k
up
k( x
I
k
up
, xi)
(14)
Finally the oset for each task k is calculated:
b
k
=
b
k
up
+ b
k
low
2
(15)
The iterative algorithm is summarized in algorithm 1:
5
Algorithm 1 Parallel-Parallel SMO
Require: l samples ( x1, y1) , . . . , ( xl, yl) with xi R
n
and yi {1, 1} i.
Ensure:

l
i=1
y
k
i

k
i
= 0 and 0
k
i
C
k
,
i = 1, . . . , l, k = 1 . . . , N.
1: initialize:
k
i
= 0, f
p,k
i
= y
k
i
, i l
p
,
p = 1 . . . P, k = 1 . . . N
2: calculate: b
p,k
up
, I
p,k
up
, b
p,k
low
, I
p,k
low
,
p = 1 . . . P, k = 1 . . . N
3: obtain: b
k
up
, I
k
up
, b
k
low
, I
k
low
, k = 1 . . . N
4: while b
k
low
> b
k
up
+ 2, k = 1 . . . N do
5: if k
I
k
low
,I
k
up


K then
6: retrieve: k
I
k
low
,I
k
up
7: else
8: compute: k
I
k
low
,I
k
up
9: end if
10: if k
I
k
low
,I
k
low


K then
11: retrieve: k
I
k
low
,I
k
low
12: else
13: compute: k
I
k
low
,I
k
low
14: end if
15: if k
I
k
up
,I
k
up


K then
16: retrieve: k
I
k
up
,I
k
up
17: else
18: compute: k
I
k
up
,I
k
up
19: end if
20: optimize:
k
I
k
up
,
k
I
k
low
21: update: f
p,k
i
, p = 1 . . . P
22: update: b
p,k
up
, I
p,k
up
, b
p,k
low
, I
p,k
low
, p = 1 . . . P
23: obtain: b
k
up
, I
k
up
, b
k
low
, I
k
low
24: end while
25: return
k
i
, i = 1, . . . , l, k = 1 . . . , N .
4.2 P2SMO Implementation on GPU.
There is a direct mapping between the execution of N
PSMO algorithm instances and the grid structure dened by
the CUDA programming model. The training set is divided
into P subsets. Given these P subsets and N binary tasks,
a grid composed of PxN blocks can be used to execute the
most computationally expensive steps of the training phase.
Each block p in P will process a subset l
p
of the training sam-
ples. Each of the threads in a block will manipulate more
than one sample i. Threads within a block are organized in
a single dimension. The vertical dimension of the grid indi-
cates the task k that is processed by the block. Blocks in the
same column of the grid share the same training samples,
but since they belong to dierent tasks they will have dier-
ent binary labels. Rows of the grid represent an instance of
the PSMO algorithm. Therefore, the intertwined execution
of the rows in the grid results in the P2SMO algorithm.
A single instance of the PSMO algorithm is illustrated in
gure 3. The computation of the subset variables is car-
Figure 3: Parallel SMO for a single binary task
ried out by executing P blocks in parallel in the GPU. The
computation of the max, min, argmax and argmin values,
called reduction, is carried out by following a highly opti-
mized parallel reduction implementation in CUDA [9]. Peak
performance of each parallel reduction instance is observed
for a maximum value of 64 blocks and a maximum number of
128 threads per block. Consequently, P is set to 64. Given
the fact that the number of subset variables is reduced, cal-
culating global variables in the host CPU results in better
performance that its calculation on the GPU. Global vari-
ables are then used to compute the new value of the weights

k
I
k
up
,
k
I
k
low
, k = 1 . . . N. Since this step may involve matrix
operations (SGEMV and SGEMM routines), it needs to be
carried out in the GPU. This is the meeting point for all the
concurrent tasks in the P2SMO algorithm to cooperate in
the calculation of the kernel values. After the weights have
been updated, again PxN blocks are executed to calculate
the new f
p,k
i
values as part of the update routine. Finally,
the stop criterion is checked to determine individual PSMO
instances have converged. If converged, the corresponding
row of the grid is eliminated and a reduced grid proceeds to
a new iteration of the P2SMO algorithm.
4.3 Task Parallelization Implications.
Although binary tasks are independently trained, there
are a number of direct implications associated to their par-
allel execution:
4.3.1 Cross-Task Kernel Caching:
As the dimensionality of the samples increases, kernel
evaluations become the most computationally expensive step
of SVM training. Since SMO algorithm focuses on nding
and optimizing non-zero weights, the algorithm tends to de-
mand the same rows of the Gram matrix K several times
as it approaches convergence. For large datasets, it is not
feasible to store the entire matrix K in memory; hence it is
a common practice to implement kernel caching mechanisms
6
Figure 4: Shared Support Vectors
that exploit the reusability of rows in matrix K. SVMLight
uses an LRU caching strategy [15]. Zhao et al. [30] show
that the probability of selecting an index that was previ-
ously selected is higher for the SMO algorithm. The con-
current execution of multiple binary tasks that share the
same memory allows dierent tasks sharing kernel evalua-
tions. If a training sample is found to be a support vector
in more than one task, a single kernel evaluation will be
shared among those tasks and the kernel cache hit rate will
increase. A representation of support vectors being shared
among tasks is illustrated in gure 4. Our implementation
exploits this benecial property. Similarly, if multiple tasks
need to evaluate the same non-cached row of matrix K in the
same iteration, the row is evaluated once and shared with
the others avoiding multiple evaluations. Empirical results
on this eect are presented in Section V. In cache miss sit-
uations, new rows of the Gram matrix are calculated using
CUDA Basic Linear Algebra Subroutines (CUBLAS) Level
3, which provide optimized functions for matrix-vector mul-
tiplications [23] using the SGEMV routine. The optimiza-
tion of these routines has been ratied by [2].
4.3.2 Progressive Grid Reduction:
Each of the N binary tasks has dierent convergence rates.
If any of the tasks has already converged, a static PxN grid
would require launching idle rows of blocks. Even if pas-
sive blocks do not need to run, they would require to be
assigned to the underlying hardware like the rest of blocks.
Consequently, these would hold GPU resources and delay
the execution of blocks that correspond to non-converged
tasks. Hence, our implementation dynamically reduces the
vertical dimension of the grid as binary tasks converge fa-
cilitating the resource allocation process of standing tasks.
Grid reduction is illustrated in gure 5.
4.3.3 Shared Kernel Evaluation in Testing:
The classication phase can also be beneted by the ex-
ecution of N binary tasks concurrently. The classication
of a single test sample z requires the evaluation of

f
k
( z)
for each of the N binary tasks. In order to obtain

f
k
( z) it
is necessary to have k( z, xi) i = 1, . . . , l which is the same
for all the N binary tasks. Regardless of the kernel used,
Figure 5: Grid Reduction
this operation requires the multiplication of the matrix X
containing the training samples and the vector z with the
testing sample. Since, usually a set Z of M testing samples
Z = [ z1 z2 . . . zM] is provided, this task can be grouped into
a matrix multiplication. Matrix multiplication is eciently
calculated using the SGEMM routine in CUBLAS.
5. PERFORMANCE RESULTS.
This section presents the performance results of executing
this GPU implementation of the multiclass classier, com-
pared with the classical solver LIBSVM. In both cases, the
same kernel type, regularization parameter C, and stopping
criteria is used. LIBSVM is also based on the SMO algo-
rithm. These tests use OVA output codes, which have been
extensively studied and proven to give satisfactory results
[26]. Both the LIBSVM kernel cache size and the GPU im-
plementation kernel cache size were set to be of equal size,
1GB. The training phase and the classication phase are self
contained, in the sense that no kernel values are shared be-
tween phases or precomputed. In both classiers, GPU and
LIBSVM, I/O was considered an intrinsic part of the classi-
er. In this section, the hardware and datasets used for the
experiments are introduced, and the performance gain both
for training and classication phases is reported. The source
code of our GPU implementation and datasets utilized for
this work are available for download in [10].
5.1 Host and Device.
The measurements taken in this section were carried out
in a single machine with Intel Core i7 920 @ 2.67 GHz and
6 GB of RAM running Ubuntu 9.04 (64 bit). The graphics
processor used was a NVIDIA Tesla C1060 with 240 Stream
Processors each of them with a frequency of 1.3 GHz. The
card has 4 GB of memory and a memory bandwidth of 102
GB/s.
5.2 Datasets.
The GPU implementation was tested on publicly avail-
able datasets. Adult [1], Web [25], Mnist [18] and Usps [13]
datasets were used to test the precision of single binary clas-
sications tasks. For this purpose, Mnist and Usps datasets
7
were converted from multiclass to binary problems consider-
ing even vs odd values. Once accuracy was ensured, Mnist,
Usps, Shuttle [12] and Letter [12] datasets were used to anal-
yse the performance of the classier in multiclass problems.
The sizes of these datasets and the parameters used for train-
ing are indicated in table 2. The Radial Basis Kernel was
used for the classication phase in all the datasets.
5.3 Classier Accuracy.
Binary tasks are the smallest classication units in which
accuracy can be evaluated. The classication performance
of the multiclass classier directly depends on the accuracy
of the binary tasks. Latest GPUs provide IEEE 754 capabil-
ities with both single precision and double precision support
[14]. For these experiments, single precision was used. In
this subsection the accuracy of the GPU implementation
training phase using well known binary tasks is compared
to the results provided by LIBSVM. Table 3 shows the ac-
curacy results comparison for binary classication. Results
show that classication accuracy in the GPU does as well as
the LIBSVM solver. Even if both optimization algorithms
were executed with the same tolerance value = 0.001, there
is some variation on the number of support vectors and the
value of the oset. It is speculated that this dierence might
be due to the application of second order heuristics [17] or
shrinking [15] techniques in LIBSVM.
5.4 Cross-Task Kernel Caching Performance.
Figure 6 shows the measurements of kernel cache perfor-
mance as the number of parallel binary tasks concurrently
executed is increased. For a number n of tasks considered in
each case, dierent subsets n of tasks could result in great
variations of the kernel cache behavior. Consequently, for
a xed number n of tasks considered, the performance of
the resulting
_
n
2
_
possible subsets was measured and av-
eraged. For the all the multiclass datasets considered in
these experiments, the execution of multiple binary tasks in
parallel improved the kernel cache hit rate due to the ex-
istence of shared support vectors among these tasks. The
impact of the execution of binary tasks in parallel is dier-
ent across datasets. Usps, Shuttle and Letter datasets show
an asymtotic levelling o of the kernel cache hits as itera-
tions increase. This eect reveals that during the approach
to convergence the algorithm focused on the same pairs of
points and reutilized intensively kernel evaluations stored in
cache in these datasets. Nevertheless, Mnist dataset shows
the case when the size of the learning problem prevents the
storage of reutilizable samples in the kernel cache and cache
miss situations continuously ocurr. Even under these cir-
cumstances, the highest number of concurrent tasks resulted
in the best kernel cache performance. In summary, all the
# # #
Dataset Training Testing (Features, (C, )
Points Points Classes)
Adult 32561 16281 (123,2) (100,0.5)
Web 49749 14951 (300,2) (64,7.8125)
Mnist 60000 10000 (780,10) (10,0.125)
Usps 7291 2007 (256,10) (10,1/256)
Shuttle 43500 14500 (9,7) (2048,8)
Letter 15000 5000 (16,26) (16,12)
Table 2: Datasets
Dataset SVM Accuracy SVs b It.
(%) (%)
Adult
G 82.6976 18677 - 115177
L 82.6976 19058 0.0018 43735
Web
G 99.4515 35220 - 76242
L 99.4515 35232 0.0137 85299
Mnist*
G 95.3200 43731 - 68038
L 95.3200 43753 0.0452 76104
Usps*
G 97.0603 684 - 7518
L 97.0603 684 0.0042 4614
Table 3: Binary Classication Accuracy. * Even vs
Odd. (G)GPU, (L)LIBSVM
cases show an increase of the kernel cache hit rate caused
by the addition of binary task when few tasks are being ex-
ecuted; the relative increase is attenuated as more and more
binary tasks are executed in parallel and a new task is in-
corporated.
5.5 Training Time.
Table 4 contains the training time for dierent datasets
for both the GPU multiclass classier and LIBSVM. The
distribution of the GPU times by operation is also included,
in order to help identify possible communication or com-
putation bottlenecks. The speedup obtained in four binary
classication problems (10-33x) was comparable to the re-
sults showed by [4]. Then, the training time for multiclass
problems was measured and compared with LIBSVM. It was
observed that the GPU implementation beneted classica-
tion problems with the largest numbers of features. Mnist
and Usps datasets, which have samples with hundreds of
features, resulted in a substantial speedup (16-57x). The
Letter dataset, which has an order of magnitude less of fea-
tures but is the largest in the number of classes, resulted in
a considerable speedup (19x) too. Nevertheless, the shuttle
multiclass problem, which is composed of samples with tens
of features and few classes, obtained a modest acceleration
(3x). It is noticeable that the acceleration of the algorithm
is strongly related to the fraction of the time spent on the
execution of SGEMV, reduction and update routines.
Table 4 shows that datasets with a large number of fea-
tures per sample and a large number of training samples, like
Web or Mnist, spend 90% of the GPU time on the SGEMV
operation. This makes memory copy, reduction and update
times marginal. In these scenarios, the bottleneck is purely
computational. The Usps dataset has a comparable num-
ber of features per sample but an order of magnitude less
of training samples. This leads to an important reduction
of the SGEMV fraction of GPU time and increases the rel-
evance of the rest of operations. On the contrary, datasets
with a reduced number of features per sample, like Shuttle or
Letter, have respectable update, reduction and memory copy
times, but they result in marginal SGEMV operation times.
In these scenarios the bottleneck will still be primarily com-
putational, but communication times gained signicance to
the point of taking a fraction between 17-33% of the total
GPU time.
In this subsection, it has been shown that speedups achieved
by GPU based binary classiers can be combined to create
a high performance implementation of a parallel multiclass
classier. We believe that these results can be further im-
proved if our nave implementation is enriched by incorpo-
8
Figure 6: : Kernel Cache Hit Rate: Mnist (Top Left); Usps (Top Right); Shuttle (Bottom Left); Letter
(Bottom Right)
Dataset Tasks GPU LIBSVM Speedup {H} to {D} {D} to {H} SGEMV Reduction Update
(sec) (sec) (%) (%) (%) (%) (%)
Adult Binary(2) 32.67 341.5 10.45 16.01 11.2 31.75 14.2 26.84
Web Binary(2) 156.95 2350 14.97 1.46 1.01 90.11 1.43 5.99
Mnist* Binary(2) 425.89 13963.4 32.79 0.47 0.31 95.75 0.57 0.61
Usps* Binary(2) 1.65 27 16.36 26.5 19.09 16.19 14.39 23.83
Mnist OVA(10) 2067.24 118916.2 57.52 0.1 0.08 96.91 0.5 3.31
Usps OVA(10) 1.28 21.3 16.64 11.06 8.68 34.88 14.64 30.74
Shuttle OVA(7) 5.85 18.88 3.38 18.68 14.61 0.4 23.57 42.74
Letter OVA(26) 19.04 479.9 25.20 9.01 7.98 2.87 29.35 50.79
Table 4: Training Time. GPU % Time per operation. {H} Host, {D} Device, {*} Even vs Odd.
rating some of the latest optimizations that are applied in
popular SVM solvers.
5.6 Classication Time.
Table 5 shows the speedup achieved by the GPU multi-
class classier both for the classication phase of two class
problems and multiclass problems. The speedups observed
in binary classication are also reproduced in multiclass clas-
sication. As it was concluded in the training phase, the ac-
celeration is strongly related to the fraction of the time spent
on computation executing SGEMM and reduction routines.
The distribution of the GPU times in classication is anal-
ogous to the distribution of times in the training phase. Web
and Mnist concentrate their GPU time in the SGEMM op-
eration, which results in marginal communication and re-
duction times. On the contrary, Shuttle and Letter datasets
have negligible SGEMM operation time, but important times
on the rest of operations.
5.7 Comparison Across GPU Generations.
In this subsection the evolution of the performance of
our multiclass SVM solver using two dierent generations
of GPU hardware was analyzed. Training times with two
graphics cards were measured: (1) a GeForce 8800 GT with
112 processing cores running at 600 MHz; and (2) a Tesla
C1060 with 240 processing cores running at 1.3 GHz. Figure
7 illustrates the measured time for both GPUs as more par-
allel binary tasks are solved. The addition of a binary task
requires the processing of an additional set of blocks corre-
sponding to the new task per PSMO iteration. In our im-
plementation, this additional set is translated into 64 more
blocks in the parallel reduction step, one more block for
the evaluation of the new weights (
new,k
I
k
up
,
new,k
I
k
low
) and 64
more blocks for the update of the f
p,new,k
i
values. As it was
expected, the large number of processing cores available in
the Tesla GPU does a better job absorbing the increasing
amounts of blocks to be executed. The dierence is modest
9
Dataset Tasks GPU LIBSVM Speedup {H} to {D} {D} to {H} SGEMM Reduction
(sec) (sec) (%) (%) (%) (%)
Adult Binary(2) 1.10 42.7 38.77 0.18 7.96 58.56 33.30
Web Binary(2) 2.51 75 29.88 0.15 2.66 81.08 16.11
Mnist* Binary(2) 4.43 496.5 112.19 0.14 0.95 92.63 6.58
Usps* Binary(2) 0.07 1 13.72 0.57 28.10 13.09 58.24
Mnist OVA(10) 14.00 683.9 48.85 0.09 0.71 62.10 37.10
Usps OVA(10) 0.13 3.62 27.84 0.35 16.15 19.50 64.01
Shuttle OVA(7) 0.49 1.43 2.92 0.05 30.69 0.83 68.43
Letter OVA(26) 2.02 6.77 3.35 0.01 2.99 2.59 94.53
Table 5: Classication Time. GPU % Time per operation. {H} Host, {D} Device, {*} Even vs Odd.
Figure 7: GPU Generation comparison. Mnist (Top Left); Usps (Top Right); Shuttle (Bottom Left); Letter
(Bottom Right)
for a reduced number of tasks but it is accentuated as the
number of tasks increases. For a suciently large number
of cores, the overall training time should approximate the
training time of the longest task.
6. CONCLUSIONS.
The rise of GPUs as massive parallel processors opens a
wide range of opportunities for the acceleration and scaling
of learning algorithms. The data parallel nature of many
learning algorithms ts conveniently the set of problems
that modern GPUs are meant to solve. Besides, previous
research in accelerating SVMs in multiprocessor systems or
scaling SVMs in computer clusters can be ported to smaller
and cheaper GPU or multi GPU congurations where mem-
ory systems are aggressive and communications are consider-
ably faster than networked environments. It has been shown
in this paper that a nave implementation of a multiclass
classier based on the SMO algorithm running on a single
GPU can lead to dataset dependent speedups in the range
of 3-57x for training and 3-112x for classication. These re-
sults reduced the training time more than an order of mag-
nitude while maintaining the accuracy of the classication
tasks. This multiclass SVM classier implementation leaves
room for improvement and better results could potentially
be achieved by using more involved SVM training techniques
[17] [16]. Nevertheless, this work showed that the GPU pro-
gramming model conveniently allowed executing multiple bi-
nary tasks in parallel over the same global memory. This
fact beneted the training time not only because of the par-
allel execution, but also due to the reusability of data across
binary tasks as it was conrmed in the empirical results.
7. FUTURE WORK.
Not only classic algorithms like SVMs can be adapted to
state-of-the-art programming models, the latest research on
statistical learning algorithms can benet from them as well.
New techniques for large scale learning should be built tak-
ing into account this new era of multi core and GPU systems,
10
in order to make training of large size problems practical or
allow real-time training of smaller size problems. The lat-
est research on large scale SVMs uses network topologies to
partition the data [8] [20]. A priori, these algorithms may
nd convenient the use of multiple GPU congurations due
to the availability of large amounts of memory, and the data
transfer speed between devices. It is a natural continuation
of the multiclass classication work to explore the imple-
mentation of distributed classication approaches, such as
Cascade SVM or DPSVM, by creating a network topology
composed of multiple GPU devices that work on partitions
of data concurrently.
8. REFERENCES
[1] A. Asuncion and D. Newman. UCI machine learning
repository, 2007.
[2] S. Barrachina, M. Castillo, F. Igual, R. Mayo, and
E. Quintana-Orti. Evaluation and tuning of the level 3
cublas for graphics processors. In Parallel and
Distributed Processing, 2008. IPDPS 2008. IEEE
International Symposium on, pages 18, April 2008.
[3] L. Cao, S. Keerthi, C.-J. Ong, J. Zhang,
U. Periyathamby, X. J. Fu, and H. Lee. Parallel
sequential minimal optimization for the training of
support vector machines. Neural Networks, IEEE
Transactions on, 17(4):10391049, July 2006.
[4] B. Catanzaro, N. Sundaram, and K. Keutzer. Fast
support vector machine training and classication on
graphics processors. In ICML 08: Proceedings of the
25th international conference on Machine learning,
pages 104111, New York, NY, USA, 2008. ACM.
[5] C. Chang and C. Lin. LIBSVM: a library for support
vector machines, 2001. Software available at
http://www.csie.ntu.edu.tw/ cjlin/libsvm.
[6] T.-N. Do, V.-H. Nguyen, and F. Poulet. Speed up svm
algorithm for massive classication tasks. In ADMA
08: Proceedings of the 4th international conference on
Advanced Data Mining and Applications, pages
147157, Berlin, Heidelberg, 2008. Springer-Verlag.
[7] R.-E. Fan, P.-H. Chen, and C.-J. Lin. Working set
selection using second order information for training
support vector machines. J. Mach. Learn. Res.,
6:18891918, 2005.
[8] H. Graf, E. Cosatto, L. Bottou, I. Durdanovic, and
V. Vapnik. Parallel support vector machines: The
cascade svm. In In Advances in Neural Information
Processing Systems, pages 521528. MIT Press, 2005.
[9] M. Harris. Mapping computational concepts to gpus.
In SIGGRAPH 05: ACM SIGGRAPH 2005 Courses,
page 50, New York, NY, USA, 2005. ACM.
[10] S. Herrero-Lopez. GPUSVM: a CUDA library for
multiclass support vector machines, 2009. Software
available at http://code.google.com/p/multisvm/.
[11] W. D. Hillis and G. L. Steele, Jr. Data parallel
algorithms. Commun. ACM, 29(12):11701183, 1986.
[12] C.-W. Hsu and C.-J. Lin. A comparison of methods for
multiclass support vector machines. Neural Networks,
IEEE Transactions on, 13(2):415425, Mar 2002.
[13] J. Hull. A database for handwritten text recognition
research. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 16(5):550554, 1994.
[14] IEEE. 754-2008 standard for oating-point arithmetic.
[15] T. Joachims. Making large-scale support vector
machine learning practical. MIT Press, Cambridge,
MA, USA, 1999.
[16] T. Joachims. Training linear svms in linear time. In
KDD 06: Proceedings of the 12th ACM SIGKDD
international conference on Knowledge discovery and
data mining, pages 217226, New York, NY, USA,
2006. ACM.
[17] S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and
K. R. K. Murthy. Improvements to platts smo
algorithm for svm classier design. Neural Comput.,
13(3):637649, 2001.
[18] Y. Lecun, L. Bottou, Y. Bengio, and P. Haner.
Gradient-based learning applied to document
recognition. Proceedings of the IEEE,
86(11):22782324, Nov 1998.
[19] E. Lindholm, J. Nickolls, S. Oberman, and
J. Montrym. Nvidia tesla: A unied graphics and
computing architecture. Micro, IEEE, 28(2):3955,
March-April 2008.
[20] Y. Lu, V. Roychowdhury, and L. Vandenberghe.
Distributed parallel support vector machines in
strongly connected networks. Neural Networks, IEEE
Transactions on, 19(7):11671178, July 2008.
[21] J. Nickolls, I. Buck, M. Garland, and K. Skadron.
Scalable parallel programming with cuda. Queue,
6(2):4053, 2008.
[22] NVIDIA. CUDA Compute Unied Device
Architecture. Programming Guide., June 2007.
[23] NVIDIA. CUDA. CUBLAS Library., June 2007.
[24] E. Osuna, R. Freund, and F. Girosi. An improved
training algorithm for support vector machines. In
Neural Networks for Signal Processing [1997] VII.
Proceedings of the 1997 IEEE Workshop, pages
276285, Sep 1997.
[25] J. C. Platt. Fast training of support vector machines
using sequential minimal optimization. MIT Press,
Cambridge, MA, USA, 1999.
[26] R. Rifkin and A. Klautau. In defense of one-vs-all
classication. J. Mach. Learn. Res., 5:101141, 2004.
[27] D. Steinkraus, I. Buck, and P. Simard. Using gpus for
machine learning algorithms. In Document Analysis
and Recognition, 2005. Proceedings. Eighth
International Conference on, pages 11151120 Vol. 2,
Aug.-1 Sept. 2005.
[28] V. N. Vapnik. The Nature of Statistical Learning
Theory (Information Science and Statistics). Springer,
November 1999.
[29] L. Zanni, T. Serani, and G. Zanghirati. Parallel
software for training large scale support vector
machines on multiprocessor systems. J. Mach. Learn.
Res., 7:14671492, 2006.
[30] Z.-D. Zhao, L. Yuan, W. Y.-X., F. Sheng Bao, S.-Y.
Zhang, and Y.-F. Sun. A novel model of working set
selection for smo decomposition methods. Tools with
Articial Intelligence, IEEE International Conference
on, 2:283290, 2007.
11

Vous aimerez peut-être aussi