Vous êtes sur la page 1sur 7

Performance Evaluation of Multithreaded Sparse

Matrix-Vector Multiplication using OpenMP

Shengfei Liu
1, 2
, Yunquan Zhang
, Xiangzheng Sun
1, 2
, RongRong Qiu
2, 3
1) Institute of Software, Chinese Academy of Sciences, P.R.China
2) Graduate University of Chinese Academy of Sciences, P.R.China
3) National Research Center for Intelligent Computing Systems, P.R.China
Sparse matrix-vector multiplication is an important
computational kernel in scientific applications.
However, it performs poorly on modern processors
because of a low compute-to-memory ratio and its
irregular memory access patterns. This paper
discusses the implementations of sparse matrix-vector
algorithm using OpenMP to execute iterative methods
on the Dawning S4800A1. Two storage formats (CSR
and BCSR) for sparse matrices and three scheduling
schemes (static, dynamic and guided) provided by the
standard OpenMP are evaluated. We also compared
these three schemes with non-zero scheduling, where
each thread is assigned approximately the same
number of non-zero elements. Experimental data
shows that, the non-zero scheduling can provide the
best performance in most cases. The current
implementation provides satisfactory scalability for
most of matrices. However, we only get a limited
speedup for some large matrices that contain millions
of non-zero elements.
1. Introduction
OpenMP is a shared memory parallel programming
standard that is based on fork/join parallel execute
model. In recent years, it is widely used in parallel
computing for SMP (symmetric multi-processing) and
multi-core processors. Multi-core processors become
more and more popular as they can better trade-off
among performance, energy efficiency and reliability
[1, 2]. The problem of application scalability up to a
lot of processing cores is considered of vital important.
Sparse matrix-vector multiplication (SPMV) is one of
these applications, which is widely used in scientific
computing and many industrial applications. SpMV
usually performs poorly on modern processors because
of low compute-to-memory ratio and irregular memory
access patterns.
In this paper, we discuss the performance of
multithreaded sparse matrix-vector multiplication on
the Dawning S4800A1. We will show the effect of
storage formats and scheduling schemes on the
performance of SpMV. The rest of this paper is
organized as follows: Section 2 provides an overview
of related work. Section 3 introduces how to
parallelize SpMV using OpenMP. Section 4 discusses
the schedule schemes provided by OpenMP. We also
implemented a static scheduling scheme based on the
number of non-zero elements, with its name non-zero
scheduling. Experimental results are shown and
discussed in Section 5. Section 6 presents conclusions
and future work.
2. Related work
There are many existing optimization techniques for
serial SpMV, including register blocking [7, 8], cache
blocking [9, 10], multi-vector technology [3, 4],
symmetrical structure [8], diagonal technology [8],
rearrangement [8], and so on. The SPARSITY [3, 4]
package is widely used to solve the automatic
optimizations of SpMV and SpMM (Sparse Matrix
Multi-Vector Multiplication). BeBOP(Berkeley
Benchmarking and Optimization Group)[5] has
developed a software package OSKI(Optimized Sparse
Kernel Interface )[6] that provides C programming
interface for the automatic optimizations of sparse
matrix computing kernel, which is used in mathematics
2009 11th IEEE International Conference on High Performance Computing and Communications
978-0-7695-3738-2/09 $25.00 2009 IEEE
DOI 10.1109/HPCC.2009.75
libraries and specific applications. Register blocking
and cache blocking were first presented in [11], but
OSKI was the first released autotuning library that
incorporated them. In order to improve the
performance of SpMV, both SPARSITY and OSKI
adopted a heuristic algorithm to determine the optimal
block size of sparse matrix. Although OSKI is serial
library, it is being integrated into higher-level parallel
linear solver libraries, including PETSc [12] and
Trilinos [13]. PETSc uses MPI (Message Passing
Interface) to implement SpMV on the platform with
distributed memory system. For load balancing, PETSc
has used row partitioning scheme with equal numbers
of rows per process.
The past researches on multithreaded SpMV are
mainly on SMP clusters. They applied and evaluated
known optimization techniques, including register and
cache blocking, reordering techniques and minimizing
communication costs [14, 15, 16]. In the work of [17],
Williams and Oliker summarized a rich collection of
optimizations for different multithreading architectures.
They pointed out that memory bandwidth could be a
significant bottleneck in CMP (chip multiprocessor)
systems. Kornilios and Georgios improved the
performance of multithreaded SpMV using index and
value compression techniques in [18]. Based on OSKI,
BeBOP team developed pOSKI (Parallel Optimized
Sparse Kernel Interface) software package [19] for
multi-core architecture. The pOSKI package also used
automatic optimization techniques like OSKI and some
other techniques introduced in [17].
3. Parallelization of SPMV
The form of SpMV is y = y + Ax, where A is a
sparse matrix, x and y are dense vectors, x is called
source vector, and y is called destination vector. One
widely used efficient sparse matrix storage format is
CSR (Compressed Sparse Row). It only stores the non-
zero elements with its column index, and the index of
the first non-zero element of each row. If matrix A is
an m*n matrix, and the number of non-zero elements is
nz, CSR format needs to store the following three
csrNz[nz] stores the value of each non-zero element
in matrix A.
csrCols[nz] stores the column index of each element
in csrNz array.
csrRowStart[m+1] stores the index of the first non-
zero element of each*row in array csrNz and csrCols,
and csrRowStart[m]=nz.
We can see that matrix A needs nz*2+m+1 storage
space in CSR format.
For example, if matrix A
= , where
m=4, n=4, nz=6, A
1 0 0 2
0 0 3 0
0 0 4 0
5 0 0 6

should be stored as follows.
csrNz ={1,2,3,4,5,6};
csrCols ={0,3,2,2,0,3};
csrRowStart ={0,2,3,4,6}.
Since we need to store the location information
explicitly as well as the value of each non-zero element,
extra communication time is needed to access these
location data. Accordingly, both computing time and
memory accessing cost should be considered in the
optimization and parallelization of SpMV. We
implemented the multithreaded SpMV for CSR format
using OpenMP as Figure 1.
Figure 1. Pseudo code of multithreaded SpMV
for CSR using OpenMP
//The following sentence is an OpenMP direction.
#pragma omp parallel for private (j)
//The following code is serial SpMV based on CSR.
for (i = 0; i < m; i++)
double temp = y[i];
for (j= csrRowStart[i]; j < csrRowStart[i+1]; j++)
int index = csrCols[j];
temp += csrNz[j]*x[index];
y[i] = temp;
In the above codes, we define j as a private variable
so that each thread has its own copy of j. The iteration
variable i and variables defined in the loop, such as
temp and index, will be private variables for each
thread by default. The low efficiency of SpMV is
mainly caused by poor data accessing locality and the
indirect accesses of vector x. An improved way to store
sparse matrix is BCSR (Block Compressed Sparse
Row), in which the matrix is divided into blocks.
Based on BCSR, register blocking algorithm further
improves data accessing efficiency by reusing data in
registers. Paper [4] discussed more details about how
to optimize sparse matrix computing by reusing
registers data in SPARSITY.
Register blocking algorithm divides sparse matrix
A[m][n] into many small r*c block. After A is divided
into m/r rows and n/c columns, elements in each block
are computed one by one. In this situation, there will
be c elements of vector x in registers reused r times for
each block. When r=c=1, BCSR is equal to CSR.
Suppose the number of non-zero blocks is nzb, BCSR
also needs three arrays to store the sparse matrix.
bcsrNz[nzb*r*c] stores the value of elements in
each non-zero block.
bcsrCols[nzb] stores the column index of the first
elements in each non-zero block.
bcsrRowStart[m/r+1] stores the index of the first
non-zero block in bcsrCols for each block row, and
bcsrRowStart [m/r] = nzb, which is the number of non-
zero blocks in the given matrix.
We take A
= as an
example, where m=6, n=9, nz=20, the corresponding
BCSR format of A
1 1 1 0 0 0 2 2 2
1 0 0 0 0 0 0 2 0
0 0 0 3 3 0 0 0 0
0 0 0 0 3 3 0 0 0
0 0 0 4 0 4 5 5 0
0 0 0 4 4 0 5 5 0

is shown as follows.
bcsrNz = {1,1,1,1,0,0,2,2,2,0,2,0,3,3,0,0,3,3,4,0,
bcsrCols = {0,6,3,3,6};
bcsrRowStart ={0,2,3,5}.
Taking 2*2 block size as an example, the pseudo
code in Figure 2 implements SpMV based on BCSR
using OpenMP, where blockRows = m/r means the
number of block rows in the matrix.
We can see from the codes that x0, x1, y0 and y1 are
reused twice in registers. The data accessing locality of
vector x is also improved, since it processes four
elements of the sparse matrix each time.

Figure 2. Pseudo code of multithreaded SpMV
for BCSR using OpenMP
4. Load balancing of SpMV
Considering load balancing, scheduling overhead
and some other factors, the OpenMP API specifies
four different scheduling schemes: static, dynamic,
guided and runtime [20]. The runtime scheme will
choose a schedule type and chunk size from
environment variables until runtime. So we mainly
discuss the other three scheduling schemes.
The static scheme divides the iterations of a loop
into pieces by an input parameter chunk. The pieces
are statically assigned to each thread in a round-robin
fashion. By default, the iterations are divided into P
evenly sized contiguous chunks, where P is the
number of threads we assigned. Since the schedule can
be statically determined, this method has the least
runtime overhead.
The dynamic scheme also divides the iterations of a
loop into pieces. However, it will distribute each piece
dynamically. Each thread obtains the next set of
iterations only if it finished the current piece of
iterations. If not specified by the input parameter
chunk, the default count of iterations in a piece is 1.
The guided scheme works in a fashion similar to the
dynamic schedule. The partition size can be calculated
by Formula (1), where LC
is the count of unassigned
iterations, P is the number of threads and is a scale
factor (recommended to 1 or 2). The guided scheme
also provides an input parameter chunk. It specifies the
minimum number of iterations to dispatch each time.
When no chunk is specified, its default value is 1. The
chunk sizes in guided scheduling begin large and
slowly decrease in size, resulting in fewer
synchronizations than dynamic scheduling, while still
providing load balancing.
= LC
/ (*P) (1)
As an example, Table 1 gives different partition
sizes for a problem, in which the count of iterations N
= 1000 and threads number P = 4.
Table 1. Partition sizes for different schemes
Scheme Partition size
static(default) 250, 250, 250, 250
50, 50, 50, 50, 50,
250, 188, 141, 106, 79,
59, 50, 50, 50, 27
In case of CSR storage format, the coarse grain row
partitioning scheme is popularly applied [17]. The
matrix will be partitioned into blocks of rows by the
number of threads. Each block is assigned to a thread.
As described in the left part of Figure 3, each thread
operates on its own parts of the csrNz, csrCols,
//The following sentence is an OpenMP direction.
#pragma omp parallel for private (j,k,h,t)
The following code is serial SpMV based on BCSR.
for (i=0; i<blockRows; i++)
t = i<<1;
register doulbe y0=y[t];
register double y1=y[t+1];
for(j=bcsrRowStart[i]; j<bcsrRowStart[i+1]; j++)
k = j<<2;
h = bcsrCols[j];
register double x0 = x[h];
register double x1 = x[h+1];
y0 += bcsrNz[k] * x0;
y0 += bcsrNz[k+1] * x1;
y1 += bcsrNz[k+2] * x0;
y1 += bcsrNz[k+3] * x1;
y[t] = y0;
y[t+1] = y1;
csrRowStart and y arrays. So their accesses on these
data are independent and can be parallelized. All
threads access elements of the x array. Since accesses
on x are read only, the data can reside in each
processors cache without causing invalidation traffic
due to the cache coherency protocol [16]. An
advantage of row partitioning is that each thread
operates on its own part of the y array, which allows
for better temporal locality on the arrays elements in
case of distinct caches.
Figure 3. Matrix partition for SpMV
There is a complementary approach to row
partitioning, column partitioning, where each thread is
assigned a block of columns. Although column
partitioning is more naturally applied to the CSC
(Compressed Sparse Column) storage format, it can
also be applied to the CSR format. A disadvantage of
column partitioning is that all threads must perform
writes on all the elements of the vector y. To solve this
problem, the best method is to make each thread use its
own vector y and perform a reducing operation at the
end of the multiplication [18]. The reducing operation
will cause more memory accesses as it need to visit the
own vector y of each thread. It will also cost (P-
1)*sizeof(y) extra space to store the y array for each
thread, where P is the number of threads we used.
Since each thread only need to operate on its own parts
of vector x in column partitioning, it will bring in
better temporal locality for vector x. To combine the
advantages of the above two partitioning methods, we
can use block partitioning, where each thread is
assigned a two-dimensional block as shown in the right
part of Figure 3. On the other hand, the
implementation of block partitioning should be more
complex and need more extra space. Their effects on
the performance of SpMV are beyond the scope of this
In our implementation of multithreaded SpMV, we
used row partitioning with all the three scheduling
schemes provided by OpenMP. We also applied a
static balancing scheme, non-zero scheduling [18],
which is based on the number of non-zero elements.
Each thread is assigned approximately the same
number of non-zero elements and thus the similar
number of floating-point operations. To implement
non-zero scheduling through the default static
scheduling, we need to define an array nzStart[P],
where P is the number of threads we used, and
nzStart[i] is the index of the first row assigned to the
ith thread. Correspondingly, nzStart [i+1] - 1 is the
index of the last row assigned to the ith thread. The
cost to init nzStart can be ignored, since we just need
to scan the array csrRowStart only once. The value of
P should be obtained at runtime through the
environment variable OMP_NUM_THREADS. The
implementation based on CSR with the non-zero
scheduling is shown as Figure 4.
Figure 4. Pseudo code of multithreaded SpMV
with the non-zero scheduling
5. Experimental Evaluation
In this section, we evaluate the performance of
multithreaded SpMV with different storage formats
and scheduling schemes. The testing environment is
based on Dawning S4800A1 with the following
configuration: 4 AMD Opteron dual-core processors
870, 2GHz, 1MB L2 cache, 16 GB DDR2 RAM, and 2
TB SCSI disk. The operating system is Turbo Linux
3.4.3-9.2. We compiled our code with Intel compiler
9.1 using the optimization option -O3.
Considering the count of non-zero elements and
matrix size (small, medium, large), we collected 8
testing sparse matrices from a popular sparse matrix
collection, UF Sparse Matrix Collection [21]. The
details of experimental matrices are shown in Table 2.
Table 2: Properties of the test matrices
Name Rows Columns Von-zeros
bcsstk17.RSA 10974 10974 219812
bcsstk28.RSA 4410 4410 111717
af23560.rsa 23560 23560 484256
epb1.rua 14734 14734 95053
raefsky2.rb 3242 3242 294276
raefsky3.rb 21200 21200 1488768
twotone.rua 120750 120750 1224224
venkat01.rb 62424 62424 1717792
//The following sentence is an OpenMP direction.
#pragma omp parallel for private(i, j)
for(int t=0; t< P; t++){
for(i= nzStart [t]; i< nzStart [t+1]; i++){
double temp = y[i];
for(j=csrRowStart[i]; j<csrRowStart[i+1]; j++){
int index = csrCols[j];
temp += csrNz[j] * x[index];
y[i] = temp;
Our implementations are using OpenMP with the
default static scheduling scheme. For BCSR format,
we chose 2*2 as the block size. The iteration time is
10000. To compare the CSR and BCSR formats, we
chose four matrices to test: bcsstk17.RSA, raefsky3.rb,
epb1.rua and bcsstk28.RSA. They are respectively
large, medium and small.
As shown in Figure 5 - Figure 8, BCSR performs
better than CSR for most matrices. BCSR format can
improve the data accessing locality and reuse data in
registers. The improvement is about 28.09% for the
four matrices on average.

! ? + 8
\um| o thud:

Figure 5. Computing time with CSR and BCSR
for bcsstk17.RSA matrix
! ? + 8
\um| o thud:

Figure 6. Computing time with CSR and BCSR
for raefsky3.rb matrix
! ? + 8
\um| o thud:

Figure 7. Computing time with CSR and BCSR
for epb1.rua matrix
! ? + 8
Number of threads
Figure 8. Computing time with CSR and BCSR
for bcsstk28.RSA matrix
The speedup we got for CSR and BCSR are shown
in Figure 9 and Figure 10. We can see that most of the
testing matrices can get a scalable even super-linear
speedup for both CSR and BCSR as the number of
threads increases. If we use eight threads, the speedup
is from 1.82 to 14.68 with CSR format, and 1.93 to
18.74 with BCSR format. The average speedup we got
with CSR and BCSR is 8.81 and 9.76 for eight threads
respectively. However, there are three matrices
(raefsky3.rb, twotone.rua and venkat01.rb) only get a
limited speedup, which is less than 2.0 while the thread
number is 8. They can not achieve higher speedup
when the number of threads is more than 3. We
observed that these matrices are exactly the largest
three matrices in the test set. The counts of non-zero
elements of them are all over one million.
Correspondingly, the length of vector x to be
multiplied with them is longer than others. As vector x
is stored continuously in memory, the longer it is, the
more TLB misses we will get when accessing vector x
irregularly. This should be a reason why we only got a
limited speedup for the large matrices.
1 2 3 4 5 6 7 8
Number of threads
bcsstk17.RSA raefsky3.rb venkat01.rb
bcsstk28.RSA epb1.rua raefsky2.rb
twotone.rua af23560.rsa
Figure 9. Speedup of CSR using 8 matrices for
1 to 8 threads
1 2 3 4 5 6 7 8
Number of threads


bcsstk17.RSA raefsky3.rb venkat01.rb

bcsstk28.RSA epb1.rua raefsky2.rb
twotone.rua af23560.rsa
Figure 10. Speedup of BCSR using 8 matrices
for 1 to 8 threads
We also evaluated the three scheduling schemes
provided by OpenMP and non-zero scheduling for
CSR format. The matrices we chose are also
bcsstk17.RSA, raefsky3.rb, epb1.rua and
bcsstk28.RSA. The performances of these scheduling
schemes are shown in Figure 11 - Figure 14.
Generally, the non-zero scheduling performs better
than the three scheduling schemes provided by
OpenMP. Although the guided scheduling partitions
the matrix into fewer blocks than the dynamic
scheduling, it may cost more runtime overhead for
each partition. The non-zero scheduling and the default
static scheduling has a similar runtime overhead,
because the implementation of non-zero scheduling
scheme is based on the default static scheduling, which
is statically determined and has the least runtime
overhead. However, all thread will get a similar
number of floating-point operations in the non-zero
scheduling. So it performs better than the default static
scheduling in most cases.

! ? + 8
Number of threads

Figure 11. Different scheduling schemes with
CSR for bcsstk17.RSA matrix
! ? + 8
\um| o thud:

Figure 12. Different scheduling schemes with
CSR for raefsky3.rb matrix

! ? + 8
Number of threads

Figure 13. Different scheduling schemes with
CSR for epb1.rua matrix

! ? + 8
Number of threads
Figure 14. Different scheduling schemes with
CSR for bcsstk28.RSA matrix
6. Conclusions and future work
In this paper, we implemented and evaluated multi-
threaded SpMV using OpenMP. We evaluated two
storage formats for the sparse matrix, as well as the
scheduling schemes provided by OpenMP and non-
zero scheduling scheme. In most cases, the non-zero
scheduling performs better than the other two
scheduling schemes. Our implementation obtained
satisfactory scalability for most matrices except some
large matrices. To solve this problem, our future work
will consider block partitioning. Since the large matrix
will be partitioned into several small matrices, it
should get better data accessing locality of vector x and
y. We will also implement a hybrid parallelization
using MPI and OpenMP for distributed memory
parallel machines.
7. Acknowledgment
This work was supported in partial by the National
Natural Science Foundation of China under contract
No.60303020 and No.60533020, the National 863 Plan
of China under contract No.2006AA01A102 and
No.2006AA01A125. We thank the reviewers for their
careful reviews and helpful suggestions.
8. References
[1] S. Borkar. Design challenges of technology scaling. IEEE
Micro, 19(4):2329, Jul-Aug, 1999.
[2] J. L. Hennessy and D. A. Patterson. Computer
Architecture: A Quantitative Approach; fourth edition.
Morgan Kaufmann, San Francisco, 2006.
[3] Eun-Jin Im, Katherine Yelick , Richard Vuduc, Sparsity:
Optimization Framework for Sparse Matrix Kernels,
International Journal of High Performance Computing
Applications, Vol. 18, No. 1, 135-158 (2004)
[4] E.-J. Im and K.A.YelickOptimizing Sparse Matrix
Computations for Register Reuse in SPARSITY, In
proceedings of the International Conference on
Computational Science, volume 2073 of LNCS, pages 127-
136,San Francisco, CA, May 2001.Springer.
[5] Berkeley Benchmarking and OPtimization (BeBOP)
Project. http://Bebop.cs.berkeley.edu.
[6] Richard Vuduc, James Demmel, Katherine
Yelick ,OSKI:A library of automatically tuned sparse matrix
kernels, Proceedings of SciDAC 2005, Journal of Physics:
Conference Series, June 2005.
[7] R. Vuduc, J. W. Demmel, K. A. Yelick, S. Kamil, R.
Nishtala, B. LeePerformance optimizations and bounds for
sparse matrix-vector multiply In proceedings of
Supercomputing, Baltimore,MD,USA,November 2002.
[8] Richard Wilson Vuduc. Automatic Performance of
Sparse Matrix Kernels. The dissertation of Ph.D, Computer
Science Division, U.C. Berkeley, December 2003.
[9] Rajesh Nishtala, Richard W. Vuduc, James W. Demmel,
Katherine A. Yelick. Performance Modeling and Analysis of
Cache Blocking in Sparse Matrix Vector Multiply.Report
[10] Rajesh Nishtala, Richard Vuduc, James W. Demmel,
Katherine A. Yelick. When Cache Blocking of Sparse Matrix
Vector Multiply Works and Why.
[11] E.-J. Im. Optimizing the Performance of Sparse Matrix-
Vector Multiplication. PhD thesis, UC Berkeley,
Berkeley,CA, USA, 2000.
[12] S. Balay, W. D. Gropp, L. C. McInnes, and B. F. Smith.
Efficient management of parallelism in object oriented
numerical software libraries. In E. Arge, A. M. Bruaset, and
H. P. Langtangen, editors, Modern Software Tools in
Scientific Computing, pages 163202, 1997.
[13] J. W. Willenbring, A. A. Anda, M. Heroux. Improving
sparse matrix-vector product kernel performance and
availability. In Proc. Midwest Instruction and Computing
Symposium, Mt. Pleasant, IA, 2006.
[14] E. Im, K. Yelick. Optimizing sparse matrix-vector
multiplication on SMPs. In 9th SIAM Conference on Parallel
Processing for Scientific Computing. SIAM, Mar. 1999.
[15] J. C. Pichel, D. B. Heras, J. C. Cabaleiro, F. F. Rivera.
Improving the locality of the sparse matrix-vector product on
shared memory multiprocessors. In PDP, pages 6671. IEEE
Computer Society, 2004.
[16] U. V. Catalyuerek, C. Aykanat. Decomposing
irregularly sparse matrices for parallel matrix-vector
multiplication. Lecture Notes in Computer Science, 1117:
7586, 1996.
[17] S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick,
and J. Demmel. Optimization of sparse matrix-vector
multiplication on emerging multicore platforms. In
Proceedings of Supercomputing, November 2007.
[18] K. Kourtis, G. Goumas, N. Koziris. Improving the
Performance of Multithreaded Sparse Matrix-Vector
Multiplication Using Index and Value Compression. In
Proceedings of the 37th International Conference on Parallel
Processing, Washington, DC, USA, 2008. pp: 511-519.
[19] Ankit Jain. Masters Report: pOSKI: An Extensible
Autotuning Framework to Perform Optimized SpMVs on
Multicore Architectures
[20] LAI Jianxin HU Changju.Analysis of Task Schedule
Overhead and Load Balance in OpenMP. Computer
Engineering, 2006, 32(18): 58-60.
[21] Matrix Market. http://math.nist.gov/MatrixMarket/