62 vues

Transféré par pventi

Sparse matrix-vector multiplication is an important computational kernel in scientific applications. However, it performs poorly on modern processors because of a low compute-to-memory ratio and its irregular memory access patterns. This paper
discusses the implementations of sparse matrix-vector algorithm using OpenMP to execute iterative methods on the Dawning S4800A1.

- Lesson 1
- Mip 3g Change Psc 29082018
- Array
- simax
- Java Arrayasdf
- Face Recognition
- Data Structures and Algorithm Analysis Lab Exercises 1S1718
- MC0061 Winter Drive 2011
- samples.doc
- carraydatatype1
- Arrays Intersection : Programming Fortran
- SSD3-U2S2
- Methodsunit II Aj
- Logic Dev Elopement
- Tcs Placement Papers - Tcs Aptitude Paper With Solutions (Id-2608)
- Code
- petsc.man
- Untitled
- Kruger, Westermann - Linear Algebra Operators for GPU Implementation of Numerical Algorithms
- chap_06.pptx

Vous êtes sur la page 1sur 7

Shengfei Liu

1, 2

, Yunquan Zhang

1

, Xiangzheng Sun

1, 2

, RongRong Qiu

2, 3

1) Institute of Software, Chinese Academy of Sciences, P.R.China

2) Graduate University of Chinese Academy of Sciences, P.R.China

3) National Research Center for Intelligent Computing Systems, P.R.China

{lsf,zyq,sxz}@mail.rdcps.ac.cn

qiurongrong@ncic.ac.cn

Abstract

Sparse matrix-vector multiplication is an important

computational kernel in scientific applications.

However, it performs poorly on modern processors

because of a low compute-to-memory ratio and its

irregular memory access patterns. This paper

discusses the implementations of sparse matrix-vector

algorithm using OpenMP to execute iterative methods

on the Dawning S4800A1. Two storage formats (CSR

and BCSR) for sparse matrices and three scheduling

schemes (static, dynamic and guided) provided by the

standard OpenMP are evaluated. We also compared

these three schemes with non-zero scheduling, where

each thread is assigned approximately the same

number of non-zero elements. Experimental data

shows that, the non-zero scheduling can provide the

best performance in most cases. The current

implementation provides satisfactory scalability for

most of matrices. However, we only get a limited

speedup for some large matrices that contain millions

of non-zero elements.

1. Introduction

OpenMP is a shared memory parallel programming

standard that is based on fork/join parallel execute

model. In recent years, it is widely used in parallel

computing for SMP (symmetric multi-processing) and

multi-core processors. Multi-core processors become

more and more popular as they can better trade-off

among performance, energy efficiency and reliability

[1, 2]. The problem of application scalability up to a

lot of processing cores is considered of vital important.

Sparse matrix-vector multiplication (SPMV) is one of

these applications, which is widely used in scientific

computing and many industrial applications. SpMV

usually performs poorly on modern processors because

of low compute-to-memory ratio and irregular memory

access patterns.

In this paper, we discuss the performance of

multithreaded sparse matrix-vector multiplication on

the Dawning S4800A1. We will show the effect of

storage formats and scheduling schemes on the

performance of SpMV. The rest of this paper is

organized as follows: Section 2 provides an overview

of related work. Section 3 introduces how to

parallelize SpMV using OpenMP. Section 4 discusses

the schedule schemes provided by OpenMP. We also

implemented a static scheduling scheme based on the

number of non-zero elements, with its name non-zero

scheduling. Experimental results are shown and

discussed in Section 5. Section 6 presents conclusions

and future work.

2. Related work

There are many existing optimization techniques for

serial SpMV, including register blocking [7, 8], cache

blocking [9, 10], multi-vector technology [3, 4],

symmetrical structure [8], diagonal technology [8],

rearrangement [8], and so on. The SPARSITY [3, 4]

package is widely used to solve the automatic

optimizations of SpMV and SpMM (Sparse Matrix

Multi-Vector Multiplication). BeBOP(Berkeley

Benchmarking and Optimization Group)[5] has

developed a software package OSKI(Optimized Sparse

Kernel Interface )[6] that provides C programming

interface for the automatic optimizations of sparse

matrix computing kernel, which is used in mathematics

2009 11th IEEE International Conference on High Performance Computing and Communications

978-0-7695-3738-2/09 $25.00 2009 IEEE

DOI 10.1109/HPCC.2009.75

659

libraries and specific applications. Register blocking

and cache blocking were first presented in [11], but

OSKI was the first released autotuning library that

incorporated them. In order to improve the

performance of SpMV, both SPARSITY and OSKI

adopted a heuristic algorithm to determine the optimal

block size of sparse matrix. Although OSKI is serial

library, it is being integrated into higher-level parallel

linear solver libraries, including PETSc [12] and

Trilinos [13]. PETSc uses MPI (Message Passing

Interface) to implement SpMV on the platform with

distributed memory system. For load balancing, PETSc

has used row partitioning scheme with equal numbers

of rows per process.

The past researches on multithreaded SpMV are

mainly on SMP clusters. They applied and evaluated

known optimization techniques, including register and

cache blocking, reordering techniques and minimizing

communication costs [14, 15, 16]. In the work of [17],

Williams and Oliker summarized a rich collection of

optimizations for different multithreading architectures.

They pointed out that memory bandwidth could be a

significant bottleneck in CMP (chip multiprocessor)

systems. Kornilios and Georgios improved the

performance of multithreaded SpMV using index and

value compression techniques in [18]. Based on OSKI,

BeBOP team developed pOSKI (Parallel Optimized

Sparse Kernel Interface) software package [19] for

multi-core architecture. The pOSKI package also used

automatic optimization techniques like OSKI and some

other techniques introduced in [17].

3. Parallelization of SPMV

The form of SpMV is y = y + Ax, where A is a

sparse matrix, x and y are dense vectors, x is called

source vector, and y is called destination vector. One

widely used efficient sparse matrix storage format is

CSR (Compressed Sparse Row). It only stores the non-

zero elements with its column index, and the index of

the first non-zero element of each row. If matrix A is

an m*n matrix, and the number of non-zero elements is

nz, CSR format needs to store the following three

arrays.

csrNz[nz] stores the value of each non-zero element

in matrix A.

csrCols[nz] stores the column index of each element

in csrNz array.

csrRowStart[m+1] stores the index of the first non-

zero element of each*row in array csrNz and csrCols,

and csrRowStart[m]=nz.

We can see that matrix A needs nz*2+m+1 storage

space in CSR format.

For example, if matrix A

1

= , where

m=4, n=4, nz=6, A

1 0 0 2

0 0 3 0

0 0 4 0

5 0 0 6

1

should be stored as follows.

csrNz ={1,2,3,4,5,6};

csrCols ={0,3,2,2,0,3};

csrRowStart ={0,2,3,4,6}.

Since we need to store the location information

explicitly as well as the value of each non-zero element,

extra communication time is needed to access these

location data. Accordingly, both computing time and

memory accessing cost should be considered in the

optimization and parallelization of SpMV. We

implemented the multithreaded SpMV for CSR format

using OpenMP as Figure 1.

Figure 1. Pseudo code of multithreaded SpMV

for CSR using OpenMP

//The following sentence is an OpenMP direction.

#pragma omp parallel for private (j)

//The following code is serial SpMV based on CSR.

for (i = 0; i < m; i++)

{

double temp = y[i];

for (j= csrRowStart[i]; j < csrRowStart[i+1]; j++)

{

int index = csrCols[j];

temp += csrNz[j]*x[index];

}

y[i] = temp;

}

In the above codes, we define j as a private variable

so that each thread has its own copy of j. The iteration

variable i and variables defined in the loop, such as

temp and index, will be private variables for each

thread by default. The low efficiency of SpMV is

mainly caused by poor data accessing locality and the

indirect accesses of vector x. An improved way to store

sparse matrix is BCSR (Block Compressed Sparse

Row), in which the matrix is divided into blocks.

Based on BCSR, register blocking algorithm further

improves data accessing efficiency by reusing data in

registers. Paper [4] discussed more details about how

to optimize sparse matrix computing by reusing

registers data in SPARSITY.

Register blocking algorithm divides sparse matrix

A[m][n] into many small r*c block. After A is divided

into m/r rows and n/c columns, elements in each block

are computed one by one. In this situation, there will

be c elements of vector x in registers reused r times for

each block. When r=c=1, BCSR is equal to CSR.

Suppose the number of non-zero blocks is nzb, BCSR

also needs three arrays to store the sparse matrix.

660

bcsrNz[nzb*r*c] stores the value of elements in

each non-zero block.

bcsrCols[nzb] stores the column index of the first

elements in each non-zero block.

bcsrRowStart[m/r+1] stores the index of the first

non-zero block in bcsrCols for each block row, and

bcsrRowStart [m/r] = nzb, which is the number of non-

zero blocks in the given matrix.

We take A

2

= as an

example, where m=6, n=9, nz=20, the corresponding

BCSR format of A

1 1 1 0 0 0 2 2 2

1 0 0 0 0 0 0 2 0

0 0 0 3 3 0 0 0 0

0 0 0 0 3 3 0 0 0

0 0 0 4 0 4 5 5 0

0 0 0 4 4 0 5 5 0

2

is shown as follows.

bcsrNz = {1,1,1,1,0,0,2,2,2,0,2,0,3,3,0,0,3,3,4,0,

4,4,4,0,5,5,0,5,5,0};

bcsrCols = {0,6,3,3,6};

bcsrRowStart ={0,2,3,5}.

Taking 2*2 block size as an example, the pseudo

code in Figure 2 implements SpMV based on BCSR

using OpenMP, where blockRows = m/r means the

number of block rows in the matrix.

We can see from the codes that x0, x1, y0 and y1 are

reused twice in registers. The data accessing locality of

vector x is also improved, since it processes four

elements of the sparse matrix each time.

Figure 2. Pseudo code of multithreaded SpMV

for BCSR using OpenMP

4. Load balancing of SpMV

Considering load balancing, scheduling overhead

and some other factors, the OpenMP API specifies

four different scheduling schemes: static, dynamic,

guided and runtime [20]. The runtime scheme will

choose a schedule type and chunk size from

environment variables until runtime. So we mainly

discuss the other three scheduling schemes.

The static scheme divides the iterations of a loop

into pieces by an input parameter chunk. The pieces

are statically assigned to each thread in a round-robin

fashion. By default, the iterations are divided into P

evenly sized contiguous chunks, where P is the

number of threads we assigned. Since the schedule can

be statically determined, this method has the least

runtime overhead.

The dynamic scheme also divides the iterations of a

loop into pieces. However, it will distribute each piece

dynamically. Each thread obtains the next set of

iterations only if it finished the current piece of

iterations. If not specified by the input parameter

chunk, the default count of iterations in a piece is 1.

The guided scheme works in a fashion similar to the

dynamic schedule. The partition size can be calculated

by Formula (1), where LC

k

is the count of unassigned

iterations, P is the number of threads and is a scale

factor (recommended to 1 or 2). The guided scheme

also provides an input parameter chunk. It specifies the

minimum number of iterations to dispatch each time.

When no chunk is specified, its default value is 1. The

chunk sizes in guided scheduling begin large and

slowly decrease in size, resulting in fewer

synchronizations than dynamic scheduling, while still

providing load balancing.

NC

k

= LC

k

/ (*P) (1)

As an example, Table 1 gives different partition

sizes for a problem, in which the count of iterations N

= 1000 and threads number P = 4.

Table 1. Partition sizes for different schemes

Scheme Partition size

static(default) 250, 250, 250, 250

dynamic

(chunk=50)

50, 50, 50, 50, 50,

guided

(chunk=50)

250, 188, 141, 106, 79,

59, 50, 50, 50, 27

In case of CSR storage format, the coarse grain row

partitioning scheme is popularly applied [17]. The

matrix will be partitioned into blocks of rows by the

number of threads. Each block is assigned to a thread.

As described in the left part of Figure 3, each thread

operates on its own parts of the csrNz, csrCols,

//The following sentence is an OpenMP direction.

#pragma omp parallel for private (j,k,h,t)

The following code is serial SpMV based on BCSR.

for (i=0; i<blockRows; i++)

{

t = i<<1;

register doulbe y0=y[t];

register double y1=y[t+1];

for(j=bcsrRowStart[i]; j<bcsrRowStart[i+1]; j++)

{

k = j<<2;

h = bcsrCols[j];

register double x0 = x[h];

register double x1 = x[h+1];

y0 += bcsrNz[k] * x0;

y0 += bcsrNz[k+1] * x1;

y1 += bcsrNz[k+2] * x0;

y1 += bcsrNz[k+3] * x1;

}

y[t] = y0;

y[t+1] = y1;

}

661

csrRowStart and y arrays. So their accesses on these

data are independent and can be parallelized. All

threads access elements of the x array. Since accesses

on x are read only, the data can reside in each

processors cache without causing invalidation traffic

due to the cache coherency protocol [16]. An

advantage of row partitioning is that each thread

operates on its own part of the y array, which allows

for better temporal locality on the arrays elements in

case of distinct caches.

Figure 3. Matrix partition for SpMV

There is a complementary approach to row

partitioning, column partitioning, where each thread is

assigned a block of columns. Although column

partitioning is more naturally applied to the CSC

(Compressed Sparse Column) storage format, it can

also be applied to the CSR format. A disadvantage of

column partitioning is that all threads must perform

writes on all the elements of the vector y. To solve this

problem, the best method is to make each thread use its

own vector y and perform a reducing operation at the

end of the multiplication [18]. The reducing operation

will cause more memory accesses as it need to visit the

own vector y of each thread. It will also cost (P-

1)*sizeof(y) extra space to store the y array for each

thread, where P is the number of threads we used.

Since each thread only need to operate on its own parts

of vector x in column partitioning, it will bring in

better temporal locality for vector x. To combine the

advantages of the above two partitioning methods, we

can use block partitioning, where each thread is

assigned a two-dimensional block as shown in the right

part of Figure 3. On the other hand, the

implementation of block partitioning should be more

complex and need more extra space. Their effects on

the performance of SpMV are beyond the scope of this

paper.

In our implementation of multithreaded SpMV, we

used row partitioning with all the three scheduling

schemes provided by OpenMP. We also applied a

static balancing scheme, non-zero scheduling [18],

which is based on the number of non-zero elements.

Each thread is assigned approximately the same

number of non-zero elements and thus the similar

number of floating-point operations. To implement

non-zero scheduling through the default static

scheduling, we need to define an array nzStart[P],

where P is the number of threads we used, and

nzStart[i] is the index of the first row assigned to the

ith thread. Correspondingly, nzStart [i+1] - 1 is the

index of the last row assigned to the ith thread. The

cost to init nzStart can be ignored, since we just need

to scan the array csrRowStart only once. The value of

P should be obtained at runtime through the

environment variable OMP_NUM_THREADS. The

implementation based on CSR with the non-zero

scheduling is shown as Figure 4.

Figure 4. Pseudo code of multithreaded SpMV

with the non-zero scheduling

5. Experimental Evaluation

In this section, we evaluate the performance of

multithreaded SpMV with different storage formats

and scheduling schemes. The testing environment is

based on Dawning S4800A1 with the following

configuration: 4 AMD Opteron dual-core processors

870, 2GHz, 1MB L2 cache, 16 GB DDR2 RAM, and 2

TB SCSI disk. The operating system is Turbo Linux

3.4.3-9.2. We compiled our code with Intel compiler

9.1 using the optimization option -O3.

Considering the count of non-zero elements and

matrix size (small, medium, large), we collected 8

testing sparse matrices from a popular sparse matrix

collection, UF Sparse Matrix Collection [21]. The

details of experimental matrices are shown in Table 2.

Table 2: Properties of the test matrices

Name Rows Columns Von-zeros

bcsstk17.RSA 10974 10974 219812

bcsstk28.RSA 4410 4410 111717

af23560.rsa 23560 23560 484256

epb1.rua 14734 14734 95053

raefsky2.rb 3242 3242 294276

raefsky3.rb 21200 21200 1488768

twotone.rua 120750 120750 1224224

venkat01.rb 62424 62424 1717792

//The following sentence is an OpenMP direction.

#pragma omp parallel for private(i, j)

for(int t=0; t< P; t++){

for(i= nzStart [t]; i< nzStart [t+1]; i++){

double temp = y[i];

for(j=csrRowStart[i]; j<csrRowStart[i+1]; j++){

int index = csrCols[j];

temp += csrNz[j] * x[index];

}

y[i] = temp;

}

}

662

Our implementations are using OpenMP with the

default static scheduling scheme. For BCSR format,

we chose 2*2 as the block size. The iteration time is

10000. To compare the CSR and BCSR formats, we

chose four matrices to test: bcsstk17.RSA, raefsky3.rb,

epb1.rua and bcsstk28.RSA. They are respectively

large, medium and small.

As shown in Figure 5 - Figure 8, BCSR performs

better than CSR for most matrices. BCSR format can

improve the data accessing locality and reuse data in

registers. The improvement is about 28.09% for the

four matrices on average.

0

!0

!

?0

?

! ? + 8

\um| o thud:

T

`

m

)

CSk

bCSk

Figure 5. Computing time with CSR and BCSR

for bcsstk17.RSA matrix

0

?0

+0

b0

80

!00

! ? + 8

\um| o thud:

T

`

m

)

CSk

bCSk

Figure 6. Computing time with CSR and BCSR

for raefsky3.rb matrix

0

?

+

b

8

!0

! ? + 8

\um| o thud:

T

`

m

)

CSk

bCSk

Figure 7. Computing time with CSR and BCSR

for epb1.rua matrix

0

?

+

b

8

!0

!?

!+

! ? + 8

Number of threads

T

i

m

e

(

s

e

c

)

CSk

bCSk

Figure 8. Computing time with CSR and BCSR

for bcsstk28.RSA matrix

The speedup we got for CSR and BCSR are shown

in Figure 9 and Figure 10. We can see that most of the

testing matrices can get a scalable even super-linear

speedup for both CSR and BCSR as the number of

threads increases. If we use eight threads, the speedup

is from 1.82 to 14.68 with CSR format, and 1.93 to

18.74 with BCSR format. The average speedup we got

with CSR and BCSR is 8.81 and 9.76 for eight threads

respectively. However, there are three matrices

(raefsky3.rb, twotone.rua and venkat01.rb) only get a

limited speedup, which is less than 2.0 while the thread

number is 8. They can not achieve higher speedup

when the number of threads is more than 3. We

observed that these matrices are exactly the largest

three matrices in the test set. The counts of non-zero

elements of them are all over one million.

Correspondingly, the length of vector x to be

multiplied with them is longer than others. As vector x

is stored continuously in memory, the longer it is, the

more TLB misses we will get when accessing vector x

irregularly. This should be a reason why we only got a

limited speedup for the large matrices.

0

4

8

12

16

1 2 3 4 5 6 7 8

Number of threads

S

p

e

e

d

u

p

bcsstk17.RSA raefsky3.rb venkat01.rb

bcsstk28.RSA epb1.rua raefsky2.rb

twotone.rua af23560.rsa

Figure 9. Speedup of CSR using 8 matrices for

1 to 8 threads

663

0

5

10

15

20

1 2 3 4 5 6 7 8

Number of threads

S

d

u

bcsstk28.RSA epb1.rua raefsky2.rb

twotone.rua af23560.rsa

Figure 10. Speedup of BCSR using 8 matrices

for 1 to 8 threads

We also evaluated the three scheduling schemes

provided by OpenMP and non-zero scheduling for

CSR format. The matrices we chose are also

bcsstk17.RSA, raefsky3.rb, epb1.rua and

bcsstk28.RSA. The performances of these scheduling

schemes are shown in Figure 11 - Figure 14.

Generally, the non-zero scheduling performs better

than the three scheduling schemes provided by

OpenMP. Although the guided scheduling partitions

the matrix into fewer blocks than the dynamic

scheduling, it may cost more runtime overhead for

each partition. The non-zero scheduling and the default

static scheduling has a similar runtime overhead,

because the implementation of non-zero scheduling

scheme is based on the default static scheduling, which

is statically determined and has the least runtime

overhead. However, all thread will get a similar

number of floating-point operations in the non-zero

scheduling. So it performs better than the default static

scheduling in most cases.

0

!0

!

?0

?

30

! ? + 8

Number of threads

T

`

m

)

:tut`

d)uum`,!)

_u`dd,!)

uouzo

Figure 11. Different scheduling schemes with

CSR for bcsstk17.RSA matrix

0

?0

+0

b0

80

!00

! ? + 8

\um| o thud:

T

`

m

)

:tut`

d)uum`,!)

_u`dd,!)

uouzo

Figure 12. Different scheduling schemes with

CSR for raefsky3.rb matrix

0

!0

!

! ? + 8

Number of threads

T

`

m

)

:tut`

d)uum`,!)

_u`dd,!)

uouzo

Figure 13. Different scheduling schemes with

CSR for epb1.rua matrix

0

!

?

3

+

! ? + 8

Number of threads

T

i

m

e

(

s

e

c

)

:tut`

d)uum`,!)

_u`dd,!)

uouzo

Figure 14. Different scheduling schemes with

CSR for bcsstk28.RSA matrix

6. Conclusions and future work

In this paper, we implemented and evaluated multi-

threaded SpMV using OpenMP. We evaluated two

storage formats for the sparse matrix, as well as the

scheduling schemes provided by OpenMP and non-

zero scheduling scheme. In most cases, the non-zero

scheduling performs better than the other two

scheduling schemes. Our implementation obtained

satisfactory scalability for most matrices except some

large matrices. To solve this problem, our future work

will consider block partitioning. Since the large matrix

will be partitioned into several small matrices, it

should get better data accessing locality of vector x and

y. We will also implement a hybrid parallelization

664

using MPI and OpenMP for distributed memory

parallel machines.

7. Acknowledgment

This work was supported in partial by the National

Natural Science Foundation of China under contract

No.60303020 and No.60533020, the National 863 Plan

of China under contract No.2006AA01A102 and

No.2006AA01A125. We thank the reviewers for their

careful reviews and helpful suggestions.

8. References

[1] S. Borkar. Design challenges of technology scaling. IEEE

Micro, 19(4):2329, Jul-Aug, 1999.

[2] J. L. Hennessy and D. A. Patterson. Computer

Architecture: A Quantitative Approach; fourth edition.

Morgan Kaufmann, San Francisco, 2006.

[3] Eun-Jin Im, Katherine Yelick , Richard Vuduc, Sparsity:

Optimization Framework for Sparse Matrix Kernels,

International Journal of High Performance Computing

Applications, Vol. 18, No. 1, 135-158 (2004)

[4] E.-J. Im and K.A.YelickOptimizing Sparse Matrix

Computations for Register Reuse in SPARSITY, In

proceedings of the International Conference on

Computational Science, volume 2073 of LNCS, pages 127-

136,San Francisco, CA, May 2001.Springer.

[5] Berkeley Benchmarking and OPtimization (BeBOP)

Project. http://Bebop.cs.berkeley.edu.

[6] Richard Vuduc, James Demmel, Katherine

Yelick ,OSKI:A library of automatically tuned sparse matrix

kernels, Proceedings of SciDAC 2005, Journal of Physics:

Conference Series, June 2005.

[7] R. Vuduc, J. W. Demmel, K. A. Yelick, S. Kamil, R.

Nishtala, B. LeePerformance optimizations and bounds for

sparse matrix-vector multiply In proceedings of

Supercomputing, Baltimore,MD,USA,November 2002.

[8] Richard Wilson Vuduc. Automatic Performance of

Sparse Matrix Kernels. The dissertation of Ph.D, Computer

Science Division, U.C. Berkeley, December 2003.

[9] Rajesh Nishtala, Richard W. Vuduc, James W. Demmel,

Katherine A. Yelick. Performance Modeling and Analysis of

Cache Blocking in Sparse Matrix Vector Multiply.Report

No.UCB/CSD-04-1335

[10] Rajesh Nishtala, Richard Vuduc, James W. Demmel,

Katherine A. Yelick. When Cache Blocking of Sparse Matrix

Vector Multiply Works and Why.

[11] E.-J. Im. Optimizing the Performance of Sparse Matrix-

Vector Multiplication. PhD thesis, UC Berkeley,

Berkeley,CA, USA, 2000.

[12] S. Balay, W. D. Gropp, L. C. McInnes, and B. F. Smith.

Efficient management of parallelism in object oriented

numerical software libraries. In E. Arge, A. M. Bruaset, and

H. P. Langtangen, editors, Modern Software Tools in

Scientific Computing, pages 163202, 1997.

[13] J. W. Willenbring, A. A. Anda, M. Heroux. Improving

sparse matrix-vector product kernel performance and

availability. In Proc. Midwest Instruction and Computing

Symposium, Mt. Pleasant, IA, 2006.

[14] E. Im, K. Yelick. Optimizing sparse matrix-vector

multiplication on SMPs. In 9th SIAM Conference on Parallel

Processing for Scientific Computing. SIAM, Mar. 1999.

[15] J. C. Pichel, D. B. Heras, J. C. Cabaleiro, F. F. Rivera.

Improving the locality of the sparse matrix-vector product on

shared memory multiprocessors. In PDP, pages 6671. IEEE

Computer Society, 2004.

[16] U. V. Catalyuerek, C. Aykanat. Decomposing

irregularly sparse matrices for parallel matrix-vector

multiplication. Lecture Notes in Computer Science, 1117:

7586, 1996.

[17] S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick,

and J. Demmel. Optimization of sparse matrix-vector

multiplication on emerging multicore platforms. In

Proceedings of Supercomputing, November 2007.

[18] K. Kourtis, G. Goumas, N. Koziris. Improving the

Performance of Multithreaded Sparse Matrix-Vector

Multiplication Using Index and Value Compression. In

Proceedings of the 37th International Conference on Parallel

Processing, Washington, DC, USA, 2008. pp: 511-519.

[19] Ankit Jain. Masters Report: pOSKI: An Extensible

Autotuning Framework to Perform Optimized SpMVs on

Multicore Architectures

[20] LAI Jianxin HU Changju.Analysis of Task Schedule

Overhead and Load Balance in OpenMP. Computer

Engineering, 2006, 32(18): 58-60.

[21] Matrix Market. http://math.nist.gov/MatrixMarket/

665

- Lesson 1Transféré parsyrill123
- Mip 3g Change Psc 29082018Transféré parreza
- ArrayTransféré parHansamali Kekulandara
- simaxTransféré parKrishnan Guruvayur
- Java ArrayasdfTransféré parkalcode35
- Face RecognitionTransféré parsanthoshkarthik123
- Data Structures and Algorithm Analysis Lab Exercises 1S1718Transféré parChristian Larano
- MC0061 Winter Drive 2011Transféré parPrakash Mandal
- samples.docTransféré parAman Kumar
- carraydatatype1Transféré parWilLy Libres-de la Cerna
- Arrays Intersection : Programming FortranTransféré parYang Yi
- SSD3-U2S2Transféré parapi-3728126
- Methodsunit II AjTransféré parsagar
- Logic Dev ElopementTransféré parDil Prasad Kunwar
- Tcs Placement Papers - Tcs Aptitude Paper With Solutions (Id-2608)Transféré parRotor Output Reddy
- CodeTransféré parIslam Nooh
- petsc.manTransféré parhummingsung
- UntitledTransféré parapi-95402653
- Kruger, Westermann - Linear Algebra Operators for GPU Implementation of Numerical AlgorithmsTransféré parErnesto Varela
- chap_06.pptxTransféré parAmbika Balu
- Excel MacrosTransféré parCláudio Marques
- OS_6Transféré partheresa.painter
- Index List Case practicalTransféré parKavukob
- LLVM Reference CardTransféré parPranay Devisetty
- C++Manual2006Transféré parThierry Wadji
- R-introTransféré parKalajan
- NeverEngine Labs Complexity Gallery - Cristian Vogel.pdfTransféré parCristian Vogel
- my divTransféré parRajesh Singh
- CLAD Homework 1 QuestionsTransféré parHemakumar Mohan
- De_Re_BASIC!Transféré parDabraCic

- Pont Platos Philosophy of Dance1Transféré parpventi
- Prediction Based DRAM Row-Buffer Management in the Many-Core Era_PACT-2011Transféré parpventi
- Blazing Star Nile Gary a David-libreTransféré parpventi
- IPL_V_Jones_Essays_on_applied_psa.pdfTransféré parpventi
- Implementing LDPC Decoding on Network-On-ChipTransféré parpventi
- The Cultural and Religious Background of Sexual Vampirism.pdfTransféré parpventi
- MemHierarchyRvdPasTransféré parTim Hamamori
- Transmission of Sexual Positioning in Relationship with Female OrgasmTransféré parpventi
- PERI - Auto-Tuning Memory-Intensive Kernels for Multicore_2008Transféré parpventi
- Memory Access Pattern Analysis and Stream Cache Design for Multimedia ApplicationsTransféré parpventi
- Cooke-Mystery ConfusionTransféré parMaureen Owens
- Virgil's Fourth Eclogue and the Eleusinian MysteriesTransféré parpventi
- Improving Data Cache Performance With Integrated Use Of_Split Caches_Victim Cache & Stream Buffers-2004Transféré parpventi
- The Cannophori and the March festival of the Magna MaterTransféré parpventi
- Comparing the OpenMP, MPI, And Hybrid Programming Paradigms on an SMP ClusterTransféré parpventi
- Multi Word JEA08Transféré parpventi

- Emerson Wireless SecurityTransféré parCehan
- Paper 11Transféré parRakeshconclave
- LogTransféré parAmanda Natallie
- gmshTransféré partitiminet
- simulink-kalman-filter.pdfTransféré parjlissa5262
- DEFCON 23 Patrick Wardle Stick That in Your (Root)Pipe and STransféré parElias Chao
- SMSETUP DiagnosticsAgentandHASupport 140514 1044 726Transféré parm0rphei
- Crytpo Lab FileTransféré parIshuJain
- GATE Computer Organization & Architecture BookTransféré parMims12
- HLAB 4Transféré parJames Henry Dannatt
- Overview of Machine LearningTransféré parTusty Nadia Maghfira
- Project Proposal 1Transféré parPrabhath Priyadarshana Perera
- Bim Dbms-II Class Note2Transféré parxrobin
- Red Hat Network Satellite 5.0 Installation Guide en USTransféré parzuri_ot
- SNMP ManagerTransféré parAarthy Karthigaiselvi
- smtpTransféré parstocklism
- nios 2Transféré parRohit Pradhan
- End to End SAP Lumira Desktop to On Premise Cloud and Mobile [1504271].pdfTransféré parGOPI
- Simulation Study of FIR Filter based on MATLABTransféré parInternational Journal for Scientific Research and Development - IJSRD
- MemTest86 User Guide UEFITransféré parLuis Araya
- Critical Insight Into S_4HANA Cloud Compared to S_4HANA on PremiseTransféré parAli Eshaghi Beni
- Code_snippets_Intro_to_scripting_in_Blender_25xTransféré parSubbu Addanki
- Lecture 4. FunctionsTransféré paraddisud
- ADDAPT 2000 Installation.pdfTransféré parLiliana Franchesca Rodriguez Araya
- Anycasting Any MeshTransféré parHenriqueGabriel
- 0-Rhce Essential Book by Eng Mo'Men HanyTransféré parMohamed Maher
- ResumeTransféré parNeetu Behal
- Verification ConceptsTransféré paramigottp
- Beginners Introduction to the Assembly Language of ATMEL AVR MicroprocessorsTransféré parapi-3698538
- Shell Script NotesTransféré parbaginda muhammad