Vous êtes sur la page 1sur 29

1

Software Engineering Seminar


Sebastian Hafen
An Auto-Tuning Framework
for Parallel Multicore Stencil Computations
Shoaib Kamil C! Chan "eoni# $liker %ohn Shalf Samuel &illiams
2
Stencils
3
&hat is a Stencil Computation'

(earest (eighbor Computations

E)g) finite #ifference between #ata points

Sweeps o*er a structure# +ri#

"ike a n-#imensional Arra!

,terati*e- i . i/0 . i/1


"eft Two- http-22iopscience)iop)org20345-4655212027087782fullte9t
Mi##le- http-22en)wikipe#ia)org2wiki2Stencil:;numerical:anal!sis<
=ight- http-22en)wikipe#ia)org2wiki2Fi*e-point:stencil
4
E9ample- 1> 8-Points-Stencil
//Stencil-loop
do k=2, xLength-1, 1
do i=2, yLength-1, 1
writeArray[k][i] = useStencil(k,i)
enddo
enddo
//Stencil-function
function useStencil(k,i)
int result = readArray[k][i]
+ readArray[k+1][i]
+ readArray[k-1][i]
+ readArray[k][i+1]
+ readArray[k][i-1]
result = result/5
return result
endfunction
;k/0i<
;ki<
;k-0i<
;ki-0< ;ki/0<
5
E9ample
5 2 3 1 2 8 4
1 3 3 7 3 3 1
9 8 7 6 5 4 3
11 22 33 44 55 66 77
1 2 4 8 16 32 64
2 3 2 3 3 4 4
4
rea#Arra! writeArra!
?
;1/0/?/?/@<28 A ?
6
E9ample
5 2 3 1 2 8 4
1 3 3 7 3 3 1
9 8 7 6 5 4 3
11 22 33 44 55 66 77
1 2 4 8 16 32 64
2 3 2 3 3 4 4
4 3
rea#Arra! writeArra!
;?/?/?/3/3<28 A 4
4
7
E9ample
5 2 3 1 2 8 4
1 3 3 7 3 3 1
9 8 7 6 5 4 3
11 22 33 44 55 66 77
1 2 4 8 16 32 64
2 3 2 3 3 4 4
4 3 4
rea#Arra! writeArra!
;0/?/3/?/6<28 A 4
4
8
E9ample from the paper- +ra#ient
P
i
c
t
u
r
e

f
r
o
m

P
a
p
e
r
9
&h!'

Sol*ing Partial >ifferential EBuations

Cse# b! man! branches of Science

Heat EBuations

&a*e EBuations

DAutomatic beam path anal!sis of laser wakefiel# particle acceleration #ataE

)))
Fuote- Papername of http-22iopscience)iop)org20345-4655212027087782fullte9t
,mages- http-22www)math)uwaterloo)ca2Gfpoulin2Files:html2fpcmresearch)html
10
Characteristics of stencil computations

High memor! traffic

"ow arithmetic intensit!

CPCs can han#le it


Computations are memor! boun#

Auto-tuning for better memor! access management


//Stencil-function
function useStencil(k,i)
int result = readArray[k][i]
+ readArray[k+1][i]
+ readArray[k-1][i]
+ readArray[k][i+1]
+ readArray[k][i-1]
result = result/5
return result
endfunction
11
The Framework
12
$*er*iew

(ot the first auto-tuning framework for stencils

Hut other work about static2single kernel instantiations

Proof-of-Concept

Supports broa# range of stencil kernels

Full! generaliIe# framework

Auto-parallelisation

Multiple back-en# architectures

E*en a +PC
13
Framework flow
M!ria# of eBui*alent
optimiIe# implementations
Hest performing
implemntation
an# configuration
parameters
=eference
,mplementation
,nspire# b! a picture of the paper
P
a
r
s
e

a
s

A
S
T
14
Strateg! Engine

Parameter Space is massi*e

Combine# serial an# parallel optimiIations

>eci#es on a appropriate subset of parameter combinations


;strategies<

Hase# on the un#erl!ing architecture

Knows about correlation of #ifferent optimiIations

Chooses onl! legal combinations


15
Transformation Engine

Transforms the AST

First applies auto-paralleliIation

Then uses auto-tuning

Has #omain knowle#ge

Can #o transformations a compiler can not


16
Auto-paralleliIation

Hasicall! #i*i#ing the problem space into blocks

Core blocks threa# blocks an# register blocks

Creates new loops for e*er! block

(on-Cniform Memor! Access ;(CMA<-Aware

Separate stencil for the bor#er cases


,mage- http-22www)0714cores)net2home2parallel-computing2cache-obli*ious-algorithms
17
Auto-paralleliIation
P
i
c
t
u
r
e

f
r
o
m

P
a
p
e
r
18
Auto-tuning

"oop unrolling an# register blocking

,mpro*es innermost loop efficienc!

Cache blocking

E9poses temporal localit! an# an# increases cache reuse

Arithmetic simplifications

Man! more possible

,t is a pro*e-of-concept
E9ample for cache blocking- http-22techpubs)sgi)com2librar!2#!naweb:#ocs276472S+,:>e*eloper2books2$r$n1:PfTune2sgi:html2ch76)html
19
Search Engine

=uns all the #ifferent tune# *ersions of the stencil kernel

186
?
gri#s ;06J333J106 Elements< initialiIe# with ran#om *alues

Cser can replace the original kernel with the fastest one
20
"imitations

$nl! 1> or ?>

$nl! Arra!s

(o sophisticate# >ata structures

$nl! arithmetic stencils

The! want to change that in future work


21
Co#e +enerator

Creates co#e from the mo#ifie# ASTs

For the CPCs- pthrea#s

For the +PC- CC>A threa# blocks

Serial fortran an# c co#e also possible


22
Teste#
Stencils an# Architectures
23
Cse# Stencils
P
i
c
t
u
r
e

f
r
o
m

P
a
p
e
r
"aplacian Stencil >i*ergence Stencil +ra#ient Stencil
24
Cse# Architectures
P
i
c
t
u
r
e

f
r
o
m

P
a
p
e
r
25
=esults
26
$ne =esult
P
i
c
t
u
r
e
s

f
r
o
m

P
a
p
e
r
"aplacian
27
=esults
P
i
c
t
u
r
e
s

f
r
o
m

P
a
p
e
r
28
Conclusion

Pro

,t #oes work) Concept is pro*en

Full! general

Performance comparable to han#-optimiIe# co#e

DProgrammer Pro#uction HenefitsE

Few minutes to annotate co#e

Contra

$penMP works goo# too

(ew architecture means new co#ing

Peak not !et reache#


Fuote from Paper
29
En# of Presentation

Vous aimerez peut-être aussi