Vous êtes sur la page 1sur 44

Experienceswith

Coarrays
DEISA/PRACE Spring School, 30th March 2011
Parallel Programming with Coarray Fortran

Overview
Implementations
Performanceconsiderations
Wheretousethecoarraymodel
Examplesofcoarraysinpractice
References
Wrapup

ImplementationStatus

HistoryofcoarraysdatesbacktoCrayimplementations
ExpectsupportfromvendorsaspartofFortran2008
G95hadmultiimagesupportin2010
gfortran
workprogressing(4.6trunk)forsingleimagesupport
Intel:multiprocesscoarraysupportinIntelComposerXE2011
(basedonFortran2008draft)

RuntimesareSMP,GASNetandcompiler/vendorruntimes

GASNethassupportformultipleenvironments
(IB,Myrinet,MPI,UDPandCray/IBMsystems)so
couldbeanoptionfornewimplementations
3

ImplementationStatus(Cray)
CrayhassupportedcoarraysandUPConvariousarchitectures
overthelastdecade(fromT3E)
FullPGASsupportontheCrayXT/XE
CrayCompilingEnvironment7.0 Dec2008
CrayCompilerEnvironment7.2 Jan2010
FullFortran2008coarraysupport
FullFortran2003withsomeFortran2008features
FullyintegratedwiththeCraysoftwarestack
Samecompilerdrivers,joblaunchtools,libraries
IntegratedwithCraypat Crayperformancetools
CanmixMPIandcoarrays

Implementationswehaveused
CrayX1/X2
Hardwaresupportscommunicationbydirectload/store
Veryefficientwithlowoverhead
CrayXT
PGAS(UPC,CAF)layeredonGASNet/portals(somessaging)
Notthatefficient
CrayXE
PGASlayeredonDMAPPportablelayeroverGemini
networkhardware
IntermediatebetweenXTandX1/2
g95onsharedmemory
UsingclonedprocessimagesonLinux

g95onUbuntu

Whentousecoarrays
Twoobviouscontexts
Completeapplicationusingcoarrays
MixedwithMPI
Asanincrementaladditiontoa(potentiallylarge)serialcode
AsanincrementaladditiontoanMPIcode(allowinguseof
MPIcollectives)
Usecoarraysforsomeofthecommunication
opportunitytoexpresscommunicationmuchmoresimply
opportunitytooverlapcommunication
Forsubsetsynchronisation
Worksharingschemes

Addingcoarraystoexistingapplications
Constrainuseofcoarraystopartofapplication
Moverelevantdataintocoarrays
Implementparallelpartwithcoarraysyntax
Movedatabacktooriginalstructures
Usecoarraystructurestocontainpointerstoexistingdata
Placerelevantarraysinglobalscope(modules)
avoidsmultipledeclarations
Declareexistingarraysascoarraysattoplevelandthrough
thecompletecalltree
(someeffortbutonlyrequireschangestodeclarations)

PerformanceConsiderations
Whatisthelatency?
Doyouneedtoavoidstridedtransfers?
Isthecompileroptimisingthecommunicationfortarget
architecture?
Isitusingblockingcommunicationwithinasegmentwhen
itdoesnoneedto?
Isitoptimisingstridedcommunication?
Canitpatternmatchloopstosinglecommunication
primitivesorcollectives?

Performance:Communicationpatterns
Trytoavoidcreatingtrafficjamsonthenetwork,suchasall
imagesstoringtoasingleimage.

Thefollowingexamplesshowtwowaystoimplementan
ALLReduce()functionusingcoarrays

10

AllReduce(everyonegets)
Allimagesgetdatafromotherssimultaneously
function allreduce_max_allget(v) result(vmax)
double precision :: vmax, v[*]
integer i
sync all
vmax=v
do i=1,num_images()
vmax=max(vmax,v[i])
end do

11

AllReduce(everyonegets,optimized)
Allimagesgetdatafromotherssimultaneouslybutthisis
optimizedsocommunicationismorebalanced
!...
sync all
vmax=v
do i=this_image()+1,num_images()
vmax=max(vmax,v[i])
end do
do i=1,this_image()-1
vmax=max(vmax,v[i])
end do

Haveseenthismuchfaster
12

Synchronization
Forsomealgorithms(finitedifferenceetc.)dontuse
sync all
butpairwisesynchronizationusingsync images(image)

13

Synchronization(onetomany)
Oftenoneimagewillbecommunicatingwithasetofimages
Ingeneralnotagoodthingtodobutassumeweare...

Temptingtousesync all

14

Synchronisation(onetomany)
Ifthisisallimagesthencoulddo
if ( this_image() == 1) then
sync images(*)
else
sync images(1)
end if

Notethatsync all islikelytobefastsoisan


alternative

15

Synchronisation(onetomany)
Forasubsetusethis
if ( this_image() == image_list(1)) then
sync images(image_list)
else
sync images(image_list(1))
end if

insteadofsync images(image_list)
forallofthemwhichislikelytobeslower
16

CollectiveOperations
Ifyourimplementationdoesnothavecollectivesandyouneed
scalabilitytoalargenumberofimages
UseMPIforthecollectivesifMPI+coarraysissupported
Implementyourownbutthismightbehard
Forreductionsofscalarsatreewillbethebesttotry
Forreductionsofmoredatayouwouldhavetoexperiment
andthismaydependonthetopology
Coarrayscanbegoodforcollectiveoperationswhere
thereisanunusualcommunicationpatternthatdoesnot
matchwhatMPIcollectivesprovide
thereisopportunitytooverlapcommunication
withcomputation
17

Tools:debuggingandprofiling
Toolsupportshouldimproveoncecoarraytakeupincreases
CrayCraypattoolsupportscoarrays
TotalviewworkswithcoarrayprogramsonCraysystems
Scalasca

CurrentlyinvestigatinghowPGASsupportcanbe
incorporated.

18

DebuggingSynchronisationproblems
Onesidedmodelistrickybecausesubtlesynchronisation
errorschangedata
TRYTOGETITRIGHTFIRSTTIME
lookcarefullyattheremoteoperationsinthecode
Thinkaboutsynchronisationofsegments
especiallylookforearlyarrivingcommunicationstrashing
yourdataatthestartofloops(thisoneiseasytomiss)
Onewaytotestistoputsleep()callsinthecode
Delayoneormoreimages
Delaymasterimage,orotherimagesforsomepatterns

19

Examplesofcoarraysinpractice
WorksharingExamples
Puzzles
Manybodydeformationcode
DistributedRemoteGather
Exampleofincorporatingcoarraysintoexistingapplication

20

SolvingSudokuPuzzles

21

GoingParallel
Startedwithserialcode
Changedtoreadinall125,000puzzlesatstart
Chooseworksharingstrategy
Oneimage(1)holdsaqueueofpuzzlestosolve
Eachimagepicksworkfromthequeueandwritesresult
backtoqueue

Arbitrarilydecidetoparcelworkas
blocksize=npuzzles/(8*num_images())

22

DataStructures
use,intrinsic iso_fortran_env
type puzzle
integer :: input(9,9)
integer :: solution(9,9)
end type puzzle
type queue
type (lock_type) :: lock
integer :: next_available = 1
type(puzzle),allocatable :: puzzles(:)
end type queue
type(queue),save :: workqueue[*]
type(puzzle)
:: local_puzzle
integer,save
:: npuzzles[*],blocksize[*]

23

Input
if (this_image() == 1) then
! After file Setup.
inquire (unit=inunit,size=nbytes)
nrecords = nbytes/10
npuzzles = nrecords/9
blocksize = npuzzles / (num_images()*8)
write (*,*) "Found ", npuzzles, " puzzles.
allocate (workqueue%puzzles(npuzzles))
do i = 1, npuzzles
call read_puzzles( &
&
workqueue%puzzles(i)%input,inunit, &
&
error)
end do
close(inunit)

24

Coreprogramstructure
! After coarray data loaded
sync all
blocksize = blocksize[1]
npuzzles = npuzzles[1]
done = .false.
workloop: do
! Acquire lock and claim work
! Solve our puzzles
end do workloop

25

Acquirelockandclaimwork
!

Reserve the next block of puzzles


lock (workqueue[1]%lock)
next = workqueue[1]%next_available
if (next <= npuzzles) then
istart = next
iend
= min(npuzzles, next+blocksize-1)
workqueue[1]%next_available = iend+1
else
done = .true.
end if
unlock (workqueue[1]%lock)
if (done) exit workloop

26

Solvethepuzzlesandwriteback
!

Solve those puzzles


do i = istart,iend
local_puzzle%input = &
&
workqueue[1]%puzzles(i)%input
call sudoku_solve &
& (local_puzzle%input,local_puzzle%solution)
workqueue[1]%puzzles(i)%solution = &
&
local_puzzle%solution
end do

27

Outputthesolutions
! Need to synchronize puzzle output updates
sync all
if (this_image() == 1) then
open (outunit,file=outfile,iostat=error)

do i = 1, npuzzles
call write_puzzle &
&
(workqueue%puzzles(i)%input, &
&
workqueue%puzzles(i)%solution,outunit,error)
end do

28

MoreontheLocking
Weprotectedaccesstothequeuestatebylockandunlock
Duringthistimenootherimagecanacquirethelock
Weneedtohavedisciplinetoonlyaccessdatawithinthe
windowwhenwehavethelock
Thereisnoconnectionwiththelockvariableandtheother
elementsofthequeuestructure

Theunlockisactinglikesyncmemory
Ifoneimageexecutesanunlock...
Anotherimagegettingthelockisorderedafterthe
firstimage

29

SummaryandCommentary
Weimplementedsolvingthepuzzlesusingaworksharing
schemewithcoarrays

Scalabilitylimitedbyserialworkdonebyimage1
I/O
ParallelI/O(deferredtoTR)withmultipleimagesrunning
distributedworkqueues.
Deferthecharacterintegerformatconversiontothe
solver,whichisexecutedinparallel.
Lockcontention
Couldusedistributedworkqueues,
eachwithitsownlock.

30

ImpactApplication
Studyofimpactanddeformationofmultiplebodies

Bodiesdeformoncontact

31

Impactapplication:Challengesandapproach
Challenges
Bodycontactisunpredictable
Muchmorecomputationincontactregions
Verypoorloadbalancelimitsscalability
Approach
Replicatecontactdataonallimages(cores)
Dividecontactdataintosegments(e.g.5timesasmany
segmentsasimages)
Usecoarraystoassigndatasegmentstoimagesbyatomically
incrementingtheglobalsegmentindex
Imagesprocessmoreorfewersegments,dependingonthe
computationalworkineachsegment
32

Results
Afterimplementingtheworkdispatchmethodinthecontact
algorithm
ContactAlgorithmfraction
ofentirecode

Contact
Algorithm

CodeVersion

Elapsed
Time

Comm.&Sync.
time

LoadImbalance

OriginalMPI

15%

75%

28%

Coarrayversion
withworkdispatch

10%

35%

1%

Bigimprovementinloadbalancing
33

Distributedremotegather
Theproblemishowtoimplementthefollowinggatherloopon
adistributedmemorysystem
REAL
:: table(n), buffer(nelts)
INTEGER :: index(nelts)
! nelts << n
...
DO i = 1, nelts
buffer(i) = table(index(i))
ENDDO

Thearraytable isdistributedacrosstheprocessors,while
index andbuffer arereplicated
Syntheticcode,butsimulatesirregular communication
accesspatterns
34

Remotegather:MPIimplementation
MPIrank0controlstheindexandreceivesthevaluesfromthe
otherranks
DO i=nelts,1,-1
pe =(index(i)-1)/nloc
offset = isum(pe)
mpi_buffer(i) = buff(offset,pe)
isum(pe) = isum(pe) 1
ENDDO

IF (mype.eq.0)THEN

isum=0
PE0 gathers indices to send out to individual PEs
DO i=1,nelts
pe =(index(i)-1)/nloc
isum(pe)=isum(pe)+1
who(isum(pe),pe) = index(i)
ENDDO
send out count and indices to PEs
DO i = 1, npes-1
CALL MPI_SEND(isum(i),1,MPI_INTEGER,i,10.
IF(isum(i).gt.0)THEN
CALL MPI_SEND(who(1,i),isum(i),...
ENDIF
ENDDO
now wait to receive values and scatter them.
DO i = 1,isum(0)
offset = mod(who(i,0)-1,nloc)+1
buff(i,0) = mpi_table(offset)
ENDDO
DO i = 1,npes-1
IF(isum(i).gt.0)THEN
CALL MPI_RECV(buff(1,i),isum(i),...
ENDIF
ENDDO

ELSE
!

!IF my_rank.ne.0

Each PE gets the list and sends the values to PE0


CALL MPI_RECV(my_sum,1,MPI_INTEGER,...
IF(my_sum.gt.0)THEN
CALL MPI_RECV(index,my_sum,MPI_INTEGER,...
DO i = 1, my_sum
offset = mod(index(i)-1,nloc)+1
mpi_buffer(i) = mpi_table(offset)
ENDDO
CALL MPI_SEND(mpi_buffer,my_sum,...
ENDIF

ENDIF

35

Remotegather:coarrayimplementation(get)
Image1getsthevaluesfromtheotherimages

IF (myimg.eq.1) THEN
DO i=1,nelts
pe =(index(i)-1)/nloc+1
offset = MOD(index(i)-1,nloc)+1
caf_buffer(i) = caf_table(offset)[pe]
ENDDO
ENDIF

36

Remotegather:coarrayvsMPI
Coarrayimplementationsaremuchsimpler
Coarraysyntaxallowstheexpressionofremotedataina
naturalway noneedofcomplexprotocols
Coarrayimplementationisordersofmagnitudefasterfor
smallnumbersofindices

37

Addingcoarraywithminimalcodemodifications
CoarraystructurescanbeinsertedintoanexistingMPIcodeto
simplifyandimprovecommunicationwithoutaffectingthe
generalcodestructure
InsomecasestheusageofFortranpointerscansimplifythe
introductionofcoarrayvariablesinaroutinewithout
modifyingthecallingtree
Codemodificationsaremostlylimitedtotheroutinewhere
coarraysmustbeintroduced
Acoarrayisnotpermittedtobeapointer:however,acoarray
maybeofaderivedtypewithpointerorallocatable
components

38

Example:addingcoarraystoexistingcode
Insubroutinesub2 wewanttousecoarraysfor
communication
Arrayxisdefinedandallocatedinthemainprogramandis
passedasargumenttosubroutinesub1andthentosub2
PROGRAM main
REAL, ALLOCATABLE :: x(:)
ALLOCATE (x(n))
CALL sub1(x,n)

SUBROUTINE sub2(x,n)
REAL :: x(n)
CALL MPI_Irecv(n,rx,...
CALL MPI_Send(n,x,...

SUBROUTINE sub1(x,n)
REAL :: x(n)
x(:) = ...
CALL sub2(x,n)

39

Modifytheroutinewherecoarraysareused
Insub2addthederivedtypeCAFPwhichcontainsjusta
pointerandallocateacoarrayvariablexp oftypeCAFP
Eachimagesetsxptopointtoarrayx
Arrayx onimagep canbereferencedbyxp[p]%ptr
SUBROUTINE sub2(x,n)
REAL, TARGET :: x(n)
! Add derived type CAFP
TYPE CAFP
REAL, POINTER :: ptr(:)
END TYPE CAFP

! Each image sets the


! pointer and sync
xp%ptr => x(1:n)
sync all
! Coarray x on image k can
! be referenced by the
! derived type pointer
! Allocate xp of type CAFP ! rx = x[k]
TYPE(CAFP) :: xp[*]
rx(1:n) = xp[k]%ptr(1:n)

40

References
http://lacsi.rice.edu/software/caf/downloads/documentation/

nrRAL98060.pdf CoarrayFortranforparallelprogramming,
NumrichandReid,1998
ftp://ftp.nag.co.uk/sc22wg5/N1801N1850/N1824.pdf
CoarraysinthenextFortranStandard,JohnReid,April2010
Ashby,J.V.andReid,J.K(2008).Migratingascientific
applicationfromMPItocoarrays.CUG2008Proceedings.RAL
TR2008015
Seehttp://www.numerical.rl.ac.uk/reports/reports.shtml
http://upc.gwu.edu/ UnifiedParallelCatGeorgeWashington
University
http://upc.lbl.gov/ BerkeleyUnifiedParallelCProject

41

Wrapup
RememberourfirstMotivationslide?
Fortrannowsupportsparallelismasafullfirstclassfeatureof
thelanguage
Changesareminimal
Performanceismaintained
Flexibilityinexpressingcommunicationpatterns
Wehopeyoulearnedsomethingandhavesuccess
withcoarrays inthefuture

42

Acknowledgements
This material for is based on original content developed by EPCC at
The University of Edinburgh for use in teaching the MSc in High
Performance Computing. The following people contributed to its
development:
Alan Simpson, Michele Weiland, Jim Enright and Paul Graham
The material was subsequently developed by EPCC and Cray with
contributions from the following people:
David Henty, Alan Simpson ( EPCC)
Harvey Richardson, Bill Long, Roberto Ansaloni,
Jef Dawson, Nathan Wichmann (Cray)

This material is Copyright 2010 by The University of Edinburgh


and Cray Inc.

44

Vous aimerez peut-être aussi