Académique Documents
Professionnel Documents
Culture Documents
Coarrays
DEISA/PRACE Spring School, 30th March 2011
Parallel Programming with Coarray Fortran
Overview
Implementations
Performanceconsiderations
Wheretousethecoarraymodel
Examplesofcoarraysinpractice
References
Wrapup
ImplementationStatus
HistoryofcoarraysdatesbacktoCrayimplementations
ExpectsupportfromvendorsaspartofFortran2008
G95hadmultiimagesupportin2010
gfortran
workprogressing(4.6trunk)forsingleimagesupport
Intel:multiprocesscoarraysupportinIntelComposerXE2011
(basedonFortran2008draft)
RuntimesareSMP,GASNetandcompiler/vendorruntimes
GASNethassupportformultipleenvironments
(IB,Myrinet,MPI,UDPandCray/IBMsystems)so
couldbeanoptionfornewimplementations
3
ImplementationStatus(Cray)
CrayhassupportedcoarraysandUPConvariousarchitectures
overthelastdecade(fromT3E)
FullPGASsupportontheCrayXT/XE
CrayCompilingEnvironment7.0 Dec2008
CrayCompilerEnvironment7.2 Jan2010
FullFortran2008coarraysupport
FullFortran2003withsomeFortran2008features
FullyintegratedwiththeCraysoftwarestack
Samecompilerdrivers,joblaunchtools,libraries
IntegratedwithCraypat Crayperformancetools
CanmixMPIandcoarrays
Implementationswehaveused
CrayX1/X2
Hardwaresupportscommunicationbydirectload/store
Veryefficientwithlowoverhead
CrayXT
PGAS(UPC,CAF)layeredonGASNet/portals(somessaging)
Notthatefficient
CrayXE
PGASlayeredonDMAPPportablelayeroverGemini
networkhardware
IntermediatebetweenXTandX1/2
g95onsharedmemory
UsingclonedprocessimagesonLinux
g95onUbuntu
Whentousecoarrays
Twoobviouscontexts
Completeapplicationusingcoarrays
MixedwithMPI
Asanincrementaladditiontoa(potentiallylarge)serialcode
AsanincrementaladditiontoanMPIcode(allowinguseof
MPIcollectives)
Usecoarraysforsomeofthecommunication
opportunitytoexpresscommunicationmuchmoresimply
opportunitytooverlapcommunication
Forsubsetsynchronisation
Worksharingschemes
Addingcoarraystoexistingapplications
Constrainuseofcoarraystopartofapplication
Moverelevantdataintocoarrays
Implementparallelpartwithcoarraysyntax
Movedatabacktooriginalstructures
Usecoarraystructurestocontainpointerstoexistingdata
Placerelevantarraysinglobalscope(modules)
avoidsmultipledeclarations
Declareexistingarraysascoarraysattoplevelandthrough
thecompletecalltree
(someeffortbutonlyrequireschangestodeclarations)
PerformanceConsiderations
Whatisthelatency?
Doyouneedtoavoidstridedtransfers?
Isthecompileroptimisingthecommunicationfortarget
architecture?
Isitusingblockingcommunicationwithinasegmentwhen
itdoesnoneedto?
Isitoptimisingstridedcommunication?
Canitpatternmatchloopstosinglecommunication
primitivesorcollectives?
Performance:Communicationpatterns
Trytoavoidcreatingtrafficjamsonthenetwork,suchasall
imagesstoringtoasingleimage.
Thefollowingexamplesshowtwowaystoimplementan
ALLReduce()functionusingcoarrays
10
AllReduce(everyonegets)
Allimagesgetdatafromotherssimultaneously
function allreduce_max_allget(v) result(vmax)
double precision :: vmax, v[*]
integer i
sync all
vmax=v
do i=1,num_images()
vmax=max(vmax,v[i])
end do
11
AllReduce(everyonegets,optimized)
Allimagesgetdatafromotherssimultaneouslybutthisis
optimizedsocommunicationismorebalanced
!...
sync all
vmax=v
do i=this_image()+1,num_images()
vmax=max(vmax,v[i])
end do
do i=1,this_image()-1
vmax=max(vmax,v[i])
end do
Haveseenthismuchfaster
12
Synchronization
Forsomealgorithms(finitedifferenceetc.)dontuse
sync all
butpairwisesynchronizationusingsync images(image)
13
Synchronization(onetomany)
Oftenoneimagewillbecommunicatingwithasetofimages
Ingeneralnotagoodthingtodobutassumeweare...
Temptingtousesync all
14
Synchronisation(onetomany)
Ifthisisallimagesthencoulddo
if ( this_image() == 1) then
sync images(*)
else
sync images(1)
end if
15
Synchronisation(onetomany)
Forasubsetusethis
if ( this_image() == image_list(1)) then
sync images(image_list)
else
sync images(image_list(1))
end if
insteadofsync images(image_list)
forallofthemwhichislikelytobeslower
16
CollectiveOperations
Ifyourimplementationdoesnothavecollectivesandyouneed
scalabilitytoalargenumberofimages
UseMPIforthecollectivesifMPI+coarraysissupported
Implementyourownbutthismightbehard
Forreductionsofscalarsatreewillbethebesttotry
Forreductionsofmoredatayouwouldhavetoexperiment
andthismaydependonthetopology
Coarrayscanbegoodforcollectiveoperationswhere
thereisanunusualcommunicationpatternthatdoesnot
matchwhatMPIcollectivesprovide
thereisopportunitytooverlapcommunication
withcomputation
17
Tools:debuggingandprofiling
Toolsupportshouldimproveoncecoarraytakeupincreases
CrayCraypattoolsupportscoarrays
TotalviewworkswithcoarrayprogramsonCraysystems
Scalasca
CurrentlyinvestigatinghowPGASsupportcanbe
incorporated.
18
DebuggingSynchronisationproblems
Onesidedmodelistrickybecausesubtlesynchronisation
errorschangedata
TRYTOGETITRIGHTFIRSTTIME
lookcarefullyattheremoteoperationsinthecode
Thinkaboutsynchronisationofsegments
especiallylookforearlyarrivingcommunicationstrashing
yourdataatthestartofloops(thisoneiseasytomiss)
Onewaytotestistoputsleep()callsinthecode
Delayoneormoreimages
Delaymasterimage,orotherimagesforsomepatterns
19
Examplesofcoarraysinpractice
WorksharingExamples
Puzzles
Manybodydeformationcode
DistributedRemoteGather
Exampleofincorporatingcoarraysintoexistingapplication
20
SolvingSudokuPuzzles
21
GoingParallel
Startedwithserialcode
Changedtoreadinall125,000puzzlesatstart
Chooseworksharingstrategy
Oneimage(1)holdsaqueueofpuzzlestosolve
Eachimagepicksworkfromthequeueandwritesresult
backtoqueue
Arbitrarilydecidetoparcelworkas
blocksize=npuzzles/(8*num_images())
22
DataStructures
use,intrinsic iso_fortran_env
type puzzle
integer :: input(9,9)
integer :: solution(9,9)
end type puzzle
type queue
type (lock_type) :: lock
integer :: next_available = 1
type(puzzle),allocatable :: puzzles(:)
end type queue
type(queue),save :: workqueue[*]
type(puzzle)
:: local_puzzle
integer,save
:: npuzzles[*],blocksize[*]
23
Input
if (this_image() == 1) then
! After file Setup.
inquire (unit=inunit,size=nbytes)
nrecords = nbytes/10
npuzzles = nrecords/9
blocksize = npuzzles / (num_images()*8)
write (*,*) "Found ", npuzzles, " puzzles.
allocate (workqueue%puzzles(npuzzles))
do i = 1, npuzzles
call read_puzzles( &
&
workqueue%puzzles(i)%input,inunit, &
&
error)
end do
close(inunit)
24
Coreprogramstructure
! After coarray data loaded
sync all
blocksize = blocksize[1]
npuzzles = npuzzles[1]
done = .false.
workloop: do
! Acquire lock and claim work
! Solve our puzzles
end do workloop
25
Acquirelockandclaimwork
!
26
Solvethepuzzlesandwriteback
!
27
Outputthesolutions
! Need to synchronize puzzle output updates
sync all
if (this_image() == 1) then
open (outunit,file=outfile,iostat=error)
do i = 1, npuzzles
call write_puzzle &
&
(workqueue%puzzles(i)%input, &
&
workqueue%puzzles(i)%solution,outunit,error)
end do
28
MoreontheLocking
Weprotectedaccesstothequeuestatebylockandunlock
Duringthistimenootherimagecanacquirethelock
Weneedtohavedisciplinetoonlyaccessdatawithinthe
windowwhenwehavethelock
Thereisnoconnectionwiththelockvariableandtheother
elementsofthequeuestructure
Theunlockisactinglikesyncmemory
Ifoneimageexecutesanunlock...
Anotherimagegettingthelockisorderedafterthe
firstimage
29
SummaryandCommentary
Weimplementedsolvingthepuzzlesusingaworksharing
schemewithcoarrays
Scalabilitylimitedbyserialworkdonebyimage1
I/O
ParallelI/O(deferredtoTR)withmultipleimagesrunning
distributedworkqueues.
Deferthecharacterintegerformatconversiontothe
solver,whichisexecutedinparallel.
Lockcontention
Couldusedistributedworkqueues,
eachwithitsownlock.
30
ImpactApplication
Studyofimpactanddeformationofmultiplebodies
Bodiesdeformoncontact
31
Impactapplication:Challengesandapproach
Challenges
Bodycontactisunpredictable
Muchmorecomputationincontactregions
Verypoorloadbalancelimitsscalability
Approach
Replicatecontactdataonallimages(cores)
Dividecontactdataintosegments(e.g.5timesasmany
segmentsasimages)
Usecoarraystoassigndatasegmentstoimagesbyatomically
incrementingtheglobalsegmentindex
Imagesprocessmoreorfewersegments,dependingonthe
computationalworkineachsegment
32
Results
Afterimplementingtheworkdispatchmethodinthecontact
algorithm
ContactAlgorithmfraction
ofentirecode
Contact
Algorithm
CodeVersion
Elapsed
Time
Comm.&Sync.
time
LoadImbalance
OriginalMPI
15%
75%
28%
Coarrayversion
withworkdispatch
10%
35%
1%
Bigimprovementinloadbalancing
33
Distributedremotegather
Theproblemishowtoimplementthefollowinggatherloopon
adistributedmemorysystem
REAL
:: table(n), buffer(nelts)
INTEGER :: index(nelts)
! nelts << n
...
DO i = 1, nelts
buffer(i) = table(index(i))
ENDDO
Thearraytable isdistributedacrosstheprocessors,while
index andbuffer arereplicated
Syntheticcode,butsimulatesirregular communication
accesspatterns
34
Remotegather:MPIimplementation
MPIrank0controlstheindexandreceivesthevaluesfromthe
otherranks
DO i=nelts,1,-1
pe =(index(i)-1)/nloc
offset = isum(pe)
mpi_buffer(i) = buff(offset,pe)
isum(pe) = isum(pe) 1
ENDDO
IF (mype.eq.0)THEN
isum=0
PE0 gathers indices to send out to individual PEs
DO i=1,nelts
pe =(index(i)-1)/nloc
isum(pe)=isum(pe)+1
who(isum(pe),pe) = index(i)
ENDDO
send out count and indices to PEs
DO i = 1, npes-1
CALL MPI_SEND(isum(i),1,MPI_INTEGER,i,10.
IF(isum(i).gt.0)THEN
CALL MPI_SEND(who(1,i),isum(i),...
ENDIF
ENDDO
now wait to receive values and scatter them.
DO i = 1,isum(0)
offset = mod(who(i,0)-1,nloc)+1
buff(i,0) = mpi_table(offset)
ENDDO
DO i = 1,npes-1
IF(isum(i).gt.0)THEN
CALL MPI_RECV(buff(1,i),isum(i),...
ENDIF
ENDDO
ELSE
!
!IF my_rank.ne.0
ENDIF
35
Remotegather:coarrayimplementation(get)
Image1getsthevaluesfromtheotherimages
IF (myimg.eq.1) THEN
DO i=1,nelts
pe =(index(i)-1)/nloc+1
offset = MOD(index(i)-1,nloc)+1
caf_buffer(i) = caf_table(offset)[pe]
ENDDO
ENDIF
36
Remotegather:coarrayvsMPI
Coarrayimplementationsaremuchsimpler
Coarraysyntaxallowstheexpressionofremotedataina
naturalway noneedofcomplexprotocols
Coarrayimplementationisordersofmagnitudefasterfor
smallnumbersofindices
37
Addingcoarraywithminimalcodemodifications
CoarraystructurescanbeinsertedintoanexistingMPIcodeto
simplifyandimprovecommunicationwithoutaffectingthe
generalcodestructure
InsomecasestheusageofFortranpointerscansimplifythe
introductionofcoarrayvariablesinaroutinewithout
modifyingthecallingtree
Codemodificationsaremostlylimitedtotheroutinewhere
coarraysmustbeintroduced
Acoarrayisnotpermittedtobeapointer:however,acoarray
maybeofaderivedtypewithpointerorallocatable
components
38
Example:addingcoarraystoexistingcode
Insubroutinesub2 wewanttousecoarraysfor
communication
Arrayxisdefinedandallocatedinthemainprogramandis
passedasargumenttosubroutinesub1andthentosub2
PROGRAM main
REAL, ALLOCATABLE :: x(:)
ALLOCATE (x(n))
CALL sub1(x,n)
SUBROUTINE sub2(x,n)
REAL :: x(n)
CALL MPI_Irecv(n,rx,...
CALL MPI_Send(n,x,...
SUBROUTINE sub1(x,n)
REAL :: x(n)
x(:) = ...
CALL sub2(x,n)
39
Modifytheroutinewherecoarraysareused
Insub2addthederivedtypeCAFPwhichcontainsjusta
pointerandallocateacoarrayvariablexp oftypeCAFP
Eachimagesetsxptopointtoarrayx
Arrayx onimagep canbereferencedbyxp[p]%ptr
SUBROUTINE sub2(x,n)
REAL, TARGET :: x(n)
! Add derived type CAFP
TYPE CAFP
REAL, POINTER :: ptr(:)
END TYPE CAFP
40
References
http://lacsi.rice.edu/software/caf/downloads/documentation/
nrRAL98060.pdf CoarrayFortranforparallelprogramming,
NumrichandReid,1998
ftp://ftp.nag.co.uk/sc22wg5/N1801N1850/N1824.pdf
CoarraysinthenextFortranStandard,JohnReid,April2010
Ashby,J.V.andReid,J.K(2008).Migratingascientific
applicationfromMPItocoarrays.CUG2008Proceedings.RAL
TR2008015
Seehttp://www.numerical.rl.ac.uk/reports/reports.shtml
http://upc.gwu.edu/ UnifiedParallelCatGeorgeWashington
University
http://upc.lbl.gov/ BerkeleyUnifiedParallelCProject
41
Wrapup
RememberourfirstMotivationslide?
Fortrannowsupportsparallelismasafullfirstclassfeatureof
thelanguage
Changesareminimal
Performanceismaintained
Flexibilityinexpressingcommunicationpatterns
Wehopeyoulearnedsomethingandhavesuccess
withcoarrays inthefuture
42
Acknowledgements
This material for is based on original content developed by EPCC at
The University of Edinburgh for use in teaching the MSc in High
Performance Computing. The following people contributed to its
development:
Alan Simpson, Michele Weiland, Jim Enright and Paul Graham
The material was subsequently developed by EPCC and Cray with
contributions from the following people:
David Henty, Alan Simpson ( EPCC)
Harvey Richardson, Bill Long, Roberto Ansaloni,
Jef Dawson, Nathan Wichmann (Cray)
44