Vous êtes sur la page 1sur 150

Parallel Processing

Dr. Pierre Vignras


http://www.vigneras.name/pierre

This work is licensed under a Creative Commons


Attribution-Share Alike 2.0 France.
See
http://creativecommons.org/licenses/by-sa/2.0/fr/
for details

Dr. Pierre Vignras


Text & Reference Books

Fundamentals of Parallel Processing


Harry F. Jordan & Gita Alaghband
Pearson Education
ISBN: 81-7808-992-0
Parallel & Distributed Computing Handbook

Albert Y. H. Zomaya
McGraw-Hill Series on Computer Engineering
ISBN: 0-07-073020-2

Parallel Programming (2nd edition)



Books

Barry Wilkinson and Michael Allen


Pearson/Prentice Hall
ISBN: 0-13-140563-2
Dr. Pierre Vignras
Class, Quiz & Exam Rules
Noentryafterthefirst5minutes
Noexitbeforetheendoftheclass
AnnouncedQuiz
Atthebeginningofaclass
Fixedtiming(youmaysufferifyouarrivelate)
SpreadOut(doitquicklytosaveyourtime)
Papersthatarenotstrictlyinfrontofyouwillbe
consideredasdone
Cheaterswillget'1'mark
Rules

Yourheadshouldbeinfrontofyourpaper!

Dr. Pierre Vignras


Grading Policy
(Temptative,canbeslightlymodified)
Quiz:5%
Assignments:5%
Project:15%
Mid:25%
Grading Policy

Final:50%

Dr. Pierre Vignras


Outline
(Important points)
Basics
[SI|MI][SD|MD],
InterconnectionNetworks
LanguageSpecificConstructions
ParallelComplexity
SizeandDepthofanalgorithm
Speedup&Efficiency
SIMDProgramming
MatrixAddition&Multiplication
GaussElimination

Dr. Pierre Vignras


Basics


Flynn's Classification

Interconnection Networks
Outline


Language Specific Constructions

Dr. Pierre Vignras 6


History
Since
ComputersarebasedontheVonNeumann
Architecture
LanguagesarebasedoneitherTuringMachines
(imperativelanguages),LambdaCalculus
(functionallanguages)orBooleanAlgebra(Logic
Programming)
Sequentialprocessingisthewayboth
I. Basics

computersandlanguagesaredesigned
Butdoingseveralthingsatonceisanobvious
waytoincreasespeed

Dr. Pierre Vignras


Definitions
ParallelProcessingobjectiveisexecutionspeed
improvement
Themannerisbydoingseveralthingsatonce
Theinstrumentsare
Architecture,
Languages,
Algorithm
I. Basics

ParallelProcessingisanoldfield
butalwayslimitedbythewidesequentialworldfor
variousreasons(financial,psychological,...)

Dr. Pierre Vignras


Flynn Architecture
Characterization
InstructionStream(whattodo)
S:Single,M:Multiple
DataStream(onwhat)
I:Instruction,D:Data
Fourpossibilities
SISD:usualcomputer
SIMD:VectorComputers(Cray)
I. Basics

MIMD:MultiProcessorComputers
MISD:Notveryuseful(sometimespresentedas
pipelinedSIMD)

Dr. Pierre Vignras


SISD Architecture
TraditionalSequentialComputer
ParallelismattheArchitectureLevel:
Interruption
IOprocessor
Pipeline

Flow of Instructions I8 I7 I6 I5
I. Basics

IF OF EX STR I4 I3 I2 I1

4-stages pipeline

Dr. Pierre Vignras


Problem with pipelines
Conditionalbranchesneedcompleteflush
Thenextinstructiondependsonthefinalresultof
acondition
Dependencies
Wait(insertingNOPinpipelinestages)untila
previousresultbecomesavailable
Solutions:
Branchprediction,outoforderexecution,...
I. Basics

Consequences:
CircuitryComplexities(cost)
Performance
Dr. Pierre Vignras
Exercises: Real Life Examples
Intel:
Pentium4:
Pentium3(=M=CoreDuo):
AMD
Athlon:
Opteron:
IBM
Power6:
I. Basics

Cell:
Sun
UltraSparc:
T1:
Dr. Pierre Vignras
Instruction Level Parallelism
MultipleComputingUnitthatcanbeused
simultaneously(scalar)
FloatingPointUnit
ArithmeticandLogicUnit
Reorderingofinstructions
Bythecompiler
Bythehardware
I. Basics

Lookahead(prefetching)&Scoreboarding
(resolveconflicts)
OriginalSemanticisGuaranteed

Dr. Pierre Vignras


Reordering: Example
EXP=A+B+C+(DxExF)+G+H
+ +

+ H + H
Height=5

Height=4
+ G + x
+ x + + x F

+ C x F A B C G D E
I. Basics

A B D E
4 Steps required if
5 Steps required if
hardware can do 2 adds and
hardware can do 1 add and
1 multiplication simultaneously
1 multiplication simultaneously
Dr. Pierre Vignras
SIMD Architecture
Often(especiallyinscientificapplications),
oneinstructionisappliedondifferentdata
for(inti=0;i<n;i++){
r[i]=a[i]+b[i];
}
Usingvectoroperations(add,mul,div,...)
Twoversions
TrueSIMD:nArithmeticUnit
I. Basics

nhighleveloperationsatatime
PipelinesSIMD:ArithmeticPipelined(depthn)
nnonidenticallowleveloperationsatatime

Dr. Pierre Vignras


Real Life Examples
Cray:pipelinedSIMDvectorsupercomputers
ModernMicroProcessorInstructionSet
containsSIMDinstructions
Intel:MMX,SSE{1,2,3}
AMD:3dnow
IBM:Altivec
Sun:VSI
I. Basics

Veryefficient
2002:winneroftheTOP500listwastheJapanese
EarthSimulator,avectorsupercomputer

Dr. Pierre Vignras


MIMD Architecture
Twoforms
Sharedmemory:anyprocessorcanaccessany
memoryregion
Distributedmemory:adistinctmemoryis
associatedwitheachprocessor
Usage:
Multiplesequentialprogramscanbeexecuted
simultaneously
I. Basics

Differentpartsofaprogramareexecuted
simultaneouslyoneachprocessor
Needcooperation!

Dr. Pierre Vignras


Warning: Definitions
Multiprocessors:computerscapableof
runningmultipleinstructionsstreams
simultaneouslytocooperativelyexecutea
singleprogram
Multiprogramming:sharingofcomputing
resourcesbyindependentjobs
Multiprocessing:
runningapossiblysequentialprogramona
I. Basics

multiprocessor(notourconcernhere)
runningaprogramconsistingofmultiple
cooperatingprocesses

Dr. Pierre Vignras


Shared vs Distributed

M
CPU CPU CPU
CPU

Switch M CPU Switch CPU M

Memory
I. Basics

Shared Memory Distributed Memory

Dr. Pierre Vignras


MIMD
Cooperation/Communication
Sharedmemory
read/writeinthesharedmemorybyhardware
synchronisationwithlockingprimitives
(semaphore,monitor,etc.)
VariableLatency
Distributedmemory
explicitmessagepassinginsoftware
I. Basics

synchronous,asynchronous,grouped,remotemethod
invocation
Largemessagemaymasklongandvariablelatency

Dr. Pierre Vignras


Interconnection Networks
I. Basics

Dr. Pierre Vignras


Hard Network Metrics
Bandwidth(B):maximumnumberofbytesthe
networkcantransport
Latency(L):transmissiontimeofasingle
message
4components
softwareoverhead(so):msgpacking/unpacking
routingdelay(rd):timetoestablishtheroute
contentiontime(ct):networkisbusy
I. Basics

channeldelay(cd):msgsize/bandwidth
so+ctdependsonprogrambehaviour
rd+cddependsonhardware
Dr. Pierre Vignras
Soft Network Metrics
Diameter(r):longestpathbetweentwonodes
r
AverageDistance(da): 1


d.N d
N 1 d =1
wheredisthedistancebetweentwonodesdefinedasthenumberoflinksin
theshortestpathbetweenthem,NthetotalnumberofnodesandNd,the
numberofnodeswithadistancedfromagivennode.
Connectivity(P):thenumberofnodesthatcan
bereachedfromagivennodeinonehop.
Concurrency(C):thenumberofindependent
I. Basics

connectionsthatanetworkcanmake.
bus:C=1;linear:C=N1(ifprocessorcansendand
receivesimultaneously,otherwise,C=N/2)
Dr. Pierre Vignras
Influence of the network
Bus
lowcost,
lowconcurrencydegree:1messageatatime
Fullyconnected
highcost:N(N1)/2links
highconcurrencydegree:Nmessagesatatime
Star
I. Basics

Lowcost(highavailability)
Exercice:ConcurrencyDegree?

Dr. Pierre Vignras


Parallel Responsibilities

Human writes a Parallel Algorithm


Human implements using a Language
Compiler translates into a Parallel Program
Runtime/Libraries are required in a Parallel Process
Human & Operating System map to the Hardware
The hardware is leading the Parallel Execution

Many actors are involved in the execution We usually consider


I. Basics

of a program (even sequential). explicit parallelism:


Parallelism can be introduced when parallelism is
automatically by compilers, introduced at the
runtime/libraries, OS and in the hardware language level

Dr. Pierre Vignras


SIMD Pseudo-Code
SIMD
<indexed variable> = <indexed expression>, (<index range>);
Example:
C[i,j] == A[i,j] + B[i,j], (0 i N-1)

j j j
i i i
C = A + B

j=0 j=1 j = N-1


I. Basics

= + = + ... = +
C[i] A[i] B[i] C[i] A[i] B[i] C[i] A[i] B[i]
Dr. Pierre Vignras
SIMD Matrix Multiplication
GeneralFormula
N 1
cij = a ik b kj
k =0

for (i = 0; i < N; i++) {


// SIMD Initialisation
C[i,j] = 0, (0 <= j < N);
for (k = 0; k < N; k++) {
I. Basics

// SIMD computation
C[i,j]=C[i,j]+A[i,k]*B[k,j], (0 <= j < N);
}
}

Dr. Pierre Vignras


SIMD C Code (GCC)
gccvectortype&operations
Vectorof8byteslong,composedof4shortelements
typedef short v4hi __attribute__ ((vector_size (8)));
Useuniontoaccessindividualelements
union vec { v4hi v, short s[4]; };
Usualoperationsonvectorvariables
v4hi v, w, r; r = v+w*w/v;

gccbuiltinsfunctions
#include<mmintrin.h>//MMXonly
I. Basics

Compilewithmmmx
Usetheprovidedfunctions
_mm_setr_pi16 (short w0, short w1, short w2, short w3);
_mm_add_pi16 (__m64 m1, __m64 m2);
...

Dr. Pierre Vignras


MIMD Shared Memory
Pseudo Code
Newinstructions:
fork <label> Startanewprocessexecutingat
<label>;
join <integer>Join<integer>processesintoone;
shared <variable list>makethestorageclassofthe
variablesshared;
private <variable list>makethestorageclassofthe
variablesprivate.
I. Basics

Sufficient?Whataboutsynchronization?
Wewillfocusonthatlateron...

Dr. Pierre Vignras


MIMD Shared Memory Matrix
Multiplication
private i,j,k;
shared A[N,N], B[N,N], C[N,N], N;
for (j = 0; j < N - 2; j++) fork DOCOL;
// Only the Original process reach this point
j = N-1;
DOCOL: // Executed by N process for each column
for (i = 0; i < N; i++) {
C[i,j] = 0;
for (k = 0; k < N; k++) {
// SIMD computation
C[i,j]=C[i,j]+A[i,k]*B[k,j];
I. Basics

}
}
join N; // Wait for the other process

Dr. Pierre Vignras


MIMD Shared Memory C Code
Processbased:
fork()/wait()
ThreadBased:
Pthreadlibrary
SolarisThreadlibrary
SpecificFramework:OpenMP
Exercice:readthefollowingarticles:
I. Basics

ImperativeConcurrentObjectOrientedLanguages,
M.Philippsen,1995
ThreadsCannotBeImplementedAsaLibrary,
HansJ.Boehm,2005

Dr. Pierre Vignras


MIMD Distributed Memory
Pseudo Code
Basicinstructionssend()/receive()
Manypossibilities:
nonblockingsendneedsmessagebuffering
nonblockingreceiveneedsfailurereturn
blockingsendandreceivebothneedstermination
detection
Mostusedcombinations
blockingsendandreceive
I. Basics

nonblockingsendandblockingreceive
Groupedcommunications
scatter/gather,broadcast,multicast,...
Dr. Pierre Vignras
MIMD Distributed Memory
C Code
TraditionalSockets
BadPerformance
Groupedcommunicationlimited
ParallelVirtualMachine(PVM)
Oldbutgoodandportable
MessagePassingInterface(MPI)
Thestandardusedinparallelprogrammingtoday
I. Basics

Veryefficientimplementations
Vendorsofsupercomputersprovidetheirownhighly
optimizedMPIimplementation
Opensourceimplementationavailable
Dr. Pierre Vignras
Performance


Size and Depth
Speedup and Efficiency
Outline

Dr. Pierre Vignras 34


Parallel Program
Characteristics
s = V[0];
for (i = 1; i < N; i++) {
s = s + V[i];
}
V[0] V[1] V[2] V[3] ... V[N-1]
I. Performance

+
+
Data Dependence Graph
Size = N-1 (nb op)
+
Depth = N-1 (nb op in the

longest path from +


any input to any output)
s
Dr. Pierre Vignras 35
Parallel Program
Characteristics
V[0] V[1] V[2] V[3] V[N-2] V[N-1]

+ + +
I. Performance

+ +

+
Size = N-1
Depth =log2N
s

Dr. Pierre Vignras 36


Parallel Program
Characteristics
SizeandDepthdoesnottakeintoaccount
manythings
memoryoperation(allocation,indexing,...)
communicationbetweenprocesses
read/writeforsharedmemoryMIMD
I. Performance

send/receivefordistributedmemoryMIMD
Loadbalancingwhenthenumberofprocessorsis
lessthanthenumberofparalleltask
etc...
InherentCharacteristicsofaParallelMachine
independentofanyspecificarchitecture
Dr. Pierre Vignras 37
Prefix Problem
for (i = 1; i < N; i++) {
V[i] = V[i-1] + V[i];
}
V[0] V[1] V[2] V[3] ... V[N-1]

+
Size = N-1
I. Performance

Depth =N-1
+

+
How to make this
+ program parallel?

V[0] V[1] V[2] V[3] ... V[N-1]


Dr. Pierre Vignras 38
Prefix Solution 1
Upper/Lower Prefix Algorithm
DivideandConquer
DivideaproblemofsizeNinto2equivalent
problemsofsizeN/2
Apply the same algorithm
V[0] V[N-1] for sub-prefix.
... ... When N=2, use the direct,
sequential implementation
N/2 Prefix N/2 Prefix
Size has increased!
N-2+N/2 > N-1 (N>2)

Depth = log2N
... + + ... +
Size (N=2k) = k.2k-1
=(N/2).log2N
V[0] V[N-1]
Dr. Pierre Vignras 39
ul
Analysis of P
Proofbyinduction
OnlyforNapowerof2
Assumefromconstructionthatsizeincrease
smoothlywithN
Example:N=1,048,576=220
SequentialPrefixAlgorithm:
1,048,575operationsin1,048,575timeunitsteps
Upper/LowerPrefixAlgorithm
10,485,760operationsin20timeunitsteps

Dr. Pierre Vignras 40


Other Prefix Parallel
Algorithms
PulrequiresN/2processorsavailable!
Inourexample,524,288processors!
Otheralgorithms(exercice:studythem)
Odd/EvenPrefixAlgorithm
Size=2Nlog2(N)2(N=2k)
Depth=2log2(N)2
LadnerandFischer'sParallelPrefix
Size=4N4.96N0.69+1
Depth=log2(N)
Exercice:implementtheminC
Dr. Pierre Vignras 41
Benchmarks
(Inspired by Pr. Ishfaq Ahmad Slides)

Abenchmarkisaperformancetestingprogram
supposedtocaptureprocessinganddata
movementcharacteristicsofaclassofapplications.
Usedtomeasureandtopredictperformance
ofcomputersystems,andtorevealtheir
architecturalweaknessandstrongpoints.
Benchmarksuite:setofbenchmarkprograms.
Benchmarkfamily:setofbenchmarksuites.
Classification:
scientificcomputing,commercial,applications,
networkservices,multimediaapplications,...

Dr. Pierre Vignras 42


Benchmark Examples

Type Name Measuring


Numerical computing (linear al-
LINPACK
gebra)
Micro-Benchmark System calls and data movement
LMBENCH
operations in Unix
STREAM Memory bandwidth
NAS Parallel computing (CFD)
PARKBENCH Parallel computing
SPEC A mixed benchmark family
Macro-Benchmark
Splash Parallel computing
STAP Signal processing
TPC Commercial applications

Dr. Pierre Vignras 43


SPEC Benchmark Family
StandardPerformanceEvaluationCorporation
StartedwithbenchmarksthatmeasureCPU
performance
Hasextendedtoclient/servercomputing,
commercialapplications,I/Osubsystems,etc.
Visithttp://www.spec.org/formore
informations

Dr. Pierre Vignras 44


Performance Metrics
Executiontime
Generallyinseconds
Realtime,UserTime,Systemtime?
Realtime
InstructionCount
GenerallyinMIPSorBIPS
Dynamic:numberofexecutedinstructions,notthe
numberofassemblylinecode!
NumberofFloatingPointOperationsExecuted
Mflop/s,Gflop/s
Forscientificapplicationswhereflopdominates
Dr. Pierre Vignras 45
Memory Performance
Threeparameters:
Capacity,
Latency,
Bandwidth

Dr. Pierre Vignras 46


Parallelism and Interaction
Overhead
Timetoexecuteaparallelprogramis
T=Tcomp+Tpar+Tinter
Tcomp:timeoftheeffectivecomputation
Tpar:paralleloverhead
Processmanagement(creation,termination,contextswitching)
Groupingoperations(creation,destructionofgroup)
ProcessInquiryoperation(Id,GroupId,Groupsize)
Tinter:interactionoverhead
Synchronization(locks,barrier,events,...)
Aggregation(reductionandscan)
Communication(p2p,collective,read/writesharedvariables)

Dr. Pierre Vignras 47


Measuring Latency
Ping-Pong
for (i=0; i < Runs; i++)
if (my_node_id == 0) { /* sender */
tmp = Second();
start_time = Second();
send an m-byte message to node 1;
receive an m-byte message from node 1;
end_time = Second();
timer_overhead = start_time - tmp ;
total_time = end_time - start_time - timer_overhead;
communication_time[i] = total_time / 2 ;
} else if (my_node_id ==1) {/* receiver */
receive an m-byte message from node 0;
send an m-byte message to node 0;
}
}

Dr. Pierre Vignras 48


Measuring Latency
Hot-Potato (fire-brigade)
Method
nnodesareinvolved(insteadofjusttwo)
Node0sendsanmbytesmessagetonode1
Onreceipt,node1immediatelysendsthe
samemessagetonode2andsoon.
Finallynoden1sendsthemessagebackto
node0
Thetotaltimeisdividedbyntogetthepoint
topointaveragecommunicationtime.

Dr. Pierre Vignras 49


Collective Communication
Performance
for (i = 0; i < Runs; i++) {
barrier synchronization;
tmp = Second();
start_time = Second();
for (j = 0; j < Iterations; j++)
the_collective_routine_being_measured;
end_time = Second();
timer_overhead = start_time - tmp;
total_time = end_time - start_time - timer_overhead;
local_time = total_time / Iterations;
communication_time[i] = max {all n local_time};
}

How to get all n local time?

Dr. Pierre Vignras 50


Point-to-Point
Communication Overhead
Linearfunctionofthemessagelengthm(in
bytes)
t(m)=t0+m/r
t0:startuptime
r:asymptoticbandwidth
t
Ideal Situation!
Many things are left out
Topology metrics
t0
Network Contention

m
Dr. Pierre Vignras 51
Collective Communication
Broadcast:1processsendsanmbytesmessagetoall
nprocess
Multicast:1processsendsanmbytesmessagetop<
nprocess
Gather:1processreceivesanmbytesfromeachof
thenprocess(mnbytesreceived)
Scatter:1processsendsadistinctmbytesmessageto
eachofthenprocess(mnbytessent)
Totalexchange:everyprocesssendsadistinctm
bytesmessagetoeachofthenprocess(mn2bytes
sent)
CircularShift:processisendsanmbytesmessageto
processi+1(processn1toprocess0)
Dr. Pierre Vignras 52
Collective Communication
Overhead
Functionofbothmandn
startuplatencydependsonlyonn
T(m,n)=t0(n)+m/r(n)
Many problems here! For a broadcast for example,
how is done the communication at the link layer?
Sequential?

Process 1 sends 1 m-bytes message to process 2,

then to process 3, and so on... (n steps required)


Truly Parallel (1 step required)

Tree based (log (n) steps required)


2
Again, many things are left out
Topology metrics

Network Contention

Dr. Pierre Vignras 53


Speedup
(Problem-oriented)

Foragivenalgorithm,letTpbetimetoperform
thecomputationusingpprocessors
T:depthofthealgorithm
Howmuchtimewegainbyusingaparallel
algorithm?
Speedup:Sp=T1/Tp
Issue:whatisT1?
Timetakenbytheexecutionofthebestsequential
algorithmonthesameinputdata
WenoteT1=Ts

Dr. Pierre Vignras


Speedup consequence

BestParallelTimeAchievableis:Ts/p
BestSpeedupisboundedbySpTs/(Ts/p)=p
CalledLinearSpeedup
Conditions
Workloadcanbedividedintopequalparts
Nooverheadatall
Idealsituation,usuallyimpossibletoachieve!
Objectiveofaparallelalgorithm
Beingasclosetoalinearspeedupaspossible

Dr. Pierre Vignras


Biased Speedup
(Algorithm-oriented)
Weusuallydon'tknowwhatisthebest
sequentialalgorithmexceptforsomesimple
problems(sorting,searching,...)
Inthiscase,T1
isthetimetakenbytheexecutionofthesame
parallelalgorithmon1processor.
Clearlybiased!
Theparallelalgorithmandthebestsequential
algorithmareoftenvery,verydifferent
Thisspeedupshouldbetakenwiththat
limitationinmind!
Dr. Pierre Vignras
Biased Speedup Consequence
Sincewecompareourparallelalgorithmonp
processorswithasuboptimalsequential
(parallelalgorithmononly1processor)wecan
have:Sp>p
Superlinearspeedup
Canalsobetheconsequenceofusing
auniquefeatureofthesystemarchitecturethat
favoursparallelformation
indeterminatenatureofthealgorithm
Extramemoryduetotheuseofpcomputers(less
swap)
Dr. Pierre Vignras
Efficiency
Howlongprocessorsarebeingusedonthe
computation?
Efficiency:Ep=Sp/p(expressedin%)
Alsorepresentshowmuchbenefitswegainby
usingpprocessors
Boundedby100%
Superlinearspeedupofcoursecangivemorethan
100%butitisbiased

Dr. Pierre Vignras


Amdhal's Law: fixed problem
size
(1967)
ForaproblemwithaworkloadW,weassume
thatwecandivideitintotwoparts:
W=W+(1)W
percentofWmustbeexecutedsequentially,andthe
remaining1canbeexecutedbynnodessimultaneouslywith
100%ofefficiency
Ts(W)=Ts(W+(1)W)=Ts(W)+Ts((1)W)=Ts(W)+(1)Ts(W)

Tp(W)=Tp(W+(1)W)=Ts(W)+Tp((1)W)=Ts(W)+(1)Ts(W)/p

Ts p 1
S p= = p
1T s 1 p1
T s
p
Dr. Pierre Vignras 59
Amdhal's Law: fixed problem
size
Ts
Ts (1-)Ts
serial section parallelizable sections
...

P processors
...

(1-)Ts/p
Tp
Dr. Pierre Vignras 60
Amdhal's Law Consequences

S(p) S()

p=100
=5%

=10%

=20%
p=10
=40%

Dr. Pierre Vignras 61


Amdhal's Law Consequences
Forafixedworkload,andwithoutany
overhead,themaximalspeedupasanupper
boundof1/
TakingintoaccounttheoverheadTo
Performanceislimitednotonlybythesequential
bottleneckbutalsobytheaverageoverhead
Ts p
S p= =
1T s pT o
T s T o 1 p1
p Ts
1
S p p
To

Ts
Dr. Pierre Vignras 62
Amdhal's Law Consequences
Thesequentialbottleneckcannotbesolved
justbyincreasingthenumberofprocessorsin
asystem.
Verypessimisticviewonparallelprocessingduring
twodecades.
Majorassumption:theproblemsize
(workload)isfixed.

Conclusion don't use parallel


programming on a small sized problem!

Dr. Pierre Vignras 63


Gustafson's Law: fixed time
(1988)
Asthemachinesizeincrease,theproblemsize
mayincreasewithexecutiontimeunchanged
Sequentialcase:W=W+(1)W
Ts(W)=Ts(W)+(1)Ts(W)

Tp(W)=Ts(W)+(1)Ts(W)/p

Tohaveaconstanttime,theparallelworkload
shouldbeincreased
Parallelcase(pprocessor):W'=W+p(1)W.
Ts(W')=Ts(W)+p(1)Ts(W)
Assumption:sequentialpartisconstant:itdoesnotincrease
withthesizeoftheproblem

Dr. Pierre Vignras 64


Fixed time speedup

T s W ' T s W ' T s W p 1T s W
S p= = =
T p W ' T s W T s W 1T s W

S p=1 p

Fixed time speedup is a linear function of p,


if the workload is scaled up
to maintain a fixed execution time

Dr. Pierre Vignras 65


Gustafson's Law
Consequences
=5% p=100
S(p) S()
p=50

=10%
=20% p=25

=40% p=10

Dr. Pierre Vignras 66


Fixed time speedup with
overhead
T s W ' T s W ' 1 p
S p= = =
T p W ' T s W T o p T o p
1
T s W

Best fixed speedup is achieved with


a low overhead as expected
T : function of the number of processors
o
Quite difficult in practice to make it a

decreasing function

Dr. Pierre Vignras 67


SIMD Programming


Matrix Addition & Multiplication
Gauss Elimination
Outline

Dr. Pierre Vignras 68


Matrix Vocabulary
Amatrixisdenseifmostofitselementsare
nonzero
Amatrixissparseifasignificantnumberofits
elementsarezero
Spaceefficientrepresentationavailable
Efficientalgorithmsavailable
Usualmatrixsizeinparallelprogramming
Lowerthan1000x1000foradensematrix
Biggerthan1000x1000forasparsematrix
Growingwiththeperformanceofcomputer
architectures
Dr. Pierre Vignras 69
Matrix Addition

C = AB , c i , j =a i , j bi , j 0in , 0 jm

// Sequential
for (i = 0; i < n; i++) { Need to map the vectored
for (j = 0; j < m; j++) { addition of m elements a[i]+b[i]
c[i,j] = a[i,j]+b[i,j]; to our underlying architecture
} made of p processing elements
}

// SIMD
Complexity:
for (i = 0; i < n; i++) { Sequential case: n.m additions
c[i,j] = a[i,j]+b[i,j];\ SIMD case: n.m/p
(0 <= j < m)
Speedup: S(p)=p
}

Dr. Pierre Vignras 70


Matrix Multiplication

C = AB , A[n , l ] ; B [l , m] ;C [n , m]
l 1
ci , j = a i , k b k , j 0in ,0 jm
k =0

for (i = 0; i < n; i++) {


for (j = 0; j < m; j++) {
c[i,j] = 0;
for (k = 0; k < l; k++) {
c[i,j] = c[i,j] + a[i,k]*b[k,j];
}
}
Complexity:
n.m initialisations
}
n.m.l additions and multiplications
Overhead: Memory indexing!
Use a temporary variable sum for c[i,j]!

Dr. Pierre Vignras 71


SIMD Matrix Multiplication

for (i = 0; i < n; i++) {


c[i,j] = 0; (0 <= j < m) // SIMD Op
for (k = 0; k < l; k++) {
c[i,j] = c[i,j] + a[i,k]*b[k,j];\
(0 <= j < m)
}
} Complexity (n = m = l for simplicity):
n2/p initialisations (p<=n)
} 3
n /p additions and multiplications (p<=n)
Overhead: Memory indexing!
Use a temporary vector for c[i]!

Algorithm in O(n) with p=n2


Assign one processor for
Algorithm in O(lg(n)) with p=n3
each element of C Parallelize the inner loop!
Initialisation in O(1)
Lowest bound
Computation in O(n)

Dr. Pierre Vignras 72


Gauss Elimination
Matrix(nxn)representsasystemofnlinear
equationsofnunknowns
a n1,0 x 0a n1,1 x 1 a n1,2 x 2 ...a n1, n1 x n1 =b n1
a n2,0 x 0a n2,1 x 1a n2,2 x 2 ...a n2, n1 x n1 =b n2
...
a 1,0 x 0a 1,1 x 1 a 1,2 x 2...a 1, n1 x n1 =b1
a 0,0 x 0 a 0,1 x 1 a 0,2 x 2...a 0, n1 x n1=b0


a n1,0 a n1,1 a n1,2 ... a n1, n1 b n1
a n2,0 a n2,1 a n2,2 ... a n2, n1 bn2
Represented by M = ... ... ... ... ... ...
a 1,0 a 1,1 a 1,2 ... a 1, n1 b1
a 0,0 a 0,1 a 0,2 ... a 0, n1 b0
Dr. Pierre Vignras 73
Gauss Elimination Principle
Transformthelinearsystemintoatriangular
systemofequations.

At the ith row, each row j below is replaced by:


{row j} + {row i}.(-aj, i/ai, i)

Dr. Pierre Vignras 74


Gauss Elimination
Sequential & SIMD
Algorithms
for (i=0; i<n-1; i++){ // for each row except the last
for (j=i+1; j<n; j++) // step through subsequent rows
m=a[j,i]/a[i,i]; // compute multiplier
for (k=i; k<n; k++) {
a[j,k]=a[j,k]-a[i,k]*m;
} Complexity: O(n3)
}
}

for (i=0; i<n-1; i++){ SIMD Version Complexity: O(n3/p)


for (j=i+1; j<n; j++)
m=a[j,i]/a[i,i];
a[j,k]=a[j,k]-a[i,k]*m; (i<=k<n)
}
}
Dr. Pierre Vignras 75
Other SIMD Applications
ImageProcessing
ClimateandWeatherPrediction
MolecularDynamics
SemiConductorDesign
FluidFlowModelling
VLSIDesign
DatabaseInformationRetrieval
...

Dr. Pierre Vignras 76


MIMD SM Programming


Process, Threads & Fibers

Pre- & self-scheduling
Common Issues
Outline


OpenMP

Dr. Pierre Vignras 77


Process, Threads, Fibers
Process:heavyweight
InstructionPointer,Stacks,Registers
memory,filehandles,sockets,devicehandles,and
windows
(kernel)Thread:mediumweight
InstructionPointer,Stacks,Registers
Knownandscheduledpreemptivelybythekernel
(userthread)Fiber:lightweight
InstructionPointer,Stacks,Registers
Unkwownbythekernelandscheduled
cooperativelyinuserspace
Dr. Pierre Vignras 78
Multithreaded Process

Code Data Files Code Data Files

Reg Reg Reg


Registers Stack
Stack Stack Stack
thread

Process Process
Single-Threaded Multi-Threaded
Dr. Pierre Vignras 79
Kernel Thread
Providedandscheduledbythekernel
Solaris,Linux,MacOSX,AIX,WindowsXP
Incompatibilities:POSIXPThreadStandard
SchedulingQuestion:ContentionScope
Thetimesliceqcanbegivento:
thewholeprocess(PTHREAD_PROCESS_SCOPE)
Eachindividualthread(PTHREAD_SYSTEM_SCOPE)
Fairness?
WithaPTHREAD_SYSTEM_SCOPE,themore
threadsaprocesshas,themoreCPUithas!

Dr. Pierre Vignras 80


Fibers (User Threads)
Implementedentirelyinuserspace(alsocalled
greentheads)
Advantages:
Performance:noneedtocrossthekernel(context
switch)toscheduleanotherthread
Schedulercanbecustomizedfortheapplication
Disadvantages:
Cannotdirectlyusemultiprocessor
Anysystemcallblockstheentireprocess
Solutions:AsynchronousI/O,SchedulerActivation
Systems:NetBSD,WindowsXP(FiberThread)
Dr. Pierre Vignras 81
Mapping
Threadsareexpressedbytheapplicationin
userspace.
11:anythreadexpressedinuserspaceis
mappedtoakernelthread
Linux,WindowsXP,Solaris>8
n1:nthreadsexpressedinuserspaceare
mappedtoonekernelthread
Anyuserspaceonlythreadlibrary
nm:nthreadsexpressedinuserspaceare
mappedtomkernelthreads(n>>m)
AIX,Solaris<9,IRIX,HPUX,TRU64UNIX
Dr. Pierre Vignras 82
Mapping
User Space User Space User Space

Kernel Space Kernel Space Kernel Space


1-1 n-1 n-m

Dr. Pierre Vignras 83


MIMD SM Matrix Addition
Self-Scheduling
private i,j,k;
shared A[N,N], B[N,N], C[N,N], N;
for (i = 0; i < N - 2; i++) fork DOLINE;
// Only the Original process reach this point
i = N-1; // Really Required?
DOLINE: // Executed by N process for each line
for (j = 0; j < N; j++) {
// SIMD computation
C[i,j]=A[i,j]+B[i,j];Complexity: O(n2/p)
} If SIMD is used: O(n2/(p.m))
} m:number of SIMD PEs
join N; // Wait for the other process
Issue: if n > p, self-scheduling.
The OS decides who will do what. (Dynamic)
In this case, overhead

Dr. Pierre Vignras 84


MIMD SM Matrix Addition
Pre-Scheduling
Basicidea:dividethewholematrixbyp=s*s
submatrix
submatrices
s

C = A + B
s

Each of the p threads compute:


= +
Ideal Situation! No communication Overhead!
Pre-Scheduling: decide statically who will do what.
Dr. Pierre Vignras 85
Parallel Primality Testing
(from Herlihy and Shavit Slides)

Challenge
Printprimesfrom1to1010
Given
Tenprocessormultiprocessor
Onethreadperprocessor
Goal
Gettenfoldspeedup(orclose)

Dr. Pierre Vignras 86


Load Balancing
Splittheworkevenly
Eachthreadtestsrangeof109

1 109 2109 1010

P0 P1 P9
Code for thread i:
void thread(int i) {
for (j = i*109+1, j<(i+1)*109; j++) {
if (isPrime(j)) print(j);
}
}

Dr. Pierre Vignras


Issues
LargerNumrangeshavefewerprimes
Largernumbershardertotest
Threadworkloads d
Uneven ct e
je
Hardtopredict re
Needdynamicloadbalancing

Dr. Pierre Vignras


Shared Counter

19
each
18 thread takes a number

17

Dr. Pierre Vignras


Procedure for Thread i

// Global (shared) variable


static long value = 1;

void thread(int i) {
long j = 0;
while (j < 1010) {
j = inc();
if (isPrime(j)) print(j);
}
}

Dr. Pierre Vignras


Counter Implementation

// Global (shared) variable


sor,
static long value = 1;
c e s
n i p r o s o r
f o r u{ o c e s
long inc()
K
Oreturn u l t
value++; ip r
f o r m
} not

Dr. Pierre Vignras


Counter Implementation

// Global (shared) variable


static long value = 1;

long inc() {
temp = value;
return value++;
} value = value + 1;
return temp;

Dr. Pierre Vignras


Uh-Oh

Value 2 3 2

read 1 read 2
write 2 write 3

read 1 write 2

time

Dr. Pierre Vignras


Challenge

// Global (shared) variable


static long value = 1;

long inc() {
temp = value;
value = temp + 1;
Make these steps
return temp;
atomic (indivisib
}

Dr. Pierre Vignras


An Aside: C/Posix Threads API

// Global (shared) variable


static long value = 1;
static pthread_mutex_t mutex;
// somewhere before
pthread_mutex_init(&mutex, NULL);

long inc() {
pthread_mutex_lock(&mutex);
temp = value;
Critical section
value = temp + 1;
pthread_mutex_unlock(&mutex);
return temp;
}

Dr. Pierre Vignras


Correctness
MutualExclusion
Nevertwoormoreincriticalsection
Thisisasafetyproperty
NoLockout(lockoutfreedom)
Ifsomeonewantsin,someonegetsin
Thisisalivenessproperty

Dr. Pierre Vignras


Multithreading Issues
Coordination
Locks,ConditionVariables,Volatile
Cancellation(interruption)
Asynchronous(immediate)
Equivalentto'killKILL'butmuchmoredangerous!!
Whathappenstolock,openfiles,buffers,pointers,...?
Avoiditasmuchasyoucan!!
Deferred
Targetthreadshouldcheckperiodicallyifitshouldbe
cancelled
Noguaranteethatitwillactuallystoprunning!

Dr. Pierre Vignras 97


General Multithreading
Guidelines
Usegoodsoftwareengineering
Alwayscheckerrorstatus
Alwaysuseawhile(condition)loopwhen
waitingonaconditionvariable
Useathreadpoolforefficiency
Reducethenumberofcreatedthread
Usethreadspecificdatasoonethreadcanholdsits
own(private)data
TheLinuxkernelschedulestasksonly,
Process==ataskthatsharesnothing
Thread==ataskthatshareseverything
Dr. Pierre Vignras 98
General Multithreading
Guidelines
GivethebiggestprioritiestoI/Obounded
threads
Usuallyblocked,shouldberesponsive
CPUBoundedthreadsshouldusuallyhavethe
lowestpriority
UseAsynchronousI/Owhenpossible,itis
usuallymoreefficientthanthecombinationof
threads+synchronousI/O
Useeventdrivenprogrammingtokeep
indeterminismaslowaspossible
Usethreadswhenyoureallyneedit!
Dr. Pierre Vignras 99
OpenMP
Designassupersetofcommonlanguagesfor
MIMDSMprogramming(C,C++,Fortran)
UNIX,Windowsversion
fork()/join()executionmodel
OpenMPprogramsalwaysstartwithonethread
relaxedconsistencymemorymodel
eachthreadisallowedtohaveitsowntemporaryviewof
thememory
CompilerdirectivesinC/C++arecalled
pragma(pragmaticinformation)
Preprocessordirectives:
#pragmaomp<directive>[arguments]

Dr. Pierre Vignras 100


OpenMP

Dr. Pierre Vignras 101


OpenMP
Thread Creation
#pragmaompparallel[arguments]
<Cinstructions>//Parallelregion
Ateamofthreadiscreatedtoexecutethe
parallelregion
theoriginalthreadbecomesthemasterthread
(threadID=0forthedurationoftheregion)
numbersofthreadsdeterminedby
Arguments(if,num_threads)
Implementation(nestedparallelismsupport,dynamic
adjustment)
Environmentvariables(OMP_NUM_THREADS)

Dr. Pierre Vignras 102


OpenMP
Work Sharing
Distributestheexecutionofaparallelregion
amongteammembers
Doesnotlaunchthreads
Nobarrieronentry
Implicitbarrieronexit(unlessnowaitspecified)
WorkSharingConstructs
Sections
Loop
Single
Workshare(Fortran)
Dr. Pierre Vignras 103
OpenMP
Work Sharing - Section

double e, pi, factorial, product;


int i;
e = 1; // Compute e using Taylor expansion
factorial = 1;
for (i = 1; i<num_steps; i++) {
factorial *= i;
e += 1.0/factorial;
} // e done
pi = 0; // Compute pi/4 using Taylor expansion
for (i = 0; i < num_steps*10; i++) {
pi += 1.0/(i*4.0 + 1.0);
pi -= 1.0/(i*4.0 + 3.0);
} Independent Loops
pi = pi * 4.0; // pi done Ideal Case!
return e * pi;

Dr. Pierre Vignras 104


OpenMP
Work Sharing - Section
double e, pi, factorial, product; int i; Private Copies
#pragma omp parallel sections shared(e, pi) { Except for
#pragma omp section { shared variables
e = 1; factorial = 1;
for (i = 1; i<num_steps; i++) {
factorial *= i; e += 1.0/factorial;
}
} /* e section */
Independent
#pragma omp section {
sections are
pi = 0; declared
for (i = 0; i < num_steps*10; i++) {
pi += 1.0/(i*4.0 + 1.0);
pi -= 1.0/(i*4.0 + 3.0);
}
pi = pi * 4.0;
} /* pi section */ Implicit Barrier
} /* omp sections */ on exit
return e * pi;
Dr. Pierre Vignras 105
Talyor Example Analysis
OpenMPImplementation:IntelCompiler
System:Gentoo/Linux2.6.18
Hardware:IntelCoreDuoT2400(1.83Ghz)Cache
L22Mb(SmartCache) Reason: Core Duo
Gcc: Architecture?
Standard: 13250 ms,
Intel CC:
Standard: 12530 ms
Optimized (-Os -msse3
Optimized (-O2 -xW
-march=pentium-m
-march=pentium4): 7040 ms
-mfpmath=sse): 7690 ms

OpenMP Standard: 15340 ms Speedup: 0.6


OpenMP Optimized: 10460 ms Efficiency: 3%

Dr. Pierre Vignras 106


OpenMP
Work Sharing - Section
Defineasetofstructuredblocksthataretobe
dividedamong,andexecutedby,thethreads
inateam
Eachstructuredblockisexecutedoncebyone
ofthethreadsintheteam
Themethodofschedulingthestructured
blocksamongthreadsintheteamis
implementationdefined
Thereisanimplicitbarrierattheendofa
sectionsconstruct,unlessanowaitclauseis
specified
Dr. Pierre Vignras 107
OpenMP
Work Sharing - Loop
// LU Matrix Decomposition
for (k = 0; k<SIZE-1; k++) {
for (n = k; n<SIZE; n++) {
col[n] = A[n][k];
}
for (n = k+1; n<SIZE; n++) {
A[k][n] /= col[k];
}
for (n = k+1; n<SIZE; n++) {
row[n] = A[k][n];
}
for (i = k+1; i<SIZE; i++) {
for (j = k+1; j<SIZE; j++) {
A[i][j] = A[i][j] - row[i] * col[j];
}
}
}

Dr. Pierre Vignras 108


OpenMP
Work Sharing - Loop
// LU Matrix Decomposition
for (k = 0; k<SIZE-1; k++) {
... // same as before
#pragma omp parallel for shared(A, row, col)
for (i = k+1; i<SIZE; i++) {
for (j = k+1; j<SIZE; j++) {
A[i][j] = A[i][j] - row[i] * col[j];
}
}
}

Loop should be in
Private Copies canonical form
Except for Implicit Barrier (see spec)
shared variables at the end of a loop
(Memory Model)

Dr. Pierre Vignras 109


OpenMP
Work Sharing - Loop
Specifiesthattheiterationsoftheassociated
loopwillbeexecutedinparallel.
Iterationsofthelooparedistributedacross
threadsthatalreadyexistintheteam
Thefordirectiveplacesrestrictionsonthe
structureofthecorrespondingforloop
Canonicalform(seespecifications):
for(i=0;i<N;i++){...}isok
For(i=f();g(i);){...i++}isnotok
Thereisanimplicitbarrierattheendofaloop
constructunlessanowaitclauseisspecified.
Dr. Pierre Vignras 110
OpenMP
Work Sharing - Reduction

reduction(op: list)
e.g: reduction(*: result)
Aprivatecopyismadeforeachvariable
declaredin'list'foreachthreadofthe
parallelregion
Afinalvalueisproducedusingtheoperator
'op'tocombineallprivatecopies
#pragma omp parallel for reduction(*: res)
for (i = 0; i < SIZE; i++) {
res = res * a[i];
}
}

Dr. Pierre Vignras 111


OpenMP
Synchronisation -- Master
Masterdirective:
#pragma omp master
{...}
Defineasectionthatisonlytobeexecutedby
themasterthread
Noimplicitbarrier:otherthreadsintheteam
donotwait

Dr. Pierre Vignras 112


OpenMP
Synchronisation -- Critical
Criticaldirective:
#pragma omp critical [name]
{...}
Athreadwaitsatthebeginningofacritical
regionuntilnootherthreadisexecutinga
criticalregionwiththesamename.
Enforcesexclusiveaccesswithrespecttoall
criticalconstructswiththesamenameinall
threads,notjustinthecurrentteam.
Constructswithoutanameareconsideredto
havethesameunspecifiedname.
Dr. Pierre Vignras 113
OpenMP
Synchronisation -- Barrier
Criticaldirective:
#pragma omp barrier
{...}
Allthreadsoftheteammustexecutethebarrier
beforeanyareallowedtocontinueexecution
Restrictions
Eachbarrierregionmustbeencounteredbyallthreads
inateamorbynoneatall.
BarrierinnotpartofC!
If (i < n)
#pragma omp barrier
//Syntaxerror!

Dr. Pierre Vignras 114


OpenMP
Synchronisation -- Atomic
Criticaldirective:
#pragma omp atomic
expr-stmt

Whereexpr-stmtis:
x binop= expr, x++, ++x, x--, --x (xscalar,binopanynon
overloadedbinaryoperator:+,/,-,*,<<,>>,&,^,|)
Onlyloadandstoreoftheobjectdesignated
byxareatomic;evaluationofexprisnot
atomic.
Doesnotenforceexclusiveaccessofsame
storagelocationx.
Somerestrictions(seespecifications)
Dr. Pierre Vignras 115
OpenMP
Synchronisation -- Flush
Criticaldirective:
#pragma omp flush [(list)]
Makesathreadstemporaryviewofmemory
consistentwithmemory
Enforcesanorderonthememoryoperations
ofthevariablesexplicitlyspecifiedorimplied.
Impliedflush:
barrier,entryandexitofparallelregion,critical
Exitfromworksharingregion(unlessnowait)
...
Warningwithpointers!
Dr. Pierre Vignras 116
MIMD DM Programming


Message Passing Solutions

Process Creation
Send/Receive
Outline


MPI

Dr. Pierre Vignras 117


Message Passing Solutions
HowtoprovideMessagePassingto
programmers?
Designingaspecialparallelprogramminglanguage
SunFortress,Elanguage,...
Extendingthesyntaxofanexistingsequential
language
ProActive,JavaParty
Usingalibrary
PVM,MPI,...

Dr. Pierre Vignras 118


Requirements
Processcreation
Static:thenumberofprocessesisfixedanddefined
atstartup(e.g.fromthecommandline)
Dynamic:processcanbecreatedanddestroyedat
willduringtheexecutionoftheprogram.
MessagePassing
Sendandreceiveefficientprimitives
BlockingorUnblocking
GroupedCommunication

Dr. Pierre Vignras 119


Process Creation
Multiple Program, Multiple Data
Model
(MPMD)
Source Source
File File
Compile to
suit processor

Executable Executable
File File

Processor 0 Processor p-1

Dr. Pierre Vignras 120


Process Creation
Single Program, Multiple Data Model
(SPMD)
Source
File

Compile to
suit processor
Executable Executable
File File

Processor 0 Processor p-1

Dr. Pierre Vignras 121


SPMD Code Sample

int pid = getPID();


if (pid == MASTER_PID) { Compilation for
execute_master_code();
heterogeneous
}else{
execute_slave_code(); processors still required
}

Master/Slave Architecture
GetPID() is provided by the
parallel system (library, language, ...)
MASTER_PID should be well defined

Dr. Pierre Vignras 122


Dynamic Process Creation
Process 1
Primitive required:
: spawn(location, executable)
:
:
spawn() Process 2
: :
: :
: :
: :
:
:
:

Dr. Pierre Vignras 123


Send/Receive: Basic
Basicprimitives:
send(dst, msg);
receive(src, msg);
Questions:
Whatisthetypeof:src, dst?
IP:port>Lowperformancebutportable
ProcessID>Implementationdependent(efficient
nonIPimplementationpossible)
Whatisthetypeof:msg?
Packing/Unpackingofcomplexmessages
Encoding/Decodingforheterogeneousarchitectures
Dr. Pierre Vignras 124
Persistence and Synchronicity in Communication (3)
(from Distributed Systems -- Principles and Paradigms, A. Tanenbaum, M. Steen)

2-22.1

a) Persistent asynchronous communication


b) Persistent synchronous communication
Persistence and Synchronicity in Communication (4)
(from Distributed Systems -- Principles and Paradigms, A. Tanenbaum, M. Steen)

2-22.2

a) Transient asynchronous communication


b) Receipt-based transient synchronous communication
Persistence and Synchronicity in Communication (5)
(from Distributed Systems -- Principles and Paradigms, A. Tanenbaum, M. Steen)

a) Delivery-based transient synchronous communication at message


delivery
b) Response-based transient synchronous communication
Send/Receive: Properties

Message System Precedence


Synchronization Requirements Constraints
Send: non-blocking Message Buffering None, unless
Receive: non- Failure return from message is received
blocking receive successfully

Send: non-blocking Message Buffering Actions preceding


send occur before
Receive: blocking Termination Detection those following send
Send: blocking Termination Detection Actions preceding
Receive: non- Failure return from send occur before
blocking receive those following send
Send: blocking Termination Detection Actions preceding
send occur before
Receive: blocking Termination Detection those following send
Dr. Pierre Vignras 128
Receive Filtering
Sofar,wehavetheprimitive:
receive(src, msg);
Andifonetoreceiveamessage:
Fromanysource?
Fromagroupofprocessonly?
Wildcardsareusedinthatcases
Awildcardisaspecialvaluesuchas:
IP:broadast(e.g:192.168.255.255),multicastaddress
TheAPIprovideswelldefinedwildcard(e.g:
MPI_ANY_SOURCE)

Dr. Pierre Vignras 129


The Message-Passing Interface (MPI)
(from Distributed Systems -- Principles and Paradigms, A. Tanenbaum, M. Steen)

Primitive Meaning

MPI_bsend Append outgoing message to a local send buffer

MPI_send Send a message and wait until copied to local or remote buffer

MPI_ssend Send a message and wait until receipt starts

MPI_sendrecv Send a message and wait for reply

MPI_isend Pass reference to outgoing message, and continue

MPI_issend Pass reference to outgoing message, and wait until receipt starts

MPI_recv Receive a message; block if there are none

MPI_irecv Check if there is an incoming message, but do not block

Some of the most intuitive message-passing primitives of MPI.


MPI Tutorial
(by William Gropp)

See William Gropp Slides

Dr. Pierre Vignras 131


Parallel Strategies


Embarrassingly Parallel Computation
Partitionning
Outline

Dr. Pierre Vignras 132


Embarrassingly Parallel
Computations
Problemsthatcanbedividedintodifferent
independentsubtasks
Nointeractionbetweenprocesses
Nodataisreallyshared(butcopiesmayexist)
Resultofeachprocesshavetobecombinedin
someways(reduction).
Idealsituation!
Master/Slavearchitecture
Masterdefinestasksandsendthemtoslaves
(staticallyordynamically)

Dr. Pierre Vignras 133


Embarrassingly Parallel
Computations Typical Example:
Image Processing
Shifting,Scaling,Clipping,Rotating,...
2PartitioningPossibilities
640 640

480
480

Process Process

Dr. Pierre Vignras 134


Image Shift Pseudo-Code
Master h=Height/slaves;
for(i=0,row=0;i<slaves;i++,row+=h)
send(row, Pi);
for(i=0; i<Height*Width;i++){
recv(oldrow, oldcol,
newrow, newcol, PANY);
map[newrow,newcol]=map[oldrow,oldcol];
}

recv(row, Pmaster)
Slave
for(oldrow=row;oldrow<(row+h);oldrow++)
for(oldcol=0;oldcol<Width;oldcol++) {
newrow=oldrow+delta_x;
newcol=oldcol+delta_y;
send(oldrow,oldcol,newrow,newcol,Pmaster);
}
Dr. Pierre Vignras 135
Image Shift Pseudo-Code
Complexity
Hypothesis:
2computationalsteps/pixel
n*npixels
pprocesses
Sequential:Ts=2n2=O(n2)
Parallel:T//=O(p+n2)+O(n2/p)=O(n2)(pfixed)
Communication:Tcomm=Tstartup+mTdata
=p(Tstartup+1*Tdata)+n2(Tstartup+4Tdata)=O(p+n2)
Computation:Tcomp=2(n2/p)=O(n2/p)

Dr. Pierre Vignras 136


Image Shift Pseudo-Code
Speedup
Speedup Ts 2n
2
=
T p 2 n 2 / p pT startup T data n 2 T startup 4T data
Computation/Communication:
2n 2 / p n2 / p
2
=O 2

p T startupT data n T startup 4T data pn
Constantwhenpfixedandngrows!
Notgood:thelargest,thebetter
Computationshouldhidethecommunication
Broadcastingisbetter
ThisproblemisforSharedMemory!

Dr. Pierre Vignras 137


Computing
the
Mandelbrot
Set
Problem:weconsiderthelimitofthefollowing
sequenceinthecomplexplane: z 0=0
z k 1= z k c
Theconstantc=a.i+bisgivenbythepoint(a,b)
inthecomplexplane
Wecomputezkfor0<k<n
ifforagivenk,(|zk|>2)weknowthesequence
divergetoinfinityweplotcincolor(k)
iffor(k==n),|zk|<2weassumethesequence
convergeweplotcinablackcolor.

Dr. Pierre Vignras 138


Mandelbrot Set: sequential
code
int calculate(double cr, double ci) {
double bx = cr, by = ci;
double xsq, ysq;
int cnt = 0;
Computes the number of iterations
while (true) { for the given complex point (cr, ci)
xsq = bx * bx;
ysq = by * by;
if (xsq + ysq >= 4.0) break;
by = (2 * bx * by) + ci;
bx = xsq - ysq + cr;
cnt++;
if (cnt >= max_iterations) break;
}

return cnt;
}

Dr. Pierre Vignras 139


Mandelbrot Set: sequential
code

y = startDoubleY;
for (int j = 0; j < height; j++) {
double x = startDoubleX;
int offset = j * width;
for (int i = 0; i < width; i++) {
int iter = calculate(x, y);
pixels[i + offset] = colors[iter];
x += dx;
}
y -= dy;
}

Dr. Pierre Vignras 140


Mandelbrot Set:
parallelization
Statictaskassignment:
Cuttheimageintopareas,andassignthemtoeach
pprocessor
Problem:someareasareeasiertocompute(less
iterations)
Assignn2/ppixelsperprocessorinaroundrobinor
randommanner
Onaverage,thenumberofiterationsperprocessor
shouldbeapproximatelythesame
Leftasanexercice

Dr. Pierre Vignras 141


Mandelbrot set: load
balancing
Numberofiterationsperpixelvaries
Performanceofprocessorsmayalsovary
DynamicLoadBalancing
WewanteachCPUtobe100%busyideally
Approach:workpool,acollectionoftasks
Sometimes,processescanaddnewtasksintothe
workpool
Problems:
Howtodefineatasktominimizethecommunication
overhead?
Whenataskshouldbegiventoprocessestoincreasethe
computation/communicationratio?
Dr. Pierre Vignras 142
Mandelbrot set: dynamic load
balancing
Task==pixel
Lotofcommunications,Goodloadbalancing
Task==row
Fewercommunications,lessloadbalanced
Sendataskonrequest
Simple,lowcommunication/computationratio
Sendataskinadvance
Complextoimplement,good
communication/computationratio
Problem:TerminationDetection?

Dr. Pierre Vignras 143


Mandelbrot Set: parallel code
Master Code
count = row = 0;
for(k = 0, k < num_procs; k++) { Pslave is the process
send(row, Pk,data_tag); which sends the last
count++, row++; received message.
}
do {
recv(r, pixels, PANY, result_tag);
count--;
if (row < HEIGHT) {
send(row, Pslave, data_tag);
row++, count++;
}else send(row, Pslave, terminator_tag);
display(r, pixels);
}while(count > 0);

Dr. Pierre Vignras 144


Mandelbrot Set: parallel code
Slave Code

recv(row, PMASTER, ANYTAG, source_tag);


while(source_tag == data_tag) {
double y = startDoubleYFrom(row);
double x = startDoubleX;
for (int i = 0; i < WIDTH; i++) {
int iter = calculate(x, y);
pixels[i] = colors[iter];
x += dx;
}
send(row, pixels, PMASTER, result_tag);
recv(row, PMASTER, ANYTAG, source_tag);
};

Dr. Pierre Vignras 145


Mandelbrot Set: parallel code
Analysis
T s Max iter . n 2
T p =T comm T comp
T comm=n t startup2 t data p1t startup 1t data n t startupnt data
n2
T comp Maxiter
p Only valid when
2
Ts Max iter . n all pixels are black!
=
T p Max iter . n 2 (Inequality of
n2 t data 2 n p1t startup t data
p Ts and Tp)
Max iter . n 2
T comp p
= 2
T comm n t data 2 n p1t startup t data

Speedup approaches p when Maxiter is high


Computation/Communication is in O(Maxiter)

Dr. Pierre Vignras 146


Partitioning
Basisofparallelprogramming
Steps:
cuttheinitialproblemintosmallerpart
solvethesmallerpartinparallel
combinetheresultsintoone
Twoways
Datapartitioning
Functionalpartitioning
Specialcase:divideandconquer
subproblemsareofthesameformasthelarger
problem
Dr. Pierre Vignras 147
Partitioning Example:
numerical integration
Wewanttocompute: b

f x dx
a

Dr. Pierre Vignras 148


Quizz
Outline

Dr. Pierre Vignras 149


Readings
Describeandcomparethefollowing
mechanismsforinitiatingconcurrency:
Fork/Join
AsynchronousMethodInvocationwithfutures
Cobegin
Forall
Aggregate
LifeRoutine(akaActiveObjects)

Dr. Pierre Vignras 150

Vous aimerez peut-être aussi