PP PDF

Parallel Processing
Dr. Pierre Vignras

http://www.vigneras.name/pierre
This work is licensed under a Creative Commons

Attribution-Share Alike 2.0 France.
See
http://creativecommons.org/licenses/by-sa/2.0/fr/
for details
Dr. Pierre Vignras

Text & Reference Books
Fundamentals of Parallel Processing

Harry F. Jordan & Gita Alaghband
Pearson Education
ISBN: 81-7808-992-0
Parallel & Distributed Computing Handbook

Albert Y. H. Zomaya
McGraw-Hill Series on Computer Engineering
ISBN: 0-07-073020-2
Parallel Programming (2nd edition)

Books
Barry Wilkinson and Michael Allen

Pearson/Prentice Hall
ISBN: 0-13-140563-2
Dr. Pierre Vignras
Class, Quiz & Exam Rules
Noentryafterthefirst5minutes
Noexitbeforetheendoftheclass
AnnouncedQuiz
Atthebeginningofaclass
Fixedtiming(youmaysufferifyouarrivelate)
SpreadOut(doitquicklytosaveyourtime)
Papersthatarenotstrictlyinfrontofyouwillbe
consideredasdone
Cheaterswillget'1'mark
Rules
Yourheadshouldbeinfrontofyourpaper!
Dr. Pierre Vignras

Grading Policy
(Temptative,canbeslightlymodified)
Quiz:5%
Assignments:5%
Project:15%
Mid:25%
Grading Policy
Final:50%
Dr. Pierre Vignras

Outline
(Important points)
Basics
[SI|MI][SD|MD],
InterconnectionNetworks
LanguageSpecificConstructions
ParallelComplexity
SizeandDepthofanalgorithm
Speedup&Efficiency
SIMDProgramming
MatrixAddition&Multiplication
GaussElimination
Dr. Pierre Vignras

Basics

Flynn's Classification

Interconnection Networks
Outline

Language Specific Constructions
Dr. Pierre Vignras 6

History
Since
ComputersarebasedontheVonNeumann
Architecture
LanguagesarebasedoneitherTuringMachines
(imperativelanguages),LambdaCalculus
(functionallanguages)orBooleanAlgebra(Logic
Programming)
Sequentialprocessingisthewayboth
I. Basics
computersandlanguagesaredesigned
Butdoingseveralthingsatonceisanobvious
waytoincreasespeed
Dr. Pierre Vignras

Definitions
ParallelProcessingobjectiveisexecutionspeed
improvement
Themannerisbydoingseveralthingsatonce
Theinstrumentsare
Architecture,
Languages,
Algorithm
I. Basics
ParallelProcessingisanoldfield
butalwayslimitedbythewidesequentialworldfor
variousreasons(financial,psychological,...)
Dr. Pierre Vignras

Flynn Architecture
Characterization
InstructionStream(whattodo)
S:Single,M:Multiple
DataStream(onwhat)
I:Instruction,D:Data
Fourpossibilities
SISD:usualcomputer
SIMD:VectorComputers(Cray)
I. Basics
MIMD:MultiProcessorComputers
MISD:Notveryuseful(sometimespresentedas
pipelinedSIMD)
Dr. Pierre Vignras

SISD Architecture
TraditionalSequentialComputer
ParallelismattheArchitectureLevel:
Interruption
IOprocessor
Pipeline
Flow of Instructions I8 I7 I6 I5
I. Basics
IF OF EX STR I4 I3 I2 I1
4-stages pipeline
Dr. Pierre Vignras

Problem with pipelines
Conditionalbranchesneedcompleteflush
Thenextinstructiondependsonthefinalresultof
acondition
Dependencies
Wait(insertingNOPinpipelinestages)untila
previousresultbecomesavailable
Solutions:
Branchprediction,outoforderexecution,...
I. Basics
Consequences:
CircuitryComplexities(cost)
Performance
Dr. Pierre Vignras
Exercises: Real Life Examples
Intel:
Pentium4:
Pentium3(=M=CoreDuo):
AMD
Athlon:
Opteron:
IBM
Power6:
I. Basics
Cell:
Sun
UltraSparc:
T1:
Dr. Pierre Vignras
Instruction Level Parallelism
MultipleComputingUnitthatcanbeused
simultaneously(scalar)
FloatingPointUnit
ArithmeticandLogicUnit
Reorderingofinstructions
Bythecompiler
Bythehardware
I. Basics
Lookahead(prefetching)&Scoreboarding
(resolveconflicts)
OriginalSemanticisGuaranteed
Dr. Pierre Vignras

Reordering: Example
EXP=A+B+C+(DxExF)+G+H
+ +
+ H + H
Height=5
Height=4
+ G + x
+ x + + x F
+ C x F A B C G D E
I. Basics
A B D E
4 Steps required if
5 Steps required if
hardware can do 2 adds and
hardware can do 1 add and
1 multiplication simultaneously
1 multiplication simultaneously
Dr. Pierre Vignras
SIMD Architecture
Often(especiallyinscientificapplications),
oneinstructionisappliedondifferentdata
for(inti=0;i<n;i++){
r[i]=a[i]+b[i];
}
Usingvectoroperations(add,mul,div,...)
Twoversions
TrueSIMD:nArithmeticUnit
I. Basics
nhighleveloperationsatatime
PipelinesSIMD:ArithmeticPipelined(depthn)
nnonidenticallowleveloperationsatatime
Dr. Pierre Vignras

Real Life Examples
Cray:pipelinedSIMDvectorsupercomputers
ModernMicroProcessorInstructionSet
containsSIMDinstructions
Intel:MMX,SSE{1,2,3}
AMD:3dnow
IBM:Altivec
Sun:VSI
I. Basics
Veryefficient
2002:winneroftheTOP500listwastheJapanese
EarthSimulator,avectorsupercomputer
Dr. Pierre Vignras

MIMD Architecture
Twoforms
Sharedmemory:anyprocessorcanaccessany
memoryregion
Distributedmemory:adistinctmemoryis
associatedwitheachprocessor
Usage:
Multiplesequentialprogramscanbeexecuted
simultaneously
I. Basics
Differentpartsofaprogramareexecuted
simultaneouslyoneachprocessor
Needcooperation!
Dr. Pierre Vignras

Warning: Definitions
Multiprocessors:computerscapableof
runningmultipleinstructionsstreams
simultaneouslytocooperativelyexecutea
singleprogram
Multiprogramming:sharingofcomputing
resourcesbyindependentjobs
Multiprocessing:
runningapossiblysequentialprogramona
I. Basics
multiprocessor(notourconcernhere)
runningaprogramconsistingofmultiple
cooperatingprocesses
Dr. Pierre Vignras

Shared vs Distributed
M
CPU CPU CPU
CPU
Switch M CPU Switch CPU M
Memory
I. Basics
Shared Memory Distributed Memory
Dr. Pierre Vignras

MIMD
Cooperation/Communication
Sharedmemory
read/writeinthesharedmemorybyhardware
synchronisationwithlockingprimitives
(semaphore,monitor,etc.)
VariableLatency
Distributedmemory
explicitmessagepassinginsoftware
I. Basics
synchronous,asynchronous,grouped,remotemethod
invocation
Largemessagemaymasklongandvariablelatency
Dr. Pierre Vignras

Interconnection Networks
I. Basics
Dr. Pierre Vignras

Hard Network Metrics
Bandwidth(B):maximumnumberofbytesthe
networkcantransport
Latency(L):transmissiontimeofasingle
message
4components
softwareoverhead(so):msgpacking/unpacking
routingdelay(rd):timetoestablishtheroute
contentiontime(ct):networkisbusy
I. Basics
channeldelay(cd):msgsize/bandwidth
so+ctdependsonprogrambehaviour
rd+cddependsonhardware
Dr. Pierre Vignras
Soft Network Metrics
Diameter(r):longestpathbetweentwonodes
r
AverageDistance(da): 1

d.N d
N 1 d =1
wheredisthedistancebetweentwonodesdefinedasthenumberoflinksin
theshortestpathbetweenthem,NthetotalnumberofnodesandNd,the
numberofnodeswithadistancedfromagivennode.
Connectivity(P):thenumberofnodesthatcan
bereachedfromagivennodeinonehop.
Concurrency(C):thenumberofindependent
I. Basics
connectionsthatanetworkcanmake.
bus:C=1;linear:C=N1(ifprocessorcansendand
receivesimultaneously,otherwise,C=N/2)
Dr. Pierre Vignras
Influence of the network
Bus
lowcost,
lowconcurrencydegree:1messageatatime
Fullyconnected
highcost:N(N1)/2links
highconcurrencydegree:Nmessagesatatime
Star
I. Basics
Lowcost(highavailability)
Exercice:ConcurrencyDegree?
Dr. Pierre Vignras

Parallel Responsibilities
Human writes a Parallel Algorithm

Human implements using a Language
Compiler translates into a Parallel Program
Runtime/Libraries are required in a Parallel Process
Human & Operating System map to the Hardware
The hardware is leading the Parallel Execution
Many actors are involved in the execution We usually consider

I. Basics
of a program (even sequential). explicit parallelism:

Parallelism can be introduced when parallelism is
automatically by compilers, introduced at the
runtime/libraries, OS and in the hardware language level
Dr. Pierre Vignras

SIMD Pseudo-Code
SIMD
<indexed variable> = <indexed expression>, (<index range>);
Example:
C[i,j] == A[i,j] + B[i,j], (0 i N-1)
j j j
i i i
C = A + B
j=0 j=1 j = N-1

I. Basics
= + = + ... = +
C[i] A[i] B[i] C[i] A[i] B[i] C[i] A[i] B[i]
Dr. Pierre Vignras
SIMD Matrix Multiplication
GeneralFormula
N 1
cij = a ik b kj
k =0
for (i = 0; i < N; i++) {

// SIMD Initialisation
C[i,j] = 0, (0 <= j < N);
for (k = 0; k < N; k++) {
I. Basics
// SIMD computation
C[i,j]=C[i,j]+A[i,k]*B[k,j], (0 <= j < N);
}
}
Dr. Pierre Vignras

SIMD C Code (GCC)
gccvectortype&operations
Vectorof8byteslong,composedof4shortelements
typedef short v4hi __attribute__ ((vector_size (8)));
Useuniontoaccessindividualelements
union vec { v4hi v, short s[4]; };
Usualoperationsonvectorvariables
v4hi v, w, r; r = v+w*w/v;
gccbuiltinsfunctions
#include<mmintrin.h>//MMXonly
I. Basics
Compilewithmmmx
Usetheprovidedfunctions
_mm_setr_pi16 (short w0, short w1, short w2, short w3);
_mm_add_pi16 (__m64 m1, __m64 m2);
...
Dr. Pierre Vignras

MIMD Shared Memory
Pseudo Code
Newinstructions:
fork <label> Startanewprocessexecutingat
<label>;
join <integer>Join<integer>processesintoone;
shared <variable list>makethestorageclassofthe
variablesshared;
private <variable list>makethestorageclassofthe
variablesprivate.
I. Basics
Sufficient?Whataboutsynchronization?
Wewillfocusonthatlateron...
Dr. Pierre Vignras

MIMD Shared Memory Matrix
Multiplication
private i,j,k;
shared A[N,N], B[N,N], C[N,N], N;
for (j = 0; j < N - 2; j++) fork DOCOL;
// Only the Original process reach this point
j = N-1;
DOCOL: // Executed by N process for each column
for (i = 0; i < N; i++) {
C[i,j] = 0;
for (k = 0; k < N; k++) {
// SIMD computation
C[i,j]=C[i,j]+A[i,k]*B[k,j];
I. Basics
}
}
join N; // Wait for the other process
Dr. Pierre Vignras

MIMD Shared Memory C Code
Processbased:
fork()/wait()
ThreadBased:
Pthreadlibrary
SolarisThreadlibrary
SpecificFramework:OpenMP
Exercice:readthefollowingarticles:
I. Basics
ImperativeConcurrentObjectOrientedLanguages,
M.Philippsen,1995
ThreadsCannotBeImplementedAsaLibrary,
HansJ.Boehm,2005
Dr. Pierre Vignras

MIMD Distributed Memory
Pseudo Code
Basicinstructionssend()/receive()
Manypossibilities:
nonblockingsendneedsmessagebuffering
nonblockingreceiveneedsfailurereturn
blockingsendandreceivebothneedstermination
detection
Mostusedcombinations
blockingsendandreceive
I. Basics
nonblockingsendandblockingreceive
Groupedcommunications
scatter/gather,broadcast,multicast,...
Dr. Pierre Vignras
MIMD Distributed Memory
C Code
TraditionalSockets
BadPerformance
Groupedcommunicationlimited
ParallelVirtualMachine(PVM)
Oldbutgoodandportable
MessagePassingInterface(MPI)
Thestandardusedinparallelprogrammingtoday
I. Basics
Veryefficientimplementations
Vendorsofsupercomputersprovidetheirownhighly
optimizedMPIimplementation
Opensourceimplementationavailable
Dr. Pierre Vignras
Performance

Size and Depth
Speedup and Efficiency
Outline

Parallel Program
Characteristics
s = V[0];
for (i = 1; i < N; i++) {
s = s + V[i];
}
V[0] V[1] V[2] V[3] ... V[N-1]
I. Performance
+
+
Data Dependence Graph
Size = N-1 (nb op)
+
Depth = N-1 (nb op in the
longest path from +

any input to any output)
s
Parallel Program
Characteristics
V[0] V[1] V[2] V[3] V[N-2] V[N-1]
+ + +
I. Performance
+ +
+
Size = N-1
Depth =log2N
s

Parallel Program
Characteristics
SizeandDepthdoesnottakeintoaccount
manythings
memoryoperation(allocation,indexing,...)
communicationbetweenprocesses
read/writeforsharedmemoryMIMD
I. Performance
send/receivefordistributedmemoryMIMD
Loadbalancingwhenthenumberofprocessorsis
lessthanthenumberofparalleltask
etc...
InherentCharacteristicsofaParallelMachine
independentofanyspecificarchitecture
Prefix Problem
for (i = 1; i < N; i++) {
V[i] = V[i-1] + V[i];
}
V[0] V[1] V[2] V[3] ... V[N-1]
+
Size = N-1
I. Performance
Depth =N-1
+
+
How to make this
+ program parallel?
V[0] V[1] V[2] V[3] ... V[N-1]

Prefix Solution 1
Upper/Lower Prefix Algorithm
DivideandConquer
DivideaproblemofsizeNinto2equivalent
problemsofsizeN/2
Apply the same algorithm
V[0] V[N-1] for sub-prefix.
... ... When N=2, use the direct,
sequential implementation
N/2 Prefix N/2 Prefix
Size has increased!
N-2+N/2 > N-1 (N>2)
Depth = log2N
... + + ... +
Size (N=2k) = k.2k-1
=(N/2).log2N
V[0] V[N-1]
ul
Analysis of P
Proofbyinduction
OnlyforNapowerof2
Assumefromconstructionthatsizeincrease
smoothlywithN
Example:N=1,048,576=220
SequentialPrefixAlgorithm:
1,048,575operationsin1,048,575timeunitsteps
Upper/LowerPrefixAlgorithm
10,485,760operationsin20timeunitsteps

Other Prefix Parallel
Algorithms
PulrequiresN/2processorsavailable!
Inourexample,524,288processors!
Otheralgorithms(exercice:studythem)
Odd/EvenPrefixAlgorithm
Size=2Nlog2(N)2(N=2k)
Depth=2log2(N)2
LadnerandFischer'sParallelPrefix
Size=4N4.96N0.69+1
Depth=log2(N)
Exercice:implementtheminC
Benchmarks
(Inspired by Pr. Ishfaq Ahmad Slides)
Abenchmarkisaperformancetestingprogram
supposedtocaptureprocessinganddata
movementcharacteristicsofaclassofapplications.
Usedtomeasureandtopredictperformance
ofcomputersystems,andtorevealtheir
architecturalweaknessandstrongpoints.
Benchmarksuite:setofbenchmarkprograms.
Benchmarkfamily:setofbenchmarksuites.
Classification:
scientificcomputing,commercial,applications,
networkservices,multimediaapplications,...

Benchmark Examples
Type Name Measuring

Numerical computing (linear al-
LINPACK
gebra)
Micro-Benchmark System calls and data movement
LMBENCH
operations in Unix
STREAM Memory bandwidth
NAS Parallel computing (CFD)
PARKBENCH Parallel computing
SPEC A mixed benchmark family
Macro-Benchmark
Splash Parallel computing
STAP Signal processing
TPC Commercial applications

SPEC Benchmark Family
StandardPerformanceEvaluationCorporation
StartedwithbenchmarksthatmeasureCPU
performance
Hasextendedtoclient/servercomputing,
commercialapplications,I/Osubsystems,etc.
Visithttp://www.spec.org/formore
informations

Performance Metrics
Executiontime
Generallyinseconds
Realtime,UserTime,Systemtime?
Realtime
InstructionCount
GenerallyinMIPSorBIPS
Dynamic:numberofexecutedinstructions,notthe
numberofassemblylinecode!
NumberofFloatingPointOperationsExecuted
Mflop/s,Gflop/s
Forscientificapplicationswhereflopdominates
Memory Performance
Threeparameters:
Capacity,
Latency,
Bandwidth

Parallelism and Interaction
Overhead
Timetoexecuteaparallelprogramis
T=Tcomp+Tpar+Tinter
Tcomp:timeoftheeffectivecomputation
Tpar:paralleloverhead
Processmanagement(creation,termination,contextswitching)
Groupingoperations(creation,destructionofgroup)
ProcessInquiryoperation(Id,GroupId,Groupsize)
Tinter:interactionoverhead
Synchronization(locks,barrier,events,...)
Aggregation(reductionandscan)
Communication(p2p,collective,read/writesharedvariables)

Measuring Latency
Ping-Pong
for (i=0; i < Runs; i++)
if (my_node_id == 0) { /* sender */
tmp = Second();
start_time = Second();
send an m-byte message to node 1;
receive an m-byte message from node 1;
end_time = Second();
timer_overhead = start_time - tmp ;
total_time = end_time - start_time - timer_overhead;
communication_time[i] = total_time / 2 ;
} else if (my_node_id ==1) {/* receiver */
receive an m-byte message from node 0;
send an m-byte message to node 0;
}
}

Measuring Latency
Hot-Potato (fire-brigade)
Method
nnodesareinvolved(insteadofjusttwo)
Node0sendsanmbytesmessagetonode1
Onreceipt,node1immediatelysendsthe
samemessagetonode2andsoon.
Finallynoden1sendsthemessagebackto
node0
Thetotaltimeisdividedbyntogetthepoint
topointaveragecommunicationtime.

Collective Communication
Performance
for (i = 0; i < Runs; i++) {
barrier synchronization;
tmp = Second();
start_time = Second();
for (j = 0; j < Iterations; j++)
the_collective_routine_being_measured;
end_time = Second();
timer_overhead = start_time - tmp;
total_time = end_time - start_time - timer_overhead;
local_time = total_time / Iterations;
communication_time[i] = max {all n local_time};
}
How to get all n local time?

Point-to-Point
Communication Overhead
Linearfunctionofthemessagelengthm(in
bytes)
t(m)=t0+m/r
t0:startuptime
r:asymptoticbandwidth
t
Ideal Situation!
Many things are left out
Topology metrics
t0
Network Contention
m
Broadcast:1processsendsanmbytesmessagetoall
nprocess
Multicast:1processsendsanmbytesmessagetop<
nprocess
Gather:1processreceivesanmbytesfromeachof
thenprocess(mnbytesreceived)
Scatter:1processsendsadistinctmbytesmessageto
eachofthenprocess(mnbytessent)
Totalexchange:everyprocesssendsadistinctm
bytesmessagetoeachofthenprocess(mn2bytes
sent)
CircularShift:processisendsanmbytesmessageto
processi+1(processn1toprocess0)
Overhead
Functionofbothmandn
startuplatencydependsonlyonn
T(m,n)=t0(n)+m/r(n)
Many problems here! For a broadcast for example,
how is done the communication at the link layer?
Sequential?
Process 1 sends 1 m-bytes message to process 2,
then to process 3, and so on... (n steps required)

Truly Parallel (1 step required)
Tree based (log (n) steps required)

2
Again, many things are left out
Topology metrics
Network Contention

Speedup
(Problem-oriented)
Foragivenalgorithm,letTpbetimetoperform
thecomputationusingpprocessors
T:depthofthealgorithm
Howmuchtimewegainbyusingaparallel
algorithm?
Speedup:Sp=T1/Tp
Issue:whatisT1?
Timetakenbytheexecutionofthebestsequential
algorithmonthesameinputdata
WenoteT1=Ts
Dr. Pierre Vignras

Speedup consequence
BestParallelTimeAchievableis:Ts/p
BestSpeedupisboundedbySpTs/(Ts/p)=p
CalledLinearSpeedup
Conditions
Workloadcanbedividedintopequalparts
Nooverheadatall
Idealsituation,usuallyimpossibletoachieve!
Objectiveofaparallelalgorithm
Beingasclosetoalinearspeedupaspossible
Dr. Pierre Vignras

Biased Speedup
(Algorithm-oriented)
Weusuallydon'tknowwhatisthebest
sequentialalgorithmexceptforsomesimple
problems(sorting,searching,...)
Inthiscase,T1
isthetimetakenbytheexecutionofthesame
parallelalgorithmon1processor.
Clearlybiased!
Theparallelalgorithmandthebestsequential
algorithmareoftenvery,verydifferent
Thisspeedupshouldbetakenwiththat
limitationinmind!
Dr. Pierre Vignras
Biased Speedup Consequence
Sincewecompareourparallelalgorithmonp
processorswithasuboptimalsequential
(parallelalgorithmononly1processor)wecan
have:Sp>p
Superlinearspeedup
Canalsobetheconsequenceofusing
auniquefeatureofthesystemarchitecturethat
favoursparallelformation
indeterminatenatureofthealgorithm
Extramemoryduetotheuseofpcomputers(less
swap)
Dr. Pierre Vignras
Efficiency
Howlongprocessorsarebeingusedonthe
computation?
Efficiency:Ep=Sp/p(expressedin%)
Alsorepresentshowmuchbenefitswegainby
usingpprocessors
Boundedby100%
Superlinearspeedupofcoursecangivemorethan
100%butitisbiased
Dr. Pierre Vignras

Amdhal's Law: fixed problem
size
(1967)
ForaproblemwithaworkloadW,weassume
thatwecandivideitintotwoparts:
W=W+(1)W
percentofWmustbeexecutedsequentially,andthe
remaining1canbeexecutedbynnodessimultaneouslywith
100%ofefficiency
Ts(W)=Ts(W+(1)W)=Ts(W)+Ts((1)W)=Ts(W)+(1)Ts(W)
Tp(W)=Tp(W+(1)W)=Ts(W)+Tp((1)W)=Ts(W)+(1)Ts(W)/p
Ts p 1
S p= = p
1T s 1 p1
T s
p
Amdhal's Law: fixed problem
size
Ts
Ts (1-)Ts
serial section parallelizable sections
...
P processors
...
(1-)Ts/p
Tp
Amdhal's Law Consequences
S(p) S()
p=100
=5%
=10%
=20%
p=10
=40%

Forafixedworkload,andwithoutany
overhead,themaximalspeedupasanupper
boundof1/
TakingintoaccounttheoverheadTo
Performanceislimitednotonlybythesequential
bottleneckbutalsobytheaverageoverhead
Ts p
S p= =
1T s pT o
T s T o 1 p1
p Ts
1
S p p
To

Ts
Thesequentialbottleneckcannotbesolved
justbyincreasingthenumberofprocessorsin
asystem.
Verypessimisticviewonparallelprocessingduring
twodecades.
Majorassumption:theproblemsize
(workload)isfixed.
Conclusion don't use parallel

programming on a small sized problem!

Gustafson's Law: fixed time
(1988)
Asthemachinesizeincrease,theproblemsize
mayincreasewithexecutiontimeunchanged
Sequentialcase:W=W+(1)W
Ts(W)=Ts(W)+(1)Ts(W)
Tp(W)=Ts(W)+(1)Ts(W)/p
Tohaveaconstanttime,theparallelworkload
shouldbeincreased
Parallelcase(pprocessor):W'=W+p(1)W.
Ts(W')=Ts(W)+p(1)Ts(W)
Assumption:sequentialpartisconstant:itdoesnotincrease
withthesizeoftheproblem

Fixed time speedup
T s W ' T s W ' T s W p 1T s W
S p= = =
T p W ' T s W T s W 1T s W
S p=1 p
Fixed time speedup is a linear function of p,

if the workload is scaled up
to maintain a fixed execution time

Gustafson's Law
Consequences
=5% p=100
S(p) S()
p=50
=10%
=20% p=25
=40% p=10

Fixed time speedup with
overhead
T s W ' T s W ' 1 p
S p= = =
T p W ' T s W T o p T o p
1
T s W
Best fixed speedup is achieved with

a low overhead as expected
T : function of the number of processors
o
Quite difficult in practice to make it a
decreasing function

SIMD Programming

Matrix Addition & Multiplication
Gauss Elimination
Outline

Matrix Vocabulary
Amatrixisdenseifmostofitselementsare
nonzero
Amatrixissparseifasignificantnumberofits
elementsarezero
Spaceefficientrepresentationavailable
Efficientalgorithmsavailable
Usualmatrixsizeinparallelprogramming
Lowerthan1000x1000foradensematrix
Biggerthan1000x1000forasparsematrix
Growingwiththeperformanceofcomputer
architectures
Matrix Addition
C = AB , c i , j =a i , j bi , j 0in , 0 jm
// Sequential
for (i = 0; i < n; i++) { Need to map the vectored
for (j = 0; j < m; j++) { addition of m elements a[i]+b[i]
c[i,j] = a[i,j]+b[i,j]; to our underlying architecture
} made of p processing elements
}
// SIMD
Complexity:
for (i = 0; i < n; i++) { Sequential case: n.m additions
c[i,j] = a[i,j]+b[i,j];\ SIMD case: n.m/p
(0 <= j < m)
Speedup: S(p)=p
}

Matrix Multiplication
C = AB , A[n , l ] ; B [l , m] ;C [n , m]
l 1
ci , j = a i , k b k , j 0in ,0 jm
k =0
for (i = 0; i < n; i++) {

for (j = 0; j < m; j++) {
c[i,j] = 0;
for (k = 0; k < l; k++) {
c[i,j] = c[i,j] + a[i,k]*b[k,j];
}
}
Complexity:
n.m initialisations
}
n.m.l additions and multiplications
Overhead: Memory indexing!
Use a temporary variable sum for c[i,j]!

SIMD Matrix Multiplication
for (i = 0; i < n; i++) {

c[i,j] = 0; (0 <= j < m) // SIMD Op
for (k = 0; k < l; k++) {
c[i,j] = c[i,j] + a[i,k]*b[k,j];\
(0 <= j < m)
}
} Complexity (n = m = l for simplicity):
n2/p initialisations (p<=n)
} 3
n /p additions and multiplications (p<=n)
Overhead: Memory indexing!
Use a temporary vector for c[i]!
Algorithm in O(n) with p=n2

Assign one processor for
Algorithm in O(lg(n)) with p=n3
each element of C Parallelize the inner loop!
Initialisation in O(1)
Lowest bound
Computation in O(n)

Gauss Elimination
Matrix(nxn)representsasystemofnlinear
equationsofnunknowns
a n1,0 x 0a n1,1 x 1 a n1,2 x 2 ...a n1, n1 x n1 =b n1
a n2,0 x 0a n2,1 x 1a n2,2 x 2 ...a n2, n1 x n1 =b n2
...
a 1,0 x 0a 1,1 x 1 a 1,2 x 2...a 1, n1 x n1 =b1
a 0,0 x 0 a 0,1 x 1 a 0,2 x 2...a 0, n1 x n1=b0

a n1,0 a n1,1 a n1,2 ... a n1, n1 b n1
a n2,0 a n2,1 a n2,2 ... a n2, n1 bn2
Represented by M = ... ... ... ... ... ...
a 1,0 a 1,1 a 1,2 ... a 1, n1 b1
a 0,0 a 0,1 a 0,2 ... a 0, n1 b0
Gauss Elimination Principle
Transformthelinearsystemintoatriangular
systemofequations.
At the ith row, each row j below is replaced by:

{row j} + {row i}.(-aj, i/ai, i)

Gauss Elimination
Sequential & SIMD
Algorithms
for (i=0; i<n-1; i++){ // for each row except the last
for (j=i+1; j<n; j++) // step through subsequent rows
m=a[j,i]/a[i,i]; // compute multiplier
for (k=i; k<n; k++) {
a[j,k]=a[j,k]-a[i,k]*m;
} Complexity: O(n3)
}
}
for (i=0; i<n-1; i++){ SIMD Version Complexity: O(n3/p)

for (j=i+1; j<n; j++)
m=a[j,i]/a[i,i];
a[j,k]=a[j,k]-a[i,k]*m; (i<=k<n)
}
}
Other SIMD Applications
ImageProcessing
ClimateandWeatherPrediction
MolecularDynamics
SemiConductorDesign
FluidFlowModelling
VLSIDesign
DatabaseInformationRetrieval
...

MIMD SM Programming

Process, Threads & Fibers

Pre- & self-scheduling
Common Issues
Outline

OpenMP

Process, Threads, Fibers
Process:heavyweight
InstructionPointer,Stacks,Registers
memory,filehandles,sockets,devicehandles,and
windows
(kernel)Thread:mediumweight
Knownandscheduledpreemptivelybythekernel
(userthread)Fiber:lightweight
Unkwownbythekernelandscheduled
cooperativelyinuserspace
Multithreaded Process
Code Data Files Code Data Files
Reg Reg Reg

Registers Stack
Stack Stack Stack
thread
Process Process
Single-Threaded Multi-Threaded
Kernel Thread
Providedandscheduledbythekernel
Solaris,Linux,MacOSX,AIX,WindowsXP
Incompatibilities:POSIXPThreadStandard
SchedulingQuestion:ContentionScope
Thetimesliceqcanbegivento:
thewholeprocess(PTHREAD_PROCESS_SCOPE)
Eachindividualthread(PTHREAD_SYSTEM_SCOPE)
Fairness?
WithaPTHREAD_SYSTEM_SCOPE,themore
threadsaprocesshas,themoreCPUithas!

Fibers (User Threads)
Implementedentirelyinuserspace(alsocalled
greentheads)
Advantages:
Performance:noneedtocrossthekernel(context
switch)toscheduleanotherthread
Schedulercanbecustomizedfortheapplication
Disadvantages:
Cannotdirectlyusemultiprocessor
Anysystemcallblockstheentireprocess
Solutions:AsynchronousI/O,SchedulerActivation
Systems:NetBSD,WindowsXP(FiberThread)
Mapping
Threadsareexpressedbytheapplicationin
userspace.
11:anythreadexpressedinuserspaceis
mappedtoakernelthread
Linux,WindowsXP,Solaris>8
n1:nthreadsexpressedinuserspaceare
mappedtoonekernelthread
Anyuserspaceonlythreadlibrary
nm:nthreadsexpressedinuserspaceare
mappedtomkernelthreads(n>>m)
AIX,Solaris<9,IRIX,HPUX,TRU64UNIX
Mapping
User Space User Space User Space
Kernel Space Kernel Space Kernel Space

1-1 n-1 n-m

MIMD SM Matrix Addition
Self-Scheduling
private i,j,k;
shared A[N,N], B[N,N], C[N,N], N;
for (i = 0; i < N - 2; i++) fork DOLINE;
// Only the Original process reach this point
i = N-1; // Really Required?
DOLINE: // Executed by N process for each line
for (j = 0; j < N; j++) {
// SIMD computation
C[i,j]=A[i,j]+B[i,j];Complexity: O(n2/p)
} If SIMD is used: O(n2/(p.m))
} m:number of SIMD PEs
join N; // Wait for the other process
Issue: if n > p, self-scheduling.
The OS decides who will do what. (Dynamic)
In this case, overhead

MIMD SM Matrix Addition
Pre-Scheduling
Basicidea:dividethewholematrixbyp=s*s
submatrix
submatrices
s
C = A + B
s
Each of the p threads compute:

= +
Ideal Situation! No communication Overhead!
Pre-Scheduling: decide statically who will do what.
Parallel Primality Testing
(from Herlihy and Shavit Slides)
Challenge
Printprimesfrom1to1010
Given
Tenprocessormultiprocessor
Onethreadperprocessor
Goal
Gettenfoldspeedup(orclose)

Load Balancing
Splittheworkevenly
Eachthreadtestsrangeof109
1 109 2109 1010
P0 P1 P9
Code for thread i:
void thread(int i) {
for (j = i*109+1, j<(i+1)*109; j++) {
if (isPrime(j)) print(j);
}
}
Dr. Pierre Vignras

Issues
LargerNumrangeshavefewerprimes
Largernumbershardertotest
Threadworkloads d
Uneven ct e
je
Hardtopredict re
Needdynamicloadbalancing
Dr. Pierre Vignras

Shared Counter
19
each
18 thread takes a number
17
Dr. Pierre Vignras

Procedure for Thread i
// Global (shared) variable

static long value = 1;
void thread(int i) {
long j = 0;
while (j < 1010) {
j = inc();
if (isPrime(j)) print(j);
}
}
Dr. Pierre Vignras

Counter Implementation

sor,
c e s
n i p r o s o r
f o r u{ o c e s
long inc()
K
Oreturn u l t
value++; ip r
f o r m
} not
Dr. Pierre Vignras

Counter Implementation

long inc() {
temp = value;
return value++;
} value = value + 1;
return temp;
Dr. Pierre Vignras

Uh-Oh
Value 2 3 2
read 1 read 2
write 2 write 3
read 1 write 2
time
Dr. Pierre Vignras

Challenge

long inc() {
temp = value;
value = temp + 1;
Make these steps
return temp;
atomic (indivisib
}
Dr. Pierre Vignras

An Aside: C/Posix Threads API

static pthread_mutex_t mutex;
// somewhere before
pthread_mutex_init(&mutex, NULL);
long inc() {
pthread_mutex_lock(&mutex);
temp = value;
Critical section
value = temp + 1;
pthread_mutex_unlock(&mutex);
return temp;
}
Dr. Pierre Vignras

Correctness
MutualExclusion
Nevertwoormoreincriticalsection
Thisisasafetyproperty
NoLockout(lockoutfreedom)
Ifsomeonewantsin,someonegetsin
Thisisalivenessproperty
Dr. Pierre Vignras

Multithreading Issues
Coordination
Locks,ConditionVariables,Volatile
Cancellation(interruption)
Asynchronous(immediate)
Equivalentto'killKILL'butmuchmoredangerous!!
Whathappenstolock,openfiles,buffers,pointers,...?
Avoiditasmuchasyoucan!!
Deferred
Targetthreadshouldcheckperiodicallyifitshouldbe
cancelled
Noguaranteethatitwillactuallystoprunning!

General Multithreading
Guidelines
Usegoodsoftwareengineering
Alwayscheckerrorstatus
Alwaysuseawhile(condition)loopwhen
waitingonaconditionvariable
Useathreadpoolforefficiency
Reducethenumberofcreatedthread
Usethreadspecificdatasoonethreadcanholdsits
own(private)data
TheLinuxkernelschedulestasksonly,
Process==ataskthatsharesnothing
Thread==ataskthatshareseverything
General Multithreading
Guidelines
GivethebiggestprioritiestoI/Obounded
threads
Usuallyblocked,shouldberesponsive
CPUBoundedthreadsshouldusuallyhavethe
lowestpriority
UseAsynchronousI/Owhenpossible,itis
usuallymoreefficientthanthecombinationof
threads+synchronousI/O
Useeventdrivenprogrammingtokeep
indeterminismaslowaspossible
Usethreadswhenyoureallyneedit!
OpenMP
Designassupersetofcommonlanguagesfor
MIMDSMprogramming(C,C++,Fortran)
UNIX,Windowsversion
fork()/join()executionmodel
OpenMPprogramsalwaysstartwithonethread
relaxedconsistencymemorymodel
eachthreadisallowedtohaveitsowntemporaryviewof
thememory
CompilerdirectivesinC/C++arecalled
pragma(pragmaticinformation)
Preprocessordirectives:
#pragmaomp<directive>[arguments]

OpenMP

OpenMP
Thread Creation
#pragmaompparallel[arguments]
<Cinstructions>//Parallelregion
Ateamofthreadiscreatedtoexecutethe
parallelregion
theoriginalthreadbecomesthemasterthread
(threadID=0forthedurationoftheregion)
numbersofthreadsdeterminedby
Arguments(if,num_threads)
Implementation(nestedparallelismsupport,dynamic
adjustment)
Environmentvariables(OMP_NUM_THREADS)

OpenMP
Work Sharing
Distributestheexecutionofaparallelregion
amongteammembers
Doesnotlaunchthreads
Nobarrieronentry
Implicitbarrieronexit(unlessnowaitspecified)
WorkSharingConstructs
Sections
Loop
Single
Workshare(Fortran)
OpenMP
Work Sharing - Section
double e, pi, factorial, product;

int i;
e = 1; // Compute e using Taylor expansion
factorial = 1;
for (i = 1; i<num_steps; i++) {
factorial *= i;
e += 1.0/factorial;
} // e done
pi = 0; // Compute pi/4 using Taylor expansion
for (i = 0; i < num_steps*10; i++) {
pi += 1.0/(i*4.0 + 1.0);
pi -= 1.0/(i*4.0 + 3.0);
} Independent Loops
pi = pi * 4.0; // pi done Ideal Case!
return e * pi;

OpenMP
double e, pi, factorial, product; int i; Private Copies
#pragma omp parallel sections shared(e, pi) { Except for
#pragma omp section { shared variables
e = 1; factorial = 1;
for (i = 1; i<num_steps; i++) {
factorial *= i; e += 1.0/factorial;
}
} /* e section */
Independent
#pragma omp section {
sections are
pi = 0; declared
for (i = 0; i < num_steps*10; i++) {
pi += 1.0/(i*4.0 + 1.0);
pi -= 1.0/(i*4.0 + 3.0);
}
pi = pi * 4.0;
} /* pi section */ Implicit Barrier
} /* omp sections */ on exit
return e * pi;
Talyor Example Analysis
OpenMPImplementation:IntelCompiler
System:Gentoo/Linux2.6.18
Hardware:IntelCoreDuoT2400(1.83Ghz)Cache
L22Mb(SmartCache) Reason: Core Duo
Gcc: Architecture?
Standard: 13250 ms,
Intel CC:
Standard: 12530 ms
Optimized (-Os -msse3
Optimized (-O2 -xW
-march=pentium-m
-march=pentium4): 7040 ms
-mfpmath=sse): 7690 ms
OpenMP Standard: 15340 ms Speedup: 0.6

OpenMP Optimized: 10460 ms Efficiency: 3%

OpenMP
Defineasetofstructuredblocksthataretobe
dividedamong,andexecutedby,thethreads
inateam
Eachstructuredblockisexecutedoncebyone
ofthethreadsintheteam
Themethodofschedulingthestructured
blocksamongthreadsintheteamis
implementationdefined
Thereisanimplicitbarrierattheendofa
sectionsconstruct,unlessanowaitclauseis
specified
OpenMP
Work Sharing - Loop
// LU Matrix Decomposition
for (k = 0; k<SIZE-1; k++) {
for (n = k; n<SIZE; n++) {
col[n] = A[n][k];
}
for (n = k+1; n<SIZE; n++) {
A[k][n] /= col[k];
}
for (n = k+1; n<SIZE; n++) {
row[n] = A[k][n];
}
for (i = k+1; i<SIZE; i++) {
for (j = k+1; j<SIZE; j++) {
A[i][j] = A[i][j] - row[i] * col[j];
}
}
}

OpenMP
Work Sharing - Loop
// LU Matrix Decomposition
for (k = 0; k<SIZE-1; k++) {
... // same as before
#pragma omp parallel for shared(A, row, col)
for (i = k+1; i<SIZE; i++) {
for (j = k+1; j<SIZE; j++) {
A[i][j] = A[i][j] - row[i] * col[j];
}
}
}
Loop should be in
Private Copies canonical form
Except for Implicit Barrier (see spec)
shared variables at the end of a loop
(Memory Model)

OpenMP
Work Sharing - Loop
Specifiesthattheiterationsoftheassociated
loopwillbeexecutedinparallel.
Iterationsofthelooparedistributedacross
threadsthatalreadyexistintheteam
Thefordirectiveplacesrestrictionsonthe
structureofthecorrespondingforloop
Canonicalform(seespecifications):
for(i=0;i<N;i++){...}isok
For(i=f();g(i);){...i++}isnotok
Thereisanimplicitbarrierattheendofaloop
constructunlessanowaitclauseisspecified.
OpenMP
Work Sharing - Reduction
reduction(op: list)
e.g: reduction(*: result)
Aprivatecopyismadeforeachvariable
declaredin'list'foreachthreadofthe
parallelregion
Afinalvalueisproducedusingtheoperator
'op'tocombineallprivatecopies
#pragma omp parallel for reduction(*: res)
for (i = 0; i < SIZE; i++) {
res = res * a[i];
}
}

OpenMP
Synchronisation -- Master
Masterdirective:
#pragma omp master
{...}
Defineasectionthatisonlytobeexecutedby
themasterthread
Noimplicitbarrier:otherthreadsintheteam
donotwait

OpenMP
Synchronisation -- Critical
Criticaldirective:
#pragma omp critical [name]
{...}
Athreadwaitsatthebeginningofacritical
regionuntilnootherthreadisexecutinga
criticalregionwiththesamename.
Enforcesexclusiveaccesswithrespecttoall
criticalconstructswiththesamenameinall
threads,notjustinthecurrentteam.
Constructswithoutanameareconsideredto
havethesameunspecifiedname.
OpenMP
Synchronisation -- Barrier
Criticaldirective:
#pragma omp barrier
{...}
Allthreadsoftheteammustexecutethebarrier
beforeanyareallowedtocontinueexecution
Restrictions
Eachbarrierregionmustbeencounteredbyallthreads
inateamorbynoneatall.
BarrierinnotpartofC!
If (i < n)
#pragma omp barrier
//Syntaxerror!

OpenMP
Synchronisation -- Atomic
Criticaldirective:
#pragma omp atomic
expr-stmt

Whereexpr-stmtis:
x binop= expr, x++, ++x, x--, --x (xscalar,binopanynon
overloadedbinaryoperator:+,/,-,*,<<,>>,&,^,|)
Onlyloadandstoreoftheobjectdesignated
byxareatomic;evaluationofexprisnot
atomic.
Doesnotenforceexclusiveaccessofsame
storagelocationx.
Somerestrictions(seespecifications)
OpenMP
Synchronisation -- Flush
Criticaldirective:
#pragma omp flush [(list)]
Makesathreadstemporaryviewofmemory
consistentwithmemory
Enforcesanorderonthememoryoperations
ofthevariablesexplicitlyspecifiedorimplied.
Impliedflush:
barrier,entryandexitofparallelregion,critical
Exitfromworksharingregion(unlessnowait)
...
Warningwithpointers!
MIMD DM Programming

Message Passing Solutions

Process Creation
Send/Receive
Outline

MPI

Message Passing Solutions
HowtoprovideMessagePassingto
programmers?
Designingaspecialparallelprogramminglanguage
SunFortress,Elanguage,...
Extendingthesyntaxofanexistingsequential
language
ProActive,JavaParty
Usingalibrary
PVM,MPI,...

Requirements
Processcreation
Static:thenumberofprocessesisfixedanddefined
atstartup(e.g.fromthecommandline)
Dynamic:processcanbecreatedanddestroyedat
willduringtheexecutionoftheprogram.
MessagePassing
Sendandreceiveefficientprimitives
BlockingorUnblocking
GroupedCommunication

Process Creation
Multiple Program, Multiple Data
Model
(MPMD)
Source Source
File File
Compile to
suit processor
Executable Executable
File File
Processor 0 Processor p-1

Process Creation
Single Program, Multiple Data Model
(SPMD)
Source
File
Compile to
suit processor
Executable Executable
File File
Processor 0 Processor p-1

SPMD Code Sample
int pid = getPID();

if (pid == MASTER_PID) { Compilation for
execute_master_code();
heterogeneous
}else{
execute_slave_code(); processors still required
}
Master/Slave Architecture
GetPID() is provided by the
parallel system (library, language, ...)
MASTER_PID should be well defined

Dynamic Process Creation
Process 1
Primitive required:
: spawn(location, executable)
:
:
spawn() Process 2
: :
: :
: :
: :
:
:
:

Send/Receive: Basic
Basicprimitives:
send(dst, msg);
receive(src, msg);
Questions:
Whatisthetypeof:src, dst?
IP:port>Lowperformancebutportable
ProcessID>Implementationdependent(efficient
nonIPimplementationpossible)
Whatisthetypeof:msg?
Packing/Unpackingofcomplexmessages
Encoding/Decodingforheterogeneousarchitectures
Persistence and Synchronicity in Communication (3)
(from Distributed Systems -- Principles and Paradigms, A. Tanenbaum, M. Steen)
2-22.1
a) Persistent asynchronous communication

b) Persistent synchronous communication
2-22.2
a) Transient asynchronous communication

b) Receipt-based transient synchronous communication
a) Delivery-based transient synchronous communication at message

delivery
b) Response-based transient synchronous communication
Send/Receive: Properties
Message System Precedence

Synchronization Requirements Constraints
Send: non-blocking Message Buffering None, unless
Receive: non- Failure return from message is received
blocking receive successfully
Send: non-blocking Message Buffering Actions preceding

send occur before
Receive: blocking Termination Detection those following send
Send: blocking Termination Detection Actions preceding
Receive: non- Failure return from send occur before
blocking receive those following send
Send: blocking Termination Detection Actions preceding
send occur before
Receive: blocking Termination Detection those following send
Receive Filtering
Sofar,wehavetheprimitive:
receive(src, msg);
Andifonetoreceiveamessage:
Fromanysource?
Fromagroupofprocessonly?
Wildcardsareusedinthatcases
Awildcardisaspecialvaluesuchas:
IP:broadast(e.g:192.168.255.255),multicastaddress
TheAPIprovideswelldefinedwildcard(e.g:
MPI_ANY_SOURCE)

The Message-Passing Interface (MPI)
Primitive Meaning
MPI_bsend Append outgoing message to a local send buffer
MPI_send Send a message and wait until copied to local or remote buffer
MPI_ssend Send a message and wait until receipt starts
MPI_sendrecv Send a message and wait for reply
MPI_isend Pass reference to outgoing message, and continue
MPI_issend Pass reference to outgoing message, and wait until receipt starts
MPI_recv Receive a message; block if there are none
MPI_irecv Check if there is an incoming message, but do not block
Some of the most intuitive message-passing primitives of MPI.

MPI Tutorial
(by William Gropp)
See William Gropp Slides

Parallel Strategies

Embarrassingly Parallel Computation
Partitionning
Outline

Embarrassingly Parallel
Computations
Problemsthatcanbedividedintodifferent
independentsubtasks
Nointeractionbetweenprocesses
Nodataisreallyshared(butcopiesmayexist)
Resultofeachprocesshavetobecombinedin
someways(reduction).
Idealsituation!
Master/Slavearchitecture
Masterdefinestasksandsendthemtoslaves
(staticallyordynamically)

Embarrassingly Parallel
Computations Typical Example:
Image Processing
Shifting,Scaling,Clipping,Rotating,...
2PartitioningPossibilities
640 640
480
480
Process Process

Image Shift Pseudo-Code
Master h=Height/slaves;
for(i=0,row=0;i<slaves;i++,row+=h)
send(row, Pi);
for(i=0; i<Height*Width;i++){
recv(oldrow, oldcol,
newrow, newcol, PANY);
map[newrow,newcol]=map[oldrow,oldcol];
}
recv(row, Pmaster)
Slave
for(oldrow=row;oldrow<(row+h);oldrow++)
for(oldcol=0;oldcol<Width;oldcol++) {
newrow=oldrow+delta_x;
newcol=oldcol+delta_y;
send(oldrow,oldcol,newrow,newcol,Pmaster);
}
Complexity
Hypothesis:
2computationalsteps/pixel
n*npixels
pprocesses
Sequential:Ts=2n2=O(n2)
Parallel:T//=O(p+n2)+O(n2/p)=O(n2)(pfixed)
Communication:Tcomm=Tstartup+mTdata
=p(Tstartup+1*Tdata)+n2(Tstartup+4Tdata)=O(p+n2)
Computation:Tcomp=2(n2/p)=O(n2/p)

Speedup
Speedup Ts 2n
2
=
T p 2 n 2 / p pT startup T data n 2 T startup 4T data
Computation/Communication:
2n 2 / p n2 / p
2
=O 2

p T startupT data n T startup 4T data pn
Constantwhenpfixedandngrows!
Notgood:thelargest,thebetter
Computationshouldhidethecommunication
Broadcastingisbetter
ThisproblemisforSharedMemory!

Computing
the
Mandelbrot
Set
Problem:weconsiderthelimitofthefollowing
sequenceinthecomplexplane: z 0=0
z k 1= z k c
Theconstantc=a.i+bisgivenbythepoint(a,b)
inthecomplexplane
Wecomputezkfor0<k<n
ifforagivenk,(|zk|>2)weknowthesequence
divergetoinfinityweplotcincolor(k)
iffor(k==n),|zk|<2weassumethesequence
convergeweplotcinablackcolor.

Mandelbrot Set: sequential
code
int calculate(double cr, double ci) {
double bx = cr, by = ci;
double xsq, ysq;
int cnt = 0;
Computes the number of iterations
while (true) { for the given complex point (cr, ci)
xsq = bx * bx;
ysq = by * by;
if (xsq + ysq >= 4.0) break;
by = (2 * bx * by) + ci;
bx = xsq - ysq + cr;
cnt++;
if (cnt >= max_iterations) break;
}
return cnt;
}

Mandelbrot Set: sequential
code
y = startDoubleY;
for (int j = 0; j < height; j++) {
double x = startDoubleX;
int offset = j * width;
for (int i = 0; i < width; i++) {
int iter = calculate(x, y);
pixels[i + offset] = colors[iter];
x += dx;
}
y -= dy;
}

Mandelbrot Set:
parallelization
Statictaskassignment:
Cuttheimageintopareas,andassignthemtoeach
pprocessor
Problem:someareasareeasiertocompute(less
iterations)
Assignn2/ppixelsperprocessorinaroundrobinor
randommanner
Onaverage,thenumberofiterationsperprocessor
shouldbeapproximatelythesame
Leftasanexercice

Mandelbrot set: load
balancing
Numberofiterationsperpixelvaries
Performanceofprocessorsmayalsovary
DynamicLoadBalancing
WewanteachCPUtobe100%busyideally
Approach:workpool,acollectionoftasks
Sometimes,processescanaddnewtasksintothe
workpool
Problems:
Howtodefineatasktominimizethecommunication
overhead?
Whenataskshouldbegiventoprocessestoincreasethe
computation/communicationratio?
Mandelbrot set: dynamic load
balancing
Task==pixel
Lotofcommunications,Goodloadbalancing
Task==row
Fewercommunications,lessloadbalanced
Sendataskonrequest
Simple,lowcommunication/computationratio
Sendataskinadvance
Complextoimplement,good
communication/computationratio
Problem:TerminationDetection?

Mandelbrot Set: parallel code
Master Code
count = row = 0;
for(k = 0, k < num_procs; k++) { Pslave is the process
send(row, Pk,data_tag); which sends the last
count++, row++; received message.
}
do {
recv(r, pixels, PANY, result_tag);
count--;
if (row < HEIGHT) {
send(row, Pslave, data_tag);
row++, count++;
}else send(row, Pslave, terminator_tag);
display(r, pixels);
}while(count > 0);

Slave Code
recv(row, PMASTER, ANYTAG, source_tag);

while(source_tag == data_tag) {
double y = startDoubleYFrom(row);
double x = startDoubleX;
for (int i = 0; i < WIDTH; i++) {
int iter = calculate(x, y);
pixels[i] = colors[iter];
x += dx;
}
send(row, pixels, PMASTER, result_tag);
recv(row, PMASTER, ANYTAG, source_tag);
};

Analysis
T s Max iter . n 2
T p =T comm T comp
T comm=n t startup2 t data p1t startup 1t data n t startupnt data
n2
T comp Maxiter
p Only valid when
2
Ts Max iter . n all pixels are black!
=
T p Max iter . n 2 (Inequality of
n2 t data 2 n p1t startup t data
p Ts and Tp)
Max iter . n 2
T comp p
= 2
T comm n t data 2 n p1t startup t data
Speedup approaches p when Maxiter is high

Computation/Communication is in O(Maxiter)

Partitioning
Basisofparallelprogramming
Steps:
cuttheinitialproblemintosmallerpart
solvethesmallerpartinparallel
combinetheresultsintoone
Twoways
Datapartitioning
Functionalpartitioning
Specialcase:divideandconquer
subproblemsareofthesameformasthelarger
problem
Partitioning Example:
numerical integration
Wewanttocompute: b
f x dx
a

Quizz
Outline

Readings
Describeandcomparethefollowing
mechanismsforinitiatingconcurrency:
Fork/Join
AsynchronousMethodInvocationwithfutures
Cobegin
Forall
Aggregate
LifeRoutine(akaActiveObjects)

PP PDF

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

PP PDF

Transféré par

Droits d'auteur :

Formats disponibles

Parallel Processing

Dr. Pierre Vignras

This work is licensed under a Creative Commons

Dr. Pierre Vignras

Fundamentals of Parallel Processing

Parallel Programming (2nd edition)

Barry Wilkinson and Michael Allen

Dr. Pierre Vignras

Dr. Pierre Vignras

Dr. Pierre Vignras

Dr. Pierre Vignras 6

Dr. Pierre Vignras

Dr. Pierre Vignras

Dr. Pierre Vignras

Dr. Pierre Vignras

Dr. Pierre Vignras

Dr. Pierre Vignras

Dr. Pierre Vignras

Dr. Pierre Vignras

Dr. Pierre Vignras

Switch M CPU Switch CPU M

Shared Memory Distributed Memory

Dr. Pierre Vignras

Dr. Pierre Vignras

Dr. Pierre Vignras

Dr. Pierre Vignras

Human writes a Parallel Algorithm

Many actors are involved in the execution We usually consider

of a program (even sequential). explicit parallelism:

Dr. Pierre Vignras

j=0 j=1 j = N-1

for (i = 0; i < N; i++) {

Dr. Pierre Vignras

Dr. Pierre Vignras

Dr. Pierre Vignras

Dr. Pierre Vignras

Dr. Pierre Vignras

Dr. Pierre Vignras 34

longest path from +

Dr. Pierre Vignras 36

V[0] V[1] V[2] V[3] ... V[N-1]

Dr. Pierre Vignras 40

Dr. Pierre Vignras 42

Type Name Measuring

Dr. Pierre Vignras 43

Dr. Pierre Vignras 44

Dr. Pierre Vignras 46

Dr. Pierre Vignras 47

Dr. Pierre Vignras 48

Dr. Pierre Vignras 49

How to get all n local time?

Dr. Pierre Vignras 50

Process 1 sends 1 m-bytes message to process 2,

then to process 3, and so on... (n steps required)

Tree based (log (n) steps required)

Dr. Pierre Vignras 53

Dr. Pierre Vignras

Dr. Pierre Vignras

Dr. Pierre Vignras

Dr. Pierre Vignras 61

Conclusion don't use parallel

Dr. Pierre Vignras 63

Dr. Pierre Vignras 64