Académique Documents
Professionnel Documents
Culture Documents
Albert Y. H. Zomaya
McGraw-Hill Series on Computer Engineering
ISBN: 0-07-073020-2
Yourheadshouldbeinfrontofyourpaper!
Final:50%
Flynn's Classification
Interconnection Networks
Outline
Language Specific Constructions
computersandlanguagesaredesigned
Butdoingseveralthingsatonceisanobvious
waytoincreasespeed
ParallelProcessingisanoldfield
butalwayslimitedbythewidesequentialworldfor
variousreasons(financial,psychological,...)
MIMD:MultiProcessorComputers
MISD:Notveryuseful(sometimespresentedas
pipelinedSIMD)
Flow of Instructions I8 I7 I6 I5
I. Basics
IF OF EX STR I4 I3 I2 I1
4-stages pipeline
Consequences:
CircuitryComplexities(cost)
Performance
Dr. Pierre Vignras
Exercises: Real Life Examples
Intel:
Pentium4:
Pentium3(=M=CoreDuo):
AMD
Athlon:
Opteron:
IBM
Power6:
I. Basics
Cell:
Sun
UltraSparc:
T1:
Dr. Pierre Vignras
Instruction Level Parallelism
MultipleComputingUnitthatcanbeused
simultaneously(scalar)
FloatingPointUnit
ArithmeticandLogicUnit
Reorderingofinstructions
Bythecompiler
Bythehardware
I. Basics
Lookahead(prefetching)&Scoreboarding
(resolveconflicts)
OriginalSemanticisGuaranteed
+ H + H
Height=5
Height=4
+ G + x
+ x + + x F
+ C x F A B C G D E
I. Basics
A B D E
4 Steps required if
5 Steps required if
hardware can do 2 adds and
hardware can do 1 add and
1 multiplication simultaneously
1 multiplication simultaneously
Dr. Pierre Vignras
SIMD Architecture
Often(especiallyinscientificapplications),
oneinstructionisappliedondifferentdata
for(inti=0;i<n;i++){
r[i]=a[i]+b[i];
}
Usingvectoroperations(add,mul,div,...)
Twoversions
TrueSIMD:nArithmeticUnit
I. Basics
nhighleveloperationsatatime
PipelinesSIMD:ArithmeticPipelined(depthn)
nnonidenticallowleveloperationsatatime
Veryefficient
2002:winneroftheTOP500listwastheJapanese
EarthSimulator,avectorsupercomputer
Differentpartsofaprogramareexecuted
simultaneouslyoneachprocessor
Needcooperation!
multiprocessor(notourconcernhere)
runningaprogramconsistingofmultiple
cooperatingprocesses
M
CPU CPU CPU
CPU
Memory
I. Basics
synchronous,asynchronous,grouped,remotemethod
invocation
Largemessagemaymasklongandvariablelatency
channeldelay(cd):msgsize/bandwidth
so+ctdependsonprogrambehaviour
rd+cddependsonhardware
Dr. Pierre Vignras
Soft Network Metrics
Diameter(r):longestpathbetweentwonodes
r
AverageDistance(da): 1
d.N d
N 1 d =1
wheredisthedistancebetweentwonodesdefinedasthenumberoflinksin
theshortestpathbetweenthem,NthetotalnumberofnodesandNd,the
numberofnodeswithadistancedfromagivennode.
Connectivity(P):thenumberofnodesthatcan
bereachedfromagivennodeinonehop.
Concurrency(C):thenumberofindependent
I. Basics
connectionsthatanetworkcanmake.
bus:C=1;linear:C=N1(ifprocessorcansendand
receivesimultaneously,otherwise,C=N/2)
Dr. Pierre Vignras
Influence of the network
Bus
lowcost,
lowconcurrencydegree:1messageatatime
Fullyconnected
highcost:N(N1)/2links
highconcurrencydegree:Nmessagesatatime
Star
I. Basics
Lowcost(highavailability)
Exercice:ConcurrencyDegree?
j j j
i i i
C = A + B
= + = + ... = +
C[i] A[i] B[i] C[i] A[i] B[i] C[i] A[i] B[i]
Dr. Pierre Vignras
SIMD Matrix Multiplication
GeneralFormula
N 1
cij = a ik b kj
k =0
// SIMD computation
C[i,j]=C[i,j]+A[i,k]*B[k,j], (0 <= j < N);
}
}
gccbuiltinsfunctions
#include<mmintrin.h>//MMXonly
I. Basics
Compilewithmmmx
Usetheprovidedfunctions
_mm_setr_pi16 (short w0, short w1, short w2, short w3);
_mm_add_pi16 (__m64 m1, __m64 m2);
...
Sufficient?Whataboutsynchronization?
Wewillfocusonthatlateron...
}
}
join N; // Wait for the other process
ImperativeConcurrentObjectOrientedLanguages,
M.Philippsen,1995
ThreadsCannotBeImplementedAsaLibrary,
HansJ.Boehm,2005
nonblockingsendandblockingreceive
Groupedcommunications
scatter/gather,broadcast,multicast,...
Dr. Pierre Vignras
MIMD Distributed Memory
C Code
TraditionalSockets
BadPerformance
Groupedcommunicationlimited
ParallelVirtualMachine(PVM)
Oldbutgoodandportable
MessagePassingInterface(MPI)
Thestandardusedinparallelprogrammingtoday
I. Basics
Veryefficientimplementations
Vendorsofsupercomputersprovidetheirownhighly
optimizedMPIimplementation
Opensourceimplementationavailable
Dr. Pierre Vignras
Performance
Size and Depth
Speedup and Efficiency
Outline
+
+
Data Dependence Graph
Size = N-1 (nb op)
+
Depth = N-1 (nb op in the
+ + +
I. Performance
+ +
+
Size = N-1
Depth =log2N
s
send/receivefordistributedmemoryMIMD
Loadbalancingwhenthenumberofprocessorsis
lessthanthenumberofparalleltask
etc...
InherentCharacteristicsofaParallelMachine
independentofanyspecificarchitecture
Dr. Pierre Vignras 37
Prefix Problem
for (i = 1; i < N; i++) {
V[i] = V[i-1] + V[i];
}
V[0] V[1] V[2] V[3] ... V[N-1]
+
Size = N-1
I. Performance
Depth =N-1
+
+
How to make this
+ program parallel?
Depth = log2N
... + + ... +
Size (N=2k) = k.2k-1
=(N/2).log2N
V[0] V[N-1]
Dr. Pierre Vignras 39
ul
Analysis of P
Proofbyinduction
OnlyforNapowerof2
Assumefromconstructionthatsizeincrease
smoothlywithN
Example:N=1,048,576=220
SequentialPrefixAlgorithm:
1,048,575operationsin1,048,575timeunitsteps
Upper/LowerPrefixAlgorithm
10,485,760operationsin20timeunitsteps
Abenchmarkisaperformancetestingprogram
supposedtocaptureprocessinganddata
movementcharacteristicsofaclassofapplications.
Usedtomeasureandtopredictperformance
ofcomputersystems,andtorevealtheir
architecturalweaknessandstrongpoints.
Benchmarksuite:setofbenchmarkprograms.
Benchmarkfamily:setofbenchmarksuites.
Classification:
scientificcomputing,commercial,applications,
networkservices,multimediaapplications,...
m
Dr. Pierre Vignras 51
Collective Communication
Broadcast:1processsendsanmbytesmessagetoall
nprocess
Multicast:1processsendsanmbytesmessagetop<
nprocess
Gather:1processreceivesanmbytesfromeachof
thenprocess(mnbytesreceived)
Scatter:1processsendsadistinctmbytesmessageto
eachofthenprocess(mnbytessent)
Totalexchange:everyprocesssendsadistinctm
bytesmessagetoeachofthenprocess(mn2bytes
sent)
CircularShift:processisendsanmbytesmessageto
processi+1(processn1toprocess0)
Dr. Pierre Vignras 52
Collective Communication
Overhead
Functionofbothmandn
startuplatencydependsonlyonn
T(m,n)=t0(n)+m/r(n)
Many problems here! For a broadcast for example,
how is done the communication at the link layer?
Sequential?
Network Contention
Foragivenalgorithm,letTpbetimetoperform
thecomputationusingpprocessors
T:depthofthealgorithm
Howmuchtimewegainbyusingaparallel
algorithm?
Speedup:Sp=T1/Tp
Issue:whatisT1?
Timetakenbytheexecutionofthebestsequential
algorithmonthesameinputdata
WenoteT1=Ts
BestParallelTimeAchievableis:Ts/p
BestSpeedupisboundedbySpTs/(Ts/p)=p
CalledLinearSpeedup
Conditions
Workloadcanbedividedintopequalparts
Nooverheadatall
Idealsituation,usuallyimpossibletoachieve!
Objectiveofaparallelalgorithm
Beingasclosetoalinearspeedupaspossible
Tp(W)=Tp(W+(1)W)=Ts(W)+Tp((1)W)=Ts(W)+(1)Ts(W)/p
Ts p 1
S p= = p
1T s 1 p1
T s
p
Dr. Pierre Vignras 59
Amdhal's Law: fixed problem
size
Ts
Ts (1-)Ts
serial section parallelizable sections
...
P processors
...
(1-)Ts/p
Tp
Dr. Pierre Vignras 60
Amdhal's Law Consequences
S(p) S()
p=100
=5%
=10%
=20%
p=10
=40%
Tp(W)=Ts(W)+(1)Ts(W)/p
Tohaveaconstanttime,theparallelworkload
shouldbeincreased
Parallelcase(pprocessor):W'=W+p(1)W.
Ts(W')=Ts(W)+p(1)Ts(W)
Assumption:sequentialpartisconstant:itdoesnotincrease
withthesizeoftheproblem
T s W ' T s W ' T s W p 1T s W
S p= = =
T p W ' T s W T s W 1T s W
S p=1 p
=10%
=20% p=25
=40% p=10
decreasing function
Matrix Addition & Multiplication
Gauss Elimination
Outline
C = AB , c i , j =a i , j bi , j 0in , 0 jm
// Sequential
for (i = 0; i < n; i++) { Need to map the vectored
for (j = 0; j < m; j++) { addition of m elements a[i]+b[i]
c[i,j] = a[i,j]+b[i,j]; to our underlying architecture
} made of p processing elements
}
// SIMD
Complexity:
for (i = 0; i < n; i++) { Sequential case: n.m additions
c[i,j] = a[i,j]+b[i,j];\ SIMD case: n.m/p
(0 <= j < m)
Speedup: S(p)=p
}
C = AB , A[n , l ] ; B [l , m] ;C [n , m]
l 1
ci , j = a i , k b k , j 0in ,0 jm
k =0
a n1,0 a n1,1 a n1,2 ... a n1, n1 b n1
a n2,0 a n2,1 a n2,2 ... a n2, n1 bn2
Represented by M = ... ... ... ... ... ...
a 1,0 a 1,1 a 1,2 ... a 1, n1 b1
a 0,0 a 0,1 a 0,2 ... a 0, n1 b0
Dr. Pierre Vignras 73
Gauss Elimination Principle
Transformthelinearsystemintoatriangular
systemofequations.
Process, Threads & Fibers
Pre- & self-scheduling
Common Issues
Outline
OpenMP
Process Process
Single-Threaded Multi-Threaded
Dr. Pierre Vignras 79
Kernel Thread
Providedandscheduledbythekernel
Solaris,Linux,MacOSX,AIX,WindowsXP
Incompatibilities:POSIXPThreadStandard
SchedulingQuestion:ContentionScope
Thetimesliceqcanbegivento:
thewholeprocess(PTHREAD_PROCESS_SCOPE)
Eachindividualthread(PTHREAD_SYSTEM_SCOPE)
Fairness?
WithaPTHREAD_SYSTEM_SCOPE,themore
threadsaprocesshas,themoreCPUithas!
C = A + B
s
Challenge
Printprimesfrom1to1010
Given
Tenprocessormultiprocessor
Onethreadperprocessor
Goal
Gettenfoldspeedup(orclose)
P0 P1 P9
Code for thread i:
void thread(int i) {
for (j = i*109+1, j<(i+1)*109; j++) {
if (isPrime(j)) print(j);
}
}
19
each
18 thread takes a number
17
void thread(int i) {
long j = 0;
while (j < 1010) {
j = inc();
if (isPrime(j)) print(j);
}
}
long inc() {
temp = value;
return value++;
} value = value + 1;
return temp;
Value 2 3 2
read 1 read 2
write 2 write 3
read 1 write 2
time
long inc() {
temp = value;
value = temp + 1;
Make these steps
return temp;
atomic (indivisib
}
long inc() {
pthread_mutex_lock(&mutex);
temp = value;
Critical section
value = temp + 1;
pthread_mutex_unlock(&mutex);
return temp;
}
Loop should be in
Private Copies canonical form
Except for Implicit Barrier (see spec)
shared variables at the end of a loop
(Memory Model)
reduction(op: list)
e.g: reduction(*: result)
Aprivatecopyismadeforeachvariable
declaredin'list'foreachthreadofthe
parallelregion
Afinalvalueisproducedusingtheoperator
'op'tocombineallprivatecopies
#pragma omp parallel for reduction(*: res)
for (i = 0; i < SIZE; i++) {
res = res * a[i];
}
}
Message Passing Solutions
Process Creation
Send/Receive
Outline
MPI
Executable Executable
File File
Compile to
suit processor
Executable Executable
File File
Master/Slave Architecture
GetPID() is provided by the
parallel system (library, language, ...)
MASTER_PID should be well defined
2-22.1
2-22.2
Primitive Meaning
MPI_send Send a message and wait until copied to local or remote buffer
MPI_issend Pass reference to outgoing message, and wait until receipt starts
Embarrassingly Parallel Computation
Partitionning
Outline
480
480
Process Process
recv(row, Pmaster)
Slave
for(oldrow=row;oldrow<(row+h);oldrow++)
for(oldcol=0;oldcol<Width;oldcol++) {
newrow=oldrow+delta_x;
newcol=oldcol+delta_y;
send(oldrow,oldcol,newrow,newcol,Pmaster);
}
Dr. Pierre Vignras 135
Image Shift Pseudo-Code
Complexity
Hypothesis:
2computationalsteps/pixel
n*npixels
pprocesses
Sequential:Ts=2n2=O(n2)
Parallel:T//=O(p+n2)+O(n2/p)=O(n2)(pfixed)
Communication:Tcomm=Tstartup+mTdata
=p(Tstartup+1*Tdata)+n2(Tstartup+4Tdata)=O(p+n2)
Computation:Tcomp=2(n2/p)=O(n2/p)
return cnt;
}
y = startDoubleY;
for (int j = 0; j < height; j++) {
double x = startDoubleX;
int offset = j * width;
for (int i = 0; i < width; i++) {
int iter = calculate(x, y);
pixels[i + offset] = colors[iter];
x += dx;
}
y -= dy;
}
f x dx
a