Académique Documents
Professionnel Documents
Culture Documents
programming
Jan Thorbecke
Delft
University of
Technology
Contents
Introduction
Programming
OpenMP
MPI
Amdahls Law
Describes the relation between the parallel portion of your code
and the expected speedup
!
!
!
!
!
!
!
speedup =
(1
1
P) +
P
N
P = parallel portion
N = number of processors used in parallel part
Amdahls Law
1.0
0.99
0.98
60
Speedup
50
40
30
20
10
0
0
10
20
30
40
Number of Processors
50
60
4
Amdahls Law
100
maxspeedup(x)
Maximum Speedup
80
60
40
20
0
1
4 5 7 10
20 30 40 50 70 90
Sequential Portion in %
5
60
Speedup
50
40
30
20
10
0
0
10
20
30
40
Number of Processors
50
60
6
Learning
Classical CPUs are sequential
!
!
Iterative processes
Threads (functional)
a single process can have multiple, concurrent execution paths.
Example implementations: POSIX threads & OpenMP
Message Passing
tasks exchange data through communications by sending and
receiving messages. Example: MPI-2 specification.
Hybrid
Programming Models
New to parallel programming
Experienced programmer
User
Programming Model
Shared Memory
Message passing
Hybrid
Operating system
compilers
System Software
Hardware Layer
Memory
Interconnect
Interconnect
Distributed Memory
Shared Memory
13
Synchronization
Do the different parallel tasks meet?
Communication
Is communication between parallel parts needed?
Load Balance
Does every task has the same amount of work?
Are all processors of the same speed?
14
!
!
Data decomposition
all work for a local portion
of the data is done by the
local processor
do i=1,100!
1: i=1,25!
2: i=26,50!
3: i=51,75!
4: i=76,100
A(1:10,1:25)!
A(1:10,26:50)!
A(11:20,1:25)!
A(11:20,25:50)
!
!
Domain decomposition
decomposition of work and
data
15
Synchronization
!
!
!
!
!
!
!
!
Synchronization
Do i=1,100
a(i) = b(i)+c(i)
Enddo
Do i=1,100
d(i) = 2*a(101-i)
Enddo
Synchronization
is necessary
Synchronization
may cause
16
Communication
53 300 operations)
!
!
!
!
!
!
!
Domain Decomposition
Adamidis/Bnisch
Slide 11
Hchstleistungsrechenzentrum Stuttgart
A(1:25)!
A(26:50)!
A(51:75)!
A(76:100)
do i=2,99!
b(i) = b(i) + h*(a(i-1)-2*a(i)+a(i+1))!
end do!
domain decomposition
Mesh Partitioning
Subdomain for each Process
Domain Decomposition
Adamidis/Bnisch
Slide 12
Hchstleistungsrechenzentrum Stuttgart
17
tructured Grids
Load Imbalance
Jostle
are
idle due to:
(Chris Walshaw, University of Greenwich)
Goals:
unequal processors
0
3
12
16
2
4
13
15
17 14
6
9
21
23
22
10
7
11
20
18
19
18
19
Domain Decomposition
!
Domain Decomposition
Find the largest element of an array
13
26 13
49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64
Domain Decomposition
Find the largest element of an array
Core 0
5
13
Core 1
26 13
Core 2
Core 3
49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64
Domain Decomposition
Find the largest element of an array
Core 0
5
13
Core 1
26 13
Core 2
Core 3
49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64
68
83
Domain Decomposition
Find the largest element of an array
Core 0
5
13
13
Core 1
26 13
Core 2
Core 3
49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64
49
68
83
Domain Decomposition
Find the largest element of an array
Core 0
5
13
13
Core 1
26 13
Core 2
Core 3
49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64
49
98
94
Domain Decomposition
Find the largest element of an array
Core 0
5
13
13
Core 1
26 13
Core 2
Core 3
49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64
50
98
94
Domain Decomposition
Find the largest element of an array
Core 0
5
13
26
Core 1
26 13
Core 2
Core 3
49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64
50
98
94
Domain Decomposition
Find the largest element of an array
Core 0
5
13
26
Core 1
26 13
Core 2
Core 3
49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64
50
98
94
Domain Decomposition
Find the largest element of an array
Core 0
5
13
Core 1
26 13
Core 2
Core 3
49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64
26
50
26
98
94
Domain Decomposition
Find the largest element of an array
Core 0
5
13
Core 1
26 13
Core 2
Core 3
49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64
26
50
50
98
94
Domain Decomposition
Find the largest element of an array
Core 0
5
13
Core 1
26 13
Core 2
Core 3
49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64
26
50
98
98
94
Domain Decomposition
Find the largest element of an array
Core 0
5
13
Core 1
26 13
Core 2
Core 3
49 34 50 22 12 68 12 98 16 78 31 83 51 94 27 74 64
26
50
98
98
94
Master Slave
Get a heap of work done
Core 0 (master)
Core 1
T
T
Core 2
T
Core 3
33
Task Decomposition
f()
g()
h()
r()
q()
s()
Task Decomposition
Core 0
Core 1
g()
h()
f()
Core 2
r()
q()
s()
Task Decomposition
Core 0
Core 1
g()
h()
f()
Core 2
r()
q()
s()
Task Decomposition
Core 0
Core 1
g()
h()
f()
Core 2
r()
q()
s()
Task Decomposition
Core 0
Core 1
g()
h()
f()
Core 2
r()
q()
s()
Task Decomposition
Core 0
Core 1
g()
h()
f()
Core 2
r()
q()
s()
Synchronization: semaphore
Thread
Private
Variables
!
!
Shared
!
Variables
!
!
!
Thread
Private
Variables
Race Conditions
Parallel threads can race against each other to update resources
Withdraws $100
from account
Deposits $100
into account
Race Conditions
Withdraws $100
from account
Deposits $100
into account
Time
Withdrawal
Subtract $100
Add $100
Deposit
!
1:
4:
4:
4:
a= 1.0,
3.0,
5.0,
7.0,
a= 1.0,
3.0,
5.0,
7.0,
a= 1.0,
3.0,
5.0,
7.0,
a= 1.0,
3.0,
5.0,
7.0,
3.0,
5.0,
7.0,
9.0, 11.0!
44
computation
a[1] = a[0] + b[0]!
a[2] = a[1] + b[1]!
a[3] = a[2] + b[2] <--|
a[4] = a[3] + b[3] <--|
a[5] = a[4] + b[4]!
a[6] = a[5] + b[5] <--|
a[7] = a[6] + b[6] <--|
a[8] = a[7] + b[7]!
a[9] = a[8] + b[8] <--|
a[10] = a[9] + b[9] <--|
a[11] = a[10] + b[10]!
Problem !
Problem!
Problem !
Problem!
Problem !
Problem!
45
Examples variables
47
Domain Decomposition
Sequential Code:
!
int a[1000], i;
for (i = 0; i < 1000; i++) a[i] = foo(i);
Domain Decomposition
Sequential Code:
!
int a[1000], i;
for (i = 0; i < 1000; i++) a[i] = foo(i);
Thread 0:
for (i = 0; i < 500; i++) a[i] = foo(i);
Thread 1:
for (i = 500; i < 1000; i++) a[i] = foo(i);
Domain Decomposition
Sequential Code:
!
int a[1000], i;
for (i = 0; i < 1000; i++) a[i] = foo(i);
Thread 0:
for (i = 0; i < 500; i++) a[i] = foo(i);
Thread 1:
for (i = 500; i < 1000; i++) a[i] = foo(i);
Private
Shared
Task Decomposition
int e;
main () {
int x[10], j, k, m;
}
j = f(k);
m = g(k); ...
int f(int k)
{
int a;
a = e * x[k] * x[k];
}
return a;
int g(int k)
{
int a;
k = k-1;
}
a = e / x[k];
return a;
Task Decomposition
int e;
main () {
int x[10], j, k, m;
}
j = f(k);
int f(int k)
{
int a;
a = e * x[k] * x[k];
}
int g(int k)
{
int a;
k = k-1;
}
m = g(k);
Thread 0
return a;
Thread 1
a = e / x[k];
return a;
Task Decomposition
int e;
main () {
int x[10], j, k, m;
}
j = f(k);
int f(int k)
{
int a;
a = e * x[k] * x[k];
}
int g(int k)
{
int a;
k = k-1;
}
m = g(k);
Thread 0
return a;
Thread 1
a = e / x[k];
return a;
Task Decomposition
int e;
main () {
int x[10], j, k, m;
}
j = f(x, k);
m = g(x, k);
Thread 0
return a;
Thread 1
a = e / x[k];
return a;
Task Decomposition
int e;
main () {
int x[10], j, k, m;
j = f(k);
m = g(k);
}
Functions local variables: Private
int f(int k)
{
int a;
a = e * x[k] * x[k];
}
int g(int k)
{
int a;
k = k-1;
}
Thread 0
return a;
Thread 1
a = e / x[k];
return a;
Private variables
Loop index variables
Run-time stack of functions invoked by thread
57
What Is OpenMP?
Compiler directives for multithreaded programming
Incremental parallelism
Combines serial and parallel code in single source
!
!
!
!
59
Directive based
Directives are special comments in the language !
Fortran fixed form: !$OMP, C$OMP, *$OMP !
Fortran free form: !$OMP !
!
Special comments are interpreted by OpenMP !
compilers !
w = 1.0/n !
sum = 0.0 !
!$OMP PARALLEL DO PRIVATE(x) REDUCTION(+:sum) !
do I=1,n !
x = w*(I-0.5) !
Comment in
sum = sum + f(x) !
Fortran
end do !
pi = w*sum !
but interpreted by
print *,pi !
OpenMP compilers
end !
60
C example
#pragma omp directives in C
!
Ignored by non-OpenMP compilers!
!
w = 1.0/n; !
sum = 0.0; !
#pragma omp parallel for private(x) reduction(+:sum) !
for(i=0, i<n, i++) { !
x = w*((double)i+0.5); !
sum += f(x); !
}!
pi = w*sum; !
printf(pi=%g\n, pi); !
}
61
Architecture of OpenMP
Directives,
Pragmas
Runtime library
routines
Control structures
Control & query routines
number of threads
Work sharing
throughput mode
Synchronization
nested parallism
Data scope attributes
Lock API
private
Environment
variables
Control runtime
schedule type
max threads
nested parallelism
throughput mode
shared
reduction
Orphaning
62
Programming Model
Fork-join parallelism:
Master thread spawns a team of threads as needed
Parallelism is added incrementally: the sequential program
evolves into a parallel program
Master
Thread
Parallel Regions
63
Work-sharing Construct
T
#pragma omp parallel
#pragma omp for
for(i = 0; i < 12; i++)
c[i] = a[i] + b[i]
Threads are assigned an independent
set of iterations
i=4
i=8
i=1
i=5
i=9
i=2
i=6
i = 10
i=3
i=7
i = 11
Implicit barrier
64
Combining pragmas
These two code segments are equivalent
#pragma omp parallel
{
#pragma omp for
for (i=0; i< MAX; i++) {
res[i] = huge();
}
}
#pragma omp parallel for
for (i=0; i< MAX; i++) {
res[i] = huge();
}
65
IWOMP 2009
TU Dresden
June 3-5, 2009
18
Example
Matrix
times
vector
Matrix-vector example
!
"
#!$$
% $$$$
j
=
*
i
TID = 0
TID = 1
for (i=0,1,2,3,4)
i = 0
sum = b[i=0][j]*c[j]
a[0] = sum
i = 1
sum = b[i=1][j]*c[j]
for (i=5,6,7,8,9)
i = 5
sum = b[i=5][j]*c[j]
a[5] = sum
i = 6
sum = b[i=6][j]*c[j]
a[1] = sum
a[6] = sum
66
Tutorial IWOMP 2009 TU Dresden, June 3, 2009
MP 2009
19
OpenMP
Performance
Example
Performance is matrix size dependent
Matrix too
small *
scales
Performance (Mflop/s)
Dresden
3-5, 2009
OpenMP parallelization
OpenMP Team := Master + Workers
A Parallel Region is a block of code executed by all
threads simultaneously
The master thread always has thread ID 0
Thread adjustment (if enabled) is only done before entering a parallel
region
Parallel regions can be nested, but support for this is implementation
dependent
An "if" clause can be used to guard the parallel region; in case the
condition evaluates to "false", the code is executed serially
Data Environment
OpenMP uses a shared-memory programming model
69
Data Environment
70
Shared variables
Can be accessed by every thread thread. Independent read/write
operations can take place.
Private variables
Every thread has its own copy of the variables that are created/
destroyed upon entering/leaving the procedure. They are not
visible to other threads.
serial code
global
auto local
static
dynamic
parallel code
shared
local
use with care
use with care
71
!
!
default(shared)
!
!
shared(varname,)
!
!
!
private(varname,)
72
Synchronization
Barriers
!
!
Critical sections
!
!
omp_set_lock(omp_lock_t *lock)!
!
omp_unset_lock(omp_lock_t *lock)!
!
....
74
75
read at a time.
t wait until the current thread exits the critical
OpenMP
an manipulate
values Critical
in the critical region.
fork
x)
- critical region
join
cnt = 0;
f = 7;
#pragma omp parallel
{
#pragma omp for
for (i=0;i<20;i++){
if(b[i] == 0){
#pragma omp critical
cnt ++;
} /* end if */
a[i]=b[i]+f*(i+1);
} /* end for */
} /* omp end parallel */
i =0,4
i=5,9
i= 10,14
i= 20,24
if
i= 20,24
if
if
cnt++
cnt++
a[i]=b[
i]+
a[i]=b[
i]+
cnt++
cnt++
a[i]=b[i]
+
a[i]=b[i]
+
Day 3:77
OpenMP
2010 Course MT1
30
!
!
#pragma
omp single [nowait][clause, ..]{!
!
block!
!
}
!
private (list)!
firstprivate (list)
NOWAIT:
the other threads
will not wait at the
end single directive
78
All threads but the master, skip the enclosed section of code and
continue
!
!
#pragma
omp barrier!
!
Each thread waits until all others in the team have reached this
point.
79
parallel!
single [nowait]!
master!
barrier!
80
void dowork( )!
{!
#pragma omp for!
for (i=0; i<n; i++) {!
....!
}!
}!
81
OMP 2009
U Dresden
e 3-5, 2009
66
static
[, chunk]! clause/2
The
schedule
Thread
0
*-K
1
L-M
2
N-*7
3
*F-*O
!
*-7
N-*
F-K
**-*7
L-O
*F-*K
P-M
*L-*O
82
dynamic [, chunk]!
Fixed portions of work; size is controlled by the value of chunk
When a thread finishes, it starts on the next portion of work
!
guided [, chunk]!
Same dynamic behavior as "dynamic", but size of the portion of work
decreases exponentially
!
runtime!
Iteration scheduling scheme is set at runtime through environment
variable OMP_SCHEDULE
83
IWOMP 2009
TU Dresden
June 3-5, 2009
68
The experiment
Example
scheduling
500 iterations on 4 threads
3
2
guided, 5
Thread ID
1
0
3
2
1
0
dynamic, 5
3
2
1
0
static
0
100
200
300
400
500
600
Iteration Number
RvdP/V1
84
85
13
At end of construct, local copies are combined through op into a single value
and combined with the value in the original SHARED variable
ReductionOpenMP:
Example
Reduction
performs
reductionomp
on shared
variables in for
list based on the operator provided.
#pragma
parallel
reduction(+:sum)
for C/C++
operator can be any one of :
i<N; i++) {
+, *, -, for(i=0;
^, |, ||, & or &&
At the end sum
of a reduction,
the shared
contains the result obtained upon
+= a[i]
* variable
b[i];
combination of the list of variables processed using the operator specified.
}
sum=0
sum = 0.0
Localomp
copy
of sum for
for each
thread
#pragma
parallel
reduction(+:sum)
All(i=0;
local copies
of sum
for
i < 20;
i++) added together
and
stored in global variable
sum = sum + (a[i] * b[i]);
i=0,4
i=5,9
i=10,14
i=15,19
sum=..
sum=..
sum=..
sum=..
sum
sum=0
87
Initial Value
Operator
Initial Value
&
~0
&&
||
FORTRAN:
intrinsic is one of MAX, MIN, IAND, IOR, IEOR
operator is one of +, *, -, .AND., .OR., .EQV., .NEQV.
88
4
dx =
2
1+x
static
long num_steps=100000;
double step, pi;
void main()
{ int i;
double x, sum = 0.0;
2.0
0.0
TT
1.0
89
void main()
{ int i;
double x, sum = 0.0;
step, num_steps
x, i
sum
90
Solution to Computing Pi
static long num_steps=100000;
double step, pi;
void main()
{ int i;
double x, sum = 0.0;
step = 1.0/(double) num_steps;
#pragma omp parallel for private(x) reduction(+:sum)
for (i=0; i<num_steps; i++){
x = (i+0.5)*step;
sum = sum + 4.0/(1.0 + x*x);
}
pi = step * sum;
printf(Pi = %f\n,pi);
}
91
92
OpenMP
9.617728
4.874539
2.455036
1.627149
1.214713
12
0.820746
16
0.616482
93
Amdahl 1.0
OpenMP Barcelona 2.2 GHz
14
speed up
12
10
8
6
4
2
0
1
12
16
number of CPUs
94
!
#pragma omp parallel for private(x) reduction(+:sum) !
with
!
#pragma omp parallel for private(x) shared(sum)!
solve with!
!
#pragma omp critical before sum
95
Multi-PF Solutions
Cuda Computing PI
CUDA: computing PI
96
2007 IBM Corporation
Environment Variables
The names of the OpenMP environment variables must be
UPPERCASE
The values assigned to them are case insensitive
OMP_NUM_THREADS!
!
OMP_SCHEDULE schedule [chunk]!
!
OMP_NESTED { TRUE | FALSE }!
!
97
98
!
!
!
!
!
!
!
processor 1 for i=0,2,4,6,8
processor 2 for i=1,3,5,7,9
99
Processor 2
for (i1=0,2,4,6,8){!
a[i1] = b[i1] + c[i1];!
}
for (i2=1,3,5,7,9){!
a[i2] = b[i2] + c[i2];!
}
private area
shared area
A
i1
P1
private area
i2
P2
100
!
!
!
!
Why?
101
!
!
!
!
!
!
!
a[]:
102
The Impact:
Non-scalable parallel applications
The Remedy:
False sharing is often quite simple to solve
103
B(0)
B(0)
B(1)
B(1)
B(2)
B(2)
B(3)
B(3)
B(4)
B(4)
B(5)
B(5)
B(6)
B(6)
B(7)
B(7)
B(8)
B(8)
B(9)
B(9)
used data
not used data
cache line
Processor 1
False Sharing
Processor 1
Processor 2
time
105
Summary
First tune single-processor performance
location of memory
Cache friendly programs: no special placement needed
Non-cache friendly programs
False sharing?
Use of OpenMP
try to avoid synchronization (barrier, critical, single, ordered)
106
Atomic
Barrier
Sections
API routines
19
!
export OMP_NUM_THREADS=4!
108
OpenACC
New set of directives to support accelerators
GPUs
Intels MIC
AMD Fusions processors
109
OpenACC example
void
convolution_SM_N(typeToUse
A[M][N],
typeToUse
B[M][N]) {
int
i,
j,
k;
int
m=M,
n=N;
//
OpenACC
kernel
region
//
Define
a
region
of
the
program
to
be
compiled
into
a
sequence
of
kernels
//
for
execution
on
the
accelerator
device
#pragma
acc
kernels
pcopyin(A[0:m])
pcopy(B[0:m])
{
typeToUse
c11,
c12,
c13,
c21,
c22,
c23,
c31,
c32,
c33;
c11
=
+2.0f;
c21
=
+5.0f;
c31
=
-8.0f;
c12
=
-3.0f;
c22
=
+6.0f;
c32
=
-9.0f;
c13
=
+4.0f;
c23
=
+7.0f;
c33
=
+10.0f;
//
The
OpenACC
loop
gang
clause
tells
the
compiler
that
the
iterations
of
the
loops
//
are
to
be
executed
in
parallel
across
the
gangs.
//
The
argument
specifies
how
many
gangs
to
use
to
execute
the
iterations
of
this
loop.
#pragma
acc
loop
gang(64)
for
(int
i
=
1;
i
<
M
-
1;
++i)
{
//
The
OpenACC
loop
worker
clause
specifies
that
the
iteration
of
the
associated
loop
are
to
be
//
executed
in
parallel
across
the
workers
within
the
gangs
created.
//
The
argument
specifies
how
many
workers
to
use
to
execute
the
iterations
of
this
loop.
#pragma
acc
loop
worker(128)
for
(int
j
=
1;
j
<
N
-
1;
++j) {
B[i][j]
=
c11
*
A[i
-
1][j
-
1]
+
c12
*
A[i
+
0][j
-
1]
+
c13
*
A[i
+
1][j
-
1]
+
c21
*
A[i
-
1][j
+
0]
+
c22
*
A[i
+
0][j
+
0]
+
c23
*
A[i
+
1][j
+
0]
+
c31
*
A[i
-
1][j
+
1]
+
c32
*
A[i
+
0][j
+
1]
+
c33
*
A[i
+
1][j
+
1];
}
}
110
}//kernels
region
}
MPI
111
Message Passing
Point-to-Point
Multi-processor communications
e.g. broadcast, reduce
MPI advantages
Mature and well understood
Backed by widely-supported formal standard (1992)
Porting is easy
User interface:
Efficient and simple
Buffer handling
Allow high-level abstractions
Performance
113
MPI disadvantages
MPI 2.0 includes many features beyond message passing
!
!
!
!
!
!
!
!
!
ve
r
u
c
g
n
rni
a
e
L
Programming Model
Explicit parallelism:
All processes starts at the same time at the same point in the code
Full parallelism: there is no sequential part in the program
T
processes
Parallel Region
5
Work Distribution
All processors run the same executable.
116
!
int main( int argc, char *argv[] )!
{!
MPI_Init( &argc, &argv );!
printf( "Hello, world!\n" );!
MPI_Finalize();!
return 0;!
}
117
!
call MPI_INIT( ierr )!
print *, 'Hello, world!'!
call MPI_FINALIZE( ierr )!
end
118
Syntax!
int MPI_Init (
!
MPI_INIT ( IERROR )!
INTEGER IERROR
NOTE: Both C and Fortran return error codes for all calls.
119
Cleans up all MPI state. Once this routine has been called, no
MPI routine ( even MPI_INIT ) may be called
!
Syntax!
int MPI_Finalize ( );!
!
MPI_FINALIZE
( IERROR )!
INTEGER IERROR!
!
MUST call MPI_FINALIZE when you exit from an MPI program
120
Bindings
C
All MPI names have an MPI_ prefix
Defined constants are in all capital letters
Defined types and functions have one capital letter after the
prefix; the remaining letters are lowercase
Fortran
All MPI names have an MPI_ prefix
No capitalization rules apply
last argument is an returned error value
121
122
123
!
call MPI_INIT( ierr )!
call MPI_COMM_RANK( MPI_COMM_WORLD, rank, ierr )!
call MPI_COMM_SIZE( MPI_COMM_WORLD, size, ierr )!
print *, 'I am ', rank, ' of ', size!
call MPI_FINALIZE( ierr )!
end
124
MPI_COMM_WORLD
5
3
1
6
7
125
Day 2: MPI
2010 Course MT1
14
Communicator
Communication in MPI takes place with respect to
communicators
MPI_COMM_WORLD is one such predefined communicator
(something of type MPI_COMM) and contains group and
context information
MPI_COMM_RANK and MPI_COMM_SIZE return
information based on the communicator passed in as the
first argument
Processes may belong to many different communicators
Rank-->
MPI_COMM_WORLD
126
Process 1
A:
Send
Receive
B:
Questions
Data
May I Send?
Process 1
Yes
Data
Data
Data
Data
Data
Data
Data
Data
Time
3
2
129
6
7
!
And the receive has become:
MPI_Recv( start, count, datatype, source, tag,
comm, status )!
!
- The source, tag, and the count of the message actually
received can be retrieved from status
130
MPI C Datatypes
131
132
MPI Tags
Messages are sent with an accompanying user-defined integer
tag, to assist the receiving process in identifying the message.
133
Naming a process
destination is specified by ( rank, group )!
Processes are named according to their rank in the group
Groups are defined by their distinct communicator
MPI_ANY_SOURCE wildcard rank permitted in a receive Tags are
integer variables or constants used to uniquely identify individual
messages
= status.MPI_TAG;
recvd_from = status.MPI_SOURCE;
MPI_Get_count( &status, datatype, &recvd_count );
In Fortran:
integer recvd_tag, recvd_from, recvd_count
integer status(MPI_STATUS_SIZE)
call MPI_RECV(..., MPI_ANY_SOURCE, MPI_ANY_TAG, .. status, ierr)
tag_recvd
= status(MPI_TAG)
recvd_from = status(MPI_SOURCE)
call MPI_GET_COUNT(status, datatype, recvd_count, ierr)
135
INTEGER status(MPI_STATUS_SIZE)
call MPI_Recv( . . . , status, ierr )
tag_of_received_message = status(MPI_TAG)
src_of_received_message = status(MPI_SOURCE)
call MPI_Get_count(status, datatype, count, ierr)
MPI_TAG and MPI_SOURCE are primarily of use when
MPI_ANY_TAG and/or MPI_ANY_SOURCE is used in the receive
The function MPI_GET_COUNT may be used to determine how much
data of a particular type was received
136
!
integer rank, size, to, from, tag, count, i, ierr!
integer src, dest!
integer st_source, st_tag, st_count!
integer status(MPI_STATUS_SIZE)!
double precision data(10) !
!
call MPI_INIT( ierr )!
call MPI_COMM_RANK( MPI_COMM_WORLD, rank, ierr )!
call MPI_COMM_SIZE( MPI_COMM_WORLD, size, ierr )!
print *, 'Process ', rank, ' of ', size, ' is alive'!
dest = size - 1!
src = 0
137
!
else if (rank .eq. dest) then!
tag = MPI_ANY_TAG!
source = MPI_ANY_SOURCE!
call MPI_RECV( data, 10, MPI_DOUBLE_PRECISION,!
+
source, tag, MPI_COMM_WORLD,!
+
status, ierr)
138
!
call MPI_FINALIZE( ierr )!
end
139
MPI_INIT! !
MPI_FINALIZE!
MPI_COMM_SIZE!
MPI_COMM_RANK!
MPI_SEND!
MPI_RECV!
140
work distribution
141
10
98
program main
use MPI
double precision PI25DT
parameter (PI25DT = 3.141592653589793238462643d0)
double precision mypi, pi, h, sum, x, f, a
integer n, myid, numprocs, i, ierr
function to integrate
f(a) = 4.d0 / (1.d0 + a*a)
call MPI_INIT( ierr )
call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )
call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr )
if ( myid .eq. 0 ) then
write(6,98)
format('Enter the number of intervals: (0 quits)')
read(5,(i10)) n
endif
call MPI_BCAST( n, 1, MPI_INTEGER, 0,MPI_COMM_WORLD, ierr)
if ( n .le. 0 ) goto 30
h = 1.0d0/n
sum = 0.0d0
do 20 i = myid+1, n, numprocs
x
= h * (dble(i) - 0.5d0)
sum = sum + f(x)
continue
mypi = h * sum
call MPI_REDUCE( mypi, pi, 1, MPI_DOUBLE_PRECISION,
+
MPI_SUM, 0, MPI_COMM_WORLD,ierr)
if (myid .eq. 0) then
write(6, 97) pi, abs(pi - PI25DT)
format(' pi is approximately: ', F18.16,
+
' Error is: ', F18.16)
endif
goto 10
call MPI_FINALIZE(ierr)
end
work distribution
20
97
30
!
!
!
142
It is assumed that you use mpich1 and has PBS installed as a job
scheduler. If you have mpich2 let me know.
Use qsub to submit the mpi job (an example script is provided) to
a queue.
143
OpenMP
marken
linux
9.617728
14.10798
22.15252
4.874539
7.071287
9.661745
2.455036
3.532871
5.730912
1.627149
2.356928
3.547961
1.214713
1.832055
2.804715
12
0.820746
1.184123
16
0.616482
0.955704
144
Amdahl 1.0
OpenMP Barcelona 2.2 GHz
MPI marken Xeon 3.1 GHz
MPI linux Xeon 2.4 GHz
14
speed up
12
10
8
6
4
2
0
1
6
number of CPUs
12
16
145
16
30
14
Amdahl
Amdahl1.0
1.0
OpenMP
MPI Core
Xeon
i7-950
E5504
3.06
2.0GHz
GHz
speed up
25
12
20
10
8
15
6
10
4
5
2
4
1
26
8 4
12 6
16 8
number of cores
12
32
16
146
147
Blocking Communication
So far we have discussed blocking communication
MPI_SEND does not complete until buffer is empty (available for reuse)
MPI_RECV does not complete until buffer is full (available for use)
A process sending data will be blocked until data in the send buffer is
emptied
A process receiving data will be blocked until the receive buffer is filled
Completion of communication generally depends on the message size
and the system buffer size
Blocking communication is simple to use but can be prone to
deadlocks
If (my_proc.eq.0) Then
Call mpi_send(..)
Call mpi_recv()
Usually deadlocks-->
Else
Call mpi_send()
Call mpi_recv(.)
Endif
148
Sources of Deadlocks
Send a large message from process 0 to process 1
If there is insufficient storage at the destination, the send must
wait for the user to provide the memory space (through a
receive)
Process 0
Send(1)
Recv(1)
Process 1
Send(0)
Recv(0)
Process 0
Process 1
Send(1)
Recv(1)
Recv(0)
Send(0)
Process 0
Process 1
Isend(1)
Irecv(1)
Waitall
Isend(0)
Irecv(0)
Waitall
150
T0: MPI_Recv
Once receive
is called @ T0,
buffer unavailable
to user
T1:MPI_Send
sender
T2
returns
@ T2,
buffer can
be reused
send side
Receive
returns @ T4,
buffer filled
receive side
151
T0: MPI_Irecv
T2: MPI_Isend
T1: Returns
sender
T3
returns @ T3
buffer unavailable
sender
completes @ T5
buffer available
after MPI_Wait
T5
T6
T7: transfer
!
T8
finishes
T6: MPI_Wait
T9: Wait returns
Internal completion is soon
followed by return of MPI_Wait
send side
MPI_Wait, returns @ T8
here, receive buffer filled
receive side152
Non-Blocking Communication
Multiple Completions
154
155
*buf = 3;
MPI_Send ( buf, 1, MPI_INT, ... );
*buf = 4; /* OK, receiver will always receive 3 */
*buf = 3;
MPI_Isend(buf, 1, MPI_INT, ...);
*buf = 4; /* Undefined whether the receiver will get 3 or 4 */
MPI_Wait ( ... );
Collective calls
MPI has several collective communication calls, the most frequently
used are:
Synchronization
Barrier
Communication
Broadcast
Gather Scatter
All Gather
Reduction
Reduce
All Reduce
158
comm )
P2
ronization in a
comm. Each process,
PI_Barrier call, blocks
in the group reach the
l.
P3
MPI_Barrier()
PI_Barrier()
P0
P1
P2
P3
mcs.anl.gov/mpi/www/www3/MPI_Barrier.html
Day 2: MPI
2010 Course MT1
40
159
1
2
Process Send
Ranks buffer
A
A
1
2
Bcast (root=0)
Process Send
Ranks buffer
1
2
B
C
Process Receive
Ranks buffer
1
2
161
0 ABCD
Process Receive
Ranks buffer
Scatter (root=0)
????
????
????
Process Send
Ranks buffer
2
3
Process Receive
Ranks buffer
Gather (root=0)
0 ABCD
1
????
????
????
162
MPI_ALLGATHERV! MPI_BCAST
MPI_ALLTOALLV!
MPI_REDUCE!
MPI_GATHERV!
MPI_SCATTER!
MPI_SCAN!
!
MPI_ALLREDUCE!
163
164
165
One-sided communication
Put/get
Other operations
Parallel I/O
Generalized requests
Bindings for C++/ Fortran-90; inter-language issues
166
167
168
Domain Decomposition
for example: wave equation in a 2D domain.
one dimension:
communication
= n2*2*w *1
w=width of halo
n3 =size of matrix
p = number of processors
cyclic boundary
> two neighbors
in each direction
two dimensions:
communication
= n2*2*w *2 / p1/2
three dimensions:
communication
= n2*2*w *3 / p2/3
170
[0, 0]
[1, 0]
[0, 1]
[1, 1]
nz
nx
nz_block
[0, 0]
MPI 0
[1, 0]
MPI 1
[0, 1]
MPI 2
[1, 1]
MPI 3
nx_block
nx_block=nx/2!
nz_block=nz/2
171
MPI 10
lef=10
rig=40
bot=26
top=24
MPI 40
MPI 11
MPI 26
MPI 41
172
173
174
communication
for (ix=0; ix<nx_block; ix++) {!
for (iz=0; iz<nz_block; iz++) {!
vx[ix][iz] -= rox[ix][iz]*(!
c1*(p[ix][iz]
- p[ix-1][iz]) +!
c2*(p[ix+1][iz] - p[ix-2][iz]));!
}!
}!
MPI 10
p[iz][nx_block-1]
!
i=nx_block-1
MPI 12
!
vx[0][iz] -= rox[0][iz]*(!
!
c1*(p[0][iz]
- p[i-1][iz])
!
c2*(p[1][iz]
i=0
- p[i-2][iz])
i=1
175
+!
pseudocode
if (my_rank==12) {!
call MPI_Receive(p[-2:-1][iz], .., 10, ..);!
!
vx[ix][iz] -= rox[ix][iz]*(!
c1*(p[ix][iz]
- p[ix-1][iz]) +!
c2*(p[ix+1][iz] - p[ix-2][iz]));!
}!
!
if (my_rank==10) {!
call MPI_Send(p[-2:-1][iz], .., 12, ..);!
}!
176
[i-1][j]
send(nx-1)
send(nx-2)
receive(-1)
receive(-2)
[i][j]
receive(-1)
receive(-2)
send(nx-1)
send(nx-2)
[i+1][j]
send(nx-1)
send(nx-2)
receive(-1)
receive(-2)
[i+2][j]
receive(-1)
receive(-2)
send(nx-1)
send(nx-2)
mpi_sendrecv(!
p(nx-1:nx-2), 2 ,.., my_neighbor_right,...
p(-2:-1),
2 ,.., my_neighbor_left,...)
177
178
MPI 0
left=47
right=10
MPI 10
MPI 25
MPI 38
MPI 47
left=38
right=0
179
proc. 12*
MPI 47
left=38
right=NU
180
80
b1=1
P[i][iz]
guard
cells
-2
-1
inner area
!
!
!
0
outer area
i
is
ie
181
halo
Exercise: MPI_Cart
Do I have to make a map of MPI processes myself?
182
! answer
16 MPI tasks
PE-
PE-
PE-
PE-
PE-
PE-
PE-
PE-
PE- 14: x=3, y=2: Sum = 32 neighbors: left=10 right=2 top=13 bottom=15!
PE-
PE-
PE- 11: x=2, y=3: Sum = 36 neighbors: left=7 right=15 top=10 bottom=-1!
PE- 15: x=3, y=3: Sum = 36 neighbors: left=11 right=3 top=14 bottom=-1!
PE- 10: x=2, y=2: Sum = 32 neighbors: left=6 right=14 top=9 bottom=11!
PE- 12: x=3, y=0: Sum = 24 neighbors: left=8 right=0 top=-1 bottom=13!
PE- 13: x=3, y=1: Sum = 28 neighbors: left=9 right=1 top=12 bottom=14!
183
12
13
10
14
11
15
184
185
MPI_Comm_size(MPI_COMM_WORLD, &size);!
!
!
npz = 2*halo+nz/dims[0];!
npx = 2*halo+nx/dims[1];!
npy = 2*halo+ny/dims[2];!
/* create cartesian domain decomposition */!
MPI_Cart_create(MPI_COMM_WORLD, 3, dims, period, reorder, &COMM_CART);!
186
subarrays for IO
local
global
nx
px
lnx
ny
py
lny
187
!
!
!
!
!
/* create subarrays(excluding halo areas) to write to file with MPI-IO */!
/* data in the local array */!
sizes[0]=npz; sizes[1]=npx; sizes[2]=npy;!
subsizes[0]=sizes[0]-2*halo;!
subsizes[1]=sizes[1]-2*halo;!
subsizes[2]=sizes[2]-2*halo;!
starts[0]=halo; starts[1]=halo; starts[2]=halo;!
MPI_Type_create_subarray(3, sizes, subsizes, starts, MPI_ORDER_C,!
MPI_FLOAT, &local_array);!
MPI_Type_commit(&local_array);!
188
!
!
189
!
!
190
!
!
disp = 0;!
rc = MPI_File_set_view(fh, disp, MPI_FLOAT, global_array,
"native", fileinfo);!
MPI_File_close(&fh);!
}!
} /* end of time loop */!
191
Summary
192
MPI Summary
!
MPI Standard widely accepted by vendors and
programmers
MPI implementations available on most modern platforms
Several MPI applications deployed
Several tools exist to trace and tune MPI applications
!
Simple applications use point-to-point and collective
communication operations
!
Advanced applications use point-to-point, collective,
communicators, datatypes, one-sided, and topology
operations
193
Pure MPI
Hybrid MPI/OpenMP
No overlapping
computation/communication
MPI only out of OpenMP code.
Only master thread with
MPI communication.
Pure OpenMP
Supported by software
distributed shared memory
Overlapping
computation/communication
MPI inside OpenMP code
Hybrid OpenMP/MPI
Natural paradigm for clusters of SMPs
May offer considerable advantages when application mapping and
load balancing is tough
Benefits with slower interconnection networks (overlapping
computation/communication)
Overlapping computation/communication:
Example
P0
P2
TyT T T
T T T T
T T T T
peT T T
T T T TtoT T T
T Type
T T T
T T
to Tenter
en
T T T T T T T
T T T ter
T T T T
T T T TteT T T
T T T T T T T
C. BEKAS
T
T
T
T
TP
T
T
T
!
if (my_thread_id < # ){!
!
MPI_ (communicate needed data)!
} else!
!
/* Perform computations that to not need
communication */!
!
.!
!
.!
}!
/* All threads execute code that requires
communication */!
!
.!!
!
.
197
!
!
!
!
for
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
}!
for (i=0; i<(all_my_points); i++)!
!
!
oldval[i] = newval[i];!
! !
! !
! }
}!
MPI_Barrier(); /* Synchronize all MPI processes here */!
198
PGAS
What is PGAS?
199
Partitioned
Global
Address Space
Partitioned Global
Address
Space
! Explicitly parallel, shared-memory like programming model
! Global addressable space
"
Allows programmers to declare and directly access data distributed across the
machine
200School 2009
PRACE Winter
Address Space
Message passing
Shared Memory
PGAS
MPI
OpenMP
Terminology
Thread, thread team
Implementation
Threads map to user threads running on one SMP node
Extensions to multiple servers not so successful
202
OpenMP
memory
threads
203
!$OMP PARALLEL
do i=1,32
a(i)=a(i)*2
end do
threads
204
OpenMP: implementation
memory
process
threads
cpus
205
Implementation:
MPI processes map to processes within one SMP node or across
multiple networked nodes
API provides process numbering, point-to-point and collective
messaging operations
206
MPI
processes
memory
memory
memory
cpu
cpu
cpu
207
MPI
process 0
process 1
memory
memory
cpu
cpu
MPI_Send(a,...,1,) MPI_Recv(a,...,0,)
208
209
PGAS
process
process
process
memory
memory
memory
cpu
cpu
cpu
210
PGAS languages
Partitioned Global Address Space (PGAS):
Global address space any process can address memory
on any processor
!
Partitioned GAS retain information about locality
!
Core idea hardest part of writing parallel code is
managing data distribution and communication; make that
simple and explicit
!
PGAS Languages try to simplify parallel programming
(increase programmer productivity).
212
UPC
Unified Parallel C:
An extension of C99
An evolution of AC, PCP, and Split-C
!
Features
SPMD parallelism via replication of threads
Shared and private address spaces
Multiple memory consistency models
!
Benefits
Global view of data
One-sided communication
Co-Array Fortran
Co-Array Fortran:
An extension of Fortran 95 and part of Fortran 2008
The language formerly known as F--
!
Features
SPMD parallelism via replication of images
Co-arrays for distributed shared data
!
Benefits
Syntactically transparent communication
One-sided communication
Multi-dimensional arrays
Array operations
HPCC PTRANS and PGAS Languages
Basic execution model co-array F- Program executes as if replicated to multiple copies with each
copy executing asynchronously (SPMD)
215
CAF (F--)
!
!
216
Image 2
Image 3
memory
memory
memory
cpu
cpu
cpu
image 2
image 3
a 1 8 1 5
b 1
a 1 7 9 9
b 3
a 1 7 9 4
b 6
b=a(2)
218
image 2
image 3
a 1 8 1 5
b 8
a 1 7 9 9
b 7
a 1 7 9 4
b 7
b=a(2)
219
Text
image 1
image 2
image 3
a 1 8 1 5
b 1
a 1 7 9 9
b 3
a 1 7 9 4
b 6
b=a(4)[3]
220
Text
image 1
image 2
image 3
a 1 8 1 5
b 4
a 1 7 9 9
b 4
a 1 7 9 4
b 4
b=a(4)[3]
221
222
Performance Tuning
Performance tuning
S
ideal
second attempt
first attempt
p
Performance tuning of parallel applications is an iterative process
Feb-5-08
223
56
224
Starting jobs:
Most implementations use a special loader named mpirun
mpirun -np <no_of_processors> <prog_name>
!
In MPI-2 it is recommended to use
mpiexec -n <no_of_processors> <prog_name>
225
The CH comes from Chameleon, the portability layer used in the original
MPICH to provide portability to the existing message-passing systems.
MPICH2
MPICH2 is an all-new implementation of the MPI Standard,
designed to implement all of the MPI-2 additions to MPI.
!
!
http://www.mcs.anl.gov/research/projects/mpich2/
227
228
example hello.c
#include "mpi.h"!
#include <stdio.h>!
int main(int argc ,char *argv[])!
{!
! int myrank;!
! MPI_Init(&argc, &argv);!
! MPI_Comm_rank(MPI_COMM_WORLD, &myrank);!
! fprintf(stdout, "Hello World, I am process
\n", myrank);!
! MPI_Finalize();!
! return 0;!
}
%d
229
230
Example: details
If using frontend and compute nodes in machines file use
mpirun -np 2 -machinefile machines hello !
!
If using only compute nodes in machine file use
!
231
Notes on clusters
Make sure you have access to the compute node (ssh keys are
!
Which mpicc are you using
$ which mpicc
!
command line arguments are not always passed to mpirun/mpiexec
!
232
MPICH2 daemons
mpdtrace:
mpd
mpdboot
mpdlistjobs
mpdkilljob kills
mpdlistjobs
!
mpdsigjob
MPI Standard:
http://www-unix.mcs.anl.gov/mpi/standard.html
http://www.mpi-forum.org/docs/mpi-11-html/node182.html
http://www-unix.mcs.anl.gov/mpi/mpi-standard/mpi-report-2.0/
node306.htm
http://www.mpi-forum.org/index.html
MPI tutorials
http://www.nccs.nasa.gov/tutorials/mpi_tutorial2/
mpi_II_tutorial.html
https://fs.hlrs.de/projects/par/par_prog_ws/
235
MPI Sources
The Standard (3.0) itself:
at http://www.mpi-forum.org
All MPI official releases, in both PDF and HTML
Books:
Using MPI: Portable Parallel Programming with the MessagePassing Interface, by Gropp, Lusk, and Skjellum, MIT Press, 1994.
Parallel Programming with MPI, by Peter Pacheco, MorganKaufmann, 1997.
at http://www.mcs.anl.gov/mpi
pointers to lots of stuff, including other talks and tutorials, a FAQ,
other MPI pages
http://mpi.deino.net/mpi_functions/index.htm
236
237
What to do when:
The system is full and you want to run your 512 CPU job?
You want to run 16 jobs, should others wait on that?
Have bigger jobs priority over smaller jobs?
Have longer jobs lower/higher priority?
238
MPICH1 in PBS
#!/bin/bash!
#PBS -N Flank!
#PBS -V!
#PBS -l nodes=11:ppn=2!
#PBS -j eo !
!
cd $PBS_O_WORKDIR!
export nb=`wc -w < $PBS_NODEFILE`!
echo $nb!
!
mpirun -np $nb -machinefile $PBS_NODEFILE ~/bin/
migr_mpi \!
file_vel=$PBS_O_WORKDIR/grad_salt_rot.su \!
file_shot=$PBS_O_WORKDIR/cfp2_sx.su
239
http://www.clusterresources.com/pages/products/maui-cluster-scheduler.php
Torque resource manager http://www.clusterresources.com/pages/products/
torque-resource-manager.php
SLURM
!
!
!
http://slurm.schedmd.com
queuing commands
qsub
qstat
qdel
xpbsmon
qsub
#PBS
#PBS
#PBS
#PBS
!
!
-l
-l
-l
-j
nodes=10:ppn=1!
mem=20mb!
walltime=1:00:00!
eo!
submit:
!
qsub
!
-q normal job.scr!
output:
jobname.ejobid!
jobname.ojobid!
qstat
available queue and resources
!
qstat -q!
!
qstat (-a)!
243
qdel
deletes job from queue and stops all running executables
!
qdel jobid
244
Be careful !
!
program with arguments!
!
EOF!
!
qsub jobs/pbs_${xsrc}.job!
(( xsrc = $xsrc + $dxsrc))!
done
245
246
It is assumed that you use mpich1 and has PBS installed as a job
scheduler. If you have mpich2 let me know.
Use qsub to submit the mpi job (an example script is provided) to
a queue.
247
OpenMP
marken (MPI)
linux (MPI)
9.617728
14.10798
22.15252
4.874539
7.071287
9.661745
2.455036
3.532871
5.730912
1.627149
2.356928
3.547961
1.214713
1.832055
2.804715
12
0.820746
1.184123
16
0.616482
0.955704
248
Amdahl 1.0
OpenMP Barcelona 2.2 GHz
MPI marken Xeon 3.1 GHz
MPI linux Xeon 2.4 GHz
14
speed up
12
10
8
6
4
2
0
1
6
number of CPUs
12
16
249
Also inspect the code and see how the reductions are done, is
there another way of doing the reductions?
250
These are old exercises from SGI and give insight in problems
you can encounter using OpenMP.
251
252
END
253
254