Académique Documents
Professionnel Documents
Culture Documents
Objectives
At the end of this module you should be able to:
Define parallel computing
1 GHz
100 MHz
10 MHz
1 MHz
79 87 95 03 11
5
Execution Optimization
Popular optimizations to increase CPU speed
Instruction prefetching
Instruction reordering
Pipelined functional units
Branch prediction
Functional unit allocation
Register allocation
Hyperthreading
Added sophistication more silicon devoted to
control hardware
Multi-core Architectures
Potential performance = CPU speed # of CPUs
Strategy:
Limit CPU speed and sophistication
Put multiple CPUs (cores) on a single chip
Potential performance
the same
CPUs
Speed
8
Parallel computing
not mainstream
Parallel programming
environments are inadequate
10
Parallel programming
is difficult
11
12
13
14
Parallel programming
considered mainstream
Everyone has a
parallel computer
Parallel programming
environments improve
15
Parallel programming
gets easier
Recognizing Potential Parallelism
Methodology
Study problem, sequential program, or code segment
Look for opportunities for parallelism
Try to keep all processors busy doing useful work
16
Pipelining
17
Domain Decomposition
First, decide how data elements should be divided
among processors
Second, decide which tasks each processor should be
doing
Example: Vector addition
18
Domain Decomposition
Find the largest element of an array
19
Domain Decomposition
Find the largest element of an array
CPU 0
CPU 1
20
CPU 2
CPU 3
Domain Decomposition
Find the largest element of an array
CPU 0
CPU 1
21
CPU 2
CPU 3
Domain Decomposition
Find the largest element of an array
CPU 0
CPU 1
22
CPU 2
CPU 3
Domain Decomposition
Find the largest element of an array
CPU 0
CPU 1
23
CPU 2
CPU 3
Domain Decomposition
Find the largest element of an array
CPU 0
CPU 1
24
CPU 2
CPU 3
Domain Decomposition
Find the largest element of an array
CPU 0
CPU 1
25
CPU 2
CPU 3
Domain Decomposition
Find the largest element of an array
CPU 0
CPU 1
26
CPU 2
CPU 3
Domain Decomposition
Find the largest element of an array
CPU 0
CPU 1
27
CPU 2
CPU 3
Domain Decomposition
Find the largest element of an array
CPU 0
CPU 1
28
CPU 2
CPU 3
Domain Decomposition
Find the largest element of an array
CPU 0
CPU 1
29
CPU 2
CPU 3
Domain Decomposition
Find the largest element of an array
CPU 0
CPU 1
30
CPU 2
CPU 3
31
Task Decomposition
f()
g()
h()
r()
q()
s()
32
Task Decomposition
CPU 1
g()
h()
CPU 0
f()
CPU 2
r()
q()
s()
33
Task Decomposition
CPU 1
g()
h()
CPU 0
f()
CPU 2
r()
q()
s()
34
Task Decomposition
CPU 1
g()
h()
CPU 0
f()
CPU 2
r()
q()
s()
35
Task Decomposition
CPU 1
g()
h()
CPU 0
f()
CPU 2
r()
q()
s()
36
Task Decomposition
CPU 1
g()
h()
CPU 0
f()
CPU 2
r()
q()
s()
37
Pipelining
Special kind of task decomposition
Assembly line parallelism
Project
Clip
Output
Input
38
Rasterize
39
Project
Clip
Rasterize
40
Project
Clip
Rasterize
41
Project
Clip
Rasterize
Project
Clip
Rasterize
42
43
Project
Clip
Rasterize
44
Project
Clip
Rasterize
45
Project
Clip
Rasterize
46
Project
Clip
Rasterize
Project
Clip
Rasterize
47
CPU 1
CPU 2
CPU 3
Data set 0
Data set 1
Data set 2
Data set 3
Data set 4
48
CPU 1
CPU 2
CPU 3
Data set 0
Data set 1
Data set 2
Data set 3
Data set 4
49
CPU 1
CPU 2
CPU 3
Data set 0
Data set 1
Data set 2
Data set 3
Data set 4
50
CPU 1
CPU 2
CPU 3
Data set 0
Data set 1
Data set 2
Data set 3
Data set 4
51
CPU 1
CPU 2
CPU 3
Data set 0
Data set 1
Data set 2
Data set 3
Data set 4
52
CPU 1
CPU 2
CPU 3
Data set 0
Data set 1
Data set 2
Data set 3
Data set 4
53
CPU 1
CPU 2
CPU 3
Data set 0
Data set 1
Data set 2
Data set 3
Data set 4
54
CPU 1
CPU 2
CPU 3
Data set 0
Data set 1
Data set 2
Data set 3
Data set 4
55
Dependence Graph
Graph = (nodes, arrows)
Node for each
Variable assignment (except index variables)
Constant
Control flow
56
b[1]
b[2]
a[0]
a[1]
a[2]
57
Domain decomposition
possible
b[1]
b[2]
a[0]
a[1]
a[2]
58
b[1]
b[2]
b[3]
a[1]
a[2]
a[3]
59
b[1]
No domain decomposition
b[2]
b[3]
a[1]
a[2]
a[3]
60
/
s
61
Task
decomposition
with 3 CPUs.
/
s
62
a[1]
a[2]
a[0]
a[1]
a[2]
63
Domain decomposition
a[1]
a[2]
a[0]
a[1]
a[2]
64
65
a[1]
a[2]
a[0]
<
a[1]
<
a[2]
66
Dense matrices
Sparse matrices
67
68
69
70
71
72
References
Richard H. Carver and Kuo-Chung Tai, Modern Multithreading:
Implementing, Testing, and Debugging Java and C++/Pthreads/
Win32 Programs, Wiley-Interscience (2006).
Robert L. Mitchell, Decline of the Desktop, Computerworld
(September 26, 2005).
Michael J. Quinn, Parallel Programming in C with MPI and
OpenMP, McGraw-Hill (2004).
Herb Sutter, The Free Lunch is Over: A Fundamental Turn
Toward Concurrency in Software, Dr. Dobbs Journal 30, 3
(March 2005).
73
74