Intro To Parallel Processing

Introduction to Parallel
Processing
Shantanu Dutt
University of Illinois at Chicago
2
Acknowledgements
Ashish Agrawal, IIT Kanpur, Fundamentals of Parallel
Processing (slides), w/ some modifications and
augmentations by Shantanu Dutt
John Urbanic, Parallel Computing: Overview (slides), w/
some modifications and augmentations by Shantanu
Dutt
John Mellor-Crummey, COMP 422 Parallel Computing:
An Introduction, Department of Computer Science, Rice
University, (slides), w/ some modifications and
augmentations by Shantanu Dutt

3
Outline
The need for explicit multi-core/processor parallel
processing:
Moore's Law and its limits
Different uni-processor performance enhancement techniques
and their limits
Applications for parallel processing
Overview of different applications
An example parallel algorithm
Classification of parallel computations
Classification of parallel architectures
Including an example of an SPMD parallel algorithm
Summary
Some text from: Fund. of Parallel
Processing, A. Agrawal, IIT Kanpur
4
Outline
processing:
Different uni-processor performance enhancement
techniques and their limits
Summary
Fundamentals of Parallel Processing,
Ashish Agrawal, IIT Kanpur 5
Moores Law &
Need for Parallel Processing
Chip performance doubles
every 18-24 months
Power consumption is prop. to
freq.
Limits of Serial computing
Heating issues
Limit to transmissions
speeds
Leakage currents
Limit to miniaturization
Multi-core processors already
commonplace.
Most high performance servers
already parallel.
Quest for Performance
Pipelining
Superscalar Architecture
Out of Order Execution
Caches
Instruction Set Design
Advancements
Parallelism
Multi-core processors
Clusters
Grid
This is the future
Top text from: Fundamentals of Parallel
Processing, A. Agrawal, IIT Kanpur 7
Pipelining
Illustration of Pipeline using the fetch, load, execute, store stages.
At the start of execution Wind up.
At the end of execution Wind down.
Pipeline stalls due to data dependency (RAW, WAR), resource conflict, incorrect
branch prediction Hit performance and speedup.
Pipeline depth No of cycles in execution simultaneously.
Intel Pentium 4 35 stages.
T
pipe
(n) is pipelined time to process n instructions = fill-time + n*(max{ti} ~
n*(max{ti} for large n, as fill-time is a constant wrt n), ti = exec. time of the
ith stage.
This pipelined throughput = 1/max{ti}
8
Pipelining
Cache
Desire for fast cheap and non volatile memory
Memory speed growth at 7% per annum while processor growth at 50% p.a.
Cache fast small memory.
L1 and L2 caches.
Retrieval from memory takes several hundred clock cycles
Retrieval from L1 cache takes the order of one clock cycle and from L2 cache
takes the order of 10 clock cycles.
Cache hit and miss.
Prefetch used to avoid cache misses at the start of the execution of the
program.
Cache lines used to avoid latency time in case of a cache miss
Order of search L1 cache -> L2 cache -> RAM -> Disk
Cache coherency Correctness of data. Important for distributed parallel
computing
Limit to cache improvement: Improving cache performance will at most improve
efficiency to match processor efficiency
10
(exs. of limited data parallelism)
(exs. of limited & low-level functional parallelism)
(single-instr.
multiple data)
: instruction-level parallelismdegree generally low and dependent
on how the sequential code has been written, so not v. effective
11
12
13
14
15
(simultaneous multi-
threading)
(multi-threading)
16
Thus : Two Fundamental Issues in
Future High Performance

17
Ashish Agrawal, IIT Kanpur
Microprocessor performance improvement via various implicit and
explicit parallelism schemes and technology improvements is reaching
(has reached?) a point of diminishing returns
Thus need development of explicit parallel algorithms that are based on
a fundamental understanding of the parallelism inherent in a problem,
and exploiting that parallelism with minimum
interaction/communication between the parallel parts
18
Outline
processing:
and their limits
Summary
19

21
22
Applications of Parallel Processing
24
25
26
27
28
An example parallel algorithm for a finite element computation
Easy Parallel Situation Each data part is
independent. No communication is required between
the execution units solving two different parts.
Next Level: Simple, structured and sparse
communication needed.
Example: Heat Equation -
The initial temperature is zero on the boundaries and
high in the middle
The boundary temperature is held at zero.
The calculation of an element is dependent upon its
neighbor elements

data1 data2 ... data N
Code from: Fundamentals of Parallel
1. find out if I am MASTER or WORKER
2. if I am MASTER
3. initialize array
4. send each WORKER starting info and
subarray
5. do until all WORKERS converge
6. gather from all WORKERS convergence
data
7. broadcast to all WORKERS convergence
signal
8. end do
9. receive results from each WORKER

1. else if I am WORKER
2. receive from MASTER starting info and
subarray
3. do until solution converged {
4. update time
5. send (non-blocking?) neighbors my border
info
6. receive (non-blocking?) neighbors border
info
7. update interior of my portion of solution
array (see comput. given in the serial code)
8. wait for non-block. commun. (if any) to
complete
14. update border of my portion of solution
array
15. determine if my solution has converged
16. if so {send MASTER convergence signal
17. recv. from MASTER convergence signal}
18. end do }
19. send MASTER results
20. endif
Serial Code -
do y=2, N-1
do x=2, M-1
u2(x,y)=u1(x,y)+cx*[u1(x+1,y) + u1(x-1,y)] +
cy*[u1(x,y+1)} + u1(x,y-1)] /* cx, cy are const.
enddo
enddo
u1 = u2;
Master (can be one of the workers)
Workers
Problem
Grid
31
Outline
processing:
and their limits
Summary and future advances
Parallelism - A simplistic understanding
Multiple tasks at once.
Distribute work into multiple
execution units.
A classification of parallelism:
Data Parallelism
Functional or Control
Parallelism
Data Parallelism - Divide the
dataset and solve each sector
similarly on a separate
execution unit.
Functional Parallelism
Divide the 'problem' into
different tasks and execute the
tasks on different units. What
would func. parallelism look like
for the example on the right?
S
e
q
u
e
n
t
i
a
l

D
a
t
a

P
a
r
a
l
l
e
l
i
s
m

16/12/2008
Ashish Agrawal, IIT Kanpur
33
Data Parallelism
Functional Parallelism
Flynns Classification
Flynn's Classical Taxonomy Based on # of instruction/task and
data streams
Single Instruction, Single Data streams (SISD)your single-core
uni-processor PC
Single Instruction, Multiple Data streams (SIMD)special
purpose low-granularity multi-processor m/c w/ a single control
unit relaying the same instruction to all processors (w/ different
data) every cc (e.g., nVIDIA graphic co-processor w/ 1000s of
simple cores)
Multiple Instruction, Single Data streams (MISD)pipelining is a
major example
Multiple Instruction, Multiple Data streams (MIMD)the most
prevalent model. SPMD (Single Program Multiple Data) is a very
useful subset. Note that this is v. different from SIMD. Why?
Data vs Control Parallelism is another independent classification to
Flynns
Flynns Classification (contd).
35
36
37
38
39
40
Data Parallelism: SIMD and SPMD fall into this category
Functional Parallelism: MISD falls into this category
MIMD can incorporates both data and functional parallelisms
(the latter at either instruction leveldifferent instrs. being
executed across the processors at any time, or at the high-level
function space)
41
Outline
processing:
and their limits
Summary
Parallel Arch. Classification
Multi-processor Architectures-
Distributed MemoryMost prevalent architecture model for # processors > 8
Indirect interconnectionn n/ws
Direct interconnection n/ws
Shared Memory
Uniform Memory Access (UMA)
Non- Uniform Memory Access (NUMA)Distributed shared memory
1
Distributed MemoryMessage Passing
Architectures
Each processor P (with its own local
cache C) is connected to exclusive
local memory, i.e. no other CPU has
direct access to it.
Each node comprises at least one
network interface (NI) that mediates
the connection to a communication
network.
On each CPU runs a serial process
that can communicate with other
processes on other CPUs by means of
the network.
Non-blocking vs Blocking
communication
Direct vs Indirect
Communication/Interconnection
network
Example: A 2x4
mesh n/w (direct
connection n/w)
The ARGO Beowulf Cluster at UIC (http://accc.uic.edu/service/argo-cluster)
44
Has 56 compute nodes/computers and a master node
Master here has a different meaninggenerally a system front-end where you login and perform
various tasks before submitting your parallel code to run on several compute nodesthan the
master node in a parallel algorithm (e.g., the one we saw for the finite-element heat distribution
problem), which would actually be one of the compute nodes, and generally distributes data to the
other compute nodes, monitors progress of the computation, determines the end of the
computation, etc., and may also additionally perform a part of the computation
Compute nodes are divided among 14 zones, each zone containing 4 nodes which are
connected as a ring network. Zones are connected to each other by a higher-level n/w.
Each node (compute or master) has 2 processors. Each processor on some nodes are
single-core ones, and dual cores in others; see http://accc.uic.edu/service/arg/nodes
1
System Computational Actions in a Message-Passing Program
45
(a) Two basic parallel processes
X, Y, and their data dependency
a := b+c; b := x*y;
Proc. X Proc. Y
recv(P2, b);
/* blocking */
a := b+c;
b := x*y;
send(P1,b); /* non-blocking */
Proc. X Proc. Y
b
P(X) P(Y)
Processor/core
containing Y
Processor/core
containing X
Message passing
of data item b.
Link (direct
or indirect) betw.
the 2 processors
(b) Their mapping to a message-passing multicomputer
Message passing
mapping
1
Dual-Core Quad-Core
L1 cache
L2 cache
Distributed Shared Memory Arch.: UMA
Flat memory model
Memory bandwidth and latency are the same for all
processors and all memory locations.
Simplest example dual core processor
Most commonly represented today by Symmetric
Multiprocessor (SMP) machines
Cache coherent UMAconsistent cache values of the
same data item in different proc./core caches
1
System Computational Actions in a Shared-Memory Program
47
(a) Two basic parallel processes
X, Y, and their data dependency
a := b+c; b := x*y;
Proc. X Proc. Y
a := b+c; b := z*w;
Proc. X Proc. Y
P(X) P(Y)
(b) Their mapping to a shared-memory multiprocessor
Shared-memory
mapping
Shared Memory
Possible Actions by O.S.:
(i) Since b is a shared
data item (e.g.,
designated by
compiler or
programmer), check
bs location to see if
it can be written to (all
prev. reads done:
read_cntr for b = 0).
(ii) If so, write b to its
location and mark
status bit as written
by Y. Initialize
read_cntr for b to
pre-determined value
Possible Actions by O.S.:
(i) Since b is a shared
data item (e.g.,
designated by
compiler or
programmer), check
bs location to see if
it has been written to
by Y or any process
(if dont care about
the writing process).
(ii) If so {read b &
decrement read_cntr
for b} else go to (i)
and busy wait (check
periodically).
1
Most text from Fundamentals of Parallel
Distributed Shared Memory Arch.: NUMA
Memory is physically distributed but logically shared.
The physical layout similar to the distributed-memory message-passing case
Aggregated memory of the whole system appear as one single address space.
Due to the distributed nature, memory access performance varies depending on which
CPU accesses which parts of memory (local vs. remote access).
Two locality domains linked through a high speed connection called Hyper Transport (in
general via a link, as in message passing archs, only here these links are used by the
O.S. to transmit read/write non-local data to/from processor/non-local memory).
Advantage Scalability (compared to UMAs)
Disadvantage a) Locality Problems and Connection congestion. b) Not a natural
parallel prog./algo. Model (it is easier to partition data among procs instead of think of
all of it occupying a large monolithic address space that each proc. can access).
all-to-all
(complete
graph)
connection
via a
combination
of direct and
indirect
conns.
1
49
50
An example of an SPMD message-passing parallel
program
51
SPMD message-passing parallel program (contd.)
52
node xor D,
1
53
How to interconnect the
multiple cores/processors
is a major consideration in
a parallel architecture
54
Tflops Tflops kW
1
Most text from: Fund. of Parallel
Summary
Serial computers / microprocessors will probably not get much faster -
parallelization unavoidable
Pipelining, cache and other optimization strategies for serial computers reaching a
plateau
Application examples
Data and functional parallelism
Flynns taxonomy: SIMD, MISD, MIMD/SPMD
Parallel Architectures Intro
Distributed Memory
Shared Memory
Uniform Memory Access
Non Uniform Memory Access
Parallel program/algorithm examples

Additional References
Computer Organization and Design Patterson Hennessey
Modern Operating Systems Tanenbaum
Concepts of High Performance Computing Georg Hager
Gerhard Wellein
Cramming more components onto Integrated Circuits Gordon
Moore, 1965
Introduction to Parallel Computing
https://computing.llnl.gov/tutorials/parallel_comp
The Landscape of Parallel Computing Research A view from
Berkeley, 2006

Intro To Parallel Processing

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Intro To Parallel Processing

Transféré par

Droits d'auteur :

Formats disponibles

Introduction to Parallel

Vous aimerez peut-être aussi