Parallel and Distributed Algorithms: Johnnie W. Baker

Parallel and Distributed
Algorithms
Spring 2007
Johnnie W. Baker
Introduction and General

Concepts
Chapter 1
References
Selim Akl, Parallel Computation: Models and Methods,

Prentice Hall, 1997, Updated online version available
through website.
Selim Akl, The Design of Efficient Parallel Algorithms,
Chapter 2 in Handbook on Parallel and Distributed
Processing edited by J. Blazewicz, K. Ecker, B.
Plateau, and D. Trystram, Springer Verlag, 2000.
Ananth Grama, Anshul Gupta, George Karypis, and
Vipin Kumar, Introduction to Parallel Computing, 2 nd
Edition, Addison Wesley, 2003.
Harry Jordan and Gita Alaghband, Fundamentals of
Parallel Processing: Algorithms Architectures,
Languages, Prentice Hall, 2003.
Michael Quinn, Parallel Programming in C with MPI and
OpenMP, McGraw Hill, 2004.
Michael Quinn, Parallel Computing: Theory and
Practice, McGraw Hill, 1994
Barry Wilkenson and Michael Allen, Parallel
Programming, 2nd Ed.,Prentice Hall, 2005.
Outline
Need for Parallel & Distributed Computing

Flynns Taxonomy of Parallel Computers
Two Main Types of MIMD Computers
Examples of Computational Models
Data Parallel & Functional/Control/Job Parallel
Granularity
Analysis of Parallel Algorithms
Elementary Steps: computational and routing steps
Running Time & Time Optimal
Parallel Speedup
Speedup
Cost and Work
Efficiency
Linear and Superlinear Speedup
Speedup and Slowdown Folklore Theorems
Amdahls and Gustafons Law
Reasons to Study Parallel & Distributed

Computing
Sequential computers have severe limits to
memory size
Significant slowdowns occur when accessing
data that is stored in external devices.
Sequential computational times for most

large problems are unacceptable.
Sequential computers can not meet the
deadlines for many real-time problems.
Some problems are distributed in nature
and natural for distributed computation
Grand Challenge to Computational

Science Categories (1989)
Quantum chemistry, statistical mechanics, and relativistic
physics
Cosmology and astrophysics
Computational fluid dynamics and turbulence
Materials design and superconductivity
Biology, pharmacology, genome sequencing, genetic
engineering, protein folding, enzyme activity, and cell
modeling
Medicine, and modeling of human organs and bones
Global weather and environmental modeling
Weather Prediction
Atmosphere is divided into 3D cells
Data includes temperature, pressure, humidity,
wind speed and direction, etc
Recorded at regular time intervals in each cell
There are about 5103 cells of 1 mile cubes.
Calculations would take a modern computer
over 100 days to perform calculations needed
for a 10 day forecast
Details in Ian Fosters 1995 online textbook
Design & Building Parallel Programs
See authors online copy for further information
Flynns Taxonomy
Best-known classification scheme for parallel
computers.
Depends on parallelism it exhibits with its
Instruction stream
Data stream
A sequence of instructions (the instruction stream)
manipulates a sequence of operands (the data
stream)
The instruction stream (I) and the data stream

(D) can be either single (S) or multiple (M)
Four combinations: SISD, SIMD, MISD, MIMD
SISD
Single Instruction, Single Data
Single-CPU systems
i.e., uniprocessors
Note: co-processors dont count

Concurrent processing allowed
Instruction prefetching
Pipelined execution of instructions
Functional but not data parallel execution

Example: PCs
SIMD
Single instruction, multiple data
One instruction stream is broadcast to all
processors
Each processor (also called a processing
element or PE) is very simplistic and is
essentially an ALU;
PEs do not store a copy of the program nor
have a program control unit.
Individual processors can be inhibited from
participating in an instruction (based on a
data test).
SIMD (cont.)
All active processor executes the same
instruction synchronously, but on different
data
On a memory access, all active
processors must access the same location
in their local memory.
The data items form an array and an
instruction can act on the complete array
in one cycle.
SIMD (cont.)
Quinn calls this architecture a processor
array
Two examples are Thinking Machines
Connection Machine CM2
Connection Machine CM200
Also, MasPars MP1 and MP2 are examples
Quinn also considers a pipelined vector

processor to be a SIMD
Somewhat non-standard.
An example is the Cray-1
MISD
Multiple instruction, single data
Quinn argues that a systolic array is an
example of a MISD structure (pg 55-57)
Some authors include pipelined
architecture in this category
This category does not receive much
attention from most authors
MIMD
Multiple instruction, multiple data
Processors are asynchronous, since they
can independently execute different
programs on different data sets.
Communications is handled either by use
of message passing or through shared
memory.
Considered by most researchers to
contain the most powerful, least restricted
computers.
MIMD (cont.)
Have major communication costs that are
usually ignored when
comparing to SIMDs
when computing algorithmic complexity
A common way to program MIMDs is for

all processors to execute the same
program.
Called SPMD method (for single program, multiple
data)
Usual method when number of processors are large.
Multiprocessors
(Shared Memory MIMDs)
All processors have access to all memory
locations .
The processors access memory through some
type of interconnection network.
This type of memory access is called uniform memory
access (UMA) .
Most multiprocessors have hierarchical or

distributed memory systems in order to provide
fast access to all memory locations.
This type of memory access is called nonuniform
memory access (NUMA).
Multiprocessors (cont.)
Normally, fast cache is used with NUMA
systems to reduce the problem of different
memory access time for PEs.
This creates the problem of ensuring that all copies of
the same date in different memory locations are
identical.
Symmetric Multiprocessors (SMPs) are currently

a popular example of shared memory
multiprocessors
Multicomputers
(Message-Passing MIMDs)
Processors are connected by an interconnection network
Each processor has a local memory and can only access
its own local memory.
Data is passed between processors using messages, as
dictated by the program.
A common approach to programming multiprocessors is
to use a message passing language (e.g., MPI, PVM)
The problem is divided into processes that can be
executed concurrently on individual processors. Each
processor is normally assigned multiple processes.
Multicomputers (cont. 2/3)

Programming disadvantages of messagepassing
Programmers must make explicit
message-passing calls in the code
This is low-level programming and is error
prone.
Data is not shared but copied, which
increases the total data size.
Data Integrity: difficulty in maintaining
correctness of multiple copies of data item.
Multicomputers (cont. 3/3)

Programming advantages of message-passing
No problem with simultaneous access to data.
Allows different PCs to operate on the same data
independently.
Allows PCs on a network to be easily upgraded when
faster processors become available.
Mixed distributed shared memory systems
Lots of current interest in a cluster of SMPs.
See Dr. David Baders or Dr. Joseph JaJas
website
Data Parallel Computation

All tasks (or processors) apply the same
set of operations to different data.
Example: for i 0 to 99 do
a[i] b[i] + c[i]
endfor
Data parallel is usual mode of computation

for SIMDs
All active processors execute the operations
synchronously.
Data Parallel Computation for

MIMDs
A common way to support data parallel
programming on MIMDs is to use SPMD (single
program multiple data) programming as follows:
Processors (or tasks) executing the same block of
instructions concurrently but asynchronously
No communication or synchronization occurs within
these concurrent instruction blocks.
Each instruction block is normally followed by
synchronization and communication steps
If processors have identical tasks (with different

data), this method allows these to be executed
in a data parallel fashion.
Functional/Control/Job Parallelism
Involves executing different sets of
operations on different data sets.
Typical MIMD control parallelism
Problem is divided into different non-identical
tasks
Tasks are distributed among the processors
Tasks are usually divided between processors so
that their workload is roughly balanced
Each processor follows some scheme for

executing its task.
One scheme is to use a task scheduling algorithm.
Alternately, processor may follow some scheme for
executing tasks concurrently.
Grain Size
Defn: Grain Size is the average number of
computations performed between
communication or synchronization steps
See Quinn textbook, page 411
Data parallel programming usually results in

smaller grain size computation
SIMD computation is considered to be fine-grain
MIMD data parallelism is usually considered to be
medium grain
Parallelism at the task level is considered to be

coarse grained parallelism
Generally, increasing the grain size of increases
the performance of MIMD programs.
Examples of Parallel Models

Synchronous PRAM (Parallel RAM)
Each processor stores a copy of the program
All execute instructions simultaneously on different data
Generally, the active processors execute the same instruction
The full power of PRAM allows processors to execute different
instructions simultaneously.
Here, simultaneous means they execute using same clock.
All computers have the same (shared) memory.

data exchanges occur through share memory
All processors have I/O capabilities.
See Akl, Figure 1.2
Examples of Parallel Models (cont)

Binary Tree of Processors (Akl, Fig. 1.3)
No shared memory
Memory distributed equally between PEs
PEs communicate through binary tree network.
Each PE stores a copy of the program.
PE execute different instructions
Normally synchronous, but asychronous is also
possible.
Only leaf and root PEs have I/O capabilities.
Examples of Parallel Models (cont)

Symmetric Multiprocessors (SMPs)
Asychronous processors of the same size.
Shared memory
Memory locations equally distant from each
PE in smaller machines.
Formerly called tightly coupled
multiprocessors.
Analysis of Parallel Algorithms

Elementary Steps:
A computational step is a basic arithmetic or logic
operation
A routing step is used to move data of constant size
between PEs using either shared memory or a link
connecting the two PEs.
Communication Time Analysis

Worst Case
Usually default for synchronous computation
Almost impossible to use for asynchronous
computation when executing multiple tasks
Can be used for asynchronous SPMD programming
when simulating synchronous computation
Average Case
Used occasionally in synchronous computation in the
same manner as in sequential computation
Usual default for asynchronous computation
Important to distinguish which is being used

Often worst case synchronous timings are compared
to average case asynchronous timings.
Default in Akls text is worst case
Default in Quinns 2004 text is average case
Shared Memory Communication Time

Memory Accesses Two Memory Accesses
P(i) writes datum to a memory cell.
P(j) reads datum in that memory cell
Uniform Analysis
Charge constant time for memory access
Unrealistic, as shown for PRAM in Chapter 2 of Akls book
Non-Uniform Analysis
Charge O(lg M) cost for a memory access, where M is the
number of memory cells
More realistic, based on how memory is accessed
Traditional to use Uniform Analysis

Considering a second variable M makes the analysis complex
Most non-constant time algorithms have the same complexity
using either the uniform and non-uniform analysis.
Communication through links

Assumes constant time to communicate along
one link.
An earlier paper (mid-90?) estimates a
communication step along a link to be on order of 10100 times slower than a computational step.
When two processors are not connected, routing

a datum between the two processors requires
O(m) time units, where m is the number of links
along path used.
Running Time
Running Time for algorithm
Time between when first processor starts executing &
when last processor terminates.
Based on worst-case, unless stated otherwise
Running Time bounds for a problem

An upper bound is given by the number of steps of
the fastest algorithm known for the problem.
The lower bound gives the fewest number of steps
possible to solve a worst case of the problem
Time optimal algorithm:

Algorithm in which the number of steps for the
algorithm is the same complexity as the lower bound
for problem.
Speedup
A measure of the decrease in running time due
to parallelism
Let t1 denote the running time of the fastest
(possible/known) sequential algorithm for the
problem.
If the parallel algorithm uses p PEs, let tp denote
its running time.
The speedup of the parallel algorithm using p
processors is defined by S(1,p) = t1/tp .
The goal is to obtain largest speedup possible.
For worst (average) case speedup, both t1 and tp
should be worst (average) case, respectively.
Tendency to ignore these details for asynchronous
computing involving multiple concurrent tasks.
Optimal Speedup
Theorem: The maximum possible speedup for parallel
computers with n PEs for traditional problems is n.
Proof (for traditional problems):
Assume a computation is partitioned perfectly into n
processes of equal duration.
Assume no overhead is incurred as a result of this
partitioning of the computation (e.g., partitioning
process, information passing, coordination of
processes, etc),
Under these ideal conditions, the parallel computation
will execute n times faster than the sequential
computation.
The parallel running time is ts /n.
Then the parallel speedup of this computation is
S(n,1) = ts /(ts /n) = n
Linear Speedup
Preceding slide argues that
S(n,1) n
and optimally S(n,1) = n.
This speedup is called linear since S(n,1) = (n)
The next slide formalizes this statement and
provides an alternate proof.
Speedup Folklore Theorem

Statement: For a given problem, the maximum speedup
possible for a parallel algorithm using p processors is at
most p. That is, S(1,p) p.
An Alternate Proof (for traditional problems)
Let t1 and tp be as above and assume t1/tp > p
This parallel algorithm can be simulated by a sequential
computer which executes each parallel step with p serial
steps.
The sequential algorithm runs in p tp time, which is faster
than the fastest sequential algorithm since pt p < t1
This contradiction establishes the theorem.
Note: We will see that this result is not valid for some nontraditional problems.
Speedup Usually Suboptimal

Unfortunately, the best speedup possible
for most applications is much less than n,
as
Assumptions in proof are usually invalid.
Usually some portions of algorithm are
sequential, with only one PE active.
Other portions of algorithm may allow a nonconstant number of PEs to be idle for more
than a constant number of steps.
E.g., during parts of the execution, many
PEs may be waiting to receive or to send
data.
Speedup Example
Example: Compute the sum of n numbers using
a binary tree with n leaves.
Takes tp = lg n steps to compute the sum in
the root node.
Takes t1 = n-1 steps to compute the sum
sequentially.
the number of processors is p =2n-1, so for
n>1,
S(1,p) = n/(lg n) < n < 2n - 1 = p
Solution is not optimal.
Optimal Speedup Example

Example: Computing the sum of n numbers
optimally using a binary tree.
Assume binary tree has n/(lg n) leaves on the tree.
Each leaf is given lg n numbers and computes the
sum of these numbers in the first step.
The overall sum of the values in the n/(lg n) leaves
are found in next step as in above example in lg(n/lg
n) or (lg n) time.
Then tp = (lg n) and S(1,p) = (n/lg n) = cp for some
constant c since p = 2(n/lg n) -1.
This parallel algorithm and speedup are optimal, as
the Folklore Speedup Theorem is valid for traditional
problems like this one.
Superlinear Speedup
Superlinear speedup occurs when S(n) > n
Most texts besides Akls and Quinns argue that
Linear speedup is the maximum speedup obtainable.
The preceding optimal speedup proof is used to
argue that superlinearity is impossible since linear
speedup is optimal.
Those not accepting superlinearity claim that

any superlinear behavior can be explained
using the following type of reasons:
the extra memory in parallel system.
a sub-optimal sequential algorithm was used.
Random luck, e.g., in case of an algorithm that has a
random aspect in its design (e.g., random selection)
Superlinearity (cont)
Some problems cannot be solved without
use of parallel computation.
It seems reasonable to consider parallel
solutions to these to be superlinear.
Examples include many large software
systems
Data too large to fit in the memory of a sequential
computer
Problem contain deadlines that must be met, but
the required computation is too great for a
sequential computer to meet deadlines.
Those who do not accept superlinearity would

discount these, as indicated in previous slide
Some Non-Traditional Problems

Problems with any of the following
characteristics are non-traditional
Input may arrive in real time
Input may vary over time
For example, find the convex hull of a set of points,
where points can be added or removed over time.
Output may effect future input

Output may have to meet certain deadlines.
See discussion on page 16-17 of Akl text

See Chapter 12 of Akls text for additional
discussions.
Other Superlinear Examples

Selim Akl has given many different
examples of different types of superlinear
algorithms.
Example 1.17 in Akls online textbook (pg 17).
See last Chapter (Ch 12) of his online text.
A number of conference and journal
publications
A Superlinearity Example
Example: Tracking
6,000 streams of radar data arrive during
each 0.5 second and each corresponds to a
aircraft.
Each must be checked against the projected
position box for each of 4,000 tracked
aircraft
The position of each plane that is matched is
updated to correspond with its radar location.
This process must be repeated each 0.5
seconds
A sequential processor can not process data
this fast.
Slowdown Folklore Theorem:

If a computation can be performed with
p processors in time tp, with q
processors in time tq, and q < p, then
tp tq (1 + p/q) tp
Comments:
This result puts a tight bound on the amount
of extra time required to execute an algorithm
with fewer PEs.
Establishes a smooth degradation of running
time as the number of processors decrease.
For q = 1, this is essentially the speedup
theorem, but with a less sharp constant.
Proof: (for traditional problems)

Let Wi denote the number of elementary steps
performed collectively during the i-th time period of the
algorithm (using p PEs).
Then W = W1 + W2 + ... + Wk (where k=tp) is the total
number of elementary steps performed collectively by
the processors.
Since all p processors are not necessarily busy all the
time, W p tp .
With q processors and q < p, we can simulate the W i
elementary steps of i-th time period by having each of
the q processors do Wi /q elementary steps.
The time taken for the simulation is
tq W1 /q + W2 /q + ...+ Wk /q
W1 /q +1 + ...+ Wk /q +1
tp + p tp /q
A Non-Traditional Example for

Folklore Theorems
Problem Statement: (from Akl, Example 1.17)
n sensors are each used to repeat the same
set {s1, s2, ..., sn } of measurements n times at
an environmental site.
It takes each sensor n time units to make
each measurement.
Each si is measured in turn by another sensor,
producing n distinct cyclic permutations of the
sequence S = (s1, s2, ..., sn)
Nontraditional Example (cont.)

The sequences produced by 3 sensors are
(s1, s2, s3), (s3, s1 ,s2), (s2 ,s3,s1)

The stream of data from each sensor is sent
to a different processor, with all corresponding
si values in each stream arriving at the same
time.
Each si must be read and stored in unit time
or stream is terminated.
Additionally, the minimum of the n new data
values that arrive during each n steps must be
computed, validated, and stored.
Sequential Processor Solution

A single processor can only monitor the values in one
data stream.
After first value of a selected stream is stored, it is too
late to process the other n-1 items that arrived at the
same time in other streams.
A stream remains active only if its preceding value was
read.
Note this is the best we can do to solve problem
We assume that a processor can read a value, compare
it to the smallest previous value received, and update its
smallest value in one unit of time.
Sequentially, the computation takes n computational
steps and n(n-1) time unit delays between arrivals, or
exactly n2 time units to process all n streams.
Summary: Calculating first minimum takes n2 time units
Binary Tree Solution using n Leaves

Assume a complete binary tree with n leaf
processors and a total of 2n-1 processors.
Each leaf receives the input from a different
stream.
Inputs from each stream sent up the tree.
Each interior PE receives two inputs, computes
the minimum, and sends result to parent.
Takes lg n time to find compute the minimum
and to store answer at the root.
Running time for parallel solution for first round
(i.e. first minimum) is lg n using 2n-1 = (n)
processors
Counterexample to Speedup
Folklore Theorem
A sequential solution to previous example
required n2 steps
The parallel solution to it took log n steps
using a binary tree with steps using p =
2n-1 = (n) processors.
The speedup is S(p,1) = t1/tn = n2/ log(n),
which is asymptotic larger than p, the
number of processors.
Akl calls this speedup parallel synergy.
Binary Tree solution using n Leaves

Assume a complete binary tree with n leaf processors
and a total of q = 2n - 1 processors.
The n leaf processors monitor n streams whose first
n values are distinct from each other.
Requires order of data arrival in streams be known in advance.
Note that this is the best we can do to solve original problem
Consecutive data in each stream are separated by n

time units, so each PE takes (n 1)n time in the first
phase to see the first n values in its stream and to
compute their minimum.
Each leaf computes the minimum of its first n values.
The minimum of the leaf minimum values is computed
by the binary tree in ln(n ) = (ln n) steps.
Running time for parallel solution using q=2n-1 PEs for
first round is (n 1)n + (ln n)
Counterexample to Slowdown
Folklore Theorem
A parallel solution to previous example with q = 2n-1 =
(n) PEs had tq = (n 1)n + (lg n)
The parallel solution using a binary tree with steps using
p = 2n-1 = (n) PEs took tp = lg n.
Slowdown Folklore Thm states that tp tq (1 + p/q) tp
Clearly tq/tp is asymptotically greater than p/q, which
contradicts Slowdown Folklore Theorem
(p/q) tp = (2n-1)/(2n-1) (lg n)= (n lg n)
tq = (n 1)n + (lg n) = (n3/2)
Akl also calls this asymptotical speedup parallel synergy.
Cost
The cost of a parallel algorithm is defined by
Cost = (running time) (Nr. of PEs)
= tp n
Cost allows the performance of parallel algorithms to be
compared to that of sequential algorithms.
The cost of a sequential algorithm is its running time.
The advantage that parallel algorithms have in using
multiple processors is removed by multiplying their
running time by the number n of processors used
If a parallel algorithm requires exactly 1/n the running
time of a sequential algorithm, then the parallel cost is
the same as the sequential running time.
Cost Optimal
A parallel algorithm for a problem is cost-optimal if its
cost is proportional to the running time of an optimal
sequential algorithm for the same problem.
By proportional, we means that
cost tp n = k ts
for some constant k.
Equivalently, a parallel algorithm with cost C is cost
optimal if there is an optimal sequential algorithm with
running time ts and (C) = (ts).
If no optimal sequential algorithm is known, then the cost
of a parallel algorithm is usually compared to the running
time of the fastest known sequential algorithm instead.
Cost Optimal Example

Recall speedup example of a binary tree with n/
(lg n) leaf PEs to compute the sum of n values in
lg n time.
C(n) = p(n) t(n) = [2(n/lg n)-1] [lg(n/(lg n)]
= [n/(lg n) lg(n)] = (n)
An optimal sequential algorithm requires
n-1= (n) steps
This shows that this algorithm is cost optimal
Another Cost Optimal Example

Sorting has a sequential lower bound of
(n lg n) steps
Assumes algorithm is general purpose and
not tailored to specific machine.
A parallel sorting algorithm using (n) PEs

will require at least (lg n) time
A parallel algorithm requiring n PEs and
O(lg n) time is cost optimal.
A parallel algorithm using (lg n) PEs will
require at least (n) time
Work
Defn: The work of a parallel algorithm is the
sum of the individual steps executed by all the
processors.
While inactive processor time is included in cost, only
active processors time is included in work.
Work indicates the actual steps that a sequential
computer would have to take in order to simulate the
action of a parallel computer.
Observe that the cost is an upper bound for the work.
While this definition of work is fairly standard, a few
authors use our definition of cost to define work.
Algorithm Efficiency
Efficiency is defined by
Observe that
S ( n)
E
n
ts
E (1, n)
ts
tp n
cost
Efficiency give the percentage of full utilization of

parallel processors on computation.
Note that for traditional problems:
Maximum speedup when using n processors is n.
Maximum efficiency is 1
For traditional problems, efficiency 1
For non-traditional problems, algorithms may

have speedup > n and efficiency > 1.
Amdahls Law
Let f be the fraction of operations in a
computation that must be performed
sequentially, where 0 f 1. The
maximum speedup S(n) achievable by a
parallel computer with n processors is
1
1
S ( n)
f (1 f ) / n f
Note: Amdahls law holds only for traditional
problems.
Proof: If the fraction of the computation that cannot be

divided into concurrent tasks is f, and no overhead incurs
when the computation is divided into concurrent parts, the
time to perform the computation with n processors is given
by tp = fts + [(1 - f )ts] / n, as illustrated below:
Proof of Amdahls Law (cont.)

Using the preceding expression for tp
ts
S ( n)
tp
ts
(1 f )t s
ft s
n
1
(1 f )
f
n
The last expression is obtained by dividing numerator

and denominator by ts , which establishes Amdahls law.
Multiplying numerator & denominator by n produces the
following alternate version
n of this formula: n
S ( n)
nf (1 f )
1 ( n 1) f
Comments on Amdahls Law

Preceding proof assumes that speedup can not
be superliner; i.e.,
S(n) = ts/ tp n
Question: Where was this assumption
used???
Amdahls law is not valid for non-traditional
problems where superlinearity can occur
Note that S(n) never exceed 1/f, but approaches
1/f as n increases.
The conclusion of Amdahls law is sometimes
stated as
S(n) 1/f
Limits on Parallelism
Note that Amdahls law places a very strict limit
on parallelism:
Example: If 5% of the computation is serial,
then its maximum speedup is at most 20, no
matter how many processors are used, since
1
1
100
S ( n)
20
f
0.05
5
Initially, Amdahls law was viewed as a fatal flaw

to any significant future for parallelism.
Applications of Amdahls Law

Law shows that efforts required to further
reduce the fraction of the code that is
sequential may pay off in large performance
gains.
Shows that hardware that achieves even a
small decrease in the percent of things
executed sequentially may be considerably
more efficient.
Can sometimes use Amdahls law to increase
the efficient of parallel algorithms
E.g., see Jordan & Alaghbands textbook
Amdahl & Gustafons Laws

A key flaw in past arguments that Amdahls law
is a fatal flaw to the future of parallelism is
Gustafons Law: The proportion of the computations
that are sequential normally decreases as the
problem size increases.
Gustafons law is a rule of thumb or general
principle, but is not a theorem that has been proved.
Other limitations in applying Amdahls Law:

Its proof focuses on the steps in a particular
algorithm, and does not consider that other algorithms
with a higher percent of parallelism may exist
Amdahls law applies only to traditional problems
were superlinearity can not occur
End of Chapter One

Introduction and General
Concepts

Parallel and Distributed Algorithms: Johnnie W. Baker

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Parallel and Distributed Algorithms: Johnnie W. Baker

Transféré par

Droits d'auteur :

Formats disponibles

Parallel and Distributed

Introduction and General

Selim Akl, Parallel Computation: Models and Methods,

Need for Parallel & Distributed Computing

Reasons to Study Parallel & Distributed

Sequential computational times for most

Grand Challenge to Computational

The instruction stream (I) and the data stream

Note: co-processors dont count

Functional but not data parallel execution

Also, MasPars MP1 and MP2 are examples

Quinn also considers a pipelined vector

A common way to program MIMDs is for

Most multiprocessors have hierarchical or

Symmetric Multiprocessors (SMPs) are currently

Multicomputers (cont. 2/3)

Multicomputers (cont. 3/3)

Data Parallel Computation

Data parallel is usual mode of computation

Data Parallel Computation for

If processors have identical tasks (with different

Each processor follows some scheme for

Data parallel programming usually results in

Parallelism at the task level is considered to be

Examples of Parallel Models

All computers have the same (shared) memory.

Examples of Parallel Models (cont)

Examples of Parallel Models (cont)

Analysis of Parallel Algorithms

Communication Time Analysis

Important to distinguish which is being used

Shared Memory Communication Time

Traditional to use Uniform Analysis

Communication through links

When two processors are not connected, routing

Running Time bounds for a problem

Time optimal algorithm:

Speedup Folklore Theorem

Speedup Usually Suboptimal

Optimal Speedup Example

Those not accepting superlinearity claim that

Those who do not accept superlinearity would

Some Non-Traditional Problems

Output may effect future input

See discussion on page 16-17 of Akl text

Other Superlinear Examples

Slowdown Folklore Theorem:

Proof: (for traditional problems)

A Non-Traditional Example for

Nontraditional Example (cont.)

(s1, s2, s3), (s3, s1 ,s2), (s2 ,s3,s1)

Sequential Processor Solution

Binary Tree Solution using n Leaves

Binary Tree solution using n Leaves

Consecutive data in each stream are separated by n

Cost Optimal Example

Another Cost Optimal Example

A parallel sorting algorithm using (n) PEs

Efficiency give the percentage of full utilization of

For non-traditional problems, algorithms may

Proof: If the fraction of the computation that cannot be

Proof of Amdahls Law (cont.)

The last expression is obtained by dividing numerator