The Need For High Performance Computers

The Need For High Performance
Computers
Many of todays applications such as weather prediction, aerodynamics and artificial
intelligence are very computationally intensive and require vast amounts of processing
power.
To calculate a 24 hour weather forcast for the UK requires about 10^ 12

operations to be performed. This would take about 2.7 hours on a Cray-1 ( capable of 10^
8 operations per second ).
So to give accurate long range forecasts ( e.g. a week ) much more powerful computers
are needed.
One way of doing this is to use faster electronic components. The limiting factor is
however the speed of light.
The speed of light is 3 * 10^ 8 m/s. Considering two electronic devices (each
capable of performing 10^ 12 operations per second ) 0.5mm apart. It takes longer for a
signal to travel between them than it takes for either of them to process it ( 10^ -12
seconds ). So producing faster components is ultimately of no good.
So it appears that the only way forward is to use PARALLELISM. The idea here is that if
several operations can be performed simultaneously then the total computation time is
reduced.
The parallel version has the potential of being 3 times as fast as the sequential machine.
Classification of Parallel Machines
Models of Computation ( Flynn 1966 )
Any computer, whether sequential or parallel, operates by executing instructions on data.
a stream of instructions (the algorithm) tells the computer what to do.
a stream of data (the input) is affected by these instructions.
Depending on whether there is one or several of these streams, we have four classes of
computers. There is also a discussion of an additional 'pseudo-machine' SPMD.
1. Single Instruction Stream, Single Data Stream : SISD.

2. Multiple Instruction Stream, Single Data Stream : MISD.
3. Single Instruction Stream, Multiple Data Stream : SIMD.
4. Multiple Instruction Stream, Multiple Data Stream : MIMD.
5. Single Program Multiple Data: SPMD.
SISD Computers
This is the standard sequential computer.
A single processing unit receives a single stream of instructions that operate on a single
stream of data.
To compute the sum of N numbers a1, a2, .... aN the processor needs to gain
access to memory N consecutive times ( to receive one number ). Also N-1 additions are
executed in sequence. Therefore the computation takes O(N) operations.
i.e. algorithms for SISD computers do not contain any parallelism, there is only one
processor
MISD Computers
N processors, each with its own control unit, share a common memory.
There are N streams of instructions (algorithms / programs) and one stream of data.
Parallelism is achieved by letting the processors do different things at the same time on
the same datum.
MISD machines are useful in computations where the same input is to be subjected to
several different operations.
Checking whether a number Z is prime.

A simple solution is to try all possible divisions of Z.
Assume the number of processors, N, is given by N = Z-2. All processors take Z as input
and tries to divide it by its associated divisor. So it is possible in one step to check if Z is
prime. More realistically if N < Z-2 then a subset of divisors would be assigned to each
processor.
For most applications MISD are very awkward to use and no commercial machines exist
with this design.
SIMD Computers
All N identical processors operate under the control of a single instruction stream issued
by a central control unit. ( to ease understanding assume that each processor holds the
same identical program. )
There are N data streams, one per processor so different data can be used in each
processor.
The processors operate synchronously and a global clock is used to ensure lockstep
operation.
i.e. at each step (global clock tick) all processors execute the same instruction, each on a
different datum. [ SPMD. operates asynchronously by running the same program on
different data using an MIMD. machine ].
Array processors such as the ICL DAP (Distributed Array Processor) and pipelined
vector computers such as the CRAY 1 & 2 and CYBER 205 fit into the SIMD category.
SIMD machines are particularly useful to solve problems which have a regular structure.
i.e. the same instruction can be applied to subsets of the data.
Adding two matrices A + B = C.

Say we have two matrices A and B of order 2 and we have 4 processors.
A11 + B11 = C11 ... A12 + B12 = C12
A21 + B21 = C21 ... A22 + B22 = C22
The same instruction is issued to all 4 processors ( add the two numbers ) and all
processors execute the instructions simultaneously.
It takes one step as opposed to four steps on a sequential machine.
An instruction could be a simple one (eg adding two numbers) or a complex one (eg
merging two lists of numbers). Similarly the datum may be simple (one number) or
complex (several numbers).
Sometimes it may be necessary to have only a subset of the processors execute an
instruction i.e. only some data needs to be operated on for that instruction.
This information can be encoded in the instruction itself indicating whether
1. the processor is active ( execute the instruction )

2. the processor is inactive ( wait for the next instruction )
In most problems to be solved on SIMD ( and MIMD. ) computers it is useful for the
processors to be able to communicate with each other to exchange data or results.
This can be done in two ways: by using a shared memory and shared variables or some
form of interconnection network and message passing (distributed memory).
MIMD Computers (multiprocessors / multicomputers)

This is the most general and most powerful of our classification.
We have N processors, N streams of instructions and N streams of data.
Each processor operates under the control of an instruction stream issued by its own
control unit.(i.e. each processor is capable of executing its own program on a different
data.
This means that the processors operate asynchronously ( typically ) i.e. can be doing
different things on different data at the same time. As with SIMD computers
communication of data or results between processors can be via a shared memory or
interconnection network.
MIMD computers with shared memory are known as multiprocessors or tightly coupled
machines. Examples are ENCORE, MULTIMAX, SEQUENT & BALANCE.
MIMD computers with an interconnection network are known as multicomputers or

loosely coupled machines. Examples are INTEL iPSC, NCUBE/7 and transputer
networks.
Note
Multicomputers are sometimes referred to as distributed systems. This is INCORRECT.
Distributed systems should, for example, refer to a network of personal workstations
(such as SUN's ) and even though the number of processing units can be quite large the
communication in such systems is currently too slow to allow close operation on one job.
SPMD - Single Program Multiple Data

The SIMD. paradigm is an example of synchronous parallelism - each processor operates
in lockstep.
A related asynchronous version is SPMD. This is where the same program is run on the
processors of an MIMD. machine.
SPMD is not a hardware paradigm, it is the software equivalent of SIMD.
Because an entire program is executed on separate data, it is possible that different

branches are taken, leading to asynchronous parallelism. The processors no longer do the
same thing ( or nothing ) in lockstep, they are busy executing different instructions within
the same program.
Consider
IF X = 0 Assume X = 0 on P1
THEN S1 X != 0 on P2
ELSE S2
Now P1 executes S1 at the same time P2 executes S2 ( which could not happen on an
SIMD machine )
Fundamentals of Interprocessor
Communication
This section aims to give an overview of the two forms of interprocessor communication,
further details can be found in the next two sections Shared Memory and Message
Passing and Interconnection Networks
Where there are N processors each with its own individual data stream i.e. SIMD. and
MIMD. , it is usually necessary to communicate data / results between processors. This
can be done in two main ways.
1. Using a SHARED MEMORY and SHARED VARIABLES

This consists of a global address space which is accessible by all N processors. A
processor can communicate to another by writing into the global memory where the
second processor can read it.
Shared memory solves the interprocessor communication problem but introduces the
problem of simultaneous accessing of the same location in the memory.
Consider.
i.e. x is a shared variable accessible by P1 and P2. Depending on certain factors, x=1 or
x=2 or x=3.
1. if P1 executes and completes x=x+1 before P2 reads the value of x from memory
then x=3 similarly if P2 executes and completes x=x+2 before P1 reads the value
of x from memory then x=3
2. if P1 and P2 read the value of x before either has updated it then the processor
which finishes last will determine the value of x.
if P1 finishes last the value is x=1

if P2 finishes last the value is x=2
In a multiuser, real time environment the processor which finishes last would vary from
run to run - so the final value would vary.
Also, even if they finish at the same time only one value of x can be stored in the location
for x. This gives rise to NON-DETERMINANCY - when a parallel program with the
same input data yields different results on different runs.
Non-determinancy is caused by race conditions.

A race is when two statements in concurrent tasks access the same memory location, at
least one of which is a write, and there is no guaranteed execution ordering between
accesses.
The problem of nondeterminancy would be solved by synchronising the use of shared
data. That is; if x=x+1 and x=x+2 were mutually exclusive statements i.e. could not be
executed at the same time, then x=3 always.
Shared memory computers e.g. SEQUENT , ENCORE are often implemented by

incorporating a fast bus to connect processors to memory i.e.
However because the bus has a finite bandwidth i.e. finite amount of data it can carry at
any instance, then as the number of processors increase the contention for the bus
becomes a problem. So it is only feasible to allow P processors to access P memory
locations simultaneously for relatively small P ( < 30 )
Comparison Table
----------------
Distributed Memory Shared Memory
--------------------------------------------------------------
Large number of processors modest number of processors
(100's - 1000's ) (10's - 100's)
High theoretical power Modest power
Unlimited expansion Limited expansion
Difficult to fully utilise Easy (?) to fully utilise
Revolutionary parallel Evolutionary parallel

programming programming
click here to see more details on SHARED MEMORY
2. INTERCONNECTION NETWORK and MESSAGE PASSING

Here each processor has its own private (local) memory and there is no global, shared
memory. Therefore the processors need to be connected in some way to allow
communication of data to take place.
if a processor requires dara contained in a separate processor then it must be
explicitly passed by using instructions supplied for communication e.g. send and receive
functions.
P1 P2
receive (x,P2) send (x, P1)
the value of x is explicitly passed from P2 to P1. This is known as message passing.
In addition to the extreme cases of shared memory and distributed memory there are
possibilities for hybrid designs that combine features of both. E.g. clusters of processeors,
where a high speed bus serves for intracluster communication and an interconnection
network is used for intercluster communication.
Example follows
or click here for further details on INTERCONNECTION NETWORKS.
Summing m numbers using a) Shared memory b) distributed

memory
Example : Summing m numbers
On a sequential computer we have
sum := A0
FOR i=1 TO m-1
sum := sum + Ai
ENDFOR
i.e. [ (A0 + A1) + A2] + A3 ... etc

Giving Theta(m) time
Is addition inherently sequential ?
If we have N processors then each processor can calculate the sum of m / N numbers and
then the sum of these partial sums will give the final sum.
Using Shared Memory

The m numbers are held in the global shared memory
Global_sum := 0
FOR all Pi WHERE 0 <= i<= N-1
Local_sum := 0
Calculate local sum of m/N numbers
LOCK
Global_sum := Global_sum + Local_sum
UNLOCK
ENDFOR
Since global_sum is a shared variable each processor must have mutually exclusive
access to it - otherwise the final answer will be incorrect.
Running time (algorithm time complexity) is now
Theta (( m / N ) + N) + S
where:
m/N comes from adding m/N numbers in parallel
N comes from adding N numbers in sequence
S is any time required for synchronization.
Distributed Memory
Say we have a "square mesh" of processors.
Each processor finds the local sum of its m / N numbers.
Then each processor passes its local sum to another processor (at the correct time ) until
finally the global_sum is contained in processor P11
FORALL Pij -all processors active

FIND local_sum of m / N numbers
ENDFOR
1. FORALL Pij j = sqrt(N) downto 2 - all column j processors

Pij passes local sum to Pij-1
Pij-1 calculates local sum = new_local_sum +
old_local_sum
ENDFOR
2. FORALL Pi1 i = sqrt(N) downto 2 - 1 processor in col i

active
Pi1 passes local sum to Pi-1,1
local sum = new_local_sum + old_local_sum
ENDFOR
There are (sqrt(N) - 1) + ( sqrt(N) - 1) additions and communications, therefore the total
time complexity is
Theta (m/N + 2(sqrt(N)) - 2 + C )
where:
C is the time for communication
Stage One
press here for next stage
Shared Memory and Message Passing
Interprocessor Communication
Where there are N processors each with its own individual data stream i.e. SIMD. and
MIMD. , it is usually necessary to communicate data between processors.
This is done in two ways
1. Using SHARED MEMORY and SHARED VARIABLES or

2. via an INTERCONNECTION NETWORK
Shared Memory and Message Passing

Here each processor has access to any variable residing in the shared memory. So if
processor x wishes to pass a number to processor y it must be done in two steps:-
processor x writes the number into the shared memory at a given location accessible to
processor y, which then reads the number from that location.
During the execution of a parallel algorithm, the N processors can access shared memory
for reading / writing data and / or results.
All processors can gain access to the shared memory simultaneously if the memory
locations they are trying to read from or write into are different.
However we can get problems when two or more processors require access to the same
memory location simultaneously.
Let x be a variable that can be accessed by two processors P1 and P2

Now consider the assignment x := x + 1, which is normally done in three stages.
1. copy value of x into some register

2. add 1 to value on register
3. store value on register at the address for x
Say now that P1 and P2 both execute such an assignment, assume x=0 initially
What is the final result ?
P1 copies value of x (= 0) into its register

P2 copies value of x (= 0) into its register
P1 adds 1 to its register --- these two at the same time

P1 stores value of x (= 1)
P2 adds 1 to its register --- these two at the same time

P2 stores value of x (= 1)
Giving a result of 1 rather than 2. This is because P2 reads the value of x (=0 ) before P1
has updated it.
Therefore depending on whether 2 or more processors can gain access to the same
memory location simultaneously, we have 4 subclasses of shared memory computers :-
1. Exclusive Read, Exclusive Write (EREW) SM Computers

Access to memory locations is exclusive i.e. no 2 processors are allowed to
simultaneously read from or write into the same location.
2. Concurrent Read, Exclusive Write (CREW) SM Computers

Multiple processors are allowed to read from the same location but write is still
exclusive. .i.e. no 2 processors are allowed to write into the same location simultaneously
3. Exclusive Read, Concurrent Write (ERCW) SM Computers
Multiple processors are allowed to write into the same memory location but read access
remains exclusive.
4. Concurrent Read, Concurrent Write (CRCW) SM Computers

Both multiple read and multiple write privileges are allowed.
NOTES
Allowing concurrent read access to the same address should pose no problems ( except
perhaps to the result of a calculation; as in the example )
Conceptually, each of the several processors reading from that location makes a copy of
its contents and stores it in its own register ( RAM )
Problems arise however, with concurrent write access.
If several processors are trying to simultaneously store ( potentially different ) data at the
same address, which of them should succeed ?
i.e. we need a deterministic way of specifying the contents of a memory location after a
concurrent write operation.
Some ways of resolving write conflicts include :-

A ) Assign priorities to the processors and accept value from highest priority processor
B ) All the processors are allowed to write, provided that the quantities they are
attempting to store are equal, otherwise access is denied to ALL processors.
C ) The max / min / sum / average of the value is stored (numeric data ).
Despite this, it is only feasible to allow P processors to access P memory locations

simultaneously for relatively small P (< 30 )
Usually because the cost of the communication hardware is too high and physical size of
the device used for a memory location.
Generally SIMD machines, because they can use very simple processors ( since have no
control unit ), typically need to have large numbers of processors ( > 1000 ) to achieve
high performance.
So shared memory SIMD machines are unrealistic and no commercial machines exist
with this design.
However in MIMD machines, which use much more powerful processors, shared
memory systems are in existance which have small numbers of processors ( 2 - 30 ).
To illustrate the theoretical potential of the four different subclasses of shared memory
consider the following example.
We have N processors to search a list S = { L1, L2, .... Ln } for the index of a given
element x. 1 < N < = n assume x may appear several times in the list and any index will
do.
ALGORITHM :
Procedure SM search (S, x, k)
Step 1: FOR i=1 to N do in parallel
read x
ENDFOR

Si := {L(i-1)n/N + 1) , .........L in/N }
perform sequential search on sublist
(returns Ki = -1 if not in list or index if it is
in the list
ENDFOR

if Ki > 0 then K = Ki endif
ENDFOR
Assuming the sequential search procedure takes O(n/N) time in the worst case what is the
time complexity of running this algorithm on the four subclasses of shared memory
machine ?
EREW : Step 1 takes O(N) time (N reads, one at a time)
Step 2 takes O(n/N) time (time for reading list & sequential
search)
Step 3 takes O(N) time
i.e. overall O(N) + O(n/N) time
ERCW : Step 1 takes O(N) time
Step 2 takes O(n/N) time
Step 3 takes constant time
( with an appropriate conflict resolution rule )
i.e. overall O(N) + O(n/N)
CREW : Step 1 takes constant time

Step 3 takes O(N) time
i.e. overall O(N) + O(n/N)
CRCW : Step 1 takes constant time
Step 3 takes constant time
i.e. overall O(n/N)
Interconnection Networks
Introduction
We have seen that one way for processors to communicate data is to use a shared
memory and shared variables. However this is unrealistic for large numbers of
processors. A more realistic assumption is that each processor has its own private
memory and data communication takes place using message passing via an
INTERCONNECTION NETWORK.
Interconnection Networks
The interconnection network plays a central role in determining the overall performance
of a multicomputer system. If the network cannot provide adequate performance, for a
particular application, nodes will frequently be forced to wait for data to arrive.
Some of the more important networks include (Those not highlighted should be
investigated by the reader ):
• Fully connected or all-to-all

• Mesh
• Rings
• Hypercube
• X - Tree
• Shuffle Exchange
• Butterfly
• Cube Connected Cycles
Metrics for Interconnection Networks
Fully connected or all-to-all

This is the most powerful interconnection network ( topology ): each node is directly
connected to ALL other nodes.
Each node has N-1 connections (N-1 nearest neighbours) giving a total of N(N-1) / 2
connections for the network.
Even though this is the best network to have the high number of connections per node
mean this network can only be implemented for small values of N. Therefore some form
of limited interconnection network must be used.
Mesh ( Torus )
In a mesh network, the nodes are arranged in a k dimensional lattice of width w, giving
a total of w^k nodes.
[ usually k=1 (linear array) or k=2 (2D array) e.g. ICL DAP ]
Communication is allowed only between neighbouring nodes. All interior nodes are
connected to 2k other nodes.
Rings
A simple ring is just a linear array with the end nodes linked.
It is equivalent to a 1D mesh with wraparound connections. One drawback to this
network is that some data transfers may require N/2 links to be traversed e.g. A and B
above ( 3 ).
This can be reduced by using a chordal ring This is a simple ring with cross or chordal
links between nodes on opposite sides.
Hypercube Connection ( Binary n-Cube )

Hypercube networks consist of N = 2^k nodes arranged in a k dimensional hypercube.
The nodes are numbered 0 , 1, ....2^k -1 and two nodes are connected if their binary
labels differ by exactly one bit.
K dimensional hypercube is formed by combining two k-1 dimensional hypercubes and
connecting corresponding nodes i.e. hypercubes are recursive.
each node is connected to k other nodes i.e. each is of degree k
The departmental NCUBE is based on this topology i.e. a 5 dimensional hypercube (64
nodes)
Choose either of these for the relevant section :

Shuffle Exchange
Cube Connected Cycles
Metrics for Interconnection Networks

Metrics provide a framework to compare and evaluate interconnection networks. The
metrics we will use are:
1. Network connectivity
2. Network diameter
3. Narrowness
4. Network expansion increments
For examples of their calculation refer to own lecture notes.
Network Connectivity
Network nodes and communication links sometimes fail and must be removed from
service for repair. When components do fail the network should continue to function with
reduced capacity.
Network connectivity measures the resiliency of a network and its ability to continue
operation despite disabled components i.e. connectivity is the minimum number of nodes
or links that must fail to partition the network into two or more disjoint networks
The larger the connectivity for a network the better the network is able to cope with
failures.
Network Diameter
The diameter of a network is the maximum internode distance i.e. it is the maximum
number of links that must be traversed to send a message to any node along a shortest
path.
The lower the diameter of a network the shorter the time to send a message from one
node to the node farthest away from it.
Narrowness
This is a measure of congestion in a network and is calculated as follows:
Partition the network into two groups of processors A and B where the number of
processors in each group is Na and Nb and assume Nb < = Na. Now count the number of
interconnections between A and B call this I. Find the maximum value of Nb / I for all
partitionings of the network. This is the narrowness of the network.
The idea is that if the narrowness is high ( Nb > I) then if the group B processors want to
send messages to group A congestion in the network will be high ( since there are fewer
links than processors )
Network Expansion Increments

A network should be expandable i.e. it should be possible to create larger and more
powerful multicomputer systems by simply adding more nodes to the network.
For reasons of cost it is better to have the option of small increments since this allows
you to upgrade your network to the size you require ( i.e. flexibility ) within a particular
budget.
E.g. an 8 node linear array can be expanded in increments of 1 node but a 3 dimensional
hypercube can be expanded only by adding another 3D hypercube. (i.e. 8 nodes)
Parallel Algorithm Construction

Classification of MIMD Algorithms
Parallel algorithms for MIMD machines can be divided into 3 categories, these are
expanded in the next sections.
1. Pipelined Algorithms / Algorithmic Parallelism

2. Partitioned Algorithms / Geometric Parallelism
3. Asynchronous / Relaxed Algorithms
Pipelined Algorithms / Algorithmic

Parallelism
A pipelined algorithm is an ordered set of ( possibly different ) processes in which the
output of each process is the input to its successor.
The input to the first process is the input to the algorithm
The output from the last process is the output of the algorithm.
Typically each processor forms part of a pipeline and performs only a small part of the
algorithm. Data then flows through the system ( pipeline ) being operated on by each
processor in succession.
To illustrate pipelining consider the following
Say it takes 3 steps A, B & C to assemble a widget and assume each step takes one unit
of time
Sequential widget assembly machine:

Spends 1 unit of time doing step A followed by 1 unit of time doing step B, followed by 1
unit of time doing step C
So a sequential widget assembler produces 1 widget in 3 time units, 2 in 6 time units etc.
i.e. one widget every 3 units.
Pipelined widget assembly machine

Say we use a 3 segment pipeline where each of the subtasks ( A, B or C) is assigned to a
segment
i.e. the machine is split into 3 smaller machines; one to do step A, one for step B and one
for step C and which can operate simultaneously.
The first machine performs step A on a new widget every time step and passes the
partially assembled widget to the second machine which performs step B. This is then
passed onto the third machine to perform step C
This produces the first widget in 3 time units (as the sequential machine), but after this
initial startup time one widget appears every time step.
i.e. the second widget appears at time 4
the third widget appears at time 5 etc.
Animated Example
press here to progress to next time step
So the final result looks like this

In general
if L is the number of steps to be performed

and T is the time for each step
and n is the number of items ( widgets )
then Time Sequential = LTn

and Time Parallel = [ L + n-1 ]T
T = 1, L = 100, n = 10^6
then Tseq = 10^8 and Tpipe = 100 + 10^6 - 1 = 10^6 + 99
increase = Tseq / Tpipe = 10^8 / ( 10^6 +99 ) = 100

i.e. 100 fold increase in speed.
In general as n tends to infinity speedup tends to L.
Geometric Parallelism / Partitioned

Algorithms
These algorithms arise when there is a natural way to decompose the data set into
smaller "chunks" of data, which are then allocated to individual processors.
Thus each processor contains more or less the same code but operates on a subset of the
total data.
The solution to these subproblems are then combined to form the complete solution.
Depending on the algorithm being solved this combining of solutions usually implies
communication synchronization among the processors.
Synchronization means constraining a particular ordering of events.
if data needs to be communicated between processors after each iteration of a numerical

calculation then this implies synchronization between processes.
[ we will see a particular example of this in the matric multiplication algorithm later on ]
Thus partitioned algorithms are sometimes called synchronous algorithms
To illustrate the difference between pipelined and partitioned

algorithms consider the following:
Say an algotithm consists of 4 parts A, B, C and D and this algorithm is to operate on a
data set E consisting of 4 subsets E1, E2 , E3 and E4 (e.g. divide up matrix into submatrix
)
The pipelined algorithm would consist of 4 processors performing A, B, C, or D. Th

complete data set would then pass through all 4 processors.
However in the partitioned algorithm the four processors all perform A, B, C and D but
only on a subset of the data
i.e. In pipelined algorithms the algorithm is distributed among the processors whereas in
partitioned algorithms the data is distributed among the processors.
Say we want to calculate Fi = cos(sin e^sqr(xi)) for x1, x2 ,....x6 using 4 processors.
Pipelined Version
F1 is produced in 4 time units

F2 is produced at time 5
i.e. time = 4 + (6-1) = 9 units
==> SPEEDUP = 24 / 9 = 2.6
==> EFFICIENCY = 66%
Partitioned Version
This time each processor performs the complete algorithm i.e. cos(sin e^sqr(x)) but on its
own data.
i.e. time = 8 units

==> SPEEDUP = 24 / 8 = 3
==> EFFICIENCY = 75%
Asynchronous / Relaxed Parallelism

In relaxed algotithms there is no explicit dependency between processes, as occurs in
synchronized algorithms. Instead relaxed algorithms never wait for input. If they are
ready they use the most recently available data
To illustrate this consider the following.
Say we have two processors A and B. A produces a sequence of numbers a1, a2 .. B

inputs ai and performs some calculation F which uses ai.
Say that B runs much faster than A.
Synchronous Operation
A produces a1 passes it to B which calculates F1;
A produces a2 passes it to B which calculates F2;
i.e. B waits for A to finish ( since B is faster than A ) etc..
Asynchronous Operation
A produces a1 passes it to B which calculates F1 but now A is still in the process of
computing a2 so instead of waiting B carries on and calculates F2 ( based on old data i.e.
a1 and therefore may not be the same as F2 above )and continues to calculate F using the
old data until a new input arrives
e.g. Fnew = Fold + ai
The idea in using asynchronous algorithms is that all processors are kept busy and never
remain idle (unlike synchronous algorithms ) so speedup is maximized.
A drawback is that they are difficult to analyse ( because we do not know what data is
being used ) and also an algorithm that is known to work ( e.g. converge) in synchronous
mode may not work (e.g diverge) in asynchronous mode.
[ Examples of calculations that do run successfully as asynchronous algorithms are :

Gauss Seidel and Jacobi methods for solving linear systems of equations. ]
Consider the Newton Raphson iteration for solving F (x) = 0 where F is some non-linear
function
i.e. Xn+1 = Xn - F(Xn)/F'(Xn)......( 1 )
generates a sequence of approximations to the root, starting from a value X0.
Say we have 3 processors
P1 : given x, P1 calculates F (x ) in time t1, units and sends it to P3

P2 :given y, P2 calculates F'(y) in time t2 units and sends it to P3
P3 : given a, b, c, P3 calculates d = a - b/c in time t3 units; if | d-a | > Epsilon then d is
sent to P1 and P2 otherwise d is output.
Serial Mode
P1 computes F(Xn) then P2 computes F'(Xn) then P3 computes Xn+1 using (1)
So time per iteration is t1 + t2 + t3
If k iterations are necessary for convergence then total time is k (t1 + t2 + t3)
Synchronous Parallel Mode.

P1 and P2 compute F(Xn) and F'(Xn) simultaneously and when BOTH have finished the
values F(Xn) and F'(Xn) are used by P3 to compute Xn+1
Time per iteration is max( t1, t2) + t3
Again k iterations will be necessary so total time is k [ max( t1, t2) + t3]
X1 = X0 - F(X0)/F'(X0) ...etc
Asynchronous Parallel Mode

P1 and P2 begin computing as soon as a new input value is made available by P3 and
they are ready to receive it,
P3 computes a new value using (1) as soon as EITHER P1 OR P2 provide a new input
i.e. (1) is now of the form
Xn+1 = Xn - F(SXi)/F'(Xj)
Time per iteration is at most min( t1, t2) + t3 [but may be as low as t3 - e.g. if P1 and P2
complete at consecutive time steps]
we cannot predict how many iterations will be necessary

Say t1 = 2, t2 = 3 and t3 = 1.
2 units 3 units 1 unit

TIME P1 P2 P3
2 F(X0) C0 -
3 - F'(X0) -
4 - - X1 = X0 - F(X0)/F'(X0)
5 C1 C1 -
6 F(X1) C1 -
7 - F'(X1) X2 = X1 - F(X1)/F'(X0)
8 C2 C2 X3 = X2 - F(X1)/F'(X1)
9 F(X2) C2 -
10 C3 F'(X2) X4 = X3 - F(X2)/F'(X1)
11 F(X3) C3 or C4 X5 = X4 - F(X2)/F'(X2)
12 C4 or C5 X6 = X5 - F(X3)/F'(X2)
13 . . .
14 . . .
etc
- indicates processor is idle
Ci indicates processor is using Xi in its calculation
NOTE
At time 11 P2 has the choice of using X3 to calculate F'(X3) or X4 to calculate F'(X4),
i.e. omit X3. Which choice is made should be determined experimentally to see which
gives the best results.
we could relax the parameter i.e. use X4 or we could synchronise with the parameter i.e.
using X3
Factors That Limit Speedup

This section is divided into two parts :
1. Defining Speedup and Efficiency.

2. Factors that limit speedup
Defining Speedup and Efficiency.

A Parallel Algorithm is an algorithm for the execution of a program which involves the
running of two or more processes on two or more processors simultaneously.
Two important measures of the quality of parallel algorithms are speedup and
efficiency :
If Ts is the time taken to run the fastest serial algorithm on one processor and if Tp is the
time taken by a parallel algorithm on N processors then
Speedup = SN = Ts / Tp
and the efficiency of the parallel algorithm is given by
Efficiency = EN = SN / N
If the best known serial algorithm takes 8 seconds i.e. Ts = 8, while a parallel algorithm
takes 2 seconds using 5 processors, then
SN = Ts / Tp = 8 / 2 = 4 and
EN = SN / N = 4 / 5 = 0.8 = 80%
i.e. the parallel algorithm exhibits a speedup of 4 with 5 processors giving an 80%
efficiency.
NOTE
Care should be taken as to exactly what is meant by Ts i.e. the time taken to run the
fastest serial algorithm on one processor - which processor ?
we can use one processor of the parallel computer or we can use the fastest serial
machine available.
The latter is the fairest way to compare parallel algorithms but it is unrealistic in practice,
since most people do not have access to the fastest serial machines, making it impossible
to make a claim about speedup.
( Researchers also do not like this way because the speedup is reduced (since Ts is
lower ) !!! ).
A slightly different definition of speedup also exists.

The time taken by the parallel algorithm on one processor divided by the time taken by
the parallel algorithm on N processors.
However this is misleading since many parallel algorithms contain extra operations to
accomodate the parallelism (e.g the communication)
so the result is Ts is increased thus exaggerating the speedup.
Which ever definition is used the ideal is to produce linear speedup i.e. produce a
speedup of N using N processors and an efficiency of 1 ( 100% ).
However in practice the speedup is reduced from its ideal value of N (the efficiency is
bounded from above by 1 ).
Factors that limit speedup

1. Software Overhead
Even with a completely equivalent algorithm, software overhead arises in the concurrent
implementation. (e.g. there may be additional index calculations necessitated by the
manner in which data are "split up" among processors. )
i.e. there is generally more lines of code to be executed in the parallel program than the
sequential program.
2. Load Balancing
Speedup is generally limited by the speed of the slowest node. So an important
consideration is to ensure that each node performs the same amount of work. i.e. the
system is load balanced.
3. Communication Overhead
Assuming that communication and calculation cannot be overlapped, then any time spent
communicating the data between processors directly degrades the speedup. (because the
processors are not calculating ).
Because of this, a goal of the parallel algorithm designer should be make the grain size
( relative amount of work done between synchronizations - communications) as large as
possible, while keeping all the processors busy.
The effect of communication on speedup is reduced, in relative

terms, as the grain size increases.
Another important parameter when considering communication is the machine

dependant ratio
tcomm / tcalc
Where tcomm is the time to transfer a single word between two nodes and
tcalc is the time to perform some floating point calculation
4. Amdahls Law
This states that the speedup of a parallel algorithm is effectively limited by the number of
operations which must be performed sequentially, i.e its Serial Fraction
Let S be the amount of time spent (by one processor) on serial parts of the program and P
be the amount of time spent ( by one processor ) on parts of the program that could be
done in parallel.
i.e. Tseq = S + P and Tpar = S + P/N ............... ( 0 )

where N is the number of processors.
Therefore Speedup = Tseq / Tpar
SPEEDUP = S + P / (S + P/N ) ( 1 )
Say we have a program containing 100 operations each of which take 1 time unit.
If 80 operations can be done in parallel i.e. P = 80
and 20 operations must be done sequentially i.e. S = 20
then using 80 processors
Speedup = 100 / (20 + 80/80) = 100 / 21 < 5
i.e. a speedup of only 5 is possible no matter how many

processors are available.
If we define the serial fraction F to be.
F = S / Tseq ==> P = (1 - F)Tseq

Then equation ( 1 ) can be written as
SPEEDUP = 1 / (F + (1 - f)/N ) ( 2 )
So if F = 0 i.e. no serial part then speedup = N ( the ideal value )

If F = 1 i.e. completely serial then speedup = 1 no matter how many processors are used.
Consider the effect of the serial fraction F on the speedup produced for N = 10 and N =
1024.
If 1% of a parallel program involves serial code, the maximum speedup that can be
attained is 9 using 10 processors, but only 91 using 1024 processors.
Therefore Amdahls Law tells us that the serial fracrion F places a severe constraint on the
speedup as the number of processors increase.
Since most parallel programs contain a certain amount of sequential code, a possible
conclusion of Amdahls Law is that it is not cost effective to build systems with large
numbers of processors because sufficient speedup will never be produced.
However most of the important applications that need to be parallelised contain very
small fractions ( < 0.001)
Scaled Speedup
Say we have a problem to solve which takes Tseq seconds to finish, S seconds on the
parts of the program that must be done serially and P ( = Tseq - S ) seconds on parts of
the program that could have been done in parallel.
i.e. Tseq = S + P and if F = S / Tseq is the serial fraction.
So if we solve the same problem on a parallel machine i.e. the problem size is fixed,
then the speedup can be predicted by Amdahls Law as
Speedup = S + P / (S + P/N)
Speedup = 1 / F + (1-F)/ N
However we could argue that run time is fixed i.e. problem size is not fixed but increases
in proportion to N
AND
we could argue that serial fraction is not constant but decreases as the problem size
increases i.e. S is constant.
Therefore Amdahls Law tells us that the serial fraction F places a severe constraint on the
speedup as the number of processors increase
Since most parallel programs contain a certain amount of sequential code the conclusion
of Amdahls Law is that it is not cost effective to build systems with large numbers of
processors because sufficient speedup will never be produced.
Amdahls Law is valid for problems in which the serial fraction F does not vary with the
problem size
i.e. as the problem increases the time Tseq and S increase keeping F = S / Tseq constant.
NOTE
Applications that parallelise well are ones with very small serial fractions.
Using the Serial Fraction in Measuring Performance.

Given the serial fraction F and the number of processors N we can calculate predicted
speedup using equation ( 2 )
i.e. speedup = 1 / F + (1 - F) / N
So if we run the program and find the actual speedup from Tseq / Tpar we can rearrange (
2 ) to find the actual serial fraction F
i.e. F = 1/Speedup - 1/N

------------------
1 - 1/N
The value of F is useful because equation ( 0 ) i.e. Tpar = S + P/N is idealised.
it assumes all processors compute for the same amount of time i.e. perfectly load
balanced.
The overhead in communication and synchronization of processors is not included.
Load balancing effects are likely to result in an irregular change in F as N increases.
Say we have 12 pieces of work each taking the same amount of time.
We have perfect load balancing for N = 2,3,4,6 & 12 processors but not for N =
5,7,8,9,10,11
Since a larger load imbalance results in a larger F, problems can be identified not
apparent from speedup or efficiency.
The overhead of communication & synchronization tends to increase as N increases.
Since increasing the overhead decreases speedup the value of F increases smoothly as N
increases. So a smoothly increasing F is a warning that the grain size is too small.
Consider the following results

N Speedup Efficiency F
2 1.95 97 0.024
3 2.88 96 0.021
4 3.76 94 0.021
8 6.96 87 0.021
Without looking at the serial fraction we cannot tell whether the results are good or not
e.g. Why does efficiency decrease ? Since F is almost constant, we can conclude it is due
to limited parallelism of the program.

The Need For High Performance Computers

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

The Need For High Performance Computers

Transféré par

Droits d'auteur :

Formats disponibles

The Need For High Performance

To calculate a 24 hour weather forcast for the UK requires about 10^ 12

1. Single Instruction Stream, Single Data Stream : SISD.

Checking whether a number Z is prime.

Adding two matrices A + B = C.

1. the processor is active ( execute the instruction )

MIMD Computers (multiprocessors / multicomputers)

MIMD computers with an interconnection network are known as multicomputers or

SPMD - Single Program Multiple Data

SPMD is not a hardware paradigm, it is the software equivalent of SIMD.

Because an entire program is executed on separate data, it is possible that different

1. Using a SHARED MEMORY and SHARED VARIABLES

if P1 finishes last the value is x=1

Non-determinancy is caused by race conditions.

Shared memory computers e.g. SEQUENT , ENCORE are often implemented by

High theoretical power Modest power

Unlimited expansion Limited expansion

Difficult to fully utilise Easy (?) to fully utilise

Revolutionary parallel Evolutionary parallel

click here to see more details on SHARED MEMORY

2. INTERCONNECTION NETWORK and MESSAGE PASSING

Summing m numbers using a) Shared memory b) distributed

i.e. [ (A0 + A1) + A2] + A3 ... etc

Using Shared Memory

Running time (algorithm time complexity) is now

FORALL Pij -all processors active

1. FORALL Pij j = sqrt(N) downto 2 - all column j processors

2. FORALL Pi1 i = sqrt(N) downto 2 - 1 processor in col i

1. Using SHARED MEMORY and SHARED VARIABLES or

Shared Memory and Message Passing

Let x be a variable that can be accessed by two processors P1 and P2

1. copy value of x into some register

What is the final result ?

P1 copies value of x (= 0) into its register

P1 adds 1 to its register --- these two at the same time

P2 adds 1 to its register --- these two at the same time

1. Exclusive Read, Exclusive Write (EREW) SM Computers

2. Concurrent Read, Exclusive Write (CREW) SM Computers

4. Concurrent Read, Concurrent Write (CRCW) SM Computers

Some ways of resolving write conflicts include :-

Despite this, it is only feasible to allow P processors to access P memory locations

Procedure SM search (S, x, k)

Step 1: FOR i=1 to N do in parallel

Step 2: FOR i=1 to N do in parallel

Step 3: FOR i=1 to N do in parallel

i.e. overall O(N) + O(n/N) time

ERCW : Step 1 takes O(N) time

Step 2 takes O(n/N) time

Step 3 takes constant time

( with an appropriate conflict resolution rule )

i.e. overall O(N) + O(n/N)

CREW : Step 1 takes constant time

Step 3 takes O(N) time

i.e. overall O(N) + O(n/N)

CRCW : Step 1 takes constant time

Step 2 takes O(n/N) time

Step 3 takes constant time

i.e. overall O(n/N)

• Fully connected or all-to-all

Fully connected or all-to-all

Hypercube Connection ( Binary n-Cube )

Choose either of these for the relevant section :