Académique Documents
Professionnel Documents
Culture Documents
Computers
Many of todays applications such as weather prediction, aerodynamics and artificial
intelligence are very computationally intensive and require vast amounts of processing
power.
So to give accurate long range forecasts ( e.g. a week ) much more powerful computers
are needed.
One way of doing this is to use faster electronic components. The limiting factor is
however the speed of light.
The speed of light is 3 * 10^ 8 m/s. Considering two electronic devices (each
capable of performing 10^ 12 operations per second ) 0.5mm apart. It takes longer for a
signal to travel between them than it takes for either of them to process it ( 10^ -12
seconds ). So producing faster components is ultimately of no good.
So it appears that the only way forward is to use PARALLELISM. The idea here is that if
several operations can be performed simultaneously then the total computation time is
reduced.
The parallel version has the potential of being 3 times as fast as the sequential machine.
Classification of Parallel Machines
Models of Computation ( Flynn 1966 )
Any computer, whether sequential or parallel, operates by executing instructions on data.
a stream of instructions (the algorithm) tells the computer what to do.
a stream of data (the input) is affected by these instructions.
Depending on whether there is one or several of these streams, we have four classes of
computers. There is also a discussion of an additional 'pseudo-machine' SPMD.
SISD Computers
This is the standard sequential computer.
A single processing unit receives a single stream of instructions that operate on a single
stream of data.
To compute the sum of N numbers a1, a2, .... aN the processor needs to gain
access to memory N consecutive times ( to receive one number ). Also N-1 additions are
executed in sequence. Therefore the computation takes O(N) operations.
i.e. algorithms for SISD computers do not contain any parallelism, there is only one
processor
MISD Computers
N processors, each with its own control unit, share a common memory.
There are N streams of instructions (algorithms / programs) and one stream of data.
Parallelism is achieved by letting the processors do different things at the same time on
the same datum.
MISD machines are useful in computations where the same input is to be subjected to
several different operations.
SIMD Computers
All N identical processors operate under the control of a single instruction stream issued
by a central control unit. ( to ease understanding assume that each processor holds the
same identical program. )
There are N data streams, one per processor so different data can be used in each
processor.
The processors operate synchronously and a global clock is used to ensure lockstep
operation.
i.e. at each step (global clock tick) all processors execute the same instruction, each on a
different datum. [ SPMD. operates asynchronously by running the same program on
different data using an MIMD. machine ].
Array processors such as the ICL DAP (Distributed Array Processor) and pipelined
vector computers such as the CRAY 1 & 2 and CYBER 205 fit into the SIMD category.
SIMD machines are particularly useful to solve problems which have a regular structure.
i.e. the same instruction can be applied to subsets of the data.
Each processor operates under the control of an instruction stream issued by its own
control unit.(i.e. each processor is capable of executing its own program on a different
data.
This means that the processors operate asynchronously ( typically ) i.e. can be doing
different things on different data at the same time. As with SIMD computers
communication of data or results between processors can be via a shared memory or
interconnection network.
MIMD computers with shared memory are known as multiprocessors or tightly coupled
machines. Examples are ENCORE, MULTIMAX, SEQUENT & BALANCE.
Note
Multicomputers are sometimes referred to as distributed systems. This is INCORRECT.
Distributed systems should, for example, refer to a network of personal workstations
(such as SUN's ) and even though the number of processing units can be quite large the
communication in such systems is currently too slow to allow close operation on one job.
Consider
IF X = 0 Assume X = 0 on P1
THEN S1 X != 0 on P2
ELSE S2
Now P1 executes S1 at the same time P2 executes S2 ( which could not happen on an
SIMD machine )
Fundamentals of Interprocessor
Communication
This section aims to give an overview of the two forms of interprocessor communication,
further details can be found in the next two sections Shared Memory and Message
Passing and Interconnection Networks
Where there are N processors each with its own individual data stream i.e. SIMD. and
MIMD. , it is usually necessary to communicate data / results between processors. This
can be done in two main ways.
Shared memory solves the interprocessor communication problem but introduces the
problem of simultaneous accessing of the same location in the memory.
Consider.
i.e. x is a shared variable accessible by P1 and P2. Depending on certain factors, x=1 or
x=2 or x=3.
1. if P1 executes and completes x=x+1 before P2 reads the value of x from memory
then x=3 similarly if P2 executes and completes x=x+2 before P1 reads the value
of x from memory then x=3
2. if P1 and P2 read the value of x before either has updated it then the processor
which finishes last will determine the value of x.
Comparison Table
----------------
Distributed Memory Shared Memory
--------------------------------------------------------------
Large number of processors modest number of processors
(100's - 1000's ) (10's - 100's)
P1 P2
receive (x,P2) send (x, P1)
the value of x is explicitly passed from P2 to P1. This is known as message passing.
In addition to the extreme cases of shared memory and distributed memory there are
possibilities for hybrid designs that combine features of both. E.g. clusters of processeors,
where a high speed bus serves for intracluster communication and an interconnection
network is used for intercluster communication.
Example follows
or click here for further details on INTERCONNECTION NETWORKS.
If we have N processors then each processor can calculate the sum of m / N numbers and
then the sum of these partial sums will give the final sum.
Theta (( m / N ) + N) + S
where:
m/N comes from adding m/N numbers in parallel
N comes from adding N numbers in sequence
S is any time required for synchronization.
Distributed Memory
Say we have a "square mesh" of processors.
Each processor finds the local sum of its m / N numbers.
Then each processor passes its local sum to another processor (at the correct time ) until
finally the global_sum is contained in processor P11
There are (sqrt(N) - 1) + ( sqrt(N) - 1) additions and communications, therefore the total
time complexity is
Theta (m/N + 2(sqrt(N)) - 2 + C )
where:
C is the time for communication
Stage One
press here for next stage
Shared Memory and Message Passing
Interprocessor Communication
Where there are N processors each with its own individual data stream i.e. SIMD. and
MIMD. , it is usually necessary to communicate data between processors.
This is done in two ways
All processors can gain access to the shared memory simultaneously if the memory
locations they are trying to read from or write into are different.
However we can get problems when two or more processors require access to the same
memory location simultaneously.
Say now that P1 and P2 both execute such an assignment, assume x=0 initially
Giving a result of 1 rather than 2. This is because P2 reads the value of x (=0 ) before P1
has updated it.
Therefore depending on whether 2 or more processors can gain access to the same
memory location simultaneously, we have 4 subclasses of shared memory computers :-
NOTES
Allowing concurrent read access to the same address should pose no problems ( except
perhaps to the result of a calculation; as in the example )
Conceptually, each of the several processors reading from that location makes a copy of
its contents and stores it in its own register ( RAM )
Problems arise however, with concurrent write access.
If several processors are trying to simultaneously store ( potentially different ) data at the
same address, which of them should succeed ?
i.e. we need a deterministic way of specifying the contents of a memory location after a
concurrent write operation.
B ) All the processors are allowed to write, provided that the quantities they are
attempting to store are equal, otherwise access is denied to ALL processors.
C ) The max / min / sum / average of the value is stored (numeric data ).
Generally SIMD machines, because they can use very simple processors ( since have no
control unit ), typically need to have large numbers of processors ( > 1000 ) to achieve
high performance.
So shared memory SIMD machines are unrealistic and no commercial machines exist
with this design.
However in MIMD machines, which use much more powerful processors, shared
memory systems are in existance which have small numbers of processors ( 2 - 30 ).
To illustrate the theoretical potential of the four different subclasses of shared memory
consider the following example.
We have N processors to search a list S = { L1, L2, .... Ln } for the index of a given
element x. 1 < N < = n assume x may appear several times in the list and any index will
do.
ALGORITHM :
read x
ENDFOR
Assuming the sequential search procedure takes O(n/N) time in the worst case what is the
time complexity of running this algorithm on the four subclasses of shared memory
machine ?
EREW : Step 1 takes O(N) time (N reads, one at a time)
Step 2 takes O(n/N) time (time for reading list & sequential
search)
Step 3 takes O(N) time
Interconnection Networks
Introduction
We have seen that one way for processors to communicate data is to use a shared
memory and shared variables. However this is unrealistic for large numbers of
processors. A more realistic assumption is that each processor has its own private
memory and data communication takes place using message passing via an
INTERCONNECTION NETWORK.
Interconnection Networks
The interconnection network plays a central role in determining the overall performance
of a multicomputer system. If the network cannot provide adequate performance, for a
particular application, nodes will frequently be forced to wait for data to arrive.
Some of the more important networks include (Those not highlighted should be
investigated by the reader ):
Each node has N-1 connections (N-1 nearest neighbours) giving a total of N(N-1) / 2
connections for the network.
Even though this is the best network to have the high number of connections per node
mean this network can only be implemented for small values of N. Therefore some form
of limited interconnection network must be used.
Mesh ( Torus )
In a mesh network, the nodes are arranged in a k dimensional lattice of width w, giving
a total of w^k nodes.
[ usually k=1 (linear array) or k=2 (2D array) e.g. ICL DAP ]
Communication is allowed only between neighbouring nodes. All interior nodes are
connected to 2k other nodes.
Rings
A simple ring is just a linear array with the end nodes linked.
It is equivalent to a 1D mesh with wraparound connections. One drawback to this
network is that some data transfers may require N/2 links to be traversed e.g. A and B
above ( 3 ).
This can be reduced by using a chordal ring This is a simple ring with cross or chordal
links between nodes on opposite sides.
The departmental NCUBE is based on this topology i.e. a 5 dimensional hypercube (64
nodes)
1. Network connectivity
2. Network diameter
3. Narrowness
4. Network expansion increments
Network Connectivity
Network nodes and communication links sometimes fail and must be removed from
service for repair. When components do fail the network should continue to function with
reduced capacity.
Network connectivity measures the resiliency of a network and its ability to continue
operation despite disabled components i.e. connectivity is the minimum number of nodes
or links that must fail to partition the network into two or more disjoint networks
The larger the connectivity for a network the better the network is able to cope with
failures.
Network Diameter
The diameter of a network is the maximum internode distance i.e. it is the maximum
number of links that must be traversed to send a message to any node along a shortest
path.
The lower the diameter of a network the shorter the time to send a message from one
node to the node farthest away from it.
Narrowness
This is a measure of congestion in a network and is calculated as follows:
Partition the network into two groups of processors A and B where the number of
processors in each group is Na and Nb and assume Nb < = Na. Now count the number of
interconnections between A and B call this I. Find the maximum value of Nb / I for all
partitionings of the network. This is the narrowness of the network.
The idea is that if the narrowness is high ( Nb > I) then if the group B processors want to
send messages to group A congestion in the network will be high ( since there are fewer
links than processors )
For reasons of cost it is better to have the option of small increments since this allows
you to upgrade your network to the size you require ( i.e. flexibility ) within a particular
budget.
E.g. an 8 node linear array can be expanded in increments of 1 node but a 3 dimensional
hypercube can be expanded only by adding another 3D hypercube. (i.e. 8 nodes)
Typically each processor forms part of a pipeline and performs only a small part of the
algorithm. Data then flows through the system ( pipeline ) being operated on by each
processor in succession.
Say it takes 3 steps A, B & C to assemble a widget and assume each step takes one unit
of time
So a sequential widget assembler produces 1 widget in 3 time units, 2 in 6 time units etc.
i.e. one widget every 3 units.
i.e. the machine is split into 3 smaller machines; one to do step A, one for step B and one
for step C and which can operate simultaneously.
The first machine performs step A on a new widget every time step and passes the
partially assembled widget to the second machine which performs step B. This is then
passed onto the third machine to perform step C
This produces the first widget in 3 time units (as the sequential machine), but after this
initial startup time one widget appears every time step.
i.e. the second widget appears at time 4
the third widget appears at time 5 etc.
Animated Example
T = 1, L = 100, n = 10^6
then Tseq = 10^8 and Tpipe = 100 + 10^6 - 1 = 10^6 + 99
However in the partitioned algorithm the four processors all perform A, B, C and D but
only on a subset of the data
i.e. In pipelined algorithms the algorithm is distributed among the processors whereas in
partitioned algorithms the data is distributed among the processors.
Say we want to calculate Fi = cos(sin e^sqr(xi)) for x1, x2 ,....x6 using 4 processors.
Pipelined Version
Partitioned Version
This time each processor performs the complete algorithm i.e. cos(sin e^sqr(x)) but on its
own data.
Asynchronous Operation
A produces a1 passes it to B which calculates F1 but now A is still in the process of
computing a2 so instead of waiting B carries on and calculates F2 ( based on old data i.e.
a1 and therefore may not be the same as F2 above )and continues to calculate F using the
old data until a new input arrives
e.g. Fnew = Fold + ai
The idea in using asynchronous algorithms is that all processors are kept busy and never
remain idle (unlike synchronous algorithms ) so speedup is maximized.
A drawback is that they are difficult to analyse ( because we do not know what data is
being used ) and also an algorithm that is known to work ( e.g. converge) in synchronous
mode may not work (e.g diverge) in asynchronous mode.
Consider the Newton Raphson iteration for solving F (x) = 0 where F is some non-linear
function
i.e. Xn+1 = Xn - F(Xn)/F'(Xn)......( 1 )
generates a sequence of approximations to the root, starting from a value X0.
Serial Mode
P1 computes F(Xn) then P2 computes F'(Xn) then P3 computes Xn+1 using (1)
So time per iteration is t1 + t2 + t3
If k iterations are necessary for convergence then total time is k (t1 + t2 + t3)
P3 computes a new value using (1) as soon as EITHER P1 OR P2 provide a new input
i.e. (1) is now of the form
Xn+1 = Xn - F(SXi)/F'(Xj)
Time per iteration is at most min( t1, t2) + t3 [but may be as low as t3 - e.g. if P1 and P2
complete at consecutive time steps]
2 F(X0) C0 -
3 - F'(X0) -
4 - - X1 = X0 - F(X0)/F'(X0)
5 C1 C1 -
6 F(X1) C1 -
7 - F'(X1) X2 = X1 - F(X1)/F'(X0)
8 C2 C2 X3 = X2 - F(X1)/F'(X1)
9 F(X2) C2 -
10 C3 F'(X2) X4 = X3 - F(X2)/F'(X1)
11 F(X3) C3 or C4 X5 = X4 - F(X2)/F'(X2)
12 C4 or C5 X6 = X5 - F(X3)/F'(X2)
13 . . .
14 . . .
etc
- indicates processor is idle
Ci indicates processor is using Xi in its calculation
NOTE
At time 11 P2 has the choice of using X3 to calculate F'(X3) or X4 to calculate F'(X4),
i.e. omit X3. Which choice is made should be determined experimentally to see which
gives the best results.
we could relax the parameter i.e. use X4 or we could synchronise with the parameter i.e.
using X3
Two important measures of the quality of parallel algorithms are speedup and
efficiency :
If Ts is the time taken to run the fastest serial algorithm on one processor and if Tp is the
time taken by a parallel algorithm on N processors then
Speedup = SN = Ts / Tp
and the efficiency of the parallel algorithm is given by
Efficiency = EN = SN / N
If the best known serial algorithm takes 8 seconds i.e. Ts = 8, while a parallel algorithm
takes 2 seconds using 5 processors, then
SN = Ts / Tp = 8 / 2 = 4 and
EN = SN / N = 4 / 5 = 0.8 = 80%
i.e. the parallel algorithm exhibits a speedup of 4 with 5 processors giving an 80%
efficiency.
NOTE
Care should be taken as to exactly what is meant by Ts i.e. the time taken to run the
fastest serial algorithm on one processor - which processor ?
we can use one processor of the parallel computer or we can use the fastest serial
machine available.
The latter is the fairest way to compare parallel algorithms but it is unrealistic in practice,
since most people do not have access to the fastest serial machines, making it impossible
to make a claim about speedup.
( Researchers also do not like this way because the speedup is reduced (since Ts is
lower ) !!! ).
Which ever definition is used the ideal is to produce linear speedup i.e. produce a
speedup of N using N processors and an efficiency of 1 ( 100% ).
However in practice the speedup is reduced from its ideal value of N (the efficiency is
bounded from above by 1 ).
2. Load Balancing
Speedup is generally limited by the speed of the slowest node. So an important
consideration is to ensure that each node performs the same amount of work. i.e. the
system is load balanced.
3. Communication Overhead
Assuming that communication and calculation cannot be overlapped, then any time spent
communicating the data between processors directly degrades the speedup. (because the
processors are not calculating ).
Because of this, a goal of the parallel algorithm designer should be make the grain size
( relative amount of work done between synchronizations - communications) as large as
possible, while keeping all the processors busy.
tcomm / tcalc
Where tcomm is the time to transfer a single word between two nodes and
tcalc is the time to perform some floating point calculation
4. Amdahls Law
This states that the speedup of a parallel algorithm is effectively limited by the number of
operations which must be performed sequentially, i.e its Serial Fraction
Let S be the amount of time spent (by one processor) on serial parts of the program and P
be the amount of time spent ( by one processor ) on parts of the program that could be
done in parallel.
SPEEDUP = S + P / (S + P/N ) ( 1 )
Say we have a program containing 100 operations each of which take 1 time unit.
If 80 operations can be done in parallel i.e. P = 80
and 20 operations must be done sequentially i.e. S = 20
then using 80 processors
Speedup = 100 / (20 + 80/80) = 100 / 21 < 5
Therefore Amdahls Law tells us that the serial fracrion F places a severe constraint on the
speedup as the number of processors increase.
Since most parallel programs contain a certain amount of sequential code, a possible
conclusion of Amdahls Law is that it is not cost effective to build systems with large
numbers of processors because sufficient speedup will never be produced.
However most of the important applications that need to be parallelised contain very
small fractions ( < 0.001)
Scaled Speedup
Say we have a problem to solve which takes Tseq seconds to finish, S seconds on the
parts of the program that must be done serially and P ( = Tseq - S ) seconds on parts of
the program that could have been done in parallel.
So if we solve the same problem on a parallel machine i.e. the problem size is fixed,
then the speedup can be predicted by Amdahls Law as
Speedup = S + P / (S + P/N)
Speedup = 1 / F + (1-F)/ N
However we could argue that run time is fixed i.e. problem size is not fixed but increases
in proportion to N
AND
we could argue that serial fraction is not constant but decreases as the problem size
increases i.e. S is constant.
Therefore Amdahls Law tells us that the serial fraction F places a severe constraint on the
speedup as the number of processors increase
Since most parallel programs contain a certain amount of sequential code the conclusion
of Amdahls Law is that it is not cost effective to build systems with large numbers of
processors because sufficient speedup will never be produced.
Amdahls Law is valid for problems in which the serial fraction F does not vary with the
problem size
i.e. as the problem increases the time Tseq and S increase keeping F = S / Tseq constant.
NOTE
Applications that parallelise well are ones with very small serial fractions.
i.e. speedup = 1 / F + (1 - F) / N
So if we run the program and find the actual speedup from Tseq / Tpar we can rearrange (
2 ) to find the actual serial fraction F
Say we have 12 pieces of work each taking the same amount of time.
We have perfect load balancing for N = 2,3,4,6 & 12 processors but not for N =
5,7,8,9,10,11
Since a larger load imbalance results in a larger F, problems can be identified not
apparent from speedup or efficiency.
Since increasing the overhead decreases speedup the value of F increases smoothly as N
increases. So a smoothly increasing F is a warning that the grain size is too small.
2 1.95 97 0.024
3 2.88 96 0.021
4 3.76 94 0.021
8 6.96 87 0.021
Without looking at the serial fraction we cannot tell whether the results are good or not
e.g. Why does efficiency decrease ? Since F is almost constant, we can conclude it is due
to limited parallelism of the program.