Parallel Algorithms Synopsis

PARALLEL ALGORITHMS
Partha Protim Konwar, Umang Kotriwala

Dept. of Information Science, Siddaganga Institute of Technology, Tumkur
Abstract – This document gives instructions for This can be done in one of two ways. Each script
preparing project for the analysis of efficiency could be marked by a different marker - this would
and formulating parallel algorithms that would require n markers. Alternatively, marking each
run on different parallel computers. You can
question could be viewed as a task. This would
use this document as both an instruction set and
as a guideline to the development of an result in m such tasks, each of which could be
algorithm. tackled by a separate marker, implying that every
script passes through every marker.
I. BRIEF INTRODUCTION In the first approach, the data (scripts) is first

In order to solve the problem on a standalone decomposed and then the computation (marking) is
system we need to design an algorithm for the associated with it. This technique is called domain
problem. This algorithm gives a sequence of steps decomposition.
which the sequential computer has to execute in
order to solve the problem. This type of algorithm In the second approach, the computation to be
is known as sequential algorithm. Similarly, for performed (marking) is first decomposed and then
solving problems on a parallel computer the the data (scripts) is associated with it. This
algorithms used are known as Parallel Algorithms. technique is called functional decomposition. The
partitioning technique that will be chosen often
A parallel algorithm defines how a given problem depends on the nature of the problem.
can be solved on a given parallel computer i.e. how
the problem is divided into sub-problems, how the Suppose one needs to compute the average mark of
processes communicate and how the partial the n scripts. If domain decomposition was chosen,
solutions are combined to produce the final result. then the marks from each of the markers would be
This type of algorithms are generally machine required. If the markers are at different physical
dependent. In order to simplify the design and locations, then some form of communication is
analysis of parallel algorithm parallel computers needed, in order to obtain the sum of the marks.
are represented by various abstract machine
models. These models make simplifying The nature of the information flow is specified in
assumptions about the parallel computer. Even the communication analysis stage of the design. In
though some assumptions may not be practical they this case, each marker can proceed independently
are justified in the following sense: and communicate the marks at the end. However,
 In designing algorithms for these models one other situations would require communication
can learn about the inherent parallelism in the between two concurrent tasks before computation
given problem. can proceed.
 The models help us compare the relative given
powers of the various computers. They also It may be the case that the time to communicate the
help us in determining the kind of parallel marks between two markers is much greater than
architecture it is best suited for a problem. the time to mark a question. In which case, it is
more efficient to reduce the number of markers and
In order to design a parallel solution to this have a marker work on a number of scripts, thereby
problem, it must first be decomposed into smaller decreasing the amount of communication.
tasks which can be executed simultaneously. This
is referred to as the partitioning stage. Effectively, several small tasks are combined to
produce larger ones, which results in a more
1|P ag e
efficient solution. This is called granularity control. Number of processors – Another important criteria
For example, k markers could mark n/k scripts for evaluating a parallel algorithm is the number of
each. The problem here is to determine the best processors required. Given a problem of input size
value of k. n the number of processors required by an
algorithm is a function of n denoted by P(n).
The mapping stage specifies where each task is to Sometimes the number of processors is a constant
execute. In this example, all tasks are of equal size independent of n.
and the communication is uniform, so any task can
be mapped to any marker. However, in more Cost – The cost of a parallel algorithm is defined
complex situations, mapping strategies may not be as the product of the running time of the parallel
obvious, requiring the use of more sophisticated algorithm and the number of processors used.
techniques.
Cost = Running time X Number of Processors
Parallel algorithm design is an interesting and If the cost of the parallel algorithm matches the
challenging area of computer science which lower bound of the best known sequential
requires a combination of creative and analytical algorithm by a constant multiple factor then the
skills. algorithm is said to be cost optimal. The algorithm
for adding n numbers takes O(log n) steps on an
n-1 processor tree. Thus the cost of parallel
II. ANALYSIS OF PARALLEL ALGORITHMS
algorithm is given by O(n log n) whereas the
A sequential algorithm is evaluated in terms of two
sequential algorithm in this case takes O(n) times.
parameters i.e. the running time complexity and the Thus a parallel algorithm is not cost optimal.
space complexity. For evaluating parallel
algorithms we consider three principle criteria. The efficiency of a parallel algorithm is defined as
They are: the ratio of the worst case running time of
sequential algorithm to the cost of parallel
 Running Time algorithm.
 Number of processors
 Cost
Efficiency = Worst case running time of sequential algorithm
Running Time – Since speeding up solution to a Cost of parallel algorithm
problem is the main reason for building parallel
computers, an important measure in evaluating a
parallel algorithm is its running time. This is III. MODELS OF COMPUTATION
defined as the time taken by the algorithm to solve The Random Access Machine (RAM) – The
a problem on a parallel computer. A parallel schematic diagram of Ram is as shown below.
algorithm is made up of two kinds of steps, which
are: Computational step and Communication step.
In a computational step a processor performs a
local arithmetical or logical operation whereas in
MEMORY
the communication step data is exchanged between ACCESS
PROCESSOR MEMORY
the processors via the shared memory or through UNIT UNIT
the inter-connection network. Thus the running

time of a parallel algorithm includes the time spent
during computational and communicational steps.
The worst case efficiency time for solving such
type of algorithm is defined as the maximum
running time of the algorithm taken over all the
inputs, whereas, the average case running time The basic functional units of the RAM are :
efficiency is the average of the running time of  Memory unit contains m locations.
algorithm over all the inputs.
2|P ag e
 A processor operates under the control of a
sequential algorithm. The processor can read It is to be noted that the shared memory also
data from memory location, and can perform functions as a communication medium for the
basic arithmetic and logical operations. processors. Here each step of an algorithm consists
 A memory access unit (MAU) creates a path of the following phases:
from the processor to an arbitrary location in  Read: Can read up to N processes
the memory. The processor provides the MAU simultaneously in parallel from N locations
with the address of the location it wishes to and store their value in local registers.
access and the read/write operations it wishes  Compute: N processors perform basic
to perform. This address is used by MAU to arithmetic or logical operations on the value in
establish direct connection between the their registers.
processor and the memory location.  Write: N processors can write simultaneously
Any step of algorithm for the Ram model consists into N memory locations from their registers.
of three basic phases, namely: This PRAM model can be further sub-divided into
 Read: The processor reads a datum from the four categories based on the way simultaneous
memory which is stored in one of its local memory accesses are handled:
registers.  Exclusive Read, Exclusive Write (EREW)
 Execute: The processor can perform basic PRAM. In this model every access to a
arithmetical and logical operations on the memory location (Read/Write) has to be
content of one or two of its registers. exclusive.
 Write: The processor writes the contents of one  Concurrent Read, Exclusive Write (CREW)
of its register into a memory location. PRAM. In this model only write operations to
memory location are exclusive whereas two or
The Parallel Random Access machine – The more processors can concurrently read from
schematic diagram of the PRAM is shown below: the memory locations are exclusive.
 Exclusive Read, Concurrent Write (ERCW)
PRAM. This model allows multiple processors
P to concurrently write to the same location,
1 whereas, the read operations are exclusive.
P  Concurrent Read, Concurrent Write (CRCW)
SHARED PRAM. This model allows both multiple read
1 MAU
MEMORY and multiple write operations to a memory
location. It the most powerful of the four
models. During read operations all processors
reading from a particular memory location read
the same value, whereas, during write
operation many processors try to write
P different values to the same memory location.
1 So, this model has to specify precisely the
value that is to be written to the memory
PRAM is one of the popular models for designing location. So, we specify some protocols which
parallel algorithm. The PRAM consists of the identify the value that is to be written to a
following: memory location. They are:
o Priority CW: Here only the processor
 A set of N (P1, P2, P3.......PN) identical with highest priority can succeed in
processors. writing its value to the memory
 A memory with m locations which is shared by location.
all the N processors.
 An MAU which allows the processors to
access the shared memory.
3|P ag e
o Common CW: Here the processors
are given a chance to write to a
Depth
memory location if and only if they
Stages
have the same value.
o Arbitrary CW: Here if one processor
Input Lines
succeeds in writing to memory Output Lines
location it is arbitrarily chosen

without affecting the correctness of
the algorithm.
o Combining CW: Here there is a
function that maps the multiple value
Width
that the processors try to write a
single value that is actually written
into the memory location.
Interconnection Networks – We know that in

PRAM, all exchanges of data among processes take
place through shared memory. There is also another
way for the processors to communicate i.e. via
Interconnection Network
direct links. In this method instead of shared
memory the M locations of memory are distributed
among N processors. So the local memory of each IV. IMPLEMENTATION - I
processor now contains M/N locations.
Parallel processing can be achieved using two
public domain message passing system namely
Combinational Circuits – A combinational circuit PVM and MPI.
can be viewed as a device that has a set of input
lines on one end and set of output lines on the Message passing Interface (MPI) – MPI is a
other. Such type of circuits are made of standard specification for a library of message
interconnected components arranged in columns passing functions. MPI specifies a public domain
called stages, each component has a fixed number platform independent standard of message passing
of input lines called fan in and fixed number of library which is portable. An MPI library consists
output lines called fan out. After each component of names, calling sequences and results of
receives its input a simple arithmetical or logical subroutines from FOTRAN 77 programs and
operation is performed in one unit time and result is functions to be called from C programs. Users
produced as output. write their programs in FOTRAN 77 and C and are
compiled with ordinary compilers, which are linked
A schematic diagram of a combinational circuit is to the MPI libraries.
shown below. For convenience of Fan In and Fan
Out are assumed to be 2. A parallel program written in FOTRAN 77 or C
using the MPI library could run without any change
An important feature of combinational circuit is on a single PC, a workstation, a network of work
that it has no feedback. The important parameters station, a parallel computer from any vendor or in
used to analyse a combinational circuit are: any operating system. The design of MPIs is based
 Size: Refers to the number of components used on four orthogonal concepts, which are message
in the combinational circuit. data types, communicators, communication
 Depth: It is the number of stages in the operations and virtual topology.
combinational circuit i.e. the maximum
Compiling an MPI program – A program written
number of components on a path from input to
using MPI format should be compiled using the
output.
parallel C compiler mpcc.
 Width: It is the maximum number of
components in a given stage. mpcc fucntionname.c –o myprog.o
4|P ag e
Message Passing in MPI – Processes in MPI are  Broadcast- Using the routine
heavy weighted and single threaded with separate MPI_Bcst(address,count,datatype,root,comm)
address spaces. Since one process cannot directly The process ranked root sends the same
access variables in another process’s address space, message whose content is identified by the
message passing is used in interprocess triple (address, count, datatype) to all
communications. The routine MPI_Send and MPI processes in the communication comm.
with RECV are used in order to send/receive a  Gather – The routine
message to or from a process. A message has two  MPI_Gather(send_address,send
parts namely, the content of the message(message count,send_datatype,recv_address,recv_count,
buffer)and the destination of the message(message recv_datatype,root, comm) facilitates the root
envelope). process receiving a personalized messages
from n process.
An example of usage of MPI_Send is shown  Scatter – The routine MPI_Scatter() ensures
below— that the root process sends out personalized
MPI_Send(&N,1,MPI_INT,i, i,MPI_COMM_WORD) messages to all the N processes.
The MPI routine has six parameters, the first three Virtual Topologies – A topology describes how the
specifies the message address, message count the processors in parallel computers are interconnected.
message datatype.MPI introduces data type In most of the parallel programs, each process
identifier to support heterogeneous computing and communicates with only a few other processes and
to allow messages from non-contiguous memory the pattern of communication within these
locations. The last three parameters specify the processes are called an application topology. MPI
destination process id of the process, tag and allows the user to define a virtual topology.
communicator respectively. These three parameters Communication within this topology takes place
constitute together the message envelope. with the hope that the underlying network topology
will correspond and expedite the message transfer.
Point to Point Communications – MPI provides An example of virtual topology is the Cartesian or
both blocking and non-blocking operations, and Mesh Topology.
non blocking versions whose completions can be
tested for and waited for explicitly.MPI also has
multiple communications mode. The standard V. IMPLEMENTATION – II
mode corresponds to current practice in message
Parallel Virtual Machine (PVM) – PVM is a public
passing systems. The synchronous mode required
domain software system that was originally
are send to block until the corresponding receive
designed to enable a collection of heterogeneous
has occurred. In buffered mode a send assumes the
UNIX computers to be co-operatively used as one
availability of certain amount of buffer space which
virtual message passing parallel computer. Unlike
must be previously specified by the user program
MPI, PVM is a self contained system i.e. while
through a routine call
MPI depends on the underlying platform to provide
MPI_Buffer_attached(buffer,size) that allocated a
process management an I/O functions PVM
user buffer. The ready mode is a way for the
doesn’t. On the other hand PVM is not a standard
programmer to notify the system that the
which means it can undergo version changes
corresponding receive has already received, so that
frequently than MPI.
the underlying system can use a faster protocol.
The PVM system is composed of two parts: A
Collective Communication – When all the
PVM daemon (pvmd3) that resides on all the
processes in a group participate in a global
computers which make up the virtual machine and
communication operation, the resulting
user-callable library (libpvm3.a) link to the user
communication is called the collective
application for message passing, process
communication. MPI provided the following
management and modifying the virtual machine.
collective communication routines:
5|P ag e
To run a PVM application the user first creates a model which is an abstraction of the essential
virtual machine by starting up PVM. Multiple users capabilities and cost characteristics which unite all
can configure overlapping virtual machine in a sequential machines. The notation allows the
UNIX system, and each user can execute several description of ―upper bounds‖(with O ()), ―lower
PVM applications simultaneously. A general bounds‖ (with ()) and ―tight bounds‖ (with Ω()) on
method of programming an application with PVM the behaviour of functions representing the time or
is as follows: User codes one or more sequential space requirements of an algorithm as its input
program in FOTRAN 77 or C which contains calls problem size grows.
to the PVM library. These programs are compiled
in the host pool and the resulting object files are Analysing Parallel Algorithms – The sequential
placed in a location accessible from machines in world benefits from a single universal abstract
the host pool. To execute an application, a user machine model (the RAM) which accurately
starts one copy of one task from a machine within (enough) characterizes all sequential computers and
the host pool. This task subsequently starts other from a simple criterion of ―better‖ for algorithm
PVM task which computes locally and exchange comparison (―less is better‖, usually of run time,
messages with each other to solve the problem. All and occasionally of memory space).
PVM tasks are identified by an integer task
Thinking parallel, we immediately encounter two
identifier (tid) which is assigned by the PVM
complications.
system.
Firstly, and fundamentally, there is no commonly
VI. EFFICIENCY ANALYSIS agreed model of parallel computation. The
We begin by reviewing the standard framework for diversity of proposed and implemented parallel
sequential algorithm analysis. We then consider the architectures is such that it is not clear that such a
complications introduced by the introduction of model will ever emerge. Worse than this, the
parallelism and look at some proposed parallel variations in architecture capabilities and
frameworks. associated costs mean that no such model can
emerge, unless we are prepared to forgo certain
Analysing Sequential Algorithms – The design and tricks or shortcuts exploitable on one machine but
analysis of sequential algorithms is a well not another. An algorithm designed in some
developed field, with a large body of commonly abstract model of parallelism may have
accepted results and techniques. This consensus is asymptotically different performance on two
built upon the fact that the methodology and different architectures (rather than just the varying
notation of asymptotic analysis (the so-called ―big- constant factors of different sequential machines).
O‖ notation) deliver results which are applicable Secondly, our notion of ―better‖ even in the context
across all sequential computers, programming of a single architecture must surely take into
languages, compilers and so on. This generality is account the number of processors involved, as well
achieved at the expense of a certain degree of as the run time. The trade-offs here will need
blurring, in which constant factors and non- careful consideration.
dominating terms in the analysis are simply
ignored. In spite of this, the approach produces In this course we will not attempt to unify the
results which allow useful comparisons of the irretrievably diverse. Thus we will have a small
essential performance characteristics of different number of machine models and will design
algorithms which are reflected in practice when algorithms for our chosen problems for some or all
implemented on real machines, in real languages of these. However, in doing so we still hope to
through real compilers. For example, mergesort emphasize common principles of design which
with its θ (n log n) run time is (in the worst case) an transcend the differences in architecture. Equally,
asymptotically better sorting algorithm than in some instances, we will exploit particular
insertion sort (θ (n2) on any normal sequential features of one model where that leads to a novel or
machine (although the actual problem size at which particularly effective algorithm. Similarly, we will
the dominance becomes apparent will vary from investigate notions of ―better‖ as they have been
implementation to implementation). Underpinning traditionally defined in the context of each model.
this work is the ―Random Access Machine‖ (RAM) We will continue to employ the notation of
6|P ag e
asymptotic analysis, but note that we must be p/m 5 10 30 60
particularly wary of constant factors in the parallel 1 1.155 1.111 1.087 1.182
case – a ―constant factor‖ discrepancy of 32 in an 2 0.622 0.596 0.584 0.631
asymptotically optimal algorithm on a 64 processor 3 0.419 0.405 0.396 0.428
machine is a serious matter. 4 0.312 0.301 0.296 0.318
5 0.252 0.247 0.242 0.266
VII. EXPERIMENTAL RESULTS AND ANALYSIS 6 0.203 0.205 0.197 0.212
Table 1: Experimental execution times (in secs) for text
String Matching Problem – In this section we size 3MB using several pattern lengths
present the experimental results for the
performance of the parallel string matching
implementation, which is based on static master-
worker model. This algorithm is implemented in p/m 5 10 30 60
ANSI C programming language using the MPI 1 9.237 8.724 8.513 9.284
2 4.642 4.46 4.375 4.769
library [5, 10, 11] for the point-to point and
3 3.095 2.969 2.926 3.201
collective communication operations. The target
4 2.334 2.26 1.801 2.421
platform for our experimental study is a personal 5 1.875 1.801 1.785 1.962
computer cluster connected with 100 Mb/s Fast 6 1.558 1.492 1.472 1.631
Ethernet network. More specifically speaking, the Table 2: Experimental execution times (in secs) for text
cluster consists of 6 PCs, based on 100 MHz Intel size 24MB using several pattern lengths
Pentium processors, with 64 MB RAM. The MPI
implementation used on the network is MPICH
version 1.2. During all experiments, the cluster of
personal computers was dedicated. Finally, to get
reliable performance results 10 executions occurred
for each experiment and the reported values are the
average ones.
The number of processors, the pattern lengths and

the several text sizes, can influence the
performance of the parallel string matching
significantly and thus these parameters are varied
in our experimental study.
Tables 1 and 2 we show the execution times in Matrix Vector multiplication Problem –
seconds, for the BF string matching algorithm, the
four pattern lengths, for two total English text sizes
0.3
and for different number of processors.
0.27
0.24
Further, Figure 2 presents the speedup factor with
0.21
respect to the number of processors for English text 0.18
of various sizes and for the BF string matching 0.15 TMR
algorithm. We define the speedup Sp in the usual 0.12
form 0.09 EKMR
0.06
Sp = T1/T p 0.03
0
Where T1 and T p are execution times of the same
30
40
50
60
70
80
90
10
20
100
algorithm (implemented for sequential and parallel

execution) on 1 and p processors, respectively. It is
Execution time for matrix-vector multiplication operation
important to note that the speedup, which is plotted
with 3 processors (for TMR – traditional matrix
in Figure, is result of the average for four pattern
representation and EKMR – extended Karnaugh map
lengths. representation) matrix size / time graph
7|P ag e
Programming and Shared Memory Programming.
80 We also found two standard libraries namely PVM
75 and MPI which is implemented in almost all types
70 of parallel computers.
65
We then analysed the design and performance of
60
Parallel algorithms for computational problem of
55 matrix-vector multiplication and logical problem of
50 string matching.
45
40
2 3 4 5 6 7 8 9 10111213141516 XI. REFERENCES
[1] V Rajaraman and C.Siva Ram Murthy Parallel
Relative performance for matrix-vector multiplication Computers Architechture and Programming, 5th
(no of processor/performance% graph) ed., May 2006.
[2] (2009) The IEEE website. [Online]. Available:
http://www.ieee.org/
VIII. CONCLUSION [3] Panagiotis D. Michailidis and Konstantinos G.
Margaritis, ―String Matching Problem On A
In the above abstract we compared three models of Cluster Of Personal Computers,‖ Department of
Parallel Computation PRAM, Interconnection Applied Informatics, University of Macedonia, 156
Networks and Combinational Circuits, which differ Egnatia str., P.O. Box 1591 Thessaloniki, Greece.
according to whether the processors communicate [4] Jen-Shiuh Liu, Jiun-Yuan Lin, and Yeh-Ching
among themselves through a shared memory or an Chung, ―Efficient Parallel Algorithm for multi
dimensional Matrix Operations‖, Department of
interconnection network. We also looked upon two Information Engineering, Feng Chia University,
Parallel Programming models Message Passing Taichung, Taiwan 407, ROC.
8|P ag e

Parallel Algorithms Synopsis

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Parallel Algorithms Synopsis

Transféré par

Droits d'auteur :

Formats disponibles

PARALLEL ALGORITHMS

Partha Protim Konwar, Umang Kotriwala

I. BRIEF INTRODUCTION In the first approach, the data (scripts) is first

the inter-connection network. Thus the running

location it is arbitrarily chosen

Interconnection Networks – We know that in

The number of processors, the pattern lengths and

algorithm (implemented for sequential and parallel

Vous aimerez peut-être aussi