Académique Documents
Professionnel Documents
Culture Documents
Parallel processing
Parallel Processing and Data Transfer
Modes in a Computer System
Parallel Processing
Instead of processing each instruction sequentially, a parallel processing system provides
concurrent data processing to increase the execution time. In this the system may have two
or more ALU's and should be able to execute two or more instructions at the same time.
The purpose of parallel processing is to speed up the computer processing capability and
increase its throughput.
NOTE: Throughput is the number of instructions that can be executed in a unit of time.
Parallel processing can be viewed from various levels of complexity. At the lowest level, we
distinguish between parallel and serial operations by the type of registers used. At the
higher level of complexity, parallel processing can be achieved by using multiple functional
units that perform many operations simultaneously.
Data Transfer Modes of a Computer System
According to the data transfer mode, computer can be divided into 4 major groups:
SISD (Single Instruction Stream, Single Data Stream)
It represents the organization of a single computer containing a control unit, processor unit
and a memory unit. Instructions are executed sequentially. It can be achieved by pipelining
or multiple functional units.
SIMD (Single Instruction Stream, Multiple Data Stream)
It represents an organization that includes multiple processing units under the control of a
common control unit. All processors receive the same instruction from control unit but
operate on different parts of the data.
They are highly specialized computers. They are basically used for numerical problems that
are expressed in the form of vector or matrix. But they are not suitable for other types of
computations
MISD (Multiple Instruction Stream, Single Data Stream)
It consists of a single computer containing multiple processors connected with multiple
control units and a common memory unit. It is capable of processing several instructions
over single data stream simultaneously. MISD structure is only of theoretical interest since
no practical system has been constructed using this organization.
MIMD (Multiple Instruction Stream, Multiple Data Stream
It represents the organization which is capable of processing several programs at same
time. It is the organization of a single computer containing multiple processors connected
with multiple control units and a shared memory unit. The shared memory unit contains
multiple modules to communicate with all processors simultaneously. Multiprocessors and
multicomputer are the examples of MIMD. It fulfills the demand of large scale computations.
Data parallelism
From Wikipedia, the free encyclopedia
Sequential vs. data-parallel job execution
Task parallelism
Task parallelism (also known as function parallelism and control parallelism) is a form
of parallelization of computer code across multiple processors in parallel computingenvironments.
Task parallelism focuses on distributing tasks—concurrently performed by processes or threads—
across different processors. In contrast to data parallelism which involves running the same task on
different components of data, task parallelism is distinguished by running many different tasks at the
same time on the same data.[1] A common type of task parallelism is pipelining which consists of
moving a single set of data through a series of separate tasks where each task can execute
independently of the others.
In a multiprocessor system, task parallelism is achieved when each processor executes a different
thread (or process) on the same or different data. The threads may execute the same or different
code. In the general case, different execution threads communicate with one another as they work,
but is not a requirement. Communication usually takes place by passing data from one thread to the
next as part of a workflow.[2]
As a simple example, if a system is running code on a 2-processor system (CPUs "a" & "b") in
a parallel environment and we wish to do tasks "A" and "B", it is possible to tell CPU "a" to do task
"A" and CPU "b" to do task "B" simultaneously, thereby reducing the run time of the execution. The
tasks can be assigned using conditional statements as described below.
Task parallelism emphasizes the distributed (parallelized) nature of the processing (i.e. threads), as
opposed to the data (data parallelism). Most real programs fall somewhere on a continuum between
task parallelism and data parallelism.[3]
PRAM MODEL
Parallel Random Access Machines (PRAM) is a model, which is
considered for most of the parallel algorithms. Here, multiple processors are
attached to a single block of memory. A PRAM model contains −
All the processors share a common memory unit. Processors can communicate
among themselves through the shared memory only.
A memory access unit (MAU) connects the processors with the single shared
memory.
To solve this problem, the following constraints have been enforced on PRAM
model −
Exclusive Read Exclusive Write (EREW) − Here no two processors are allowed
to read from or write to the same memory location at the same time.
Concurrent Read Exclusive Write (CREW) − Here all the processors are
allowed to read from the same memory location at the same time, but are not
allowed to write to the same memory location at the same time.
Concurrent Read Concurrent Write (CRCW) − All the processors are allowed
to read from or write to the same memory location at the same time.
There are many methods to implement the PRAM model, but the most
prominent ones are −
As multiple processors access the same memory location, it may happen that
at any particular point of time, more than one processor is accessing the
same memory location. Suppose one is reading that location and the other is
writing on that location. It may create confusion. To avoid this, some control
mechanism, like lock / semaphore, is implemented to ensure mutual
exclusion.
Shared memory programming has been implemented in the following −
Thread libraries − The thread library allows multiple threads of control that run
concurrently in the same memory location. Thread library provides an interface
that supports multithreading through a library of subroutine. It contains
subroutines for
Due to the closeness of memory to CPU, data sharing among processes is fast and
uniform.
It is not portable.
The address of the processor from which the message is being sent;
Starting address of the memory location of the data in the sending processor;
Starting address of the memory location for the data in the receiving processor.
Processors can communicate with each other by any of the following methods
−
Point-to-Point Communication
Collective Communication
Point-to-Point Communication
Point-to-point communication is the simplest form of message passing. Here,
a message can be sent from the sending processor to a receiving processor
by any of the following transfer modes −
Synchronous mode − The next message is sent only after the receiving a
confirmation that its previous message has been delivered, to maintain the
sequence of the message.
Collective Communication
Collective communication involves more than two processors for message
passing. Following modes allow collective communications −
It is portable;
Does not perform well in the communication network between the nodes.
Features of PVM
For a given run of a PVM program, users can select the group of machines;
It is a message-passing model,
Process-based computation;
Data parallel languages help to specify the data decomposition and mapping
to the processors. It also includes data distribution statements that allow the
programmer to have control on data – for example, which data will go on
which processor – to reduce the amount of communication within the
processors.
Multicomputers
Multicomputers are distributed memory MIMD architectures. The following
diagram shows a conceptual model of a multicomputer −
Third generation computers are the next generation computers where VLSI
implemented nodes will be used. Each node may have a 14-MIPS processor,
20-Mbytes/s routing channels and 16 Kbytes of RAM integrated on a single
chip.
Message-Routing Schemes
In multicomputer with store and forward routing scheme, packets are the
smallest unit of information transmission. In wormhole–routed networks,
packets are further divided into flits. Packet length is determined by the
routing scheme and network implementation, whereas the flit length is
affected by the network size.
In Store and forward routing, packets are the basic unit of information
transmission. In this case, each node uses a packet buffer. A packet is
transmitted from a source node to a destination node through a sequence of
intermediate nodes. Latency is directly proportional to the distance between
the source and the destination.
When all the channels are occupied by messages and none of the channel in
the cycle is freed, a deadlock situation will occur. To avoid this a deadlock
avoidance scheme has to be followed.
Star-Connected Network
In a star-connected network, one processor acts as the central processor. Every other
processor has a communication link connecting it to this processor. Figure 2.14(b) shows a
star-connected network of nine processors. The star-connected network is similar to bus-
based networks. Communication between any pair of processors is routed through the
central processor, just as the shared bus forms the medium for all communication in a bus-
based network. The central processor is the bottleneck in the star topology.
Due to the large number of links in completely connected networks, sparser networks are
typically used to build parallel computers. A family of such networks spans the space of
linear arrays and hypercubes. A linear array is a static network in which each node (except
the two nodes at the ends) has two neighbors, one each to its left and right. A simple
extension of the linear array (Figure 2.15(a)) is the ring or a 1-D torus (Figure 2.15(b)).
The ring has a wraparound connection between the extremities of the linear array. In this
case, each node has two neighbors.
Figure 2.15. Linear arrays: (a) with no wraparound links; (b) with
wraparound link.
A two-dimensional mesh illustrated in Figure 2.16(a) is an extension of the linear array to
two-dimensions. Each dimension has nodes with a node identified by a two-tuple (i, j).
Every node (except those on the periphery) is connected to four other nodes whose indices
differ in any dimension by one. A 2-D mesh has the property that it can be laid out in 2-D
space, making it attractive from a wiring standpoint. Furthermore, a variety of regularly
structured computations map very naturally to a 2-D mesh. For this reason, 2-D meshes
were often used as interconnects in parallel machines. Two dimensional meshes can be
augmented with wraparound links to form two dimensional tori illustrated in Figure 2.16(b).
The three-dimensional cube is a generalization of the 2-D mesh to three dimensions, as
illustrated in Figure 2.16(c). Each node element in a 3-D cube, with the exception of those
on the periphery, is connected to six other nodes, two along each of the three dimensions. A
variety of physical simulations commonly executed on parallel computers (for example, 3-D
weather modeling, structural modeling, etc.) can be mapped naturally to 3-D network
topologies. For this reason, 3-D cubes are used commonly in interconnection networks for
parallel computers (for example, in the Cray T3E).
Figure 2.16. Two and three dimensional meshes: (a) 2-D mesh with no
wraparound; (b) 2-D mesh with wraparound link (2-D torus); and (c) a 3-D
mesh with no wraparound.
The general class of k-d meshes refers to the class of topologies consisting of d dimensions
with k nodes along each dimension. Just as a linear array forms one extreme of the k-
d mesh family, the other extreme is formed by an interesting topology called the hypercube.
The hypercube topology has two nodes along each dimension and log p dimensions. The
construction of a hypercube is illustrated in Figure 2.17. A zero-dimensional hypercube
consists of 20, i.e., one node. A one-dimensional hypercube is constructed from two zero-
dimensional hypercubes by connecting them. A two-dimensional hypercube of four nodes is
constructed from two one-dimensional hypercubes by connecting corresponding nodes. In
general a d-dimensional hypercube is constructed by connecting corresponding nodes of two
(d - 1) dimensional hypercubes. Figure 2.17 illustrates this for up to 16 nodes in a 4-D
hypercube.
Figure 2.17. Construction of hypercubes from hypercubes of lower
dimension.
Tree-Based Networks
A tree network is one in which there is only one path between any pair of nodes. Both
linear arrays and star-connected networks are special cases of tree networks. Figure
2.18 shows networks based on complete binary trees. Static tree networks have a
processing element at each node of the tree (Figure 2.18(a)). Tree networks also have a
dynamic counterpart. In a dynamic tree network, nodes at intermediate levels are switching
nodes and the leaf nodes are processing elements (Figure 2.18(b)).
Figure 2.18. Complete binary tree networks: (a) a static tree network; and
(b) a dynamic tree network.
To route a message in a tree, the source node sends the message up the tree until it
reaches the node at the root of the smallest subtree containing both the source and
destination nodes. Then the message is routed down the tree towards the destination node.
Tree networks suffer from a communication bottleneck at higher levels of the tree. For
example, when many nodes in the left subtree of a node communicate with nodes in the
right subtree, the root node must handle all the messages. This problem can be alleviated in
dynamic tree networks by increasing the number of communication links and switching
nodes closer to the root. This network, also called a fat tree, is illustrated in Figure 2.19.
Unit 4
Shared memory
From Wikipedia, the free encyclopedia
In computer science, shared memory is memory that may be simultaneously accessed by multiple
programs with an intent to provide communication among them or avoid redundant copies. Shared
memory is an efficient means of passing data between programs. Depending on context, programs
may run on a single processor or on multiple separate processors.
Using memory for communication inside a single program, e.g. among its multiple threads, is also
referred to as shared memory
In computer hardware, shared memory refers to a (typically large) block of random access
memory (RAM) that can be accessed by several different central processing units (CPUs) in
a multiprocessor computer system.
Shared memory systems may use:[1]
uniform memory access (UMA): all the processors share the physical memory uniformly;
non-uniform memory access (NUMA): memory access time depends on the memory location
relative to a processor;
cache-only memory architecture (COMA): the local memories for the processors at each node is
used as cache instead of as actual main memory.
A shared memory system is relatively easy to program since all processors share a single view of
data and the communication between processors can be as fast as memory accesses to a same
location. The issue with shared memory systems is that many CPUs need fast access to memory
and will likely cache memory, which has two complications:
access time degradation: when several processors try to access the same memory location it
causes contention. Trying to access nearby memory locations may cause false sharing. Shared
memory computers cannot scale very well. Most of them have ten or fewer processors;
lack of data coherence: whenever one cache is updated with information that may be used by
other processors, the change needs to be reflected to the other processors, otherwise the
different processors will be working with incoherent data. Such cache coherenceprotocols can,
when they work well, provide extremely high-performance access to shared information between
multiple processors. On the other hand, they can sometimes become overloaded and become a
bottleneck to performance.
Technologies like crossbar switches, Omega networks, HyperTransport or front-side bus can be
used to dampen the bottleneck-effects.
In case of a Heterogeneous System Architecture (processor architecture that integrates different
types of processors, such as CPUs and GPUs, with shared memory), the memory management
unit (MMU) of the CPU and the input–output memory management unit (IOMMU) of the GPU have
to share certain characteristics, like a common address space.
The alternatives to shared memory are distributed memory and distributed shared memory, each
having a similar set of issues.
Shared Memory
Shared memory multiprocessors are one of the most important classes of
parallel machines. It gives better throughput on multiprogramming workloads
and supports parallel programs.
In this case, all the computer systems allow a processor and a set of I/O
controller to access a collection of memory modules by some hardware
interconnection. The memory capacity is increased by adding memory
modules and I/O capacity is increased by adding devices to I/O controller or
by adding additional I/O controller. Processing capacity can be increased by
waiting for a faster processor to be available or by adding more processors.
All the resources are organized around a central memory bus. Through the
bus access mechanism, any processor can access any physical address in the
system. As all the processors are equidistant from all the memory locations,
the access time or latency of all the processors is same on a memory location.
This is called symmetric multiprocessor.
Message-Passing Architecture
Message passing architecture is also an important class of parallel machines.
It provides communication among processors as explicit I/O operations. In
this case, the communication is combined at the I/O level, instead of the
memory system.
Send and receive is the most common user level communication operations
in message passing system. Send specifies a local data buffer (which is to be
transmitted) and a receiving remote processor. Receive specifies a sending
process and a local data buffer in which the transmitted data will be placed.
In send operation, an identifier or a tag is attached to the message and the
receiving operation specifies the matching rule like a specific tag from a
specific processor or any tag from any processor.
Shared-Memory Multicomputers
Three most common shared memory multiprocessors models are −
When all the processors have equal access to all the peripheral devices, the
system is called a symmetric multiprocessor. When only one or a few
processors can access the peripheral devices, the system is called
an asymmetric multiprocessor.
Non-uniform Memory Access (NUMA)
In NUMA multiprocessor model, the access time varies with the location of
the memory word. Here, the shared memory is physically distributed among
all the processors, called local memories. The collection of all local memories
forms a global address space which can be accessed by all the processors.
Load balancing (computing)
n computing, load balancing improves the distribution of workloads across multiple computing
resources, such as computers, a computer cluster, network links, central processing units, or disk
drives.[1] Load balancing aims to optimize resource use, maximize throughput, minimize response
time, and avoid overload of any single resource. Using multiple components with load balancing
instead of a single component may increase reliability and availability through redundancy. Load
balancing usually involves dedicated software or hardware, such as a multilayer switch or a Domain
Name System server process.
Load balancing differs from channel bonding in that load balancing divides traffic between network
interfaces on a network socket (OSI model layer 4) basis, while channel bonding implies a division of
traffic between physical interfaces at a lower level, either per packet (OSI model Layer 3) or on a
data link (OSI model Layer 2) basis with a protocol like shortest path bridging.
Round-robin DNS[edit]
Main article: Round-robin DNS
An alternate method of load balancing, which does not require a dedicated software or hardware
node, is called round robin DNS. In this technique, multiple IP addresses are associated with a
single domain name; clients are given IP in round robin fashion. IP is assigned to clients for a
time quantum.
DNS delegation[edit]
Another more effective technique for load-balancing using DNS is to delegate www.example.org as
a sub-domain whose zone is served by each of the same servers that are serving the web site. This
technique works particularly well where individual servers are spread geographically on the Internet.
For example:
one.example.org A 192.0.2.1
two.example.org A 203.0.113.2
www.example.org NS one.example.org
www.example.org NS two.example.org
However, the zone file for www.example.org on each server is different such that each server
resolves its own IP Address as the A-record.[2] On server one the zone file
for www.example.org reports: