Vous êtes sur la page 1sur 37

Unit 3

Parallel processing
Parallel Processing and Data Transfer
Modes in a Computer System
Parallel Processing
Instead of processing each instruction sequentially, a parallel processing system provides
concurrent data processing to increase the execution time. In this the system may have two
or more ALU's and should be able to execute two or more instructions at the same time.
The purpose of parallel processing is to speed up the computer processing capability and
increase its throughput.
NOTE: Throughput is the number of instructions that can be executed in a unit of time.
Parallel processing can be viewed from various levels of complexity. At the lowest level, we
distinguish between parallel and serial operations by the type of registers used. At the
higher level of complexity, parallel processing can be achieved by using multiple functional
units that perform many operations simultaneously.
Data Transfer Modes of a Computer System
According to the data transfer mode, computer can be divided into 4 major groups:
SISD (Single Instruction Stream, Single Data Stream)
It represents the organization of a single computer containing a control unit, processor unit
and a memory unit. Instructions are executed sequentially. It can be achieved by pipelining
or multiple functional units.
SIMD (Single Instruction Stream, Multiple Data Stream)
It represents an organization that includes multiple processing units under the control of a
common control unit. All processors receive the same instruction from control unit but
operate on different parts of the data.
They are highly specialized computers. They are basically used for numerical problems that
are expressed in the form of vector or matrix. But they are not suitable for other types of
MISD (Multiple Instruction Stream, Single Data Stream)
It consists of a single computer containing multiple processors connected with multiple
control units and a common memory unit. It is capable of processing several instructions
over single data stream simultaneously. MISD structure is only of theoretical interest since
no practical system has been constructed using this organization.
MIMD (Multiple Instruction Stream, Multiple Data Stream
It represents the organization which is capable of processing several programs at same
time. It is the organization of a single computer containing multiple processors connected
with multiple control units and a shared memory unit. The shared memory unit contains
multiple modules to communicate with all processors simultaneously. Multiprocessors and
multicomputer are the examples of MIMD. It fulfills the demand of large scale computations.

Data parallelism
From Wikipedia, the free encyclopedia
Sequential vs. data-parallel job execution

Data parallelism is a form of parallelization across multiple processors in parallel

computing environments. It focuses on distributing the data across different nodes, which operate on
the data in parallel. It can be applied on regular data structures like arrays and matrices by working
on each element in parallel. It contrasts to task parallelism as another form of parallelism.
A data parallel job on an array of 'n' elements can be divided equally among all the processors. Let
us assume we want to sum all the elements of the given array and the time for a single addition
operation is Ta time units. In the case of sequential execution, the time taken by the process will be
n*Ta time units as it sums up all the elements of an array. On the other hand, if we execute this job
as a data parallel job on 4 processors the time taken would reduce to (n/4)*Ta + merging overhead
time units. Parallel execution results in a speedup of 4 over sequential execution. One important
thing to note is that the locality of data references plays an important part in evaluating the
performance of a data parallel programming model. Locality of data depends on the memory
accesses performed by the program as well as the size of the cache.

Task parallelism
Task parallelism (also known as function parallelism and control parallelism) is a form
of parallelization of computer code across multiple processors in parallel computingenvironments.
Task parallelism focuses on distributing tasks—concurrently performed by processes or threads—
across different processors. In contrast to data parallelism which involves running the same task on
different components of data, task parallelism is distinguished by running many different tasks at the
same time on the same data.[1] A common type of task parallelism is pipelining which consists of
moving a single set of data through a series of separate tasks where each task can execute
independently of the others.
In a multiprocessor system, task parallelism is achieved when each processor executes a different
thread (or process) on the same or different data. The threads may execute the same or different
code. In the general case, different execution threads communicate with one another as they work,
but is not a requirement. Communication usually takes place by passing data from one thread to the
next as part of a workflow.[2]
As a simple example, if a system is running code on a 2-processor system (CPUs "a" & "b") in
a parallel environment and we wish to do tasks "A" and "B", it is possible to tell CPU "a" to do task
"A" and CPU "b" to do task "B" simultaneously, thereby reducing the run time of the execution. The
tasks can be assigned using conditional statements as described below.
Task parallelism emphasizes the distributed (parallelized) nature of the processing (i.e. threads), as
opposed to the data (data parallelism). Most real programs fall somewhere on a continuum between
task parallelism and data parallelism.[3]

Parallel Random Access Machines (PRAM) is a model, which is
considered for most of the parallel algorithms. Here, multiple processors are
attached to a single block of memory. A PRAM model contains −

 A set of similar type of processors.

 All the processors share a common memory unit. Processors can communicate
among themselves through the shared memory only.

 A memory access unit (MAU) connects the processors with the single shared

Here, n number of processors can perform independent operations

on nnumber of data in a particular unit of time. This may result in
simultaneous access of same memory location by different processors.

To solve this problem, the following constraints have been enforced on PRAM
model −
 Exclusive Read Exclusive Write (EREW) − Here no two processors are allowed
to read from or write to the same memory location at the same time.

 Exclusive Read Concurrent Write (ERCW) − Here no two processors are

allowed to read from the same memory location at the same time, but are allowed
to write to the same memory location at the same time.

 Concurrent Read Exclusive Write (CREW) − Here all the processors are
allowed to read from the same memory location at the same time, but are not
allowed to write to the same memory location at the same time.

 Concurrent Read Concurrent Write (CRCW) − All the processors are allowed
to read from or write to the same memory location at the same time.

There are many methods to implement the PRAM model, but the most
prominent ones are −

 Shared memory model

 Message passing model

 Data parallel model

Shared Memory Model

Shared memory emphasizes on control parallelism than on data
parallelism. In the shared memory model, multiple processes execute on
different processors independently, but they share a common memory space.
Due to any processor activity, if there is any change in any memory location,
it is visible to the rest of the processors.

As multiple processors access the same memory location, it may happen that
at any particular point of time, more than one processor is accessing the
same memory location. Suppose one is reading that location and the other is
writing on that location. It may create confusion. To avoid this, some control
mechanism, like lock / semaphore, is implemented to ensure mutual
Shared memory programming has been implemented in the following −

 Thread libraries − The thread library allows multiple threads of control that run
concurrently in the same memory location. Thread library provides an interface
that supports multithreading through a library of subroutine. It contains
subroutines for

o Creating and destroying threads

o Scheduling execution of thread

o passing data and message between threads

o saving and restoring thread contexts

Examples of thread libraries include − SolarisTM threads for Solaris, POSIX

threads as implemented in Linux, Win32 threads available in Windows NT and
Windows 2000, and JavaTM threads as part of the standard JavaTM
Development Kit (JDK).

 Distributed Shared Memory (DSM) Systems − DSM systems create an

abstraction of shared memory on loosely coupled architecture in order to
implement shared memory programming without hardware support. They
implement standard libraries and use the advanced user-level memory
management features present in modern operating systems. Examples include
Tread Marks System, Munin, IVY, Shasta, Brazos, and Cashmere.

 Program Annotation Packages − This is implemented on the architectures

having uniform memory access characteristics. The most notable example of
program annotation packages is OpenMP. OpenMP implements functional
parallelism. It mainly focuses on parallelization of loops.
The concept of shared memory provides a low-level control of shared memory
system, but it tends to be tedious and erroneous. It is more applicable for
system programming than application programming.

Merits of Shared Memory Programming

 Global address space gives a user-friendly programming approach to memory.

 Due to the closeness of memory to CPU, data sharing among processes is fast and

 There is no need to specify distinctly the communication of data among processes.

 Process-communication overhead is negligible.

 It is very easy to learn.

Demerits of Shared Memory Programming

 It is not portable.

 Managing data locality is very difficult.

Message Passing Model

Message passing is the most commonly used parallel programming approach
in distributed memory systems. Here, the programmer has to determine the
parallelism. In this model, all the processors have their own local memory
unit and they exchange data through a communication network.

Processors use message-passing libraries for communication among

themselves. Along with the data being sent, the message contains the
following components −

 The address of the processor from which the message is being sent;
 Starting address of the memory location of the data in the sending processor;

 Data type of the sending data;

 Data size of the sending data;

 The address of the processor to which the message is being sent;

 Starting address of the memory location for the data in the receiving processor.

Processors can communicate with each other by any of the following methods

 Point-to-Point Communication

 Collective Communication

 Message Passing Interface

Point-to-Point Communication
Point-to-point communication is the simplest form of message passing. Here,
a message can be sent from the sending processor to a receiving processor
by any of the following transfer modes −

 Synchronous mode − The next message is sent only after the receiving a
confirmation that its previous message has been delivered, to maintain the
sequence of the message.

 Asynchronous mode − To send the next message, receipt of the confirmation

of the delivery of the previous message is not required.

Collective Communication
Collective communication involves more than two processors for message
passing. Following modes allow collective communications −

 Barrier − Barrier mode is possible if all the processors included in the

communications run a particular bock (known as barrier block) for message

 Broadcast − Broadcasting is of two types −

o One-to-all − Here, one processor with a single operation sends same

message to all other processors.

o All-to-all − Here, all processors send message to all other processors.

Messages broadcasted may be of three types −

 Personalized − Unique messages are sent to all other destination processors.

 Non-personalized − All the destination processors receive the same message.

 Reduction − In reduction broadcasting, one processor of the group collects all

the messages from all other processors in the group and combine them to a single
message which all other processors in the group can access.

Merits of Message Passing

 Provides low-level control of parallelism;

 It is portable;

 Less error prone;

 Less overhead in parallel synchronization and data distribution.

Demerits of Message Passing

 As compared to parallel shared-memory code, message-passing code generally
needs more software overhead.

Message Passing Libraries

There are many message-passing libraries. Here, we will discuss two of the
most-used message-passing libraries −

 Message Passing Interface (MPI)

 Parallel Virtual Machine (PVM)


It is a universal standard to provide communication among all the concurrent

processes in a distributed memory system. Most of the commonly used
parallel computing platforms provide at least one implementation of message
passing interface. It has been implemented as the collection of predefined
functions called library and can be called from languages such as C, C++,
Fortran, etc. MPIs are both fast and portable as compared to the other
message passing libraries.

Merits of Message Passing Interface

 Runs only on shared memory architectures or distributed memory architectures;

 Each processors has its own local variables;

 As compared to large shared memory computers, distributed memory computers

are less expensive.

Demerits of Message Passing Interface

 More programming changes are required for parallel algorithm;

 Sometimes difficult to debug; and

 Does not perform well in the communication network between the nodes.


PVM is a portable message passing system, designed to connect separate

heterogeneous host machines to form a single virtual machine. It is a single
manageable parallel computing resource. Large computational problems like
superconductivity studies, molecular dynamics simulations, and matrix
algorithms can be solved more cost effectively by using the memory and the
aggregate power of many computers. It manages all message routing, data
conversion, task scheduling in the network of incompatible computer

Features of PVM

 Very easy to install and configure;

 Multiple users can use PVM at the same time;

 One user can execute multiple applications;

 It’s a small package;

 Supports C, C++, Fortran;

 For a given run of a PVM program, users can select the group of machines;

 It is a message-passing model,

 Process-based computation;

 Supports heterogeneous architecture.

Data Parallel Programming

The major focus of data parallel programming model is on performing
operations on a data set simultaneously. The data set is organized into some
structure like an array, hypercube, etc. Processors perform operations
collectively on the same data structure. Each task is performed on a different
partition of the same data structure.

It is restrictive, as not all the algorithms can be specified in terms of data

parallelism. This is the reason why data parallelism is not universal.

Data parallel languages help to specify the data decomposition and mapping
to the processors. It also includes data distribution statements that allow the
programmer to have control on data – for example, which data will go on
which processor – to reduce the amount of communication within the

Multiprocessor System Interconnects

Parallel processing needs the use of efficient system interconnects for fast
communication among the Input/Output and peripheral devices,
multiprocessors and shared memory.

Hierarchical Bus Systems

A hierarchical bus system consists of a hierarchy of buses connecting various
systems and sub-systems/components in a computer. Each bus is made up
of a number of signal, control, and power lines. Different buses like local
buses, backplane buses and I/O buses are used to perform different
interconnection functions.

Local buses are the buses implemented on the printed-circuit boards. A

backplane bus is a printed circuit on which many connectors are used to plug
in functional boards. Buses which connect input/output devices to a computer
system are known as I/O buses.

Crossbar switch and Multiport Memory

Switched networks give dynamic interconnections among the inputs and
outputs. Small or medium size systems mostly use crossbar networks.
Multistage networks can be expanded to the larger systems, if the increased
latency problem can be solved.

Both crossbar switch and multiport memory organization is a single-stage

network. Though a single stage network is cheaper to build, but multiple
passes may be needed to establish certain connections. A multistage network
has more than one stage of switch boxes. These networks should be able to
connect any input to any output.

Multistage and Combining Networks

Multistage networks or multistage interconnection networks are a class of
high-speed computer networks which is mainly composed of processing
elements on one end of the network and memory elements on the other end,
connected by switching elements.

These networks are applied to build larger multiprocessor systems. This

includes Omega Network, Butterfly Network and many more.

Multicomputers are distributed memory MIMD architectures. The following
diagram shows a conceptual model of a multicomputer −

Multicomputers are message-passing machines which apply packet switching

method to exchange data. Here, each processor has a private memory, but
no global address space as a processor can access only its own local memory.
So, communication is not transparent: here programmers have to explicitly
put communication primitives in their code.

Having no globally accessible memory is a drawback of multicomputers. This

can be solved by using the following two schemes −
 Virtual Shared Memory (VSM)

 Shared Virtual Memory (SVM)

In these schemes, the application programmer assumes a big shared memory

which is globally addressable. If required, the memory references made by
applications are translated into the message-passing paradigm.

Virtual Shared Memory (VSM)

VSM is a hardware implementation. So, the virtual memory system of the
Operating System is transparently implemented on top of VSM. So, the
operating system thinks it is running on a machine with a shared memory.

Shared Virtual Memory (SVM)

SVM is a software implementation at the Operating System level with
hardware support from the Memory Management Unit (MMU) of the
processor. Here, the unit of sharing is Operating System memory pages.

If a processor addresses a particular memory location, the MMU determines

whether the memory page associated with the memory access is in the local
memory or not. If the page is not in the memory, in a normal computer
system it is swapped in from the disk by the Operating System. But, in SVM,
the Operating System fetches the page from the remote node which owns
that particular page.

Three Generations of Multicomputers

In this section, we will discuss three generations of multicomputers.

Design Choices in the Past

While selecting a processor technology, a multicomputer designer chooses
low-cost medium grain processors as building blocks. Majority of parallel
computers are built with standard off-the-shelf microprocessors. Distributed
memory was chosen for multi-computers rather than using shared memory,
which would limit the scalability. Each processor has its own local memory

For interconnection scheme, multicomputers have message passing, point-

to-point direct networks rather than address switching networks. For control
strategy, designer of multi-computers choose the asynchronous MIMD,
MPMD, and SMPD operations. Caltech’s Cosmic Cube (Seitz, 1983) is the first
of the first generation multi-computers.

Present and Future Development

The next generation computers evolved from medium to fine grain
multicomputers using a globally shared virtual memory. Second generation
multi-computers are still in use at present. But using better processor like
i386, i860, etc. second generation computers have developed a lot.

Third generation computers are the next generation computers where VLSI
implemented nodes will be used. Each node may have a 14-MIPS processor,
20-Mbytes/s routing channels and 16 Kbytes of RAM integrated on a single

The Intel Paragon System

Previously, homogeneous nodes were used to make hypercube
multicomputers, as all the functions were given to the host. So, this limited
the I/O bandwidth. Thus to solve large-scale problems efficiently or with high
throughput, these computers could not be used.The Intel Paragon System
was designed to overcome this difficulty. It turned the multicomputer into an
application server with multiuser access in a network environment.

Message Passing Mechanisms

Message passing mechanisms in a multicomputer network needs special
hardware and software support. In this section, we will discuss some

Message-Routing Schemes
In multicomputer with store and forward routing scheme, packets are the
smallest unit of information transmission. In wormhole–routed networks,
packets are further divided into flits. Packet length is determined by the
routing scheme and network implementation, whereas the flit length is
affected by the network size.

In Store and forward routing, packets are the basic unit of information
transmission. In this case, each node uses a packet buffer. A packet is
transmitted from a source node to a destination node through a sequence of
intermediate nodes. Latency is directly proportional to the distance between
the source and the destination.

In wormhole routing, the transmission from the source node to the

destination node is done through a sequence of routers. All the flits of the
same packet are transmitted in an inseparable sequence in a pipelined
fashion. In this case, only the header flit knows where the packet is going.

Deadlock and Virtual Channels

A virtual channel is a logical link between two nodes. It is formed by flit buffer
in source node and receiver node, and a physical channel between them.
When a physical channel is allocated for a pair, one source buffer is paired
with one receiver buffer to form a virtual channel.

When all the channels are occupied by messages and none of the channel in
the cycle is freed, a deadlock situation will occur. To avoid this a deadlock
avoidance scheme has to be followed.
Star-Connected Network

In a star-connected network, one processor acts as the central processor. Every other
processor has a communication link connecting it to this processor. Figure 2.14(b) shows a
star-connected network of nine processors. The star-connected network is similar to bus-
based networks. Communication between any pair of processors is routed through the
central processor, just as the shared bus forms the medium for all communication in a bus-
based network. The central processor is the bottleneck in the star topology.

Linear Arrays, Meshes, and k-d Meshes

Due to the large number of links in completely connected networks, sparser networks are
typically used to build parallel computers. A family of such networks spans the space of
linear arrays and hypercubes. A linear array is a static network in which each node (except
the two nodes at the ends) has two neighbors, one each to its left and right. A simple
extension of the linear array (Figure 2.15(a)) is the ring or a 1-D torus (Figure 2.15(b)).
The ring has a wraparound connection between the extremities of the linear array. In this
case, each node has two neighbors.

Figure 2.15. Linear arrays: (a) with no wraparound links; (b) with
wraparound link.
A two-dimensional mesh illustrated in Figure 2.16(a) is an extension of the linear array to
two-dimensions. Each dimension has nodes with a node identified by a two-tuple (i, j).
Every node (except those on the periphery) is connected to four other nodes whose indices
differ in any dimension by one. A 2-D mesh has the property that it can be laid out in 2-D
space, making it attractive from a wiring standpoint. Furthermore, a variety of regularly
structured computations map very naturally to a 2-D mesh. For this reason, 2-D meshes
were often used as interconnects in parallel machines. Two dimensional meshes can be
augmented with wraparound links to form two dimensional tori illustrated in Figure 2.16(b).
The three-dimensional cube is a generalization of the 2-D mesh to three dimensions, as
illustrated in Figure 2.16(c). Each node element in a 3-D cube, with the exception of those
on the periphery, is connected to six other nodes, two along each of the three dimensions. A
variety of physical simulations commonly executed on parallel computers (for example, 3-D
weather modeling, structural modeling, etc.) can be mapped naturally to 3-D network
topologies. For this reason, 3-D cubes are used commonly in interconnection networks for
parallel computers (for example, in the Cray T3E).

Figure 2.16. Two and three dimensional meshes: (a) 2-D mesh with no
wraparound; (b) 2-D mesh with wraparound link (2-D torus); and (c) a 3-D
mesh with no wraparound.

The general class of k-d meshes refers to the class of topologies consisting of d dimensions
with k nodes along each dimension. Just as a linear array forms one extreme of the k-
d mesh family, the other extreme is formed by an interesting topology called the hypercube.
The hypercube topology has two nodes along each dimension and log p dimensions. The
construction of a hypercube is illustrated in Figure 2.17. A zero-dimensional hypercube
consists of 20, i.e., one node. A one-dimensional hypercube is constructed from two zero-
dimensional hypercubes by connecting them. A two-dimensional hypercube of four nodes is
constructed from two one-dimensional hypercubes by connecting corresponding nodes. In
general a d-dimensional hypercube is constructed by connecting corresponding nodes of two
(d - 1) dimensional hypercubes. Figure 2.17 illustrates this for up to 16 nodes in a 4-D
Figure 2.17. Construction of hypercubes from hypercubes of lower

It is useful to derive a numbering scheme for nodes in a hypercube. A simple numbering

scheme can be derived from the construction of a hypercube. As illustrated in Figure 2.17, if
we have a numbering of two subcubes of p/2 nodes, we can derive a numbering scheme for
the cube of p nodes by prefixing the labels of one of the subcubes with a "0" and the labels
of the other subcube with a "1". This numbering scheme has the useful property that the
minimum distance between two nodes is given by the number of bits that are different in
the two labels. For example, nodes labeled 0110 and 0101 are two links apart, since they
differ at two bit positions. This property is useful for deriving a number of parallel
algorithms for the hypercube architecture.

Tree-Based Networks

A tree network is one in which there is only one path between any pair of nodes. Both
linear arrays and star-connected networks are special cases of tree networks. Figure
2.18 shows networks based on complete binary trees. Static tree networks have a
processing element at each node of the tree (Figure 2.18(a)). Tree networks also have a
dynamic counterpart. In a dynamic tree network, nodes at intermediate levels are switching
nodes and the leaf nodes are processing elements (Figure 2.18(b)).

Figure 2.18. Complete binary tree networks: (a) a static tree network; and
(b) a dynamic tree network.

To route a message in a tree, the source node sends the message up the tree until it
reaches the node at the root of the smallest subtree containing both the source and
destination nodes. Then the message is routed down the tree towards the destination node.

Tree networks suffer from a communication bottleneck at higher levels of the tree. For
example, when many nodes in the left subtree of a node communicate with nodes in the
right subtree, the root node must handle all the messages. This problem can be alleviated in
dynamic tree networks by increasing the number of communication links and switching
nodes closer to the root. This network, also called a fat tree, is illustrated in Figure 2.19.

Figure 2.19. A fat tree network of 16 processing nodes.

Unit 4
Shared memory
From Wikipedia, the free encyclopedia

An illustration of a shared memory system of three processors.

In computer science, shared memory is memory that may be simultaneously accessed by multiple
programs with an intent to provide communication among them or avoid redundant copies. Shared
memory is an efficient means of passing data between programs. Depending on context, programs
may run on a single processor or on multiple separate processors.
Using memory for communication inside a single program, e.g. among its multiple threads, is also
referred to as shared memory
In computer hardware, shared memory refers to a (typically large) block of random access
memory (RAM) that can be accessed by several different central processing units (CPUs) in
a multiprocessor computer system.
Shared memory systems may use:[1]

 uniform memory access (UMA): all the processors share the physical memory uniformly;
 non-uniform memory access (NUMA): memory access time depends on the memory location
relative to a processor;
 cache-only memory architecture (COMA): the local memories for the processors at each node is
used as cache instead of as actual main memory.
A shared memory system is relatively easy to program since all processors share a single view of
data and the communication between processors can be as fast as memory accesses to a same
location. The issue with shared memory systems is that many CPUs need fast access to memory
and will likely cache memory, which has two complications:

 access time degradation: when several processors try to access the same memory location it
causes contention. Trying to access nearby memory locations may cause false sharing. Shared
memory computers cannot scale very well. Most of them have ten or fewer processors;
 lack of data coherence: whenever one cache is updated with information that may be used by
other processors, the change needs to be reflected to the other processors, otherwise the
different processors will be working with incoherent data. Such cache coherenceprotocols can,
when they work well, provide extremely high-performance access to shared information between
multiple processors. On the other hand, they can sometimes become overloaded and become a
bottleneck to performance.
Technologies like crossbar switches, Omega networks, HyperTransport or front-side bus can be
used to dampen the bottleneck-effects.
In case of a Heterogeneous System Architecture (processor architecture that integrates different
types of processors, such as CPUs and GPUs, with shared memory), the memory management
unit (MMU) of the CPU and the input–output memory management unit (IOMMU) of the GPU have
to share certain characteristics, like a common address space.
The alternatives to shared memory are distributed memory and distributed shared memory, each
having a similar set of issues.

Shared Memory
Shared memory multiprocessors are one of the most important classes of
parallel machines. It gives better throughput on multiprogramming workloads
and supports parallel programs.

In this case, all the computer systems allow a processor and a set of I/O
controller to access a collection of memory modules by some hardware
interconnection. The memory capacity is increased by adding memory
modules and I/O capacity is increased by adding devices to I/O controller or
by adding additional I/O controller. Processing capacity can be increased by
waiting for a faster processor to be available or by adding more processors.

All the resources are organized around a central memory bus. Through the
bus access mechanism, any processor can access any physical address in the
system. As all the processors are equidistant from all the memory locations,
the access time or latency of all the processors is same on a memory location.
This is called symmetric multiprocessor.
Message-Passing Architecture
Message passing architecture is also an important class of parallel machines.
It provides communication among processors as explicit I/O operations. In
this case, the communication is combined at the I/O level, instead of the
memory system.

In message passing architecture, user communication executed by using

operating system or library calls that perform many lower level actions, which
includes the actual communication operation. As a result, there is a distance
between the programming model and the communication operations at the
physical hardware level.

Send and receive is the most common user level communication operations
in message passing system. Send specifies a local data buffer (which is to be
transmitted) and a receiving remote processor. Receive specifies a sending
process and a local data buffer in which the transmitted data will be placed.
In send operation, an identifier or a tag is attached to the message and the
receiving operation specifies the matching rule like a specific tag from a
specific processor or any tag from any processor.

The combination of a send and a matching receive completes a memory-to-

memory copy. Each end specifies its local data address and a pair wise
synchronization event.

Shared-Memory Multicomputers
Three most common shared memory multiprocessors models are −

Uniform Memory Access (UMA)

In this model, all the processors share the physical memory uniformly. All the
processors have equal access time to all the memory words. Each processor
may have a private cache memory. Same rule is followed for peripheral

When all the processors have equal access to all the peripheral devices, the
system is called a symmetric multiprocessor. When only one or a few
processors can access the peripheral devices, the system is called
an asymmetric multiprocessor.
Non-uniform Memory Access (NUMA)
In NUMA multiprocessor model, the access time varies with the location of
the memory word. Here, the shared memory is physically distributed among
all the processors, called local memories. The collection of all local memories
forms a global address space which can be accessed by all the processors.
Load balancing (computing)
n computing, load balancing improves the distribution of workloads across multiple computing
resources, such as computers, a computer cluster, network links, central processing units, or disk
drives.[1] Load balancing aims to optimize resource use, maximize throughput, minimize response
time, and avoid overload of any single resource. Using multiple components with load balancing
instead of a single component may increase reliability and availability through redundancy. Load
balancing usually involves dedicated software or hardware, such as a multilayer switch or a Domain
Name System server process.
Load balancing differs from channel bonding in that load balancing divides traffic between network
interfaces on a network socket (OSI model layer 4) basis, while channel bonding implies a division of
traffic between physical interfaces at a lower level, either per packet (OSI model Layer 3) or on a
data link (OSI model Layer 2) basis with a protocol like shortest path bridging.

Round-robin DNS[edit]
Main article: Round-robin DNS
An alternate method of load balancing, which does not require a dedicated software or hardware
node, is called round robin DNS. In this technique, multiple IP addresses are associated with a
single domain name; clients are given IP in round robin fashion. IP is assigned to clients for a
time quantum.

DNS delegation[edit]
Another more effective technique for load-balancing using DNS is to delegate www.example.org as
a sub-domain whose zone is served by each of the same servers that are serving the web site. This
technique works particularly well where individual servers are spread geographically on the Internet.
For example:

one.example.org A
two.example.org A
www.example.org NS one.example.org
www.example.org NS two.example.org

However, the zone file for www.example.org on each server is different such that each server
resolves its own IP Address as the A-record.[2] On server one the zone file
for www.example.org reports:

Client-side random load balancing[edit]

Another approach to load balancing is to deliver a list of server IPs to the client, and then to have
client randomly select the IP from the list on each connection.[3][4] This essentially relies on all clients
generating similar loads, and the Law of Large Numbers[4] to achieve a reasonably flat load
distribution across servers. It has been claimed that client-side random load balancing tends to
provide better load distribution than round-robin DNS; this has been attributed to caching issues with
round-robin DNS, that in case of large DNS caching servers, tend to skew the distribution for round-
robin DNS, while client-side random selection remains unaffected regardless of DNS caching.