Vous êtes sur la page 1sur 165



Authored by:

Assistant Professor
Department of Information Science & Engineering
J.N.N.College of Engineering
SHIMOGA 577204
Email: sathya_sv@rediffmail.com


Sl.No Units Topics Page No

Preface 3
1. UNIT - 1 Introduction to Computer Networks 4
2. UNIT - 2 Distributed Computing System An overview 25
3. UNIT - 3 Distributed Databases An overview 57
4. UNIT - 4 Levels of Distribution Transparency 70
5. UNIT - 5 Distributed Database Design 80
6. UNIT - 6 Overview of Query Processing 92
7. UNIT - 7 Transaction Management and Concurrency Control 106
8. UNIT - 8 Time and Synchronization 140
9 References 163


Nineteen seventies saw the usage of computers extensively for building

powerful integrated database systems and it experienced a large number of
applications. The decade also witnessed the excellent price performance ratio
offered by microprocessor-based workstations over mainframe systems.
Eighties saw the advent of computer networks extensively allowing the

connection of different computers, exchange of data and resource sharing.

This lead the distributed database, which is an integrated database, built on
top of a computer network rather than on a single computer. The data that
forms the database are stored at different sites of the computer network. Lot
of research work has been done to solve the problems faced in building and
implementing the distributed database.
Knowledge of traditional databases and computer network is necessary
to integrate them to a new discipline Distributed Computing System. This
has lead to the concepts and design issues of Distributed Operating Systems,
which are now commercially available.
We begin the discussion with an introduction to computer networking
in UNIT-1 followed by a discussion on Distributed Computing Systems in
UNIT-2 where we present several models. The differences in distributed and
centralized databases have been discussed in UNIT-3. UNIT-4 and UNIT-5
describe transparency and design issues of distributed databases. The query
processing has been explained with the help of relational algebra and
relational calculus in UNIT-6. Finally, UNIT-7 and UNIT-8 explain relevant
issues in distributed data processing like transaction management and time &
The author depended heavily on the books written by Stefano Curi,
Giuseppe Pelagatti, Pradeep.K.Sinha and George Coulouris in preparing this
study material and he is indebted to the above authors. I wish to thank
S.N.Jagadeesha and Suma.S.S who have carefully reviewed the manuscript
and suggested many important improvements. The author would like to
express his heartfelt thanks to the KSOU for entrusting him this work. Last
but not the least the author warmly thanks the technical staff of CS & E
department and all his friends for their help in preparing this course material.

- Sathyanarayana.S.V

UNIT - 1


1.0 Objectives

1.1 Introduction
1.2 Network types
1.3 LAN Technologies
1.3.1 LAN Topologies
1.3.2 Medium Access Protocols CSMA/CD Protocol Token ring Protocol
1.4 WAN Technologies
1.4.1 Switching Techniques Circuit Switching Packet Switching
1.4.2 Routing Techniques Static Routing Dynamic Routing
1.5 Communication Protocols
1.5.1 The OSI Reference Model
1.6 Summary

1.0 Objectives: Computer networks provide the necessary means for communication
between computing elements of the system. At the end of this unit, you will come to
know the basic concepts of computer networking. We outline the characteristics of
local and wide area networks. We summarize the principles of protocols and protocol
layering. In total, this unit will help you to understand the advanced concept of task
execution called Distributed computing.

1.1 Introduction:
A computer network is a communication system that links end systems by
communication lines and software protocols to exchange data between two processes
running on different end systems of the network. The end systems are often referred
to as nodes, sites, hosts, computers, machines, and so on. The nodes may vary in size
and function. Size-wise, a node may be a small microprocessor, a workstation, a

minicomputer, or a large supercomputer. Function-wise, a node may be dedicated

system (such as a print server or a file server) without any capability for interactive
users, a single personal computer or a general-purpose time-sharing system.
A distributed computing system is basically a computer network whose nodes
have their own local memory and also other hardware and software resources. A
distributed system, therefore, relies entirely on the underlying computer network for
the communication of data and control information between the nodes of which they
are composed. The performance and reliability of a distributed system depend to a
great extent on the performance and reliability of the underlying computer network.
Hence, a basic knowledge of computer networks is required for the study of
distributed operating systems. Therefore, this unit deals with important aspects of
networking concepts and designs emphasizing the aspects needed for designing
distributed operating systems.
1.2 Networks Types:
Networks are broadly classified into two types: local area networks (LANs) and wide-
area networks (WANs). The WANs are also referred to as long-haul network. The key
characteristics that are often used to differentiate between these two types of networks are
as follows:
1. Geographic distribution: The main difference between the two of networks is
the way in which they are geographically distributed. A LAN is restricted to a
limited geographic coverage of a few kilometers. Therefore, LANs typically
provide communication facilities within a building or a campus, whereas
WANs may operate nationwide or even worldwide.
2. Data rate: Data transmission rates are usually much higher in LANs than in
WANs. Transmission rates in LANs usually range from 0.2 megabit per
second (Mbps) to 1 gigabit per second (Gbps), whereas, transmission rates in
WANs usually range from 1200 bits per second to slightly over 1 Mbps.
3. Error rate: Local area networks generally experience fewer data transmission
errors than WANs do. Typically, bit error rates are in the range 10 -8 10-12 with
LANs as opposed to 10-5 10-7 with WANs.

4. Communication link: The most common communication links used in LANs

are twisted pair, coaxial cable, and fiber optics. On the other hand, WANs are
physically distributed over a large geographic area, and the communication
links used are by default relatively slow and unreliable. Telephone lines,
microwave links, and satellite channels are used as links in WANs.
5. Ownership: A LAN is typically owned by a single organization because of its
limited geographic coverage. Interconnecting multiple LANs each of which
may belong to a different organization, however usually forms a WAN.
Therefore, administrative and maintenance complexities and costs for LANs
are usually much lower than for WANs.
6. Communication cost: The overall communication costs of a LAN are much
lower than that of a WAN. The main reasons for this are lower error rates,
simple (or absence of) routing algorithms, and lower administrative and
maintenance costs. Moreover, the cost to transmit data in a LAN is negligible
since the transmission medium is usually owned by the user organization.
However, with a WAN, this cost may be very high because the transmission
media used are leased lines or public communication systems, such as
telephone lines, microwave links, and satellite channels.
Networks that share some of the characteristics of both LANs and WANs are
referred to as Metropolitan Area Networks (MANs). The MANs usually cover a wider
geographic area (up to about 50 km in diameter) than LANs and frequently operate at
speeds very close to LAN speeds. A main objective of MANs is to interconnect LANs
located in an entire city or metropolitan area. Communication links commonly used for
MANs are coaxial cable and microwave links.


This section presents a description of topologies and principles of operation of LANs.
1.3.1 LAN Topologies: The two commonly used network topologies for constructing
LANs are multi-access bus and ring.

In a simple multi-access bus network, all sites are directly connected to a single
transmission medium (called the bus) that spans the whole length of the network
(Fig 1.1). The bus is passive and is shared by all the sites for any message
transmission in the network. Each site is connected to the bus by a drop cable
using a T-connection or tap. Broadcast communication is used for message

Shared bus

Fig 1.1 Simple multi-access bus network


That is, a message is transmitted from one site to another by placing it on the
shared bus. An address designator is associated with the message. As the message
travels on the bus, each site checks whether it addresses to it and the addressed
site picks up the message.
A variant of the simple multi-access bus network topology is the multi-access
branching bus network topology. In such a network, two or more simple
access bus networks are interconnected using repeaters (Fig 1.2). Repeaters are
the hardware devices used to connect cable segments. They amplify and copy
electric signals from one segment of a network to its next segment.


Shared bus

Fig 1.2 Multi-access branching bus network topology

In a ring network, each site is connected to exactly two other sites so that a loop is
formed (Fig 1.3). A separate link is used to connect two sites. The links are
interconnected using repeaters. Data is transmitted in one direction around the
ring by signaling between sites. That is, to send a message from one site to
another, the source site writes the destination sites address in the message header
to see if the message is addressed to it. If not the site passes on the message to its
own neighbor. In this manner, the message circulates around the ring until some
site (to which the message is addressed) removes the message from the ring;
otherwise, it is removed by theRepeater
source site (which sent the message). In the latter
case, the message always circulates for one complete round on the ring.
in ring networks, one of the sites acts as aRepeater
monitor site to ensure that a
message does not circulate indefinitely (that is, in case the source site or
destination site fails). The monitor site also perform other jobs, such as
housekeeping functions, ring utilization, and handling other error conditions
Repeater Monitor Repeater



Fig 1.3 Ring network topology


1.3.2 Medium-Access Control Protocols: In case of both multi-access bus and ring
networks, we say that all the sites of a network share a single channel, resulting in a
multi-access environment. In such an environment, it is possible that several sites try to
transmit information over the shared channel simultaneously. In this case, the transmitted
information may become scramble and must be discarded. The concerned sites must be
notified about the discarded information, so that they can retransmit their information. If
no special provisions are made, this situation may be repeated, resulting in a multi-access
environment to control the access to a shared channel. These schemes are known as
medium-access control protocols.
Clearly, in a multi-access environment, the use of a medium having high raw data
rate alone is not sufficient. The medium-access control protocol used must also provide
for efficient bandwidth use of the medium. Therefore, the medium-access control
protocol has a significant effect on the overall performance of a computer network, and
often it is by such protocols that the networks differ the computer network, and often it is
by such protocols that the networks differ the most. The three important performance

objectives of a medium-access control protocol are high throughput, high channel

utilization, and low message delay.
In addition to meeting the performance objectives, other desirable characteristics of a
medium-access protocol are:
For fairness, unless a priority scheme is intentionally implemented, the protocol
should provide equal opportunity to all sites in allowing them to transmit their
information over the shared medium.
For better scalability, sites should require a minimum knowledge of the network
structure (topology, size, or relative location of other sites), and addition, removal,
or movement of a site from one place to another in the network should be possible
without the need to change the protocol. Furthermore, it should not be necessary
to have knowledge of the exact value of the end-to-end propagation delay of the
network for the protocol to function correctly.
For higher reliability, centralized control should be avoided and the operation of
the protocol should be completely distributed.
For supporting real time applications, the protocol should exhibit bounded delay
properties. That is, the maximum message transfer delay from one site to another
in the network must be known and fixed.
Several protocols have been developed for medium-access control in a multi-
access environment. Of these, the Carrier Sense Multiple Access with Collision Detection
(CSMA/CD) protocol is popular and is used for ring networks. Some of these protocols
are described in the following section.
11 CSMA/CD Protocol: The CSMA/CD scheme employs decentralized control of

the shared medium. In this scheme, each site has equal status, in the sense there is
no central controller site. The sites contend with each other for use of the shared
medium and the site that first gains access during an idle period of the medium
uses the medium for the transmission of its own message. Obviously, occasional
collisions of messages may occur when more than one site senses the medium to
be idle and transmits messages simultaneousely. The scheme uses collision
detection, recovery, and controlled transmission mechanisms to deal with this
problem. Therefore, the scheme is comprised of the following three mechanisms:
Carrier sense and defer mechanism. Whenever a site wishes to transmit a packet,
it first listens for the presence of a signal (known as a carrier by analogy with
radio broadcasting) on the shared medium. If the medium is found to be free (no
carrier is present on the medium), the site starts transmitting its packet. Otherwise,
the site defers its packet transmission and waits (continues to listen) until the
medium becomes free. The site initializes its packet transmission as soon as it
senses the medium free.
Collision detection mechanism. Unfortunately, carrier sensing does not prevent all
collisions because of the nonzero propagation delay of the shared medium.
Obviously, collisions occur only within a short time interval following the start of
transmission, since after this interval all sites will detect that the medium is not
free and defer transmission. This time interval is called the collision window or
collision interval and is equal to the amount of time required for a signal to
propagate from one end of the shared medium to the other and back again. If a
site attempts to transmit a packet, it must listen to the shared medium for a time
period that is at least equal to the collision interval in order to guarantee that the
packet will not experience a collision. Collision avoidance by listening to the
shared medium for at least the collision interval time before initiation of packet
transmission leads to inefficient utilization of the medium when collisions are
rare. Therefore, instead of trying to avoid collisions, the CSMA/CD scheme
allows collisions to occur, detects them, and then takes necessary recovery

Controlled retransmission mechanism. After a collision, the packets that become

corrupted due to the collision must be retransmitted. If all the transmitting stations
whose packets were corrupted by the collision attempt to retransmit their packets
immediately after the jamming signal, collision will probably occur again. To
minimize repeated collisions and to achieve channel stability under overload
conditions, a controlled retransmission strategy is used in which the competition
for the shared medium is resolved using a suitable algorithm. The Token Ring Protocol: This scheme also employs decentralized control of the
shared medium. In this scheme a single token is circulated among the sites in the system,
the site in possession of the token will have access to the shared medium.
A token is a special type of message (having a unique bit pattern) that entitles its
holder to use the shared medium for transmitting its messages. A special field in
the token indicates whether it is free or busy. The token is passed from one site to
the adjacent site around the ring in one direction.
A site that has a message ready for transmission must wait until the token reaches
it and if it is free, it transmits the token to the next site in the ring. A receiving site
checks the status of the token. If it is free, it uses it to transmit its own message.
Otherwise, it passes the token to the next site.
When it receives the free token, it will set it to busy, attaches its message to the
token, and transmit its own message. Otherwise, it checks to see if the message
attached to the busy token is addressed to it. If it is, it retrieves the message
attached to the token and forwards the token without the attached message to the
next site in the ring.
When the busy token returns to the sending site after one passes it to the next site,
allowing the next site to transmit its message (if it has any). The free token
circulates from one site to another until it reaches a site that has some message to
To prevent a site from holding the token for a very long time, a token-holding
timer is used to control the length of time for which a site may occupy the token.

To guarantee reliable operation, the token has to be protected against loss or

duplication. That is, if the token gets lost due to a site failure, the system must detect the
loss and generate a new token. The monitor site usually does this. Moreover, if a site i
crashes, the ring must be reconfigured so that site i-1 will send the token directly to site
An advantage of the token protocol is that the message delay can be bounded
because of the absence of collisions. Another advantage is that it can work with both
large and small packet size as well as variable-size packets. In principle, a message
attached to the token may be of almost any length. A major disadvantage however is the
initial waiting time to receive a free token even at very light loads. This initial waiting
time could be appreciable, especially in large rings.
1.4 WAN Technologies:
A WAN of computers is considered by interconnecting computers that are separated by
large distances; they may be located in different cities or even in different countries. In
general, no fixed regular network topology is used for interconnecting the computers of a
WAN. Moreover, different communication media may be used for different links of a
WAN. For example, in a WAN, computers located in the same country may be
interconnected by coaxial cables (telephone lines), but communications satellite may be
used to interconnect two computers that are located in different countries.
The computers of a WAN are not connected directly to the communication
channels but are connected to hardware devices called packet-switching exchanges
(PSEs), which are special-purpose computers dedicated to the task of data
communication. Therefore, the communication channels of the network interconnect the
PSEs, which actually perform the task of data communication across the network (Fig.
To send a message packet to another computer on the network, a computer sends the
packet to the PSE to which it is connected. The packet is transmitted from the sending
computers PSE to the receiving computers PSE, possibly via another PSEs. The actual
mode of packet transmission and the route used for forwarding a packet from its sending
computers PSE to its receiving computers PSE depend on the switching and routing
techniques used by the PSEs of the network. Various possible options for these

techniques are described next. When the packet reaches its receiving computers PSE, it
is delivered to the receiving computer.





Fig 1.4 The WAN using Packet Switching Exchanges

1.4.1 Switching Techniques: We saw that in a WAN communication is achieved by

transmitting a packet from its source computer to its destination computer through two or
more PSEs. The PSEs provide switching facility to move a packet from one PSE to
another until the packet reaches its destination. That is, a PSE removes a packet from an
input channel and places it on an output channel. Network latency is highly dependent on
the switching technique used by the PSEs of the WAN. The two most commonly used
schemes are circuit switching and packet switching. They are described next. Circuit Switching: This scheme is similar to that used in the public telephone
system. In this system when a telephone call is made, a dedicated circuit is established by
the telephone switching office from the callers telephone to the callees telephone. Once
this circuit is established, the only delay involved in the communication is the time
required for the propagation of the electromagnetic signal through all the wires and
switches. While it might be hard to obtain a circuit sometimes (such as during busy hour),

once the circuit is established, exclusive access to it is guaranteed until the calls
In this method, before data transmission starts, a physical circuit is constructed
between the sender and receiver computers during the circuit establishment phase. During
this phase, the channels constituting the circuit are reserved exclusively for the circuit;
hence there is no need for buffers at the intermediate PSEs. Once the circuit is
established, all packets of the data are transferred one after another through the dedicated
circuit without being buffered at intermediate sites, the packets appear to form a
continuous data stream. Finally, in the circuit termination phase, the circuit is torn down
as the last packet of the data transmitted. As soon as the circuit is torn down, the channels
that were reserved for the circuit become available for use by others. If a circuit cannot be
established because a desired channel is busy (being used), the circuit is said to be
blocked. Depending on the way blocked circuits are handled, the partial circuit may be
torn down, with establishment to be attempted later.
The main advantage of a circuit-switching technique:
Once the circuit is established, data is transmitted with no delay other than the
propagation delay, which is negligible.
Since the full capacity of the circuit is available for exclusively use by the
connected pair of computers, the transmission time required to send a message
can be known and guaranteed after the circuit has been successfully established.
However, the method requires additional overhead during circuit establishment and
circuit disconnection phases, and channel bandwidth may be wasted if the connected pair
of computers does not utilize the channel capacities of the path forming the circuit
efficiently. Therefore, the method is considered suitable only for long continuous
transmissions or for transmissions that require guaranteed maximum transmission delay.
It is preferred method for transmission of voice and real-time data in distributed
applications. Circuit switching is used in the Public Switched Telephone Network
(PSTN). Packet Switching: In this method, instead of establishing a dedicated path
between a sender and receiver pair (of computers), the channels are shared for
transmitting packets of different sender-receiver pairs. That is, a channel is occupied by a

sender-receiver pair only while transmitting a single packet of the message of that pair;
the channel may then be used for transmitting either another packet of the sender-receiver
pair or a packet of the some other sender-receiver pair.
In this method, each packet of a packet of the message contains the address of the
destination computer, so that it can be sent to its destination independently of all other
packets. Notice that different packets of the same message may take a different path
through the network and, and at the destination computer, the receiver may get the
packets in an order different from the order in which they were sent. Therefore, at the
destination computer, the packets have to be properly reassembled into a message. When
a packet reaches a PSE (Refer Fig.1.4), the packet is temporarily stored there in a packet
buffer. The packet is then forwarded to a selected neighboring PSE when the next channel
becomes available and the neighboring PSE has an available packet buffer. Hence the
actual path taken by a packet to its destination is dynamic because the path is established
as the packet travels along. Packet-switching technique is also known as store-and-
forward communication because every packet is temporarily stored by each PSE along its
routing before it is forwarded to another PSE.
As compared to circuit switching, packet switching is:
Suitable for transmitting small amounts of data that are bursty in nature.
The method allows efficient usage of channels because the communication
bandwidth of a channel is shared for transmitting several messages.
Furthermore, the dynamic selection of the actual path to be taken by a packet
gives the network considerable reliability because failed PSEs or channel can be
ignored and alternate paths may be used. For example, in the WAN of Figure 1.4,
if channel 2 fails, using the path 1-3 the message can still be sent from computer
A to D.
On the other hand, due to the need to buffer each packet at every PSE and to
reassemble the packets at the destination computer, the overhead incurred is large.
Therefore, the method is inefficient for transmitting large messages.
Another drawback of the method is that there is no guarantee of how long it takes
a message to go from its source computer to its destination computer because the

time taken for each packet depends on the route chosen for the packet, along with
the volume of data being transferred along this route.
Packet switching is used in the X.25 public packet network and the Internet.
1.4.2 Routing Techniques: In a WAN, when multiple paths exist between the source and
destination computers of a packet, any one of the paths may be used to transfer the
packet. For example, in the WAN of Figure 1.4, there are two paths between computers E
and F: 3-4 and 1-2-4-and any one of the two may be used to transmit a packet from
computer E to F. The selection of the actual path to be used for transmitting a packet is
determined by the routing technique used. An efficient routing technique is crucial to the
overall performance of the network. This requires that the routing decision process must
be as fast as possible to reduce the network. This requires that the routing decision
process should be easily implementable in hardware. Furthermore, the decision process
usually should not require global state information of the network because such
information gathering is a difficult task and creates additional traffic in the network.
Routing algorithms are usually classified based on the following three attributes:
Place where routing decisions are made
Time constant of the information upon which the routing decisions are based
Control mechanism used for dynamic routing
Note that routing techniques are not needed in LANs because the sender of a message
simply puts the message on the communication channel and the receiver takes it off from
the channel. There is no need to decide the path to be used for transmitting the message
from the sender to the receiver.
Out of the three attributes let us only consider the second one as it very important
as far as our requirement is concerned. According to this, the routing algorithms are
classified as follows:
Static routing. In this method, routing tables (stored on PSEs) are set once and
do not change for very long periods of time. They are changed only when the
network undergoes major modifications. Static routing is also known as fixed
or deterministic routing. Static routing is simple and easy to implement.
However, it makes poor use of network bandwidth and causes blocking of a

packet even when alternative paths are available for its transaction. Hence,
static routing schemes ARE susceptible to component failures.
Dynamic routing. In this method routing tables are updated relatively
frequently, reflecting shorter-term changes in the network environment.
Dynamic routing strategy is also known as adaptive routing because it has a
tendency to adapt to the dynamically changing state of the network, such as
the presence of faulty or congested channels. Dynamic routing schemes can
use alternative paths for packet transmissions, making more efficient use of
network bandwidth and providing resilience to failures. The latter property is
particularly important for large-scale architectures, since expanding network
size can increase the probability of encountering a faulty network component.
In dynamic routing, however, packets of a message may arrive out of order at
the destination computer. Appending a sequence number to each packet and
property reassembling the packets at the destination computer can solve this
problem. The path selection policy for dynamic routing may either be minimal
or non-minimal. In the minimal policy, the selected path is one of the shortest
paths between the source and destination pair of computers. Therefore, every
channel visited will bring the packet closer to the destination. On the other
hand, in the non-minimal policy, a packet may follow a longer path, usually in
response to current network conditions. If the non-minimal policy is used,
care must be taken to avoid a situation in which the packet will continue to be
routed through the network but never reach the destination.
1.5 Communication Protocols:
For transmission of message data comprised of multiple packets, the sender and receiver
must also agree upon the method used for identifying the first packet and the last packet
of the packet. Moreover, agreement is also needed for handling duplicate messages,
avoiding buffer overflows, and assuring proper message sequencing. Network designers
define all such agreements, needed for communication between the communicating
parties, in terms of rules and conventions. The term protocol is used to refer to a set of
such rules and conventions.

Computer networks are implemented using the concept using the concept of
layered protocols.
According to this concept, the protocols of a network are organized into a series
of layers in such a way that each layer contains protocols for exchanging data and
providing functions in a logical sense with the peer entities at other sites in the
Entities in the adjacent layers interact in a physical sense though the common
interface defined between the two layers by passing parameters such as headers,
trailers, and data parameters. The main reasons for using the concept of layered
protocols in network design are as follows:
The protocols of a network are fairly complex. Designing them in layers makes
their implementation more manageable.
Layering of protocols provides well-defined interfaces between the layers, so
that a change in one layer does not affect an adjacent layer. That is, the various
functionalities can be partitioned and implemented independently so that each
one can be changed as technology improves without the other ones being
affected. For example, a change to a routing algorithm in a network control
program should not affect the functions of message sequencing, which is located
in another layer of the network architecture.
Layering of protocols also allows interaction between functionality-paired
layers in different locations. This concept aids in permitting the distribution of
functions to remote sites.
The terms protocol suite, protocol family, or protocol stack are used to refer to
the collection of protocols (of all layers) of a particular network system.
1.5.1 The OSI Reference Model: The basic goal of communication protocols for
network systems is to allow remote computers to communicate with each other and to
allow users to access remote resources. On the other hand, the basic goal of
communication protocols for distributed systems is not only to allow users to access
remote resources but also to do so in a transparent manner. Several standards and
protocols for network systems are already available.

The number of layers, the name of each layer, and the functions of each layer may
be different from one network to other network.
To make the job of the network communication protocol designers easier, the
International Standardization Organization (ISO) has developed a reference model
that identifies seven standard layers and defines the jobs to be performed at each
layer. This model is called the Open System International Reference Model (OSI
It is a guide, not a specification. It provides a framework in which standards can
be developed for the services and protocols at each layer. To provide an
understanding of the structure and functioning of layered network protocols, a
brief description of the OSI model here as shown in fig 1.5.
It is a seven-layer architecture in which a separate set of protocols is defined for
each layer. Thus each layer has an independent function and deals with one or
more specific aspects of the communication.
The seven layers are Physical, Data link, Network, Transport, Session,
Presentation, and Application. Let us deal them one by one in detail.
o Physical Layer: This specifies the physical link interconnection, including
electrical/photonic characteristics.
o Data link layer: This specifies how data travels between two end points of
a communication link( for e.g., a host and a packet switch). At this level,
data is delivered in a frame, which consists of a stream of binary data 0s
and 1s in which checksum techniques are applied to detect errors. The
Ethernet protocol is one example of this.
o Network layer: This defines the basic unit of transfer across the network
and includes the concept of multiplexing and routing. At this level the
software assembles a packet in the form the network expects and uses
layer 2 to transfer it over single links.
o Transport layer: This provides end to end reliability by having the
destination host communicate with source host to compensate for the fact
that multiple networks with different qualities of service my have been

o Session layer: This describes how protocol software can be organized to

handle all the functionality needed by the application programs
particularly to maintain transfer-level synchronization.
Site 1 Site 2

Process Process

Layer 7 Application protocol Layer 7

(Application) (Application)

Interface Interface

Layer 6 Presentation protocol Layer 6

(Presentation) (Presentation)

Interface Interface

Layer 5 Session protocol Layer 5

(Session) (Session)

Interface Interface

Layer 4 Transport protocol Layer 4

(Transport) (Transport)

Interface Interface

Layer 3 Network protocol Layer 3

(Network) (Network)

Interface Interface

Layer 2 Data-link protocol Layer 2

(Data link) (Data link)

Interface Interface

Layer 1 Physical protocol Layer 1

(Physical) (Physical)


Fig.1.5 The architecture of the OSI model


o Presentation layer: This includes functions required for the basic

encoding rules used in transferring information, be it text, voice, video, or
o Application layer: This includes application programs like electronic mail
or file transfer programs.
Check Your Progress-1: Fill up the blanks: -
A. The end systems are called as---------------.
B. Networks are broadly classified as ------------- and ---------------------.
C. The two commonly used network topologies for LAN are ----------- and ----------.
D. An example of a medium access protocol----------.
E. The two types of Switching techniques are -----------and-------------.
F. An example of a communication protocol is -----------.
G. A reference model suggested by ISO is known as -------------.
Check Your Progress-2: Answer the following questions: -
1) Explain the different network types.
2) What are the key characteristics used to differentiate the different network types?
3) Describe the LAN topologies.
4) Discuss the importance of routing techniques and briefly explain the different
5) What do you mean by switching? Differentiate Circuit and Packet switching
6) Explain the different medium access protocols in brief.
7) Describe the seven-layer architecture suggested by ISO.
1.6 Summary: A distributed system relays entirely on the underlying computer network
for the communication of data and control information between the nodes of which they
are composed.
A computer network is a communication system that links the nodes by
communication lines and software protocols to exchange the data between the
two processors running on different nodes of the network.

Based on characteristics such as geographic distribution of nodes. Data rate,

error rate and communication cost, networks are broadly classified into two
types, LAN and WAN. Networks that share from the characteristics of both
LAN and WAN are sometimes referred to as MANs.
The two commonly used network topologies for constructing LAN are the
multi-access bus and ring.
A wide area network of computers are constructed by interconnecting
computers that are separated by large distances, special hardware devices
called packet switching exchanges are used to connect computers to the
communication channels.
The selection of actual path to be used to transmit a packet in a WAN is
determined by the routing strategy used. The path used to transmit a packet in
a WAN can be either statically or dynamically changed based on the network
conditions using suitable algorithms.
Computer networks are implemented using the concept of layered protocols.
The OSI model provides stand for layered protocols for WANs.
The seven layers of the OSI model are Physical, Data-Link, Network,
Transport, Session, Presentation and Application.



2.0 Objectives
2.1 Introduction
2.2 Distributed Computing System An Outline
2.3 Evolution of Distributed Computing System
2.4 Distributed Computing System Models
2.4.1 Minicomputer model
2.4.2 Workstation model
2.4.3 Workstation Server model
2.4.4 Processor pool model
2.4.5 Hybrid model
2.5 Uses of Distributed Computed System
2.6 Distributed Operating System
2.7 Issues in designing a Distributed Operating system
2.8 Introduction to Distributed Computing Environment
2.8.1 DCE
2.8.2 DCE Components
2.8.3 DCE Cells
2.9 Summary
2.0 Objectives: In this unit we will be learning a new task execution strategy called
Distributed Computing. By the end of this unit you will be able to understand this new
approach and the following related terminologies.
Distributed Computing System (DCS)
Distributed Computing models
Distributed Operating System
Distributed Computing Environment (DCE)

2.1 Introduction: Advancements in microelectronic technology have resulted in the

availability of fast, inexpensive processors, and advancements in communication
technology have resulted in the availability of cost-effective and highly efficient
computer networks. The net result of the advancements in these two technologies is that
the price performance ratio has now changed to favor the use of interconnected, multiple
processors in place of a single, high-speed processor.
The merging of computer and networking technologies gave birth to Distributed
computing systems in the late 1970s. Therefore, starting from the late 1970s, a significant
amount of research work was carried out in both universities and industries in the area of
distributed operating systems. These research activities have provided us with the basic
ideas of designing distributed operating systems. Although the field is still immature,
with ongoing active research activities, commercial distributed operating systems have
already started to emerge. These systems are based on already established basic concepts.
This unit deals with these basic concepts and their use in the design and implementation
of distributed operating systems. Finally, the unit will give a brief look about a complete
system known as Distributed Computing Environment.
2.2 Distributed Computing System - An Outline:
Computer architecture consisting of interconnected; multiple processors are basically of
two types:
1. Tightly Coupled systems: In these systems, there is a single system wide
primary memory (address space) that is shared by all the processors (Fig 2.1
(a)). If any processor writes. Therefore, in these systems, any communication
between the processors usually takes place through the shared memory.
2. Loosely Coupled system: In these systems, the processors do not share
memory, and each processor has its own local memory (Fig 2.1(b)). In these
systems, all physical communication between the processors is done by
passing messages across the network that interconnects the processors.

CPU CPU System-wide CPU CPU

Shared memory

Interconnection hardware

Fig. 2.1 (a) A tightly coupled multiprocessor systems

Local memory Local memory Local memory Local memory


Communication network

Fig 2.1 (b) A loosely coupled multiprocessor systems

Let us see some points with respect to both tightly coupled multiprocessor systems and
loosely coupled multiprocessor.
Tightly coupled systems are referred to as parallel processing systems, and
loosely coupled systems are referred to as distributed computing systems, or
simply distributed systems.
In case of tightly coupled systems, the processors of distributed computing
systems can be located far from each other to cover a wider geographical area.
In tightly coupled systems, the number of processors that can be usefully
deployed is usually small and limited by the bandwidth of the shared memory.
The Distributed computing systems are more freely expandable and can have an
almost unlimited number of processors.

In short, a distributed computing system is basically a collection of processors

interconnected by a communication network in which each processor has its own local
memory and other peripherals, and the communication between any two processors of
the system takes place by message passing over the communication network.
2.3 Evolution of Distributed Computing Systems:
Early computers were very expensive (they cost millions of dollars) and very large in size
(they occupied a big room). There were very few computers and were available only in
research laboratories of universities and industries. These computers were run from a
console by an operator and were not accessible to ordinary users. The programmers
would write their programs and submit them to the computer center on some media such
as punched cards, for processing. Before processing a job, the operator would set up the
necessary environment (mounting tapes, loading punched cards in a card reader etc.,) for
processing the job. The job was then executed and the result, in the form of printed
output, was later returned to the programmer.
The job setup time was a real problem in early computers and wasted most of the
valuable central processing unit (CPU) time. Several new concepts were introduced in the
1950s and 1960s to increase CPU utilization of these computers. Notable among these are
batching together of jobs with similar needs before processing them, automatic
sequencing of jobs, off-line processing by using the concepts of buffering and spooling
and multiprogramming. Automatic job sequencing with the use of control cards to define
the beginning and end of a job improved CPU utilization by eliminating the need for
human job sequencing. Off-line processing improved CPU utilization by allowing
overlap of CPU and input/output (I/O) operations by executing those two actions on two
independent machines (I/O devices are normally several orders of magnitude slower than
the CPU). Finally, multiprogramming improved CPU utilization by organizing jobs so
that the CPU always had something to execute.
However, none of these ideas allowed multiple users to directly interact with a
computer system and to share its resources simultaneously. Therefore, execution of
interactive jobs that are composed of many short actions in which the next action depends
on the result of a previous action was a tedious and time-consuming activity.
Development and debugging of programs are examples of interactive jobs. It was not

until the early 1970s that computers started to use the concept of time-sharing to
overcome this hurdle. Early time-sharing system had several dumb terminals attached to
main computer. These terminals were placed in a room different from the main computer
room. These terminals were placed in a room different from the main computer room.
Using these terminals, multiple users could now simultaneously execute interactive jobs
and share the resources of the computer system. In a time-sharing system, each user is
given the impression that he or she has his or her own computer because the system
switches rapidly from one users job to the next users job, executing only a very small
part of each job at a time. Although the idea of time-sharing was demonstrated as early as
1960, time-sharing computer systems were not common until the early 1970s because
they ware difficult and expensive to build. Parallel advancements in hardware technology
allowed reduction in the size and increase in the processing speed of computers, causing
large-sized computers to be gradually replaced by smaller and cheaper ones that had more
processing capability than their predecessors. These systems were called minicomputers.
The advent of time-sharing systems was the first step was distributed
computing systems because it provided us with two important concepts used in
distributed computing systems-
The sharing of computer resources simultaneously by many users
The accessing of computers from a place different from the main computer room.
Initially the terminals of a time-sharing system were dumb terminals and all processing
was done by the main computer system. Advancements in microprocessor technology in
the 1970s allowed the dumb terminals to be replaced by intelligent terminals so that the
concepts of offline processing and time sharing could be combined to have the
advantages of both concepts in a single system. Microprocessor technology continued to
advance rapidly, making available in the early 1980s single-user computers called
workstations that had computing power almost equal to that of minicomputers but were
available for only a small fraction of the price of a minicomputer. For example, the first
workstation developed at Xerox PARC (called Alto) had a high-resolution monochrome
display, a mouse 128 kilobytes of main memory, a 2.5 megabyte hard disk, and a micro
programmed CPU that executed machine-level instruction at speeds of 2-6 s. These
workstations were then used as terminals in the time-sharing systems. In these time-

sharing systems, most of the processing of users job could be done at the users own
computer, allowing the main computer to be simultaneously shared by a larger number of
users. Shared resources such as files, databases, and software libraries were placed on the
main computer. Centralized time-sharing systems described above had a limitation in that
the terminals could not be placed very far from the main computer room since ordinary
cables were used to connect the terminals to the main computer. However, in parallel,
there were advancements in compute networking technology in the late 1960s and early
1970s that emerged as two key networking technologies-
LAN (Local Area Network): The LAN technology allowed several computers
located within a building or a campus to be interconnected in such a way that these
machines could exchange information with each other at data rates of about 10
megabits per second (Mbps). The first high-speed LAN was the Ethernet developed
at Xerox PARC in 1973
WAN technology: allowed computers located far from each other (may be in
different cities or countries or continents) to be interconnected in a such a way that
these machines could exchange information with each other at data rates of about
56 kilobits per second (Kbps). The first WAN was the ARPANET (Advanced
Research Projects Agency Network) developed by the U.S. Department of Defense
in 1969.
The ATM technology: The data rates of networks continued to improve gradually in the
1980s providing data rates of up to 100 Mbps for LANs and data rates of up to 64 Kbps
for WANs. Recently (early 1990s) there have been another major advancements in
networking technology the ATM (Asynchronous Transfer Mode) technology. The ATM
technology is an emerging technology that is still not very well established. It will make
very high speed networking possible, providing data transmission rates up to 1.2 gigabits
per second (Gbps) in both LAN and WAN environments. The availability of such high-
bandwidth networks will allow future distributed computing systems to support a
completely new class of distributed applications, called multimedia applications, that deal
with the handling of a mixture of information, including voice, video and ordinary data.
The merging of computer and networking technologies gave birth to Distributed
computing systems in the late 1970s.

2.4 Distributed Computing System Models:

Various models are used for building distributed computing system. These models can be
broadly classifies into five categories-minicomputer, workstation, workstation-server,
processor-pool and hybrid. They are briefly described below.
2.4.1 Minicomputer Model: The minicomputer model is a simple extension of the
centralized time-sharing system. As shown in Fig 2.2, a distributed computing
system based on this model consists of few minicomputers (they may be large
supercomputers as well) interconnected by a communication network. Each
minicomputer usually has multiple users simultaneously logged on to it. For this,
several interactive terminals are connected to each minicomputer. Each user is
logged on to one specific minicomputer, with remote access to other minicomputer.
The network allows a user to access remote resources that are available on some
machine other than the one on to which the user is currently logged.
Mini- Terminals

Communication Mini-
Network Computer


Fig 2.2. A distributed computing system based on the minicomputer model

The minicomputer model may be used when resource sharing (such as sharing of
information databases of different types, with each type of database located on a different

machine) with remote users is desired. The early ARPA net is an example of a distributed
computing system based on the minicomputer model.
2.4.2 Workstation Model: As shown in figure 2.3, a distributed computing systems based
on the workstation model consists of several workstations interconnected by a
communication network. A companys office or a university department may have several
workstations scattered throughout a building or campus, each workstation equipped with
its own disk and serving as a single user computer. It has been often found that in such an
environment, at any one time (especially at night), significant proportions of the
workstations are idle (not being used), resulting the waste of large amounts of CPU time.
Therefore, the idea of the workstations model is to interconnect all these workstations by
a high-speed LAN so that idle workstations may be used to process jobs of users who are
logged onto other workstations and do not have sufficient processing power at their own
workstations to get their jobs processed efficiently.

Workstation Workstation

Communication Workstation
Workstation Network



Fig 2.3. A distributed computing system based on the workstation model

In this model, a user logs onto one of the workstations called his or her home
workstation and submits jobs for execution. When the system finds that the users

workstation does not have sufficient processing power for executing the processor of
the submitted jobs efficiently, it transfers one or more of the processors from the
users workstation to some other workstation that is currently idle and gets the
process executed there, and finally the result of execution is returned to be users
2.4.3 Workstation-Server Model: The workstation model is a network of personal
workstations, each with its own disk and a local file system. A workstation with its own
local disk is usually called a diskful workstation and a workstation without a local disk is
called a diskless workstation. With the invention of high-speed networks, diskless
workstations have been more popular in network environments than diskful workstations,
making the workstation-server model path popular than the workstation mode for
building distributed computing systems.
As shown in Fig 2.4 a distributed computing system based on the workstation
server model consists of a few minicomputers and several workstations (most of which
are diskless, but a few of which may be diskful) interconnected by a communication
For a number of reasons, such as higher reliability and better scalability, multiple
servers are often used for managing the resources of a particular type in a distributed
computing system. For example, there may be multiple file servers, each running on a
separate minicomputer and cooperating via the network, for managing the files of all the
users in the system. Due to this reason, a distinction is often is made between the services
that are provided to clients and the servers that provide them. That is, a server is an
abstract entity that is provided by one or more servers. For example, one or more file
servers may be used in a distributed computing system to provide file service to the users.
In this model, a user logs onto a workstation called his or her home workstation.
Normal computation activities required by the users process are performed at the users
home workstation, but requests for services provided by special servers (such as a file
server or a database server) are sent to a server providing that type of service that
performs the users requested activity and returns the result of request processing to the
users workstations. Therefore, in this model, the users processes need not to be migrated
to the server machines for getting the work done by those machines.


Workstation Workstation

Workstation Communication Workstation


Mini- Mini- Mini-

computer computer computer
used as .
used as file used as print
server database server

Fig. 2.4. A distributed computing system based on the workstation-server model

As compared to the workstation model, the workstation-server

model has several advantages:
1. In general, it is much cheaper to use a few minicomputers equipped with
large, fast disk that are accessed over the network than a large number of
diskful workstations, with each workstation having a small, slow disk.
2. Diskless workstations are also preferred to diskful workstations from a system
maintenance point of view. Backup and hardware maintenance are easier to
perform with a few large disks than with many small disks scattered all over a
building or campus. Furthermore, installing new releases of software (such as
file server with new functionalities) is easier when the software is to be
installed on a few file server machines than on every workstation.
3. In the workstation-server model, since the file servers manage all files, users
have the flexibility to use any workstation and access the files in the same
manner irrespective of which workstation the user is currently logged on. Note

that this is not the true with the workstation model, in which the workstation
model, in which each workstation has its local file system, because different
mechanisms are needed to access local and remote files.
4. In the workstation-server model, the request-response protocol is mainly used
to access the services of the server machines. Therefore, unlike the
workstation model, this model does not need a process migration facility,
which is difficult to implement. The request-response protocol is known as the
client-server model of communication. In this model, a client process (which
in this case resides on a workstation) sends a request to server process (which
in this case resides on a minicomputer) for getting some services such as
reading a unit of a file. The server executes the request and sends back a reply
to the client that contains the result of processing.
5. A user has guaranteed response time because workstations are not used for
executing remote processes. However, the model does not utilize the
processing capability of idle workstations.
2.4.4 Processor-Pool Model: The processor-pool model is based on the observation that
most of the time a user does not need any computing power but once in a while he or she
may need a very large amount of computing power for a short time. Therefore, unlike the
workstation-server model in which a processor is allocated to each user, in the processor-
pool model the processors are pooled together to be shared by the users as needed. The
pool of processors consists of a large number of microcomputers and minicomputers
attached to the network. Each processor in the pool has its own memory to load and run a
system program or an application program of the distributed computing system.
As shown in Fig 2.5, in the pure processor-pool model, the processors in the pool
have no terminals attached directly to them, and users access the system from terminals
that are attached to the network via special devices. A special server (called a run server)
manages and allocates the processors in the pool to different users on a demand basis.



Server File

Pool of processors

Fig 2.5 The Processor pool model

When a user submits a job for computation, the run server temporarily assigns an
appropriate number of processors to his or her job. For example, if the users computation
job is the compilation of a program having n segments, in which each of the segments
can be compiled independently to produce separate relocatable object files, n processors
from the pool can be allocated to this job to compile all the n segments in parallel. When
the computation is completed the processors are returned to the pool for use by other
In the processor-pool model there is no concept of a home machines. That is, a
user does not log onto a particular machines but to the system as a whole. This is in
contrast to other models in which each user has a home machine (e.g. a workstation or

minicomputer) onto which he or she logs and runs most of his or her programs there by
Amoeba and the Cambridge Distributed Computing Systems are examples of
distributed computing systems based on the processor-pool model.
2.4.5 Hybrid Model: Out of the four models above, the workstation-server model is the
most widely used model for building distributed computing system. This is because a
large number of computer users only perform simple interactive tasks such as editing
jobs, sending electronic mails, and executing small programs. The workstation-server
model is ideal for such simple usage. However, in a workstation environment that has
groups of users who often perform jobs needing massive computation. The processor-
pool model is more attractive and suitable.
To combine the advantages of both the workstation-server and processor-pool
models, a hybrid model may be used to build a distributed computing system. This hybrid
model is based on the workstation-server model but with the addition of a pool of
processors. The processors in the pool can be allocated dynamically for computing that
are too large for workstations or that requires several computers concurrently for efficient
execution. In addition to efficient execution of computation-intensive jobs, the hybrid
model gives guaranteed response to interactive jobs by allowing them to the processed on
local workstations of the users. However, the hybrid model is more expensive to
implement than the workstation-server model or the processor-pool model.
From the models of distributed computing systems presented above, it is obvious
that distributed computing systems are much more complex and difficult to build than
traditional centralized systems (those consisting of a single CPU, its memory, peripherals,
and one or more terminals). The increased complexity is mainly due to the fact that in
addition to being capable of effectively using and managing a very large number of
distributed resources, the system software of a distributed computing system should also
be capable of handling the communication and security problems that are very different
from those of centralized systems. For example, the performance and reliability of a
distributed computing system depends to a great extent on the performance and reliability
of the underlying communication network. Special software is usually needed to handle
loss of messages, during transmission across the network or to prevent overloading of the

network that degrades the performance and responsiveness to the users. Similarly, special
software security measures are needed to protect the widely distributed shared resources
and services against intentional or accidental violation of access control and privacy
Despite the increased complexity and the difficulty of building distributed
computing systems, the installation and use of distributed computing systems overweigh
their disadvantages. The technical needs, the economic pressures, and the major
advantages that have led to the emergence and popularity of distributed computing
systems are described here.
Inherently Distributed Applications: Distributed computing systems come into
existence in some very natural easy. For example, several applications are inherently
distributed in nature and require a distributed computing system for their realization.
For instance, in an employee database of a nationwide organization, the data pertaining
to a particular employee are generated at the employees branch office, and in addition
to the global need to view the entire database; there is a local need for frequent and
immediate access to locally generated data at each branch office. Such applications
require that some processing power be available at the many distributed locations for
collecting, preprocessing, and accessing data, resulting in the need for distributed
application are a computerized worldwide airline reservation system, a computerized
banking system in which a customer can deposit/withdraw money from his or her
account from any branch of the bank, and a factory automation system controlling
robots and machines all along an assembly line.
Information Sharing among Distributed Users: Efficient person-to-person
communication facility by sharing information over great distances is the one more
advantage. In a distributed computing system, the users working at other nodes of the
system can easily and efficiently share information generated by one of the users. This
facility may be useful in many ways. For example, two or more users who are
geographically far off from each other can perform a project but whose computers are
the parts of the same distributed computing system.

The use of distributed computing systems by a group of users to work

cooperatively is known as computer-supported cooperative working (CSCW), or
Resource Sharing: Information is not the only thing that can be shared in a distributed
computing system. Sharing of software resources such as software libraries and
databases as well as hardware resources such as printers, hard disks, and plotters can
also be done in a very effective way among all the computers and users of a single
distributed computing system
Better Price Performance Ration: This is one of the most important reasons for the
growing popularity of distributed computing system. With the rapidly increasing power
and reduction in the price of microprocessors, combined with the increasing speed of
communication networks, distributed computing systems potentially have a much better
price-performance ratio than a single large centralized system. Another reason for
distributed computing systems to be more cost effective than centralized systems is that
they facilitate resource sharing among multiple computers.
Shorter Response Times and Higher Throughput: Due to multiplicity of processors,
distributed computing systems are expected to have better performance than single-
processor centralized systems. The two most commonly used performance metrics are
response time and throughput of user processes. That is, the multiple processors of
distributed computing systems can be utilized properly for providing shorter response
times and higher throughput than a single processor centralized system. Another
method often used in distributed computing systems for achieving better overall
performance is to distribute the load more evenly among the multiple processors by
moving jobs from currently overloaded processors to lightly loaded ones.
Higher Reliability: Reliability refers to the degree of tolerance against errors and
component failures in a system. A reliable system prevents loss of information even in
the event of component failures. The multiplicity of storage devices and processors in a
distributed computing system allows the maintenance of multiple copies of critical
information within the system. With this approach, if one of the processors fails, the
computation can be successfully completed at the other processor, and if one of the
storage devices fails, the computations can be successfully completed at the other

processors, and if one of the storage devices fails, the information can still be used from
the other storage device.
Availability: An important aspect of reliability is availability, which refers to the
fraction of time for which a system is available for use. In comparison to a centralized
system, a distributed computing system also enjoys the advantage of increased
Extensibility and Incremental Growth: Another major advantage of distributed
computing systems is that they are capable of incremental growth. That is, it is
possible to gradually extend the power and functionality of a distributed computing
system by simply adding additional resources (both hardware and software) to the
system as and when the need arises. For example, additional processors can be easily
added to the system to handle the increased workload of an organization that might
have resulted from its expansion. Extensibility is also easier on a distributed computing
system because addition of new resources to an existing system can be performed
without significant disruption of the normal functioning of the system. Properly
designed distributed computing systems that have the property of extensibility and
incremental growth are called open distributed systems.
Better Flexibility in Meeting Users Needs: Different types of computers are usually
more suitable for performing different types of computations. For example, computers
with ordinary power are suitable for ordinary data processing jobs, whereas high-
performance computers are more suitable for complex mathematical computations. In a
centralized system, the users have to perform all types of computations on the only
available computer.
2.5 Distributed Operating System (DOS):
Definition: It is defined as a program that controls the resources of a computer system
and provides its users with an interface or virtual machine that is more convenient to use
than the bare machine. According to this definition, the two primary tasks of an operating
system are as follows:
o To present users with a virtual machine that is easier to program than the
underlying hardware.

o To manage the various resources of the system. This involves performing such
tasks as keeping track of who is using which resource, granting resource requests
accounting for resource usage, and mediating conflicting requests from different
programs and users.
The Classification: The operating systems commonly used for distributed computing
systems can be broadly classified into two types - network operating systems and
distributed operating systems.
The three most important features
System Image: The most important feature used to differentiate between the two
types of operating systems is the image of the distributed computing system from
the point of view of its users. In case of a network operating system, the users
view the distributed computing systems as a collection of distinct machines
connected by a communication subsystem. That is, the users are aware of the fact
that multiple computers are being used. On the other hand, a distributed
operating system hides the existence of multiple computers and provides a single-
system image to its users. That is, it makes a collection of networked machines act
as a virtual uniprocessor.
Autonomy: A network operating system is built on a set of existing centralized
operating systems and handles the interface and coordination of remote operations
and communications between these operating systems. That is, in the case of a
network operating system, each computer of the distributed computing system has
its own local operating system (the operating system of different computers may
be the same or different), and there is essentially no coordination at all among the
computers except for the rule that when two processes of different computers
communicate with each other, they must use a mutually agreed on communication
protocol. Each computer functions independently of other computers in the sense
that each one makes independent decisions about the creation and termination of
their own processes and management of local resources. Notice that due to the
possibility of difference in local operating systems, the system calls for different
computers of the same distributed computing system may be different in this case.
On the other hand, with a distributed operating system, there is a single system

wide operating system and each computer of the distributed computing system
runs a part of this global operating system. The distributed operating system
tightly interweaves all the computers of the distributed computing system in the
sense that they work in close cooperation with each other for the efficient and
effective utilization of the various resources of the system. That is, processes and
several resources are managed globally (some resources are managed locally).
Moreover, there is a single set of globally valid system calls available on all
computers of the distributed computing system.
In short, it can be said that the degree of autonomy of each machine of a
distributed computing system that uses a network operating system is
considerably high as compared to that of machines of a distributed computing
system that uses a distributed operating system.
Fault tolerance capability: A network operating system provides little or no fault
tolerance capability in the sense that if 10% of the machines of the entire
distributed computing system are down at any moment, at least 10% of the users
11are unable to continue with their work. On the other hand, with a distributed
operating system, most of the users are normally unaffected by the failed
machines and can continue to perform their work normally, with only a 10% loss
in performance of the entire distributed computing system. Therefore, the fault
tolerance capability of a distributed operating system is usually very high as
compared as that of a network operating system.
Some Important Points to be noted with respect to DOS:
A distributed operating system is one that looks to its users like an ordinary
centralized operating system but runs on multiple, independent central
processing unit (CPUs). The key concept here is transparency. In other words,
the multiple processors should be invisible (transparent) to the user. Another
way of expressing the same idea is to say that the user views the system as a
virtual uniprocessor, not as a collection of distinct machines.
A distributed computing system that uses a network operating system is
usually referred to as a network system, whereas one that uses a distributed

operating system is usually referred to as a true distributed system (or simply

a distributed system).
2.7 Issues in Designing a Distributed Operating System:
In general, designing operating system is more difficult than designing a centralized
operating system for several reasons. In the design of a centralized operating system, it is
assumed that the operating system has access to complete and accurate information about
the environment in which it is functioning. In a distributed system, the resources are
physically separated, there is no common clock among the multiple processors, delivery
of messages is delayed, and messages could even be lost. Due to all these reasons, a
distributed operating system doest not have up-to-data, consistent knowledge about the
state of the various components of the underlying distributed system Despite these
complexities and difficulties, a distributed operating system must be designed to provide
all the advantage of a distributed system to its users. That is, the users should be able to
view a distributed system as virtual centralized system that is flexible, efficient, reliable,
secure and easy to use. To meet this requirement, the designers of a distributed operating
system must deal with several design issues. Some of the key design issues are described
Transparency: We saw that one of the main goals of a distributed operating
system is to make the existence of multiple computers invisible (transparent) and
provide a single system image to its users. That is, a distributed operating system
must be designed in such a way that a collection of distinct machines connected
by a communication subsystem appears to its users as a virtual uni-processor. The
eight forms of transparency identified by the International Standards
Organizations Reference Model for Open Distributed Processing [ISO 1992] are
access transparency, location transparency, replication transparency, failure
transparency, migration transparency, location transparency, concurrency
transparency performance transparency, and scaling transparency. These
transparency aspects are described below.
o Access Transparency: It means that users should not need or be able to
recognize whether a resource (hardware or software )is remote or local.
This implies that the distributed operating system should allow users to

access remote resources in the same way as local resources. That is, the
user interface, which takes the form of a set of system calls, should not
distinguish between local and remote resources, and it should be the
responsibility of the distributed operating system to locate the resources
and to arrange for servicing user requests in a user-transparent manner.
o Location Transparency: The two main aspects of location transparency are
as follows:
Name transparency: This refers to the fact that the name of a
resource (hardware or software) should not reveal any hint as to
the physical location of the resource. That is, the name of a
resource should be independent of the physical connectivity or
topology of the system or the current location of the resource.
Furthermore, such resources, which are capable of being moved
from one node to another in a distributed system (such as file) must
be allowed to move without having their names changed.
Therefore, resource names must be unique system wide
User mobility: This refers to the fact that no matter which
machine a user is logged onto, he or she should be able to access a
resource with the same name. That is, the user should not be
required to use different names to access the same resource from
two different nodes of the system.
o Replication Transparency: For better performance and reliability,
almost all distributed operating systems have the provision to create
replicas (additional copies) of files and other resources on different nodes
of the distributed system. In these systems, both the existence of multiple
copies of a replicated resource and the replication activity should be
transparent to the users. That is, two important issued related to replication
transparency are naming of replicas and replication control. It is the
responsibility of the system to name the various copies of a resource and
to map a user-supplied name of the resource to an appropriate replica of
the resource.

o Failure Transparency: Failure transparency deals with masking from the

users partial failures in the system, such as a communication link failure, a
machine failure, or a storage device crash. A distributed operating system
having failure transparency property will continue to function, perhaps in
a degraded form, in the face of partial failures. However, in this type of
design, care should be taken to ensure that the cooperation among multiple
servers does not add too much overhead to the system.
o Migration Transparency: For better performance, reliability and security
reasons, an object that is capable of being moved (such as a process or a
file) is often migrated from one node to another in a distributed system.
The aim of migration transparency is to ensure that the movement of the
object is handled automatically by the system in a user-transparent
manner. Three important issues in achieving this goals are as follows:
Migration decisions such as which object is to be moved
from where to where should be made automatically by the
Migration of an object from one node to another should not
require any change in its name
When the migrating object is a process, the inter process
communication mechanism should ensure that a message
sent to the migrating process reaches it without the need for
the sender process to resend it if the receiver process moves
to another node before the message is received.
o Concurrency Transparency: In a distributed system, multiple users who
are spatially use the system concurrently. In such a situation, it is
economical to share the system resources (hardware or software) among
the concurrently executing user processes. However, since the number of
available resources in a computing system is restricted, one user process
must necessarily influence the action of other concurrently executing user
processes, as it competed for resources. For providing concurrency

transparency, the resource sharing mechanisms of the distributed operating

system must have the following four properties.
An event-ordering property ensures that all access requests to
various system resources are properly ordered to provide a
consistent view to all users of the system.
A mutual-exclusion property ensures that at any time at most one
process accesses a shared resource, which must be used
simultaneously by multiple processes if program operation is to be
A no-starvation property ensures that is every process that is
granted a resource, which must not be used simultaneously by
multiple processes, eventually, releases it, every request for that
resource is eventually granted.
A no-deadlock property ensures that a situation will never occur in
which competing processes prevent their mutual progress even
though no single one requests more resources than available in the
o Performance Transparency: The aim of performance transparency is to
allow the system to be automatically reconfigures to improve
performance, as loads vary dynamically in the system. As far as
practicable, a situation in which one processors of the system is
overloaded with jobs while another processor is idle should not be allowed
to occur. That is, the processing capability of the system should be
uniformly distributed among the currently available jobs in the system.
This requirement calls for the support of intelligent resource allocation and
process migration facilities in distributed operating systems.
o Scaling Transparency: The aim of scaling transparency is to allow the
system to expand in scale without disrupting the activities of the users.
This requirement calls for open-system architecture and the use of scalable
algorithms for designing the distributed operating system components. On

the other hand, since every component of a distributed operating system

must use scalable algorithms.
Reliability: In general, distributed systems are expected to be more reliable than
centralized systems due to the existence of multiple instances of resources.
However, the existence of multiple instances of the resources alone cannot
increase the systems reliability. Rather, the distributed operating system, which
manages these resources, must be designed properly to increase the systems
reliability by taking full advantage of this characteristic feature of a distributed
For higher reliability, the fault-handling mechanisms of a distributed
operating system must be designed properly to avoid faults, to tolerate faults, and
to detect and recover from faults. Commonly used methods for dealing with these
issues are briefly described here.
o Fault Avoidance: Fault Avoidance deals with designing the components
of the system in such a way that the occurrence of faults is minimized.
Conservative design practices such as using high reliability components
are often employed for improving the systems reliability based on the
idea of fault avoidance. Although a distributed operating system often has
little or no role to play in improving the fault avoidance capability of a
hardware component, the designers of the various software components of
the distributed operating system must test them thoroughly to make these
components highly reliable.
o Fault Tolerance: Fault tolerance is the ability of a system to continue
functioning in the event of partial system failure. The performance of the
system might be degraded to partial failure, but otherwise the system
functions properly. Some of the important concepts that may be used to
improve the fault tolerance ability of a distributed operating system are as
Redundancy techniques: The basic idea behind redundancy
techniques is to avoid single points of failure by replicating critical
hardware and software components, so that if one of them fails, the

others can be used to continue. Obviously, having two or more

copies of a critical component makes it possible, at least in
principle, to continue operations in spite of occasional partial
Distributed control: For better reliability, many of the particular
algorithms or protocols used in a distributed operating system must
employ a distributed control mechanism to avoid single points of
o Fault Detection and Recovery: The fault detection and recovery method
of improving reliability deals with the use of hardware and software
mechanisms to determine the occurrence of a failure and then to correct
the system to a state acceptable for continued operation. Some of the
commonly used techniques for implementing this method in a distributed
operating system are as follows:

Atomic transactions: An atomic transaction (or just transaction for

short) is a computation consisting of a collection of operations that
take place indivisibly in the presence of failures and concurrent
computations. That is either all of the operations are performed
successfully or none of their effects prevails, and other processes
executing concurrently cannot modify or observe intermediate
states of the computation. Transactions help to preserve the
consistency of a set of shared data objects (e.g. files) in the face of
failures and concurrent access. They make crash recovery much
easier, because a transaction can only end in two states: Either all
the operations of the transaction are performed or none of the
operations of the transaction is performed.

Stateless servers: The client-server model is frequently used in

distributed systems to service user requests. In this model, a server
may be implemented by using any one of the following two service

paradigms-stateful or stateless. The two paradigms are

distinguished by one aspect of the client-server relationship,
whether or not the history of the served requests between a client
and a server affects the execution of the next service requests. The
stateful approach does depend on the history of the serviced
requests, but the stateless approach does not depend on it.
Acknowledgements and timeout-based retransmissions of
messages: In a distributed system, event such as a node crash or a
communication link failure may interrupt a communication that
was in progress between two processes, resulting in the loss of a
message. Therefore, a reliable inter-process communication
mechanism must have ways to detect lost messages so that they ca
be retransmitted. Handling of lost messages for every message
received, and if the sender does not receive any acknowledgement
for a message. Duplicate messages may be sent in the event of
failures or because of timeouts.
Flexibility: Another important issue in the design of distributed operating
systems is flexibility. The design of a distributed operating system should be
flexible due to the following reasons:
o Ease of modification: From the experience of system designers, it has been
found that some parts of the design often need to be replaced/modified
either because some bug is detected in the design or because the design is
no longer suitable for the changed system environment or new-user
requirements. Therefore, it should be easy to incorporate changes in the
system in a user-transparent manner or with minimum interruption caused
to the users.
o Ease of enhancement: In every system, new functionalities have to be
added from time to time to make it more powerful and easy to use.
Therefore, it should be easy to add new services to the system.
The most important design factor that influences the flexibility of a distributed
operating system is the model used for designing its kernel. The kernel of an

operating system is its central controlling part that provided basic system
facilities. It operates in a separate address space that a user cannot replace or
modify. The two commonly used models for kernels design in distributed
operating systems are monolithic kernel and the micro kernel.
o In monolithic kernel model, the kernel provides most operating system
services such as process management, and inter-process communication. As a
result, the kernel has a large, monolithic structure. Many distributed operating
systems that are extensions or imitations of the UNIX operating system use
the monolithic kernel model. This is mainly because UNIX itself has a large,
monolithic kernel.
o In the micro kernel model, the main goal is to keep the kernel as small as
possible. Therefore, in this model, the kernel is a very small nucleus of
software that provides only the minimal facilities necessary for implementing
additional operating system services. The only services provided by the kernel
in this model are inter-process communication, lowlevel device management
and some memory management. All other operating system services, such as
file-management, name management, additional process and memory
management activities, and much system call handling are implemented as a
user-level server processes.
Node 1 Node 2 Node n

User User User

Applications Applications Applications

Monolithic Monolithic Monolithic

kernel (includes kernel (includes kernel
most OS services) most OS services) . (includes most
OS services)

Network hardware

Fig 2.6(a) The monolithic kernel model.

Performance: If a distributed system is to be used, its performance must be at
least as good as a centralized system. That is, when a particular application is run

on a distributed system, its overall performance should be better than or at least

equal to that of running the same applications on a single-processor system.
However, to achieve this goal, it is important that the various components of the
operating system of a distributed system be designed properly; otherwise, the
overall performance of the distributed system may turn out to be worse than a
centralized system.
Node 1 Node 2 Node n
User User User
Applications Applications applications

Server/ Server/ Server/

Manager Manager Manager
modules modules
Micro kernel Micro kernel Micro kernel
(has only (has only (has only
minimal minimal minimal
facilities) facilities) facilities)

Network hardware

Fig 2.6(b) The micro kernel model.

Scalability: Scalability refers to the capability of a system to adapt to increased
service load. It is inevitable that a distributed system will grow with time since it
is very common to add new machines or an entire sub-network to the system to
take care of increased workload or organizational changes in a company.
Therefore, a distributed operating system should be designed to easily cope with
the growth of nodes and users in the system. That is, such growth should not
cause serious disruption of service or significant loss of performance to users.

Heterogeneity: A heterogeneous distributed system consists of interconnected

sets of dissimilar hardware or software systems. Because of the diversity,
designing heterogeneous distributed systems is far more difficult than designing
homogenous distributed systems in which each system is based on the same, or

closely related, hardware and software. However, as a consequence of large scale,

heterogeneity is often inevitable in distributed systems. Furthermore, many users
prefer often heterogeneity because heterogeneous distributed systems provide the
flexibility to their users of different computer platforms for different applications.

Security: In order that the users can trust the system and rely on it, the various
resources of a computer system must be protected against destruction and
unauthorized access. Enforcing security in a distributed system is more difficult
than in a centralized system because of the lack of a single point of control and
the use of insecure networks for data communication. In a centralized system, all
users are authenticated by the system at login time, and the system can easily
check whether a user is authorized to perform the requested operation on an
accessed resource. In a distributed system, however, since the client-server model
is often used for requesting and providing services, when a client sends a request
message to a server, the server must have some way of knowing who is the client.
This is not so simple as it might appear because any client identification field in
the message cannot be trusted. This is because an intruder (a person or program
trying to obtain unauthorized access to system resources) may pretend to be an
authorized client or may change the message contents during transmission.
Therefore, as compared to a centralized system, enforcement of security in a
distributed system has the following additional requirements:
1. It should be possible for the sender of a message to know that the
intended receiver received the message.
2 It should be possible for receiver of a message to know that the
message was sent by the genuine sender
3. It should be possible for both the sender and receiver of a message to
be guaranteed that the contents of the message were not changed while
it was in transfer.
Emulation of Existing Operating Systems: For commercial success, it is
important that a newly designed distributed operating system be able to emulate
existing poplar operating systems such as UNIX. With this property, new software

can be written using the system call interface of the new operating system to take
full advantage of its special features of distribution, but a vast amount to already
existing old software can also be run on the same system without the need to
rewrite them. Therefore, moving to the new distributed operating system will
allow both types of software to be run side by side.
2.8 Introduction to Distributed Computing Environment (DCE):
The Open Software Foundation (OSF), a consortium of computer manufactures including
IBM, DEC, and Hewlett Packard defined a vendor independent distributed computing
environment (DCE).
2.8.1.DCE: It is not an operating system, nor it is an application. To a certain extent, it is
an integrated set of services and tools that can be installed as a coherent environment on
top of existing operating systems and serve as a platform for building and running
distributed applications.
A primary goal of DCE is vendor independence. It runs on many different kinds
of computers, operating systems, and networks produced by different vendors. For
example, some operating system to which DCE can be easily ported include OSF/1, AIX,
and OS/2. On the other hand, it can be used with any network hardware and transport
software, including TCP/IP, X 2.5 as well as other similar products.
As shown in the below figure, DCE is a middleware software layered between the
DCE applications layer and the operating system and networking layer. The basic idea is
to take a collection of existing machines (possibly from different vendors), interconnect
them by a communication network, add the DCE software platform on top of the native
operating systems of the machines, and then be able to build and run distributed
applications. Each machine has its own local operating system, which may be different
from that of other machines. The DCE software layer on top of the operating system and
networking layer hides the differences between machines by automatically performing
data-type conversions when necessary.
2.8.2 DCE Components: It is a mix of various technologies developed independently
and nicely integrated by OSF. Each of these technologies forms a component of DCE.
The main components of DCE are as follows:

DCE applications

DCE Software

Operating systems and networking

Fig 2.7 Position of DCE software in a DCE based distributed system

1. Threads package: It provides a simple programming model for building
concurrent applications. It includes operations to create and control multiple
threads of execution in a single process and to synchronize access to global
data within an application.
2. Remote Procedure Call (RPC) facility: It provides programmers with a
number of powerful tools necessary to build client-server applications. In fact,
the DCE RPC facility is the basis for all communication in DCE because the
programming model underlying all of DCE is the client-server model.
3. Distributed Time Service (DTS): It closely synchronizes the clocks of all the
computers in the system. It also permits the use of time values from external
time sources, such as those of the U.S. National Institute for Standards and
Technology (NIST), to synchronize the clocks of the computers in the system
with external time. This facility can also be used to synchronize the clocks of
the computers of one distributed environment with the clocks of the
computers of another distributed environment.
4. Name Services: The name services of DCE include the cell directory service
(CDS), the Global Directory Service (GDS), and Global Directory Agent
(GDA). These services allow resources such as servers, files, devices, and so
on, to be uniquely named and accessed in a location-transparent manner.
5. Security Service: It provides the tools needed for authentication and
authorization to protect system resources against illegitimate access.
6. Distributed File Server (DFS). It provides a system-wide file system that has
such characteristics as location transparency, high performance, and high

availability. A unique feature of DCE DFS is that it can also provide file
services to clients of other file systems.
2.8.3.DCE Cells: The DCE system is highly scalable in the sense that a system running
DCE can have thousands of computers and millions of users spread over a worldwide
geographic area. To accommodate such large systems, DCE uses the concept of cells.
This concept helps break down a large system into smaller, manageable units called cells.
In a DCE system, a cell is a group of users, machines, or other resources that
typically have a common purpose and share common DCE services. The minimum cell
configuration required a cell directory server, a security server, a distributed timeserver,
and one or more client machines. Each DCE client machine has client processes for
security service, cell directory service, distributed time service, RPC facility, and threads
facility. A DCE client machine may also have a process for distributed file service if a
cell configuration has a DCE distributed file server. Due to the use of the method of
intersection for clock synchronization, it is recommended that each cell in a DCE system
should have at least three distributed timeservers.
Check Your Progress 1: Answer the following questions in one or two sentences.
a) Define a distributed computing system.
b) List the different computing system models.
c) What is a DCE?
d) Define transparency in DCS.
e) List the various forms of transparencies expected by a DCS.

Check Your Progress 2: Answer the following questions

1. Explain the different computing system models.
2. Why are distributed computing system gain popularity? Explain.
3. Discuss the various design issues of Distributed operating system.
4. Explain Distributed Computing Environment.
5. Discuss the relative advantages and disadvantages of the various commonly used
models for configuring distributed computing systems.

2.9 Summary:
Let us sum up the different concepts we have studied till here.
A distributed computing system is a collection of processors
interconnected by a communication network in which each processor has
its own local memory and other peripherals and communication between
any two processors of the system takes place by message passing over the
communication network.
The existing models for distributed computing systems can be broadly
classified into five categories, minicomputer, workstation-server,
processor-pool and hybrid.
Distributed computing system are much more complex and difficult to
build than the traditional centralized systems. Despite the increased
complexity and the difficulty of buildings, the installation and the use of
distributed computing system are rapidly increasing. This is mainly
because the advantages of distributed computing systems outweigh its
The main advantages of distributed computing systems are (a) suitability
for inherently distributed applications. (b) Sharing of information among
distributed users and sharing of resources (d) better price performance
ratio (e) shorter response times and higher throughout (f) higher reliability
(g) extensibility and incremental growth and (h) better flexibility in
meeting users needs.
The operating systems commonly used for distributed computing systems
can be broadly classified into two types: network operating systems and
distributed operating systems. As compared to a network operating
system, a distributed operating system has better transparency and fault
capability and provides the image of a virtual uniprocessor to the users.
The main issue involved in the design of a distributed operating system is
transparency, reliability, flexibility, performance, scalability, heterogeneity,
security and emulation of existing operating systems.




3.0 Objectives
3.1 Introduction
3.2 Review of Databases
3.2.1 The Relational Model
3.2.2 The Relational Operations
3.3 Distributed Processing- An Introduction
3.4 Features of Distributed Versus Centralized Databases
3.5 Uses of Distributed Databases
3.6 Distributed Database Management Systems
3.7 Summary
3.0 Objectives: The main objectives of this unit are:
Familiarization of
Basic Database Concepts
Distributed processing
Distributed Databases
Distributed Database Management Systems
3.1 Introduction:
In recent years databases have become an important area of information processing, and
it is easy to predict that importance will rapidly grow. There are both organizational and
technological reasons for this trend. The distributed databases eliminate many of the
problems of centralized databases and fit more naturally in the decentralized structures of
many organizations. A very vague definition of distributed database is that It is a
collection of data which belong logically to the same system but spread over the sites of a
computer network. To understand this new way of data processing, this Unit first
introduces the basic concepts of databases. Later an overview of distributed data
processing is discussed. In the next section, we differentiate the conventional centralized

database system and distributed database system. Finally, a Distributed Data Base
Management System (DDBMS) is discussed.
3.2 Review of Databases:
In this section we discuss some basic concepts of databases, which will be very much
necessary for understanding the further topics of this unit. A database model known as the
Relational model is used. Basically, the relational model allows the use of powerful, set-
oriented, associative expressions instead of the one-record-at-a-time primitives of more
procedural models like the Codasyl model. We use in this course two different data
manipulation languages: Relational algebra and SQL. SQL, which is a user-friendly
language, is used for dealing with the problems of writing application program for
distributed databases. Relational algebra is instead used for describing and manipulating
access strategies to distributed databases.
3.2.1 The Relational Model: Here some basic terminologies are defined. They are as
o Relations: These are the tables that are used for storing the data in relational
o Attributes: Each relation has a (fixed) number of columns, called Attributes.
o Tuples: These are the dynamic, time-varying number of rows.
o The number of attributes of a relation is called its grade.
o The number of tuples is called its cardinality.
o The set of possible values of a given attribute is called its domain
An Example for demonstrating the above terminologies:
2 Vasu 25 1
7 Varma 34 2
11 Rama 18 1
12 Krishna 22 3
14 Hari 30 1

(a) Relation EMP



2 25 1 Vasu
7 34 2 Varma
11 18 1 Rama
12 22 3 Krishna
14 30 1 Hari

(b) Relation EMP

Fig 3.1 Example of relation
Figure 3.1(a) shows a relation EMP (employee) consisting of four attributes: EMPNUM,
NAME, AGE and DEPTNUM. The relation has five tuples; for example, (2, Vasu , 25, 1)
is a tuple of the EMP relation. The grade of relation EMP is 4 and the cardinality is 5.
o Relation Schema: The relation name and the name of attributes appearing in it
are called the relation schema of the relation; for example,
is the relation schema of relation EMP in the above example.
o Strictly speaking the relations are treated as sets of tuples. So, the following
points are to be noted:
1. There cannot be two identical tuples in the same relation.
2. There is no defined order of the tuples of a relation.
o The keys of the relation are the subsets of the attributes of a relation schema
whose values are unique within the relation, and thus can be used to uniquely
identify the tuples of the relation. Given that relations are sets of tuples, a key
must exist; at least, the set of all attributes of the relation constitutes the key.
Thus, in the above examples of Figure 2.1, EMPNUM is an appropriate key of
EMP. Also in the relation there would be more than one key like the pair of
attributes: EMPNUM, DEPTNUM. Typically, one of them is selected as the
primary key.
o The relational algebra is a collection of operations on relations, each of which
takes one or two relations as operands and produces one relation as result.

3.2.2 The Relational Operations: The operations of relational algebra can be composed
into arbitrarily complex expressions; Expressions allow specifying the manipulations that
are required on relations in order to retrieve information from them.
o The different operations that are possible on relations: Here five basic
operations are defined: Selection, Projection, Cartesian product, Union, and
Difference. From these operations, some other operations are derived, such as
Intersection, Division, Join and Semi join. We have highlighted those
operations, which have an important application in distributed databases, such as
Join and Semi-join. The meaning of each operation is described referring to an
Unary operations take only one relation as operand; they include selection
and projection.
The Selection SLF R where R is the operand to which the selection is
applied and F is a formula that expresses a selection predicate produces a
result relation with the same relation schema as the operand relation, and
containing the subset of the tuples of the operand, which satisfy the
predicate. The formula involves attribute names or constants as operand,
arithmetic comparison operators, and logical operators.
Projection PJAttr where Attr denotes a subset of the attributes of the
operand relation produces a result having these attributes as relation
Binary operations take two relations as operand; we review union,
difference, Cartesian product, join and semi-join.
The union R UN S is meaningful only between two relations R and S
with the same relation schema; it produces a relation with the same
relation schema as its operands and the union of the tuples of R and S (i.e
all tuples appearing either in R or in S or in both)
The difference R DF S is meaningful only between two relations R and S
with the same relation schema; it produces a relation with the same
relation schema as its operands and the difference between the tuples of R
and S (i.e all tuples appearing in R but not in S)

The Cartesian product R CP S produces a relation whose relation

schema includes all the attributes of R and S. If two attributes with the
same name appear in R and S, they are nevertheless considered as
different attributes; in order to avoid ambiguity, the name of each attribute
is prefixed with the name of its original relation. Every tuple of R is
combined with every tuple of S to form one tuple of the result.
Example: The different operations explained already are illustrated using an example
a 1 A A 1 a 1 a 1
b 1 B A 3 f 3 b 1
a 1 D 3 c 2
b 2 F 1 d 4
2 a 3
(a) Operand relations
A 1 A A 1 a 1 A
A 1 D B 1 b 1 b
B 2 a 1 d
b 2 f
a 3 f
b) Selection SLA=a R c) Projection PJ A , B R d) Union R UN S
The join of two relations R and S is denoted as R JN F S, where F is a
formula, which specifies the join predicate. A join is derived from
selection and Cartesian product as follows:
The natural join R NJN S of two relations R and S is a join in which all
attributes with the same names in the two relations are compared. Since
these attributes have both the same name and the same values in all the
tuples of the result, one of the two attributes is omitted from the result.
a 1 a a 1 a
b 1 b a 1 a
a 1 d a 1 a

b 2 f a 1 a
a 1 a a 3 f
b 1 b a 3 f
a 1 d a 3 f
b 2 f a 3 f
f) Cartesian Product R CP S
B 1 B A 1 a 1 a 1
A 1 D A 1 a 2 a 3
A 2 F B 1 b 3 b 1
A 1 d 1 d 4
e) Difference R DF S g) Join R JN R.C=T.C T

A 1 a 1 A 1 a a 1 a
A 1 d 4 B 1 b a 1 d
A 1 d
h) Natural join R NJN T i) Semi join RSJ R.C=T.C T j) Natural join RNSJ T
Figure 3.2 Operations of relational algebra.
The semi-join of two relations R and S is denoted as R SJ F S, where F is a
formula that specifies a join predicate. A semi-join is derived from
projection and join as follows:
R SJF S = PJ Attr(R) (RJNF S)
Where Attr(R) the set of all attributes of R.
The natural semi-join of two relations R and S, denoted R NSJ S, is
obtained by considering a semi-join with the same join predicate as in the
natural join.
o SQL Statement: A simple statement in SQL has the following structure:
Select (attribute list)
From (relation name)
Where (predicates)
The interpretation of this statement is equivalent to performing a selection
operation using the predicates of the where clause on the relation specified in

the from clause and then projecting the result on the attributes of the select
clause. consider the relation of Figure 2.1a, the statement:
Example 1: Consider the relation EMP of the Fig
Select NAME, AGE
From EMP
Where AGE > 20 AND DEPTNUM = 1
Returns the following relation:
Verma 25
Hari 30

Example 2: Referring to the relation of Figure, the SQL statement

Select A, R .C
From R, T
Where R. B = T. B and D = 3
Returns as a result the relation:
A R. C
The execution of a statement of this type can be interpreted in the following way:
1. Perform the join of relation R and T using the join clause R . B = T
2. Perform a selection with predicate D = 3 on the result of the join
3. Project the result on attributes A and R . C.

3.3 Distributed Processing- An Introduction:

A Distributed processing system is one in which several autonomous processors and data
stores supporting processes and/or databases interact in order to cooperate to achieve an
overall goal. The processes coordinate their activities and exchange information
transferred over a communication network.

Nowadays distributed processing is the important area of information processing and also
it is implemented using the concept of Distributed databases. A Distributed database is a
collection of data, which belong logically to the same system but are spread over the sites
of computer network. The two important aspects of a distributed database:
Distribution: The data are not resided in centralized place
Logical correlation: The data are tied themselves with some properties
Let us understand the above terminologies by taking an example. Consider a Bank
Transaction. Here the bank has three branches at different locations .At each branch the
computer controls the teller terminals of the branch and the account database of the
branch (Figure 2 .1). Each local branch database is termed as SITE of the distributed
database connected by a communication network. All the local transactions are managed
by these local computers and will therefore be called as local applications. An example
of local application is a debit or a credit application performed on an account stored at the
same branch at which the application is requested.
Also there are some applications which accesses data at more than one branch
known as Global applications or Distributed applications. i.e., for example the transfer
of funds from account of one branch to an account of other branch. This is not that simple
issue as it involves updating the databases at two different places.
We can now summarize the above aspects and let us reform the definition of the
distributed database as a collection of data, which are distributed over different
computers of a computer network. Each site of the network has autonomous processing
capability and can perform local applications and also it participates at least in one
global application, which requires accessing data at several sites using a communication
subsystem. The most important aspect expected from the system is the cooperation
between the autonomous sites.

3.4 Features of Distributed Versus Centralized databases:

It is better if we look at the typical features of the centralized database and compare them
with the corresponding features of distributed databases.

The following table gives the comparative study of the main features.

Sl.No Features Centralized databases Distributed databases

1 Centralized The idea of centralization is Here the idea of distribution of the data
control much emphasized here. Here is considered. A hierarchical approach
a Database administrator of administration like Global
(DBA) takes care about the administrator, local administrator is
safety of the data incorporated.
2 Data Data independence means Here a notion called Distribution
independence that the actual organization of transparency is used which makes the
the data is transparent to the program unaffected by the movement of
programmer. Here a notion the data from one site to another.
called Conceptual schema is
used which gives the
conceptual view of the data.
3 Reduction of Here the redundancy is Here the redundancy is an added feature
redundancy reduced as far as possible. as the locality of applications is
increased if the data is replicated at all
sites where the applications needed it
and also the reliability of the application
can be increased .
4 Complex The efficient access of the Here Complex physical structures will
physical data is supported by the not support the efficient data access as
structures and complex physical data the data is distributed. This can be
efficient access structures like secondary solved using the concepts like local and
indexes, interfile chains and global optimization, which determines
so on. the optimum procedure for accessing
the data at different sites.
5 Integrity, Here this problem can be Here it is difficult as the transactions
Recovery and solved easily as it is may be initiated at different sites
Concurrency controlled at one point. simultaneously. Special algorithms have
control Transaction atomicity is the to be used to take care this aspect.
concept used for this purpose
i.e the sequence of operations
performed either completely
or not performed at all.
6 Privacy and Maintained by centralized Maintained by Local Data Base
Security data base administrator. Administrators at different sites

3.5 Uses of Distributed databases:

Organizational and economic reasons
Usage and interconnection of existing databases
Incremental growth of an organization
Reduced communication overhead

Performance aspects
Increased reliability and availability
3.6 Distributed Data Base Management Systems (DDBMS):
A Distributed Database Management System helps in the creation and management of
distributed databases .The software requirements for building a distributed database are
The database management component (DB)
The data communication component (DC)
The data dictionary (DD), which gives the idea of data distribution in the entire
The distributed database component (DDB)
The fig 3.3 can show the interaction between the above components.
The services supported by the above components are:
Remote database access to an application program.
Some degree of distribution transparency
Support for database administration and control
Support for concurrency control and recovery of distributed transactions
The different types of the Distributed database accesses available:
a. Remote access via DBMS primitives: Here the application issues a request, which
refers to remote data. Fig 3.4 shows this scenario. This request is automatically routed by
the DDBMS to the site where the data is located; then it is executed and the
corresponding result is returned.
b. Remote access via auxiliary program: Here the application requires the auxiliary
program to be executed at the remote site, which accesses the remote database and
returns the result to requesting application. Fig 3.5 gives the exact picture about the



Local DD
Database 1
Site 1

Site 2

Local DD
Database 2

Fig. 3.3 Components of a commercial DDBMS

Application ACCESS

Site 1

Site 2

DBMS2 Database 2


Fig 3.4 Remote access via DBMS primitives

Application REQUEST FOR


Site 1

program AND
Database 2

Fig 3.5 Remote access via an auxiliary

Check Your Progress 1: Answer the following
A. In relational databases, the data are stored in tables called -------------.
B. The number of attributes of a relation is ----------.
C. The number of tuples of a relation is ------------.
D. What is a query?
E. What is a transaction?

Check Your Progress 2: Answer the following

1. The collection of data that belong to the same system but are spread over the
sites of a computer network is -------------.
2. Who guarantees the safety of the data?
3. In DDB which are the two possible DBAs available?
4. Define global optimization and local optimization.
5. Give the reasons for the necessity of Distributed databases.
6. Give the different components of DDBMS.
Practical Exercise:
Demonstrate all the relational operations using an organizational structure of LIC. Use
SQL statements.
3.7 Summary:
By the discussions made in this unit you have come to know the importance of data
distribution and distributed processing. Also we have discussed about the software
requirement for managing the distributed database. In the next Unit let us discuss the
architectural and conceptual requirements of the distributed databases.




4.0 Objectives
4.1 Introduction
4.2 Reference Architecture for distributed databases
4.3 Types of data fragmentation
4.3.1 Horizontal Fragmentation
4.3.2 Derived Horizontal Fragmentation
4.3.3 Vertical Fragmentation
4.3.4 Mixed Fragmentation
4.4 Integrity Constraints in Distributed Databases
4.5 Summary

4.0 Objectives: In this unit we will learn the following topics:

Reference Architecture for Distributed Databases
Types of Data Fragmentation
Integrity constraints in Distributed databases
4.1 Introduction:
In this unit we suggest reference architecture for the distributed database. This
architecture allows us to determine the different levels of transparency, which are
conceptually relevant to understand distributed databases. Also the mapping between the
different levels is defined. Here we have used the relational model and relational algebra
for this purpose. The distributed access primitives are represented using the SQL
statements, as it is user friendly. The emphasis is given to the fact that how the SQL
primitives reference the objects which constitute the database. The different types of

fragmentation methods are discussed. Also the integrity constraints in the distributed
transaction are explained.
4.2 Reference Architecture for Distributed Databases:
Here we have suggested reference architecture for the distributed databases as shown in
the fig.4.1.The different levels are conceptually helpful to understand the functioning of
the whole system. The various stages of the architecture are as follows.
Global Schema: defines all the data, which are contained in the distributed
database as if it is a centralized system. Here a set of global relations is used.
Fragmentation Schema: Each global relation is split into several non-
overlapping portions that are called as Fragments. The mapping between the
global relations and fragments is defined in the Fragmentation Schema. It is a
one to many relation such that several fragments correspond to one global
relation but only one global relation corresponds to one fragment. They are
indicated as Ri, the ith fragment of the global relation R.
Allocation Schema: The fragments are really the logical portions of the
global relation, which are physically dispersed at different sites of the
network. This schema defines at which site(s) a fragment is allocated. It is to
be noted that depending upon the requirement more than one fragment may be
allocated at a site. So this mapping determines whether the system is a
Redundant system or a Non redundant system.
Local Mapping Schema: We have already described the relationships
between the objects at the three top levels of this architecture. These three
levels are site independent; therefore, they do not depend on the data model of
the local DBMSs. At a lower level, it is necessary to map the physical images
to the objects that are manipulated by the local DBMS s. This mapping is
called a local mapping schema and depends on the type of local DBMS;
therefore in a heterogeneous system we have different types of local mappings
at different sites.


Fragmentation Site
Schema Independent


(Other sites)
Local Local .
Mapping Mapping .
Schema 1 Schema 2 .

Site 1 Site 2

at site 1
at site 2
Fig. 4.1 A reference architecture for distributed databases

This architecture provides a very general conceptual framework for understanding

distributed databases. The three most important objectives that motivate the features of
this architecture are the separation of data fragmentation and allocation, the control of
redundancy, and the independence from local DBMSs.
Separating the concept of data fragmentation from the concept of data
allocation: This separation allows us to distinguish two different levels of
distribution transparency, namely fragmentation transparency and location
transparency. Fragmentation transparency is the highest degree of
transparency and consists of the fact that the user or application programmer
works on global relations. Location transparency is a lower degree of
transparency and requires the user or application programmer to work on
fragments instead of global relations; however, he or she does not know where
the fragments are located.
Explicit control of redundancy: The reference architecture provides explicit
control of redundancy at the fragment level.
Independence from local DBMSs: This feature, called local mapping
transparency, allows us to study several problems of distributed database
management without having to take into account the specific data models of
local DBMSs. Another type of transparency, which is strictly related to
location transparency, is replication transparency. Replication transparency
means that the user is unaware of the replication of fragments.
4.3 Types Of Data Fragmentation:
Two different types fragmentation: Horizontal and Vertical fragmentation can
decompose the global relations into fragments. We will first consider these two types of
fragmentation separately and then consider the more complex fragmentation, which can
be obtained by applying a composition of both.
In all types of fragmentation, a fragment can be defined by an expression in a
relational language (we will use relational algebra), which takes global relation as
operands and produces the fragment as result. For example, if a global relation contains
data about employees, a fragment which contains only data about employees who work at
department D1 can be obviously defined by a selection operation on the global relation.

Some rules, which must be followed when defining fragments:

Completeness condition: All the data of the global relation must be mapped
into the fragments; i.e., it must not happen that a data item that belongs to a
global relation does not belong to any fragment.
Reconstruction condition: It must always be possible to reconstruct each
global relation from its fragments. The necessity of this condition is obvious
in fact, only fragments are stored in the distributed database, and global
relation have to be built through this reconstruction operation if necessary.
Disjoint condition: It is convenient that fragments be disjoint, so that the
replication of data can be controlled explicitly at the allocation level.
4.3.1 Horizontal Fragmentation: Horizontal fragmentation consists of partitioning the
tuples of a global relation into subsets; this is clearly useful in distributed databases,
where each subset can contain data that have common geographical properties. It can be
defined by expressing each fragment as a selection operation on the global relation.
Example: let a global relation be
Then the horizontal fragmentation can be defined in the following way:
Now let us verify whether this fragmentation fulfills the conditions stated earlier.
The completeness condition: If Mysore and Shimoga are the only possible values of
the CITY attribute, then it satisfies this condition.
The reconstruction condition: can be verified easily, because it is always possible to
reconstruct the SUPPLIER global relation through the following operation.
The disjoint ness condition is clearly verified.
Qualification: The predicate, which is used in the selection operation and defines a
fragment, is called as Qualification. For instance, in the above example the qualifications
q1 : CITY = Mysore
q2 : CITY = Shimoga

We can generalize from the above example that in order to satisfy the completeness
condition, the set of qualifications of all fragments must be complete, at least with respect
to the set of allowed values. The reconstruction condition is always satisfied through the
union operation, and the disjoint ness condition requires that qualifications be mutually
4.3.2 Derived Horizontal Fragmentation: This is a type of fragmentation, which is
derived from the horizontal fragmentation of another relation.
Example: Consider a global relation
where SNUM is a supplier number. If it is required that a fragment has to contain the
tuples for suppliers, which are in a given city, and then we have to go for derived
fragmentation. A semi-join operation with the fragments SUPLIER1 and SUPLIER2 is
needed in order to determine the tuples of SUPPLY, which correspond to the suppliers in
a given city. The derived fragmentation of SUPPLY can be therefore defined as follows:
The reconstruction of the global relation SUPPLY can be performed through the union
operation as was shown for SUPPLIER.
The completeness of the above fragmentation requires that there be no supplier numbers
in the SUPPLY relation, which are not contained also in the SUPPLIER relation. This is a
typical, and reasonable, integrity constraint for this database and usually is called as the
referential integrity constraint.
The disjoint ness condition is satisfied if a tuple of the SUPPLY relation does not
correspond to two tuples of the SUPPLIER relation that belong to two different
fragments. In this case this condition is easily verified, because the supplier numbers are
unique keys of the SUPPLIER relation.
4.3.3 Vertical Fragmentation:
The Vertical fragmentation of a global relation is the subdivision of its attributes into
groups; fragments are obtained by projecting the global relation over each group. This
can be useful in distributed databases where each group of attributes can contain data that
have common geographical properties. The fragmentation is correct; if each attribute is

mapped into at least one attribute of the fragments; moreover, it must be possible to
reconstruct the original relation by joining the fragments together.
Example: Consider a global relation
A vertical fragmentation of this relation can be defines as
The reconstruction of relation EMP can be obtained as
This is because; EMPNUM is a key of EMP.
Let us draw some important points to be noted from this example.
The purpose of including the key of the global relation into each
fragment is to ensure the reconstruction property.
An alterative way to provide the reconstruction property is to generate
tuple identifiers that are used as system-controlled keys. This can be
convenient in order to avoid the replication of large keys; moreover,
users cannot modify tuple identifiers.
Let us finally consider the problem of fragment disjoint ness. First, we have seen that at
least the key should be replicated in all fragments in order to allow reconstruction. In
fact, if we include the same attribute in two different vertical fragments, we know exactly
that the column that corresponds to this attribute.
For example, consider the following vertical fragmentation of relation EMP:
The attribute NAME is replicated in both fragments. We can remove this attribute when
we reconstruct relation EMP through an additional projection operation.
4.3.4 Mixed Fragmentation: The fragments that are obtained by the above
fragmentation operations are relations themselves, so that it is possible to apply the
fragmentation operations recursively, provided that the correctness conditions are

satisfied each time. The reconstruction can be obtained by applying the reconstruction
rules in reverse order.
Example: Consider the same global relation
The following is a mixed fragmentation, which is obtained by applying the vertical
fragmentation of the previous example, followed by a horizontal fragmentation on




Fig.4.3 The fragmentation tree of relation EMP

The reconstruction of relation EMP is defined by the following expression:
A fragmentation tree can conveniently represent mixed fragmentation (as shown in the
above figure). In a fragmentation tree, the root corresponds to a global relation, the leaves
corresponds to the leaves correspond to the fragments, and the intermediate nodes
correspond to the intermediate results of the fragment-defining expressions.

The following codes shows the global and fragmentation schemata of EXAMPLE_DDB.
Most of the global relations of EXAMPLE_DDB and their fragmentation have been
already introduced. A DEPT relation, horizontally fragmented into three fragments on the
value of the DEPTNUM attribute, is added.
Global schema
Fragmentation schema
4.4 Integrity Constraints In Distributed Databases:
When an update performed by a database application violates an integrity constraint, the
application is rejected and thus the correctness of data is preserved. A typical example of
integrity constraint is referential integrity, which requires that all values of a given
attribute of a relation exist also in some other relation. This constraint is particularly
useful in distributed databases, for ensuring the correctness of derived fragmentation. For
example, since the SUPPLY relation has a fragmentation which is derived from that of
SUPPLIER relation by means of a semi-join on the SUPNUM attribute, it is required that
all values of SUPNUM in SUPPLY be present also in SUPPLIER.

Integrity constraints can be enforced automatically by adding to application programs

some code for testing whether the constraint is violated. If so, the program execution is
suspended and all actions already performed by it are cancelled, if necessary.
One of the most serious disadvantages of integrity constraints is the loss in
performance that is due to the execution of the integrity tests; this loss is very important
in distributed databases. The major problems in applying integrity checking might
increase the need of accessing remote sites. It is necessary to consider also integrity
checking in the design of the distribution of database.
Check Your Progress-1: Answer the following: -
1. Define Global schema, Fragmentation schema, Allocation schema and Local
2. What is the concept of redundancy?
3. List the different types of Fragmentation techniques available?
4. Define referential integrity constraint.
5. What are the rules that the fragmentation technique should follow?
Check Your Progress-2: Answer the following: -
1. Explain the reference architecture for distributed databases.
2. Explain the different fragmentation techniques with an example.
3. Discuss the integrity constraints in distributed databases.
Check Your Progress-3: Solve the following: -
Design a Global schema for the organization structure of LIC Of India. Write the
fragmentation schema for the same.
In this unit we have studied reference architecture for distributed database. Also the
different types of fragmentation techniques are discussed. We have also seen some
demonstration examples. Some ideas about integrity constraints are given.


5.0 Objectives
5.1 Introduction
5.2 A Framework for Distributed Database Design
5.2.1 Objectives of the Design of Data Distribution
5.2.2 Top Down and Bottom Up Approach A classical Design
5.3 The Design of Database Fragmentation
5.3.1 Horizontal Fragmentation Primary Fragmentation Derived Horizontal Fragmentation
5.3.2 Vertical Fragmentation
5.3.3 Mixed Fragmentation
5.4 The Allocation of Fragments
5.5 Summary
5.0 Objectives: In this unit you will come to know the different design aspects of
distributed databases. At the end of this unit you will be able to describe the topics like
A framework for distributed database design
The objectives of design of data distribution
Top Down and Bottom Up design approaches
The design of database fragmentation
Horizontal Fragmentation
Vertical Fragmentation
Mixed Fragmentation
The allocation of fragments
General Criteria for Fragment allocation
5.1 Introduction: The concept of data distribution itself is difficult to design and
implement because of various technical and organizational issues. So we need to have an

efficient design methodology. From the technical aspect, the interconnection of sites and
appropriate distribution of the data and applications to the sites depending upon the
requirement of applications and for optimizing performances. From the organizational
point, the issue of decentralization is crucial and distributing an application has a greater
effect on the organization. In recent years, lot of research work has taken place in this
area and the major outcome of this are:
o Design criteria for effective data distribution
o Mathematical background of the design aids
In the section 5.2 you will learn a framework of the design including the design
approaches like Top Down and Bottom Up. The section 5.3 explains about the design
of Horizontal and Vertical Fragmentation. In the section 5.4 we will give principles and
concepts in Fragment allocation.
5.2 A Framework for Distributed Database Design: The design of a centralized
database concentrates on:
Designing the conceptual schema that describes the complete database
Designing the Physical database, which maps the conceptual schema to the
storage areas and determines the appropriate access methods.
The above two steps contributes in distributed database towards the design of Global
schema and the design of local databases. The added steps are:
Designing the Fragmentation: - The actual procedure of dividing the existing
global relations into horizontal, vertical or mixed fragments
Designing the allocation of fragments: -Allocation of different fragments
according to the site requirements
Before designing the Distributed database a thorough knowledge about the application is
a must. In this case we expect the following things from the designer.
Site of Origin: The site from which the application is issued.
The frequency of invoking the request at each site
The number, type and the statistical distribution of accesses made by each
application to each required data.
In the coming section let us try to know the actual need of design of data distribution.

5.2.1 Objectives of the Design of Data Distribution: In the design of data distribution
the following objectives should be considered.
Processing locality: Reducing the remote references in turn maximizing the local
references is the primary aim of the data distribution. This can be achieved by
having redundant fragment allocation meeting the site requirements. Complete
locality is an extended idea, which simplifies the execution of application.
Availability and reliability of distributed data: Availability is achieved by
having multiple copies of the data for read only applications. Reliability is
achieved by storing the multiple copies of the information, as it will be helpful in
case of system crashes.
Workload distribution: workload distribution is the major goal to have high
degree of parallelism.
Storage costs and Processing locality: Cost criteria and Availability of storage
areas should be intelligently handled for effective data distribution.
Using the all above criteria may increase the design complexity. So important aspects are
taken as objectives depending upon the requirement and others are treated as constraints.
In the next section let us design a simple approach for maximizing the processing
5.2.2 Top Down and Bottom Up Approach A classical Design Methodologies:
There are two classical approaches as far as distributed databases design is concerned.
They are:
1. Top Down Approach: This may be quite useful when the system has to be
designed from the scratch. Here we follow the following steps:
Design of Global Schema.
Design of Fragmentation Schema.
Design of Allocation Schema.
Design of Local Schema (Design of Physical Databases).
2. Bottom - Up Approach: This can be used for an existing system. This approach
is based on the integration of existing schemata into a single, global schema. But
requires that the following aspects have to be fulfilled.

The selection of a common database model for describing the Global

schema of the database.
The translation of each local schema into the common data model.
The Integration of common schemata into a common Global schema. i.e
the merging of common data definitions and the resolution of conflicts
among different representations given to the same data.
The Bottom Up design require solving these three problems. Then of course the
design steps are just reverse of the previous method.
5.3 The Design of Database Fragmentation: Here we discuss the design of non-
overlapping fragments, which are the logical units of allocation. That is, it is important to
have an efficient design methodology so that we can overcome the related problems of
allocation. In the following, we explain the design of Horizontal, Vertical and Mixed
5.3.1 Horizontal Fragmentation: Here we discuss two important methods called
Primary and Derived. Determining the horizontal fragmentation involves knowing:
The logical properties of the data such as fragmentation predicates.
The statistical properties of the data such as the number of references of
applications to the fragments. Primary Fragmentation: The correctness of Primary fragmentation requires that
each global relation be selected in one and only one fragment. Thus, determining the
primary horizontal fragmentation of a global relation requires determining a set of
disjoint and complete selection predicates (we shall define this later in this section). The
property we expect from each fragment is that the elements of them must be referenced
homogeneously by all applications.
Let G be the global relation for which we want to produce a horizontal
primary fragmentation. Let us define some terminologies.
A Simple Predicate: is a predicate of the type:
Attribute = value
A Min-term Predicate y for a set P of simple predicates is the
conjunction of all predicates appearing in P, either taken in natural form or
negated. Thus:

y = pi *
pi p
Where (pi* = pi or pi * =NOT pi) and y false
A fragment is the set of all tuples for which a min-term predicate holds.
A simple predicate is relevant respect to a set P of simple predicates if
there exists at least two min-term predicates of P whose expression differs
only in the predicate pi itself such that the corresponding fragments are
referenced in a different way by at least one application.
Let us try to understand the above terminologies by taking an example. Let us consider
the relations DEPT (DEPTNUM, NAME, AREA) and JOB(JOBID,JOB NAME). Let us
assume that only two departments are functioning i.e 1 & 2.Now some examples for
simple predicates are:
JOB = programmer or JOB programmer.
The corresponding min-term predicates are
DEPTNUM =1 AND JOB = programmer
DEPTNUM =1 AND JOB programmer
DEPTNUM 1 AND JOB programmer
DEPTNUM 1 AND JOB = programmer
Now let us concentrate on some more supporting terminologies. Let P = {p 1,p2,.p n}be a
set of simple predicates. For correct and efficient fragmentation P must be complete and
We say that a set P of predicates is complete if and only if any two tuples
belonging to the same fragment are referenced with the same probabilities
by any application.
The set P is minimal if all its predicates are relevant.
Example: In the above example, P1 ={DEPTNUM = 1} is not complete since the
application is even interested in the employees who are programmers. So in this case
P2 = {DEPTNUM =1,JOB = programmer} is complete and minimal. The set P3 =
{DEPTNUM =1, JOB = programmer, SAL > 50} is complete but not minimal since
SAL >50 is not relevant.

By knowing the minimum characteristics that are to be considered now let us

generalize the method to be followed while producing fragments of the given global
I. Consider a predicate pi that partitions the tuples of the global relation G
into two parts, which are referenced differently at least by one application.
Let P = p1.
II. Consider a new simple predicate pi which partitions at least one fragment of
P into two parts, which are referenced in a different way by at least one
application. Eliminate non-relevant predicates from P. Repeat this step until
the set of min-term fragments of P is complete.
Example: Let us take two cities of Karnataka: Shimoga and Mysore. The example
application considered is the marketing of medical goods. The global schema for this
application includes the relations EMPL, DEPT, SUPPLIER and SUPPLY. These
relations look as follows:
We design the fragmentation of SUPPLIER and DEPT with a Primary Fragmentation.
Now let us take a query. Find the names of suppliers with a given number SNUM. As
you have already come across a popular query language SQL can be used for
representing this query.
Select NAME
Where SNUM = $Y

This query issued at any one of the sites. Let us assume that we have three sites in
our purview. Site 1 is in Shimoga, Site 2 is in Mysore and Site 3 is in between Shimoga
and Mysore. So, if the query is issued at Site 1 it references SUPPLIERS whose CITY is
Shimoga with almost 90% probability; if it is issued at Site 2 it references SUPPLIERS
of Shimoga and Mysore with equal probability; if it is issued at Site3 it references

SUPPLIERS whose CITY is Mysore with almost 90% probability. This is because the
obvious fact that department around one city tends to use suppliers, which are close to
We can write the predicates for the above application,
Since the set {P1, P2} is complete and minimal, the search is terminated.
Let us now consider the global relation DEPT: DEPT (DEPTNUM, NAME,
AREA, MGRNUM). Some example predicates that are suitable for administrative
applications are considered.
P1: DEPTNUM < = 10
P2: (10 < DEPTNUM < = 20)
P3: DEPTNUM > 20
If we assume that in the northern area the departments with DEPTNUM > 20 will
never be there, then AREA = NORTH implies that DEPTNUM > 20 is false. Thus the
fragments are reduced to the following four:
Y1: DEPTNUM < = 10
Y2: (10 < DEPTNUM < = 20) AND (AREA = NORTH )
Y3: (10 < DEPTNUM < = 20) AND (AREA = SOUTH )
Y4: DEPTNUM > 20
If we now concentrate about the fragment allocation we can easily allocate
fragments corresponding to y1 and y4 at sites 1 and 3.But depending upon the
requirement fragments y2 and y3 will be allocated to either sites 1 or 3. Derived Horizontal Fragmentation: This is not based on the properties of its
own attributes, but it is derived from the horizontal fragmentation of another relation. It is
used to make the join between the fragments. A distributed join is a join between
horizontally fragmented relations. That is when you want to join the two relations G and
H you have to compare their fragments. Join Graphs can efficiently represent it. The fig
5.1 represents the different possible join graphs.

o Total: The join graph is total when it contains all possible edges between
fragments of G and H.
o Reduced: The join graph is reduced when some of the edges between G
and H are missing. Here we have two types:
Partitioned: A reduced graph is partitioned if the graph is
composed of two or more sub graphs without edge between them.
Simple: A reduced graph is simple if it is partitioned and each sub
graph has just one edge.
Example: Consider the relation SUPPLY (SNUM, PNUM, DEPTNUM, QUAN). Let us
take the following case. Some application
o Requires the information about supplies of given suppliers; thus they join
between SUPPLY and SUPPLIER in the SNUM attribute.
o Requires the information about supplies at a given department; then they
perform join between SUPPLY and DEPT on the DEPTNUM attribute.
Let us assume that the relation DEPT is horizontally fragmented on the attribute
DEPTNUM and that SUPPLIER is horizontally fragmented on the attribute SNUM. The
derived horizontal fragmentation can be obtained for relation SUPPLY by either
performing a Semi - join operation with SUPPLIER on SNUM or with DEPT on
DEPTNUM; both of them are correct.
5.3.2 Vertical Fragmentation: This requires grouping the attributes into sets, which are
referenced in the similar manner by applications. This method has been discussed by
considering two separate types of problems:
The Vertical Partitioning Problem: Here set must be disjoint. Of course
one attribute must be common. For example assume that a relation S is
vertically fragmented using this concept into S1 and S2.This can be useful
where an application can be executed using either S1 or S2.Otherwise
having the complete S at a particular site may be a unnecessary burden.
Two possible design approaches:
1. The split approach: The global relations are progressively
split into fragments

2. The Grouping approach: The attributes are progressively

aggregated to constitute fragments.
Both are Heuristic approaches as each iteration steps look for best choice.
In both the cases formulas are used to indicate the best
possible splitting or grouping.

R1 R1 S1 R1 S1

R2 S2
R2 R2
R3 S3

R3 R3 S1
R4 S4

R4 R4 S2

Figure 5.1 The different possible join graphs
The Vertical Clustering Problem: Here sets can overlap. Here depending
upon the requirement you may have more than one common attribute in
the two different fragments of a global relation. It introduces Replication
within fragments, as some common attributes are present in the fragments.
It is suitable only for Read-Only applications; because for applications,
which involve frequent updating of these common attributes needs to be
referred to the sites where all these attributes are present. Therefore,
Vertical clustering is suggested where overlapping attributes are not
heavily updated.

Example: Consider the global relation EMPL (EMPNUM, NAME, SAL, TAX,
MGRNUM, DEPTNUM). The following are made:
Administrative applications, requires NAME, SAL, TAX of employees.
The department, requires NAME, MGRNUM and DEPTNUM
Here Vertical clustering is suggested as the attribute NAME is required in both the
fragments. So the fragments may be:
5.3.3 Mixed Fragmentation: The simple way for performing this is:
Apply Horizontal fragmentation to Vertical fragments
Apply Vertical fragmentation to Horizontal fragments
Both these aspects are illustrated using the following diagrams 5.2 and 5.3.
A1 A2 A3 A4 A5

Fig: 5.2 Vertical fragmentations followed by horizontal fragmentation.

A1 A2 A3 A4 A5

Fig: 5.3 Vertical fragmentations followed by horizontal fragmentation

5.4 The Allocation of Fragments:

In this section we explain the different aspects to be considered when you go for
allocating a particular fragment to site. This section describes some general criteria that
can be used for allocating fragments. There are two types of allocation methods, which
can be followed. They are:
Non-redundant Allocation: It is simple. A method known as Best-fit approach can be
used; i.e a measure is associated with each possible allocation, and the site with the bets
measure is selected. It avoids placing a fragment at a given site where already a fragment
is present which is related to this fragment.

Redundant Allocation: It is complex design, since:

o The degree of replication is a variable of the problem.
o The modeling of read applications is complicated as the applications may
select any of the several alternatives.
The following two methods can be used for determining the redundant allocation
of fragments:
Determine the set of all sites where the benefit of allocating one
copy of the fragment is higher than the cost, and allocate a copy of
the fragment to each element of this site; this method selects all
beneficial sites.
Start from a non-replicated version. Then progressively introduce
replicated copies from the most beneficial; the process is
terminated when no additional replication is beneficial.
Both the reliability and availability of the system increases if there are two or three copies
of the fragment, but further copies give a less than proportional increase.
Check Your Progress-1: Answer the following:
1) List the objectives of Distributed databases.
2) What do you mean by Top-Down and Bottom-Up approaches?
3) Define the following:
(i) A simple predicate (ii) Min-term predicate (iii) Complete predicate set
(iv) Minimal predicate set.
4) Define a join graph.
5) State Vertical partitioning and Clustering problem.
Check Your Progress-2: Answer the following:
1) Give an overall framework for the design of distributed databases.
2) Describe Top-Down and Bottom-Up design approach.
3) Explain the general principle of allocation of fragments in distributed
4) Discuss the general principle of fragmentation design considering all types of
5.5 Summary:

In this unit we have discussed the four phases of the design of Distributed databases:
Global schema, Fragmentation schema, Allocation schema and Local schema. Some
important aspects of design of fragmentation and allocation schemas are described in
detail. Also some of the practical examples are chosen for familiarizing the new concepts.



6.0 Objectives
6.1 Introduction
6.2 Query Processing Problem
6.3 Objectives of Query Processing
6.4 Characterization of Query Processors
6.5 Layers of Query Processing
6.5.1 Query Decomposition
6.5.2 Data Localization
6.5.2 Global Query Optimization
6.5.3 Local Query Optimization
6.6 Summary

6.0 Objectives: In this unit we learn about an overview of query processing in

Distributed Data Base Management Systems (DDBMSs). This is explained with the help
of Relational Calculus and Relational Algebra because of their generality and wide use
in DDBMSs. In this we discuss
Various problems of query processing
About an ideal Query Processor
The concept of layering in query processing
Some related examples of query processing
6.1 Introduction: The increasing success of relational database technology in data
processing is suitable, in part, to the availability of nonprocedural languages, which can
significantly improve application development and end-user productivity. By hiding the
low-level details about the physical organization of the data, relational database
languages allow the expression of complex queries in a concise and simple fashion. In
particular, to construct the answer to the query, the user does not exactly specify the

procedure to follow. This procedure is actually devised by a DBMS module, called as

Query Processor. This relieves the user from query optimization, a time consuming task
that is handled properly by the query processor.
This issue has considerably important both in Centralized and Distributed
processing systems. However, the query processing problem is much more difficult in
distributed environments than in the conventional systems. In exact, the relations
involved in distributed queries may be fragmented and/or replicated, there by inducing
communication overhead costs.
So, in this unit let us discuss the different issues of query processing, about an
ideal query processor for distributed environment and finally, a layered software
approach for distributed query processing.
6.2 Query Processing Problem:
The main duty of a relational query processor is to transform a high-level query (in
relational calculus), into an equivalent lower level query (in relational algebra). The
distributed database is of major importance for query processing since the definition of
fragments is based on the objective of increasing reference locality, and sometimes-
parallel execution for the most important queries. The role of a distributed query
processor is to map a high level query on a distributed database (a set of global relations)
into a sequence of database operations (of relational algebra) on relational fragments.
Several important functions characterize this mapping:
The calculus query must be decomposed into a sequence of relational operations
called an algebraic query
The data accessed by the query must be localized so that the operations on
relations are translated to bear on local data (fragments)
The algebraic query on fragments must be extended with communication
operations and optimized with respect to a cost function to be minimized. This
cost function refers to computing resources such as disk I/Os, CPUs, and
communication networks.
The low-level query actually implements the execution strategy for the query. The
transformation must achieve both correctness and efficiency. The well-defined mapping
with the above said functional characteristics makes the correctness issue easy. But

producing an efficient execution strategy is more complex. A relational calculus query

may have many equivalent and correct transformations into relational algebra. Since each
equivalent execution strategy can lead to different consumptions of computer resources,
the main problem is to select the execution strategy that minimizes the resource
Example: We consider the following subset of engineering database scheme given in
fig.6.0: E (ENO, ENAME, TITLE) G (ENO, JNO, RESP, DUR) and the simple user
query: Find the names of employees who are managing a project.
E1 A Elect. Eng. E1 J1 Manager 12
E2 B Syst. Arial, E2 J1 Analyst 24
E3 C Mech. Eng. E2 J2 Analyst 6
E4 D Programmer E3 J3 Consultant 10
E5 E Syst. Anal. E3 J4 Engineer 48
E6 F Elect. Eng. E4 J2 Programmer 18
E7 G Mech. Eng. E5 J2 Manager 24
E8 H Syst. Anal. E6 J4 Manager 48
E7 J3 Engineer 36
E8 J3 Manager 40
J1 Instrumentation 150000 Montreal Elect. Eng. 40000
J2 Database Develop. 135000 New York Syst. Anal. 34000
J3 CAD/CAM 250000 New York Mech. Eng. 27000
J4 Maintenance 310000 Paris Programmer 24000

Figure 6.0 Example Database


The equivalent relational calculus using SQL syntax is:

AND RESP = Manager
Two equivalent relational algebra queries that are correct transformations of the above
query are:


PJ ENAME (E JN ENO (SL RESP = Manager (G)))

NOTE: The following observations are made from the above example:
It can be observed that the second query avoids the Cartesian product (CP) of E
and G, consumes much less computing resource than the first and thus should be
retained. That is, we have to avoid performing Cartesian product operation on a
full table.
In a centralized environment, the role of the query processor is to choose the best
relational algebra query for a given query among all equivalent ones.
In a distributed environment, relational algebra is not enough to express execution
strategies. It must be supported with operations for exchanging data between sites.
The distributed query processor has to select the best sites to process the data and
the way in which the data should be transformed with the choice of ordering the
Example: This example illustrates the importance of site selection and communication
for a chosen relational algebra query against a fragmented database. We consider the
following query:
PJ ENAME (E JN ENO (SL RESP = Manager (G)))
This query is written considering the relations of the previous example. We assume that
the relations E and G are horizontally fragmented as follows:

E 1 = SL ENO E3 (E)

E 2 = SL ENO > E3 (E)

G1 = SL ENO E3 (G)

G2 = SL ENO > E3 (G)

Fragments G1, G2, E1 and E2 are stored at the sites 1,2,3, and 4, respectively, and the
result is expected at the site 5 as shown in the fig 6.1. For simplicity, we have ignored the
project operation here. In the figure two equivalent strategies for the above query are
Some of the observations of the Strategies:
An arrow from site i to site j labeled with R
indicates that relation R is transferred from site i to site j.
Strategy A exploits the fact that relations E and G
are fragmented in the same way in order to perform the select and join operations
in parallel.
Strategy B centralizes all the operations and the
data at the result site before processing the query.
Resource consumption of these two strategies:
Assumptions made:
1. Tuple access denoted as tupacc is 1 unit.
2. A tuple transfer, denoted as tuptrans, is 10 units.
3. Relations E and G have 400 and 1000 tuples
4. There are 20 managers in relation G.
5. The data is uniformly distributed among sites.
6. E and G relations are locally clustered an
attributes RESP and ENO, respectively.
7. There is direct access to tuples of G (respectively,
E) based on the value of attribute RESP (respectively, ENO)
The Cost Analysis:
The cost of strategy A can be derived as follows:

1. Produce G' by selecting G requires 20 * tupacc = 20

2. Transfer G' to the sites of E requires 20 * tuptrans = 200
3. Produce E' by joining G' and E requires
(10*10)* tupacc*2 = 200
4. Transfer E' to result site requires 20* tuptrans = 200

The total cost 620

The cost of strategy B can be derived as follows:
1. Transfer E to site 5 requires 400 * tuptrans = 4000
2. Transfer G to site 5 requires 1000 * tuptrans = 10000
3. Produce G' by selecting G requires 1000 * tupacc = 1000
4. Join E and G' requires 400 * 20 * tupacc = 8000

The total cost 23000

The strategy A is better by a factor of 37, which is quite significant. Also it provides the
better distribution of work among the sites. The difference would be still better if we
assume slower communication and/or higher degree of fragmentation.

Result = E1 UN E2

E'1 E'2

Site 3 Site 4

E1 = E1 JN ENO G1 E2 = E2 JN ENO G2

G'1 G'2

Site 5
Site 2
Site 1 G = SL G1 G2 = SLRESP = Manager G2
1 RESP = Manager

(a) Strategy A

Result = (E1 UN E2 JN ENO PJ RESP ='Manager (G1 UN G2)

G1 G2 E1 E2

Site 1 Site 2 Site 3 Site 4

(b) Strategy B
Fig.6.1 Equivalent Distributed Execution Strategies

6.3 Objectives of Query Processing:

The main objectives of query processing in a distributed environment is to form a
high level query on a distributed database, which is seen as a single database by the
users, into an efficient execution strategy expressed in a low level language on local
An important point of query processing is query optimization. Because many
execution strategies are correct transformations of the same high-level query, the
one that optimizes (minimizes) resource consumption should be retained.
The good measures of resource consumption are:
o The total cost that will be incurred in processing the query. It is the some
of all times incurred in processing the operations of the query at various
sites and intrinsic communication.
o The resource time of the query. This is the time elapsed for executing the
query. Since operations can be executed in parallel at different sites, the
response time of a query may be significantly less than its cost.
Obviously the total cost should be minimized.
o In a distributed system, the total cost to be minimized includes CPU, I/O,
and communication costs. These costs can be minimized by reducing the
number of I/O operations through fast access methods to the data and
efficient use of main memory. The communication cost is the time needed
for exchanging the data between sites participating in the execution of the
query. This cost is incurred in processing the messages and transmitting
the data on the communication network. In distributed system, the

communication cost factor is largely dominating the local processing cost,

so that the other cost factors are ignored.
o In centralized systems, only CPU and I/O cost have to be considered.
6.4 Characterization of Query Processors: It is very difficult to give the characteristics, which
differentiates centralized and distributed query processors. Still some of them have been listed
here. Out of them, the first four are common to both and the next four are particular to
distributed query processors.
o Languages: The input language to the query processor can be based on relational
calculus or relational algebra. The former requires an additional phase to
decompose a query expressed in relational calculus to relational algebra. In
distributed context, the output language is generally some form of relational
algebra augmented with communication primitives. That is it must perform
perfect mapping between input languages with the output language.
o Types of optimization: Conceptually, query optimization is to choose a best point
of solution space that leads to the minimum cost. A popular approach called
exhaustive search is used. This is a method where heuristic techniques are used.
In both centralized and distributed systems a common heuristic is to minimize the
size of intermediate relations. Performing unary operations first and ordering the
binary operations by the increasing size of their intermediate relations can do
o Optimization Timing: A query may be optimized at different times relative to the
actual time of query execution. Optimization can be done statically before
executing the query or dynamically as the query is executed. The main advantage
of the later method is that the actual sizes of the intermediate relations are
available to the query processor, thereby minimizing the probability of a bad
choice. The main drawback of the dynamic method is that the query optimization,
which is an expensive one, must be repeated for each and every query. So,
Hybrid optimization may be better in some situation.
o Statistics: The effectiveness of the query optimization is based on statistics on the
database. Dynamic query optimization requires statistics in order to choose the
operation that has to be done first. Static query optimization requires statistics to

estimate the size of intermediate relations. The accuracy of the statistics can be
improved by periodical updating.
o Decision sites: Most of the systems use centralized decision approach, in which a
single site generates the strategy. However, the decision process could be
distributed among various sites participating in the elaboration of the best
strategy. The centralized approach is simpler but requires the knowledge of the
complete distributed database where as the distributed approach requires only
local information. Hybrid approach is better where the major decisions are taken
at one particular site and other decisions are taken locally.
o Exploitation of the Network Topology: the distributed query processor exploits
the network topology. With wide area networks, the cost function to be
minimized can be restricted to the data communication cost, which is a dominant
factor. This issue reduces the work of distributed query optimization, that can be
dealt as two separate problems: Selection of the global execution strategy, based
on the inter-site communication and selection of each local execution strategy,
based on a centralized query processing algorithms. With local area networks,
communication costs are comparable to I/O costs. Therefore, it is reasonable to
the distributed query processor to increase parallel execution at the cost of
increasing communication.
o Exploitation of Replicated fragments: For reliability purposes it is useful to have
fragments replicated at different sites. Query processors have to exploit this
information either statically or dynamically for processing the query efficiently.
o Use of semi- joins: The semi-join operation reduces the size of the data that are
exchanged between the sites so that the communication cost can be reduced.
6.5 Layers Of Query Processing:
The problem of query processing can itself be decomposed into several subprograms,
corresponding to various layers. In figure 6.2, a generic layering scheme for query
processing is shown where each layer solves a well-defined sub-problem. The input is a
query on distributed data expressed in relational calculus. This distributed query is posed
on global (distributed) relations, meaning that data distribution is hidden. Four main
layers are involved to map the distributed query into an optimized sequence of local

operations, each acting on a local database. These layers perform the functions of query
decomposition, data localization, global query optimization, and local query
optimization. The first three layers are performed by a central site and use global
information; the local sites do the fourth.













Figure 6.3 Generic Layering Scheme for Distributed Query Processing


6.5.1 Query Decomposition: The first layer decomposes the distributed calculus query
into an algebraic query on global relations. The information needed for this
transformation is found in the global conceptual schema describing the global relations.
However, the information about data distribution is not used here but in the next layer.
Thus the techniques used by this layer are those of a centralized DBMS.
Query decomposition can be viewed as four successive steps:
o The calculus query is rewritten in a normalized form that is suitable for
subsequent manipulation. Normalization of a query generally involves the
manipulation of the query quantifiers and of the query qualification by applying
logical operator priority.
o The normalized query is analyzed semantically so that incorrect queries are
detected and rejected as early as possible. Techniques to detect incorrect queries
exist only for a subset of relational calculus. Typically, they use some sort of
graph that captures the semantics of the query.
o The correct query (still expressed in relational calculus) is simplified. One way to
simplify a query is to eliminate redundant predicates.
o The calculus query is restructured as an algebraic query. The quality of an
algebraic query is defined in terms of expected performance. The traditional way
to do this transformation toward a "better" algebraic specification is to start with
an initial algebraic query and transform it in order to find a "good" one. The initial
algebraic query is derived immediately from the calculus query by translating the
predicates and the target statement into relational operations as they appear in the
query. This directly translated algebra query is then restructured through
transformation rules. The algebraic query generated by this layer is good in the
sense that the worse executions are avoided.
6.5.2 Data Localization: The input to the second layer is an algebraic query on
distributed relations. The main role of the second layer is to localize the querys data
using data distribution information. Relations are fragmented and stored in disjoint
subsets called fragments, each being stored at a different site. This layer determines
which fragments are involved in the query and transforms the distributed query into a
fragment query. Fragmentation is defined through fragmentations rules that can be

expressed as relational operations. A distributed relation can be reconstructed by applying

the fragmentation rules, and then deriving a program, called a localization program, of
relational algebra operations, which then act on fragments.
Generating a fragments query is done in two steps.
o The distributed query is mapped into a fragment query by substituting each
distributed relation by its reconstruction program (also called materialization
o The fragment query is simplified and restructured to produce another good
query. Simplification and restructuring may be done according to the same rules
used in the decomposition layer. As in the decomposition layer, the final fragment
query is generally far from optimal because information regarding fragments is
not utilized.

6.5.3 Global Query Optimization: The input to the third layer is a fragment query, that
is, an algebraic query on fragments. The goal of query optimization is to find an
execution strategy for the query, which is close to optimal. An execution strategy for a
distributed query can be described with relational algebra operations and communication
primitives (send/receive operations) for transferring data between sites. The previous
layers have already optimized the query for example, by eliminating redundant
expressions. However, this optimization is independent of fragments characteristics such
as cardinalities. In addition, communication operations are not yet specified. By
permuting the ordering of operations within one fragment query, many equivalent queries
may be found. Query optimization consists of finding the best ordering of
operations in the fragments query, including communication operations, which minimize
a cost function. The cost function, often defined in terms of time units, refers to
computing resources such as disk space, disk I/Os, buffer space, CPU cost,
communication cost and so on. An important aspect of query optimization is join
ordering, since permutations of the joint within the query may lead to improvements of
orders of magnitude. One basic technique for optimizing a sequence of distributed join
operations is through the semi-join operator. The main value of the semi-join in a
distributed system is to reduce the size of the join operands and then the communication

cost. The output of the query optimization layer is an optimized algebraic query with
communication operation included on fragments.

6.5.4 Local Query Optimization: The last layer us performed by all the sites having
fragments involved in query. Each sub-query executing at one site, called a local query, is
then optimized using the local schema of the site. At this time, the algorithms to perform
the relational operations may be chosen. Local optimization uses the algorithms of
centralized systems.
Check Your Progress: Answer the following: -

a) What is a Query processor?

b) State the Query processing problem.

c) Explain the different characteristics of Query processor.

d) Describe the layer architecture of query processing.

e) Discuss Query optimization.

6.6 Summary: In this unit we have provided an overview of query processing in

distributed DBMSs. The following points are discussed:
We have introduced the function and objectives of query processing.
The goals of the query processing are discussed. They are
Given a calculus query on a distributed database, find a
corresponding execution strategy that minimizes a system cost
function, which includes I/O, CPU, and communication costs.
An execution strategy is specified in terms of relational algebra
operations and communication primitives applied to the local
We have described a characterization of query processors based on their
implementation choices. This is useful for comparing alternative query
processor designs and to understand the trade-offs between efficiency and
We have proposed a generic layering scheme for describing distributed
query processing. Here four main functions have been isolated: Query

decomposition, Data localization, Query optimization, and Local query


UNIT - 7


7.0 Objectives.
7.1 Introduction.
7.2 Transaction Management
7.2.1 A Frame work for transaction management Transactions properties Transaction Management Goals Distributed Transactions
7.2.2 Atomicity of Distributed Transactions Recovery in Centralized Systems Problems with respect to Communication in Distributed
databases Recovery of Distributed Transactions The Transaction control
7.3 Concurrency Control for Distributed Transactions
7.3.1 Concurrency Control Based on Locking in Centralized
7.3.2 Concurrency Control Based on Locking in Distributed

7.4 Summary

7.0 Objectives: At the end of this unit you will be able to:
Describe the different problems in managing the distributed transactions.
Discuss about recovery of distributed transactions.
Explain a popular algorithm called 2 Phase Commit Protocol
Discuss the aspects of Concurrency control in Distributed Transactions.

7.1 Introduction:
The management of distributed transaction means dealing with interrelated problem like
reliability, concurrency control and the efficient utilization of the resources of the
complete system. In this unit we have considered the well-known protocols like 2-Phase
commit protocol for recovery and 2-Phase locking for concurrency control. All the
aspects are discussed under different sections as given in the above structure.

7.2 Transaction Management:

The following section deals with the problems of transactions in both centralized and
distributed transactions. The properties of transaction and various goals of transaction
management are discussed. The recovery problems in both centralized and distributed
transactions are analyzed.
7.2.1 A Framework for Transaction Management: In this case we define the properties
of transactions, state the goals of distributed transaction management and describe
architecture of distributed transaction. Transaction's Properties: The Transaction is an application or part of application
that is characterized by the following properties.
Atomicity: Either all or none of the transactions operations are performed. It
requires that if a transaction is interrupted by a failure its partial results are not at
all taken into consideration and the whole operation has to be repeated. The two
types of problems that does not allow the transaction to complete are:
o Transaction aborts: This may be requested by the transaction itself as
some of its inputs are wrong or it has been estimated that the results
produced may become useless. It also may be forced by the system for its
own reason. The activity of ensuring atomicity in the presence of
Transaction aborts is called Transaction recovery.
o System Crashes: It is because of some catastrophic effects that crash the
system without any prior knowledge. The activity of ensuring atomicity in
the presence of system crashes is called crash recovery.
The completion of transaction is called Commit. The primitives that can be used
for carrying out the transaction are:

Begin _Transaction Begin _Transaction Begin _Transaction

Commit Abort X System

Forces Abort

Durability: Once a transaction is committed, the system must guarantee that the
results of operations will never be lost, independent of subsequent failures. The
activity of providing Durability of the transaction is called Database recovery.
Serializability: If many transactions execute concurrently, the result must be
same as if they were executed serially in the same order. The activity of providing
Serializability of the transaction is called Concurrency control.
Isolation: This property states that an incomplete transaction cannot disclose its
result to other transactions until it is committed. This property has to be strictly
followed to avoid a problem called Cascading Aborts (Domino Effect).
According to this all the transactions that has observed the partial results have to
be aborted.

These properties have to be fulfilled for the efficient transaction to happen. In the next
section we will see why the transaction management is an important aspect. Transaction Management Goals: After knowing the performance characteristics
of transactions let us see what are the real goals of transaction management? The goal of
the transaction management in a Distributed database is to control the execution of
transactions so that:
1. Transactions have atomicity, durability, serializability, and isolation properties.
2. Their cost in terms of main memory, CPU, and number of transmitted control
messages and their response time are minimized.
3. The availability of the system is maximized.

The second point talks more about the efficiency of the transaction. Let us discuss in
detail about second and third point as we have already dealt the first point in the previous
sub section.
CPU and main memory utilization: It is a common aspect in both centralized and
distributed database. In case of concurrent transactions the CPU and main memory
should be properly scheduled and managed by the operating system. Otherwise it
becomes a bottleneck when the number concurrent transactions are more.

Control messages and their Response time: As the control messages does not carry
any fruitful data and only they are used to control the execution of transactions, there
should be very less exchange of such messages between the sites. The obvious reason is
the communication cost will be increased unnecessarily.
Another important aspect is the response time of each individual transaction.
This should be as small as possible for the better performance of the system. Definitely it
will be very crucial as in distributed system an additional time is required for
communication between different sites.
Availability: This should be discussed keeping the failure of the systems in mind. The
algorithms implemented by the transaction manager must bypass the site which is not
operational and provide the access to a site so that the request can be some how serviced.
After studying the goals of the transaction management in detail, in the coming
section we will suggest an appropriate model for Distributed transaction. Distributed Transactions: A transaction is a part of the application. Ones some
application issues begin_transaction primitive; from this point onwards, all actions
which are performed by the application, until a commit or abort primitive is issued are to
be considered as one compete transaction. Now let us discuss a model for distributed
transaction. Let us study some related terminologies of this model.
Agents: An agent is a local process, which performs several functions on behalf
of an application. In order to cooperate in the execution of global operation
required by the application the agents have to communicate. As they are resident
at different sites, the communication between the agents is performed through
messages. There are various methods for organizing the agents to build a
structure of cooperating processes. In this model let us have a hypothetical
assumption of the method, which will be discussed in detail in the next section.
Root agent: There exists a root agent, which starts the whole transaction, so that
when the user requests the execution of an application, the root agent is started;
the site of the root agent is called Site of Origin of the transaction.
The root agent has the responsibility of issuing the begin_ transaction, commit
and abort primitives.
Only the root agent can request the creation of a new agent.

Finally, to summarize the distributed transaction model consists of a root agent that has
initiated the transaction and number of agents depending upon the application, which
works concurrently. All the primitives are executed by the root agent and these are not
local to the site of origin but also affect all the agents of transaction.

A case study of a Distributed Transaction:

A distributed transaction includes one or more statements that, individually or as a

group, update data on two or more distinct nodes of a distributed database. For example,
assume the database configuration depicted in the fig 7.1

Fig. 7.1 Distributed System


The following distributed transaction executed by scott updates the local sales
database, the remote hq database, and the remote maint database:

UPDATE scott.dept@hq.us.acme.com
WHERE deptno = 10;
UPDATE scott.emp
SET deptno = 11
WHERE deptno = 10;
UPDATE scott.bldg@maint.us.acme.com
SET room = 1225
WHERE room = 1163;
Example: Let us consider an example of a distributed transaction. The example is the
Fund Transfer operation between two accounts. A global relation
FUND_TRAN (ACC_ NUM, AMOUNT) is taken for to manage this application. The
application starts reading from the terminal the amount that has to be transferred, the
account numbers from which the amount must be taken and to which it must be credited.
Then the application issues a begin_transaction primitive and the usual operations starts
from now onwards. This is a global environment. The following transaction code (fig.7.2)
narrates the whole process.
If we assume that the accounts are distributed at different sites of a network like
the branches of the bank, at execution time various cooperating processes will perform
the transaction. For example, in the following transaction code (fig. 7.3) two agents are
shown. One of the two is the root agent. Here we assume that the from acc is located
at the root agent site and that the to acc is located at a different site, where the
AGENT1 is executed. When the root agent wants to perform the transaction then it
executes the primitive Create AGENT1; then it sends the parameters to AGENT1. The
root agent also issues begin_transaction, commit and abort primitives. All these
transaction operations will be carried out preserving the properties of distributed
transactions discussed in previous section.

Read (terminal, $AMOUNT, $from acc, $to acc);
from ACC
where ACC_NUM = $from acc;
if $FROM_AMOUNT - $AMOUNT < 0 then abort
else begin
Update ACC
Where ACC = $from acc;
Update ACC
Where ACC = $to acc;

Fig 7.2 The FUND TRANSFER transaction in the centralized environment

Read (terminal, $AMOUNT, $from acc, $to acc);
from ACC
where ACC_NUM = $from acc;
if $FROM_AMOUNT - $AMOUNT < 0 then abort
else begin
Update ACC
Where ACC = $from acc;
Create AGENT1;

Send to AGENT1( $AMOUNT, $to acc);

Receive from ROOT AGENT ($AMOUNT, $to acc);
Update ACC
Where ACC = $to acc
Fig 7.3 The FUND TRANSFER transaction in the Distributed environment
7.2.2 Atomicity of Distributed Transactions: In this section we are trying to experience
the concept of atomicity, which is a required property for distributed transactions. In the
first section we discuss recovery techniques in centralized system. Next we shall deal the
possible communication failures in distributed transactions. After this let us concentrate
on recovery procedures followed in distributed transactions. Finally the section talks
about a distributed transaction algorithm called 2-Phase commit Protocol, which
actually takes care all the properties expected by distributed transactions. Recovery in Centralized Systems:
The issue of recovery mechanism is important in the case of revoking the system to
perform normal database operations after a failure. Thus, before discussing about
recovery, let us first analyze about the different kinds of failure that can occur in a
centralized database.
Failures in Centralized Systems: Failures are classified as follows:
Failures without loss of information: In these failures, all the information
stored in memory, is available for the recovery. The example for this type of
failures is the abort of transactions because an error condition is discovered,
like overflow.
Failures with loss of volatile storage: Here the content of memory is lost. Of
course the information recorded in the disks are not affected. Examples are
system crashes.
Failures with loss of Non volatile storage: such failures are called as Media
Failures Here the contents of disks are also lost. Typical failures are Head

crashes. The probability of such failures are very less compare to the other
two types. However, it is possible to make the possibility still less by having
the same information on several disks. Really this idea is the basis for the
concept of Stable Storage. A strategy known as the Careful Replacement is
used for this purpose, which states that, at every update operation, first one
copy of the information is updated, and then the correctness of the update is
verified, finally the remaining copies are updated.
Failures with loss of stable storage: In this type, some information stored in
stable storage is lost because of many, simultaneous failure of the storage
disks. Any way this probability cannot be reduced to zero.

Logs: This is a basic technique to handle transactions in presence of failures. A

Log contains information for undoing or redoing all the actions performed by the
transactions. i.e
To undo the action means, to cancel the performed operation and restore
back the result of just previous operation. The necessity of undoing the
actions of a transaction which fails before the commitment is that, if the
commitment is not possible the database must remain the same as if the
transaction were not executed at all; hence partial actions must be undone.
To redo the action means, to perform again the action. To know the
necessity of this we have to take the case of failure with the loss of
volatile storage such that, the complete backup in stable storage of
already committed operation has not been taken. In this case redoing of
the actions of committed transactions is a must to record the database.
A log record contains the required information for redoing or undoing
actions. Whenever a transaction is performed on the database, a log
record is written in the log file. The log record includes:
The identifier of the transaction
The identifier of the record
The type of action
The old record value

The new record value

Auxiliary information for the recovery procedure like pointer to
the previous log record of the same information.
Also when a transaction is started, committed, or aborted, a
begin_transaction, commit, or abort record is written in the log.
The Log Write- Ahead Protocol: The writing of a database and writing
the corresponding log record are two separate operations. In case of a
failure, if the database update were performed before writing the log
record, the recovery procedure would be unable of undoing the update, as
the corresponding log record would not be available. In order to
overcome this, the log write- ahead protocol is used. This consists of two
At least the undo portion of the log record must have already been
recorded on stable storage before performing a database update of
the corresponding log record.
All log records of the transaction must have already been recorded
on stable storage before committing a transaction.
Recovery Procedures: We have already seen the various possibilities of failures
in centralized systems and the importance of log record in the recovery procedure.
Now let us see how recovery of database is done and the different steps to be
followed. The recovery procedure reads the log record and performs the following
operations if the failure is due to the loss of volatile storage.
o Step1: Determine all non-committed transactions that have to be
undone. This can be recognized easily as they have a begin_transaction
record in the log file, without having a commit or abort record.
o Step2: Determine all transactions that have to be redone. i.e all the
transactions that have a commit record in the log file. In order to
differentiate the transactions that have to be redone from the one which
are safely recorded in the stable storage, Checkpoints are used.
(Checkpoints are the operations that are periodically performed to
simplify the step1 and step2).

o Step3: Undo the transactions, which are determined in the step1 and
redo the transactions that are determined in step2.
The checkpoint requires the following operations to be performed:
o All log records and the database updates which are still in the volatile
storage has to be recorded in the stable storage
o Writing to a stable storage a checkpoint record. A checkpoint record in
the log contains the indication of transactions that are active at the time
when the checkpoint is done (Active transaction is one begin
_transaction belongs to the log but not a commit or abort record).
The usage of checkpoints modifies step1 and step2 of the recovery
procedure as follows:
o Find and read the last checkpoint record.
o Keep all transactions written in the checkpoint record into the undo
set, which contains the transactions to be undone. The redo set, which
contains the transactions to be redone, is initially empty.
o Read the log file starting from the checkpoint record until the end. If a
begin_transaction is found, put the corresponding transaction in the
undo set. If a commit record is found, move the corresponding
transaction from the undo set to the redo set.
From the above discussion we can say that only the latest portion of the log
must be kept online, where as the remaining part can be kept in the stable
storage. So far only we have seen the failure of volatile storage. Let us see the
failures with the loss of stable storage. This can be studied by considering two
o Failures in which database information is lost, but logs are
safe: In this case, performing redo of all committed
transactions using the log does the recovery. Taking the
database to a Dump, which is an image of previous state,
which was stored on tape storage, is done before redoing the
transactions (Of course it is a lengthy process).

o Here the log information itself is lost: This is a catastrophic

event where complete recovery is any way not possible. In this
case, it is reestablished to a recent available state, by resetting
the database to the last dump and using the log that is not
The principles that we have seen will be sufficient to understand the recovery
procedures of distributed databases. Let us now see in detail the recovery in
distributed database. Problems with respect to Communication in the Distributed databases:
As usual recovery mechanisms for distributed transactions requires to know the
communication failures between the sites in distributed system environment. Let us
assume one communication model and according to this let us estimate the different
possibilities of errors. When a message is transferred from A to B, we require the
following from a communication network:
i. A receives a positive acknowledgement after a delay that is less than some
maximum value MAX.
ii. The message is delivered at B in proper sequence with respect to other A B
iii. The message is error free.
There are possibilities that these specifications are really not met because of different
types of errors that may occur. The different possibilities are, missing of
acknowledgements, late arrival of acknowledgements etc. These can be very easily
eliminated by adding advanced design features so that we can assume that:
o Ones the message is delivered at B, then the message is error free and is
in sequence with to other received messages.
o If A receives an acknowledgement, then the message has been delivered.
The two possible communication errors are: Lost messages and Network Partitions. If
the acknowledgement for a message has not received within some predefined interval
called timeout, then the source has to look only for the above errors and try to take the
corresponding steps.

Multiple Failures and K-resiliency: Failures do not occur one at a time. A system which
can tolerate K failures is called K-resilient. In distributed databases, this concept is
applied to site failures and/or partitions. With respect to site failures, an algorithm is said
to be K- resilient if it works properly even if K sites are down. An extreme case of failure
is called Total Failure, where all sites are down. Recovery of Distributed Transactions: Now let us consider recovery problems
in distributed databases. For this purpose, let us assume that at each site a Local
Transaction Manager is available. Each agent can issue begin_transaction, commit, and
abort primitives to its LTM. After having issued a begin_transaction to its LTM, an agent
will possess the properties of a local transaction. We will call an agent that has issued a
begin_transaction primitive to its local transaction manager a Sub-transaction. Also to
distinguish the begin_transaction, commit, and abort primitives of the distributed
transaction from the local primitives issued by each agent to its LTM, we will call the
later as local_begin, local_commit, and local_abort.
For building a Distributed Transaction Manager (DTM), the following
properties are expected from the LTM:
Ensuring the atomicity of sub-transaction
Writing some records on stable storage on behalf of the distributed
transaction manager
We need the second requirement, as some additional information must also be
recorded in such away that they can be recovered in case of failure. In order to
make sure that either all actions of a distributed transaction are performed or
none is performed at all, two conditions are necessary:
At each site either all actions are performed or none is performed
All sites must take the same decision with respect to the commitment
or abort of sub transaction.
Fig. 7.4 shows a reference model of Distributed transaction recovery.
ROOT Messages Messages Distribution
118 Transaction

DTM- Messages DTM- Messages DTM- Distribution


1 1 1

LTM LTM LTM Transaction
at at at Manager
Site i Site j Site i (LTM)

Interface 1 : Local_begin, Local_commit, Local_Abort, Local_Create

Interface 2 : Begin_Transaction, Commit, Abort, Create

Figure 7.4 A reference model of distributed transaction recovery

Begin_transaction: When it is issued by the root agent, DTM will have to

issue a local_begin primitive to the LTM at the site of origin and all the sites
at which there are already active agents of the same application, thus
transforming all agents into sub-transactions; from this time on the activation
of a new agent by the same distributed transaction requires that the
local_begin be issued to the LTM where the agent is activated, so that the new
agent is created as a Sub-transaction. The example of FUND TRANSFER
(refer fig 7.2 and fig 7.3) is taken for explaining this concept. The fig 7.4
explains this.
Abort: When an abort is issued by the root agent, all existing sub-transactions
must be aborted. Issuing local_aborts to the LTMs at all sites where there is an
active sub transaction performs this.
Commit: The implementation of the commit primitive is the most difficult
and expensive. The main difficulty originates from the fact that the correct

commitment of a distributed transaction requires that al sub-transactions

commit locally even if there are failures.
In order to implement this primitive for a distributed transaction, the
general idea of 2-Phase commit Protocol has been developed. It is discussed
in detail in the next section.
Example: The figure 7.5 shows an example where the primitives and messages are
shown which are issued by the various components of the reference model for
the execution of the FUND TRANSFER Application which is already explained.
The different numbers indicate the various actions and their order of execution.
Issuing of Begin transaction primitive to the DTM agent.
1. Issuing of Local_begin primitive to the LTM agent.
2. Issuing of Create AGENT1 primitive to the DTM agent
3. Send Create requests to the other DTM agents
4. Issuing of Local Create primitives to the LTM agent
5. Depending upon the number of local transactions required that many agents
are created in a loop
6. Then the local transactions will begin.
7. Then the communications required for committing or aborting the transaction
takes place between ROOT AGENT and AGENTS
. Receive
Create AGENT1
Send to AGENT1

3 1


Send Create Req. Local Create


7 5

transaction Write

in local begin_
Fig.7.5 Fund Transfer The Transaction Control Statements: The following list describes transaction
control statements supported:

Session Trees for Distributed Transactions:

Figure 7.5 Actions and messages during the first part

of the FUND_TRANFER transaction

As the statements in a distributed transaction are issued, the concept of Distributed

Transaction defines a session tree of all nodes participating in the transaction. A
session tree is a hierarchical model that describes the relationships among sessions
and their roles. Fig 7.6 illustrates a session tree. All nodes participating in the session
tree of a distributed transaction assume one or more of the following roles:

Role Description

Client A node that references information in a database belonging to a different node.

Database server A node that receives a request for information from another node.

Global coordinator The node that originates the distributed transaction.

Local coordinator A node that is forced to reference data on other nodes to complete its part of the transaction.

Commit point site The node that commits or rolls back the transaction as instructed by the global coordinator.

Fig.7.6 Example of a Session Tree


The role a node plays in a distributed transaction is determined by:

Whether the transaction is local or remote

The commit point strength of the node
Whether all requested data is available at a node, or whether other nodes need to be
referenced to complete the transaction
Whether the node is read-only

Clients: A node acts as a client when it references information from another node's
database. The referenced node is a database server. In 6.3, the node sales are a client of
the nodes that host the warehouse and finance databases.
Database Servers: A database server is a node that hosts a database from which a client
requests data. In Fig 7.6 an application at the sales node initiates a distributed transaction
that accesses data from the warehouse and finance nodes. Therefore, sales.acme.com
has the role of client node, and warehouse and finance are both database servers. In this
example, sales be a database server and a client because the application also modifies
data in the sales database.
Local Coordinators: A node that must reference data on other nodes to complete its part
in the distributed transaction is called a local coordinator. In, Fig 7.6 sales be a local
coordinator because it coordinates the nodes it directly references: warehouse and
finance. The node sales also happen to be the global coordinator because it coordinates all
the nodes involved in the transaction.
A local coordinator is responsible for coordinating the transaction among the nodes it
communicates directly with by:
Receiving and relaying transaction status information to and from those nodes
Passing queries to those nodes
Receiving queries from those nodes and passing them on to other nodes
Returning the results of queries to the nodes that initiated them

Global Coordinator: The node where the distributed transaction originates is called the
global coordinator. The database application issuing the distributed transaction is directly
connected to the node acting as the global coordinator. For example, in, Fig 7.6 the

transaction issued at the node sales references information from the database servers
warehouse and finance. Therefore, sales.acme.com is the global coordinator of this
distributed transaction. The global coordinator becomes the parent or root of the session
tree. The global coordinator performs the following operations during a distributed
Sends all of the distributed transaction's SQL statements, remote procedure calls, and so
forth to the directly referenced nodes, thus forming the session tree
Instructs all directly referenced nodes other than the commit point site to prepare the
Instructs the commit point site to initiate the global commit of the transaction if all nodes
prepare successfully
Instructs all nodes to initiate a global abort of the transaction if there is an abort response

Commit Point Site: The job of the commit point site is to initiate a commit or roll back
(abort) operation as instructed by the global coordinator. The system administrator
always designates one node to be the commit point site in the session tree by
assigning all nodes commits point strength. The node selected as commit point
site should be the node that stores the most critical data. Fig 7.7 illustrates an
example of distributed system, with sales serving as the commit point site:
The commit point site is distinct from all other nodes involved in a distributed transaction in
these ways:
The commit point site never enters the prepared state. Consequently, if the commit point
site stores the most critical data, this data never remains in-doubt, even if a failure occurs.
In failure situations, failed nodes remain in a prepared state, holding necessary locks on
data until in-doubt transactions are resolved.
The commit point site commits before the other nodes involved in the transaction. In
effect, the outcome of a distributed transaction at the commit point site determines
whether the transaction at all nodes is committed or rolled back: the other nodes follow
the lead of the commit point site. The global coordinator ensures that all nodes complete
the transaction in the same manner as the commit point site.

Figure 7.7 Commit Point Site

How a Distributed Transaction Commits?
A distributed transaction is considered committed after all non-commit point sites are
prepared, and the transaction has been actually committed at the commit point site.
The online redo log at the commit point site is updated as soon as the distributed
transaction is committed at this node.
Because the commit point log contains a record of the commit, the transaction is
considered committed even though some participating nodes may still be only in the
prepared state and the transaction not yet actually committed at these nodes. In the
same way, a distributed transaction is considered not committed if the commit has not
been logged at the commit point site.
Commit Point Strength: Every database server must be assigned commit point
strength. If a database server is referenced in a distributed transaction, the value of its
commit point strength determines which role it plays in the two-phase commit.

Specifically, the commit point strength determines whether a given node is the
commit point site in the distributed transaction and thus commits before all of the
other nodes. This value is specified using the initialization parameter
COMMIT_POINT_STRENGTH. This section explains how the system determines the
commit point site. The commit point site, which is determined at the beginning of the
prepare phase, is selected only from the nodes participating in the transaction. The
following sequence of events occurs:
Of the nodes directly referenced by the global coordinator, the software selects the node
with the highest commit point strength as the commit point site.
The initially selected node determines if any of the nodes from which it has to obtain
information for this transaction has a higher commit point strength.
Either the node with the highest commit point strength directly referenced in the
transaction or one of its servers with a higher commit point strength becomes the commit
point site.
After the final commit point site has been determined, the global coordinator sends
prepare responses to all nodes participating in the transaction.

Fig 7.6 shows in a sample session tree the commit point strengths of each node (in
parentheses) and show the node chosen as the commit point site. The following
conditions apply when determining the commit point site:
A read-only node cannot be the commit point site.
If multiple nodes directly referenced by the global coordinator have the same commit
point strength, then the software designates one of these as the commit point site.
If a distributed transaction ends with an abort, then the prepare and commit phases are not
needed. Consequently, the software never determines a commit point site. Instead, the
global coordinator sends a ABORT statement to all nodes and ends the processing of the
distributed transaction.

As Fig 7.8 illustrates, the commit point site and the global coordinator can be
different nodes of the session tree. The commit point strength of each node is
communicated to the coordinators when the initial connections are made. The
coordinators retain the commit point strengths of each node they are in direct
communication with so those commit point sites can be efficiently selected during

two-phase commits. Therefore, it is not necessary for the commit point strength to be
exchanged between a coordinator and a node each time a commit occurs.

Fig 7.8 The commit point site and the global coordinator

Two-Phase Commit Mechanism: Unlike a transaction on a local database, a

distributed transaction involves altering data on multiple databases. Consequently,
distributed transaction processing is more complicated, because The system must
coordinate the committing or rolling back of the changes in a transaction as a self-
contained section. In other words, the entire transaction commits, or the entire
transaction rolls back (aborts).
The software ensures the integrity of data in a distributed transaction using the two-
phase commit mechanism. In the prepare phase, the initiating node in the transaction
asks the other participating nodes to promise to commit or roll back the transaction.
During the commit phase, the initiating node asks all participating nodes to commit the
transaction. If this outcome is not possible, then all nodes are asked to roll back.
All participating nodes in a distributed transaction should perform the same action:
they should either all commit or all perform a abort of the transaction. The software
automatically controls and monitors the commit or abort of a distributed transaction and

maintains the integrity of the global database (the collection of databases participating
in the transaction) using the two-phase commit mechanism. This mechanism is
completely transparent, requiring no programming on the part of the user or application
developer. The commit mechanism has the following distinct phases, which the
software performs automatically whenever a user commits a distributed transaction:

Prepare The initiating node, called the global coordinator, asks participating nodes
phase other than the commit point site to promise to commit or roll back the
transaction, even if there is a failure. If any node cannot prepare, the
transaction is rolled back.

Commit If all participants respond to the coordinator that they are prepared, then the
phase coordinator asks the commit point site to commit. After it commits, the
coordinator asks all other nodes to commit the transaction.

Forget The global coordinator forgets about the transaction.

This section contains the following topics:
Prepare Phase
The first phase in committing a distributed transaction is the prepare phase. In this
phase, the system does not actually commit or roll back the transaction. Instead, all
nodes referenced in a distributed transaction (except the commit point site), are told to
prepare to commit. By preparing, a node:
Records information in the online redo logs so that it can subsequently either commit or
roll back the transaction, regardless of intervening failures
Places a distributed lock on modified tables, which prevents reads

When a node responds to the global coordinator that it is prepared to commit, the
prepared node promises to either commit or roll back the transaction later--but does not
make a unilateral decision on whether to commit or roll back the transaction. The
promise means that if an instance failure occurs at this point, the node can use the redo
records in the online log to recover the database back to the prepare phase.

Types of Responses in the Prepare Phase: When a node is told to prepare, it can
respond in the following ways:

Prepared Data on the node has been modified by a statement in the distributed
transaction, and the node has successfully prepared.

Read- No data on the node has been, or can be, modified (only queried), so no
only preparation is necessary.

Abort The node cannot successfully prepare.

Prepared Response: When a node has successfully prepared, it issues a prepared
message. The message indicates that the node has records of the changes in the
online log, so it is prepared either to commit or perform a abort. The message also
guarantees that locks held for the transaction can survive a failure.
Read-Only Response: When a node is asked to prepare, and the SQL statements
affecting the database do not change the node's data, the node responds with a read-
only message. The message indicates that the node will not participate in the
commit phase.

Note that if a distributed transaction is set to read-only, then it does not use abort
segments. If many users connect to the database and their transactions are not set to READ
ONLY, then they allocate abort space even if they are only performing queries.

Abort Response: When a node cannot successfully prepare, it performs the following
Releases resources currently held by the transaction and roll back the local portion of the
Responds to the node that referenced it in the distributed transaction with an abort

These actions then propagate to the other nodes involved in the distributed transaction so
that they can roll back the transaction and guarantee the integrity of the data in the global
database. This response enforces the primary rule of a distributed transaction: all nodes

involved in the transaction either all commit or all roll back the transaction at the same
logical time.

Steps in the Prepare Phase: To complete the prepare phase, each node excluding the
commit point site performs the following steps:
The node requests that its descendants, that is, the nodes subsequently referenced,
prepare to commit.
The node checks to see whether the transaction changes data on itself or its descendants.
If there is no change to the data, then the node skips the remaining steps and returns a
read-only response
The node allocates the resources it needs to commit the transaction if data is changed.
The node saves redo records corresponding to changes made by the transaction to its
online redo log.
The node guarantees that locks held for the transaction are able to survive a failure.
The node responds to the initiating node with a prepared response or, if its attempt or the
attempt of one of its descendents to prepare was unsuccessful, with an abort response.
These actions guarantee that the node can subsequently commit or roll back the
transaction on the node. The prepared nodes then wait until a COMMIT or ABORT
request is received from the global coordinator.

After the nodes are prepared, the distributed transaction is said to be in-doubt .It retains
in-doubt status until all changes are either committed or aborted.

Commit Phase: The second phase in committing a distributed transaction is the commit
phase. Before this phase occurs, all nodes other than the commit point site
referenced in the distributed transaction have guaranteed that they are prepared, that
is, they have the necessary resources to commit the transaction.
Steps in the Commit Phase: The commit phase consists of the following steps:
The global coordinator instructs the commit point site to commit.
The commit point site commits.
The commit point site informs the global coordinator that it has committed.
The global and local coordinators send a message to all nodes instructing them to commit
the transaction.

At each node, the system commits the local portion of the distributed transaction and
releases locks.
At each node, the system records an additional redo entry in the local redo log, indicating
that the transaction has committed.
The participating nodes notify the global coordinator that they have committed.

Guaranteeing Global Database Consistency: Each committed transaction has an

associated system change number (SCN) to uniquely identify the changes made by
the SQL statements within that transaction. The SCN functions as an internal .The
system timestamp that uniquely identifies a committed version of the database. In a
distributed system, the SCNs of communicating nodes are coordinated when all of
the following actions occur:
A connection occurs using the path described by one or more database links
A distributed SQL statement executes
A distributed transaction commits

During the prepare phase, the system determines the highest SCN at all nodes involved in
the transaction. The transaction then commits with the high SCN at the commit point site.
The commit SCN is then sent to all prepared nodes with the commit decision.

Forget Phase: After the participating nodes notify the commit point site that they have
committed, the commit point site can forget about the transaction. The following
steps occur:
After receiving notice from the global coordinator that all nodes have committed, the
commit point site erases status information about this transaction.
The commit point site informs the global coordinator that it has erased the status
The global coordinator erases its own information about the transaction.

Response of 2-Phase Commit protocol for Failures: It is tough to all failures in which
no log information is lost. The response of the protocol in the presence of failure is now
1. Site Failure:
A participant fails before having written the ready record in the log. In this
case, the coordinators timeout expires, and it takes the abort decision.
A participant fails after written the ready record in the log. In this case, the
operational sites correctly terminate the transaction (abort or commit). When
the failed site recovers, the restart procedure has to ask the coordinator or
some other participant about the result of the transaction.
The coordinator fails after having written the prepare record in the log before
having written the log, but before having written a global_commit or
global_abort record in the log. In this case all participants who have already
answered READY message must wait for the recovery of the coordinator. The
restart procedure of the coordinator resumes the commit protocol from the
beginning, reading the identity of the participants from the prepare record in
the log, and sending again prepare message to them. Each ready participant
must recognize that the new PREPARE message is a repetition of the previous
The coordinator fails after having written a global_commit or global_abort
record in the log, but before having written the complete record in the log. In
this case, the coordinator must send the decision again to all participants;
participants who have not received the command have to wait until the
coordinator recovers.
The coordinator fails before having written the complete record in the log. In
this case, the transaction was already concluded, and no action is required at
2. Lost Messages:
An answer message (READY or ABORT) from a participant is lost. In this
case the coordinators time out expires, and the whole transaction is aborted.

A PREPARE message is lost. In this case, the participant remains in wait. The
global result is same as the previous one as the coordinator does not receive an
A command message is lost i.e either COMMIT or ABORT. In this case, the
destination participant remains uncertain about the decision. Having a timeout
in the participant; if no command has been received after the timeout interval
from the answer can eliminate this problem, a request for the repetition of the
command is sent.
An ACK message is lost. In this case, the coordinator remains uncertain about
the fact that the participant has been received the command message.
Introducing a timeout in the coordinator can eliminate this problem; if no
ACK message is received after the timeout interval from the transmission of
the command, the coordinator will send the command again.
3.Network Partitions: Let us suppose that a simple partitions occurs, dividing the
sites in two groups the group contains the coordinator is called the coordinator
group the other the participant group. From the view of the coordinator the
partition is equivalent to the multiple failure of as set of participants, and the solution
is already discussed. From the view of the participant the failure is equivalent to a
coordinator failure and the situation is similar to the case already discussed.
It has been observed that the recovery procedure for a site that is involved in
processing a distributed transaction is more complex than that for a centralized
7.3 Concurrency Control for Distributed Transactions:
In this section we discuss the fundamental problems, which are due to the concurrent
execution of transactions. We deal concurrency control based on locking. First, the 2-
Phase-locking protocol in centralized databases is presented; then, 2-phase-locking is
extended to distributed databases.
7.3.1 Concurrency Control Based on Locking in Centralized Databases: The basic
idea of locking is that whenever a transaction accesses a data item, it locks it, and that a
transaction which wants to lock a data item which is already locked by another
transaction must wait until the other transaction has released the lock (unlock).

Let us see some important terminologies related to this concept:

Lock Mode: Transaction locks the data item in the following modes:
o Shared Mode: Here the transaction wants only to read the data item.
o Exclusive Mode: Here the transaction wants edit the data item.
The Well-formed Transactions: The transactions are always well-formed if it
always locks a data item in shared mode before reading it, and it always locks a
data item in exclusive mode before writing it
Compatibility Rules existing between Lock Modes:
o A transaction can lock a data item in shared mode if it is not locked at all
or it is locked in shared mode by another transaction
o A transaction can lock a data item in exclusive mode only if it is not
locked at all.
Conflicts: Two transactions are in conflict if they want to want to lock the same
data item with two compatible modes; two types of conflicts: Read-Write conflict
and Write-Write conflict.
Granularity of Locking: This term relates to the size of objects that are locked
with a single lock operation. In general, it is possible to lock at the record level
(i.e to lock individual tuples) or at the File level (to lock at the fragment level).
Concurrent transactions are successful if the following rules are followed:
o Transactions are well-formed
o Compatibility rules are observed
o Each transaction does not request new locks after it has released a lock.
A sophisticated locking mechanism known as 2-Phase locking which includes the above
said principles is normally used. According to this, there are two separate phases:
Growing phase: Each transactions there is a first phase during which new locks
are acquired
Shrinking Phase: A second phase during which locks are only released.

We will simply assume that all transactions are performed according to the following
(Begin Application)
Begin transaction
Acquire locks before reading or writing
Release locks
(End application)
In this way the transactions are well formed, 2-Phase locked and isolated.
Deadlock: A deadlock between two transactions arises if each transaction has locked a
data item and is waiting to lock a different data item which has already been locked by he
other transaction in the conflicting mode. Both transactions will wait forever in this
situation, and system intervention is required to unblock the situation. The system must
first find out the deadlock situation and force one transaction to release its locks, so that
the other one can proceed. i.e one transaction is aborted. This method is called as
Deadlock detection.
7.3.2 Concurrency Control Based on Locking in Distributed Databases:
Let us now concentrate about Distributed transaction concurrency control. Here some
assumptions like:
The local agents (LTMs) can lock and unlock local data items.
The LTMs interpret local locking primitives: local- lock-shared, local- lock-
exclusive and local unlock.
The global agent issue global primitives like: lock-shared, lock- exclusive and
The most important result for distributed databases is the following:
If a distributed transactions are well-formed and 2-phaselocked, then 2-phase locking
is the correct locking mechanism in distributed transaction as well as in centralized
We shall now discuss the important problems that have to be solved by the Distributed
transaction manager (DTM).

Dealing with multiple copies of the data: In distributed databases, redundancy

between data items, which are stored at different sites, is often desired, and in this
case two transactions, which hold conflicting locks on two copies of the same data
item stored at different sites, could be unaware of their mutual existence. In this
case locking would be completely useless.
In order to avoid this problem, the lock primitive issued by the DTM agent
has to translate the lock primitive issued by an agent on a data item in such a way
that it is impossible for a conflicting transaction to be unaware of this lock. The
simple way is to issue local locks to all LTMs at all sites where a local copy of the
data item is stored. In this way, the lock primitive is converted into as many lock
primitives, as there are copies of the locked items.
These schemes are only briefly explained here:
Write-locks-all, read-locks-one: In this scheme exclusive locks are
acquired on all copies, while shared locks are acquired only an
arbitrary copy. A conflict is always detected, because a shared-
exclusive conflict is detected at the site where the shared lock is
required and exclusive-exclusive conflicts are detected at all sites.
Majority locking: Both shared and exclusive and exclusive locks are
requested at a majority of the copies of the data item. In this way, if
two transactions are required to lock the same data item, there is at
least one copy of it where the conflict is discovered.
Primary copy locking: One copy of each data item is privileged
(called Primary copy); all locks must be required at this copy so
that conflicts are discovered at the site where the primary copy
Deadlock detection: The second problem that is faced by the DTM is Deadlock
detection. A deadlock is a circular waiting situation, which can involve many
transactions, not just two. The basic characteristic of a deadlock is the existence of
a set of transactions such that each transaction waits for another one. This can

ROOT Messages Messages

AGENT AGENT Distribution

2' 2' 2'

DTM- Messages DTM- Messages DTM- Distribution


1' 1' 1'

LTM LTM LTM Transaction
at at at Manager
Site i Site j Site i (LTM)

Interface 1' : Local_lock_shared, Local_lock_exclusive, Local_unlock

Interface 2': lock_shared, lock_exclusive, Unlock

Figure 7.9 A reference model of distributed Concurrency control

be represented with a wait-for graph. It is a directed graph having transactions as

the nodes; an edge from transaction T1 to transaction T2 represents the fact that
T1 waits for T2 as shown in the fig 7.10. The existence of the deadlock situation
corresponds to the existence of a cycle in the wait-for graph. Therefore a system
can discover deadlocks by constructing wait-for graph and analyzing whether
there are cycles in it.
In the fig. 7.10 the notation TiAj refers to the agent Aj of transaction Ti. Here
there are two sites and two transactions T1 and T2, each one consisting of two
agents. For simplicity, we assume that each transaction has only one agent at each
site where it is executed. A direct edge from an agent TiAj to an agent TrAs
means that TiAj is blocked and waiting for TrAs.

Example of a deadlock situation:

Site1 Site2

T1A1 T1A2
1 T2A2

Fig 7.10 A distributed wait graph showing a distributed deadlock

Clearly, if the arcs of the wait-for graph are different sites, the deadlock detection
problem becomes intrinsically a problem of distributed transaction management.
In this case, a global wait-for graph should be built by the DTM. The construction
of the global wait-for graph requires the execution of rather complex algorithms.
Most of the systems do not determine deadlocks in the above way as it is
little complicated. They simply use timeouts for detecting deadlocks. With the
timeout method a transaction is aborted after a given time interval has passed
since the transaction entered a wait state. This method does not determine a
deadlock; it simply observes a "long waiting" which could possibly be caused by
a deadlock. The main challenge here is to estimate an optimum timeout interval.
In a distributed system it is even more difficult to determine a workable timeout
interval than in a centralized system, because of the less predictable behavior of
the communication network and of remote sites.
Check Your Progress 1: Answer the following: -
A. Define the different properties of a transaction.
B. List the goals of transaction management.
C. What do you mean by a distributed transaction?
D. Define the concept of atomicity.
E. Define the Log in case of transaction management.
F. What is the role of Distributed Transaction Manager?
G. What are the different transaction control statements available?
H. Define Commit point site.
I. Define Commit point strength.

J. State Two-Phase Commit Protocol.

K. List the different failures possible in case of Distributed transaction.
L. What do you mean by the problem of concurrency control?
M. Define Deadlock in Distributed transaction.
Check Your Progress 2: Answer the following: -
1. Explain Distributed transaction.
2. Explain the Recovery techniques in Centralized database. Also explain how it
is extended to distributed databases?
3. Explain Two-Phase commit protocol with an example.
4. Describe Two-Phase locking mechanism used to solve Concurrency control
issue in distributed transaction.
5. Explain Deadlock in Distributed transactions.
7.4 Summary: In this unit, we have learnt about the distributed transaction
management and their related aspects. The following are the key points which are to be
Distributed transaction managers (DTMs) have the atomicity,
durability, serializability, and isolation properties. In most of the
systems this is obtained by implementing the 2-Phase Commit
protocol for reliability, 2-phase-locking for concurrency control, and
timeout for deadlock detection.
The 2-Phase Commit protocol ensures that the sub-transactions of the
same transaction will either all commit or all abort, in spite of the
possible failures; it is resilient to any failure in which no log
information is lost.
The 2-phase-locking mechanism requires that all sub-transactions
acquire locks in the growing phase and release locks in the shrinking
phase.Time out mechanisms for deadlock detection abort those
transactions, which are in wait, possibly for a deadlock.



8.0 Objective
8.1 Introduction
8.2 Clock Synchronization
8.2.1 Synchronizing physical clocks Crstians method for synchronizing clocks The Berkeley algorithm The Network Time Protocol
8.3 Logical Time and Logical clocks
8.3.1 Logical clocks
8.3.2 Totally Ordered Logical Clocks
8.4 Distributed Coordination
8.4.1 Distributed Mutual Exclusion central server algorithm A Distributed algorithm using logical clocks A Ring based algorithm
8.5 Elections
8.5.1 The bully algorithm
8.5.2 A Ring based election algorithm
8.6 Summary

8.0 Objective: In this unit we introduce some topics related to the issue of coordination
in distributed systems. To start with, the notion of time is dealt. We go on to
explain the notion of logical clocks, which are tool for ordering events, without
knowing precisely when they occurred. The second half examines briefly some
algorithms to achieve distributed coordination. These include algorithms achieve
mutual exclusion among collection of processes, so as to coordinate their
accesses to shared resources. It goes on to examine how a group of processes
agree upon a new coordinator of their activities, after their previous coordinator
has failed or become unreachable. This process is called as Election.
8.1 Introduction:
This unit introduces some concepts and algorithms related to the timing and coordination
of events occurring in distributed systems. In the first half we describe the notion
of time in a distributed system. We discuss the problem of how to synchronize
clocks in different computers, and so time events occurring at them consistently,
and we even discuss the related problem of determining the order in which events
occurred. In the latter half of the chapter we introduce some algorithms whose
goal is to confer a privilege upon some unique member of collection of processes.
In the next section we explain methods whereby computer clocks can be
approximately synchronized, using message passing. However, we shall not be able to
obtain sufficient accuracy to determine, in many cases, the relative ordering of events on
some events by appealing to the flow of data between processes in a distributed system.
We go on to introduce logical clocks, which are used to define an order on events without
measuring the physical time at which they occurred.
8.2 Clock Synchronization: Time is an important and interesting issue in distributed
systems, for several reasons.
o First, the time is a quantity we often want to measure accurately. In order to know
at what time of day a particular event occurred at a particular computer for
example, for accountancy purposes it is necessary to synchronize its clock with
a trustworthy, external source of time. This is external synchronization. Also, if
computer clocks are synchronized with one another to a known degree of
accuracy, then we can, within the bounds of this accuracy, measure the interval

between two events occurring at different computers, by appealing to their local

clocks. This is internal synchronization. Two or more computers that are
internally synchronized are not necessarily externally synchronized, since they
may drift collectively from external time.
o Secondly, algorithms that depend upon clock synchronization have been
developed for several problems in distributions. These include maintaining the
consistency of distributed data (the use of timestamps to serialize transactions);
checking the authenticity of a request and eliminating the processing of duplicate
o Thirdly, the notion of physical time is also a problem in a distributed system.
This is not due to the effects of special relativity, which are negligible or non-
existent for normal computers, but the problem is based on a similar limitation
concerning our ability to pass information from one computer to another. This is
that the clocks belonging to different computers can only be synchronized, at least
in the majority of cases, by network communication. Message passing is limited
by virtue of the speed at which it can transfer information, but this in itself would
not be a problem if we knew how long message transmission took The problem is
that sending a message usually takes an unpredictable amount of time.
Now let us see different aspects of clock synchronization one by one.
8.2.1 Synchronizing physical clocks:
Computers each contain their own physical clock. These clocks are electronic
devices that count oscillations occurring in a crystal at a definite frequency, and which
typically divide this count and store the result in a counter register. Clock devices can be
programmed to generate interrupts at regular intervals in order that, for example, time
slicing can be implemented; however we shall not concern ourselves with this aspect of
clock operation. The clock output can be read by software and scaled into a suitable time
unit. This value can be used to timestamp any event experienced by any process
executing at the host computer. By event we mean an action that appears to occur
indivisibly, all at once such as sending or receiving a message. Note however, that
successive events will correspond to different timestamps only if the clock resolution
the period between updates of the clock register is smaller than the rate at which events

can occur. The rate at which events occur depends on such factors as the length of the
processor instruction cycle.
In order to compare timestamps generated by clocks of the same physical
construction at different computers, one might think that we need only know the relative
offset of one clocks counter from that of the other for example, that one count had the
value 1958 when the other was initialized to 0. Unfortunately, this supposition is based on
a false argument; computer clocks in practice are extremely unlikely to tick at the same
rate, whether or not they are of the same physical construction.
Clock drift: The crystal-based clocks used in computers are, like any other clocks,
subject to clock drift (Figure8.1), which means that they count time at different rates, and
so deviate. The underlying oscillators are subject to physical variations, with the
consequence that their frequencies of oscillation differ. Moreover, even the same clocks
frequency varies with the temperature. Designs exist that attempt to compensate for this
variation but they cannot eliminate it. The difference in the oscillation period


Fig.7.1 Drift between computers clocks in a distributed system.

between two clocks might be extremely small, but the different accumulated over many
oscillation leads to an observable difference in the counters registered by two clocks, no
matter how accurately they were initialized to the same value. A clocks drift rate is the
change in the offset (difference in reading) between the clock and a nominal perfect
reference clock per unit of time measured by the reference clock. For clocks based on a
query crystal, this is about 10-6 giving a difference of one second every 1,000,000
seconds or 11.6 days.
Coordinated universal time: The most accurate physical clocks known use atomic
oscillators, whose accuracy is about one part in 1013. The output of these atomic clocks is
used as the standard for elapsed real time, known as International Atomic Time.

Coordinated universal time abbreviated as UTC (sic) is an international

standard that is based on atomic time, but a so-called leap second is occasionally
inserted or deleted to keep in step with astronomical time. UTC signals are
synchronized and broadcast regularly from land-based radio stations and satellites
covering many parts of the world. For example, in the US radio station WWW
broadcasts time signals on several short-wave frequencies. Satellite sources
include the Geostationary Operational Environmental Satellite (GOES) and the
Global Positioning System (GPS).
Receivers are available commercially. Compared with perfect UTC, the signals
received from land-based stations have accuracy in the order of 0.1-10
milliseconds, depending on the station used. Signals received from GEOS are
accurate to about 0.1ms and signals received from GPS are accurate to about one
millisecond. Computers with receivers attached can synchronize their clocks with
these timing signals.
Compensating for clock drift :If the time provided by a time service, such as Universal
Coordinated Time signals, is greater than the time at a computer C, then it may be
possible simply to set Cs clock to the time service time. Several clock ticks appear to
have been missed, but time continues to advance, as expected. Now consider what
should happen if the time services time is behind that of C. We cannot set Cs time back
to the time, because this is liable to confuse applications that relay on the assumption that
time always advances. The solution is not set Cs clock back, but to cause it to run slow
for a period, until it is in accord with the timeserver time. It is possible to change the rate
at which updates are made to the time as given to applications. This can be achieved in
software, without changing the rate at which the hardware clock ticks (an operation
which is not always supported by hardware clocks).
Let us call the time given to applications (the software clocks reading) S and the
time given by the hardware clock H. Let the compensating factor be , so that
S(t)=H(t) + (t)
The simplest form for to make S change continuously is a linear function of the
hardware clock: (t) = aH(t) + b, where a and b are constants to be found. Substtuting
for in the identity, we have S(t) = (1 + a)H(t) + b.

Let the value of the software clock be Tskew when H=h, and let the actual time at
the point be Treal. We may have that Tskew > Treal or Tskew <Treal. If S is to give the actual
time after N further ticks, we must have
Tskew =(1 + a)h+b, and Treal + N = (1+a)(h + N) + b
By solving these equations, we find
A = (Treal - Tskew) / N and b= Tskew (1 + a)h. Cristians method for synchronizing clocks :One way to achieve
synchronization between computers in a distributed system is for a central timeserver
process S to supply the time according to its clock upon request, as shown in Fig 8.2. The
timeserver computer can be fitted with a suitable receiver so as to be synchronized with
UTC. If a process P requests the time in a message mr, and receives the time value t in a
message mt , then in principle it could set its clock to the time t +Ttrans, where Ttrans,

Time server,s

Fig.8.2 Clock synchronization using a timeserver.

where Ttrans is the time taken to transmit mt from S to P (t is inserted in mt at the last
possible point before transmission from Ss computer).
Unfortunately, Ttrans is subject to variation. In general, other processes are
competing with S and P for resources at each computer involved, and other messages
compete with mt for the network. These factors are unpredictable and not practically
avoidable or accurately measurable in most installations. In general, we may say that
Ttrans = min + x, where x 0. The minimum value min is the value that would be obtained
if no other processes executed and no other network traffic existed; min can be measured
or conservatively estimated. The value of x is not known in a particular case, although a
distribution of values may be measurable for a particular installation.
Cristian suggested the use of such a timeserver, connected to a device that
receives signals from a source of UTC, to synchronize computers. Synchronization

between the timeserver and its UTC receiver can be achieved by a method that is similar
to the following procedure for synchronization between computers.
A process P wishing to learn the time from S can record the total round-trip time
with reasonable accuracy if its rate of clock drift is small. For example, the round-
trip time should be in the order of 1-10 milliseconds on a LAN, over which time a
clock with a drift rate of 10-6 varies by at most 10-5 milliseconds.
Let the time returned in Ss message mt be t. A simple estimate of the time to
which P should set its clock is t+ Tround/2, which assumes that the elapsed time is
split equally before and after S placed t in mt. Let the time between sending and
receipt for mr and mt. be min + x and min + y, respectively. If the value of min is
known or can be conservatively estimated, then we can determine the accuracy of
this result as follows.
The earliest point that S could have placed the time in mt was min after P
dispatched mt. The latest point at which it could have done this was min before mt
arrived at P. The time by SS clock when the reply messages arrives is therefore in
the range [t+min, t+Tround-min]. The width of this range is Tround 2 min, so the
accuracy is (Tround /2-min).
Discussion of Cristians algorithm: As described Cristians method suffers from the
problem associated with all services implemented by a single server, that the single
timeserver might fail and thus render synchronization impossible temporarily. Cristian
suggested, for this reason, that time should be provided by a group of synchronized
timeservers, each with a server for UTC time signals. For example, a client could
multicast its request to all servers, and use only the first reply obtained. Note that a
malfunctioning timeserver that replied with spurious time values, or an improper
timeserver that replied with deliberately incorrect times could wreak havoc in a computer
system. The Berkeley algorithm: Gusella and Zatti describe an algorithm for internal
synchronization which they developed for collections of computers running Berkeley
UNIX. In it, a coordinator computer is chosen to act as the master. Unlike Cristians
protocol, this computer periodically polls the other computers whose clocks are to be
synchronized, called slaves. The slaves send back their clock values to it. The master

estimates their local clock times by observing the round-trip times (similarly to Cristians
technique), and it averages the values obtained (including its own clocks reading). The
balance of probabilities is that this average cancels out the individual clocks tendencies
to run fast or slow. The accuracy of the protocol depends upon a nominal maximum
round-trip time between the master and the slaves. The master eliminates any occasional
readings associated with larger times than this maximum.
Instead of sending the updated current time back to the other computers which
would introduce further uncertainty due to the message transmission time the master
sends the amount by which each individual slaves clock requires adjustment. This can be
a positive or negative value.
The algorithm eliminates readings from clocks that have drifted badly or that has
failed and provides spurious readings. Such clocks could have a significant adverse effect
if an ordinary average was taken. The master takes a fault-tolerant average. That is, a
subset of clocks is chosen that do not differ from one another by more than a specified
amount, and the average is taken only of readings from these clocks.
Gusella and Zatti describe an experiment involving 15 computers whose clocks
were synchronized to within about 20-25 milliseconds using their protocol. The local
clocks drift rate was measured to be less than 2 x 10 -5 , and the maximum round-trip time
was taken to be 10 milliseconds. The Network Time Protocol: The Network Time Protocol (NTP) defines
architecture for a time service and a protocol to distribute time information over a wide
variety of interconnected networks. It has been adopted as a standard for clock
synchronization throughout the Internet.
NTPs chief design aims and features are:
To provide a service enabling clients across the Internet to be
synchronized accurately to UTC; despite the large and variable message
delays encountered in Internet communication. NTP employs statistical
techniques for the filtering of timing data and it discriminates between the
qualities of timing from different servers.
To provide a reliable service that can survive lengthy losses of
connectivity; there are redundant servers and redundant paths between the

servers. The servers can reconfigure so as still to provide the service if one
of them becomes unreachable.
To enable clients to resynchronize sufficiently; to offset the rate of drift
found in most computers. The service is designed to scale to large number
of clients and servers.
To provide protection against interference with the time service; whether
malicious or accidental. The time service uses authentication techniques to
check that timing data originate from the claimed trusted sources. It also
validates the return addresses of messages sent to it.



Fig.8.3 An example synchronization subnet in an NTP implementation

Note: Arrows denote synchronization control and Numbers denote strata.

A network of servers located across the Internet provides the NTP service. Primary
servers are directly connected to a time such as a radio clock receiving UTC;
secondary servers are synchronized, ultimately, to primary servers. The servers are
connected in a logical hierarchy called a synchronization subnet (see Fig 8.3),
whose levels are called strata. Primary servers occupy stratum 1: they are at the
root. Stratum 2 servers are secondary servers that are synchronized directly to the
primary severs; stratum 3 are synchronized from stratum 2 servers, and so on. The
lowest level (leaf) severs execute in users workstations.
NTP servers synchronize modes: multicast, procedure-call and symmetric mode.
Multicast mode: is intended for use on a high-speed LAN. One or more servers
periodically multicasts the time to the servers running in other computers
connected by the LAN, which set their clocks assuming a small delay. This mode
can only achieve relatively low accurate, but ones which nonetheless are
considered sufficient for many purposes.
Procedure-call mode: is similar to the operation of Cristians algorithm, described
above. In this mode, one server accepts requests from other computers, which it
processes by replying with its timestamp (current clock reading). This mode is
suitable where higher accuracies are required that can be achieved with multicast
or where multicast is not supported in hardware. For example, file servers on the
same or a neighboring LAN , which need to keep accurate timing information for
file accesses, could contact a local master server in procedure-call mode.
Symmetric mode: is intended for use by the master servers that supply time
information in LANs and by the higher levels (lower strata) of the
synchronization subnet, where the highest accuracies are to be achieved. A pair of
servers operating in symmetric mode exchange messages bearing timing
information. Timing data are retained as part of an association between the servers
that is maintained in order to improve the accuracy of their synchronization over

In all modes, messages are delivered unreliably, using the standard UDP Internet
transport protocol. In procedure-call mode and symmetric mode, messages are
exchanged in pairs. Each message bears timestamps of recent message events; the
local times when the previous NTP messages between the pair were sent and
received, and the local time when the current message was transmitted. The
recipient of the NTP message notes the local time when it receives the message.
The four times Ti-3, Ti-2, Ti-1, and Ti are shown in Fig 8.4 for messages m and m sent
between servers A and B. Note that in symmetric mode, unlike the Cristian
algorithm, there can be a non-negligible delay between the arrival of one message
and the dispatch of the next. Also, messages may be lost, but the three timestamps
carried by each message are nonetheless valid.
For each pair of messages sent between two servers the NTP protocol calculates an
offset oi that is an estimate of the actual offset between the two clocks, and a delay
di that is total transmission time for the two messages. If the true offset of the clock
at B relative to that at A is o, and if the actual transmission times for m and m are t
and t respectively, then we have:

Ti-2 = Ti-3 + t + o, and Ti = Ti-1 + t o

Defining a = Ti-2 Ti-3 and b=Ti-1 Ti

this leads to:

di=t + t = a-b ; o = oi + ( t t)/2 where oi = ( a+b)/2

Server B Ti-2 Ti-1 Time

Server A Ti-3 Ti

Fig.8.4 Message exchanged between a pair of NTP peers


Using the fact that t, t 0 it can be shown that oi dj/2 o oi + di/2. Thus oi is an
estimate of the offset, and di is a measure of the accuracy of this estimate.
8.3 Logical time and logical clocks:
From the point of view of any single process, events are ordered uniquely by times shown
on the local clock. However, as Lamport pointed out, since we cannot perfectly
synchronize clocks across a distributed system, we cannot in general use physical time to
find out the order of any arbitrary pair of events occurring within it.
The order of events occurring at different processes can be critical in a distributed
application. In general, we can use a scheme which is similar to physical causality, but
which applies in distributed systems, to order some of the events that occur at different
processes. This ordering is based on two simple and naturally obvious points:
If two events occurred at the same process, then they occurred in the order in
which it observes them.
Whenever a message is sent between processes, the event of sending the message
occurred before the event of receiving the message.
Lamport called the ordering obtained by generalizing these two relationships the
happened-before relation. It is also sometimes known as the relation of casual ordering
or potential casual ordering.
More formally, we write x y if two events x and y occurred at a single process p,
and x occurred before y. Using this restricted order we can define the happened-before
relation, denoted by, as follows:
HB1: If process p: x y, then , x y.
HB2: For any messages m, send(m) rcv(m).
- where send(m) is the event of sending the message, and rcv(m)
is the event of receiving it.
HB3: If x,y and z are events such that x y and y z, then x z.
a b m1
P2 Time
c d m2
e f

Fig. 8.5 Event occurring at three processes.

Thus if x y, then we can find a series of events e1,e2, e3, en occurring at one or
more processes such that x=e1 and y=en and for i=1,2 .., n-1, either HBI or HB2 applies
between ei and ei+1: That is, either they occur in succession at the same process., or there
is a message m such that ei=send(m) and ei+1=rev(m). The sequence of events e1,e2, e3,
en need not be unique.
The relation is illustrated for the case of the three processes P1,P2 and P3 in
Fig 8.5. It can be seen that a b, since the events occur in this order at process p1, and
similarly c d; b c, since these are the sending and reception of message m1, and
similarly d f. Combining these relations, we may also say that, for example, a f.
It can also be observer from Fig 8.5 that not all events are related by the relation . We
say that events such as a and e that are not ordered by are concurrent, and write this
all e.
The relation captures a flow of data intervening between two events. Note
however, that in real life data can flow in ways other than by message passing. For
example, if Smith enters a command to his process to send a message, then telephones
Jones who commands her process to issue another message, then the issuing of the first
message clearly happened-before that of the second. Unfortunately, since no network
messages were sent between the issuing processes, we cannot model this type of
relationship in our system.
8.3.1 Logical clocks: Lamport invented a simple mechanism by which the happened
before ordering can be captured numerically, called a logically clock. A logical clock is a
monotonically increasing software counter, whose value need bear no particular
relationship to any physical clock, in general. Each process p keeps its own logical clock,
Cp, which it uses to timestamp events.

1 2
a b m1
P2 3 4 Physical
c d Time
P3 1 5

e f
Fig.8.6 Logical timestamps for the events shown in Fig 8.5
We denote the timestamp of event a at p by Cp(a), and by C(b) we denote the timestamp
of event b at whatever process it occurred.
To capture the happened-before relation , processes update their logical
clocks and transmit the values of their logical clock in messages as follows:
LC1 : Cp is incremented before each event is issued at process p: Cp :=
Cp + 1
LC2 : a) When a process p sends a message m, it piggybacks on m the
value t=Cp.
b) On receiving (m,t) a process q computes Cq:=max(Cq,t) and
then applies LCI before time stamping the event rev(m).
Although we increment clocks by one, we could have chosen any positive value. It can
easily be shown, by induction on the length of any sequence of events relating two events
a and b, that a b => C(a) < C(b).
Note that the converse is not true. If C(a) < C(b), then we cannot infer that a
b. In Fig 8.6 we illustrate the use of logical clocks for the example given in Fig 8.5. Each
of the processes p1,p2 and p3 has its logical clocks initialized to 0. The clock values given
are those immediately after the event to which they are adjacent. Note that, for example,
C(b) > C(e) but b || e.
8.3.2 Totally ordered logical clocks: Logical clocks impose only a partial order on the
set of all events, since some pairs of distinct events, generated by different processes,
have numerically identical timestamps. However, we can extend this to a total order-that
is, one for which all pairs of distinct events are ordered by taking into account the
identifiers of the processes at which events occur. If a is an event occurring at pa with
local timestamp Ta, and b is an event occurring at pb with local timestamps Tb, we define
the global logical timestamps for these events to be (Ta , pa) and (Tb , pb) respectively and
we define (Ta , pa) < (Tb , pb) if and only if either Ta < Tb or Ta = Tb and , pa < pb.
8.4 Distributed Coordination:
Distributed processes often need to coordinate their activities. For example, if a collection
of processes shares a resource or collection of resources managed by a server, then often-

mutual exclusion is required to prevent interference and ensure consistency when

accessing the resources. This is essentially the Critical section problem, familiar in the
domain of the domain of operating systems. In the distributed case, however, neither
shared variables nor facilities supplied by a single local kernel can be used to solve it in
A separate, generic mechanism for distributed mutual exclusion is required in
certain cases one that is independent of the particular resource management scheme in
question. Distributed mutual exclusion involves a single process being given a privilege -
the right to access shared resources temporarily, before another process is granted it. In
some other cases, however, the requirement is for a set of processes to choose one of their
numbers to play a privileged, coordinating role over the long term. A method for
choosing a unique process to play a particular role is called an election algorithm.
This sectionnow examines some algorithms for achieving the goals of distributed
mutual exclusion and elections.
8.4.1 Distributed mutual exclusion: Our basic requirements for mutual exclusion
concerning some resource (or collection of resources) are as follows:
ME1 : (safety) At most one process may execute in the critical section (CS) at a
ME2 : (LIVENESS) A process requesting entry to the CS is eventually granted it (so
long as any process executing in the CS eventually leaves it).
ME2 implies that the implementation is deadlock-free, and that starvation does not occur.
A further requirement, which may be made, is that of casual ordering:
ME3 : (ordering) Entry to the CS should be granted in happened-before order.
A process may continue with other processing while waiting to be granted entry to a
critical section. During this time it might send a message to another process, which
consequently also tries to enter the critical section. ME3 specifies that the first process
should be granted access before the second.
We now discuss some algorithms for achieving these requirements. The central server algorithm: The simplest way to achieve mutual exclusion is
to employ a server that grants permission to enter section. For the sake of simplicity, we

shall assume that there is only one critical section managed by the server. Recall that the
protocol for executing a critical section is as follows.
enter() (*enter critical section block if necessary *)
(*access shared resources in critical section *)
exit() (*leave critical section other processes may now enter *)
Fig 8.7 shows the use of this server. To enter a critical section, a process sends a request
message to the server and awaits a reply from it. Conceptually, the reply constitutes a
token signifying permission to enter the critical section. If no other process has the token
at the time of the request, then the server replies immediately, granting the token. If the
token is currently held by another process, then the server does no reply
Queue of
Requests 4

1. Grant
1. Request
token 1. Release
token P4

Fig.8.7 Server managing a mutual exclusion token for a set of processes

but queues the request. On existing the critical section, a message is sent to the giving it
back the token.
If the queue of waiting processes is not empty, then the server chooses the oldest
entry in the queue, removes it and replies to the corresponding process. The chosen
process then holds the token. In the figure, we show a situation in which p2s request has
been appended to the queue, which already contained p4s request. p3 exists the critical
section, and the server removes p4s entry and grants permission to enter to p4 by
replying to it. Process p1 does not currently require entry to the critical section.
The server is also critical point of failure. Any process attempting to communicate
with it, either to enter or exit the critical section, will detect the failure of the server. To
recover from a server failure, a new server can be created (or one of the processes

requiring mutual exclusion could be picked to play a dual role as a server). Since the
server must be unique if it is to guarantee mutual exclusion, an election must be called to
choose one of the clients to create or act as the server, and to multicast its address to the
others. We shall describe some election algorithms below. When the new server has been
chosen, it needs to obtain the state of its clients, so that it can process their requests, as
the previous server would have done. The ordering of entry requests will be different in
the new server from that in the failed one unless precautions are taken, but we shall not
address this problem here. A distributed algorithm using logical clocks: Ricart and Agrawala developed
an algorithm to implement mutual exclusion that is based upon distributed agreement,
instead of using a central server. The basic idea is that processes that require entry to a
critical section multicast a request message, and can enter it only when all the other
processes have replied to this message. The conditions under which a process replies to a
request are designed to ensure that conditions ME1 ME3 are met.
Again, we can associate permission to enter a token in this case takes multiple
messages. The assumptions of the algorithm are that the processes p1 pn know one
anothers addresses, that all messages sent are eventually delivered, and that each process
pi keeps a logical clock, updated according to the rules LC1 and LC2 of the previous unit.
Messages requesting the token are of the for <T, pi>, where T is the senders timestamp
and pi is the senders identifier. For simplicitys sake, we assume that only one critical
section is at issue, and so it does not have to be identified. Each process records its state
of having released the token (RELEASED), wanting the token (WANTED) or processing
the token (HELD) in a variable. The protocol is given in Fig 8.8.
Fig.8.8 Ricart and Agrawalas algorithm
On initialization:
State := RELEASED;

To obtain the token:

State := WANTED;
Multicast request to all processes; Request processing deferred here
T:=requests timestamp;
Wait until (number of replies received = (n-1);
On receipt of a request < Ti , pi > at pj (i # j);
If (state = HELD or (state = WANTED and (T,pj) < (T,pi)))

Queue request from pi without replying;
Reply immediately to pi;
End if
To release token
State := RELEASED;
Reply to any queued requests;

If a process requests the token is RELEASED everywhere else that is, no other
process wants it then all processes will reply immediately to the request and the
requester will obtain the token. If a token is HELD at some process, then that process will
not reply to requests until it has finished with the token, and so the requester cannot
obtain the token in the meantime. If two or more processes request token at the same
time, then whichever processs request bears the lowest timestamp will be the first to
collect (n-1) replies, granting it the token next. If the requests bear equal timestamps, the
process identifiers are compared to order them. Note that, when a process requests the
token, it defers processing requests from other processes until its own request has been
sent and the timestamp T is known. This is so that processes make consistent decisions
when processing requests.
To illustrate the algorithm, consider a situation involving three processes, p1, p2
and p3 shown in Fig 8.9. Let us assume that p3 is not interested in the token, and that p1
and p2 requests it concurrently. The timestamp of p1s request is 41, that of p2 is 34. When
p3 receives their requests, it replies immediately. When p2 receives p1s request it finds its
own request has the lower timestamp, and so does not reply, holding p1 off. However, p1
finds that p2s request has a lower timestamp than that of its own request, and so replies
immediately. On receiving this second reply, p2 possesses the token. When p2 releases the
token, it will reply to p1s request, and so grant it to token.
Obtaining the token takes 2(n-1) messages in the algorithm: (n-1) to multicast the
request, followed by (n-1) replies. Or, if there is hardware support for multicast, only one
message is required for the request; the total is then n messages. It is thus a processes
request mutual exclusion by multicasting. Considerably more expensive algorithm, in
general, than the central server algorithm just described. Note also, that while it is a fully
distributed algorithm, the failure of any process involved would make process

impossible. And in the distributed algorithm all the processes involved receive and
process every request, so no performance gain has been made over the single server
bottleneck, which does just the same. Finally, note that a process that wishes to obtain the
token and which was the last to process it still goes through the protocol as described,
even though it could simply decide locally to reallocate it to itself. Ricart and Agrawala
refined this protocol so that it requests n messages to obtain the token in the worst (and
common) case, without hardware support for multicast. A ring-based algorithm: One of the simplest ways to arrange mutual exclusion
between n processes p1, ... pn is to arrange them in a logical ring. The idea is that exclusion
is conferred by obtaining a token in the form of a message passed from process to process
in a single direction-clockwise, say round the ring. The ring topology which is
unrelated to the physical interconnections between the underlying computers- which is
unrelated to the physical interconnections between the underlying computers is created
by giving each process the address of its neighbor. A process that requires the token waits
until it receives it, but retains it. To exit the critical section, the process sends the token on
to its neighbor.

41 P3

P1 Reply



Fig. 8.9 Multicast Synchronization

The arrangement of processes is shown in Fig 8.10. It is straightforward to verify
that the conditions ME1 and ME2 are met by this algorithm, but that the token is not
necessarily obtained in happened-before order. It can take from 1 to (n-1) messages to
obtain this token, from the point at which the token becomes required. However,
messages are sent around the ring even when no process requires the token. If a process
fails, then clearly no progress can be made beyond it in transferring the token, until a
reconfiguration is applied to extract the failed process from the ring.




If the process holding the token fails, then an election is required to pick a unique process
Fig. 8.10 A ring of processes transferring a mutual exclusion token.
from the surviving members, which will regenerate the token and transmit it as before.
Care has to be taken, in ensuring that the failed process really has failed, and
does not later unexpectedly inject the old token into the ring, so that there are tow tokens.
This situation can arise since process failure can only be ascertained by repeated failure
of the process to acknowledge messages sent to it.
8.5 Elections: An election is a procedure carried out to choose a process from a group,
for example to take over the role of a process that has failed. The main requirement is for
the choice of elected process to be unique, even if several processes call elections

8.5.1 The bully algorithm: The bully algorithm can be used when the numbers of the
group know the identities and addresses of the other members. The algorithm selects the
surviving members with the largest identifiers to function as the coordinator. We assume
that communication is reliable, but processes can be failed during an election. The
algorithm proceeds as follows.
There are three types of messages in this algorithm.
An election message is sent to announce an election
An answer message is sent in response to an election message
A coordinator message is sent to announce the identity of the new coordinator.
A process begins an election when it notices that the coordinator has failed. To begin an
election, a process sends an election message those processes that have a higher
identifier. It then awaits an answer message in response. If none arrives within a certain
time, the process considers itself the coordinator, and sends a coordinator message to all
processes with lower identifiers announcing this fact. Otherwise, the process waits
further limited period for a coordinator message to arrive from the new coordinator. If
none arrives, it begins another election.
If a process receives a coordinator message, it records the identifier of the
coordinator contained within it, and treats that process as the coordinator. If a process
receives an election message, it sends back an answer message, and begins another
election unless it has begun one already.
When a failed process is restarted, it begins an election. If it has the higher
process identifier, then it will decide that it is the coordinator, and announce this to the
other processes. Thus it will become the coordinator, even though the current coordinator
is functioning. It is this reason that the algorithm is called the bully algorithm. The
operation of the algorithm is shown in Fig 8.11. There are four processes p1-p4, and an
election is called when p1 detects the failure of the coordinator, p4, and announces an
election (stage 1 in the figure). On receiving an election message from p1, p2 and p3 send
answer messages to p1 and begin their own election; p3 sends an answer message to p2,
but p3 receives no answer message from the failed process p4 (stage 2). It therefore decides
that it is the coordinator. But before it can send out the coordinator message, it too fails
(stage 3). When p1s timeout period expires (which we assume occurs before p2s timeout

expires), it notices the absence of a coordinator message and begins another election.
Eventually, p2 is elected coordinator (stage 4).
8.5.2 A ring-based election algorithm: We give the algorithm of Chang and Roberts,
suitable for a collection of processes that are arranged in a logical ring (Refer fig8.12).
We assume that the processes do not know the identities of the others a priori, and that
each process knows only how to communicate with its neighbor in, say, the clockwise
direction. The goal of this algorithm is to elect a single coordinator, which is the process
with the largest identifier. The algorithm assumes that all the processes remain functional
and reachable during its operation. Initially, every process is marked as a non-participant
in an election. Any process can begin an election. It proceeds by marking itself as a
participant, placing its identifier in an election message and sending it to its neighbor.
When a process receives an election message, it compares the identifier in the message
with its own. If the arrived identifier is the greater, then it forwards the message to its
neighbor. If the arrived identifier is smaller and the receiver is not a participant then it
substitutes its own identifier in the message and forwards it; but it does not forward the
message if it is already a participant. On forwarding an election message in any case, the
process marks itself as a participant. If however, the received identifier is that of the
receiver itself, then this processs identifier must be the greatest, and it becomes the
coordinator. The coordinator marks itself as a non-participant, and forwards the message
to its neighbor.

election C

Stage 1 answer P2 P3 P4
Stage 2 election election

P1 P2 answer P3 P4

Stage 3

P1 P2 P3 P4

Eventually.. C
Stage 4

P1 P2 P3 P4

Fig. 8.11 The bully algorithm: The election of coordinator p2, after the failure of p4
and then p3.



Fig.8.12 A Ring based Election in progress.


NOTE: The election was started by process 17. The highest process identifier
encountered so far is 24. Participant processes are shown darkened
The point of making processes as participant or non-participant is so that
messages arising when another process starts an election at the time are extinguished as
soon as possible, and always before the winning election result has been announced.
If only a single process starts an election, then the worst case is when its
anticlockwise neighbor has the highest identifier. A total of n-1 messages are then
required to reach this neighbor, which will not announce its election until its identifier has
completed another circuit, taking a further n messages. The elected message is then sent
n times, making 3n-1 messages in all.
An example of a ring-based election in progress is shown in Figure 10.12. The
election message currently contains 24, but process 28 will replace this with its identifier
when the message reaches it.
Check your Progress 1: Answer the following: -
1. Define Physical time and Logical time.
2. Define clock drift.
3. Define coordinated universal time.
4. List the design features of Network time Protocol.
5. State Happened-before relation.
6. Define Distributed mutual exclusion principle.
7. Name a distributed algorithm that uses the concept of logical clocks.
8. What for the ring-based algorithm is used for?
9. What are the three types of the messages used in bully algorithm?
10. What do you mean by distributed coordination?
Check your Progress 2: Answer the following: -
1. Explain the necessity of clock synchronization.
2. Describe Crstians method for synchronizing clocks.
3. Explain Berkley algorithm.
4. What is the need of the concept of logical clock? Explain Happened before

5. Discuss and compare the different algorithms used for maintaining mutual
exclusion in Distributed transaction.
6. Explain Election algorithms.
8.6 Summary:
In this unit first we have discussed the importance of accurate time keeping for
distributed systems. It then described algorithms for synchronizing clocks despite the
drift between them and the variability of message delays between computers.
The degree of synchronization accuracy that is practically obtainable fulfils many
requirements, but is nonetheless not sufficient to determine the ordering of an arbitrary
pair of events occurring at different computers. The happened-before relationship is a
partial order on events, which reflects a flow of information within a process, or via
messages between processes between them. Some algorithms require events to be
ordered in happened-before order, for example, successive updates made at separate
copies of data. Logical clocks are counters that are updated so as to reflect the happened-
before relationship between them.
The unit then described the need for processes to access shared resources under
conditions of mutual exclusion. Resource servers in all cases do not implement locks, and
a separate distributed mutual exclusion service is then required. Three algorithms were
considered which achieve mutual exclusion: the central server, a distributed algorithm
using logical clocks, and a ring-based algorithm. These are heavyweight mechanisms that
cannot withstand failure, although they can be modified to be fault-tolerant. On the
whole, it seems advisable to integrate locking with resource management.
Finally, the unit considered the bully algorithm and a ring-based algorithm whose
common aim is to elect a process uniquely from a given set even if several elections
take place concurrently. These algorithms could be used, for example, to elect a new
master timeserver, or a new lock server, when the previous one fails.


1. Stefano Curi, Giuseppe Pelagatti, Distributed Databases Principles & Systems

McGraw Hill International Editions, 1985.

2. Pradeep . K . Sinha, Distributed Operating Systems Concepts and Design

PHI, 1998.

3. George Coulouris, Jean Dollimore, Tim Kindberg, Distributed Sysems Concepts

& Design, second edition Addison Wesley, 2000.

4. Andrew.S.Tanenbaum, Computer Networks- Third Edition PHI,1999.