Vous êtes sur la page 1sur 190

3/30/2014

1
BITS Pilani
Hyderabad Campus
CS ZG623: Advanced Operating Systems
Chittaranjan Hota, PhD
Dept. of Computer Sc. & Information Systems
hota@hyderabad.bits-pilani.ac.in
Second Semester 2013-2014
Introduction to Distributed Systems
Theoretical Foundations
Distributed Mutual Exclusion
Distributed Deadlock
Agreement Protocols
Distributed File Systems
Distributed Scheduling
Distributed Shared Memory
Recovery
Fault tolerance
Protection and Security

Course Overview
3/30/2014
CS ZG623, Adv OS 2
Mid-Semester
3/30/2014
2
Text and References
3/30/2014
CS ZG623, Adv OS 3
Text Book:
Advanced Concepts in Operating Systems:
Mukesh Singhal & Shivaratri, Tata Mc Graw Hill

References:
1. Distributed Operating Systems: P. K. Sinha, PHI
2. Distributed Operating Systems: The Logical Design, A.
Goscinski, AW
3. Modern Operating Systems: A.S.Tanenbaum & VAN, PHI
4. Distributed Systems: Concepts and Design: G. Coluris, AW

EC No. Evaluation
Component &
Type of
Examination
Duration Weigh-
tage
Day, Date, Session,Time
EC-1 Assignment/Quiz ** Details to be announced
on LMS Taxila
15% ** Details to be announced
on LMS Taxila
EC-2
Mid-Semester Test
(Closed Book)*
2 Hours 35% Sunday, 16/02/2014 (AN)*
2 PM 4 PM
EC-3 Comprehensive
Exam
(Open Book)*
3 Hours 50% Sunday, 06/04/2014 (AN)*
2 PM 5 PM
Evaluation Components
3/30/2014
CS ZG623, Adv OS 4
3/30/2014
3
3/30/2014
CS ZG623, Adv OS 5
Growth of the Internet
Source: Internet World Stats
Source: Cisco VNI Global
Forecast, 2011-2016
3/30/2014
CS ZG623, Adv OS 6
Leading Applications
3/30/2014
4
3/30/2014
CS ZG623, Adv OS 7
Source: Traffic and Market data report Ericsson, June 2012
Growth of Mobile world
3/30/2014
CS ZG623, Adv OS 8
Powerful multi-
core processors
General purpose
graphic processors
Superior software
methodologies
Virtualization
leveraging the
powerful hardware
Wider bandwidth for
communication
Proliferation
of devices
Explosion of
domain
applications
Source: Cloud Futures 2011, Redmond
Golden era in Computing
Tata Nano
3/30/2014
5
3/30/2014
CS ZG623, Adv OS 9
Re-imagination of computing
devices
Source: Internet Trends, Mary Meeker
3/30/2014
CS ZG623, Adv OS 10
Re-imagination of connectivity
Source: Internet Trends, Mary Meeker
3/30/2014
6
3/30/2014
CS ZG623, Adv OS 11
Re-imagination of life stories
Source: Internet Trends, Mary Meeker
3/30/2014
CS ZG623, Adv OS 12
Re-imagination of recruiting/hiring
Source: Internet Trends, Mary Meeker
3/30/2014
7
3/30/2014
CS ZG623, Adv OS 13
Re-imagination of Commerce
3/30/2014
CS ZG623, Adv OS 14
Re-imagination of Ticketing
3/30/2014
8
3/30/2014
CS ZG623, Adv OS 15
Re-imagination of Meeting people
BITS Goa
BITS Pilani
3/30/2014
CS ZG623, Adv OS 16
Re-imagination of Healthcare
3/30/2014
9
3/30/2014
CS ZG623, Adv OS 17
Re-imagination of Teaching/Learning
3/30/2014
CS ZG623, Adv OS 18
Re-imagination of watching movies
3/30/2014
10
3/30/2014
CS ZG623, Adv OS 19
Killer Applications for
Distributed Systems
Source: Distributed computing, Kai Hwang
3/30/2014
CS ZG623, Adv OS 20
Hardware Concepts
1.6
(Different organizations and memories in distributed computer systems)
3/30/2014
11
3/30/2014
CS ZG623, Adv OS 21
Uniprocessor Operating Systems
An OS acts as a resource manager
Manages CPU, I/O devices, and Memory
OS provides a virtual interface that is easier to use than
hardware
Structure of uniprocessor operating systems
Monolithic (e.g. MS-DOS, and early UNIX)
One large kernel that handles everything
Layered design (Kernel based UNIX)
Functionality is decomposed into N layers
Each layer uses services of layer N-1 and implements
new service(s) for layer N+1
Virtual machine (e.g. VM/370)
3/30/2014
CS ZG623, Adv OS 22
Uniprocessor OS: Microkernel based
User level servers implement additional functionality
Setting device regs., CPU scheduling, manipulating MMU,
capturing hardware interrupts etc .
3/30/2014
12
3/30/2014
CS ZG623, Adv OS 23
Multicomputer Operating System
(General structure)
1.14
3/30/2014
CS ZG623, Adv OS 24
Network Operating System
(General structure)
1-19
3/30/2014
13
3/30/2014
CS ZG623, Adv OS 25
What is a Distributed System?
Loosely coupled collection of autonomous computers
connected by a network running a distributed operating
system to produce a single integrated computing
environment ( a virtual computer).
Honeycomb
3/30/2014
CS ZG623, Adv OS 26
Cluster of Cooperative Computers
3/30/2014
14
3/30/2014
CS ZG623, Adv OS 27
Computational Grids
3/30/2014
CS ZG623, Adv OS 28
GARUDA from CDAC
3/30/2014
15
3/30/2014
CS ZG623, Adv OS 29
A
D
E F
G
H
F
H
G
A
E
C
C
B
P2P overlay
layer
Native IP
layer

D
B
AS
1

AS
2

AS
3

AS
4

AS
5

AS
6

P2P Overlays
3/30/2014
CS ZG623, Adv OS 30
P2P Examples
3/30/2014
16
3/30/2014
CS ZG623, Adv OS 31
Example P2P: BitTorrent
Source: wiki
3/30/2014
CS ZG623, Adv OS 32
Allen Telescope Array
Another Example: SETI
Source: setiathome.berkeley.edu/
3/30/2014
17
3/30/2014
CS ZG623, Adv OS 33
Another Example: Seattle on Android
Source: http://boinc.berkeley.edu/
3/30/2014
CS ZG623, Adv OS 34
Cloud Computing
3/30/2014
18
3/30/2014
CS ZG623, Adv OS 35
Crowd Sourcing
3/30/2014
CS ZG623, Adv OS 36
Distributed Systems: Pros and Cons
Advantages
Communication and resource sharing possible
Economy : Price-performance
Reliability & scalability
Potential for incremental growth
Disadvantages
Distribution-aware OSs and applications
High network connectivity essential
Security and privacy

3/30/2014
19
3/30/2014
CS ZG623, Adv OS 37
Concept Example
Centralized services
A single server for all users
(medical records, bank account)
Centralized data A single on-line telephone book
Centralized
algorithms
Doing routing based on complete
information
Scalability Problems
3/30/2014
CS ZG623, Adv OS 38
Scaling Technique (1): Hiding
Communication Latency
If possible, do asynchronous communication
Not always possible if the client has nothing to do
Alternatively, by moving part of the computation
Java applets
3/30/2014
20
3/30/2014
CS ZG623, Adv OS 39
Scaling Technique (2): Distribution
Examples: DNS resolutions
3/30/2014
CS ZG623, Adv OS 40
Scaling Technique (3): Replication
Copy of information to increase availability and also to
balance the load
Example: P2P networks (Gnutella +) distribute copies
uniformly or in proportion to use
Example: Gmail caches
replication decision made by client

Issue: Consistency of replicated information

3/30/2014
21
3/30/2014
CS ZG623, Adv OS 41
More Design Issues
Lack of Global Knowledge
Naming
Compatibility
Process Synchronization
Resource Management
Security
Source: www.bbc.co.uk
3/30/2014
CS ZG623, Adv OS 42
Distributed System Models
Minicomputer model
Each computer supports many users
Local processing but can fetch remote data (files,
databases, etc.)
Workstation model
Most of the work is locally done
Using a dfs, a user can access remote data
Processor pool model
Terminals are Xterms or diskless terminals
Pool of backend processors handle processing

3/30/2014
22
3/30/2014
CS ZG623, Adv OS 43
Remote Procedure Calls (RPC)
3/30/2014
CS ZG623, Adv OS 44
RPC: Remote Procedure Call
Issues:
identifying and accessing the remote procedure
parameters
return value
Sun RPC
Microsofts DCOM
OMGs CORBA
Java RMI
XML/RPC
SOAP/.NET
AJAX (Asynchronous Javascript and XML)
Many types
3/30/2014
23
SUN RPC
3/30/2014
CS ZG623, Adv OS 45
struct square_in {
long arg1;
};
struct square_out {
long res1;
};
program SQUARE_PROG {
version SQUARE_VERS {
square_out SQUAREPROC(square_in) = 1;
} = 1;
} = 0x13451111;
Rpcgen
3/30/2014
CS ZG623, Adv OS 46
Input File
rpcgen
Client Stubs XDR filters header file Server skeleton
C Source Code
Protocol
Description
3/30/2014
24
Rpcgen continued
3/30/2014
CS ZG623, Adv OS 47
bash$ rpcgen square.x

produces:
square.h header
square_svc.c server stub
square_clnt.c client stub
square_xdr.c XDR conversion routines

Function names derived from IDL function names and
version numbers

Square Client: Client.c
3/30/2014
CS ZG623, Adv OS 48
#include square.h
#include <stdio.h>
int main (int argc, char **argv)
{ CLIENT *cl;
square_in in;
square_out *out;
if (argc != 3) { printf(client <localhost> <integer>); exit (1); }
cl = clnt_create (argv[1], SQUARE_PROG, SQUARE_VERS, tcp);
in.arg1 = atol (argv [2]);
if ((out = squareproc_1(&in, cl)) == NULL)
{ printf (Error); exit(1); }
printf (Result %ld\n, out -> res1);
exit(0);
}


3/30/2014
25
Square server: server.c
3/30/2014
CS ZG623, Adv OS 49
#include square.h
#include <stdio.h>
square_out *squareproc_1_svc (square_in *inp, struct
svc_req *rqstp)
{
static square_out *outp;
outp.res1 = inp -> arg1 * inp -> arg1;
return (&outp);
}



Exe creation
3/30/2014
CS ZG623, Adv OS 50
gcc o client client.c square_clnt.c square_xdr.c lnsl


gcc o server server.c square_svc.c square_xdr.c
lrpcsvc -lnsl

3/30/2014
26
A Communication network
3/30/2014
CS ZG623, Adv OS 51
Example of layering
3/30/2014
CS ZG623, Adv OS 52
Letter
Letter
Addressed
Envelope
Addressed
Envelope
3/30/2014
27
Topologies
3/30/2014
CS ZG623, Adv OS 53
Point-to-Point network
The Internet
3/30/2014
CS ZG623, Adv OS 54
Tier 1 ISP
Tier 1 ISP
Tier 1 ISP
NAP
Tier-2 ISP
Tier-2 ISP
Tier-2 ISP Tier-2 ISP
Tier-2 ISP
local
ISP
local
ISP
local
ISP
local
ISP
local
ISP
Tier 3
ISP
local
ISP
local
ISP
local
ISP
3/30/2014
28
OSI Layers
3/30/2014
CS ZG623, Adv OS 55
7 Application
6 Presentation
5 Session
4 Transport
3 Network
2 Data-Link
1 Physical
High level protocols
Low level protocols
Headers
3/30/2014
CS ZG623, Adv OS 56
Process
Transport
Network
Data Link
Process
Transport
Network
Data Link
DATA
DATA
DATA
DATA
H
H
H
H
H H
3/30/2014
29
Packet journey
3/30/2014
CS ZG623, Adv OS 57
Tulip network
68.80.0.0/13
Googles network
64.233.160.0/19
64.233.169.105
web server
DNS server

BITS network
68.80.2.0/24
browser
Inherent limitations of a
Distributed system
3/30/2014
CS ZG623, Adv OS 58
Absence of a global clock
Unpredictable message transmission delays
Drift from Physical time
Absence of a shared memory
Difficult to maintain coherent global state

Solution: Logical Clocks
3/30/2014
30
Lamports Logical clocks
3/30/2014
CS ZG623, Adv OS 59
The Happened-Before Relation ()
Captures the behavior of underlying dependencies
between the events
Causality
Concurrent Events (||)
Timestamps have no relation with physical time,
hence called Logical Clock
Implementation of Logical clocks
3/30/2014
CS ZG623, Adv OS 60
[IR1] Clock C
i
is incremented between any two
successive events in P
i
:
if ab, C
i
(b)=C
i
(a)+d (d>0, usually 1)

[IR2] If event a sends a message m to P
i
, then
m is assigned a timestamp t
m
= C
i
(a)

When the same message is received by P
k
, then
C
k
= max(C
k
, t
m
+d) (d>0, usually 1)
3/30/2014
31
Total ordering
3/30/2014
CS ZG623, Adv OS 61
defines ir-reflexive partial order among the events
Not reflexive, antisymmetry, and transitivity
Total Ordering (antisymmetry, transitive, and total)
If a is any event in P
i
and b in P
k
, then a=>b iff
C
i
(a) < C
k
(b) or
C
i
(a) = C
k
(b) & P
i
P
k
where denotes a relation that totally orders the
processes to break ties.
Limitations of Logical clocks
3/30/2014
CS ZG623, Adv OS 62
If ab then C(a) < C(b) but ,if C(a) < C(b), then it
is not necessarily true that ab, if a and b occur
in different processes.
Cant determine whether two events are casually
related from timestamps.
3/30/2014
32
Vector Clocks
3/30/2014
CS ZG623, Adv OS 63
Definitions
n = number of processes
P
i
has a clock C
i
, an integer vector of length n
Timestamp vector is assigned to each event a,
C
i
(a)
C
i
[i] corresponds to P
i
s own logical time
C
i
[j], ji is P
i
s best guess of logical time at P
j
Implementation of Vector clocks
3/30/2014
CS ZG623, Adv OS 64
[IR1] Clock C is incremented between any two
successive events in P
i
:
C
i
[i] := C
i
[i] +d (d>0)
Events include sending and receiving messages
[IR2] If event a is P
i
sending message m, then
m is assigned t
m
=C
i
(a)
When P
j
receives m, C
j
is updated:
for every k, C
j
[k] := max (C
j
[k], t
m
[k])
For every i and j , C
i
[i] C
j
[i]

3/30/2014
33
Causal ordering of Messages
3/30/2014
CS ZG623, Adv OS 65
Time
Space
P1
P2
P3
(2)
(1)
Send(M1)
Send(M2)
Birman- Schiper- Stephenson
(BSS) Protocol
3/30/2014
CS ZG623, Adv OS 66
Before broadcasting m, process P
i
increments vector time
VT
Pi
[i] and timestamps m.
Process P
j


P
i
receives m with timestamp VT
m
from P
i
,
delays delivery until both:
VT
Pj
[i] = VT
m
[i] - 1 // received all previous ms
VT
Pj
[k] >= VT
m
[k] for every k {1,2,, n} - {i}
// received all messages also received by P
i
before sending
m

When P
j
delivers m, VT
Pj
updated by IR2
3/30/2014
34
Example of BSS
3/30/2014
CS ZG623, Adv OS 67
P1
P2
P3
(buffer)
(0,0,1) (0,1,1)
(0,0,1) (0,1,1)
(0,0,1) (0,1,1)
deliver
from buffer
Schiper-Eggli-Sandoz Algorithm
3/30/2014
CS ZG623, Adv OS 68
SES: No need for broadcast messages.
Each process maintains a vector V_P of size N - 1,
N the number of processes in the system.
V_P is a vector of tuple (P,t): P the destination
process id and t, a vector timestamp.
T
m
: logical time of sending message m
T
pi
: present logical time at p
i

Initially, V_P is empty.
3/30/2014
35
SES Continued
3/30/2014
CS ZG623, Adv OS 69
Sending a Message:
Send message M, time stamped t
m
, along with V_P1 to
P2.
Insert (P2, t
m
) into V_P1. Overwrite the previous value
of (P2,t), if any.
(P2,t
m
) is not sent. Any future message carrying (P2,t
m
)
in V_P1 cannot be delivered to P2 until tm < T
p2
.
Delivering a message
If V_M (in the message) does not contain any pair (P2,
t), it can be delivered.
/* (P2, t) exists in V_M*/ If t !< Tp2, buffer the
message. (Dont deliver).
else (t < Tp2) deliver it


Example of SES
3/30/2014
CS ZG623, Adv OS 70
P1
P2
P3
M1
(0,2,1)
(0,1,0)
(0,2,0)
(2,2,2) (1,1,0)
(0,2,2)
M2
M3
V_P2
empty
V_P2:
(P1, <0,1,0>)
V_P3:
(P1,<0,1,0>)
Tp1:
Tp2:
Tp3:
3/30/2014
36
Global State: The Model
3/30/2014
CS ZG623, Adv OS 71
Node properties:
No shared memory
No global clock
Channel properties:
FIFO
loss free
non-duplicating
The Need for Global State
Many problems in Distributed Computing can be
cast as executing some action on reaching a
particular state
e.g.
-Distributed deadlock detection is finding a cycle in the
Wait For Graph.
-Termination detection
-Checkpointing
And many more..
3/30/2014
37
Difficulties due to Non Determinism
Deterministic Computation
- At any point in computation there is at
most one event that can happen next.

Non-Deterministic Computation
- At any point in computation there can be
more than one event that can happen next.
m
Example: Initial State
3/30/2014
38
Example continued
m
Example continued
m
3/30/2014
39
a
Example continued
a
Example continued
3/30/2014
40
a

Example continued
Deterministic state diagram
3/30/2014
41
Non-deterministic computation



m
1
m
2
m
3
p
q
r
Three possible runs
p
q
r
q
r
m
1
m
3
m
2
m
1
m
2
m
3
m
1
m
3
m
2
p
r
p
q
3/30/2014
42
A Non-Deterministic Computation
All these states are feasible
Global state Example
3/30/2014
CS ZG623, Adv OS 84
Global State 3
Global State 2
Global State 1
3/30/2014
43
Recording Global state
Let global state of A is recorded in (1) and not in (2).
State of B, C1, and C2 are recorded in (2)
Extra amount of $50 will appear in global state
Reason: As state recorded before sending message and C1s
state after sending message.
Inconsistent global state if n < n, where
n is number of messages sent by A along channel before As
state was recorded
n is number of messages sent by A along the channel before
channels state was recorded.
Consistent global state: n = n
Continued
Similarly, for consistency m = m
m: no. of messages received along channel before Bs state
recording
m: no. of messages received along channel by B before
channels state was recorded.
Also, n >= m, as no. of messages sent along the channel
be less than that of received
Hence, n >= m
Consistent global state should satisfy the above equation.
Consistent global state:
Channel state: sequence of messages sent before recording
senders state, excluding the messages received before
receivers state was recorded.
Only transit messages are recorded in the channel state.
3/30/2014
44
Notion of Consistency: Example
p
q
p
q
S
p
0
S
p
1
S
p
2
S
p
3
S
q
0
S
q
1
S
q
2
S
q
3
m
1
m
2
m
3
A Consistent State?
p
q
p
q
S
p
0
S
p
1
S
p
2
S
p
3
S
q
0
S
q
1
S
q
2
S
q
3
m
1
m
2
m
3
S
p
1 S
q
1
3/30/2014
45
Yes
p
q
p
q
S
p
0
S
p
1
S
p
2
S
p
3
S
q
0
S
q
1
S
q
2
S
q
3
m
1
m
2
m
3
S
p
1 S
q
1
What about this?
p
q
p
q
S
p
0
S
p
1
S
p
2
S
p
3
S
q
0
S
q
1
S
q
2
S
q
3
m
1
m
2
m
3
S
p
2 S
q
3
Yes
3/30/2014
46
What about this?
p
q
p
q
S
p
0
S
p
1
S
p
2
S
p
3
S
q
0
S
q
1
S
q
2 S
q
3
m
1
m
2
m
3
S
p
1 S
q
3
No
Chandy-Lamport GSR algorithm
3/30/2014
CS ZG623, Adv OS 92
Sender (Process p)
Record the state of p
For each outgoing channel (c) incident on p, send a marker
before sending any other messages
Receiver (q receives marker on c1)
If q has not yet recorded its state
Record the state of q
Record the state of c1 as null
For each outgoing channel (c) incident on q , send a
marker before sending any other message.
If q has already recorded its state
Record the state of (c1) as all the messages received since
the last time of the state of q was recorded.
3/30/2014
47
Uses of GSR
3/30/2014
CS ZG623, Adv OS 93
recording a consistent state of the global computation
checkpointing for fault tolerance (rollback, recovery)
testing and debugging
monitoring and auditing

detecting stable properties in a distributed system via
snapshots. A property is stable if, once it holds in a state,
it holds in all subsequent states.
termination
deadlock
garbage collection

State Recording Example
3/30/2014
CS ZG623, Adv OS 94
3/30/2014
48
State Recording Example
3/30/2014
CS ZG623, Adv OS 95
Let p transfer 100 to q, q transfer 50 to p and 30 to r, and p wants to record
the global state.
400
50
30
State Recording Example
3/30/2014
CS ZG623, Adv OS 96
Let p transfer 100 to q, q transfer 50 to p and 30 to r, and p wants to record
the global state.
400
50
30
520
3/30/2014
49
State Recording Example
3/30/2014
CS ZG623, Adv OS 97
Let p transfer 100 to q, q transfer 50 to p and 30 to r, and p wants to record
the global state.
400
50
30
520
530
50
Cut
3/30/2014
CS ZG623, Adv OS 98
A cut is a set of cut events, one per node, each of which
captures the state of the node on which it occurs. It is
also a graphical representation of a global state.






3/30/2014
50
Consistent Cut
3/30/2014
CS ZG623, Adv OS 99



An inconsistent cut :






ee , e c
2
, c
2
-/->c
3


A cut C = {c1, c2, c3, } is consistent if for all sites there
are no events ei and ej such that:

(ei --> ej) and (ej --> cj) and (ei -/-> ci), c
i ,
c
j
C

e
e
Ordering of Cut events
3/30/2014
CS ZG623, Adv OS 100
The cut events in a consistent cut are not causally related.
Thus, the cut is a set of concurrent events and a set of
concurrent events is a cut.

Note, in this inconsistent cut, c3 --> c2.






3/30/2014
51
Termination detection
3/30/2014
CS ZG623, Adv OS 101
Question:
In a distributed computation,
when are all of the processes
become idle (i.e., when has the
computation terminated)?

Huangs algorithm
3/30/2014
CS ZG623, Adv OS 102
The computation starts when the controlling agent sends the first
message and terminates when all processes are idle.

The role of weights:
the controlling agent initially has a weight of 1 and all others
have a weight of zero,
when a process sends a message, a portion of the senders
weight is put in the message, reducing the senders weight,
a receiver adds to its weight the weight of a received message,
on becoming idle, a process sends it weight to the controlling
agent,
the sum of all weights is always 1.

The computation terminates when the weight of the controlling
agent reaches 1 after the first message.
3/30/2014
52
Continued
3/30/2014
CS ZG623, Adv OS 103
Distributed Mutual Exclusion

What is mutual exclusion?

1. simultaneous update and read of a directory
2. can you allow two/more processors to update a file
simultaneously?
3. can two processes send their data to a printer?

So, it is exclusive access to a shared resource or to the
critical region.

3/30/2014
53
In uni-processor system, mutual exclusion is achieved by
semaphore, monitors etc.
An algorithm used for implementing mutual exclusion
must satisfy
Mutual exclusion
No starvation, -Freedom from deadlock, -Fault
tolerant

To handle mutual exclusion in distributed system,
- Centralized Approach
- Distributed Approach
- Token-Passing Approach
All use message passing approach rather than shared
variable.

Continued
Performance of DME Algorithms
Performance of each algorithm is measured in terms of
- no. of messages required for CS invocation
- synchronization delay (leaving & entering)
- response time (arrival, sending out, Entry & Exit)
System throughput = 1/(sd+E)

3/30/2014
54
A Centralized Algorithm
A Central controller with a queue for deferring replies.
Request, Reply, and Release messages.
Reliability and Performance bottleneck.
Lamports DME
Requesting the critical section.

1. When a site Si wants to enter the CS, it sends a REQUEST(T=tsi, i)
message to all the sites in its request set Ri and places the request on
request_queuei.
2. When a site Sj receives the REQUEST(tsi , i) message from site Si,
it returns a timestamped REPLY message to Si and places site Si s
request on request_queuej.

Executing the critical section.
Site Si enters the CS when the two following conditions hold:
1. Si has received a message with timestamp larger than (tsi , i) from all
other sites.
2. Si s request is at the top of request_queuei.
3/30/2014
55
Continued
Releasing the critical section.

1. Site Si, upon exiting the CS, removes its request from
the top of its request queue and sends a timestamped
RELEASE message to all the sites in its request set.

2. When a site Sj receives a RELEASE message from site
Si , it removes Si s request from its request queue.

3. When a site removes a request from its request queue,
its own request may become the top of the queue, enabling
it to enter the CS. The algorithm executes CS requests in
the increasing order of timestamps.

Correctness
Suppose that both Si and Sj were in CS at the same time (t).
Then we have:







3/30/2014
56
Ricart-Agrawala DME
Requesting Site:
A requesting site Pi sends a message request(ts,i) to all
sites.
Receiving Site:
Upon reception of a request(ts,i) message, the receiving
site Pj will immediately send a timestamped reply(ts,j)
message if and only if:
Pj is not requesting or executing the critical section
OR
Pj is requesting the critical section but sent a request
with a higher timestamp than the timestamp of Pi
Otherwise, Pj will defer the reply message.


Maekawas DME
A site requests permission only from a subset of sites.

Request set of sites S
i
& S
j
: Ri, Rj such that Ri and Rj
will have at-least one common site (Sk). Sk mediates
conflicts between Ri and Rj.

A site can send only one REPLY message at a time, i.e.,
a site can send a REPLY message only after receiving a
RELEASE message for the previous REPLY message.

3/30/2014
57
Request Set
Maekawas Request set with
N=13
R
1
= { 1, 2, 3, 4 }
R
2
= { 2, 5, 8, 11 }
R
3
= { 3, 6, 8, 13 }
R
4
= { 4, 6, 10, 11 }
R
5
= { 1, 5, 6, 7 }
R
6
= { 2, 6, 9, 12 }
R
7
= { 2, 7, 10, 13 }
R
8
= { 1, 8, 9, 10 }
R
9
= { 3, 7, 9, 11 }
R
10
= { 3, 5, 10, 12 }
R
11
= { 1, 11, 12, 13 }
R
12
= { 4, 7, 8, 12 }
R
13
= { 4, 5, 9, 13 }
3/30/2014
58
Maekawas DME Algo
Requesting the critical section
1. A site Si requests access to the CS by sending
REQUEST(i) messages to all the sites in its request set Ri.

2. When a site Sj receives the REQUEST(i) message, it
sends a
REPLY(j) message to Si provided it hasnt sent a REPLY
message to a site from the time it received the last
RELEASE message.
Otherwise, it queues up the REQUEST for later
consideration.
Executing the critical section
1. Site Si accesses the CS only after receiving REPLY
messages from all the sites in Ri .


Continued
Releasing the critical section
1. After the execution of the CS is over, site Si sends
RELEASE(i) message to all the sites in Ri .

2. When a site Sj receives a RELEASE(i) message from
site Si , it sends a REPLY message to the next site waiting
in the queue and deletes that entry from the queue. If the
queue is empty, then the site updates its state to reflect
that the site has not sent out any REPLY message.
3/30/2014
59
Deadlocks
Deadlock handling
FAILED
A FAILED message from site Si to site Sj indicates that Si cannot
grant Sjs request because it has currently granted permission to a site
with a higher priority request. So that Sj will not think that it is just
waiting for the message to arrive.
INQUIRE
An INQUIRE message from Si to Sj indicates that Si would like to
find out from Sj if it has succeeded in locking all the sites in its request
set.
YIELD
A YIELD message from site Si to Sj indicates that Si is returning the
permission to Sj (to yield to a higher priority request at Sj ).

3/30/2014
60
Token based DME algorithms
A site enters CS if it possesses the token (only
one token for the System).
The major difference is the way the token is
searched
Use sequence numbers instead of timestamps
Used to distinguish requests from same site
Keep advancing independently at each site
The proof of mutual exclusion is trivial
Suzuki-kasami broadcast DME
3/30/2014
CS ZG623, Adv OS 120


A site has RN:












LN[j] is the sequence no. of the request that site S
j
executed
most recently.

3/30/2014
61
The DME algorithm
3/30/2014
CS ZG623, Adv OS 121

Requesting the critical section.
1. If the requesting site S
i
does not have the token, then it increments its
sequence number, RN
i
[i], and sends a REQUEST(i, sn) message to all other
sites. (sn is the updated value of RN
i
[i].)
2. When a site S
j
receives this message, it sets RN
j
[i] to max(RN
j
[i], sn). If S
j
has
the idle token, it sends the token to S
i
if RN
j
[i] = LN[i] + 1.

Executing the critical section.
3. Site S
i
executes the CS when it has received the token.

Releasing the critical section.
Having finished the execution of the CS, site S
i
takes the following actions:
4. It sets LN[i] element of the token array equal to RN
i
[i].
5. For every site S
j
whose ID is not in the token queue, it appends its ID to the
token queue if RN
i
[j] = LN[j] + 1.
6. If token queue is nonempty after the above update, then it deletes the top site
ID from the queue and sends the token to the site indicated by the ID.
Analysis of the DME algorithm
3/30/2014
CS ZG623, Adv OS 122
Correctness
Mutex is trivial.
Theorem:
A requesting site enters the CS in finite time.
Proof:
A request enters the token queue in finite time. The
queue is in FIFO order, and there can be a maximum
N-1 sites ahead of the request.

Performance
0 or N messages per CS invocation.
Synchronous delay: 0 or T ( only 1 message)

3/30/2014
62
Analysis
Correctness
Mutex is trivial.
Theorem:
A requesting site enters the CS in finite time.
Proof:
A request enters the token queue in finite time.
The queue is in FIFO order, and there can be a
maximum N-1 sites ahead of the request.

Performance
0 or N messages per CS invocation.
Synchronous delay: 0 or T ( only 1 message)

Raymonds Tree Based Algo
3/30/2014
CS ZG623, Adv OS 124
3/30/2014
63
The Algo
3/30/2014
CS ZG623, Adv OS 125

Requesting the critical section.
1. When a site wants to enter the CS, it sends a REQUEST message to the
node along the directed path to the root, provided it does not hold the token
and its request_q is empty. It then adds to its request_q.

2. When a site on the path receives this message, it places the REQUEST in
its request_q and sends a REQUEST message along the directed path to the
root provided it has not sent out a REQUEST message on its outgoing edge.
3. When the root site receives a REQUEST message, it sends the token to the
site from which it received the REQUEST message and sets its holder
variable to point at that site.

4. When site receives the token, it deletes the top entry from its request_q,
sends the token to the site indicated in this entry, and sets its holder variable
to point at that site. If the request_q is nonempty at this point, then the site
sends a REQUEST message to the site which is pointed at by holder
variable.


Continued
3/30/2014
CS ZG623, Adv OS 126

Executing the critical section
1. A site enters the CS when it receives the token and
its own entry is at the top of its request_q. In this case,
the site deletes the top entry from its request_q and
enters the CS.

Releasing the critical section
1. If its request_q is nonempty, then it deletes the top
entry from its request_q, sends the token to that site,
and sets its holder variable to point at that site.
2. If the request_q is nonempty at this point, then the
site sends a REQUEST message to the site which is
pointed at by the holder variable.

3/30/2014
64
Example
3/30/2014
CS ZG623, Adv OS 127
Analysis
3/30/2014
CS ZG623, Adv OS 128
Proof of Correctness
Mutex is trivial.
Finite waiting: all the requests in the system form a FIFO queue and the
token is passed in that order.

Performance
O(logN) messages per CS invocation. (Average distance between two
nodes in a tree)

Sync. delay: (T logN) / 2

The average distance between two sites is logN / 2.

3/30/2014
65
Deadlocks in Distributed System
What is a deadlock?
One or more processes waiting
indefinitely for resources to be
released by other waiting processes.

Can occur on h/w or s/w resources,
but mostly seen on distributed
databases (Lock & Unlock).
T
1
T
2


lock(x)


lock(y)

.

lock(y)


lock(x)
time
3/30/2014
66
Types of Deadlock
Communication Deadlock
Resource Deadlock

Four Conditions : Mutual Exclusion, Hold & Wait,
No Preemption, and Circular Wait.
Ways to Handle Deadlocks
-Prevention
-Avoidance
-Detection and resolution
-Ignorance
3/30/2014
67
Control Framework
Centralized control

Distributed Control

Hierarchical Control

False Deadlocks
P0 P2
P1
R
S T
System A
System B
3/30/2014
68
Cycle Vs Knot
The AND model of requests requires all
resources currently being requested to be
granted to un-block a computation
A cycle is sufficient condition to declare a
deadlock with this model

The OR model of requests allows a computation
making multiple different resource requests to
un-block as soon as any are granted
A cycle is a necessary condition
A knot is a sufficient condition
A strongly connected subgraph of a directed graph, such that starting
from any node in the subset it is impossible to leave the knot by following
the edges of the graph.
Example
3/30/2014
69
Detection Requirements
Progress

Safety


Individual sites maintain local WFGs
Nodes for local processes
Node Pex represents external processes
Deadlock detection:
If a site Si finds a cycle that does not involve Pex, it has found a
deadlock
If a site Si finds a cycle that does involve Pex, there is the
possibility of a deadlock
It sends a message containing its detected cycle to the sites
involved in Pex
If site Sj receives such a message, it updates its local WFG graph,
and searches it for a cycle
If Sj finds a cycle that does not involve its Pex, it has found a deadlock
If Sj finds a cycle that does involve its Pex, it sends out a message


Obermaracks Algorithm
3/30/2014
70
Example
Consider each elementary cycle containing EX. For each
such cycle EX T1 . . . Tn EX compare T1 with Tn. If
T1 > Tn, send the cycle to each site, where an agent of Tn is
waiting to receive a message from the agent of Tn at this site.
Edge-Chasing: Chandy-Misra-Haas
Some processes wait for local resources
Some processes wait for resources on other
machines
Algorithm invoked when a process has to wait for
a resource
Uses local WFGs to detect local deadlocks and
probes to determine the existence of global
deadlocks.
3/30/2014
71
Chandy-Misra-Haass Algorithm
Sending the probe:
if Pi is locally dependent on itself then deadlock.
else for all Pj and Pk such that
(a) Pi is locally dependent upon Pj, and
(b) Pj is waiting on Pk, and
(c ) Pj and Pk are on different sites, send probe(i,j,k) to the home
site of Pk.
Receiving the probe:
if (d) Pk is blocked, and
(e) dependentk(i) is false, and
(f) Pk has not replied to all requests of Pj,
then begin
dependentk(i) := true;
if k = i then Pi is deadlocked
else ...

for all Pm and Pn such that
(a) Pk is locally dependent upon Pm, and
(b) Pm is waiting on Pn, and
(c) Pm and Pn are on different sites, send probe(i,m,n)
to the home site of Pn.
end.
Example
3/30/2014
72
C-M-H Algorithm: Another Example
P1
P2
P3
P4
P5
P6
P7
probe(1,3,4)
probe(1,7,1)
Advantages
1. Popular, Variants of this are used in locking
schemes.
2. Easy to implement, as each message is of
fixed length and requires few computational
steps.
3. No graph constructing and information
collection
4. False deadlocks are not detected
5. Does not require a particular structure among
processes
3/30/2014
73
Disadvantages
Two or more processes may independently
detect the same deadlock and hence while
resolving, several processes will be aborted.
Even though a process detects a deadlock, it
does not know the full cycle
M(n-1)/2 messages, where M= no. of processes,
n= no. of sites.
Centralized Control
3/30/2014
CS ZG623, Adv OS 146
Simple conceptually:
Each node reports to the master detection node
The master detection node builds and analyzes the
WFG by getting the request resource and release
resource messages
The master detection node manages resolution when a
deadlock is detected
Same serious problems:
Single point of failure
Network congestion issues
False deadlock detection
3/30/2014
74
Continued
3/30/2014
CS ZG623, Adv OS 147
There are different ways, by which each node may send
its WFG status to the coordinator
Whenever an arc is added/deleted at a site, message
is sent
Periodically, every process can send list of arcs since
last update
Coordinator can ask the information when required

None of these work well
False deadlocks
3/30/2014
CS ZG623, Adv OS 148
False deadlocks are produced because of
incomplete/delayed information
Example :








A solution to false deadlock is Lamports global time.
3/30/2014
75
The Ho-Ramamoorthy Centralized Algorithms
3/30/2014
CS ZG623, Adv OS 149
Two phase (can be for AND or OR model)
Each site has a status table of locked and waited resources
The control site will periodically ask for this table from each node
The control node will search for cycles and, if found, will request the
table again from each node
Only the information common in both reports will be analyzed for
confirmation of a cycle

One phase (can be for AND or OR model)
Each site keeps 2 tables; process status and resource status
The control site will periodically ask for these tables (both together in
a single message) from each node
The control site will then build and analyze the WFG, looking for
cycles and resolving them when found

One Phase is faster than two-Phase
One Phase requires fewer messages and more storage (i.e 2 tables).
Distributed Control
3/30/2014
CS ZG623, Adv OS 150
Responsibility is shared by all
Not vulnerable to single point of failure
N/W Congestion is not seen
Only when there is a suspicion, deadlock detection
starts
Difficult to design, because of no shared memory
All the processes may be involved in detecting a
single/same deadlock
Types :
Path-pushing , Edge-chasing, Diffusion-computation
and Global-state-detection
3/30/2014
76
Path pushing: Obermarcks algo
Individual sites maintain local WFGs
Nodes for local processes
Node Pex represents external processes
Deadlock detection:
If a site Si finds a cycle that does not involve Pex, it has found a
deadlock
If a site Si finds a cycle that does involve Pex, there is the possibility of
a deadlock
It sends a message containing its detected cycle to the sites
involved in Pex
If site Sj receives such a message, it updates its local WFG graph,
and searches it for a cycle
If Sj finds a cycle that does not involve its Pex, it has found a
deadlock
If Sj finds a cycle that does involve its Pex, it sends out a
message to other dependent sites with this path (cycle
involving Pex)

Continued
3/30/2014
CS ZG623, Adv OS 152
Situation at node x:
P1
P2 P4
P3
No local deadlock
Pex
Some process outside node x waits
for a resource currently owned by P3
Some process outside of node x
holds a resource P4 is waiting for.
Already a deadlock???
3/30/2014
77
Obermarcks Example
Consider each elementary cycle containing Pex. For each such cycle Pex
T1 . . . Tn Pex compare T1 with Tn. If T1 > Tn, send the cycle to
each site, where an agent of Tn is waiting to receive a message from the
agent of Tn at this site.
Edge-Chasing: Chandy-Misra-Haas
Some processes wait for local resources
Some processes wait for resources on other
machines
Algorithm invoked when a process has to wait
for a resource
Uses local WFGs to detect local deadlocks and
probes to determine the existence of global
deadlocks.

3/30/2014
78
Edge-Chasing: Chandy-Misra-Haas
Sending the probe:
if Pi is locally dependent on itself then deadlock,
else for all Pj and Pk such that
(a) Pi is locally dependent upon Pj, and
(b) Pj is waiting on Pk, and
(c ) Pj and Pk are on different sites, send probe(i,j,k) to the home
site of Pk.
Receiving the probe:
if (d) Pk is blocked, and
(e) dependentk(i) is false, and
(f) Pk has not replied to all requests of Pj,
then begin
dependentk(i) := true;
if (k == i) then Pi is deadlocked
else ...
for all Pm and Pn such that
(a) Pk is locally dependent upon Pm, and
(b) Pm is waiting on Pn, and
(c) Pm and Pn are on different sites, send probe(i,m,n)
to the home site of Pn.
end.
C-M-H Example1
3/30/2014
79
C-M-H Example2
P1
P2
P3
P4
P5
P6
P7
probe(1,3,4)
probe(1,7,1)
probe(1,1,4) probe(1,2,6)
C-M-H Advantages
1. Popular, Variants of this are used in locking
schemes.
2. Easy to implement, as each message is of
fixed length and requires few computational
steps.
3. No graph constructing and information
collection
4. False deadlocks are not detected
5. Does not require a particular structure among
processes
3/30/2014
80
C-M-H disadvantages
Two or more processes may independently
detect the same deadlock and hence while
resolving, several processes will be aborted.
Even though a process detects a deadlock, it
does not know the full cycle
M(n-1)/2 messages, where M=no. of processes,
n= no. of sites.
Diffusion Computation: C-M-H
Initiation by a blocked process Pi:
send query(i,i,j) to all processes Pj in the dependent set DSi of Pi;
num(i) := |DSi|; waiti(i) := true;

Blocked process Pk receiving query(i,j,k):
if this is engaging query for process Pk /* first query from Pi */
then send query(i,k,m) to all Pm in DSk;
numk(i) := |DSk|; waitk(i) := true;
else if waitk(i) then send a reply(i,k,j) to Pj.

Process Pk receiving reply(i,j,k)
if waitk(i) then
numk(i) := numk(i) - 1;
if numk(i) = 0 then
if i = k then declare a deadlock.
else send reply(i, k, m) to Pm, which sent the engaging query.
3/30/2014
81
Diffusion Computation Example
P9
Hierarchical Deadlock Detection
Master Control Node
Level 1 Control Node
Level 2 Control Node
Level 3 Control Node
3/30/2014
82
Menasce-Muntz Algorithm Example
Ho-Ramamoorthy Hierarchical Algorithm
Central Site
Control
site
Control
site
Control
site
3/30/2014
83
Persistence & Resolution
Deadlock persistence:
Average time a deadlock exists before it is
resolved.
Deadlock resolution:
Aborting at least one process/request involved in
the deadlock.
Efficient resolution of deadlock requires
knowledge of all processes and resources.
If every process detects a deadlock and tries to
resolve it independently highly inefficient!
Several processes might be aborted.
Agreement Protocols
3/30/2014
CS ZG623, Adv OS 166
Have you ever wondered why vendors of (distributed)
software solutions always only offer solutions that
promise 95% reliability, 97% reliability, but never
100% reliability?

The fault does not lie with Microsoft, Google or
Yahoo

The fault lies in the impossibility of consensus
3/30/2014
84
What happened at Byzantine?

3/30/2014
CS ZG623, Adv OS 167
May 29
th
, 1453
The Turks are besieging the city of Byzantine by making a
coordinated attack.

Goals
Consensus between loyal generals
A small number of traitors cannot cause the loyals to
adopt a bad plan
Do not have to identify the traitors
Agreement Protocol: The System model
3/30/2014
CS ZG623, Adv OS 168
The are n processors in the system and at most
m of them can be faulty
The processors can directly communicate with
other processors via messages (fully connected
system)
A receiver computation always knows the
identity of a sending computation
The communication system is reliable
3/30/2014
85
Communication Requirements
3/30/2014
CS ZG623, Adv OS 169
Synchronous communication model is assumed in this
section:
Healthy processors receive, process and reply to
messages in a lockstep manner
The receive, process, reply sequence is called a
round
In the synch-comm model, processes know what
messages they expect to receive in a round
The synch model is critical to agreement protocols, and
the agreement problem is not solvable in an
asynchronous system
Processor Failures
3/30/2014
CS ZG623, Adv OS 170
Crash fault
Abrupt halt, never resumes operation
Omission fault
Processor omits to send required messages
to some other processors
Malicious fault
Processor behaves randomly and arbitrarily
Known as Byzantine faults
3/30/2014
86
Message Types
3/30/2014
CS ZG623, Adv OS 171
Authenticated messages (also called signed
messages)
assure the receiver of correct identification of
the sender

Non-authenticated messages (also called oral
messages)
are subject to intermediate manipulation
may lie about their origin
Agreement Problems
3/30/2014
CS ZG623, Adv OS 172
Problem Who initiates value Final Agreement

Byzantine One Processor Single Value
Agreement

Consensus All Processors Single Value

Interactive All Processors A Vector of Values
Consistency
3/30/2014
87
BA: Impossibility condition
3/30/2014
CS ZG623, Adv OS 173
Theorem: There is no algorithm to solve byzantine if only
oral messages are used, unless more than two thirds of
the generals are loyal.

In other words, impossible if n 3f for n processes, f of
which are faulty

Oral messages are under control of the sender
sender can alter a message that it received before
forwarding it

Lets look at examples for special case of n=3, f=1


Case 1
3/30/2014
CS ZG623, Adv OS 174
Traitor lieutenant tries to foil consensus by refusing to
participate
Lieutenant 3
Commanding General 1
R
Lieutenant 2
R R
decides to retreat
Round 1: Commanding
General sends Retreat
white hats == loyal or good guys
black hats == traitor or bad guys
Loyal lieutenant obeys
commander. (good)
Round 2: L3 sends Retreat
to L2, but L2 sends nothing
Decide: L3 decides Retreat
Acknowledgement: Class lectures of A.D Brown of UofT
3/30/2014
88
Case 2a
3/30/2014
CS ZG623, Adv OS 175
Traitor lieutenant tries to foil consensus by lying about
order sent by general
Lieutenant 3
Commanding General 1
R
Lieutenant 2
R R
decides to retreat
Round 1: Commanding
General sends Retreat
Loyal lieutenant obeys
commander. (good)
Round 2: L3 sends Retreat
to L2; L2 sends Attack to L2
Decide: L3 decides Retreat
A
Acknowledgement: Class lectures of A.D Brown of UofT
Case 2b
3/30/2014
CS ZG623, Adv OS 176
R
A
Traitor lieutenant tries to foil consensus by lying about
order sent by general
Lieutenant 3
Commanding General 1
Lieutenant 2
A A
decides to retreat
Round 1: Commanding
General sends Attack
Loyal lieutenant disobeys
commander. (bad)
Round 2: L3 sends Attack to
L2; L2 sends Retreat to L2
Decide: L3 decides Retreat
Acknowledgement: Class lectures of A.D Brown of UofT
3/30/2014
89
Case 3
3/30/2014
CS ZG623, Adv OS 177
Traitor General tries to foil consensus by sending
different orders to loyal lieutenants
Lieutenant 3
Commanding General 1
R
Lieutenant 2
A R
decides to retreat
Round 1: General sends
Attack to L2 and
Retreat to L3
Loyal lieutenants obey
commander. (good)
Decide differently (bad)
Round 2: L3 sends Retreat
to L2; L2 sends Attack to L2
Decide: L2 decides Attack
and L3 decides Retreat
A
decides to attack
Acknowledgement: Class lectures of A.D Brown of UofT
Oral Message Algorithm
3/30/2014
CS ZG623, Adv OS 178
Oral Message algorithm, OM(m) consists of m+1 phases
Algorithm OM(0) is the base case (no faults)
1) Commander sends value to every lieutenant
2) Each lieutenant uses value received from commander, or default
retreat if no value was received
Recursive algorithm OM(m) handles up to m faults
1) Commander sends value to every lieutenant
2) For each lieutenant i, let v
i
be the value i received from commander,
or retreat if no value was received. Lieutenant i acts as commander
and runs Alg OM(m-1) to send v
i
to each of the n-2 other lieutenants
3) For each i, and each j not equal to i, let v
j
be the value Lieutenant i
received from Lieutenant j in step (2) (using Alg OM(m-1)), or else
retreat if no such value was received. Lieutenant i uses the value
majority(v
1
, , v
n-1
) to compute the agreed upon value.
3/30/2014
90
Example OM with m=1, and n=4
3/30/2014
CS ZG623, Adv OS 179
L3
Commander 1
v
L2
v
v
Step 1: Commander sends same
value, v, to all
Step 2: Each of L2, L3, L4
executes OM(0) as commander,
but L2 sends arbitrary values
Step 3: Decide
L3 has {v,v,x},
L4 has {v,v,y},
Both choose v.
x
L4
v
v
v
v
y
Continued with Commander being Byzantine
3/30/2014
CS ZG623, Adv OS 180
y
x
x
z
z
y
L3
Commander 1
L2
x
z
Step 1: Commander sends
different value, x,y,z, to each
Step 2: Each of L2, L3, L4
executes OM(0) as commander,
sending value they received
Step 3: Decide
L2 has {x,y,z}
L3 has {x,y,z},
L4 has {x,y,z},
L4
y
All loyal
lieutenants get
same result.
3/30/2014
91
Stages in OM
3/30/2014
CS ZG623, Adv OS 181
P1
P2
Pn
Pn
P4
Pn
P3
Pn-1
P2
n-1
n-2

n-2
.
.
.
.
.
.
.
.
.

Stage 0: (n-1) msgs
Stage 1: (n-1)(n-2) msgs

.
.
.
Stage m

O(n
m
) Complexity of OM
3/30/2014
CS ZG623, Adv OS 182
OM(m) triggers n-1 OM(m-1)
OM(m-1) triggers n-2 OM(m-2)

OM(m-k) will be called by (n-1)(n-k) times

OM(0)
3/30/2014
92
Interactive Consistency Model
3/30/2014
CS ZG623, Adv OS 183
Applications of BA
3/30/2014
CS ZG623, Adv OS 184
Building fault tolerant distributed services
Hardware Clock Synchronization in presence
of faulty nodes
Distributed commit in databases



3/30/2014
93
Distributed File Systems
3/30/2014
CS ZG623, Adv OS 185
A File?
Named Object
Sequence of data items together with a set of attributes
Purpose of Use
Permanent storage of information
Large amount of information
Sharing of information
A File system?
Sub system of an OS that performs file management
activities (or OS programming interface to disk storage)
Why DFS?
3/30/2014
CS ZG623, Adv OS 186
Data sharing of multiple users
User mobility
Location transparency
Location independence
Backups and centralized management

Not all DFS are the same:
High-speed network DFS vs. low-speed network DFS
3/30/2014
94
DFS Features
3/30/2014
CS ZG623, Adv OS 187
A distributed file system
Used in a distributed environment
Complex, because users and storage devices are
physically dispersed.
Features
Remote information sharing
Access transparency
Location transparency
User mobility
Availability
Replica
Disk less work station


Data access in a DFS
3/30/2014
CS ZG623, Adv OS 188
3/30/2014
95
Methods of building a DFS
3/30/2014
CS ZG623, Adv OS 189
Mounting
Binding of different file system to form a hierarchically single one
A name space can be mounted to an internal or leaf node, called
a mount point.
Kernel maintains a structure called mount table which maps
mount points to appropriate storage device.
Can be done at client end or at server end

Caching
Used for reducing frequency of access to file servers
Exploits temporal locality of reference
A copy of the data stored remotely is brought & kept in client
cache or local disk, expecting / with an anticipation of future use.
Employed at both client (sun NFS) and server (Sprite) sides
Helps in reducing the delays in accessing of data, & of-course
network latency.
Continued
3/30/2014
CS ZG623, Adv OS 190
Hints:
Although caching reduces the access time by providing a local copy,
the major problem is it enforces consistency.
Problem arises because different clients may be doing different
operations on their data cache.
So, the file servers and clients must be in coordination
Alternatively, cached data is viewed as hints i.e, not completely
accurate.
Bulk Data Transfer:
Multiple consecutive data blocks are transferred
Reducing the communication protocol overhead (including buffering,
assembly/disassembly etc).
File access overhead(seek time etc.) are reduced
Encryption
For secure data transmission, session key are established with the help
of exchange protocols.
3/30/2014
96
DFS Design Issues
3/30/2014
CS ZG623, Adv OS 191
Components
File Service
3/30/2014
CS ZG623, Adv OS 192
3/30/2014
97
Directory service
3/30/2014
CS ZG623, Adv OS 193
Continued
3/30/2014
CS ZG623, Adv OS 194
3/30/2014
98
Caching on disk or memory
3/30/2014
CS ZG623, Adv OS 195


Writing policies
3/30/2014
CS ZG623, Adv OS 196


Write through, delayed writing policy (write-back), write on close
3/30/2014
99
Cache Consistency
3/30/2014
CS ZG623, Adv OS 197
Continued
3/30/2014
CS ZG623, Adv OS 198
3/30/2014
100
Stateful Vs Stateless
3/30/2014
CS ZG623, Adv OS 199
Scalability
3/30/2014
CS ZG623, Adv OS 200
Client-Server design
Client caching
Server initiated cache invalidation
Do not bother about read-only files
Clients serving as servers for few
clients
Structure of server


3/30/2014
101
DFS Case Studies: Google FS
3/30/2014
CS ZG623, Adv OS 201
Goals: Performance, Scalability, Reliability, and Availability
Metadata:
namespace, access
control, mapping of
files to chunks, and
current location of
chunks.
1
2
3
4
3/30/2014
CS ZG623, Adv OS 202
Namenode
B
replication
Rack1
Rack2
Client
Blocks
Datanodes
Datanodes
Client
Write
Read
Metadata ops
Metadata(Name, replicas, )
Block ops
Source: http://hadoop.apache.org/
3/30/2014
102
Hadoop clusters
3/30/2014
CS ZG623, Adv OS 203
File read
3/30/2014
CS ZG623, Adv OS 204
HDFS Client
Client Node
Distributed
FileSystems
FSData
InputStream
1: open
3: read
6: close
NameNode
namenode
2: get block location
DataNode
datanode
DataNode
datanode
DataNode
datanode
4: read
5: read
3/30/2014
103
Rack awareness
3/30/2014
CS ZG623, Adv OS 205
node
r1 r2 r1 rack
n2
d1
d2
d=2
n1 n1
d=0
n1
d=4
d=6
HDFS Write
3/30/2014
CS ZG623, Adv OS 206
HDFS Client
Client Node
Distributed
FileSystems
FSData
OutputStream
1: create
3: write
6: close
NameNode
namenode
2: create
DataNode
datanode
DataNode
datanode
DataNode
datanode
4: write packet 5: ack packet
7: complete
Pipeline
4
5 5
4
3/30/2014
104
Currently de facto standard for LANs
A Unix computer has a NFS client and server module in its OS kernel
Developed by Sun Microsystems (in 1985)


SUN Network File System
3/30/2014
CS ZG623, Adv OS 207
Client computer Server computer
UNIX
file
system
NFS
client
NFS
server
UNIX
file
system
Application
program
Application
program
Virtual file system Virtual file system
P
C

D
O
S

UNIX kernel
system calls
RPC

UNIX
Operations
on local files
Operations
on
remote files
UNIX kernel
Network
read(fh, offset, count) -> attr, data
write(fh, offset, count, data) -> attr
create(dirfh, name, attr) -> newfh, attr
remove(dirfh, name) status
getattr(fh) -> attr
setattr(fh, attr) -> attr
lookup(dirfh, name) -> fh, attr
rename(dirfh, name, todirfh, toname)
link(newdirfh, newname, dirfh, name)
readdir(dirfh, cookie, count) -> entries
symlink(newdirfh, newname, string) -> status
readlink(fh) -> string
mkdir(dirfh, name, attr) -> newfh, attr
rmdir(dirfh, name) -> status
statfs(fh) -> fsstats

RPC Interface
3/30/2014
CS ZG623, Adv OS 208
3/30/2014
105
Reading in NFS V3 and V4
3/30/2014
CS ZG623, Adv OS 209
NFS V3
NFS V4
Mounting from single server
3/30/2014
CS ZG623, Adv OS 210
3/30/2014
106
Mounting from multiple servers
3/30/2014
CS ZG623, Adv OS 211
Auto-mounting in NFS
3/30/2014
CS ZG623, Adv OS 212
3/30/2014
107
Security
3/30/2014
CS ZG623, Adv OS 213
Caching in NFS
3/30/2014
CS ZG623, Adv OS 214
Server caching does nothing to reduce RPC traffic between client
and server
NFS client module caches the results of read, write, getattr,
lookup and readdir operations (blocks, translations, attributes)
synchronization of file contents (one-copy semantics) is not
guaranteed when two or more clients are sharing the same file
(delayed writing policy).
Cached on demand.
File is flushed on close
3/30/2014
108
Sprite DFS
3/30/2014
CS ZG623, Adv OS 215
Another Example
3/30/2014
CS ZG623, Adv OS 216
3/30/2014
109
Remote Link
3/30/2014
CS ZG623, Adv OS 217
Caching is done on both client and server side.
Client Cache stores recently accessed blocks indexed by file token
Dont need to map to disk block, less communication with server.
Cache block size is 4K
Directories are not cached to avoid inconsistency
Server Cache also stores file blocks.

Delayed writing policy is used, because 20 to 30% of the new data
are deleted within 30 seconds and 90% of the files are open for 10
seconds.

A clients cache dynamically adapts to the changing demands on the
machines virtual memory systems & the file system.


Sprite continued
3/30/2014
CS ZG623, Adv OS 218
3/30/2014
110
Cache Consistency is server initiated.
Concurrent write sharing is avoided
When a concurrent write request is received, server instructs
current writer to flush data, commits it, and then declares the file
uncachable to *all* clients.
All requests now flow through the server, which will serialize
them.
About 1% traffic overhead
Sequential Write Sharing is overcome by using
versions, version number is incremented upon write .
Server also keeps track of last client doing write, and requests
that client to flush data when it gets the next file open from some
other client.

Cache consistency
3/30/2014
CS ZG623, Adv OS 219
CODA File System
3/30/2014
CS ZG623, Adv OS 220
CMU
3/30/2014
111
Disconnected Operation
3/30/2014
CS ZG623, Adv OS 221
The X-kernel logical File System
3/30/2014
CS ZG623, Adv OS 222
3/30/2014
112
The X-kernel logical File System
3/30/2014
CS ZG623, Adv OS 223
3/30/2014
CS ZG623, Adv OS 224
Distributed Scheduling
3/30/2014
113
Grid computing
3/30/2014
CS ZG623, Adv OS 225
Distributed computing across networks using open
standards supporting heterogeneous resources - IBM
[Source: IBM TJ Watson Research Center]
Linux virtual server (load balancer)
3/30/2014
CS ZG623, Adv OS 226
WINDOWS also has load balancing features
3/30/2014
114
Examples
3/30/2014
CS ZG623, Adv OS 227
Simple Linux Utility for Resource
Management (SLURM)







Tivoli Loadleveler from IBM



SLURM is an open-source resource
manager designed for Linux clusters of all
sizes. It provides three key functions. First it
allocates exclusive and/or non-exclusive
access to resources (computer nodes) to
users for some duration of time so they can
perform work. Second, it provides a
framework for starting, executing, and
monitoring work (typically a parallel job) on a
set of allocated nodes. Finally, it arbitrates
contention for resources by managing a
queue of pending work.
When jobs are submitted to LoadLeveler, they are not
necessarily executed in the order of submission.
Instead, LoadLeveler dispatches jobs based on their
priority, resource requirements and special instructions;
for example, administrators can specify that long-
running jobs run only on off-hours, that short-running
jobs be scheduled around long-running jobs or that jobs
belonging to certain users or groups get priority. In
addition, the resources themselves can be tightly
controlled: use of individual machines can be limited to
specific times, users or job classes or LoadLeveler can
use machines only when the keyboard and mouse are
inactive.
Fundamentals
3/30/2014
CS ZG623, Adv OS 228
Why is it required?
- Resource Sharing
- Performance Improvement

What is a Distributed Scheduler?
- Transparently & Judiciously redistributes the load
- Useful for LAN

Motivation
- Tasks arrived randomly and CPU service time is also random
- Applicable for both homogeneous and heterogeneous systems

3/30/2014
115
Load sharing in Homogeneous systems
3/30/2014
CS ZG623, Adv OS 229
Load Distribution is useful in heterogeneous & non-
heterogeneous systems as well.
Livny & Melmen Model
P (a task is waiting and a server is ideal)
= 1-(1-

P
0
)
N
(1-P
0
N
)-P
0
N
(2-P
0
)
N

Load Index
CPU Queue length
Not always the current value is the right representative if task
transfer involves significant delays
CPU Utilization
Classification of load distributing algorithms
Static Priori knowledge is used
Dynamic Current System state
Adaptive Runtime Change
Load balancing Vs Load Sharing
Preemptive Vs Nonpreemptive

Issues in load distribution
3/30/2014
CS ZG623, Adv OS 230
3/30/2014
116
Transfer Policy
A state to participate in task transfer
Threshold policy
Selection Policy
New tasks causing transfer
Response time, execution time improvement
Issues
Transfer overhead should be minimal
Location dependent calls must be executed before
transfer

Components
3/30/2014
CS ZG623, Adv OS 231
Continued
3/30/2014
CS ZG623, Adv OS 232
Location Policy
Polling
Broadcasting a query
Information Policy
When, where and what
Demand Driven
Periodic
State-charge driven

3/30/2014
117
Stability
3/30/2014
CS ZG623, Adv OS 233
Queuing theoretic perspective
The sum of load due to tasks and distribution
must be less than the system capacity, else
unbounded queues can build up
A stable algorithm can still give worst
performance than not using it.
Algorithmic perspective
Processor thrashing

Sender Initiated Load sharing
3/30/2014
CS ZG623, Adv OS 234
Initiated by a sender when load>T
Eager , Lazowska and Zoharjan
Three algorithms differ by location policy

Transfer Policy
CPU queue length

Selection Policy
Newly arrived tasks to be transfered
3/30/2014
118
Location Policy
Random
No remote state information is collected by sender
Useless task transfer
Thrashing
Solution : No. of transfers are fixed.

Threshold
Polling is used to avoid useless transfers
Randomly selected and then polled to find if it is a reciever
Poll limit limits the no. of polls
During a searching session, a node is polled only once by
sender

Continued
3/30/2014
CS ZG623, Adv OS 235
Shortest
Chooses the best receiver for a task
A no. of nodes are selected (Poll limit) and polled
to find shortest queue length
Once selected, irrespective of its queue length at
the time of actual receipt, task is executed
Information Policy
Demand Driven

Continued
3/30/2014
CS ZG623, Adv OS 236
Stability: Instable at high system loads
When all systems are highly loaded, the probability that a sender
gets a receiver is very low.
Polling is increased as per the rate at which new tasks originate
May cause instability
3/30/2014
119
Algorithm initiated from an under loaded node, i.e, a
receiver

Transfer Policy
Threshold Policy
CPU queue length
Triggered when the task departs

Selection Policy
New tasks or partially completed ones
Response time or execution time reduction


Receiver Initiated
3/30/2014
CS ZG623, Adv OS 237
Continued
3/30/2014
CS ZG623, Adv OS 238
3/30/2014
120
Sender vs receiver initiated
3/30/2014
CS ZG623, Adv OS 239
Recap: Local and Global
scheduling
3/30/2014
CS ZG623, Adv OS 240
Source:
3/30/2014
121
Recap: Load distribution in homogeneous envt.
3/30/2014
CS ZG623, Adv OS 241
Recap: Sender-initiated algo.
3/30/2014
CS ZG623, Adv OS 242
Select Node i
randomly
i is Poll-set
QueueLength+1
> T
Poll-set = Nil
Poll-set=Poll-set U i Poll Node i
QueueLength at i
< T
No. of polls
<
PollLimit
Queue the
task locally
Transfer task
to i
Yes
Yes
Yes
Yes
No
No
No
No
Task
Arrives
3/30/2014
122
Recap: Receiver-initiated algo.
3/30/2014
CS ZG623, Adv OS 243
Select Node i
randomly
i is Poll-set
QueueLength
< T
Poll-set = Nil
Poll-set=Poll-set U i Poll Node i
QueueLength at i
> T
No. of polls
<
PollLimit
Wait for a
perdetermined period
Transfer task
from i to j
Yes
Yes
Yes
Yes
No
No
No
No
Task Departure at j
Combination of sender initiated and receiver initiated
At low system loads, the sender-initiated component
easily finds a receiver
At high system loads, the receiver- initiated
component easily finds a sender
Above Average Algorithm:
Kreuger & Finkel
Load at each node is maintained within an acceptable
range.

Symmetrically Initiated Load
sharing
3/30/2014
CS ZG623, Adv OS 244
3/30/2014
123
Transfer Policy
Threshold policy that uses two adaptive thresholds
This is equidistant from the average load across all
the nodes
Lower threshold and upper threshold
Load < L.T receiver
Load > U.T sender

Continued
3/30/2014
CS ZG623, Adv OS 245
Continued
3/30/2014
CS ZG623, Adv OS 246
TooHigh
Waiting
Accept
SENDER
RECEIVER
TooHigh
TooLow
Accept
Load ++
AwaitingTask
Transfer task
No accept
ChangeAverage
Source: Nguyen, Khoi
3/30/2014
124
Receiver-Initiated Component
Upon receiving, it sends, accept and sets Awaiting-
task time out
When Too low timeout expires without receiving any
Too high message, receiver decreases the estimate
by Change Average.
Selection Policy
New tasks or partially executed ones
Response time or execution time reduction

Continued
3/30/2014
CS ZG623, Adv OS 247
Information Policy
Demand Driven
Each node computes the average system load individually
Stable symmetrically initiated algorithm
Each node maintains a structure comprising of
sender, receiver and OK lists
Initially each node assumes every other node as
receiver, so its OK and sender lists will be empty

Transfer Policy
Uses LT and UT

Adaptive Load sharing
3/30/2014
CS ZG623, Adv OS 248
3/30/2014
125
Sender initiated component
3/30/2014
CS ZG623, Adv OS 249
Sender
Receiver
ID_R
ID_S
RECV
ID_C
SEND OK
OK SEND RECV
is receiver?
Inform its status
transfer task
ID_R
Sender or OK
ID_R ID_C
pollis receiver?
Remove ID_S status
Receiver
Source: Nguyen, Khoi
Receiver initiated component
3/30/2014
CS ZG623, Adv OS 250
Sender Receiver
ID_R
RECV SEND OK
OK SEND RECV
is sender?
Transfer task and
inform its after status
Sender
ID_S
Remove ID_R status
Receiver or OK
Inform its current status
ID_S ID_S
ID_C
ID_C
Source: Nguyen, Khoi
3/30/2014
126
Selection Policy
Sender initiated component
New Tasks
Receiver initiated component
New Tasks or partially executed ones, response/execution time
Information Policy
Demand Driven
Discussion:
At high system load, sender initiated component is deactivated.
At low system load, receiver-initiated generally fails, but it updates
the receivers list accurately so that future sender-initiated
components get benefited.
Continued
3/30/2014
CS ZG623, Adv OS 251
Two desirable properties:
It does not cause instability
Load sharing is due to non-preemptive transfers
(which are cheaper) only.
This algorithm uses the sender initiated load
sharing component of the stable symmetrically
initiated algorithm as it is, but has a modified
receiver initiated component to attract the future
non-preemptive task transfers from sender
nodes.

Stable adaptive sender initiated
3/30/2014
CS ZG623, Adv OS 252
3/30/2014
127
The data structure (at each node) of the stable
symmetrically initiated algorithm is augmented by an
array called statevector.
The statevector is used by each node to keep track of
which list (senders, recevers, or OK) it belongs to at all
the other nodes in the system.
When a sender polls a selected node, the senders
statevector is updated to reflect that the sender now
belongs to the senders list at the selected node, the
polled node updates its statevector based on the reply it
sent to the sender node to reflect which list it will belong
to at the sender

Continued
3/30/2014
CS ZG623, Adv OS 253
Continued
3/30/2014
CS ZG623, Adv OS 254
The receiver initiated component is replaced by the
following protocol:
When a node becomes a receiver, it informs all the nodes that
are misinformed about its current state. The misinformed nodes
are those nodes whose receivers lists do not contain the
receivers ID.
The statevector at the receiver is then updated to reflect that it
now belongs to the receivers list at all those nodes that were
informed of its current state.
By this technique, this algorithm avoids the receivers sending
broadcast messages to inform other nodes that they are
receivers.
No preemptive transfers of partly executed tasks here.
3/30/2014
128
Selecting a suitable load
sharing algorithm
3/30/2014
CS ZG623, Adv OS 255
System with low load sender- initiated

System with high load receiver-initiated

System with wide load fluctuation stable symmetrically-
initiated

System with wide load fluctuation and high cost for
migration of partly executed tasks stable sender-
initiated ( adaptive)

System with heterogeneous work arrival adaptive stable
algorithms
Comparison
3/30/2014
CS ZG623, Adv OS 256
Better than sender-initiated but unstable at high-load
3/30/2014
129
Continued
3/30/2014
CS ZG623, Adv OS 257
ADSYM ( stable SYM ) very good
Requirements for load sharing
3/30/2014
CS ZG623, Adv OS 258
Scalability
Quick scheduling decisions with minimum overhead
Location Transparency
Determinism
Same result irrespective of transfer
Preemption
A workstations owner must not get degraded
performance on his return
Heterogeneity
Able to distinguish among different architecture,
processors of different capability, machines with
special bandwidth etc.
3/30/2014
130
Case studies
3/30/2014
CS ZG623, Adv OS 259
The V-System
State change driven information policy
Selection policy - new tasks are selected
Transfer policy
If it is among N (lightly loaded)
Receiver
Location Policy
When a task arrives, local cache is consulted to find
a suitable node.
Randomly a node is selected & polled to unify, can it
really become the receiver

Continued
3/30/2014
CS ZG623, Adv OS 260
Rarely three polls are required
Publishing scheme has advantages over direct queries
Publishing of state information depends on rate at
which state changes
But for direct queries, no. of polls are decided by no. of
jobs
Load Index
CPU utilization (bg process with counter)

3/30/2014
131
Sprite scheduler
3/30/2014
CS ZG623, Adv OS 261
Information policy (state change driven)
Location Policy
Centralized
Selection Policy
Tasks that are to be scheduled or run remotely are decided by
the user (sender).
Sender contacts central coordinator to know the receiver
When the owner of w/s wants to use it, foreign jobs are evicted
& sent back to their original w/s
Transfer Policy
senders can be manually made by the users - Sender(Semi-
Automated)
sender is a node from which job evicts - Sender (Semi-
Automated)
not active for 30 seconds & less tasks - Receiver


Condor distributed scheduler
3/30/2014
CS ZG623, Adv OS 262
Long running CPU intensive tasks (bg jobs only)
Transfer and selection polices
Same as Sprite
But centralized selection is used
User submits big jobs to a central coordinator
Central coordinator polls w/s every 2 minutes
periodically (w/s is ideal if for 12.5 minutes user has not
done any work)
When the owner of the w/s remains active more than 5
minutes, the foreign job is evicted/transferred to the
originating node
3/30/2014
132
Stealth distributed scheduler
3/30/2014
CS ZG623, Adv OS 263
Exploits the under utilized resources of the w/ss by their
owners
Prioritized local resource allocation
Foreign tasks are executed even if the owners of those
w/ss are active
Priority is higher for an owners work
It has a prioritized CPU scheduler, prioritized V.M,
prioritized file cache
Guarantees that owner process get the resources
whatever they need and remaining are given to the
foreign processes
Replaces the primitive transfer (as in sprite & condor) by
a cheap operation i.e, prioritized local allocation.
Task migration
3/30/2014
CS ZG623, Adv OS 264
Preemptive transfers are required in following situations
In receiver initiated task transfers
In w/s model, jobs (foreign) are preempted
To avoid starvation
Task placements vs Task Migration
Yet to begin, already started
Task migration requirements
State transfer
Regs, stack, state, VM address space, file descriptors,
buffered msgs, CW directory, signal mask and
handlers, resource usage statistics, etc.
Unfreeze
3/30/2014
133
State Transfer
The transfer of the tasks state including information
e.g. registers, stack, ready or blocked, virtual
memory address space, file descriptors, buffered
messages etc. to the new machine.
The task is frozen at some point during the transfer
so that the state does not change further.
Unfreeze
The task is installed at the new machine and is put
in the ready queue so that it can continue executing
there.
Steps in Task migration
3/30/2014
CS ZG623, Adv OS 265
Issues in Task migration
3/30/2014
CS ZG623, Adv OS 266
State Transfer
Location Transparency
Structure of a Migration Mechanism
Performance
3/30/2014
134
State transfer
3/30/2014
CS ZG623, Adv OS 267
The Cost
Obtaining & transferring the state, and unfreezing the task.
Residual Dependencies
Refers to the amount of resources a former host of a
preempted or migrated task continues to dedicate to
service requests from the migrated task.
Implementations
The V-System
Attempts to reduce the freezing time of a migrating task
by pre-copying the state.
Sprite
Makes use of the location-transparent file access
mechanism provided by its file system
All the modified pages of the migrating task are
swapped to file server
Location transparency
3/30/2014
CS ZG623, Adv OS 268
Location transparency in principle requires that names
(e.g. process names, file names) be independent of their
location (i.e. host names).
Any operation (such as signaling) or communication that
was possible before the migration of a task should be
possible after its migration.
Example SPRITE Location Transparency
Mechanisms
A location-transparent distributed file system is
provided
The entire state of the migrating task is made
available at the new host, and therefore, any kernel
calls made will be local at new host.
Location-dependent information such as host of a
task is maintained at the home machine of a task
3/30/2014
135
Structure of Migration mechanism
3/30/2014
CS ZG623, Adv OS 269
Issues involved in Migration Mechanisms
Decision whether to separate the policy-making
modules from mechanism modules
Where the policy and mechanisms should reside
The migration mechanism may best fit inside the
kernel
Policy modules decide whether a task transfer
should occur, this can be placed in the kernel as
well
Interplay between the task migration mechanism and
various other mechanisms
The mechanisms can be designed to be
independent of one another so that if one
mechanisms protocol changes, the others need
not.
Performance
3/30/2014
CS ZG623, Adv OS 270
Comparing the performance of task migration
mechanisms implemented in different systems is a
difficult task, because of the different:
Hardware
Sprite consists of a collection of SPARCstation 1
Operating systems
IPC mechanism
File systems
Policy mechanisms
3/30/2014
136
What is shared memory?
3/30/2014
CS ZG623, Adv OS 271
CPU Memory
CPU1 Memory
CPU4
CPU2
CPU3
Chip package
extension
A single-chip computer
A hypothetical shared-memory
Multiprocessor.
Distributed Shared Memory: Motivation
3/30/2014
CS ZG623, Adv OS 272
SMP systems
Run parts of a program in parallel
Share single address space
Use threads for parallelism
Use synchronization primitives to ensure mutual
exclusion

Can we achieve this with multi-computers?
All communication and synchronization must be done
with messages
3/30/2014
137
Each tiger has his own feeding through
where as for the other one, a group of
tigers eating from a single feeding through.

Each student looking at his/her notebook,
but for the second one, all the students
looking at a black board.

Examples
3/30/2014
CS ZG623, Adv OS 273
Bus based Multi-processors
3/30/2014
CS ZG623, Adv OS 274
CPU CPU CPU Memory
Bus
A multiprocessor

CPU
Cache

CPU

CPU
Cache Cache
Memory
Bus
A multiprocessor with caching
3/30/2014
138
Write once protocol
3/30/2014
CS ZG623, Adv OS 275
A B W
1
C
W
1
CLEAN
Memory is correct
CPU
A B W
1
C
W
1
W
1
CLEAN CLEAN
Memory is correct
Continued
3/30/2014
CS ZG623, Adv OS 276
A B W
1
C
W
2
W
1
A B W
1
C
W
3


W
1
DIRTY INVALID
DIRTY INVALID
Memory is correct
No update to memory
Memory is correct
A writes W again. This and subsequent
writes by A are done locally, without any
bus traffic.
3/30/2014
139
Continued
3/30/2014
CS ZG623, Adv OS 277
A B W
1
C
W
3


W
1
INVALID INVALID DIRTY
W
3
No update to memory
NUMA: Cm*
3/30/2014
CS ZG623, Adv OS 278
CPU M I/O
Local bus
Cluster
CPU M I/O
CPU M I/O
Intercluster
bus
Local memory
Microprogrammed MMU
3/30/2014
140
Uniform Memory Access (UMA)
3/30/2014
CS ZG623, Adv OS 279
Comparison of shared memory systems
3/30/2014
CS ZG623, Adv OS 280
Single-bus
multi-processor
Switched
multi-
processor
NUMA
machine
Page-
based
DSM
Shared-
variable
DSM
Object-
based
DSM
Cache
block
Cache
block
Page Page Data
structure
Object
Loosely
coupled
Tightly
coupled
Transfer
unit
Sequent
Firefly
Dash
Alewife
Cm*
Butterfly
Ivy
Mirage
Munin
Midway
Linda
Orca
Remote access in hardware
Remote access in software
Managed by MMU Managed by OS Managed by language
runtime system
Hardware-controlled caching Software-controlled caching
3/30/2014
141
Distributed shared memory
3/30/2014
CS ZG623, Adv OS 281
CPU
CPU
CPU
Cache Cache Cache
Memory
Bus
X
X Network
Software DSM
3/30/2014
CS ZG623, Adv OS 282
3/30/2014
142
Continued
3/30/2014
CS ZG623, Adv OS 283
DSM: logical view
3/30/2014
CS ZG623, Adv OS 284
The DSM system hides the remote communication mechanism from the
application programmer, making programming ease and ensuring
portability.
3/30/2014
143
The basic model of DSM
3/30/2014
CS ZG623, Adv OS 285
[Source:Harandi, J. Hou, Gupta]

0 1 2 3 4 5 6 7 8 9
0 2 1 4
7 5
3 6
8 9
P1 P2 P3
Shared Address Space
0 1 2 3 4 5 6 7 8 9
0 2 1 4
7 5
3 6
8
9
Shared Address Space
9
Page Transfer
0 1 2 3 4 5 6 7 8 9
0 2 1 4
7 5
3 6
8
9
Shared Address Space
9
Read-only replicated page
Another example
3/30/2014
CS ZG623, Adv OS 286
3/30/2014
144
Advantages and Disadvantages
3/30/2014
CS ZG623, Adv OS 287
Source: Kshemkalyani and Singhal
Implementing DSMs
3/30/2014
CS ZG623, Adv OS 288
Issues:
1) Keeping track of location of remote data.
2) Reducing the Communication delays.
3) Minimal execution of communication
protocols.
4) Making shared data concurrently
accessible at several nodes.

3/30/2014
145
Central server algorithm
3/30/2014
CS ZG623, Adv OS 289
Data access Req
Central Server
Clients
Similar to a monitor
consisting of data structure
& procedures, that provide
mutually exclusive access
to data structures
The Migration algorithm
3/30/2014
CS ZG623, Adv OS 290
Data is shipped or migrated to the location of the data
access request.
Only one node can access at a time , the shared data
items.
Uses locality of reference
Susceptible to thrashing

Data access request
Data block migrates
To locate a page, either use a server that keeps track
of pages or broadcast a query.
3/30/2014
146
The Read-replication algo
3/30/2014
CS ZG623, Adv OS 291
Extends the migration algorithm by replicating
data blocks at various sites and allowing
reading only by multiple nodes or readwrite
by one node.
Multiple nodes access data concurrently
reading only.
Writing is expensive
Needs all others to be invalidated.
Updated to maintain consistency.
Continued
3/30/2014
CS ZG623, Adv OS 292
DSM keeps track of location of all the copies of data
blocks
In IVY , owner of page keeps the location of all copies.
3/30/2014
147
The Full replication algorithm
3/30/2014
CS ZG623, Adv OS 293
Extension of read replication
Multiple nodes can have both Read & Write access to
shared data blocks.
Many nodes can write data concurrently.
Consistency is a prime factor
A gap free sequencer can be used
All the nodes wishing to write, send the modifications to
sequencer .
Sequencer assigns a sequence number & multicasts the
modification with the sequence number to the nodes
having the copy.
Each node processes the modifications in the sequence
number order.
Continued
3/30/2014
CS ZG623, Adv OS 294
3/30/2014
148
Memory coherence
3/30/2014
CS ZG623, Adv OS 295
Deals with the consistency of DSM memory when it is
being used.

An example:

Process 1
br = b;
ar = a;
If (ar>=br) then
Print (OK);

Process 2
a = a+1;
b = b+1;
Informally , a memory is coherent if the value returned by
a read is always the value the programmer expected.
Different models
3/30/2014
CS ZG623, Adv OS 296
3/30/2014
149
Strict consistency
3/30/2014
CS ZG623, Adv OS 297
Strict Consistency (one-copy semantics)
Any read to a memory location x returns the value
stored by the most recent write operation to x.
Examples:

1. A system to provide sports fans with up-to-the minute (not up-to-the
seconds) scores for Cricket matches (cricinfo.com) ?

2. P1 : W(x) 1 P1: W(x) 1
-------------------------- ------------------------------------
P2: R(x)1 R(x)0 R(x)1
Sequential consistency
3/30/2014
CS ZG623, Adv OS 298
3/30/2014
150
Continued
3/30/2014
CS ZG623, Adv OS 299
Example The combination ar = 0, br = 1, Is it possible?




Process 1
br = b;
ar = a;
If (ar>=br) then
Print (OK);

Process 2
a = a+1;
b = b+1;
Continued
3/30/2014
CS ZG623, Adv OS 300
3/30/2014
151
Causal consistency
3/30/2014
CS ZG623, Adv OS 301
Writes that are potentially causally related must be seen by
all processes in the same order. Concurrent writes may be
seen in a different order on different machines.

Example 1:

P1:
P2:
P3:
P4:
W(x)1
W(x) 3
R(x)1 W(x)2
R(x)1
R(x)1
R(x)3 R(x)2
R(x)2 R(x) 3
This sequence obeys causal consistency
Concurrent writes
Continued
3/30/2014
CS ZG623, Adv OS 302
P1:
P2:
P3:
P4:
W(x)1
R(x)1 W(x)2
R(x)2 R(x)1
R(x)1 R(x) 2
This sequence does not obey causal consistency
Causally related
P1:
P2:
P3:
P4:
W(x)1
W(x)2
R(x)2 R(x)1
R(x)1 R(x) 2
This sequence obeys causal consistency
3/30/2014
152
Pipelined RAM consistency
3/30/2014
CS ZG623, Adv OS 303
Writes done by a single processor are received by all other
processors in the same order. A pair of writes from
different processes may be seen in different orders at
different processors.
P1:
P2:
P3:
P4:
W(x)1
R(x)1 W(x)2
R(x)2 R(x)1
R(x)1 R(x) 2
This sequence is allowed with PRAM consistent memory
Cache consistency
3/30/2014
CS ZG623, Adv OS 304

Under sequential consistency, all processors have to
agree on some sequential order of execution for all
accesses.
But, coherence requires that accesses are sequentially
consistent on a per location basis.




Accesses to x and y can be linearized into R(x)0, W(x)1,
and R(y)0, W(y)1
The history is coherent (per variable), but not sequentially
consistent
SC implies coherence but not vice versa.
P1:
P2:
W(x)1 R(y)0
W(y)1 R(x)0
3/30/2014
153
Processor consistency
3/30/2014
CS ZG623, Adv OS 305
Processor Consistency:
Combination of Coherence and PRAM (History should
be coherent and PRAM simultaneously).
Stronger than Coherence but weaker than SC.


Weak consistency
3/30/2014
CS ZG623, Adv OS 306
P1:
P2:
P3:
P4:
W(x)1 W(y) 2 S
R(y)2 R(x)0 S
R(x)0 R(y) 2 S
This sequence is allowed with weak consistent memory (but may never be coded in)
P1:
P2:
P3:
P4:
W(x)1 W(y) 2 S
S R(y)2 R(x)1
S R(x)1 R(y) 2
The memory in P3 and P4 has been brought up to date
3/30/2014
154
Release consistency
3/30/2014
CS ZG623, Adv OS 307
Two synchronization variables are defined: Acquire and Release
Before any read/write op is allowed to complete at a
processor, all previous acquires at all processors must be
completed
Before a release is allowed to complete at a processor, all
previous read/write ops. must be propagated to all
processors
Acquire and release ops are sequentially consistent w.r.t.
one another

P1:
P2:
P3:
P4:
Acq(L) W(x)1 W(x) 2 Rel(L)
Acq(L) R(x)2 Rel(L)
R(x) 1
This sequence is allowed with release consistent memory
Granularity of chunks
3/30/2014
CS ZG623, Adv OS 308
On a page fault,
the missing page is just brought in from a remote node.
A region of 2, 4, or 8 pages including the missing page may
also be brought in.
Locality of reference: if a processor has referenced one
word on a page, it is likely to reference other neighboring
words in the near future.
Chunk/Page size
Small => too many page transfers
Large => False sharing
3/30/2014
155
False sharing
3/30/2014
CS ZG623, Adv OS 309
A

B
A

B
Processor 1 Processor 2
Code using A Code using B
Two
unrelated
shared
variables
Occurs because: Page size > locality of reference
Unrelated variables in a region cause large number of pages
transfers
Large pages sizes => more pairs of unrelated variables
Page consists of
two variables A
and B
Consistency
3/30/2014
CS ZG623, Adv OS 310
Achieving consistency is not an issue if
Pages are not replicated, or
Only read-only pages are replicated.


Two approaches are taken in DSM
Update: the write is allowed to take place locally, but the
address of the modified word and its new value are broadcast
to all the other processors. Each processor holding the word
copies the new value, i.e., updates its local value.
Invalidate: The address of the modified word is broadcast, but
the new value is not. Other processors invalidate their copies.
3/30/2014
156
Invalidation protocol
3/30/2014
CS ZG623, Adv OS 311
Each page is either in R or W state.
When a page is in W state, only one copy exists, located at
one processor (called current owner) in read-write mode.
When a page is in R state, the current/latest owner has a copy
(mapped read only), but other processors may have copies
too.
W
Processor 1 Processor 2
Owner
P
page
Processor 1 Processor 2
R
Owner
P
Suppose Processor 1 is attempting a read: Different scenarios
(a)
(b)
Read continued
3/30/2014
CS ZG623, Adv OS 312
Processor 1 Processor 2
R
Owner
P
R
Processor 1 Processor 2
R
P
R
Owner
Processor 1 Processor 2
P
R
Owner
Processor 1 Processor 2
P
W
Owner
In the first 4 cases, the page is mapped into its address space.
1. Ask for a copy
2. Mark page as R
3. Do read
1. Ask P2 to degrade its copy to R
2. Ask for a copy
3. Mark page as R
4. Do read
(c)
(d)
(e)
(f)
3/30/2014
157
Invalidation protocol: write
3/30/2014
CS ZG623, Adv OS 313
Processor 1 Processor 2
W
Owner
P
Processor 1 Processor 2
R
P
Owner
1. Mark page as W
2. Do write
Processor 1 Processor 2
R
Owner
P
R
Processor 1 Processor 2
R
P
R
Owner
1. Invalidate other copies
2. Mark local page as W
3. Do write
1. Invalidate other copies
2. Ask for ownership
3. Mark page as W
4. Do write
Write continued
3/30/2014
CS ZG623, Adv OS 314
Processor 1 Processor 2
P
R
Owner
Processor 1 Processor 2
P
W
Owner
1. Invalidate other copies
2. Ask for ownership
3. Ask for a page
4. Mark page as W
5. Do write
1. Invalidate other copies
2. Ask for ownership
3. Ask for a page
4. Mark page as W
5. Do write
3/30/2014
158
DSM using write-update
3/30/2014
CS ZG623, Adv OS 315
time
time
a := 7;
b := 7;
if(b=8) then
print("after");
if(a=7) then
b := b+1;
...
if(b=a) then
print("before");
time
updates
Source: Class presentations of Pham
Quoc Cuong Phan Dinh Khoi
Write-update in PLUS
3/30/2014
CS ZG623, Adv OS 316
3/30/2014
159
How does owner find copies to invalidate?
3/30/2014
CS ZG623, Adv OS 317
Broadcast a message.
Works only if broadcast messages are reliable and
can never be lost.
The owner (or page manager) for a page maintains a
copyset list giving processors currently holding the page.
Network
3 4
2
4
2 3 4
1
3
4
1 2 3
5
2
4
2 3 4 1
Copyset
Page num.
Replacement startegy
3/30/2014
CS ZG623, Adv OS 318
Which block to replace ?
Usage based Vs Non-Usage based
(LRU) (FIFO)

Some DSM systems use priority mechanism for replacement (IVy).
Each memory block of a node is of
Unused
Nil -> invalidated
Read-Only
Read-Owned
Writable -> has write access & obviously, it is the owner

Replacement priority is as below
Unused &Nil -> highest
Read-Only
Read-Owned & writable for which replicas exist elsewhere
Read-Owned & writable for which there is no replica
(LRU is used for replacement)
3/30/2014
160
Caching in DSMs
3/30/2014
CS ZG623, Adv OS 319
For performance, DSM caches data locally
More efficient access (locality)
But, must keep caches consistent
Caching of pages in case of page based DSM
Issues
Page size
Consistency mechanism
Heterogeneous DSM
3/30/2014
CS ZG623, Adv OS 320
Data Conversion
Different architectures may use different byte ordering
and fl. pt. representations
DSM cant do this data conversion, without knowing the
type of application level data contained in the block and
the actual block layout
Application programmers provide the appropriate
routines
Also DSM compiler can carry out conversion if DSM is
organized into data objects.

3/30/2014
161
Thrashing in DSMs
3/30/2014
CS ZG623, Adv OS 321
Thrashing occurs when network resources are
exhausted, and more time is spent invalidating
data and sending updates than is used doing
actual work
Based on system specifics, one should choose
write-update or write-invalidate to avoid
thrashing
DSM Case Study
3/30/2014
162
IVy: Integrated shared virtual memory at Yale
Apollo workstations connected in a token ring network
DSM implemented as a region of processors VM.
Page based DSM
Granularity is 1 KB page.
Address Space
Private space (Local)
Shared virtual address space (Global)
Coherence protocol
Read-replication algorithm, multiple readers
single writer semantics
A reader always sees the latest value written
supports Strict consistency
Write Invalidate protocol in IVy
A write fault to page P
Who is the owner ?
Owner sends the page with copy set to the requester
and does not keep it.
Faulting (Requester) after getting the page, looks into
copy set and sends invalidation messages to all those
who are having the copy.

A read fault to a page P
Finds owner of p and makes a request.
Owner sends p & adds the requester to the copy set.
Marks the page as read only.
3/30/2014
163
Continued
There are three different protocols which implement
write invalidate
differ only in, how the owner of the page is located.
1. Request
2. Reply
Page
Manager
P
3. Request
4. Reply
Page
Manager
Owner
1. Request
3. Reply
2. Request forwarded
Owner P
Fixed distributed manager
Central managers work is distributed to every
processor in the system.
Every processor keeps track of (maintains) owners of
a predetermined set of pages.
A faulty processor uses H(p), a hash function to find
out the controller/ manager of the page.
Remaining same as previous.
3/30/2014
164
Dynamic distributed manager

The hint in the probowner field is updated whenever a
processor receives an invalidation request, relinquishes a
page, receives a page or forwards a page fault request, etc.
Continued
3/30/2014
165
Double fault
All 3 schemes -> double fault (read and written
successively )

(has read only, needs read-write)
Kessler & Livny -> sequence no associated with a
page
Process synchronization
Coherence guarantees consistency amongst pages,
whereas eventcounts serialize the concurrent
accesses to a page.
Init(ec), Read(ec), Await(ec, value), Advance(ec)
3/30/2014
166
Recovery in Distributed Systems
What is Recovery?
Failure causes inconsistencies in the state of the
distributed system.

Recovery: bringing back the failed node to its
normal operational state along with other nodes in
the system.
3/30/2014
167
Failure types
Process failure: (computation results in incorrect
outcome)
Deadlocks, protection violation, erroneous user
input, etc.
System failure:
Full/partial amnesia.
Pause failure
Halting failure.
Secondary storage failure
Communication failure
Forward and Backward recovery
Forward Recovery:
Difficult to do forward assessment.
Backward Recovery:
When forward assessment not possible. Restore
processes to previous error-free state.
Expensive to rollback states.
Does not eliminate same fault occurring again.
Unrecoverable actions: Cash dispensed at ATMs.
3/30/2014
168
Approaches to backward error recovery
Operation based approach (logs or audit trails are
written)
Stable storage and Secondary storage
Updating in-place
Write-ahead log
Sate based approach (checkpointing)
Shadow pages
State based recovery
3/30/2014
169
Orphan messages and domino effect
Lost messages
3/30/2014
170
Problem of livelocks
Synchronous Recovery
3/30/2014
171
Recap: Synchronous Recovery
Disadvantages of synchronous recovery
Additional message exchanges for taking
checkpoints.

Delays normal executions as messages cannot be
exchanged during checkpointing.

Unnecessary overhead if no failures occur between
checkpoints.
3/30/2014
172
Asynchronous recovery
Asynchronous approach: independent checkpoints at
each processor.
Identify a consistent set of checkpoints if needed, for
rollbacks.
e.g., {x3,y3,z2} not consistent; {x2,y2,z2}
consistent and is used for rollback
X
Y
Z
x1
y1
z1
x2 x3
y3
z2
y2
Asynchronous recovery example
3/30/2014
173
Security & Protection
Misuses of important information and its
destruction should be prevented.

Potential security violations (Anderson)
Unauthorized information release
Unauthorized information modification
Unauthorized denial of service

Access Matrix Model (AMM)
3 Components
Current objects (O)
Current subjects (S)
Subjects can be treated as objects & accessed
like an objects by other objects
Generic Rights (R)
R={R1, R2, ..}
Eg.
O-> file
S-> process
R-> Read, Write, Own, delete, execute etc.
3/30/2014
174
Protection State (S,O,P)
Access Matrix Model
3/30/2014
175
Access Control Lists/Capability Lists
Capability Based AMM
capability = tuple (o, P(s,o))
each subject has a set of capabilities
possession of capability confers access rights
Capability Based Addressing:
A schematic view of a capability

Object
descriptor

Access rights



read, write, execute, etc.

3/30/2014
176
Capability based Addressing
Features:
Re-locatability, and Sharing
Capability Implementation

Capabilities can be implemented in 2 ways
Tagged (Borrorghs B6700)
Partitioned (Plessey system)
3/30/2014
177
Advantages of Capabilities
Efficient: an access by a subject is implicitly
valid, if it has the capability => validity tested
easily.
Simple: structure of a capability is same to that
of addressing mechanism.
Flexible: A user can decide which objects/
addresses have a capability.
Disadvantages
Control of propagation
Copy bit
A depth counter

Review
Determining all subjects who have access to an object (feasible in
partitioned application)

Revocation of access rights
Delete the objects

Garbage collection
A count of number of copies of each capability is maintained by either the
creator or the system
3/30/2014
178
Access Control List Method
Execution efficiency: Poor
Revocation and Review of access: Easy
Shadow pages
Storage
efficiency?
The Lock-Key Method
each subject has capability list of tuples (o,k),
where k is a key
each object has ACL of tuples (k,A),where A is a
set of access modes
when s wants access a of object o
find a tuple (o,k) in s's capability list ( if not found,
access is declined )
find a matching tuple (k,A) in o's ACL, such that a A.
revocation is easy
3/30/2014
179
Integrity
Confidentiality
Availability
Security Goals
Passive Attacks
3/30/2014
180
Active Attacks
Model for Network Security
3/30/2014
181
Symmetric Cipher: DES Overview
Public-Key Cryptography
3/30/2014
182
large
message
m
H: Hash
function
H(m)
digital
signature
(encrypt)
Bobs
private
key
K
B
-
+
Bob sends digitally signed message:
Alice verifies signature and
integrity of digitally signed
message:
K
B
(H(m))
-
encrypted
msg digest
K
B
(H(m))
-
encrypted
msg digest
large
message
m
H: Hash
function
H(m)
digital
signature
(decrypt)
H(m)
Bobs
public
key
K
B
+
equal
?
Digital signature = signed message digest
Alice thus verifies that:
Bob signed m.
No one else signed m.
Bob signed m and not m.
Non-repudiation:
Alice can take m, and signature K
B
(m) to court and prove that Bob
signed m.
Authentication Using a Key Distribution
Center (Mutual Authentication)
The Needham-Schroeder authentication
protocol (Multiway Challenge Response).
3/30/2014
183
Authentication Using a Key Distribution
Center (Mutual Authentication)
The Otway-Rees authentication protocol
(slightly simplified).
Kerberos Authentication
[Source: www.microsoft.com]
3/30/2014
184
Public-Key Certificates
Fault Tolerance
Avoidance of disruptions due to failures and
to improve availability
3/30/2014
185
System reliability: Fault-Intolerance vs. Fault-Tolerance
The fault intolerance (or fault-avoidance) approach
improves system reliability by removing the source of
failures (i.e., hardware and software faults) before normal
operation begins.

The approach of fault-tolerance expect faults to be
present during system operation, but employs design
techniques which insure the continued correct execution
of the computing processes.
Approaches to fault-tolerance
Approaches:
(a) Mask failures:
System continues to provide its specified function(s) in the presence of
failures
Example: voting protocols
(b) Well defined failure behavior:
System exhibits a well defined behavior in the presence of failures
It may or may not perform its specified function(s), but facilitates actions
suitable for fault recovery
Example: commit protocols
A transaction made to a database is made visible only if successful and it
commits
Redundancy:
Method for achieving fault tolerance (multiple copies of hardware,
processes, data, etc.)
3/30/2014
186
Issues
Process Deaths:
All resources allocated to a process must be recovered when a
process dies
Kernel and remaining processes can notify other cooperating
processes
Client-server systems: client /server process needs to be informed
that the corresponding server/client process died
Machine failure:
All processes running on that machine will die
Client-server systems: difficult to distinguish between a process and
machine failure
Issue: detection by processes of other machines
Network Failure:
Network may be partitioned into subnets
Machines from different subnets cannot communicate
Difficult for a process to distinguish between a machine and a
communication link failure
Atomic actions
System activity: sequence of primitive or atomic actions

Atomic Action:
Example: Two processes, P1 and P2, share a memory location x
and both modify x
Process P1 Process P2

Lock(x); Lock(x);
x := x + z; x := x + y; Atomic action
Unlock(x); Unlock(x);

successful exit
3/30/2014
187
Transaction Model
Transaction
A sequence of actions (typically read/write), each of which is executed at one or
more sites, the combined effect of which is guaranteed to be atomic.

Atomic Transactions
Atomicity: either all or none of the effects of the transaction are made permanent.
Consistency: the effect of concurrent transactions is equivalent to some serial
execution.
Isolation: transactions cannot observe each others partial effects.
Durability: once accepted, the effects of a transaction are permanent (until changed
again, of course).

Environment
Each node is assumed to have:
data stored in a partially/full replicated manner
stable storage (information that survives failures)
logs (a record of the intended changes to the data: write ahead, UNDO/REDO)
locks (to prevent access to data being used by a transaction in progress)
Committing
Transaction: Sequence of actions treated as an atomic action to
preserve consistency (e.g. access to a database)
Commit a transaction: Unconditional guarantee that the transaction
will complete successfully (even in the presence of failures)
Abort a transaction: Unconditional guarantee to back out of a
transaction, i.e., all the effects of the transaction have been removed
(transaction was backed out)
Events that may cause aborting a transaction: deadlocks, timeouts,
protection violation
Commit protocols:
Enforce global atomicity involving several cooperating distributed
processes
Ensure that all the sites either commit or abort transaction unanimously,
even in the presence of multiple and repetitive failures
3/30/2014
188
The Two-Phase Commit Protocol
q
l
w
l
c
l
a
l
Commit_Req msg
sent to all cohorts
All cohorts agreed
Send Commit msg
To all cohorts
One or more cohort(s)
replied abort
Abort msg send
To all cohorts
w
i
q
i
a
i
c
i
Abort msg received
From Coordinator
Commit_req
Msg received
Abort msg sent
To Coordinator
Commit_Req
Msg received
Agreed msg sent
to coordinator
Commit msg received
from Coordinator
Coordinator
Cohort
The Three-Phase Commit Protocol
q
l
w
l
p
l
a
l
Commit_Req msg
sent to all cohorts
All cohorts agreed
Send Prepare msg
To all cohorts
Ones or more cohort(s)
replied abort
Abort msg send
To all cohorts
w
i
q
i
a
i
p
i
Abort msg received
From Coordinator
Commit_req
Msg received
Abort msg sent
To Coordinator
Commit_Req
Msg received
Agreed msg sent
to coordinator
Prepare msg
Received
Send ACK msg
To Coordinator
Coordinator
Cohort
c
l
All cohorts send
ACK msg
Send Commit msg
To all cohorts
c
i
Commit msg received
from Coordinator
3/30/2014
189
The Nonblocking Commit Protocol for
Single site Failure
q
l
w
l
p
l a
l
Commit_Req msg
sent to all cohorts
All cohorts agreed
Send Prepare msg
To all cohorts
Ones or more cohort(s)
replied abort
Abort msg send
To all cohorts
Coordinator
c
l
All cohorts send
ACK msg
Send Commit msg
To all cohorts
T
F,T
F,T
Abort msg sent
To all cohorts
F
T = Timeout Transition
F = Failure Transition
F,T = Failure/ Timeout Transition
The Nonblocking Commit Protocol for
Single site Failure
Cohort
w
i
q
i
a
i
p
i
Abort msg received
From Coordinator
Commit_req
Msg received
Abort msg sent
To Coordinator
Commit_Req
Msg received
Agreed msg sent
to coordinator
Prepare msg
Received
Send ACK msg
To Coordinator
c
i
Commit msg received
from Coordinator T = Timeout Transition
F = Failure Transition
F,T = Failure/ Timeout Transition
Abort msg received
from Coordinator
F,T
F,T
F,T
3/30/2014
190
Voting protocols
Principles:
Data replicated at several sites to increase reliability
Each replica assigned a number of votes
To access a replica, a process must collect a majority of votes
Vote mechanism:
(1) Static voting:
Each replica has number of votes (in stable storage)
A process can access a replica for a read or write operation if it
can collect a certain number of votes (read or write quorum)
(2) Dynamic voting
Number of votes or the set of sites that form a quorum change
with the state of system (due to site and communication failures)
(2.1) Majority based approach:
Set of sites that can form a majority to allow access to replicated
data of changes with the changing state of the system
(2.2) Dynamic vote reassignment:
Number of votes assigned to a site changes dynamically