1.introduction 10.MySlide-Filesystem 2IN1

CS30002:
Operating Systems
Arobinda Gupta
Spring 2012
General Information
z
Textbook:
z
z
Course Webpage
z
Operating System Concepts, 8th Ed, by Silberschatz,

Galvin, and Gagne
I will use materials from other books as and when
needed
http://cse.iitkgp.ac.in/~agupta/OS
Grading Policy
z
z
z
Midsem 30%
Endsem 50%
TA 20% (Two class tests, may also have
assignments)
Introduction
What is an Operating System?

z
User-centric definition
z
A program that acts as an intermediary between

a user of a computer and the computer hardware
Defines an interface for the user to use services
provided by the system
Provides a view of the system to the user
System-centric definition
z
Resource allocator manages and allocates

resources
Control program controls the execution of user
programs and operations of I/O devices
Computer System Components

1. Hardware provides basic computing resources
(CPU, memory, I/O devices).
2. Operating system controls and coordinates the
use of the hardware among the various
application programs for the various users.
3. Applications programs define the ways in which
the system resources are used to solve the
computing problems of the users (compilers,
databases, games, ).
4. Users (people, machines, other computers).
Abstract View of System

Components
Types of Systems
z
Batch Systems
z
Multiprogrammed Batch Systems

z
z
z
z
Different types of systems with multiple CPUs/Machines
Real Time Systems

z
More than one CPU in a single machine to allocate jobs to

Symmetric Multiprocessing, NUMA machines
Multicore
Other Parallel Systems, Distributed Systems, Clusters

z
Dedicated to a single user at one time
Multiprocessing Systems
z
Multiple jobs in memory and on disk, CPU is multiplexed

among jobs in memory, jobs swapped between disk and
memory
Allows interaction with users
Personal Computers
z
Multiple jobs in memory, CPU is multiplexed between

them
CPU-bound vs I/O bound jobs
Time-sharing Systems
z
Multiple jobs, but only one job in memory at one time

and executed (till completion) before the next one starts
Systems to run jobs with time guarantees
Other types possible depending on resources in the

machine, types of jobs to be run
OS design depends on the type of system it is

designed for
Our primary focus in this course:

z
Uniprocessor, time-sharing systems running general

purpose jobs from users
Effect of multicore/multiprocessors
Will discuss some other topics at end
Resources Managed by OS
z
Physical
z
CPU, Memory, Disk, I/O Devices like keyboard, monitor,

printer
Logical
z
Process, File,
Main Components of an OS
z
Resource-Centric View
z
z
z
z
z
z
z
Process Management
Main Memory Management
File Management
I/O System Management
Secondary Storage Management
Security and Protection System
Networking (this is now integrated with most OS, but will
be covered in the Networks course)
User-centric view
z
z
System Calls
Command Interpreter (not strictly a part of an OS)
Process Management
z
z
A process is a program in execution.

Needs certain resources to accomplish
its task
z
CPU time, memory, files, I/O devices
OS responsibilities
z
z
z
Process creation and deletion.

Process suspension and resumption.
Provide mechanisms for:
z process synchronization
z interprocess communication
Main-Memory Management
z
OS responsibilities
z
Keep track of which parts of memory are currently

being used and by whom
Decide which processes to load when memory space
becomes available
Allocate and deallocate memory space as needed
File Management
z
OS responsibilities
z
z
z
z
z
File creation, deletion, modification

Directory creation, deletion, modification
Support of primitives for manipulating files and
directories
Mapping files onto secondary storage.
File backup on stable (nonvolatile) storage media
I/O System Management

z
The I/O system consists of:

z
z
z
A buffer-caching system
Device driver interface
Drivers for specific hardware devices
Secondary-Storage Management
z
Most modern computer systems use disks as

the principle on-line storage medium, for both
programs and data.
OS responsibilities
z
z
z
Free space management

Storage allocation
Disk scheduling
Security and Protection System

z
Protection refers to a mechanism for controlling

access by programs, processes, or users to both
system and user resources.
The protection mechanism must:
z
z
z
distinguish between authorized and unauthorized usage

specify the controls to be imposed
provide a means of enforcement
System Calls
z
System calls provide the interface between a

running program and the OS
z
z
z
Think of it as a set of functions available to the program

to call (but somewhat different from normal functions,
we will see why)
Generally available as assembly-language instructions.
Most common languages (e.g., C, C++) have APIs that
call system calls underneath
Passing parameters to system calls

z
z
z
Pass parameters in registers

Store the parameters in a table in memory, and the
table address is passed as a parameter in a register
Push (store) the parameters onto the stack by the
program, and pop off the stack by operating system
Command-Interpreter System
z
Strictly not a part of OS, but always there

z
the shell
Allows user to give commands to OS, interpretes

the commands and executes them
z
z
Calls appropriate functions/system calls

You will write one in your lab
Process Management
What is a Process?
z
Process an instance of a program in execution

z
A process has resources allocated to it by the OS

during its execution
z
z
z
z
z
z
Multiple instances of the same program are different

processes
CPU time
Memory space for code, data, stack
Open files
Signals
Data structures to maintain different information about the
process
Each process identified by a unique, positive integer id

(process id)
Process Control Block (PCB)

z
z
z
The primary data structure maintained by the OS

that contains information about a process
One PCB per process
OS maintains a list of PCBs for all processes
Typical Contents of PCB

z
z
z
z
z
z
z
z
z
Process id, parent process id

Process state
CPU state: CPU register contents, PSW
Priority and other scheduling info
Pointers to different memory areas
Open file information
Signals and signal handler info
Various accounting info like CPU time used etc.
Many other OS-specific fields can be there
z
Linux PCB (task_struct) has 100+ fields
Process States (5-state model)

z
As a process executes, it changes state

z
z
z
new: The process is being created

running: Instructions are being executed
waiting: The process is waiting for some
event (needed for its progress) to occur
ready: The process is waiting to be assigned
to a CPU
terminated: The process has finished
execution
Process State Transitions
Main Operations on a Process

z
Process creation
z
z
z
Process scheduling
z
Data structures like PCB set up and initialized

Initial resources allocated and iitialized if needed
Process added to ready queue (queue of processes ready to run)
CPU is allotted to the process, process runs
Process termination
z
z
z
z
Process is removed
Resources are reclaimed
Some data may be passed to parent process (ex. exit status)
Parent process may be informed (ex. SIGCHLD signal in UNIX)
Process Creation
z
A process can create another process

z
z
z
By making a system call (a function to invoke the

service of the OS, ex. fork( ))
Parent process: the process that invokes the call
Child process: the new process created
The new process can in turn create other

processes, forming a tree of processes
The first process in the system is handcrafted
z
No system call, because the OS is still not running

fully (not open for service)
Process Creation (contd.)

z
Resource sharing possibilities

z
z
z
Execution possibilities
z
z
Parent and children share all resources

Children share subset of parents resources
Parent and child share no resources
Parent and children execute concurrently
Parent waits until children terminate
Memory address space possibilities

z
z
Address space of child duplicate of parent

Child has a new program loaded into it
Processes Tree on a UNIX

System
Process Termination
z
Process executes last statement and asks the

operating system to terminate it ( ex. exit/abort)
Process encounters a fatal error
z
Parent may terminate execution of children

processes (ex. kill). Some possible reasons
z
z
Can be for many reasons like arithmetic exception etc.
Child has exceeded allocated resources

Task assigned to child is no longer required
Parent is exiting
z
Some operating systems may not allow child to

continue if its parent terminates
Process Scheduling
z
z
Ready queue queue of all processes residing in main

memory, ready and waiting to execute (links to PCBs)
Scheduler/Dispatcher picks up a process from ready
queue according to some algorithm (CPU Scheduling
Policy) and assigns it the CPU
Selected process runs till
z
It needs to wait for some event to occur (ex. a disk read)

The CPU scheduling policy dictates that it be stopped
CPU time allotted to it expires (timesharing systems)

z Arrival of a higher priority process
When it is ready to run again, it goes back to the ready queue
Scheduler is invoked again to select the next process

from the ready queue
Representation of Process
Scheduling
Schedulers
z
Long-term scheduler (or job scheduler)

Selects which processes should be brought into the
ready queue
Controls the degree of multiprogramming (no. of jobs in
memory)
Invoked infrequently (seconds, minutes)
May not be present in an OS (ex. linux/windows does
not have one)
z
z
Short-term scheduler (or CPU scheduler)

Selects which process should be executed next and
allocates CPU
Invoked very frequently (milliseconds), must be fast
What if all processes do not fit

in memory?
z
Partially executed jobs in secondary memory

(swapped out)
z
Copy the process image to some pre-designated area

in the disk (swap out)
Bring in again later and add to ready queue later
Addition of Medium Term

Scheduling
Other Questions
z
How does the scheduler gets scheduled? (Suppose we

have only one CPU)
z
What does it do with the running process?

z
As part of execution of an ISR (ex. timer interrupt in a time-sharing

system)
Called directly by an I/O routine/event handler after blocking the
process making the I/O or event request
Save its context
How does it start the new process?

z
z
Load the saved context of the new process chosen to be run

Start the new process
Context of a Process
z
Information that is required to be saved to be

able to restart the process later from the same
point
Includes:
z
z
z
z
z
z
CPU state all register contents, PSW

Program counter
Memory state code, data
Stack
Open file information
Pending I/O and other event information
Context Switch
z
When CPU switches to another process, the

system must save the state of the old process
and load the saved state for the new process
Context-switch time is overhead; the system
does no useful work while switching
Time dependent on hardware support
Handling Interrupts
z
z
z
z
z
z
z
H/w saves PC, PSW

Jump to ISR
ISR should first save the context of the process
Execute the ISR
Before leaving, ISR should restore the context of the
process being executed
Return from ISR restores the PC
ISR may invoke the dispatcher, which may load the
context of a new process, which runs when the
interrupt returns instead of the original process
interrupted
CPU Switch From Process to

Process
Example: Timesharing
Systems
z
z
z
z
z
Each process has a time quantum T allotted to it

Dispatcher starts process P0, loads a external counter
(timer) with counts to count down from T to 0
When the timer expires, the CPU is interrupted
The ISR invokes the dispatcher
The dispatcher saves the context of P0
z
The dispatcher selects P1 from ready queue

z
z
z
z
z
PCB of P0 tells where to save

The PCB of P1 tells where the old state, if any, is saved
The dispatcher loads the context of P1

The dispatcher reloads the counter (timer) with T
The ISR returns, restarting P1 (since P1s PC is now
loaded as part of the new context loaded)
P1 starts running
CPU Scheduling
Types of jobs
z
CPU-bound vs. I/O-bound

z
Maximum CPU utilization obtained with

multiprogramming
Batch, Interactive, real time

z
Different goals, affects scheduling policies
CPU Scheduler
z
Selects from among the processes in memory

that are ready to execute, and allocates the CPU
to one of them
CPU scheduling decisions may take place when
a process:
z
z
z
z
z
z
Switches from running to waiting state

Switches from running to ready state
Switches from waiting to ready
Terminates
Scheduling under 1 and 4 is nonpreemptive.

All other scheduling is preemptive.
Dispatcher
z
Dispatcher module gives control of the CPU to

the process selected by the CPU scheduler; this
involves:
z
z
z
switching context
switching to user mode
jumping to the proper location in the user program to
restart that program
Dispatch latency time it takes for the

dispatcher to stop one process and start another
running.
Scheduling Criteria
z
CPU utilization keep the CPU as busy as

possible
Throughput # of processes that complete their
execution per time unit
Turnaround time amount of time to execute a
particular process
Waiting time amount of time a process has
been waiting in the ready queue
Response time amount of time it takes from
when a request was submitted until the first
response is produced, not output (for timesharing environment)
Optimization Criteria
z
z
z
z
z
Max CPU utilization

Max throughput
Min turnaround time
Min waiting time
Min response time
First-Come, First-Served (FCFS)

Scheduling
Process Burst Time

P1
24
P2
3
P3
3
Suppose that the processes arrive in the order: P1 , P2 ,
P3 . The Gantt Chart for the schedule is:
P1
0
z
z
P2
24
P3
27
Waiting time for P1 = 0; P2 = 24; P3 = 27

Average waiting time: (0 + 24 + 27)/3 = 17
30
FCFS Scheduling (Cont.)

Suppose that the processes arrive in the order
P2 , P3 , P1
z The Gantt chart for the schedule is:
P2
0
z
z
z
z
P3
3
P1
6
30
Waiting time for P1 = 6; P2 = 0; P3 = 3

Average waiting time: (6 + 0 + 3)/3 = 3
Much better than previous case.
Convoy effect: short process behind long process
Shortest-Job-First (SJR)
Scheduling
z
Associate with each process the length of its

next CPU burst. Use these lengths to schedule
the process with the shortest time
Two schemes:
nonpreemptive once CPU given to the process it
cannot be preempted until it completes its CPU burst
z preemptive if a new process arrives with CPU burst
length less than remaining time of current executing
process, preempt. This scheme is know as the
Shortest-Remaining-Time-First (SRTF)
SJF is optimal gives minimum average waiting time for
a given set of processes
z
Example of Non-Preemptive
SJF
Process Arrival Time

P1
0.0
P2
2.0
P3
4.0
P4
5.0
SJF (non-preemptive)
P1
0
P3
Burst Time
7
4
1
4
P2
8
P4
12
16
Average waiting time = (0 + 6 + 3 + 7)/4 = 4
Example of Preemptive SJF
Process Arrival Time

P1
0.0
P2
2.0
P3
4.0
P4
5.0
SJF (preemptive)
P1
0
P2
2
P3
4
P2
5
Burst Time
7
4
1
4
P1
P4
7
11
Average waiting time = (9 + 1 + 0 +2)/4 = 3
16
Determining Length of Next CPU

Burst
z
z
Can only estimate the length.

Can be done by using the length of previous
CPU bursts, using exponential averaging
z n+1 = tn + (1- )n, 0 1
tn = actual length of nth CPU burst
Properties of Exponential
Averaging
z
=0
z
n+1 = n
z Recent history does not count
=1
n+1 = tn
z Only the actual last CPU burst counts
If we expand the formula, each successive term has less
weight than its predecessor
z Recent history has more weight than old history
z
Priority Scheduling
z
A priority number (integer) is associated with each

process
The CPU is allocated to the process with the highest
priority (smallest integer highest priority).
z Preemptive
z Nonpreemptive
SJF is a priority scheduling where priority is the
predicted next CPU burst time
Problem Starvation low priority processes may never
execute
Solution Aging as time progresses increase the
priority of the process
Round Robin (RR)

z
Each process gets a small unit of CPU time (time

quantum), usually 10-100 milliseconds. After this time
has elapsed, the process is preempted and added to the
end of the ready queue
If there are n processes in the ready queue and the time
quantum is q, then each process gets 1/n of the CPU
time in chunks of at most q time units at once. No
process waits more than (n-1)q time units
Performance
z q large FIFO
z q small q must be large with respect to context
switch, otherwise overhead is too high
Example of RR with Time

Quantum = 20
Process
P1
P2
P3
P4
The Gantt chart is:
P1
0
P2
20
37
P3
P4
57
Burst Time
53
17
68
24
P1
77
P3
P4
P1
P3
P3
97 117 121 134 154 162
Typically, higher average turnaround than SJF,

but better response.
Multilevel Queue
z
Ready queue is partitioned into separate queues:

foreground (interactive) and background (batch)
Each queue has its own scheduling algorithm,
z
z
foreground RR
background FCFS
Scheduling must be done between the queues

z Fixed priority scheduling (i.e., serve all from
foreground then from background). Possibility of
starvation.
z Time slice each queue gets a certain amount of
CPU time which it can schedule amongst its
processes; i.e., 80% to foreground in RR
z 20% to background in FCFS
Multilevel Feedback Queue

z
A process can move between the various

queues; aging can be implemented this way.
Multilevel-feedback-queue scheduler defined by
the following parameters:
z
z
z
z
z
number of queues
scheduling algorithms for each queue
method used to determine when to upgrade a process
method used to determine when to demote a process
method used to determine which queue a process will
enter when that process needs service
Example of Multilevel Feedback

Queue
z
Three queues:
z
z
z
Q0 time quantum 8 milliseconds

Q1 time quantum 16 milliseconds
Q2 FCFS
Scheduling
z
A new job enters queue Q0 which is served FCFS.

When it gains CPU, job receives 8 milliseconds. If it
does not finish in 8 milliseconds, job is moved to
queue Q1.
At Q1 job is again served FCFS and receives 16
additional milliseconds. If it still does not complete, it
is preempted and moved to queue Q2.
Multilevel Feedback Queues
Process Coordination
Why is it needed?
Processes may need to share data
More than one process reading/writing the same data
(a shared file, a database record,)
Output of one process being used by another
Needs mechanisms to pass data between processes
Ordering executions of multiple processes may

be needed to ensure correctness
Process X should not do something before process Y
does something etc.
Need mechanisms to pass control signals between
processes
Interprocess Communication
(IPC)
Mechanism for processes P and Q to
communicate and to synchronize their actions
Establish a communication link
Fundamental types of communication links

Shared memory
P writes into a shared location, Q reads from it and
vice-versa
Message passing
P and Q exchange messages
We will focus on shared memory, will discuss

issues with message passing later
Implementation Questions
How are links established?
Can a link be associated with more than two
processes?
How many links can there be between every pair
of communicating processes?
What is the capacity of a link?
Is the size of a message that the link can
accommodate fixed or variable?
Is a link unidirectional or bi-directional?
Producer Consumer Problem

Paradigm for cooperating processes
producer process produces information that is consumed
by a consumer process.
unbounded-buffer places no practical limit on the size of the
buffer.
bounded-buffer assumes that there is a fixed buffer size.
Basic synchronization requirement

Producer should not write into a full buffer
Consumer should not read from an empty buffer
All data written by the producer must be read exactly once by the
consumer
Bounded-Buffer Shared-Memory
Solution
Shared data
#define BUFFER_SIZE 10
typedef struct {
...
} item;
item buffer[BUFFER_SIZE];
int in = 0;
int out = 0;
We will see how to create such shared

memory between processes in the lab
Bounded-Buffer:
Producer Process
item nextProduced;
while (1) {
while (((in + 1) % BUFFER_SIZE) == out)
; /* do nothing */
buffer[in] = nextProduced;
in = (in + 1) % BUFFER_SIZE;
}
Bounded-Buffer:
Consumer Process
item nextConsumed;
while (1) {
while (in == out)
; /* do nothing */
nextConsumed = buffer[out];
out = (out + 1) % BUFFER_SIZE;
}
The solution allows at most n 1 items in buffer

(of size n) at the same time. A solution, where
all n buffers are used is not simple
Suppose we modify the producer-consumer
code by adding a variable counter, initialized to
0 and incremented each time a new item is
added to the buffer
Producer process
Shared data
#define B_SIZE 10
typedef struct {
...
} item;
item buffer[B_SIZE];
int in = 0;
int out = 0;
int counter = 0;
Will this work?
item nextProduced;
while (1) {
while (counter == BUFFER_SIZE)
; /* do nothing */
buffer[in] = nextProduced;
in = (in + 1) % BUFFER_SIZE;
counter++;
}
Consumer process
item nextConsumed;
while (1) {
while (counter == 0)
; /* do nothing */
nextConsumed = buffer[out];
out = (out + 1) % BUFFER_SIZE;
counter--;
}
The Problem with this solution

The statement counter++ may be implemented
in machine language as:
register1 = counter
register1 = register1 + 1
counter = register1
The statement counter-- may be implemented

as:
register2 = counter
register2 = register2 1
counter = register2
If both the producer and consumer attempt to

update counter concurrently, the assembly
language statements may get interleaved.
Interleaving depends upon how the producer
and consumer processes are scheduled.
An Illustration
Assume counter is initially 5. One interleaving of
statements is:
producer: register1 = counter (register1 = 5)
producer: register1 = register1 + 1 (register1 = 6)
consumer: register2 = counter (register2 = 5)
consumer: register2 = register2 1 (register2 = 4)
producer: counter = register1 (counter = 6)
consumer: counter = register2 (counter = 4)
The value of counter may be either 4 or 6, where

the correct result should be 5.
Race Condition
A scenario in which the final output is dependent
on the relative speed of the processes
Example: The final value of the shared data counter
depends upon which process finishes last
Race conditions must be prevented

Concurrent processes must be synchronized
Final output should be what is specified by the
program, and should not change due to relative
speeds of the processes
Atomic Operation
An operation that is either executed fully without
interruption, or not executed at all
The operation can be a group of instructions
Ex. the instructions for counter++ and counter-Note that the producer-consumer problems solution
works if counter++ and counter-- are made atomic
In practice, the process may be interrupted in the middle
of an atomic operation, but the atomicity should ensure
that no process uses the effect of the partially executed
operation until it is completed
The Critical Section Problem

n processes all competing to use some shared
data (in general, use some shared resource)
Each process has a section of code, called
critical section, in which a shared data is
accessed.
Problem ensure that when one process is
executing in its critical section, no other process
is allowed to execute in its critical section,
irrespective of the relative speeds of the
processes
Also known as the Mutual Exclusion Problem as
it requires that access to the critical section is
mutually exclusive
Requirements for Solution to the

Critical-Section Problem
Mutual Exclusion: If process Pi is executing in its
critical section, then no other processes can be
executing in their critical sections.
Progress: If no process is executing in its critical
section and there exist some processes that wish to
enter their critical section, then the selection of the
processes that will enter the critical section next
cannot be postponed indefinitely.
Bounded Waiting/No Starvation: A bound must exist
on the number of times that other processes are
allowed to enter their critical sections after a process
has made a request to enter its critical section and
before that request is granted.
Entry and Exit Sections

Entry section: a piece of code executed by a process just
before entering a critical section
Exit section: a piece of code executed by a process just
after leaving a critical section
General structure of a process Pi
.
.
entry section
critical section
exit section
remainder section /*remaining code */
Solutions vary depending on how these sections are
written
Petersens Solution
Only 2 processes, P0 and P1
Processes share some common variables to
synchronize their actions
int turn = 0
turn = i Pi s turn to enter its critical section
boolean flag[2]
initially flag [0] = flag [1] = false
flag [i] = true Pi ready to enter its critical section
Process Pi
do {
flag [i]:= true;
turn = j;
while (flag [j] and turn = j) ;
critical section
flag [i] = false;
remainder section
} while (1);
Meets all three requirements; solves the criticalsection problem for two processes
Can be extended to n processes by pairwise mutual
exclusion too costly
Solution for n Processes:

Bakery Algorithm
Before entering its critical section, process
receives a number. Holder of the smallest
number enters the critical section.
If processes Pi and Pj receive the same
number, if i < j, then Pi is served first; else Pj
is served first.
The numbering scheme always generates
numbers in increasing order of enumeration;
i.e., 1,2,3,3,3,3,4,5...
Bakery Algorithm
Notation < lexicographical order (ticket #,
process id #)
(a,b) < c,d) if a < c or if a = c and b < d
max (a0,, an-1) is a number, k, such that k ai for i 0,
, n 1
Shared data
boolean choosing[n];
int number[n];
Data structures are initialized to false and 0
respectively
Bakery Algorithm
do {
choosing[i] = true;
number[i] =
max(number[0], number[1], , number [n 1]) +1;
choosing[i] = false;
for (j = 0; j < n; j++) {
while (choosing[j]) ;
while ((number[j] != 0) &&
(number[j,j] < number[i,i])) ;
}
critical section
number[i] = 0;
remainder section
} while (1);
Hardware Instruction Based

Solutions
Some architectures provide special instructions
that can be used for synchronization
TestAndSet: Test and modify the content of a
word atomically
boolean TestAndSet (boolean &target) {
boolean rv = target;
target = true;
return rv;
}
Swap: Atomically swap two variables.

void Swap(boolean &a, boolean &b) {
boolean temp = a;
a = b;
b = temp;
}
Mutual Exclusion with Testand-Set

Shared data:
boolean lock = false;
Process Pi
do {
while (TestAndSet(lock)) ;
critical section
lock = false;
remainder section
}
Mutual Exclusion with Swap

Shared data (initialized to false):
boolean lock;
boolean waiting[n];
Process Pi
do {
key = true;
while (key == true)
Swap(lock,key);
critical section
lock = false;
remainder section
}
Semaphore
Widely used synchronization tool
Does not require busy-waiting
CPU is not held unnecessarily while the process is
waiting
A Semaphore S is
A data structure with an integer variable S.value and a
queue S.q of processes
The data structure can only be accessed by two
atomic operations, wait(S) and signal(S) (also called
P(S) and V(S))
Value of the semaphore S = value of the integer

S.value
wait and signal Operations

wait (S): if (S.value > 0) S.value --;
else {
add the process to S.q;
block the process;
}
signal (S): if (S.q is not empty)

choose a process from S.q and unblock it
else S.value ++;
Note: which process is picked for unblocking may depend
on policy. Also, implementations can make S.value < 0
also (change wait and signal code appropriately)
Solution of n-Process Critical

Section using Semaphores
Shared data:
semaphore mutex; /* initially mutex = 1 */
Process Pi:
do {
wait(mutex);
critical section
signal(mutex);
remainder section
} while (1);
Ordering Execution of Processes

using Semaphores
Execute statement B in Pj only after
statement A executed in Pi
Use semaphore flag initialized to 0
Code:
Pi
M
A
signal(flag)
Pj
M
wait(flag)
B
Multiple such points of synchronization can

be enforced using one or more semaphores
Pitfalls
Use carefully to avoid
Deadlock two or more processes are waiting
indefinitely for an event that can be caused by only
one of the waiting processes
Starvation indefinite blocking. A process may
never be removed from the semaphore queue in
which it is suspended
Example of Deadlock
Let S and Q be two semaphores initialized to 1
P0
wait(S);
wait(Q);
M
signal(S);
signal(Q)
P1
wait(Q);
wait(S);
M
signal(Q);
signal(S);
Two Types of Semaphores

Binary semaphore integer value can range only
between 0 and 1; can be simpler to implement.
Counting semaphore value can be any positive
integer
Useful in cases where there are multiple copies of resources
l-exclusion problem: at most l processes can be in their
critical section at the same time
Can implement a counting semaphore using a binary

semaphore easily (do it yourself)
Internal Implementations of
Semaphores
How do we make wait and signal atomic?
Should we use another semaphore? Then who makes that
atomic?
Different solutions possible

Interrupts:
Disable interrupts just before a wait or a signal call, enable

it just after that
Works fine for uniprocessors, but not for multiprocessors
Use s/w-based or h/w-instruction-based solutions to put entry
and exit sections around wait/signal code
Since wait/signal code is of small size, wont busy wait for
too long
Classical Problems of
Synchronization
Bounded-Buffer Producer-Consumer Problem
Readers and Writers Problem
Dining-Philosophers Problem
Bounded-Buffer Problem
Shared data
semaphore full, empty, mutex;
Initially:
full = 0, empty = n, mutex = 1
Bounded-Buffer Problem:
Producer Process
do {
produce an item in nextp
wait(empty);
wait(mutex);
add nextp to buffer
signal(mutex);
signal(full);
} while (1);
Bounded-Buffer Problem:
Consumer Process
do {
wait(full)
wait(mutex);
remove an item from buffer to nextc
signal(mutex);
signal(empty);
consume the item in nextc
} while (1);
Readers-Writers Problem
A common shared data
Reader process only reads data
Writer process only writes data
Synchronization requirements
Writers should have exclusive access to the data
No other reader or writer can access the data at that
time
Multiple readers should be allowed to access the data
if there is no writer accessing the data
Solution using Semaphores

Shared data
semaphore mutex, wrt;
Initially
mutex = 1, wrt = 1, readcount = 0
Writer
wait(wrt);
perform write
signal(wrt);
Reader
wait(mutex);
readcount++;
if (readcount == 1)
wait(rt);
signal(mutex);
perform read
wait(mutex);
readcount--;
if (readcount == 0)
signal(wrt);
signal(mutex):
Shared data
semaphore chopstick[5];
Initially all values are 1
Philosopher i:
do {
wait(chopstick[i])
wait(chopstick[(i+1) % 5])
eat
signal(chopstick[i]);
signal(chopstick[(i+1) % 5]);
think
} while (1);
Other Synchronization
Constructs
Programming constructs
Specify critical sections or shared data to be protected
by mutual exclusion in program using special
keywords
Compiler can then insert appropriate code to enforce
the conditions (for ex., put wait/signal calls in
appropriate places in code)
Examples
Critical regions, Monitors, Barriers,
Memory Management
Goals of Memory Management

z
Allocate available memory efficiently to

multiple processes
Main functions
z
z
Allocate memory to processes when needed

Keep track of what memory is used and what is
free
Protect one processs memory from another
Memory Allocation
z
Contiguous Allocation
z
Each process allocated a single contiguous chunk

of memory
Non-contiguous Allocation
z
Parts of a process can be allocated noncontiguous chunks of memory
In this part, we assume that the entire process

needs to be in memory for it to run
z
Fixed Partition Scheme

z
Memory broken up into fixed size partitions

z
z
z
Each partition can have exactly one process

When a process arrives, allocate it a free partition
z
z
z
But the size of two partitions may be different
Can apply different policy to choose a partition
Easy to manage
Problems:
z
z
Maximum size of process bound by max. partition size

Large internal fragmentation possible
Contiguous Allocation (Cont.)

z
Variable Partition Scheme

z
Hole block of available memory; holes of various

size are scattered throughout memory
When a process arrives, it is allocated memory from a
hole large enough to accommodate it
Operating system maintains information about:
a) allocated partitions b) free partitions (hole)
OS
OS
OS
OS
process 5
process 5
process 5
process 5
process 9
process 9
process 8
process 2
process 10
process 2
process 2
process 2
Dynamic Storage-Allocation
Problem
How to satisfy a request of size n from a list of free holes?
z
First-fit: Allocate the first hole that is big

enough
Next-fit: Similar to first-fit, but start from last
hole allocated
Best-fit: Allocate the smallest hole that is big
enough; must search entire list, unless ordered
by size. Produces the smallest leftover hole.
Worst-fit: Allocate the largest hole; must also
search entire list. Produces the largest leftover
hole
Fragmentation
z
External Fragmentation: total memory space exists to

satisfy a request, but it is not contiguous.
Internal Fragmentation: allocated memory may be larger
than requested memory; this size difference is memory
internal to a partition, but not being used.
Reduce external fragmentation by compaction
z Shuffle memory contents to place all free memory
together in one large block
z Costly
Keeping Track of Free Partitions

z
Bitmap method
z
z
Define some basic fixed allocation unit size

1 bit maintained for each allocation unit
z
z
z
0 unit is free, 1 unit is allocated
Bitmap bitstring of the bits of all allocation units

To allocate space of size n allocation units, find a
run of n consecutive 0s in bitmap
Maintain a linked list of free partitions

z
Each node contains start address, size, and

pointer to next free block
Non-contiguous Allocation
z
z
Paging
Segmentation
Memory Abstraction
z
z
What does the programmer see as memory

Simplest: No abstraction
z
z
Programmer sees the physical memory

Compiler generates absolute physical memory
addresses
Abstraction: Address Spaces

z
A set of addresses that the process can use to

address memory
Each process has its own address space
The Case of No Abstraction

z
Addresses generated by compiler (instruction and

data) refer to exact physical memory addresses
z
Instruction and data must be loaded in exactly the

same physical memory locations
Advantage: Fast execution
z
Compile time binding
No address translation overhead during actual memory

access
Problem: Unrelated processes may read/write

from/to each others address space
Multiple processes can still be run

z
If the behavior of the processes are well-known and they

use different ranges of physical address
z Possible in some closed systems with known processes
Swapping
z Keep one process in memory at one time
z Copy the memory space of the process to disk when
another process is to be run
z Copy the memory space back from the disk when the
process needs to be rerun
Not good for general purpose multiprogramming

systems
Memory Abstraction:
Logical or Virtual Addresses
z
z
z
Each process has its own address space (Logical Address

Space)
Translating to physical address Load Time or Run Time
Load time binding
z
z
z
z
z
z
Compiler generates addresses in the processs address space

Loader changes addresses during loading depending on where
in physical memory the process is loaded
Advantage: No address translation overhead during running
Problem: total memory requirement of a process needs to be
known a-priori
Problem: Process cannot be moved during execution
Problem: Rogue process can still overwrite other processs
memory by writing out of bounds, no runtime check
Load time binding with runtime check

z
z
Address bound at load time, but checked at run time if within

bound
Solves the problem of overwriting other processs memory,
but increases cost of access
One simple method

z
z
z
z
H/w provided base and limit registers

z Accessible only by OS
Base register loaded with beginning physical memory
address of process given at load time
Limit register loaded with length of memory given to process
On every access, hardware checks if limit register is
exceeded
z Aborts program if limit is exceeded
Logical or Virtual Address (contd.)

z
Execution/Run time binding

z Physical address corresponding to a logical address
found only when the logical address is used
z Process can be moved during its execution
z CPU generates logical address
z Memory Management Unit (MMU): hardware that
converts a generated logical address to physical
address before access
z Advantage: Processes can be moved during
execution, protects one process from another, can
grow process memory at run time
z Problem: Address translation overhead at run time
The user program deals with logical addresses;

it never sees the real physical addresses
The same logical address space in the address
space of two processes must always map to
different physical addresses at runtime
How to ensure this for run time bindings?
A Simple Solution
z
H/w provided base and limit registers

z
z
z
z
Programs loaded in consecutive memory locations

without relocation during load
Base register loaded with beginning physical
memory address of process
Limit register loaded with length of process
z
Must be known a-priori
On every access, MMU adds base register to logical

address, and then checks if limit register is exceeded
z
Accessible only by OS
Aborts program if limit is exceeded
Hard to grow memory if needed, but possible
A Better Solution: Paging

z
Allows processes to grow memory as and

when needed
Logical/Virtual address space of a process
can be noncontiguous; process is allocated
physical memory whenever the latter is
available.
Allows multiple processes to reside in
memory at the same time
Paging
z
z
z
Divide physical memory into fixed-sized (power of 2)

blocks called frames
Divide logical memory into blocks of same size called
pages
Keep track of all free frames.
To run a program of size n pages, need to find n free
frames and load program.
Page table: used to translate logical to physical
addresses
z One page table per process
Page Table
z
One entry for each page in the logical address

space
Contains the base address of the page frame
where the page is stored
Also contains a valid bit
z
If set, logical address is valid and has physical

memory allocated to it
If not set, logical address is invalid
Address Translation Scheme

z
Address generated by CPU is divided into:

z
z
z
z
Page number (p) used as an index into the page table

which contains base address of the corresponding page
frame in physical memory
Page offset (d) combined with base address to define the
physical memory address that is sent to the memory unit
Use page number to index the page table

Get the page frame start address
Add offset with that to get the actual physical
memory address
Access the memory
Address Translation
Architecture
Implementation of Page Table

z
z
Page table is kept in main memory.

Page-table base register (PTBR) points to the page
table.
Page-table length register (PRLR) indicates size of the
page table.
In this scheme every data/instruction access requires
two memory accesses. One for the page table and one
for the data/instruction.
The two memory access problem can be solved by the
use of a special fast-lookup hardware cache called
translation look-aside buffers (TLBs)
Paging Hardware With TLB
Effective Access Time

z
z
z
z
z
TLB Lookup time =

Assume memory cycle time is 1 time unit
Hit ratio percentage of times that a page
number is found in the TLB;
Hit ratio =
Effective Access Time (EAT)
EAT = (1 + ) + (2 + )(1 )
=2+
Page Table Structure

z
z
z
Hierarchical Paging
Hashed Page Tables
Inverted Page Tables
Hierarchical Page Tables

z
Break up the logical address space into multiple

page tables.
A simple technique is a two-level page table.
Two-Level Paging Example

z
A logical address (on 32-bit machine with 4K page

size) is divided into:
z
z
Since the page table is paged, the page number is

further divided into:
z
z
a page number consisting of 20 bits.

a page offset consisting of 12 bits.
a 10-bit page number.

a 10-bit page offset.
Thus, a logical address is as follows:

page number
pi
10
p2
10
page offset
d
12
Two-Level Page-Table Scheme
Address-Translation Scheme
z
Address-translation scheme for a two-level

32-bit paging architecture
Hashed Page Tables

z
z
Common in address spaces > 32 bits.

The virtual page number is hashed into a page
table. This page table contains a chain of
elements hashing to the same location.
Virtual page numbers are compared in this
chain searching for a match. If a match is
found, the corresponding physical frame is
extracted.
Hashed Page Table
Inverted Page Table

z
z
One entry for each real page of memory (page frame)

Entry consists of the virtual address of the page stored in
that real memory location, with information about the
process that owns that page
Decreases memory needed to store each page table, but
increases time needed to search the table when a page
reference occurs
Use hash table to limit the search to one or at most a
few page-table entries.
Inverted Page Table

Architecture
Protection
z
Protection bit can be there with each page in

the page table
z
z
MMU can check for access type when

translating address
z
Ex. read-only page

Bits set by OS
Traps if illegal access
More elaborate protections possible with h/w

support
Shared Pages
z
Example: Shared code

z
z
z
One copy of read-only code shared among processes

(i.e., text editors, compilers, window systems)
Store shared page in a single page frame

Map it to logical address spaces of processes by
inserting appropriate entries in their page tables
that all point to the shared page frame
Segmentation
z
z
Memory-management scheme that supports

user view of memory.
A program is a collection of segments. A
segment can be any logical unit
z
code, global variables, heap, stack,
Segment sizes may be different
Segmentation Architecture
z
Logical address consists of a two tuple:

<segment-number, offset>,
Segment table maps two-dimensional physical
addresses; each table entry has:
z base contains the starting physical address where
the segments reside in memory.
z limit specifies the length of the segment.
Segment-table base register (STBR) points to the
segment tables location in memory.
Segment-table length register (STLR) indicates number
of segments used by a program;
segment number s is legal if s < STLR.
Segmentation Architecture
(Cont.)
z
Protection. With each entry in segment table associate:

z validation bit = 0 illegal segment
z read/write/execute privileges
Protection bits associated with segments; code sharing
occurs at segment level.
Since segments vary in length, memory allocation is a
dynamic storage-allocation problem.
Segmentation Hardware
Example of Segmentation
Sharing of Segments
Virtual Memory
Basic Concept
z
z
z
z
Usually, only part of the program needs to be in

memory for execution
Allow logical address space to be larger than
physical memory size
Bring only what is needed in memory when it is
needed
Virtual memory implementation
z
z
Demand paging
Demand segmentation
Virtual Memory That is Larger Than

Physical Memory
Demand Paging
z
Bring a page into memory only when it is

needed (on demand)
z
z
z
z
Less I/O needed to start a process

Less memory needed
Faster response
More users
Page is needed reference to it

z
z
invalid reference abort

not-in-memory bring to memory
Transfer of a Paged Memory to

Contiguous Disk Space
Some questions
z
How to know if a page is in memory?

z
If not present, what happens during reference to a logical

address in that page?
z
z
z
Any free page frame
What if there is no free page frame?

z
Page fault
If not present, where in disk is it?

If a new page is brought to memory, where should it be
placed?
z
Valid bit
Page replacement policies
Is it always necessary to copy a page back to disk on

replacement?
z
Dirty/Modified bit
Valid-Invalid Bit
z
z
z
With each page table entry a validinvalid bit is

associated
(1 in-memory, 0 not-in-memory)
Initially validinvalid but is set to 0 on all entries
Set to 1 when memory is allocated for a page
(page brought into memory)
Address translation steps:
z
z
z
Use page no. in address to index into page table

Check valid bit
If set, get page frame start address, add offset to get
memory address, access memory
If not set, page fault
Page Table When Some Pages Are Not in

Main Memory
Page Fault
z
z
z
z
z
z
z
If there is ever a reference to a page, first reference will

trap to OS page fault
OS looks at another table to decide:
z Invalid reference abort
z Just not in memory, get from disk
Get empty frame
Get disk address of page (in PTE)
Swap page into frame from disk (context switch for I/O
wait)
Modify PTE entry for page with frame no., set valid bit =
1.
Restart instruction: Least Recently Used
z block move
z
auto increment/decrement location
Steps in Handling a Page Fault
Performance of Demand
Paging
z
Page Fault Rate 0 p 1.0

z if p = 0 no page faults
z if p = 1, every reference is a fault
Effective Access Time (EAT)
EAT = (1 p) x memory access
+ p (page fault overhead
+ [swap page out ]
+ swap page in
+ restart overhead)
Demand Paging Example

z
Memory access time = 150 ns
50% of the time the page that is being replaced has

been modified and therefore needs to be swapped out.
z
z
Swap Page Time = 10 msec = 107 ns

EAT = (1 p) x 150 + p (1.5 x 107 )
150 + p (1.5 x 107 ) ns
EAT = 200 ns will need p = 0.0000033 !
EAT = 165 ns (10% loss) will need p = 0.000001, or 1
fault in 1000000 accesses
Page in Disk
z
Swap space: part of disk divided in page sized

slots
z
Pages can be swapped out from memory to a

free page slot
z
z
First slot has swap space management info such as

free slots etc.
Address of page <swap partition no., page slot no.>

Address stored in PTE
Read-only pages can be read from the file

system directly using memory-mapped files
What happens if there is no free

frame?
z
Page replacement find some page in memory,

but not really in use, swap it out
z
z
algorithm
performance want an algorithm which will result in
minimum number of page faults.
Same page may be brought into memory

several times
Page Replacement
z
Prevent over-allocation of memory by modifying pagefault service routine to include page replacement
Use modified (dirty) bit to reduce overhead of page

transfers
Set to 0 when a page is brought into memory
Set to 1 when some location in a page is changed
Copy page to disk only if bit set to 1
Page replacement completes separation between logical

memory and physical memory large virtual memory can
be provided on a smaller physical memory
Need For Page Replacement
Basic Page Replacement

z
z
z
z
Find the location of the desired page on disk

Find a free frame
z If there is a free frame, use it
z If there is no free frame, use a page replacement
algorithm to select a victim frame
Copy the to-be replaced frame to disk if needed
Read the desired page into the (newly) free frame.
Update the page and frame tables.
Restart the process
Page Replacement
Page Replacement Algorithms

z
z
Want lowest page-fault rate

Evaluate algorithm by running it on a particular
string of memory references (reference string of
pages referenced) and computing the number of
page faults on that string
First-In-First-Out (FIFO)
Algorithm
z
Reference string: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3,
4, 5
3 frames (3 pages can be in memory at a
time per process)
9 page faults
FIFO Page Replacement
Beladys Anamoly
z
The number of page faults may increase with

increase in number of page frames for FIFO
z
Counter-intuitive
Consider 4 page frames

1
10 page faults
Beladys Anamoly
Optimal Algorithm
z
Replace page that will not be used for longest period of

time.
4 frames example: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5
1
6 page faults
3
4
z
z
How do you know this?

Used for measuring how well your algorithm performs
Optimal Page Replacement
Least Recently Used (LRU)

Algorithm
z
Reference string: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5
1
Counter implementation
z Every page entry has a counter; every time page is
referenced through this entry, copy the clock into the
counter
z When a page needs to be changed, look at the
counters to determine which are to change
LRU Page Replacement
LRU Algorithm (Cont.)

z
Stack implementation keep a stack of page

numbers in a double link form:
z
Page referenced:
z move it to the top
z requires 6 pointers to be changed
No search for replacement
Use Of A Stack to Record The Most Recent

Page References
LRU Approximation Algorithms

z
Reference bit
z
z
z
z
Additional Reference Bits algorithm

z
z
z
1 bit per page frame, initially = 0

When page frame is referenced, bit set to 1
Periodically reset to 0
Replace the one which is 0 (if one exists). We do not
know the order, however
K bits marked 1 to K from MSB
ith bit indicates if page accessed in the ith most recent
interval
At every interval (timer interrupt), right shift the bits (LSB
drops off), shift reference bit to MSB, and reset reference
bit to 0
To replace, find the frame with the smallest value of the K
bits
Second chance
z
z
z
Need reference bit

Clock replacement
If page to be replaced (in clock order) has reference
bit = 1, then:
z set reference bit 0
z leave page in memory
z replace next page (in clock order), subject to same
rules
Second-Chance (clock)
Page-Replacement Algorithm
Counting Algorithms
z
z
z
Keep a counter of the number of references that

have been made to each page
LFU Algorithm: replaces page with smallest count
MFU Algorithm: based on the argument that the
page with the smallest count was probably just
brought in and has yet to be used
Allocation of Frames
z
Each process needs some minimum number of

pages
No process should use up nearly all page
frames
Two major allocation schemes
z
z
fixed allocation
priority allocation
Fixed Allocation
z
Equal allocation e.g., if 100 frames and

5 processes, give each 20 pages.
Proportional allocation Allocate
according to the size of process
Priority Allocation
z
Use a proportional allocation scheme using

priorities rather than size.
If process Pi generates a page fault,

z
z
select for replacement one of its frames

select for replacement a frame from a process with
lower priority number
Global vs. Local Allocation

z
Global replacement process selects a

replacement frame from the set of all frames;
one process can take a frame from another.
Local replacement each process selects from
only its own set of allocated frames.
Why does paging work?

z
Locality of reference
z
z
z
Processes tend to access locations which are close to each other

(spatial locality) or which are accessed in the recent past
(temporal locality)
Locality of reference implies once a set of page is

brought in for a process, less chance of page faults by
the process for some time
Also, TLB hit ratio will be high
Working set the set of pages currently needed by a
process
Working set of a process changes over time
z
But remains same for some time due to locality of reference
Thrashing
z
If a process does not have enough pages, the

page-fault rate is very high. This leads to:
z
z
z
z
low CPU utilization.

operating system thinks that it needs to increase the
degree of multiprogramming.
another process added to the system
Adds to the problem as even less page frames are
available for each process
Thrashing a process is busy swapping pages

in and out
Thrashing occurs when

working set of all processes
> total memory size
Working-Set Model
z
z
z
z
working-set window a fixed number of page

references
Example: 10,000 instruction
WSSi (working set of Process Pi) =
total number of pages referenced in the most recent
(varies in time)
z if too small will not encompass entire locality
z if too large will encompass several localities
z if = will encompass entire program
D = WSSi total demand frames
if D > m Thrashing
Policy if D > m, then suspend one of the processes.
Working-set model
Keeping Track of the Working

Set
z
z
z
z
Approximate with interval timer + a reference bit

Example: = 10,000
z Timer interrupts after every 5000 time units
z Keep in memory 2 bits for each page
z Whenever a timer interrupts copy and sets the values
of all reference bits to 0
z If one of the bits in memory = 1 page in working set
Why is this not completely accurate?
Improvement = 10 bits and interrupt every 1000 time
units
Page-Fault Frequency Scheme
Establish acceptable page-fault rate.

z If actual rate too low, process loses frame
z If actual rate too high, process gains frame
What should the page size be?

z
Reduce internal fragmentation

z
Reduce table size

z
Smaller is better
Capture locality
z
Larger is better
Reduce I/O overhead

z
Smaller is better
Larger is better
Need to be chosen judiciously
Other Considerations
z
Prepaging
z
Bring in pages not referenced yet
TLB Reach
z
z
z
The amount of memory accessible from the TLB.

TLB Reach = (TLB Size) X (Page Size)
Ideally, the working set of each process is stored in
the TLB. Otherwise there is a high degree of page
faults
Other Considerations (Cont.)

z
Program structure
int A[][] = new int[1024][1024];
Each row is stored in one page
Program 1
for (j = 0; j < A.length; j++)
for (i = 0; i < A.length; i++)
A[i,j] = 0;
1024 x 1024 page faults!!

Program 2
for (i = 0; i < A.length; i++)
for (j = 0; j < A.length; j++)
A[i,j] = 0;
1024 page faults
Other Considerations (Cont.)

z
I/O Interlock Pages must sometimes be locked

into memory
z
z
z
Pages that are used for copying a file from a device

must be locked from being selected for eviction by a
page replacement algorithm
Some OS pages need to be I memory all the time
Use a lock bit to indicate if the page is locked and
cannot be replaced
File Management
Two Parts
z
Filesystem Interface
z
Interface the user sees

z Organization of the files as seen by the user
z Operations defined on files
z Properties that can be read/modified
Filesystem design
z
Implementing the interface
Filesystem Interface
Basic Topics
z
z
z
z
z
z
File Concept
Access Methods
Directory Structure
File System Mounting
File Sharing
Protection
File Concept
z
Logical units of information on secondary

storage
Named collection of related info on secondary
storage
Smallest unit of allocation on disk
z
All info must be in at least one file
Abstracts out the secondary storage details by

presenting a common logical storage view
File Types
z
Data
z
z
z
z
z
Text, binary,
Program
Regular files stores information
Directory stores information about file(s)
Device files represents different devices
File Structure
z
z
None - sequence of words, bytes

Simple record structure
z
z
z
Lines
Fixed length
Variable length
Complex Structures
z
z
Formatted document
Relocatable load file
Important File Attributes

z
z
z
z
z
Name only information kept in human-readable form

Type needed for systems that support different types
Location pointer to file location on device
Size current file size
Protection controls who can do reading, writing,
executing
Time, date, and user identification data for protection,
security, and usage monitoring
Information about files are kept in the directory structure,
which is maintained on the disk
File Operations
z
z
z
z
z
z
z
Create
Write
Read
Reposition within file file seek
Delete
Truncate
Open(Fi) search the directory structure on disk for
entry Fi, and move the content of entry to memory
Close (Fi) move the content of entry Fi in memory to
directory structure on disk
Access Methods
z
Sequential Access
read next
write next
reset
Direct Access
read n
write n
position to n
read next
write next
n = relative block number
Sequential-access File
Example of Index and Relative

Files
Directory Structure
z
A collection of nodes containing information

about all files
Directory
Files
F1
F2
F3
F4
Fn
A Typical File-system
Organization
Information in a Device
Directory
z
z
z
z
z
z
z
z
z
Name
Type
Address
Current length
Maximum length
Date last accessed (for archival)
Date last updated (for dump)
Owner ID (who pays)
Protection information (discuss later)
Operations Performed on
Directory
z
z
z
z
z
z
Search for a file

Create a file
Delete a file
List a directory
Rename a file
Traverse the file system
Organize the Directory (Logically)

to Obtain
z
z
Efficiency locating a file quickly

Naming convenient to users
z
z
Two users can have same name for different files.

The same file can have several different names
Grouping logical grouping of files by

properties, (e.g., all Java programs, all games,
)
Single-Level Directory
z
A single directory for all users
Problems
Naming problem
z Grouping problem
z
Two-Level Directory
z
Separate directory for each user
Path name
Can have the same file name for different user
Efficient searching
No grouping capability
Tree-Structured Directories
(Cont.)
z
Efficient searching
Grouping Capability
Current directory (working directory)

z
z
cd /spell/mail/prog
type list
(Cont.)
z
z
Absolute or relative path name

Creating a new file is done in current
directory
Delete a file
rm <file-name>
Creating a new subdirectory is done in
current directory.
mkdir <dir-name>
Acyclic-Graph Directories
z
Have shared subdirectories and files
Acyclic-Graph Directories
(Cont.)
z
z
Two different names (aliasing)

If dict deletes list dangling pointer
Solutions:
z
z
z
Backpointers, so we can delete all pointers

Variable size records a problem
Backpointers using a daisy chain organization
Entry-hold-count solution
General Graph Directory
General Graph Directory

(Cont.)
z
How do we guarantee no cycles?

z
z
z
Allow only links to file not subdirectories

Garbage collection
Every time a new link is added use a cycle detection
algorithm to determine whether it is OK
File System Mounting

z
z
z
z
z
z
A filesystem must be mounted before it can be accessed

One file system designated as root filesystem
Root directory of root filesystem is system root directory
Parts of other filesystems are added to directory tree
under root by mounting onto a directory in the root
filesystem.
The directory onto which it is mounted on is called the
mount point
The previous contents of the mount point become
inaccessible
root file system

/
usr
usr
sys
dev
etc
bin
mounted filesystem fs1
//
local
z
z
z
z
users bin
Accessing /usr/adm/ now actually accesses /adm/.. in

filesystem fs1
/usr in the root file system is the mountpoint
Anything under /usr in the root filesystem becomes
inaccessible until fs1 is unmounted
Mounting now can be done on any other mountpoint,
including any directory on an earlier mounted filesystem
z
adm
Ex. can now mount some other filesystem fs2 on /usr/adm, will
hide all files under /adm under fs1 and access to /usr/adm will go
to corresponding part of fs2
Need not mount / always, can mount any subtree of a

filesystem on a mountpoint to add only part of a
filesystem (but has to be a complete subtree)
File Sharing
z
Create links to files

z
Same file accessed from two different places in

directory structure using possibly different names
Soft Link vs. Hard Links
Protection
z
File owner/creator should be able to control:

z
z
what can be done

by whom
Types of access
z
z
z
z
z
z
Read
Write
Execute
Append
Delete
List
Filesystem Implementation
Basic Topics
z
z
z
z
z
z
z
z
Data Structures for File Access

Disk Layout of Filesystems
Allocating Storage for Files
Directory Implementation
Free-Space Management
Virtual Filesystems
Efficiency and Performance
Recovery
Data Structures for File

Access
z
File Control Block (FCB)

z
z
z
One per file

Contains file attributes and location of disk blocks of the file
Stored in disk, usually brought to memory when file is opened
Open File Table

z
z
In-memory table with one entry per open file

Each entry points to the FCB of the file (on disk or usually to
copy in memory)
Can be hierarchical
z
Per-process table with entries pointing to entries in a

single system-wide table
System-wide table points to FCB of file
A Typical File Control Block
In-Memory Open File Tables
Disk Layout
z
z
z
z
Files stored on disks. Disks broken up into one

or more partitions, with separate fs on each
partition
Sector 0 of disk is the Master Boot Record
Used to boot the computer
End of MBR has partition table. Has starting and
ending addresses of each partition.
One of the partitions is marked active in the
master boot table
Disk Layout (contd.)

z
z
Boot computer => BIOS reads/executes MBR

MBR finds active partition and reads in first block
(boot block)
Program in boot block locates the OS for that
partition and reads it in
All partitions start with a boot block
One Possible Example
Superblock contains info about the fs (e.g. type

of fs, number of blocks, )
i-nodes contain info about files
z
Common Unix name for FCB
Allocation Methods
z
An allocation method refers to how disk

blocks are allocated for files
Possibilities
z
z
z
Contiguous allocation
Linked allocation
Indexed allocation
z
z
z
z
Each file occupies a set of contiguous

blocks on the disk
Easy to implement only starting location
(block #) and length (number of blocks)
are required
Random access
Wasteful of space (dynamic storageallocation problem)
z
Fragmentation possible
Files cannot grow
Contiguous Allocation of Disk

Space
Extent-Based Systems
z
Many newer file systems use a modified

contiguous allocation scheme
Extent-based file systems allocate disk blocks
in extents
An extent is a contiguous block of disks.
Extents are allocated for file allocation. A file
consists of one or more extents
Linked Allocation
z
z
z
Each file is a linked list of disk blocks:

blocks may be scattered anywhere on the
disk
Simple need only starting address
Free-space management system no
waste of space
No random access
Linked Allocation
Linked List Allocation
Linked lists using a table in

memory
z
z
z
Put pointers in table in memory

File Allocation Table (FAT)
Still have to traverse pointers, but now in
memory
But table becomes really big
z
z
200 GB disk with 1 KB blocks needs a 600 MB table

Growth of the table size is linear with the growth of the
disk size
File-Allocation Table
Indexed Allocation
z
Brings all pointers together into the index

block
Logical view
index table
Example of Indexed Allocation
Indexed Allocation (Cont.)

z
z
z
Need index table

Random access
Dynamic access without external
fragmentation, but have overhead of index
block
Mapping from logical to physical in a file of
maximum size of 256K words and block
size of 512 words. We need only 1 block
for index table
Indexed Allocation Mapping

(Cont.)
z
Mapping from logical to physical in a file

of unbounded length (block size of 512
words)
Linked scheme Link blocks of index
table (no limit on size)
Two-level Indexing
z
Two-level index (maximum file size is 5123)
outer-index
index table
file
i-nodes
z
z
FCB in Unix
Contains file attributes and disk address of
blocks
One block can hold only limited number of disk
block addresses, limits size of file
Solution: use some of the blocks to hold address
of blocks holding address of disk blocks of files
z
Can take this to more than one level
i-node with one-level

indirection
Unix i-node
z
z
z
File Attributes
12 direct pointers
1 singly indirect pointer
z
1 doubly indirect pointer

z
Points to a block that points to blocks that have disk block

addresses
1 triply indirect pointer

z
Points to a block that has disk block addresses
Points to a block that points to blocks that point to blocks that

have disk block addresses
What is the max. file size possible??
Directory Implementation
z
Linear list of file names with pointer to the data blocks

z
z
z
z
z
z
Address of first block (contiguous)

Number of first block (linked)
Number of i-node
simple to program
time-consuming to execute
Hash Table linear list with hash data structure
z
z
decreases directory search time

collisions situations where two file names hash to the same
location
fixed size
z
Bit vector (n blocks)

0 1
n-1
bit[i] =
678
1 block[i] free
0 block[i] occupied
Block number calculation for first free block

(number of bits per word) *
(number of 0-value words) +
offset of first 1 bit
(Cont.)
z
z
z
Bit map requires extra space. Example:

block size = 212 bytes
disk size = 230 bytes (1 gigabyte)
n = 230/212 = 218 bits (or 32K bytes)
Easy to get contiguous files
Linked list (free list)
z
z
z
Cannot get contiguous space easily

No waste of space
May need no. of disk accesses to find a free block
z Grouping
z Counting
(Cont.)
z
Need to protect:
z
z
Pointer to free list

Bit map
z Must be kept on disk
z Copy in memory and disk may differ.
Linked Free Space List on Disk
Virtual File Systems

z
Virtual File Systems (VFS) provide an objectoriented way of implementing file systems.
VFS allows the same system call interface (the
API) to be used for different types of file
systems.
The API is to the VFS interface, rather than any
specific type of file system.
Schematic View of Virtual File

System
How VFS works

z
z
z
z
File system registers with VFS (e.g. at boot time)

At registration time, fs provides list of addresses
of function calls the vfs wants
Vfs gets info from the new fs i-node and puts it in
a v-node
Makes entry in fd table for process
When process issues a call (e.g. read), function
pointers point to concrete function calls
. A simplified view of the data structures and code used by

the VFS and concrete file system to do a read.
Efficiency and Performance

z
Efficiency dependent on:

z
z
disk allocation and directory algorithms

types of data kept in files directory entry
Performance
z
disk cache separate section of main memory for

frequently used blocks
free-behind and read-ahead techniques to optimize
sequential access
improve PC performance by dedicating section of
memory as virtual disk, or RAM disk.
Various Disk-Caching
Locations
Page Cache
z
z
z
A page cache caches pages rather than

disk blocks using virtual memory techniques
Memory-mapped I/O uses a page cache
Routine I/O through the file system uses the
buffer (disk) cache
I/O Without a Unified Buffer

Cache
Unified Buffer Cache

z
A unified buffer cache uses the same page

cache to cache both memory-mapped pages
and ordinary file system I/O.
I/O Using a Unified Buffer

Cache
Recovery
z
Consistency checking compares data in

directory structure with data blocks on disk, and
tries to fix inconsistencies
Use system programs to back up data from disk
to another storage device (floppy disk, magnetic
tape)
Recover lost file or disk by restoring data from
backup
Disk Management
Physical Disk Structure
Disk Structure
z
Disk drives are addressed as large 1dimensional arrays of logical blocks, where the
logical block is the smallest unit of transfer
The 1-dimensional array of logical blocks is
mapped into the sectors of the disk sequentially
z
Sector 0 is the first sector of the first track (top platter)

on the outermost cylinder
Mapping proceeds in order through that track, then
the rest of the tracks in that cylinder, and then through
the rest of the cylinders from outermost to innermost
Disk Access Time

z
Two major components

z
Seek time is the time for the disk to move the heads to
the cylinder containing the desired sector
z Typically 5-10 milliseconds
Rotational latency is the additional time waiting for the
disk to rotate the desired sector to the disk head
z Typically, 2-4 milliseconds
One minor component

z
Read/write time or transfer time actual time to

transfer a block, less than a millisecond
Disk Scheduling
z
z
Should ensure a fast access time and disk bandwidth

Fast access
z
z
z
z
z
z
Minimize total seek time of a group of requests

If requests are for different cylinders, average rotation latency has
to be incurred for each anyway, so minimizing it is not the primary
goal (though some scheduling possible if multiple requests for
same cylinder is there)
Seek time seek distance

Main goal : reduce total seek distance for a group of
requests
Auxiliary goal: fairness in waiting times for the requests
Disk bandwidth is the total number of bytes transferred,
divided by the total time between the first request for
service and the completion of the last transfer
Disk Scheduling (Cont.)

z
Several algorithms exist to schedule the

servicing of disk I/O requests.
We illustrate them with a request queue (0-199).
98, 183, 37, 122, 14, 124, 65, 67
Head pointer 53
FCFS
z
z
z
Service requests in the order they come

Fair to all requests
Can cause very large total seek time over all
requests if the load is moderate to high
FCFS
Illustration shows total head movement of 640 cylinders.
SSTF
z
Selects the request with the minimum seek time

from the current head position
SSTF scheduling is a form of SJF scheduling
z
z
z
z
May cause starvation of some requests like SJF

But not optimal, unlike SJF
Minimizes seek time, but not fair

May work well if the load is not high
SSTF (Cont.)
Total head movement = 236 cylinders
SCAN
z
The disk arm starts at one end of the disk, and

moves toward the other end, servicing requests
until it gets to the other end of the disk, where
the head movement is reversed and servicing
continues
Sometimes called the elevator algorithm
SCAN (Cont.)
C-SCAN
z
z
Provides a more uniform wait time than SCAN

The head moves from one end of the disk to the
other. servicing requests as it goes. When it
reaches the other end, however, it immediately
returns to the beginning of the disk, without
servicing any requests on the return trip
Treats the cylinders as a circular list that wraps
around from the last cylinder to the first one
C-SCAN (Cont.)
C-LOOK
z
z
Version of C-SCAN
Arm only goes as far as the last request in each
direction, then reverses direction immediately,
without first going all the way to the end of the
disk.
C-LOOK (Cont.)
Selecting a Disk-Scheduling
Algorithm
z
z
SSTF is common and has a natural appeal

SCAN and C-SCAN perform better for systems that
place a heavy load on the disk
Performance depends on the number and types of
requests
Requests for disk service can be influenced by the fileallocation method
The disk-scheduling algorithm should be written as a
separate module of the operating system, allowing it to
be replaced with a different algorithm if necessary
Either SSTF or C-LOOK is a reasonable choice for the
default algorithm (depending on load)
Disk Management
z
Low-level formatting, or physical formatting Dividing a

disk into sectors that the disk controller can read and
write.
To use a disk to hold files, the operating system still
needs to record its own data structures on the disk
z Partition the disk into one or more groups of cylinders
z Logical formatting or making a file system
Boot block initializes system
z The bootstrap is stored in ROM
z Bootstrap loader program
Methods such as sector sparing used to handle bad
blocks
Operating System Issues

z
Major OS jobs are to manage physical devices

and to present a virtual machine abstraction to
applications
For hard disks, the OS provides two abstraction:

z
z
Raw device an array of data blocks.

File system the OS queues and schedules the
interleaved requests from several applications.
Application Interface
z
Most OSs handle removable disks almost exactly like

fixed disks a new cartridge is formatted and an empty
file system is generated on the disk.
Tapes are presented as a raw storage medium, i.e., and
application does not not open a file on the tape, it opens
the whole tape drive as a raw device
Usually the tape drive is reserved for the exclusive use of
that application
Since the OS does not provide file system services, the
application must decide how to use the array of blocks
Since every application makes up its own rules for how to
organize a tape, a tape full of data can generally only be
used by the program that created it
CPU Scheduling
Linux scheduler history
We will be talking about the O(1) scheduler
SMP Support in 2.4 and 2.6 versions
2.4 Kernel
CPU1
CPU2
CPU3
2.6 Kernel
CPU1
CPU2
CPU3
Linux Scheduling
z
z
z
3 scheduling classes
z SCHED_FIFO and SCHED_RR are realtime classes
z SCHED_OTHER is for the rest
140 Priority levels
z 1-100 : RT priority
z 101-140 : User task priorities
Three different scheduling policies
z One for normal tasks
z Two for Real time tasks
Pre-emptive, priority based scheduling.

When a process with higher real-time priority
(rt_priority) wishes to run, all other processes with
lower real-time priority are thrust aside.
In SCHED_FIFO, a process runs until it relinquishes
control or another with higher real-time priority wishes to
run.
SCHED_RR process, in addition to this, is also
interrupted when its time slice expires or there are
processes of same real-time priority (RR between
processes of this class)
SCHED_OTHER is also round-robin, with lower time
slice
SCHED_OTHER: Normal tasks

z Each task assigned a Nice value
z Static priority = 120 + Nice
z
z
z
Nice value between -20 and +19
Assigned a time slice

Tasks at the same priority are round-robined
z Ensures Priority + Fairness
Basic Philosophies
z
z
z
z
Priority is the primary scheduling mechanism

Priority is dynamically adjusted at run time
z Processes denied access to CPU get increased
z Processes running a long time get decreased
Try to distinguish interactive processes from noninteractive
z Bonus or penalty reflecting whether I/O or compute
bound
Use large quanta for important processes
z Modify quanta based on CPU use
Associate processes to CPUs
Do everything in O(1) time
The Runqueue
z
z
z
z
140 separate queues, one for each priority

level
Actually, two sets, active and expired
Priorities 0-99 for real-time processes
Priorities 100-139 for normal processes;
value set via nice()/setpriority() system calls
Linux 2.6 scheduler runqueue structure
Scheduler Runqueue
z
z
z
z
A scheduler runqueue is a list of tasks that are

runnable on a particular CPU.
A rq structure maintains a linked list of those
tasks.
The runqueues are maintained as an array
runqueues, indexed by the CPU number.
The rq keeps a reference to its idle task
z
The idle task for a CPU is never on the scheduler

runqueue for that CPU (it's always the last choice)
Access to a runqueue is serialized by

acquiring and releasing rq->lock
Basic Scheduling Algorithm

z
z
z
z
z
Find the highest-priority queue with a

runnable process
Find the first process on that queue
Calculate its quantum size
Let it run
When its time is up, put it on the expired list
z
Recalculate priority first
Repeat
Process Descriptor Fields Related

to the Scheduler
z
z
z
z
z
z
z
z
z
z
z
z
z
z
z
z
thread_info->flags
thread_info->cpu
state
prio
static_prio
run_list
array
sleep_avg
timestamp
last_ran
activated
policy
cpus_allowed
time_slice
first_time_slice
rt_priority
The Highest Priority Process

z
z
There is a bit map indicating which queues

have processes that are ready to run
Find the first bit thats set:
z
z
z
140 queues 5 integers

Only a few compares to find the first that is nonzero
Hardware instruction to find the first 1-bit
z
bsfl on Intel
Time depends on the number of priority levels, not

the number of processes
Scheduling Components
z
z
z
z
z
Static Priority
Sleep Average
Bonus
Dynamic Priority
Interactivity Status
Static Priority
Each task has a static priority that is set
based upon the nice value specified by
the task.
z static_prio in task_struct
z Value between 0 and 139 (between
100 and 139 for normal processes)
z Each task has a dynamic priority that is
set based upon a number of factors
z
tries to increase priority of interactive jobs
Sleep Average
z
Interactivity heuristic: sleep ratio

z Mostly sleeping: I/O bound
z Mostly running: CPU bound
Sleep ratio approximation
z sleep_avg in the task_struct
z Range: 0 .. MAX_SLEEP_AVG
When process wakes up (is made runnable),
recalc_task_prio adds in how many ticks it was
sleeping (blocked), up to some maximum value
(MAX_SLEEP_AVG)
When process is switched out, schedule
subtracts the number of ticks that a task actually
ran (without blocking)
sleep_avg scaled to a bonus vale
Average Sleep Time and

Bonus Values
Average sleep time
Bonus
>= 0 but < 100 ms
>= 100 ms but < 200 ms
>= 200 ms but < 300 ms
>= 300 ms but < 400 ms
>= 400 ms but < 500 ms
>= 500 ms but < 600 ms
>= 600 ms but < 700 ms
>= 700 ms but < 800 ms
>= 800 ms but < 900 ms
>= 900 ms but < 1000 ms
1 second
10
Bonus and Dynamic Priority

z
Dynamic priority (prio in task_struct) is

calculated in from static priority and bonus
z
= max (100, min( static_priority bonus + 5,

139) )
Calculating Time Slices

z
z
time_slice in the task_struct

Calculate Quantum where
z
z
z
z
z
If (SP < 120): Quantum = (140 SP) 20

if (SP >= 120): Quantum = (140 SP) 5
where SP is the static priority
Higher priority process get longer quanta

Basic idea: important processes should run longer
Other mechanisms used for quick interactive
response
Nice Value vs. static priority and Quantum
Interactive Processes
z
A process is considered interactive if

z
Low-priority processes have a hard time becoming

interactive:
z
z
z
bonus 5 >= (Static Priority / 4) 28

(Static Priority / 4) 28 = interactive delta
A high static priority (100) becomes interactive when its

average sleep time is greater than 200 ms
A default static priority process becomes interactive when
its sleep time is greater than 700 ms
Lowest priority (139) can never become interactive
The higher the bonus the task is getting and the

higher its static priority, the more likely it is to be
considered interactive.
Using Quanta
z
z
z
z
z
At every time tick (in scheduler_tick) , decrement the quantum of

the current running process (time_slice)
If the time goes to zero, the process is done
Check interactive status:
z If non-interactive, put it aside on the expired list
z If interactive, put it at the end of the active list
Exceptions: dont put on active list if:
z If higher-priority process is on expired list
z If expired task has been waiting more than STARVATION_LIMIT
If theres nothing else at that priority, it will run again immediately
Of course, by running so much, its bonus will go down, and so
will its priority and its interactive status
Avoiding Starvation
z
The system only runs processes from active

queues, and puts them on expired queues when
they use up their quanta
When a priority level of the active queue is empty,
the scheduler looks for the next-highest priority
queue
After running all of the active queues, the active and
expired queues are swapped
There are pointers to the current arrays; at the end
of a cycle, the pointers are switched
The Priority Arrays

struct prio_array {
unsigned int nr_active;
unsigned long bitmap[5];
struct list_head queue[140];
};
struct rq {
spinlock_t lock;
unsigned_long nr_running;
struct prio_array *active, *expired;
struct prio_array arrays[2];
task_struct *curr, *idle;
};
Swapping Arrays
struct prioarray *array =
rq->active;
if (array->nr_active == 0) {
rq->active = rq->expired;
rq->expired = array;
}
Why Two Arrays?

z
z
z
Why is it done this way?

It avoids the need for traditional aging
Why is aging bad?
z
Its O(n) at each clock tick
Linux is More Efficient

z
Processes are touched only when they start

or stop running
Thats when we recalculate priorities,
bonuses, quanta, and interactive status
There are no loops over all processes or
even over all runnable processes
Real-Time Scheduling
z
Linux has soft real-time scheduling

z
All real-time processes are higher priority than any

conventional processes
Processes with priorities [0, 99] are real-time
z
z
No hard real-time guarantees
saved in rt_priority in the task_struct

scheduling priority of a real time task is: 99 - rt_priority
Process can be converted to real-time via

sched_setscheduler system call
Real-Time Policies
z
First-in, first-out: SCHED_FIFO

z
z
z
z
Round-robin: SCHED_RR
z
Static priority
Process is only preempted for a higher-priority process
No time quanta; it runs until it blocks or yields voluntarily
RR within same priority level
As above but with a time quanta (800 ms)
Normal processes have SCHED_OTHER

scheduling policy
Multiprocessor Scheduling
z
z
Each processor has a separate run queue

Each processor only selects processes from its own
queue to run
Yes, its possible for one processor to be idle while
others have jobs waiting in their run queues
Periodically, the queues are rebalanced: if one
processors run queue is too long, some processes
are moved from it to another processors queue
Locking Runqueues
z
z
z
To rebalance, the kernel sometimes needs to move

processes from one runqueue to another
This is actually done by special kernel threads
Naturally, the runqueue must be locked before this
happens
The kernel always locks runqueues in order of
increasing indexes
Why? Deadlock prevention!
Processor Affinity
z
z
z
z
Each process has a bitmask saying what CPUs

it can run on
Normally, of course, all CPUs are listed
Processes can change the mask
The mask is inherited by child processes (and
threads), thus tending to keep them on the same
CPU
Rebalancing does not override affinity
Load Balancing
To keep all CPUs busy, load balancing
pulls tasks from busy runqueues to idle
runqueues.
z If schedule finds that a runqueue has no
runnable tasks (other than the idle task), it
calls load_balance
z load_balance also called via timer
z
z
z
schedule_tick calls rebalance_tick

Every tick when system is idle
Every 100 ms otherwise
Load Balancing
z
load_balance looks for the busiest runqueue

(most runnable tasks) and takes a task that is
(in order of preference):
z inactive (likely to be cache cold)
z high priority
load_balance skips tasks that are:
z likely to be cache warm (hasn't run for
cache_decay_ticks time)
z currently running on a CPU
z not allowed to run on the current CPU (as
indicated by the cpus_allowed bitmask in the
task_struct)
Linux 2.6 CFS Scheduler

z
z
Was merged into the 2.6.23 release.

Uses red-black tree structure instead of
multilevel queues.
Tries to run the task with the "gravest need" for
CPU time
Red-Black tree in CFS
Red-Black tree properties

z
z
Self Balance
Insertion and deletion operation in O(log(n))
z
With proper implementation its performance is

almost the same as O(1) algorithms!
The switch_to Macro

z
switch_to() performs a process switch

from the prev process (descriptor) to the
next process (descriptor).
switch_to is invoked by schedule() & is
one of the most hardware-dependent kernel
routines.
z
See kernel/sched.c and include/asm*/system.h for more details.
Ext3 Filesystem
Introduction
Common file system on linx
z Introduced in 2001
z Supports max file size of 16 GB to 2 TB
z Max filesystem size can be from 2 TB to 32
TB
z Maximum 32000 sbdirectories under a
directory
z Options for block size from 1 KB to 4 KB
z
Block Groups
z
Disk is partitioned into equal sized block

groups
z
z
Same number of inodes per block group

Same number of data blocks per block group
Each block group has data blocks and inodes

stored in adjacent tracks
z
z
z
Flies allocated within a single block group usually

Inodes and data blocks are close together
Reduces average seek time
Partition Layout
Superblock
z
located 1024 bytes from the start of the file system

and is 1024 bytes in size.
Back up copies are typically stored in the first file
data block of each block group
z
Only the one in block group 0 is looked at usually
Contains a description of the basic size and shape of this

file system.
Some Superblock Fields

z
z
z
z
z
z
z
z
Block group no. of group storing the superblock

Block size
Total no. of blocks
No. of free blocks
No. of inodes
Total no.of free inodes
First inode (for /)
Many others, its a long list
The Ext3 Group Descriptor

z
z
One Group Descriptor data structure for every block group

All the group descriptors for all of the Block Groups are
duplicated in each Block Group in case of file system
corruption.
The Group Descriptor contains the following:
Blocks Bitmap : block number of block allocation bitmap
Inode Bitmap : block number of Inode allocation bitmap
Inode Table : The block number of the starting block for the
Inode table for this Block Group.
Free blocks count : number of data blocks free in the Group
Free Inodes count : number of Inodes free in the Group
Used directory count : number of inodes allocated to
directories
Bitmaps
z The
block bitmap manages the allocation status of the

blocks in the group
z The inode bitmap manages the allocation status of the
inodes in the group
z Both bitmaps must fit into one block each
z
Fixes the maximum no. of blocks in a block group as 8 times

the block size in bytes
Inodes
zInode
z
Tables:
Inode table contains the inodes that describe the files in the
group
zInodes:
z
Each inode corresponds to one file, and it stores files primary

metadata, such as files size, ownership, and temporal
information.
Inode is typically 128 bytes in size and is allocated to each
file and directory
12 direct links, one single, one double, and one triple indirect
link
Ext2 inode
Some inode fields

z
z
z
z
z
z
z
z
z
z
File type
Access rights
File length
Time of last file access
Time of last change of inode
Time of last change of file
Hard links counter
Number of data blocks
Pointers to data blocks
Access control lists
Directories
z An
Ext3 directory is just like a regular file except it has a

special type value.
z The content of directories is a list of directory entry data
structure, which describes file/subdirectory name and inode
address.
z The length of directory entry varies from 1 to 255 bytes.
z Fields in the directory entry:
z Inode: inode no. of file/directory
z Name length: the length of the file name
z Record length: the length of this directory entry
z
Tells where to start the next structure
File type
z Name: file/subdirectory name
z
Indexing and Directories
When Ext3 wants to delete a directory entry, it just increases the record
length of the previous entry to the end to deleted entry.
Allocating inodes
z If
a new inode is for a non-directory file, Ext3 allocates

an inode in the same block group as the parent
directory.
z If that group has no free inode or block,
z
Quadratic search: search in block groups i mod (n), i+1 mod

(n), i + 1 + 2 mod (n), i + 1 + 2 + 4 mod (n)
z If
quadratic search fails, Ext3 uses exhaustive linear

search to find a free inode
z If
a new inode is for a directory, Ext2 tries to place it in

a group that has not been used much.
z Using total number of free inodes and blocks in the
superblock, Ext3 calculates the average free inodes
and blocks per group.
z Ext3 searches each of the group and uses the first one
whose free inodes and blocks are less than the
average.
z If the pervious search fails, the group with the smallest
number of directories is used.
Allocating Data Blocks

zFirst
z
Goal
Get the new block near the last block allocated to the
file
zPreallocation
allocates a number of contiguous

blocks (usually 8) even if only one block is asked
for
z
Preallocated blocks are freed when the file is closed, or

when a write operation is not sequential with respect to
write operations that triggered the preallocation
Every allocation request has a goal block

z
If the current block and previously allocated block

have consecutive file block no., goal = logical block no.
of previous block + 1
z
Tries to keep consecutive file blocks adjacent on disk
Else, if at least one preallocated block earlier, goal =

that block
Else, goal = first block in the block group with the files
inode
Aim: allocate a physical block = goal block
z
z
If the goal block is not free, try the next one

If not available, search all block groups starting
from the one containing the goal block
z
z
Look for a group of 8 adjacent free blocks

If not, look for one
Indexing and Directories

zDirectory
z
entry allocation:
Ext2 starts at the beginning of the directory and examines

each directory entry.
Calculate the actual record length and compare with the
record length field.
If they are different, Ext2 can insert the new directory entry at
the end of the current entry.
Else, Ext2 appends the new entry to the end of the entry list.
Memory data Structures

z
Superblock and Group Descriptors are

always cached
Bitmaps (block and inode) and inode and
data blocks are cached dynamically as
needed when the corresponding object is in
use
z
Page cache mitigates some of the problem of

reading from disk
Main change in Ext3 over Ext2 add

journaling support for recovery
Not to be discussed today

1.introduction 10.MySlide-Filesystem 2IN1

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

1.introduction 10.MySlide-Filesystem 2IN1

Transféré par

Droits d'auteur :

Formats disponibles

CS30002:

Operating System Concepts, 8th Ed, by Silberschatz,

What is an Operating System?

A program that acts as an intermediary between

Resource allocator manages and allocates

Computer System Components

Abstract View of System

Multiprogrammed Batch Systems

Different types of systems with multiple CPUs/Machines

Real Time Systems

More than one CPU in a single machine to allocate jobs to

Other Parallel Systems, Distributed Systems, Clusters

Dedicated to a single user at one time

Multiple jobs in memory and on disk, CPU is multiplexed

Multiple jobs in memory, CPU is multiplexed between

Multiple jobs, but only one job in memory at one time

Systems to run jobs with time guarantees

Other types possible depending on resources in the

OS design depends on the type of system it is

Our primary focus in this course:

Uniprocessor, time-sharing systems running general

Will discuss some other topics at end

CPU, Memory, Disk, I/O Devices like keyboard, monitor,

A process is a program in execution.

CPU time, memory, files, I/O devices

Process creation and deletion.

Keep track of which parts of memory are currently

File creation, deletion, modification

I/O System Management

The I/O system consists of:

Most modern computer systems use disks as

Free space management

Security and Protection System

Protection refers to a mechanism for controlling

distinguish between authorized and unauthorized usage

System calls provide the interface between a

Think of it as a set of functions available to the program

Passing parameters to system calls

Pass parameters in registers

Strictly not a part of OS, but always there

Allows user to give commands to OS, interpretes

Calls appropriate functions/system calls

Process an instance of a program in execution

A process has resources allocated to it by the OS

Multiple instances of the same program are different

Each process identified by a unique, positive integer id

Process Control Block (PCB)

The primary data structure maintained by the OS

Typical Contents of PCB

Process id, parent process id

Linux PCB (task_struct) has 100+ fields

Process States (5-state model)

As a process executes, it changes state

new: The process is being created

Process State Transitions

Main Operations on a Process

Data structures like PCB set up and initialized

A process can create another process

By making a system call (a function to invoke the

The new process can in turn create other

No system call, because the OS is still not running

Process Creation (contd.)

Resource sharing possibilities