Vous êtes sur la page 1sur 174

CS30002:

Operating Systems
Arobinda Gupta
Spring 2012

General Information
z

Textbook:
z
z

Course Webpage
z

Operating System Concepts, 8th Ed, by Silberschatz,


Galvin, and Gagne
I will use materials from other books as and when
needed
http://cse.iitkgp.ac.in/~agupta/OS

Grading Policy
z
z
z

Midsem 30%
Endsem 50%
TA 20% (Two class tests, may also have
assignments)

Introduction

What is an Operating System?


z

User-centric definition
z

A program that acts as an intermediary between


a user of a computer and the computer hardware
Defines an interface for the user to use services
provided by the system
Provides a view of the system to the user

System-centric definition
z

Resource allocator manages and allocates


resources
Control program controls the execution of user
programs and operations of I/O devices

Computer System Components


1. Hardware provides basic computing resources
(CPU, memory, I/O devices).
2. Operating system controls and coordinates the
use of the hardware among the various
application programs for the various users.
3. Applications programs define the ways in which
the system resources are used to solve the
computing problems of the users (compilers,
databases, games, ).
4. Users (people, machines, other computers).

Abstract View of System


Components

Types of Systems
z

Batch Systems
z

Multiprogrammed Batch Systems


z
z

z
z

Different types of systems with multiple CPUs/Machines

Real Time Systems


z

More than one CPU in a single machine to allocate jobs to


Symmetric Multiprocessing, NUMA machines
Multicore

Other Parallel Systems, Distributed Systems, Clusters


z

Dedicated to a single user at one time

Multiprocessing Systems
z

Multiple jobs in memory and on disk, CPU is multiplexed


among jobs in memory, jobs swapped between disk and
memory
Allows interaction with users

Personal Computers
z

Multiple jobs in memory, CPU is multiplexed between


them
CPU-bound vs I/O bound jobs

Time-sharing Systems
z

Multiple jobs, but only one job in memory at one time


and executed (till completion) before the next one starts

Systems to run jobs with time guarantees

Other types possible depending on resources in the


machine, types of jobs to be run

OS design depends on the type of system it is


designed for

Our primary focus in this course:


z

Uniprocessor, time-sharing systems running general


purpose jobs from users
Effect of multicore/multiprocessors

Will discuss some other topics at end

Resources Managed by OS
z

Physical
z

CPU, Memory, Disk, I/O Devices like keyboard, monitor,


printer

Logical
z

Process, File,

Main Components of an OS
z

Resource-Centric View
z
z
z
z
z
z
z

Process Management
Main Memory Management
File Management
I/O System Management
Secondary Storage Management
Security and Protection System
Networking (this is now integrated with most OS, but will
be covered in the Networks course)

User-centric view
z
z

System Calls
Command Interpreter (not strictly a part of an OS)

Process Management
z
z

A process is a program in execution.


Needs certain resources to accomplish
its task
z

CPU time, memory, files, I/O devices

OS responsibilities
z
z
z

Process creation and deletion.


Process suspension and resumption.
Provide mechanisms for:
z process synchronization
z interprocess communication

Main-Memory Management
z

OS responsibilities
z

Keep track of which parts of memory are currently


being used and by whom
Decide which processes to load when memory space
becomes available
Allocate and deallocate memory space as needed

File Management
z

OS responsibilities
z
z
z

z
z

File creation, deletion, modification


Directory creation, deletion, modification
Support of primitives for manipulating files and
directories
Mapping files onto secondary storage.
File backup on stable (nonvolatile) storage media

I/O System Management


z

The I/O system consists of:


z
z
z

A buffer-caching system
Device driver interface
Drivers for specific hardware devices

Secondary-Storage Management
z

Most modern computer systems use disks as


the principle on-line storage medium, for both
programs and data.
OS responsibilities
z
z
z

Free space management


Storage allocation
Disk scheduling

Security and Protection System


z

Protection refers to a mechanism for controlling


access by programs, processes, or users to both
system and user resources.
The protection mechanism must:
z
z
z

distinguish between authorized and unauthorized usage


specify the controls to be imposed
provide a means of enforcement

System Calls
z

System calls provide the interface between a


running program and the OS
z

z
z

Think of it as a set of functions available to the program


to call (but somewhat different from normal functions,
we will see why)
Generally available as assembly-language instructions.
Most common languages (e.g., C, C++) have APIs that
call system calls underneath

Passing parameters to system calls


z
z
z

Pass parameters in registers


Store the parameters in a table in memory, and the
table address is passed as a parameter in a register
Push (store) the parameters onto the stack by the
program, and pop off the stack by operating system

Command-Interpreter System
z

Strictly not a part of OS, but always there


z

the shell

Allows user to give commands to OS, interpretes


the commands and executes them
z
z

Calls appropriate functions/system calls


You will write one in your lab

Process Management

What is a Process?
z

Process an instance of a program in execution


z

A process has resources allocated to it by the OS


during its execution
z
z
z
z
z
z

Multiple instances of the same program are different


processes

CPU time
Memory space for code, data, stack
Open files
Signals
Data structures to maintain different information about the
process

Each process identified by a unique, positive integer id


(process id)

Process Control Block (PCB)


z

z
z

The primary data structure maintained by the OS


that contains information about a process
One PCB per process
OS maintains a list of PCBs for all processes

Typical Contents of PCB


z
z
z
z
z
z
z
z
z

Process id, parent process id


Process state
CPU state: CPU register contents, PSW
Priority and other scheduling info
Pointers to different memory areas
Open file information
Signals and signal handler info
Various accounting info like CPU time used etc.
Many other OS-specific fields can be there
z

Linux PCB (task_struct) has 100+ fields

Process States (5-state model)


z

As a process executes, it changes state


z
z
z

new: The process is being created


running: Instructions are being executed
waiting: The process is waiting for some
event (needed for its progress) to occur
ready: The process is waiting to be assigned
to a CPU
terminated: The process has finished
execution

Process State Transitions

Main Operations on a Process


z

Process creation
z
z
z

Process scheduling
z

Data structures like PCB set up and initialized


Initial resources allocated and iitialized if needed
Process added to ready queue (queue of processes ready to run)
CPU is allotted to the process, process runs

Process termination
z
z
z
z

Process is removed
Resources are reclaimed
Some data may be passed to parent process (ex. exit status)
Parent process may be informed (ex. SIGCHLD signal in UNIX)

Process Creation
z

A process can create another process


z

z
z

By making a system call (a function to invoke the


service of the OS, ex. fork( ))
Parent process: the process that invokes the call
Child process: the new process created

The new process can in turn create other


processes, forming a tree of processes
The first process in the system is handcrafted
z

No system call, because the OS is still not running


fully (not open for service)

Process Creation (contd.)


z

Resource sharing possibilities


z
z
z

Execution possibilities
z
z

Parent and children share all resources


Children share subset of parents resources
Parent and child share no resources
Parent and children execute concurrently
Parent waits until children terminate

Memory address space possibilities


z
z

Address space of child duplicate of parent


Child has a new program loaded into it

Processes Tree on a UNIX


System

Process Termination
z

Process executes last statement and asks the


operating system to terminate it ( ex. exit/abort)
Process encounters a fatal error
z

Parent may terminate execution of children


processes (ex. kill). Some possible reasons
z
z

Can be for many reasons like arithmetic exception etc.

Child has exceeded allocated resources


Task assigned to child is no longer required

Parent is exiting
z

Some operating systems may not allow child to


continue if its parent terminates

Process Scheduling
z
z

Ready queue queue of all processes residing in main


memory, ready and waiting to execute (links to PCBs)
Scheduler/Dispatcher picks up a process from ready
queue according to some algorithm (CPU Scheduling
Policy) and assigns it the CPU
Selected process runs till
z

It needs to wait for some event to occur (ex. a disk read)


The CPU scheduling policy dictates that it be stopped

CPU time allotted to it expires (timesharing systems)


z Arrival of a higher priority process
When it is ready to run again, it goes back to the ready queue

Scheduler is invoked again to select the next process


from the ready queue

Representation of Process
Scheduling

Schedulers
z

Long-term scheduler (or job scheduler)


Selects which processes should be brought into the
ready queue
Controls the degree of multiprogramming (no. of jobs in
memory)
Invoked infrequently (seconds, minutes)
May not be present in an OS (ex. linux/windows does
not have one)

z
z

Short-term scheduler (or CPU scheduler)


Selects which process should be executed next and
allocates CPU
Invoked very frequently (milliseconds), must be fast

What if all processes do not fit


in memory?
z

Partially executed jobs in secondary memory


(swapped out)
z

Copy the process image to some pre-designated area


in the disk (swap out)
Bring in again later and add to ready queue later

Addition of Medium Term


Scheduling

Other Questions
z

How does the scheduler gets scheduled? (Suppose we


have only one CPU)
z

What does it do with the running process?


z

As part of execution of an ISR (ex. timer interrupt in a time-sharing


system)
Called directly by an I/O routine/event handler after blocking the
process making the I/O or event request
Save its context

How does it start the new process?


z
z

Load the saved context of the new process chosen to be run


Start the new process

Context of a Process
z

Information that is required to be saved to be


able to restart the process later from the same
point
Includes:
z
z
z
z
z
z

CPU state all register contents, PSW


Program counter
Memory state code, data
Stack
Open file information
Pending I/O and other event information

Context Switch
z

When CPU switches to another process, the


system must save the state of the old process
and load the saved state for the new process
Context-switch time is overhead; the system
does no useful work while switching
Time dependent on hardware support

Handling Interrupts
z
z
z
z
z

z
z

H/w saves PC, PSW


Jump to ISR
ISR should first save the context of the process
Execute the ISR
Before leaving, ISR should restore the context of the
process being executed
Return from ISR restores the PC
ISR may invoke the dispatcher, which may load the
context of a new process, which runs when the
interrupt returns instead of the original process
interrupted

CPU Switch From Process to


Process

Example: Timesharing
Systems
z
z
z
z
z

Each process has a time quantum T allotted to it


Dispatcher starts process P0, loads a external counter
(timer) with counts to count down from T to 0
When the timer expires, the CPU is interrupted
The ISR invokes the dispatcher
The dispatcher saves the context of P0
z

The dispatcher selects P1 from ready queue


z

z
z
z
z

PCB of P0 tells where to save


The PCB of P1 tells where the old state, if any, is saved

The dispatcher loads the context of P1


The dispatcher reloads the counter (timer) with T
The ISR returns, restarting P1 (since P1s PC is now
loaded as part of the new context loaded)
P1 starts running

CPU Scheduling

Types of jobs
z

CPU-bound vs. I/O-bound


z

Maximum CPU utilization obtained with


multiprogramming

Batch, Interactive, real time


z

Different goals, affects scheduling policies

CPU Scheduler
z

Selects from among the processes in memory


that are ready to execute, and allocates the CPU
to one of them
CPU scheduling decisions may take place when
a process:
z
z
z
z

z
z

Switches from running to waiting state


Switches from running to ready state
Switches from waiting to ready
Terminates

Scheduling under 1 and 4 is nonpreemptive.


All other scheduling is preemptive.

Dispatcher
z

Dispatcher module gives control of the CPU to


the process selected by the CPU scheduler; this
involves:
z
z
z

switching context
switching to user mode
jumping to the proper location in the user program to
restart that program

Dispatch latency time it takes for the


dispatcher to stop one process and start another
running.

Scheduling Criteria
z

CPU utilization keep the CPU as busy as


possible
Throughput # of processes that complete their
execution per time unit
Turnaround time amount of time to execute a
particular process
Waiting time amount of time a process has
been waiting in the ready queue
Response time amount of time it takes from
when a request was submitted until the first
response is produced, not output (for timesharing environment)

Optimization Criteria
z
z
z
z
z

Max CPU utilization


Max throughput
Min turnaround time
Min waiting time
Min response time

First-Come, First-Served (FCFS)


Scheduling

Process Burst Time


P1
24
P2
3
P3
3
Suppose that the processes arrive in the order: P1 , P2 ,
P3 . The Gantt Chart for the schedule is:
P1
0

z
z

P2
24

P3
27

Waiting time for P1 = 0; P2 = 24; P3 = 27


Average waiting time: (0 + 24 + 27)/3 = 17

30

FCFS Scheduling (Cont.)


Suppose that the processes arrive in the order
P2 , P3 , P1
z The Gantt chart for the schedule is:
P2
0
z
z
z
z

P3
3

P1
6

30

Waiting time for P1 = 6; P2 = 0; P3 = 3


Average waiting time: (6 + 0 + 3)/3 = 3
Much better than previous case.
Convoy effect: short process behind long process

Shortest-Job-First (SJR)
Scheduling
z

Associate with each process the length of its


next CPU burst. Use these lengths to schedule
the process with the shortest time
Two schemes:
nonpreemptive once CPU given to the process it
cannot be preempted until it completes its CPU burst
z preemptive if a new process arrives with CPU burst
length less than remaining time of current executing
process, preempt. This scheme is know as the
Shortest-Remaining-Time-First (SRTF)
SJF is optimal gives minimum average waiting time for
a given set of processes
z

Example of Non-Preemptive
SJF

Process Arrival Time


P1
0.0
P2
2.0
P3
4.0
P4
5.0
SJF (non-preemptive)
P1
0

P3

Burst Time
7
4
1
4

P2
8

P4
12

16

Average waiting time = (0 + 6 + 3 + 7)/4 = 4

Example of Preemptive SJF

Process Arrival Time


P1
0.0
P2
2.0
P3
4.0
P4
5.0
SJF (preemptive)
P1
0

P2
2

P3
4

P2
5

Burst Time
7
4
1
4

P1

P4
7

11

Average waiting time = (9 + 1 + 0 +2)/4 = 3

16

Determining Length of Next CPU


Burst
z
z

Can only estimate the length.


Can be done by using the length of previous
CPU bursts, using exponential averaging
z n+1 = tn + (1- )n, 0 1
tn = actual length of nth CPU burst

Properties of Exponential
Averaging
z

=0
z

n+1 = n
z Recent history does not count

=1

n+1 = tn
z Only the actual last CPU burst counts
If we expand the formula, each successive term has less
weight than its predecessor
z Recent history has more weight than old history
z

Priority Scheduling
z

A priority number (integer) is associated with each


process
The CPU is allocated to the process with the highest
priority (smallest integer highest priority).
z Preemptive
z Nonpreemptive
SJF is a priority scheduling where priority is the
predicted next CPU burst time
Problem Starvation low priority processes may never
execute
Solution Aging as time progresses increase the
priority of the process

Round Robin (RR)


z

Each process gets a small unit of CPU time (time


quantum), usually 10-100 milliseconds. After this time
has elapsed, the process is preempted and added to the
end of the ready queue
If there are n processes in the ready queue and the time
quantum is q, then each process gets 1/n of the CPU
time in chunks of at most q time units at once. No
process waits more than (n-1)q time units
Performance
z q large FIFO
z q small q must be large with respect to context
switch, otherwise overhead is too high

Example of RR with Time


Quantum = 20

Process
P1
P2
P3
P4
The Gantt chart is:
P1
0

P2
20

37

P3

P4
57

Burst Time
53
17
68
24

P1
77

P3

P4

P1

P3

P3

97 117 121 134 154 162

Typically, higher average turnaround than SJF,


but better response.

Multilevel Queue
z

Ready queue is partitioned into separate queues:


foreground (interactive) and background (batch)
Each queue has its own scheduling algorithm,
z
z

foreground RR
background FCFS

Scheduling must be done between the queues


z Fixed priority scheduling (i.e., serve all from
foreground then from background). Possibility of
starvation.
z Time slice each queue gets a certain amount of
CPU time which it can schedule amongst its
processes; i.e., 80% to foreground in RR
z 20% to background in FCFS

Multilevel Feedback Queue


z

A process can move between the various


queues; aging can be implemented this way.
Multilevel-feedback-queue scheduler defined by
the following parameters:
z
z
z
z
z

number of queues
scheduling algorithms for each queue
method used to determine when to upgrade a process
method used to determine when to demote a process
method used to determine which queue a process will
enter when that process needs service

Example of Multilevel Feedback


Queue
z

Three queues:
z
z
z

Q0 time quantum 8 milliseconds


Q1 time quantum 16 milliseconds
Q2 FCFS

Scheduling
z

A new job enters queue Q0 which is served FCFS.


When it gains CPU, job receives 8 milliseconds. If it
does not finish in 8 milliseconds, job is moved to
queue Q1.
At Q1 job is again served FCFS and receives 16
additional milliseconds. If it still does not complete, it
is preempted and moved to queue Q2.

Multilevel Feedback Queues

Process Coordination

Why is it needed?
Processes may need to share data
More than one process reading/writing the same data
(a shared file, a database record,)
Output of one process being used by another
Needs mechanisms to pass data between processes

Ordering executions of multiple processes may


be needed to ensure correctness
Process X should not do something before process Y
does something etc.
Need mechanisms to pass control signals between
processes

Interprocess Communication
(IPC)
Mechanism for processes P and Q to
communicate and to synchronize their actions
Establish a communication link

Fundamental types of communication links


Shared memory
P writes into a shared location, Q reads from it and
vice-versa
Message passing
P and Q exchange messages

We will focus on shared memory, will discuss


issues with message passing later

Implementation Questions
How are links established?
Can a link be associated with more than two
processes?
How many links can there be between every pair
of communicating processes?
What is the capacity of a link?
Is the size of a message that the link can
accommodate fixed or variable?
Is a link unidirectional or bi-directional?

Producer Consumer Problem


Paradigm for cooperating processes
producer process produces information that is consumed
by a consumer process.
unbounded-buffer places no practical limit on the size of the
buffer.
bounded-buffer assumes that there is a fixed buffer size.

Basic synchronization requirement


Producer should not write into a full buffer
Consumer should not read from an empty buffer
All data written by the producer must be read exactly once by the
consumer

Bounded-Buffer Shared-Memory
Solution
Shared data
#define BUFFER_SIZE 10
typedef struct {
...
} item;
item buffer[BUFFER_SIZE];
int in = 0;
int out = 0;

We will see how to create such shared


memory between processes in the lab

Bounded-Buffer:
Producer Process
item nextProduced;
while (1) {
while (((in + 1) % BUFFER_SIZE) == out)
; /* do nothing */
buffer[in] = nextProduced;
in = (in + 1) % BUFFER_SIZE;
}

Bounded-Buffer:
Consumer Process
item nextConsumed;
while (1) {
while (in == out)
; /* do nothing */
nextConsumed = buffer[out];
out = (out + 1) % BUFFER_SIZE;
}

The solution allows at most n 1 items in buffer


(of size n) at the same time. A solution, where
all n buffers are used is not simple
Suppose we modify the producer-consumer
code by adding a variable counter, initialized to
0 and incremented each time a new item is
added to the buffer

Producer process

Shared data
#define B_SIZE 10
typedef struct {
...
} item;
item buffer[B_SIZE];
int in = 0;
int out = 0;
int counter = 0;

Will this work?

item nextProduced;
while (1) {
while (counter == BUFFER_SIZE)
; /* do nothing */
buffer[in] = nextProduced;
in = (in + 1) % BUFFER_SIZE;
counter++;
}

Consumer process
item nextConsumed;
while (1) {
while (counter == 0)
; /* do nothing */
nextConsumed = buffer[out];
out = (out + 1) % BUFFER_SIZE;
counter--;
}

The Problem with this solution


The statement counter++ may be implemented
in machine language as:
register1 = counter
register1 = register1 + 1
counter = register1

The statement counter-- may be implemented


as:
register2 = counter
register2 = register2 1
counter = register2

If both the producer and consumer attempt to


update counter concurrently, the assembly
language statements may get interleaved.
Interleaving depends upon how the producer
and consumer processes are scheduled.

An Illustration
Assume counter is initially 5. One interleaving of
statements is:
producer: register1 = counter (register1 = 5)
producer: register1 = register1 + 1 (register1 = 6)
consumer: register2 = counter (register2 = 5)
consumer: register2 = register2 1 (register2 = 4)
producer: counter = register1 (counter = 6)
consumer: counter = register2 (counter = 4)

The value of counter may be either 4 or 6, where


the correct result should be 5.

Race Condition
A scenario in which the final output is dependent
on the relative speed of the processes
Example: The final value of the shared data counter
depends upon which process finishes last

Race conditions must be prevented


Concurrent processes must be synchronized
Final output should be what is specified by the
program, and should not change due to relative
speeds of the processes

Atomic Operation
An operation that is either executed fully without
interruption, or not executed at all
The operation can be a group of instructions
Ex. the instructions for counter++ and counter-Note that the producer-consumer problems solution
works if counter++ and counter-- are made atomic
In practice, the process may be interrupted in the middle
of an atomic operation, but the atomicity should ensure
that no process uses the effect of the partially executed
operation until it is completed

The Critical Section Problem


n processes all competing to use some shared
data (in general, use some shared resource)
Each process has a section of code, called
critical section, in which a shared data is
accessed.
Problem ensure that when one process is
executing in its critical section, no other process
is allowed to execute in its critical section,
irrespective of the relative speeds of the
processes
Also known as the Mutual Exclusion Problem as
it requires that access to the critical section is
mutually exclusive

Requirements for Solution to the


Critical-Section Problem
Mutual Exclusion: If process Pi is executing in its
critical section, then no other processes can be
executing in their critical sections.
Progress: If no process is executing in its critical
section and there exist some processes that wish to
enter their critical section, then the selection of the
processes that will enter the critical section next
cannot be postponed indefinitely.
Bounded Waiting/No Starvation: A bound must exist
on the number of times that other processes are
allowed to enter their critical sections after a process
has made a request to enter its critical section and
before that request is granted.

Entry and Exit Sections


Entry section: a piece of code executed by a process just
before entering a critical section
Exit section: a piece of code executed by a process just
after leaving a critical section
General structure of a process Pi
.
.

entry section
critical section
exit section
remainder section /*remaining code */
Solutions vary depending on how these sections are
written

Petersens Solution
Only 2 processes, P0 and P1
Processes share some common variables to
synchronize their actions
int turn = 0
turn = i Pi s turn to enter its critical section
boolean flag[2]
initially flag [0] = flag [1] = false
flag [i] = true Pi ready to enter its critical section

Process Pi
do {
flag [i]:= true;
turn = j;
while (flag [j] and turn = j) ;
critical section
flag [i] = false;
remainder section
} while (1);
Meets all three requirements; solves the criticalsection problem for two processes
Can be extended to n processes by pairwise mutual
exclusion too costly

Solution for n Processes:


Bakery Algorithm
Before entering its critical section, process
receives a number. Holder of the smallest
number enters the critical section.
If processes Pi and Pj receive the same
number, if i < j, then Pi is served first; else Pj
is served first.
The numbering scheme always generates
numbers in increasing order of enumeration;
i.e., 1,2,3,3,3,3,4,5...

Bakery Algorithm
Notation < lexicographical order (ticket #,
process id #)
(a,b) < c,d) if a < c or if a = c and b < d
max (a0,, an-1) is a number, k, such that k ai for i 0,
, n 1

Shared data
boolean choosing[n];
int number[n];
Data structures are initialized to false and 0
respectively

Bakery Algorithm
do {
choosing[i] = true;
number[i] =
max(number[0], number[1], , number [n 1]) +1;
choosing[i] = false;
for (j = 0; j < n; j++) {
while (choosing[j]) ;
while ((number[j] != 0) &&
(number[j,j] < number[i,i])) ;
}
critical section
number[i] = 0;
remainder section
} while (1);

Hardware Instruction Based


Solutions
Some architectures provide special instructions
that can be used for synchronization
TestAndSet: Test and modify the content of a
word atomically
boolean TestAndSet (boolean &target) {
boolean rv = target;
target = true;
return rv;
}

Swap: Atomically swap two variables.


void Swap(boolean &a, boolean &b) {
boolean temp = a;
a = b;
b = temp;
}

Mutual Exclusion with Testand-Set


Shared data:
boolean lock = false;
Process Pi
do {
while (TestAndSet(lock)) ;
critical section
lock = false;
remainder section
}

Mutual Exclusion with Swap


Shared data (initialized to false):
boolean lock;
boolean waiting[n];
Process Pi
do {
key = true;
while (key == true)
Swap(lock,key);
critical section
lock = false;
remainder section
}

Semaphore
Widely used synchronization tool
Does not require busy-waiting
CPU is not held unnecessarily while the process is
waiting

A Semaphore S is
A data structure with an integer variable S.value and a
queue S.q of processes
The data structure can only be accessed by two
atomic operations, wait(S) and signal(S) (also called
P(S) and V(S))

Value of the semaphore S = value of the integer


S.value

wait and signal Operations


wait (S): if (S.value > 0) S.value --;
else {
add the process to S.q;
block the process;
}

signal (S): if (S.q is not empty)


choose a process from S.q and unblock it
else S.value ++;
Note: which process is picked for unblocking may depend
on policy. Also, implementations can make S.value < 0
also (change wait and signal code appropriately)

Solution of n-Process Critical


Section using Semaphores
Shared data:
semaphore mutex; /* initially mutex = 1 */
Process Pi:
do {
wait(mutex);
critical section
signal(mutex);
remainder section
} while (1);

Ordering Execution of Processes


using Semaphores
Execute statement B in Pj only after
statement A executed in Pi
Use semaphore flag initialized to 0
Code:
Pi
M
A
signal(flag)

Pj
M
wait(flag)
B

Multiple such points of synchronization can


be enforced using one or more semaphores

Pitfalls
Use carefully to avoid
Deadlock two or more processes are waiting
indefinitely for an event that can be caused by only
one of the waiting processes
Starvation indefinite blocking. A process may
never be removed from the semaphore queue in
which it is suspended

Example of Deadlock
Let S and Q be two semaphores initialized to 1
P0
wait(S);
wait(Q);
M
signal(S);
signal(Q)

P1
wait(Q);
wait(S);
M
signal(Q);
signal(S);

Two Types of Semaphores


Binary semaphore integer value can range only
between 0 and 1; can be simpler to implement.
Counting semaphore value can be any positive
integer
Useful in cases where there are multiple copies of resources
l-exclusion problem: at most l processes can be in their
critical section at the same time

Can implement a counting semaphore using a binary


semaphore easily (do it yourself)

Internal Implementations of
Semaphores
How do we make wait and signal atomic?
Should we use another semaphore? Then who makes that
atomic?

Different solutions possible


Interrupts:

Disable interrupts just before a wait or a signal call, enable


it just after that
Works fine for uniprocessors, but not for multiprocessors
Use s/w-based or h/w-instruction-based solutions to put entry
and exit sections around wait/signal code
Since wait/signal code is of small size, wont busy wait for
too long

Classical Problems of
Synchronization
Bounded-Buffer Producer-Consumer Problem
Readers and Writers Problem
Dining-Philosophers Problem

Bounded-Buffer Problem
Shared data
semaphore full, empty, mutex;
Initially:
full = 0, empty = n, mutex = 1

Bounded-Buffer Problem:
Producer Process
do {

produce an item in nextp

wait(empty);
wait(mutex);

add nextp to buffer

signal(mutex);
signal(full);
} while (1);

Bounded-Buffer Problem:
Consumer Process
do {
wait(full)
wait(mutex);

remove an item from buffer to nextc

signal(mutex);
signal(empty);

consume the item in nextc

} while (1);

Readers-Writers Problem
A common shared data
Reader process only reads data
Writer process only writes data
Synchronization requirements
Writers should have exclusive access to the data
No other reader or writer can access the data at that
time
Multiple readers should be allowed to access the data
if there is no writer accessing the data

Solution using Semaphores


Shared data
semaphore mutex, wrt;
Initially
mutex = 1, wrt = 1, readcount = 0

Writer

wait(wrt);

perform write

signal(wrt);

Reader
wait(mutex);
readcount++;
if (readcount == 1)
wait(rt);
signal(mutex);

perform read

wait(mutex);
readcount--;
if (readcount == 0)
signal(wrt);
signal(mutex):

Dining-Philosophers Problem

Shared data
semaphore chopstick[5];
Initially all values are 1

Dining-Philosophers Problem
Philosopher i:
do {
wait(chopstick[i])
wait(chopstick[(i+1) % 5])

eat

signal(chopstick[i]);
signal(chopstick[(i+1) % 5]);

think

} while (1);

Other Synchronization
Constructs
Programming constructs
Specify critical sections or shared data to be protected
by mutual exclusion in program using special
keywords
Compiler can then insert appropriate code to enforce
the conditions (for ex., put wait/signal calls in
appropriate places in code)

Examples
Critical regions, Monitors, Barriers,

Memory Management

Goals of Memory Management


z

Allocate available memory efficiently to


multiple processes
Main functions
z
z

Allocate memory to processes when needed


Keep track of what memory is used and what is
free
Protect one processs memory from another

Memory Allocation
z

Contiguous Allocation
z

Each process allocated a single contiguous chunk


of memory

Non-contiguous Allocation
z

Parts of a process can be allocated noncontiguous chunks of memory

In this part, we assume that the entire process


needs to be in memory for it to run

Contiguous Allocation
z

Fixed Partition Scheme


z

Memory broken up into fixed size partitions


z

z
z

Each partition can have exactly one process


When a process arrives, allocate it a free partition
z

z
z

But the size of two partitions may be different

Can apply different policy to choose a partition

Easy to manage
Problems:
z
z

Maximum size of process bound by max. partition size


Large internal fragmentation possible

Contiguous Allocation (Cont.)


z

Variable Partition Scheme


z

Hole block of available memory; holes of various


size are scattered throughout memory
When a process arrives, it is allocated memory from a
hole large enough to accommodate it
Operating system maintains information about:
a) allocated partitions b) free partitions (hole)
OS

OS

OS

OS

process 5

process 5

process 5

process 5

process 9

process 9

process 8
process 2

process 10
process 2

process 2

process 2

Dynamic Storage-Allocation
Problem
How to satisfy a request of size n from a list of free holes?
z

First-fit: Allocate the first hole that is big


enough
Next-fit: Similar to first-fit, but start from last
hole allocated
Best-fit: Allocate the smallest hole that is big
enough; must search entire list, unless ordered
by size. Produces the smallest leftover hole.
Worst-fit: Allocate the largest hole; must also
search entire list. Produces the largest leftover
hole

Fragmentation
z

External Fragmentation: total memory space exists to


satisfy a request, but it is not contiguous.
Internal Fragmentation: allocated memory may be larger
than requested memory; this size difference is memory
internal to a partition, but not being used.
Reduce external fragmentation by compaction
z Shuffle memory contents to place all free memory
together in one large block
z Costly

Keeping Track of Free Partitions


z

Bitmap method
z
z

Define some basic fixed allocation unit size


1 bit maintained for each allocation unit
z

z
z

0 unit is free, 1 unit is allocated

Bitmap bitstring of the bits of all allocation units


To allocate space of size n allocation units, find a
run of n consecutive 0s in bitmap

Maintain a linked list of free partitions


z

Each node contains start address, size, and


pointer to next free block

Non-contiguous Allocation
z
z

Paging
Segmentation

Memory Abstraction
z
z

What does the programmer see as memory


Simplest: No abstraction
z
z

Programmer sees the physical memory


Compiler generates absolute physical memory
addresses

Abstraction: Address Spaces


z

A set of addresses that the process can use to


address memory
Each process has its own address space

The Case of No Abstraction


z

Addresses generated by compiler (instruction and


data) refer to exact physical memory addresses
z

Instruction and data must be loaded in exactly the


same physical memory locations
Advantage: Fast execution
z

Compile time binding

No address translation overhead during actual memory


access

Problem: Unrelated processes may read/write


from/to each others address space

Multiple processes can still be run


z

If the behavior of the processes are well-known and they


use different ranges of physical address
z Possible in some closed systems with known processes
Swapping
z Keep one process in memory at one time
z Copy the memory space of the process to disk when
another process is to be run
z Copy the memory space back from the disk when the
process needs to be rerun

Not good for general purpose multiprogramming


systems

Memory Abstraction:
Logical or Virtual Addresses
z

z
z

Each process has its own address space (Logical Address


Space)
Translating to physical address Load Time or Run Time
Load time binding
z
z

z
z

z
z

Compiler generates addresses in the processs address space


Loader changes addresses during loading depending on where
in physical memory the process is loaded
Advantage: No address translation overhead during running
Problem: total memory requirement of a process needs to be
known a-priori
Problem: Process cannot be moved during execution
Problem: Rogue process can still overwrite other processs
memory by writing out of bounds, no runtime check

Load time binding with runtime check


z
z

Address bound at load time, but checked at run time if within


bound
Solves the problem of overwriting other processs memory,
but increases cost of access

One simple method


z

z
z
z

H/w provided base and limit registers


z Accessible only by OS
Base register loaded with beginning physical memory
address of process given at load time
Limit register loaded with length of memory given to process
On every access, hardware checks if limit register is
exceeded
z Aborts program if limit is exceeded

Logical or Virtual Address (contd.)


z

Execution/Run time binding


z Physical address corresponding to a logical address
found only when the logical address is used
z Process can be moved during its execution
z CPU generates logical address
z Memory Management Unit (MMU): hardware that
converts a generated logical address to physical
address before access
z Advantage: Processes can be moved during
execution, protects one process from another, can
grow process memory at run time
z Problem: Address translation overhead at run time

The user program deals with logical addresses;


it never sees the real physical addresses
The same logical address space in the address
space of two processes must always map to
different physical addresses at runtime
How to ensure this for run time bindings?

A Simple Solution
z

H/w provided base and limit registers


z

z
z
z

Programs loaded in consecutive memory locations


without relocation during load
Base register loaded with beginning physical
memory address of process
Limit register loaded with length of process
z

Must be known a-priori

On every access, MMU adds base register to logical


address, and then checks if limit register is exceeded
z

Accessible only by OS

Aborts program if limit is exceeded

Hard to grow memory if needed, but possible

A Better Solution: Paging


z

Allows processes to grow memory as and


when needed
Logical/Virtual address space of a process
can be noncontiguous; process is allocated
physical memory whenever the latter is
available.
Allows multiple processes to reside in
memory at the same time

Paging
z

z
z

Divide physical memory into fixed-sized (power of 2)


blocks called frames
Divide logical memory into blocks of same size called
pages
Keep track of all free frames.
To run a program of size n pages, need to find n free
frames and load program.
Page table: used to translate logical to physical
addresses
z One page table per process

Page Table
z

One entry for each page in the logical address


space
Contains the base address of the page frame
where the page is stored
Also contains a valid bit
z

If set, logical address is valid and has physical


memory allocated to it
If not set, logical address is invalid

Address Translation Scheme


z

Address generated by CPU is divided into:


z

z
z
z

Page number (p) used as an index into the page table


which contains base address of the corresponding page
frame in physical memory
Page offset (d) combined with base address to define the
physical memory address that is sent to the memory unit

Use page number to index the page table


Get the page frame start address
Add offset with that to get the actual physical
memory address
Access the memory

Address Translation
Architecture

Implementation of Page Table


z
z

Page table is kept in main memory.


Page-table base register (PTBR) points to the page
table.
Page-table length register (PRLR) indicates size of the
page table.
In this scheme every data/instruction access requires
two memory accesses. One for the page table and one
for the data/instruction.
The two memory access problem can be solved by the
use of a special fast-lookup hardware cache called
translation look-aside buffers (TLBs)

Paging Hardware With TLB

Effective Access Time


z
z
z

z
z

TLB Lookup time =


Assume memory cycle time is 1 time unit
Hit ratio percentage of times that a page
number is found in the TLB;
Hit ratio =
Effective Access Time (EAT)
EAT = (1 + ) + (2 + )(1 )
=2+

Page Table Structure


z
z
z

Hierarchical Paging
Hashed Page Tables
Inverted Page Tables

Hierarchical Page Tables


z

Break up the logical address space into multiple


page tables.

A simple technique is a two-level page table.

Two-Level Paging Example


z

A logical address (on 32-bit machine with 4K page


size) is divided into:
z
z

Since the page table is paged, the page number is


further divided into:
z
z

a page number consisting of 20 bits.


a page offset consisting of 12 bits.

a 10-bit page number.


a 10-bit page offset.

Thus, a logical address is as follows:


page number
pi
10

p2
10

page offset
d
12

Two-Level Page-Table Scheme

Address-Translation Scheme
z

Address-translation scheme for a two-level


32-bit paging architecture

Hashed Page Tables


z
z

Common in address spaces > 32 bits.


The virtual page number is hashed into a page
table. This page table contains a chain of
elements hashing to the same location.
Virtual page numbers are compared in this
chain searching for a match. If a match is
found, the corresponding physical frame is
extracted.

Hashed Page Table

Inverted Page Table


z
z

One entry for each real page of memory (page frame)


Entry consists of the virtual address of the page stored in
that real memory location, with information about the
process that owns that page
Decreases memory needed to store each page table, but
increases time needed to search the table when a page
reference occurs
Use hash table to limit the search to one or at most a
few page-table entries.

Inverted Page Table


Architecture

Protection
z

Protection bit can be there with each page in


the page table
z
z

MMU can check for access type when


translating address
z

Ex. read-only page


Bits set by OS

Traps if illegal access

More elaborate protections possible with h/w


support

Shared Pages
z

Example: Shared code


z

z
z

One copy of read-only code shared among processes


(i.e., text editors, compilers, window systems)

Store shared page in a single page frame


Map it to logical address spaces of processes by
inserting appropriate entries in their page tables
that all point to the shared page frame

Segmentation
z
z

Memory-management scheme that supports


user view of memory.
A program is a collection of segments. A
segment can be any logical unit
z

code, global variables, heap, stack,

Segment sizes may be different

Segmentation Architecture
z

Logical address consists of a two tuple:


<segment-number, offset>,
Segment table maps two-dimensional physical
addresses; each table entry has:
z base contains the starting physical address where
the segments reside in memory.
z limit specifies the length of the segment.
Segment-table base register (STBR) points to the
segment tables location in memory.
Segment-table length register (STLR) indicates number
of segments used by a program;
segment number s is legal if s < STLR.

Segmentation Architecture
(Cont.)
z

Protection. With each entry in segment table associate:


z validation bit = 0 illegal segment
z read/write/execute privileges
Protection bits associated with segments; code sharing
occurs at segment level.
Since segments vary in length, memory allocation is a
dynamic storage-allocation problem.

Segmentation Hardware

Example of Segmentation

Sharing of Segments

Virtual Memory

Basic Concept
z
z
z
z

Usually, only part of the program needs to be in


memory for execution
Allow logical address space to be larger than
physical memory size
Bring only what is needed in memory when it is
needed
Virtual memory implementation
z
z

Demand paging
Demand segmentation

Virtual Memory That is Larger Than


Physical Memory

Demand Paging
z

Bring a page into memory only when it is


needed (on demand)
z
z
z
z

Less I/O needed to start a process


Less memory needed
Faster response
More users

Page is needed reference to it


z
z

invalid reference abort


not-in-memory bring to memory

Transfer of a Paged Memory to


Contiguous Disk Space

Some questions
z

How to know if a page is in memory?


z

If not present, what happens during reference to a logical


address in that page?
z

z
z

Any free page frame

What if there is no free page frame?


z

Page fault

If not present, where in disk is it?


If a new page is brought to memory, where should it be
placed?
z

Valid bit

Page replacement policies

Is it always necessary to copy a page back to disk on


replacement?
z

Dirty/Modified bit

Valid-Invalid Bit
z

z
z

With each page table entry a validinvalid bit is


associated
(1 in-memory, 0 not-in-memory)
Initially validinvalid but is set to 0 on all entries
Set to 1 when memory is allocated for a page
(page brought into memory)
Address translation steps:
z
z
z

Use page no. in address to index into page table


Check valid bit
If set, get page frame start address, add offset to get
memory address, access memory
If not set, page fault

Page Table When Some Pages Are Not in


Main Memory

Page Fault
z
z

z
z
z
z
z

If there is ever a reference to a page, first reference will


trap to OS page fault
OS looks at another table to decide:
z Invalid reference abort
z Just not in memory, get from disk
Get empty frame
Get disk address of page (in PTE)
Swap page into frame from disk (context switch for I/O
wait)
Modify PTE entry for page with frame no., set valid bit =
1.
Restart instruction: Least Recently Used
z block move
z

auto increment/decrement location

Steps in Handling a Page Fault

Performance of Demand
Paging
z

Page Fault Rate 0 p 1.0


z if p = 0 no page faults
z if p = 1, every reference is a fault
Effective Access Time (EAT)
EAT = (1 p) x memory access
+ p (page fault overhead
+ [swap page out ]
+ swap page in
+ restart overhead)

Demand Paging Example


z

Memory access time = 150 ns

50% of the time the page that is being replaced has


been modified and therefore needs to be swapped out.

z
z

Swap Page Time = 10 msec = 107 ns


EAT = (1 p) x 150 + p (1.5 x 107 )
150 + p (1.5 x 107 ) ns
EAT = 200 ns will need p = 0.0000033 !
EAT = 165 ns (10% loss) will need p = 0.000001, or 1
fault in 1000000 accesses

Page in Disk
z

Swap space: part of disk divided in page sized


slots
z

Pages can be swapped out from memory to a


free page slot
z
z

First slot has swap space management info such as


free slots etc.

Address of page <swap partition no., page slot no.>


Address stored in PTE

Read-only pages can be read from the file


system directly using memory-mapped files

What happens if there is no free


frame?
z

Page replacement find some page in memory,


but not really in use, swap it out
z
z

algorithm
performance want an algorithm which will result in
minimum number of page faults.

Same page may be brought into memory


several times

Page Replacement
z

Prevent over-allocation of memory by modifying pagefault service routine to include page replacement

Use modified (dirty) bit to reduce overhead of page


transfers

Set to 0 when a page is brought into memory

Set to 1 when some location in a page is changed

Copy page to disk only if bit set to 1

Page replacement completes separation between logical


memory and physical memory large virtual memory can
be provided on a smaller physical memory

Need For Page Replacement

Basic Page Replacement


z
z

z
z

Find the location of the desired page on disk


Find a free frame
z If there is a free frame, use it
z If there is no free frame, use a page replacement
algorithm to select a victim frame
Copy the to-be replaced frame to disk if needed
Read the desired page into the (newly) free frame.
Update the page and frame tables.
Restart the process

Page Replacement

Page Replacement Algorithms


z
z

Want lowest page-fault rate


Evaluate algorithm by running it on a particular
string of memory references (reference string of
pages referenced) and computing the number of
page faults on that string

First-In-First-Out (FIFO)
Algorithm
z

Reference string: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3,
4, 5
3 frames (3 pages can be in memory at a
time per process)

9 page faults

FIFO Page Replacement

Beladys Anamoly
z

The number of page faults may increase with


increase in number of page frames for FIFO
z

Counter-intuitive

Consider 4 page frames


1

10 page faults

Beladys Anamoly

Optimal Algorithm
z

Replace page that will not be used for longest period of


time.
4 frames example: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5
1

6 page faults

3
4
z
z

How do you know this?


Used for measuring how well your algorithm performs

Optimal Page Replacement

Least Recently Used (LRU)


Algorithm
z

Reference string: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5
1

Counter implementation
z Every page entry has a counter; every time page is
referenced through this entry, copy the clock into the
counter
z When a page needs to be changed, look at the
counters to determine which are to change

LRU Page Replacement

LRU Algorithm (Cont.)


z

Stack implementation keep a stack of page


numbers in a double link form:
z

Page referenced:
z move it to the top
z requires 6 pointers to be changed
No search for replacement

Use Of A Stack to Record The Most Recent


Page References

LRU Approximation Algorithms


z

Reference bit
z
z
z
z

Additional Reference Bits algorithm


z
z
z

1 bit per page frame, initially = 0


When page frame is referenced, bit set to 1
Periodically reset to 0
Replace the one which is 0 (if one exists). We do not
know the order, however
K bits marked 1 to K from MSB
ith bit indicates if page accessed in the ith most recent
interval
At every interval (timer interrupt), right shift the bits (LSB
drops off), shift reference bit to MSB, and reset reference
bit to 0
To replace, find the frame with the smallest value of the K
bits

Second chance
z
z
z

Need reference bit


Clock replacement
If page to be replaced (in clock order) has reference
bit = 1, then:
z set reference bit 0
z leave page in memory
z replace next page (in clock order), subject to same
rules

Second-Chance (clock)
Page-Replacement Algorithm

Counting Algorithms
z

z
z

Keep a counter of the number of references that


have been made to each page
LFU Algorithm: replaces page with smallest count
MFU Algorithm: based on the argument that the
page with the smallest count was probably just
brought in and has yet to be used

Allocation of Frames
z

Each process needs some minimum number of


pages
No process should use up nearly all page
frames
Two major allocation schemes
z
z

fixed allocation
priority allocation

Fixed Allocation
z

Equal allocation e.g., if 100 frames and


5 processes, give each 20 pages.
Proportional allocation Allocate
according to the size of process

Priority Allocation
z

Use a proportional allocation scheme using


priorities rather than size.

If process Pi generates a page fault,


z
z

select for replacement one of its frames


select for replacement a frame from a process with
lower priority number

Global vs. Local Allocation


z

Global replacement process selects a


replacement frame from the set of all frames;
one process can take a frame from another.
Local replacement each process selects from
only its own set of allocated frames.

Why does paging work?


z

Locality of reference
z

z
z

Processes tend to access locations which are close to each other


(spatial locality) or which are accessed in the recent past
(temporal locality)

Locality of reference implies once a set of page is


brought in for a process, less chance of page faults by
the process for some time
Also, TLB hit ratio will be high
Working set the set of pages currently needed by a
process
Working set of a process changes over time
z

But remains same for some time due to locality of reference

Thrashing
z

If a process does not have enough pages, the


page-fault rate is very high. This leads to:
z
z

z
z

low CPU utilization.


operating system thinks that it needs to increase the
degree of multiprogramming.
another process added to the system
Adds to the problem as even less page frames are
available for each process

Thrashing a process is busy swapping pages


in and out

Thrashing occurs when


working set of all processes
> total memory size

Working-Set Model
z

z
z
z

working-set window a fixed number of page


references
Example: 10,000 instruction
WSSi (working set of Process Pi) =
total number of pages referenced in the most recent
(varies in time)
z if too small will not encompass entire locality
z if too large will encompass several localities
z if = will encompass entire program
D = WSSi total demand frames
if D > m Thrashing
Policy if D > m, then suspend one of the processes.

Working-set model

Keeping Track of the Working


Set
z
z

z
z

Approximate with interval timer + a reference bit


Example: = 10,000
z Timer interrupts after every 5000 time units
z Keep in memory 2 bits for each page
z Whenever a timer interrupts copy and sets the values
of all reference bits to 0
z If one of the bits in memory = 1 page in working set
Why is this not completely accurate?
Improvement = 10 bits and interrupt every 1000 time
units

Page-Fault Frequency Scheme

Establish acceptable page-fault rate.


z If actual rate too low, process loses frame
z If actual rate too high, process gains frame

What should the page size be?


z

Reduce internal fragmentation


z

Reduce table size


z

Smaller is better

Capture locality
z

Larger is better

Reduce I/O overhead


z

Smaller is better

Larger is better

Need to be chosen judiciously

Other Considerations
z

Prepaging
z

Bring in pages not referenced yet

TLB Reach
z
z
z

The amount of memory accessible from the TLB.


TLB Reach = (TLB Size) X (Page Size)
Ideally, the working set of each process is stored in
the TLB. Otherwise there is a high degree of page
faults

Other Considerations (Cont.)


z

Program structure
int A[][] = new int[1024][1024];
Each row is stored in one page

Program 1
for (j = 0; j < A.length; j++)
for (i = 0; i < A.length; i++)
A[i,j] = 0;

1024 x 1024 page faults!!


Program 2
for (i = 0; i < A.length; i++)
for (j = 0; j < A.length; j++)
A[i,j] = 0;

1024 page faults

Other Considerations (Cont.)


z

I/O Interlock Pages must sometimes be locked


into memory
z

z
z

Pages that are used for copying a file from a device


must be locked from being selected for eviction by a
page replacement algorithm
Some OS pages need to be I memory all the time
Use a lock bit to indicate if the page is locked and
cannot be replaced

File Management

Two Parts
z

Filesystem Interface
z

Interface the user sees


z Organization of the files as seen by the user
z Operations defined on files
z Properties that can be read/modified

Filesystem design
z

Implementing the interface

Filesystem Interface

Basic Topics
z
z
z
z
z
z

File Concept
Access Methods
Directory Structure
File System Mounting
File Sharing
Protection

File Concept
z

Logical units of information on secondary


storage
Named collection of related info on secondary
storage
Smallest unit of allocation on disk
z

All info must be in at least one file

Abstracts out the secondary storage details by


presenting a common logical storage view

File Types
z

Data
z

z
z
z
z

Text, binary,

Program
Regular files stores information
Directory stores information about file(s)
Device files represents different devices

File Structure
z
z

None - sequence of words, bytes


Simple record structure
z
z
z

Lines
Fixed length
Variable length

Complex Structures
z
z

Formatted document
Relocatable load file

Important File Attributes


z
z
z
z
z

Name only information kept in human-readable form


Type needed for systems that support different types
Location pointer to file location on device
Size current file size
Protection controls who can do reading, writing,
executing
Time, date, and user identification data for protection,
security, and usage monitoring
Information about files are kept in the directory structure,
which is maintained on the disk

File Operations
z
z
z
z
z
z
z

Create
Write
Read
Reposition within file file seek
Delete
Truncate
Open(Fi) search the directory structure on disk for
entry Fi, and move the content of entry to memory
Close (Fi) move the content of entry Fi in memory to
directory structure on disk

Access Methods
z

Sequential Access
read next
write next
reset

Direct Access
read n
write n
position to n
read next
write next
n = relative block number

Sequential-access File

Example of Index and Relative


Files

Directory Structure
z

A collection of nodes containing information


about all files

Directory

Files

F1

F2

F3

F4
Fn

A Typical File-system
Organization

Information in a Device
Directory
z
z
z
z
z
z
z
z
z

Name
Type
Address
Current length
Maximum length
Date last accessed (for archival)
Date last updated (for dump)
Owner ID (who pays)
Protection information (discuss later)

Operations Performed on
Directory
z
z
z
z
z
z

Search for a file


Create a file
Delete a file
List a directory
Rename a file
Traverse the file system

Organize the Directory (Logically)


to Obtain
z
z

Efficiency locating a file quickly


Naming convenient to users
z
z

Two users can have same name for different files.


The same file can have several different names

Grouping logical grouping of files by


properties, (e.g., all Java programs, all games,
)

Single-Level Directory
z

A single directory for all users

Problems
Naming problem
z Grouping problem
z

Two-Level Directory
z

Separate directory for each user

Path name
Can have the same file name for different user
Efficient searching
No grouping capability

Tree-Structured Directories

Tree-Structured Directories
(Cont.)
z

Efficient searching

Grouping Capability

Current directory (working directory)


z
z

cd /spell/mail/prog
type list

Tree-Structured Directories
(Cont.)
z
z

Absolute or relative path name


Creating a new file is done in current
directory
Delete a file
rm <file-name>
Creating a new subdirectory is done in
current directory.
mkdir <dir-name>

Acyclic-Graph Directories
z

Have shared subdirectories and files

Acyclic-Graph Directories
(Cont.)
z
z

Two different names (aliasing)


If dict deletes list dangling pointer
Solutions:
z

z
z

Backpointers, so we can delete all pointers


Variable size records a problem
Backpointers using a daisy chain organization
Entry-hold-count solution

General Graph Directory

General Graph Directory


(Cont.)
z

How do we guarantee no cycles?


z
z
z

Allow only links to file not subdirectories


Garbage collection
Every time a new link is added use a cycle detection
algorithm to determine whether it is OK

File System Mounting


z
z
z
z

z
z

A filesystem must be mounted before it can be accessed


One file system designated as root filesystem
Root directory of root filesystem is system root directory
Parts of other filesystems are added to directory tree
under root by mounting onto a directory in the root
filesystem.
The directory onto which it is mounted on is called the
mount point
The previous contents of the mount point become
inaccessible

root file system


/
usr
usr

sys

dev

etc

bin

mounted filesystem fs1

//

local

z
z
z
z

users bin

Accessing /usr/adm/ now actually accesses /adm/.. in


filesystem fs1
/usr in the root file system is the mountpoint
Anything under /usr in the root filesystem becomes
inaccessible until fs1 is unmounted
Mounting now can be done on any other mountpoint,
including any directory on an earlier mounted filesystem
z

adm

Ex. can now mount some other filesystem fs2 on /usr/adm, will
hide all files under /adm under fs1 and access to /usr/adm will go
to corresponding part of fs2

Need not mount / always, can mount any subtree of a


filesystem on a mountpoint to add only part of a
filesystem (but has to be a complete subtree)

File Sharing
z

Create links to files


z

Same file accessed from two different places in


directory structure using possibly different names

Soft Link vs. Hard Links

Protection
z

File owner/creator should be able to control:


z
z

what can be done


by whom

Types of access
z
z
z
z
z
z

Read
Write
Execute
Append
Delete
List

Filesystem Implementation

Basic Topics
z
z
z
z
z
z
z
z

Data Structures for File Access


Disk Layout of Filesystems
Allocating Storage for Files
Directory Implementation
Free-Space Management
Virtual Filesystems
Efficiency and Performance
Recovery

Data Structures for File


Access
z

File Control Block (FCB)


z
z
z

One per file


Contains file attributes and location of disk blocks of the file
Stored in disk, usually brought to memory when file is opened

Open File Table


z
z

In-memory table with one entry per open file


Each entry points to the FCB of the file (on disk or usually to
copy in memory)
Can be hierarchical
z

Per-process table with entries pointing to entries in a


single system-wide table
System-wide table points to FCB of file

A Typical File Control Block

In-Memory Open File Tables

Disk Layout
z

z
z
z

Files stored on disks. Disks broken up into one


or more partitions, with separate fs on each
partition
Sector 0 of disk is the Master Boot Record
Used to boot the computer
End of MBR has partition table. Has starting and
ending addresses of each partition.
One of the partitions is marked active in the
master boot table

Disk Layout (contd.)


z
z

Boot computer => BIOS reads/executes MBR


MBR finds active partition and reads in first block
(boot block)
Program in boot block locates the OS for that
partition and reads it in
All partitions start with a boot block

One Possible Example

Superblock contains info about the fs (e.g. type


of fs, number of blocks, )
i-nodes contain info about files
z

Common Unix name for FCB

Allocation Methods
z

An allocation method refers to how disk


blocks are allocated for files
Possibilities
z
z
z

Contiguous allocation
Linked allocation
Indexed allocation

Contiguous Allocation
z
z

z
z

Each file occupies a set of contiguous


blocks on the disk
Easy to implement only starting location
(block #) and length (number of blocks)
are required
Random access
Wasteful of space (dynamic storageallocation problem)
z

Fragmentation possible

Files cannot grow

Contiguous Allocation of Disk


Space

Extent-Based Systems
z

Many newer file systems use a modified


contiguous allocation scheme
Extent-based file systems allocate disk blocks
in extents
An extent is a contiguous block of disks.
Extents are allocated for file allocation. A file
consists of one or more extents

Linked Allocation
z

z
z

Each file is a linked list of disk blocks:


blocks may be scattered anywhere on the
disk
Simple need only starting address
Free-space management system no
waste of space
No random access

Linked Allocation

Linked List Allocation

Linked lists using a table in


memory
z
z
z

Put pointers in table in memory


File Allocation Table (FAT)
Still have to traverse pointers, but now in
memory
But table becomes really big
z
z

200 GB disk with 1 KB blocks needs a 600 MB table


Growth of the table size is linear with the growth of the
disk size

File-Allocation Table

Indexed Allocation
z

Brings all pointers together into the index


block
Logical view

index table

Example of Indexed Allocation

Indexed Allocation (Cont.)


z
z
z

Need index table


Random access
Dynamic access without external
fragmentation, but have overhead of index
block
Mapping from logical to physical in a file of
maximum size of 256K words and block
size of 512 words. We need only 1 block
for index table

Indexed Allocation Mapping


(Cont.)
z

Mapping from logical to physical in a file


of unbounded length (block size of 512
words)
Linked scheme Link blocks of index
table (no limit on size)

Two-level Indexing
z

Two-level index (maximum file size is 5123)

outer-index
index table

file

i-nodes
z
z

FCB in Unix
Contains file attributes and disk address of
blocks
One block can hold only limited number of disk
block addresses, limits size of file
Solution: use some of the blocks to hold address
of blocks holding address of disk blocks of files
z

Can take this to more than one level

i-node with one-level


indirection

Unix i-node
z
z
z

File Attributes
12 direct pointers
1 singly indirect pointer
z

1 doubly indirect pointer


z

Points to a block that points to blocks that have disk block


addresses

1 triply indirect pointer


z

Points to a block that has disk block addresses

Points to a block that points to blocks that point to blocks that


have disk block addresses

What is the max. file size possible??

Directory Implementation
z

Linear list of file names with pointer to the data blocks


z
z
z

z
z
z

Address of first block (contiguous)


Number of first block (linked)
Number of i-node

simple to program
time-consuming to execute
Hash Table linear list with hash data structure
z
z

decreases directory search time


collisions situations where two file names hash to the same
location
fixed size

Free-Space Management
z

Bit vector (n blocks)


0 1

n-1

bit[i] =

678

1 block[i] free
0 block[i] occupied

Block number calculation for first free block


(number of bits per word) *
(number of 0-value words) +
offset of first 1 bit

Free-Space Management
(Cont.)
z

z
z

Bit map requires extra space. Example:


block size = 212 bytes
disk size = 230 bytes (1 gigabyte)
n = 230/212 = 218 bits (or 32K bytes)
Easy to get contiguous files
Linked list (free list)
z
z
z

Cannot get contiguous space easily


No waste of space
May need no. of disk accesses to find a free block
z Grouping
z Counting

Free-Space Management
(Cont.)
z

Need to protect:
z
z

Pointer to free list


Bit map
z Must be kept on disk
z Copy in memory and disk may differ.

Linked Free Space List on Disk

Virtual File Systems


z

Virtual File Systems (VFS) provide an objectoriented way of implementing file systems.
VFS allows the same system call interface (the
API) to be used for different types of file
systems.
The API is to the VFS interface, rather than any
specific type of file system.

Schematic View of Virtual File


System

How VFS works


z
z

z
z

File system registers with VFS (e.g. at boot time)


At registration time, fs provides list of addresses
of function calls the vfs wants
Vfs gets info from the new fs i-node and puts it in
a v-node
Makes entry in fd table for process
When process issues a call (e.g. read), function
pointers point to concrete function calls

. A simplified view of the data structures and code used by


the VFS and concrete file system to do a read.

Efficiency and Performance


z

Efficiency dependent on:


z
z

disk allocation and directory algorithms


types of data kept in files directory entry

Performance
z

disk cache separate section of main memory for


frequently used blocks
free-behind and read-ahead techniques to optimize
sequential access
improve PC performance by dedicating section of
memory as virtual disk, or RAM disk.

Various Disk-Caching
Locations

Page Cache
z

z
z

A page cache caches pages rather than


disk blocks using virtual memory techniques
Memory-mapped I/O uses a page cache
Routine I/O through the file system uses the
buffer (disk) cache

I/O Without a Unified Buffer


Cache

Unified Buffer Cache


z

A unified buffer cache uses the same page


cache to cache both memory-mapped pages
and ordinary file system I/O.

I/O Using a Unified Buffer


Cache

Recovery
z

Consistency checking compares data in


directory structure with data blocks on disk, and
tries to fix inconsistencies
Use system programs to back up data from disk
to another storage device (floppy disk, magnetic
tape)
Recover lost file or disk by restoring data from
backup

Disk Management

Physical Disk Structure

Disk Structure
z

Disk drives are addressed as large 1dimensional arrays of logical blocks, where the
logical block is the smallest unit of transfer
The 1-dimensional array of logical blocks is
mapped into the sectors of the disk sequentially
z

Sector 0 is the first sector of the first track (top platter)


on the outermost cylinder
Mapping proceeds in order through that track, then
the rest of the tracks in that cylinder, and then through
the rest of the cylinders from outermost to innermost

Disk Access Time


z

Two major components


z

Seek time is the time for the disk to move the heads to
the cylinder containing the desired sector
z Typically 5-10 milliseconds
Rotational latency is the additional time waiting for the
disk to rotate the desired sector to the disk head
z Typically, 2-4 milliseconds

One minor component


z

Read/write time or transfer time actual time to


transfer a block, less than a millisecond

Disk Scheduling
z
z

Should ensure a fast access time and disk bandwidth


Fast access
z
z

z
z
z
z

Minimize total seek time of a group of requests


If requests are for different cylinders, average rotation latency has
to be incurred for each anyway, so minimizing it is not the primary
goal (though some scheduling possible if multiple requests for
same cylinder is there)

Seek time seek distance


Main goal : reduce total seek distance for a group of
requests
Auxiliary goal: fairness in waiting times for the requests
Disk bandwidth is the total number of bytes transferred,
divided by the total time between the first request for
service and the completion of the last transfer

Disk Scheduling (Cont.)


z

Several algorithms exist to schedule the


servicing of disk I/O requests.
We illustrate them with a request queue (0-199).
98, 183, 37, 122, 14, 124, 65, 67
Head pointer 53

FCFS
z
z
z

Service requests in the order they come


Fair to all requests
Can cause very large total seek time over all
requests if the load is moderate to high

FCFS
Illustration shows total head movement of 640 cylinders.

SSTF
z

Selects the request with the minimum seek time


from the current head position
SSTF scheduling is a form of SJF scheduling
z
z

z
z

May cause starvation of some requests like SJF


But not optimal, unlike SJF

Minimizes seek time, but not fair


May work well if the load is not high

SSTF (Cont.)

Total head movement = 236 cylinders

SCAN
z

The disk arm starts at one end of the disk, and


moves toward the other end, servicing requests
until it gets to the other end of the disk, where
the head movement is reversed and servicing
continues
Sometimes called the elevator algorithm

SCAN (Cont.)

C-SCAN
z
z

Provides a more uniform wait time than SCAN


The head moves from one end of the disk to the
other. servicing requests as it goes. When it
reaches the other end, however, it immediately
returns to the beginning of the disk, without
servicing any requests on the return trip
Treats the cylinders as a circular list that wraps
around from the last cylinder to the first one

C-SCAN (Cont.)

C-LOOK
z
z

Version of C-SCAN
Arm only goes as far as the last request in each
direction, then reverses direction immediately,
without first going all the way to the end of the
disk.

C-LOOK (Cont.)

Selecting a Disk-Scheduling
Algorithm
z
z

SSTF is common and has a natural appeal


SCAN and C-SCAN perform better for systems that
place a heavy load on the disk
Performance depends on the number and types of
requests
Requests for disk service can be influenced by the fileallocation method
The disk-scheduling algorithm should be written as a
separate module of the operating system, allowing it to
be replaced with a different algorithm if necessary
Either SSTF or C-LOOK is a reasonable choice for the
default algorithm (depending on load)

Disk Management
z

Low-level formatting, or physical formatting Dividing a


disk into sectors that the disk controller can read and
write.
To use a disk to hold files, the operating system still
needs to record its own data structures on the disk
z Partition the disk into one or more groups of cylinders
z Logical formatting or making a file system
Boot block initializes system
z The bootstrap is stored in ROM
z Bootstrap loader program
Methods such as sector sparing used to handle bad
blocks

Operating System Issues


z

Major OS jobs are to manage physical devices


and to present a virtual machine abstraction to
applications

For hard disks, the OS provides two abstraction:


z
z

Raw device an array of data blocks.


File system the OS queues and schedules the
interleaved requests from several applications.

Application Interface
z

Most OSs handle removable disks almost exactly like


fixed disks a new cartridge is formatted and an empty
file system is generated on the disk.
Tapes are presented as a raw storage medium, i.e., and
application does not not open a file on the tape, it opens
the whole tape drive as a raw device
Usually the tape drive is reserved for the exclusive use of
that application
Since the OS does not provide file system services, the
application must decide how to use the array of blocks
Since every application makes up its own rules for how to
organize a tape, a tape full of data can generally only be
used by the program that created it

CPU Scheduling

Linux scheduler history

We will be talking about the O(1) scheduler

SMP Support in 2.4 and 2.6 versions

2.4 Kernel

CPU1

CPU2

CPU3

2.6 Kernel

CPU1

CPU2

CPU3

Linux Scheduling
z

z
z

3 scheduling classes
z SCHED_FIFO and SCHED_RR are realtime classes
z SCHED_OTHER is for the rest
140 Priority levels
z 1-100 : RT priority
z 101-140 : User task priorities
Three different scheduling policies
z One for normal tasks
z Two for Real time tasks

Pre-emptive, priority based scheduling.


When a process with higher real-time priority
(rt_priority) wishes to run, all other processes with
lower real-time priority are thrust aside.
In SCHED_FIFO, a process runs until it relinquishes
control or another with higher real-time priority wishes to
run.
SCHED_RR process, in addition to this, is also
interrupted when its time slice expires or there are
processes of same real-time priority (RR between
processes of this class)
SCHED_OTHER is also round-robin, with lower time
slice

SCHED_OTHER: Normal tasks


z Each task assigned a Nice value
z Static priority = 120 + Nice
z

z
z

Nice value between -20 and +19

Assigned a time slice


Tasks at the same priority are round-robined
z Ensures Priority + Fairness

Basic Philosophies
z
z

z
z

Priority is the primary scheduling mechanism


Priority is dynamically adjusted at run time
z Processes denied access to CPU get increased
z Processes running a long time get decreased
Try to distinguish interactive processes from noninteractive
z Bonus or penalty reflecting whether I/O or compute
bound
Use large quanta for important processes
z Modify quanta based on CPU use
Associate processes to CPUs
Do everything in O(1) time

The Runqueue
z

z
z
z

140 separate queues, one for each priority


level
Actually, two sets, active and expired
Priorities 0-99 for real-time processes
Priorities 100-139 for normal processes;
value set via nice()/setpriority() system calls

Linux 2.6 scheduler runqueue structure

Scheduler Runqueue
z
z
z
z

A scheduler runqueue is a list of tasks that are


runnable on a particular CPU.
A rq structure maintains a linked list of those
tasks.
The runqueues are maintained as an array
runqueues, indexed by the CPU number.
The rq keeps a reference to its idle task
z

The idle task for a CPU is never on the scheduler


runqueue for that CPU (it's always the last choice)

Access to a runqueue is serialized by


acquiring and releasing rq->lock

Basic Scheduling Algorithm


z

z
z
z
z

Find the highest-priority queue with a


runnable process
Find the first process on that queue
Calculate its quantum size
Let it run
When its time is up, put it on the expired list
z

Recalculate priority first

Repeat

Process Descriptor Fields Related


to the Scheduler
z
z
z
z
z
z
z
z
z
z
z
z
z
z
z
z

thread_info->flags
thread_info->cpu
state
prio
static_prio
run_list
array
sleep_avg
timestamp
last_ran
activated
policy
cpus_allowed
time_slice
first_time_slice
rt_priority

The Highest Priority Process


z
z

There is a bit map indicating which queues


have processes that are ready to run
Find the first bit thats set:
z
z
z

140 queues 5 integers


Only a few compares to find the first that is nonzero
Hardware instruction to find the first 1-bit
z

bsfl on Intel

Time depends on the number of priority levels, not


the number of processes

Scheduling Components
z
z
z
z
z

Static Priority
Sleep Average
Bonus
Dynamic Priority
Interactivity Status

Static Priority
Each task has a static priority that is set
based upon the nice value specified by
the task.
z static_prio in task_struct
z Value between 0 and 139 (between
100 and 139 for normal processes)
z Each task has a dynamic priority that is
set based upon a number of factors
z

tries to increase priority of interactive jobs

Sleep Average
z

Interactivity heuristic: sleep ratio


z Mostly sleeping: I/O bound
z Mostly running: CPU bound
Sleep ratio approximation
z sleep_avg in the task_struct
z Range: 0 .. MAX_SLEEP_AVG
When process wakes up (is made runnable),
recalc_task_prio adds in how many ticks it was
sleeping (blocked), up to some maximum value
(MAX_SLEEP_AVG)
When process is switched out, schedule
subtracts the number of ticks that a task actually
ran (without blocking)
sleep_avg scaled to a bonus vale

Average Sleep Time and


Bonus Values
Average sleep time

Bonus

>= 0 but < 100 ms

>= 100 ms but < 200 ms

>= 200 ms but < 300 ms

>= 300 ms but < 400 ms

>= 400 ms but < 500 ms

>= 500 ms but < 600 ms

>= 600 ms but < 700 ms

>= 700 ms but < 800 ms

>= 800 ms but < 900 ms

>= 900 ms but < 1000 ms

1 second

10

Bonus and Dynamic Priority


z

Dynamic priority (prio in task_struct) is


calculated in from static priority and bonus
z

= max (100, min( static_priority bonus + 5,


139) )

Calculating Time Slices


z
z

time_slice in the task_struct


Calculate Quantum where
z
z

z
z
z

If (SP < 120): Quantum = (140 SP) 20


if (SP >= 120): Quantum = (140 SP) 5
where SP is the static priority

Higher priority process get longer quanta


Basic idea: important processes should run longer
Other mechanisms used for quick interactive
response

Nice Value vs. static priority and Quantum

Interactive Processes
z

A process is considered interactive if


z

Low-priority processes have a hard time becoming


interactive:
z
z
z

bonus 5 >= (Static Priority / 4) 28


(Static Priority / 4) 28 = interactive delta

A high static priority (100) becomes interactive when its


average sleep time is greater than 200 ms
A default static priority process becomes interactive when
its sleep time is greater than 700 ms
Lowest priority (139) can never become interactive

The higher the bonus the task is getting and the


higher its static priority, the more likely it is to be
considered interactive.

Using Quanta
z
z
z

z
z

At every time tick (in scheduler_tick) , decrement the quantum of


the current running process (time_slice)
If the time goes to zero, the process is done
Check interactive status:
z If non-interactive, put it aside on the expired list
z If interactive, put it at the end of the active list
Exceptions: dont put on active list if:
z If higher-priority process is on expired list
z If expired task has been waiting more than STARVATION_LIMIT
If theres nothing else at that priority, it will run again immediately
Of course, by running so much, its bonus will go down, and so
will its priority and its interactive status

Avoiding Starvation
z

The system only runs processes from active


queues, and puts them on expired queues when
they use up their quanta
When a priority level of the active queue is empty,
the scheduler looks for the next-highest priority
queue
After running all of the active queues, the active and
expired queues are swapped
There are pointers to the current arrays; at the end
of a cycle, the pointers are switched

The Priority Arrays


struct prio_array {
unsigned int nr_active;
unsigned long bitmap[5];
struct list_head queue[140];
};
struct rq {
spinlock_t lock;
unsigned_long nr_running;
struct prio_array *active, *expired;
struct prio_array arrays[2];
task_struct *curr, *idle;

};

Swapping Arrays
struct prioarray *array =
rq->active;
if (array->nr_active == 0) {
rq->active = rq->expired;
rq->expired = array;
}

Why Two Arrays?


z
z
z

Why is it done this way?


It avoids the need for traditional aging
Why is aging bad?
z

Its O(n) at each clock tick

Linux is More Efficient


z

Processes are touched only when they start


or stop running
Thats when we recalculate priorities,
bonuses, quanta, and interactive status
There are no loops over all processes or
even over all runnable processes

Real-Time Scheduling
z

Linux has soft real-time scheduling


z

All real-time processes are higher priority than any


conventional processes
Processes with priorities [0, 99] are real-time
z
z

No hard real-time guarantees

saved in rt_priority in the task_struct


scheduling priority of a real time task is: 99 - rt_priority

Process can be converted to real-time via


sched_setscheduler system call

Real-Time Policies
z

First-in, first-out: SCHED_FIFO


z
z
z
z

Round-robin: SCHED_RR
z

Static priority
Process is only preempted for a higher-priority process
No time quanta; it runs until it blocks or yields voluntarily
RR within same priority level
As above but with a time quanta (800 ms)

Normal processes have SCHED_OTHER


scheduling policy

Multiprocessor Scheduling
z
z

Each processor has a separate run queue


Each processor only selects processes from its own
queue to run
Yes, its possible for one processor to be idle while
others have jobs waiting in their run queues
Periodically, the queues are rebalanced: if one
processors run queue is too long, some processes
are moved from it to another processors queue

Locking Runqueues
z

z
z

To rebalance, the kernel sometimes needs to move


processes from one runqueue to another
This is actually done by special kernel threads
Naturally, the runqueue must be locked before this
happens
The kernel always locks runqueues in order of
increasing indexes
Why? Deadlock prevention!

Processor Affinity
z

z
z
z

Each process has a bitmask saying what CPUs


it can run on
Normally, of course, all CPUs are listed
Processes can change the mask
The mask is inherited by child processes (and
threads), thus tending to keep them on the same
CPU
Rebalancing does not override affinity

Load Balancing
To keep all CPUs busy, load balancing
pulls tasks from busy runqueues to idle
runqueues.
z If schedule finds that a runqueue has no
runnable tasks (other than the idle task), it
calls load_balance
z load_balance also called via timer
z

z
z

schedule_tick calls rebalance_tick


Every tick when system is idle
Every 100 ms otherwise

Load Balancing
z

load_balance looks for the busiest runqueue


(most runnable tasks) and takes a task that is
(in order of preference):
z inactive (likely to be cache cold)
z high priority
load_balance skips tasks that are:
z likely to be cache warm (hasn't run for
cache_decay_ticks time)
z currently running on a CPU
z not allowed to run on the current CPU (as
indicated by the cpus_allowed bitmask in the
task_struct)

Linux 2.6 CFS Scheduler


z
z

Was merged into the 2.6.23 release.


Uses red-black tree structure instead of
multilevel queues.
Tries to run the task with the "gravest need" for
CPU time

Red-Black tree in CFS

Red-Black tree properties


z
z

Self Balance
Insertion and deletion operation in O(log(n))
z

With proper implementation its performance is


almost the same as O(1) algorithms!

The switch_to Macro


z

switch_to() performs a process switch


from the prev process (descriptor) to the
next process (descriptor).
switch_to is invoked by schedule() & is
one of the most hardware-dependent kernel
routines.
z

See kernel/sched.c and include/asm*/system.h for more details.

Ext3 Filesystem

Introduction
Common file system on linx
z Introduced in 2001
z Supports max file size of 16 GB to 2 TB
z Max filesystem size can be from 2 TB to 32
TB
z Maximum 32000 sbdirectories under a
directory
z Options for block size from 1 KB to 4 KB
z

Block Groups
z

Disk is partitioned into equal sized block


groups
z
z

Same number of inodes per block group


Same number of data blocks per block group

Each block group has data blocks and inodes


stored in adjacent tracks
z
z
z

Flies allocated within a single block group usually


Inodes and data blocks are close together
Reduces average seek time

Partition Layout

Superblock
z

located 1024 bytes from the start of the file system


and is 1024 bytes in size.
Back up copies are typically stored in the first file
data block of each block group
z

Only the one in block group 0 is looked at usually

Contains a description of the basic size and shape of this


file system.

Some Superblock Fields


z
z
z
z
z
z
z
z

Block group no. of group storing the superblock


Block size
Total no. of blocks
No. of free blocks
No. of inodes
Total no.of free inodes
First inode (for /)
Many others, its a long list

The Ext3 Group Descriptor


z
z

One Group Descriptor data structure for every block group


All the group descriptors for all of the Block Groups are
duplicated in each Block Group in case of file system
corruption.
The Group Descriptor contains the following:
Blocks Bitmap : block number of block allocation bitmap
Inode Bitmap : block number of Inode allocation bitmap
Inode Table : The block number of the starting block for the
Inode table for this Block Group.
Free blocks count : number of data blocks free in the Group
Free Inodes count : number of Inodes free in the Group
Used directory count : number of inodes allocated to
directories

Bitmaps
z The

block bitmap manages the allocation status of the


blocks in the group
z The inode bitmap manages the allocation status of the
inodes in the group
z Both bitmaps must fit into one block each
z

Fixes the maximum no. of blocks in a block group as 8 times


the block size in bytes

Inodes
zInode
z

Tables:

Inode table contains the inodes that describe the files in the
group

zInodes:
z

Each inode corresponds to one file, and it stores files primary


metadata, such as files size, ownership, and temporal
information.
Inode is typically 128 bytes in size and is allocated to each
file and directory
12 direct links, one single, one double, and one triple indirect
link

Ext2 inode

Some inode fields


z
z
z
z
z
z
z
z
z
z

File type
Access rights
File length
Time of last file access
Time of last change of inode
Time of last change of file
Hard links counter
Number of data blocks
Pointers to data blocks
Access control lists

Directories
z An

Ext3 directory is just like a regular file except it has a


special type value.
z The content of directories is a list of directory entry data
structure, which describes file/subdirectory name and inode
address.
z The length of directory entry varies from 1 to 255 bytes.
z Fields in the directory entry:
z Inode: inode no. of file/directory
z Name length: the length of the file name
z Record length: the length of this directory entry
z

Tells where to start the next structure

File type
z Name: file/subdirectory name
z

Indexing and Directories

When Ext3 wants to delete a directory entry, it just increases the record
length of the previous entry to the end to deleted entry.

Allocating inodes
z If

a new inode is for a non-directory file, Ext3 allocates


an inode in the same block group as the parent
directory.
z If that group has no free inode or block,
z

Quadratic search: search in block groups i mod (n), i+1 mod


(n), i + 1 + 2 mod (n), i + 1 + 2 + 4 mod (n)

z If

quadratic search fails, Ext3 uses exhaustive linear


search to find a free inode

z If

a new inode is for a directory, Ext2 tries to place it in


a group that has not been used much.
z Using total number of free inodes and blocks in the
superblock, Ext3 calculates the average free inodes
and blocks per group.
z Ext3 searches each of the group and uses the first one
whose free inodes and blocks are less than the
average.
z If the pervious search fails, the group with the smallest
number of directories is used.

Allocating Data Blocks


zFirst
z

Goal

Get the new block near the last block allocated to the
file

zPreallocation

allocates a number of contiguous


blocks (usually 8) even if only one block is asked
for
z

Preallocated blocks are freed when the file is closed, or


when a write operation is not sequential with respect to
write operations that triggered the preallocation

Every allocation request has a goal block


z

If the current block and previously allocated block


have consecutive file block no., goal = logical block no.
of previous block + 1
z

Tries to keep consecutive file blocks adjacent on disk

Else, if at least one preallocated block earlier, goal =


that block
Else, goal = first block in the block group with the files
inode

Aim: allocate a physical block = goal block

z
z

If the goal block is not free, try the next one


If not available, search all block groups starting
from the one containing the goal block
z
z

Look for a group of 8 adjacent free blocks


If not, look for one

Indexing and Directories


zDirectory
z

entry allocation:

Ext2 starts at the beginning of the directory and examines


each directory entry.
Calculate the actual record length and compare with the
record length field.
If they are different, Ext2 can insert the new directory entry at
the end of the current entry.
Else, Ext2 appends the new entry to the end of the entry list.

Memory data Structures


z

Superblock and Group Descriptors are


always cached
Bitmaps (block and inode) and inode and
data blocks are cached dynamically as
needed when the corresponding object is in
use
z

Page cache mitigates some of the problem of


reading from disk

Main change in Ext3 over Ext2 add


journaling support for recovery
Not to be discussed today

Vous aimerez peut-être aussi