Vous êtes sur la page 1sur 13

FILE STRUCTURES

UNIT -1
INTRODUCTION TO FILE STRUCTURES
Lecture notes
1.1 The heart of file structure design
DISKs
o have enormous storage capacity
o are non volatile
o costs less than memory
o but are very slow when compared to memory
ANALOGY
o RAM access time 120ns
o DISK access time 30ms
o If finding something in the book in hand takes 20sec, and the same info if not found in
the book should be searched in the library, keeping the same ratio of memory access
and disk access, it would take 5million sec or almost 58days.
A disks relatively slow access time and the enormous, nonvolatile capacity is the driving force
behind FILE STRUCTURE design!!
FS should give access to all the capacity without making the application spend a lot of time
waiting for the disk.
FS is a combination of representation for data in files and of operations for accessing the data.
o It allows applications to read, write and modify data
o Also finding the data
o Or reading the data in a particular order
Efficiency of FS design for a particular application is decided on,
o Details of the representation of the data
o Implementation of the operations
A large variety in the types of data and in the needs of application makes FS design important.
What is best for one situation may be terrible for other.

1.2 A Short History of FS Design


The general goals of FS design
o One access to the disk to get the desired information.
o Structures that take us to the information with as few accesses as possible. Two or three
trips to disks...
o Group information so that we can get everything we need in one trip to the disk; name,
address, phone number and account balance... all at once...

All these are easy to achieve if the files do not change, grow or shrink. When information is
added or deleted it is much difficult.
Initially the storage device was tape,
o Access was sequential
o Accessing cost was directly proportional to the size of the file.
Then came in the disks drives
o Indexes were added to files
o List of keys and pointers were present in a smaller file (easily searchable)
o Easy to directly access the file even if it was a very huge file.
o As the indexes grew they too became difficult to manage.
Early 1960s
o Idea of applying tree structures emerged.
o But trees can grow unevenly as records are added or deleted.
o Resulting in long searches requiring many disk accesses to find a record.
In 1963
o AVL tree was developed which was a self adjusting binary tree structure for data in
memory.
o AVL tree structure was implemented for files by some researchers.
o The problem was dozens of access were required to find a record in even moderate
sized files
o A method was required to keep a tree balanced when each node of the tree was not a
single record, as in a binary tree, but a file block containing dozens or hundreds of
records.
B-tree
o After ten years of design work came up B-tree
o AVL tree grows top-down, where as B-tree grows bottom-up
o Provides excellent access performance
o Sequential access was not efficient in B-tree
+
B trees
o Solved the problem of sequential access in B-tree
o Added a linked list at the bottom level of the B-tree.
B-tree and B+ tree became the basis for many commercial file systems
They provided access times that grow in proportion to log k N where,
o N is the number of entries in the file
o k is the number of entries indexed in a single block of the B-tree structure.
Practically, you can find one file entry among millions of others with only three or four trips to
the disk.
B-tree guarantees that performance stays about the same even if you add or delete entries.
Hashing is a good way to get what we want with a single request. (size non-changing files)
Early days, hashed indexes were used to provide fast access to files.
Extendible dynamic hashing retrieves the information with one or, at most, two disk accesses no
matter how big the file become.

1.3 A Conceptual Toolkit


Development of file structures over the last three decades
o Sequential
o Tree structures
o Direct access
Design tools keep emerging
o Decrease the number of disk access by collecting data into
Buffers
Blocks
Buckets
o Manage the growth of these collections by splitting them
Requires us to find ways to increase address or index space.
These are called the conceptual tools.
o Methods of framing and addressing a design problem
1.4 Fundamental File Operations: Physical Files and Logical Files
A physical file
o Refers to particular collection of bytes stored on a disk or tape.
o It physically exists on the disk.
o A disk drive might contain hundreds and thousands of these physical files.
A logical file
o To the program
A file is somewhat like a telephone line connected to the telephone network
The program can send or receive bytes through this phone line
Where do these bytes go? Or where do they come from? It does not know!
Knows nothing about the other end of it.
A single program is limited to use only 20 files.
o The application relies on the operating system to take care of the telephone switching
system.
o Bytes coming into the program can come down the line from a physical file or the
keyboard or some other i/p device.
o Bytes going out of the program might end up in a file or appear on the terminal screen.
o The program knows to which line it is talking to get the bytes in and send the bytes out.
o This line is called the logical file
To open a file for use
o The operating system should receive instructions to link the logical file with the physical
file on the disk or a device.
A number is returned in response to link the physical file with the logical file, which is used to
refer to the file inside the program that is logical name.
ANALOGY
o My office phone is connected to six telephone lines.
o When I receive a call I get an intercom message such as you have a call on line three.

o
o

The receptionist does not say, you have a call from 814-789-1903
I need to have the call identified logically, not physically.

1.4 Opening Files


C++ Files and Streams
o C++ views each files as a sequence of bytes.
o Each file ends with an end-of-file marker.
o When a file is opened, an object is created and a stream is associated with the object.
o To perform file processing in C++, the header files <iostream.h> and <fstream.h> must
be included.
o <fstream.> includes <ifstream> and <ofstream>
Creating a sequential file
// Create a sequential file
#include <iostream.h>
#include <fstream.h>
#include <stdlib.h>
int main()
{
// ofstream constructor opens file
ofstream outClientFile( "clients.dat", ios::out );
if ( !outClientFile ) { // overloaded ! operator
cerr << "File could not be opened" << endl;
exit( 1 ); // prototype in stdlib.h
}
cout << "Enter the account, name, and balance.\n"
<< "Enter end-of-file to end input.\n? ";
int account;
char name[ 30 ];
float balance;
while ( cin >> account >> name >> balance ) {
outClientFile << account << ' ' << name
<< ' ' << balance << '\n';
cout << "? ";
}
return 0; // ofstream destructor closes file
}

Question.
o What does the above program do?
How to open a file in C++ ?
o Ofstream outClientFile(clients.dat, ios:out)
OR
o Ofstream outClientFile;
o outClientFile.open(clients.dat, ios:out)
File Open Modes
o ios:: app - (append) write all output to the end of file

o
o
o
o
o
o
o

ios:: ate - data can be written anywhere in the file


ios:: binary - read/write data in binary format
ios:: in - (input) open a file for input
ios::out - (output) open afile for output
ios: trunc -(truncate) discard the files contents if it exists
ios:nocreate - if the file does NOT exists, the open operation fails
ios:noreplace - if the file exists, the open operation fails

1.5 Closing Files


The file is closed implicitly when a destructor for the corresponding object is called
Or by using member function close:
o outClientFile.close();
1.6 Reading and Writing Files
<ostream> memebr function write
o The <ostream> member function write outputs a fixed number of bytes beginning at a
specific location in memory to the specific stream. When the stream is associated with a
file, the data is written beginning at the location in the file specified by the put file
pointer.
The write function expects a first argument of type const char *, hence we used the
reinterpret_cast <const char *> to convert the address of the blankClient to a const char *.
The second argument of write is an integer of type size_t specifying the number of bytes to
written. Thus the sizeof( clientData ).
Writing data randomly to a random file
#include <iostream.h>
#include <fstream.h>
#include <stdlib.h>
#include "clntdata.h"
int main()
{
ofstream outCredit( "credit.dat", ios::ate );
if ( !outCredit ) {
cerr << "File could not be opened." << endl;
exit( 1 );
}
cout << "Enter account number "
<< "(1 to 100, 0 to end input)\n? ";
clientData client;
cin >> client.accountNumber;
while ( client.accountNumber > 0 &&
client.accountNumber <= 100 ) {
cout << "Enter lastname, firstname, balance\n? ";
cin >> client.lastName >> client.firstName
>> client.balance;

outCredit.seekp( ( client.accountNumber - 1 ) *

sizeof( clientData ) );
outCredit.write(
reinterpret_cast<const char *>( &client ),
sizeof( clientData ) );
cout << "Enter account number\n? ";
cin >> client.accountNumber; } return 0; }

Reading data from a random file


#include <iostream.h>
#include <iomanip.h>
#include <fstream.h>
#include <stdlib.h>
#include "clntdata.h"
void outputLine( ostream&, const clientData & );
int main()
{
ifstream inCredit( "credit.dat", ios::in );
if ( !inCredit ) {
cerr << "File could not be opened." << endl;
exit( 1 );
}
cout << setiosflags( ios::left ) << setw( 10 ) << "Account"
<< setw( 16 ) << "Last Name" << setw( 11 )
<< "First Name" << resetiosflags( ios::left )
<< setw( 10 ) << "Balance" << endl;
clientData client;
inCredit.read( reinterpret_cast<char *>( &client ),
sizeof( clientData ) );

while ( inCredit && !inCredit.eof() ) {


if ( client.accountNumber != 0 )
outputLine( cout, client );
inCredit.read( reinterpret_cast<char *>( &client ),
sizeof( clientData ) );
}
return 0;
}
void outputLine( ostream &output, const clientData &c )
{
output << setiosflags( ios::left ) << setw( 10 )
<< c.accountNumber << setw( 16 ) << c.lastName
<< setw( 11 ) << c.firstName << setw( 10 )
<< setprecision( 2 ) << resetiosflags( ios::left )
<< setiosflags( ios::fixed | ios::showpoint )
<< c.balance << '\n';
}

The <istream> function read

inCredit.read (reinterpret_cast<char *>(&client), sizeof(clientData));

The <istream> function inputs a specified (by sizeof(clientData)) number of bytes from the
current position of the specified stream into an object.

1.7 Seeking
Reading and printing a sequential file
// Reading and printing a sequential file
#include <iostream.h>
#include <fstream.h>
#include <iomanip.h>
#include <stdlib.h>
void outputLine( int, const char *, double );
int main()
{
// ifstream constructor opens the file
ifstream inClientFile( "clients.dat", ios::in );
if ( !inClientFile ) {
cerr << "File could not be opened\n";
exit( 1 );
}

}
File position pointer
o <istream> and <ostream> classes provide member functions for repositioning the file
pointer (the byte number of the next byte in the file to be read or to be written.)
o These member functions are:
seekg (seek get) for istream class
seekp (seek put) for ostream class
Examples of moving a file pointer
o inClientFile.seekg(0) - repositions the file get pointer to the beginning of the file
o inClientFile.seekg(n, ios:beg) - repositions the file get pointer to the n-th byte of the file
o inClientFile.seekg(m, ios:end) -repositions the file get pointer to the m-th byte from the
end of file
o nClientFile.seekg(0, ios:end) - repositions the file get pointer to the end of the file
o The same operations can be performed with <ostream> function member seekp.
Member functions tellg() and tellp().
o Member functions tellg and tellp are provided to return the current locations of the get
and put pointers, respectively.
o long location = inClientFile.tellg();
o To move the pointer relative to the current location use ios:cur

o inClientFile.seekg(n, ios:cur) - moves the file get pointer n bytes forward.


Updating a sequential file
o Data that is formatted and written to a sequential file cannot be modified easily without
the risk of destroying other data in the file.

If we want to modify a record of data, the new data may be longer than the old one and
it could overwrite parts of the record following it.
Problems with sequential files
o Sequential files are inappropriate for so-called instant access applications in which a
particular record of information must be located immediately.
o These applications include banking systems, point-of-sale systems, airline reservation
systems, (or any data-base system.)
Random access files
o Instant access is possible with random access files.
o Individual records of a random access file can be accessed directly (and quickly) without
searching many other records.
1.8 Special Characters in Files
All computer systems have reserved a number of characters for specific system functions.
Examples:
o Control-Z indicates often end-of-file in MS-DOS programs
o Control-D indicates often end-of-file in Unix programs
o CR (Carriage return) and LF (Line Feed) characters together indicate end-of-line
1.9 Directory Structures
Files are stored in directories. Thus directories are collections of files
Most modern systems maintain a tree directory structure.
1.10 Physical Devices and Logical Files
I/O Redirection
o I/O redirection allows for changing the source of input to come from a file instead of a
keyboard:
program < file /* program reads input form a file instead of keyboard
o I/O redirection allows for directing the output to go a file instead of the screen
program > file /* program writes to a file instead of the screen
1.12 Pipes
o An output of one program can be used as an input to another program be using pipes:
o Example:
program1 | program2
1.11 Secondary Storage Management
Secondary storage devices:
o have much longer access time than main memory
o have access times that vary from one access to another (some accesses are relatively
fast and other accesses are slower on the same device)
o have a lot of more storage than main memory
o have storage that is non-volatile

Disks
o

Types of commonly used disks


hard disks
floppy disks
Iomega ZIP disks
Jaz disks
The Organization of Disks
o Data is stored on the surface of one or more platters.
o Disk storage units are:
tracks and cylinders
sectors
Disk Storage Capacity
o The amount of data that can be held on a disk depends on how densely bits can be
stored on the disk surface
o The capacity of the disk is a function of:
the number of cylinders
the number of tracks per cylinder
the capacity of a track
o Track capacity =number of sectors per track bytes per sector
o Cylinder capacity=number of tracks per cylinder track capacity
o Drive capacity = number of cylinders cylinder capacity
How is the Data Read from or Written to a Disk?
o The operating system sends control signals to the disk via a disk driver to read or to
write data from a given sector of a given cylinder.
o The disk is rotating to position the needed sector under the read/write head (rotational
delay)
o The read/write head is moving to the needed cylinder (seek time).
Specification of Disk Drives
o Capacity: e.g. 2 GB
o Minimum (track to track) seek time: e.g. 1 msec
o Average seek time : e.g. 12 msec (milliseconds)
o Maximum seek time: e.g. 22 msec
o Spindle speed: e.g. 5200 rpm (rotations per minute)
o Average rotational delay: e.g. 6 msec
o Mximum transfer rate: e.g.2796 bytes/msec=2730K/sec
o Bytes per sector: e.g. 512
o Sectors per track: e.g. 63
o Tracks per cylinder: e.g. 16
o Cylinders: 4092
Organizing Data by Sectors
o Consecutive physical sectors sometimes are not consecutive logically
o this is called sector interleaving

In the early 1990s, controller speeds improved so that disks can now offer noninterleaving (also known as 1:1 interleaving)

Clusters
o A cluster is a fixed number of consecutive (logical) disk sectors.
o Some operating systems view each file as a series of clusters.
o Clusters are designed to improve performance since all sectors in one cluster can be
accessed without an additional seek.
o Extents
Extents of a file are those parts of the file which are stored in contiguous
clusters.
It is very beneficial to store the whole file in one extent (seek time is minimized).

Fragmentation
o Fragmentation is the wasted disk space due to the fact that the smallest organizational
unit of a disk is one sector.
o If a sector size is 512 bytes than even if we need to store only one byte, we have to
allocate to it one whole sector. Thus 511 bytes are wasted.
Blocks
o Some disk allow for storing data in user defined blocks instead of sectors.
o When the data on a disk is organized in blocks, this usually means that the amount of
data transferred in a single I/O operation can vary.
o Blocks can be either variable or fixed length.
o Block organization can be more efficient than sector organization but it is much more
complex.
Non-data Overhead
o Non-data overhead includes at the beginning of each sector:
sector address
track address
sector usability
The Cost of Disk Access
o Seek time
the time required to move the r/w head to the correct cylinder
o Rotational delay
the time required to rotate the disk so that the correct sector is positioned
under the r/w head
o Transfer time
the time required to transfer the data:

number of bytes transferred


Transfer time =

rotation time
number of bytes on a track

Disks as Bottlenecks
o Disk speeds lag far behind
CPU
main memory
local network
o Computer programs spend most of time awaiting data from the disk
Improving Disk Performance
o Disk striping
splitting the parts of a single file on several drives
o RAID
Redundant Array of Inexpensive Disks
o RAM disk
o Disk caching
o Buffering

1.12 Magnetic Tapes


Tapes provide sequential data access.
Tapes are mostly used as backup devices.
Current tape drives:
o big variety
o prices range from $150 to $150,000
o data transfer rates range from 0.5 MB/sec to 10 MB/sec
o capacities range from 200 MB to 50 GB
Tape Organization
o Tapes store data sequentially.
o The logical position of a byte within a file corresponds directly to its physical position
relative to the start of the file
o Data is stored usually on 9 parallel tracks with 8 data bits and 1 parity bit (this is called a
frame)
o Frames are grouped into blocks separated by interblock gaps.
o Tapes are read one block at a time.
1.13 Disks versus Tape
Disks provide both random access and sequential access, tapes provide only sequential access.

Buffering of disk data in main memory reduces seek time, thus disk are commonly used for
sequential file processing too.
Tapes are still the most common long term archival storage.

1.14 CD-ROM

CD-ROM=Compact Disk, Read Only Memory


CD-ROM data can be recorded only once but it can be read multiple times.
CD-ROM technology has been started in the early 1970s as videodisk technology
First modern CD-ROMs were built in 1985.
CD-ROM Technology
o CD-ROMS are encoded using laser technology.
o Data is recorded by copying a master disk.
o The master is formed by using the data to be encoded to turn a powerful laser on and
off very quickly.
o CLV (Constant Linear Velocity) format (spiral format)
CD-ROMs Strengths
o high storage capacity (600 MB)
o low cost
o durability
o read-only access (great advantage from the design point of view for
CD-ROMs Weaknesses
o long seek time (0.5 sec to 1 sec!)
o slow transfer rate (1x = 150KB/sec)
approx. 5 times faster than a floppy disks
orders of magnitude slower than a good hard disk
o asymmetric writing and reading
once data has been recorded no write access is allowed, consequently other
devices must be available (such as a hard disk) to allow for interactive use of
programs on CD-ROMs.
1.15 Computer Storage Hierarchy
Primary (registers, RAM)
o fastest, smallest, most expensive
Secondary (hard disk)
o slower, larger, less expensive
Offline (removable disks, tapes)
o slowest, large and largest, least expensive (see fig 3.17 p. 84 of your text)
1.16 A Journey of a Byte

Application program (e.g. Word)


Operating System (e.g. Windows NT)
File manager (part of operating system)
I/O processor
Disk controller
Disk

1.17 Buffer Management


File managers buffer data in main memory.
This is called buffering
Buffering strategies used:
o multiple buffering (CPU uses one buffer and I/O processor uses another)
o move mode and locate mode (instead of using application and system buffers - only
one buffer for both purposes is used).
o scatter/gather I/O (
reading or writing of disk blocks and separating headers and
data into different buffers)

Vous aimerez peut-être aussi