Vous êtes sur la page 1sur 18

CS 165 Database Systems Lesson 1

Introduction to File Structures


Objectives

At the end of the lesson, the student should be able to:

• Explain the concepts behind file organization and file access


• Differentiate logical and physical files
• Determine the advantages of using primary over secondary storage devices
• Identify the characteristics, capacities and access costs of different storage devices
• Explain the concepts behind file organization and file access
• Identify the different indexing mechanisms
• Use file compression techniques to improve space utilization and file access
performance

Issues in Physical File Organization

• Slow disk access:

• It takes about approximately 120 nanoseconds to access information from RAM while
it takes about 30 milliseconds if it’s from disks.

• It is analogous to looking in an index of a book for 20 seconds for a specific topic and
locating for the same thing in the library for 58 days!

• Transfer of data from disks to RAM and vice versa takes more time than calculations
made in the Central Processing Unit of a computer.

• The reasons for RAM being faster are:

• RAM chips are located on the motherboard so the distance the electrical
signals have to travel from the CPU to RAM or in the opposite direction is
much shorter compared to the distance between the CPU and secondary
storage devices. The shorter the distance, the faster the processing.

• It takes few nanoseconds for the CPU to access RAM but it takes several
milliseconds to access secondary storage. Also working with the secondary
storage involves mechanical operations like spinning.

• Volative primary storage: when the power is off, all contents of RAM are lost. That is why data
from RAM is saved as files on secondary storage which is non-volatile and almost permanent
(It wears out eventually or becomes out of dated technology)

• Virtually infinite secondary storage: when you run out of space on one disk, you use another.
On the contrary there is a limited amount of RAM that can be accessed by the CPU. Some
programs will not run on a particular computer system because there is not enough RAM
available.

Introduction to File Structures 1


CS 165 Database Systems Lesson 1

• Cheaper secondary storage: secondary storage is cheaper than RAM in terms of cost per unit
of data. Therefore, we need a to have file structure design optimizing the speed of RAM and
capacity of disks.

• Dynamic nature of data: insertions, deletions, searching and updates of information are
common.

• Need for fast searching and retrieval: finding data from a collection of items in least time, even
after deletion and insertion of considerable amount of information, without reorganizing the
data. File organization must adjust gracefully to these operations.

Ideally, we would like to get information with one access to the disk. If it is not possible, even with
just as few accesses as possible. We also want to have file structures with information groups, or
what we call records, to get a bulk of information in just one access to the disk.

File structure design which meet above goals is easy to come up to when we have files that never
change. But in reality, we add, delete and update files which makes reaching these goals much
difficult.

Solutions Made

• Tapes were the early storage devices used where access was sequential.
• Indexes were then used to search easily before accessing its actual information from the disk.
• Binary trees were then applied in 1960s but due to deletion and addition, it gets uneven which
results in long searches requiring more disk access.
• Tree-structured file organization: B-tree, B*-tree and B+-tree.
• Direct access design in the form of hashing.

Logical vs. Physical Files

A file in a disk, which means it exists physically, is called a physical file. A file as viewed by the
user is called a logical file. Consider the following Pascal code:

assign(input_file,’input.txt’);

This statement asks the operating system to find the physical file ‘input.txt’ from the disk and to
hook it up to its logical filename input_file. This logical file is what we use to refer to the physical
file inside the program.

Secondary Storage Devices

Secondary storage devices hold files that are not currently being used. For a file to be used it must
be copied to main memory first. After any modifications files must be saved to secondary storage.

In designing, we are always responsive to the constraints of the medium and the environment.
The point here is we should know the constraints of the medium we are using.

Introduction to File Structures 2


CS 165 Database Systems Lesson 1

Types of Storage Devices

• Direct Access Storage Devices (DASDs) ⇒ also known as random access devices, means
that the system maintains a list of data locations and the required piece of data can be found
quickly. The most common DASD media are magnetic disks such as hard disks, floppy disks
and optical disks

• Serial Devices ⇒ use media such as magnetic tapes that permit only serial access.

Goal: Since accessing secondary storage devices is a bottleneck, we need to have a good design
to arrange data in ways that minimize access cost.

Disks

 These devices use magnetism to store the data on a magnetic surface.


 Advantages: high storage capacity, reliable, gives direct access to data
 A drive spins the disk very quickly underneath a read/write head, which does what its
name says. It reads data from a disk and writes data to a disk.

Organization of Disks

All magnetic disks are similarly formatted, or divided into areas, called tracks, sectors and
cylinders.

• Platters
• Tracks ⇒ circular rings on one side of the disk which are arranged successively in a platter,
each consisting of sectors
• Sectors ⇒ the smallest addressable portion of a disk.
• Disk pack ⇒ consists of platters on a spindle when a disk drive uses a number of platters.
• Cylinder ⇒ tracks that are directly below and above one another.
⇒ Accessing data on tracks on the same cylinder would not require additional seek
time

Seeking ⇒ Moving the arm which holds the r/w head in finding the right location on the disk.
⇒ Usually the slowest part of reading information.

Disk Capacity

• Amount of data on a track depends upon how densely bits can be stored: a low-density disk
can hold about 40 kilobytes per track and 35 tracks per surface. A top-of-the-line disk can hold
about 50 kilobytes per track and can have more than 1,000 tracks on a surface.

• To compute track, cylinder and drive capacity:

CT = Track Capacity = #sectors/track x #bytes/sector


CC = Cylinder Capacity = #tracks/cylinder x CT
CD = Drive Capacity = #cylinders/drive x CC

Introduction to File Structures 3


CS 165 Database Systems Lesson 1

Example: What is the capacity of the drive with the following characteristics?

#bytes/sector = 512
#sectors/track = 40
#tracks/cylinder = 11
#cylinders = 1331

Cost of a Disk Access (Latency)

Three main factors affecting disk access time: seek time [ts], rotational delay [tr], block transfer
time [tbt]

Step Measured as:


(in ms)
1. seek seek time
move the head to proper track

2. rotate rotational
rotate disk under the head to the correct delay
sector

3. settle settling time


head lowers to disk;
wait for vibrations from moving to stop
(actually touches only on floppies)

4. data transfer data transfer


copy data to main memory time

Seek Time

• Time it takes to move the arm to the correct cylinder


• Depends on the physical characteristics of the disk drive as well as on the #cylinders.
• Assumption: all cylinders are equally likely to be sources or destinations.

minimum seek time ⇒ time it takes to move the access arm from a track to its adjacent track.
maximum seek time ⇒ time it takes to move the access arm from the outermost to the innermost
track.
average seek time ⇒ the average of the minimum and maximum seek times.

Introduction to File Structures 4


CS 165 Database Systems Lesson 1

Rotational Delay

• Time it takes for the head to reach the right place on the track.
• Assumption: Correct block can be recognized by some flag or block identifier at the beginning.

rotational time (tr ) ⇒ time needed for a disk pack to complete one whole revolution.

• Average latency time: [tr/2]= time it takes the disk drive to make ½ revolution.

tr/2 = (1/2) (60 x 1000) / rpm


where rpm is revolutions per minute

or if tr is given:

tr/2 = tr / 2

• Hard disks usually rotates at about 5,400 rpm ~ 1 rev. per 11.1 ms so the average
rotational delay is 5.55 msec.

• For floppy disk:


360 rpm, tr/2 = 83.3 ms.

Block Transfer Time

• Time it takes for the head to pass over a block


• The bytes of data can be transferred from disk to RAM and vice versa.
• Depends on rotational speed, recording density (#bytes/track), and #bytes in a block.
• Let
tbt = block transfer time
rbt = block transfer rate
B = #bytes transferred

tbt = B / rbt

Example: Assume a block size of 1,000 bytes and that the blocks are stored randomly. The disk
drive has the following characteristics:
ts/2 = 30 ms
tr = 16.67 ms
tr/2 = 8.3 ms
rbt = 806 KB/sec
= 825,344 bytes/sec

• What is the average access time per block?

Average access time per block


= seek + rotational latency + transfer
= ts/2 + tr/2 + tbt
=30 ms/block + 8.3 ms/block + 1,000bytes/block
825,344bytes/sec
= (30 + 8.3 + 1.21) ms /block
= 39.51 ms/block

Introduction to File Structures 5


CS 165 Database Systems Lesson 1

Magnetic Tape

• A sequential-access storage device in which blocks of data are stored serially along the length
of the tape and can only be accessed in a serial manner
• Logical position of a byte within a file corresponds directly to its physical location relative to the
start of the file.
• Tape is a good medium for archival storage, for transportation of data.
• Tape drives come in many shapes, sizes and speeds. Performance of each drive can be
measured in terms of the following quantities:
• Tape (or recording) density ⇒ number of characters or bytes of data that can be
stored per inch (bpi)
• Tape Speed ⇒ commonly 30 to 200 inches per second (ips)

Disk vs. Tape

DISK TAPE
Random Access Sequential Access
Use to store files in shorter terms Long-term storage of files
Generally serves many processes Dedicated to one process
Expensive Less expensive
Used as main secondary storage Considered to be a tertiary storage

Introduction to File Structures 6


CS 165 Database Systems Lesson 1

RAID: Improving File Access Performance by Parallel Processing

• Redundant Array of Inexpensive Disks


• A set of physical disk drives that appear to the database users and programs as if they
form one large logical storage unit.
• Parallel database processing speeds up writes and reads. This can be accomplished
through using a RAID. RAID does not change the logical or physical structure of
application programs or database queries.

RAID Levels

Introduction to File Structures 7


CS 165 Database Systems Lesson 1

Introduction to File Structures 8


CS 165 Database Systems Lesson 1

Fundamental File Structure Concepts

Data is stored in a physical storage for later retrieval. In order to organize the data for easy
retrieval, physical file organization and access mechanism have to be determined.

Field and Record Organization

Field and record organization refers to the physical structure of records to store. Records could be
fixed or variable in length.

A Stream File

PROGRAM: writstrm
get output file name and open it with the logical name OUTPUT
get LAST name as input
while (LAST name has a length > 0)
get FIRST name, ADDRESS, CITY, STATE, and ZIP as input
write LAST to the file OUTPUT
write FIRST to the file OUTPUT
write ADDRESS to the file OUTPUT
write CITY to the file OUTPUT
write STATE to the file OUTPUT
write ZIP to the file OUTPUT
get LAST name as input
endwhile
close OUTPUT
end PROGRAM
Given an input, this is written to the file precisely as specified: as a stream of bytes containing no
added information. Once we put all that information together as a single byte stream, there is no
way to get it apart again. The integrity of the fundamental organizational units of the input data is
also lost.

Field Structures

Field
 A conceptual tool used in file processing.
 Does not necessarily exist in any physical sense, yet it is important to the file’s structure.
 Smallest logically meaningful unit of information in a file.
 A subdivision of a record containing a single attribute of the entity the record describes.

Introduction to File Structures 9


CS 165 Database Systems Lesson 1

Methods of Field Organization

The following are the methods most commonly used to organize a record into fields:

1. Fixed-length fields
• In this method, we could pull the fields back out of the file simply by counting our way
to the end of the field.
• One disadvantage is, this method makes the file larger.
• Data larger than the size of the field won’t fit in it. A solution is to fix the size of the field
to cover all cases, but this will result to internal fragmentation.
• Appropriate method when the fields are fixed in length or if there is a little variation in
the field.
• In C:
struct {
char last[10];
char first[10];
char address[15];
char city[15];
char state[2];
char zip[9];
} set_of_ fields;

2. Length indicator at the beginning of each field


• Stores the length of the fields ahead of the data.

3. Delimiters to separate fields


• Uses a character or set of characters, not used in the fields, as delimiters between
fields.
• Choice of delimiter is important since it must not get in the way of processing.

4. “keyword = value” expression to identify fields


• A field providing information about itself.
• A good format for dealing with missing fields.
• Used in combination with delimiters.
• Wastes a lot of space because of the space occupied by the keywords.

Record Structures

Records

 Just like fields, another conceptual tool.


 Set of fields that belong together when the file is viewed in terms of a higher level of
organization.

Introduction to File Structures 10


CS 165 Database Systems Lesson 1

Methods of Record Organization

The following are the methods most commonly used to organize a file into records:

1. Fixed-length records
• All records contain the same number of bytes.
• Most commonly used method for organizing files.
• Having a fixed length does not imply that the sizes or number of fields in the record
must be fixed – it is possible to have fixed-sized or variable-sized fields or its
combination

2. Predictable number of fields


• Instead of fixed size, the number of fields in the record is what is fixed in this method.

3. Length indicator at the start of each record


• Each record begins with a field containing the number of bytes in it.

4. Use of index to keep track of addresses


• Another file is used as index which contains the byte offset of each record.
5. Delimiter at the end of each record
• Similar to delimiters in fields.

File Organization

The following are the main types of file organization:


• Sequential
• Relative

Sequential File Organization

Sequential file organization is the oldest type of file organization since during the 1950s and
1960s, the foundation of many information systems was sequential processing.

Sequential Files

• Files that are read from beginning to end.


• Developed due of heavy use of magnetic tapes as they were the first secondary storage
devices available.

Physical Characteristics

• Physical order of the records is the same as the logical order


• Logical representation: (record 1) ⇒ (record 2) ⇒ (record 3) ⇒ … ⇒ (record n)

Two Types of Sequential Files

• Unordered pile files


• Sorted sequential files

Introduction to File Structures 11


CS 165 Database Systems Lesson 1

Relative File Organization

• With relative file organization, there exists a predictable relationship between the key used to
identify a record and that record’s absolute address on an external file.

• Relative addressing allows access to a record directly given only the key, regardless of the
position of the record in the file.

• The file is characterized as providing random access because the logical organization of the
file need not correspond to its physical organization.

• The simplest relative file organization is when the key value corresponds directly to the
physical location of the record in the file.

 This method is useful for dense keys, i.e., values of consecutive keys differ only by one.

 If the key collection is not dense, it might result to wasted space. This can be solved by
mapping the large range of non-dense key values into smaller range of record positions in
the file. ⇒ The key is no longer the address of the record in the file. Hashing or indexing
may be used.

FILE ACCESS

File access refers to the manner data is retrieved.

There are two general file access methods:

 Sequential Access
• Data is accessed sequentially, starting at the beginning of the file.
• O(n) – expensive if done directly to the disk
• Not advisable for most serious retrieval situations but there are still some other
applications which it is the best one to use like a few-record file, searching for some
pattern, search where a large number of matches is expected.

 Relative Access
• Addresses of records can be obtained directly from a key.
• Uses an indexing technique that allows a user to locate a record in a file with as few
accesses as possible, ideally with just one access.
• The indexing scheme could be one of the following:
• Direct addressing
• Binary search tree
• B-trees
• Multiple-key indexing
• Hashing

If access is sequential, file organization used will not have significant effect on the access cost.
However, relative access entails fixing the size of records or using an index.

Introduction to File Structures 12


CS 165 Database Systems Lesson 1

Indexing Mechanisms

Index

 Consists of keys and reference fields


 Works by indirection, i.e., let’s you impose order on a file without actually rearranging the
file.
 Used to provide multiple access path to a file.
 Gives keyed access to variable-length record files.

Key
 Identifies a record based on the record’s contents not on the sequence of the records.

Primary keys
 Keys that uniquely identify a single record.
 Primary keys should be unchanging.

Secondary keys
 Used to overcome the shortcomings of the primary key.
 Used to access records according to data content.

Indexing is a mechanism for:

• Separating logical organization of files from their physical organization. The physical file may
be unorganized, but there may be logically ordered indexes for accessing it.
• Conducting efficient searches on the files
• Indexes hold information about location of records with specific values
• Index structures (hopefully) fit in main memory so index searching is fast

Types of Indexes

• Primary Index on Unordered Files


• Primary Index on Ordered Files
• Clustering Index on Ordered Files
• Secondary Index

Primary Index on Unordered Files

• The index is a simple arrays of structures that contain the keys and reference fields. It allows
binary search on the index table and access to the physical records/blocks directly from the
index table.
• The physical records are not ordered; the index provides the order.
• Physical files are entry-sequenced.
• To create an index of this type:
• Append records to the physical file as they are inserted
• Build an index (primary and/or secondary) on this file
• Deletion/Update of physical records require reorganization of the file and the
reorganization of primary index

Introduction to File Structures 13


CS 165 Database Systems Lesson 1

Primary Index on Unordered Files

Primary Index on Ordered Files

• Physical records are ordered based on the primary key


• The index is ordered but only one index record for each block
• Reduces the index requirement, enabling binary search over the values without having to read
the entire file to perform binary search.

Primary Index on Ordered Files

Introduction to File Structures 14


CS 165 Database Systems Lesson 1

Clustering Index on Ordered Files

Secondary Index

• Used to facilitate faster access to commonly queried non-primary fields


• Secondary indexes typically point to the primary index
Advantage: Record deletion and update cause less work
Disadvantage: Less efficient

Retrieval Using Combination of Secondary Keys

• Use boolean AND operation, specifying the intersection of two subsets of the data file.

Introduction to File Structures 15


CS 165 Database Systems Lesson 1

Hashing

• Used to obtain addresses from keys. A hash function is used to map a range of key values into
a smaller range of relative addresses.

• Hashing is like indexing in that it involves associating a key with a relative record address.

• Unlike indexing, with hashing, there is no obvious connection between the key and the address
generated since the function "randomly selects" a relative address for a specific key value,
without regard to the physical sequence of the records in the file. Thus it is also referred to as
randomizing scheme.

• Problem in hashing: Presence of collision. Collision happens when two or more input keys
generate the same address when the same hash function is used.
• Solution: Collision resolution technique or use of perfect hash functions.

Common Hashing Techniques Perfect Hashing Techniques


Hash functions which are able to Hash functions which are able to generate a
generate random addresses (not perfectly uniform distribution of addresses
necessarily unique)

• Prime Number Division Method • Quotient Reduction


• Digit Extraction • Remainder Reduction
• FOLDING • Associated Value
• Mid-Square • Reciprocal Hashing
• Radix Conversion
Advantages
Easier to implement Provide 1-1 mapping of keys into addresses
(no collisions)

Disadvantages
• Records may collide • Requires knowledge of the set of key values
• Records may be unevenly distributed. in order to generate a perfectly uniform
In the worst case, all records hash distribution of keys
into a single address • Quite complicated to use
• Has a lot of pre-requisites
• It is hard to find a function that produces no
collision

Other Indexing Mechanisms

• Binary Search Trees: BST, Height-Balanced (AVL, Red-Black)


• B-Trees

Indexes That Are Too Large to Hold in Memory

If the index to a file is too large to be stored in the main memory, index access and maintenance
must be done on secondary storage. The disadvantage of this approach is that binary searching
on secondary storage requires several seeks. Also, index rearrangement requires shifting or
sorting records on secondary storage, which is expensive.
 Alternatives: use hashing for direct access or tree-structured index for both keyed access
and sequential access

Introduction to File Structures 16


CS 165 Database Systems Lesson 1

Organizing Files for Performance

It is not sufficient to organize files to store data. To provide efficient access, we must know how to
organize files to improve performance. Data compression and reclaiming unused space are some
of the ways that we can do to improve space utilization and file access times.

Data Compression

Data compression is the process of making files smaller. It involves encoding the information in a
file in such a way as to take up less space.

Since smaller files use less storage, it results in cost savings. In addition, it can be transmitted
faster and processed faster sequentially.

Several methods for data compression exist, and in this course we'll cover:
• Using a different notation
• Suppressing repeated sequences, and
• Assigning variable-length codes

Redundancy Reduction
⇒ Compression by reducing redundancy in data representation
⇒ The three techniques that follow use this method.

Using a Different Notation

Compact Notation ⇒ a compression technique which we decrease the number of bits in data
representation.

 Fixed-length fields of a record are good candidates for compression


o Choosing minimal length that is still conservative enough
o Using alternative values: for example, course code instead of course name

Suppressing Repeated Sequences

Run-Length Encoding (RLE)

• A compression technique in which runs of repeated codes are replaced by a count of the
number of repetitions of the code, followed by the code that is repeated

• Images with repeated pixels are good candidates for RLE

The algorithm:
 Read through the pixels in the image, copying the values to the file sequence, except
where the same pixel value occurs more that one successively.
 Substitute with the following 3 bytes the pixel values which occurred in succession:
• Run-length code indicator;
• Pixel value that is repeated; and
• The number of times it is repeated.

Introduction to File Structures 17


CS 165 Database Systems Lesson 1

For example, we want to compress the following, with 0xFF not included in the image:
22 23 24 24 24 24 24 24 24 25 26 26 26 26 26 26 25 24

The compressed version is


22 23 FF 24 07 25 FF 26 06 25 24

 RLE does not guarantee any particular amount of space savings. It may even result in larger
files.

Assigning Variable-Length Codes

This is another redundancy reduction technique of data compression where the most frequently
used piece of data is assigned to have the shortest code.

Example 1: Morse Code

E • T -
I •• M --
S ••• O ---
H ••••

However, Morse Code still need some delimiter to recognize characters.

Example 2: Huffman Code

Suppose we have an alphabet consisting of only seven characters:

Letter a b c d e f g
Probability 0.4 0.1 0.1 0.1 0.1 0.1 0.1
Code 1 010 011 0000 0001 0010 0011

• Each letter here occurs with the probability indicated

• Letter a has the greatest probability to occur more frequently, so it is assigned the one bit code

• So the string abacde is represented as ‘1010101100000001’

• Seven letters can be stored using three bits only, but in this example, as much as four bits are
used to ensure that the distinct codes can be stored together without delimiters and still could
be recognized

Introduction to File Structures 18