Académique Documents
Professionnel Documents
Culture Documents
SNU-OOPSLA-LAB
File Structures SNU-OOPSLA Lab. 1
Chapter Objectives
Describe the problem solved by extendible hashing and related approaches Explain how extendible hashing works; show how it combines tries with conventional, static hashing Use the buffer, file, and index classes of previous chapters to implement extendible hashing, including deletion Review studies of extendible hashing performance Examine alternative approaches to the same problem, including dynamic hashing, linear hashing, and hashing schemes that control splitting by allowing for overflow buckets
File Structures
SNU-OOPSLA Lab.
Contents
12.1 Introduction 12.2 How extendible hashing works 12.3 Implementation 12.4 Deletion 12.5 Extendible hashing performance 12.6 Alternative approaches
File Structures
SNU-OOPSLA Lab.
12.1 Introduction
Dynamic files
Static hashing
described in chapter 11 (direct hashing) typically worse than B-Tree for dynamic files eventually requires file reorganization
Extendible hashing
hashing for dynamic file Fagin, Nievergelt, Pippenger, and Strong (ACM TODS 1979)
File Structures
SNU-OOPSLA Lab.
Overview(1)
Direct access (hashing) files have static size, so not suitable for files whose size is unknown in advance
Dynamic file structure is desired which retains the feature of fast retrieval by primary key, and which also expands and contracts as the number of records in the file fluctuates (without reorganizing the whole file) Similar motivation!
File Structures
SNU-OOPSLA Lab.
Overview(2)
Extendible Hashing
Primary key
Hashing function
H(key)
File Structures
SNU-OOPSLA Lab.
The branching factor of the tree is equal to the # of alternative symbols in each position of the key e.g.) Radix 26 trie - able, abrahms, adams, anderson, adnrews, baird
Use
a b
File Structures
b d n
anderson andrews
Extendible Hashing
H maps keys to a fixed address space, with size the largest prime less than a power of 2 (65531 < 216) File pointers point to blocks of records known as buckets, where an entire bucket is read by one physical data transfer, buckets may be added to or removed from the file dynamically The d bits are used as an index in a directory array containing 2d entries, which usually resides in primary memory The value d, the directory size(2d), and the number of buckets change automatically as the file expands and contracts
File Structures
SNU-OOPSLA Lab.
File Structures
SNU-OOPSLA Lab.
0 1 0 1
A B C
(2) Retrieving from secondary storage the buckets containing keys, instead of individual keys
File Structures
SNU-OOPSLA Lab.
10
0 1 0 1
A B C
00
A B C
01
10 11
File Structures
SNU-OOPSLA Lab.
11
Representation of Trie(2)
Directory entry : a pointer to the associated bucket Given an address beginning with the bits 10, the 210
directory entries
File Structures
SNU-OOPSLA Lab.
12
Retrieve a record
find H(given key) extract first d bits of H(given key) use this value as an index into the directory to find a pointer use this pointer to read a bucket into primary memory locate the desired record within the bucket (scan)
File Structures
SNU-OOPSLA Lab.
13
A pair of adjunct buckets with the same value of d which share a common value of the first d-1 bits of H(key) can be combined if the average load < 50%, so all records would be able to fit into one bucket File contraction is the reverse of expansion; the directory can be compacted and d decremented whenever all pairs of pointers have the same values
File Structures
SNU-OOPSLA Lab.
14
d=2 B00 H(key)=00.. d=2 B01 H(key)=01.. d=4 B1000H(key)=1000.. d=4 B1001H(key)=1001.. d=3 B101 H(key)=101..
d=2
B11 H(key)=11.. Bucket B100 overflows, d increase to 4
SNU-OOPSLA Lab. 16
00 01 10 11
A B C
00 01 10 11
A
D B C
File Structures
SNU-OOPSLA Lab.
17
00 01 10 11
File Structures
A B C
SNU-OOPSLA Lab. 18
A
0 0
1 1
B D C 3. Directory
000
001
010 011
0 1 0 1
B
D C
SNU-OOPSLA Lab.
B
100 101
D C
110
111
19
Creating Address
Function hash(KEY)
Fold/Add hashing algorithm Do not MOD hashing value by address space since no fixed address space exists Output from the hash function for a number of keys
bill lee pauline alan julie mike elizabeth mark 0000 0011 0110 1100 0000 0100 0010 1000 0000 1111 0110 0101 0100 1100 1010 0010 0010 1110 0000 1001 0000 0111 0100 1101 0010 1100 0110 1010 0000 1010 0000 0111
File Structures
SNU-OOPSLA Lab.
20
Int Hash (char * key) { int sum = 0; int len = strlen(key); if (len % 2 == 1) len ++; // make len even for (int j = 0; j < len; j+2) sum = (sum + 100 * key[j] + key[j+1]) % 19937; return sum; }
Figure 12.7 Function Hash (key) returns an integer hash value for key for a 15 bit
File Structures
SNU-OOPSLA Lab.
21
Int MakeAddress (char * key, int depth) { int retval = 0; int hashVal = Hash(key); // reverse the bits for (int j = 0; j < depth; j++) { retval = retval << 1; int lowbit = hashVal & 1; retval = retval | lowbit; hashVal = hashVal >> 1; } return retval; } Figure 12.9 Function MakeAddress(key,depth)
File Structures SNU-OOPSLA Lab. 22
Class Bucket: protected TextIndex {protected: Bucket (Directory & dir, int maxKeys = defaultMaxKeys); int Insert (char * key, int recAddr); int Remove(char * key); Bucket * Split (); int NewRange (int & newStart, int & newEnd); int Redistribute (Bucket & newBucket); int FindBuddy (); int TryCombine (); int Combine (Bucket * buddy, int buddyIndex); int Depth; Directory & Dir; int BucketAddr; friend class Directory; friend class BucketBuffer; }; Figure 12.10 Main members of class Bucket
File Structures SNU-OOPSLA Lab. 23
class Directory {public: Directory (..); ~Directory(); int Open (..); int Create(); int Close(); int Insert(); int Delete(); int Search(); protected int DoubleSize(); int Collape(); int InsertBucket (.); int Find (); int StoreBucket(); int LoadBucket() .. } Figure 12.11 Definition of class Directory
File Structures SNU-OOPSLA Lab. 24
12.4 Deletion
Buddy buckets: the buckets are siblings and at the leaf level of the tree (Buddy means something like friend) e.g., B and D in page 19 are buddy buckets
Shrink the directory if none of the buckets requires the depth of address information that is currently available in the directory
File Structures
SNU-OOPSLA Lab.
25
Buddy Bucket
Given a bucket with an address uvwxy, where u, v, w, x, and y have values of either 0 or 1, the buddy bucket, if it exists, has the value uvwxz, such that
z = y XOR 1
If enough keys are deleted, the contents of buddy buckets can be combined into a single bucket
File Structures
SNU-OOPSLA Lab.
26
Collapse condition
If a single cell, downsizing is impossible If there is a pair of directory cells that do not both point to the same bucket, collapsing is impossible
Allocating space
Allocate half the size of the original Copy the bucket references shared by each cell pair to a single cell in the new directory
File Structures
SNU-OOPSLA Lab.
27
Time : O(1)
If the directory can kept in RAM: a single access Otherwise: two accesses are necessary
File Structures
With uniform distributed addresses, all the buckets tend to fill up at the same time -> split at the same time As buffer fills up : 90% After a concentrated series of splits : 50% N ~= 4/(b ln 2) Utilization = r / bN ~= ln 2 = 0.69 Average utilization of 69%
File Structures
SNU-OOPSLA Lab.
29
Use a directory to track bucket addresses Extend the directory through the use of tries
Start with a hash function that covers an address space of a fixed size
splits forming the leaves of a trie that grows down from the original address node makes a trie
File Structures
SNU-OOPSLA Lab.
30
External node: reference a data bucket Internal node: point to two children index nodes When a node has split children, it changed from an external node to an internal node
Apply the first hash function original address space if external node is found : search is completed
File Structures
SNU-OOPSLA Lab.
31
(a)
(b)
3
40
4
41
(c)
1 20
2 21
3 1
410
411
File Structures
SNU-OOPSLA Lab.
32
Overflow handling
Both schemes extend the hash function locally, as a binary search trie
Space Utilization
File Structures
SNU-OOPSLA Lab.
33
Growth of directory
Dynamic hashing: slower, more gradual growth Extendible hashing: extend directory by doubling it Dynamic hashing is lager than a directory cell in extendible hashing (because of pointers) Dynamic hashing: more than one page fault (with linked structure for the directory) Extendible hashing: single page fault
Page fault
File Structures
SNU-OOPSLA Lab.
34
Unlike extendible hashing and dynamic hashing, linear hashing does not use a directory.
The actual address space is extended one bucket at a time as buckets overflow
Because the extension of the address space does not necessarily correspond to the bucket that is overflowing,
linear hashing necessarily involves the use of overflow buckets, even as the address space expands
No directories: Avoid additional seek resulting from additional layer Use more bits of hashed value
File Structures
SNU-OOPSLA Lab.
35
b
01
c
10
d
11
a
000
b
01
c
10
d
11
A
100
(a)
(b)
y
x a
00
x
A
100
b
01
c
10
d
11
B
101
a
00
b
01
c
10
d
11
A
100
B
101
C
110
(c)
File Structures SNU-OOPSLA Lab.
(d)
(continued...)
36
x a
00
b
01
c
10
d
11
A
100
B
101
C
110
D
111
(e)
File Structures
SNU-OOPSLA Lab.
37
B-Tree: redistribution rather than splitting Hashing: placing records in chains of overflow buckets to postpone splitting
Linear hashing Every time any bucket overflows Not split overflowing bucket Litwin(1980): overall load factor of the file
File Structures
SNU-OOPSLA Lab.
38
Use chaining overflow bucket Avoid doubling directory space 1.1 seek, 76% ~ 81% storage utilization
File Structures
SNU-OOPSLA Lab.
39
12.1 Introduction 12.2 How extendible hashing works 12.3 Implementation 12.4 Deletion 12.5 Extendible hashing performance 12.6 Alternative approaches
File Structures
SNU-OOPSLA Lab.
40