Hashing Best Study

File Structures by Folk, Zoellick and Riccardi
Chap12. Extendible Hashing
SNU-OOPSLA-LAB
File Structures SNU-OOPSLA Lab. 1
Chapter Objectives
Describe the problem solved by extendible hashing and related approaches Explain how extendible hashing works; show how it combines tries with conventional, static hashing Use the buffer, file, and index classes of previous chapters to implement extendible hashing, including deletion Review studies of extendible hashing performance Examine alternative approaches to the same problem, including dynamic hashing, linear hashing, and hashing schemes that control splitting by allowing for overflow buckets
File Structures
SNU-OOPSLA Lab.
Contents
12.1 Introduction 12.2 How extendible hashing works 12.3 Implementation 12.4 Deletion 12.5 Extendible hashing performance 12.6 Alternative approaches
File Structures
SNU-OOPSLA Lab.
12.1 Introduction
Dynamic files
undergo a lot of growths
Static hashing

described in chapter 11 (direct hashing) typically worse than B-Tree for dynamic files eventually requires file reorganization
Extendible hashing

hashing for dynamic file Fagin, Nievergelt, Pippenger, and Strong (ACM TODS 1979)
File Structures
SNU-OOPSLA Lab.
Overview(1)
Direct access (hashing) files have static size, so not suitable for files whose size is unknown in advance
Dynamic file structure is desired which retains the feature of fast retrieval by primary key, and which also expands and contracts as the number of records in the file fluctuates (without reorganizing the whole file) Similar motivation!

Indexed-sequential File ==> B tree Hashing ==> Extendible Hashing
File Structures
SNU-OOPSLA Lab.
Overview(2)
Extendible Hashing
Primary key
Hashing function
H(key)
Extract first d digit Directory Index Table look-up File pointer
File Structures
SNU-OOPSLA Lab.
12.2 How Extendible Hashing works
Idea from Tries file (radix searching)
The branching factor of the tree is equal to the # of alternative symbols in each position of the key e.g.) Radix 26 trie - able, abrahms, adams, anderson, adnrews, baird
Use
the first n characters for branching

l r adams able abrahms e r baird
SNU-OOPSLA Lab. 7
a b
File Structures
b d n
anderson andrews
Extendible Hashing

H maps keys to a fixed address space, with size the largest prime less than a power of 2 (65531 < 216) File pointers point to blocks of records known as buckets, where an entire bucket is read by one physical data transfer, buckets may be added to or removed from the file dynamically The d bits are used as an index in a directory array containing 2d entries, which usually resides in primary memory The value d, the directory size(2d), and the number of buckets change automatically as the file expands and contracts
File Structures
SNU-OOPSLA Lab.
Extendible Hashing Example

Directory with d=3 and 4 buckets d=3 000 001 010 011 100 101 110 111 d=1 B0 d=3 B100 H(key)=100 H(key)=0
d=3 B101 H(key)=101 d=2 B11 H(key)=11
File Structures
SNU-OOPSLA Lab.
Turning the trie into a directory
Using Trie for extendible hashing

(1) Use Radix 2 Trie :
Keys in A : beginning with 0 Keys in B : beginning with 10 Keys in C : beginning with 11
0 1 0 1
A B C
(2) Retrieving from secondary storage the buckets containing keys, instead of individual keys
File Structures
SNU-OOPSLA Lab.
10
Representation of Trie (1)

Tree is not preferable (directory is not big) A flattened array

1. Make a complete full binary tree 2. Collapse it into the directory structure
0 1 0 1
A B C
00
A B C
01
10 11
File Structures
SNU-OOPSLA Lab.
11
Representation of Trie(2)
Directory is a complete binary tree

Directory entry : a pointer to the associated bucket Given an address beginning with the bits 10, the 210
directory entries
Introduced for uniform distribution
File Structures
SNU-OOPSLA Lab.
12
Retrieve a record
Steps in retrieving a record with a given key
find H(given key) extract first d bits of H(given key) use this value as an index into the directory to find a pointer use this pointer to read a bucket into primary memory locate the desired record within the bucket (scan)
File Structures
SNU-OOPSLA Lab.
13
Expansion & Contraction(1)
A pair of adjunct buckets with the same value of d which share a common value of the first d-1 bits of H(key) can be combined if the average load < 50%, so all records would be able to fit into one bucket File contraction is the reverse of expansion; the directory can be compacted and d decremented whenever all pairs of pointers have the same values
File Structures
SNU-OOPSLA Lab.
14

Bucket B0 overflows, then splits into B0 and B1 d=3 d=2 000 001 010 d=2 011 100 d=3 101 110 111 d=3 d=2 B00 H(key)=11..
B00 H(key)=00.. B01 H(key)=01.. B100 H(key)=100.. B00 H(key)=101..

d=4 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
File Structures
d=2 B00 H(key)=00.. d=2 B01 H(key)=01.. d=4 B1000H(key)=1000.. d=4 B1001H(key)=1001.. d=3 B101 H(key)=101..
d=2
B11 H(key)=11.. Bucket B100 overflows, d increase to 4
SNU-OOPSLA Lab. 16
Splitting to Handle Overflow (1)
When overflow occurs

e.g.1) Overflowing of bucket A
Split A into A and D Come to use additional unused bits No need to expand the directory
00 01 10 11
A B C
00 01 10 11
A
D B C
File Structures
SNU-OOPSLA Lab.
17
Splitting to Handle Overflow(2)
e.g. Overflowing of bucket B
Do not have additional unused bits (need to expand the directory)

1. Divide B using 3 bits of hash address 2. Make a complete full binary tree 3. Collapse it into the directory structure
00 01 10 11
File Structures
A B C
SNU-OOPSLA Lab. 18
1. Result of overflow of bucket B

0 1
A
0 0
1 1
B D C 3. Directory
2. Complete Binary Tree

0 0 0 1 1 0 1 1 0 1
File Structures
000
001
010 011
0 1 0 1
B
D C
SNU-OOPSLA Lab.
B
100 101
D C
110
111
19
Creating Address
Function hash(KEY)
Fold/Add hashing algorithm Do not MOD hashing value by address space since no fixed address space exists Output from the hash function for a number of keys
bill lee pauline alan julie mike elizabeth mark 0000 0011 0110 1100 0000 0100 0010 1000 0000 1111 0110 0101 0100 1100 1010 0010 0010 1110 0000 1001 0000 0111 0100 1101 0010 1100 0110 1010 0000 1010 0000 0111
File Structures
SNU-OOPSLA Lab.
20
Int Hash (char * key) { int sum = 0; int len = strlen(key); if (len % 2 == 1) len ++; // make len even for (int j = 0; j < len; j+2) sum = (sum + 100 * key[j] + key[j+1]) % 19937; return sum; }
Figure 12.7 Function Hash (key) returns an integer hash value for key for a 15 bit
File Structures
SNU-OOPSLA Lab.
21
Int MakeAddress (char * key, int depth) { int retval = 0; int hashVal = Hash(key); // reverse the bits for (int j = 0; j < depth; j++) { retval = retval << 1; int lowbit = hashVal & 1; retval = retval | lowbit; hashVal = hashVal >> 1; } return retval; } Figure 12.9 Function MakeAddress(key,depth)
Class Bucket: protected TextIndex {protected: Bucket (Directory & dir, int maxKeys = defaultMaxKeys); int Insert (char * key, int recAddr); int Remove(char * key); Bucket * Split (); int NewRange (int & newStart, int & newEnd); int Redistribute (Bucket & newBucket); int FindBuddy (); int TryCombine (); int Combine (Bucket * buddy, int buddyIndex); int Depth; Directory & Dir; int BucketAddr; friend class Directory; friend class BucketBuffer; }; Figure 12.10 Main members of class Bucket
class Directory {public: Directory (..); ~Directory(); int Open (..); int Create(); int Close(); int Insert(); int Delete(); int Search(); protected int DoubleSize(); int Collape(); int InsertBucket (.); int Find (); int StoreBucket(); int LoadBucket() .. } Figure 12.11 Definition of class Directory
12.4 Deletion
When to combine buckets
Buddy buckets: the buckets are siblings and at the leaf level of the tree (Buddy means something like friend) e.g., B and D in page 19 are buddy buckets
Examine the directory to see if we can make changes there
Shrink the directory if none of the buckets requires the depth of address information that is currently available in the directory
File Structures
SNU-OOPSLA Lab.
25
Buddy Bucket
Given a bucket with an address uvwxy, where u, v, w, x, and y have values of either 0 or 1, the buddy bucket, if it exists, has the value uvwxz, such that
z = y XOR 1
If enough keys are deleted, the contents of buddy buckets can be combined into a single bucket
File Structures
SNU-OOPSLA Lab.
26
Collapsing the Directory
Collapse condition

If a single cell, downsizing is impossible If there is a pair of directory cells that do not both point to the same bucket, collapsing is impossible
Allocating space

Allocate half the size of the original Copy the bucket references shared by each cell pair to a single cell in the new directory
File Structures
SNU-OOPSLA Lab.
27
12.5 Extendible Hashing Performance
Time : O(1)

If the directory can kept in RAM: a single access Otherwise: two accesses are necessary
Space utilization of the bucket

r (# of records), b (block size), N (# of Blocks) Utilization = r / bN Average utilization ==> 0.69
Space utilization for the directory

How
large a directory should we expect to have, given an expected number of keys?

Expected value for the directory size by Flajolet(1983)
Estimated directory size =3.92 / b X r(1+1/b)

SNU-OOPSLA Lab. 28
File Structures
Space utilization for buckets
Periodic and fluctuating

With uniform distributed addresses, all the buckets tend to fill up at the same time -> split at the same time As buffer fills up : 90% After a concentrated series of splits : 50% N ~= 4/(b ln 2) Utilization = r / bN ~= ln 2 = 0.69 Average utilization of 69%
r : # of records , b : block size

B tree space utilization
Normal B-tree : 67%, B-tree with redistribution in insertion : 85 %
File Structures
SNU-OOPSLA Lab.
29
12.6 Alternative Approaches(1): Dynamic Hashing
Similar to dynamic extendible hashing

Use a directory to track bucket addresses Extend the directory through the use of tries
Start with a hash function that covers an address space of a fixed size
When overflow occurs
splits forming the leaves of a trie that grows down from the original address node makes a trie
File Structures
SNU-OOPSLA Lab.
30
Alternative Approaches(2): Dynamic Hashing
Two kinds of nodes

External node: reference a data bucket Internal node: point to two children index nodes When a node has split children, it changed from an external node to an internal node
Two hash functions

Apply the first hash function original address space if external node is found : search is completed
if internal node is found : apply second hash function
File Structures
SNU-OOPSLA Lab.
31
(a)
Original address space
(b)
3
40
4
41
(c)
1 20
2 21
3 1

41
410
411
File Structures
SNU-OOPSLA Lab.
32
Dynamic Hashing vs. Extendible Hashing(1)
Overflow handling
Both schemes extend the hash function locally, as a binary search trie
Both schemes use directory structure
Dynamic hashing: a linked structure

Extendible hashing: perfect tree expressible as an array both schemes is the same (space utilization : 69%)
Space Utilization
File Structures
SNU-OOPSLA Lab.
33
Dynamic Hashing and Extendible Hashing(2)
Growth of directory

Dynamic hashing: slower, more gradual growth Extendible hashing: extend directory by doubling it Dynamic hashing is lager than a directory cell in extendible hashing (because of pointers) Dynamic hashing: more than one page fault (with linked structure for the directory) Extendible hashing: single page fault
Actual size of an index node
Page fault

File Structures
SNU-OOPSLA Lab.
34
Alternative Approaches(3): Linear Hashing
Unlike extendible hashing and dynamic hashing, linear hashing does not use a directory.
The actual address space is extended one bucket at a time as buckets overflow
Because the extension of the address space does not necessarily correspond to the bucket that is overflowing,
linear hashing necessarily involves the use of overflow buckets, even as the address space expands

No directories: Avoid additional seek resulting from additional layer Use more bits of hashed value
hd(k) : depth d hashing function (using function make_address)
File Structures
SNU-OOPSLA Lab.
35
The growth of address space in linear hashing(1)

w a
00
b
01
c
10
d
11
a
000
b
01
c
10
d
11
A
100
(a)
(b)
y
x a
00
x
A
100
b
01
c
10
d
11
B
101
a
00
b
01
c
10
d
11
A
100
B
101
C
110
(c)
File Structures SNU-OOPSLA Lab.
(d)
(continued...)
36
The growth of address space in linear hashing(2)
x a
00
b
01
c
10
d
11
A
100
B
101
C
110
D
111
(e)
File Structures
SNU-OOPSLA Lab.
37
Alternative Approaches(5) :Approaches to Controlling Splitting
Postpone splitting: increase space utilization

B-Tree: redistribution rather than splitting Hashing: placing records in chains of overflow buckets to postpone splitting
Triggering event for splitting
Linear hashing Every time any bucket overflows Not split overflowing bucket Litwin(1980): overall load factor of the file
Below 2 seeks, 75% ~ 80% storage utilization
File Structures
SNU-OOPSLA Lab.
38
Alternative Approaches(5) :Approaches to Controlling Splitting
Postpone splitting for extensible hashing

Use chaining overflow bucket Avoid doubling directory space 1.1 seek, 76% ~ 81% storage utilization
File Structures
SNU-OOPSLA Lab.
39
Lets Review !!!
12.1 Introduction 12.2 How extendible hashing works 12.3 Implementation 12.4 Deletion 12.5 Extendible hashing performance 12.6 Alternative approaches
File Structures
SNU-OOPSLA Lab.
40

Hashing Best Study

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Hashing Best Study

Transféré par

Droits d'auteur :

Formats disponibles

File Structures by Folk, Zoellick and Riccardi

Chap12. Extendible Hashing

undergo a lot of growths

Indexed-sequential File ==> B tree Hashing ==> Extendible Hashing

Extract first d digit Directory Index Table look-up File pointer

12.2 How Extendible Hashing works

Idea from Tries file (radix searching)

the first n characters for branching

Extendible Hashing Example

d=3 B101 H(key)=101 d=2 B11 H(key)=11

Turning the trie into a directory

Using Trie for extendible hashing

Representation of Trie (1)

Tree is not preferable (directory is not big) A flattened array

Directory is a complete binary tree

Introduced for uniform distribution

Steps in retrieving a record with a given key

Expansion & Contraction(1)

Expansion & Contraction(2)

B00 H(key)=00.. B01 H(key)=01.. B100 H(key)=100.. B00 H(key)=101..

Expansion & Contraction(3)

Splitting to Handle Overflow (1)

When overflow occurs

Splitting to Handle Overflow(2)

e.g. Overflowing of bucket B

Do not have additional unused bits (need to expand the directory)

1. Result of overflow of bucket B

2. Complete Binary Tree

When to combine buckets

Examine the directory to see if we can make changes there

Collapsing the Directory

12.5 Extendible Hashing Performance

Space utilization of the bucket

r (# of records), b (block size), N (# of Blocks) Utilization = r / bN Average utilization ==> 0.69

Space utilization for the directory

large a directory should we expect to have, given an expected number of keys?

Estimated directory size =3.92 / b X r(1+1/b)

Space utilization for buckets

Periodic and fluctuating

r : # of records , b : block size

B tree space utilization

Normal B-tree : 67%, B-tree with redistribution in insertion : 85 %

12.6 Alternative Approaches(1): Dynamic Hashing

Similar to dynamic extendible hashing

When overflow occurs

Alternative Approaches(2): Dynamic Hashing

Two kinds of nodes

Two hash functions

if internal node is found : apply second hash function

Original address space

Original address space

Original address space

Dynamic Hashing vs. Extendible Hashing(1)

Both schemes use directory structure

Dynamic hashing: a linked structure

Dynamic Hashing and Extendible Hashing(2)

Actual size of an index node

Alternative Approaches(3): Linear Hashing

hd(k) : depth d hashing function (using function make_address)

The growth of address space in linear hashing(1)

The growth of address space in linear hashing(2)

Alternative Approaches(5) :Approaches to Controlling Splitting

Postpone splitting: increase space utilization