Académique Documents
Professionnel Documents
Culture Documents
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
ECE750 Lecture 5: Hash functions, Hash
tables, Bloom lters, Hash grouping,
Discrepancy, Probabilistic counting
Todd Veldhuizen
tveldhui@acm.org
Electrical & Computer Engineering
University of Waterloo
Canada
Oct. 12, 2007
1 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Part I
Hashing
2 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Hash Tables
51
2
.
2
A particularly terrible choice would be m = 256, which would hash
objects based only on their lowest 8 bits. e.g., the hash of a string
would depend only on its last character.
7 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Multiplication hash functions: Example
Example of multiplication hash function using =
51
2
, and hash
table with m = 100 slots:
key {k} m{k}
1 0.618034 61.
2 0.236068 23.
3 0.854102 85.
4 0.472136 47.
5 0.090170 9.
6 0.708204 70.
7 0.326238 32.
8 0.944272 94.
9 0.562306 56.
10 0.180340 18.
11 0.798374 79.
12 0.416408 41.
13 0.034442 3.
14 0.652476 65.
15 0.270510 27.
16 0.888544 88.
17 0.506578 50.
Idea is that the third column (the hash slots) looks like a random
sequence.
8 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Multiplication hash functions I
k=1
f (k)
_
1
0
f (x)dx
Just like a uniform distribution on (0, 1)!
See [3]. Variously called Weyls ergodic principle, Weyls
equidistribution theorem.
However, {k} is emphatically not a random sequence.
3
Continuously dierentiable and periodic with period 1
10 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Hash Functions
P[[U) = log
2
m +
m
i =1
p
i
log
2
p
i
Convention: 0 log
2
0 = 0.
The smaller D(
Two cases:
n
2
m
2
Two cases: If n
2
m then ln p 0. If n
2
~ m then
ln p .
20 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Collision Strategy 1: Pick m big IV
Recall that if
ln p = x +
then
p = e
x+
= e
x
e
= e
x
_
1 + +
2
+
_
Taylor series
= e
x
(1 + O()) if o(1)
n(n1)
2m
+ O
_
n
3
e
n(n1)
2m
m
2
_
Interpretation:
21 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Collision Strategy 1: Pick m big V
If m (n
2
) there are no collisions (almost surely).
If m o(n
2
) there is a collision (almost surely).
m =
1
2
n
2
is an example of a threshold function:
g
g
n
2
0
1
n
3
n
2
n n
2+
Hash table size (m)
23 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Collision Strategy 1: pick m big
n
2m(m1)
(1 + o(1))
Geometric distribution:
Mean: p
1
26 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Collision Strategy 2: pick h carefully II
Let =
n
m
be the load factor: the average number of
items per hash table slot.
Let N
i
be a random variable indicating the number of
items landing in slot i .
E[N
i
] =
Var[N
i
] = n
1
m
1
1
m
| {z }
Bernoulli variance
n
1
m
1
1
m
+
n
2
m
2
n
2
m
+ n
n
m
Plus space (m) for the primary hash table =
(m+
n
2
m
+n). Choosing m = (n) yields linear space.
6
The max( ) deals with the possibility that < 1, in which case
log < 0.
29 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Collision Strategy 4: open addressing I
n) time
[7].
cryptography
combinatorics
data mining
computational geometry
databases
N
max
n = prior estimate on n
Elements of A are all the same: will get one hash table
slot with all the elements!
Uses (N
max
) bits, e.g., on the order of card(A) bits.
Uses (mlog N
max
) storage, and delivers accuracy of
O(m
1/2
)
43 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Probabilistic Counting
1
4
//
3
4
1
8
//
7
8
1
16
//
15
16
Need log N states, which can be encoded in log log N
bits.
44 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Hash grouping & discrepancy I
Let V = v
1
, . . . , v
n
be a set of elements (e.g., people)
Let S = S
1
, . . . , S
m
be a collection of subsets of V
(e.g., live in Ontario, work in IT, etc.)
v
j
S
i
(v
j
)
46 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Hash grouping & discrepancy III
Examples:
Suppose:
!
!
X
X
X
X
X
X
hh
Observation:
Yes: have two iterators, one for each tree. Initially the
two iterators point at the smallest key in their tree.
Easy: a min-heap.
= 1 +c log n.
)).
E.g., if we choose =
1
2
we are twice as fast as a BST
for searching, and require O(
(1 + log n)
(Hash) 0 1 1000000000 100%
1/8 4.7 355237568 35%
1/4 16.0 47654705 4.7%
1/2 31.9 504341 0.05%
3/4 45.8 4165 0.0004%
7/8 53.3 362 0.00004%
(BST) 1 60.8 31 0.000003%