ECE750 F2008 Algorithms5

ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
ECE750 Lecture 5: Hash functions, Hash
tables, Bloom lters, Hash grouping,
Discrepancy, Probabilistic counting
Todd Veldhuizen
tveldhui@acm.org
Electrical & Computer Engineering
University of Waterloo
Canada
Oct. 12, 2007
1 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Part I
Hashing
2 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Hash Tables
Suppose we wanted to represent the following set:

M = 35, 139, 395, 1691, 1760, 1795, 3632, 3789, 4657
Given some x N, we want to quickly test whether
x M.
Binary search trees: require following a path through a

tree perhaps not fast enough for our problem.
Super fast way: allocate an array of 4657 bytes. Set

A[i ] =
_
0 if i , M
1 if i M
Then, on a RAM, can test whether x M with a single
memory access to A[i ] (a constant amount of time).
However, space required by this strategy is O(sup M).
3 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Hash Tables
Obviously the array A would contain mostly empty

space. Can we somehow compress the array but still
support fast access?
Yes: allocate a much smaller table B of length k.

Dene a function h : [1, 4657] [1, k] that maps
indices of A to indices of B, can be computed quickly,
and ensures that if x, y M and x ,= y, then
h(x) ,= h(y) i.e., no two elements of M have the same
index in B.
Then, x M if and only if B[h(x)] = x.

4 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Hash Tables
For our example, h(x) = x mod 17 does the trick. Here

is the array B:
j B[j ]
0 0
1 35
2 0
3 139
4 395
5 0
j B[j ]
6 0
7 0
8 1691
9 1760
10 1795
11 3632
j B[j ]
12 0
13 0
14 0
15 3789
16 4657
e.g.: x = 1691: h(x) = 8, and B[8] = 1691, so x M.
e.g.: x = 1692: h(x) = 9, and B[9] = 1760 ,= 1692, so

x , M.
This is a hash table. h(x) = x mod 17 is called a hash

function.
5 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Hash Functions
A hash function is a map h : K H from some

(usually large) key space K to some (usually small) set
of hash values H. In our example, we were mapping
from K = [0, 4657] to H = [0, 16].
If the set M K is chosen uniformly at random, keys

are uniformly distributed (i.e., each k K has the same
probability of appearing in a set to represent). In this
case the hash function should distribute the keys evenly
amongst elements of H, i.e., we want that
[h
1
(y)[ [h
1
(z)[ for y, z H.
1
For a nonuniform distribution on keys, one just wants to choose h

so that the distribution induced on H is close to uniform.
1
Recall that for a function f : R S, f
1
(s) {r : f (r ) = s}.
6 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Hash Functions
We will describe some hash functions where K = N

(keys are nonnegative integers). These are easily
adapted to other kinds of keys (e.g., strings) by
interpreting the binary representation of the key as an
integer.
Some commonly used hash functions are the following:
1. Division: use h(k) = k mod m where m = [H[ is usually
chosen to be a prime number far away from any power
of 2. (Note.
2
)
For long bit strings, use Horners rule for evaluating

polynomials in Z/mZ (will explain.)
2. Multiplication: use h(k) = mk|, where 0 < < 1
is an irrational number and x x x|. A popular
choice of is =
51
2
.
2
A particularly terrible choice would be m = 256, which would hash
objects based only on their lowest 8 bits. e.g., the hash of a string
would depend only on its last character.
7 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Multiplication hash functions: Example
Example of multiplication hash function using =
51
2
, and hash
table with m = 100 slots:
key {k} m{k}
1 0.618034 61.
2 0.236068 23.
3 0.854102 85.
4 0.472136 47.
5 0.090170 9.
6 0.708204 70.
7 0.326238 32.
8 0.944272 94.
9 0.562306 56.
10 0.180340 18.
11 0.798374 79.
12 0.416408 41.
13 0.034442 3.
14 0.652476 65.
15 0.270510 27.
16 0.888544 88.
17 0.506578 50.
Idea is that the third column (the hash slots) looks like a random
sequence.
8 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Multiplication hash functions I
The reason why h(k) = mk| is a reasonable hash

function is interesting.
The short answer is that the sequence k for

k = 1, 2, 3, . . . kind of behaves like a sequence of
random reals drawn from (0, 1). So, h(k) = mk|
looks like a randomly chosen hash function.
A less sketchy explanation:
1. k is uniformly distributed on (0, 1): asymptotically,
the proportion of k falling in an interval (, )
where (, ) (0, 1) is ( ). (Just like a uniform
distribution on (0, 1).)
9 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Multiplication hash functions II
2. k satises an ergodic theorem: if we sample a
suitably well-behaved
3
function f at points k and
average, this converges to the integral:
1
m
m
k=1
f (k)
_
1
0
f (x)dx
Just like a uniform distribution on (0, 1)!
See [3]. Variously called Weyls ergodic principle, Weyls
equidistribution theorem.
However, {k} is emphatically not a random sequence.
3
Continuously dierentiable and periodic with period 1
10 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Hash Functions
To evaluate whether a hash function is a good choice

for a set of data S K, one can see how the observed
distribution of keys into hash table slots compares to a
uniform distribution.
Suppose there are n keys and m hash slots. Compute

the observed distribution of the keys:
p
i
=
[k : h(k) = i [
n
To measure how far from uniform, compute

D(
P[[U) = log
2
m +
m
i =1
p
i
log
2
p
i
Convention: 0 log
2
0 = 0.
This is the Kullback-Leibler divergence of the observed

distribution

P from the uniform distribution U. It may
be thought of as the distance from

P to U.
The smaller D(
P[[U), the better the hash function.

11 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Horners Rule I
Horners rule gives an ecient method for evaluating

hash functions for sequences, e.g., strings.
Consider a hash function of the form

h(k) = k mod m
If we wish to hash a string such as hello, we can

interpret it as a long binary number: in ASCII, hello is
01101000
. .
h
01100101
. .
e
01101100
. .
l
01101100
. .
l
01101111
. .
o
As a sequence of integers, hello is

[104, 101, 108, 108, 111]. We want to compute
(104 2
32
+ 101 2
24
+ 108 2
16
+ 108 2
8
+ 111 2
0
) mod m
12 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Horners Rule II
Horners rule is a general trick for evaluating a

polynomial. We write
ax
3
+ bx
2
+ cx + d = (ax
2
+ bx + c)x + d
= ((ax + b)x + c)x + d
So that instead of computing x
3
, x
2
, . . . we have only
multiplications:
t
1
= ax + b
t
2
= t
1
x + c
t
3
= t
2
x + d
Trivia: some early CPUs included an instruction opcode

for applying Horners rule. May be making a comeback!
13 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Horners Rule III
To use Horners rule for hashing: to compute

(a 2
24
+ b 2
16
+ c 2
8
+ d) mod m,
t
1
= (a 2
8
+ b) mod m
t
2
= (t
1
2
8
+ c) mod m
t
3
= (t
2
2
8
+ d) mod m
Note that multiplying by 2
k
is simply a shift by k bits.
Why this works. In short, algebra. The integers Z form a ring

under multiplication and addition. The hash function
h(k) = k mod m can be interpreted as a homomorphism from the
ring Z of integers to the ring Z/mZ of integers modulo m.
Homomorphisms preserve structure in the following sense: if we
write + for integer addition, and for addition modulo m,
h(a + b) = h(a) h(b)
i.e., it doesnt matter whether we compute (a + b) mod m or
compute (a mod m) and (b mod m) and add with modular
14 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Horners Rule IV
arithmetic: we get the same answer either way. Similarly, if we
write for multiplication in Z, and for multiplication in Z/mZ,
h(a b) = h(a) h(b)
Horners rule works precisely because h : Z Z/mZ is a
homomorphism:
h(((a 2
8
+ b) 2
8
+ c) 2
8
+ d)
= (((h(a) h(2
8
) h(b)) h(2
8
) h(c)) h(2
8
) h(d))
This can be optimized to use fewer applications of h, as above.
In this form it is obvious why m = 2
8
is a horrible choice for a
hash table size: 2
8
mod 2
8
= 0, so
(((h(a) h(2
8
) h(b)) h(2
8
) h(c)) h(2
8
) h(d))
= (((h(a) 0 h(b)) 0 h(c)) 0 h(d))
= h(d)
15 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Horners Rule V
i.e., the hash value depends only on the last byte. Similarly, if we
used m = 2
16
, we would have h(2
16
) = 0, which would remove all
but the last two bytes from the hash value computation.
For background on algebra see, e.g., [2, 11, 9].
16 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Collisions
A collision occurs when two keys map to the same

location in the hash table, i.e., there are distinct
x, y M such that h(x) = h(y).
Strategies for handling collisions:

1. Pick a value of m large enough so that collisions are
rare, and can be easily dealt with e.g., by maintaining a
short overow list of items whose hash slot is already
occupied.
2. Pick the hash function h to avoid collisions.
3. Put a secondary data structure in each hash table slot
(a list, tree, or another hash table);
4. If a hash slot is full then try some other slots in some
xed sequence (open addressing).
17 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Collision Strategy 1: Pick m big I
Lets see how big m must be for the probability of

collisions to be small.
Two cases:
n > m: then there must be a collision, by the

pigeonhole principle.
4
n m: may or may not be a collision.
The birthday problem: what is the probability that

amongst n people, at least two share the same birthday?
This is a hashing problem: people are keys, days of the

year are slots, and h maps people to their birthdays.
If n 23, then the probability of two people having the

same birthday is >
1
2
. (Counterintuitive, but true.)
The birthday problem analysis is straightforward to

adapt to hashing.
18 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Collision Strategy 1: Pick m big II
Suppose the hash function h and the distribution of

keys cooperate to produce a uniform distribution of keys
into hash table slots.
Recall that with a uniform distribution, probability may

be computed by simple counting:
Pr(event E happens) =
# outcomes in which E happens
# outcomes
First we count the number of hash functions without

collisions:
There are m choices of where to put the rst key; m1

choices of where to put the second key; ... m n + 1
choices of where to put the n
th
key.
The number of hash functions with no collisions is

m
n
= m (m 1) (m n + 1) =
m!
(mn)!
. (Note
5
.)
Next we count the number of hash functions allowing

collisions:
19 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Collision Strategy 1: Pick m big III
There are m choices of where to put the rst key; m

choices of where to put the second key; ... m choices of
where to put the n
th
key.
The number of hash functions allowing collisions is m

n
.
The probability of a collision-free arrangement is

p =
m!
(m n)! m
n
Asymptotic estimate of ln p, assume m ~ n:

ln p
n
2
2m
+
n
2m
+ O
_
n
3
m
2
_
(1)
Here we have used Stirlings approximation and
ln(m n) = ln m
n
m
O
n
2
m
2
Two cases: If n
2
m then ln p 0. If n
2
~ m then
ln p .
20 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Collision Strategy 1: Pick m big IV
Recall that if
ln p = x +
then
p = e
x+
= e
x
e
= e
x
_
1 + +
2
+
_
Taylor series
= e
x
(1 + O()) if o(1)
Probability of a collision-free arrangement is

p e
n(n1)
2m
+ O
_
n
3
e
n(n1)
2m
m
2
_
Interpretation:
21 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Collision Strategy 1: Pick m big V
If m (n
2
) there are no collisions (almost surely).
If m o(n
2
) there is a collision (almost surely).
i.e., if we want a low probability of collisions, our hash

table has to be quadratic (or more) in the number of
items.
4
If m + 1 pigeons are placed in m pigeonholes, there must be two
pigeons in the same hole. (Replace pigeons with keys, and
pigeonholes with hash slots.)
5
The handy notation m
m
is called a falling power [10].
22 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Threshold functions
m =
1
2
n
2
is an example of a threshold function:
the threshold, asymptotic probability of event is 0
~ the threshold, asymptotic probability of event is 1.

Prob. of no collision
$
$
g
g
n
2
0
1
n
3
n
2
n n
2+
Hash table size (m)
23 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Collision Strategy 1: pick m big
Picking m big is not an eective strategy for handling

collisions.
For n = 1000 elements, this table shows how big m

must be to achieve the desired probability of no
collisions:
p m
0.1 5000000
0.01 50000000
10
6
500000000000
10
9
500000000000000
24 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Collision Strategy 1: pick m big
The analysis of collisions in hashing demonstrates two

pigeonhole principles.
The simplest pigeonhole principle states that if you put

m + 1 pigeons in m holes, there must be one hole
with 2 pigeons.
With respect to hash tables, the pigeonhole applies as

follows: If a hash table with m slots is used to store
m + 1 elements, there is a collision.
The probability-of-collision analysis of the previous

slides demonstrates a probabilistic pigeonhole principle:
if you put (
n) pigeons in n holes, there is a hole with

2 pigeons almost surely (i.e., with probability
converging to 1 as n .)
25 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Collision Strategy 2: pick h carefully I
Can we pick our hash function h to avoid collisions?
For example, if we use hash functions of the form

h(k) = mk|
we could try random values of (0, 1) until we found
one that was collision-free.
We have a probability of success

p e
n
2m(m1)
(1 + o(1))
Geometric distribution:
Probability of success p, probability of failure 1 p
Each trial independent, identically distributed.
Probability that k tries are needed for success

= (1 p)
k1
p
Mean: p
1
26 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Collision Strategy 2: pick h carefully II
Number of values of we expect to try before we nd a

collision-free hash table for n = 1000:
m # Expected failures before success
1000 10
217
2000 10
109
10000 10
22
100000 147
Picking hash functions randomly in this manner is

unlikely to be practical.
There are better strategies: see [8, 4].

27 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Collision Strategy 3: secondary data structures I
By far the most common technique for handling

collisions is to put a secondary data structure in each
hash table slot:
A linked list (chaining)
A binary search tree (BST)
Another hash table
Let =
n
m
be the load factor: the average number of
items per hash table slot.
Assuming uniform distribution of keys into slots:
Linked lists require 1 + steps (on average) to nd a

key;
Suitable BSTs require 1 + max(c log , 0) steps (on

average).
6
Using secondary hash tables of size quadratic in the

number of elements in the slot, one can achieve O(1)
lookups on average, and require only (n) space.
28 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Collision Strategy 3: secondary data structures II
Analysis of secondary hash tables:
Let N
i
be a random variable indicating the number of
items landing in slot i .
E[N
i
] =
Var[N
i
] = n
1
m
1
1
m
| {z }
Bernoulli variance
Space required for secondary hash tables is

proportional to
E
2
4
X
1i m
N
2
i
3
5
=
X
1i m
E[N
2
i
] =
X
1i m
Var[N
i
] +
2
= m
n
1
m
1
1
m
+
n
2
m
2
n
2
m
+ n
n
m
Plus space (m) for the primary hash table =
(m+
n
2
m
+n). Choosing m = (n) yields linear space.
6
The max( ) deals with the possibility that < 1, in which case
log < 0.
29 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Collision Strategy 4: open addressing I
Open addressing is a family of techniques for resolving

collisions that do not require secondary data structures.
This has the advantage of not requiring any dynamic
memory allocation.
In the simplest scenario we have a function s : H H

that is ideally a permutation of the hash values, for
example the linear probing function
s(x) = (x + 1) mod m
When we attempt to insert a key k, we look in slot h(k),

s(h(k)), s(s(h(k))), etc. until an empty slot is found.
To nd a key k, we look in slot h(k), s(h(k)),

s(s(h(k))), etc. until either k or an empty slot is found.
30 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Collision Strategy 4: open addressing II
However, the use of permutations performs badly as the

hash table becomes fuller: tend to get
clumps/clusters, i.e., long sequences
h(k), s(h(k)), s(s(h(k))), . . . where all the slots are
occupied (see e.g. [12]).
Performance can be good for not very full tables, e.g.

<
2
3
. As 1 operations begin to take (
n) time
[7].
Quadratic probing oers less clumping: try slots h

0
(k),
h
1
(k), where
h
i
(k) = (h(k) + i
2
) mod m
h(k) is an initial xed hash function. If m prime, the
sequence h
i
(k) will visit every slot.
31 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Collision Strategy 4: open addressing III
Double hashing uses two hash functions, h

1
and h
2
:
h
i
(k) = (h
1
(k) + i h
2
(k)) mod m
h
1
(k) gives an initial slot to try; h
2
(k) gives a stride
(reduces to linear probing when h
2
(k) = 1.)
Under favourable conditions, an open addressing

scheme behaves like a geometric distribution when
searching for an open slot: the probability of nding an
empty slot is 1 , so the expected number of trials is
1
1
. Note the catastrophe when 1.
32 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Summary of collision strategies
Strategy E[access time] Space
Choose m big O(1) (n
2
)
Linked List 1 + O(n + m)
Binary Search Tree 1 + max(c log , 0) O(n + m)
Secondary Hash Tables O(1) O(n)
Open addressing
1
1
O(m)
Open addressing can be quite eective if 1, but

fails catastrophically as 1.
33 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Summary of collision strategies
If unexpectedly n ~ m (e.g. we have far more data than

we designed for), then . For example, if
m O(1) and n (1):
Linked list has O(n) accesses;
BSTs have O(log n) accessesoer a gentler failure

mode.
If hash function is badly nonuniform:
Linked list can be O(n);
BST will have O(log n);
Secondary hash tables may require O(n

2
) space.
To summarize: hash table + BST will give fast search

times, and let you sleep at night.
To maintain O(1) access times as n , it is

necessary to maintain m n. This can be done by
choosing an allowable interval [c
1
, c
2
]; when > c
2
resize the hash table to make = c
1
. So long as
c
2
> c
1
, this strategy adds O(1) amortized time per
insertion, as in dynamic arrays.
34 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Universal hashing I
Suppose we are implementing a cellphone application

that permits phones to store pieces of data on a server
(e.g., address book entries).
We decide for performance reasons to use a hash table

with linked lists to resolve collisions.
If a malicious hacker knows what hash function we have

chosen, then they could launch a denial of service
attack by programming cellphones to repeatedly storing
and retrieving addresses that hash to the same slot:
since searching the linked list is an O(n) operation, the
server might be made too busy to service other requests
in reasonable time.
A solution is to choose the hash function randomly, so

that it would be dicult for the malicious hacker to nd
values that would hash to the same slot.
35 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Universal hashing II
This is called universal hashing.
e.g. we could choose a random value (0, 1) and use

the hash function h(k) = mk|.
36 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Applications of hashing
Hashing is a ubiquitous concept, used not just for

maintaining collections but also for
cryptography
combinatorics
data mining
computational geometry
databases
router trac analysis
Some algorithms/data structures that use hashing:
Bloom lters [1]: approximate set membership with no

false negatives; with 10 bits per element can achieve 1%
false positive rate
Interpolation search + hashing = log log n expected

searching times, with no unused space.
An example: probabilistic counting

37 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Part II
Applications of hashing
38 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Probabilistic Counting
Problem: estimate the number of unique elements in a

LARGE collection (e.g., a database, a data stream)
without requiring much working space
Useful for query optimization in databases [13]:
e.g. to evaluate A B C can do either A (B C) or

(A B) C
one of these might be very fast, one very slow.
have rough estimates of [B C[ vs [A B[ to decide

which strategy will be faster.
39 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Less serious (but more readily understood) example:
Shakespeares complete works:
N=884,647 words (or so)
n=28,239 unique words (or so)
w = average word length
N
max
n = prior estimate on n
Problem: estimate n the number of unique words

used. Approaches:
1. Sorting: Put all 884,647 words in a list and sort, then
count. (Time O(Nw log N), space O(Nw))
2. Trie: Scan through the words and build a trie, with
counters at each node; requires O(nw) space
(neglecting size of counters.)
3. Super-LogLog Probabilistic Counting [5]: Use 128 bytes
of space, obtain estimate of 30897 words (error 9.4%).
40 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Inputs: a multiset A of elements, possibly with many

duplicates (e.g., Shakespeares plays)
Problem: estimate card(A): the number of unique

elements in A (e.g., number of distinct words
Shakespeare used)
Simple starting idea: hash the objects into an

m-element hash table. Instead of storing keys, just
count the number of elements landing in each hash slot.
Extreme cases to illustrate the principle:
Elements of A are all dierent: will get an even

distribution in the hash table.
Elements of A are all the same: will get one hash table
slot with all the elements!
The shape of the hash table distribution reects the

frequency of duplicates.
41 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Linear Counting [13]
Compute hash values in the range [0, N

max
)
Maintain a bitmap representing which elements of the

hash table would be occupied, and estimate n from the
sparsity of the hash table.
Uses (N
max
) bits, e.g., on the order of card(A) bits.
Room for improvement: the precise sparsity pattern

doesnt matter: just the number of full vs. empty slots.
42 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Probabilistic Counting [6]
Compute hash values in the range [0, N

max
)
Instead of counting hash values directly, count the

occurrence of hash values matching certain patterns:
Pattern Expected occurrences
xxxxxxx1 2
1
card(A)
xxxxxx10 2
2
card(A)
xxxxx100 2
3
card(A)
xxxx1000 2
4
card(A)
.
.
.
.
.
.
Use these counts to estimate card(A).
To improve accuracy, use m dierent hash functions.
Uses (mlog N
max
) storage, and delivers accuracy of
O(m
1/2
)
43 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Super-LogLog [5] requires (log log N

max
) bits. With
1.28kb of memory can estimate card(A) to within
accuracy of 2.5% for N
max
130 million.
Probabilistic counters: count to N using log log N bits:

1
2
//
1
2

1
4
//
3
4

1
8
//
7
8

1
16
//
15
16

Need log N states, which can be encoded in log log N
bits.
44 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Hash grouping & discrepancy I
Hashing is a useful way to divide objects into randomly

chosen groups.
e.g. for a distributed database, want to distribute

records evenly among the servers use a hash table.
If the hashing function is randomly chosen (or otherwise

suitable), then we get a random division of the objects
into groups.
Discrepancy theory [3] tells us that we will get a roughly

even split of many dierent characteristics across these
groups.
E.g. if we use hashing to split a large database

containing information about people, we might want to
estimate quantities such as: How many people live in
Ontario, work in the information technology industry,
and subscribe to magazine X?
45 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Hash grouping & discrepancy II
We can estimate this reasonably well just by looking at

the data in one groupprovided the groups satisfy
some conditions.
Basic ideas of discrepancy:
Let V = v
1
, . . . , v
n
be a set of elements (e.g., people)
Let S = S
1
, . . . , S
m
be a collection of subsets of V
(e.g., live in Ontario, work in IT, etc.)
(V, S) is called a set system.
A colouring of V is a function : V 1, +1,

where (v
i
) is the colour of v
i
. (E.g. think of -1 and
+1 as red and blue).
The discrepancy of a set S

i
is the imbalance between
red and blue:
(S
i
) =
v
j
S
i
(v
j
)
46 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Hash grouping & discrepancy III
With a randomly chosen colouring, with high probability,

(S
i
) = O(
_
[S
i
[ ln(2m))
i.e., across each set S
i
we have a discrepancy that is
in the size of the set.
This means that assigning objects randomly into groups

gives us single groups that are representative of the
characteristics of the entire population of objects, so
long as m (the number of subgroups/characteristics) is
small.
47 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Theme: Design Tradeos
Tradeos between design parameters: A recurring

theme in algorithms & data structures.
Examples:
By making a hash table bigger, we can decrease (the

load factor) and achieve faster search times. (A tradeo
between space and time.)
In designing circuits to add n-bit integers, we can obtain

very low delays (the maximum number of gates between
inputs and outputs) by increasing the number of gates:
trading time (delay) for area (number of gates)
In many tasks we can trade the precision of an answer

for time and space, e.g., responding quickly to database
queries with an estimate of the answer, rather than the
exact answer.
48 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Theme: Design Tradeos
Design tradeos are often parameterizable.
For example, in speed/accuracy tradeos we dont

usually have to choose either speed or accuracy. Instead
we have a parameter the allowable error that we
can adjust.
With large we get fast (but possibly not very

accurate) answers
As 0 we get very accurate answers that take longer

to compute.
Lets look at an example of a tradeo in the design of

data structures.
49 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Design Tradeo: Hash tables vs. BSTs
Consider representing a collection of n keys drawn from

an ordered structure K, ).
A (balanced) binary search tree (BST) has (log n)

search times.
A hash table has (1) search (if we keep the size

number of elements, and choose an appropriate hash
function.)
Dierence between these two data structures:
A BST allows us to iterate through the elements in

order, using (log n) working space. The (log n) space
is used to record the path from the root to the iterator
position in a stack.
Items in a hash table are not stored in order if we

want to iterate through them in order, we need extra
space and time, e.g. (n) space for a temporary array
and (n log n) time to sort the items.
50 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
We can view BSTs and hash tables as two points in a

design space:
Data structure Search time Working space for
ordered iteration
Hash table (1) (n)
Binary Search Tree (log n) (log n)
Suppose:
We have a very large (n = 10

9
) collection of keys that
barely ts in memory
Dynamic: keys added and removed frequently.
We need fast search, fast insert, fast remove.
Red-black: height is 2 log n 61 levels
We need to be able to iterate through the collection in

order.
There is not enough room in memory to create a

temporary array for sorting; also, this would be
prohibitively slow.
51 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Lets make a simple data structure that will oer a

smoother tradeo between search time and the working
space required for an ordered iteration.
If you think of BST + hash table as two points in a

design space, we want a structure that will interpolate
smoothly between them.
Search
cc
Working space for ordered iteration
c log n
log n
1
n
Binary Search Tree
Hash Table
Time
cc
52 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Design Tradeo: Hash tables vs. BSTs I
Consider a hash table of m slots, using a BST in each

slot to resolve collisions:
X
X
b
b
b
b
b
b
b b
b b
b
b b
b
b
b
b
b
b b
b
b
b
b
b
b
!
!
!
!
a
a
a
a
!
!
X
X
X
X
X
X
hh
Observation:
When m = 1 we have a degenerate hash table with a

single slot.
All the keys are put in a single BST.

53 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Design Tradeo: Hash tables vs. BSTs II
So, choosing m = 1 essentially gives us a BST: we can

iterate through the keys in order, search requires c log n
steps, where c reects the average tree depth.
What about the case m = 2?
We have a hash table with two slots. If hash function is

good, we get two BSTs of roughly n/2 keys apiece.
Search time is about c log(n/2).
Can we iterate through the keys in order?
Yes: have two iterators, one for each tree. Initially the
two iterators point at the smallest key in their tree.
At each step of the iteration, choose the iterator that

is pointing at the smaller of the two keys. Retrieve
that key, and advance the iterator.
Generalize: if we choose an arbitrary m,
We will have m BSTs of average size n/m
Search times will be around c log(n/m), assuming

m n.
To iterate through the keys in order,

54 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Design Tradeo: Hash tables vs. BSTs III
Obtain m iterators, one for each tree.
At each step, choose the iterator pointing at the

smallest key, retrieve that key and advance the iterator.
To do this eciently, we need a fast way to maintain a

collection (of iterators) that lets us quickly obtain the
one with the smallest value (the iterator pointing at the
smallest key)
Easy: a min-heap.
Our algorithm for ordered iteration will look like this:

1. Create an array of m BST iterators, one for each hash
table slot.
2. Turn this array into a min-heap, ordering iterators by
the key they are pointing at. (The heap can be built in
O(m) time.)
3. To obtain the next element,
3.1 Remove the least element from the min heap. (This
takes O(log m) time.)
3.2 Obtain its key, and advance the iterator. (Advancing a
BST iterator requires O(1) amortized time.)
55 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Design Tradeo: Hash tables vs. BSTs IV
3.3 Put the iterator back into the min-heap. (This takes
O(log m) time.)
56 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
We can iterate through the keys in order in time

O(n(1 + log m)).
O(m) time to obtain the iterators and build the heap
O(1 + log m) time per key to adjust the heap, times n

keys = O(n log m) (The 1 + handles the case m = 1.)
Overall, O(n(1 + log m)) time, assuming m _ n.
The space required for iterating through the keys in

order is O(m(1 + log(n/m))):
We need m iterators, one per hash table slot.
Each iterator requires space O(1 + log(n/m)), on

average, for a stack recording its position in the tree.
(The 1 + handles the case where n = m.)
The number of steps for searching is on average

1 + c log(n/m), where c is a constant depending on the
kind of BST we choose. The constant 1 is added to reect
visiting the correct slot in the hash table; and to handle the case
where m = n, in which case c log(n/m) = 0, and having 0 search
steps doesnt make sense.
57 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Looking at these complexities, a sensible

parameterization is m = n
1
.
When = 0, m = n and we get a hash table;
When = 1, m = 1 and we get a BST.
Space and time:
Number of search steps is 1 + c log(n/m) =

1 + c log(n/(n
1
)) = 1 + c log n
= 1 +c log n.
directly multiplies our search time: choosing =

1
2
halves our search time.
Working space for ordered iteration is

O(m(1 + log(n/m))) = O(n
1
(1 + log n
)).
E.g., if we choose =
1
2
we are twice as fast as a BST
for searching, and require O(
n log n) working space

for ordered iteration.
The amount of extra space we need for ordered

iteration, relative to the space needed to store the keys,
is
n
1
(1+ log n)
n
= n
(1 + log n). NB: if > 0 the

relative space overhead for supplying ordered iteration is
0.
58 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Lets look at some real-life numbers. Take n = 10

9
keys.
Assume we use red-black trees, so that average depth of

keys in a tree of n/m elements is 2 log(n/m).
Parameter #Search steps Space for iter. Space overhead
1 +2 log n n
1
(1 + log n) n
(1 + log n)
(Hash) 0 1 1000000000 100%
1/8 4.7 355237568 35%
1/4 16.0 47654705 4.7%
1/2 31.9 504341 0.05%
3/4 45.8 4165 0.0004%
7/8 53.3 362 0.00004%
(BST) 1 60.8 31 0.000003%
e.g. Choosing = 1/4, we can get searches 4 times

faster than the plain red-black tree, and have only a
4.7% space overhead for ordered iteration.
Choosing = 1/2, we can get searches twice as fast as

a plain red-black tree, with a 0.05% space overhead for
ordered iteration.
59 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Bibliography I
[1] Burton H. Bloom.
Space/time trade-os in hash coding with allowable
errors.
Commun. ACM, 13(7):422426, 1970.
[2] Stanley Burris and H. P. Sankappanavar.
A Course in Universal Algebra.
Springer-Verlag, 1981.
[3] Bernard Chazelle.
The Discrepancy MethodRandomness and
Complexity.
Cambridge University Press, Cambridge, 2000.
60 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Bibliography II
[4] Martin Dietzfelbinger, Anna Karlin, Kurt Mehlhorn, and
Friedhelm MeyerAuf Der.
Dynamic perfect hashing: Upper and lower bounds.
SIAM J. Comput., 23(4):738761, 1994.
[5] Marianne Durand and Philippe Flajolet.
Loglog counting of large cardinalities (extended
abstract).
In Giuseppe Di Battista and Uri Zwick, editors, ESA,
volume 2832 of Lecture Notes in Computer Science,
pages 605617. Springer, 2003.
[6] Philippe Flajolet and G. N. Martin.
Probabilistic counting algorithms for data base
applications.
Journal of Computer and System Sciences,
31(2):182209, September 1985.
61 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Bibliography III
[7] Philippe Flajolet, Patricio V. Poblete, and Alfredo
Viola.
On the analysis of linear probing hashing.
Algorithmica, 22(4):490515, 1998.
[8] Michael L. Fredman and Janos Komlos
an Endre Szemeredi.
Storing a sparse table with 0(1) worst case access time.
J. ACM, 31(3):538544, 1984.
[9] Joseph A. Gallian.
Contemporary Abstract Algebra.
D. C. Heath and Company, Toronto, 3rd edition, 1994.
62 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Bibliography IV
[10] Ronald L. Graham, Donald E. Knuth, and Oren
Patashnik.
Concrete Mathematics: A Foundation for Computer
Science.
Addison-Wesley, Reading, MA, USA, second edition,
1994.
[11] Saunders MacLane and Garrett Birkho.
Algebra.
Chelsea Publishing Co., New York, third edition, 1988.
[12] Robert Sedgewick and Philippe Flajolet.
An introduction to the analysis of algorithms.
Addison-Wesley, 1996.
63 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Bibliography V
[13] Kyu-Young Whang, Brad T. Vander-Zanden, and
Howard M. Taylor.
A linear-time probabilistic counting algorithm for
database applications.
ACM Trans. Database Syst., 15(2):208229, 1990.
64 / 64

ECE750 F2008 Algorithms5

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

ECE750 F2008 Algorithms5

Transféré par

Droits d'auteur :

Formats disponibles

ECE750 Lecture 5:

Suppose we wanted to represent the following set:

Binary search trees: require following a path through a

Super fast way: allocate an array of 4657 bytes. Set

Obviously the array A would contain mostly empty

Yes: allocate a much smaller table B of length k.

Then, x M if and only if B[h(x)] = x.

For our example, h(x) = x mod 17 does the trick. Here

e.g.: x = 1691: h(x) = 8, and B[8] = 1691, so x M.

e.g.: x = 1692: h(x) = 9, and B[9] = 1760 ,= 1692, so

This is a hash table. h(x) = x mod 17 is called a hash

A hash function is a map h : K H from some

If the set M K is chosen uniformly at random, keys

For a nonuniform distribution on keys, one just wants to choose h

We will describe some hash functions where K = N

For long bit strings, use Horners rule for evaluating

The reason why h(k) = mk| is a reasonable hash

The short answer is that the sequence k for

To evaluate whether a hash function is a good choice

Suppose there are n keys and m hash slots. Compute

To measure how far from uniform, compute

This is the Kullback-Leibler divergence of the observed

P[[U), the better the hash function.

Horners rule gives an ecient method for evaluating

Consider a hash function of the form

If we wish to hash a string such as hello, we can

As a sequence of integers, hello is

Horners rule is a general trick for evaluating a

Trivia: some early CPUs included an instruction opcode

To use Horners rule for hashing: to compute

Why this works. In short, algebra. The integers Z form a ring

A collision occurs when two keys map to the same

Strategies for handling collisions:

Lets see how big m must be for the probability of

n > m: then there must be a collision, by the

n m: may or may not be a collision.

The birthday problem: what is the probability that

This is a hashing problem: people are keys, days of the

If n 23, then the probability of two people having the

The birthday problem analysis is straightforward to

Suppose the hash function h and the distribution of

Recall that with a uniform distribution, probability may

First we count the number of hash functions without

There are m choices of where to put the rst key; m1

The number of hash functions with no collisions is

Next we count the number of hash functions allowing

There are m choices of where to put the rst key; m

The number of hash functions allowing collisions is m

The probability of a collision-free arrangement is

Asymptotic estimate of ln p, assume m ~ n:

Probability of a collision-free arrangement is

i.e., if we want a low probability of collisions, our hash

the threshold, asymptotic probability of event is 0

~ the threshold, asymptotic probability of event is 1.

Picking m big is not an eective strategy for handling

For n = 1000 elements, this table shows how big m

The analysis of collisions in hashing demonstrates two

The simplest pigeonhole principle states that if you put

With respect to hash tables, the pigeonhole applies as

The probability-of-collision analysis of the previous

n) pigeons in n holes, there is a hole with

Can we pick our hash function h to avoid collisions?

For example, if we use hash functions of the form