Vous êtes sur la page 1sur 64

ECE750 Lecture 5:

Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
ECE750 Lecture 5: Hash functions, Hash
tables, Bloom lters, Hash grouping,
Discrepancy, Probabilistic counting
Todd Veldhuizen
tveldhui@acm.org
Electrical & Computer Engineering
University of Waterloo
Canada
Oct. 12, 2007
1 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Part I
Hashing
2 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Hash Tables

Suppose we wanted to represent the following set:


M = 35, 139, 395, 1691, 1760, 1795, 3632, 3789, 4657
Given some x N, we want to quickly test whether
x M.

Binary search trees: require following a path through a


tree perhaps not fast enough for our problem.

Super fast way: allocate an array of 4657 bytes. Set


A[i ] =
_
0 if i , M
1 if i M
Then, on a RAM, can test whether x M with a single
memory access to A[i ] (a constant amount of time).
However, space required by this strategy is O(sup M).
3 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Hash Tables

Obviously the array A would contain mostly empty


space. Can we somehow compress the array but still
support fast access?

Yes: allocate a much smaller table B of length k.


Dene a function h : [1, 4657] [1, k] that maps
indices of A to indices of B, can be computed quickly,
and ensures that if x, y M and x ,= y, then
h(x) ,= h(y) i.e., no two elements of M have the same
index in B.

Then, x M if and only if B[h(x)] = x.


4 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Hash Tables

For our example, h(x) = x mod 17 does the trick. Here


is the array B:
j B[j ]
0 0
1 35
2 0
3 139
4 395
5 0
j B[j ]
6 0
7 0
8 1691
9 1760
10 1795
11 3632
j B[j ]
12 0
13 0
14 0
15 3789
16 4657

e.g.: x = 1691: h(x) = 8, and B[8] = 1691, so x M.

e.g.: x = 1692: h(x) = 9, and B[9] = 1760 ,= 1692, so


x , M.

This is a hash table. h(x) = x mod 17 is called a hash


function.
5 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Hash Functions

A hash function is a map h : K H from some


(usually large) key space K to some (usually small) set
of hash values H. In our example, we were mapping
from K = [0, 4657] to H = [0, 16].

If the set M K is chosen uniformly at random, keys


are uniformly distributed (i.e., each k K has the same
probability of appearing in a set to represent). In this
case the hash function should distribute the keys evenly
amongst elements of H, i.e., we want that
[h
1
(y)[ [h
1
(z)[ for y, z H.
1

For a nonuniform distribution on keys, one just wants to choose h


so that the distribution induced on H is close to uniform.
1
Recall that for a function f : R S, f
1
(s) {r : f (r ) = s}.
6 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Hash Functions

We will describe some hash functions where K = N


(keys are nonnegative integers). These are easily
adapted to other kinds of keys (e.g., strings) by
interpreting the binary representation of the key as an
integer.
Some commonly used hash functions are the following:
1. Division: use h(k) = k mod m where m = [H[ is usually
chosen to be a prime number far away from any power
of 2. (Note.
2
)

For long bit strings, use Horners rule for evaluating


polynomials in Z/mZ (will explain.)
2. Multiplication: use h(k) = mk|, where 0 < < 1
is an irrational number and x x x|. A popular
choice of is =

51
2
.
2
A particularly terrible choice would be m = 256, which would hash
objects based only on their lowest 8 bits. e.g., the hash of a string
would depend only on its last character.
7 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Multiplication hash functions: Example
Example of multiplication hash function using =

51
2
, and hash
table with m = 100 slots:
key {k} m{k}
1 0.618034 61.
2 0.236068 23.
3 0.854102 85.
4 0.472136 47.
5 0.090170 9.
6 0.708204 70.
7 0.326238 32.
8 0.944272 94.
9 0.562306 56.
10 0.180340 18.
11 0.798374 79.
12 0.416408 41.
13 0.034442 3.
14 0.652476 65.
15 0.270510 27.
16 0.888544 88.
17 0.506578 50.
Idea is that the third column (the hash slots) looks like a random
sequence.
8 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Multiplication hash functions I

The reason why h(k) = mk| is a reasonable hash


function is interesting.

The short answer is that the sequence k for


k = 1, 2, 3, . . . kind of behaves like a sequence of
random reals drawn from (0, 1). So, h(k) = mk|
looks like a randomly chosen hash function.
A less sketchy explanation:
1. k is uniformly distributed on (0, 1): asymptotically,
the proportion of k falling in an interval (, )
where (, ) (0, 1) is ( ). (Just like a uniform
distribution on (0, 1).)
9 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Multiplication hash functions II
2. k satises an ergodic theorem: if we sample a
suitably well-behaved
3
function f at points k and
average, this converges to the integral:
1
m
m

k=1
f (k)
_
1
0
f (x)dx
Just like a uniform distribution on (0, 1)!
See [3]. Variously called Weyls ergodic principle, Weyls
equidistribution theorem.
However, {k} is emphatically not a random sequence.
3
Continuously dierentiable and periodic with period 1
10 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Hash Functions

To evaluate whether a hash function is a good choice


for a set of data S K, one can see how the observed
distribution of keys into hash table slots compares to a
uniform distribution.

Suppose there are n keys and m hash slots. Compute


the observed distribution of the keys:
p
i
=
[k : h(k) = i [
n

To measure how far from uniform, compute


D(

P[[U) = log
2
m +
m

i =1
p
i
log
2
p
i
Convention: 0 log
2
0 = 0.

This is the Kullback-Leibler divergence of the observed


distribution

P from the uniform distribution U. It may
be thought of as the distance from

P to U.

The smaller D(

P[[U), the better the hash function.


11 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Horners Rule I

Horners rule gives an ecient method for evaluating


hash functions for sequences, e.g., strings.

Consider a hash function of the form


h(k) = k mod m

If we wish to hash a string such as hello, we can


interpret it as a long binary number: in ASCII, hello is
01101000
. .
h
01100101
. .
e
01101100
. .
l
01101100
. .
l
01101111
. .
o

As a sequence of integers, hello is


[104, 101, 108, 108, 111]. We want to compute
(104 2
32
+ 101 2
24
+ 108 2
16
+ 108 2
8
+ 111 2
0
) mod m
12 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Horners Rule II

Horners rule is a general trick for evaluating a


polynomial. We write
ax
3
+ bx
2
+ cx + d = (ax
2
+ bx + c)x + d
= ((ax + b)x + c)x + d
So that instead of computing x
3
, x
2
, . . . we have only
multiplications:
t
1
= ax + b
t
2
= t
1
x + c
t
3
= t
2
x + d

Trivia: some early CPUs included an instruction opcode


for applying Horners rule. May be making a comeback!
13 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Horners Rule III

To use Horners rule for hashing: to compute


(a 2
24
+ b 2
16
+ c 2
8
+ d) mod m,
t
1
= (a 2
8
+ b) mod m
t
2
= (t
1
2
8
+ c) mod m
t
3
= (t
2
2
8
+ d) mod m
Note that multiplying by 2
k
is simply a shift by k bits.

Why this works. In short, algebra. The integers Z form a ring


under multiplication and addition. The hash function
h(k) = k mod m can be interpreted as a homomorphism from the
ring Z of integers to the ring Z/mZ of integers modulo m.
Homomorphisms preserve structure in the following sense: if we
write + for integer addition, and for addition modulo m,
h(a + b) = h(a) h(b)
i.e., it doesnt matter whether we compute (a + b) mod m or
compute (a mod m) and (b mod m) and add with modular
14 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Horners Rule IV
arithmetic: we get the same answer either way. Similarly, if we
write for multiplication in Z, and for multiplication in Z/mZ,
h(a b) = h(a) h(b)
Horners rule works precisely because h : Z Z/mZ is a
homomorphism:
h(((a 2
8
+ b) 2
8
+ c) 2
8
+ d)
= (((h(a) h(2
8
) h(b)) h(2
8
) h(c)) h(2
8
) h(d))
This can be optimized to use fewer applications of h, as above.
In this form it is obvious why m = 2
8
is a horrible choice for a
hash table size: 2
8
mod 2
8
= 0, so
(((h(a) h(2
8
) h(b)) h(2
8
) h(c)) h(2
8
) h(d))
= (((h(a) 0 h(b)) 0 h(c)) 0 h(d))
= h(d)
15 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Horners Rule V
i.e., the hash value depends only on the last byte. Similarly, if we
used m = 2
16
, we would have h(2
16
) = 0, which would remove all
but the last two bytes from the hash value computation.
For background on algebra see, e.g., [2, 11, 9].
16 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Collisions

A collision occurs when two keys map to the same


location in the hash table, i.e., there are distinct
x, y M such that h(x) = h(y).

Strategies for handling collisions:


1. Pick a value of m large enough so that collisions are
rare, and can be easily dealt with e.g., by maintaining a
short overow list of items whose hash slot is already
occupied.
2. Pick the hash function h to avoid collisions.
3. Put a secondary data structure in each hash table slot
(a list, tree, or another hash table);
4. If a hash slot is full then try some other slots in some
xed sequence (open addressing).
17 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Collision Strategy 1: Pick m big I

Lets see how big m must be for the probability of


collisions to be small.

Two cases:

n > m: then there must be a collision, by the


pigeonhole principle.
4

n m: may or may not be a collision.

The birthday problem: what is the probability that


amongst n people, at least two share the same birthday?

This is a hashing problem: people are keys, days of the


year are slots, and h maps people to their birthdays.

If n 23, then the probability of two people having the


same birthday is >
1
2
. (Counterintuitive, but true.)

The birthday problem analysis is straightforward to


adapt to hashing.
18 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Collision Strategy 1: Pick m big II

Suppose the hash function h and the distribution of


keys cooperate to produce a uniform distribution of keys
into hash table slots.

Recall that with a uniform distribution, probability may


be computed by simple counting:
Pr(event E happens) =
# outcomes in which E happens
# outcomes

First we count the number of hash functions without


collisions:

There are m choices of where to put the rst key; m1


choices of where to put the second key; ... m n + 1
choices of where to put the n
th
key.

The number of hash functions with no collisions is


m
n
= m (m 1) (m n + 1) =
m!
(mn)!
. (Note
5
.)

Next we count the number of hash functions allowing


collisions:
19 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Collision Strategy 1: Pick m big III

There are m choices of where to put the rst key; m


choices of where to put the second key; ... m choices of
where to put the n
th
key.

The number of hash functions allowing collisions is m


n
.

The probability of a collision-free arrangement is


p =
m!
(m n)! m
n

Asymptotic estimate of ln p, assume m ~ n:


ln p
n
2
2m
+
n
2m
+ O
_
n
3
m
2
_
(1)
Here we have used Stirlings approximation and
ln(m n) = ln m
n
m
O

n
2
m
2

Two cases: If n
2
m then ln p 0. If n
2
~ m then
ln p .
20 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Collision Strategy 1: Pick m big IV

Recall that if
ln p = x +
then
p = e
x+
= e
x
e

= e
x
_
1 + +
2
+
_
Taylor series
= e
x
(1 + O()) if o(1)

Probability of a collision-free arrangement is


p e

n(n1)
2m
+ O
_
n
3
e

n(n1)
2m
m
2
_

Interpretation:
21 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Collision Strategy 1: Pick m big V

If m (n
2
) there are no collisions (almost surely).

If m o(n
2
) there is a collision (almost surely).

i.e., if we want a low probability of collisions, our hash


table has to be quadratic (or more) in the number of
items.
4
If m + 1 pigeons are placed in m pigeonholes, there must be two
pigeons in the same hole. (Replace pigeons with keys, and
pigeonholes with hash slots.)
5
The handy notation m
m
is called a falling power [10].
22 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Threshold functions

m =
1
2
n
2
is an example of a threshold function:

the threshold, asymptotic probability of event is 0

~ the threshold, asymptotic probability of event is 1.


Prob. of no collision
$
$

g
g

n
2
0
1
n
3
n
2
n n
2+
Hash table size (m)
23 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Collision Strategy 1: pick m big

Picking m big is not an eective strategy for handling


collisions.

For n = 1000 elements, this table shows how big m


must be to achieve the desired probability of no
collisions:
p m
0.1 5000000
0.01 50000000
10
6
500000000000
10
9
500000000000000
24 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Collision Strategy 1: pick m big

The analysis of collisions in hashing demonstrates two


pigeonhole principles.

The simplest pigeonhole principle states that if you put


m + 1 pigeons in m holes, there must be one hole
with 2 pigeons.

With respect to hash tables, the pigeonhole applies as


follows: If a hash table with m slots is used to store
m + 1 elements, there is a collision.

The probability-of-collision analysis of the previous


slides demonstrates a probabilistic pigeonhole principle:
if you put (

n) pigeons in n holes, there is a hole with


2 pigeons almost surely (i.e., with probability
converging to 1 as n .)
25 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Collision Strategy 2: pick h carefully I

Can we pick our hash function h to avoid collisions?

For example, if we use hash functions of the form


h(k) = mk|
we could try random values of (0, 1) until we found
one that was collision-free.

We have a probability of success


p e

n
2m(m1)
(1 + o(1))

Geometric distribution:

Probability of success p, probability of failure 1 p

Each trial independent, identically distributed.

Probability that k tries are needed for success


= (1 p)
k1
p

Mean: p
1
26 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Collision Strategy 2: pick h carefully II

Number of values of we expect to try before we nd a


collision-free hash table for n = 1000:
m # Expected failures before success
1000 10
217
2000 10
109
10000 10
22
100000 147

Picking hash functions randomly in this manner is


unlikely to be practical.

There are better strategies: see [8, 4].


27 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Collision Strategy 3: secondary data structures I

By far the most common technique for handling


collisions is to put a secondary data structure in each
hash table slot:

A linked list (chaining)

A binary search tree (BST)

Another hash table

Let =
n
m
be the load factor: the average number of
items per hash table slot.

Assuming uniform distribution of keys into slots:

Linked lists require 1 + steps (on average) to nd a


key;

Suitable BSTs require 1 + max(c log , 0) steps (on


average).
6

Using secondary hash tables of size quadratic in the


number of elements in the slot, one can achieve O(1)
lookups on average, and require only (n) space.
28 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Collision Strategy 3: secondary data structures II

Analysis of secondary hash tables:

Let N
i
be a random variable indicating the number of
items landing in slot i .

E[N
i
] =

Var[N
i
] = n
1
m

1
1
m

| {z }
Bernoulli variance

Space required for secondary hash tables is


proportional to
E
2
4
X
1i m
N
2
i
3
5
=
X
1i m
E[N
2
i
] =
X
1i m
Var[N
i
] +
2
= m

n
1
m

1
1
m

+
n
2
m
2

n
2
m
+ n
n
m
Plus space (m) for the primary hash table =
(m+
n
2
m
+n). Choosing m = (n) yields linear space.
6
The max( ) deals with the possibility that < 1, in which case
log < 0.
29 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Collision Strategy 4: open addressing I

Open addressing is a family of techniques for resolving


collisions that do not require secondary data structures.
This has the advantage of not requiring any dynamic
memory allocation.

In the simplest scenario we have a function s : H H


that is ideally a permutation of the hash values, for
example the linear probing function
s(x) = (x + 1) mod m

When we attempt to insert a key k, we look in slot h(k),


s(h(k)), s(s(h(k))), etc. until an empty slot is found.

To nd a key k, we look in slot h(k), s(h(k)),


s(s(h(k))), etc. until either k or an empty slot is found.
30 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Collision Strategy 4: open addressing II

However, the use of permutations performs badly as the


hash table becomes fuller: tend to get
clumps/clusters, i.e., long sequences
h(k), s(h(k)), s(s(h(k))), . . . where all the slots are
occupied (see e.g. [12]).

Performance can be good for not very full tables, e.g.


<
2
3
. As 1 operations begin to take (

n) time
[7].

Quadratic probing oers less clumping: try slots h


0
(k),
h
1
(k), where
h
i
(k) = (h(k) + i
2
) mod m
h(k) is an initial xed hash function. If m prime, the
sequence h
i
(k) will visit every slot.
31 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Collision Strategy 4: open addressing III

Double hashing uses two hash functions, h


1
and h
2
:
h
i
(k) = (h
1
(k) + i h
2
(k)) mod m
h
1
(k) gives an initial slot to try; h
2
(k) gives a stride
(reduces to linear probing when h
2
(k) = 1.)

Under favourable conditions, an open addressing


scheme behaves like a geometric distribution when
searching for an open slot: the probability of nding an
empty slot is 1 , so the expected number of trials is
1
1
. Note the catastrophe when 1.
32 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Summary of collision strategies
Strategy E[access time] Space
Choose m big O(1) (n
2
)
Linked List 1 + O(n + m)
Binary Search Tree 1 + max(c log , 0) O(n + m)
Secondary Hash Tables O(1) O(n)
Open addressing
1
1
O(m)

Open addressing can be quite eective if 1, but


fails catastrophically as 1.
33 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Summary of collision strategies

If unexpectedly n ~ m (e.g. we have far more data than


we designed for), then . For example, if
m O(1) and n (1):

Linked list has O(n) accesses;

BSTs have O(log n) accessesoer a gentler failure


mode.

If hash function is badly nonuniform:

Linked list can be O(n);

BST will have O(log n);

Secondary hash tables may require O(n


2
) space.

To summarize: hash table + BST will give fast search


times, and let you sleep at night.

To maintain O(1) access times as n , it is


necessary to maintain m n. This can be done by
choosing an allowable interval [c
1
, c
2
]; when > c
2
resize the hash table to make = c
1
. So long as
c
2
> c
1
, this strategy adds O(1) amortized time per
insertion, as in dynamic arrays.
34 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Universal hashing I

Suppose we are implementing a cellphone application


that permits phones to store pieces of data on a server
(e.g., address book entries).

We decide for performance reasons to use a hash table


with linked lists to resolve collisions.

If a malicious hacker knows what hash function we have


chosen, then they could launch a denial of service
attack by programming cellphones to repeatedly storing
and retrieving addresses that hash to the same slot:
since searching the linked list is an O(n) operation, the
server might be made too busy to service other requests
in reasonable time.

A solution is to choose the hash function randomly, so


that it would be dicult for the malicious hacker to nd
values that would hash to the same slot.
35 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Universal hashing II

This is called universal hashing.

e.g. we could choose a random value (0, 1) and use


the hash function h(k) = mk|.
36 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Applications of hashing

Hashing is a ubiquitous concept, used not just for


maintaining collections but also for

cryptography

combinatorics

data mining

computational geometry

databases

router trac analysis

Some algorithms/data structures that use hashing:

Bloom lters [1]: approximate set membership with no


false negatives; with 10 bits per element can achieve 1%
false positive rate

Interpolation search + hashing = log log n expected


searching times, with no unused space.

An example: probabilistic counting


37 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Part II
Applications of hashing
38 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Probabilistic Counting

Problem: estimate the number of unique elements in a


LARGE collection (e.g., a database, a data stream)
without requiring much working space

Useful for query optimization in databases [13]:

e.g. to evaluate A B C can do either A (B C) or


(A B) C

one of these might be very fast, one very slow.

have rough estimates of [B C[ vs [A B[ to decide


which strategy will be faster.
39 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Probabilistic Counting

Less serious (but more readily understood) example:

Shakespeares complete works:

N=884,647 words (or so)

n=28,239 unique words (or so)

w = average word length

N
max
n = prior estimate on n

Problem: estimate n the number of unique words


used. Approaches:
1. Sorting: Put all 884,647 words in a list and sort, then
count. (Time O(Nw log N), space O(Nw))
2. Trie: Scan through the words and build a trie, with
counters at each node; requires O(nw) space
(neglecting size of counters.)
3. Super-LogLog Probabilistic Counting [5]: Use 128 bytes
of space, obtain estimate of 30897 words (error 9.4%).
40 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Probabilistic Counting

Inputs: a multiset A of elements, possibly with many


duplicates (e.g., Shakespeares plays)

Problem: estimate card(A): the number of unique


elements in A (e.g., number of distinct words
Shakespeare used)

Simple starting idea: hash the objects into an


m-element hash table. Instead of storing keys, just
count the number of elements landing in each hash slot.

Extreme cases to illustrate the principle:

Elements of A are all dierent: will get an even


distribution in the hash table.

Elements of A are all the same: will get one hash table
slot with all the elements!

The shape of the hash table distribution reects the


frequency of duplicates.
41 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Probabilistic Counting

Linear Counting [13]

Compute hash values in the range [0, N


max
)

Maintain a bitmap representing which elements of the


hash table would be occupied, and estimate n from the
sparsity of the hash table.

Uses (N
max
) bits, e.g., on the order of card(A) bits.

Room for improvement: the precise sparsity pattern


doesnt matter: just the number of full vs. empty slots.
42 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Probabilistic Counting

Probabilistic Counting [6]

Compute hash values in the range [0, N


max
)

Instead of counting hash values directly, count the


occurrence of hash values matching certain patterns:
Pattern Expected occurrences
xxxxxxx1 2
1
card(A)
xxxxxx10 2
2
card(A)
xxxxx100 2
3
card(A)
xxxx1000 2
4
card(A)
.
.
.
.
.
.
Use these counts to estimate card(A).

To improve accuracy, use m dierent hash functions.

Uses (mlog N
max
) storage, and delivers accuracy of
O(m
1/2
)
43 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Probabilistic Counting

Super-LogLog [5] requires (log log N


max
) bits. With
1.28kb of memory can estimate card(A) to within
accuracy of 2.5% for N
max
130 million.

Probabilistic counters: count to N using log log N bits:



1
2
//
1
2


1
4
//
3
4


1
8
//
7
8


1
16
//
15
16


Need log N states, which can be encoded in log log N
bits.
44 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Hash grouping & discrepancy I

Hashing is a useful way to divide objects into randomly


chosen groups.

e.g. for a distributed database, want to distribute


records evenly among the servers use a hash table.

If the hashing function is randomly chosen (or otherwise


suitable), then we get a random division of the objects
into groups.

Discrepancy theory [3] tells us that we will get a roughly


even split of many dierent characteristics across these
groups.

E.g. if we use hashing to split a large database


containing information about people, we might want to
estimate quantities such as: How many people live in
Ontario, work in the information technology industry,
and subscribe to magazine X?
45 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Hash grouping & discrepancy II

We can estimate this reasonably well just by looking at


the data in one groupprovided the groups satisfy
some conditions.

Basic ideas of discrepancy:

Let V = v
1
, . . . , v
n
be a set of elements (e.g., people)

Let S = S
1
, . . . , S
m
be a collection of subsets of V
(e.g., live in Ontario, work in IT, etc.)

(V, S) is called a set system.

A colouring of V is a function : V 1, +1,


where (v
i
) is the colour of v
i
. (E.g. think of -1 and
+1 as red and blue).

The discrepancy of a set S


i
is the imbalance between
red and blue:
(S
i
) =

v
j
S
i
(v
j
)
46 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Hash grouping & discrepancy III

With a randomly chosen colouring, with high probability,


(S
i
) = O(
_
[S
i
[ ln(2m))
i.e., across each set S
i
we have a discrepancy that is

in the size of the set.

This means that assigning objects randomly into groups


gives us single groups that are representative of the
characteristics of the entire population of objects, so
long as m (the number of subgroups/characteristics) is
small.
47 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Theme: Design Tradeos

Tradeos between design parameters: A recurring


theme in algorithms & data structures.

Examples:

By making a hash table bigger, we can decrease (the


load factor) and achieve faster search times. (A tradeo
between space and time.)

In designing circuits to add n-bit integers, we can obtain


very low delays (the maximum number of gates between
inputs and outputs) by increasing the number of gates:
trading time (delay) for area (number of gates)

In many tasks we can trade the precision of an answer


for time and space, e.g., responding quickly to database
queries with an estimate of the answer, rather than the
exact answer.
48 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Theme: Design Tradeos

Design tradeos are often parameterizable.

For example, in speed/accuracy tradeos we dont


usually have to choose either speed or accuracy. Instead
we have a parameter the allowable error that we
can adjust.

With large we get fast (but possibly not very


accurate) answers

As 0 we get very accurate answers that take longer


to compute.

Lets look at an example of a tradeo in the design of


data structures.
49 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Design Tradeo: Hash tables vs. BSTs

Consider representing a collection of n keys drawn from


an ordered structure K, ).

A (balanced) binary search tree (BST) has (log n)


search times.

A hash table has (1) search (if we keep the size


number of elements, and choose an appropriate hash
function.)

Dierence between these two data structures:

A BST allows us to iterate through the elements in


order, using (log n) working space. The (log n) space
is used to record the path from the root to the iterator
position in a stack.

Items in a hash table are not stored in order if we


want to iterate through them in order, we need extra
space and time, e.g. (n) space for a temporary array
and (n log n) time to sort the items.
50 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Design Tradeo: Hash tables vs. BSTs

We can view BSTs and hash tables as two points in a


design space:
Data structure Search time Working space for
ordered iteration
Hash table (1) (n)
Binary Search Tree (log n) (log n)

Suppose:

We have a very large (n = 10


9
) collection of keys that
barely ts in memory

Dynamic: keys added and removed frequently.

We need fast search, fast insert, fast remove.

Red-black: height is 2 log n 61 levels

We need to be able to iterate through the collection in


order.

There is not enough room in memory to create a


temporary array for sorting; also, this would be
prohibitively slow.
51 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Design Tradeo: Hash tables vs. BSTs

Lets make a simple data structure that will oer a


smoother tradeo between search time and the working
space required for an ordered iteration.

If you think of BST + hash table as two points in a


design space, we want a structure that will interpolate
smoothly between them.
Search
cc
Working space for ordered iteration
c log n
log n
1
n
Binary Search Tree
Hash Table
Time
cc
52 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Design Tradeo: Hash tables vs. BSTs I

Consider a hash table of m slots, using a BST in each


slot to resolve collisions:
X
X
b
b
b
b
b
b
b b
b b
b
b b
b
b
b
b
b
b b
b
b
b
b
b
b
!
!
!
!
a
a
a
a

!
!
X
X

X
X
X
X

hh

Observation:

When m = 1 we have a degenerate hash table with a


single slot.

All the keys are put in a single BST.


53 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Design Tradeo: Hash tables vs. BSTs II

So, choosing m = 1 essentially gives us a BST: we can


iterate through the keys in order, search requires c log n
steps, where c reects the average tree depth.

What about the case m = 2?

We have a hash table with two slots. If hash function is


good, we get two BSTs of roughly n/2 keys apiece.

Search time is about c log(n/2).

Can we iterate through the keys in order?

Yes: have two iterators, one for each tree. Initially the
two iterators point at the smallest key in their tree.

At each step of the iteration, choose the iterator that


is pointing at the smaller of the two keys. Retrieve
that key, and advance the iterator.

Generalize: if we choose an arbitrary m,

We will have m BSTs of average size n/m

Search times will be around c log(n/m), assuming


m n.

To iterate through the keys in order,


54 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Design Tradeo: Hash tables vs. BSTs III

Obtain m iterators, one for each tree.

At each step, choose the iterator pointing at the


smallest key, retrieve that key and advance the iterator.

To do this eciently, we need a fast way to maintain a


collection (of iterators) that lets us quickly obtain the
one with the smallest value (the iterator pointing at the
smallest key)

Easy: a min-heap.

Our algorithm for ordered iteration will look like this:


1. Create an array of m BST iterators, one for each hash
table slot.
2. Turn this array into a min-heap, ordering iterators by
the key they are pointing at. (The heap can be built in
O(m) time.)
3. To obtain the next element,
3.1 Remove the least element from the min heap. (This
takes O(log m) time.)
3.2 Obtain its key, and advance the iterator. (Advancing a
BST iterator requires O(1) amortized time.)
55 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Design Tradeo: Hash tables vs. BSTs IV
3.3 Put the iterator back into the min-heap. (This takes
O(log m) time.)
56 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Design Tradeo: Hash tables vs. BSTs

We can iterate through the keys in order in time


O(n(1 + log m)).

O(m) time to obtain the iterators and build the heap

O(1 + log m) time per key to adjust the heap, times n


keys = O(n log m) (The 1 + handles the case m = 1.)

Overall, O(n(1 + log m)) time, assuming m _ n.

The space required for iterating through the keys in


order is O(m(1 + log(n/m))):

We need m iterators, one per hash table slot.

Each iterator requires space O(1 + log(n/m)), on


average, for a stack recording its position in the tree.
(The 1 + handles the case where n = m.)

The number of steps for searching is on average


1 + c log(n/m), where c is a constant depending on the
kind of BST we choose. The constant 1 is added to reect
visiting the correct slot in the hash table; and to handle the case
where m = n, in which case c log(n/m) = 0, and having 0 search
steps doesnt make sense.
57 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Design Tradeo: Hash tables vs. BSTs

Looking at these complexities, a sensible


parameterization is m = n
1
.

When = 0, m = n and we get a hash table;

When = 1, m = 1 and we get a BST.

Space and time:

Number of search steps is 1 + c log(n/m) =


1 + c log(n/(n
1
)) = 1 + c log n

= 1 +c log n.

directly multiplies our search time: choosing =


1
2
halves our search time.

Working space for ordered iteration is


O(m(1 + log(n/m))) = O(n
1
(1 + log n

)).

E.g., if we choose =
1
2
we are twice as fast as a BST
for searching, and require O(

n log n) working space


for ordered iteration.

The amount of extra space we need for ordered


iteration, relative to the space needed to store the keys,
is
n
1
(1+ log n)
n
= n

(1 + log n). NB: if > 0 the


relative space overhead for supplying ordered iteration is
0.
58 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Design Tradeo: Hash tables vs. BSTs

Lets look at some real-life numbers. Take n = 10


9
keys.

Assume we use red-black trees, so that average depth of


keys in a tree of n/m elements is 2 log(n/m).
Parameter #Search steps Space for iter. Space overhead
1 +2 log n n
1
(1 + log n) n

(1 + log n)
(Hash) 0 1 1000000000 100%
1/8 4.7 355237568 35%
1/4 16.0 47654705 4.7%
1/2 31.9 504341 0.05%
3/4 45.8 4165 0.0004%
7/8 53.3 362 0.00004%
(BST) 1 60.8 31 0.000003%

e.g. Choosing = 1/4, we can get searches 4 times


faster than the plain red-black tree, and have only a
4.7% space overhead for ordered iteration.

Choosing = 1/2, we can get searches twice as fast as


a plain red-black tree, with a 0.05% space overhead for
ordered iteration.
59 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Bibliography I
[1] Burton H. Bloom.
Space/time trade-os in hash coding with allowable
errors.
Commun. ACM, 13(7):422426, 1970.
[2] Stanley Burris and H. P. Sankappanavar.
A Course in Universal Algebra.
Springer-Verlag, 1981.
[3] Bernard Chazelle.
The Discrepancy MethodRandomness and
Complexity.
Cambridge University Press, Cambridge, 2000.
60 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Bibliography II
[4] Martin Dietzfelbinger, Anna Karlin, Kurt Mehlhorn, and
Friedhelm MeyerAuf Der.
Dynamic perfect hashing: Upper and lower bounds.
SIAM J. Comput., 23(4):738761, 1994.
[5] Marianne Durand and Philippe Flajolet.
Loglog counting of large cardinalities (extended
abstract).
In Giuseppe Di Battista and Uri Zwick, editors, ESA,
volume 2832 of Lecture Notes in Computer Science,
pages 605617. Springer, 2003.
[6] Philippe Flajolet and G. N. Martin.
Probabilistic counting algorithms for data base
applications.
Journal of Computer and System Sciences,
31(2):182209, September 1985.
61 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Bibliography III
[7] Philippe Flajolet, Patricio V. Poblete, and Alfredo
Viola.
On the analysis of linear probing hashing.
Algorithmica, 22(4):490515, 1998.
[8] Michael L. Fredman and Janos Komlos
an Endre Szemeredi.
Storing a sparse table with 0(1) worst case access time.
J. ACM, 31(3):538544, 1984.
[9] Joseph A. Gallian.
Contemporary Abstract Algebra.
D. C. Heath and Company, Toronto, 3rd edition, 1994.
62 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Bibliography IV
[10] Ronald L. Graham, Donald E. Knuth, and Oren
Patashnik.
Concrete Mathematics: A Foundation for Computer
Science.
Addison-Wesley, Reading, MA, USA, second edition,
1994.
[11] Saunders MacLane and Garrett Birkho.
Algebra.
Chelsea Publishing Co., New York, third edition, 1988.
[12] Robert Sedgewick and Philippe Flajolet.
An introduction to the analysis of algorithms.
Addison-Wesley, 1996.
63 / 64
ECE750 Lecture 5:
Hash functions,
Hash tables,
Bloom lters,
Hash grouping,
Discrepancy,
Probabilistic
counting
Todd Veldhuizen
tveldhui@acm.org
Bibliography
Bibliography V
[13] Kyu-Young Whang, Brad T. Vander-Zanden, and
Howard M. Taylor.
A linear-time probabilistic counting algorithm for
database applications.
ACM Trans. Database Syst., 15(2):208229, 1990.
64 / 64

Vous aimerez peut-être aussi