Huffman Coding

Huffman Coding
Hashmat Rohian CSE 6111
Motivation
To compress or not to compress, that is the question!
reducing the space required to store files on disk or tape
reducing the time to transmit large files.
Image Source : plus.maths.org/issue23/ features/data/data.jpg
Example:
A file with 100K characters
Character Frequency (in 1000s) Fixed-length codeword a 45 000 b 13 001 c 12 010 d 16 011 e 9 100 f 5 101
Space =
(45*3 + 13*3 + 12*3 + 16*3 + 9*3 + 5*3) * 1000
= 300K bits
Can we do better ?? YES !!

Use variable-length codes instead. Give frequent characters short codewords, and infrequent characters long codewords.
a Frequency (in 1000s) Fixed-length codeword Variable-length codeword 45 000 0 b 13 001 101 c 12 010 100 d 16 011 111 e 9 100 1101 f 5 101 1100
Space =
(45*1 + 13*3 + 12*3 + 16*3 + 9*4 + 5*4) * 1000
= 224K bits
( Savings = 25%)
PREFIX-FREE CODE : No codeword is also prefix of some other codeword.
No Ambiguity !!
Variable-length codeword
101
100
111
1101
1100
Representation:
The Huffman algorithm is represented as: binary tree each edge represents either 0, "go to the left child" 1, "go to the right child"
0
100
0
a:45 25
0
55
1
30
0 0
c:12
b:13 f:5
14
1
d:16 e:9
each leaf corresponds a particular code.
Cost of the tree B(T) = f(c) dT(c) where c C
0ptimal Code
100
0 1 0
100
0
86
1 0
14 28 14
1 0 1
a:45 25
0
55
1
58
0 1 0
30
0 0
c:12 c:12 d:16 e:9 f:5
b:13 f:5
a:45 b:13
14
1
d:16 e:9
Always a full binary tree One leaf for each letter of the alphabet
Constructing a Huffman code Build the tree T in a bottom-up manner. Begins with a set of |C| leaves Upward |C| - 1 "merging" operations
Greedy Choice? The two smallest nodes are chosen at each step.
The steps of Huffman's algorithm

f:5 e:9 c:12 b:13 d:16 a:45 c:12 b:13 14 d:16 a:45
0 1
f:5 14
0
e:9 30 a:45
1
d:16
1 0
25
1
a:45
0
25
1 0
f:5
e:9 c:12
b:13
c:12
b:13
0
14
1
d:16 e:9
f:5 a:45
0
55
1 0
100
0
25
0 1 0
30
1
a:45 d:16 25
0
55
1
c:12
b:13
0
14
1
30
1 0 1
f:5
e:9
c:12
b:13
0
14
1
d:16 e:9
f:5
Running Time Analysis
Q is implemented as a binary min-heap. The merge operation is executed exactly |n| - 1 times. Each heap operation requires time O(log n). = O(nlog n)
Huffman code
Input
ACE
Output
E I C A H
= = = = =
01 00 10 111 110
(111)(10)(01) = 1111001
Huffman Coding Example
Decoding
1. Read compressed file & binary tree 2. Use binary tree to decode file
Follow path from root to leaf
Huffman Code Algorithm Overview
A 3 1 5 1
H 2 0 C 5 0 10 1 0 25 E 8 1 15 I 7 0
1111001
Huffman Decoding 1
A 3 1 5 1
H 2 0 C 5 0 10 1 0 25 E 8 1 15 I 7 0
1111001
Huffman Decoding 2
A 3 1 5 1
H 2 0 C 5 0 10 1 0 25 E 8 1 15 I 7 0
1111001
Huffman Decoding 3
A 3 1 5 1
H 2 0 C 5 0 10 1 0 25 E 8 1 15 I 7 0
1111001
Huffman Decoding 4
A 3 1 5 1
H 2 0 C 5 0 10 1 0 25 E 8 1 15 I 7 0
1111001
AC
Huffman Decoding 5
A 3 1 5 1
H 2 0 C 5 0 10 1 0 25 E 8 1 15 I 7 0
1111001
AC
Huffman Decoding 6
A 3 1 5 1
H 2 0 C 5 0 10 1 0 25 E 8 1 15 I 7 0
1111001
ACE
Huffman Decoding 7
Induction on the number of code words The Huffman algorithm finds an optimal code for n = 1 Suppose that the Huffman algorithm finds an optimal code for codes size n, now consider a code of size n + 1 . . .
Correctness proof
Greedy Choice Proof

T T T
x y a b y x b
a b x y
Assume that f[a] < f[b] and f[x] < f[y] Since f[x] and f[y] are the two lowest frequencies, f[x] < f[a] and f[y] < f[b].
T Tree constructed by Huffman X Any code tree Show C(T) <= C(X) T and X Trees from the greedy choice C(T) = C(T) C(X) <= C(X) T and X Trees with minimum cost leaves x and y removed
Finish the induction proof
C(X) = C(X) x y C(T) = C(T) x y C(T) <= C(X) C(T) = C(T) = C(T) + x + y <= C(X) + x + y = C(X) <= C(X)
X : Any tree, X: modified, X : Two smallest leaves removed
What is our next step?
Challenges and how to tackle them?
Two passes over the data: One pass to collect frequency counts of the letters A second pass to encode and transmit the letters, based on the static tree structure.
Problems: Delay (network communication, file compression applications) Extra disk accesses slow down the algorithm.
We need one-pass methods, in which letters are encoded on the fly.
Dynamic Huffman codes
Algorithm FGK The next letter of the message is encoded on the basis of a Huffman tree for the previous letters. Encoding length = (2S + t), where S is the encoding
length by a static Huffman code, and t is the number of letters in the original message.
Sender and receiver start with the same initial tree use the same algorithm to modify the tree after each letter is processed and thus always have equivalent copies of it.
Sibling Property:
A binary tree with p leaves of nonnegative weight is a Huffman tree iff the p leaves have nonnegative weights w1, . . . , wp, and the weight of each internal node is the sum of the weights of its children; and the nodes can be numbered in non-decreasing order by weight, so that nodes (2j - 1)and 2j are siblings, for 1 j p - 1, and their common parent node is higher in the numbering.
32 11 f
9 11
21
7
10
10 5 c
11 5 d
2
6 e
2 a
3 b
Difficulty
Suppose that MT = ai1 , ai2, . . . , ait, has already been processed. ai(t+1) is encoded and decoded using Huffman tree for MT. How to modify this tree quickly in order to get a Huffman tree for MT+1?
Eg. Assume t = 32,

32 11 f
9 11
33 21
10
11
11 f
8
10 5 c
11 5 d
2
ai(t+1) = b
6 e
6
5 c
2 a
3 b
X
11
7 3
22
10
11
5 d
2
6 e
2 a
4 b
First phase
Begin with the leaf of ai(t+1), as the current node. Repeatedly interchange the contents of the current node, including the subtree rooted there, with that of the highest numbered node of the same weight Make the parent of the latter node the new current node. Halt when the root is reached.
Eg. Assume t = 32, ai(t+1) = b

32 11 f
9 11
32 21
10
11
11 f 11
21
10
10 5 c
10 6 e 5 c
11
5 d
2
5 d
4
5 2 a
6 e 3 b
2
2 a
3 b
32 11 f
9
11
32 21
10
11
11 10
7
21
7
10
11 5 d 5
1
5 5 c
3 5 4
6 e
6 e 3 b
2
10 5 c
11 f 5 d
4
2 a
2 a
3 b
Second phase
We turn this tree into the desired Huffman tree for weights of ai(t+1)s leaf and its ancestors by 1 32 11 5
5 9 11 10
MT+1 by incrementing
33
11
the
21 10
6 7
12 11 f 6
21 10
6 7
10
6 e 3 b
2
6 e
11 f
2 a
5 c
5 d
2 a
4 b
5 c
5 d
Huffman savings are between 20% - 90% Dynamic Huffman Coding optimal and efficient Optimal data compression achievable by a character code can always be achieved with a prefix code. Better compression possible (depends on data) Using other approaches (e.g., pattern dictionary)
Conclusions
Thank you for your attention!

Huffman Coding

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Huffman Coding

Transféré par

Droits d'auteur :

Formats disponibles

Huffman Coding

Hashmat Rohian CSE 6111

reducing the space required to store files on disk or tape

reducing the time to transmit large files.

Image Source : plus.maths.org/issue23/ features/data/data.jpg

(45*3 + 13*3 + 12*3 + 16*3 + 9*3 + 5*3) * 1000

Can we do better ?? YES !!

(45*1 + 13*3 + 12*3 + 16*3 + 9*4 + 5*4) * 1000

PREFIX-FREE CODE : No codeword is also prefix of some other codeword.

each leaf corresponds a particular code.

Cost of the tree B(T) = f(c) dT(c) where c C

c:12 c:12 d:16 e:9 f:5

The steps of Huffman's algorithm

Running Time Analysis

Huffman Coding Example

Huffman Code Algorithm Overview

Greedy Choice Proof

Finish the induction proof

X : Any tree, X: modified, X : Two smallest leaves removed

What is our next step?

Challenges and how to tackle them?

We need one-pass methods, in which letters are encoded on the fly.

Dynamic Huffman codes

Eg. Assume t = 32,

Eg. Assume t = 32, ai(t+1) = b

Thank you for your attention!

Vous aimerez peut-être aussi

(453 + 133 + 123 + 163 + 93 + 53) * 1000

(451 + 133 + 123 + 163 + 94 + 54) * 1000