Huffman

Huffman Coding: An Application of Binary Trees and Priority Queues (probability based) for Encoding and
Compression of Data
Purpose of Huffman Coding

Proposed by Dr. David A. Huffman in 1952
A Method for the Construction of Minimum Redundancy Codes
Applicable to many forms of data transmission

Our example: text files
The Basic Algorithm

Huffman coding is a form of statistical coding Not all characters occur with the same frequency! Yet all characters are allocated the same amount of space
1 char = 1 byte, be it
or
The Basic Algorithm

Any savings in tailoring codes to frequency of character? Code word lengths are no longer fixed like ASCII. Code word lengths vary and will be shorter for the more frequently used characters.
The Basic Algorithm

1. 2. 3. 4. 5. Scan text to be compressed and tally occurrence of all characters. Sort or prioritize characters based on number of occurrences in text. Build Huffman code tree based on prioritized list. Perform a traversal of tree to determine all code words. Scan text again and create new file using the Huffman codes.
Building a Tree
Scan the original text
Consider the following short text:

Eerie eyes seen near lake.
Count up the occurrences of all characters in the text
Building a Tree
Eerie eyes seen near lake. What characters are present?

E e r i space y s n a r l k .
Building a Tree

What is the frequency of each character in the text?
Char Freq. E 1 e 8 r 2 i 1 space 4
Char Freq. y 1 s 2 n 2 a 2 l 1
Char Freq. k 1 . 1
Building a Tree
Prioritize characters
Create binary tree nodes with character and frequency of each character Place nodes in a priority queue
The lower the occurrence, the higher the priority in the queue
Building a Tree
The queue after inserting all nodes
E
1
i
1
y
1
l
1
k
1
.
1
r
2
s
2
n
2
a
2
sp
4
e
8
Null Pointers are not shown
Building a Tree
While priority queue contains two or more nodes
Create new node Dequeue node and make Dequeue next node and subtree Frequency of new node frequency of left and Enqueue new node back it left subtree make it right equals sum of right children into queue
Building a Tree
E 1
i 1
y 1
l 1
k 1
. 1
r 2
s 2
n 2
a 2
sp 4
e 8
Building a Tree
y 1
l 1
k 1
. 1
r 2
s 2
n 2
a 2
sp 4
e 8
2 E 1 i 1
Building a Tree
y 1
l 1
k 1
. 1
r 2
s 2
n 2
a 2 E 1
2
i 1
sp 4
e 8
Building a Tree
k 1
. 1
r 2
s 2
n 2
a 2
sp 4
e 8
E 1
i 1
2 y 1
l 1
Building a Tree
k 1
. 1
r 2
s 2
n 2
a 2
2 y 1
2 l 1
sp 4
e 8
E 1
i 1
Building a Tree
2 y 1
2 l 1
sp
2
E 1 i 1
2 k 1 . 1
Building a Tree
2 i 1
sp
2
E 1
4 y 1 l 1
k 1 . 1
Building a Tree
2
E 1 i 1
sp 4 . 1
e 8
y 1
l 1
4 r 2 s 2
k 1
Building a Tree
2 E 1 i 1
y 1
2
l 1 k 1
2
. 1
sp 4 r 2
4 s 2
Building a Tree
2 y 1
2 l 1 k 1
2 . 1
sp 4 r 2
4 s 2
e 8
E 1
i 1
4 n 2 a 2
Building a Tree
2 y 1
2 l 1 k 1
2 . 1
sp 4 r 2
4 s 2 n 2
4 a 2
e 8
E 1
i 1
Building a Tree
2 k 1 . 1
sp 4 r 2
4 s 2 n 2
4 a 2
4 2 E 1 i 1 2
y 1
l 1
Building a Tree
2 k 1 . 1
sp 4 r 2
4 s 2 n 2
4 a 2 2
4 2 y 1
e 8
E 1
i 1
l 1
Building a Tree
4 r 2 s 2 n 2
4 a 2 E 1 6 2 k 1 . 1 sp 4 2 i 1
4 2 y 1 l 1
e 8
Building a Tree
4 r 2 s 2 n 2
4 a 2
e sp 4 8
2 E 1 i 1
y 1
2 l 1 k 1
. 1
What is happening to the characters with a low number of occurrences?
Building a Tree
4 2 E 1 i 1 2 2 l 1
6 sp 4
e 8
y 1
k 1
. 1
8 4 r 2 s 2 n 2 4 a 2
Building a Tree
4 2 E 1 i 1 2 2 l 1
6 sp 4
e 8 4 r 2 s 2
8 4 n 2 a 2
y 1
k 1
. 1
Building a Tree
e 8
4
r 2 s 2 n 2
4 10 a 2 2 4 2 y 1 6
2 l 1
k 1 . 1
sp 4
E 1
i 1
Building a Tree
e 8
10 4 4 a 2 E 1 2 i 1 2 2 l 1 6 sp 4
4
r 2 s 2 n 2
y 1
k 1
. 1
Building a Tree
10 4 2 E 1 i 1 2 2 l 1 16 6 sp 4 e 8 4 r 2 s 2 n 2 8 4 a 2
y 1
k 1
. 1
Building a Tree
10 4 2 E 1 i 1 2 2 l 1 6 sp 4
16 e 8 4 r 2 s 2 n 2
8 4
a 2
y 1
k 1
. 1
Building a Tree
26
10
4 2 E 1 i 1 y 1 2 l 1 k 1 2 . 1
16 e 8 sp 4 r 2 4 s 2 n 2
8 4
a 2
Building a Tree
After enqueueing this node there is only one node left in priority queue.
4 s 2 n 2 a 2
26 10 4 2 E 1 i 1 y 1 2 l 1 k 1 6 e 8 sp 4 . 1 r 2 4 16 8
Building a Tree
Dequeue the single node left in the queue. This tree contains the new code words for each character.
2 10 4 2 2 6 sp 4 e 8 4 26 16 8 4
Frequency of root node should equal number of characters in text.
E i y l k . 1 1 1 1 1 1
r s n a 2 2 2 2
26 characters
Encoding the File

Traverse Tree for Codes
Perform a traversal of the tree to obtain new code words Going left is a 0 going right is a 1 code word is only completed when a leaf node is reached
26 10 16 e 8 sp 4 4 8 4
4 2 2 2
E i y l k . 1 1 1 1 1 1
r s n a 2 2 2 2
Encoding the File

Traverse Tree for Codes
Char E i y l k . space e r s n a Code 0000 0001 0010 0011 0100 0101 011 10 1100 1101 1110 1111
26 10 16 e 8 sp 4 4 8 4
4 2 2 2
E i y l k . 1 1 1 1 1 1
r s n a 2 2 2 2
Encoding the File

Rescan text and encode file using new code words
0000101100000110011 1000101011011010011 1110101111110001100 1111110100100101

Why is there no need for a separator character?
.
Char E i y l k . space e r s n a
Code 0000 0001 0010 0011 0100 0101 011 10 1100 1101 1110 1111
Encoding the File

Results
Have we made 0000101100000110011 things any 1000101011011010011 better? 1110101111110001100 73 bits to encode 1111110100100101 the text ASCII would take 8 * 26 = 208 bits If modified code used and if 4 bits per character are needed for fixed length coding. Total bits 4 * 26 = 104. Savings not as great.
Decoding the File

How does receiver know what the codes are? There are so many constraints and ways found from the survey. Tree constructed for each text file and sent.
Look up Tree directly sent Frequency of each character
Big hit on compression, especially for smaller files because Final Data transmission is bit based instead of byte based
Tree predetermined based on the applications.

based on statistical analysis of text files or file types or the previously stored record.
Decoding the File

Once receiver has tree, it scans incoming bit stream 0 go left 1 go right 0000101100000110011 1000101011011010011 1110101111110001100 1111110100100101
26 10 4 6 e 8 sp 4 4 16 8 4
E i y l k . 1 1 1 1 1 1
r s n a 2 2 2 2
Huffman coding
Binary Non-binary Adaptive
Another is Shannon-Fano coding
Arithmetic Coding
The Model a way of calculating, in any given context, the distribution of probabilities for the next input symbol. The decoder must have access to the same model & be able to regenerate the same input string from the encoded string. Uses blended message, frame or set of symbols together.
Model Based Approach
The Basic Algorithm

1. We begin with a current interval" [L; H) initialized to [0; 1). 2. For each symbol, we perform two steps : (a) We subdivide the current interval into subintervals, one for each possible alphabet symbol. The size of a symbol's subinterval is proportional to the estimated probability that the symbol will be the next symbol in the file, according to the model of the input. (b) We select the subinterval corresponding to the symbol that actually occurs next in the file, and make it the new current interval. 3. We output enough bits to distinguish the final current interval from all other possible final intervals.
Encoding
The message is represented by an interval of real numbers between 0 & 1. As the message becomes longer, its interval becomes smaller the no. of bits needed to specify that interval grows. The reduction of the intervals size is according to the symbols probability (generated by the model). A more common symbol will reduce the range by adding fewer bits to the message..
A simple example : encoding the message eaii!

The mode l
Symbol
a
Probability
.2
Range
[0,0.2)
e
i o u
.3
.1 .2 .1
[0.2, 0.5)
[0.5, 0.6) [0.6, 0.8) [0.8, 0.9)
.1
[0.9, 1)
1 ! 0 .9 u 0
.8 0 .6 0 .5
0.5
! u o i e
0.26
! u o i
0.236
! u o i e a
0.2336
! u o i
0.2336
! u
o
i e
i
e
e
a
0 .2
0.2
0.2
0.23
0.233
a a 0.23354
The size of the final range is

0.2336 -0.23354 = 0.00006,
that is also exactly the multiplication of the probabilities of the five symbols in the message eaii! :
0.00006. (0.3)* (0.2)* (0.1)* (0.1)* (0.1) =
it takes 5 decimal digits to encode the message.

The best compression code is the output length contains a contribution of -logp bits from the encoding of each symbol whose probability of occurrence is p.
According to Shannon :

The entropy of eaii!
is
-log 0.3 -log 0.2 -log 0.1 -log 0.1 -log 0.1 =-log 0.00006 = 4.22
A few remarks..
5 decimal digits seems a lot to encode a message comprising 4 vowels! Our example ended up by expanding rather than compressing. Different models will give different entropies
Decoding
The decoder receives the final range or a value from that range & finds symbol. The decoder deduces the message from the first symbol to the last, according to the symbols probability of which the interval belongs to.
Each message ends with EOF to avoid ambiguity: i.e. a single no. 0.0 could represent any of a,aa,aaa
Back to the example

The decoder gets the final range : [0.23354,0.2336). The range lies entirely within the space the model allocate for e first character was e !!! Initially [0,1) [0.2, 0.5)
Now it can simulate the operation of the encoder:
e a !
[0.23354,0.2336).
Two major difficulties

The shrinking current interval requires use of high precision arithmetic. No output is produced until the entire message has been encoded.

Huffman

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Huffman

Transféré par

Droits d'auteur :

Formats disponibles

Huffman Coding: An Application of Binary Trees and Priority Queues (probability based) for Encoding and

Purpose of Huffman Coding

Applicable to many forms of data transmission

The Basic Algorithm

The Basic Algorithm

The Basic Algorithm

Consider the following short text:

Count up the occurrences of all characters in the text

Eerie eyes seen near lake. What characters are present?

Eerie eyes seen near lake.

Char Freq. E 1 e 8 r 2 i 1 space 4

Null Pointers are not shown

What is happening to the characters with a low number of occurrences?

Frequency of root node should equal number of characters in text.

Eerie eyes seen near lake.

Encoding the File

Encoding the File

Encoding the File

0000101100000110011 1000101011011010011 1110101111110001100 1111110100100101

Encoding the File

Decoding the File

Tree predetermined based on the applications.

Decoding the File

Another is Shannon-Fano coding

Model Based Approach

The Basic Algorithm

A simple example : encoding the message eaii!

The size of the final range is

it takes 5 decimal digits to encode the message.

The entropy of eaii!

Back to the example

Now it can simulate the operation of the encoder:

Two major difficulties

Vous aimerez peut-être aussi