Académique Documents
Professionnel Documents
Culture Documents
Compression of Data
or
Building a Tree
Scan the original text
Building a Tree
Scan the original text
Building a Tree
Scan the original text
Char Freq. y 1 s 2 n 2 a 2 l 1
Char Freq. k 1 . 1
Building a Tree
Prioritize characters
Create binary tree nodes with character and frequency of each character Place nodes in a priority queue
The lower the occurrence, the higher the priority in the queue
Building a Tree
The queue after inserting all nodes
E
1
i
1
y
1
l
1
k
1
.
1
r
2
s
2
n
2
a
2
sp
4
e
8
Building a Tree
While priority queue contains two or more nodes
Create new node Dequeue node and make Dequeue next node and subtree Frequency of new node frequency of left and Enqueue new node back it left subtree make it right equals sum of right children into queue
Building a Tree
E 1
i 1
y 1
l 1
k 1
. 1
r 2
s 2
n 2
a 2
sp 4
e 8
Building a Tree
y 1
l 1
k 1
. 1
r 2
s 2
n 2
a 2
sp 4
e 8
2 E 1 i 1
Building a Tree
y 1
l 1
k 1
. 1
r 2
s 2
n 2
a 2 E 1
2
i 1
sp 4
e 8
Building a Tree
k 1
. 1
r 2
s 2
n 2
a 2
sp 4
e 8
E 1
i 1
2 y 1
l 1
Building a Tree
k 1
. 1
r 2
s 2
n 2
a 2
2 y 1
2 l 1
sp 4
e 8
E 1
i 1
Building a Tree
2 y 1
2 l 1
sp
2
E 1 i 1
2 k 1 . 1
Building a Tree
2 i 1
sp
2
E 1
4 y 1 l 1
k 1 . 1
Building a Tree
2
E 1 i 1
sp 4 . 1
e 8
y 1
l 1
4 r 2 s 2
k 1
Building a Tree
2 E 1 i 1
y 1
2
l 1 k 1
2
. 1
sp 4 r 2
4 s 2
Building a Tree
2 y 1
2 l 1 k 1
2 . 1
sp 4 r 2
4 s 2
e 8
E 1
i 1
4 n 2 a 2
Building a Tree
2 y 1
2 l 1 k 1
2 . 1
sp 4 r 2
4 s 2 n 2
4 a 2
e 8
E 1
i 1
Building a Tree
2 k 1 . 1
sp 4 r 2
4 s 2 n 2
4 a 2
4 2 E 1 i 1 2
y 1
l 1
Building a Tree
2 k 1 . 1
sp 4 r 2
4 s 2 n 2
4 a 2 2
4 2 y 1
e 8
E 1
i 1
l 1
Building a Tree
4 r 2 s 2 n 2
4 a 2 E 1 6 2 k 1 . 1 sp 4 2 i 1
4 2 y 1 l 1
e 8
Building a Tree
4 r 2 s 2 n 2
4 a 2
e sp 4 8
2 E 1 i 1
y 1
2 l 1 k 1
. 1
Building a Tree
4 2 E 1 i 1 2 2 l 1
6 sp 4
e 8
y 1
k 1
. 1
8 4 r 2 s 2 n 2 4 a 2
Building a Tree
4 2 E 1 i 1 2 2 l 1
6 sp 4
e 8 4 r 2 s 2
8 4 n 2 a 2
y 1
k 1
. 1
Building a Tree
e 8
4
r 2 s 2 n 2
4 10 a 2 2 4 2 y 1 6
2 l 1
k 1 . 1
sp 4
E 1
i 1
Building a Tree
e 8
10 4 4 a 2 E 1 2 i 1 2 2 l 1 6 sp 4
4
r 2 s 2 n 2
y 1
k 1
. 1
Building a Tree
10 4 2 E 1 i 1 2 2 l 1 16 6 sp 4 e 8 4 r 2 s 2 n 2 8 4 a 2
y 1
k 1
. 1
Building a Tree
10 4 2 E 1 i 1 2 2 l 1 6 sp 4
16 e 8 4 r 2 s 2 n 2
8 4
a 2
y 1
k 1
. 1
Building a Tree
26
10
4 2 E 1 i 1 y 1 2 l 1 k 1 2 . 1
16 e 8 sp 4 r 2 4 s 2 n 2
8 4
a 2
Building a Tree
After enqueueing this node there is only one node left in priority queue.
4 s 2 n 2 a 2
26 10 4 2 E 1 i 1 y 1 2 l 1 k 1 6 e 8 sp 4 . 1 r 2 4 16 8
Building a Tree
Dequeue the single node left in the queue. This tree contains the new code words for each character.
2 10 4 2 2 6 sp 4 e 8 4 26 16 8 4
E i y l k . 1 1 1 1 1 1
r s n a 2 2 2 2
26 characters
26 10 16 e 8 sp 4 4 8 4
4 2 2 2
E i y l k . 1 1 1 1 1 1
r s n a 2 2 2 2
26 10 16 e 8 sp 4 4 8 4
4 2 2 2
E i y l k . 1 1 1 1 1 1
r s n a 2 2 2 2
Char E i y l k . space e r s n a
Code 0000 0001 0010 0011 0100 0101 011 10 1100 1101 1110 1111
Big hit on compression, especially for smaller files because Final Data transmission is bit based instead of byte based
E i y l k . 1 1 1 1 1 1
r s n a 2 2 2 2
Huffman coding
Binary Non-binary Adaptive
Arithmetic Coding
The Model a way of calculating, in any given context, the distribution of probabilities for the next input symbol. The decoder must have access to the same model & be able to regenerate the same input string from the encoded string. Uses blended message, frame or set of symbols together.
Encoding
The message is represented by an interval of real numbers between 0 & 1. As the message becomes longer, its interval becomes smaller the no. of bits needed to specify that interval grows. The reduction of the intervals size is according to the symbols probability (generated by the model). A more common symbol will reduce the range by adding fewer bits to the message..
Symbol
a
Probability
.2
Range
[0,0.2)
e
i o u
.3
.1 .2 .1
[0.2, 0.5)
[0.5, 0.6) [0.6, 0.8) [0.8, 0.9)
.1
[0.9, 1)
1 ! 0 .9 u 0
.8 0 .6 0 .5
0.5
! u o i e
0.26
! u o i
0.236
! u o i e a
0.2336
! u o i
0.2336
! u
o
i e
i
e
e
a
0 .2
0.2
0.2
0.23
0.233
a a 0.23354
that is also exactly the multiplication of the probabilities of the five symbols in the message eaii! :
0.00006. (0.3)* (0.2)* (0.1)* (0.1)* (0.1) =
According to Shannon :
is
-log 0.3 -log 0.2 -log 0.1 -log 0.1 -log 0.1 =-log 0.00006 = 4.22
A few remarks..
5 decimal digits seems a lot to encode a message comprising 4 vowels! Our example ended up by expanding rather than compressing. Different models will give different entropies
Decoding
The decoder receives the final range or a value from that range & finds symbol. The decoder deduces the message from the first symbol to the last, according to the symbols probability of which the interval belongs to.
Each message ends with EOF to avoid ambiguity: i.e. a single no. 0.0 could represent any of a,aa,aaa
e a !
[0.23354,0.2336).