Académique Documents
Professionnel Documents
Culture Documents
November 2003
Sanath Jayasena 10-1
What is Data Compression?
• Transformation of data into a more
compact form
– takes less space than before
November 2003
Sanath Jayasena 10-2
Simple Example
• Suppose ASCII code of a char. is 1 byte
• Suppose we have a text file containing
one hundred instances of ‘a’
– File size would be about 100 bytes
• Let us store this as “100a” in a new file to
convey the same information
– New file size would be 4 bytes
– 4/100 96% saving
November 2003
Sanath Jayasena 10-3
Lossless Data Compression
• Last example shows “lossless”
compression
– Can retrieve original data by decompression
• Lossless compression used when data
integrity is important
• Example software
– winzip, gzip, compress, bzip, GIF
November 2003
Sanath Jayasena 10-4
Lossy Data Compression
• “Lossy” means original not retrievable
– Reduces size by permanently eliminating
certain information
– When uncompressed, only a part of the
original information is there (but the user
may not notice it)
• When can we use lossy compression?
– For audio, images, video
– E.g., jpeg, mpeg
November 2003
Sanath Jayasena 10-5
Codes
• Ways to represent information
• The code for a character is a “codeword”
• We consider binary codes
– Each character represented by a unique
binary codeword
• Fixed-length coding
– Length of codeword of each character same
– E.g., ASCII, Unicode
November 2003
Sanath Jayasena 10-6
Fixed-Length Coding
• Suppose there are n characters
• What is the minimum number of bits
needed for fixed-length coding?
log2 n
• Example: {a, b, c, d, e}; 5 characters
log2 5 = 2.3… = 3 bits per character
– We can have codewords: a=000, b=001,
c=010, d=011, e=100
November 2003
Sanath Jayasena 10-7
Variable-Length Coding
• Length of codewords may differ from character
to character
• Frequent characters get short codewords
• Infrequent ones get long codewords
• Example
a b c d e f
Frequency 46 13 12 16 8 5
Codeword 0 101 100 111 1101 1100
November 2003
Sanath Jayasena 10-8
Variable-Length Coding …contd
November 2003
Sanath Jayasena 10-10
Entropy …contd
Character a b c d e
Frequency 4 5 2 1 1
= − 4 + 5 + 2 + 1 + 1
13 H 4 log 2 5 log 2 2 log 2 log 2 log 2
13 13 13 13 13
November 2003
Sanath Jayasena 10-12
Huffman Coding Algorithm
• Huffman invented a greedy method to
construct an optimal prefix-free variable-
length code
– Code based on frequency of occurrence
• Optimal code given by a full binary tree
– Every internal node has 2 children
– If |C| is the size of alphabet, , there are |C|
leaves and |C|-1 internal nodes
November 2003
Sanath Jayasena 10-13
Huffman Coding Algorithm …contd
November 2003
Sanath Jayasena 10-14
Huffman Algorithm
Input: Alphabet C and frequencies f [ ]
Result: Optimal coding tree for C
Character a b c d e f
Frequency 45 13 12 16 9 5
• Next lecture
– Dynamic Programming
November 2003
Sanath Jayasena 10-17