Vous êtes sur la page 1sur 4

Mini Project 2 ------ Practice on C programming

Please email to my TA:


Due: 2010-12-15
In computer science and information theory, Huffman coding is an entropy encoding
algorithm used for lossless data compression. The term refers to the use of a
variable-length code table for encoding a source symbol (such as a character in a file)
where the variable-length code table has been derived in a particular way based on the
estimated probability of occurrence for each possible value of the source symbol. It
was developed by David A. Huffman while he was a Ph.D. student at MIT, and
published in the 1952 paper "A Method for the Construction of
Minimum-Redundancy Codes". For more detailed information see
http://en.wikipedia.org/wiki/Huffman_coding.

In this project, we aim to put Huffman coding into practice and use it to implement a
compress/uncompress utility. By using this utility, we can compress a regular file; or
uncompress a compressed file that is compressed by our utility.
For example:

Basic idea:
1 scan file and do statistic on each character (the times of its occurrence in this file)
2 create Huffman tree based on the statistic
3 compute Huffman code of each character based on the Huffman tree in step 2
4 encode the source file: write the huffman code of each character into the
compressed file. For example:
The binary value of character „A‟ is 01000001, which have 8 bits. Suppose the
Huffman code of „A‟ is 0110. Then we only need to write 4 bits instead of 8 bits to
represent „A‟. So we save four bits. Since each byte has 8 bits, we use these save 4
bits to store other character‟s Huffman code. So suppose the Huffman code of „B‟ is
110 and binary value of character „B‟ is 01000010. Then in the original file, it requires
two bytes to store „A‟ and „B‟, but now we only need 7 bits. However, since each byte
has 8 bits, you need to make up another bit. It either comes from one bit of another
Huffman code of the following character or a 0 if the end of the file is reached.

Original file: 01000001 01000010……..


A B

Compressed file: 0110 110 x……..


A B

When you create the compressed file, you need to put the encoding information into
the compressed file in order to use it when uncompress the file.

When uncompressing the file, you read the encoding information first and
re-construct Huffman tree. Next decode the file based on the Huffman tree.

Huffman Code: Example

The following example bases on a data source using a set of five different symbols.
The symbol's frequencies are:

Symbol Frequency
A 24
B 12
C 10
D 8
E 8
----> total 186 bit
(with 3 bit per code word)

The two rarest symbols 'E' and 'D' are connected first, followed by 'C' and 'D'. The
new parent nodes have the frequency 16 and 22 respectively and are brought
together in the next step. The resulting node and the remaining symbol 'A' are
subordinated to the root node that is created in a final step.

Code Tree according to Huffman


Symbol Frequency Code Code total
Length Length
A 24 0 1 24
B 12 100 3 36
C 10 101 3 30
D 8 110 3 24
E 8 111 3 24
---------------------------------------
ges. 186 bit tot. 138 bit
(3 bit code)

Basic Requirement:
1. Achieve Huffman code based on the statistic of all the characters of a file
2. Output the file based on the Huffman code, you can output plain Huffman code of
each character.
3. Decode a “encoded file”.
Example:
Text:
Abbdc
Huffman code:
„A‟: 10
„b‟: 01
„c‟: 11
„d‟: 00
Your “compressed” file should show:
1001010011

If input 101010110000
Your output should be:
AAAcdd

Bonus(extra 10 points on your overall credits):


1. Implement the real compress/uncompress utility which can compress/uncompress
files.
2. Make comparisons on the compression rate among different type of files, for
example txt file, image file, …, and so on. (at least three types)

Vous aimerez peut-être aussi