Vous êtes sur la page 1sur 25

Text Data Compression

Algorithms
Hardeep Singh
Vatsal Thakar
Kalyan Boppana

Table of Contents
Introduction
Compression Methods
Algorithms
Comparison of Algorithms
Conclusion

Introduction
Storing and sending compressed bits
Data Compression Methods:

Compression Methods
Lossless
Data remains intact after file is uncompressed/compressed
Uses
Spreadsheet Files
Financial Data
Lossy
Compress by eliminating redundant information.
Extraction result in partial data loss (not exact file)
Uses
Image
Video and Sound

Algorithm
Huffman Coding
Shannon-Fano Code
Arithmetic

Huffman Coding
Developed by David Huffman
Invented greedy algorithm that construct an optimal prefix code
called Huffman code
Most commonly used algorithm
Operating system programs

Two Steps
The words the repeated most of time have shorter codes than the
words that repeated least.
Given Step 1, the words that repeated least will have the same length.

ASCII to Binary

Implementation
Given Input:
Compute number of occurrence for each character
Arrange the character from most common to least
common frequency
Assign each character a leaf and push into a queue
Construct a binary tree by combining two least common
characters
Repeat until the root node (parent) is found
Label right childs with 1 and left childs with 0

Algorithm

Example

Coding

Encoding

Decoding

Shannon-Fano
Developed by Claude Shannon and Robert Fano
Almost same like Huffman coding

Steps
Count the occurrence and determine probability
Divide list into two parts, keeping same probability

Complexity
Suboptimal

Implementation
1. Parse the input and count the occurrence of each
symbol
2. Determine probability
3. Sort them according to their probability
4. Generate leaf nodes for each symbols
5. Divide list into two parts, keeping roughly similar
probability
6. Assign 0 to left and 1 to right nodes
7. Repeat steps 5&6 until each node is a leaf in the
tree

Generic Example

Example

Encoding and Decoding similar to Huffman code

Arithmetic
Generate variable length codes
Encode stream of symbols rather than specific
symbols
Replaces input with a single floating point number

Uses
Small alphabets
Binary sources
Color compression

Modeling and coding aspects are kept separate in


this technique

Implementation

Encoding Algorithm

Encoding Example

Decoding Algorithm

Decoding Example

Arithmetic vs. Huffman/Shannon-Fano


Implementation is easy in case of Huffman code
Compression ration of arithmetic coding for different input size
is higher
Higher compression ratio requires higher execution time in case
of arithmetic compression
Compression ratio increases far better in case of arithmetic
compression when input size is larger
In large collections of text and images, Huffman coding is likely
to be used for the text , and arithmetic coding for the images.
The space character is the most common, with the probability of about
18%, so Huffman redundancy is quite small

Comparison Cont.

Conclusion
Data compression is a technique or art of finding short descriptions for
long strings.
Every compression algorithm can be decomposed into zero or more
transforms, a model, and a coder.
There is no general procedure for finding good models or prediction
algorithms .
Art in artificial intelligence.

In case of decompressed file size Huffman give better result than other
two algorithms.
However it can also be concluded that depending on the content of the original file, the
performance of the algorithm varies.

Questions

References
Asadollah Shahbahrami, Ramin Bahrampour, Mobin Sabbaghi Rostami, Mostafa Ayoubi Mobarhan.
"Evaluation of Huffman and Arithmetic Algorithms for Multimedia Compression Standards." Evaluation of
Huffman and Arithmetic Algorithms for Multimedia Compression Standards (n.d.): n. pag. Arxiv. Department
of Computer Engineering, Faculty of Engineering, University of Guilan, Rasht, Iran. Web. 7 Nov. 2016.
Shrusti Porwal, Yashi Chaudhary, Jitendra Joshi, Manish Jain"Data Compression Methodologies for Lossless
Data and Comparison between Algorithms." Data Compression Methodologies for Lossless Data and
Comparison between Algorithms (n.d.): n. pag. Web. 7 Nov. 2016.
Joshi, Anamika. "Arithmetic Coding." YouTube. YouTube, 23 Apr. 2013. Web. 07 Nov. 2016.
Marshall, Dave. "The Shannon-Fano Algorithm." The Shannon-Fano Algorithm. N.p., 04 Oct. 2001. Web. 07
Nov. 2016.
Trimbitas, Radu. "Huffman Codes." Foundations of Coding Theory and Applications of Error-Correcting Codes
with an Introduction to Cryptography and Information Theory (2011): 17-24. 11 Nov. 2012. Web. 7 Nov. 2016.
Porwal, Shrusti, Yashi Chaudhary, Jitendra Joshi, and Manish Jain. "Data Compression Methodologies for
Lossless Data and Comparison between Algorithms." International Journal of Engineering Science and
Innovative Technology (IJESIT) 2.2 (2013): 6. Web. 12 Oct. 2015.
Singh, Amandeep. "Research Paper on Text Data Compression Algorithm Using Hybrid Approach."
International Journal of Computer Science and Mobile Computing 3: 10. Web. 12 Oct. 2015.
Arup Kumar Bhattacharjee Arup Kumar Bhattacharjee. "Comparison Study of Lossless Data Compression
Algorithms for Text Data." IOSR-JCE IOSR Journal of Computer Engineering: 15-19. Print.

Vous aimerez peut-être aussi