Académique Documents
Professionnel Documents
Culture Documents
Common Applications
Text compression loss-less, gzip uses Lempel-Ziv coding, 3:1 compression better than Huffman Audio compression lossy, mpeg 3:1 to 24:1 compression MPEG = motion picture expert group Image compression lossy, jpeg 3:1 compression JPEG = Joint photographic expert group Video compression lossy, mpeg 27:1 compression
Text Compression
Prefix code: one, of many, approaches no code is prefix of any other code constraint: loss-less tasks encode: text (string) -> code decode: code --> text main goal: maximally reduce storage, measured by compression ratio minor goals: simplicity efficiency: time and space some require code dictionary or 2 passes of data
Example
Send ACTG each occurring 1/4 of the time Code: A--00, C--01, T--10, G--11 2 bits per letters: no surprise Average message length: prob(A)*codelength(A)+prob(B)*codelength(B) + 1/4*2+. = 2 bits. Now suppose: prob(A) = 13/16 and other 1/16 Codes: A - 1; C-00, G-010, T-011 (prefix) 13/16*1+ 1/16*2+ 1/16*3+1/16*3=21/16 = 1.3+ What is best result? Part of the answer: The information content! But how to get it?
Understanding Entropy/Information
Suppose a set S is divided into k classes Let ni be the number of elements in class i Let N be the sum of all ni. Let pi be ni/N (the frequency of class i) Entropy(S) = -p1*log(p1) - p2*log(p2) -.-pk*log(pk). Note if k = 2, same as before. If all classes equally likely (pi = 1/k) then
Entropy(S) = - 1/k*log(1/k) - = -log(1/k) = log(k) If k = power of 2, then this is number of bits to distinguish all classes
Intuitively entropy gives right answers. Learning Hint: To understand equations, try special cases.
Shannon-Fano Tree
N o t e b 0 1 c 1 0 d e P r e fix p r o p e r t y
0
a 0 0
1 1 0 1 1 1
Code Encode(character)
Again can use binary prefix tree For encode and decode could use hashing yields O(1) encode/decode time O(N) space cost ( N is size of alphabet) For compression, main goal is reducing storage size in example its the total number of bits code size for single character = depth of tree code size for document = sum of (frequency of char * depth of character) different trees yield different storage efficiency Whats the best tree?
Huffman Code
Provably optimal: i.e. yields minimum storage cost Algorithm: CodeTree huff(document) 1. Compute the frequency and a leaf node for each char leaf node has countfield and character 2. Remove the 2 nodes with least counts and create a new node with count equal to the sum of counts and sons, the removed nodes. internal node has 2 node ptrs and count field 3. Repeat 2 until only 1 node left. 4. Thats it!
Tree, a la Huffman
R e p e a t : 4 1 1 0 3 7 7 4 1 4 2 7 2 M e r g e l o w e s t f r e
5 1
f r e
q 4
/ 4
1 1 0 a 3 0 s 1 0 0
7 7 1 1 e 4 00 t 1 1
2 5 0
7 1 1 i 2 1
Tree Cost
it s / n
d e 4 4
t o
t a l
it s :
9 5
( b e
f o r
1 7 1 0 / 2 / 2 70 3 / 3 / 94 1
2 7 5 / 2 / 13 20 / 2 / 2 4
/ 3 / 1 2
Analysis
Intuition: least frequent chars get longest codes or most frequent chars get shortest codes. Let T be a minimal code tree. (Induction) All nodes have 2 sons. (by construction) Lemma: if c1 and c2 be least frequently used then they are at the deepest depth Proof: if not deepest nodes, exchange and total cost (number of bits) goes down
Analysis (continued)
Sk : Huffman algorithm on k chars produces optimal code. S2: obvious Sk => Sk+1 Let T be optimal code on k+1 chars By lemma, two least freq chars are deepest Replace two least freq char by new char with freq equal to sum Now have tree with k nodes By induction, Huffman yields optimal tree.
Lempel-Ziv
Input: string of characters Internal: dictionary of (codewords, words) Output: string of codewords and characters. Codewords are distinct from characters. In algorithm, w is a string, c is character and w+c means concatenation. When adding a new word to the dictionary, a new code word needs to be assigned.
Lempel-Ziv Algorithm
w = NIL; while ( read a character c ) { if w+c exists in the dictionary w = w+c; else add w+c to the dictionary; output the code for w; w = k; }
Adaptive Encoding
Webster has 157,000 entries: could encode in X bits but only works for this document Dont want to do two passes Adaptive Huffman modify model on the fly Zempel-Liv 1977 ZLW Zempel-Liv Welsh 1984 used in compress (UNIX) uses dictionary method variable number of symbols to fixed length code better with large documents- finds repetitive patterns
Audio Compression
Sounds can be represented as a vector valued function At any point in time, a sound is a combination of different frequencies of different strengths For example, each note on a piano yields a specific frequency. Also, our ears, like pianos, have cilia that responds to specific frequencies. Just like sin(x) can be approximated by small number of terms, e.g. x -x^3/3+x^5/120, so can sound. Transforming a sound into its spectrum is done mathematically by a fourier transform. The spectrum can be played back, as on computer with sound cards.
Audio
Using many frequencies, as in CDs, yields a good approximation Using few frequenices, as in telephones, a poor approximation Sampling frequencies yields compresssion ratios between 6 to 24, depending on sound and quality High-priced electronic pianos store and reuse samples of concert pianos High filter: removes/reduces high frequencies, a common problem with aging Low filter: removes/reduces low frequencies Can use differential methods: only report change in sounds
Image Compression
with or without loss, mostly with who cares about what the eye cant see Black and white images can regarded as functions from the plane (R^2) into the reals (R), as in old TVs positions vary continuous, but our eyes cant see the discreteness around 100 pixels per inch. Color images can be regarded as functions from the plane into R^3, the RGB space. Colors are vary continuous, but our eyes sample colors with only 3 difference receptors (RGB) Mathematical theories yields close approximation there are spatial analogues to fourier transforms
Image Compression
faces can be done with eigenfaces images can be regarded a points in R^(big) choose good bases and use most important vectors i.e. approximate with fewer dimensions: JPEG, MPEG, GIF are compressed images
Video Compression
Uses DCT (discrete cosine transform) Note: Nice functions can be approximated by sum of x, x^2, with appropriate coefficients sum of sin(x), sin(2x), with right coefficients almost any infinite sum of functions DCT is good because few terms give good results on images. Differential methods used: only report changes in video
Summary
Issues: Context: what problem are you solving and what is an acceptable solution. evaluation: compression ratios fidelity, if loss approximation, quantization, transforms, differential adaptive, if on-the-fly, e.g. movies, tv Different sources yield different best approaches cartoons versus cities versus outdoors code book separate or not fixed or variable length codes