Vous êtes sur la page 1sur 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/221425533

Improvement of huffman coding algorithm using interpolation polynomials.

Conference Paper · January 2004


Source: DBLP

CITATIONS READS
0 439

1 author:

Valeri Pougatchev
University of Technology, Jamaica
24 PUBLICATIONS   48 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Chapter: "Quantitative Evaluation Solutions of the Accomplishment of Operational Plans of the Organization" of the IGI-Global publiser book - "Smart Technology
Applications in Business Environments" View project

All content following this page was uploaded by Valeri Pougatchev on 13 June 2015.

The user has requested enhancement of the downloaded file.


The 3rd IASTED International Conference on COMMUNICATIONS, INTERNET, AND INFORMATION
TECHNOLOGY ~CIIT 2004~, November 22-24, 2004, St. Thomas, US Virgin Islands

Improvement of Huffman Coding Algorithm


using Interpolation Polynomials
Valeri Pougatchev
Lecturer, Programme Director
237 Old Hope Road, School of Computing & Information Technology
University of Technology, Kingston 6, Jamaica W.I.
VPougatc@UTech.edu.jm

Abstract token i in the file as a probability of occurrence of each


The result of research described in this paper, concerns token P(i). The entropy can be calculated as:-
the possibility of implementation lossless Huffman k
compression information for files, where Huffman
compression will not give an appropriate result – where
H=-  P(i) log P(i)
i 1
2 (1)

distribution of probabilities of occurrences of tokens is where k – is the number of unique tokens in the file.
uniform or close to uniform. Using techniques as
described in this paper it is possible to use the Huffman Based on the above we can say that a measure of the
method in an improved way. An algorithm building efficiency of this coding is a redundancy of the outcome
binary trees for Huffman compression is developed code – the difference between the entropy and the
further. average length of the outcome code.

Key words The effectiveness of Huffman’s method depends of two


Huffman Coding Algorithm, Binary Tree, Interpolation main factors. Firstly the design of an optimal binary
Polynomial, Orthogonal Polynomials. outcome code and, secondly distribution of frequency of
occurrence of tokens in the file. It seems that for better
1. Introduction compression, the process of Huffman coding can be
A popular Huffman coding algorithm developed by simply repeated on the next steps. Unfortunately, that is
David Huffman [1] is one of several lossless algorithms not true. After Huffman compression, the distribution of
of compressing information in computers and still used frequency of occurrence of new tokens in a file is
for data transfer. The codes generated using this approaching to the uniformly distribution and here
technique or procedure is called Huffman codes. It is Huffman’s method does not work.
based on the idea of assigning a shorter codeword for the Below we consider a method of transformation of set of
tokens within a file with a higher probability of tokens, after Huffman compression in a new set of
occurrence, than tokens that occur less frequently. Under tokens, which gives us a new distribution of frequency
a token, as used in this paper, we will assume each byte of occurrence of tokens in a file using techniques based
of information be coded in ASCII, the American on building interpolation polynomials on the set of
Standard Code for Information Interchange symbol. tokens.
Notice also that the ASCII code uses the same number of However, before that, we would like to show the process
8 bits to represent each symbol. If we use fewer bits to of designing unique binary code for Huffman
represent tokens that occur more often, on the average, compression, which has developed within this research.
we would use fewer bits per token. The average number All algorithms described in this paper are implemented
of bits per token is often known as the rate of the code. in computer programs using the programming languages
The founder of the modern Information Theory, Claude C++ and Visual Basic 6.
Elwood Shannon showed [2] that it is possible to use
entropy of information as a measure of the average 2. Design of a Huffman binary code
number of binary symbols needed to the code the output For simplicity sake, consider a method of building
of the source. Shannon demonstrated that the best unique binary code (“binary tree”) using a simple
scheme that a lossless compression scheme can achieve example. Let us analyze the text file:
is to encode the output of a source with an average aaabbbbcccccddddddeefgggggggh(0A)(0D) (2)
number of bits equal to the entropy of the source.
Consider a file for compression as a sequence of tokens (0A) and (0D) are hexadecimal values of special symbols
{X1, X2, X3…}. Supposed each element in this sequence - linefeed and carriage return characters, which usually
is independent and identically distributed (iid), then we ends text file. The probabilities of the symbols of that file
can consider the frequency of the occurrence of each are:

1
P(a)=0.1, P(b)=0.13, P(c)=0.16, P(d)=0.20, in the new symbol summarizing their frequency of
P(e)=0.06, P(f)=0.03, P(g)=0.23, P(h)=0.03, occurrence. After this, we sort a new set of symbols and
P(0A)=0.03, P(0D)=0.03 repeat once again the previous procedure.
We continue this procedure until we get only two
The first line of Figure 1 below, shows symbols that are symbols with assigned values “1” and “0”. The sum of
to be encoded sorted on ascending order (in terms of the frequency of occurrence of these two symbols is
symbol’s probability). The second line shows their equal to the number of symbols in the file.
equivalent hexadecimal value. The third line shows the When reading the binary value for each source symbol,
frequency of occurrence the symbols in the circles. The we have to:
fourth line shows the relative frequency of the 1. Follow the branch of the tree started from the source
occurrence of each symbol. symbol until we reach the “root” of the tree. Along
The algorithm of building the binary tree starts from step the path, we have to write the binary values
1 and continues until step 9. In each step, we define two encountered at each level until the bottom.
symbols with lesser occurrence, assigned for their binary 2. The outcome value will be determined after writing
value “1” and “0” respectively, and then combine both back previous code encountered.

h f e a b c d g
(68) (0D) (66) (0A) (65) (61) (62) (63) (64) (67)
1 1 1 1 2 3 4 5 6 7
0.03 0.03 0.03 0.03 0.06 0.1 0.13 0.16 0.2 0.23
Step 1 1 0

1 1 2 2 3 4 5 6 7
Step 2
1 0

Step 3 2 2 2 3 4 5 6 7
1 0

Step 4 2 3 4 4 5 6 7
1 0
Step 5 4 4 5 5 6 7
1 0

Step 6 5 5 6 7 8
1 0

Step 7 6 7 8 10
1 0

8 10 13
Step 8
1 0
13
Step 9 1 0 18

Figure 1 Building the binary Huffman Tree

In our case, we have binary code for source symbols: (0D) 01000
h (68) 01001
ASCII code Hexadecimal Binary code
code This algorithm in C++ works successfully for any size of
g (67) 10 file.
d (64) 11
c (63) 001 We will use this algorithm below.
b (62) 011
a (61) 0000 Consider the motto of The University of Technology,
e (65) 0101 Jamaica as a following example base text:
(0A) 00010
f (66) 00011

2
University of Technology, Jamaica:  Head of file, consisting of information on the
"Excellence through knowledge" (3) length of the head of the file (in bytes) and
information about each symbol (token) – the ASCII
Based on the example we are going to show the idea of value of the symbol, its length of encoded binary
improving Huffman’s method of compression using code (in bits) in sequence and binary code itself.
interpolation polynomials.  Body of the file consisting of a continuous stream
of bits without delimiters between symbols and
Analysis of the file [3] in Huffman binary code shown in extends from byte to byte following the sequential
Table 1. location. Each symbol within each byte carries on
going onto the next byte until the end of the stream.
File size: 65 bytes
Number of different tokens (symbols): 29 We will consider the body of the file after our first
Huffman compression. Continual compression of this
Entropy of this file using formula [1]: 4.5179
file using Huffman’s algorithm is problematic as shown
Average length of new code: 4.55385 in [4]. The distribution of new symbols after this
procedure is close to uniform. Due to this fact, it was
Last value shows us that the redundancy of shown [3] that it is impossible to use Huffman’s method
compression following our algorithm is 0.03595. of compression further. The uniform distribution of
symbols in the file is the main problem.
Content of File
Binary Length
ASCII Hexadecimal Frequency We improve this process, using additional procedures,
code of code
code code which effectively skew the uniform distribution and
e 65 7 101 3 allow lossless Huffman compression again.
20 6 0001 4 We use the technique of creating interpolation
o 6F 5 0010 4 polynomials based on mean-square approximations.
c 63 4 0011 4
n 6E 4 0101 4 Let us consider the file to be compressed as a set of
l 6C 4 0111 4 points on an X-Y graph. On axis X will be the byte
i 69 3 00001 5 number (first byte, second byte, etc.) and on axis Y the
h 68 3 1110 4 ASCII integer values of these bytes (0-255). We will
a 61 3 00000 5 consider the process of building interpolation
g 67 3 1111 4 polynomials through this set of points.
t 74 2 11001 5
r 72 2 10001 5 3. Building Interpolation Polynomials

y
22
79
2
2
10011
01001
5
5
Let  yi   {0,1, 2,..., 255} – a set of integer values,
d 64 1 110001 6 which represents an ASCII binary value of bits within
E 45 1 110101 6 the bytes. Let interpolation function, which we have to
2C 1 11011 5 find has a form  ( x, a1 , a2 ,..., am ) , where
: 3A 1 010001 6 a1 , a2 ,..., am some parameters. We can consider a
f 66 1 011001 6
system of equations
J 4A 1 100101 6
T 54 1 100001 6   ( x1 , a1 , a2 ,..., am )  y1
m 6D 1 110100 6   ( x , a , a ,..., a )  y
s 73 1 110000 6  2 1 2 m 2
k 6B 1 011000 6  (4)
U 75 1 100100 6 .....................................
v 76 1 010000 6   ( xn , a1 , a2 ,..., am )  yn
w 77 1 100000 6
x 78 1 011010 6 where, n – size of file. Here, generally speaking, n>m.
U 55 1 011011 6 The system of equations [4], usually, does not have
solutions. We have to find a set of parameters
Table 1 - Analysis of file [3] a1 , a2 ,..., am , determined when we achieve the
minimum value of the function.
n
(a1 , a2 ,..., am )   (  ( xi , a1 ,..., am )  yi ) 2
We developed a C++ program, which implements the
binary code shown above. Based on this, a new
i 1
compressed file is created with a completely new
structure, as follows: Supposed that a function  , which we are going to find, is a
linear combination of some predefined functions

3
1 ( x), 2 ( x),..., m ( x) Hence, we can get the mean-square approximation, using
a formula:
In this case, for ρ we have m
( k , f )
m
 ( x)   2 k ( x)
 ( x )   a j j ( x )
(4)
k 1 ||  k ||
j 1
and We will use this base formula to get the subsequent
n m
(a1 , a2 ,..., am )   ( a j j ( xi )  yi ) 2 interpolation polynomial for the set of values of tokens
(bytes).
i 1 j 1

3.2 Building the orthogonal polynomials


  j ( x1 )   y1  Let 0 ( x)  1 .
 
Let X j  ........... , j  1, 2,..., m , Y   ...  are
    Let 1 ( x)  x  1 , where 1 selected from condition
  j ( xn )  y 
   n 1,0   0 :
n
vectors in N-space. If we denote ( X , Y )  x y i i as 1,0   ( x  1,0 )  ( x,1)  1 (1,1) 
i 1 n
a inner product of vectors X and Y, we can write   xi  1n  0
function i 1
(a1 , a2 ,..., am ) as n

(a1 , a2 ,..., am ) || a1 X1  a2 X 2  ...  am X m  Y ||2 x i


hence 1  i 1
n
where X 1 , X 2 ,..., X m , Y are vectors. Let 2 ( x)  ( x   2 )1  10 ( x) , where  2 and 1
Let π – subspace, tighten on the vectors X1, X2, …, Xm defined from conditions of orthogonality
min || a1 X1  a2 X 2  ...  am X m  Y || can be reached 2 ,0   0 and 2 ,1   0
in the point of π – subspace, where a vector 2 ,0   ( x1,1)  2 (1,1)  1 (1,1) 
a1 X 1  a2 X 2  ...  am X m  Y is orthogonal to the π.
n

n  x  (x ) i 1 i
  xi1 ( xi )  1n  0 ,
For this is enough, that vector
1  i 1
a1 X 1  a2 X 2  ...  am X m  Y would be orthogonal to i 1 n
all vectors X1, X2, …, Xm of π. Hence
m 2 ,1   ( x1,1 )  2 (1,1 )  1 (0 ,1 ) 
( ai X i  Y , X j )  0 , for j=1,2,…,n n 2 n 2
i 1
  xi (1 ( xi ))   2  (1 ( xi ))  0
Last sentence creates a system of liner equations
i 1 i 1
m

a (X , X
n
)  (Y , X i ) , j=1 2,…,m
i 1
i i j (3)
 x ( ( x )) i 1 i
2

where hence 2  i 1
n
n n
( X i , X j )   i ( xk ) j ( xk ), (Y , X j )   j ( xk )yk  ( ( x ))
i 1
1 i
2

k 1 k 1
Let 3 ( x)  ( x   3 )2   21 ( x)
3.1 The mean-square approximation by
Orthogonal Polynomials From conditions of orthogonality
To avoid finding solutions with systems of linear 3 ,0   0 , 3 ,1   0 , and 3 ,2   0
equations [3] it is reasonable to build system of we obtain
orthogonal polynomials φ0 (x), φ1 (x), φ2(x),… in terms of n n
equality of their inner product to zero.
n
 xi (2 ( xi ))2  x  ( x ) ( x )
i 2 i 1 i

(i ,  j )  i ( xk ) j ( xk )  0, for i  j 3  i 1


n
, 2  i 1
n
k 1
Based on last fact, we can rewrite a system (3)
 (2 ( xi ))2
i 1
 ( ( x ))
i 1
1 i
2

|| k || ak  ( f , k ), k=1,2,…,m
2

4
If we continue this process further, it is easy to prove by 250
induction, recurrent formulas for obtaining orthogonal
polynomials 200
 j 1 ( x)  ( x   j 1 ) j   j j 1 ( x)
150
where (5)

n n 100

 x ( ( x ))
i j i
2
 x i j 1 ( xi ) j ( xi )
 j 1  i 1
n
,j  i 1
n
50

 ( ( x ))
i 1
j i
2
 (
i 1
j 1 ( xi )) 2

5 10 15 20 25 30 35
I Graph 2
3.3 Generating a Non-Uniformly Distributed set
of values of tokens. We will use third-order interpolation polynomial  ( x)
Consider a statement for compression (3) as a starting with following conditions:
point. Let consider each symbol of that statement as an
integer value of ASCII code (65 symbols):   ( x1 )  y1 - for the first symbol
  ( xn )  yn - for the last symbol
{85,110,105,118,101,114,115,105,116,121,32,111,102,
32,84,101,99,104,110,111,108,111,103,121,44,32,74,97
We developed a computer program, which automatically
,109,97,105,99,97,58,32,34,69,120,99,101,108,108,101,
finds the best interpolation polynomial, in terms of
110,99,101,32,116,104,114,111,117,103,104,32,107,11
minimum entropy of set of distances between points of
0,111,119,108,101,100,103,101,34} (6)
set (7) and  ( x ) :
After Huffman compression, based on the algorithm of
building a binary tree and the procedure of building a  ( x)  K00 ( x)  K11 ( x)  K 22 ( x)  K33 ( x) (8)
compressed file as described above, we get a new set of
values of symbols (37 symbols): where
{109,66,133,142,1,202,68,153,24,105,242,147,151,167,  K 0 , K1 , K 2 , K3 are coefficients of orthogonal po-
99,40,52,0,76,8,140,245,104,235,189,83,163,158,137,7 lynomials, based on formula (4)
3,252,44,41,64,247,31,179} (7)  0 ( x)  1 - first orthogonal polynomial

We can see the set of these points on Graph 1  1 ( x)  3.3( x - 17.25) - second orthogonal
polynomial
250
 2 ( x)  - 0.17(( x - 20.5)( x - 17.25) - 201.2)
200
is the third orthogonal polynomial
3 ( x)  -0.033(( x - 21.25)(( x - 20.5)( x - 17.25) -

150 - 201.2)-80( x - 17.25))
is the fourth orthogonal polynomial
100

The Graph 3 shows a plot of set of points (7) and


50 interpolation polynomial (8)

250
5 10 15 20 25 30 35
Graph 1
200

Entropy of this set of values is 5.2. This set of values is


almost uniformly distributed (see Graph 2). To continue 150
with Huffman compression here would be futile.
For the set of points (7) obtain an appropriate 100

interpolation polynomial  ( x ) , and consider the new


set of values of symbols as a set of distances between 50

points of set (7) and  ( x ) .


5 10 15 20 25 30 35
Graph 3
5
This interpolation polynomial goes through interpolation This research was supported by a research Grant from
points: {1,109}, {7,68}, {24,235}, {37,179} the University of Technology, Jamaica.
Distances between interpolation polynomial (8) and
points of set (7) are Bibliography
[1] D.A.Huffman. A Method for Construction of
{0, 31,-49,-63,79,-128,0,-73,56,-10,-149,-32,-32,-37,44, Minimum Redundancy Codes. Proceeding of the IRE,
110, 104,176, 107, 196, 66,-24,126, 0, 54, 167, 90,102, 40 \:1098-1101, 1951
126, 196, 7, 208, 199, 176,-28, 167, 0} (9)
[2] C.E.Shannon. A Mathematical Theory of
The entropy of a set of points (9) is 4.72, which is less Communication. Bell System Technical Journal,
then entropy of the starting set of points (7) 5.2. This 27:379-423, 623-656, 1948
means that it is possible to implement Huffman
compression to set of points (9) once again and obtain [3] V.Pougatchev, M.Verma Frequency Technique for
even further compression. This method could be Compression of Information, The International
invoked continuously in order to arrive at any conference “Modern problems of Functional Analysis
deterministic point. and Differential Equations” devoted to the 50th
anniversary of the Department of Functional Analysis
4. Conclusion at the Voronezh University, that was founded in 1953
The process described in this paper allows further by Professor Mark Krasnoselsky, Russia, Voronezh,
implementation of the lossless Huffman compression of June 30 -July 4, 2003, p. 48-52
files consisting of tokens with uniform or close to
uniform distributions of tokens, where that type of [4] V.Pougatchev, V.Rodin The Frequency Principle of
compression would not give any appropriate result. Information. The volume Compression in ECM.,
Building interpolation polynomials based on orthogonal Modeling and Stability, VI Ukrainian Conference,
polynomials use two points: Kiev, Ukraine, 15-19, May 1995, p.93
 Building a new orthogonal polynomials with a
higher order is independent of the previous
polynomial and we can get there by using
convenient recurrent formulas (5)
 Simplicity of implementation of this method on the
computer program.

5. Acknowledgements
6

View publication stats

Vous aimerez peut-être aussi