Académique Documents
Professionnel Documents
Culture Documents
net/publication/221425533
CITATIONS READS
0 439
1 author:
Valeri Pougatchev
University of Technology, Jamaica
24 PUBLICATIONS 48 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Chapter: "Quantitative Evaluation Solutions of the Accomplishment of Operational Plans of the Organization" of the IGI-Global publiser book - "Smart Technology
Applications in Business Environments" View project
All content following this page was uploaded by Valeri Pougatchev on 13 June 2015.
distribution of probabilities of occurrences of tokens is where k – is the number of unique tokens in the file.
uniform or close to uniform. Using techniques as
described in this paper it is possible to use the Huffman Based on the above we can say that a measure of the
method in an improved way. An algorithm building efficiency of this coding is a redundancy of the outcome
binary trees for Huffman compression is developed code – the difference between the entropy and the
further. average length of the outcome code.
1
P(a)=0.1, P(b)=0.13, P(c)=0.16, P(d)=0.20, in the new symbol summarizing their frequency of
P(e)=0.06, P(f)=0.03, P(g)=0.23, P(h)=0.03, occurrence. After this, we sort a new set of symbols and
P(0A)=0.03, P(0D)=0.03 repeat once again the previous procedure.
We continue this procedure until we get only two
The first line of Figure 1 below, shows symbols that are symbols with assigned values “1” and “0”. The sum of
to be encoded sorted on ascending order (in terms of the frequency of occurrence of these two symbols is
symbol’s probability). The second line shows their equal to the number of symbols in the file.
equivalent hexadecimal value. The third line shows the When reading the binary value for each source symbol,
frequency of occurrence the symbols in the circles. The we have to:
fourth line shows the relative frequency of the 1. Follow the branch of the tree started from the source
occurrence of each symbol. symbol until we reach the “root” of the tree. Along
The algorithm of building the binary tree starts from step the path, we have to write the binary values
1 and continues until step 9. In each step, we define two encountered at each level until the bottom.
symbols with lesser occurrence, assigned for their binary 2. The outcome value will be determined after writing
value “1” and “0” respectively, and then combine both back previous code encountered.
h f e a b c d g
(68) (0D) (66) (0A) (65) (61) (62) (63) (64) (67)
1 1 1 1 2 3 4 5 6 7
0.03 0.03 0.03 0.03 0.06 0.1 0.13 0.16 0.2 0.23
Step 1 1 0
1 1 2 2 3 4 5 6 7
Step 2
1 0
Step 3 2 2 2 3 4 5 6 7
1 0
Step 4 2 3 4 4 5 6 7
1 0
Step 5 4 4 5 5 6 7
1 0
Step 6 5 5 6 7 8
1 0
Step 7 6 7 8 10
1 0
8 10 13
Step 8
1 0
13
Step 9 1 0 18
In our case, we have binary code for source symbols: (0D) 01000
h (68) 01001
ASCII code Hexadecimal Binary code
code This algorithm in C++ works successfully for any size of
g (67) 10 file.
d (64) 11
c (63) 001 We will use this algorithm below.
b (62) 011
a (61) 0000 Consider the motto of The University of Technology,
e (65) 0101 Jamaica as a following example base text:
(0A) 00010
f (66) 00011
2
University of Technology, Jamaica: Head of file, consisting of information on the
"Excellence through knowledge" (3) length of the head of the file (in bytes) and
information about each symbol (token) – the ASCII
Based on the example we are going to show the idea of value of the symbol, its length of encoded binary
improving Huffman’s method of compression using code (in bits) in sequence and binary code itself.
interpolation polynomials. Body of the file consisting of a continuous stream
of bits without delimiters between symbols and
Analysis of the file [3] in Huffman binary code shown in extends from byte to byte following the sequential
Table 1. location. Each symbol within each byte carries on
going onto the next byte until the end of the stream.
File size: 65 bytes
Number of different tokens (symbols): 29 We will consider the body of the file after our first
Huffman compression. Continual compression of this
Entropy of this file using formula [1]: 4.5179
file using Huffman’s algorithm is problematic as shown
Average length of new code: 4.55385 in [4]. The distribution of new symbols after this
procedure is close to uniform. Due to this fact, it was
Last value shows us that the redundancy of shown [3] that it is impossible to use Huffman’s method
compression following our algorithm is 0.03595. of compression further. The uniform distribution of
symbols in the file is the main problem.
Content of File
Binary Length
ASCII Hexadecimal Frequency We improve this process, using additional procedures,
code of code
code code which effectively skew the uniform distribution and
e 65 7 101 3 allow lossless Huffman compression again.
20 6 0001 4 We use the technique of creating interpolation
o 6F 5 0010 4 polynomials based on mean-square approximations.
c 63 4 0011 4
n 6E 4 0101 4 Let us consider the file to be compressed as a set of
l 6C 4 0111 4 points on an X-Y graph. On axis X will be the byte
i 69 3 00001 5 number (first byte, second byte, etc.) and on axis Y the
h 68 3 1110 4 ASCII integer values of these bytes (0-255). We will
a 61 3 00000 5 consider the process of building interpolation
g 67 3 1111 4 polynomials through this set of points.
t 74 2 11001 5
r 72 2 10001 5 3. Building Interpolation Polynomials
y
22
79
2
2
10011
01001
5
5
Let yi {0,1, 2,..., 255} – a set of integer values,
d 64 1 110001 6 which represents an ASCII binary value of bits within
E 45 1 110101 6 the bytes. Let interpolation function, which we have to
2C 1 11011 5 find has a form ( x, a1 , a2 ,..., am ) , where
: 3A 1 010001 6 a1 , a2 ,..., am some parameters. We can consider a
f 66 1 011001 6
system of equations
J 4A 1 100101 6
T 54 1 100001 6 ( x1 , a1 , a2 ,..., am ) y1
m 6D 1 110100 6 ( x , a , a ,..., a ) y
s 73 1 110000 6 2 1 2 m 2
k 6B 1 011000 6 (4)
U 75 1 100100 6 .....................................
v 76 1 010000 6 ( xn , a1 , a2 ,..., am ) yn
w 77 1 100000 6
x 78 1 011010 6 where, n – size of file. Here, generally speaking, n>m.
U 55 1 011011 6 The system of equations [4], usually, does not have
solutions. We have to find a set of parameters
Table 1 - Analysis of file [3] a1 , a2 ,..., am , determined when we achieve the
minimum value of the function.
n
(a1 , a2 ,..., am ) ( ( xi , a1 ,..., am ) yi ) 2
We developed a C++ program, which implements the
binary code shown above. Based on this, a new
i 1
compressed file is created with a completely new
structure, as follows: Supposed that a function , which we are going to find, is a
linear combination of some predefined functions
3
1 ( x), 2 ( x),..., m ( x) Hence, we can get the mean-square approximation, using
a formula:
In this case, for ρ we have m
( k , f )
m
( x) 2 k ( x)
( x ) a j j ( x )
(4)
k 1 || k ||
j 1
and We will use this base formula to get the subsequent
n m
(a1 , a2 ,..., am ) ( a j j ( xi ) yi ) 2 interpolation polynomial for the set of values of tokens
(bytes).
i 1 j 1
n x (x ) i 1 i
xi1 ( xi ) 1n 0 ,
For this is enough, that vector
1 i 1
a1 X 1 a2 X 2 ... am X m Y would be orthogonal to i 1 n
all vectors X1, X2, …, Xm of π. Hence
m 2 ,1 ( x1,1 ) 2 (1,1 ) 1 (0 ,1 )
( ai X i Y , X j ) 0 , for j=1,2,…,n n 2 n 2
i 1
xi (1 ( xi )) 2 (1 ( xi )) 0
Last sentence creates a system of liner equations
i 1 i 1
m
a (X , X
n
) (Y , X i ) , j=1 2,…,m
i 1
i i j (3)
x ( ( x )) i 1 i
2
where hence 2 i 1
n
n n
( X i , X j ) i ( xk ) j ( xk ), (Y , X j ) j ( xk )yk ( ( x ))
i 1
1 i
2
k 1 k 1
Let 3 ( x) ( x 3 )2 21 ( x)
3.1 The mean-square approximation by
Orthogonal Polynomials From conditions of orthogonality
To avoid finding solutions with systems of linear 3 ,0 0 , 3 ,1 0 , and 3 ,2 0
equations [3] it is reasonable to build system of we obtain
orthogonal polynomials φ0 (x), φ1 (x), φ2(x),… in terms of n n
equality of their inner product to zero.
n
xi (2 ( xi ))2 x ( x ) ( x )
i 2 i 1 i
|| k || ak ( f , k ), k=1,2,…,m
2
4
If we continue this process further, it is easy to prove by 250
induction, recurrent formulas for obtaining orthogonal
polynomials 200
j 1 ( x) ( x j 1 ) j j j 1 ( x)
150
where (5)
n n 100
x ( ( x ))
i j i
2
x i j 1 ( xi ) j ( xi )
j 1 i 1
n
,j i 1
n
50
( ( x ))
i 1
j i
2
(
i 1
j 1 ( xi )) 2
5 10 15 20 25 30 35
I Graph 2
3.3 Generating a Non-Uniformly Distributed set
of values of tokens. We will use third-order interpolation polynomial ( x)
Consider a statement for compression (3) as a starting with following conditions:
point. Let consider each symbol of that statement as an
integer value of ASCII code (65 symbols): ( x1 ) y1 - for the first symbol
( xn ) yn - for the last symbol
{85,110,105,118,101,114,115,105,116,121,32,111,102,
32,84,101,99,104,110,111,108,111,103,121,44,32,74,97
We developed a computer program, which automatically
,109,97,105,99,97,58,32,34,69,120,99,101,108,108,101,
finds the best interpolation polynomial, in terms of
110,99,101,32,116,104,114,111,117,103,104,32,107,11
minimum entropy of set of distances between points of
0,111,119,108,101,100,103,101,34} (6)
set (7) and ( x ) :
After Huffman compression, based on the algorithm of
building a binary tree and the procedure of building a ( x) K00 ( x) K11 ( x) K 22 ( x) K33 ( x) (8)
compressed file as described above, we get a new set of
values of symbols (37 symbols): where
{109,66,133,142,1,202,68,153,24,105,242,147,151,167, K 0 , K1 , K 2 , K3 are coefficients of orthogonal po-
99,40,52,0,76,8,140,245,104,235,189,83,163,158,137,7 lynomials, based on formula (4)
3,252,44,41,64,247,31,179} (7) 0 ( x) 1 - first orthogonal polynomial
We can see the set of these points on Graph 1 1 ( x) 3.3( x - 17.25) - second orthogonal
polynomial
250
2 ( x) - 0.17(( x - 20.5)( x - 17.25) - 201.2)
200
is the third orthogonal polynomial
3 ( x) -0.033(( x - 21.25)(( x - 20.5)( x - 17.25) -
150 - 201.2)-80( x - 17.25))
is the fourth orthogonal polynomial
100
250
5 10 15 20 25 30 35
Graph 1
200
5. Acknowledgements
6