Vous êtes sur la page 1sur 7

A HUFFMAN-BASED TEXT ENCRYPTION ALGORITHM

Ruy Luiz Milidiú


Departamento de Informática
Pontifícia Universidade Católica (PUC-Rio)
milidiu@inf.puc-rio.br

Claudio Gomes de Mello


Departamento de Engenharia de Sistemas (DE/9)
Instituto Militar de Engenharia (IME)
cgmello@de9.ime.eb.br

José Rodrigues Fernandes


Faculdade de Informática
Universidade Católica de Petrópolis (UCP)
jose.rodrigues@ucp.br

Abstract. Huffman coding achieves compression by assigning shorter codes to the


most used symbols and longer codes to the rare ones. This paper describes two
cipher procedures added to Huffman coding to provide encryption besides its
compression feature. We use multiple substitution to disguise symbols with “fake”
codes and a stream cipher to encrypt these codes. The proposed scheme is simple
and fast. It generates encrypted documents with enough confusion and diffusion.

1 Introduction H(X) leads to Zero Redundancy, that is, has the exact
number of necessary bits to represent S.
In order to maintain a secure communication
between remote points it is necessary to guarantee The encoding produced by Huffman’s algorithm is
the integrity and confidentiality of both incoming prefix-free, and satisfies [Stin95]:
and outcoming information. The communication
cost is related to the volume of exchanged H(X) ≤ l(Huffman) < H(X)+1,
information. Hence, information compression is
essential. Besides that, one must also guarantee that where l is the weighted average length.
sniffers can not be able to decipher in transit
messages. Here, we propose two cryptosystems based on Huffman
coding.
To protect data against statistical analysis, Shannon
[Shan49] suggested that the language redundancy Sometimes an alphabet provides multiple substitutions
should be reduced before encryption. We use the for a letter. Thus, a symbol xi of a plaintext X instead of
well-known Huffman codes to achieve this. always being replaced by a codeword c, will be
Huffman codes have optimal average number of replaced by any codeword of a set (c1, c2, ...). These
bits per character, among prefix-free codes. Besides alternates used in the multiple substitution are called
that, Rivest et al [Rive96] tried to cryptanalyse a homophones [Simm91].
file that has been Huffman coded (and not
encrypted) and find that it was “surprisingly Günther et al [Gunt88] introduced a coding technique
difficult”. using homophones such that their encoding tables
generates a stream of symbols which all have the same
First, let us introduce some concepts. Let P be the frequency. Then, Massey et al [Mass89] proposed a
set of possible plaintexts, and S={s1, s2, …, sn} the scheme, based on Günter’s homophonic substitution, to
plaintext alphabet of X, X∈P such that X=x1x2… generate homophones by decomposing the probability
where xi∈S. Let n be the number of symbols in S. If of each symbol in a sum of negative powers of 2,
pi is the probability of si to appear in the plaintext X generating new symbols.
we have the Entropy of X, defined by H(X) = -
SUMi (pi) . log pi, as the average number of bits to Our first cipher is a multiple substitution procedure. We
represent each symbol si∈S. Moreover, we say that substitute each Huffman coded symbol by a string of
“fake” codes followed by the symbol itself. It is a
steganographic technique, that is, to disguise the (right child) shown in figure 1 is created for these
symbol by mixing it with other fake ones. symbols.

Each symbol has an identification bit (ID bit) that


marks it as real (bit 0) or fake (bit 1). 1
0

Our second procedure is a stream cipher. It encodes s1


the ID bit of each symbol by operating a XOR
0 1
(exclusive-or) with a given secret-key.
s2
The result is an encrypted Huffman coding and
decoding that can be used in communication or in 0 1
gigabytes sized document collections as proposed
in [Moff94]. s3 s4

In section 2, the Huffman coding, decoding and Figure 1 - prefix-free tree


properties are introduced. In section 3, we describe
our multiple substitution and stream cipher used to Then, a walk in the tree through the leaves generates the
modify Huffman codes to add the encryption codetable of figure 2.
feature. In section 4, we relate some experiments
using our implementation to compress and encrypt Symbol Frequency Codeword
some documents. Finally, in section 5, we conclude s1 12 0
our work. s2 7 11
s3 3 100
s4 3 101
2 Huffman Codes
Figure 2 - Codetable
Suppose that a plaintext has a set of n different
symbols S = {s1,...,sn}, n>1, and that the frequency
The plaintext is encoded with 44 bits as follows:
of each symbol in the plaintext is known. So, a
code for each symbol is required in order to 11100110110101100011010001010101011011000110
compress the plaintext, limited by prefix-free
codes. In a standard text coding, we would have 8 bits per
symbol, and hence 200 bits for the above plaintext.
Prefix-free codes means that no codeword is the With the codetable of figure 2, we achieve only 44 bits.
first part (prefix) of another codeword. When This is due to the fact that the most frequent symbols in
decoding, it is easy to recognize in the decoding the plaintext have the smallest codeword lengths.
process the end of a codeword without reading the
next codeword. Huffman decoding is almost immediate. With the
Huffman codes assigned to each symbol, we go parsing
That is, the code searched is that of a prefix-free the bits of the encoded text and walking through the
binary tree, where each symbol is located at one Huffman tree. Guided by the labeled branches assigned
tree’s leaf. A walk through the tree assigns the with the bit, we traverse the coding tree until reaching a
codeword of a symbol when you reach a leaf. leaf. At this leaf, we find the encoded symbol.
Huffman codes were introduced by David Huffman A Huffman tree is an optimal tree, but we also have
[Huff52] in 1952. This coding scheme compresses several other optimal trees. Some of them are easily
texts by assigning shorter codes to the most used obtained by exchanging the places of symbols si at the
symbols and longer codes to the rare ones. Now, let same level of the tree. This can be used to hide code
us illustrate the approach of Huffman’s algorithm. information.
Suppose the following plaintext with 25 symbols: Wayner [Wayn88] proposed two methods for assigning
a key to a tree. First, suppose we have a Huffman tree
s2 s3 s2 s1 s2 s1 s4 s3 s1 s2 s1 s3 s1 s4 s1 s4 s1 s2 s1 s2 s1 s1 s1 s2 s1 with N leaves. It is well known that for a strict binary
tree with N leaves we have:
The frequencies of each symbol are calculated and 1. N-1 internal nodes;
a prefix-free tree labeled with 0 (left child) or 1 2. The depth H of the tree satisfies: upper-
bound(log N) ≤ H ≤ N-1.
Therefore, Wayner proposed that we can have a set a. Codebook construction
of optimal Huffman trees by operating an XOR
between each N-1 branches of the tree with a The set of fake codes ∆ are indeed homophonic codes
control key with size N-1. Or we can assign one bit of η. This homophones can be generated according to
of a key for each level of the tree. The size of the several alternatives such as:
key can be too short in this case, O(log N). Milidiú
et al [Mili97] show that one can efficiently
implement prefix-free codes with length restriction, Alternative 1: After collecting the plaintext
obtaining also very effective codes with little loss alphabet S={s1, s2, …, sn} with their frequency
of compression. values, create other m symbols that we define as to
be fake symbols δj with j in [1,m] that are
These two procedures lead to cipher procedures. homophones of η. Generate random frequencies to
each δj symbol and then construct the Huffman tree
with S and {δ1, δ2, ..., δm} symbols together in the
3 Encrypted Huffman same tree;
Here, we propose two cryptosystems to add
Alternative 2: Create m symbols δj with j in [1,m]
encryption properties to Huffman codes.
with the frequency values equal to the first m
symbols of S in the Huffman tree. Then, construct a
A Cryptosystem is a five-tuple (P, C, K, E, D),
second fake Huffman tree;
where the following conditions are satisfied:
1. P is a finite set of possible plaintexts;
Alternative 3: After constructing the Huffman tree
2. C is a finite set of possible ciphertexts;
with the S symbols, assume that the first m symbol’s
3. K, the keyspace, is a finite set of possible keys;
codes represent the m fake codes too.
4. For each k in K, there is an encryption rule ek ∈
E and a corresponding decryption rule dk ∈ D.
Each ek: P=>C and dk: C=>P are functions In both alternatives 2 and 3 we need an extra
such that dk(ek(X)) = X for every plaintext X in identification bit (ID bit) to indicate when the symbol
P. code is real (bit 0) or fake (bit 1).

We call α the fake tree generation rate. It is a


3.1 Multiple Substitution parameter that controls the expansion rate, say 30% for
example, of fake symbols in the coding tree. With this
Our first procedure is a multiple substitution cipher. rate we can configure the number of distinct fake
We insert “null” symbols in the cipher text. A
symbols created as m = α * n as shown in figure 3 that
“null” symbol means nothing, and their inclusion is
illustrates alternative 2.
only to avoid easy decoding of the text by
unauthorized people. This null symbol η, that we
call a fake symbol, is inserted in the ciphertext with 0 1
multiple codes, also called homophones.

We use a set ∆ = {δ1, δ2, …, δm} of fake symbols


generated to disguise the output ot the effective null α
symbol. Next, we describe the method.

Our Multiple Substitution is a cryptosystem (P, C,


K, L, F, E, D) where we additionally define:
1. L as a finite set of possible fake codes
representing the fake symbol η. So, we have L Figure 3 – Fake tree construction rate generates a new
a subset of C and S+= S U {η}, where S+ is the fake Huffman tree
alphabet of symbols plus the null symbol;
2. F = (f1, f2, …) as the fake code generator. For i We use indeed alternative 3 due to its savings of
>= 1, fi: P=>Pi-1 x L. memory space and small time to construct a Huffman
tree with more symbols (alternative 1) or a second tree
(alternative 2).
So, we have a Geometrical Distribution of generating mi
fake codes plus the effective symbols defined by the
So, α is the symmetric key that defines the number following expression:
of distinct fake symbols used as shown in figure 3.
P[mi + 1] = β (mi+1)-1 . (1- β) = β mi . (1- β)
The decoding procedure is very simple, as shown
b. Coding below:

With the m fake codes defined, we use a function to


generate a string of fake codes for each symbol xi of 1. If the codeword read is a S symbol, then
X=x1x2…. dk(s=xi) = xi;
2. Else dk(s=tj) = η and the null symbol is
Let Li be the string of codes generated for xi; Codes skipped.
of Li are a string of mi homophones of η, that is, the
fake codes of δj with j in [1,m] followed by the xi
symbol itself. So, Li = (δ1, δ2, ..., δmi, xi) as shown By generating fake symbols we insert new characters in
in figure 4. The δj codes are randomly chose from the encoded text in order to flatten the overall
the m fake codes. Sequential selection could be distribution of the symbols of a given language.
used if desired.

c. Diffusion
fi
xi A well-known cipher attack is statistical analysis. In
δ1 any language some characters are used more than
δ2 others. Hence, an attack could be done by counting the
δ3 frequencies of each symbol in a ciphertext and trying to
.. assign each one to characters in the language that have
xi the same distribution frequencies that the ones found in
Li the ciphertext.
Figure 4 – Generation of homophones of η
followed by the effective symbol With fake codes generation, we achieve some diffusion
in the distribution of frequency of codes. Then,
Finally, let β be a fake symbol generation rate, that counting the frequencies of codes can lead to wrong
is, the probability of generating a fake code at each assignments to symbols.
round before outputting the effective code.
It’s obvious that a great disadvantage of this scheme is
So, the coding procedure is the following pseudo- text expansion. But it has great benefits too: diffusion
code: of distribution frequencies and the lack of correlation
between symbols in the language. Both features lead to
a more difficult statistical analysis.
1. Choose a random number p, p in [0,1]
2. If p<β then d. Text Expansion
eRs(xi)=δj
To estimate the text expansion of multiple substitution
return to 1
we observe that:
3. Else
eRs(xi)=xi
(i) The average number of output ciphers
i is increased by 1
generated is the average of a Geometrical
4. If i≤n then return to 1
Distribution, that is, 1/(1-β);
5. End.
(ii) One additional bit per character due to the ID
bit is needed. Since we have H(X) ≤
l(Huffman) < H(X)+1, it leads to H(X) + 2.
The parameter β is used to set the number of fake
symbols generated between real symbol outputs. It
The text expansion can be estimated by the average
is used to balance diffusion and text expansion.
number of characters generated times the average
number of bits. So, the average number of bits B per
character is:
B ≤ [1/(1-β)] . ( H(X) + 2 ) If |k|<|X|, that is, the size of the key is less than the size
of the plaintext, we have two alternatives to generate
For example, using β=0.30 and assuming that we the keystream:
have H(X)=4.19 for monogram parsing, and an
average HL=1.25 for the entropy of english
language we get:
1/(1-β) = 1/(1-0.30) = 1.43, that is, 43% of text Alternative 1: Cyclic keystream generation - when
expansion due to fake codes. the key k ends up we return back to the beginning
and so on. Then, we have the key bit defined by zi
B1 ≤ 1.43 (4.19+2) = 8.85 = f(k, i) = ki mod φ, where φ=|k|;
BL ≤ 1.43 (1.25+2) = 4.65
Alternative 2: Random keystream generator -
So, we should still achieve compression using word function f generates bit-streams using key k as a
parsing in our encrypted Huffman if compared to seed.
standard ASCII representation (8 bits per
character).
We assume alternative 1 as illustrated in figure 5.This
way, we maintain Huffman’s coding synchronism
3.2 Stream Cipher properties. Alternative 2 is not used since it causes
dependency with past.
Suppose that someone has the encoded text, the
Huffman codetable and the number of fake codes keystream k
defined by β. Then, decoding is immediate. 0 0 1 0 0 1 1 0 1 0
Therefore, a secret-key is necessary to add
confusion to the process. The secret-key we use is XOR
fully scalable. It can have any length we desire: 48
bits, 128 bits, 256 bits, etc. ID-bit
0 1 1 0 1
So, we introduce a second procedure, a stream codeword
cipher.
Figure 5 – XOR between the i-th bit of the key k and
A Stream cipher is a cryptosystem (P, C, K, L, F, E, the ID bit of the codeword
D) where we additionally define:
1. L as a finite set called the keystream alphabet;
2. F=(f1, f2, …) as the keystream generator. For i b. Coding
>= 1, fi: K x Pi-1 => L.
We define coding and decoding procedures as ⊕ - XOR
(exclusive-or) operations between the zi bit and all the
a. Keystream bits of the codeword. This is equivalent of exchanging
places of a symbol in the Huffman tree:
Let k∈K and X=x1x2…. The stream cipher
procedure is defined as:
yi = ek(xi) = (zi xor xi,1, zi xor xi,2, …, zi xor xi,|xi|)
(i) Z=z1z2… is the keystream. We have a xi = dk(xi) = (zi xor yi,1, zi xor yi,2, …, zi xor yi,|yi|)
function fi that generates zi from k: zi =
fi(k, x1, x2, …, xi-1);
(ii) zi is used to cipher xi such that yi = ezi(xi); However, we use indeed a simpler procedure defined by
a XOR between the zi bit and only the first bit of the
codeword. This is equivalent of only disguising the ID
We have a new zi key for each xi that comes in, and bit:
this zi key is generated from the past z1, y1, z2, y2,
…, zi-1, yi-1.
yi = ek(xi) = (zi xor h1, xi,2, …, xi,|xi|)
Our method consists of a single constant function xi = dk(xi) = (zi xor h1’, yi,2, …, yi,|yi|)
such that zi = f(k,i).

Where h1 and h1’ are the ID bits of xi and yi.


c. Confusion Brazilian Gutenberg
Constitution Project
With this XOR operation we add hideness to our Plaintext size 7.004.160 39.059.456
method. It is simple and has low processor cost. Encoded text 4.078.717 22.813.331
The secret-key can have any length and any kind of size (Huffman)
trial-and-error attacking would be so much time Ciphertext size 8.309.822 52.539.412
consuming that makes the analysis unfeasible. Expansion over 19% 34%
plaintext
In this scheme, we have indeed a composite key Expansion over 104% 130%
that contains α and the keytream seed k as shown encoded text
in figure 6.
Table 1 – Extra-space results
α K

Figure 6 – Composed key 5 Conclusions

With this secret key included in the process, we Usually, one make serial procedures to compress and
have an encryption system to use in insecure then encrypt a file. In this work, we proposed simple
communication channels. modifications in Huffman codes to add encryption to its
compression feature.

d. Text Expansion It results in a fast and low computational power


consumption that provides enough confusion and
The stream cipher does not cause any additional diffusion to the ciphertext. That follows from both its
text expansion. theoretical and practical properties.

The diffusion and confusion are also controlled


4 Experiments parameters. The β factor controls the rate of generating
fake codes. With β we can set the ciphertext expansion
In table 1, we list some results with a C++ due to null symbol insertion, what is directly associated
implementation of our encrypted Huffman coding with the security of the file. While the size of the key k
and decoding. We use the Brazilian Constitution is associated with the security of the confusion feature.
and the Gutenberg Project collection.

We use α = 0.30, monogram parsing and all sizes References


are expressed in bytes.
[Gunt88] Gunter, C.G. 1988. An Universal Algorithm
We first measure the storage space required to for Homophonic Coding. in Advances in Cryptology
compress the documents collection using standard Eurocrypt-88, LNCS, vol. 330.
Huffman coding.
[Huff52] Huffman, D. 1952. A Method for the
Then, we use the encrypted Huffman and measure Construction of Maximum of Minimum Redundancy
the difference between the new space required to Codes. Proc. IRE, 1098-1101.
the encryption features, that is, the extra-space [Mass89] Massey, J.L., Kuhn, Y.J.B., Jendal, H.N.
needed. 1989. An Information-Theoretic Treatment of
Homophonic Substitution. in Advances in Cryptology
Observe that text expansion is very close to its Eurocrypt-89, LNCS, vol. 434.
expected value. Moreover, the relative additional
time to introduce encryption to the compression [Mili97] Milidiú, R.L., Laber, E.S. 1997. Improved
process is about 5%. Bounds on the Inefficient of Length-Restricted Prefix
Codes.
[Moff94] Moffat, A., Witten, I.I., Bell, Timothy C.
1994. Managing Gigabytes: Compressing and
Indexing Documents and Images. Van Nostrand
Reinhold.

[Rive96] Rivest, Ronald L., Mohtashemi, M.,


Gillman, David W. 1996. On Breaking a Huffman
Code. IEEE Transactions on Information Theory,
vol. 42, no. 3.

[Shan49] Shannon, C. 1949. Communication


Theory of Secrecy Systems. Bell Syst. Tech., vol.
28, no. 4, pp. 656-715.

[Simm91] Simmons, G. 1991. Contemporary


Cryptology – The Science of Information Integrity.
IEEE Press.

[Stin95] Stinson, D.R. 1995. Cryptography: Theory


and Practice, CRC Press

[Wayn88] Wayner, P. 1988. A Redundancy


Reducing Cipher, Cryptologia, 107-112.

Vous aimerez peut-être aussi