Vous êtes sur la page 1sur 35

CREATIVE COMPONENT

THE SECRETS THAT LIE BENEATH:


THE KEYS TO SOLVING THE VIGENERE CIPHER

MATTHEW HENNEKE

Abstract. In the Sixteenth Century, Blaise de Vigenere formu-


lated the Vigenere Cipher as a form of encryption. Two individu-
als, Friedrich Kasiski and William Friedman, have derived methods
by which to determine the keyword length. Based on the keyword
length, there are methods to decipher text that is encrypted by the
Vigenere Cipher. Dr. David Wright programmed Maple code that
implores the use of the decryption methods of Friedrich Kasiski
and William Friedman. The purpose of this paper is to analyze
the methods derived by Kasiski and Friedman and the Maple Code
that is used for decryption of text.

Contents
1. Introduction 2
2. Basic Terminology 4
3. The Vigenere Cipher 5
4. Kasiskis Method for Finding the Length of the Keyword 9
5. Friedmans Method for Finding the Length of the Keyword 15
6. Constructing the Keyword 23
7. Recovery of the Plain Text Message 29
8. Appendix A: Maple Code 29
References 35

Date: August 2002.


1
2 MATTHEW HENNEKE

1. Introduction
Written text has been a very vital part of every civilization for thou-
sands of years. Written text was used to record events, tell stories,
and to relay information both for personal and public use. Through
the years though, the need for text to be understandable to only a few
select people has been of the upmost importance. For government and
military officials, their need to send messages to their colleagues unde-
tected to the public or enemy was crucial to national security. Thus,
as time passed, individuals began to toy with the idea of hiding the
true message through intentional alterations and transformations that
were random to the general public. This is where the art of cryptol-
ogy emerged and still resides today. There is evidence that indicates
cryptology has been in existence since at least the second century A.D.
when Caesar was developing his own cipher systems.
Through the years, cryptology has evolved to include many different
techniques and methods by which to encrypt written text. The process
of encryption is to take a piece of written text known as the plain text
i.e. lyrics of the Beatles The Long and Winding Road that is composed
from the plain alphabet. Then apply a rule of shifts, substitutions, or
other transformations to the plain text. The alphabet that is used in
conjunction with the rule is known as the cipher alphabet. What results
from this application of the rule or multitude of rules is referred to as
the cipher text. It is also possible to use the reverse of the rule or rules
which then will decipher the cipher text thus producing the original
plain text. The rule by which text is transformed can be as simple as
a kids decoder ring found as a prize in a cereal box or as complex as
involving modulo arithmetic and other numerical techniques.
Under the general category of cryptology, there are several methods
that head more specific subcategories of methods. One such method
are shift ciphers in which a copy of the plain alphabet is shifted a
number of spaces to the right by which obtaining the cipher alphabet.
For instance, if it was determined that the cipher alphabet would be a
shift of five letters from the plain alphabet, then the letter A would be
encrypted as the letter F, B=G, and so forth. For example the title or
the plain text
THE LONG AND WINDING ROAD
would be encrypted as the cipher text,
YMJ QTSL FSI BNSINSL WTFI.
For the purpose of comparison, the plain text and cipher text are
aligned above each other in Table 1.
SOLVING THE VIGENERE CIPHER 3

Table 1. Aligned plain and cipher text of The Long and


the Winding Road for 1 rule

T H E L O N G A N D W I N D I N G R O A D
Y M J Q T S L F S I B N S I N S L W T F I

Thomas Barrs Invitation to Cryptology tells how the Roman Em-


peror Julius Caesar used a shift cipher of three letters in his Gallic
Wars, thus the Caesar Cipher is named after his form of cryptology
[1].
Another general method is substitution ciphers in which letters or
symbols are substituted in directly for another letter or symbol. This
is a very common method of cryptology known to people in the likes of
Cryptoquotes, decipher rings, and other puzzle/games of codes. The
Cryptoquote found in most newspapers alongside the crossword puzzle
is an example of a monoalphabetic substitution cipher. Basically, there
exists an one-to-one relationship in that each letter in the plain text is
represented by only one letter in the cipher text. Say for instance that
the letter Q is represented by the letter Z, then no other plain text letter
may be represented by Z. Just as the Cryptoquote is solvable based on
contextual clues of apostrophes, short or repeated words, repetitive
letter patterns, and word patterns, it is possible to eventually solve
monoalphabetic substitution ciphers with minimal difficulty.
A new element of complexity was eventually implemented in that
the one-to-one relationship was eliminated with the use of polyalpha-
betic substitution ciphers. Now, any finite number of substitution rules
may be applied to encrypt plain text. Take the example from above
for the monoalphabetic substitution cipher and apply a polyalphabetic
substitution cipher to the title The Long and Winding Road. For the
purpose of demonstration, let the example be composed of three rules
that are applied periodically to each letter of the text. The first rule
will be a shift of one, so then A will be substituted by B, B by C, . . .
, and Z by A. The second rule will be a shift of two (A by C, B by D,
. . .), and the third rule will be the Caesar Cipher (A by D, B by E,
. . .). Again, the plain text
THE LONG AND WINDING ROAD
would be encrypted as the cipher text,
UJH MQQH CQE YLOFLOI UPCG.
Again as for the example above, the plain and cipher text are aligned
below in Table 2 for comparison.
4 MATTHEW HENNEKE

Table 2. Aligned plain and cipher text of The Long and


the Winding Road for a cipher of 3 rules

T H E L O N G A N D W I N D I N G R O A D
U J H M Q Q H C Q E Y L O F L O I U P C G

While, it is convenient and very beneficial to know that there are a


plethora of methods of encrypting plain text from very basic to very
complex, the other facet of the subject needs to also be addressed. This
other facet is the decryption of the cipher text resulting in the original
plain text. For every rule that is used to encrypt text, a procedure
needs to be developed that is capable of finding the pattern(s) within
the cipher text. This generally proves to be the more difficult task
especially when polyalphabetic ciphers are involved.
The focus of this paper is to explore the intricacies of one such polyal-
phabetic substitution cipher. The cipher of interest is known as the
Vigenere Cipher. The secret of its success is that the pattern of succes-
sive shifts of the cipher alphabets is determined by a keyword. Each
letter of the alphabet has a numeric value (A=0, B=1, . . ., Z=25) that
determines the shift amount. Thus, the length of the keyword deter-
mines the number of rules that run in a periodic cycle by which the
cipher is governed. Therefore, the key to deciphering any text that has
been encrypted by the Vigenere Cipher is determining first the length
of the keyword and then the keyword itself.
During the course of this paper, the methods developed by Friedrich
Kasiski, a Prussian Army Officer, and William Friedman, a Russian
born mathematician, by which to determine the keyword length will
be presented. Throughout the paper three examples will be used often
to illustrate the concepts being presented: the passage of scripture
Psalm 46:10, the lyrics to the Oklahoma State University (OSU) Alma
Mater, and the lyrics to the Beatles song The Long and Winding Road.
Once the keyword length is determined, then the actual keyword must
be determined and the methods by which this is done are examined
next. In addition, Maple Code that has been written by David Wright
will be commented on for each of the major sections.

2. Basic Terminology
In this section, the basic terminology of cryptology that is used
throughout this paper is explained. First, a message that is readable
to all people is known as a plain text. The alphabet that is used to
write the plain text is known simply as the plain alphabet. A cipher
SOLVING THE VIGENERE CIPHER 5

system is a rule for transforming a plain text into some unreadable


text. This rule is usually defined by some information that is referred
to as the key. Once this transformation which is known as encryption
has occurred, then the unreadable text is called the cipher text. The
cipher alphabet is the set of symbols that is used to write the cipher
text. In addition, each cryptosystem must also have a rule defined by
some key by which the cipher text can be transformed to produce the
plain text. This particular process that is opposite of encryption is
known as decryption. Throughout this paper, it is assumed that all
cipher systems are symmetric in that the same key is used for both
encryption and decryption.
A cipher system that is a monoalphabetic substitution cipher
holds a one-to-one relationship between the plain text and cipher text
alphabets. Thus, each symbol in the plain alphabet is represented
uniquely by a symbol in the cipher alphabet. A Caesar shift cipher
is a well-known example of a monoalphabetic substitution cipher in
which the plain alphabet is shifted so many units to obtain the cipher
alphabet. On the other hand, a polyalphabetic substitution cipher
is a cipher system in which it is possible for the same plain text letter
to be encrypted as different cipher text letters at different places in the
cipher text. Similarly, the same cipher text letter may be decrypted as
different plain text letters in different places.

3. The Vigenere Cipher


Of the many encryption methods that have come about through the
years, only a few have withstood the test of time and mans abilities.
The Sixteenth Century French statesman and mathematician Blaise
de Vigenere fostered the idea of the Vigenere Cipher in his treatise
Traite des chiffres ou secretes manieres descrire [3]. For several hun-
dred years, the Vigenere Cipher proved to be the menace of cryptog-
raphers thus being bestowed the French title Le Chiffre Indechiffrable
[5]. Translated into English the title reads The Undecipherable Cipher.
The Vigenere Cipher is a polyalphabetic substitution cipher that
utilizes 26 distinct cipher alphabets and a keyword. The basic process
of encryption using the Vigenere Cipher is in either the Vigenere Square
or addition modulo 26. The Vigenere Square is an arrangement of
letters where the plain alphabet constitutes the first row and then below
it are 26 rows of cipher alphabets that incorporate a Caesar shift of 1
for each respective row. Preceding each of the cipher alphabets in the
lower 26 rows is either a letter of the alphabet that is listed for use
6 MATTHEW HENNEKE

with the keyword or its respective numerical value starting at 0 most


of the time (Figure 1).
From a mathematical perspective, it is possible to view the Vigenere
Cipher as addition modulo 26. First, the question of how to view letters
numerically is addressed. The answer is found in the idea of assigning
equivalent numerical values to each letter based on its position in the
alphabet. For instance, A is in position 0 so its numerical value is
0. The same follows for each of the other 25 letters. Thus, Z has a
numerical value of 25 since it is in position 26 of the alphabet. In
general, let y be the equivalent numerical value of a plain text letter
and let s be the value of a shift cipher. Then, the shift cipher could
be expressed as (y + s) mod 26. More specifically, let each Caesar
shift be represented by si , i = 0, 1, . . . , 25. Therefore, a mathematical
perspective on the Vigenere Cipher would be
(y + si ) mod 26, i = 0, 1, . . . , 25.
In more full details, the actual process of encryption by the methods
mentioned above will be explored. The first method takes advantage
of the Vigenere Square itself, and the second method does not require
a square or a table because it uses the mathematical idea of modular
arithmetic. Thus, the first method begins with choosing a keyword
that is repeatedly spelled out over the entire plain text. Then, using
the Vigenere Square, the column that is headed by the plain text letter
and the row that is marked by the keyword letter are found and their
point of intersection within the table provides the respective cipher
text letter. Then each of the cipher text letters is written below their
respective plain text letters. Below the text of Psalm 46:10 is encrypted
using the Vigenere Cipher with the keyword of God. So, using the
Vigenere Square above in Figure 1, find the column that is headed by
B, the first letter of the plain text, and the find the row that is headed
by G, the first letter of the keyword. Then trace the respective row and
column in and down to find their intersection in this case the cipher
text letter H. This procedure has been repeated for the other 76 letters
of the plain text below in Table 3.
Therefore, the plain text,
Be still and know that I am God; I will be exalted
among the nations, I will be exalted in the earth.
with the use of the keyword God was encrypted as
HSVZW OROQJ YQUKW NOWOO PMCGO KLRZE KSAGZ WKRDS
CQMHK KBDZW RTGLC WORPH KLDRH HJWQZ VHKOU ZV.
SOLVING THE VIGENERE CIPHER 7

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
B C D E F G H I J K L M N O P Q R S T U V W X Y Z A
C D E F G H I J K L M N O P Q R S T U V W X Y Z A B
D E F G H I J K L M N O P Q R S T U V W X Y Z A B C
E F G H I J K L M N O P Q R S T U V W X Y Z A B C D
F G H I J K L M N O P Q R S T U V W X Y Z A B C D E
G H I J K L M N O P Q R S T U V W X Y Z A B C D E F
H I J K L M N O P Q R S T U V W X Y Z A B C D E F G
I J K L M N O P Q R S T U V W X Y Z A B C D E F G H
J K L M N O P Q R S T U V W X Y Z A B C D E F G H I
K L M N O P Q R S T U V W X Y Z A B C D E F G H I J
L M N O P Q R S T U V W X Y Z A B C D E F G H I J K
M N O P Q R S T U V W X Y Z A B C D E F G H I J K L
N O P Q R S T U V W X Y Z A B C D E F G H I J K L M
O P Q R S T U V W X Y Z A B C D E F G H I J K L M N
P Q R S T U V W X Y Z A B C D E F G H I J K L M N O
Q R S T U V W X Y Z A B C D E F G H I J K L M N O P
R S T U V W X Y Z A B C D E F G H I J K L M N O P Q
S T U V W X Y Z A B C D E F G H I J K L M N O P Q R
T U V W X Y Z A B C D E F G H I J K L M N O P Q R S
U V W X Y Z A B C D E F G H I J K L M N O P Q R S T
V W X Y Z A B C D E F G H I J K L M N O P Q R S T U
W X Y Z A B C D E F G H I J K L M N O P Q R S T U V
X Y Z A B C D E F G H I J K L M N O P Q R S T U V W
Y Z A B C D E F G H I J K L M N O P Q R S T U V W X
Z A B C D E F G H I J K L M N O P Q R S T U V W X Y

Figure 1. The Vigenere Square

Thus, the first method of encryption of a plain text message is a step


by step procedure using the Vigenere Square.
Secondly, the idea of addition modulo 26 can be used as an encryp-
tion technique. The following notation will be used throughout the
paper when modular addition is concerned. First, let the keyword
length be k and the plain text message length be n. Thus, for the
k letters in the keyword let their respective numerical equivalents be
q0 , q1 , . . . , qk1 . Also, assign equivalent numerical values to the n letters
of the plain text to get y0 , y1 , . . . , yn1 . Repeat for the corresponding
letters of the cipher text to get x0 , x1 , . . . , xn1 . Now, to encrypt the
plain text message, the following formula would be used
(1) xi = (yi + qi mod k ) mod 26, i = 0, 1, 2, . . . , n 1.
Now, the method of modular arithmetic will be demonstrated with
the text Psalm 46:10 and keyword God. It is noted that k = 3 with
8 MATTHEW HENNEKE

Table 3. Encryption of Psalm 46:10 using the Vigenere


Cipher and the keyword God

key: G O D G O D G O D G O D G O D G
plain: B E S T I L L A N D K N O W T H
cipher: H S V Z W O R O Q J Y Q U K W N
key: O D G O D G O D G O D G O D G O
plain: A T I A M G O D I W I L L B E E
cipher: O W O O P M C G O K L R Z E K S
key: G O D G O D G O D G O D G O D G
plain: X A L T E D A M O N G T H E N A
cipher: A G Z W K R D S C Q M H K K B D
key: O D G O D G O D G O D G O D G O
plain: T I O N S I W I L L B E E X A L
cipher: Z W R T G L C W O R P H K L D R
key: D G O D G O D G O D G O D
plain: T E D I N T H E E A R T H
cipher: H H J W Q Z V H K O U Z V

q0 = 6, q1 = 14, andq2 = 3 and that n = 77. From the first let-


ter of the plain text, B, it is determined that y0 = 1. Thus, x0 =
(y0 + q0 mod 3 ) mod 26 = (1 + 6) mod 26 = 7. Therefore, the letter in
the equivalent numerical position would result in a cipher text letter
of H. Now, let us examine position 33 which is occupied by the plain
text letter X. Thus, y32 = 23. First, 32 mod 3 = 2, so q32mod3 = q2 = 3.
Therefore, x32 = (23 + 3) mod 26 = (26) mod 26 = 0. Respectively
then position 33 would be filled with the cipher letter A.
In summary, there are two methods by which to encrypt a plain text
message, and they have been demonstrated using the Vigenere Cipher.
Next, the process of decryption will be briefly mentioned. Just as the
foundation of encryption is the keyword, the process of decryption is
solely dependent on the keyword. When the keyword is known, then
it is possible to decrypt cipher text by reversing the process of either
method presented to encrypt plain text. Section 6 of this paper in more
details addresses the issue of how to determine the keyword used both
in the process of encryption and decryption.
The first method presented previously used the Vigenere Square.
First, the rows that correspond to the respective letters of the keyword
could be pulled from the Vigenere Square along with the first row
that corresponds to A for reference purposes. Next, arrange the rows
in order of the keyword itself (see Figure 2, and then find the cipher
SOLVING THE VIGENERE CIPHER 9

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
G H I J K L M N O P Q R S T U V W X Y Z A B C D E F
O P Q R S T U V W X Y Z A B C D E F G H I J K L M N
D E F G H I J K L M N O P Q R S T U V W X Y Z A B C

Figure 2. The 3 arranged rows of the Vigenere Square


corresponding to G O D headed by row A

letter in the row of the first keyword letter and look up to the A row
for the plain text letter. Then, repeat the process for each cipher text
letter until the plain text has been recovered. For example, the first
cipher text letter in Psalm 46:10 is H. First, look in the G row and find
the letter H and look up to find the corresponding plain text letter B.
Repeat for the cipher text letter S to find E. Continue the process for
the remaining 75 letters to unveil the plain text message.
Second, the idea of addition modulo 26 as presented above can be
used to decrypt cipher text without the use of a table or square. Again,
the notation mentioned above for modular addition will be utilized.
Equation 1 may be modified to produce the equivalent numerical value
of the plain text, and it is yi = (xi qi mod k ) mod 26, i = 0, 1, 2, . . . , n
1. As a reminder from above, k = 3 with q0 = 6, q1 = 14, andq2 = 3
and n = 77. The first cipher text letter is H, so x0 = 7. Thus,
y0 = (7 6) mod 26 = 1 and this translates into the plain text letter
B. Also from above, the cipher text letter A aligns with the keyword
letter D in the position 33 with 32mod3 = 2. Thus, x32 = 0 and
q2 = 3. Now, y32 = (0 3) mod 26 = 3 which results in a move of
three letters left from A to end at X, the respective plain text letter.
Again, this process would be continued onward until the entire cipher
text had been decrypted to reveal the plain text message.

4. Kasiskis Method for Finding the Length of the


Keyword
Again, the primary task in deciphering text that has been encoded
using the Vigenere Cipher is determining the length of the keyword.
Once the keyword length is known, then the focus is turned to find-
ing the exact keyword. Today, the process of discovering information
about the keyword from parts of the cipher text if not from all of it is
known as cryptanalysis. The goal of the remainder of this paper is
to investigate this notion of cryptanalysis.
In his 1863 paper Die Geheimschriften und die Dechiffrierkunst,
Friedrich Kasiski felt that patterns might be found in the ciphertext
that would be a result of a certain portion of the keyword coinciding
10 MATTHEW HENNEKE

with repetitive sets of letters in the plain text periodically as reported


by O.L. Franksen in Mr. Babbages Secret [3]. Then, once the patterns
are identified in the cipher text, the number of letters of separation be-
tween the sets of repeated letters is counted. Once this is done for all
occurrences of patterns in the cipher text, then the greatest common
divisor or a divisor of it could potentially be the length of the keyword.
Before more fully exploring Kasiskis Method, there is historical ev-
idence that this method could just as well be known as the Babbage
Method. In Singhs The Codebook it is noted that the method bearing
Kasiskis name was devised in 1863, yet around 9 years earlier in Eng-
land, Charles Babbage, a mathematician and early computer scientist,
developed the same method. It is told that the two methods were de-
veloped independent of each other. Two reasons are provided as to why
Babbage did not print his findings. Babbage was very infamous for not
finishing projects and manuscripts, so possibly again this hindered his
release of his brilliant discovery. Printing his results could possibly have
jeopardized the integrity of the British governments work; therefore,
they denied him the right to publish his work. For more information on
Babbages work and life see Singh, pp. 66-78. Nonetheless, both gen-
tlemans discoveries were a major breakthrough in the decipherment of
the Vigenere Cipher.
Suppose a plain text message is to be encrypted by a keyword of
length k. Various words have a tendency of being repeated in the
English language. Due to this repetition, there is the possibility that a
word such as AND occurs twice in the message such that the number
of letters from the first A with inclusion up to the second A is an
integer multiple of k. When this does occur in the plain text, then it re-
sults in identical letter blocks in the cipher text. Now, when examining
the cipher text, if the same letter blocks appeared and they were sepa-
rated by a integer multiple of the same number, then one might begin
thinking that the keyword length is either this number of a multiple of
it. To help illustrate this idea, a portion of the cipher text to The Long
and Winding Road has been superimposed with the keyword Liverpool
in Table 4. Within this select portion of cipher text, the repetitive
blocks of EPZ are underlined with the corresponding portion of the
keyword. Found in this portion of cipher text is the letter block EPZ.

Now, three blocks of EPZ are encrypted by LIV of the keyword. If


these blocks were separated by a multiple integer of some number then
there is believable evidence that the keyword length is this number. It
so happens that between each letter block, the respective number of
letters separating each is 135 and 45. Thus, these blocks are separated
SOLVING THE VIGENERE CIPHER 11

Table 4. Select portion of the cipher text of The Long


and Winding Road to demonstrate the basics of Kasiskis
Method

key: L I V E R P O O L L I V E R P O O L
cipher: E P Z P F C U O Y O E D R U X B U C
key: L I V E R P O O L L I V E R P O O L
cipher: Z I Y X Y P H Z P L L N X F N C I C
key: L I V E R P O O L L I V E R P O O L
cipher: O W J V N X Z Z Y P D Z V U X G O A
..
.
key: L I V E R P O O L L I V E R P O O L
cipher: E P Z V R X B K L D P Z H R L O M S
key: L I V E R P O O L L I V E R P O O L
cipher: L A G I W I O D Z Z T J J K T O F D
key: L I V E R P O O L L I V E R P O O L
cipher: N Z T M E V T C C E P Z H R N K V J

by multiple integers of 1, 3, 5, and 9. Thus, there is reason to believe


that one of these values could be the keyword length. Granted, the
full cipher text should be examined for additional blocks of EPZ that
appear and other letter blocks. From all of the results, the divisors
should be found and then the greatest common divisor most likely will
be the keyword length.
Since coincidental occurrences are the basis on which Kasiskis Method
is based, then some difficulties can arise while attempting to determine
the keyword length. First, if the plain text is short, then it might limit
the repetition of letter patterns encrypted by the same part of the key-
word. In turn this might reduce the potential candidates for keyword
length guesses. Additionally, the plain text message could be composed
such that intentional patterns or repetitious words are limited or elim-
inated. This forced alteration of the message could then potentially
reduce the number of identical letter blocks found in the cipher text.
Therefore, as long as a sufficient number of repetitive letter blocks ap-
pear in the cipher text, then Kasiskis Method will be useful. If not,
then one will either have to guess or attempt a more solid method.
Using the lyrics to the song The Long and Winding Road and to
the OSU Alma Mater, Kasiskis Method will be demonstrated below
in which it can be seen that it is often a reasonable approach to deter-
mining the keyword length.
12 MATTHEW HENNEKE

The long and winding road that leads to your door,


Will never disappear,
Ive seen that road before It always leads me here,
leads me to your door.
The wild and windy night the rain washed away,
Has left a pool of tears crying for the day.
Why leave me standing here, let me know the way
Many times Ive been alone and many times Ive cried
Anyway youll never know the many ways Ive tried, but
Still they lead me back to the long and winding road
You left me standing here a long, long time ago
Dont leave me waiting here, lead me to you door
Da, da, da, da

Figure 3. Lyrics of The Long and Winding Road

The lyrics to The Long and Winding Road are provided in Figure 3
as reference for the following example of estimating the keyword length
by Kasiskis Method.
To aid the process of analysis, each line of the cipher text is composed
of 8 groups of 5 letters per group. When referring to the cipher text, the
lines are numbered from 1 to 11 from top to bottom and the groups are
numbered 1 to 8 from left to right. Thus, the first letter combination
found is ZKS in line 2, groups 4 and 5. Its respective pairing is found
in line 7, group 4. Below in Table 5, the cipher text is presented with
the respective letter combination pair ZKS underlined.
Table 5. Cipher text for The Long and Winding Road

EPZPF CUOYO EDRUX BUCZI YXYPH ZPLLN XFNCI COWJV


NXZZY PDZVU XGOAA MVVZK SGPPV OLRIF CLOJZ JFGSW
ELTRE PHZSL OAHIY TFSWP IYWDT HCJZC MHFDF HSPED
PUPBR HTVYC EXUVE EPZVR XBKLD PZHRL OMSLA GIWIO
DZZTJ JKTOF DNZTM EVTCC EPZHR NKVJW MVZVB SGELV
YMEVV SCPTZ XDTYB ZHBCI NPMAL YGOMD TGWGP JZIEP
ZCYPI IHDPB METUZ WZKSQ CTMYE ENKOJ JWPPC CSJPC
SISNI VSXLV TARNG WGPBM MVSPI EDBDP CIVSJ WMVHD
TPONV BJXYT ZCYRI IHNXB RTYOM SRSMC FWMAX DTGHL
YLDRX WSFPL TJRXA CBREQ HIRVC RZYBG IRKSA PHIDX
ZCUVP CMGIR SASEZ GJYUD CFOLL VHRSO

Now, from the Z in the first letter combination to the W just before
the Z in the second letter combination there are 198 letters (not in-
cluding spaces). For the number 198, the potential divisors that could
SOLVING THE VIGENERE CIPHER 13

possibly represent the keyword length are 1, 2, 3, 9, and 11. Now,


another pair of letter combinations will be examined. The next one
found is OMS in line 4, group 7 and its respective pairing is found in
line 9, groups 5 and 6. The number of letters between this pair of let-
ter combinations is 193 which happens to be a prime number. Thus, it
does not lend much evidence to the length of the keyword. Though, the
keyword could be a phrase of 193 letters it is very unlikely. The next
letter combination pair to be found is JWMV in line 5, group 6 and
7 and line 8, group 7 and 8. Between this pair of letter combinations,
there are 126 letters, and the respective divisors of 126 are 1, 2, 3, 7,
9, and 14.
Two other pairs of letter combinations that occur in the cipher text
are GWGP and IVS. The locations for GWGP are line 6, group 7 and
line 8, group 3 and 4. Its respective divisors are 1, 3, 7, and 9 for a
separation of 63 letters. Lastly, the location for IVS is line 8, group 1
and 2 and then again in line 8, group 7. Its respective divisors are 1,
3, and 9. Finally, one last letter combination occurs in four different
positions in the cipher text at line 1, group1; line 4, group 4; line 5,
group 5; and line 6, group 8 & line 7, group 1. The number of letters
separating the first pair of EPZ is 180 with divisors of 1, 2, 3, 4, 5,
and 9. The number of letters between the second pair is 45, and its
respective divisors are 1, 3, 5, and 9. The third and fourth appearances
of EPZ are separated by 58 letters. The divisors of 58 are 2 and 29.
For the individual cases that produced prime number divisors, they will
be disregarded as sources of the keyword length. Individually, one of
them could potentially be the exact keyword length, but the estimation
is made based on more substantial evidence. Of the remaining four
pairs, the greatest common divisor is 9. Therefore, there is very strong
evidence that the length of the keyword used to encipher the lyrics
of the Beatles song The Long and Winding Road is 9. Considering
the keyword used was Liverpool, Kasiskis Method provided a stable
means by which to determine the actual length of the keyword.
Now, in comparison to the last example in which there was strong
evidence in determining the length of the keyword, analysis of the sec-
ond cipher text of the OSU Alma Mater is not as indicative of the
keyword length. First, for the purpose of referencing, the lyrics of the
OSU Alma Mater are presented in Table 4 .
There are three pairs of letter combinations found in the cipher text
that are presented as examples for analysis, and they are underlined
in Table 6. In addition, for this example let the system by which the
rows and columns were numbered in the last example be used again.
14 MATTHEW HENNEKE

Figure 4. Lyrics of the OSU Alma Mater


Proud and immortal, Bright shines your name;
Oklahoma State, We herald your fame.
Ever youll find us-loyal and true.
To our Alma Mater, O...S...U.

Table 6. Cipher text of the OSU Alma Mater

EZGNR LCHBQ BWJMO WQVBK WBKAW YTWRS JZFTA PDOEE


WWETG EPXXA TPWKO WSCHY GNSFS PKIKC DCDET TCHNW
AWQTZ LCHMV JMLHC FGEEQ PUSMS CDWN

The first pair, GN, is found in line 1, group 1 and line 2, group 5. The
number of letters separating the respective pair is 58, and its potential
divisors are 1, 2, and 29. Found in line 1, group 2 and line 3, group 2
is LCH, the second pair. The number of letters between this pair is 80
with divisor values of 1, 2, 4, 5, 8, and 10. The last pair found in the
cipher text for this example is ET in line 2, group 1 and line 2, group
7. The value of 31, a prime number, is the number of letters found
between the two occurrences of ET. Again dismissing the last pair due
to its value of letter separations being a prime number, the greatest
common divisor of 58 and 80 is 2. This is not a reasonable value of a
keyword length due to the faint level of security that is provided from
a keyword length of two. So, Kasiskis Method did as charged and
determined a keyword length, but most likely one would have to take
into consideration the other potential divisors of 58 and 80 such as 4,
5, 8, or 10.
For large-sized text, the procedure just presented would be tedious
and would be bound for error in counting and location of letter combi-
nations. This is where the ability of computers to do many operations
or calculations per second can be very beneficial. David Wright has
written Maple Code that performs Kasiskis Method on a given text
(see Appendix A). In his code, the procedure for performing Kasiskis
Method with inputs is rightfully named as kasiski(msg,l,n). Mes-
sage (cipher text), letter blocks of length l and minimum of n repeti-
tions of letter blocks are the parameters used as inputs for kasiski. In
essence, kasiski will scan the cipher text message for common blocks
of letters of length l with at least n repetitions. Based on these pa-
rameters, kasiski will output the block of letters with its number
of repetitions and the number of letters separating each block will be
given in prime-factored form. The cipher text of the song The Long
SOLVING THE VIGENERE CIPHER 15

and Winding Road was analyzed by the maple code. Thus, examples
of output for each procedure discussed here and in remaining sections
is based on this analysis. For example, the following output is based
on a block of length 3 with at least 3 repetitions. In addition, the
second group of output is based on a block of length 4 with at least
2 repetitions. Strong evidence of the keyword length is found in the
second output group where the factor of 32 = 9 occurs in each letter
block.
kasiski(lwr,3,3);

[[[EPZ,4],[(3)^{3}(5),(3)^{2}(5),(2)(29)]]]

kasiski(lwr,4,2);

[[[PZHR, 2], [(2)^{2}(3)^{2}]],


[[JWMV, 2], [(2)(3)^{2}(7)]],
[[GWGP,2], [(3)^{2}(7)]]]
The last two parameters can easily be changed to determine other
letter blocks that might be present in the cipher text message. There-
fore, searching for letter blocks in cipher text messages and finding the
number of letters that separate these blocks in prime-factored form can
greatly be simplified by the use of computer programs such as David
Wrights Maple Code.
In conclusion, Kasiskis Method does provide a means by which to
determine potential values of the keyword length by examining the
greatest common divisor of each of the pairs of letter combinations.
Yet, since his method does rely upon coincidental patterns in the key-
words placement and how it aligns with the letters in the plain text,
shorter length or purposely altered plain text will provide minimal ev-
idence to be found within the respective cipher text as seen in the
second example.

5. Friedmans Method for Finding the Length of the


Keyword
Friedmans Method uses the ideas of counting, letter frequency, and
probability of letters contained in the cipher text by which to calculate
the keyword length. William Friedman was a pioneer in using probabil-
ity as a tool of decipherment in the area of cryptology. In Kahns The
Codebreakers the origins of Friedmans famous work is noted. Found
16 MATTHEW HENNEKE

in the Riverbank Publication No. 22, Friedman wrote in 1920 The In-
dex of Coincidence and Its Applications in Cryptography. The Index of
Coincidence is the fundamental idea upon which Friedmans Method is
based. The Index of Coincidence is basically the probability of picking
two identical letters from a given text. Considering the basic idea of
the Index of Coincidence, it can used in general with all polyalphabetic
ciphers. It is not limited just to the Vigenere Cipher, yet the Vigenere
Cipher is the focus of this paper. Therefore, it is presented with regards
to the Vigenere Cipher.
It is developed using the idea of how many different ways can two
letters be picked from a given number of letters or how many combi-
nations can be found. From counting the combinatoric formula will
be used to determine the number of combinations. Some notation will
be introduced here that will be frequently used in this section of ma-
terial. In the cipher text, the frequencies of A, B, . . . , Z are denoted
by n0 , n1 , . . . , n25 . The total number of letters contained in a text is
denoted by n such that n = n0 + n1 + . . . + n25 . So, the number of ways
of picking two letters from the entire text is
 
n n(n 1)
= .
2 2
From the total number of letter pairs that can be formed from the
text, it is necessary to determine how many of them are a pair of the
same letter. For instance the number  of As is n0 . Thus, the number
of pairs chosen from As only is n20 = n0 (n20 1) . The number of letter
pairs for each letter A to Z can be found the same. Thus, by summing
the results from each of these calculations, the total number of letter
pairs of identical letters would be
25
X ni (ni 1)
.
i=0
2
Therefore, the Index of Coincidence is formed by the number of letter
pairs of identical letters over the total number of letter pairs found in
the text. Thus,
P25 ni (ni 1)
i=0 2
c = n(n1)
2
which then simplifies to
25
X ni (ni 1)
(2) c = .
i=0
n(n 1)
SOLVING THE VIGENERE CIPHER 17

HSVZW OROQJ YQUKW NOWOO PMCGO KLRZE KSAGZ WKRDS


CQMHK KBDZW RTGLC WORPH KLDRH HJWQZ VHKOU ZV

Figure 5. The cipher text for Psalm 46:10.

For example, the Index of Coincidence will be calculated for the text
of Psalm 46:10, and for the purpose of referencing, the ciphertext for
this passage is found in Figure 5. For each letter, the frequency is
reported in Table 7 along with each respective numerical value, ni (ni
1). This value ni (ni 1) represents the number of combinations of
picking two identical letters given there are ni of a particular letter in
the set.

Table 7. Frequency of each letter in Psalm 46:10

Letter A B C D E F G H I J K L M
ni 1 1 3 3 1 0 3 6 0 2 8 3 2
ni (ni 1) 0 0 6 6 0 0 6 30 0 2 56 6 2
Letter N O P Q R S T U V W X Y Z
ni 1 8 2 4 6 3 1 2 3 7 0 1 6
ni (ni 1) 0 56 2 12 30 6 0 2 6 42 0 0 30

There are 77 letters in the ciphertext of interest, so n(n 1) = 5852.


Using the information in the above table, the Index of Coincidence is
25
X ni (ni 1) 300
c = = = 0.05126.
i=0
n(n 1) 5852
Now, that the Index of Coincidence has been developed, it is neces-
sary to relate it to the length of the keyword. First, there are two exact
probability values that are used in this relationship. The first value,
0.065, is the probability of selecting identical letters in ordinary English
from a large sized text. Since, the text size is large, then the number
of repetitive words would be high thus allowing for a slightly higher
chance of picking an identical letter pair. This value is derived from
the estimated probabilities of the occurrence of each letter in ordinary
English. Respectively, let p0 , p1 , . . . , p25 represent the probability of
picking an A, B, . . . , Z from this large pool. To calculate these values,
pick a large English text of more than a thousand letters and count
the frequency of each letter in the text. Then divide the frequency
amount by the total number of letters in the text to get the respective
probability of a letter occurring in an ordinary English text. In Table 8
there are values of natural letter frequencies for typical English based
18 MATTHEW HENNEKE

on a count of 4 million characters as given in Bauers Decrypted Secrets


[2, pp. 270].

Table 8. Letter frequencies in common English

Letter A B C D E F G H I
Percent 8.04 1.54 3.06 3.99 12.51 2.30 1.96 5.49 7.26
Letter J K L M N O P Q R
Percent 0.16 0.67 4.14 2.53 7.09 7.60 2.00 0.11 6.12
Letter S T U V W X Y Z
Percent 6.54 9.25 2.71 0.99 1.92 0.19 1.73 0.09

Some of the known tables of values will have minimal variations in


the values due to the characteristics of the English text used. These
values can also be found for any respective language or symbolic system.
Say, the letter H is chosen from the pool with a respective probability
value of p7 . Assuming the pool is very large, the probability of picking
a second H would be similar, so it also would be p7 . The choice of
the second letter from the pool is assumed to be independent of the
choice of the first letter. Thus, p27 is the probability that a pair of
Hs would be selected. The same would hold true for any other pair
of identical letters to be picked from the pool. Now, the event of
picking a pair of Hs from the pool is mutually exclusive from the event
of picking another pair of identical letters from the pool. Therefore,
using estimated values of the probability of picking a random ordinary
English letter,
X25
0.065 = p2i .
i=0
The second value of interest to the relationship is 0.03846, and it rep-
resents the probability of picking a pair of identical letters where each
of the 26 letters is likely to occur. Imagine that there are two complete
tile sets of the English language laying on a table. The probability of
picking randomly two tiles such that they are identical is this aforemen-
tioned value. The idea here is that if each letter is likely to occur then
it would be similar to drawing letters from the urn with replacement.
The identity of each letter would be dictated by chance alone. Thus,
this value is calculated from the fact that there are 262 = 676 total
combinations of randomly picking two letters from the table. Of these
676 possible pairings, only 26 are identical pairs of letters. Therefore,
26
= 0.03846
676
SOLVING THE VIGENERE CIPHER 19

is the probability of interest for this situation.


Using notation that Friedman devised and that is noted in Kahns
The Codebreakers, the values 0.03846 and 0.065 are referred to by the
respective symbols r (for random) and p (for plaintext) [4, pp. 378].
In Invitation to Cryptology, Thomas A. Barr states, Thus for a polyal-
phabetic encipherment of English, the Index of Coincidence will be
no less than 0.03846 and no more than about 0.065. Furthermore,
Thomas A. Barr concludes,
If this number [Index of Coincidence] is close to the
probability of selecting identical letters in ordinary Eng-
lish, 0.065, then the cipher is likely to be monoalpha-
betic. On the other hand, if the Index of Coincidence is
close to the probability of picking two identical letters
from a large collection where the letters are evenly dis-
tributed, then the cipher is likely to be polyalphabetic
[1].
Finally, the relationship between the Index of Coincidence of a cipher
text obtained from English by the Vigenere Cipher with keyword length
k and the actual value of k will be explored. There are two different yet
similar relationships that will be presented. The first is fairly basic in
its structure and stems from discussion with Dr. David Wright. First,
let k be the length of the keyword. Then, k1 is the chance that the
two letters in a pair have been encoded by the same key letter. Since,
the value of 0.065 is calculated from the respective probabilities of all
letters, then it is associated with the value k1 . Thus, the value of 0.03846
is associated with the complement, 1 k1 [7]. Now,
 
1 1
0.065 + 0.03846 1 = c
k k
is solved algebraically for k to result in
0.02654
(3) k= .
c 0.03846
Thus, the first relationship between the Index of Coincidence and the
keyword length is established. For the example done above where it
was found that c = 0.05126, the calculated keyword length, k, is
0.02654
k= = 2.0734.
0.05126 0.03846
Thus, the two integer values that bound the value 2.0734 are 2 and 3.
Therefore, these are two potential keyword lengths and considering the
keyword used for this example was three letters in length, then this is
20 MATTHEW HENNEKE

fairly accurate in providing an estimate for the keyword length. It is


nonetheless a place by which to begin analyzing the ciphertext.
From Barrs Invitation to Cryptology, the second relationship be-
tween the Index of Coincidence and the keyword length is somewhat
more involved using ideas of counting and probability [1, pp. 134138].
First, assume that there is an English text composed of n total letters
that has been Vigenere encrypted by a keyword of length k. Now, if n
is not an even multiple of k or n = ck, c = 1, 2, 3, . . . then the remain-
der from nk is ignored such that n is an even multiple of k. Now, the
letters of the cipher text are arranged into an array of nk rows and k
columns. With regards to this array of letters, the proportional values
of choosing a pair of letters in both different columns and in the same
column are calculated first.
First, the proportional value of picking a pair of letters from different
columns is developed. Since, there are k columns, then the number of
k
 k(k1)
combinations of picking two columns is 2 = 2
. Now, in each
column there are nk letters, so the total number of pairs of letters in
two columns is nk nk . Thus, the total number of ways of picking a pair
of letters from two different columns is
k(k 1) n n n2 (k 1)
= .
2 k k 2k
Since the number of columns in this array of letters is the number of
letters in the keyword, then each column represents a random shift of
the cipher alphabet. When choosing pairs of letters in this manner, it is
very similar to choosing them from a controlled situation such as when
there is a known number of alphabet sets. Thus, the earlier calculated
probability 0.03846 provides a proportional number of ways that a pair
of identical letters can be chosen from two different columns. This
proportional number of ways is
n2 (k 1)
(4) 0.03846 .
2k
Next, development of the proportional value of picking a pair of
letters from the same column is presented. Being there are nk letters in
each column, the number of combinations of picking two letters in the
same column is  
n/k 1 n n 
= 1 .
2 2 k k
Since there are k columns in the array, then the total number of ways
of picking a pair of letters from the same column is
1 n n  n(n k)
k 1 = .
2 k k 2k
SOLVING THE VIGENERE CIPHER 21

The manner in which these pairs of letters are chosen resembles picking
a pair of identical letters from an ordinary English text. Therefore, the
proportional number of ways to select a pair of identical letters from
the same column is
n(n k)
(5) 0.065 .
2k
Before, the Index of Coincidence was defined as the probability of
picking a pair of identical letters from a given text. So far, the propor-
tional number of ways of choosing a pair of identical letters from both
different columns and from the same column has been determined. The
sum of Equations 4 and 5 account for the total number of ways of pick-
ing a pair of identical letters from a given text. And, the total
 number
ways of picking a pair of letters from a text of n letters is 2 = n(n1)
n
2
.
Therefore, the Index of Coincidence is approximately
n2 (k1) n(nk)
0.03846 2k
+ 0.065 2k
c n(n1)
.
2

Simplify the above estimate of the Index of Coincidence to produce


0.03846n(k 1) + 0.065(n k)
(6) c .
k(n 1)
Now, an approximation of the Index of Coincidence based on the
keyword length, k, and the text length, n, has been developed. Yet,
the length of the keyword is still the goal so that it is possible to
determine the keyword and then decipher the Vigenere encrypted text.
Therefore, solve Equation 6 for k to derive an estimate of the keyword
length. First, multiply both sides of 6 by k(n 1) to get
k(n 1)c 0.03846n(k 1) + 0.065(n k).
Now, perform basic algebra to produce the desired estimate of k,
k(nc c ) 0.03846nk 0.03846n + 0.065n 0.065k
k(nc c 0.03846n + 0.065) 0.02654n
k(n(c 0.03846) + (0.065 c )) 0.02654n.

Therefore, the second keyword length estimate based on the Index of


Coincidence is
0.02654n
(7) k .
(n(c 0.03846) + (0.065 c ))
22 MATTHEW HENNEKE

Before the Index of Coincidence value of 0.05126 was calculated for


the passage Psalm 46:10 which is composed of 77 letters, and so from
Equation 7, the estimated keyword length is
0.02654 77
k = 2.045.
(77(0.05126 0.03846) + (0.065 0.05126))
This is very similar to the estimate calculated from Equation 3. Thus,
there is consistent evidence that the keyword length is around two or
maybe three letters. Again, the actual keyword length is three letters.
The Maple Code provided in Appendix A that has been written
for Friedmans Method actually includes two separate procedures that
use the two different approximations of the keyword length based on
the Index of Coincidence. The first procedure is rightfully index of
coincidence(ciphertext) and its only input is the cipher text mes-
sage. The procedure then does by machine what would be done by
hand to calculate the Index of Coincidence (see Equation 2). Thus,
the only output is the value of the Index of Coincidence, c , and for
the example cipher text,
index_of_coincidence(lwr):
0.04174120453.
The next two procedures friedman(cipher) and friedman2(cipher)
have only the cipher text message as their input. Both of these pro-
cedures internally rely upon the output from the previous procedure
index of coincidence to aid in estimating the keyword length. First,
friedman uses the form of Equation 7 with slightly different numbers
with regards to rounding to determine its estimate of the keyword
length such as for our example
friedman(lwr):
7.114071982.
Then a second procedure is introduced that is labeled as friedman2
and it uses the more simple keyword length estimate based on the Index
of Coincidence (see Equation 3). The resulting output for our example
is
friedman2(lwr):
7.216927003.
When respective output from kasiski(lwr,4,2) is compared to ei-
ther friedman(lwr) or friedman2(lwr) , then the results do not
match. The difference is found in the keyword LIVERPOOL. When
repeatedly spelled out, the pair of Os and the pair of Ls hold just one
position in that the same Caesar Shift cipher would be used in each
SOLVING THE VIGENERE CIPHER 23

of the respective positions. Therefore, distinct letters will be consid-


ered not ones that repeat when trying to determine the keyword length
using Friedmans Method. Thus, the use of these Maple Code proce-
dures would decrease the time needed for analysis of a cipher text and
increase accuracy when compared to hand calculations.
Thus, Friedmans Method calls upon the techniques of counting,
probability, and letter frequency to establish the fundamental Index
of Coincidence. Then, the relationship between the Index of Coinci-
dence and the keyword length is manipulated to achieve an estimate
of the keyword length used to encrypt text using the Vigenere Cipher.
Whether Kasiskis Method or Friedmans Method is used to calculate
the keyword length, it is only the first task in the process to decipher a
Vigenere encrypted text. The second task is to actually determine the
keyword.

6. Constructing the Keyword


Thus far in this paper, two methods have been presented by which to
guess the keyword length when faced with a cipher text that has been
encrypted with the Vigenere Cipher. The first vital bit of information
regarding the keyword has been discovered, but now the remaining
information about the keyword has to be discovered which would be the
actual keyword. There are several methods by which one can unearth
the actual keyword used during the encryption process. First, though,
the cipher text has to be organized based on the keyword length.
Assume that the cipher text has n letters and it was determined
that the keyword length is k letters. During the process of encryption
when the keyword is repeatedly spelled over the plain text, the plain
text letters assigned to the same keyword letter were separated by k
letters. Thus, it is reasonable to observe those plain text letters that
were assigned to the same keyword letter separately. Therefore, arrange
the letters of the cipher text into k columns such that each column is
composed of the letters found in the kx + i, x = 0, 1, 2, . . . , b ni
k
c, i =
1, . . . , k positions of the cipher text. From this arrangement of the
letters in k columns, the number of rows will be either nk or nk + 1.
Note that if k is an even multiple of n then there are nk rows, but if
there is a remainder, then there are nk + 1 rows. Continuing to use
as an example The Long and Winding Road, it was estimated that
k = 9, so the cipher text would be arranged into 9 columns labeled
k1 , k2 , . . . , k9 (See Figure 9). Thus, k1 is composed of the cipher text
letters in positions 1, 10, 19, . . . , 424.
24 MATTHEW HENNEKE

Table 9. Cipher text of The Long and Winding Road


sorted into 9 columns for each key letter

k1 k2 k3 k4 k5 k6 k7 k8 k9
E P Z P F C U O Y
O E D R U X B U C
Z I Y X Y P H Z P
L L N X F N C I C
O W J V N X Z Z Y
.. .. .. .. .. .. .. .. ..
. . . . . . . . .
C M G I R S A S E
Z G J Y U D C F O
L L V H R S O

Now, the cipher text has been subdivided into k columns or subse-
quences of letters if the whole cipher text is viewed as a sequence of
letters. Because of this subdivision of the cipher text letters, each col-
umn now represents a particular row of the Vigenere Square or more
specifically a Caesar shift. Therefore, the task at hand is to find the key
for each of the k individual Caesar shifts. For these Caesar shifts, the
key will just be the shift value that will correspond to an equivalent
letter value for each keyword letter. The techniques by which these
individual keys may be discovered are now presented.
Once, the cipher text has been arranged into the subsequences, ki , i =
1, 2, . . . , k, then an analysis of the letter frequency and proportion is
performed for each individual subsequence. For k1 of the example, the
frequency of each letter ,fi , i = 0, 1, . . . , 25, and respective proportion
out of 48 letters, pi , i = 0, 1, . . . , 25, are given below in Table 10.

Table 10. Frequency and proportion values of k1 for


the Long and Winding Road

Letter A B C D E F G H I
fi 1 0 2 2 4 0 0 2 0
pi 2.08 0.00 4.17 4.17 8.33 0.00 0.00 4.17 0.00
Letter J K L M N O P Q R
fi 1 0 7 0 1 4 8 0 1
pi 2.08 0.00 14.6 0.00 2.08 8.33 16.7 0.00 2.08
Letter S T U V W X Y Z
fi 0 3 0 1 3 0 4 4
pi 0.00 6.25 0.00 2.08 6.25 0.00 8.33 8.33
SOLVING THE VIGENERE CIPHER 25

The first method requires a 1 26 vector of calculated proportions


from the first subsequence of the cipher text and a 1 26 vector of
known proportion values calculated from a large sample of letters. De-
pending on the language system used, this vector of values will differ.
The premise of this first method is to match proportion values of the
subsequence to values of the sample of letters. There are several land-
marks of peaks, valleys, and plateaus that are used for the purpose
of matching, and they will be discussed next. The number of shifts
required to match these two vectors is recorded as the key for the re-
spective Caesar shift of the subsequence being analyzed. Then, this
process would be repeated for each of the keyword positions to finally
discover the actual keyword.
The aforementioned landmarks are found from examining proportion
values for a large sample of English text. Foremost, E stands alone
with the highest registered proportion value. Two different groups of
letters, N-O and R-S-T, are plateaus of values that are slightly below
that of E. Since they are not very commonly used in writing, the very
end of the alphabet especially X-Z produces a valley or values near
zero. One last notable landmark is Q. It creates a valley that bottoms
near zero with the surrounding values being much higher. Of the above
mentioned observations, not all of them will be satisfied or apparent
when considering analysis of text. Therefore, the most accurate match
is the goal, and from this will come the correct key for the Caesar shift.
Above in Table 10, the letter frequencies of The Long and Winding
Road are reported for k1 . From looking at these values, should they be
shifted to match the known frequencies for English based on a sample
of 4,000,000 letters? There initial alignment is shown in Table 11 for
the purpose of comparison.

Overall, the letter P occurred the most in k1 with 8 showings or


16.67%. Off this first mark, P would be a shift of 11 units to the left
to match to the peak of E. The letters C-D-E seem to create a plateau
of higher values. This could be a potential match to the R-S-T group.
These two letter groups match after a shift of 11 units to the left for
C to match up with R. In addition, the Y-Z-A-B group is a potential
match to the N-O-P-Q group being that both Y-Z occur twice, A
once and B zero. Again, this alignment would be a shift to the left of
11 units. Therefore, there seems to be substantial evidence that the
key for the first Caesar shift is 11. Finally, the calculated proportion
values have been shifted to the left 11 units and aligned with the known
proportion values in Table 12. From this information, the first keyword
letter would be L which lies in position 12 of the alphabet. This does in
26 MATTHEW HENNEKE

Table 11. Known letter proportions from a sample of


4,000,000 letters and calculated letter proportions of k1
for the The Long and Winding Road

Letter A B C D E F G H I
Sample Text 8.04 1.54 3.06 3.99 12.51 2.30 1.96 5.49 7.26
Example Text 2.08 0.00 4.17 4.17 8.33 0.00 0.00 4.17 0.00
Letter J K L M N O P Q R
Sample Text 0.16 0.67 4.14 2.53 7.09 7.60 2.00 0.11 6.12
Example Text 2.08 0.00 14.6 0.00 2.08 8.33 16.7 0.00 2.08
Letter S T U V W X Y Z
Sample Text 6.54 9.25 2.71 0.99 1.92 0.19 1.73 0.09
Example Text 0.00 6.25 0.00 2.08 6.25 0.00 8.33 8.33

deed coincide with the actual keyword of Liverpool. This method does
provide a means by which to determine the key for each Caesar shift,
but there are more mathematically based methods left to explore.

Table 12. Known letter proportions from a sample of


4,000,000 letters and calculated letter proportions of k1
for the The Long and Winding Road shifted 11 units to
the left
Letter A B C D E F G H I
Sample Text 8.04 1.54 3.06 3.99 12.51 2.30 1.96 5.49 7.26
Example Text 14.6 0.00 2.08 8.33 16.7 0.00 2.08 0.00 6.25
Letter J K L M N O P Q R
Sample Text 0.16 0.67 4.14 2.53 7.09 7.60 2.00 0.11 6.12
Example Text 0.00 2.08 6.25 0.00 8.33 8.33 2.08 0.00 4.17
Letter S T U V W X Y Z
Sample Text 6.54 9.25 2.71 0.99 1.92 0.19 1.73 0.09
Example Text 4.17 8.33 0.00 0.00 4.17 0.00 2.08 0.00

The second method involves the use of mathematics in the form


1
of the LP norm. In general, the L1 norm can be expressed as
N
|x|+ = i=1 |xi | where x is a vector in N dimensions. For each
of the L norms presented, x will be a vector in N dimensions.
For this method let p(m) , m = 0, 1, . . . , 25 be a vector of the calculated
proportion values for each subsequence. The index (m) indicates the
shift value of this particular vector. Let the vector of known proportion
SOLVING THE VIGENERE CIPHER 27

values be denoted by . The main objective of this method is to calcu-


late the distance between each corresponding pair of elements of p(m)
and . Then sum the absolute value of the distances for each of the
26 pairs of elements. Mathematically, this process can be performed
based on Equation 8.
X25
(m)
(8) p j , m = 0, 1, . . . , 25

j
j=0

This is then repeated 25 times with a shift of one vector to the vector
of calculated proportion values. Once all 26 calculations are finished,
the goal is to find the smallest sum of distances which would imply that
the calculated proportion values closely match the known proportion
values. The value of the shift that produces this minimal value is also
the value of the key for the Caesar shift of the respective subsequence.
The calculated proportion values from k1 of the The Long and Wind-
ing Road will be used in examples to find actual values based on the
methods. Thus, for the L1 norm, the smallest value calculated was
46.72 in which m = 11. Thus, a key of 11 belongs to the Caesar shift of
k1 . Then, as before, the overall procedure is repeated for each element
of the keyword.
The third method involves the use of a similar idea to what was
used in the previous method. In this method, the mathematical idea
L2 norm is used. P The L
2
1/2 norm is also well-known as Euclidean
N 2
Distance, |x| = i=1 xi . Thus, in this method, the goal is to
minimize the distance between the vectors. Therefore, the minimal
value of
(9) k p(m) k
is desired with respect to the shift value m. It is helpful to first sim-
plify this statement so as to determine the manner by which it can be
minimized. Thus,
k p(m) k2 = p(m) p(m)
 

= k p(m) k2 + k k2 2p(m) .

Since, both k p(m) k2 and k k2 are constant, positive values, the


minimal value of Equation 9 will occur when the maximum value of
p(m) occurs. Now, p(m) is a dot product of two vectors which will
produce a scalar answer. Thus, the dot product must be taken for
each respective value of m that corresponds to shifting the vector of
calculated proportions. Of the 26 calculated dot products, the desired
28 MATTHEW HENNEKE

result is the dot product that results in the maximum value of p(m) .
Therefore, the key to the Caesar shift is the value m. For the L2 norm,
the maximum value of p(m) calculated was 117.384 and the minimal
value of k p(m) k calculated was 12.5697. Both of these values were
calculated when the shift value was 11. Thus, the first letter of the
keyword would be L. For the remaining k 1 positions in the keyword,
the method is repeated to determine the actual keyword.
The last method to be presented is based on the L norm. It can
be expressed as |x| = max{|xi | , 1 i N }. It merely picks out
the largest absolute valued element of x. Therefore, L norm will
determine the largest proportion value found in the vector of calculated
proportions. Thus,

(0)
(10) |x| = max{ pi , 1 i 26}.

Now, it would be expected that the position of the largest proportion


value would match to the position of E from the known proportion
values. Let, t represent the position in the vector that contains the
largest element. Thus, the value of the key of the Caesar shift would
be m = t 5. For the calculated values of proportions from k1 , |x| =
16.67%. And 16.67% resides in position 16 of the alphabet. Therefore,
the Caesar shift associated with k1 has a key of 11 which equates to
the letter L.
By subdividing the cipher text letters into respective columns, the
process of determining the exact keyword is reduced to finding the key
value of the associated Caesar shift for each letter position of the key-
word. Overall, there are several methods that can be used to determine
the key value of the Caesar shift.
Two procedures have been programmed into Maple to determine
guesses at the actual keyword (See Appendix A). The first of the two
procedures is guesskey (s,n) which is based on the L norm. The
input parameters into this procedure are the cipher text message and
the length of the keyword. The output is the procedures best estimate
of the keyword. For example, the keyword of length 9 that guesskey
estimated for The Long and Winding Road is given below.
guesskey(lwr,9):
LIVENPOOL
2
Based on the L norm ,the second procedure coded is guesskey2
(cipher,l). Just as for guesskey, the input parameters are the same.
They are the cipher text message and the keyword length. Again, it
will determine the best estimate of the actual keyword, and return it as
SOLVING THE VIGENERE CIPHER 29

output as demonstrated below for The Long and Winding Road with
a keyword of 9 letters.
guesskey2(lwr,9):
LIVENPOOL

7. Recovery of the Plain Text Message


Over the past pages, the keyword length has been calculated using
mathematics and probability via Kasiskis and Friedmans Methods.
From here, frequency analysis was used to determine shifts which cor-
respond to letter positions finally revealing the keyword. Yet, there
is one last detail that needs to be discussed, and it is actually recov-
ery of the plain text message. Again, the cipher text is needed where
it is separated into k subsequences. The keyword letter li is the fi-
nal key. First, it reveals the row from the Vigenere Square that was
used to encrypt the letters of the plain text in the letter positions
kx + 1, x = 0, 1, 2, . . . , b n1
k
c. For each letter in these positions, the
cipher text letter is found in the respective row of the Vigenere Square
and the column that it lies in is noted. Now, the letter that heads
this column will be the original letter of the plain text message. It
also reveals the value of the Caesar shift that corresponds to the row
of the Vigenere Square. For each respective Caesar Shift, then shift
the cipher text letter to reveal the plain text letter. These methods
are more fully explained in Section 3 of this paper. This procedure is
repeated for all the letters in ki and so forth. Eventually, after some
work, the original plain text message can be read in its entirety.

8. Appendix A: Maple Code


with(linalg): with(stats): with(plots): with(plottools):

# s is a string of letters and other symbols.


# numbers returns the vector of numbers corresponding to the
# letters, ignoring case and the other symbols.
numbers:=proc(s)
local alph,v,n,m,i,c;
alph:=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ;
n:=length(s);
v:=[];
for i from 1 to n do
c:=substring(s,i..i);
m:=searchtext(c,alph);
if m > 0 then
30 MATTHEW HENNEKE

v:=[op(v), (m -1 mod 26) + 1] fi;


od;
RETURN(v)
end:

# letters converts a vector of numbers to a


# string of uppercase letters
letters:=proc(v)
local alph,i,msg;
alph:=ABCDEFGHIJKLMNOPQRSTUVWXYZ;
msg:=;
for i from 1 to nops(v) do
msg:=cat(msg,substring(alph,v[i]..v[i]) );
od;
RETURN(msg);
end:

# upcase converts a string to a standard one of only uppercase letters


upcase:=w->letters(numbers(w)):

vigen_enc:=proc(plaintext,keyword)
local v, k, l, w, i;
v:=numbers(plaintext);
k:=numbers(keyword);
l:=nops(k);
w:=[seq( (v[i]+k[ (i-1 mod l) +1]-2 mod 26) +1,
i=1..nops(v) )];
RETURN(letters(w))
end:

vigen_dec:=proc(ciphertext,keyword)
local v, k, l, w, i;
v:=numbers(ciphertext);
k:=numbers(keyword);
l:=nops(k);
w:=[seq( (v[i]-k[ (i-1 mod l) +1] mod 26) +1,
i=1..nops(v) )];
RETURN(letters(w))
end:

index_of_coincidence:=proc(ciphertext)
local v,x,n,u,ans;
SOLVING THE VIGENERE CIPHER 31

v:=numbers(ciphertext);
n:=nops(v);
u:=[seq( nops(select(has,v,x)), x=1..26 )];
ans:=(evalm(u &* u)-n)/(n*(n-1));
RETURN( evalf(ans) );
end:

# Friedmans formula for the guess for the length of the keyword
friedman:=proc(ciphertext)
local k,n,v;
v:=numbers(ciphertext);
n:=nops(v);
k:=index_of_coincidence(ciphertext);
RETURN( 0.027*n/( (n-1)*k -0.038*n +0.065) );
end:

friedman2:=proc(ciphertext)
local k;
k:=index_of_coincidence(ciphertext);
RETURN( 0.027/( k -0.038) );
end:

numfreq:=proc(msg)
local v;
v:=numbers(msg);
RETURN([seq( nops(select(has,v,x)), x=1..26 )]);
end:

# Frequencies of letters in msg


freq:=proc(msg)
local alph,v,ans,x;
alph:=ABCDEFGHIJKLMNOPQRSTUVWXYZ;
v:=numfreq(msg);
RETURN( evalm(transpose(
[seq( [substring(alph,x..x),v[x]] , x=1..26)] )) );
end:

# Frequencies of letters in subsequence of letters at


# positions k[1]*x +k[2] (k must be a list.)
subfreq:=proc(s,k::list)
local v,n,u,w,x;
v:=numbers(s);
32 MATTHEW HENNEKE

n:=nops(v);
u:=[seq( v[k[1]*x+k[2]], x=0..floor( (n-k[2])/k[1] ) )];
w:=letters(u);
RETURN( freq(w) );
end:

# Letter frequencies sorted from high to low.


sortfreq:=proc(msg)
local alph,v,ans,x;
alph:=ABCDEFGHIJKLMNOPQRSTUVWXYZ;
v:=sort(numbers(msg));
ans:=[];
for x from 1 to 26 do
ans:=[op(ans), [substring(alph,x..x) , nops(select(has,v,x))] ];
od;
ans:=sort(ans,(x,y)->evalb(x[2]>y[2]));
RETURN(evalm(transpose(ans)));
end:

# Highest frequency letter in subsequence


topfreq:=proc(s,k)
local v,n,u,w,x;
v:=numbers(s);
n:=nops(v);
u:=[seq( v[k[1]*x+k[2]], x=0..floor( (n-k[2])/k[1] ) )];
w:=letters(u);
RETURN( col(sortfreq(w),1) );
end:

# Guess the keyword based on length n and highest frequency letters.


guesskey:=proc(s,n)
local i, ans,u,v,m,w;
ans:=;
v:=[];
for i from 1 to n do
u:=topfreq(s,[n,i]);
m:=numbers(u[1]);
v:=[op(v),m[1]];
od;
w:=map(x-> ( x-5 mod 26) + 1, v);
RETURN(letters(w));
end:
SOLVING THE VIGENERE CIPHER 33

basefreq:=[[A,B,C,D,E,F,G,H,I,J,K,L,M,
N,O,P,Q,R,S,T,U,V,W,X,Y,Z],
[8.167,1.492,2.782,4.253,12.702,2.228,2.015,6.094,6.966,
0.153,0.772,4.025,2.406,6.749,7.507,1.929,0.095,5.987,
6.327,9.056,2.758,0.978,2.360,0.150,1.974,0.074]]:

basechart:=display( seq(
rectangle( [x-0.25,basefreq[2,x]],[x+0.25,0],color=blue),
x=1..26) ):

blockfreq:=proc(s,m)
local v,n,i,ans,block,x,y;
n:=length(s);
v:=sort([seq(substring(s,i..(i+m-1)),i=1..(n-m+1))]);
ans:=[];
block:=v[1];
y:=1;
for x from 2 to n-m+1 do
if v[x] = block then
y:=y+1;
continue;
else
ans:=[op(ans),[block,y]];
block:=v[x];
y:=1;
fi;
od;
ans:=[op(ans),[block,y]];
RETURN(ans)
end:

blockfreqsort:=proc(s,m)
local A,x,y;
A:=blockfreq(s,m);
sort( A, (x,y)->evalb(x[2]>y[2]))
end:

kasiski:=proc(msg,l,n)
local v,x,y,ans,u,facs,i;
v:=select((x,y)->evalb(x[2]>=y), blockfreqsort(msg,l),n);
34 MATTHEW HENNEKE

ans:=[];
for x from 1 to nops(v) do
u:=findstring(msg,v[x][1]);
facs:=[seq( ifactor(u[i+1]-u[i]), i=1..nops(u)-1 )];
ans:=[op(ans), [v[x],facs] ];
od;
RETURN(ans)
end:

guesskey2:=proc(cipher,l)
global freq0;
local nums, i, p, inds, m, j, x, y;
nums:=[];
for i from 1 to l do
p:=convert(row(subfreq(cipher, [l,i]), 2), list);
inds:= [seq( evalf( evalm(p &* cycle(freq0, -j))/norm(p,2) ),
j=0..25)];
m:=max(op(inds));
for j from 1 to 26 do
if inds[j] >= m then break fi;
od;
nums:=[op(nums), j ];
od;
RETURN(letters(nums))
end:
SOLVING THE VIGENERE CIPHER 35

References
[1] Thomas H. Barr. Invitation to Cryptology, Prentice Hall, Upper Saddle River,
NJ, 2002.
[2] F.L. Bauer. Decrypted Secrets: Methods and Maxims of Cryptology, Springer-
Verlag, Berlin, 1997.
[3] Ole Immanuel Franksen Mr. Babbages Secret: The Tale of a Cypherand APL
Prentice-Hall, Englewood Cliffs, New Jersey, 1984.
[4] David Kahn The Code Breakers: The Story of Secret Writing, Scribner, New
York, NY 1996.
[5] Simon Singh The Code Book: The Science of Secrecy From Ancient Egypt to
Quantum Cryptography, Anchor Books, New York, NY, 1999.
[6] Douglas R. Stinson Cryptography: Theory and Practice, CRC Press, Boca
Raton, FL, 1995.
[7] David Wright Project Instructional Meetings, Oklahoma State University, Still-
water, OK, 2002.

Vous aimerez peut-être aussi