# Bulletin of Mathematical Biology (2007) 69: 215–243 DOI 10.

1007/s11538-006-9119-3

ORIGINAL ARTICLE

An Extended RNA Code and its Relationship to the Standard Genetic Code: An Algebraic and Geometrical Approach

´ a,∗ , Eberto R. Morgadob , Tzipe Govezenskya Marco V. Jose

a

Theoretical Biology Group, Instituto de Investigaciones Biom´ edicas, Universidad Nacional Autonoma de M´ exico, M´ exico D.F. 04510, M´ exico ´ b Facultad de Matem´ atica, F´ ısica y Computacion, ´ Universidad Central “Marta Abreu” de Las Villas, Santa Clara, Cuba

Received: 19 September 2005 / Accepted: 23 February 2006 / Published online: 2 November 2006 C Society for Mathematical Biology 2006

Abstract An algebraic and geometrical approach is used to describe the primaeval RNA code and a proposed Extended RNA code. The former consists of all codons of the type RNY, where R means purines, Y pyrimidines, and N any of them. The latter comprises the 16 codons of the type RNY plus codons obtained by considering the RNA code but in the second (NYR type), and the third, (YRN type) reading frames. In each of these reading frames, there are 16 triplets that altogether complete a set of 48 triplets, which specify 17 out of the 20 amino acids, including AUG, the start codon, and the three known stop codons. The other 16 codons, do not pertain to the Extended RNA code and, constitute the union of the triplets YYY and RRR that we deﬁne as the RNA-less code. The codons in each of the three subsets of the Extended RNA code are represented by a fourdimensional hypercube and the set of codons of the RNA-less code is portrayed as a four-dimensional hyperprism. Remarkably, the union of these four symmetrical pairwise disjoint sets comprises precisely the already known six-dimensional hypercube of the Standard Genetic Code (SGC) of 64 triplets. These results suggest a plausible evolutionary path from which the primaeval RNA code could have originated the SGC, via the Extended RNA code plus the RNA-less code. We argue that the life forms that probably obeyed the Extended RNA code were intermediate between the ribo-organisms of the RNA World and the last common ancestor (LCA) of the Prokaryotes, Archaea, and Eucarya, that is, the cenancestor. A general encoding function, E, which maps each codon to its corresponding amino acid or the stop signal is also derived. In 45 out of the 64 cases, this function takes the form of a linear transformation F , which projects the whole six-dimensional hypercube onto a four-dimensional hyperface conformed by all triplets that end in cytosine. In the remaining 19 cases the function E adopts the form of an afﬁne

∗ Corresponding author. ´ E-mail address: marcojose@biomedicas.unam.mx (M. V. Jose).

216

Bulletin of Mathematical Biology (2007) 69: 215–243

transformation, i.e., the composition of F with a particular translation. Graphical representations of the four local encoding functions and E, are illustrated and discussed. For every amino acid and for the stop signal, a single triplet, among those that specify it, is selected as a canonical representative. From this mapping a graphical representation of the 20 amino acids and the stop signal is also derived. We conclude that the general encoding function E represents the SGC itself. Keywords Primaeval RNA code · Standard genetic code · Evolution of the genetic code · Extended RNA codes · Algebra and geometry

1. Introduction The current genetic code is considered to be nearly universal. This code is written in an alphabet of four letters (C, A, U, G), grouped into words three letters long, called triplets or codons. Each of the 64 codons speciﬁes one of the 20 amino acids or else serves as a punctuation mark signaling the end of a message. Given 64 codons and 20 amino acids plus a punctuation mark there are 2164 ≈ 4 × 1084 possible genetic codes. Is there something special about the only one code that governs all life on Earth? Francis Crick (1968) argued that the Standard Genetic Code (SGC) need not be special at all; it could be nothing more than a “frozen accident.” This concept is not far away from the idea that sometime there was an age of miracles. However, when the SGC was compared to a computer generated random sample of one million alternatives, the natural code emerged as superior to every random permutation with a single exception (Freeland and Hurst, 1998). Recently, numerical experiments with hand-crafted genetic codes analyzed in silico showed inferior statistical properties (such as information content, scaling and autocorrelation properties) than the SGC (Garc´ ıa et al., 2004). It is widely accepted that there was an age in the origin of life in which RNA played the role of both genetic material and main agent of catalytic activity (e.g. Woese, 1967; Crick, 1968; Kenneth and Ellington, 1995). This period is known as the RNA World (Gilbert, 1986; Gesteland et al., 1999). Investigations on the minimal gene set that is necessary and sufﬁcient to sustain the existence of cellular life are consistent with the notion that the last common ancestor (LCA) of the three primary kingdoms (Archaea, Eucarya, and Prokaryotes) had an RNA genome (Mushegian and Koonin, 1996; Hutchinson et al., 1999; Gil et al., 2002). However, the quasi-species concept of Eigen and Schuster (1977) demonstrated that the accuracy of replication placed limits on the size of the genome that can be maintained by selection. The higher the error rate during replication, the smaller the maximum possible permissible genome size. Thus, replication ﬁdelity was a strong limiting feature in the RNA World. On the other hand, sequence similarities shared by many ancient, large proteins found in all three kingdoms of life suggest that considerable ﬁdelity already existed in the operative genetic system of their LCA, but such ﬁdelity is unlikely, given the Eigen’s limit, to be found in RNA-based genetic systems (Lazcano, 1995; Lazcano and Miller, 1996). The cenancestor probably had a DNA genome (Becerra et al., 1997).

U). To this end. In previous works (Sanchez et al. G). 2005). G. two algebras were presented to reﬂect the relationship between codon assigment and the physicochemical aspects of the amino acids. second and third reading frames. and 01 is to 10. 10. (A. Eigen and Schuster.1. which also represent the integers 0.Bulletin of Mathematical Biology (2007) 69: 215–243
217
To our knowledge. 3. Konecny et al. The main question that we address in this work is to see if via our algebraic and geometrical approach we can shed some light on the problem of how the primaeval RNA code could have evolved to generate the SGC. U. given that translational and transcriptional errors were probably of great importance early in the history of life. and. Given that C and G. (C. we used the following ordered assignment of the nucleotide bases: C↔ 00. We also derive the general encoding function of the SGC.. about the concepts of algebra and geometry is provided at the end of this article. may be done in 24 = 4! ways. and an encoding function for the RNA-less code. we show that each reading frame can be represented as a four-dimensional hypercube (A3. the constraint of having an intact message in only one reading frame has to be relaxed. 1996. (1957) but rather an RNA code which can be translated in the ﬁrst. Sanchez et al. Next.1. Theoretical background The standard table of codon assignments derives from the obvious representa´ ˜ tion of the triplet code as a 4 × 4 × 4 cube. suggested to organize the codon table as a six-dimensional hypercube ´ or six-dimensional vector space over the binary ﬁeld. This encoding function adopts different forms for different subsets of codons. 1995) with the SGC. Jimenez-Monta no ´ et al. we search for symmetries and patterns in both the SGC and the RNA code. 1. Here. A. we discuss our ﬁndings in terms of the origin and evolution of the SGC. it is convenient to select the ordering in such a way that 00 is complementary of 11.3) associated to the RNA-less code can be inserted as pairwise disjoint four-dimensional afﬁne subspaces in the six-dimensional hypercube (Coxeter. 1977. observing that 64 is equal not only to 43 but also to 26 . we deﬁne an encoding function for each of the three four-dimensional hypercubes. U.. 1973. in the binary numerical system. First. 2005). A1. With this ordering. This assignment of the duplets 00.. G. A↔ 01. an Appendix which is referred to throughout this work. 1968. G).. 5). A3. 1. purines A and G are represented by the odd numbers 1 and 3. and we show that this function is an integration of the different encoding functions of the above-mentioned four sets. Several authors (e. C. the three four-dimensional hypercubes and the right hyperprism (A9. we have only eight possible selections: (C. 11 to the letters C.g. A. 2004a. For the interested reader. U↔ 10. Interestingly.1. A and U are complementary to each other in double stranded DNA. G. We hypothesize that in order to allow further evolution of the RNA genetic code.
. Fig. U. Therefore. there has not been systematic studies that relate the RNA code (Crick. and it represents the SGC code itself. A. we consider not a strict comma-less code as proposed by Crick et al. G↔ 11. associated to the Extended RNA code.1) as derived by concepts of combinatorial geometry. whereas the pyrimidines C and U are represented by the even numbers 0 and 2. The article is organized as follows. when the machinery of protein synthesis was imprecise. C. Finally. (A. U). 01. Hence.b. 2.

is deﬁned in Table 1. The RNA code and its three reading frames: The extended code A primaeval RNA World was proposed (Gilbert. G. Y .1. also called XOR operation. A. for the elements of the ﬁeld Z2 . C. and codons with an NYR and YRN patterns emerged. which is the operation table of a known abelian group of order 4. This group is isomorphic (A3) to the group of all the symmetries of a plane rectangle (not including the square). A). These three patterns altogether are here deﬁned as the codons of the Extended RNA code. C.any nucleotide). Results 2. In the hypercube. The vector space (Z2 )6 is an orthotope. but in our pictures many of them look like acute or obtuse. U. the so-called Klein Four Group. (G.218 Table 1 + 0 1
Bulletin of Mathematical Biology (2007) 69: 215–243 Sum module 2 in the ﬁeld Z2 0 0 1 1 1 0
(U. A. This is so because the picture is a projection of a six-dimensional ﬁgure over a plane. U. 1}(A3. shifts in the reading frame probably occurred. widely used in symbolic logic. Since the primitive translational apparatus may have been imperfect.pyrimidine.1).purine. G. C). N . When Table 1 is translated to the nucleotide bases it gives Table 2. but it is possible to show that the remaining ones lead to the same results. and (G. This sum. In the vector space structure of the hypercube the addition is the so-called sum module 2. (U. This extended code includes 48 triplets which specify 17 amino
Table 2 Sum module 2 of nucleotide bases + C A U G C C A U G A A C G U U U G C A G G U A C
. which is a vector space over the binary ﬁeld Z2 = {0. 2. From these orderings. we have selected the ﬁrst. C). when they are portrayed in a plane. The 64 sextuples of zeros and ones generate the so-called six-dimensional hypercube. according to the terminology of Coxeter (1973) (A3). A). 1986) which comprises the codons with an RNY pattern (R . all the angles between adjacent edges are supposed to be right angles. as in every orthotope. bit by bit. The same happens with some of the angles of a three-dimensional cube.

RYY is obtained from YYY by the addition of the triplet ACC. In most cases. these subclasses correspond to coordinated planes or faces of the cube. each of the subclasses is partitioned into two pairs. and we deﬁned it as the RNA-less code or complementary code of the Extended RNA code. The eight elements of a subset correspond to a three-dimensional coordinated afﬁne subspace (A4. From the eight subsets. only one is a vector subspace (A1. In order to describe algebraically and geometrically all sets. an ordinary cube. that is. In this context.. we denote as ti j the composed function ti ◦ t j . For every pair (i . we denote as ti the associated translation: v → v + ei . The fourth set is the union of two subsets: YYY and RRR. j ). and the three known stop codons. For every triplet (i . RRY is obtained from YYY by the addition of the triplet AAC. the addition of the vector e2 which involves a transversion (A10. we denote as ti jk the composed function ti ◦ t j ◦ tk and so on (A9. For example.Bulletin of Mathematical Biology (2007) 69: 215–243
219
acids. these pairs of codons specify the same amino acid. respectively. Each of the ﬁrst three sets can be partitioned into two subsets by replacing the N by Y or R. The other seven subsets are three-dimensional afﬁne subspaces obtained from YYY by translations (A9. For any vector space the associated geometry is deﬁned as the family of all the afﬁne subspaces or linear varieties of the given space (A2).2) in the ﬁrst nucleotide.1). but also the same ﬁrst nucleotide. 1b) in which every amino acid corresponds to an edge of the associated cube.2. and that is the subset YYY. that is. the second (NYR pattern) and the third (YRN pattern) sets correspond to readings of the primaeval RNA code at the second and the third reading frames. where X. Eigen and Schuster. For each of the eight subsets. which contains the null vector CCC . we introduce the following notational convention. the 64 triplets XYZ of the SGC.1) between triplets on the same edge is 1.1). They constitute coordinated lines or edges of the hypercube. Y. The remaining 16 codons can only be obtained by mutations other than by frame-shift readings. that is. First reading frame The 16 triplets of the RNY pattern code for eight amino acids and are considered as the primaeval RNY code (Crick. 1968. we may consider two subclasses. the vector e2 + e4 . Finally. and the fourth set corresponds to the RNA-less code (YYY and RRR patterns). which represents transversions in the ﬁrst and the second nucleotides. Each cube is obtained from the other by means of the translation t4 . U. This is so since the Hamming distance (He et al.1) or vectorial cube. We say that the reading and translation of sequences of codons of the form RNY deﬁnes the ﬁrst reading frame (FRF) in the RNA World. can be partitioned into four sets of 16 triplets each. 1978). the start codon.1). and Z belong to the set {C. The set of codons of the form RNY is the union of the cubes RYY (Fig. G}. not only the same central nucleotide. Now we will consider the combinatorial geometry. of the canonical base. associated to the structure of the vector space. contained in the six-dimensional hypercube. that
. A. 2. The ﬁrst set corresponds to the so-called primaeval RNA code (RNY pattern). 1a) and RRY (Fig. k). for the set RNY we obtain the two subsets RYY and RRY. A6. including AUG. For every vector ei . 2004. associated to the canonical vector e4 . every pair consisting of triplets that have. each of them formed by triplets having the same central nucleotide. j .

ignoring the ﬁrst nucleotide R. then triplets of the form NYR will be translated. the addition of the codon CAC to each triplet. and is therefore an orthotope (Coxeter. Graphical representation of the subsets (a) RYY and (b) RRY. 1973). 1c). 2. The 16 triplets of the form NYR specify eight amino acids. also called a measure polytope. is equal to 1. The second reading frame If in a sequence of triplets of the form RNY. or a four-dimensional hyperface of the sixdimensional hypercube of the 64 triplets (Fig. which is a coordinated afﬁne subspace (A4). the reading starts at the second nucleotide N. due to an slippage. (c) First Reading Frame: RNY = RYYURRY. The hypercube is a generalization of a three-dimensional cube in n dimensions.220
Bulletin of Mathematical Biology (2007) 69: 215–243
Fig. ﬁve of which were also found in the FRF. Note that in each cube the four amino acids correspond to 4 (solid edges) out of the 12 edges of the cube. the triplets of the form NYR give
. The set of triplets of the form NYR is a disjoint set with the set RNY. The Hamming distance between a vertex and its image under the translation t4 . It is a symmetrical regular polytope with mutually orthogonal (A8) sides.
is.3. Then. 1 RNY code. It means that the set of codons of the form RNY constitutes a four-dimensional hypercube.

rather there are two amino acids (Met and Ile) which correspond to single vertexes. 2 NYR code. and their correspondence with their 11 amino acids. those of the FRF and the SRF. as an extension of the original RNY code. 2a) and RYR (Fig. that is. The set of codons of the form NYR is the union of the cubes YYR (Fig. the addition of the vector e2 + e6 . In this set. Graphical representation of the subsets (a) YYR and (b) RYR. Each cube can be derived
. In (b) not all the amino acids correspond to edges of the cube. that is.Bulletin of Mathematical Biology (2007) 69: 215–243
221
a)
b)
c)
Fig. Note that in (a). YYR is obtained from YYY by addition of the codon CCA. We consider the union of the sets of codons of the form RNY and NYR. RYR is obtained from YYY by the addition of the triplet ACA.. Note that in the cube YYR the amino acid Leu corresponds to a face and in the cube RYR not every amino acid corresponds to an edge of the associated cube. We say that the reading and translation of sequences of codons of the form NYR deﬁnes the second reading frame (SRF) in the RNA World. (c) Second Reading Frame: NYR = YYRURYR. which involves transversions in the ﬁrst and third nucleotides. and Leu correspond to a face (four connected solid edges). 1995). of the vector e6 which represents a transversion in the third nucleotide. that is. 2b). two amino acids correspond to 2 (solid edges) out of the 12 edges of the cube. the start codon AUG ensues (Konecny et al.
rise to only three new amino acids.

those associated to the ﬁrst. that is. 2. YRR is obtained from YYY by the addition of the triplet CAA. is equal to 1. as an extension of the primaeval RNY code. which complete a set of 17 out of the 20 primary amino acids. This function maps every triplet XYZ onto the triplet YZX. In summary. e4 → e2 . 3a) and YRR (Fig. We say that the reading and translation of sequences of codons of the form YRN deﬁnes the third reading frame (TRF) in the RNA World. by the addition of the codon ACC to every triplet. 3b). or four-dimensional hyperface of the six-dimensional hypercube of the 64 triplets (Fig. that is. The hypercube YRN is the image of the NYR under the non-singular linear transformation e1 → e5 . The set of triplets of the form YRN is a disjoint set with the sets RNY and NYR. of the vector e4 that represents a transversion in the second nucleotide. The Hamming distance between a vertex and its image. under the translation t6 . e3 → e1 . under the translation t2 . e2 → e6 . or a four-dimensional hyperface of the six-dimensional hypercube of the 64 triplets (Fig. NYR and YRN. comprises the Extended RNA code.4. 3c). We will consider the union of the sets of codons of the form RNY. which is a coordinated afﬁne subspace.
. Thirteen out of the 16 triplets of the form YRN code for six new amino acids. and their correspondence with 17 amino acids and stop codons. is equal to 1. that is. It means that the set of codons of the form YRN constitutes also a four-dimensional hypercube. e6 → e4 . second and third reading frames. Then. e5 → e3 . the addition of the vector e4 + e6 which involves transversions in the second and third nucleotides. The matrix of this linear transformation is orthogonal with determinant 1. which is deﬁned by an even permutation of the canonical base. e2 → e6 . 2c). associated to the canonical vector e6 . which converts RNY into NYR. without repetitions of any of those found in the FRF or the SRF. YRY is obtained from YYY by addition of the codon CAC. The hypercube NYR is the image of the RNY under the non-singular linear transformation e1 → e5 . ignoring the ﬁrst (R) and the second (N) nucleotides. The other three codons are the so-called stop codons. e3 → e1 . due to a slippage. the set of 48 codons of the form RNY. that is. the triplets of the form YRN give rise to six new amino acids. e6 → e4 . Each cube is obtained from the other by the translation t6 . where 45 out of them specify 17 of the primary amino acids. or double rotation. and the other three codons correspond to a termination signal. associated to the canonical vector e2 . that is. The Hamming distance between a vertex and its image. The set of codons of the form YRN is the union of the cubes YRY (Fig. NYR and YRN. The third reading frame If in a sequence of triplets of the form RNY. e5 → e3 . It means that the set of codons of the form NYR constitutes also a four-dimensional hypercube. then triplets of the form YRN will be translated. and it can also be interpreted as a double rotation in the six-dimensional Euclidean vector space R6 (A7). which is a coordinated afﬁne subspace.222
Bulletin of Mathematical Biology (2007) 69: 215–243
from the other by means of the translation t2 . e4 → e2 . the reading starts at the third nucleotide Y. In the YRY cube every amino acid corresponds to an edge of the associated cube but this is not the case in the YRR cube. the addition of the codon CCA to every triplet. which is the same basis permutation.

Note that in (a). Trp corresponds to only one vertex. (c) Third Reading Frame: YRN = YRYUYRR. that is. The union of the Extended RNA code and its complementary RNA-less code. The codons of the RNA-less World The remaining 16 triplets belong to the cubes YYY or RRR. This addition completes the set of the 20 amino acids. If we consider the reading and translation of sequences of codons of the form YYY or RRR. the triplets of the form YYY or RRR give rise to only three new amino acids. and the three stop codons belong to this cube connected by two solid edges. ﬁve out of which are repetitions of those found in the Extended RNA code.Bulletin of Mathematical Biology (2007) 69: 215–243
223
a)
b)
c)
Fig. these triplets code for eight amino acids. For this reason. constitutes the six-dimensional hypercube of the 64 triplets with the 20 amino acids and a termination mark. The 16 codons
.5. Then. those composed by only pyrimidines or only purines. but they do not enter into the composition of the Extended RNA code.
2. Graphical representation of the subsets (a) YRY and (b) YRR. In (b) two amino acids correspond to an edge of the cube (Arg and Gln). four amino acids correspond to 4 (solid edges) out of the 12 edges of the cube. we will call them the triplets of the RNA-less World. 3 YRN code.

4b). in both cubes (a) and (b). Each cube could be obtained from the other by the translation t246 associated to the vector e2 + e4 + e6 . 4 The RNA-less code: YYYURRR. the addition of the codon AAA to every triplet.
of the RNA-less code represent the union of the cubes YYY (Fig. which performs a transversion in every nucleotide component. that is. the four amino acids correspond also to 4 (solid edges) out of the 12 edges of the cube. Recall that YYY is a vector subspace of the whole vector space. In contrast to the hypercubes of the Extended RNA code. Note that. (c) RNA-less code. Graphical representation of the subsets (a) YYY and (b) RRR. 4a) and RRR (Fig. the Hamming distance between a vertex and its
.224
Bulletin of Mathematical Biology (2007) 69: 215–243
a)
b)
c)
Fig.

purple • YRN (RNA code in the TRF).Bulletin of Mathematical Biology (2007) 69: 215–243
225
image. Interestingly. Encoding functions So far.g. 5 A graph representation diagram of the Boole lattice of the 64 triplets of the SGC. black • YYY and RRR (RNA-less code).
3. a four-dimensional hyperface but rather it is a right hyperprism of height 3 (Fig. the six´ ˜ et al. 4c and A9. but it is not as in the other cases. none of these
Fig. It means that the set of codons of the RNA-less code do not lead to a four-dimensional hypercube. it has been customarily to represent in various graphical ways (e. The set is a fourdimensional subspace of the six-dimensional hypercube. the icosahedron or dodecdimensional hypercube (Jimenez-Monta no ahedron (White undated). orange • NYR (RNA code in the SRF). under the translation t246 . or the standard table of the genetic code) the 64 codons of the SGC but without any reference to an encoding function that maps each triplet with its corresponding amino acid or stop signal. 1996). The sixdimensional hypercube can be envisaged as composed by: yellow • RNY (primaeval RNA code).
. the result of the union of the pairwise disjoint sets of the Extended RNA code plus the RNA-less code is the six-dimensional hypercube of the 64 codons of the SGC (Fig.3). Actually.. 5). is equal to 3.

UUU
representations is the genetic code. AAU AGC. AAG GAA. we derive encoding functions for the Extended RNA code. CGG.CCA. Biologically. those that end with the nucleotide C. that is. UAG. For the three stop codons we select as representative the triplet UAA. CUG. For example. CAU CGC. which are called projections and are characterized by the property of idempotency. CCG CUC.226
Bulletin of Mathematical Biology (2007) 69: 215–243 Table 3 The correspondence between the ordered set of triplets and every amino acid and stop codons Amino acid Threonine RF Isoleucine Alanine Valine 1st Asparagine Serine Aspartic acid Glycine Proline RF Leucine 2nd Methionine Histidine RF Arginine Tyrosine 3rd Cysteine Glutamine Stop Tryptophan Lysine RNA Glutamic acid Phenylalanine less Cube Symbol Sextuple Set of triplets 1 1 1 1 2 2 2 2 3 3 4 5 5 5 5 6 6 6 7 7 8 Thr Ile Ala Val Asn Ser Asp Gly Pro Leu Met His Arg Tyr Cys Gln Stop Trp Lys Glu Phe 010000 011000 110000 111000 010100 011100 110100 111100 000000 001000 011011 000100 001100 100100 101100 000101 100101 101111 010101 110101 101000 ACC. GGA. y. AUU GCC. it changes the third nucleotide by the nucleotide C. Herein. U. The ordering of the set of triplets is the linear order determined by the selected order of the bases (C. It is a four-dimensional subspace. the RNA-less code. we consider the linear transformation. in fact.2). of the vector space (Z2 )6 . that is.UCG GAC. v ) → (x . GAU GGC. UCU. 0. It is the solution subspace of the homogeneous linear
. 3. We select the representation of every amino acid by only one triplet. t . The image of F . GCA. GAG UUC. u. UGU CAA. t . UCC. GUU. y. CGU. z. UCA. G) so that the triplet at the left-most position is the one with the minimum value. that is. GCG GUC. or endomorphism F (A9. GUG AAC. the correspondence of the ordered set of triplets that specify every amino acid is illustrated. it is well known that most mutations in the third base does not change the corresponding amino acid (the wobble hypothesis) and therefore we started with a function that maps the third nucleotide of a codon to zero. 0. AGA. and the SGC. 0). 0. denoted by Im( F ). is the set of triplets of the form XYC. GCU. which carries over every triplet XYZ to the triplet XYC. F ◦ F = F . ACU.1. UAU UGC. A. AGC UAC. the representative of Ala is the triplet GCC. UUG AUG CAC. a vectorial four-dimensional hypercube or a hyperface of the six-dimensional hypercube. that is. ACA.GGU. GGG CCC. CUA. the ﬁrst. The kernel of F is the two-dimensional subspace of vectors of the form (0. z. CUU. the kernel of F is a face of the hypercube. CGA. An auxiliary linear function for the different encoding functions In Table 3. Consequently. The endomorphism F belongs to a class of endomorphisms. CAG UAA. Hence. AGU. F : (x . u. UGA UGG AAA.UUA. AUA. CCU. v ). 0. ACG AUC. the triplets of the form CCZ. GUA.

system u = 0. to the hypercube RNY. or else while preserving the ﬁrst and second nucleotides. its vertex that ends with the nucleotide C. This representative triplet in the cube RNC . 6 The hypercube NNC . Note that the blue three-dimensional cube is image of F1 . G}. if compared with the other representative. a four-dimensional hyperface of the whole six-dimensional hypercube generated by the canonical vectors e1 . A. the third one is changed to C if it is U. In other words. which projects the whole hypercube onto the cube RNC . of the linear transformation F : XYZ → XYC . U. 3. u. which projects it onto its subspace of triplets of the form NNC. The encoding function for the primaeval RNY code In the four-dimensional hypercube of codons of the form RNY. in fact. z. we deﬁne a function F1 . image of the function F . e4 . Note that the cube RNC . This set is a four-dimensional coordinated vector subspace. with unknowns x . Hypercube NNC image of the function F . y. representing the FRF of the primaeval RNA code. according to the linear order of the set of triplets as derived from the selected ordering in the set {C. of the whole six-dimensional hypercube.2. image of the function F1 in the set RNY. v which are the components of a generic vector in (Z2 )6 . v = 0. We denote this hypercube as NNC (Fig. e2 . 6). e3 . is the intersection of the four-dimensional hypercubes RNY
. is the minor. This means that for every edge associated to the same amino acid. t .Bulletin of Mathematical Biology (2007) 69: 215–243
227
Fig. The function F1 is the restriction. The triplet that is selected as the canonical representative of each amino acid is the one which belongs to the cube RNC . the function F1 assigns to each of the 16 triplets of the set RNY the same triplet if it ends with C. is selected.

AAU → AAC AGC. GUU → GUC AAC. which is a three-dimensional hyperface of the hypercube NYR. speciﬁed by the F RF . AUU → AUC GCC. which assigns to every set of triplets. F2 coincides with the restriction to NYR of the afﬁne transformation t6 ◦ F .
3. according to the linear order in the whole set of triplets.
. The encoding function for the triplets of the second reading frame In the hypercube of codons of the form NYR. in the following way. we deﬁne a function F2 . GUG → GU A AUG → AUG AU A → AU A GC A. But the function F2 is not exactly a linear projection of NYR onto NYA.228
Bulletin of Mathematical Biology (2007) 69: 215–243
and NNC . The encoding function for the triplets of the third reading frame In the hypercube of codons of the form YRN. associated to the TRF of the Extended RNA code. GGU → GGC (for Threonine) (for Isoleucine) (for Alanine) (for Valine) (for Asparagine) (for Serine) (for Aspartic acid) (for Glycine)
3. GCG → GC A AC A. AGU → AGC GAC. we are taking the cube RNC as a canonical representation of the set of eight amino acids. the composition of F with the translation t6 . associated to the SRF of the Extended RNA code. ACU → ACC AUC. The function F2 is related to the linear transformation F . it is so. whereas for AUG it acts as the composition t56 ◦ F . F2 behaves as the composition t16 ◦ F . encoding for the same amino acid. as is the case of F1 in the hypercube RNY. UU A. respectively.3. for 13 out of the 16 triplets. with only three exceptions: UUG. the minor. UUA and AUG.4. GCU → GCC GUC. In most cases. CCG → CC A GU A. which assigns to every set of triplets. GAU → GAC GGC. For UUG and UUA. described above. ACG → AC A (for Leucine) (for Serine) (for Proline) (for Valine) (for Methionine) (for Isoleucine) (for Alanine) (for Threonine)
We note that all of the images of the function F2 . with the only exception of AUG. we deﬁne a function F3 . The explicit deﬁnition of the function F1 is: F1 ACC. UCG → UC A CC A. Here. t16 and t56 the composed translations t1 ◦ t6 and t5 ◦ t6 . Actually. belong to the cube NYA. CUG. The explicit deﬁnition of the function F2 is: F2 CU A. UUG → CU A UC A.

associated to the RNY-less code we can deﬁne a function F4 . we have considered the lexicographic order of the triplets used above and we have assigned to this order the corresponding integer values from 0 to 63. GAA. F3 coincides with the restriction to YRN of the afﬁne transformation t6 ◦ F For UGA. the minor. F4 acts as the restriction to RRR of the afﬁne transformation t6 ◦ F . UCU. belong to the cube YRC . CGG → CGC U AC. However. C AG → C AA (for Cysteine) (for Arginine) (for Tyrosine) (for Histidine) (for Tryptophan) (for Stop codons) (for Glutamic acid)
Note that four out of the seven images for the function F3 . A graphical representation of the encoding functions F1 . CUU → CUC UCC. GGG → GGA (for Proline) (for Leucine) (for Serine) (for Phenylalanine) (for Lysine) (for Arginine) (for Glutamic acid) (for Glycine)
For the triplets CCU. 3. 7. UGU → UGC CGC. GGA. AGG. the function F4 acts as the restriction to the set YYY of the linear transformation F . UUC. The explicit deﬁnition of the function F3 is: F3 UGC. AGG → AGA GAA. However. for the triplets AAG. U AU → U AC C AC.Bulletin of Mathematical Biology (2007) 69: 215–243
229
encoding for the same amino acid. CAG and CAA. the function F3 coincides with the restriction to YRN of the linear transformation F : XYZ → XYC . AAA. F2 . note that several codons have only
. and F4 is shown in Fig.5. the minor. F3 coincides with the restriction to YRN of the afﬁne transformation t36 ◦ F . F3 coincides with the restriction to YRN of the afﬁne transformation t56 ◦ F . AGA. In fact. UUU. encoding for the same amino acid. CUU. F3 . The encoding function for the triplets of the RNA-less code In the vector subspace of the codons of the form RRR or YYY. For UGG. The abscissa represents each of the 64 codons and they are mapped according to the encoding functions of each subset which results in their respective representative codons. which assigns to every set of triplets. GAG. CGU. CGA. according to the linear order in the whole set of triplets. AAG → AAA AGA. UUU → UUC AAA. In order to build this plot. CUC. GAG → GAA GGA. according to the linear order in the whole set of triplets. The linearity of the encoding functions is apparent even considering few departures. GGG. C AU → C AC UGG → UGG U AA. UCC. for 10 triplets. UAA. CCC. CCU → CCC CUC. UGA → U AA C AA. For UAG. U AG. The explicit deﬁnition of the function F4 is: F4 CCC. UCU → UCC UUC.

Ser appears when F1 . hence these codons have two or three representative codons. associated to the vectorial structure. F4 .
3.g. We have deﬁned in the set of 64 triplets a structure of a vector space over the binary ﬁeld Z2 = {0. which assigns to every triplet its associated amino acid or its termination mark. G ↔ 11. as a vector of the six-dimensional vector space (Z2 )6 . Then a given amino acid will appear at different ordinate values. Pro appears when F2 and F4 are applied) or even three (e. A ↔ 01. In Table 3.6. It is done by means of the assignment C ↔ 00. which leads to the representation of each triplet as a sextuple of zeros and ones. U ↔ 10.230
Bulletin of Mathematical Biology (2007) 69: 215–243
Encoding Function F1. are applied) of the subsets. F2. F3 and. 61 out of the 64 triplets code for the 20 primary amino acids that are the building blocks of proteins. The general encoding function for amino acids and stop codons in the hypercube of 64 triplets As it is well known. 1}. 7 Local encoding functions F1 . F2 and F4 . also called six-dimensional hypercube.g. F2 . we derive an algebraic function. the triplets
. This correspondence involves the addition operation and the vectorial algebraic structure in the set of triplets. A graphical representation of the local encoding functions F1 . Herein. F3 and F4 . F3 and F4
70 F1 F2 F3 F4
20 18 19
60
21 20
21
50
17 16 15 17
Representative codon
40
12 11 11 5 10
14 13
30
9 9 8 6
11
20
6 5
7
10
4 2 3 1
4
4
0
0
1
10
20
30
40
50
60
70
Codon (lexicographic order)
Fig. that is.
one representative codon but some of them appear in two (e. F2 . as well as a combinatorial geometry.

The latter is mainly due to a change in the ﬁrst nucleotide of their codons. GAG UUA. Thus. In order to summarize the function E. a graphical representation built in the same way as Fig. UCU. the one which is in the left-most position (marked in bold characters). and in the ordinate the value corresponding to the canonical codon for each amino acid or the termination mark (image of the function E) is given. that is. values from 0 to 63 were assigned to each codon according to the selected lexicographic order. The ﬁrst triplet in each row. their canonical codons are of the type NNC. is taken as the canonical representative of the corresponding amino acid. it turns out that the endomorphism F coincides. AGG CAA. In this encoding function a single canonical codon corresponds to each amino acid in contrast to the above-mentioned four encoding functions. There are ﬁve amino acids and the stop signal whose canonical codons end in A or G and therefore they require speciﬁc translations. UCA. 8. The overall shape is still linear and we remark that this function represents the actual SGC. as shown in Table 4. In the abscissa. 7 is shown in Fig. In Table 4. For the stop signal. The remaining 19 codons. The 45 codons that specify 15 amino acids which are directly mapped by the linear function F are represented by circles.Bulletin of Mathematical Biology (2007) 69: 215–243
231
which code for each amino acid and the stop signal are listed according to their lexicographic order. UUG AAA. There are three amino acids encoded by six codons. there are also two special sets. The other 19 triplets (black characters). represented by crosses. In fact. Eight out of the 19 codons specify the remaining ﬁve amino acids and their canonical codons are of the type NNA or NNG. Three out of the 19 codons specify the stop signal and the remaining eight codons correspond to the three amino acids encoded by six codons whose canonical representatives are of the type NNC. in most of the cases. The 45 triplets encode for 15 amino acids. some of which are mapped directly by F but others require a translation. it coincides for 45 (grey characters) out of the 64 triplets. UCG UGG UAA. require a particular translation. with the desired encoding function denoted by E. Note that for 8 out of the 20 amino acids there are special subsets of the sets of their associated triplets for which F alone is not the encoding function E.
Table 4 Triplets for which the function F requires an afﬁne transformation Amino acid Arg Gln Glu Leu Lys Met Ser Trp Stop Stop Canonical triplet CGC CAA GAA CUC AAA AUG AGC UGG UAA UAA Special set AGA. are those for which the encoding function E takes the form of an afﬁne transformation. UAG UGA Encoding function t2 ◦ F t6 ◦ F t6 ◦ F t1 ◦ F t6 ◦ F t56 ◦ F t1234 ◦ F t56 ◦ F t6 ◦ F t36 ◦ F
. the composition of F with a suitable translation. CAG GAA. We deﬁne the encoding function as that which assigns to every triplet the left-most triplet in every row. AAG AUG UCC. we show the encoding functions for every special set.

232
Bulletin of Mathematical Biology (2007) 69: 215–243
General Encoding function E
60 mapped by F mapped by F and a translation
20 18 19 17 21
50
16 15
Canonical codon
40
12
14 13
30
10 9 11
20
6 5
7
8
5 4
10
4 2 3
0
1
0
10
20
30
40
50
60
70
Codon (lexicographic order)
Fig. CAA. which is the image of NNC under the translation t6 . and Trp. Met. and AAA. belong to the hypercube NN A. The graphical representation of the general encoding function E of the SGC. under the afﬁne transformation t6 ◦ F . GAA. respectively. are Gln. AUG. 9. The other two triplets. Amongst the 16 triplets which are labels of the vertexes. whose canonical triplets are. under the afﬁne transformation t56 ◦ F . 6) correspond to triplets that code for 15 out of the 20 amino acids. The ﬁrst three. As before. these two triplets are the unitary images of the faces AU N and UGN. Lys. AUG and UGG. that is. It codes for Ser. the addition of the triplet CCG. which is not the canonical representative of any amino acid. For this reason. The graph representation diagram of amino acids and the stop signal Note from Table 3 that the vertexes of the four-dimensional hypercube NNC (Fig. the triplet UCC. GAN and AAN. 8 Plot of the general encoding function E in its two main forms. GAA. the addition of the triplet CCA. These three triplets are the unitary images of the faces C AN. we show a graph representation diagram of
. The other ﬁve amino acids. In Fig. which are not represented in the hypercube NNC . belong to the hypercube NNG. we deleted the vertex with label UCC and its four adjacent edges in the last graph diagram.
4. which is the image of NNC under the translation t56 . AAA. there is only one. which is the ﬁrst in its list. but this amino acid is already represented by the triplet AGC. Glu. CAA. and UGG (Table 3). that is.

These three sets are disjoint when they are pairwise compared.
the 20 amino acids. By allowing readings starting at the SRF and TRF positions of the RNY code. A graph representation diagram of the 20 amino acids and the stop signal.
5.. It is a phenotypic graphical image of the hypercube NNC . Gln and Tyr. Altogether these three sets comprise 45 triplets which code for 17 out of the
. nor its adjacent edges. and that the distance between the amino acids Met and Trp is equal to 3. Discussion In this work. Each of these types corresponds to one set of 16 elements. we propose an Extended RNA code as derived from the RNA code as originally proposed by Eigen (1977) and later used by several authors (e. We observe that the vertex which represents the stop signal is adjacent to the amino acids Glu. codons of the type NYR and the YRN appear. ﬁve of which correspond to amino acids. whose canonical representative triplets do not belong to the hypercube NNC . that represents the class of the three stop codons. 1995). without the vertex UCC. Konecny et al. with an additional vertex. 9 The phenotypic graph of the 20 amino acids and the stop signal.g.Bulletin of Mathematical Biology (2007) 69: 215–243
233
Fig. which represents the stop signal. The canonical RNA code consists of only RNY codons that comprises 16 out of the 64 possible triplets and which codify for eight amino acids. with the addition of six external vertexes. and another vertex.

the union of the three hypercubes of the Extended RNA code and the vector subspace of the RNA-less code. makes up the whole six-dimensional hypercube of 64 triplets. frame-reading mistranslations conferred obviously evolutionary advantages. and YYY and RRR (RNA-less code) (Fig. It works on what already exists. The remaining 16 triplets. which we call RNA-less code or complementary code of the Extended RNA code. via the Extended RNA code and the addition of the RNA-less code. each of them being a three-dimensional cube. These results suggest a plausible evolutionary path from which the primaeval RNA code could have originated.2) to each other as afﬁne subspace of the whole sixdimensional hypercube. the six-dimensional hypercube. emerge. 1977). by allowing reading slippages in the other two reading frames. we can decompose the six-dimensional hypercube as consisting of the patterns RNY (primaeval RNA code). we can hypothesize that the point in which genetically encoded protein translation started to evolve corresponds most likely to a breakthrough organism obeying an Extended RNA code after the RNA World and prior to the cenancestor. These 16 triplets constitute two disjoint sets. providing a comma-free readout via wobbleintermediates to the present form. In the RNY code every amino acid is coded by two neighbor triplets located in an edge whereas in the NYR and YRN codes there are departures from this regularity: some amino acids are now encoded by four triplets and others are encoded by only one. can not be derived by frame-shift readings but rather by other types of mutations such as insertions. Interestingly. 1985). Conversely. either transforming a system to give it new functions or combining several systems to produce a more elaborate one. 32 new triplets (which codify for nine new amino acids) and three stop triplets (which specify a stop signal). The union of the cubes YYY and RRR produces a four-dimensional vector subspace which is not a hyperface of the six-dimensional hypercube. As a consequence. it has been found that the order of triplet frequencies RNY > RNR > YNY > YNR is a general attribute of coding sequences (Eigen et al. Our present results do not offer any clue about a chronological order in which the different encoding subsets could have led to the current SGC. Alternatively. and they also include the three stop codons. Given the RNA code. The RNY code can be graphically represented as a four-dimensional hypercube that results from the union of the disjoint sets RYY and RRY each of them being a three-dimensional cube. and YRN (Extended RNA code). since in fact. each of them being a three-dimensional cube.. Notably. respectively. the steps RNY plus RNR and YNY could also form another extended RNA code. plus NYR. However. each four-dimensional hypercube is isomorphic and isometric (A6. and the union of the disjoint sets YRY and YRR. The NYR and YRN sets are also represented by a fourdimensional hypercube that result from the union of the disjoint sets YYR and RYR. deletions and substitutions. Thus.234
Bulletin of Mathematical Biology (2007) 69: 215–243
20 amino acids. 5). This is what we call the Extended RNA code. YYY and RRR. It innovates with what it has at hand and this process has been recognized as the evolutionary tinker (Jacob. this subspace is isomorphic as afﬁne subspace to the three hypercubes of the Extended RNA code but it is not isometric to any of them. and they proposed that this order may reﬂect the evolution of the genetic code from an RNY structure. some amino
. and they are pairwise disjoint. Natural selection does not generate novelties from scratch.

this is the ﬁrst time in which the SGC is expressed as a mathematical function which maps each triplet onto its corresponding amino acid or stop signal. it means that their
. The set v + W is also called a coset or adjoint class of the subgroup W. A. In the context of the frozen concept. Definition A. Remarks. (1995) ﬁrst noted that by allowing reading slippages there were two hidden messages in the RNY code which are AUG and CAU which are found in the SRF and TRF. et al. The dimension of a linear variety is deﬁned as the dimension of its associated vector subspace. genes are transcribed in a frame-shift fashion (Keck. Szathmary. The subspace W is called the associated vector subspace of the linear variety v + W.. where W is a vector subspace and v is a ﬁxed vector. and there are new triplets which altogether specify nine new amino acids. 1990). To our knowledge.1..Bulletin of Mathematical Biology (2007) 69: 215–243
235
acids of the RNY are also coded by triplets that appear in the SRF.1. but the vector v may be any of the elements of the set v + W. as well as with the concept of a linear transformation or linear endomorphism of a vector space. a single messenger RNA (mRNA) is able to encode three different proteins because messages contain three distinct putative translation initiation sites. to any of the three four-dimensional hypercubes of the Extended RNA code. the primaeval RNY code was already frozen and that it evolved like a replicating and growing icicle. Concept of afﬁne space and its dimension Definition A. In other words. Mathematical and biological background We assume that the reader is familiar with the concept of a vector space over a scalar ﬁeld K.1. is unique. In the search of producing synthetic life in the laboratory (Hutchinson III et al. For a vector space V we call afﬁne subspace of V . The phenotypic graph of amino acids is also a novel ﬁnding whose image resembles. also called a K-vector space. or linear variety contained in V . to every subset of the form v + W. a single mRNA can be translated in three different reading frames which encode three trans-activators that are required for late transcription (Keck et al.. in part. The associated vector subspace W. and with that of vector subspace of a vector space. ´ 1999.2. In the vaccinia virus. and it contains the element v .1. It is also worth to mention that in present day DNA virus such as vaccinia virus. Konecny et al. 1990). for a linear variety. respectively. 2005) our encoding functions may be used as a guide to understand the difference between a tinkered-together genome and an engineered one. for a ﬁxed subspace W.
Appendix A. Given the uneven degeneracy of the genetic code it is appealing that the general encoding function is almost linear. we can say that considering the symmetries of both the Extended RNA code and the RNA-less code. We recall that every vector space is an abelian group for the addition operation. When two vectors u and v deﬁne the same linear variety.

the k-dimensional for 2 < k < n. Hence. lines and planes in a geometry are. where K is a ﬁeld.. .3. in fact. In this case. parallel cubes. is the solution set of a linear system
. Definition A. . is the ordered set of vectors e1 = (1. where the xi are elements of K. 0. the afﬁne subspaces of dimension 0. x2 . for i ∈ {1.236
Bulletin of Mathematical Biology (2007) 69: 215–243
difference vector u − v belongs to W. 2. The isomorphism may be deﬁned by the matching of any of the bases of V with the canonical bases of Kn . If we take v as the null vector 0. if n is its dimension. In particular if W = U . provided of a coordinate system. they are parallel.3. xn ). parallel planes. respectively. We say that 2 afﬁne subspaces v + W and u + U are parallel if W is a subspace of U . The associated geometry to a vector space. . . As it is well known. The concept of n-dimensional hypercube Let us consider a vector space of the form V = Kn .2. . . . 0). a line parallel to a plane. they form a base of the vector spaces. and 2. Consequently. A. The afﬁne subspaces of dimension 1 are called the lines of the geometry and those of dimension 2 are called the planes of the geometry. . We call the associated geometry of a K-vector space V . are called the coordinates of the vector. that is. The points. the set or family of all the afﬁne subspaces or linear varieties contained in V . in an n-dimensional vector space V . having the same dimension. . n}. Concepts of point. Definition A. or as some vector which belongs to W. also called points of the associated geometry. it is known that every afﬁne subspace v + W. if the afﬁne subspaces have the same associated subspaces. the vectors are represented as n-tuples (x1 . . all the unitary sets are afﬁne subspaces of dimension zero. . The concept of parallelism Definition A. Thus.2. .2. . Then we say that the space has been coordinatized. such that u − v ∈ W. A. It is easy to show that these vectors are linearly independent and that they generate the space. or U is a subspace of W. 1.. the afﬁne subspaces that contain the null vector. etc. deﬁned as the set of all the ordered pairs (u. The canonical base of Kn . the identity v + W = W is obtained. e2 = (0. and plane in a geometry. 0. . all the vector subspaces are also afﬁne subspaces. . 0. v ) ∈ V × V . 1. The only afﬁne subspace of dimension n is the whole space V . 1) . . en = (0. The linear varieties or afﬁne subspaces. 0). .2. . . The elements xi . . for a ﬁxed linear subspace W. that is. From standard courses of linear algebra. K being a ﬁeld.2. that is. 0. . every n-dimensional K-vector space is isomorphic to the space Kn . are the equivalent classes of the equivalence relation RW . According to this deﬁnition we can talk of parallel lines. line.1. . are generically called hyperplanes of the geometry. The other afﬁne subspaces.

The zero-dimensional hyperfaces are the vertexes of the hypercube whereas the one-dimensional hyperfaces are the edges of it. a coordinated afﬁne subspace is the solution set of a system whose associated matrix is diagonal.. . The two-dimensional hyperfaces are simply the faces of the hypercube. In a vector space Kn we will call coordinated afﬁne subspace to every afﬁne subspace whose associated subspace is generated by one or some vectors of the canonical base. i.1. of two elements. In Fig. 5 of the main text.e. being p a prime number. The assignment to every integer. the vector space Kn is called n-dimensional hypercube. the six-dimensional hypercube of the 64 codons is illustrated.Bulletin of Mathematical Biology (2007) 69: 215–243
237
of n equations with n unknowns. i. The vectorial coordinated lines are usually called coordinated axes and the vectorial coordinated planes simply coordinated planes. The name hypercube comes from analogy with the three-dimensional case. coincide with the extremes of the vectors e1 .
. 1}.1. its defect remainder in the entire division by p. . . The ndimensional hypercube (Z2 )n is a regular polytope (Coxeter. The dimension of W is equal to n − r . and all the edges have length 1. In the case of the binary ﬁeld (Z2 ). In this case. the system with the same left parts and the right members equal to zero.5. the norm is simply the number of ones that appear in the n-tuple. A coordinated afﬁne subspace of the hypercube is called a hyperface of the hypercube (Z2 )6 . one of whose vertexes is the null vector 0 = (0. 0). The concept of coordinated afﬁne subspaces Definition A. . x2 .e. e2 and e3 of the canonical bases.4. the 2 2 2 ordinary sum |v | = x1 + x2 + · · · + xn . Definition A.4. it is also called the weight of the vector. which are remainders of the entire division by p. the solution set of the associated homogeneous system. being the linear subspace. We will call norm or length of a vector v = (x1 . A. where the eight triplets of zeros and ones. In fact.4. where r denotes the rank of the associated matrix.1. the ﬁeld of non-negative integers. Definition A. xn ). which is always a nonnegative integer. In the particular case where K is the binary ﬁeld Z2 = {0. The concept of norm or length of a vector in the spaces (Z p )n Let us consider a vector space of the form (Z p )n .5. is the so-called reduction module p. and the vertexes which are adjacent to it. 1973).3. 0. A.. where (Z p ) denotes the Galois Field of p elements. Definition A. represent the points which are the vertexes of a cube or regular hexahedron. It is also called an orthotope because the angle of two adjacent edges is a right angle.2.

have the property that the inverse of any of them is equal to its transposed. Note that the norm or length of a vector coincides with the scalar product of the vector with itself. the afﬁne coordinated lines are usually called edges. . It is easy to show that a linear isometry preserves the length or norm of every element of the space. x2 . .. and geometrically.6. is called isometric if it preserves the distance between every two elements. widely used in coding theory and in criptology. en }. It can be proved that the inner product and the length or norm are related in the following way. . every linear isometry performs a permutation on the set {e1 . In vector spaces of the type (Z2 )n . v ) for every pair (u. .6. We call scalar or inner product of the vectors v = (x1 .5. f (v )) = d(u. . . . . The concept of permutation transform or multiple rotation in the vector space V = (Z p )n As the only vectors of length 1 in the vector space are the canonical vectors e1 .2. A. visualizing the space as a graph. Transformations that preserve the distance Definition A. i. the so-called hypercubes. .. For every pair (u. We deﬁne the distance between the vectors v and w as the norm of the difference vector v − w . and the afﬁne coordinated planes are called faces of the hypercube.1.
.e. . . xn ) and w = ( y1 . yn ) the integer < v. e2 . Actually. called permutation matrices. .7. e2 .6. . The concept of distance in the spaces (Z p )n . Definition A. this distance is the so-called Hamming distance. the linear isometries will also be called permutation transforms. v ) of arbitrary vectors u and v . being the subtraction the inverse operation of the addition in the vector space (Z p )n .238
Bulletin of Mathematical Biology (2007) 69: 215–243
In the more general case of a vector space (Z p )n . In general. y2 . A. if d( f (u). In the case of the hypercube (Z2 )n .2. This kind of matrices. the afﬁne coordinated subspaces are called hyperfaces of the hypercube. v ) ∈ V × V . The matrix of a linear isometry with respect to the canonical base is a matrix obtained from the identity matrix by a permutation of its columns. the following equality holds: |u + v | = |u| + |v | + 2 u. An isometric linear transformation will be called a linear isometry of the vector space. A transformation f of the vector space V = (Z2 )n in itself. en . v . For this reason. the Hamming distance is the number of places in which both vectors differ. we can deﬁne the concept of scalar or inner product in the following way: Definition A. w >= x1 y1 + x2 y2 + · · · + xn yn . it means the minimal number of edges between the two vertexes represented by the n-tuples. .

9. It is easy to prove that every afﬁne transformation carries a linear variety v + W onto another linear variety. and only if. being the isomorphism. It is clear that an afﬁne transformation tv ◦ F is bijective. associated to the vectors v and w . This kind of base is usually called orthonormal base. given by the function: v → tv . as ti j . we will denote as ti the translation associated to the vector ei of the canonical base. its linear component F is also bijective. k ∈ {1. Special notation. If the afﬁne transformation is bijective. an abelian group which is a subgroup of the symmetric group S(V ) of all the bijections of the set V with itself.Bulletin of Mathematical Biology (2007) 69: 215–243
239
As every permutation is a composition of pairwise disjoint cycles. from V onto T . every linear isometry can be interpreted as a composition of local rotations. a linear isometry preserves the orthogonality of any pair of orthogonal vectors of the vector space. and only in this case. in fact. of length 1. one to one. Hence.e. additionally. the composition tv ◦ tw is the translation tv+w . j . For translations tv and tw .1.9. Definition A. i. but also the scalar product of any two vectors of the vector space.1. We say that two vectors v and w of the space (Z p )n are orthogonal or perpendicular. The concept of translation and afﬁne transformations in a vector space Definition A. T is. .1. n}. .2. In the hypercube (Z2 )n . Given a vector v of a vector space V .
. from V to V . if their scalar product is equal to zero. It means that the set T of all the translations of the vector space V is a group. The concept of orthogonality in a vector space (Z p )n Definition A. that is. The linear isometries of the vector space (Z p )n will also be called permutation transforms or multiple rotations. ti jk as the translation associated to the vector ei + e j + ek. . Definition A. a multiple rotation in the vector space.9. A. As the set (Z p )n can be viewed as a subset of the Rvector space Rn . isomorphic to the additive group of V .8.. a multiple rotation in (Z p )n can also be interpreted as a multiple rotation in Rn . where i . . We call afﬁne transformation to every composed function of the form tv ◦ F . in any vector space of the form (Z p )n are orthogonal. being R the ﬁeld of real numbers. According to this deﬁnition the vectors of the canonical bases. where F is a linear transformation or linear endomorphism of the vector space and tv is the associated translation of a ﬁxed vector v . 2. It can be proved that a linear isometry preserves not only the distance between vectors. one to each other. associated to the sum v + w of both vectors. the dimension is preserved. and they are. whose dimension is less than or equal to that of v + W.8. and so on. if.7. we call a translation associated to the vector v . to the bijective function tv : u → u + v . the translation associated to the vector ei + e j . A.

when u is orthogonal to W and has length h greater than 1. 0. with h greater than 1. when ei does not belong to W. For instance. The nucleotides A and G are called purines. 0). 1. Definition A. t .1. since it consists in the change of only one base. U= 10. if the triplet CUA changes to the triplet CUG.
Some biological concepts and their mathematical representation A. by substitution of the base A by the base G. 1. 0. In our work. every mutation may be represented by means of a translation tv . 1. 1. A.9. then the set u + v + W is also a k-dimensional hyperface. to a distance h greater than 1. since A and G are both purines. 0. is a (k + 1)-dimensional hyperface. by the translation t5 . 0. associated to some vector v . A= 01. as an element of the hypercube (Z2 )6 .240
Bulletin of Mathematical Biology (2007) 69: 215–243
If v + W is a k-dimensional hyperface and u is any vector of the hypercube (Z2 )n .
. It can be proved that the union of the k-dimensional hyperfaces v + W and ei + v + W. A codon mutation is called a transition when it changes a base to a base of the same class. and it is called a transversion when the change is from one class to another. z. On the the other hand. According to our representation of triplets as elements of the vector space (Z2 )6 .10. that is. we have assigned to every nucleotide a numerical pair in the following way: C= 00. if CUA is changed to CUU. that is. G} is called a mutation. A right hyperprism is obtained from a hyperface by the adjunction of a translation of it. 0. v ) of zeros and ones. It follows. Transitions and transversions. 0. it is a transversion. 1) and this transformation may be represented by the addition of the vector e5 = (0. purine to purine or pyrimidine to pyrimidine. is called a right hyperprism of height h. Then. Mutations Definition A. If the translation is to a distance equal to 1. 1) is changed to the vector (0. We say that the mutation is simple if the substitution involves only one base of the triplet. associated to it. Then. that is. The substitution in a triplet of one or more of its three nucleotide bases within the set {C. Definition A.10. A right hyperprism of height h.3. that is. if the triplet CUA is changed for the triplet CUG. y. purine to pyrimidine or pyrimidine to purine. In this particular case the mutation is simple. u. that every triplet XYZ is represented as a sextuple (x . The union of the hyperfaces v + W and u + v + W.10.2. it means that the vector (0. a (k + 1)-dimensional hyperface is obtained. but it is not a hyperface of the hypercube. 0. it is a transition. for instance. is an afﬁne subspace. U. the nucleotide A is replaced by the nucleotide G. 0. and the nucleotides C and U are called pyrimidines. then the purine A is changed to the pyrimidine U. and G= 11. This terminology is due to their chemical composition.

the vector M = (1. 1. In the vector space of the 64 triplets we can deﬁne a bijective function. and are represented by the numerical pairs 00 and 11. that is. t3 . the pyrimidine C and the purine G. the addition to the vector (x . When U is changed by T (thymine). and t6 . which is the sum e1 + e2 + e3 + e4 + e5 + e6 . From the algebraic point of view. They also form a pair in the double helix structure. which assigns to every triplet XYZ the triplet X Y Z . where the components X . are considered complementary bases. t1 . CUC and CCU. for they differ in two of their components. as additions of the triplets UCC. t15 . and t6 of even subindexes.12. u. the Hamming distance between a base and its complementary is always equal to 2. and the identity function may be considered as a transition. that is. Complementary bases For biological reasons. it means the addition of the triplet GGG to XYZ. Y . the set of all the transitions is a group of order 8. 1. are obtained. As T is an abelian group. the base A is paired with T in the double helix structure of DNA. and Z. whereas the basic transversions represented by the translations t2 . Analogously. the subgroup of pure transitions is a normal subgroup of T . e3 and e5 . that is. are obtained by additions of the triplets ACC. v ). 1. The translation t M performs a transversion in each of the three components of the triplet.11. All the other mutations are transversions or compositions of transversions with transitions. 1. The mentioned function is the translation t M . CAC and CCA (see Table 2). those associated to the canonical vectors e1 . z. as result. t4 . which have odd subindexes. t4 . subgroup of the group T of 64 possible mutations. the maximum of all the possible distances in the six-dimensional hypercube. the associated to the vector M. y.
. Y. and it consists in the substitution of every zero by ones and every one by zeros. represented by the translations t1 . Hence. t . t13 . respectively. according to the addition operation illustrated in Table 2. of the six vectors of the canonical base. 1). the purine A and the pyrimidine U are considered to be complement of each other. t5 . The set of mutations which are transitions is closed under the composition. t35 and t135 being t0 the identity function. respectively. In the Boolean structure of the hypercube (Z2 )6 this transformation is the so-called Boolean negation. that is. The eight transitions are represented by the translations: t0 . Algebraic representations of transitions and transversions Notice that the basic transitions in the hypercube (Z2 )6 are represented by the translations t1 . The sum of a triplet with its complementary gives. A. associated to the triplet. The basic transitions. and Z are the complements of X. the triplet GGG. t3 and t5 . they have been represented by complementary numerical pairs 10 and 01. According to our numerical representation. and t5 . t3 .Bulletin of Mathematical Biology (2007) 69: 215–243
241
A. one to each other. Notice that the Hamming distance between a codon and its complementary is always equal to 6. while the basic transversions are represented by the translations t2 . For this reason. deﬁned by the vector M. This triplet is usually called the complementary codon or the anticodon of XYZ.

S. C... H. 1973.. ´ Sanchez. D. Mol. A. Crick. Moss..J. A. USA 93. Cell 61. Silva... Naturwissenschaften 64.. Sabater-Munoz. 2002. S. Leguina. MATCH Commun. 117–125. Mushegian. J. Crick. C. Hamming distance and stochastic matrices. A principle of natural self-organization. White.. ˜ Gil. 4454–4458.E. 2004a. Science 286.. Poschel... 2437–2441.
. A. 1995.. R. E. S. R. A minimal gene set for cellular life derived by comparison of complete bacterial genomes. Mol. Morgado. Fraser. ˜ ¨ Jimenez-Monta no. 38.R.. Bobadilla. Atkins.. New York. J. Orig. 341–369. T.F. A. 1990.S. Acad.. 1998. S. 618. Life Evol. Evolution and tinkering. Naturwissenschaften 64.. Schuster. A principle of natural self-organization. The origin of the genetic code... 1999. Mexico. Evol. Bull. Proc. Lazcano.. Baldick. Gesteland. ´ ˜ M. Flores. The hypercycle. 1977.. 45.E. Mol.O. 173. Pattern analysis of 5S rRNA. The genetic code is one in a million. the pre-RNA World. J.. 25. Gill. The genetic code Boolean lattice... ´ tiproject: Tecnolog´ ıas para la Universidad de la Informacion ´ UNAM. Orgel. Cell 85. de la Mora-Basanez. M. R. Winkler-Oswatitsch. M. Schuster...G.. Ellington. J. Proc. 115–118. Science 196. J. 1977. References
Becerra. 2165–2169. We thank Giselle Morgado Avin ﬁgures of the manuscript. E.. 2004.C.V. Math.A. S. Lazcano. Statistical Garc´ ıa. Keck. Codes without commas. Cold Spring Harbor Laboratory Press. S. Jacob. Hofacker. Jose.242
Bulletin of Mathematical Biology (2007) 69: 215–243
Acknowledgments M.. Part C: The realisitic hypercycle. Lindemann. 1405–1421. Natl. T. 793–798. Regular Polytopes. 1996. L. Theor. Alvarez.. 52. 1996.H. Biol.. 1995. Part A: Emergence of the hypercycle. Polyphyletic gene losses can bias backtrack characterizations of the cenancestor.. The RNA World. J. Smith. The origin and early evolution of life: Prebiotic chemistry.V. Sci. F. 1161–1166.F. Physica A 342. Gilbert.D. Freeland. Comput. 1996. J. Dover Publication Inc.. Math. M.. Biosystems 39. 1957.H.T. M. B...M.. Eigen.. Coxeter. Sci. 801–809. was supported by Universidad Central ”Marta Abreu” de ˜o ´ for preparing the Las Villas. C. Sci. was ﬁnancially supported by PAPIIT-IN216106.. E. The hypercycle. USA 99.. USA 82. Natl. S. L. Moya. Acad. Lazcano. Koonin. 288–293.. 1997.A.. Proc.. analysis of the distribution of amino acids in Borrelia burgdorferi genome under different genetic codes.. C. Chem.V. Silva. 238–248. B. 47. Acad. Hutchinson III. Cellular evolution during the early Archean: What happened between the progenote and the cenancestor? Microbiologia SEM 11. The search for missing links between self-replicating nucleic acids and the RNA world.. A. Biosph. Natl.. 1978. Complementary coding to the primeval commaless code. 1999.. 1986. 66. S. 1968. Grifﬁth. P. Grau... Eigen. 1–13. J. Biol. and by the Mul´ y la Computacion.. He. Sci.N. The hypercube structure of the genetic code explains conservative and non-conservative amino acid substitutions in vivo and in vitro. J.R. Peterson.. A.. A. W. 263–270. Natl. 2004. R.D. Cuba. Cech. 29–46.J.. 417–421. Miller.J. Role of DNA replication in Vaccinia virus gene expression: A naked template is required for transcription of three late trans-activator genes.. O. E. Venter. Latorre.. M. Hurst.. F. J.R. A.J. Clarke.A.. Evol. 541–565. 515–530. P. R. B. Global transposon mutagenesis and a minimal mycoplasma genome. and time. Islas. J. H. T.I. Cline. ¨ Konecny. Kenneth. 367–379.C..V. F.R. Acad.. USA 43.J.. We also thank Rafael Camacho for helpful suggestions.C. Govezensky. Santa Clara. Schoniger. Eigen... Proc. Nature 319. P. Ricci. 3rd edition.. R. A. The RNA World.L. 1985. C. 1995.. M. L. J. S.H.. UNAM. ´ M.. Genetic code. Biol.. Extreme genome reduction in Buchnera spp.: Toward the minimal genome needed for symbiotic life.. 10268–10273. Petoukhov.R.

Ch.. E. 1. 190–197. http://www.Bulletin of Mathematical Biology (2007) 69: 215–243
243
´ Sanchez.. In search of the simplest cell.codefun. M. Math. Science.pdf. Undated. R. 1967. The Genetic Code. Grau. 2004b.. Grau. Morgado. 433. vol. A genetic code boolean structure I. R. Genetic code Boolean algebras. ´ Szathmary. Corfus. 1–14. Bull. W-Seas Transactions of the International Conference on Biology and Biomedicine.. White. The meaning of boolean deductions.. Biol. Maximum symmetry in the genetic code: The Raﬁki map. Morgado. 469–470.. New York. 2005. R. E..
. 7. 67.com/Images/genetic/Max/Sym300pi. pp. E. Woese. Harper and Row. R.. 2005. C. ´ Sanchez.. Unpublished manuscript. ISNN 1109–9518. Greece. issue 2.