BullMathBiol 07

Bulletin of Mathematical Biology (2007) 69: 215243 DOI 10.
1007/s11538-006-9119-3
ORIGINAL ARTICLE
An Extended RNA Code and its Relationship to the Standard Genetic Code: An Algebraic and Geometrical Approach
a, , Eberto R. Morgadob , Tzipe Govezenskya Marco V. Jose
a
Theoretical Biology Group, Instituto de Investigaciones Biom edicas, Universidad Nacional Autonoma de M exico, M exico D.F. 04510, M exico b Facultad de Matem atica, F sica y Computacion, Universidad Central Marta Abreu de Las Villas, Santa Clara, Cuba
Received: 19 September 2005 / Accepted: 23 February 2006 / Published online: 2 November 2006 C Society for Mathematical Biology 2006
Abstract An algebraic and geometrical approach is used to describe the primaeval RNA code and a proposed Extended RNA code. The former consists of all codons of the type RNY, where R means purines, Y pyrimidines, and N any of them. The latter comprises the 16 codons of the type RNY plus codons obtained by considering the RNA code but in the second (NYR type), and the third, (YRN type) reading frames. In each of these reading frames, there are 16 triplets that altogether complete a set of 48 triplets, which specify 17 out of the 20 amino acids, including AUG, the start codon, and the three known stop codons. The other 16 codons, do not pertain to the Extended RNA code and, constitute the union of the triplets YYY and RRR that we dene as the RNA-less code. The codons in each of the three subsets of the Extended RNA code are represented by a fourdimensional hypercube and the set of codons of the RNA-less code is portrayed as a four-dimensional hyperprism. Remarkably, the union of these four symmetrical pairwise disjoint sets comprises precisely the already known six-dimensional hypercube of the Standard Genetic Code (SGC) of 64 triplets. These results suggest a plausible evolutionary path from which the primaeval RNA code could have originated the SGC, via the Extended RNA code plus the RNA-less code. We argue that the life forms that probably obeyed the Extended RNA code were intermediate between the ribo-organisms of the RNA World and the last common ancestor (LCA) of the Prokaryotes, Archaea, and Eucarya, that is, the cenancestor. A general encoding function, E, which maps each codon to its corresponding amino acid or the stop signal is also derived. In 45 out of the 64 cases, this function takes the form of a linear transformation F , which projects the whole six-dimensional hypercube onto a four-dimensional hyperface conformed by all triplets that end in cytosine. In the remaining 19 cases the function E adopts the form of an afne
Corresponding author. E-mail address: marcojose@biomedicas.unam.mx (M. V. Jose).
216
Bulletin of Mathematical Biology (2007) 69: 215243
transformation, i.e., the composition of F with a particular translation. Graphical representations of the four local encoding functions and E, are illustrated and discussed. For every amino acid and for the stop signal, a single triplet, among those that specify it, is selected as a canonical representative. From this mapping a graphical representation of the 20 amino acids and the stop signal is also derived. We conclude that the general encoding function E represents the SGC itself. Keywords Primaeval RNA code Standard genetic code Evolution of the genetic code Extended RNA codes Algebra and geometry
1. Introduction The current genetic code is considered to be nearly universal. This code is written in an alphabet of four letters (C, A, U, G), grouped into words three letters long, called triplets or codons. Each of the 64 codons species one of the 20 amino acids or else serves as a punctuation mark signaling the end of a message. Given 64 codons and 20 amino acids plus a punctuation mark there are 2164 4 1084 possible genetic codes. Is there something special about the only one code that governs all life on Earth? Francis Crick (1968) argued that the Standard Genetic Code (SGC) need not be special at all; it could be nothing more than a frozen accident. This concept is not far away from the idea that sometime there was an age of miracles. However, when the SGC was compared to a computer generated random sample of one million alternatives, the natural code emerged as superior to every random permutation with a single exception (Freeland and Hurst, 1998). Recently, numerical experiments with hand-crafted genetic codes analyzed in silico showed inferior statistical properties (such as information content, scaling and autocorrelation properties) than the SGC (Garc a et al., 2004). It is widely accepted that there was an age in the origin of life in which RNA played the role of both genetic material and main agent of catalytic activity (e.g. Woese, 1967; Crick, 1968; Kenneth and Ellington, 1995). This period is known as the RNA World (Gilbert, 1986; Gesteland et al., 1999). Investigations on the minimal gene set that is necessary and sufcient to sustain the existence of cellular life are consistent with the notion that the last common ancestor (LCA) of the three primary kingdoms (Archaea, Eucarya, and Prokaryotes) had an RNA genome (Mushegian and Koonin, 1996; Hutchinson et al., 1999; Gil et al., 2002). However, the quasi-species concept of Eigen and Schuster (1977) demonstrated that the accuracy of replication placed limits on the size of the genome that can be maintained by selection. The higher the error rate during replication, the smaller the maximum possible permissible genome size. Thus, replication delity was a strong limiting feature in the RNA World. On the other hand, sequence similarities shared by many ancient, large proteins found in all three kingdoms of life suggest that considerable delity already existed in the operative genetic system of their LCA, but such delity is unlikely, given the Eigens limit, to be found in RNA-based genetic systems (Lazcano, 1995; Lazcano and Miller, 1996). The cenancestor probably had a DNA genome (Becerra et al., 1997).
217
To our knowledge, there has not been systematic studies that relate the RNA code (Crick, 1968; Eigen and Schuster, 1977; Konecny et al., 1995) with the SGC. The main question that we address in this work is to see if via our algebraic and geometrical approach we can shed some light on the problem of how the primaeval RNA code could have evolved to generate the SGC. To this end, we search for symmetries and patterns in both the SGC and the RNA code. We hypothesize that in order to allow further evolution of the RNA genetic code, the constraint of having an intact message in only one reading frame has to be relaxed. Therefore, we consider not a strict comma-less code as proposed by Crick et al. (1957) but rather an RNA code which can be translated in the rst, second and third reading frames, given that translational and transcriptional errors were probably of great importance early in the history of life, when the machinery of protein synthesis was imprecise. For the interested reader, an Appendix which is referred to throughout this work, about the concepts of algebra and geometry is provided at the end of this article. The article is organized as follows. First, we show that each reading frame can be represented as a four-dimensional hypercube (A3.1) as derived by concepts of combinatorial geometry. Interestingly, the three four-dimensional hypercubes and the right hyperprism (A9.3) associated to the RNA-less code can be inserted as pairwise disjoint four-dimensional afne subspaces in the six-dimensional hypercube (Coxeter, 1973; A1.1; A3.1; Fig. 5). Next, we dene an encoding function for each of the three four-dimensional hypercubes, associated to the Extended RNA code, and an encoding function for the RNA-less code. We also derive the general encoding function of the SGC, and we show that this function is an integration of the different encoding functions of the above-mentioned four sets. This encoding function adopts different forms for different subsets of codons, and it represents the SGC code itself. Finally, we discuss our ndings in terms of the origin and evolution of the SGC. 1.1. Theoretical background The standard table of codon assignments derives from the obvious representa tion of the triplet code as a 4 4 4 cube. Several authors (e.g. Jimenez-Monta no et al., 1996; Sanchez et al., 2005), observing that 64 is equal not only to 43 but also to 26 , suggested to organize the codon table as a six-dimensional hypercube or six-dimensional vector space over the binary eld. In previous works (Sanchez et al., 2004a,b; 2005), two algebras were presented to reect the relationship between codon assigment and the physicochemical aspects of the amino acids. Here, we used the following ordered assignment of the nucleotide bases: C 00, A 01, U 10, G 11, which also represent the integers 0, 1, 2, 3, in the binary numerical system. With this ordering, purines A and G are represented by the odd numbers 1 and 3, whereas the pyrimidines C and U are represented by the even numbers 0 and 2. This assignment of the duplets 00, 01, 10, 11 to the letters C, A, U, G, may be done in 24 = 4! ways. Given that C and G, and, A and U are complementary to each other in double stranded DNA, it is convenient to select the ordering in such a way that 00 is complementary of 11, and 01 is to 10. Hence, we have only eight possible selections: (C, A, U, G); (C, U, A, G); (A, C, G, U); (A, G, C, U);
218 Table 1 + 0 1
Bulletin of Mathematical Biology (2007) 69: 215243 Sum module 2 in the eld Z2 0 0 1 1 1 0
(U, C, G, A); (U, G, C, A); (G, A, U, C); and (G, U, A, C). From these orderings, we have selected the rst, but it is possible to show that the remaining ones lead to the same results. The 64 sextuples of zeros and ones generate the so-called six-dimensional hypercube, which is a vector space over the binary eld Z2 = {0, 1}(A3.1). The vector space (Z2 )6 is an orthotope, according to the terminology of Coxeter (1973) (A3). In the hypercube, as in every orthotope, all the angles between adjacent edges are supposed to be right angles, but in our pictures many of them look like acute or obtuse. This is so because the picture is a projection of a six-dimensional gure over a plane. The same happens with some of the angles of a three-dimensional cube, when they are portrayed in a plane. In the vector space structure of the hypercube the addition is the so-called sum module 2, bit by bit, also called XOR operation, widely used in symbolic logic. This sum, for the elements of the eld Z2 , is dened in Table 1. When Table 1 is translated to the nucleotide bases it gives Table 2, which is the operation table of a known abelian group of order 4, the so-called Klein Four Group. This group is isomorphic (A3) to the group of all the symmetries of a plane rectangle (not including the square). 2. Results 2.1. The RNA code and its three reading frames: The extended code A primaeval RNA World was proposed (Gilbert, 1986) which comprises the codons with an RNY pattern (R - purine, Y - pyrimidine, N - any nucleotide). Since the primitive translational apparatus may have been imperfect, shifts in the reading frame probably occurred, and codons with an NYR and YRN patterns emerged. These three patterns altogether are here dened as the codons of the Extended RNA code. This extended code includes 48 triplets which specify 17 amino
Table 2 Sum module 2 of nucleotide bases + C A U G C C A U G A A C G U U U G C A G G U A C
219
acids, including AUG, the start codon, and the three known stop codons. The remaining 16 codons can only be obtained by mutations other than by frame-shift readings, and we dened it as the RNA-less code or complementary code of the Extended RNA code. In this context, the 64 triplets XYZ of the SGC, where X, Y, and Z belong to the set {C, A, U, G}, can be partitioned into four sets of 16 triplets each. The rst set corresponds to the so-called primaeval RNA code (RNY pattern); the second (NYR pattern) and the third (YRN pattern) sets correspond to readings of the primaeval RNA code at the second and the third reading frames, respectively, and the fourth set corresponds to the RNA-less code (YYY and RRR patterns). Each of the rst three sets can be partitioned into two subsets by replacing the N by Y or R. For example, for the set RNY we obtain the two subsets RYY and RRY. The fourth set is the union of two subsets: YYY and RRR. The eight elements of a subset correspond to a three-dimensional coordinated afne subspace (A4.1), that is, an ordinary cube, contained in the six-dimensional hypercube. This is so since the Hamming distance (He et al., 2004; A6.1) between triplets on the same edge is 1. From the eight subsets, only one is a vector subspace (A1.1) or vectorial cube, which contains the null vector CCC , and that is the subset YYY. The other seven subsets are three-dimensional afne subspaces obtained from YYY by translations (A9.1). For each of the eight subsets, we may consider two subclasses, each of them formed by triplets having the same central nucleotide; these subclasses correspond to coordinated planes or faces of the cube. Finally, each of the subclasses is partitioned into two pairs, every pair consisting of triplets that have, not only the same central nucleotide, but also the same rst nucleotide. They constitute coordinated lines or edges of the hypercube. In most cases, these pairs of codons specify the same amino acid. Now we will consider the combinatorial geometry, associated to the structure of the vector space. For any vector space the associated geometry is dened as the family of all the afne subspaces or linear varieties of the given space (A2). In order to describe algebraically and geometrically all sets, we introduce the following notational convention. For every vector ei , of the canonical base, we denote as ti the associated translation: v v + ei . For every pair (i , j ), we denote as ti j the composed function ti t j . For every triplet (i , j , k), we denote as ti jk the composed function ti t j tk and so on (A9.1). 2.2. First reading frame The 16 triplets of the RNY pattern code for eight amino acids and are considered as the primaeval RNY code (Crick, 1968; Eigen and Schuster, 1978). We say that the reading and translation of sequences of codons of the form RNY denes the rst reading frame (FRF) in the RNA World. The set of codons of the form RNY is the union of the cubes RYY (Fig. 1a) and RRY (Fig. 1b) in which every amino acid corresponds to an edge of the associated cube. RYY is obtained from YYY by the addition of the triplet ACC, that is, the addition of the vector e2 which involves a transversion (A10.2) in the rst nucleotide. RRY is obtained from YYY by the addition of the triplet AAC, that is, the vector e2 + e4 , which represents transversions in the rst and the second nucleotides. Each cube is obtained from the other by means of the translation t4 , associated to the canonical vector e4 , that
220
Fig. 1 RNY code. Graphical representation of the subsets (a) RYY and (b) RRY. Note that in each cube the four amino acids correspond to 4 (solid edges) out of the 12 edges of the cube. (c) First Reading Frame: RNY = RYYURRY.
is, the addition of the codon CAC to each triplet. The Hamming distance between a vertex and its image under the translation t4 , is equal to 1. It means that the set of codons of the form RNY constitutes a four-dimensional hypercube, which is a coordinated afne subspace (A4), or a four-dimensional hyperface of the sixdimensional hypercube of the 64 triplets (Fig. 1c). The hypercube is a generalization of a three-dimensional cube in n dimensions, also called a measure polytope. It is a symmetrical regular polytope with mutually orthogonal (A8) sides, and is therefore an orthotope (Coxeter, 1973). 2.3. The second reading frame If in a sequence of triplets of the form RNY, the reading starts at the second nucleotide N, due to an slippage, ignoring the rst nucleotide R, then triplets of the form NYR will be translated. The set of triplets of the form NYR is a disjoint set with the set RNY. The 16 triplets of the form NYR specify eight amino acids, ve of which were also found in the FRF. Then, the triplets of the form NYR give
221
a)
b)
c)
Fig. 2 NYR code. Graphical representation of the subsets (a) YYR and (b) RYR. Note that in (a), two amino acids correspond to 2 (solid edges) out of the 12 edges of the cube, and Leu correspond to a face (four connected solid edges). In (b) not all the amino acids correspond to edges of the cube, rather there are two amino acids (Met and Ile) which correspond to single vertexes. (c) Second Reading Frame: NYR = YYRURYR.
rise to only three new amino acids. We say that the reading and translation of sequences of codons of the form NYR denes the second reading frame (SRF) in the RNA World. We consider the union of the sets of codons of the form RNY and NYR, that is, those of the FRF and the SRF, and their correspondence with their 11 amino acids, as an extension of the original RNY code. In this set, the start codon AUG ensues (Konecny et al., 1995). The set of codons of the form NYR is the union of the cubes YYR (Fig. 2a) and RYR (Fig. 2b). Note that in the cube YYR the amino acid Leu corresponds to a face and in the cube RYR not every amino acid corresponds to an edge of the associated cube. YYR is obtained from YYY by addition of the codon CCA, that is, of the vector e6 which represents a transversion in the third nucleotide. RYR is obtained from YYY by the addition of the triplet ACA, that is, the addition of the vector e2 + e6 , which involves transversions in the rst and third nucleotides. Each cube can be derived
222
from the other by means of the translation t2 , associated to the canonical vector e2 , that is, by the addition of the codon ACC to every triplet. The Hamming distance between a vertex and its image, under the translation t2 , is equal to 1. It means that the set of codons of the form NYR constitutes also a four-dimensional hypercube, which is a coordinated afne subspace, or four-dimensional hyperface of the six-dimensional hypercube of the 64 triplets (Fig. 2c). The hypercube NYR is the image of the RNY under the non-singular linear transformation e1 e5 , e2 e6 , e3 e1 , e4 e2 , e5 e3 , e6 e4 , which is dened by an even permutation of the canonical base. This function maps every triplet XYZ onto the triplet YZX. The matrix of this linear transformation is orthogonal with determinant 1, and it can also be interpreted as a double rotation in the six-dimensional Euclidean vector space R6 (A7). 2.4. The third reading frame If in a sequence of triplets of the form RNY, the reading starts at the third nucleotide Y, due to a slippage, ignoring the rst (R) and the second (N) nucleotides, then triplets of the form YRN will be translated. The set of triplets of the form YRN is a disjoint set with the sets RNY and NYR. Thirteen out of the 16 triplets of the form YRN code for six new amino acids, without repetitions of any of those found in the FRF or the SRF. Then, the triplets of the form YRN give rise to six new amino acids, which complete a set of 17 out of the 20 primary amino acids. The other three codons are the so-called stop codons. We say that the reading and translation of sequences of codons of the form YRN denes the third reading frame (TRF) in the RNA World. We will consider the union of the sets of codons of the form RNY, NYR and YRN, that is, those associated to the rst, second and third reading frames, and their correspondence with 17 amino acids and stop codons, as an extension of the primaeval RNY code. The set of codons of the form YRN is the union of the cubes YRY (Fig. 3a) and YRR (Fig. 3b). In the YRY cube every amino acid corresponds to an edge of the associated cube but this is not the case in the YRR cube. YRY is obtained from YYY by addition of the codon CAC, that is, of the vector e4 that represents a transversion in the second nucleotide. YRR is obtained from YYY by the addition of the triplet CAA, that is, the addition of the vector e4 + e6 which involves transversions in the second and third nucleotides. Each cube is obtained from the other by the translation t6 , associated to the canonical vector e6 , that is, the addition of the codon CCA to every triplet. The Hamming distance between a vertex and its image, under the translation t6 , is equal to 1. It means that the set of codons of the form YRN constitutes also a four-dimensional hypercube, which is a coordinated afne subspace, or a four-dimensional hyperface of the six-dimensional hypercube of the 64 triplets (Fig. 3c). The hypercube YRN is the image of the NYR under the non-singular linear transformation e1 e5 , e2 e6 , e3 e1 , e4 e2 , e5 e3 , e6 e4 , which is the same basis permutation, or double rotation, which converts RNY into NYR. In summary, the set of 48 codons of the form RNY, NYR and YRN, where 45 out of them specify 17 of the primary amino acids, and the other three codons correspond to a termination signal, comprises the Extended RNA code.
223
a)
b)
c)
Fig. 3 YRN code. Graphical representation of the subsets (a) YRY and (b) YRR. Note that in (a), four amino acids correspond to 4 (solid edges) out of the 12 edges of the cube. In (b) two amino acids correspond to an edge of the cube (Arg and Gln). Trp corresponds to only one vertex, and the three stop codons belong to this cube connected by two solid edges. (c) Third Reading Frame: YRN = YRYUYRR.
2.5. The codons of the RNA-less World The remaining 16 triplets belong to the cubes YYY or RRR, that is, those composed by only pyrimidines or only purines, but they do not enter into the composition of the Extended RNA code. For this reason, we will call them the triplets of the RNA-less World. If we consider the reading and translation of sequences of codons of the form YYY or RRR, these triplets code for eight amino acids, ve out of which are repetitions of those found in the Extended RNA code. Then, the triplets of the form YYY or RRR give rise to only three new amino acids. This addition completes the set of the 20 amino acids. The union of the Extended RNA code and its complementary RNA-less code, constitutes the six-dimensional hypercube of the 64 triplets with the 20 amino acids and a termination mark. The 16 codons
224
a)
b)
c)
Fig. 4 The RNA-less code: YYYURRR. Graphical representation of the subsets (a) YYY and (b) RRR. Note that, in both cubes (a) and (b), the four amino acids correspond also to 4 (solid edges) out of the 12 edges of the cube. (c) RNA-less code.
of the RNA-less code represent the union of the cubes YYY (Fig. 4a) and RRR (Fig. 4b). Each cube could be obtained from the other by the translation t246 associated to the vector e2 + e4 + e6 , that is, the addition of the codon AAA to every triplet, which performs a transversion in every nucleotide component. Recall that YYY is a vector subspace of the whole vector space. In contrast to the hypercubes of the Extended RNA code, the Hamming distance between a vertex and its
225
image, under the translation t246 , is equal to 3. It means that the set of codons of the RNA-less code do not lead to a four-dimensional hypercube. The set is a fourdimensional subspace of the six-dimensional hypercube, but it is not as in the other cases, a four-dimensional hyperface but rather it is a right hyperprism of height 3 (Fig. 4c and A9.3). Interestingly, the result of the union of the pairwise disjoint sets of the Extended RNA code plus the RNA-less code is the six-dimensional hypercube of the 64 codons of the SGC (Fig. 5).
3. Encoding functions So far, it has been customarily to represent in various graphical ways (e.g. the six et al., 1996), the icosahedron or dodecdimensional hypercube (Jimenez-Monta no ahedron (White undated), or the standard table of the genetic code) the 64 codons of the SGC but without any reference to an encoding function that maps each triplet with its corresponding amino acid or stop signal. Actually, none of these
Fig. 5 A graph representation diagram of the Boole lattice of the 64 triplets of the SGC. The sixdimensional hypercube can be envisaged as composed by: yellow RNY (primaeval RNA code); orange NYR (RNA code in the SRF); purple YRN (RNA code in the TRF); black YYY and RRR (RNA-less code).
226
Bulletin of Mathematical Biology (2007) 69: 215243 Table 3 The correspondence between the ordered set of triplets and every amino acid and stop codons Amino acid Threonine RF Isoleucine Alanine Valine 1st Asparagine Serine Aspartic acid Glycine Proline RF Leucine 2nd Methionine Histidine RF Arginine Tyrosine 3rd Cysteine Glutamine Stop Tryptophan Lysine RNA Glutamic acid Phenylalanine less Cube Symbol Sextuple Set of triplets 1 1 1 1 2 2 2 2 3 3 4 5 5 5 5 6 6 6 7 7 8 Thr Ile Ala Val Asn Ser Asp Gly Pro Leu Met His Arg Tyr Cys Gln Stop Trp Lys Glu Phe 010000 011000 110000 111000 010100 011100 110100 111100 000000 001000 011011 000100 001100 100100 101100 000101 100101 101111 010101 110101 101000 ACC, ACA, ACU, ACG AUC, AUA, AUU GCC, GCA, GCU, GCG GUC, GUA, GUU, GUG AAC, AAU AGC, AGU, UCC, UCA, UCU,UCG GAC, GAU GGC, GGA,GGU, GGG CCC,CCA, CCU, CCG CUC, CUA, CUU, CUG,UUA, UUG AUG CAC, CAU CGC, CGA, CGU, CGG, AGA, AGC UAC, UAU UGC, UGU CAA, CAG UAA, UAG, UGA UGG AAA, AAG GAA, GAG UUC, UUU
representations is the genetic code. Herein, we derive encoding functions for the Extended RNA code, the RNA-less code, and the SGC. 3.1. An auxiliary linear function for the different encoding functions In Table 3, the correspondence of the ordered set of triplets that specify every amino acid is illustrated. The ordering of the set of triplets is the linear order determined by the selected order of the bases (C, A, U, G) so that the triplet at the left-most position is the one with the minimum value. We select the representation of every amino acid by only one triplet, the rst. For example, the representative of Ala is the triplet GCC. For the three stop codons we select as representative the triplet UAA. Biologically, it is well known that most mutations in the third base does not change the corresponding amino acid (the wobble hypothesis) and therefore we started with a function that maps the third nucleotide of a codon to zero. Consequently, we consider the linear transformation, or endomorphism F (A9.2), of the vector space (Z2 )6 , F : (x , y, z, t , u, v ) (x , y, z, t , 0, 0), which carries over every triplet XYZ to the triplet XYC, that is, it changes the third nucleotide by the nucleotide C. The endomorphism F belongs to a class of endomorphisms, which are called projections and are characterized by the property of idempotency, that is, F F = F . The kernel of F is the two-dimensional subspace of vectors of the form (0, 0, 0, 0, u, v ), that is, the triplets of the form CCZ. Hence, the kernel of F is a face of the hypercube. The image of F , denoted by Im( F ), is the set of triplets of the form XYC, that is, those that end with the nucleotide C. It is a four-dimensional subspace, in fact, a vectorial four-dimensional hypercube or a hyperface of the six-dimensional hypercube. It is the solution subspace of the homogeneous linear
227
Fig. 6 The hypercube NNC , image of the function F . Hypercube NNC image of the function F . Note that the blue three-dimensional cube is image of F1 .
system u = 0, v = 0, with unknowns x , y, z, t , u, v which are the components of a generic vector in (Z2 )6 . We denote this hypercube as NNC (Fig. 6). 3.2. The encoding function for the primaeval RNY code In the four-dimensional hypercube of codons of the form RNY, representing the FRF of the primaeval RNA code, we dene a function F1 , which projects the whole hypercube onto the cube RNC . This means that for every edge associated to the same amino acid, its vertex that ends with the nucleotide C, is selected. In other words, the function F1 assigns to each of the 16 triplets of the set RNY the same triplet if it ends with C, or else while preserving the rst and second nucleotides, the third one is changed to C if it is U. The triplet that is selected as the canonical representative of each amino acid is the one which belongs to the cube RNC . This representative triplet in the cube RNC , is the minor, if compared with the other representative, according to the linear order of the set of triplets as derived from the selected ordering in the set {C, A, U, G}. The function F1 is the restriction, to the hypercube RNY, of the linear transformation F : XYZ XYC , of the whole six-dimensional hypercube, which projects it onto its subspace of triplets of the form NNC. This set is a four-dimensional coordinated vector subspace, in fact, a four-dimensional hyperface of the whole six-dimensional hypercube generated by the canonical vectors e1 , e2 , e3 , e4 . Note that the cube RNC , image of the function F1 in the set RNY, is the intersection of the four-dimensional hypercubes RNY
228
and NNC . Here, we are taking the cube RNC as a canonical representation of the set of eight amino acids, specied by the F RF . The explicit denition of the function F1 is: F1 ACC, ACU ACC AUC, AUU AUC GCC, GCU GCC GUC, GUU GUC AAC, AAU AAC AGC, AGU AGC GAC, GAU GAC GGC, GGU GGC (for Threonine) (for Isoleucine) (for Alanine) (for Valine) (for Asparagine) (for Serine) (for Aspartic acid) (for Glycine)
3.3. The encoding function for the triplets of the second reading frame In the hypercube of codons of the form NYR, associated to the SRF of the Extended RNA code, we dene a function F2 , which assigns to every set of triplets, encoding for the same amino acid, the minor, according to the linear order in the whole set of triplets. The explicit denition of the function F2 is: F2 CU A, CUG, UU A, UUG CU A UC A, UCG UC A CC A, CCG CC A GU A, GUG GU A AUG AUG AU A AU A GC A, GCG GC A AC A, ACG AC A (for Leucine) (for Serine) (for Proline) (for Valine) (for Methionine) (for Isoleucine) (for Alanine) (for Threonine)
We note that all of the images of the function F2 , with the only exception of AUG, belong to the cube NYA, which is a three-dimensional hyperface of the hypercube NYR. But the function F2 is not exactly a linear projection of NYR onto NYA, as is the case of F1 in the hypercube RNY. The function F2 is related to the linear transformation F , described above, in the following way. In most cases, F2 coincides with the restriction to NYR of the afne transformation t6 F , the composition of F with the translation t6 . Actually, it is so, for 13 out of the 16 triplets, with only three exceptions: UUG, UUA and AUG. For UUG and UUA, F2 behaves as the composition t16 F , whereas for AUG it acts as the composition t56 F , t16 and t56 the composed translations t1 t6 and t5 t6 , respectively.
3.4. The encoding function for the triplets of the third reading frame In the hypercube of codons of the form YRN, associated to the TRF of the Extended RNA code, we dene a function F3 , which assigns to every set of triplets,
229
encoding for the same amino acid, the minor, according to the linear order in the whole set of triplets. The explicit denition of the function F3 is: F3 UGC, UGU UGC CGC, CGA, CGU, CGG CGC U AC, U AU U AC C AC, C AU C AC UGG UGG U AA, U AG, UGA U AA C AA, C AG C AA (for Cysteine) (for Arginine) (for Tyrosine) (for Histidine) (for Tryptophan) (for Stop codons) (for Glutamic acid)
Note that four out of the seven images for the function F3 , belong to the cube YRC . In fact, for 10 triplets, the function F3 coincides with the restriction to YRN of the linear transformation F : XYZ XYC . For UGG, F3 coincides with the restriction to YRN of the afne transformation t56 F . For UAG, UAA, CAG and CAA, F3 coincides with the restriction to YRN of the afne transformation t6 F For UGA, F3 coincides with the restriction to YRN of the afne transformation t36 F . 3.5. The encoding function for the triplets of the RNA-less code In the vector subspace of the codons of the form RRR or YYY, associated to the RNY-less code we can dene a function F4 , which assigns to every set of triplets, encoding for the same amino acid, the minor, according to the linear order in the whole set of triplets. The explicit denition of the function F4 is: F4 CCC, CCU CCC CUC, CUU CUC UCC, UCU UCC UUC, UUU UUC AAA, AAG AAA AGA, AGG AGA GAA, GAG GAA GGA, GGG GGA (for Proline) (for Leucine) (for Serine) (for Phenylalanine) (for Lysine) (for Arginine) (for Glutamic acid) (for Glycine)
For the triplets CCU, CCC, CUU, CUC, UCU, UCC, UUU, UUC, the function F4 acts as the restriction to the set YYY of the linear transformation F . However, for the triplets AAG, AAA, AGG, AGA, GAG, GAA, GGG, GGA, F4 acts as the restriction to RRR of the afne transformation t6 F . A graphical representation of the encoding functions F1 , F2 , F3 , and F4 is shown in Fig. 7. In order to build this plot, we have considered the lexicographic order of the triplets used above and we have assigned to this order the corresponding integer values from 0 to 63. The abscissa represents each of the 64 codons and they are mapped according to the encoding functions of each subset which results in their respective representative codons. The linearity of the encoding functions is apparent even considering few departures. However, note that several codons have only
230
Encoding Function F1, F2, F3 and F4

70 F1 F2 F3 F4
20 18 19
60
21 20
21
50
17 16 15 17
Representative codon
40
12 11 11 5 10
14 13
30
9 9 8 6
11
20
6 5
10
4 2 3 1
10
20
30
40
50
60
70
Codon (lexicographic order)
Fig. 7 Local encoding functions F1 , F2 , F3 and F4 . A graphical representation of the local encoding functions F1 , F2 , F3 and, F4 .
one representative codon but some of them appear in two (e.g. Pro appears when F2 and F4 are applied) or even three (e.g. Ser appears when F1 , F2 and F4 , are applied) of the subsets; hence these codons have two or three representative codons. Then a given amino acid will appear at different ordinate values.
3.6. The general encoding function for amino acids and stop codons in the hypercube of 64 triplets As it is well known, 61 out of the 64 triplets code for the 20 primary amino acids that are the building blocks of proteins. Herein, we derive an algebraic function, which assigns to every triplet its associated amino acid or its termination mark. We have dened in the set of 64 triplets a structure of a vector space over the binary eld Z2 = {0, 1}. It is done by means of the assignment C 00, A 01, U 10, G 11, which leads to the representation of each triplet as a sextuple of zeros and ones, that is, as a vector of the six-dimensional vector space (Z2 )6 , also called six-dimensional hypercube. This correspondence involves the addition operation and the vectorial algebraic structure in the set of triplets, as well as a combinatorial geometry, associated to the vectorial structure. In Table 3, the triplets
231
which code for each amino acid and the stop signal are listed according to their lexicographic order. The rst triplet in each row, that is, the one which is in the left-most position (marked in bold characters), is taken as the canonical representative of the corresponding amino acid. We dene the encoding function as that which assigns to every triplet the left-most triplet in every row. Thus, it turns out that the endomorphism F coincides, in most of the cases, with the desired encoding function denoted by E. In fact, it coincides for 45 (grey characters) out of the 64 triplets. The 45 triplets encode for 15 amino acids. The other 19 triplets (black characters), are those for which the encoding function E takes the form of an afne transformation, the composition of F with a suitable translation, as shown in Table 4. Note that for 8 out of the 20 amino acids there are special subsets of the sets of their associated triplets for which F alone is not the encoding function E. For the stop signal, there are also two special sets. In Table 4, we show the encoding functions for every special set. There are ve amino acids and the stop signal whose canonical codons end in A or G and therefore they require specic translations. There are three amino acids encoded by six codons, some of which are mapped directly by F but others require a translation. The latter is mainly due to a change in the rst nucleotide of their codons. In order to summarize the function E, a graphical representation built in the same way as Fig. 7 is shown in Fig. 8. In this encoding function a single canonical codon corresponds to each amino acid in contrast to the above-mentioned four encoding functions. The overall shape is still linear and we remark that this function represents the actual SGC. In the abscissa, values from 0 to 63 were assigned to each codon according to the selected lexicographic order, and in the ordinate the value corresponding to the canonical codon for each amino acid or the termination mark (image of the function E) is given. The 45 codons that specify 15 amino acids which are directly mapped by the linear function F are represented by circles; their canonical codons are of the type NNC. The remaining 19 codons, represented by crosses, require a particular translation. Eight out of the 19 codons specify the remaining ve amino acids and their canonical codons are of the type NNA or NNG. Three out of the 19 codons specify the stop signal and the remaining eight codons correspond to the three amino acids encoded by six codons whose canonical representatives are of the type NNC.
Table 4 Triplets for which the function F requires an afne transformation Amino acid Arg Gln Glu Leu Lys Met Ser Trp Stop Stop Canonical triplet CGC CAA GAA CUC AAA AUG AGC UGG UAA UAA Special set AGA, AGG CAA, CAG GAA, GAG UUA, UUG AAA, AAG AUG UCC, UCA, UCU, UCG UGG UAA, UAG UGA Encoding function t2 F t6 F t6 F t1 F t6 F t56 F t1234 F t56 F t6 F t36 F
232
General Encoding function E

60 mapped by F mapped by F and a translation
20 18 19 17 21
50
16 15
Canonical codon
40
12
14 13
30
10 9 11
20
6 5
5 4
10
4 2 3
10
20
30
40
50
60
70
Codon (lexicographic order)

Fig. 8 Plot of the general encoding function E in its two main forms. The graphical representation of the general encoding function E of the SGC.
4. The graph representation diagram of amino acids and the stop signal Note from Table 3 that the vertexes of the four-dimensional hypercube NNC (Fig. 6) correspond to triplets that code for 15 out of the 20 amino acids. Amongst the 16 triplets which are labels of the vertexes, there is only one, the triplet UCC, which is not the canonical representative of any amino acid. It codes for Ser, but this amino acid is already represented by the triplet AGC, which is the rst in its list. For this reason, we deleted the vertex with label UCC and its four adjacent edges in the last graph diagram. The other ve amino acids, which are not represented in the hypercube NNC , are Gln, Glu, Lys, Met, and Trp, whose canonical triplets are, respectively, CAA, GAA, AAA, AUG, and UGG (Table 3). The rst three, CAA, GAA, and AAA, belong to the hypercube NN A, which is the image of NNC under the translation t6 , that is, the addition of the triplet CCA. These three triplets are the unitary images of the faces C AN, GAN and AAN, under the afne transformation t6 F . The other two triplets, AUG and UGG, belong to the hypercube NNG, which is the image of NNC under the translation t56 , that is, the addition of the triplet CCG. As before, these two triplets are the unitary images of the faces AU N and UGN, under the afne transformation t56 F . In Fig. 9, we show a graph representation diagram of
233
Fig. 9 The phenotypic graph of the 20 amino acids and the stop signal. A graph representation diagram of the 20 amino acids and the stop signal.
the 20 amino acids, with an additional vertex, that represents the class of the three stop codons. It is a phenotypic graphical image of the hypercube NNC , without the vertex UCC, nor its adjacent edges; with the addition of six external vertexes, ve of which correspond to amino acids, whose canonical representative triplets do not belong to the hypercube NNC ; and another vertex, which represents the stop signal. We observe that the vertex which represents the stop signal is adjacent to the amino acids Glu, Gln and Tyr, and that the distance between the amino acids Met and Trp is equal to 3.
5. Discussion In this work, we propose an Extended RNA code as derived from the RNA code as originally proposed by Eigen (1977) and later used by several authors (e.g. Konecny et al., 1995). The canonical RNA code consists of only RNY codons that comprises 16 out of the 64 possible triplets and which codify for eight amino acids. By allowing readings starting at the SRF and TRF positions of the RNY code, codons of the type NYR and the YRN appear. Each of these types corresponds to one set of 16 elements. These three sets are disjoint when they are pairwise compared. Altogether these three sets comprise 45 triplets which code for 17 out of the
234
20 amino acids, and they also include the three stop codons. This is what we call the Extended RNA code. The RNY code can be graphically represented as a four-dimensional hypercube that results from the union of the disjoint sets RYY and RRY each of them being a three-dimensional cube. The NYR and YRN sets are also represented by a fourdimensional hypercube that result from the union of the disjoint sets YYR and RYR, and the union of the disjoint sets YRY and YRR, respectively, each of them being a three-dimensional cube. Interestingly, each four-dimensional hypercube is isomorphic and isometric (A6.2) to each other as afne subspace of the whole sixdimensional hypercube, and they are pairwise disjoint. The remaining 16 triplets, which we call RNA-less code or complementary code of the Extended RNA code, can not be derived by frame-shift readings but rather by other types of mutations such as insertions, deletions and substitutions. These 16 triplets constitute two disjoint sets, YYY and RRR, each of them being a three-dimensional cube. The union of the cubes YYY and RRR produces a four-dimensional vector subspace which is not a hyperface of the six-dimensional hypercube. As a consequence, this subspace is isomorphic as afne subspace to the three hypercubes of the Extended RNA code but it is not isometric to any of them. Notably, the union of the three hypercubes of the Extended RNA code and the vector subspace of the RNA-less code, makes up the whole six-dimensional hypercube of 64 triplets. Conversely, we can decompose the six-dimensional hypercube as consisting of the patterns RNY (primaeval RNA code), plus NYR, and YRN (Extended RNA code), and YYY and RRR (RNA-less code) (Fig. 5). These results suggest a plausible evolutionary path from which the primaeval RNA code could have originated, via the Extended RNA code and the addition of the RNA-less code, the six-dimensional hypercube. Our present results do not offer any clue about a chronological order in which the different encoding subsets could have led to the current SGC. However, we can hypothesize that the point in which genetically encoded protein translation started to evolve corresponds most likely to a breakthrough organism obeying an Extended RNA code after the RNA World and prior to the cenancestor. Alternatively, it has been found that the order of triplet frequencies RNY > RNR > YNY > YNR is a general attribute of coding sequences (Eigen et al., 1985), and they proposed that this order may reect the evolution of the genetic code from an RNY structure, providing a comma-free readout via wobbleintermediates to the present form. Thus, the steps RNY plus RNR and YNY could also form another extended RNA code. Given the RNA code, frame-reading mistranslations conferred obviously evolutionary advantages, since in fact, by allowing reading slippages in the other two reading frames, 32 new triplets (which codify for nine new amino acids) and three stop triplets (which specify a stop signal), emerge. Natural selection does not generate novelties from scratch. It works on what already exists; either transforming a system to give it new functions or combining several systems to produce a more elaborate one. It innovates with what it has at hand and this process has been recognized as the evolutionary tinker (Jacob, 1977). In the RNY code every amino acid is coded by two neighbor triplets located in an edge whereas in the NYR and YRN codes there are departures from this regularity: some amino acids are now encoded by four triplets and others are encoded by only one; some amino
235
acids of the RNY are also coded by triplets that appear in the SRF; and there are new triplets which altogether specify nine new amino acids. Konecny et al. (1995) rst noted that by allowing reading slippages there were two hidden messages in the RNY code which are AUG and CAU which are found in the SRF and TRF, respectively. It is also worth to mention that in present day DNA virus such as vaccinia virus, genes are transcribed in a frame-shift fashion (Keck, et al., 1990). In the vaccinia virus, a single messenger RNA (mRNA) is able to encode three different proteins because messages contain three distinct putative translation initiation sites. In other words, a single mRNA can be translated in three different reading frames which encode three trans-activators that are required for late transcription (Keck et al., 1990). To our knowledge, this is the rst time in which the SGC is expressed as a mathematical function which maps each triplet onto its corresponding amino acid or stop signal. Given the uneven degeneracy of the genetic code it is appealing that the general encoding function is almost linear. The phenotypic graph of amino acids is also a novel nding whose image resembles, in part, to any of the three four-dimensional hypercubes of the Extended RNA code. In the search of producing synthetic life in the laboratory (Hutchinson III et al., 1999; Szathmary, 2005) our encoding functions may be used as a guide to understand the difference between a tinkered-together genome and an engineered one. In the context of the frozen concept, we can say that considering the symmetries of both the Extended RNA code and the RNA-less code, the primaeval RNY code was already frozen and that it evolved like a replicating and growing icicle.
Appendix A. Mathematical and biological background We assume that the reader is familiar with the concept of a vector space over a scalar eld K, also called a K-vector space, and with that of vector subspace of a vector space, as well as with the concept of a linear transformation or linear endomorphism of a vector space. A.1. Concept of afne space and its dimension Definition A.1.1. For a vector space V we call afne subspace of V , or linear variety contained in V , to every subset of the form v + W, where W is a vector subspace and v is a xed vector. The set v + W is also called a coset or adjoint class of the subgroup W, and it contains the element v . We recall that every vector space is an abelian group for the addition operation. The subspace W is called the associated vector subspace of the linear variety v + W. Definition A.1.2. The dimension of a linear variety is dened as the dimension of its associated vector subspace. Remarks. The associated vector subspace W, for a linear variety, is unique, but the vector v may be any of the elements of the set v + W. When two vectors u and v dene the same linear variety, for a xed subspace W, it means that their
236
difference vector u v belongs to W. The linear varieties or afne subspaces, for a xed linear subspace W, are the equivalent classes of the equivalence relation RW , dened as the set of all the ordered pairs (u, v ) V V , such that u v W. If we take v as the null vector 0, or as some vector which belongs to W, the identity v + W = W is obtained. Hence, all the vector subspaces are also afne subspaces, in fact, the afne subspaces that contain the null vector. Consequently, all the unitary sets are afne subspaces of dimension zero, also called points of the associated geometry. The afne subspaces of dimension 1 are called the lines of the geometry and those of dimension 2 are called the planes of the geometry. The only afne subspace of dimension n is the whole space V , if n is its dimension. The other afne subspaces, that is, the k-dimensional for 2 < k < n, are generically called hyperplanes of the geometry. A.2. The associated geometry to a vector space. Concepts of point, line, and plane in a geometry. The concept of parallelism Definition A.2.1. We call the associated geometry of a K-vector space V , K being a eld, the set or family of all the afne subspaces or linear varieties contained in V . Definition A.2.2. The points, lines and planes in a geometry are, respectively, the afne subspaces of dimension 0, 1, and 2. Definition A.2.3. We say that 2 afne subspaces v + W and u + U are parallel if W is a subspace of U , or U is a subspace of W. In particular if W = U , that is, if the afne subspaces have the same associated subspaces, they are parallel, having the same dimension. According to this denition we can talk of parallel lines, parallel planes, a line parallel to a plane, parallel cubes, etc. A.3. The concept of n-dimensional hypercube Let us consider a vector space of the form V = Kn , where K is a eld. In this case, the vectors are represented as n-tuples (x1 , x2 , . . . , xn ), where the xi are elements of K. The elements xi , for i {1, 2, . . . , n}, are called the coordinates of the vector. The canonical base of Kn , is the ordered set of vectors e1 = (1, 0, . . . , 0), e2 = (0, 1, 0, . . . , 0),. . ., en = (0, 0, . . . , 0, 1) . It is easy to show that these vectors are linearly independent and that they generate the space. Thus, they form a base of the vector spaces. As it is well known, every n-dimensional K-vector space is isomorphic to the space Kn . The isomorphism may be dened by the matching of any of the bases of V with the canonical bases of Kn . Then we say that the space has been coordinatized, that is, provided of a coordinate system. From standard courses of linear algebra, it is known that every afne subspace v + W, in an n-dimensional vector space V , is the solution set of a linear system
237
of n equations with n unknowns, being the linear subspace, the solution set of the associated homogeneous system, i.e., the system with the same left parts and the right members equal to zero. The dimension of W is equal to n r , where r denotes the rank of the associated matrix. Definition A.3.1. In the particular case where K is the binary eld Z2 = {0, 1}, of two elements, the vector space Kn is called n-dimensional hypercube. The ndimensional hypercube (Z2 )n is a regular polytope (Coxeter, 1973). It is also called an orthotope because the angle of two adjacent edges is a right angle, and all the edges have length 1. The name hypercube comes from analogy with the three-dimensional case, where the eight triplets of zeros and ones, represent the points which are the vertexes of a cube or regular hexahedron, one of whose vertexes is the null vector 0 = (0, 0, 0), and the vertexes which are adjacent to it, coincide with the extremes of the vectors e1 , e2 and e3 of the canonical bases. In Fig. 5 of the main text, the six-dimensional hypercube of the 64 codons is illustrated. A.4. The concept of coordinated afne subspaces Definition A.4.1. In a vector space Kn we will call coordinated afne subspace to every afne subspace whose associated subspace is generated by one or some vectors of the canonical base. In fact, a coordinated afne subspace is the solution set of a system whose associated matrix is diagonal. The vectorial coordinated lines are usually called coordinated axes and the vectorial coordinated planes simply coordinated planes. Definition A.4.2. A coordinated afne subspace of the hypercube is called a hyperface of the hypercube (Z2 )6 . The zero-dimensional hyperfaces are the vertexes of the hypercube whereas the one-dimensional hyperfaces are the edges of it. The two-dimensional hyperfaces are simply the faces of the hypercube. A.5. The concept of norm or length of a vector in the spaces (Z p )n Let us consider a vector space of the form (Z p )n , where (Z p ) denotes the Galois Field of p elements, being p a prime number, i.e., the eld of non-negative integers, which are remainders of the entire division by p. The assignment to every integer, its defect remainder in the entire division by p, is the so-called reduction module p. Definition A.5.1. We will call norm or length of a vector v = (x1 , x2 , . . . , xn ), the 2 2 2 ordinary sum |v | = x1 + x2 + + xn , which is always a nonnegative integer. In the case of the binary eld (Z2 ), the norm is simply the number of ones that appear in the n-tuple. In this case, it is also called the weight of the vector.
238
In the more general case of a vector space (Z p )n , we can dene the concept of scalar or inner product in the following way: Definition A.5.2. We call scalar or inner product of the vectors v = (x1 , x2 , . . . , xn ) and w = ( y1 , y2 , . . . , yn ) the integer < v, w >= x1 y1 + x2 y2 + + xn yn . Note that the norm or length of a vector coincides with the scalar product of the vector with itself. A.6. The concept of distance in the spaces (Z p )n . Transformations that preserve the distance Definition A.6.1. We dene the distance between the vectors v and w as the norm of the difference vector v w , being the subtraction the inverse operation of the addition in the vector space (Z p )n . In the case of the hypercube (Z2 )n , this distance is the so-called Hamming distance, widely used in coding theory and in criptology. Actually, the Hamming distance is the number of places in which both vectors differ, and geometrically, it means the minimal number of edges between the two vertexes represented by the n-tuples, visualizing the space as a graph. In vector spaces of the type (Z2 )n , the so-called hypercubes, the afne coordinated lines are usually called edges, and the afne coordinated planes are called faces of the hypercube. In general, the afne coordinated subspaces are called hyperfaces of the hypercube. It can be proved that the inner product and the length or norm are related in the following way. For every pair (u, v ) of arbitrary vectors u and v , the following equality holds: |u + v | = |u| + |v | + 2 u, v . Definition A.6.2. A transformation f of the vector space V = (Z2 )n in itself, is called isometric if it preserves the distance between every two elements, i.e., if d( f (u), f (v )) = d(u, v ) for every pair (u, v ) V V . An isometric linear transformation will be called a linear isometry of the vector space. It is easy to show that a linear isometry preserves the length or norm of every element of the space. A.7. The concept of permutation transform or multiple rotation in the vector space V = (Z p )n As the only vectors of length 1 in the vector space are the canonical vectors e1 , e2 ,. . . en , every linear isometry performs a permutation on the set {e1 , e2 , . . . , en }. For this reason, the linear isometries will also be called permutation transforms. The matrix of a linear isometry with respect to the canonical base is a matrix obtained from the identity matrix by a permutation of its columns. This kind of matrices, called permutation matrices, have the property that the inverse of any of them is equal to its transposed.
239
As every permutation is a composition of pairwise disjoint cycles, every linear isometry can be interpreted as a composition of local rotations, i.e., a multiple rotation in the vector space. As the set (Z p )n can be viewed as a subset of the Rvector space Rn , being R the eld of real numbers, a multiple rotation in (Z p )n can also be interpreted as a multiple rotation in Rn . Definition A.7.1. The linear isometries of the vector space (Z p )n will also be called permutation transforms or multiple rotations. A.8. The concept of orthogonality in a vector space (Z p )n Definition A.8.1. We say that two vectors v and w of the space (Z p )n are orthogonal or perpendicular, if their scalar product is equal to zero. According to this denition the vectors of the canonical bases, in any vector space of the form (Z p )n are orthogonal, one to each other, and they are, additionally, of length 1. This kind of base is usually called orthonormal base. It can be proved that a linear isometry preserves not only the distance between vectors, but also the scalar product of any two vectors of the vector space. Hence, a linear isometry preserves the orthogonality of any pair of orthogonal vectors of the vector space. A.9. The concept of translation and afne transformations in a vector space Definition A.9.1. Given a vector v of a vector space V , we call a translation associated to the vector v , to the bijective function tv : u u + v , from V to V . For translations tv and tw , associated to the vectors v and w , the composition tv tw is the translation tv+w , associated to the sum v + w of both vectors. It means that the set T of all the translations of the vector space V is a group, isomorphic to the additive group of V , being the isomorphism, given by the function: v tv , from V onto T . T is, in fact, an abelian group which is a subgroup of the symmetric group S(V ) of all the bijections of the set V with itself. Special notation. In the hypercube (Z2 )n , we will denote as ti the translation associated to the vector ei of the canonical base; as ti j , the translation associated to the vector ei + e j ; ti jk as the translation associated to the vector ei + e j + ek, where i , j , k {1, 2, . . . , n}, and so on. Definition A.9.2. We call afne transformation to every composed function of the form tv F , where F is a linear transformation or linear endomorphism of the vector space and tv is the associated translation of a xed vector v . It is clear that an afne transformation tv F is bijective, that is, one to one, if, and only if, its linear component F is also bijective. It is easy to prove that every afne transformation carries a linear variety v + W onto another linear variety, whose dimension is less than or equal to that of v + W. If the afne transformation is bijective, and only in this case, the dimension is preserved.
240
If v + W is a k-dimensional hyperface and u is any vector of the hypercube (Z2 )n , then the set u + v + W is also a k-dimensional hyperface. It can be proved that the union of the k-dimensional hyperfaces v + W and ei + v + W, when ei does not belong to W, is a (k + 1)-dimensional hyperface. Definition A.9.3. The union of the hyperfaces v + W and u + v + W, when u is orthogonal to W and has length h greater than 1, is called a right hyperprism of height h. A right hyperprism is obtained from a hyperface by the adjunction of a translation of it, to a distance h greater than 1. If the translation is to a distance equal to 1, a (k + 1)-dimensional hyperface is obtained. A right hyperprism of height h, with h greater than 1, is an afne subspace, but it is not a hyperface of the hypercube.
Some biological concepts and their mathematical representation A.10. Mutations Definition A.10.1. The substitution in a triplet of one or more of its three nucleotide bases within the set {C, A, U, G} is called a mutation. We say that the mutation is simple if the substitution involves only one base of the triplet. In our work, we have assigned to every nucleotide a numerical pair in the following way: C= 00, A= 01, U= 10, and G= 11. It follows, that every triplet XYZ is represented as a sextuple (x , y, z, t , u, v ) of zeros and ones, that is, as an element of the hypercube (Z2 )6 . According to our representation of triplets as elements of the vector space (Z2 )6 , every mutation may be represented by means of a translation tv , associated to some vector v . Then, for instance, if the triplet CUA is changed for the triplet CUG, it means that the vector (0, 0, 1, 0, 0, 1) is changed to the vector (0, 0, 1, 0, 1, 1) and this transformation may be represented by the addition of the vector e5 = (0, 0, 0, 0, 1, 0), that is, by the translation t5 , associated to it. In this particular case the mutation is simple, since it consists in the change of only one base, the nucleotide A is replaced by the nucleotide G. Transitions and transversions. The nucleotides A and G are called purines, and the nucleotides C and U are called pyrimidines. This terminology is due to their chemical composition. Definition A.10.2. A codon mutation is called a transition when it changes a base to a base of the same class, that is, purine to purine or pyrimidine to pyrimidine, and it is called a transversion when the change is from one class to another, that is, purine to pyrimidine or pyrimidine to purine. For instance, if the triplet CUA changes to the triplet CUG, by substitution of the base A by the base G, it is a transition, since A and G are both purines. On the the other hand, if CUA is changed to CUU, then the purine A is changed to the pyrimidine U. Then, it is a transversion.
241
A.11. Algebraic representations of transitions and transversions Notice that the basic transitions in the hypercube (Z2 )6 are represented by the translations t1 , t3 , and t5 , that is, those associated to the canonical vectors e1 , e3 and e5 , which have odd subindexes, while the basic transversions are represented by the translations t2 , t4 , and t6 of even subindexes. The set of mutations which are transitions is closed under the composition, and the identity function may be considered as a transition. Hence, the set of all the transitions is a group of order 8, subgroup of the group T of 64 possible mutations. The eight transitions are represented by the translations: t0 , t1 , t3 , t5 , t13 , t15 , t35 and t135 being t0 the identity function. All the other mutations are transversions or compositions of transversions with transitions. As T is an abelian group, the subgroup of pure transitions is a normal subgroup of T . The basic transitions, represented by the translations t1 , t3 and t5 , are obtained, respectively, as additions of the triplets UCC, CUC and CCU, according to the addition operation illustrated in Table 2, whereas the basic transversions represented by the translations t2 , t4 , and t6 , are obtained by additions of the triplets ACC, CAC and CCA (see Table 2). A.12. Complementary bases For biological reasons, the purine A and the pyrimidine U are considered to be complement of each other. When U is changed by T (thymine), the base A is paired with T in the double helix structure of DNA. For this reason, they have been represented by complementary numerical pairs 10 and 01. Analogously, the pyrimidine C and the purine G, are considered complementary bases, one to each other, and are represented by the numerical pairs 00 and 11. They also form a pair in the double helix structure. According to our numerical representation, the Hamming distance between a base and its complementary is always equal to 2, for they differ in two of their components. In the vector space of the 64 triplets we can dene a bijective function, which assigns to every triplet XYZ the triplet X Y Z , where the components X , Y , and Z are the complements of X, Y, and Z, respectively. This triplet is usually called the complementary codon or the anticodon of XYZ. From the algebraic point of view, it means the addition of the triplet GGG to XYZ, that is, the addition to the vector (x , y, z, t , u, v ), associated to the triplet, the vector M = (1, 1, 1, 1, 1, 1), which is the sum e1 + e2 + e3 + e4 + e5 + e6 , of the six vectors of the canonical base. The mentioned function is the translation t M , dened by the vector M, and it consists in the substitution of every zero by ones and every one by zeros. In the Boolean structure of the hypercube (Z2 )6 this transformation is the so-called Boolean negation. The sum of a triplet with its complementary gives, as result, the triplet GGG, that is, the associated to the vector M. Notice that the Hamming distance between a codon and its complementary is always equal to 6, that is, the maximum of all the possible distances in the six-dimensional hypercube. The translation t M performs a transversion in each of the three components of the triplet.
242
Acknowledgments M.V.J. was nancially supported by PAPIIT-IN216106, UNAM, and by the Mul y la Computacion, tiproject: Tecnolog as para la Universidad de la Informacion UNAM, Mexico. E. M. was supported by Universidad Central Marta Abreu de o for preparing the Las Villas, Santa Clara, Cuba. We thank Giselle Morgado Avin gures of the manuscript. We also thank Rafael Camacho for helpful suggestions. References
Becerra, A., Islas, S., Leguina, J.I., Silva, E., Lazcano, A., 1997. Polyphyletic gene losses can bias backtrack characterizations of the cenancestor. J. Mol. Evol. 45, 115118. Coxeter, H.S.M., 1973. Regular Polytopes, 3rd edition. Dover Publication Inc., New York. Crick, F.H.C., 1968. The origin of the genetic code. J. Mol. Biol. 38, 367379. Crick, F.H.C., Grifth, J.S., Orgel, L.E., 1957. Codes without commas. Proc. Natl. Acad. Sci. USA 43, 417421. Eigen, M., Schuster, P., 1977. The hypercycle. A principle of natural self-organization. Part A: Emergence of the hypercycle. Naturwissenschaften 64, 541565. Eigen, M., Schuster, P., 1978. The hypercycle. A principle of natural self-organization. Part C: The realisitic hypercycle. Naturwissenschaften 64, 341369. Eigen, M., Lindemann, B., Winkler-Oswatitsch, R., Clarke, C.H., 1985. Pattern analysis of 5S rRNA. Proc. Natl. Acad. Sci. USA 82, 24372441. Freeland, S.J., Hurst, L.D., 1998. The genetic code is one in a million. J. Mol. Evol. 47, 238248. M.V., 2004. Statistical Garc a, J.A., Alvarez, S., Flores, A., Govezensky, T., Bobadilla, J.R., Jose, analysis of the distribution of amino acids in Borrelia burgdorferi genome under different genetic codes. Physica A 342, 288293. Gesteland, R.F., Cech, T.R., Atkins, J.F., 1999. The RNA World. Cold Spring Harbor Laboratory Press. Gil, R., Sabater-Munoz, B., Latorre, A., Silva, S.J., Moya, A., 2002. Extreme genome reduction in Buchnera spp.: Toward the minimal genome needed for symbiotic life. Proc. Natl. Acad. Sci. USA 99, 44544458. Gilbert, W., 1986. The RNA World. Nature 319, 618. He, M., Petoukhov, S.V., Ricci, P.E., 2004. Genetic code, Hamming distance and stochastic matrices. Bull. Math. Biol. 66, 14051421. Hutchinson III, C.A., Peterson, S.N., Gill, S.R., Cline, R.T., White, O., Fraser, C., Smith, H.O., Venter, J.C., 1999. Global transposon mutagenesis and a minimal mycoplasma genome. Science 286, 21652169. M.A., de la Mora-Basanez, Jimenez-Monta no, C.R., Poschel, T., 1996. The hypercube structure of the genetic code explains conservative and non-conservative amino acid substitutions in vivo and in vitro. Biosystems 39, 117125. Keck, J.G., Baldick, C.J., Moss, B., 1990. Role of DNA replication in Vaccinia virus gene expression: A naked template is required for transcription of three late trans-activator genes. Cell 61, 801809. Kenneth, D.J., Ellington, A.D., 1995. The search for missing links between self-replicating nucleic acids and the RNA world. Orig. Life Evol. Biosph. 25, 515530. Konecny, J., Schoniger, M., Hofacker, L., 1995. Complementary coding to the primeval commaless code. J. Theor. Biol. 173, 263270. Jacob, F., 1977. Evolution and tinkering. Science 196, 11611166. Lazcano, A., 1995. Cellular evolution during the early Archean: What happened between the progenote and the cenancestor? Microbiologia SEM 11, 113. Lazcano, A., Miller, S.L., 1996. The origin and early evolution of life: Prebiotic chemistry, the pre-RNA World, and time. Cell 85, 793798. Mushegian, A.R., Koonin, E.V., 1996. A minimal gene set for cellular life derived by comparison of complete bacterial genomes. Proc. Natl. Acad. Sci. USA 93, 1026810273. Sanchez, R., Morgado, E., Grau, R., 2004a. The genetic code Boolean lattice, MATCH Commun. Math. Comput. Chem. 52, 2946.
243
Sanchez, R., Morgado, E., Grau, R., 2004b. Genetic code Boolean algebras. W-Seas Transactions of the International Conference on Biology and Biomedicine, vol. 1, issue 2, pp. 190197. Corfus, Greece, ISNN 11099518. Sanchez, R., Morgado, E., Grau, R., 2005. A genetic code boolean structure I. The meaning of boolean deductions. Bull. Math. Biol. 67, 114. Szathmary, E., 2005. In search of the simplest cell. Science. 433, 469470. White, M., Undated. Maximum symmetry in the genetic code: The Raki map. Unpublished manuscript. http://www.codefun.com/Images/genetic/Max/Sym300pi.pdf. Woese, C., 1967. The Genetic Code, Ch. 7. Harper and Row, New York.

BullMathBiol 07

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

BullMathBiol 07

Transféré par

Droits d'auteur :

Formats disponibles

Bulletin of Mathematical Biology (2007) 69: 215243 DOI 10.

Bulletin of Mathematical Biology (2007) 69: 215243

Bulletin of Mathematical Biology (2007) 69: 215243

Bulletin of Mathematical Biology (2007) 69: 215243

Bulletin of Mathematical Biology (2007) 69: 215243

Bulletin of Mathematical Biology (2007) 69: 215243

Bulletin of Mathematical Biology (2007) 69: 215243

Bulletin of Mathematical Biology (2007) 69: 215243

Bulletin of Mathematical Biology (2007) 69: 215243

Bulletin of Mathematical Biology (2007) 69: 215243

Bulletin of Mathematical Biology (2007) 69: 215243

Bulletin of Mathematical Biology (2007) 69: 215243

Bulletin of Mathematical Biology (2007) 69: 215243

Bulletin of Mathematical Biology (2007) 69: 215243

Encoding Function F1, F2, F3 and F4

Codon (lexicographic order)

Bulletin of Mathematical Biology (2007) 69: 215243

Bulletin of Mathematical Biology (2007) 69: 215243

General Encoding function E

Codon (lexicographic order)

Bulletin of Mathematical Biology (2007) 69: 215243

Bulletin of Mathematical Biology (2007) 69: 215243

Bulletin of Mathematical Biology (2007) 69: 215243

Bulletin of Mathematical Biology (2007) 69: 215243

Bulletin of Mathematical Biology (2007) 69: 215243

Bulletin of Mathematical Biology (2007) 69: 215243

Bulletin of Mathematical Biology (2007) 69: 215243

Bulletin of Mathematical Biology (2007) 69: 215243

Bulletin of Mathematical Biology (2007) 69: 215243

Bulletin of Mathematical Biology (2007) 69: 215243

Bulletin of Mathematical Biology (2007) 69: 215243

Vous aimerez peut-être aussi