Vous êtes sur la page 1sur 5

International Journal of Advanced Computer Science, Vol. 3, No. 5, Pp. 244-248, May, 2013.

Searching for the Relics of Primitive Codons


Satoshi Mizuta & Taro Mori
Manuscript
Received: 28,Mar.,2013 Revised: 30,Mar.,2013 Accepted: 8,Apr.,2013 Published: 15,Apr.,2013

Keywords
codon, evolution, doublet, protein, gene

Abstract A protein making up living organisms is synthesized as a one-dimensional array of amino acids based on the information stored in a gene, in which each amino acid of 20 types is encoded by a triplet of four types of nucleotides called a codon. Considering that evolution progresses from simple forms toward complicated ones, each amino acid might have been encoded by a single nucleotide or a pair of nucleotides at the early stage of life, although the deterministic feature of the coding system was lost. In this study, we searched for the relics of the primitive coding systems in genomic sequences and in protein sequences. As a result, we obtained two candidates in genomic sequences and three in proteins. Especially, 50S ribosomal protein L32 is involved in both candidates and, therefore, seems to be one of the prime candidates for the relics of primitive coding systems.

is not clear at present. In this study, we search for the relics of the primitive coding system regarding the following two possibilities. One possibility is that the remnants of the primitive coding system are left in genomic sequences of some species, and the other is that more than one proteins originating from a common gene at the early stage of life survive today. Actually, some amino acids are fully specified by a pair of the first and the second bases of a codon even in the current codon table. These coding patterns seem to be just the remains of the primitive coding system. We searched for more convincing evidences for the primitive coding system by exploiting the methods of bioinformatics.

2. Methods
A. Scenario of Codon Evolution Fig.1 shows the scenario of the codon evolution assumed in this study. More than one amino acids were encoded by a single nucleotide or a pair of nucleotides at the early stage of life. Here we define a singlet codon and a doublet codon as a single nucleotide and a pair of nucleotides coding for amino acids, respectively, and we refer to them together as a primitive codon. We also refer to a gene composed of the primitive codons as a primitive gene. Because the number of variations of a singlet and a doublet codon is 4 and 16, respectively, whereas the number of amino acids constituting proteins is 20, each of some primitive codons must code for more than one amino acids, among which a certain amino acid was chosen stochastically during translation. The chosen amino acids, in turn, constituted a variety of sequences, most of which must have been of no use or even fatal to sustaining life. However, some sequences useful for life might be synthesized by chance, and an additional nucleotide was added to each primitive codon during the course of evolution so that the useful sequences could be synthesized deterministically (see the right panel of Fig.1). B. Hypothetical Primitive Codon Table Table 1 shows the current RNA codon table, in which the snapshot of primitive codons seems to be recognized; five amino acids, valine (Val), proline (Pro), threonine (Thr), alanine (Ala), and glycine (Gly) we call them specific amino acids in this paper have a one-to-one relationship to a pair of the first and the second bases, as well as the other amino acids except for STOP show at most two-fold variations for the pair of the first and the second bases. We may consider this coding pattern, known as degeneracy, to be remains of the primitive coding system based on the

1. Introduction
Proteins making up living organisms are synthesized as one-dimensional arrays of amino acids through the biological processes, transcription and translation, and the blueprints of the proteins are stored in genome sequences as genes composed of nucleotides. To encode each amino acid of 20 types by four types of nucleotides deterministically, a group of three nucleotides is required. In a gene, actually, each amino acid is encoded by a triplet of nucleotides called a codon. However, it is unlikely that such a reasonable coding system existed at the early stage of life when the mechanisms of transcription and translation were established for the first time. This extremely rational coding pattern is none other than what has been acquired in the course of evolution. Considering that evolution progresses from simple forms toward complicated ones, it seems natural to think that, at the early stage of life, a single nucleotide or a pair of nucleotides encoded more than one amino acids at a cost of the deterministic feature of the coding system, and that a certain amino acid among them were stochastically chosen. The current coding system has been formed so that the array of amino acids useful for living organisms can be preferentially synthesized. However, the formation process

This work was supported by JSPS KAKENHI Grant Number 24650152. Satoshi Mizuta with Graduate School of Science and Technology, Hirosaki University, Japan (slmizu@cc.h iro saki- u.ac.jp). Taro Mori with Graduate School of Science and Technology, Hirosaki University, Japan (gs11 520@eit.h irosa ki-u .ac.jp).
hi u u 15 hi u u

Mizuta et al.: Searching for the Relics of Primitive Codons.

245

Fig.1 Scenario of codon evolution. At the early stage of life (left panel), more than one amino acid sequences were synthesized stochastically from a common primitive gene. During the course of evolution, an additional nucleotide was added to each primitive codon so that the amino acid sequences useful for life could be synthesized deterministically (right panel).

doublet type of primitive codons. From the above observation, we can straightforwardly construct a hypothetical primitive codon table by eliminating the third bases of the codons in Table 1. The constructed codon table is shown in Table 2. Note that we consider only the doublet type of primitive codons in the subsequent sections of this paper unless otherwise noted.
TABLE 1 RNA CODON TABLE

stage of life, it is possible that the one copy has evolved into a conventional gene at present and the other is conserved in non-coding regions of some genomic sequences (see Fig.2).

Fig.2 Relics of primitive codons in genomic DNA sequences. One copy of a duplicated primitive gene is possibly conserved in non-coding regions of some genomic sequences.

Amino acids in colored cells have a one-to-one relationship to a pair of the first and the second bases.

C. Relics of Primitive Codons In this study, we searched for the relics of primitive codons in two targets, genomic DNA sequences and amino acid sequences of proteins. 1) Relics in Genomic DNA Sequences: If a primitive
TABLE 2 HYPOTHETICAL PRIMITIVE CODON TABLE

The difference between the present gene and the corresponding relic in non-coding regions originating from a common primitive gene is that the latter is lacking in the third base of the each codon compared to its counterpart. Therefore, we searched for the nucleotide sequences constructed from the existing protein genes by eliminating the third bases of their codons (see Fig.3). In this study, we chose all the protein genes of Mycoplasma genitalium (483 sequences in total) as the original sequences from which the queries are constructed, because their products have fundamental functions for living organisms and the genes must have also existed at the early stage of life. On the other hand, we chose Arabidopsis thaliana as the target genome, because eukaryotic genomes have non-coding regions of large size. The primitive genes must be non-functional at present. Therefore, their relics, if exist, would be detected in non-coding regions intergenic regions of genome sequences or introns of genes.

Fig.3 A nucleotide sequence to be searched for. It is constructed from an existing gene by eliminating the third bases of the codons. Only the doublet type of primitive codons is considered in this paper.

gene was duplicated on a genome sequence at the early


International Journal Publishers Group (IJPG)

246

International Journal of Advanced Computer Science, Vol. 3, No. 5, Pp. 244-248, May, 2013. TABLE 3 RESULTS OF THE SEARCH FOR PRIMITIVE GENES Querya 50S ribosomal protein L32 SLb 116 Targetc Chr. IV SPd 6554978 2688124 Score 335 241 Identity (%) 64.0 69.3 p-value 9.9e-03 2.6e-03 Locationd Intergenic region intron

50S ribosomal protein L36 76 Chr.III a Originating from M. Genitalium b Sequence length of the constructed primitive gene (bp) c Originating from A. thaliana d Start position of the overlap region in the target sequence e Location of the overlap region in the target sequence

The search was performed as follows. A subsequence with a length of 1.5 times of that of the query sequence was cut out from the beginning of the target genome, then the alignment score was computed based on the Needleman and Wunsch algorithm[3] between the query and the subsequence global for the query and local for the subsequence with a match reward of +5, a mismatch penalty of 4, and a gap penalty of 2. We repeated this procedure sliding the cutting window by a half length of the query sequence until the window reached the end of the target genome.

Moreover, there is one more significant feature specific to quasi-homolog proteins. If more than one amino acid sequences originating from a common primitive gene acquired similar functions, all but one sequence must have been swept away in the course of evolution due to the redundancy. Therefore, it is probable that quasi-homolog proteins, if survive today, have different functions from each other. We chose all the protein sequences of M. genitalium (483

Fig.5 Quasi-homolog protein search. We searched for protein sequence pairs with matches of specific amino acids as many as possible.

Fig.4 Quasi-homolog proteins. If more than one amino acid sequences originating from a common primitive gene occasionally acquired useful functions for living organisms, they would possibly still survive today as quasi-homolog proteins.

sequences in total) as the queries for the same reason as in the case of primitive genes and chose all the protein sequences in UniProt[4] as the target sequences (531473 sequences in total) except for the proteins shorter than the query. The search was performed as follows. Sliding a window of the same size of the query sequence from the beginning of the target sequence to its end, we computed the maximum number of the matches of specific amino acids between the query and the each target sequence. We applied this procedure to all the target proteins. D. Data Sets All the gene and the protein sequences analyzed in this study were downloaded from the web site of GenBank[1] and UniProt[4], respectively.

2) Relics in Amino Acid Sequences: If more than one amino acid sequences originating from a common primitive gene acquired useful functions for life by chance, it is possible that all or some of them evolved into extant proteins. We refer to them as quasi-homolog proteins in this paper (see Fig.4). Because a specific amino acid is uniquely translated from a primitive codon, the positions of specific amino acids on the quasi-homolog proteins originating from a common primitive gene are all the same, as long as there is no insertion or deletion. Therefore, we can specify quasi-homolog proteins by searching for protein sequence pairs with matches of specific amino acids as many as possible (see Fig.5).

3. Results
A. Primitive Genes Table 3 shows the results of the search for primitive genes. Two sequences that have both a high alignment score and sequence identity and are detected in non-coding regions are shown.

International Journal Publishers Group (IJPG)

Mizuta et al.: Searching for the Relics of Primitive Codons. TABLE 4 RESULTS OF THE SEARCH FOR QUASI-HOMOLOG PROTEINS THE NUMBER OF SPECIFIC AMINO ACIDS, A, G, P, T, AND V, IN THE OVERLAP REGIONS ARE SHOWN Protein (species) Query 50S ribosomal protein L32 (M. genitalium) Target UDB-N-acetylmuramate dehydrogenase (Magnetcoccus marinus) Collagen alpha-2(I) chain (Human) Protein smf (Escherichia coli K-12)
a

247

A 3 3/12c 3/7 2/8

G 3 3/10 3/19 1/8

P 0 0/1 0/10 0/0

T 2 2/3 1/2 2/4

V 4 1/4 2/3 4/7

SLa 57 311 1366 374

SPb 85 834 109

p-value 3e-07 1e-07 6e-07

Sequence length (aa) b Start position of the overlap region in the target sequence c Match/Total

We estimated the p-values as follows. We randomly shuffled a query sequence and generated ten thousand random sequences preserving its base composition. We performed the search process for the each random sequence in the same way as that described in the method section and computed the maximum alignment score between the each random sequence and the target genome. Let n be the number of the random sequences having the maximum alignment score and the identity equal to or higher than the detected values for the original query sequence, then the p-value is calculated by n 104 . B. Quasi-Homolog Proteins Table 4 shows the results of the search for quasi-homolog proteins. The sequence pairs that have the maximum of the total matches of specific amino acids (=9), Ala (A), Gly (G), Pro (P), Thr (T), and Val (V), are shown. Note that, based on the annotations in UniProt, proteins that are apparently homologous in the usual sense are removed from the results. We estimated the p-values as follows. Shuffling a whole target sequence, we generated ten million random sequences

preserving its amino acid composition. Then, we measured the maximum number of total matches of specific amino acids between the query sequence and the each random sequence in the same way as that described in the method section. The p-value is calculated by k 107 , where k is the number of the random sequences that have the maximum number of the matches equal to or greater than nine. The annotations of the proteins in Table 4 indicate that they have different functions from each other. In order to investigate further the relationships between their functions, we took a look the three-dimensional structures of the proteins. Although none of them are experimentally determined, some are predicted based on homology-modelling in SWISS-MODEL Repository[2]. Fig.6 shows the predicted 3D structures of the proteins except for Collagen alpha-2(I) chain. They look quite different, as well as they have different numbers of -helices and -strands. This appears to confirm that the three proteins have significantly different functions from each other.

(1)

(2)

(3)

Fig.6 Predicted 3D structures; (1) 50S ribosomal protein L32 (1vsaY 1-52), (2) UDP-N-acetylmuramate dehydrogenase (1hskA 72-128), and (3) Protein smf (3uqzA 28-84). Model protein names and their displayed regions are shown in the parentheses.

International Journal Publishers Group (IJPG)

248

International Journal of Advanced Computer Science, Vol. 3, No. 5, Pp. 244-248, May, 2013.

4. Conclusion
We searched for the relics of primitive codons in two forms, primitive genes and quasi-homolog proteins. Two regions were obtained as the candidates for primitive genes in the genome of A. thaliana and three pairs of proteins were detected as the candidates for quasi-homolog proteins. The estimated p-values of the results for primitive genes are 9.9 103 and 2.6 103 (Table 3) for 50S ribosomal protein L32 and 50S ribosomal protein L36, respectively. On the other hand, those for quasi-homolog proteins are at most 10 6 for the three pairs of proteins (Table 4). These small p-values mean that the search results are statistically significant. One notable feature to distinguish quasi-homolog proteins from ordinary homologous proteins is that the former may have quite different functions from each other. The predicted 3D structures, as well as the annotations, of the detected candidates for quasi-homolog proteins are very dissimilar. This observation strongly suggests that the proteins have functions significantly different from each other. Moreover, it is worth noting that 50S ribosomal protein L32 is involved in the candidates of both cases. This result makes the protein and the corresponding gene one of the prime candidates for the relics of primitive codons, although it is necessary to analyze further the relationships between the query and the target proteins in terms of their functions, the mutual relationships between the genes from which the proteins originate, and so on, to make the final decision.

References
[1] [2] [3] GenBank. http://www.ncbi.nlm.nih.gov/. SWISS-MODEL Repository. http://swissmodel.expasy.org/repository. S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins, (1970) J. Mol. Biol., vol.48, [4] pp.443453. UniProt. http://www.ebi.ac.uk/uniprot/.

International Journal Publishers Group (IJPG)

Vous aimerez peut-être aussi