Vous êtes sur la page 1sur 6


"Politehnica" University of Bucharest Bio-Medical Engineering Centre pcristea@dsp.pub.ro

Abstract: Converting the DNA sequences into digital signals using a base four representation of the nucleotides leads to the conversion of the codons into numbers in the range 0-63 and of the amino acids, together with the terminator, into numbers in the range 0-20. Correspondingly, this transforms the symbolic DNA sequences into digital genetic signals of nucleotides, codons or amino acids and offers the possibility to apply a whole range of powerful signal processing methods for their analysis. The paper proposes a new representation of the Genetic Code that reflects better its structure and degeneracy. Optimal symbolic-to-digital mappings for nucleotides and amino acids are proposed on this basis and some preliminary investigation of the resulting genetic signals is described. The use of Independent Component Analysis (ICA) for identifying control sequences in the DNA not coding directly proteins is proposed. Keywords: Genetic Code, Genetic Signals, Genome Analysis, Projection Pursuit, ICA A special attention is given to fundament the choice of a "natural" correspondence between the nucleotides and the digits in base four (Thymine = 0, Cytosine =1, Adenine = 2, Guanine = 3). Independent Component Analysis (ICA) is a special case of the Blind Separation of the Sources (BSS) method. Its goal is to recover statistically independent source signals from some available linearly mixed signals produced by an unknown medium. Many applications are actively being developed, including speech recognition, telecommunications, bio-medical signal and image processing. For complex reallife problems, the computational load becomes excessive, especially because unknown relative shifts of the independent components have to be considered. The paper proposes an ICA approach to the search for independent components in the extra-genic DNA [4].

The almost complete sequencing of the human genome, as well as the public access to most of its content [1, 2], offer tremendous opportunities to explore in depth its content and to data mine this unique information depository. The classic approach of representing DNA as symbolic sequences of nitrogenous bases or of symbolic codons encoding polypeptide chains essentially limits the methodology of handling the information to mainly pattern matching procedures. Converting the DNA sequences into digital signals using a base 4 representation of the nucleotides leads to the conversion of the codons into numbers of the range 0-63 and of the amino acids, together with the terminator, into numbers of the range 0-20. Correspondingly, this leads to the conversion of DNA sequences into digital genetic signals and opens the possibility to apply a whole range of powerful signal processing methods for their analysis. Currently, only about 32000 genes containing the instructions to make proteins, but representing less than 5 percent of the human genome, are considered of interest. The vast majority of the genome is considered junk [3], as it has been discovered that it contains a large amount of mobile (transposable) elements that bear a close resemblance to the DNA of independent entities like viruses and bacteria. Similarly to the mitochondria, that have also started their ancestral life as independent entities, to become the main energy suppliers of the eukariotic cells, a significant part of the extragenic chromosomal DNA has very probable an important role in the control of protein synthesis.

The Genetic Code is universal, as it is used by all known organisms, with only small variations in mitochondria and certain microbes. Anyway, the Genetic Code applies to all known nuclear genetic material, DNA, mRNA and tRNA, and encompasses animals (including humans), plants, fungi, archaea, bacteria, and viruses. The main genetic material is represented by the DNA molecules that have a basically simple linear un-branched structure formed by nucleotide chains. The repetitive unit, the nucleotide, has three components: phosphate, deoxyribose and a nitrogenous base. Only four kinds of nitrogenous base are found in DNA: thymine (T) and cytosine (C), which are

- 17 -

pyrimidines, and adenine (A) and guanine (G) which are purines. The DNA molecule has two complementary chains that form a double helix in which a pyrimidine in one chain faces a purine in the other, only the base pairs T-A and C-G existing. Proteins are the main contributors to cell structure and, as enzymes, catalyze the chemical reactions specific to the functioning of the cells. The primary structure of a protein is given by polypeptide chains formed of amino acid sequences. The coiling (secondary), folding (tertiary) and aggregation (quaternary) of the polypeptides generate the complex spatial structure of a protein, essential for its functioning. There are only twenty different amino acids in the proteins. A sequence of three nitrogenous bases encodes an amino acid according to the Genetic Code in two steps: transcription - one strand of DNA is copied into a complementary mRNA (messenger) molecule, and translation - in which the language of nitrogenous bases is transformed by ribosomes into the language of amino acids. Only certain limited regions of the genome -- the Table 1. Genetic Code

genes -- give information to make proteins. Human genes are few and far apart. There are about 12 genes per million bases of human DNA. Genes are divided into exons -- sections of the coding sequence, interrupted by introns -- noncoding spacers. Human genes have many small exons, some just 19 bases long, separated by introns of an average length of about 3,300 bases, but with a large dispersion. Most introns are only 87 bases long, while some are over 10,000. Genes are reach in C and G, while noncoding DNA sequences are rich in T and A. The Genetic Code is given in Table 1 [5], while Table 2 lists the amino acids. There are 64 codons, out of which 61 encode the 20 amino acids, while 3 correspond to terminators -- "end" sequences. Consequently, there is a degeneracy of the genetic code, most amino acids being inserted into a growing polypeptide chain in response to two or more different triplets in the mRNA. Fig. 1 gives the classic 3D Cartesian representation of the codons and their translation to amino acids.

Table 2. Amino acid short names

Ala Alanine Arg Arginine Asn Asparagine Asp Aspartic acid Cys Cysteine Gln Glutamine Glu Glutamic Acid Gly - Glycine Ile Isoleucine His Histidine Leu Leucine Lys Lysine Met Methionine Phe Phenylalanine Pro Proline Ser Serine Thr Thereonine Trp Tryptophan Tyr Tyrosine Val Valine Ter - Terminator

This representation does not grasp the characteristic symmetries and degeneracy of the Genetic Code. We propose the tetrahedron representation in Fig. 2. Each nitrogenous base defines a direction in the representation space, towards one of the corners of a tetrahedron.

Figure 1. Classic Cartesian representation of the Genetic Code

- 18 -

Figure 2. Tetrahedral representation of the Genetic Code The first base in a codon selects one of the four A, C, G, T. The optimal choice given in Table 3 first order 16-codon tetrahedrons that compose results from the condition of minimally nonthe zero order tetrahedron of the overall Genetic monotonous correspondence between the Code. The second base in the codon selects one codons 0-63 and the amino acids plus the of the second order 4-codon tetrahedrons that terminator 0-20 (See Fig.7), that leads to best compose the first order tetrahedron. Finally, the auto-correlated extra-genic genetic signals. third base identifies one of the vertices. Degeneracy is basically restricted to the second order tetrahedrons and most pairs of interchangeable bases are distributed on the edges along the pyrimidines and purines directions. This construction has also the advantage to naturally suggest the putative ancestral coding sequences by the simple passage to a lower level tetrahedron. Table 3. Mapping of Nucleotides to Digits in Base Four Pyrimidines Thymine = T = 0 Cytosine = C = 1 Purines Adenine = A = 2 Guanine = G = 3 Figures 3 to 6 show the four first order 16-codon tetrahedrons, the numerical codes attached to the vertices, i.e., to the codons, and the encoded amino acids. It can be noticed that there are one to one correspondences. Correspondingly, Tables 3 to 6 also give the numerical codes which result for the amino acids from the order of their first reference. There are only two one codon - one amino acid (non degenerated) mappings for Tryptophan and Methionine, but ten double, three triple, six quadruple, and two sextuple degeneracies. From the frequency of the amino acids in the proteins, it results that the

The existence of the four different nitrogenous bases strongly suggests the mapping of the nucleotides to the digits {0, 1, 2, 3}, the interpretation of the three-base-codons as threedigit-numbers written in base four, thus the mapping of the codons along the linear DNA strands to the numbers {0, 1, 2, , 63}. Actually, a whole DNA sequence can be seen as a huge number written in base four. Nevertheless, it is more natural to interpret each codon as a distinct sample of a digital genetic signal distributed along the DNA strands. There are 4! = 24 distinct choices for attaching the digits 0-3 to the bases

- 19 -

Table 4. Codes of the Amino Acids in the Cytosine Tetrahedron

Figure 3. Symbolic to Digital Mapping of Codons -- Thymine Tetrahedron Table 3. Codes of the Amino Acids in the Thymine Tetrahedron

Figure 5. Symbolic to Digital Mapping of Codons -- Adenine Tetrahedron Genetic Code has the features of an entropic coding. On the other hand, a higher degeneracy suggests an ancestral amino acid. This allows building models of ancestral proteins. Table 7 summarizes the proposed optimal correspondence of numerical codons to amino acids, while Fig. 7 represents the dependence numeric codons to numeric codes attached to amino acids. Table 5. Codes of the Amino Acids in the Adenine Tetrahedron

The minimum non-monotonic dependence has only three reversals of the normal order: for a terminator sequence and for the two sextuple degeneracies: serine and arginine Figure 4. Symbolic to Digital Mapping of Codons -- Cytosine Tetrahedron

- 20 -

Table 7. Proposed Optimal Correspondence of Numerical Codons to Amino Acids

Figure 6. Symbolic to Digital Mapping of Codons -- Guanine Tetrahedron Table 6. Codes of the Amino Acids in the Guanine Tetrahedron


An exhaustive search for all the 24 possible correspondences nitrogenous bases -- digits 0-3 has shown that there does not exist a more monotonic dependence. The proposed coding gives a piece-wise constant function, with only the three mentioned reversals of the order.

Figure 7. Proposed Optimal (Minimally NonMonotonous) Correspondence of Numerical Codons to Amino Acid Codes

Figure 8. An excerpt from a Codon Digital Genetic Signal

- 21 -

Fig. 8 represents an excerpt of 150 samples from a Codon Digital Genetic Signal, while Fig. 9 shows the corresponding Amino Acid Digital Genetic Signal. It is significant that the genetic signals built from genes show low auto-correlation, even for neighboring samples. This is a feature usually associated with noise and is consistent with the fact that the functionality of a protein is not given directly by its first order structure, i.e., the sequence of amino acids, but by its higher order spatial structure On the other hand, the extra genic genetic signals obtained from non-coding DNA sequences have many features typical for piecewise smooth "natural" signals like a good correlation of close neighbors, that decreases abruptly with the distance.

The paper proposes the Tetrahedron Representation of the Genetic Code that reflects better its structure and degeneracy. Optimal symbolic - to - digital mappings for nucleotides and amino acids are proposed on this basis. Some features of the resulting genetic signals are described. It is suggested that the use of the Projection Pursuit approach, specifically the Independent Component Analysis (ICA) on the genetic signals derived from extra-genic DNA sequences, that do not encode proteins, could reveal signals that control the functioning of the genes, i.e., the synthesis of the proteins.

[1] Venter, J.C. et al., A New Strategy for Genome Sequencing, NATURE, 381, (May 30, 1996), pp. 364-366, [2] Venter, J.C. et al. Shotgun Sequencing of the Human Genome, SCIENCE, 280, (June 5, 1998), pp. 1540-1542. [3] H. Gee, Junk Science, Draft of A Journey into the Genome: What,s There, NATURE, www.nature.com. [4] Cristea P., Independent Component Analysis for Genetic Signals, SPIE Conference BiOS 2001 International Biomedical Optics Symposium, SC316, Short Course, San Jose, USA, 20-26 January 2001. [5] J. C. Venter et al., Draft Analysis of the Human Genome by Celera Genomics, SCIENCE, 291, (16 February 2001), pp. 13041351, www.sciencemag.org, [6] Myers, E.W. et al. A Whole-Genome Assembly of Drosophila, SCIENCE, 287, (March 24, 2000), pp. 2196-2204. [7] Doolittle, W.F., Phylogenetic classification and the universal tree, SCIENCE, 284, (June 25, 1999), pp. 2124-2128. [8] Andersson, J.O. & Nesb, C.L, Are there bugs in our genome?, SCIENCE EXPRESS, (May 17, 2001). [9] R. H. Davis, S. G. Weller, The Gist of Genetics, Jones & Bartlett Publishers, 1996, 1998.

Figure 9. The Amino Acid Digital Genetic Signal corresponding to Codons in Fig. 8

- 22 -