Académique Documents
Professionnel Documents
Culture Documents
Fig. 2 – Prediction of exon-regions in the DNA sequence Fig. 3 – Prediction of exon-regions in the DNA sequence
BOVLYSOZMB using classic Viterbi and ORF sliding window ACU08131 using classic Viterbi and ORF sliding window
The advantage of this method is that the exons are As it can be seen in Tables 3, 4, 5, and 6 the method where
predicted with some degree of confidence, even when the ORF is employed outperforms the classic method when
DNA sequence contains errors (some missing base pairs in Viterbi algorithm is used once for every sequence. The
the first exon, for example), because missing one exon DNA sequences which were used for prediction
should not affect the prediction of the following exons of comparison are: ACU08131, AGGLINE, AMU12025,
AOIRHODOPS, and BOVLYSOZMB.
Sensivity at nucleotide level for the five previous exons with a higher accuracy than the method using
sequences is 70% for ORF sliding window method and (is Viterbi algorithm only once for the entire DNA sequence.
better than) 46% for the classic method. Also, this method can be adapted and used as a starting
point for others gene-finding methods.
Table 3 – Nucleotide accuracy when using Viterbi algorithm To prove the full potential, in terms of accuracy, of the
Sequence Sn Sp CC P(All) new method more testing on a large number of DNA
ACU08131 0.96 1 0.97 0.99 sequences is needed. For few sequences which have been
AGGGLINE 0.71 0.97 0.82 0.98 used to find the exons, ORF sliding window method brings
AMU12025 0.32 0.97 0.51 0.86 an improvement in terms of accuracy over the method
AOIRHODOPS 0.34 0.50 0.36 0.89 when Viterbi algorithm is applied once for an entire DNA
BOVLYSOZMB 0 0 0.02 0.95
sequence.
AVERAGE 0.46 0.69 0.54 0.94
7. REFERENCES
Table 4 – Nucleotide accuracy when using ORF sliding window
of length 2500 bp [1] M. Burset, R. Guigo, "Evaluation of gene structure
Sequence Sn Sp CC P(All) prediction programs", Genomics, Vol. 34, pp. 353-357, 1996.
ACU08131 1 0.96 0.97 0.99
AGGGLINE 1 0.99 0.99 0.99 [2] L. Rabiner, "A tutorial on Hidden Markov Models and
AMU12025 0.56 0.87 0.64 0.90 selected applications in speech recognition", Proceedings of the
AOIRHODOPS 0.38 0.46 0.36 0.89 IEEE, Vol. 77(2), pp. 257-268, 1989.
BOVLYSOZMB 0.54 0.75 0.62 0.97
AVERAGE 0.70 0.81 0.72 0.95 [3] J. Henderson, S. Salzberg, K.H. Fasman, “Finding Genes in
DNA with a Hidden Markov Model”, Journal of Computational
Biology, Vol. 4(2), pp. 127-142, 1997.
Table 5 – Exact exon accuracy when using Viterbi algorithm
Sequence Sn Sp 1ME Ov [4] Supplementary data-sets for M. Burset and R. Guigo,
ACU08131 0.83 0.83 1 1 "Evaluation of gene structure prediction programs", [1],
AGGGLINE 0.67 0.67 0.67 0.67 <http://www1.imim.es/datasets/genomics96/>
AMU12025 0.33 0.67 0.33 0.33
AOIRHODOPS 0.20 0.17 0.60 0.60 [5] R. V. Davuluri, I. Grosse, M.G. Zhang, "Computational
BOVLYSOZMB 0 0 0 0 identification of promoters and first exons in the human
AVERAGE 0.41 0.47 0.52 0.52 genome", Nature Genetics, Vol. 29, pp.412-417, 2001.
Table 6 – Exact exon accuracy when using ORF sliding window [6] S. Salzberg, A.L. Delcher, K.H. Fasman, J. Henderson, "A
of length 2500 bp Decision Tree System for Finding Genes in DNA", Journal of
Computational Biology, Vol. 5(4), pp. 667-680, 1998.
Sequence Sn Sp 1ME Ov
ACU08131 0.83 0.71 1 1 [7] P. Bernaola-Galvan, I. Grosse, P. Carpena, J.L. Oliver, R.
AGGGLINE 0.67 0.67 1 1 Roman-Roldan, H.E. Stanley, "Finding Borders between Coding
AMU12025 0.33 0.33 0.83 0.83 and Noncoding DNA Regions by an Entropic Segmentation
AOIRHODOPS 0.40 0.33 0.60 0.60 Method", Physical Review Letters, Vol. 85 No. 6,
BOVLYSOZMB 0.25 0.25 0.75 0.75 pp. 1342-1345, 2000.
AVERAGE 0.50 0.46 0.84 0.84
[8] A.A. Mironov, J.W. Fickett, M.S. Gelfand, "Frequent
The ORF sliding window method is computationally Alternative Splicing of Human Genes", Genome Research,
intensive because Viterbi algorithm has to be applied for Vol. 9, pp. 1288-1293, 1999.
every ORF window for the same sequence (for example,
[9] S.E. Cawley, Statistical models for DNA sequencing and
sequence BOVLYSOZMB has 168 ORF windows). Also, analysis - Ph.D. Thesis, Univ. of California, Berkeley, 2000.
the ORF sliding window method has the tendency to give
more false positives than classic method. [10] A.I. Wirth, A Plasmodium Falciparum Genefinder -
Honours research project, Department of Mathematics and
6. CONCLUSIONS Statistics, University of Melbourne, Parkville, 1999.
In this study a new method was introduced for the [11] M. Pertea, X. Lin, S.L. Salzberg, "GeneSplicer: a new
computational identification of exons (protein coding computational method for splice site prediction", Nucleic Acids
Research, vol. 29 (5), pp. 1185-1190, 2001.
regions). This method provides a general framework for
finding the exons in a given anonymous sequence of DNA
and employs the ORF sliding window for predicting the