Vous êtes sur la page 1sur 4

COMPUTATIONAL IDENTIFICATION OF EXONS IN DNA WITH

A HIDDEN MARKOV MODEL

Daniel Nicorici, Jaakko Astola, Ioan Tabus

Tampere International Center for Signal Processing


Tampere University of Technology
P.O. Box 553, FIN-33101 Tampere, Finland
E-mail: {Daniel.Nicorici, Jaakko.Astola, Ioan.Tabus}@tut.fi

ABSTRACT contains the dinucleotide GT, and the 3' boundary


(acceptor site) contains the dinucleotide AG.
The number of DNA sequences has been growing fast. The reduction in number of potential sites for a gene
There are computational methods for finding genes from finder algorithm will significantly reduce the number of
DNA sequences, but still there is a need for more accurate alternative ways of parsing a DNA sequence into exons
algorithms. This study describes a new method for finding and introns, and therefore will make the overall gene
protein-coding regions in anonymous sequences of DNA, prediction easier.
more specifically it identifies the exons regions. The idea
of the new method is to use the framework of Hidden 2. HIDDEN MARKOV MODEL
Markov Models together with Viterbi algorithm applied
on ORF Sliding Window. The new method was evaluated Hidden Markov Models (HMMs) provides a good
for few vertebrate sequences. An analysis regarding the probabilistic method for modeling discrete sequences of
accuracy of the new method is presented and it is shown data like DNA sequences (alphabet of four letters: A, C,
that the new method has a higher accuracy than the method G, and T).
which uses the Viterbi algorithm once for an entire DNA The Hidden Markov Model system developed in this
sequence. study incorporates several distinct smaller HMMs that
model (capture) the exon, intron, intergenic, start-gene,
and stop-gene regions. A more detailed description of
HMM and HMMs in gene finding can be found in
1. INTRODUCTION Rabiner’s paper [2], and Henderson’s paper [3]. The
HMM models for splice sites are simple chains with
The computational identification of genes is a major goal multiple stages. In principle, the exon, intron, donor, and
for molecular biology, especially for Human Genome acceptor models used in this study are very similar to the
Project. One of the main goals of the Human Genome ones in the Henderson’s paper [3]. The donor model has 9
Project is to provide a complete list of annotated genes stages (first 3 belong to exon region and last 6 belong to
which will be used in biomedical research. intron region). The acceptor model has 15 stages (first 14
The reliable detection of genes is critical for the belong to intron region and last one belongs to exon
success of computational gene discovery from genomic region). The model of start gene is different from
sequences. Also, methods that can reliably identify genes Henderson’s paper [3], it has 15 stages (first 9 belong to
in anonymous sequences of DNA, generated by genome intergenic region and last 6 belong to exon region). All
project, can speed the process. smaller HMMs are integrated in combined HMMs like in
A number of such methods exist but their predictive Figure 1.
performance for finding genes is still not satisfactory [1].
In general, there are four types of features that must be
found for identifying a protein-encoding gene in DNA: the
start of the gene (codon ATG), the end of the gene
(codons: TAA, or TGA, or TAG), the donor site, and
finally the acceptor site in each exon region. Usually the 5' Fig. 1 – HMM system combining smaller HMMs
boundary of introns (donor site) in most eukaryotes
The HMM parameters were set to the frequencies of
occurrence of the respective events in the training set, and
no other retraining method, like Baum-Welch, has been [3]. The final results from Tables 1 and 2 are the average
used. Even without using a retraining algorithm the of the results obtained from training and test set for each
prediction accuracy is good (Tables 1 and 2). After of the five set partitions of the cross validation.
initialization step, the new sequences are parsed, using
Viterbi algorithm. Table 2 – Exact exon accuracy for new HMMs and VEIL
Viterbi algorithm is a dynamic programming Model
No of seq.
Sn Sp 1ME Ov
algorithm, which finds the most likely sequence of hidden in data set
states through the model for the given sequence, and New HMMs 444 0.52 0.56 0.68 0.73
computes the probability of the model producing the VEIL 570 0.46 0.50 0.63 0.70
sequence via that path. Because the combined HMMs
contain all states representing: start-gene, splice sites, and Each parse of DNA sequence was scored against its real
end-gene, this alignment gives directly where the exons states parse using counts for true positives (TP), true
start and end. The Viterbi algorithm was modified in order negatives (TN), false positives (FP), and false negatives
to build paths which contain genes with length multiple of (FN). In Table 1, the sensivity (Sn) for nucleotides is
three (i.e. stop codon is in reading frame). By default, the defined as the percentage of coding nucleotides correctly
Viterbi algorithm gives the single best alignment of a labeled as coding. Specificity (Sp) for nucleotides is the
sequence for the respective model. percentage of nucleotides labeled as coding that were
really coding. P(All) is the percentage of predicting any
3. DATA SET OF DNA SEQUENCES base correctly. Correlation coefficient (CC) is defined as:
TP TN FN FP
In this study a data sub-set of 444 sequences of vertebrate CC = .
(TP + FN )(TN + FP )(TP + FP )(TN + FN )
DNA is used. This is extracted from database of Burset
and Guigo [1, 4], which consists of 570 complete In Table 2, sensivity (Sn) for exons is the percentage of
sequences of vertebrate DNA. Each sequence contains whole exons which are predicted exactly. Specificity (Sp)
exactly one gene with at least one intron. for exons is percentage of exons predicted that are exactly
If the start codon (ATG) occurs very close to the donor correct. One matching edge (1ME) is percentage of exons
site (GT) (as for example in the extreme case: ATG GT) for which the algorithm predicted at least one of the edges
the coding portion may be too small (or even does not exactly. Ov is the percentage of exons for which we
exist) to be identified by currently available gene-finding predicted overlapping exons (including exact matches).
algorithms. In this way, we have no way of knowing where As it can be noticed in Tables 1 and 2, the reduced data
the promoter is located or where the 5' end of the gene is set with 444 DNA sequences has the same characteristics
likely to reside without further experimental data, such as as the one with 570 sequences from the study of Burset
full-length 5' UTR sequences. and Guigo.
We remove from data set of Burset and Guigo [1, 4]
the DNA sequences which have: the intergene region 4. ORF SLIDING WINDOW METHOD
upstream and downstream of a gene with length less than
30 bp (base pairs), and which contains exons of length less Here it is presented a new method which brings an
than 30 bp, based on consideration presented in Davuluri’s improvement specifically in finding and locating the
paper [5]. After removal, the data set contains 444 DNA exons. For this a sliding window together with Viterbi
sequences, each containing one gene. The HMMs system algorithm are used. The sliding windows are taken in such
developed in this study is used on a data set with 444 way that they contain ORF (Open Reading Frame -at least
sequences that were randomly permuted. Then, the data set one start and one stop codon-) which have the length as
is partitioned into four sets of 89 and one set of 88 close as possible to 2500 bp (plus 15 bp before and after
sequences, respectively. The HMMs build is initialized the first start and last stop codon, respectively) but not
from four sets and tested on the remaining one. less. The reason for taking sliding windows which form
ORFs, compared with a classical sliding window, is to
Table 1 – Nucleotide accuracy for new HMMs and VEIL reduce the computational part of the method which is quite
No of seq. high even in simpler cases. In this way, the number of
Model Sn Sp CC P(All) sliding windows for which the Viterbi algorithm is applied
in data set
New HMMs 444 0.76 0.75 0.71 0.92 is reduced dramatically, compared with the number of
VEIL 570 0.74 0.72 0.68 0.92 non-restricted sliding windows. The high content in start
codons (ATG), and stop codons (TAA, TAG, TGA)
The results obtained with our HMMs are presented in especially in protein coding regions assures that enough
Tables 1 and 2, and they compare favorably with the sliding windows ORF can be used for applying the Viterbi
results of others gene finders [1], especially with VEIL algorithm, and in this way the entire DNA sequence is
covered as uniform as possible. The major problem for the gene. In Figure 2 it can be seen that even when the
other gene-finding algorithms is the high number of second exon is not predicted (with ORF sliding window
possible start and stop codons, while this high number is method), it is not affected the prediction of the following
an advantage for our method using ORF sliding windows. exons.
This method brings an improvement over the classical
method where Viterbi algorithm is applied only once for 5. COMPARISONS
the entire sequence, while in the case of our method the
Viterbi algorithm is applied many times for the same DNA The HMMs system is initialized over the whole data set,
sequence. Because it is applied many times for regions which is composed from 444 DNA sequences. It uses an
from the same DNA sequence, which may overlap ORF sliding window of length 2500 bp. The prediction of
partially (ORF sliding window), it can be tracked which exons is performed for 5 sequences (longer than 5000 bp)
regions have been predicted as being exons, and how chosen from the 444 DNA sequences. The method in
many times. Thus, if an exon is missed in one window, it is which only Viterbi algorithm was applied is compared to
possible to be predicted in the following ones. If a certain the new method which uses ORF sliding window. The
region has been predicted to be an exon many times, it results for predicting the exons (with both methods) for
means that that region has a high probability to be an sequence BOVLYSOZMB are given in Figure 2. It can be
exon. With this method the performance of finding exons noticed that classic Viterbi algorithm, applied for entire
can be improved. But the drawback is that an entire gene sequence, fails to predict any exon and when Viterbi
cannot be directly predicted, because it is possible that the algorithm is used with ORF sliding window, three from
first exon (of one gene), or the last exon (of the same four exons are predicted correctly and only one is missed
gene) to be missed. In order to predict the entire gene from (but its splice sites are still predicted but with a very weak
an anonymous sequence of DNA, other methods have to confidence). Surprisingly, the exon which is missed has
be employed. length multiple of three and perhaps is skipped because of
This method for finding exons permits to estimate with alternative splicing. The length of the ORF sliding window
some degree of confidence which regions are more likely is in this case smaller than the length of the genes, and
to be exons than the others. In this way, the exons are ORF sliding window method should be used for long DNA
located taking into consideration the neighbour regions, sequences. Figure 3 shows the predictions of exons for
and not only splice sites. Also the information regarding sequence ACU08131 where both methods perform well,
which regions are more probable to be exons, and which proving (at least for this case) that ORF sliding window
splice sites are more probable to take part in transcription method performs equally well when classic Viterbi
can be employed in future methods for locating the genes. algorithm also makes good predictions.

Fig. 2 – Prediction of exon-regions in the DNA sequence Fig. 3 – Prediction of exon-regions in the DNA sequence
BOVLYSOZMB using classic Viterbi and ORF sliding window ACU08131 using classic Viterbi and ORF sliding window

The advantage of this method is that the exons are As it can be seen in Tables 3, 4, 5, and 6 the method where
predicted with some degree of confidence, even when the ORF is employed outperforms the classic method when
DNA sequence contains errors (some missing base pairs in Viterbi algorithm is used once for every sequence. The
the first exon, for example), because missing one exon DNA sequences which were used for prediction
should not affect the prediction of the following exons of comparison are: ACU08131, AGGLINE, AMU12025,
AOIRHODOPS, and BOVLYSOZMB.
Sensivity at nucleotide level for the five previous exons with a higher accuracy than the method using
sequences is 70% for ORF sliding window method and (is Viterbi algorithm only once for the entire DNA sequence.
better than) 46% for the classic method. Also, this method can be adapted and used as a starting
point for others gene-finding methods.
Table 3 – Nucleotide accuracy when using Viterbi algorithm To prove the full potential, in terms of accuracy, of the
Sequence Sn Sp CC P(All) new method more testing on a large number of DNA
ACU08131 0.96 1 0.97 0.99 sequences is needed. For few sequences which have been
AGGGLINE 0.71 0.97 0.82 0.98 used to find the exons, ORF sliding window method brings
AMU12025 0.32 0.97 0.51 0.86 an improvement in terms of accuracy over the method
AOIRHODOPS 0.34 0.50 0.36 0.89 when Viterbi algorithm is applied once for an entire DNA
BOVLYSOZMB 0 0 0.02 0.95
sequence.
AVERAGE 0.46 0.69 0.54 0.94
7. REFERENCES
Table 4 – Nucleotide accuracy when using ORF sliding window
of length 2500 bp [1] M. Burset, R. Guigo, "Evaluation of gene structure
Sequence Sn Sp CC P(All) prediction programs", Genomics, Vol. 34, pp. 353-357, 1996.
ACU08131 1 0.96 0.97 0.99
AGGGLINE 1 0.99 0.99 0.99 [2] L. Rabiner, "A tutorial on Hidden Markov Models and
AMU12025 0.56 0.87 0.64 0.90 selected applications in speech recognition", Proceedings of the
AOIRHODOPS 0.38 0.46 0.36 0.89 IEEE, Vol. 77(2), pp. 257-268, 1989.
BOVLYSOZMB 0.54 0.75 0.62 0.97
AVERAGE 0.70 0.81 0.72 0.95 [3] J. Henderson, S. Salzberg, K.H. Fasman, “Finding Genes in
DNA with a Hidden Markov Model”, Journal of Computational
Biology, Vol. 4(2), pp. 127-142, 1997.
Table 5 – Exact exon accuracy when using Viterbi algorithm
Sequence Sn Sp 1ME Ov [4] Supplementary data-sets for M. Burset and R. Guigo,
ACU08131 0.83 0.83 1 1 "Evaluation of gene structure prediction programs", [1],
AGGGLINE 0.67 0.67 0.67 0.67 <http://www1.imim.es/datasets/genomics96/>
AMU12025 0.33 0.67 0.33 0.33
AOIRHODOPS 0.20 0.17 0.60 0.60 [5] R. V. Davuluri, I. Grosse, M.G. Zhang, "Computational
BOVLYSOZMB 0 0 0 0 identification of promoters and first exons in the human
AVERAGE 0.41 0.47 0.52 0.52 genome", Nature Genetics, Vol. 29, pp.412-417, 2001.

Table 6 – Exact exon accuracy when using ORF sliding window [6] S. Salzberg, A.L. Delcher, K.H. Fasman, J. Henderson, "A
of length 2500 bp Decision Tree System for Finding Genes in DNA", Journal of
Computational Biology, Vol. 5(4), pp. 667-680, 1998.
Sequence Sn Sp 1ME Ov
ACU08131 0.83 0.71 1 1 [7] P. Bernaola-Galvan, I. Grosse, P. Carpena, J.L. Oliver, R.
AGGGLINE 0.67 0.67 1 1 Roman-Roldan, H.E. Stanley, "Finding Borders between Coding
AMU12025 0.33 0.33 0.83 0.83 and Noncoding DNA Regions by an Entropic Segmentation
AOIRHODOPS 0.40 0.33 0.60 0.60 Method", Physical Review Letters, Vol. 85 No. 6,
BOVLYSOZMB 0.25 0.25 0.75 0.75 pp. 1342-1345, 2000.
AVERAGE 0.50 0.46 0.84 0.84
[8] A.A. Mironov, J.W. Fickett, M.S. Gelfand, "Frequent
The ORF sliding window method is computationally Alternative Splicing of Human Genes", Genome Research,
intensive because Viterbi algorithm has to be applied for Vol. 9, pp. 1288-1293, 1999.
every ORF window for the same sequence (for example,
[9] S.E. Cawley, Statistical models for DNA sequencing and
sequence BOVLYSOZMB has 168 ORF windows). Also, analysis - Ph.D. Thesis, Univ. of California, Berkeley, 2000.
the ORF sliding window method has the tendency to give
more false positives than classic method. [10] A.I. Wirth, A Plasmodium Falciparum Genefinder -
Honours research project, Department of Mathematics and
6. CONCLUSIONS Statistics, University of Melbourne, Parkville, 1999.

In this study a new method was introduced for the [11] M. Pertea, X. Lin, S.L. Salzberg, "GeneSplicer: a new
computational identification of exons (protein coding computational method for splice site prediction", Nucleic Acids
Research, vol. 29 (5), pp. 1185-1190, 2001.
regions). This method provides a general framework for
finding the exons in a given anonymous sequence of DNA
and employs the ORF sliding window for predicting the

Vous aimerez peut-être aussi