Académique Documents
Professionnel Documents
Culture Documents
Editor-in-Chief
I. Tabus, Tampere University of Technology, Finland
Associate Editors
Jaakko Astola, Finland
Junior Barrera, Brazil
Michael L. Bittner, USA
Michael R. Brent, USA
Yidong Chen, USA
Paul Dan Cristea, Romania
Aniruddha Datta, USA
Bart De Moor, Belgium
Edward R. Dougherty, USA
J. Garcia-Frias, USA
Debashis Ghosh, USA
John Goutsias, USA
Roderic Guigo, Spain
Yufei Huang, USA
Seungchan Kim, USA
John Quackenbush, USA
Jorma Rissanen, Finland
Stephane Robin, France
Contents
Information Theoretic Methods for Bioinformatics, Jorma Rissanen, Peter Grunwald,
Jukka Heikkonen, Petri Myllymaki, Teemu Roos, and Juho Rousu
Volume 2007, Article ID 79128, 2 pages
Compressing Proteomes: The Relevance of Medium Range Correlations, Dario Benedetto,
Emanuele Caglioti, and Claudia Chica
Volume 2007, Article ID 60723, 8 pages
A Study of Residue Correlation within Protein Sequences and Its Application to Sequence
Classification, Chris Hemmerich and Sun Kim
Volume 2007, Article ID 87356, 9 pages
Identifying Statistical Dependence in Genomic Sequences via Mutual Information Estimates,
Hasan Metin Aktulga, Ioannis Kontoyiannis, L. Alex Lyznik, Lukasz Szpankowski, Ananth Y. Grama,
and Wojciech Szpankowski
Volume 2007, Article ID 14741, 11 pages
Motif Discovery in Tissue-Specific Regulatory Sequences Using Directed Information, Arvind Rao,
Alfred O. Hero III, David J. States, and James Douglas Engel
Volume 2007, Article ID 13853, 13 pages
Splitting the BLOSUM Score into Numbers of Biological Significance, Francesco Fabris,
Andrea Sgarro, and Alessandro Tossi
Volume 2007, Article ID 31450, 18 pages
Aligning Sequences by Minimum Description Length, John S. Conery
Volume 2007, Article ID 72936, 14 pages
MicroRNA Target Detection and Analysis for Genes Related to Breast Cancer Using MDLcompress,
Scott C. Evans, Antonis Kourtidis, T. Stephen Markham, Jonathan Miller, Douglas S. Conklin,
and Andrew S. Torres
Volume 2007, Article ID 43670, 16 pages
Variation in the Correlation of G + C Composition with Synonymous Codon Usage Bias among
Bacteria, Haruo Suzuki, Rintaro Saito, and Masaru Tomita
Volume 2007, Article ID 61374, 7 pages
Information-Theoretic Inference of Large Transcriptional Regulatory Networks, Patrick E. Meyer,
Kevin Kontos, Frederic Lafitte, and Gianluca Bontempi
Volume 2007, Article ID 79879, 9 pages
NML Computation Algorithms for Tree-Structured Multinomial Bayesian Networks,
Petri Kontkanen, Hannes Wettig, and Petri Myllymaki
Volume 2007, Article ID 90947, 11 pages
Editorial
Information Theoretic Methods for Bioinformatics
3 Jukka Heikkonen,4 Petri Myllymaki,
2, 5
Jorma Rissanen,1, 2 Peter Grunwald,
2,
5
5
Teemu Roos, and Juho Rousu
1 Computer
2
tandem repeats which are useful in, for instance, genetic profiling. In both cases, the used techniques are based on mutual
information.
The objective in the paper by A. Rao et al. is to discover long-range regulatory elements (LREs) that determine
tissue-specific gene expression. Their methodology is based
on the concept of directed information, a variant of mutual
information introduced originally in the 1970s. It is shown
that directed information can be successfully used for selecting motifs that discriminate between tissue-specific and nonspecific LREs. In particular, the performance of directed information is better than that of mutual information.
F. Fabris et al. present an in-depth study to BLOSUM
block substitution matrix scores. They propose a decomposition of the BLOSUM score into three components: the mutual information of two compared sequences, the divergence
of observed amino acid co-occurence frequencies from the
probabilities in the substitution matrix, and the background
frequency divergence measuring the stochastic distance of
the observed amino acid frequences from the marginals in
the substitution matrix. The authors show how the result
of the decomposition, called BLOSpectrum, can be used to
analyze questions about the correctness of the chosen BLOSUM matrix, the degree of typicality of compared sequences
or their alignment, and the presence of weak or concealed
correlations in alignments with low BLOSUM scores.
The paper by J. Conery presents a new framework for
biological sequence alignment that is based on describing
pairs of sequences by simple regular expressions. These regular expressions are given in terms of right-linear grammars,
and the best grammar is found by use of the MDL principle. Essentially, when two sequences contain similar substrings, this similarity can be exploited to describe the sequences with fewer bits. The precise codelengths are determined with a substitution matrix that provides conditional
probabilities for the event that a particular symbol is replaced by another particular symbol. One advantage of such
a grammar-based approach is that gaps are not needed to
align sequences of varying length. The author experimentally
compares the alignments found by his method with those
found by CLUSTALW. In a second experiment, he measures
the accuracy of his method on pairwise alignments taken
from the BAlisBASE benchmark.
S. C. Evans et al. explore miRNA sequences based on
MDLcompress, an MDL-based grammar inference algorithm that is an extension of the optimal symbol compression ratio (OSCR) algorithm published earlier. Using MDLcompress, they analyze the relationship between miRNAs,
single nucleotide polymorphisms (SNPs) and breast cancer. Their results suggest that MDLcompress outperforms
other grammar-based coding methods, such as DNA sequitur, while retaining a two-part code that highlights biologically significant phrases. The ability to quantify cost in
bits for phrases in the MDL model allows prediction of regions where SNPs may have the most impact on biological
activity.
The partially redundant third position of codons
(protein-coding nucleotide triplets) tends to have a strongly
biased distribution. The amount of bias is known to be
Research Article
Compressing Proteomes: The Relevance of
Medium Range Correlations
Dario Benedetto,1 Emanuele Caglioti,1 and Claudia Chica2
1 Dipartimento
2 Structural
di Matematica, Universit`a di Roma La Sapienza, Piazzale Aldo Moro 5, 00185 Roma, Italy
and Computational Biology Unit, EMBL Heidelberg, Meyerhofstrae 1, 69117 Heidelberg, Germany
1.
INTRODUCTION
Abbreviation
Mj
Hi
Vc
Ec
Sc
Dm
Ce
Hs
Organism
Methanococcus jannaschii
Heamophilus influenzae
Vibrio cholerae
Escherichia coli
Saccharomyces cerevisiae
Drosophyla melanogaster
Caenorhabditis elegans
Homo sapiens
Proteome length
448 779
509 519
870 500
157 8496
2 900 352
5 818 330
6 874 562
3 295 751
Number of proteins
1680
1657
2988
5339
5835
11 592
17 456
5733
Correlations
As a first approximation to the general trends in residue distribution, we study the cooccurrence of amino acids. More
precisely, we calculate the pair correlations at dierent distances, that is, the average number of times equal residues a
appear at distance k along the whole sequence
Ck =
1 k
C
20 a aa
(1)
0.0004
Table 2: Intra- and interprotein correlation. Intraprotein correlation is always higher than interprotein correlation, and correlation
between matching halves () is higher than that of not corresponding halves (+).
0.00035
Correlation C(k)
0.0003
0.00025
0.0002
0.00015
0.0001
5e 05
0
5e 05
100
200
300
400
800
900 1000
Mj
Vc
Hi
Dm
Ce
Sc
with
k
Caa
=
N
k
N k
i=1
i = a i+k = a fa2 ,
(2)
where N is the sequence length, ( i = a) is the characteristic function of finding residue a at position i, and fa is
the relative frequency of amino acid a in the proteome. According to this definition, a positive correlation means that,
for a distance k, the number of pairs of equal amino acid
is more frequent than expected due to their frequency in
the proteome. The resulting correlation function for the 8
proteomes we studied (Figure 1) shows that eukaryotic sequences have stronger correlations than prokaryotic ones.
Moreover, for all the proteomes, the correlation remains positive at a medium range, for values of k bigger than 800 or
1000, depending on the proteome. We notice that the natural order of proteins in the proteomes, given by the succession of genes in the chromosomes, is relevant: when we randomly permute proteins, the medium range correlations are
lost, both in eukaryotes and prokaryotes.
The medium range correlations imply that, in proteomes,
the amino acid distribution of neighboring proteins tends
to be more similar than that of distant ones. This fact can
be related to the process of duplication, recognied as the
dominant force in the evolution of protein function [16]. As
protein repeats have been related to duplication at dierent
scales (genome, gene, or exon) [17], it is possible that the
amino acid patterns responsible for the observed medium
range correlation have the same evolutionary origin.
Due to the correlation definition used, the medium range
correlations could be caused either by pairs of amino acids
belonging to the same protein, or to dierent ones. Therefore, we split the nonlocal correlation into two groups and
analyse them separately: interprotein correlations (between
2 contiguous proteins) and intraprotein correlations (inside
Interprot corr
0.050381
0.045588
0.063712
0.080064
0.032501
0.095722
0.122692
Interprot corr+
0.050231
0.039246
0.041780
0.069980
0.018606
0.056176
0.077690
the same protein sequence). In Table 2, we present the results for the intraprotein correlation between the two halves
of the same protein and the interprotein correlation between
corresponding and noncorresponding halves of two contiguous proteins: first half with first half (corr ) and second half
with first half (corr+ ).
These correlations are defined as follows. Let N p be the
number of proteins, let i (a) and +i (a) be the relative frequency of the residue a in the first and the second half of the
ith protein, respectively, and let (a) be the corresponding
mean value. We define
i, j =
1
(a) (a) j (a) (a) ,
20 a i
(3)
1 +
i (a) (a) j (a) (a) .
20 a
(4)
for instance,
i, j =
We also define
i =
i,i .
+i = ++
i,i ,
(5)
Cintra =
p
+
1 i,i
.
N p i=1 i +i
(6)
Cinter
=
+
Cinter
=
1
Np 1
1
Np 1
N p 1
i,i+1
i=1
i i+1
,
(7)
N p 1
+i,i+1
i=1
+i i+1
The correlation values in Table 2 have the same trend for all
the proteomes: intraprotein correlation is always higher than
interprotein correlation.
The correlation defined by means of
i, j are dierent
k which is the correlafrom the traditional correlation Caa
tion of the symbol a at distance k, where k is the number of
residues: we have calculated the correlation function of the
Correlation C(k)
0.04
0.03
0.02
0.01
0
2.2.
0.01
10
15
20
Distance k (no of proteins)
25
30
1
Np k
1
Np k
N p k
i,i+k
i=1
i i+k
,
(8)
N p k
+i,i+k
i=1
+i i+k
In a previous study [4], the complexity of large sets of nonredundant protein sequences was measured using a reduced alphabet approximation, that is, using groups of amino acids
defined by an a priori classification. The Shannon entropy
was then estimated from the entropies of the blocks of ncharacters. The authors did not find enough evidence to support the existence of short range correlations between the
amino acids of protein sequences.
Conversely, given the above evidence of medium range
correlations in proteome sequences, we build groups of correlated amino acids using the correlations between the 20
k
, the correlation between all
amino acids. We calculate Cab
amino acid pairs ab at distances k, in the same way we calk in the previous section:
culate Caa
k
Cab
=
N
k
N k
i = a i+k = b fa fb .
(9)
Ng
200
i=1 a,bgi k=1
k
Cab
,
(10)
V L I M F WN Q H K R D E G A S T C Y P
V L I M F W N Q H K R D E G A S T C Y P
Figure 3: Correlation between the 20 amino acids for Hi. Positive (black) and negative (grey) correlations determine amino acid
groups.
Mj
Sc
Hs
Groups
LIFWSY
VMGATP
NQHKRDEC
LIFWNSY
VMQHGATCP
KRDE
LIMFWCY
NQHSTP
KRDE
VGA
VLIMFWNY
HSTC
QKDE
RGAP
its group. If F(G ) > F(G), the algorithm accepts the new partition. Iterating this procedure we would reach a local maximum which may not be the absolute maximum. In order
to avoid being trapped in a local maximum, the algorithm
accepts, with a small probability P, a new partition G for
which F(G ) F(G). The value of this probability P slowly
decreases to zero as the number of iterations increases in such
a way that the convergence of the algorithm to the absolute
maximum of F is guaranteed.
The number and the structure of the groups chosen have
the highest value of F(G) and represent an equilibrated partition of the 20 amino acids, that is, groups with only one
element are not accepted.
The idea behind our grouping scheme is to simplify
the amino acid pattern mining by taking advantage of their
S=
(11)
using the model to calculate the probability Pi i of character i at position i. The better is the model, the lower is the
estimated value of the sequence entropy. We construct three
models to estimate the probability of each character, considering the previous ones and taking into account both short
and medium range correlations. For each model, we find parameters that minimise the sequence entropy. The Smin value
obtained is taken as an estimate of the compression rate of
a running arithmetic codification [25] of the proteomes and
is used to compare our results with other compression algorithms (Table 4).
Previous works on protein sequence compression like [5]
are based on short range Markovian models. In those models,
the probability of each amino acid is calculated as a function
of the context in which it appears, considering the frequency
Table 4: Compression rate in bit per character for the studied proteomes. One-character entropy is the entropy of the sequences considering
that their residues are independently distributed.
Algorithm
One-character entropy
CP, Nevill-Manning and Witten 1999 [5]
lza-CTW, Matsumoto et al. 2000 [6]
ProtComp, Cao et al. 2007 [7]
XM, Cao et al. 2007 [7]
Model 1
Model 2
Model 3
ProtComp, Hategan and Tabus 2004 [8]
BWT/SCP, Adjeroh and Nan 2006 [9]
Hi
4.155
4.143
4.118
4.108
4.102
4.111
4.102
4.100
2.330
2.546
Mj
4.068
4.051
4.028
4.008
4.000
4.017
4.005
4.002
3.910
2.273
Sc
4.165
4.146
3.951
3.938
3.885
3.963
3.948
3.945
3.440
3.111
Hs
4.133
4.112
3.920
3.824
3.786
3.978
3.933
3.931
3.910
3.435
Estimation
Results obtained with a dierent set of proteomes
(12)
i
1 + Nc
k=0 k Fk (a)
Model 1: pi (a) =
.
Nc
i
b 1+
k=0 k Fk (b)
argued in other works on latent periodicity of protein sequences [27, 28]. From the point of view of protein sequence
evolution, the short range parameters can also reflect the existence of constraints on the distribution of residues. Protein
sequences are modified by mutation, but still have to cope
with folding requirements that determine a nonrandom positioning of key residues, depending on their geometrical and
physico-chemical properties. In fact, structural alphabets derived from hidden Markov models denote that local conformations of protein structures have dierent sequence specificity [29].
The intra/interprotein correlations identified in previous
sections suggest that the frequencies of the single residues
has nonnegligible fluctuations on the medium range. We take
into account these fluctuations in our second model (model
2 in Table 4):
i
1 + RiL (a) + Nc
k=0 k Fk (a)
Model 2: pi (a) =
.
Nc
i
i
b 1 + RL (b) +
k=0 k Fk (b)
(14)
Here we added
(13)
i
.
(15)
L
This quantity is proportional to the frequency of the amino
acid a in the subsequence of length L, with L a distance of
medium scale, starting
from the position i L. The factor i/L
guarantees that a RiL (a) = i, so that it increases
with i in the
same way as the other terms of the sum (e.g., a F0i (a) = i).
The parameter is optimised as k . The optimal values for L
found during the entropy minimisation stage are 190 for Hi,
163 for Mj, 105 for Sc, and 115 for Hs.
Finally, in model 3, we use the groups found in
Section 2.2 (see Table 3). In particular, a contribution to
the probablity of a given residue is obtained by computing
the probability of the residue to belong to a certain group
and then the conditional probability of the residue once the
group is given is
i
1 + GiL ga f i (a) + Nc
k=0 k Fk (a)
Model 3: pi (a) =
,
Nc
i
i
i
b 1 + GL gb f (b) +
k=0 k Fk (b)
(16)
i
(17)
For this model, the optimal values of the parameter L are 129
for Hi, 94 for Mj, 77 for Sc, and 100 for Hs.
As one can see in Table 4, the capability of our statistical
model to represent the nonrandom information contained
in proteomes is comparable to those models that consider
repeated amino acid patterns at both short and medium scale
[6, 7].
The improvement in the performance of models 2 and 3
is due to the fact that they identify the short range correlations and separate them from the fluctuations of amino acid
frequencies at a protein length range. This demonstrates that
both correlation types are informative and that the statistical
significance of repetitions at those scales is enough to model
the amino acid probabilities.
The compression rate achieved when the medium range
correlations are modelled with the frequency of amino acid
groups (model 3) is almost equivalent to the compression
rate of model 2. From a biological perspective it indicates that
groups of amino acids are meaningful, and that the redundant information at medium scale has a structural component might be coming from the three-dimensional structure
constraints.
According to our results, there is an important dierence
in the compressibility rates of the eukaryotic and prokaryotic
proteomes which is in agreement with the correlation function in Figure 1. The sequences of S. cerevisiae and H. sapiens are more redundant, and thus more compressible, than
those of H. influenzae and M. jannaschii; correspondingly,
the correlation functions of Sc and Hs remain positive for
longer distances than Hi and Mj. This additional redundancy
could be related to the presence, in eukaryotic proteomes, of
paralogous proteins with very similar distribution of synonymous amino acids, but dierent function. There is evidence
suggesting that paralogous genes have been recruited during
evolution of dierent metabolic pathways and are related to
the organism adaptability to environmental changes [16]. On
the other hand, the lower compressibility of the Hi and Mj
proteomes is in agreement with the reduction of prokaryotic
genome size as an adaptation to fast metabolic rates [30, 31].
3.
CONCLUSIONS
In this article, we show that the correlation function gathers evolutionary and structural information of proteomes.
Even if proteins are highly complex sequences, at a proteome
scale, it is possible to identify correlations between characters at short and medium ranges. It confirms that protein
sequences are not completely random, indeed they present
repeated amino acid patterns at those two scales. The alternation of secondary structure units can determine the local
redundancy. This was already known and generally modelled
using Markov models. In our opinion, sequence duplication
8
[11] C. E. Shannon, A mathematical theory of communication,
Bell System Technical Journal, vol. 27, pp. 379423 and 623
656, 1948.
[12] J. Cleary and I. Witten, Data compression using adaptive coding and partial string matching, IEEE Transactions on Communications, vol. 32, no. 4, pp. 396402, 1984.
[13] F. M. J. Willems, Y. M. Shtarkov, and T. J. Tjalkens, The
context-tree weighting method: basic properties, IEEE Transactions on Information Theory, vol. 41, no. 3, pp. 653664,
1995.
[14] Integr8 web portal, ftp://ftp.ebi.ac.uk/pub/databases/integr8/,
2006.
[15] J. Abel, The data compression resource on the internet,
http://www.datacompression.info/, 2005.
[16] C. A. Orengo and J. M. Thornton, Protein families and their
evolutiona structural perspective, Annual Review of Biochemistry, vol. 74, pp. 867900, 2005.
[17] J. Heringa, The evolution and recognition of protein sequence repeats, Computers & Chemistry, vol. 18, no. 3, pp.
233243, 1994.
[18] M. A. Andrade, C. Petosa, S. I. ODonoghue, C. W. Muller, and
P. Bork, Comparison of ARM and HEAT protein repeats,
Journal of Molecular Biology, vol. 309, no. 1, pp. 118, 2001.
[19] S. Kirkpatrick, C. D. Gelatt Jr., and M. P. Vecchi, Optimization by simulated annealing, Science, vol. 220, no. 4598, pp.
671680, 1983.
[20] L. A. Mirny and E. I. Shakhnovich, Universally conserved positions in protein folds: reading evolutionary signals about stability, folding kinetics and function, Journal of Molecular Biology, vol. 291, no. 1, pp. 177196, 1999.
[21] M. A. Huynen, P. F. Stadler, and W. Fontana, Smoothness
within ruggedness: the role of neutrality in adaptation, Proceedings of the National Academy of Sciences of the United States
of America, vol. 93, no. 1, pp. 397401, 1996.
[22] S. Karlin, Statistical signals in bioinformatics, Proceedings of
the National Academy of Sciences of the United States of America, vol. 102, no. 38, pp. 1335513362, 2005.
[23] K. A. Dill, Dominant forces in protein folding, Biochemistry,
vol. 29, no. 31, pp. 71337155, 1990.
[24] B. Rost, Did evolution leap to create the protein universe?
Current Opinion in Structural Biology, vol. 12, no. 3, pp. 409
416, 2002.
[25] J. Rissanen and G. G. Langdon Jr., Arithmetic Coding, IBM
Journal of Research and Development, vol. 23, no. 2, pp. 149
162, 1979.
[26] S. L. Salzberg, A. L. Delcher, S. Kasif, and O. White, Microbial
gene identification using interpolated Markov models, Nucleic Acids Research, vol. 26, no. 2, pp. 544548, 1998.
[27] V. P. Turutina, A. A. Laskin, N. A. Kudryashov, K. G.
Skryabin, and E. V. Korotkov, Identification of latent periodicity in amino acid sequences of protein families, Biochemistry
(Moscow), vol. 71, no. 1, pp. 1831, 2006.
[28] E. V. Korotkov and M. A. Korotkova, Enlarged similarity of
nucleic acid sequences, DNA Research, vol. 3, no. 3, pp. 157
164, 1996.
[29] A. C. Camproux and P. Tuery, Hidden Markov modelderived structural alphabet for proteins: the learning of protein local shapes captures sequence specificity, Biochimica et
Biophysica Acta, vol. 1724, no. 3, pp. 394403, 2005.
[30] S. D. Bentley and J. Parkhill, Comparative genomic structure
of prokaryotes, Annual Review of Genetics, vol. 38, pp. 771
791, 2004.
Research Article
A Study of Residue Correlation within Protein Sequences and
Its Application to Sequence Classification
Chris Hemmerich1 and Sun Kim2
1 Center
For Genomics and Bioinformatics, Indiana University, 1001 E. 3rd Street, Bloomington 47405-3700, India
of Informatics, Center for Genomics and Bioinformatics, Indiana University, 901 E. 10th Street,
Bloomington 47408-3912, India
2 School
1.
INTRODUCTION
A protein can be viewed as a string composed from the 20symbol amino acid alphabet or, alternatively, as the sum of
their structural properties, for example, residue-specific interactions or hydropathy (hydrophilic/hydrophobic) interactions. Protein sequences contain sucient information to
construct secondary and tertiary protein structures. Most
methods for predicting protein structure rely on primary sequence information by matching sequences representing unknown structures to those with known structures. Thus, researchers have investigated the correlation of amino acids
within and across protein sequences [13]. Despite all this, in
terms of character strings, proteins can be regarded as slightly
edited random strings [1].
Previous research has shown that residue correlation can
provide biological insight, but that MI calculations for protein sequences require careful adjustment for sampling errors. An information-theoretic analysis of amino acid contact potential pairings with a treatment of sampling biases
has shown that the amount of amino acid pairing information is small, but statistically significant [2]. Another recent
study by Martin et al. [3] showed that normalized mutual information can be used to search for coevolving residues.
From the literature surveyed, it was not clear what significance the correlation of amino acid pairings holds for pro-
tein structure. To investigate this question, we used the family and sequence alignment information from Pfam-A [4]. To
model sequences, we defined and used the mutual information vector (MIV) where each entry represents the MI estimation for amino acid pairs separated by a particular distance in
the primary structure. We studied two dierent properties of
sequences: amino acid identity and hydropathy.
In this paper, we report three important findings.
(1) MI scores for the majority of 1000 real protein sequences sampled from Pfam are statistically significant
(as defined by a P value cuto of .05) as compared to
random sequences of the same character composition,
see Section 4.1.
(2) MIV has significantly better modeling power of proteins than MI, as demonstrated in the protein sequence
classification experiment, see Section 5.2.
(3) The best classification results are provided by MIVs
containing scores generated from both the amino acid
alphabet and the hydropathy alphabet, see Section 5.2.
In Section 2, we briefly summarize the concept of MI
and a method for normalizing MI content. In Section 3, we
formally define the MIV and its use in characterizing protein sequences. In Section 4, we test whether MI scores for
protein sequences sampled from the Pfam database are statistically significant compared to random sequences of the
same residue composition. We test the ability of MIV to classify sequences from the Pfam database in Section 5, and in
Section 6, we examine correlation with MIVs and further investigate the eects of alphabet size in terms of information
theory. We conclude with a discussion of the results and their
implications.
2.
We use MI content to estimate correlation in protein sequences to gain insight into the prediction of secondary and
tertiary structures. Measuring correlation between residues
is problematic because sequence elements are symbolic variables that lack a natural ordering or underlying metric [5].
Residues can be ordered in certain properties such as hydropathy, charge, and molecular weight. Weiss and Herzel [6]
analyzed several such correlation functions.
MI is a measure of correlation from information theory
[7] based on entropy, which is a function of the probability
distribution of residues. We can estimate entropy by counting residue frequencies. Entropy is maximal when all residues
appear with the same frequency. MI is calculated by systematically extracting pairs of residues from a sequence and calculating the distribution of pair frequencies weighted by the
frequencies of the residues composing the pairs.
By defining a pair as adjacent residues in the protein sequence, MI estimates the correlation between the identities
of adjacent residues. We later define pairs using nonadjacent
residues, and physical properties rather than residue identities.
MI has been proven useful in multiple studies of biological sequences. It has been used to predict coding regions
in DNA [8], and has been used to detect coevolving residue
pairs in protein multiple sequence alignments [3].
2.1. Mutual information
The entropy of a random variable X, H(X), represents the
uncertainty of the value of X. H(X) is 0 when the identity of
X is known, and H(X) is maximal when all possible values
of X are equally likely. The mutual information of two variables MI(X, Y ) represents the reduction in uncertainty of X
given Y , and conversely, MI(Y , X) represents the reduction
in uncertainty of Y given X:
MI(X, Y ) = H(X) H(X | Y ) = H(Y ) H(Y | X). (1)
When X and Y are independent, H(X | Y ) simplifies to
H(X), so MI(X, Y ) is 0. The upper bound of MI(X, Y ) is the
lesser of H(X) and H(Y ), representing complete correlation
between X and Y :
H(X | Y ) = H(Y | X) = 0.
(2)
iA
P xi log 2 P xi ,
(3)
iA j A
P xi , x j log 2
P(xi , x j )
,
P(xi )P(x j )
(4)
iA j A
P xi , x j log 2 P xi , x j
(5)
(6)
We calculate the MI of a sequence to characterize the structure of the resulting protein. The structure is aected by different types of interactions, and we can modify our methods to consider dierent biological properties of a protein sequence. To improve our characterization, we combine these
dierent methods to create of vector of MI scores.
Using the flexibility of MI and existing knowledge of protein structures, we investigate several methods for generating
MI scores from a protein sequence. We can calculate the pair
probability P(xi , x j ) using any relationship that is defined for
all amino acid identities i, j A . In particular, we examine
distance between residue pairings, dierent types of residueresidue interactions, classical and normalized MI scores, and
three methods of interpreting gap symbols in Pfam alignments.
3.1.
Distance MI vectors
(4) DEIPCPFCGC
(5) DEIPCPFCGC
(6) DEIPCPFCGC
Amino acids
C,I,M,F,W,Y,V,L
R,N,D,E,Q,H,K,S,T,P,A,G
iA j A
Pd xi , x j log 2
Pd xi , x j
.
P xi P x j
(8)
n N.
(9)
(10)
MI is then calculated for
A . H is transformed to G using
the same method.
Pi < 1.
(11)
iA
Pi j < 1.
(12)
i, j A
d
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Globin MI(d)
A
H
1.34081
0.42600
1.20553
0.23740
1.07361
0.12164
0.92912
0.02704
0.97230
0.00380
0.91082
0.00392
0.90658
0.01581
0.87965
0.02435
0.83376
0.01860
0.88404
0.01000
0.88685
0.01353
0.90792
0.01719
0.95955
0.00231
0.88584
0.01387
0.93670
0.01490
0.86407
0.02052
0.89004
0.04024
0.91409
0.01706
0.89522
0.01691
0.92742
0.03319
Ferrochelatase MI(d)
A
H
0.95240
0.13820
0.93240
0.03837
0.90004
0.02497
0.87380
0.03133
0.90400
0.02153
0.78479
0.02944
0.81559
0.00588
0.91757
0.00822
0.87615
0.01247
0.90823
0.00721
0.89673
0.00611
0.94314
0.02195
0.87247
0.01027
0.85914
0.00733
0.88250
0.00335
0.94592
0.00548
0.92664
0.01398
0.80241
0.00108
0.85366
0.00719
0.90928
0.01334
ANALYSIS OF CORRELATION IN
PROTEIN SEQUENCES
DUF629 MI(d)
A
H
0.70611
0.04752
0.63171
0.00856
0.63330
0.00367
0.66955
0.00575
0.62328
0.00587
0.68383
0.00674
0.63120
0.00782
0.67433
0.00172
0.63719
0.00495
0.61597
0.00411
0.60790
0.00718
0.66750
0.00867
0.64879
0.00805
0.66959
0.00607
0.66033
0.00106
0.62171
0.01363
0.63445
0.00314
0.67801
0.00536
0.65903
0.00898
0.70176
0.00151
Big 2 MI(d)
A
1.26794
0.92824
0.95326
0.99630
1.00100
0.98737
1.06852
1.04627
1.00784
0.97119
1.02660
0.92858
0.98879
1.09997
1.06989
1.27002
1.05699
1.06677
1.05439
1.17621
H
0.21026
0.05522
0.07424
0.04962
0.08373
0.03664
0.05216
0.12002
0.05221
0.04002
0.02240
0.02261
0.03156
0.04766
0.01286
0.06204
0.03154
0.02136
0.03310
0.01902
In theory, a random string contains no correlation between characters. So, we expect a slightly edited random
string to exhibit little correlation. In practice, noninfinite
random strings usually have a nonzero MI score. This overestimation of MI in finite sequences is a factor of the length
of the string, alphabet size, and frequency of the characters
that make up the string. We investigated the significance of
this error for our calculations and methods for reducing or
correcting for the error.
To confirm the significance of our MI scores, we used
a permutation-based technique. We compared known coding sequences to random sequences in order to generate a
P value signifying the chance that our observed MI score
or higher would be obtained from a random sequence of
residues. Since MI scores are dependent on sequence length
and residue frequency, we used the shue command from
the HMMER package to conserve these parameters in our
random sequences.
We sampled 1000 sequences from our subset of PfamA. A simple random sample was performed without replacement from all sequences between 100 and 1000 residues in
length. We calculated MI(0) for each sequence sampled. We
then generated 10 000 shued versions of each sequence and
calculated MI(0) for each.
We used three scoring methods to calculate MI(0):
(1) A with literal gap interpretation,
(2) A normalized by joint entropy with literal gap interpretation,
(3) H with literal gap interpretation.
5
1
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
100
200
300
400
500
600
700
800
900 1000
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
100
A literal
A literal, normalized
H literal
200
300
400
500
600
700
800
900 1000
In all three cases, the MI(0) score for a shued sequence of infinite length would be 0; therefore, the calculated
scores represent the error introduced by sample-size eects.
Figure 1, mean MI(0) of shued sequences, shows the average shued sequence scores (i.e., sampling error) in bits for
each method. This figure shows that, as expected, the sampling error tends to decrease as the sequence length increases.
4.1. Significance of MI(0) for protein sequences
To compare the amount of error, in each method we normalized the mean MI(0) scores from Figure 1 by dividing the
mean MI(0) score by the MI(0) score of the sequence used to
generate the shues. This ratio estimates the amount of the
sequence MI(0) score attributed to sample-size eects.
Figure 2, normalized MI(0) of shued sequences, compares the eectiveness of our two corrective methods in minimizing the sample-size eects. This figure shows that normalization by joint entropy is not as eective as Figure 1 suggests. Despite a large reduction in bits, in most cases, the portion of the score attributed to sampling eects shows only a
minor improvement. H still shows a significant reduction in
sample-size eects for most sequences.
Figures 1 and 2 provide insight into trends for the three
methods, but do not answer our question of whether or not
the MI scores are significant. For a given sequence S, we estimated the P value as
P=
x
,
N
(13)
where N is the number of random shues and x is the number of shues whose MI(0) was greater than or equal to
MI(0) for S. For this experiment, we choose a significance
cuto of .05. For a sequence to be labeled significant, no more
than 50 of the 10 000 shued versions may have an MI(0)
score equal or larger than the original sequence. We repeated
657
309
106
60
894
783
368
162
117
Classification of feature vectors is a well-studied problem with many available strategies. A good introduction to
many methods is available in [11], and the method chosen
can significantly aect performance. Since the focus of this
experiment is to compare methods of calculating MIV, we
only used the well-established and versatile nearest neighbor
classifier in conjunction with Euclidean distance [12].
5.1. Classification implementation
For classification, we used the WEKA package [11]. WEKA
uses the instance based 1 (IB1) algorithm [13] to implement nearest neighbor classification. This is an instancebased learning algorithm derived from the nearest neighbor
pattern classifier and is more ecient than the naive implementation.
The results of this method can dier from the classic
nearest neighbor classifier in that the range of each attribute
is normalized. This normalization ensures that each attribute
contributes equally to the calculation of the Euclidean distance. As shown in Table 3, MI scores calculated from A
have a larger magnitude than those calculated from H . This
normalization allows the two alphabets to be used together.
5.2. Sequence classification with MIV
In this experiment, we explore the eectiveness of classifications made using the correlation measurements outlined in
Section 3.
Each experiment was performed on a random sample of
50 families from our subset of the Pfam database. We then
used leave-one-out cross-validation [14] to test each of our
classification methods on the chosen families.
In leave-one-out validation, the sequences from all 50
families are placed in a training pool. In turn, each sequence
is extracted from this pool and the remaining sequences are
used to build a classification model. The extracted sequence
is then classified using this model. If the sequence is placed
in the correct family, the classification is counted as a success. Accuracy for each method is measured as
no. of correct classifications
.
no. of classification attempts
(14)
Table 5: Classification results for MI(0) and MIV20 methods. SD represents the standard deviation of the experiment accuracies.
MIV20
rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Method
Hybrid-H
Normalized hybrid-H
Literal-H
Normalized literal-H
Normalized Hybrid-H w/Pfam prior
Literal-H w/Pfam prior
Normalized Literal-H w/Pfam prior
Hybrid-H w/Pfam prior
Normalized literal-A
Hybrid-A
Normalized literal-A w/Pfam prior
Literal-A
Literal-A w/Pfam prior
Hybrid-A w/Pfam prior
Normalized hybrid-A
Normalized hybrid-A w/Pfam prior
Strict-H w/Pfam prior
Normalized strict-H w/Pfam prior
Normalized strict-A w/Pfam prior
Normalized strict-A
Strict-H
Normalized strict-H
Strict-A w/Pfam prior
Strict-A
MI(0) accuracy
Mean
SD
26.73%
2.59
26.20%
4.16
22.92%
3.41
23.45%
3.88
26.31%
3.95
22.73%
4.90
22.45%
4.89
22.81%
2.97
17.76%
3.21
17.16%
3.06
19.60%
3.67
16.36%
2.84
19.95%
2.84
23.09%
3.36
18.10%
3.08
23.32%
3.65
12.97%
2.85
13.01%
2.72
19.77%
3.52
18.27%
2.92
11.22%
2.33
11.15%
2.52
19.25%
3.38
16.27%
2.75
MIV20 accuracy
Mean
SD
85.14%
2.06
85.01%
2.19
79.51%
2.79
78.86%
2.79
77.21%
2.94
76.89%
2.91
76.29%
2.96
71.57%
3.15
66.69%
4.14
64.09%
4.36
63.39%
4.05
61.97%
4.32
61.82%
4.12
58.07%
4.28
41.76%
4.59
40.46%
4.04
29.96%
3.89
29.81%
3.87
29.73%
3.93
29.20%
3.65
29.09%
3.60
28.85%
3.58
28.44%
3.91
25.80%
3.60
Table 6: Top scoring combinations of MIV methods. All combinations of two MIV methods were tested, with these five methods performing
the most accurately. SD represents the standard deviation of the experiment accuracies.
Rank
1
2
3
4
5
6.
First method
Hybrid-H
Hybrid-H
Hybrid-H
Hybrid-H
Hybrid-H
Second method
Normalized hybrid-A w/Pfam prior
Normalized strict-A w/Pfam prior
Literal-A w/Pfam prior
Literal-A
Strict-A w/Pfam prior
In this section, we examine the results of our dierent methods of calculating MIVs for Pfam sequences. We first use correlation within the MIV as a metric to compare several of our
scoring methods. We then take a closer look at the eect of
reducing our alphabet size when translating from A to H .
Mean accuracy
90.99%
90.66%
90.30%
90.24%
90.08%
SD
1.44
1.47
1.48
1.73
1.57
The results strengthen our observations from the classification experiment. Methods that performed well in classification exhibit less redundancy between MIV indexes. In particular, the advantage of methods using H is clear. In each
case, correlation decreases as the distance between indexes
increases. For short distances, A methods exhibit this to a
lesser degree; however, after index 10, the scores are highly
correlated.
Effect of alphabets
20
20
20
15
15
15
15
10
10
10
10
0.4
0.2
10 15 20
Literal-A
10 15 20
Normalized literal-A
10 15 20
Hybrid-A
0.8
0.6
10 15 20
Normalized hybrid-A
(a)
20
20
20
20
15
15
15
15
10
10
10
10
0.4
0.2
10 15 20
Literal-H
10 15 20
Normalized literal-H
10 15 20
Hybrid-H
0.8
0.6
10 15 20
Normalized hybrid-H
(b)
Figure 3: Pearsons correlation analysis of scoring methods. Note the reduced correlation in the methods based on H , which all performed
very well in classification tests.
1
.
N
Table 7: Comparison of measured entropy to expected entropy values for 1000 amino acid sequences. Each sequence is 100 residues
long and was generated by a Bernoulli scheme.
Alphabet
A
H
Alphabet
size
20
2
Theoretical
entropy
4.322
0.971
Mean measured
entropy
4.178
0.964
CONCLUSIONS
(15)
2.5
MI (d)
[6]
1.5
[7]
[8]
0.5
0
0
10
12
14
16
18
Residue distance d
Mean MIV for H
Mean MIV for A
[9]
[10]
[11]
[12]
[13]
[14]
[15]
Research Article
Identifying Statistical Dependence in Genomic Sequences via
Mutual Information Estimates
Hasan Metin Aktulga,1 Ioannis Kontoyiannis,2 L. Alex Lyznik,3 Lukasz Szpankowski,4
Ananth Y. Grama,1 and Wojciech Szpankowski1
1 Department
1.
INTRODUCTION
tivated, we propose to develop precise and reliable methodologies for quantifying and identifying such dependencies,
based on the information-theoretic notion of mutual information.
Biomolecules store information in the form of monomer
strings such as deoxyribonucleotides, ribonucleotides, and
amino acids. As a result of numerous genome and protein
sequencing eorts, vast amounts of sequence data is now
available for computational analysis. While basic tools such
as BLAST provide powerful computational engines for identification of conserved sequence motifs, they are less suitable
for detecting potential hidden correlations without experimental precedence (higher-order substitutions).
The application of analytic methods for finding regions
of statistical dependence through mutual information has
been illustrated through a comparative analysis of the 5 untranslated regions of DNA coding sequences [4]. It has been
known that eukaryotic translational initiation requires the
consensus sequence around the start codon defined as the
2
Kozaks motif [5]. By screening at least 500 sequences, an
unexpected correlation between positions 2 and 1 of the
Kozaks sequence was observed, thus implying a novel translational initiation signal for eukaryotic genes. This pattern
was discovered using mutual information, and not detected
by analyzing single-nucleotide conservation. In other relevant work, neighbor-dependent substitution matrices were
applied to estimate the average mutual information content of the core promoter regions from five dierent organisms [6, 7]. Such comparative analyses verified the importance of TATA-boxes and transcriptional initiation. A similar
methodology elucidated patterns of sequence conservation
at the 3 untranslated regions of orthologous genes from human, mouse, and rat genomes [8], making them potential
targets for experimental verification of hidden functional signals.
In a dierent kind of application, statistical dependence
techniques find important applications in the analysis of gene
expression data. Typically, the basic underlying assumption
in such analyses is that genes expressed similarly under divergent conditions share functional domains of biological activity. Establishing dependency or potential relationships between sets of genes from their expression profiles holds the
key to the identification of novel functional elements. Statistical approaches to estimation of mutual information from
gene expression datasets have been investigated in [1].
Protein engineering is another important area where statistical dependency tools are utilized. Reliable predictions of
protein secondary structures based on long-range dependencies may enhance functional characterizations of proteins [9]. Since secondary structures are determined by both
short- and long-range interactions between single amino
acids, the application of comparative statistical tools based
on consensus sequence algorithms or short amino acid sequences centered on the prediction sites is far from optimal.
Analyses that incorporate mutual information estimates may
provide more accurate predictions.
In this work we focus on developing reliable and precise information-theoretic methods for determining whether
two biosequences are likely to be statistically dependent. Our
main goal is to develop ecient algorithmic tools that can
be easily applied to large data sets, mainlythough not
exclusivelyas a rigorous exploratory tool. In fact, as discussed in detail below, our findings are not the final word on
the experiments we performed, but, rather, the first step in
the process of identifying segments of interest. Another motivating factor for this project, which is more closely related to
ideas from information theory, is the question of determining whether there are error correction mechanisms built into
large molecules, as argued by Battail; see [10] and the references therein. We choose to work with protein coding exons and noncoding introns. While exons are well-conserved
parts of DNA, introns have much greater variability. They
are dispersed on strings of biopolymers and still they have
to be precisely identified in order to produce biologically relevant information. It seems that there is no external source
of information but the structure of RNA molecules themselves to generate functional templates for protein synthesis.
Determining potential mutual relationships between exons
THEORETICAL BACKGROUND
In this section, we outline the theoretical basis for the mutual information estimators we will later apply to biological
sequences.
Suppose we have two strings of unequal lengths,
X1n = X1 , X2 , . . . , Xn ,
Y1M = Y1 , Y2 , Y3 , . . . , YM ,
(1)
x,y A
V (x, y) log
V (x, y)
.
P(x)Q(y)
(2)
and q j (y) denote the empirical distributions
ilarly, let P(x)
j+n1
n
, respectively. We define the empirical (perof X1 and Y j
j+n1
p j (x, y) log
x,y A
p j (x, y)
.
p(x)q j (y)
(3)
We propose to use the following simple test for detecting dependence between X1n and Y1M . Choose and fix a threshold
> 0, and compute the empirical mutual information Ij (n)
j+n1
of length
between X1n and each contiguous substring Y j
M
n from Y1 . If I j (n) is larger than for some j, declare that
j+n1
are dependent; otherwise, declare
the strings X1n and Y j
that they are independent.
Before examining the issue of selecting the value of the
threshold , we note that this statistic is identical to the
(normalized) log-likelihood ratio between the above two hypotheses. To see this, observe that expanding the definition
of p j (x, y) in Ij (n), we can simply rewrite
Ij (n) =
n
1
x,y A
n i=1
p j (x, y)
p(x)q j (y)
n
p j (x, y)
1
I{(Xi, Y j+i1 )} (x, y) log
,
n i=1 x,yA
p(x)q j (y)
(4)
Ij (n) =
n
p j Xi , Y j+i1
1
log
n i=1
p Xi q j Y j+i1
n
j Xi, Y j+i1
1
i=1 p
,
= log n
Xi q j Y j+i1
n
i=1 p
(5)
2
(2 ln 2)nI(n) Z 2 |A| 1
(7)
where Z is as before. Therefore, for large n the error probability Pe,1 decays like the tail of the 2 distribution function,
k, ( ln 2)n
1
,
(k)
(8)
(9)
(10)
W(Y | X)
Q(Y )
p(x)W(y | x) log
x,y A
W(y | x)
I
Q(y)
2
(11)
.
(6)
n I(n) I T N 0, 2 ,
2 = Var log
Pe,1
I = I(X; Y ) of the mutual information, but, as we show below, the rate of this convergence is slower than the 1/n rate
of scenario
(i): here,
I(n)I with probability one, but only at
(12)
where the last approximation sign indicates equality to first
order in the exponent. Thus, despite the fact that I(n) converges at dierent speeds in the two scenarios, both error
probabilities Pe,1 and Pe,2 decay exponentially with the sample size n.
To see why (10) holds it is convenient to use the alternative expression for I(n) given in (5). Using this, and recalling
that I(n) = I1 (n), we obtain
n[I(n) I] = n
n
p1 Xi , Yi
1
log I .
n i=1
p Xi q1 Yi
(13)
Since the empirical distributions converge to the corresponding true distributions, for large n it is straightforward to justify the approximation
n
P Xi W Yi | Xi
1 1
I .
log
n I(n) I
n n i=1
P Xi Q Yi
(14)
5
DNA structure of zmSRp32
Exons
3 UTR
Intron
Intron
Start
mRNA structures
Stop
Pre-mRNA processing
Alternative exons
Alternative intron
Figure 1: Alternative splicings of the zmSRp32 gene in maize. The gene consists of a number of exons (shaded boxes) and introns (lines)
flanked by the 5 and 3 untranslated regions (white boxes). RNA transcripts (pre-mRNA) are processed to yield mRNA molecules used as
templates for protein synthesis. Alternative pre-mRNA splicing generates dierent mRNA templates from the same transcripts, by selecting
either alternative exons or alternative introns. The regions discussed in the text are identified by indices corresponding to the nucleotide
position in the original DNA sequence.
Pe,2 exp
2
n ;
(15)
EXPERIMENTAL RESULTS
In this section, we apply the mutual information test described above to biological data. First we show that it can
be used eectively to identify statistical dependence between
regions of the maize zmSRp32 gene that may be involved
3.1.
All of our experiments were performed on the maize zmSRp32 gene [11]. This gene belongs to a group of genes that
are functionally homologous to the human ASF/SF2 alternative splicing factor. Interestingly, these genes encode alternative splicing factors in maize and yet themselves are also
alternatively spliced. The gene zmSRp32 is coded by 4735
nucleotides and has four alternative splicing variants. Two
of these four variants are due to dierent splicings of this
gene, between positions 1369 and 32434220, respectively,
as shown in Figure 1. The results given here are primarily
from experiments on these segments of zmSRp32.
In order to understand and quantify the amount of correlation between dierent parts of this gene, we computed
the mutual information between all functional elements including exons, introns, and the 5 untranslated region. As before, we denote the shorter sequence of length n by X1n =
(X1 , X2 , . . . , Xn ) and the longer one of length M by Y1M =
(Y1 , Y2 , . . . , YM ). We apply the simple mutual information
estimator Ij (n) defined in (3) to estimate the mutual inforj+n1
for each j = 1, 2, . . . , M
mation between X1n and Y j
n + 1, and we plot the dependency graph of Ij = Ij (n) versus j; see Figure 2. The threshold is computed, according
0.08
0.05
0.06
Mutual information
Mutual information
0.07
0.05
0.04
0.03
0.02
0.04
0.03
0.02
0.01
0.01
0
3200 3300 3400 3500 3600 3700 3800 3900
Base position on zmSRp32 gene sequence
(a)
0
3200 3300 3400 3500 3600 3700 3800 3900
Base position on zmSRp32 gene sequence
(b)
Figure 2: Estimated mutual information between the exon located between bases 1369 and each contiguous subsequence of length 369
in the intron between bases 32434220. The estimates were computed both for the original sequences in the standard four-letter alphabet
{A, C, G, T } (shown in (a)), as well as for the corresponding transformed sequences for the two-letter purine/pyrimidine grouping {AG, CT }
(shown in (b)).
0.4
0.35
0.35
0.3
0.3
Mutual information
Mutual information
0.25
0.2
0.15
0.1
0.25
0.2
0.15
0.1
0.05
0.05
0
32 33 34 35 36 37 38 39 40 41
Base position on zmSRp32 gene sequence
0
32
42
102
(a)
0.1
0.09
Mutual information
Mutual information
0.12
0.1
0.08
0.06
0.04
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.02
0.01
0
32
33 34 35 36 37 38 39 40 41
2
Base position on zmSRp32 gene sequence 10
(c)
0.08
0.07
Mutual information
0.1
Mutual information
33 34 35 36 37 38 39 40 41
2
Base position on zmSRp32 gene sequence 10
(d)
0.12
0.08
0.06
0.04
0.02
0
32
42
102
(b)
0.14
0
32
33 34 35 36 37 38 39 40 41
Base position on zmSRp32 gene sequence
0.06
0.05
0.04
0.03
0.02
0.01
33
34
35
36 37 38 39 40
2
Base position on zmSRp32 gene sequence 10
(e)
0
32
33
34 35
36 37 38 39
Base position on zmSRp32 gene sequence
40
102
(f)
Figure 3: Dependency graph of Ij versus j for the zmSRp32 gene, using dierent alphabet groupings: in (a) and (b), we plot the estimated
mutual information between the exon found between bases 178 and each subsequence of length 78 in the intron located between bases
32434220. Plot (a) shows estimates over the original four-letter alphabet {A, C, G, T } , and (b) shows the corresponding estimates over the
Watson-Crick pairs {AT, CG}. Similarly, plots (c) and (d) contain the estimated mutual information between the intron located in bases
79268 and all corresponding subsequences of the intron between bases 32434220. Plot (c) shows estimates over the original alphabet, and
plot (d) over the two-letter purine/pyrimidine grouping {AG, CT }. Plots (e) and (f) show the estimated mutual information between the 5
untranslated region and all corresponding subsequences of the intron between bases 32434220, for the four-letter alphabet (in (e)), and for
the two-letter purine/pyrimidine grouping {AG, CT } (in (f)).
8
3.2. Application to tandem repeats
Here we further explore the utility of the mutual information statistic, and we examine its performance on the problem of detecting short tandem repeats (STRs) in genomic sequences. STRs, usually found in noncoding regions, are made
of back-to-back repetitions of a sequence which is at least two
bases long and generally shorter than 15 bases. The period of
an STR is defined as the length of the repetition sequence
in it. Owing to their short lengths, STRs survive mutations
well, and can easily be amplified using PCR without producing erroneous data. Although there are many well-identified
STRs in the human genome, interestingly, the number of repetitions at any specific locus varies significantly among individuals, that is, they are polymorphic DNA fragments. These
properties make STRs suitable tools for determining genetic
profiles, and have become a prevalent method in forensic investigations. Long repetitive sequences have also been observed in genomic sequences, but have not gained as much
attention since they cannot survive environmental degradation and do not produce high quality data from PCR analysis.
Several algorithms have been proposed for detecting
STRs in long DNA strings with no prior knowledge about
the size and the pattern of repetition. These algorithms
are mostly based on pattern matching, and they all have
high time-complexity. Finding short repetitions in a long
sequence is a challenging problem. When the query string
is a DNA segment that contains many insertions, deletions,
or substitutions due to mutations, the problem becomes
even harder. Exact- and approximate-pattern matching algorithms need to be modified to account for these mutations,
and this renders them complex and inecient. To overcome
these limitations, we propose a statistical approach using an
adaptation of the method described in the previous sections.
In the United States, the FBI has decided on 13 loci to be
used as the basis for genetic profile analysis, and they continue to be the standard in this area. To demonstrate how
our approach can be used for STR detection, we chose to
use sequences from the FBIs combined DNA index system
(CODIS): the SE33 locus contained in the GenBank sequence
V00481, and the VWA locus contained in the GenBank sequence M25858. The periods of STRs found in CODIS typically range from 2 to bases, and do not exhibit enough variability to demonstrate how our approach would perform under divergent conditions. For this reason, we used the V00481
sequence as is, but on M25858 we artificially introduced an
STR with period 11, by substituting bases 28212920 (where
we know that there are no other repeating sequences) with
9 tandem repeats of ACTTTGCCTAT. We have also introduced base substitutions, deletions, and insertions on our artificial STR to imitate mutations.
Let Y1M = (Y1 , Y2 , . . . , YM ) denote the DNA sequence in
which we are looking for STRs. The gist of our approach is
simply to choose a periodic probe sequence of length n, say,
X1n = (X1 , X2 , . . . , Xn ) (typically much shorter than Y1M ), and
then to calculate the empirical mutual information Ij = Ij (n)
between X1n and each of its possible alignments with Y1M . In
order to detect the presence of STRs, the values of the empirical mutual information in regions where STRs do appear
1.8
0.9
1.6
0.8
1.4
0.7
Mutual information
Mutual information
1.2
1
0.8
0.6
0.4
0.6
0.5
0.4
0.3
0.2
0.1
0.2
0
0
0
0
(a)
(b)
Figure 4: Dependency graph of the GenBank sequence Y1M = V 00481, for a probe sequence X1n which is a repetition of AGGT, of length (a)
12, or (b) 60. The sequence Y1M contains STRs that are repetitions of the pattern AAAG, in the following regions: (i) there is a repetition of
AAAG between bases 62108; (ii) AAAG is intervened by AG and AAGG until base 138; (iii) again between 138294 there are repetitions of
AAAG, some of which are modified by insertions and substitutions. In (a) our probe is too short, and it is almost impossible to distinguish
the SE33 locus from the rest. However, in (b) the location SE33 is singled out by the two big peaks in the mutual information estimates; the
shorter peak between the two larger ones is due to the interventions described above. Note that the STRs were identified by a probe sequence
that was a repetition of a pattern dierent from that of the repeating part of the STRs themselves, but of the same period.
1.5
Mutual information
Mutual information
1.5
0.5
50
100
150
200
250
(a)
0.5
50
100
150
200
250
(b)
Figure 5: Dependency graph of the VWA locus contained in GenBank sequence M25858 for a probe sequence X1n with n = 12, which is a
repetition of (a) TCTA , an exactly matching probe, (b) GTGC, a completely dierent probe, but of the exact same pattern. In both cases,
we have chosen X1n to be long enough to suppress unrelated information. Note that the results in (a) and (b) are almost identical. The VWA
locus contains an STR of TCTA between positions 44123. This STR is apparent in both dependency graphs by forming a periodic curve
with high correlation.
10
1.4
0.45
Mutual information
Mutual information
1.2
1
0.8
0.6
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.2
0
0.4
0.05
0
1000
2000
3000
4000
5000
6000
1000
2000
3000
4000
5000
6000
(a)
(b)
Figure 6: In these charts we use the modified GenBank sequence M25858, which contains the VWA locus in CODIS between positions
16831762 and the artificial STR introduced by us at 28212920. The repeat sequence of the VWA locus is TCTA, and the repeat sequence
of the artificial STR is ACTTTGCCTAT. In (a), the probe X1n has length n = 88 and consists of repetitions of AGGT. Here the repeating
sequence of the VWA locus (which has period 4) is clearly indicated by the peak, whereas the artificial tandem repeat (which has period 11)
does not show up in the results. The small peak around position 2100 is due to a very noisy STR again with a 4-base period. In (b), the probe
X1n again has length n = 88, and it consists of repetitions of CATAGTTCGGA. This produces the opposite result: the artificial STR is clearly
identified, but there is no indication of the STR present at the VWA locus.
4.
CONCLUSIONS
[21] J. Aberg,
Yu. M. Shtarkov, and B. J. M. Smeets, Multialphabet
coding with separate alphabet description, in Proceedings of
the International Conference on Compression and Complexity of
Sequences, pp. 5665, Positano, Italy, June 1997.
[22] A. Orlitsky, N. P. Santhanam, K. Viswanathan, and J. Zhang,
Limit results on pattern entropy, IEEE Transactions on Information Theory, vol. 52, no. 7, pp. 29542964, 2006.
11
Research Article
Motif Discovery in Tissue-Specific Regulatory Sequences
Using Directed Information
Arvind Rao,1 Alfred O. Hero III,1 David J. States,2 and James Douglas Engel3
1 Departments
1.
INTRODUCTION
transcriptional start site (TSS). The basal transcriptional machinery at the promoter coupled with the transcription factor complexes at these distal, long-range regulatory elements
(LREs) are collectively involved in directing tissue-specific
expression of genes.
One of the current challenges in the post-genomic era
is the principled discovery of such LREs genome-wide. Recently, there has been a community-wide eort (http://
www.genome.gov/ENCODE) to find all regulatory elements
in 1% of the human genome. The examination of the discovered elements would reveal characteristics typical of most
enhancers which would aid their principled discovery and
examination on a genome-wide scale. Some characteristics
of experimentally identified distal regulatory elements [1, 2]
are as follows.
(i) Noncoding elements: distal regulatory elements are
noncoding and can either be intronic or intergenic regions on the genome. Hence, previous models for gene
TATA box
RNA pol. II
Distal
enhancer
Promoter
(proximal)
Exon
Another practical reason for the examination of promoters is that their locations (and genomic sequences)
are more clearly delineated on genome databases (like
UCSC or Ensembl). Sucient data (http://symatlas
.gnf.org) on the expression of genes is also publicly
available for analysis. Sequence motif discovery is set
up as a feature extraction problem from these tissuespecific promoter sequences. Subsequently, a support
vector machine (SVM) classifier is used to classify
new promoters into specific and nonspecific categories
based on the identified sequence features (motifs). Using the SVM classifier algorithm, 90% of tissue-specific
genes are correctly classified based upon their upstream promoter region sequences alone.
(ii) Known long range regulatory elements (LRE) motifs:
to analyze the motifs in LRE elements, we examine
the results of the above approach on the Enhancer
Browser dataset (http://enhancer.lbl.gov) which has
results of expression of ultraconserved genomic elements in transgenic mice [8]. An examination of these
ultraconserved enhancers is useful for the extraction
of discriminatory motifs to distinguish the regulatory
elements from the nonregulatory (neutral) ones. Here
the results indicate that up to 95% of the sequences can
be correctly classified using these identified motifs.
Distal
enhancer
TSS
Intron
We note that some of the identified motifs might not be transcription factor binding motifs, and would need to be functionally characterized. This is an advantage of our methodinstead of constraining ourselves to the degeneracy present
in TF databases (like TRANSFAC/JASPAR), we look for all
sequences of a fixed length.
2.
CONTRIBUTIONS
Examine sequences
(promoters/enhancers)
from Tissue Expression Atlas
Training data
Tissue-specific
sequences
Neutral sequences
Biological interpretation
of top ranking motifs
OVERALL METHODOLOGY
4
5.
if gi,k 2gi,[0.5T] ,
Mi,k =
0 otherwise.
(1)
Ensembl Gene ID
AAAAAA AAAAAG AAAAAT AAAACA
ENSG00000155366
0
0
1
4
ENSG000001780892
6
5
5
6
ENSG00000189171
1
2
1
0
ENSG00000168664
6
3
8
0
ENSG00000160917
4
1
4
2
ENSG00000163655
2
4
0
1
ENSG000001228844
8
6
10
7
ENSG00000176749
0
0
0
0
ENSG00000006451
5
2
2
1
LRE motifs
X2
Y
X2
X1
PREPROCESSING
From the above, Ntrain,+1 1000 and Ntrain,1 1000 dimensional cooccurrence matrices are available for the tissuespecific and nonspecific data, both for the promoter and
enhancer sequences. Before proceeding to the feature (hexamer motif) selection step, the counts of the M = 1000
hexamers in each training sample need to be normalized
to account for variable sequence lengths. In the cooccurrence matrix, let gci,k represent the absolute count of the
kth hexamer, k 1, 2, . . . , M, in the ith gene. Then, for
each gene gi , the quantile labeled matrix has Xi,k = l if
gci,[((l1)/K)M] gci,k < gci,[(l/K)M] , K = 4. Matrices of dimension Ntrain,+1 1001, Ntrain,1 1001 for the specific and
nonspecific training samples are now obtained. Each matrix
contains the quantile label assignments for the 1000 hexamers (Xi , i (1, 2, . . . , 1000)), as stated above, and the last column has the corresponding class label (Y = 1/ + 1).
7.
X1
I XiN Y N =
N
n=1
I Xin ; Yn | Y n1 .
(2)
Using a stationarity assumption over a finite-length memory of the training samples, a correspondence with the setup
in [22, 23] can be seen. As already known [24], the mutual
information is I(X N ; Y N ) = H(X N ) H(X N | Y N ), where
H(X N ) and H(X N | Y N ) are the Shannon entropy of X N and
I X N Y N =
H X n | Y n1 H X n | Y n
n=1
H X n , Y n1 H Y n1
(3)
n=1
.
H X n, Y n H Y n
this methodology, a Voronoi tessellation approach for entropy estimation because of the higher performance guarantees as well as the relative ease of implementation of such a
procedure.
The above method is used to estimate the true DI between a given hexamer and the class label for the entire training set. Feature selection comprises of finding all those hexamers (Xi ) for which I(XiN Y N ) is the highest. From the definition of DI, we know that 0 I(XiN Y N ) I(XiN ; Y N ) <
. To make a meaningful comparison of the strengths of
association between dierent hexamers and the class label,
we use a normalized score to rank the DI values. This normalized measure DI should be able to map this large range
([0, ]) to [0, 1]. Following [29], an expression for the normalized DI is given by
DI = 1 e2I(X N Y N )
N
i
i1
= 1 e2 i=1 I(X ;Yi |Y ) .
(4)
in the hyperplane {x :
T
T
f (x)
= x + 0 }, subject to yi (xi + 0 ) 1 i i, i
0, i constant [33].
10.
Our proposed approach is as follows. Here, the term sequence can pertain to either tissue-specific promoters or
LRE sequences, obtained from the GNF SymAtlas and Ensembl databases or the Enhancer Browser.
(1) The sequence is parsed to obtain the relative counts/
frequencies of occurrence of the hexamer in that sequence and to build the hexamer-sequence frequency
matrix. The seqinr package in R is used for this purpose. This is done for all the sequences in the specific
(class +1) and nonspecific (class 1) categories.
The matrix thus has N = Ntrain,+1 + Ntrain,1 rows and
46 = 4096 columns.
(2) The obtained hexamer-sequence frequency matrix is
preprocessed by assigning quantile labels for each hexamer within the ith sequence. A hexamer-sequence
matrix is thus obtained where the (i, j)th entry has the
quantile label of the jth hexamer in the ith sequence.
This is done for all the N training sequences consisting
of examples from the 1 and +1 class labels.
(3) Thus, two submatrices corresponding to the two class
labels are built. One matrix contains the hexamersequence quantile labels for the positive training examples and the other matrix is for the negative training
examples.
(4) To select hexamers that are most dierent between the
positive and negative training examples, a t-test is performed for each hexamer, between the ts and nts
groups. Ranking the corresponding t-test P-values
yields those hexamers that are most dierent distri-
7
butionally between the positive and negative training
samples. The top 1000 of these hexamers are chosen for further analysis. This step is only necessary
to reduce the computational complexity of the overall procedurecomputing the DI between each of the
4096 hexamers and the class label is relatively expensive.
(5) For the top K = 1000 hexamers which are most
significantly dierent between the positive and negative training examples, I(XkN Y N ) and I(XkN ; Y N ) reveal the degree of association for each of the k
(1, 2, . . . , K) hexamers. The entropy terms in the directed information and mutual information expressions are found using a higher-order entropy estimator. Using the procedure of Section 7, the raw DI values are converted into their normalized versions. Since
the goal is to maximize I(Xk Y ), we can rank the DI
values in descending order.
(6) The significance of the DI estimate is obtained based
on the bootstrapping methodology. For every hexamer, a P = 0.05 significance with respect to its
bootstrapped null distribution yields potentially discriminative hexamers between the two classes. The
Benjamini-Hochberg procedure is used for multipletesting correction. Ranking the significant hexamers
by decreasing DI value yields features that can be used
for classifier (SVM) training.
(7) Train the support vector machine (SVM) classifier on
the top d features from the ranked DI list(s). For comparison with the MI-based technique, we use the hexamers which have the top d (normalized) MI values.
The accuracy of the trained classifier is plotted as a
function of the number of features (d), after ten-fold
cross-validation. As we gradually consider higher d, we
move down the ranked list. In the plots below, the misclassification fraction is reported instead. A fraction of
0.1 corresponds to 10% misclassification.
Note. An important point concerns the training of the SVM
classifier with the top d features selected using DI or MI (step
(7) above). Since the feature selection step is decoupled from
the classification step, it is preferred that the top d motifs are
consistently ranked high among multiple draws of the data,
so as to warrant their inclusion in the classifier. However,
this does not yield expected results on this data set. Briefly,
a kendall rank correlation coecient [34] was computed between the rankings of the motifs between multiple data draws
(by sampling a subset of the entire dataset), for both MIand DI-based feature-selection. It is observed that this coecient is very low in both MI and DI, indicating a highly
variable ranking. This is likely due to the high variability in
data distribution across these multiple draws (due to limited
number of data points), as well as the sensitivity of the datadependent entropy estimation procedure to the range of the
samples in the draw. To circumvent this problem of inconsistency in rank of motifs, a median DI/MI value is computed
across these various draws and the top d features based on the
median DI/MI value across these draws are picked for SVM
training [20].
GC hkg prom
RESULTS
(a)
(b)
102
10
Frequency
GC brain prom
0.8
0.7
0.6
0.5
0.4
0.3
0.2
Frequency
0.8
0.7
0.6
0.5
0.4
0.3
0.2
2
1
6
4
2
0
0.3
0.3
0.4
0.5
0.6
GC brain prom
(d)
0.35
11.
0.3
0.25
0.2
0.15
0.1
0.05
0
50
100
150
200
Number of top ranking features used for classification
MI
DI
GC hkg prom
GC heart prom
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
Brain promoters
Ahr-ARNT ( )
Tcf11-MafG ( )
c-ETS ( )
FREAC-4
T3R-alpha1
0.3
(a)
(b)
102
4
3
Frequency
Frequency
Table 2: Comparison of high ranking motifs (by DI) across dierent data sets. The ( ) sign indicates tissue-specific expression of the
corresponding TF gene.
2
1
0
0.3
30
25
20
15
10
5
0
0.3
0.7
(d)
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
50
100
150
200
Number of top ranking features used for classification
MI
DI
Heart promoters
Pax2
Tcf11-MafG ( )
XBP1 ( )
Sox-17 ( )
FREAC-4
GATA( )
Brain enhancers
HNF-4 ( )
Nkx2
AML1
c-ETS ( )
Elk1 ( )
A very interesting question emerges from the above presented results. What if one is interested in a motif that is
not present in the above ranked hexamer list for a particular tissue-specific set? As an example, consider the case for
MyoD, a transcription factor which is expressed in muscle
and has an activity in heart-specific genes too [39]. In fact, a
variant of its consensus motif CATTTG is indeed in the top
ranking hexamer list. The DI-based framework further permits investigation of the directional association of the canonical MyoD motif (CACCTG) for the discrimination of heartspecific genes versus housekeeping genes. This is shown in
Figure 10. As is observed, MyoD has a significant directional
influence on the heart-specific versus neutral sequence class
label. This, in conjunction with the expression level characteristics of MyoD, indicates that the motif CACCTG is
potentially relevant to make the distinction between heartspecific and neutral sequences.
10
GC brain enh
1
0.6
0.6
0.4
0.4
0.9
0.8
0.7
0.6
0.2
(a)
(b)
40
20
0
0.3
0.3
15
0.2
5
0
0.1
0
0.3
0.4
0.5
0.6
GC brain enh
(d)
0.5
0.4
25
Frequency
Frequency
60
F(x)
0.2
0.25
0.2
0.15
0.1
0.05
0
50
100
150
200
Number of top ranking features used for classification
MI
DI
Another theme picks up on something quite traditionally done in bioinformatics research-finding key TF regulators underlying tissue-specific expression. Two major questions emerge from this theme.
(1) Which putative regulatory TFs underlie the tissuespecific expression of a group of genes?
(2) For the TFs found using tools like TOUCAN [12], can
we examine the degree of influence that the particular
TF motif has in directing tissue-specific expression?
To address the first question, we examine the TFs revealed by DI/MI motif selection and compare these to the
TFs discovered from TOUCAN [12], underlying the expres-
0.1
0.2
0.3
0.4
0.5
0.6
DI of MyoDheart-specific promoters (x)
0.7
11.4. Observations
With regard to the feature selection and classification results,
in both studies (enhancers and promoters), we observe that
about 100 hexamers are enough to discriminate the tissuespecific from the neutral sequences. Furthermore, some sequence features of these motifs at the promoter/enhancer
emerge.
(i) There is higher sequence variability at the promoter
since it has to act in concert with LREs of dierent tissue types during gene regulation.
(ii) Since the enhancer/LRE acts with the promoter to confer expression in only one tissue type, these sequences
are more specific and hence their mining identifies
motifs that are probably more indicative of tissuespecific expression.
We however, reiterate that the enhancer dataset that we study
uses the hsp68-lacz as the promoter driven by the ultraconserved elements. Hence there is no promoter specificity in
this context. Though this is a disadvantage and might not
reveal all key motifs, it is the best that can be done in the
absence of any other comprehensive repository.
The second aspect of the presented results highlights two
important points. Firstly, the identified motifs have a strong
predictive value as suggested by the cross-validation results as
well as Table 2. Moreover, DI provides a principled methodology to investigate any given motif for tissue-specificity as
well as for identifying expression-level relationships between
the TFs and their target genes, (Section 11.3).
12.
CONCLUSIONS
In this work, a framework for the identification of hexamer motifs to discriminate between two kinds of sequences (tissue-specific promoters or regulatory elements
versus nonspecific elements) is presented. For this feature se-
Empirical CDF
1
0.9
0.8
0.7
0.6
F(x)
11
0.5
0.4
0.3
0.2
0.1
0
0.1
0.2
0.3
0.4
0.5
x
0.6
0.7
0.8
0.9
FUTURE WORK
Several opportunities for future work exist within this proposed framework. Multiple sequence alignment of promoter/regulatory sequences across species would be a useful
preprocessing step to reduce false detection of discriminatory motifs. The hexamers can also be identified based on
other metrics exploiting distributional divergence between
the samples of the +1 and 1 classes. Furthermore, there
is a need for consistent high-dimensional entropy estimators within the small sample regime. A very interesting direction of potential interest is the formulation of a stepwise
hexamer selection algorithm, using the directed information
for maximal relevance selection and mutual information for
minimizing between-hexamer redundancy [18]. This analysis is beyond the scope of this work but an implementation
is available from the authors for further investigation. (The
12
ACKNOWLEDGMENTS
The authors gratefully acknowledge the support of the NIH
under Award 5R01-GM028896-21 for J. D. Engel. They
would like to thank Professor Sandeep Pradhan and Mr.
Ramji Venkataramanan for useful discussions on directed
information. They are extremely grateful to Professor Erik
Learned-Miller and Dr. Damian Fermin for sharing their
code for high-dimensional entropy estimation and ENSEMBL sequence extraction, respectively. They also thank
the anonymous reviewers and the corresponding editor for
helping them improve the quality of the manuscript through
insightful comments and suggestions. The material in this
paper was presented in part at the IEEE Statistical Signal Processing Workshop 2007 (SSP07).
[14]
[15]
[16]
[17]
REFERENCES
[1] K. D. MacIsaac and E. Fraenkel, Practical strategies for discovering regulatory DNA sequence motifs, PLoS Computational Biology, vol. 2, no. 4, p. e36, 2006.
[2] G. Kreiman, Identification of sparsely distributed clusters of
cis-regulatory elements in sets of co-expressed genes, Nucleic
Acids Research, vol. 32, no. 9, pp. 28892900, 2004.
[3] C. Burge and S. Karlin, Prediction of complete gene structures in human genomic DNA, Journal of Molecular Biology,
vol. 268, no. 1, pp. 7894, 1997.
[4] Q. Li, G. Barkess, and H. Qian, Chromatin looping and the
probability of transcription, Trends in Genetics, vol. 22, no. 4,
pp. 197202, 2006.
[5] D. A. Kleinjan and V. van Heyningen, Long-range control of
gene expression: emerging mechanisms and disruption in disease, The American Journal of Human Genetics, vol. 76, no. 1,
pp. 832, 2005.
[6] L. A. Pennacchio, G. G. Loots, M. A. Nobrega, and I.
Ovcharenko, Predicting tissue-specific enhancers in the human genome, Genome Research, vol. 17, no. 2, pp. 201211,
2007.
[7] D. C. King, J. Taylor, L. Elnitski, F. Chiaromonte, W. Miller,
and R. C. Hardison, Evaluation of regulatory potential and
conservation scores for detecting cis-regulatory modules in
aligned mammalian genome sequences, Genome Research,
vol. 15, no. 8, pp. 10511060, 2005.
[8] L. A. Pennacchio, N. Ahituv, A. M. Moses, et al., In vivo enhancer analysis of human conserved non-coding sequences,
Nature, vol. 444, no. 7118, pp. 499502, 2006.
[9] K. Kadota, J. Ye, Y. Nakai, T. Terada, and K. Shimizu, ROKU:
a novel method for indentification of tissue-specific genes,
BMC Bioinformatics, vol. 7, pp. 294, 2006.
[10] J. Schug, W.-P. Schuller, C. Kappen, J. M. Salbaum, M. Bucan, and C. J. Stoeckert Jr., Promoter features related to tissue
specificity as measured by Shannon entropy, Genome biology,
vol. 6, no. 4, p. R33, 2005.
[11] T. Werner, Regulatory networks: linking microarray data
to systems biology, Mechanisms of Ageing and Development,
vol. 128, no. 1, pp. 168172, 2007.
[12] S. Aerts, P. Van Loo, G. Thijs, et al., TOUCAN 2: the allinclusive open source workbench for regulatory sequence
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
13
Research Article
Splitting the BLOSUM Score into Numbers of
Biological Significance
Francesco Fabris,1, 2 Andrea Sgarro,1, 2 and Alessandro Tossi3
1 Dipartimento
di Matematica e Informatica, Universit`a degli Studi di Trieste, via Valerio 12b, 34127 Trieste, Italy
di Biomedicina Molecolare, AREA Science Park, Strada Statale 14, Basovizza, 34012 Trieste, Italy
3 Dipartimento di Biochimica, Biofisica, e Chimica delle Macromolecole, Universit`
a degli Studi di Trieste,
via Licio Giorgieri 1, 34127 Trieste, Italy
2 Centro
1.
INTRODUCTION
Substitution matrices have been in use since the introduction of the Needleman and Wunsch algorithm [1], and are
referred to, either implicitly or explicitly, in several other papers from the seventies, McLachlan [2], Sanko [3], Sellers
[4], Waterman et al. [5], Dayho et al. [6]. These are the
conceptual tools at the basis of several methods for attributing a similarity score to two aligned protein sequences. Any
amino acid substitution matrix, which is a 20 20 table, has
a scoring method that is implicitly associated with a set of
target frequencies p(i, j) [7, 8], pertaining to the pair i, j of
amino acids that are paired in the alignment. An important
approach to obtaining the score associated with the paired
amino acids i, j, was that suggested by Dayho et al. [6],
who developed a stochastic model of protein evolution called
PAM (points of accepted mutations). In this model, the frequencies m(i, j) indicate the probability of change from one
amino acid i to another amino acid j, in homologous protein
sequences with at least 85% identity, during short-term evolution. The matrix M, relating each amino acid to each of the
other 19, with an evolutionary distance of 1, would have entries m(i, j) close to 1 on the main diagonal (i = j) and close
s(i, j) = log
mk (i, j)
,
p(i)p( j)
(1)
where p(i) and p( j) are the observed frequencies of the amino acids.
S. Heniko and J. G. Heniko introduce the BLOck SUbstitution Matrix (BLOSUM) [9]. While the scoring method
is always based on a log odds ratio, as seems natural in any
kind of substitution matrices [7], the method for deriving
the target frequencies is quite dierent from PAM; one needs
evaluating the joint target frequencies p(i, j) of finding the
amino acids i and j paired in alignments among homologous
proteins with a controlled rate of percent identity. This joint
probability is compared with p(i)p( j), the product of the
background frequencies of amino acids i and j, derived from
amino acids probability distribution P = { p1 , p2 , . . . , p20 }.
p(i, j)
p(i)p( j)
(2)
n
s xh , y h =
n(i, j) log
p(i, j)
,
p(i)p( j)
scoring method in finding concealed or weakly correlated sequences are well documented in the literature, the most relevant being:
(1) Gaps: insertions or deletions (of one or more residue)
in one or both the aligned sequence cause loss of synchronization, significantly decreasing the score;
(2) Bad : using a BLOSUM- matrix tailored for a particular evolutionary distance on sequences with a dierent evolutionary distance leads to a misleading score
[7, 12, 13];
(3) divergence in background distribution: standard substitution matrices, such as BLOSUM-, are truly appropriate only for comparison of proteins with standard
background frequency distributions of amino acids
[11].
We have set out to inspect, in more depth and by use of
mathematical tools, what the BLOSUM score really measures
from a biological point of view; the aim was to split the score
into components, the BLOSpecrum, that provide insight on
the above described phenomena and other biological information regarding the compared sequences, once the alignment has been made using the classical methods (BLAST,
FASTA, etc.). We do not propose an alternative alignment algorithm or a method for increasing the performance of the
available ones; nor do we suggest new methods for inserting
gaps so as to maximize the score (see, e.g., [14, 15]). Ours is
simply a diagnostic tool to reveal the following:
(3)
(1) if, for an available algorithm, the chosen scoring matrix is correct;
where n(i, j) is the number of occurrences of the pair i, j inside the aligned sequences. This equation weighs the log ratio
associated to the i, j entry of the BLOSUM matrix with the
occurrences of the pair i, j, and seems intuitive following a
heuristic approach, as any reasonable substitution matrix is
implicitly of this form [7]. In order to compute the necessary target and background frequencies p(i, j) and p(i)p( j),
S. Heniko and J. G. Heniko used the database BLOCKS
(http://blocks.fhcrc.org/index.html), which contains sets of
proteins with a controlled maximum rate of percent identity
that defines the BLOSUM matrix, so that BLOSUM-62
refers = 62%, and so forth.
Scoring substitution matrices, such as PAM or BLOSUM,
are used in modern web tools (BLAST, PSI-BLAST, and others) for performing database searches; the search is accomplished by finding all sequences that, when compared to a
given query sequence, sum up a score over a certain threshold. The aim is usually that of discovering biological correlation among dierent sequences, often belonging to dierent
organisms, which may be associated with a similar biological function. In most cases, this correlation is quite evident
when proteins are associated with genes that have duplicated,
or organisms that have diverged from one another relatively
recently, and leads to high values of the BLOSUM (or PAM)
score. But in some cases, a relevant biological correlation may
be obscured by phenomena that reduce the score, making
it dicult to capture. Those that limit the eciency of the
(2) whether the aligned sequences are typical protein sequences or not;
h=1
i, j
METHODS
p(i, j) log
i, j
p(i, j)
,
p(i)p( j)
(4)
where p(i, j), p(i), p( j) are, respectively, the joint probability distribution and the marginals associated to the random variables X and Y . We can adapt (4) to the comparison of two sequences if we interpret p(i, j) as the relative
frequency of finding amino acids i and j paired in the X
and Y sequences, and p(i) (p( j)) of finding amino acid i
( j) in sequence X (Y ). Following this approach, in a biological setting, mutual information (MI) becomes a measure
of the stochastic correlation between two sequences. It can be
shown (see the appendix) that I(X, Y ) log 20 4.3219.
The second tool is the informational divergence D(P//Q) between two probability distributions P = { p1 , p2 , . . . , pK } and
Q = {q1 , q2 , . . . , qK } [18], where
D(P//Q) =
K
i=1
p(i) log
p(i)
.
q(i)
(5)
i, j
p(i, j) log
p(i, j)
= D PXY //PX PY 0,
p(i)p( j)
(6)
so that MI is really a special kind of ID, that measures the
distance between the joint probability distributions PXY
and the product PX PY of the two marginals PX and PY .
Given two amino acid sequences, X and Y , the corresponding BLOSUM (unscaled) normalized score SN (X, Y ),
measured in bits, is computed as
n
p(i, j)
1
s xh , y h =
f (i, j) log
,
SN (X, Y ) =
n h=1
p(i)p(
j)
i, j
i, j
p(i, j) log
p(i, j)
p(i)p( j)
(8)
as the mutual information, or relative entropy, of the target and background frequencies associated to the database
BLOCKS, or to any other protein model used to find the target frequencies. Here A, and B are dummy random variables
taken to have generated the data of the database. The quantity I(A, B) was in eect used by Altschul in the case of PAM
matrices [7], and by S. Heniko and J. G. Heniko [9] for the
BLOSUM matrices, and in both cases it can be interpreted as
the average exchange of information associated with a pair
of aligned amino acids of the data bank, or as the expected
average score associated to pairs of amino acids, when they
are put into correspondence in alignments that adhere to
the protein model over which the matrices are computed.
From the perspective of an aligning method, we can state that
I(A, B) measures the average information available for each
position in order to distinguish the alignment from chance,
so that the higher its value, the shorter the fragments whose
alignment can be distinguished from chance [7]. Equation
(6) (or (A.4) in the appendix) ensures also that this average
score is always greater than or equal to zero.
On the other hand, if we compute the expected score
when two amino acids i and j are picked at random in an
independence setting model, given as
E(A, B) =
i, j
p(i)p( j) log
p(i, j)
= D PX PY //PXY ) 0,
p(i)p( j)
(9)
(7)
the classical assumptions made in constructing a scoring matrix [7] require that this expected score is lower than or equal
to zero. Note that all these quantities pertain to the database
BLOCKS (in the case of BLOSUM), that is to the particular
protein model used.
f (i, j) log
i, j
f (i, j)
,
fX (i) fY ( j)
(10)
2.2.
+ D FX //PA + D FY //PB .
(11)
3.1.
5
consequence that only alignment characterized by remarkable values of I(X, Y ) will emerge.
There are therefore essentially three cases of biological interest, which we can now analyze in terms of the correspondence between mathematical and biological meaning of the
terms.
Case 1. The joint observed frequencies FXY are typical,1 that
is, they are very close to the target frequencies, FXY PAB .
In this case, D(FXY //PAB ) 0 and also D(F//P) 0.
Case 2. The joint observed frequencies FXY are not typical
(FXY
= PAB ), but the marginals are typical (FX P, FY P).
In this case, D(FXY //PAB ) 0, but D(F//P) 0.
Case 3. Both the joint observed FXY and the marginals FX ,
FY are not typical, that is FXY
= PAB , FX
= P, FY
= P.
In this case, D(FXY //PAB ) 0, but also D(F//P) 0.
Case 1 is straightforward; two similar protein sequences
with a typical background amino acid distribution; and
amino acids paired in a way that complies with the protein
model implicit in BLOCKS result in a high score. This is
frequently the case for two firmly correlated sequences, belonging to the same family of proteins with standard amino
acid content, associated with organisms that diverged only
recently.
Case 2 is rather more interesting; the amino acid distribution is close to the background distribution (these are
typical protein sequences) but the score is highly penalized
as the observed joint frequencies are dierent from the target frequencies implicit in the BLOCKS database. This can
have dierent causes. For example, the chosen BLOSUM matrix may be incorrectly matched to the evolutionary distance
of the sequences, or the sequences may have diverged under
a nonstandard evolutionary process. For high-scoring alignments involving unrelated sequences, the target frequency divergence D(FXY //PAB ) will tend to be low, due to the second
theorem of Karlin and Altschul [8], when the target frequencies associated to the scoring matrix in use are the correct
ones for the aligned sequences being analyzed.2 This is because any set of target frequencies in any particular amino
acid substitution matrix, such as BLOSUM-, is tailored to
a particular degree of evolutionary divergence between the
sequences, generally measured by relative entropy (8) [7],
and related with the controlled maximum rate of percent identity. So a low D(FXY //PAB ) 0 is evidence that
the BLOSUM- matrix we are using is the correct one, as a
precise consequence of a mathematical theorem, while conversely for positive (or almost positive) scoring alignments
with large target frequency divergence, the sequences may be
1
Recall that the concept of typicality always refers to the adherence of the
various probability distributions to that of the protein model associated
to the database BLOCKS.
2 Note that in general, choosing the ( parameter associated with the)
smallest D(FXY //PAB ) is dierent from choosing the minimum E-value
associated with dierent parameters. Recall that E = m n2S , where S
is the score and m and n are the sequences lengths.
6
related at a dierent evolutionary distance than that of the
substitution matrix in use. Trying several scoring matrices
until something interesting is found is a common practice in protein sequence alignment [20]. In our case, scanning the range could thus lead to a significant decrease in
D(FXY //PAB ), as detected in the BLOSpectrum, and improve
the score [7, 12, 13], taking it back to Case 1. This could in
turn result in a better capacity to discriminate weakly correlated sequences from those correlated by chance. If, on the
other hand, tuning does not greatly aect D(FXY //PAB ),
and we are comparing typical sequences (low background
frequency divergence) with an appropriate parameter, the
large target frequency divergence indicates that some nonstandard evolutionary process (regarding the substitution of
amino acids) is at work. This cannot adequately be captured
by the standard BLOCKS database and BLOSUM substitution matrices. Under these circumstances, Case 2 can never
lead to high scores, due to the penalization of the target frequency divergence. We are here likely in the grey area of
weakly correlated sequences with a very old common ancestor, or of portions of proteins with strong structural properties that do not require the conservation of the entire sequence. Note that unfortunately we are not able to assess the
statistical significance when our method finds a suspected
concealed correlation; however, the method still gives us useful information that helps guide our judgment on the possible existence of such correlation, that needs to be further investigated in depth, exploiting other biological information
such as 3D structure and biological function.
Case 3 accounts for the situation in which we have two
nontypical sequences, with high values of both target and
background frequency divergence. This applies, for example,
to some families of antimicrobial peptides, that are unusually
rich in certain amino acids (such as Pro and Arg, Gly, or Trp
residues). This means that the high penalty arising from the
subtracted D(FXY //PAB ) is (at least partially) compensated
by the positive D(FX //PA ) and D(FY //PB ), and the global
score does not collapse to negative values, even if it is usually low. In eect, the background frequency divergence acts
as a compensation factor that prevents excessive penalties for
those sequences which, even though related by nonstandard
amino acid substitutions, also have a nontypical background
distribution of the amino acids inside the sequences themselves. In other words, the nontypicality of FXY is (at least
in part) forced of by the anomalous background frequencies of the amino acids. This compensation is welcome, since
it avoids missing biologically related sequences pertaining
to nontypical protein families, and mathematically corroborates the robustness of the BLOSUM scoring method.
The problem of evaluating the best method for scoring nonstandard sequences has been recently tackled by
Yu et al. [11, 21], who showed that standard substitution
matrices are not truly appropriate in this case, and developed a method for obtaining compositionally adjusted
matrices. In general, when background frequencies dier
markedly from those implicit in the substitution matrix (i.e.,
the background frequency divergence is high) is one case
when using a standard matrix is nonoptimal. Another is
I(X, Y )
D(FXY //PAB )
D(F//P)
<0.9
<1.1
<0.3
0.91.1
1.11.5
0.30.7
>1.1
>1.5
>0.7
the algebraic sum of the four terms, together with the rough
BLOSUM score, directly obtained by summing up the integer values of the BLOSUM- matrix. As already observed in
Section 2.2 the pairs containing a gap, such as (, j) or (i, ),
are not considered in the computation, since their contribution to the score is zero when one assumes the independence
between a gap and the paired amino acid.
There are essentially two ways for employing the BLOSpectrum. The first one is that of performing a BLAST or
FASTA search inside a database, given a query sequence.
The result is a set of h possible matches, ordered by score,
in which the query sequence and the corresponding match
are paired for a length that is respectively n1 , n2 , . . . , nh . The
user can extract all matches of interest within the output
set and compares them with the query sequence by using
BLOSpectrum software. The second one is that of comparing
two assigned sequences with a program such as BLAST2, so
as to find the best gapped alignment. Also in this case we can
use BLOSpectrum on the two portions of the query sequences
that are paired by BLAST2 and that have the same length n.
It is obvious that the next step would be that of integrating
the BLOSpectrum tool inside a widely used database search
engine.
Even if the correct way for using the BLOSpectrum software is that of supplying it with two sequences of the same
length, derived from preceding queries of BLAST, BLAST2,
FASTA or others, the BLOSpectrum applet accepts also two
sequences of dierent length n and m > n; in this case the
program merely computes the scores associated to all possible alignments of n over m, showing the highest one, but it
does not insert gaps.
3.3.
Biological examples
To illustrate the behavior of the BLOSpectrum under the perspective of the above three cases, we have chosen groups of
proteins from several established protein families present in
the SWISSPROT data bank http://www.expasy.uniprot.org
(see Table 2), together with some specific examples of sequences, taken from the literature, that are known to be biologically related, even if aligning with rather modest scores.
The first set contains sequences from the related Hepatocyte nuclear factor 4 (HNF4-), Hepatocyte nuclear factor 6 (HNF6), and GAT binding protein 1 (globin transcription factor 1 families). These represent typical protein families coupled by standard target frequencies. Furthermore, sequences within each family are quite similar to one another,
with a percent identity greater than 85%. All these proteins
are expected to fall in Case 1.
The second set of sequences is expected to fall in Case 2. A
first example is taken from the serine protease family, containing paralogous proteins such as trypsin, elastase, and chymotrypsin, whose phylogenetic tree constructed according to
the multiple alignment for all members of this family [23] is
consistent with a continuous evolutionary divergence from
a common ancestor of both prokaryotes and eukaryotes.
Another example pertaining to weakly correlated sequences
that show distant relationships is the one originally used by
Table 2: The three sets of protein families used in testing the BLOSpectrum. The UniProt ID is furnished (with the sequence length). For the
defensins and Pro-rich peptides, only the mature peptide sequences were used in alignments. In the following tables, sequences are indicated
by the corresponding numbers 14.
Sequence
Family
First set
HNF4-
P41235 (465)
H. sapiens
P49698 (465)
Mus musculus
P22449 (465)
Rattus norv.
HNF6
Q9UBC0 (465)
H. sapiens
O08755 (465)
Mus musculus
P70512 (465)
Rattus norv.
GAT1
P15976 (413)
H. sapiens
P17679 (413)
Mus musculus
P43429 (413)
Rattus norv.
Second set
P07477 (247)
H. sapiens
trypsin
P17538 (263)
H. sapiens
chymotrypsin
P00775 (259)
Streptomyces
griseus trypsin
P35049 (248)
Fusarium oxysporum trypsin
Hemoglobins
P02232 (92)
Vicia faba
leghemoglobin I
S06134 (92)
P. chilensis
hemoglobin I
Transposons
A26491 (41)
D. mauritiana
mariner transposon
NP493808 (41)
C. elegans
transposon TC1
Beta defensins
BD01 (36)
H. sapiens
BD02 (41)
H. sapiens
Serine proteases
Q9UNI1 (258)
H. sapiens
elastase1
BD03 (39)
H. sapiens
BD04 (50)
H. sapiens
PF (82) pig
Third set
Pro/Argrich
peptides
does not appear, at first sight, to be too abnormal. The sequence comparisons score are modest at best, even though
members are known to be biologically correlated.
The third set contains sequences that are expected to fall
in Case 3. These are members of the Bactenecins family of linear antimicrobial peptides, with an unusually high content
of Pro and Arg residues, and an identity of about 35% [27],
representing sequences with a highly atypical amino acid frequency distribution.
If we analyze the alignments inside all these sets of protein families, we eectively find examples for each of the
three cases illustrated in the preceding section. The alignments of human and mouse HNF4- sequences (as illustrated in Table 3), and the BLOSpectrum of HNF4-, HNF6,
and GAT1 sequence comparisons (see Figure 1), are clear examples of Case 1, with high correlation between all respective
couples of sequences and a target frequency divergence that
is strongly sensitive to the BLOSUM- parameter, so we stop
the scoring procedure at step 5.
For example, the HNF4- alignment has a target frequency divergence that varies from 2.41 to 0.93 when
passing from BLOSUM-35 (a matrix tailored for a wrong
9
Table 3: BLOSUM decomposition for intrafamily alignments for proteins of the first set.
HNF4- human versus HNF4- mouse
BLOSUM
I(X, Y )
D(FXY //PAB )
D(FX //P)
D(FY //P)
SN (X, Y )
Score
% Identity
100
3.939
0.929
0.050
0.057
3.118
2833
95.9
80
3.939
1.297
0.046
0.053
2.741
2537
95.9
62
3.939
1.582
0.046
0.052
2.456
2330
95.9
50
3.939
1.861
0.043
0.050
2.171
3003
95.9
40
3.939
2.226
0.039
0.047
1.800
3381
95.9
35
3.939
2.414
0.036
0.044
1.605
2982
95.9
HNF4- (BLOSUM-100)
Sequences
I(X, Y )
D(FXY //PAB )
D(FX //P)
D(FY //P)
SN (X, Y )
Score
% Identity
13
3.955
0.930
0.050
0.056
3.132
2846
96.3
2-3
4.141
1.008
0.057
0.056
3.246
2952
99.5
First set
HNF4- human
versus
HNF4- mouse
GAT1 human
versus
GAT1 mouse
HNF6 human
versus
HNF6 mouse
1 2 3 4 5
BLOSUM-100
BLOSUM-100
BLOSUM-100
(1) I(X, Y ) (2) D(FXY //PAB ) (3) D(FX //P) (4) D(FY //P) (5) Score
evolutionary distance), to BLOSUM-100 (the matrix tailored for a correct evolutionary distance) so that minimizing the frequency divergence (rows in italic) helps identify
the best parameter for comparing the analyzed sequences;
it corresponds to = 100, coherent with the high percent identity (8696%). In this case, the compensation factor D(FX //P) + D(FY //P) corresponding to background frequency divergence is almost zero, since observed background
and target frequencies are very near to those implicit in
the BLOCKS database, leading to the conclusion that these
are typical sequences that correspond closely to the protein
model associated with BLOCKS. The global (normalized)
score is high (3.12 in the HNF4- example), due to a high
degree of stochastic similarity (I(X, Y ) 3.94), which is not
10
Second set
Chymotrypsin human
versus
S. griseus trypsin
Vicia faba
leghemoglobin I
versus
Paracaudina chilensis
hemoglobin I
D. mauritiana
mariner transposon
versus
C. elegans
transposon TC1
BD01 human
versus
BD02 human
Ungapped
3
2
1 2 3 4 5
BLOSUM-35
BLOSUM-40
BLOSUM-62
BLOSUM-35
Gapped
1 2 3 4 5
BLOSUM-80
(1) I(X, Y )
BLOSUM-40
(2) D(FXY //PAB )
BLOSUM-50
(5) Score
Figure 2: BLOSpectrum for (ungapped and gapped) sequences of the second set.
11
Table 4: BLOSUM decomposition for ungapped and gapped serine proteases.
Serine proteases
BLOSUM
I(X, Y )
D(FXY //PAB )
D(FX //P)
D(FY //P)
SN (X, Y )
Score
% Identity
1.014
2.023
0.134
0.132
0.742
398
11.5
80
1.014
1.739
0.141
0.137
0.446
230
11.5
62
1.014
1.570
0.146
0.145
0.264
121
11.5
50
1.014
1.437
0.134
0.141
0.147
120
11.5
40
1.014
1.321
0.132
0.138
0.035
42
11.5
35
1.014
1.305
0.136
0.145
0.008
11.5
100
1.645
1.213
0.164
0.156
0.753
326
35.9
80
1.645
1.138
0.170
0.164
0.842
382
35.9
62
1.645
1.149
0.178
0.171
0.845
416
35.9
50
1.645
1.176
0.171
0.159
0.800
557
35.9
40
1.645
1.270
0.170
0.158
0.703
640
35.9
35
1.645
1.346
0.177
0.163
0.640
584
35.9
background frequency divergences to remarkably lower values (0.237 and 0.226), neutralizing the compensation (see
Table 6 and Figure 2, third column).
In both the preceding examples, we are in the situation
where the parameter of the substitution matrix is appropriate for the sequence divergence of the sequences in question,
the background frequency divergence is small, but the target
frequency divergence is still large: this is a signal that we are
dealing with weakly related sequences, characterized by several events of substitution that occurred during evolution. It
is usually dicult to capture these weakly related sequences
using standard scoring matrices, such as BLOSUM or PAM,
since the common ancestor could be very old. As a matter of
fact, this diculty was used to respectively test the PAM-250
versus PAM-120 matrices (Altschul [7], hemoglobin) and
BLOSUM-62 versus PAM-160 matrices (S. Heniko and J.
G. Heniko [9], transposons). Here, we cannot remove the
cause of mismatching and we leave the Scoring Procedure at
step 6.
The last example from this group derives from human
beta defensins, and even if these sequences are known to be
evolutionarily related, some couples actually show a negative
normalized score (14, 2-3, 24, see Table 7 and Figure 2,
last column), suggesting that they are not. In fact, a normal BLOSUM-62 BLAST search using the human beta defensin 1 sequence, picks up several homologues from other
mammalian species, whereas those with the three paralogous
human sequences are below the cuto score. BLOSpectrum
analysis reveals a high stochastic correlation I(X, Y ) (2.00
3.03), neutralized by an even higher-penalty factor due to the
target frequency divergence (3.283.56), partly compensated
by the substantial background frequency divergences (0.54
0.79), and with little eect of the BLOSUM- parameter, or
of introducing gaps. These are fairly typical proteins, whose
12
S ++ AHA +V
++ +
+L +
H+ +
+ L++ ++
++E+
AW
A+
I(X, Y )
D(FXY //PAB )
D(FX //P)
D(FY //P)
SN (X, Y )
Score
100
1.839
2.478
0.264
0.207
0.166
31
15.2
80
1.839
2.240
0.264
0.199
0.063
12
15.2
62
1.839
50
1.839
2.128
0.260
0.192
0.163
35
15.2
2.077
0.255
0.185
0.203
54
15.2
40
1.839
2.051
0.255
0.194
0.237
83
15.2
35
1.839
2.070
0.263
0.202
0.235
82
15.2
% Identity
1.597
1.962
0.166
0.172
0.026
10
18.1
80
1.597
1.759
0.161
0.163
0.162
40
18.1
62
1.597
1.661
0.154
0.153
0.243
65
18.1
50
1.597
1.618
0.145
0.145
0.268
104
18.1
40
1.597
1.606
0.145
0.155
0.291
152
18.1
35
1.597
1.623
0.154
0.163
0.283
148
18.1
P02232: 2 FTEKQEALVNSSSQLFKQNPSNYSVLFYTIILQKAPTAKAMFSFLK--DSAGVVDSPKLGAHAEKVF 68
T
Q+ +V
+N +++
P+A+
++ +
S ++ AHA +V
S06134: 12 LTLAQKKIVRKTWHQLMRNKTSFVTDVFIRIFAYDPSAQNKFPQMAGMSASQLRSSRQMQAHAIRVS 78
+L +
H+ +
+ L++ ++
++E+
AW
A+
frequency divergences. Note that in two cases, a mildly positive score could suggest a distant relationship. Analysis of the
BLOSpectrum helps in evaluating this possibility. The PF12
versus GAT1 alignment is simply a case of overcompensation
for a nontypical sequence (the background frequency divergence for one of the sequences is very high). In the second
case, however, the I(X, Y ) value for the BD04 versus GAT1
human alignment is surprisingly quite high, suggesting that
a closer look might be appropriate.
4.
CONCLUSIONS
13
Table 6: BLOSUM decomposition for ungapped and gapped transposons.
DN P HT+
VR
+L
+ SPDL P +
I(X, Y )
D(FXY //PAB )
D(FX //P)
D(FY //P)
SN (X, Y )
Score
% Identity
100
2.339
2.926
0.740
0.531
0.685
55
34.1
80
2.339
2.849
0.733
0.531
0.754
60
34.1
62
2.339
2.800
0.724
0.526
0.789
67
34.1
50
2.339
2.831
0.721
0.516
0.746
90
34.1
40
2.339
2.935
0.716
0.509
0.630
104
34.1
35
2.339
2.969
0.714
0.505
0.590
92
34.1
1.991
2.244
0.244
0.243
0.235
40
25.0
80
1.991
2.110
0.246
0.234
0.362
67
25.0
62
1.991
2.021
0.245
0.227
0.443
91
25.0
50
1.991
2.009
0.237
0.226
0.445
123
25.0
40
1.991
2.043
0.227
0.228
0.404
152
25.0
35
1.991
2.066
0.226
0.229
0.381
144
25.0
DN P HT+
VR
+L
+ SPDL P + HL+
+ +
L+
+ A
+ I
+P R +
+ ++G
14
BLOSUM
I(X, Y )
D(FXY //PAB )
D(FX //P)
D(FY //P)
SN (X, Y )
Score
% Identity
100
3.030
3.566
0.564
0.618
0.646
45
41.6
80
3.030
3.453
0.568
0.623
0.768
58
41.6
62
3.030
3.438
0.604
0.652
0.849
65
41.6
50
3.030
3.418
0.615
0.663
0.891
99
41.6
40
3.030
3.378
0.577
0.626
0.855
129
41.6
35
3.030
3.320
0.539
0.588
0.837
120
41.6
I(X, Y )
D(FXY //PAB )
D(FX //P)
D(FY //P)
SN (X, Y )
Score
% Identity
13
2.731
3.325
0.539
0.751
0.697
101
30.5
14
2.532
3.658
0.539
0.728
0.141
22
16.6
2-3
2.009
3.466
0.794
0.616
0.045
10
10.2
24
2.334
3.522
0.609
0.568
0.009
12.1
3-4
2.122
3.286
0.794
0.655
0.286
44
20.5
I(X, Y )
D(FXY //PAB )
D(FX //P)
D(FY //P)
SN (X, Y )
Score
% Identity
100
0.424
4.935
2.329
2.460
0.279
28
34.8
80
0.424
4.724
2.317
2.449
0.467
42
34.8
62
0.424
4.637
2.301
2.430
0.518
37
34.8
50
0.424
4.533
2.264
2.389
0.544
68
34.8
40
0.424
4.407
2.221
2.338
0.576
97
34.8
35
0.424
4.368
2.199
2.301
0.556
98
34.8
I(X, Y )
D(FXY //PAB )
D(FX //P)
D(FY //P)
SN (X, Y )
Score
% Identity
13
0.516
4.434
2.095
2.205
0.382
63
30.9
14
0.446
4.491
2.199
2.488
0.643
110
39.5
2-3
0.584
4.156
2.095
2.257
0.780
133
47.6
24
0.406
4.350
2.256
2.251
0.563
134
37.2
3-4
0.609
4.260
2.095
2.347
0.792
132
45.2
APPENDIX
Proof of (11). By multiplying inside the log function of (7)
by f (i, j)/ f (i, j) and by f (i) f ( j)/ f (i) f ( j) and rearranging
the terms, we obtain
p(i, j) f (i, j) f (i) f ( j)
f (i, j) log
SN (X, Y ) =
p(i)p( j) f (i, j) f (i) f ( j)
i, j
=
f (i, j) log
i, j
i, j
f (i, j)
f (i, j)
f (i, j) log
f (i) f ( j) i, j
p(i, j)
f (i, j) log
f (i) f ( j)
p(i)p( j)
15
Third set
BCT5 bovin
versus
BCT7 bovin
BCT5 bovin
versus
PR39PRC pig
BCT7 bovin
versus
PR39PRC pig
1 2 3 4 5
BLOSUM-35
BLOSUM-35
BLOSUM-35
(1) I(X, Y ) (2) D(FXY //PAB ) (3) D(FX //P) (4) D(FY //P) (5) Score
I(X, Y )
D(FXY //PAB )
D(FX //P)
D(FY //P)
SN (X, Y )
Score
% Identity
1-1
0.578
0.986
0.036
0.205
0.165
312
5.37
0.088
144
8.71
0.076
143
8.47
0.195
36
10.0
0.091
24
18.2
0.136
25
12.0
0.712
1.033
0.038
0.193
0.622
1.122
0.230
0.193
1.010
3.887
0.460
2.220
0.686
3.486
41
2.243
3.033
2.182
0.709
= I(X, Y ) D FXY //PAB
f (i, j) log
i, j
f (i)
f ( j)
f (i, j) log
+
p(i) i, j
p( j)
= I(X, Y ) D FXY //PAB + D FX //PA
+ D FY //PB .
(A.1)
0.465
16
Noncorrelated sequences
HNF4 human
versus
HNF6 human
HNF6 human
versus
GAT1 human
HNF4 human
versus
GAT1 human
1 2 3 4 5
BLOSUM-35
BLOSUM-35
BLOSUM-35
PF12 pig
versus
GAT1 human
BD04 human
versus
BCT7 bovin
BD04 human
versus
GAT1 human
1 2 3 4 5
BLOSUM-35
BLOSUM-35
BLOSUM-35
(1) I(X, Y ) (2) D(FXY //PAB ) (3) D(FX //P) (4) D(FY //P) (5) Score
K
i=1
p(i) log
p(i)
,
q(i)
(A.2)
0 D(P//Q) +
(= 0 when P Q)
= + when there exists i such that 2(i) = 0 .
(A.3)
logarithm). Since D(P//Q) = 0 if and only if P Q, this allows us to interpret the ID as a measure of (pseudo)distance
between probability distributions. It is only pseudo (from
the mathematical point of view) since the concept of distance is well defined in mathematics, and requires also symmetry between the variables and the validity of the so-called
triangular inequality. But ID lacks both these last two properties, since, in general, D(P//Q)
= D(Q//P) (it is asymmetric) and, if R is a third probability distribution, we are not
sure that D(P//R) + D(R//Q) is greater than D(P//Q) (the
triangular inequality does not hold). We underline that such
a distance is not symmetric (and so the order in which P and
Q are specified does matter), that is, it is a distance from
rather than a distance between.
Suppose now that PX = { pX (1), pX (2), . . . , pX (K)} and
PY = { pY (1), pY (2), . . . , pY (K)} are the probability distributions associated to the (random) variables X and Y , which
take their values in the same alphabet A. Here, pX (i) =
Pr{X = i} means the probability that the variable X assumes
17
the value i. In our framework, X and Y are two protein sequences of the same length n, and pX (2) = Pr{X = 2} = 0.09
(e.g.) is interpreted as the relative frequency of the second
amino acid of the alphabet A; so, the overall occurrence of
the 2nd amino acid in sequence X is equal to 0.09n. In this
context, we can introduce also a joint probability distribution associated to the sequences, PXY = { pXY (i, j), i, j
A} = Pr{X = i, Y = j, i, j A}, where pXY (i, j) corresponds to the relative frequency of finding the amino acids
i, j paired in a certain position
of the alignment between X
and Y . It is well known that i, j pXY (i, j) = 1 (PXY is a probability distribution) and that the sum of the joint probabilities over one variable gives the marginal of the other variable
j pXY (i, j) = pX (i). For example, given that the ninth and
the fifth amino acid in the alphabet are Arginine and Leucine,
respectively, pXY (9, 5) = pXY (Arg, Leu) = 0.01 means that
the relative frequency of finding Arg in X paired with Leu in
Y is equal to 0.01. In practice, we avoid the use of the subscripts, and use the simpler notation p(i) and p(i, j) instead
of pX (i) and pXY (i, j).
Since the condition of independence between two variables (protein sequences) X and Y is fixed by the formula
pXY (i, j) = pX (i)pY ( j) (for each pair i, j A), then, once
assigned a certain PXY , it could be interesting to attempt
to evaluate the distance of PXY from the condition of independence between the variables. Making use of the ID (A.2),
we need to evaluate the quantity D(PXY //PX PY ), that is the
stochastic distance between the joint PXY and the product of
the marginals PX PY . If we have independence, then PXY
PX PY , and the divergence equals zero. On the contrary, if it
appears that X and Y are tied by a certain degree of dependence, this can be measured by
D PXY //PX PY =
i, j
p(i, j) log
p(i, j)
I(X, Y ) 0.
p(i)p( j)
(A.4)
This quantity is called also the mutual information (or relative entropy) I(X, Y ) between the random variables (the protein sequences, in our setting) X and Y . It is symmetric in
its variables (I(X, Y ) = I(Y , X)) and is always nonnegative,
since it is an informational divergence. Note also that MI is
upper bounded by the logarithm of the alphabet cardinality, that is I(X, Y ) log 20 [18]. Moreover, since it equals
zero if and only if the joint probability distribution coincides with the product of the marginals, that is, when we
have independence between the two variables, we can interpret the mutual information (MI) as a measure of stochastic
dependence between X and Y . From another point of view,
we can say that independence is equivalent to the situation
in which the variables X and Y do not exchange information. So, the meaning of I(X, Y ) can be read also as the degree of dependence between the variables, or as the average information exchanged between the same variables. Mutual information is one of the pillars of Shannon information theory, and was introduced in the seminal paper by Shannon
[16, 17].
ACKNOWLEDGMENTS
The authors thank Jorja Heniko, who provided the matrices
of joint probability distributions associated to the database
BLOCKS, and an anonymous referee of a previous version
of this paper, who made several key remarks. This work has
been supported by the Italian Ministry of Research, PRIN
2003, FIRB 2003 Grants, by the Istituto Nazionale di Alta
Matematica (INdAM), 2003 Grant, and by the Regione Friuli
Venezia Giulia (2005 Grants).
REFERENCES
[1] S. B. Needleman and C. D. Wunsch, A general method applicable to the search for similarities in the amino acid sequence
of two proteins, Journal of Molecular Biology, vol. 48, no. 3,
pp. 443453, 1970.
[2] A. D. McLachlan, Tests for comparing related amino-acid
sequences. Cytochrome c and cytochrome c551 , Journal of
Molecular Biology, vol. 61, no. 2, pp. 409424, 1971.
[3] D. Sanko, Matching sequences under deletion-insertion
constraints, Proceedings of the National Academy of Sciences
of the United States of America, vol. 69, no. 1, pp. 46, 1972.
[4] P. H. Sellers, On the theory and computation of evolutionary distances, SIAM Journal on Applied Mathematics, vol. 26,
no. 4, pp. 787793, 1974.
[5] M. S. Waterman, T. F. Smith, and W. A. Beyer, Some biological sequence metrics, Advances in Mathematics, vol. 20, no. 3,
pp. 367387, 1976.
[6] M. O. Dayho, R. M. Schwartz, and B. C. Orcutt, A model of
evolutionary change in proteins, in Atlas of Protein Sequence
and Structure, M. O. Dayho, Ed., vol. 5, supplement 3, pp.
345352, National Biomedical Research Foundation, Washington, DC, USA, 1978.
[7] S. F. Altschul, Amino acid substitution matrices from an information theoretic perspective, Journal of Molecular Biology,
vol. 219, no. 3, pp. 555565, 1991.
[8] S. Karlin and S. F. Altschul, Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes, Proceedings of the National Academy of
Sciences of the United States of America, vol. 87, no. 6, pp. 2264
2268, 1990.
[9] S. Heniko and J. G. Heniko, Amino acid substitution
matrices from protein blocks, Proceedings of the National
Academy of Sciences of the United States of America, vol. 89,
no. 22, pp. 1091510919, 1992.
[10] W. Feller, An Introduction to Probability and Its Applications,
John Wiley & Sons, New York, NY, USA, 1968.
[11] Y.-K. Yu, J. C. Wootton, and S. F. Altschul, The compositional
adjustment of amino acid substitution matrices, Proceedings
of the National Academy of Sciences of the United States of America, vol. 100, no. 26, pp. 1568815693, 2003.
[12] S. F. Altschul, A protein alignment scoring system sensitive
at all evolutionary distances, Journal of Molecular Evolution,
vol. 36, no. 3, pp. 290300, 1993.
[13] D. J. States, W. Gish, and S. F. Altschul, Improved sensitivity of nucleic acid database searches using application-specific
scoring matrices, Methods, vol. 3, no. 1, pp. 6670, 1991.
[14] S. R. Sunyaev, G. A. Bogopolsky, N. V. Oleynikova, P. K.
Vlasov, A. V. Finkelstein, and M. A. Roytberg, From analysis of protein structural alignments toward a novel approach
to align protein sequences, Proteins: Structure, Function, and
Bioinformatics, vol. 54, no. 3, pp. 569582, 2004.
18
[15] M. A. Zachariah, G. E. Crooks, S. R. Holbrook, and S. E.
Brenner, A generalized ane gap model significantly improves protein sequence alignment accuracy, Proteins: Structure, Function, and Bioinformatics, vol. 58, no. 2, pp. 329338,
2005.
[16] C. E. Shannon, A mathematical theory of communication
part I, Bell System Technical Journal, vol. 27, pp. 379423,
1948.
[17] C. E. Shannon, A mathematical theory of communication
part II, Bell System Technical Journal, vol. 27, pp. 623656,
1948.
[18] I. Csiszar and J. Korner, Information Theory: Coding Theorems
for Discrete Memoryless Systems, Academic Press, New York,
NY, USA, 1981.
[19] A. A. Schaer, L. Aravind, T. L. Madden, et al., Improving
the accuracy of PSI-BLAST protein database searches with
composition-based statistics and other refinements, Nucleic
Acids Research, vol. 29, no. 14, pp. 29943005, 2001.
[20] F. Frommlet, A. Futschik, and M. Bogdan, On the significance
of sequence alignments when using multiple scoring matrices, Bioinformatics, vol. 20, no. 6, pp. 881887, 2004.
[21] S. F. Altschul, J. C. Wootton, E. M. Gertz, et al., Protein
database searches using compositionally adjusted substitution
matrices, FEBS Journal, vol. 272, no. 20, pp. 51015109, 2005.
[22] A. A. Schaer, Y. I. Wolf, C. P. Ponting, E. V. Koonin,
L. Aravind, and S. F. Altschul, IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed
position-specific score matrices, Bioinformatics, vol. 15,
no. 12, pp. 10001011, 1999.
[23] W. R. Rypniewski, A. Perrakis, C. E. Vorgias, and K. S. Wilson,
Evolutionary divergence and conservation of trypsin, Protein Engineering, vol. 7, no. 1, pp. 5764, 1994.
[24] A. L. Hughes, Evolutionary diversification of the mammalian
defensins, Cellular and Molecular Life Sciences, vol. 56, no. 12, pp. 94103, 1999.
[25] F. Bauer, K. Schweimer, E. Kluver, et al., Structure determination of human and murine -defensins reveals structural
conservation in the absence of significant sequence similarity,
Protein Science, vol. 10, no. 12, pp. 24702479, 2001.
[26] A. Tossi and L. Sandri, Molecular diversity in gene-encoded,
cationic antimicrobial polypeptides, Current Pharmaceutical
Design, vol. 8, no. 9, pp. 743761, 2002.
[27] R. Gennaro, M. Zanetti, M. Benincasa, E. Podda, and M. Miani, Pro-rich antimicrobial peptides from animals: structure,
biological functions and mechanism of action, Current Pharmaceutical Design, vol. 8, no. 9, pp. 763778, 2002.
[28] M. E. Selsted, M. J. Novotny, W. L. Morris, Y.-Q. Tang, W.
Smith, and J. S. Cullor, Indolicidin, a novel bactericidal
tridecapeptide amide from neutrophils, Journal of Biological
Chemistry, vol. 267, no. 7, pp. 42924295, 1992.
[29] S. Kullback, Information Theory and Statistics, Dover, Mineola,
NY, USA, 1997.
Research Article
Aligning Sequences by Minimum Description Length
John S. Conery
Department of Computer and Information Science, University of Oregon, Eugene, OR 97403, USA
Received 26 February 2007; Revised 6 August 2007; Accepted 16 November 2007
Recommended by Peter Grunwald
This paper presents a new information theoretic framework for aligning sequences in bioinformatics. A transmitter compresses
a set of sequences by constructing a regular expression that describes the regions of similarity in the sequences. To retrieve the
original set of sequences, a receiver generates all strings that match the expression. An alignment algorithm uses minimum description length to encode and explore alternative expressions; the expression with the shortest encoding provides the best overall
alignment. When two substrings contain letters that are similar according to a substitution matrix, a code length function based on
conditional probabilities defined by the matrix will encode the substrings with fewer bits. In one experiment, alignments produced
with this new method were found to be comparable to alignments from CLUSTALW. A second experiment measured the accuracy
of the new method on pairwise alignments of sequences from the BAliBASE alignment benchmark.
Copyright 2007 John S. Conery. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1.
INTRODUCTION
Sequence alignment is a fundamental operation in bioinformatics, used in a wide variety of applications ranging
from genome assembly, which requires exact or nearly exact
matches between ends of small fragments of DNA sequences
[1], to homology search in sequence databases, which involves pairwise local alignment of DNA or protein sequences
[2], to phylogenetic inference and studies of protein structure
and function, which depend on multiple global alignments
of protein sequences [35].
These diverse applications all use the same basic definition of alignment: a character in one sequence corresponds
either to a character from the other sequence or to a gap
character that represents a space in the middle of the other
sequence. Alignment is often described informally as a process of writing a set of sequences in such a way that matching
characters are displayed within the same column, and gaps
are inserted in strings in order to maximize the similarity
across all columns. More formally, alignments can be defined
by a matrix M, where Mi j is 1 if character i of one sequence
is aligned with character j of the other sequence, or in some
cases, Mi j is a probability, for example, the posterior probability of aligning letters i and j [6].
This paper introduces a new framework for describing
the similarities and dierences in a set of sequences. The idea
is to construct a special-purpose grammar for the strings that
2
for recurring patterns in protein and DNA sequences [9].
These applications of MDL are examples of machine learning, where the system uses the data as a training set and the
goal is to infer a general description that can be applied to
other data. The goal of the sequence alignment algorithm
presented here is simply to find the best description for the
data at hand; there is no attempt to create a general grammar
that may apply to other sequences.
Grammars have been used previously to describe the
structure of biological sequences [1012], and regular expressions are a well-known technique for describing patterns
that define families of proteins [13]. But as with previous
work on MDL and grammars, these other applications use
grammars and regular expressions to describe general patterns that may be found in sequences beyond those used to
define the pattern, whereas for alignment the goal is to find a
grammar that describes only the input data.
Grammars have the potential to describe a wide variety
of relationships among sequences. For example, a top level
rule might specify several dierent ways to partition the sequences into smaller groups, and then specify separate alignments for each group. In this case, the top level rules are effectively a representation of a phylogenetic tree that shows
the evolutionary history of the sequences. This paper focuses on one very restricted type of grammar that is capable
of describing only the simplest correspondence between sequences. The algorithm presented here assumes that only two
sequences are being aligned, and that the goal is to describe
similarity over the entire length of both input sequences, that
is, the algorithm is for pairwise global alignment. For this application, the simplest type of formal grammara right linear grammaris sucient to describe the alignment. Since
every right linear grammar has an equivalent regular expression, and because regular expressions are simpler to explain
(and are more commonly used in bioinformatics), the remainder of this paper will use regular expression syntax when
discussing grammars for a pair of sequences.
Current alignment algorithms are highly sensitive to the
choice of gap parameters [1417]; for example, Reese and
Pearson showed that the choice of gap penalties can influence the score for alignments made during a database search
by an order of magnitude [18]. One of the advantages of the
grammar-based framework is that gaps are not needed to
align sequences of varying length. Instead, the parts of regular expressions that correspond to regions of unaligned positions will have a dierent number of characters from each
input sequence.
Previous work using information theory in sequence
alignment has been within the general framework of a
Needleman-Wunsch global alignment or Smith-Waterman
local alignment. Allison et al. [19] used minimum message
length to consider the cost of dierent sequences of edit operations in global alignment of DNA; Schmidt [20] studied the information content of gapped and ungapped alignments, and Aynechi and Kuntz [21] used information theory to study the distribution of gap sizes. The work described
here takes a dierent approach altogether, since gap characters are not used to make the alignments.
One of the main applications of sequence alignment is comparison of protein sequences. The inputs to the algorithm are
sets of strings, where each letter corresponds to one of the 20
amino acids found in proteins. The goal of the alignment is
to identify regions in each of the input sequences that are
parts of the same structural or functional elements or are descended from a common ancestor.
Figure 1(b) shows the evolution of fragments of three
hypothetical proteins starting from a 9-nucleotide DNA sequence. The labels below the leaves of the tree are the amino
acids corresponding to the DNA sequences at the leaves. The
only change along the left branch is a single substitution
which changes the first amino acid from P to T, and an alignment algorithm should have no problem finding the correspondences between the two short sequences (Figure 1(c)).
The sequence on the right branch of the tree is the result of a mutation that inserted six nucleotides in the middle
of the original sequence. In order to align the resulting sequence with one of its shorter cousins, a standard alignment
algorithm inserts a gap, represented by a sequence of one or
more dashes, to mark where it thinks the insertion occurred.
John S. Conery
Genetic code
.
.
.
.
.
.
(a)
(b)
(c)
(d)
(e)
Figure 1: (a) The genetic code specifies how triplets of DNA letters (known as codons) are translated into single amino acids when a cell
manufactures a protein sequence from a gene. (b) A tree showing the evolution of a short DNA sequence. Labels below the leaves are the
corresponding amino acid sequences. (c) Alignment of the two shorter sequences. (d) and (e) Two ways to align the longer sequence with
one of the shorter ones.
Regular expressions are widely used for pattern matching, where the expression describes the general form of a
string and an application can test whether a given string
matches the pattern. To see how a regular expression is an
alternative to a standard gap-based alignment consider the
following pattern, which describes the two sequences in Figures 1(d) and 1(e):
P(P | LFS)P.
(1)
Here the vertical bar means or and the parentheses are used
to mark the ends of the alternatives. The pattern described
by this expression is the set of strings that start with a P, then
have either another P or the string LFS, and end in a P. In
this example, the letters enclosed in parentheses correspond
to a variable region: the pattern simply says these letters are
not aligned and no attempt is made to say why they are not
aligned or what the source of the dierence is. The regular
expression is an abstract description, covering both the alignments of Figures 1(d) and 1(e) (and a third, biologically less
plausible, alignment in which the top string would be PP
P).
For a more realistic example, consider the two sequence
fragments in Figure 2(a), which are from the beginning of
two of the protein sequences used to test the alignment application. Substrings of 15 characters near the front of each
sequence are similar to each other. A regular expression that
describes this similarity would have three groups, showing
letters before and after the region of similarity as well as the
region itself (Figure 2(b)).
Any pair of sequences can be described by a regular expression of this form. The expression consists of a series of
segments, written one after another, where each segment has
two substrings separated by the vertical bar. But this standard
(a)
(b)
(c)
Figure 2: (a) Strings from the start of two of the amino acid sequences used to test the alignment algorithm. The substrings in blue are
similar to the corresponding substring in the other sequence. (b) A regular expression that makes explicit the boundaries of the region of
similarity. (c) The canonical form representation of the regular expression. The canonical form has the same groupings of letters, but displays
the letters in a dierent order and uses marker symbols instead of parentheses to specify group boundaries. A # means the sequence segments
are blocks, where the ith letter from one sequence has been aligned with the ith letter in the other sequence. A > designates the start of a
variable region of unaligned letters.
John S. Conery
2 markers
27 letters
6 markers
27 letters
Figure 3: Schematic representation of an expression rewriting operation. A canonical form expression with a single variable region
is transformed into a new expression with two variable regions surrounding a block. The number of sequence letters does not change,
but four new marker symbols are added to specify the boundaries
of the block.
the locations of the start of the block (one in each input sequence) and two > symbols mark the end of the block. As a
special case, the block might be at the beginning or end of
the expression; if so only two new # markers are added to the
expression.
Since the alignment algorithm uses the minimum description length principle to search for the simplest expression, this transformation appears to be a step in the wrong
direction because the complexity of the expression, in terms
of the number of symbols used, has increased. The key point
is that MDL operates at the level of the encoding of the expression, that is, it prefers the expression that can be encoded
in the fewest number of bits. As will be shown in this section,
blocks of similar sequence letters have shorter encodings. If
the number of bits saved by placing similar letters in a block
is greater than the cost of encoding the symbols that mark the
ends of the block, the transformed expression is more compact.
The code length function that assigns a number of bits
to each symbol in a canonical form sequence expression has
three components:
(i) a protocol that defines the general structure of an expression and the representation of alignment parameters;
(ii) a method for assigning a number of bits to each letter
from the set of input sequences;
(iii) a method for determining the number of bits to use
for the marker symbols that identify the boundaries
between blocks and variable regions.
3.1. Communication protocol
A common exercise in information theory is to imagine that
a compressed data set is going to be sent to a receiver in
binary form, and the receiver needs to recover the original
data. This exercise ensures that all the necessary information
is present in the compressed dataif the receiver cannot reconstruct the original data, it may be because essential information was not encoded by the compression algorithm. In
the case of the MDL alignment algorithm, the idea is to compress a set of sequences by creating a representation of a regular expression that describes the structure of the sequences.
The receiver recovers the original sequence data by expanding the expression to generate every sequence that matches
the expression.
A communication protocol that specifies the type of information contained in a message and the order in which the
pieces of the message are transmitted is an essential part of
the encoding. The representation of a sequence expression
begins with a preamble that contains information about the
structure of the expression and the encoding of alignment
parameters.
A canonical form sequence expression is an alternating
series of blocks and variable regions, where the marker symbols (# and >) inserted into the input sequences identify the
boundaries between segments. The communication protocol allows the transmitter to simplify the expression as it is
compressed by putting a single bit in the preamble to specify the type of the first segment. Then the only thing that is
required is a single type of symbol to specify the locations of
the remaining markers. For the example sequences shown in
Figure 2, the expression can be transformed into the following string:
> MNNNNYIF.MNSYKP.ENENPILYNTNEGEE.
ENENPVLYNYKEDEE.NRSS.SSHI
(2)
(3)
p(x, y)
1
log
p(x)p(y)
(4)
Table 1: Cost (in bits) of aligning pairs of letters. Sx,y is the score
for letters x and y in the PAM100 substitution matrix. c(x) + c(y)
is the sum of the costs of the two letters, which is incurred when
the letters are in a variable region. c(x) + c(y | x) is the cost of the
same letters when they are aligned in a block. The benefit of aligning two letters is the dierence between the unaligned cost and the
aligned cost: a positive benefit results from aligning similar letters,
a negative benefit from aligning dissimilar letters.
x
W
I
L
M
L
L
L
y
W
I
L
L
I
Q
C
Sx,y
12
6
6
3
1
2
6
c(x) + c(y)
6.36 + 6.36
3.65 + 3.65
3.09 + 3.09
4.97 + 3.09
3.09 + 3.65
3.09 + 5.02
3.09 + 5.78
c(x) + c(y | x)
6.36 + 0.44
3.65 + 1.25
3.09 + 0.72
4.97 + 2.26
3.09 + 3.66
3.09 + 6.09
3.09 + 9.38
benefit(y, x)
5.92
2.40
2.37
0.83
0.01
1.07
3.60
(5)
John S. Conery
8
20
6
18
(a)
(b)
(c)
q(x, y) = (1 ) p(x, y)
p(x, y) = 1
= q()
q(x, y) = (1 )
(d)
Figure 4: The items in blue correspond to information added to a string to specify the locations of marker symbols. (a) Indexed representation. The preamble contains two tables of m 1 numbers to specify the locations of the m marker symbols (the first marker is always
at the front of the string) in each sequence. Each table entry has k = log2 n bits to specify a location in a string of length n. (b) Tagged
representation. A one-bit tag added to each symbol identifies the symbol class (letter or marker), and is followed by the bits that represent
the symbol itself. (c) Scaled representation. The number of bits for each symbol x is simply log2 q(x) where q(x) is the probability of the
symbol based on a distribution that includes the probability of a marker. (d) Given a probability for marker symbols, the joint probabilities
for the letter pairs are scaled by 1.0 so the sum of probabilities over all symbols is 1.0.
(ii) use one bit to specify the type of the first segment
(which will be the same for both sequences);
(iii) use log2 s bits to specify which one of the s substitution matrices was used to encode letters and letter
pairs;
(iv) use 2log2 n + 1 bits to specify n, the length of the first
input sequence. This number also allows the receiver
to determine k = log2 n, the number of bits required to
represent a single marker table entry;
(v) the next 2log2 m + 1 bits specify m, the number of
marker symbols in each sequence;
(vi) create a table of size mk bits for the locations of the
m markers in the first sequence, followed by another
table of the same size for the markers of the second
sequence.
Following the preamble, the body of the message simply
consists of the encoding of the letters defined in the previous
section. Since the receiver knows the length of the first sequence, there is no need to include an end-of-string marker
after the first sequence. This location becomes a de facto
marker for the start of the second sequence.
Figure 4(a) shows how the start of the two example sequences would be encoded with the indexed representation.
The numbers in blue are indices between 0 and the length of
the longer of the two sequences.
The advantage of this representation is that no additional
parameters are required to align a pair of sequences: the only
alignment parameter is the substitution matrix, which deter-
mines the individual probability for each letter and the joint
probability for each letter pair.
3.3.2. Tagged representation
There are two drawbacks to the indexed representation. The
first is that the number of bits used to represent a marker
grows (albeit very slowly) with the length of the input sequences. That means one might get a dierent alignment for
the same two substrings of sequence letters in dierent contexts; if the substrings are embedded in longer sequences,
the number of bits per marker will increase, and the alignment algorithm might decide on a dierent placement for
the markers in the middle of the substrings.
The second disadvantage is that in many cases marker
symbols identify the locations of insertions and deletions,
which are evolutionary events. The number of bits used to
represent a marker should correspond to the likelihood of an
insertion or deletion, but not the length of the sequence. If
anything, longer sequences are more likely to have had insertions or deletions, so the number of bits representing those
events should be lower, not higher.
The tagged representation addresses these problems by
defining a prefix code for markers and embedding the marker
codes in the appropriate locations within each sequence
string. This method requires the user to specify a value for a
new parameter, named , the number of bits required to represent a marker. Each symbol in the expression is preceded by
(6)
(1 )p(x, y)
= (1 )
p(x, y) = (1 )p(x).
(7)
q(x, y) (1 )p(x, y)
=
q(x)
(1 )p(x)
p(x, y)
=
= p(y | x).
p(x)
(8)
Example
EXPERIMENTAL RESULTS
Plasmodium orthologs
John S. Conery
(a)
(b)
Figure 5: Cost of alternative expressions for the example sequences using the PAM20 substitution matrix and = 0.02. The cost for each
marker symbol is = log2 = 5.644 bits. (a) The cost for the null hypothesis is the sum of all the individual letter costs plus the cost of the
two marker symbols. (b) When the letters in blue are aligned with one another, the costs of the letters in the second sequence are computed
with conditional probabilities. This reduces the cost of the letters in the block by 129.508 91.381 = 38.127 bits. The transformed grammar
has four additional markers, but the reduction in cost aorded by using the block outweighs the cost of the new markers (4 5.644 =
22.576 bits) so the expression with one block has a lower overall cost.
(a)
(b)
Untrim
Trim
Aligned by both
0.473
0.469
Aligned by neither
0.147
0.258
clustalw only
0.38
0.267
Realign only
<0.001
0.006
(c)
Figure 6: Alignment of sequences MAL7P1.11 and Pv087705 from ApiDB [35]. (a) Comparison of CLUSTALW alignment (top two lines of
text) and the regular expression alignment (bottom two lines). Background colors indicate whether the two algorithms agree. Green: columns
aligned by both algorithms; blue: letters not aligned by both algorithms; white: letters aligned by CLUSTALW but appearing in variable regions
in the regular expression; red: letters aligned in the regular expression but not by CLUSTALW. (b) Same as (a), but comparing the trimmed
CLUSTALW alignment with regular expression alignment. The middle row of two lines shows the result of the alignment trimming algorithm;
an asterisk identifies a column from the CLUSTALW alignment that was removed by gap expansion. (c) Proportion of each type of column
averaged over all 3909 alignments.
ing BLAST to search for reciprocal best hits. Since P. falciparum diverged from P. vivax approximately 200 MYA [36],
all the alignments used the PAM20 substitution matrix. The
realign alignments were made using the scaled representation for marker symbols with = 0.02 since insertion and
deletion events are relatively rare at this short evolutionary
time scale.
Figure 6 shows a detailed comparison of the alignments
for one pair of genes (MAL7P1.11 and Pv087705). The top
two lines in Figure 6(a) are the alignment produced by
CLUSTALW, and the bottom two are the regular expression
alignment. To make it easier to compare the alignments, the
marker symbols have been deleted, and the letters in variable
10
regions printed in italics to distinguish them from letters
in blocks. The four background colors indicate the level of
agreement between the two alignments: a pair can be aligned
by both programs, aligned by neither, or aligned by one but
not the other.
Researchers often apply an alignment trimming algorithm to the output of an alignment algorithm to identify
suspect columns in an alignment [37]. An example of a suspect column is the one shown in Figure 1 where an insertion occurred in the middle of a codon. Figure 6(b) shows
the alignment of the Plasmodium genes after an alignment
trimming operation [38] was applied to the CLUSTALW alignments. The middle two lines in this figure show the results
of the trimming application: an X indicates a letter that was
left in the alignment, and a indicates a position that was
originally aligned but has now been converted to a gap. In
this example, the alignment trimming algorithm agreed with
the regular expression alignment: columns that were previously shown as aligned (white background color) are now
unaligned (blue).
Over all the 3909 pairs of sequences, the two alignment
methods agreed on 62% of the letters (top two rows of
Figure 6(c)). The disagreement was almost entirely due to the
fact that in 38% of the columns, the regular expression alignment was more conservative and placed characters in an unaligned region when CLUSTALW aligned those same letters.
There are very few instances where realign put letters in
an aligned block and CLUSTALW did not. Applying the alignment trimming algorithm increases the level of agreement:
approximately one fourth of the columns originally considered aligned by CLUSTALW were reclassified as unaligned, in
agreement with realign. The number of columns aligned
only by realign also increased, but that is simply due to the
fact that the alignment trimming algorithm used here [38] is
very conservative and also trims away the last character in an
aligned region (as shown by the red columns at the ends of
blocks in Figure 6(b)).
These results show that for sequences with a high degree
of similarity (separated by only 200MY of evolution), the
MDL method implemented in realign does a credible job
of global alignment. A more detailed analysis of genes with
known alignments, preferably including structural and functional alignment, would be required to determine whether
the 25% of the letter pairs aligned by CLUSTALW should in
fact be aligned, or whether realign was correct in leaving
them in variable regions.
4.2. BAliBASE reference alignments
The main parameter of the regular expression alignment
method is the substitution matrix, which defines the probabilities for amino acid letters. A second parameter, the number of bits to use for a marker symbol or the probability associated with a marker symbol, is required if expressions are
encoded with the tagged or scaled representations, respectively. To illustrate the eects of these parameters, an experiment evaluated the accuracy of realign alignments compared to known reference alignments from the BAliBASE
[34] benchmark suite.
John S. Conery
11
(a)
(b)
(c)
Figure 7: Portions of alignments of sequences 1aho and 1bmr from the BAliBASE alignment benchmark (Release 3) [34]. (a) The reference
alignment from BAliBASE. Letters in core blocks are highlighted in blue. (b) Alignment from realign, using PAM20 and = 0.2. (c) Same
as (b) but using PAM250. In (b) and (c) lines starting with % are comments that show the degree of similarity of corresponding letters in the
preceding block: identical (=), similar (+), or dissimilar (). Sequence letters in blue are correctly aligned core blocks. Red letters are core
block column that should have been aligned but were left in variable regions. The circled numbers highlight changes in the alignment (see
text).
This paper has shown that regular expressions provide useful descriptions of alignments of pairs of sequences. The expressions are simple concatenations of alternating blocks and
variable regions, where blocks are equal-length substrings
12
10
0.5
0.05
0.1
0.15
0.2
0.05
0.1
0.15
Mean accuracy
103
0.2
Compression
Accuracy
(a)
(b)
Figure 8: The eect of the scaling parameter on alignments of pairs of sequences from BAliBASE [34] test set BB12007. There are eight
sequences in the set; the data points are based on averages over all (8 7)/2 = 28 pairs of sequences. (a) Mean cost (in bits) of alignments
as a function of . (b) Mean compression (the dierence between the cost of the null hypothesis and the lowest cost alignment for each pair
of sequences) is indicated by open circles. The mean accuracy of the alignments (proportion of core blocks correctly aligned) is indicated by
closed circles (scale shown on the right axis).
John S. Conery
13
but that number drops to one half if the CLUSTALW alignments are treated with an alignment trimming algorithm
to remove ambiguous regions. A more detailed case-by-case
analysis would be required to determine if the remaining unaligned characters should remain unaligned (i.e., alignment
trimming should be more ambitious) or if they need to be
aligned (i.e., the regular expression approach is not aligning
some characters that should be aligned).
A second set of experiments compared the output of the
regular expression method with known reference alignments
from the BAliBASE alignment benchmark. Since the benchmark is designed to test multiple alignment algorithms, and
it is generally accepted that multiple alignment is more accurate than simple pairwise alignment [28], it is not possible
to say whether the regular expression approach is as accurate
as recent multiple alignment methods, but the overall accuracy of over 80% for sequences with 20% to 40% identity is
encouraging.
One direction for future research is to try to automatically determine, for each substitution matrix, the best value
for or , the parameters that determine the number of bits
per marker symbol. Based on extensive investigation (e.g.,
[39]) of dierent combinations of substitution matrix and
other parameters BLAST, CLUSTALW, and other applications
set default values for gap penalties based on the choice of substitution matrix. A similar analysis, perhaps based on insertion and deletion mutation rates, might be used to match a
substitution matrix with a setting of or for regular expression alignments.
A second direction for future research is to expand the
method to perform multiple alignment of more than two
sequences. One approach would be to use pairwise local
alignments produced by realign as anchors for DIALIGN
[22, 23], a progressive multiple alignment program that joins
consistent sets of ungapped local alignments into a complete multiple alignment. A dierent approach would align
all the sequences at the same time, using sum-of-pairs or
some other method to average conditional costs based on
each of the n (n 1)/2 pairs of sequences.
A third direction for future research is to extend the
canonical sequence expressions or the equivalent grammar
to include other forms of descriptions of regions of similarity.
One idea is to use PROSITE blocks [40] as subroutines that
can be embedded in blocks. For example, PROSITE block
PS00007 is [RK]-x(2, 3)-[DE]-x(2, 3)-Y, using a notation similar to a regular expression where a string in brackets means
any one of these letters and x(2, 3) means any sequence
between 2 and 3 letters long. A string that matches this pattern, RDIKDPEY, occurs in one of the Plasmodium sequences
discussed in Section 4.1. A block for the region containing
this pattern might include a reference to the PROSITE block,
for example, instead of
#DLLRDIKDPEYSYT
(9)
(10)
14
[9] A. Brazma, I. Jonassen, J. Vilo, and E. Ukkonen, Pattern discovery in biosequences, in International Conference on Grammar Inference (ICGI 98), V. Honavar and G. Slutski, Eds.,
vol. 1433 of Lecture Notes in Artificial Intelligence, pp. 257270,
Springer, Ames, Iowa, USA, 1998.
[10] L. Cai, R. L. Malmberg, and Y. Wu, Stochastic modeling
of RNA pseudoknotted structures: a grammatical approach,
Bioinformatics, vol. 19, suppl. 1, pp. i66i73, 2003.
[11] D. B. Searls, The computational linguistics of biological
sequences, in Artificial Intelligence and Molecular Biology,
pp. 47120, American Association for Artificial Intelligence,
Menlo Park, Calif, USA, 1993.
[12] D. Bsearls, Linguistic approaches to biological sequences,
Computer Applications in the Biosciences, vol. 13, no. 4, pp.
333344, 1997.
[13] A. Bairoch, PROSITE: a dictionary of sites and patterns in
proteins, Nucleic Acids Research, vol. 20, pp. 20132018, 1992.
[14] M. Vingron and M. S. Waterman, Sequence alignment and
penalty choice. Review of concepts, case studies and implications, Journal of Molecular Biology, vol. 235, no. 1, pp. 112,
1994.
[15] S. Heniko, Scores for sequence searches and alignments,
Current Opinion in Structural Biology, vol. 6, no. 3, pp. 353
360, 1996.
[16] G. Giribet and W. C. Wheeler, On gaps, Molecular Phylogenetics and Evolution, vol. 13, no. 1, pp. 132143, 1999.
[17] Y. Nozaki and M. Bellgard, Statistical evaluation and comparison of a pairwise alignment algorithm that a priori assigns the
number of gaps rather than employing gap penalties, Bioinformatics, vol. 21, no. 8, pp. 14211428, 2005.
[18] J. T. Reese and W. R. Pearson, Empirical determination of effective gap penalties for sequence comparison, Bioinformatics,
vol. 18, no. 11, pp. 15001507, 2002.
[19] L. Allison, C. S. Wallace, and C. N. Yee, Finite-state models in
the alignment of macromolecules, Journal of Molecular Evolution, vol. 35, no. 1, pp. 7789, 1992.
[20] J. P. Schmidt, An information theoretic view of gapped and
other alignments, in Proceedings of the 3rd Pacific Symposium
on Biocomputing (PSB 98), pp. 561572, Maui, Hawaii, USA,
January 1998.
[21] T. Aynechi and I. D. Kuntz, An information theoretic approach to macromolecular modeling: I. Sequence alignments,
Biophysical Journal, vol. 89, no. 5, pp. 29983007, 2005.
[22] B. Morgenstern, DIALIGN 2: improvement of the segmentto-segment approach to multiple sequence alignment, Bioinformatics, vol. 15, no. 3, pp. 211218, 1999.
[23] M. Brudno, M. Chapman, B. Gottgens, S. Batzoglou, and B.
Morgenstern, Fast and sensitive multiple alignment of large
genomic sequences, BMC Bioinformatics, vol. 4, p. 66, 2003.
[24] T. D. Schneider, Information content of individual genetic
sequences, Journal of Theoretical Biology, vol. 189, no. 4, pp.
427441, 1997.
[25] N. Krasnogor and D. A. Pelta, Measuring the similarity of
protein structures by means of the universal similarity metric,
Bioinformatics, vol. 20, no. 7, pp. 10151021, 2004.
[26] J. S. Conery, Realign: grammar-based sequence alignment,
University of Oregon, http://teleost.cs.uoregon.edu/realign.
[27] M. O. Dayho, R. M. Schwartz, and B. C. Orcutt, A model of
evolutionary change in proteins, in Atlas of Protein Sequence
and Structure, vol. 5, suppl. 3, pp. 345352, Washington, DC,
USA, 1978.
Research Article
MicroRNA Target Detection and Analysis for Genes Related to
Breast Cancer Using MDLcompress
Scott C. Evans,1 Antonis Kourtidis,2 T. Stephen Markham,1 Jonathan Miller,3
Douglas S. Conklin,2 and Andrew S. Torres1
1 GE
1.
INTRODUCTION
Yes
< 1?
Gain > Gmin ?
Encode,
done
No
3.5
SCR
3
2.5
2
1.5
1
0.5
0
10
20
20
Symb
o
30
l leng
40
th
50
60
60
70
80
40
ats
Repe
Length
10
4
Phrase
GAAGTGCAGT
Locations
1, 11
3, 8, 13, 18, 24
Best OSCR phrase
AGTG
Repeat
2
5
Figure 1: The OSCR algorithm. Phrases that recursively contribute most to sequence compression are added to the model first. The motif
AGTG is the first selected and added to OSCRs MDL model. A longest match algorithm would not call out this motif.
code for protein. Informatics techniques designed to identify protein-coding sequences, transcription factors, or other
known classes of sequence did not resolve the distinctive signatures of miRNA hairpin loops or their target sites in the
3 UTRs of protein-coding genes. In this sense, apart from
comparative genomics, sequence analysis methods tend to be
best at identifying classes of sequence whose biological significance is already known.
Minimum description length (MDL) principles [9] offer a general approach to de novo identification of biologically meaningful sequence information with a minimum of
assumptions, biases, or prejudices. Their advantage is that
they address explicitly the cost capability for data analysis
without over fitting. The challenge of incorporating MDL
into sequence analysis lies in (a) quantification of appropriate model costs and (b) tractable computation of model inference. A grammar inference algorithm that infers a twopart minimum description length code was introduced in
[10], applied to the problem of information security in [11]
and to miRNA target detection in [12]. This optimal symbol
compression ratio (OSCR) algorithm produces meaningful
models in an MDL sense while achieving a combination of
model and data whose descriptive size together represents an
estimate of the Kolmogorov complexity of the dataset [13].
We anticipate that this capacity for capturing the regularity
of a data set within compact, meaningful models will have
wide application to DNA sequence analysis.
MDL principles were successfully applied to segment
DNA into coding, noncoding, and other regions in [14].
The normalized maximum likelihood model (an MDL algorithm) [15] was used to derive a regression that also
achieves near state-of-the-art compression. Further MDLrelated approaches include the greedy oineGREEDY
algorithm [16] and DNA Sequitur [17, 18]. While these
3
{128-bit strings alternating 1 and 0}
101010
010101
10101010 10
000000000000 000
000000000000 001
000000000000 010
000000000000 011
1111111111111 10
1111111111111 11
2128 = 3.4 1038
1111 0000
1100 1100
1001 1001
1010 1010
2124
{128-bit strings}
Figure 2: Two-part representations of a 128-bit string. As the length of the model increases, the size of the set including the target string
decreases.
K (x) =
min l(p) .
(p)=x
(1)
(2)
K(x)
(bits)
high, since the numbers of ones and zeros are equal. However, intuitively the regularity of the string makes it seem
strange to call it random. By considering the model cost, as
well as the data costs of a string, MDL theory provides a formal methodology that justifies objectively classifying a string
as something other than a member of the set of all 128 bit
binary. These concepts can be extended beyond the class of
models that can be constructed using finite sets to all computable functions [22].
The size of the model (the number of bits allocated to
spelling out the members of set S) is related to the Kolmogorov structure function, (see [23]). defines the smallest set, S, that can be described in at most k bits and contains
a given string x of length n,
k x n | n =
min
p:l(p)<k ,U(p,n)=S
log2 |S| .
(3)
Cover [23] has interpreted this function as a minimum sucient statistic, which has great significance from an MDL perspective. This concept is shown graphically in Figure 3. The
cardinality of the set containing string x of length n starts out
as equal to n when k = 0 bits is used to describe set S (restrict
its size). As k increases, the cardinality of the set containing
string x can be reduced until a critical value k is reached
which is referred to as the Kolmogorov minimum sucient
statistic, or algorithmic minimum sucient statistic [22]. At
k , the size of the two-part description of string x equals
K (x) within a constant. Increasing k beyond k will continue to make possible a two-part code of size K (x), eventually resulting in a description of a set containing the single
element x. However, beyond k , the increase in the descriptive cost of the model, while reducing the cardinality of the
set to which x belongs, does not decrease the strings overall
descriptive cost.
The optimal symbol compression ratio (OSCR) algorithm is a grammar inference algorithm that infers a two-part
minimum description length code and an estimate of the algorithmic minimum sucient statistic [10, 11]. OSCR produces meaningful models in an MDL sense, while achieving a combination of model plus data whose descriptive size
together estimate the Kolmogorov complexity of the data set.
OSCRs capability for capturing the regularity of a data set
into compact, meaningful models has wide application for
sequence analysis. The deep recursion of our approach combined with its two-part coding nature makes our algorithm
uniquely able to identify meaningful sequences without limiting assumptions.
The entropy of a distribution of symbols defines the average per symbol compression bound in bits per symbol for
a prefix free code. Human coding and other strategies can
produce an instantaneous code approaching the entropy in
the limit of infinite message length when the distribution is
known. In the absence of knowledge of the model, one way
to proceed is to measure the empirical entropy of the string.
However, empirical entropy is a function of the partition and
depends on what substrings are grouped together to be considered symbols. Our goal is to optimize the partition (the
number of symbols, their length, and distribution) of a string
such that the compression bound for an instantaneous code,
(the total number of encoded symbols R time entropy Hs )
plus the codebook size is minimized. We define the approximate model descriptive cost M to be the sum of the lengths
of unique symbols, and total descriptive cost D p as follows:
M
li ,
D p M + R Hs .
(4)
While not exact (symbol delimiting comma costs are ignored in the model, while possible redundancy advantages
are not considered either), these definitions provide an approximate means of breaking out MDL costs on a per symbol
basis. The analysis that follows can easily be adapted to other
model cost assumptions.
2.1.
1
ri log2 ri .
R i
(5)
Thus, we have
D p = R log2 (R) +
R log2 (R) =
i
li ri log2 ri ,
ri log2 (R) = log2 (R)
ri ,
with
(6)
5
String x= a rose is a rose is a rose
0.9
SCR
0.8
0.5
0.4
0.7
0.6
0.3
R = 26 3 = 23
l=2
r=3
R = 26 3(5) = 11
SCR = 1.023
l=6
r=3
R = 26 2(6) = 14
SCR = 0.5
l=7
r=2
SCR = 0.7143
0.2
0.1
20
30
40
50
60
70
80
90
100 110
40 repeats
60 repeats
enables a per-symbol formulation for D p and results in a conservative approximation for R log2 (R) over the likely range of
R. The per-symbol descriptive cost can now be formulated:
log ri
di = ri log2 (R)
2
+ li .
(7)
log ri
ri log2 (R)
d
2
i = i =
Li
li ri
+ li
(8)
(2) Calculate the SCR for all substrings. Select the substring from this set with the smallest SCR and add it
to the model M.
(3) Replace all occurrences of the newly added substring
with a unique character.
(4) Repeat steps 1 through 3 until no suitable substrings
are found.
(5) When a full partition has been constructed, use Human coding or another coding strategy to encode the
distribution, p, of symbols.
OSCR ALGORITHM
Example
(9)
Model (set)
a rose
is S1
S1 S2 S2
S1 S2 S2
S1
a rose
f (S1 ) = 1
S2
is S1
f (S2 ) = 2
In [12], we described our initial application of the OSCR algorithm to the identification of miRNA target sites. We selected a family of genes from Drosophila (fruit fly) that contain in their 3 UTRs conserved sequence structures previously described by Lai [24]. These authors observed that
a highly-conserved 8-nucleotide sequence motif, known as
a K-box (sense = 5 cUGUGAUa 3 ; antisense = 5 uAUCACAg) and located in the 3 UTRs of Brd and bHLH gene
families, exhibited strong complementarity to several fly
miRNAs, among them miR-11. These motifs exhibited a role
in posttranscriptional regulation that was at the time unexplained.
The OSCR algorithm constructed a phrasebook consisting of nine motifs, listed in Figure 7 (top) to optimally partition the adjacent set of sequences, in which the motifs
are color coded. The OSCR algorithm correctly identified
the most redundant antisense sequence (AUCACA) from the
several examples it was presented.
The input data for this analysis consists of 19 sequences,
each 18 nucleotides in length (Figure 7). From these sequences, OSCR generated a model consisting of grammar
variables S1 through S4 that map to individual nucleotides
(grammar terminals), the variable S5 that maps to the nucleotide sequence, AUCACA, and four shorter motifs S6 S9 .
The phrase S5 turns out to be a putative target of several different miRNAs, including miR-2a, miR-2b, miR-6, miR-13a,
miR-13b, and miR-11. OSCR identified as S9 a 2 nucleotide
sequence (5 GU 3 ) that is located immediately downstream
of the K-box motif. The new consensus sequence would read
5 AUCACAGU 3 and has a greater degree of homology
to miR-6 and miR-11 than to other D. melanogaster miRNAs. In vivo studies performed subsequent to the original
Lai paper demonstrated the specificity of miR-11 activity
on the Bob-A,B,C, E(spl)ma, E(spl)m4, and E(spl)md genes
[25].
In a separate analysis, we applied OSCR to the sequence
of an individual fruit fly gene transcript, BobA (accession
NM 080348; Figure 7, bottom). Only the BobA transcript
7
OSCR analysis of Brd family and bHlH repressor
Motif: AUCACA first phrase added
GUU second phrase added
CU, AU, and GU also called out
1
61
121
181
241
301
361
421
481
541
601
S1
S2
S3
S4
S5
S6
S7
S8
S9
G
U
C
A
AUCACA
GUU
CU
AU
GU
GGUCACAUCACAGAUACU
CUCGUCAUCACAGUUGGA
CGAUUAAUCACAAUGAGU
UCCUCGAUCACAGUUGGA
GGUGCUAUCACAAUGUUU
UGUUUUAUCACAAUAUCU
AUUAGUAUCACAUCAACA
AAAUGUAUCACAAUUUUU
GUUGAUAUCACAAAUGUA
AAGACUAUCACACUUGGU
UACAAAAUCACAGCUGAA
AGGAACAUCACAUCAUAU
AGAACUAUCACAGGAACA
UUAGUUAUCACAUGAACU
AGUUAUAUCACAGUUGAA
CAGGCCAUCACACGGGAG
UGCCCUAUCACAGACUUA
UGGGCUAUCACAGAUGCG
GUUGCCAUCACAGUUGGG
aacaguucuccauccgagcagaucauaaguaaccaaccugcaaaauguucaccgaaaccg
cucuuguuuccaacuucaauggagugacagagaagaaaucucuuaccggcgccuccacca
accugaagaagcugcugaagaccaucaagaaggucuucaagaacuccaagccuucgaagg
agauuccgauccccaacaucaucuacucuugcaauacugaggaggagcaccagaauuggc
ucaacgaacaacuggaggccauggcaauccaucuucacugaguucuucugggacaucccc
cuccaucgaguaucugugaugugacccgaucaaaaggucuauaaaucggcacuccggcuu
uaauauccaacugugaugacgagaacacaagacugacugacuugugugccuuggagguga
caaaguucgucgccucugccaacuguacauaucaaacuagcugcuaaaaugucuucaauu
augcuuuaauguagucuaaguuaguauuaucauugucuuccauuaguuuaagaaaaucau
ugucuuccauguuuguuuguuaggguaaaaaaaacuagcuuaagaauaaaaaucccucgc
ggaaagaaaacaau
Figure 7: Motif analysis of 19 sequences each of which is believed to contain a single target site for miR-11 from fruit fly. (Top) OSCR adds
the variable S5 to its MDL codebook, the K-box motif, which has been shown to be a miRNA target site for miR-11. (Bottom) Full sequence
of BobA gene transcript with K-box and GY box motifs underlined in blue text. The K-box motif (CUGUGAUG) is a target site for miR-11
and the GY-box motif (UGUCUUCCAU) is a target site for miR-7.
5.
MDLcompress
The new MDLcompress algorithmic tool retains the fundamental element of OSCRdeeplyrecursive heuristicbased grammar inference, while trading computational complexity for space complexity to decrease execution time. The
compression and hence the ability of the algorithm to identify specific motifs (which we hypothesize to be of potential
biological significance) have been enhanced by new heuristics and an architecture that searches not only the sequence
but also the model for candidate phrases. The performance
has been improved by gathering statistics about potential
code words in a single pass and forming and maintaining
simple matrix structures to simplify heuristic calculations.
Additional gains in compression are achieved by tuning the
algorithm to take advantage of sequence-specific features
such as palindromes, regions of local similarity, and SNPs.
5.1.
encoded by this symbol in the phrasebook. We previously defined the SCR for a candidate phrase i as
i =
12
+ li
R
Ri
Cm = l i ,
C p = li Ri ,
(11)
(12)
li
j =1
log
R
,
rj
10
(10)
(13)
where R is the total number of symbols without the formation of the candidate phrase and r j is the frequency of the
jth symbol in the candidate phrase. Model costs require a
method for not only spelling out the candidate phrase but
SCR
6
4
2
0
0
20
40
Length
60
80
100
40
20
30
ats
Repe
10
also the cost of encoding the length of the phrase to be described. We estimate this cost as
Cm = M l i +
li
log
j =1
R
,
rj
(14)
L+2
.
Co = R Ri log
L+1
(15)
C m + Ch + Co
.
Cp
(16)
Additional heuristics
9
Input sequence
Pease porridge hot,
pease porridge cold,
pease porridge in the pot,
nine days old.
Some like it hot,
some like it cold,
some like it in the pot,
nine days old.
120
TC
110
100
90
80
70
1
2
3
4
Total compression model inference
peasS5 porridgS5
S1 pease porridge
S2 <CR>some like it
S6 somS5 likS5 it
S3 in the pot, <CR>nine days old.
in thS5 pS7 S6 ninS5 days old.
S4 cold,
S5 e
S6 <CR>
S7 ot,
S
S1 hS7 S6 S1 S4 S6 S1 S3 S6 S2 hS7 S2 S4 S2 S3
Figure 9: MDLcompress model-inferred grammar for the input sequence pease porridge using total compression (TC) and the longest
match (LM) heuristics. Both the SCR and TC heuristics achieve the same total compression and both exceed the performance of LM.
Subsequent iterations enable MDLcompress to identify phrases, yielding further compression of the TC grammar model.
2000
1800
1600
1400
1200
1000
800
600
400
200
0
10
20
30
40
50
60
70
80
90
Model cost
Description cost
Total cost
10
Phrase length
ros e
i s
r os e
i s
ros e .
>>>phrase Array(1)
ans =
index: 1
length: 6
verboselength: 6
chararray: a rose
startindices: [1 11 21]
frequency: 3
>>>phrase Array(2)
ans =
index: 1
length: 10
verboselength: 10
chararray: a rose is
startindices: [1 11]
frequency: 2
2
Index box
Phrase array
Box update
Phrase array has all information necessary to update
other candidates after each phrase is added to the model.
S1
S1
S1 .
>>>phrase Array(1)
ans =
index: 1
length: 1
verboselength: 6
chararray: a rose
startindices: [1 6 11]
frequency: 3
>>>phrase Array(2)
ans =
index: 1
length: 5
verboselength: 10
chararray: a rose is
startindices: [1 6]
frequency: 2
Figure 11: The data structures used in MDLcompress allow constant time selection and replacement of candidate phrases. In the top of the
figure is the initial index matrix and phrase array. After adding a rose for the model, MDLcompress can generate the new index box and
phrase array, shown in the bottom half, in constant time.
During the phrase selection part of each iteration, MDLcompress only has to search through phrase array, calculating the heuristic for each entry. Once a phrase is selected,
the matrix is used to identify overlapping phrases, which will
have their frequency reduced by the substitution of a new
symbol for the selected substring. While there may be many
phrases in the array that are updated, only local sections of
the matrix are altered, so overall only a small percentage of
the data structure is updated. This technique is what allows
MDLcompress to execute eciently even with long input sequences, such as DNA.
5.4.
Performance bounds
11
Table 1
Genes
HUMDYSTROP
HUMGHCSA
HUMHBB
HUMHDABCD
HUMPRTB
CHNTXX
DNACompress
(bits/nucleotide)
1.91
1.03
1.79
1.80
1.82
1.61
Sequitur
REVERSE-COMPLEMENT MATCHES
As in DNA Sequitur, the search for and grammar encoding of reverse-complement matches is readily implemented
by adding the reverse-complement of a phrase to the MDL-
2.34
1.86
2.20
2.26
2.22
2.24
DNASequitur
2.2
1.74
2.05
2.12
2.14
2.12
MDLcompresss
1.95
1.49
1.92
1.92
1.92
1.95
POST PROCESSING
After the MDLcompress model has been created, two methods possibilities for further compression are the following.
(1) Regions of Local similarity: it is sometimes most ecient to define a phrase as a concatenation of multiple
shorter and adjacent phrases already in the codebook.
(2) Single nucleotide polymorphisms (SNPs): it is sometime most ecient to define a phrase as a single nucleotide alteration to another phrase already in the
codebook.
8.
12
5UTR
CDS
3UTR
Position in 3UTR
1) aaaaaaaaaaaa
2) agcacttatt
3) aaacaggac
433, 445
262, 362
155, 172
Figure 12: Validation of MDLcompress performance. MDL compress identifies miRNA-372 and 373 target motif (AGCACTTATT) in LATS2
tumor suppressor gene as second phrase.
13
Table 2: 3 UTR MDLcompress phrases from 144 ErbB2-positive-related gene mRNA sequence.
Accession number
NM 000442
NM 004265
NM 004265
NM 004265
NM 005324
NM 005324
NM 005324
NM 005930
NM 005930
NM 005930
NM 005930
NM 005930
NM 006148
NM 006148
NM 006148
NM 006148
NM 006148
NM 006148
NM 006148
NM 006148
NM 006148
NM 006148
NM 006148
NM 006148
NM 006276
Number of repeats
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
Length
13
10
10
10
12
10
9
11
11
10
10
10
13
12
11
11
11
11
11
11
10
10
10
10
11
Phrase
tttctcttttcct
tcagggaggg
ccccccagct
gcagaggcag
ttttatttataa
cagtttcctt
tttataata
tatttcaattt
tatttttgctc
gacaaatgtg
cttttttttc
ttggaacact
gtgtgtgagtgtg
ccccagtctcca
acttcttggtt
cctcctgccca
ccccatctctg
ggaagcacagc
tgtgggtgggg
cctttctggcc
ctccctcctc
cagctaccgg
tcccctcccc
gtggaggaag
agatcaagatc
Locations
2835, 3091
2274, 2667
2954, 3021
2255, 3051
1292, 1802
997, 1991
627, 1055
2903, 2932
2733, 3809
3064, 3250
3425, 3689
3750, 3787
1951, 3654
647, 1651
1067, 1290
1186, 1503
2147, 2302
1545, 2447
2014, 2776
2812, 3759
1035, 1408
525, 1591
1464, 1828
2159, 2267
1010, 1091
160
140
3UTR
120
100
80
OSCR
phrase
OSCR
phrase
60
40
20
OSCR sequence
hsa-miR-218
rno-miR-218
xtr-miR-218
AGAUCAAGAUC
UGUACCAAUCUAGUUCGUGUU
UGUACCAAUCUAGUUCGUGUU
UGUACCAAUCUAGUUCGUGUU
(a)
BT474
HMEC
Figure 13: A miRNA target site relevant to breast cancer is identified by OSCR. (a) Proposed interaction between miRNAs (human, rat,
frog) and OSCR phrase. (b) Down regulation of the SFRS7 by RNAi specifically inhibits the proliferation of breast cancer cell line BT474
and not normal cells. These miRNAs may be implicated in breast cancer.
14
SNP500
(500 genes)
13
(a)
Name
ESR1
PTGS2
EGFR
Accession
NM 000125
NM 000963
NM 005228
MDL sequence
GATATGTTTA
CAAAATGC
TTTTACTTC
Position
4023.5325
2179, 2717.3097
4233.4967
SNP
4029 T C
3103 G A
4975 C T
(b)
Figure 14: MDLcompress directly identifies putative miRNA target sequences that may be implicated in breast cancer. (a) Schematic of
overlap between SNP500 database and potential miRNA sequences identified by MDLcompress in the test set. (b) Potential miRNA sites
identified by MDLcompress with disease-related polymorphisms identified by SNP analysis. These miRNA targets may be implicated in
breast cancer.
binding sites of predicted miRNAs and identified 2490 Texellike mutations and 483 mutations that potentially result in
loss of miRNA binding.
We performed a similar analysis on the 144 overexpressed
gene mRNA sequences from the BT474 breast cancer cell
line [30, 31] to identify which of these genes possess diseaserelated Texel-like mutations. By cross-referencing with the
SNP500 database [40], SNPs were found in 13 of the 144
overexpressed gene mRNA sequences from the BT474 breast
cancer cell line, all in the 3 UTR region. The initial comparison of the 93 MDLcompress code words from the 144 genes
discussed previously did not match with any SNP phrases.
We then relaxed the strict constraint that a phrase must lead
to compression at every step and asked MDLcompress in
longest match to identify the top 10 candidates in each gene
mRNA sequence that would most likely lead to compression.
Strikingly, 3 of these genes-ESR-1, PGTS2, and EGFR-have
SNPs in the set of the first 10 code word candidates identified
by MDLcompress when run on each these genes respective
mRNA sequence (Figure 14). These three sequences were selected out of the 13 because they fulfill the criteria we used
for Figure 13(a), that based on sequence analysis (similarity
to miRNA sequences and intra- and inter- species sequence
conservation); they are putative miRNA targets.
These motifs are localized to the 3 UTR and have not
been predicted to interact with any known miRNAs in the
literature. Although further validation studies are required,
these observations suggest that MDLcompress may be capable of directly identifying potential miRNA target sequences
with roles in breast cancer.
Our hypothesis regarding the significance of MDL
phrases that are added to the MDLcompress model motivates
search of these phrases for SNPs related to cancer. As shown
in Figure 10, an SNP identified in PTGS2 gene [40] colocalizes with the MDLcompress-identified phrase caaaatgc in
the 3 UTR of PTGS2 and yields a disproportionate change
in the descriptive cost of the sequence under the MDLcompress model generated for the original sequence. Altering a
2.5
SNP g
2
1.5
1
taaaacttccttttaaatcaaaatgccaaatttattaaggtggtggagcc
0.5
0
2700
2710
2720
2730
2740
2750
Figure 15: Cost per nucleotide for PTGS2. The blue curve identifies
cost per nucleotide of the original sequence based upon an MDLcompress model developed using the total compression heuristic
and the first 15 phrases to be selected. The cost per nucleotide under
the SNP g a is shown in red.
single nucleotide typically yields a very small change in descriptive cost, in most cases less than a bit; however, the SNP
in the phrase shown in Figure 15 yields a change in descriptive cost on the order of 4 bits, suggesting that this phrase
is in fact meaningful. Future work will elaborate on this potential relationship between meaningful phrases identified by
MDLcompress and disease, and explore the capability of using MDLcompress models to predict sites where SNPs are especially likely to cause pathology.
11.
CONCLUSIONS
15
[5] B. P. Lewis, C. B. Burge, and D. P. Bartel, Conserved seed pairing, often flanked by adenosines, indicates that thousands of
human genes are microRNA targets, Cell, vol. 120, no. 1, pp.
1520, 2005.
[6] V. Rusinov, V. Baev, I. N. Minkov, and M. Tabler, MicroInspector: a web tool for detection of miRNA binding sites in
an RNA sequence, Nucleic Acids Research, vol. 33, web server
issue, pp. W696W700, 2005.
[7] G. A. Calin, C.-G. Liu, C. Sevignani, et al., MicroRNA profiling reveals distinct signatures in B cell chronic lymphocytic
leukemias, Proceedings of the National Academy of Sciences of
the United States of America, vol. 101, no. 32, pp. 1175511760,
2004.
[8] A. Esquela-Kerscher and F. J. Slack, OncomirsmicroRNAs
with a role in cancer, Nature Reviews Cancer, vol. 6, no. 4, pp.
259269, 2006.
[9] P. Grunwald, I. J. Myung, and M. Pitt, Eds., Advances in Minimum Description Length: Theory and Applications, MIT Press,
Cambridge, Mass, USA, 2005.
[10] S. C. Evans, Kolmogorov complexity estimation and application
for information system security, Ph.D. dissertation, Rensselaer
Polytechnic Institute, Troy, NY, USA, 2003.
[11] S. C. Evans, B. Barnett, S. F. Bush, and G. J. Saulnier, Minimum description length principles for detection and classification of FTP exploits, in Proceedings of IEEE Military Communications Conference (MILCOM 04), vol. 1, pp. 473479,
Monterey, Calif, USA, October-November 2004.
[12] S. C. Evans, A. Torres, and J. Miller, MicroRNA target motif detection using OSCR, Tech. Rep. GRC223, GE Research,
Niskayuna, NY, USA, 2006.
[13] M. Li and P. Vitanyi, Introduction to Kolmogorov Complexity
and Applications, Springer, New York, NY, USA, 1997.
[14] W. Szpankowski, W. Ren, and L. Szpankowski, An optimal DNA segmentation based on the MDL principle, International Journal of Bioinformatics Research and Applications,
vol. 1, no. 1, pp. 317, 2005.
[15] I. Tobus, G. Korodi, and J. Rissanen, DNA sequence compression using the normalized maximum likelihood model for
discrete regression, in Proceedings of Data Compression Conference (DCC 03), pp. 253262, Snowbird, Utah, USA, March
2003.
[16] A. Apostolico and S. Lonardi, Some theory and practice of
greedy o-line textual substitution, in Proceedings of Data
Compression Conference (DCC 98), pp. 119128, Snowbird,
Utah, USA, March 1998.
[17] C. G. Nevill-Manning and I. H. Witten, Identifying hierarchical structure in sequences: a linear-time algorithm, Journal of
Artificial Intelligence Research, vol. 7, pp. 6782, 1997.
[18] N. Cherniavsky and R. Lander, Grammar-based compression of DNA sequences, in DIMACS Working Group on The
BurrowsWheeler Transform, Piscataway, NJ, USA, August
2004.
[19] X. Chen, M. Li, B. Ma, and J. Tromp, DNACompress: fast and
eective DNA sequence compression, Bioinformatics, vol. 18,
no. 12, pp. 16961698, 2002.
[20] B. Behzadi and F. Le Fessant, DNA compression challenge revisited: a dynamic programming approach, in The
16th Annual Symposium on Combinatorial Pattern Matching
(CPM 05), vol. 3537 of Lecture Notes in Computer Science, pp.
190200, Jeju Island, Korea, 2005.
[21] S. C. Evans, T. S. Markham, A. Torres, A. Kourtidis, and D.
Conklin, An improved minimum description length learning algorithm for nucleotide sequence analysis, in Proceedings of IEEE 40th Asilomar Conference on Signals, Systems and
16
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
Research Article
Variation in the Correlation of G + C Composition with
Synonymous Codon Usage Bias among Bacteria
Haruo Suzuki, Rintaro Saito, and Masaru Tomita
Institute for Advanced Biosciences, Keio University, Yamagata 997-0017, Japan
Received 31 January 2007; Accepted 4 June 2007
Recommended by Teemu Roos
G + C composition at the third codon position (GC3) is widely reported to be correlated with synonymous codon usage bias.
However, no quantitative attempt has been made to compare the extent of this correlation among dierent genomes. Here, we
applied Shannon entropy from information theory to measure the degree of GC3 bias and that of synonymous codon usage bias
of each gene. The strength of the correlation of GC3 with synonymous codon usage bias, quantified by a correlation coecient,
varied widely among bacterial genomes, ranging from 0.07 to 0.95. Previous analyses suggesting that the relationship between
GC3 and synonymous codon usage bias is independent of species are thus inconsistent with the more detailed analyses obtained
here for individual species.
Copyright 2007 Haruo Suzuki et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1.
INTRODUCTION
2.
2.1. Software
All analyses were conducted by using G-language genome
analysis environment software [13], available at http://www
.g-language.org. Graphs such as the histogram and scatter
plot were generated in the R statistical computing environment [14], available at http://www.r-project.org.
2.2. Sequences
We tested data from 371 bacterial genomes (see Additional
Table 1 for a comprehensive list (available online at http://
www2.bioinfo.ttck.keio.ac.jp/genome/haruo/BSB ST1.pdf)).
Complete genomes in GenBank format [15] were downloaded from the NCBI repository site (ftp://ftp.ncbi.nih.gov/
genomes/Bacteria). Protein coding sequences containing
letters other than A, C, G, or T and those containing amino
acids with residues less than their degree of codon degeneracy were discarded. From each coding sequence, start and
stop codons were excluded.
Hi =
ki
j =1
Ri j log2 Ri j ,
(2)
Hi
Hi
=
,
Hi max
log2 ki
Ew =
s
w i Ei ,
(3)
(4)
i=1
2.3. Analyses
Ri j = ki
j =1 ni j
(1)
(5)
H (bits)
0.8
0.6
0.4
0.2
0.2
0.4
0.6
0.8
m
g =1 xg x yg y
r =
,
2 m
2
m
x
x
y
y
g
g
g =1
g =1
m
1
1
x=
xg , y =
yg ,
m g =1
m g =1
(6)
RESULTS
To investigate whether the correlation of GC3 with synonymous codon usage bias (the r value of HGC3 versus Ew ) was
related to species characteristics, we compared the r values
with genomic features such as genomic G + C content and
tRNA gene copy number. Among the 371 genomes analyzed
here, genomic G + C content ranged from 23% to 73% and
tRNA gene copy number varied from 28 to 145.
We constructed scatter plots of the r values of HGC3 with
Ew plotted against genomic G + C content and tRNA gene
copy number for 371 genomes (Figure 5). The relationship
between the r value of HGC3 and the tRNA gene copy number
was unclear (Figure 5(b)). In contrast, the r values of HGC3
tended to be high in G + C-poor or G + C-rich genomes, revealing a nonlinear relationship between the r value of HGC3
and genomic G+C content (Figure 5(a)). The highest r value
0.8
0.8
0.7
0.7
Ew
0.9
Ew
0.9
0.6
0.6
0.5
0.5
0.4
0.4
0.6
0.7
0.8
0.9
HGC1 , r = 0.25
0.85
(a)
0.9
0.95
HGC2 , r = 0.01
(b)
0.9
0.9
0.8
0.7
Ew
Ew
0.8
0.6
0.7
0.5
0.4
0.6
0.3
0.4
0.9
0.88
(c)
0.92
0.96
HGC1 , r = 0.06
(d)
0.8
0.8
Ew
0.9
Ew
0.9
0.7
0.7
0.6
0.6
0.86
0.9
0.94
HGC2 , r = 0.08
0.98
(e)
0.85
0.9
0.95
HGC3 , r = 0.07
(f)
Figure 2: Scatter plots of Ew plotted against (a) HGC1 , (b)HGC2 , and (C) HGC3 for Geobacter metallireducens GS-15 genes and against (d)
HGC1 , (e) HGC2 , and (f) HGC3 for Saccharophagus degradans 240 genes. The extent of the correlation between HGC1 , HGC2 , and HGC3 and Ew
is represented by Spearmans rank correlation coecient (r).
of HGC3 (0.95) was found in G. metallireducens, with a genomic G+C content of 60% (Figure 2(c)). The lowest r value
of HGC3 (0.07) was found in S. degradans, with a genomic
G + C content of 46% (Figure 2(f)). The mean and standard
5
80
Number of genomes
r of HGC3
0.5
60
40
20
0
1
0.5
0.5
0
r of HGC1
0.5
0.5
0.5
(a)
1
1
0.5
0
r of HGC1
0.5
Number of genomes
80
0
(a)
60
40
20
0
1
0.5
0
r of HGC2
(b)
80
Number of genomes
r of HGC3
0.5
0.5
1
1
0.5
0
r of HGC2
0.5
60
40
20
0
1
(b)
0.5
0
r of HGC3
(c)
r of HGC1
0.5
were 0.86 and 0.04. Thus, the r values of HGC3 for G + Cpoor bacteria tended to be lower than those for G + C-rich
bacteria.
0.5
4.
1
1
0.5
0
r of HGC2
0.5
(c)
DISCUSSION
0.8
0.8
0.6
0.6
r of HGC3
r of HGC3
0.4
0.4
0.2
0.2
0
30
40
50
60
Genomic G + C content (%)
70
(a)
40
60
80
100
120
tRNA gene number
140
(b)
Figure 5: Scatter plots of the r values of HGC3 with Ew plotted against (a) genomic G+C content and (b) tRNA gene number for 371 bacterial
genomes.
Adenine
Thymine
Guanine
Cytosine
G + C content at the first codon position
G + C content at the second codon position
G + C content at the third codon position
Entropy of GC1
Entropy of GC2
Entropy of GC3
Weighted sum of relative entropy
Spearmans rank correlation coecient
ACKNOWLEDGMENTS
The authors thank Dr Kazuharu Arakawa (Institute for Advanced Biosciences, Keio University) for his technical advice
on the G-language genome analysis environment, and Kunihiro Baba (Faculty of Policy Management, Keio University) for his technical advice on the R statistical computing environment. This work was supported by the Ministry
of Education, Culture, Sports, Science, and Technology of
Japan Grant-in-Aid for the 21st Century Centre of Excellence
(COE) Program entitled Understanding and Control of Life
via Systems Biology (Keio University).
REFERENCES
[1] M. D. Ermolaeva, Synonymous codon usage in bacteria,
Current Issues in Molecular Biology, vol. 3, no. 4, pp. 9197,
2001.
[2] A. Carbone, F. Kepes, and A. Zinovyev, Codon bias signatures, organization of microorganisms in codon space, and
lifestyle, Molecular Biology and Evolution, vol. 22, no. 3, pp.
547561, 2005.
[3] A. Carbone, A. Zinovyev, and F. Kep`es, Codon adaptation index as a measure of dominating codon bias, Bioinformatics,
vol. 19, no. 16, pp. 20052015, 2003.
[4] R. D. Knight, S. J. Freeland, and L. F. Landweber, A simple model based on mutation and selection explains trends
in codon and amino-acid usage and GC composition within
and across genomes, Genome Biology, vol. 2, no. 4, pp.
research0010.1research0010.13, 2001.
[5] J. R. Lobry and A. Necsulea, Synonymous codon usage and
its potential link with optimal growth temperature in prokaryotes, Gene, vol. 385, pp. 128136, 2006.
[6] D. J. Lynn, G. A. C. Singer, and D. A. Hickey, Synonymous
codon usage is subject to selection in thermophilic bacteria,
Nucleic Acids Research, vol. 30, no. 19, pp. 42724277, 2002.
[7] G. A. C. Singer and D. A. Hickey, Thermophilic prokaryotes
have characteristic patterns of codon usage, amino acid composition and nucleotide content, Gene, vol. 317, no. 1-2, pp.
3947, 2003.
[8] H. Suzuki, R. Saito, and M. Tomita, A problem in multivariate
analysis of codon usage data and a possible solution, FEBS
Letters, vol. 579, no. 28, pp. 64996504, 2005.
[9] X.-F. Wan, D. Xu, A. Kleinhofs, and J. Zhou, Quantitative
relationship between synonymous codon usage bias and GC
composition across unicellular genomes, BMC Evolutionary
Biology, vol. 4, p. 19, 2004.
7
[10] F. Wright, The eective number of codons used in a gene,
Gene, vol. 87, no. 1, pp. 2329, 1990.
[11] B. Zeeberg, Shannon information theoretic computation of
synonymous codon usage biases in coding regions of human
and mouse genomes, Genome Research, vol. 12, no. 6, pp.
944955, 2002.
[12] H. Suzuki, R. Saito, and M. Tomita, The weighted sum of
relative entropy: a new index for synonymous codon usage
bias, Gene, vol. 335, no. 1-2, pp. 1923, 2004.
[13] K. Arakawa, K. Mori, K. Ikeda, T. Matsuzaki, Y. Kobayashi,
and M. Tomita, G-language genome analysis environment:
a workbench for nucleotide sequence data mining, Bioinformatics, vol. 19, no. 2, pp. 305306, 2003.
[14] R Development Core Team, R: a language and environment for
statistical computing, R Foundation for Statistical Computing,
Vienna, Austria, 2006.
[15] D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, and
D. L. Wheeler, GenBank, Nucleic Acids Research, vol. 35, supplement 1, pp. D21D25, 2007.
[16] C. E. Shannon, A mathematical theory of communication,
Bell System Technical Journal, vol. 27, pp. 379423, 1948.
[17] A. Muto and S. Osawa, The guanine and cytosine content
of genomic DNA and bacterial evolution, Proceedings of the
National Academy of Sciences of the United States of America,
vol. 84, no. 1, pp. 166169, 1987.
[18] N. Sueoka, On the genetic basis of variation and heterogeneity of DNA base composition, Proceedings of the National
Academy of Sciences of the United States of America, vol. 48,
no. 4, pp. 582592, 1962.
[19] S. Garcia-Vallve, A. Romeu, and J. Palau, Horizontal gene
transfer in bacterial and archaeal complete genomes, Genome
Research, vol. 10, no. 11, pp. 17191725, 2000.
[20] R. J. Grocock and P. M. Sharp, Synonymous codon usage in
Pseudomonas aeruginosa PA01, Gene, vol. 289, no. 1-2, pp.
131139, 2002.
[21] J. O. McInerney, Replicational and transcriptional selection
on codon usage in Borrelia burgdorferi, Proceedings of the
National Academy of Sciences of the United States of America,
vol. 95, no. 18, pp. 1069810703, 1998.
[22] P. M. Sharp, E. Bailes, R. J. Grocock, J. F. Peden, and R. E.
Sockett, Variation in the strength of selected codon usage
bias among bacteria, Nucleic Acids Research, vol. 33, no. 4, pp.
11411153, 2005.
[23] P. M. Sharp, M. Stenico, J. F. Peden, and A. T. Lloyd, Codon
usage: mutational bias, translational selection, or both? Biochemical Society Transactions, vol. 21, no. 4, pp. 835841, 1993.
Research Article
Information-Theoretic Inference of Large Transcriptional
Regulatory Networks
Patrick E. Meyer, Kevin Kontos, Frederic Lafitte, and Gianluca Bontempi
ULB Machine Learning Group, Computer Science Department, Universite Libre de Bruxelles, 1050 Brussels, Belgium
Received 26 January 2007; Accepted 12 May 2007
Recommended by Juho Rousu
The paper presents MRNET, an original method for inferring genetic networks from microarray data. The method is based on
maximum relevance/minimum redundancy (MRMR), an eective information-theoretic technique for feature selection in supervised learning. The MRMR principle consists in selecting among the least redundant variables the ones that have the highest
mutual information with the target. MRNET extends this feature selection principle to networks in order to infer gene-dependence
relationships from microarray data. The paper assesses MRNET by benchmarking it against RELNET, CLR, and ARACNE, three
state-of-the-art information-theoretic methods for large (up to several thousands of genes) network inference. Experimental results on thirty synthetically generated microarray datasets show that MRNET is competitive with these methods.
Copyright 2007 Patrick E. Meyer et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
1.
INTRODUCTION
Two important issues in computational biology are the extent to which it is possible to model transcriptional interactions by large networks of interacting elements and how these
interactions can be eectively learned from measured expression data [1]. The reverse engineering of transcriptional regulatory networks (TRNs) from expression data alone is far
from trivial because of the combinatorial nature of the problem and the poor information content of the data [1]. An additional problem is that by focusing only on transcript data,
the inferred network should not be considered as a biochemical regulatory network but as a gene-to-gene network, where
many physical connections between macromolecules might
be hidden by shortcuts.
In spite of these evident limitations, the bioinformatics
community made important advances in this domain over
the last few years. Examples are methods like Boolean networks, Bayesian networks, and Association networks [2].
This paper will focus on information-theoretic approaches [36] which typically rely on the estimation of mutual information from expression data in order to measure
the statistical dependence between variables (the terms variable and feature are used interchangeably in this paper).
Such methods have recently held the attention of the bioin-
CLR algorithm
The CLR algorithm [6] is an extension of RELNET. This algorithm computes the mutual information (MI) for each pair
of genes and derives a score related to the empirical distribution of these MI values. In particular, instead of considering
genes Xi and X j , it takes
the information I(Xi ; X j ) between
2
2
into account the score zi j = zi + z j , where
This section reviews some state-of-the-art methods for network inference which are based on information-theoretic
notions.
These methods require at first the computation of the
mutual information matrix (MIM), a square matrix whose
i, j element
MIMi j = I Xi ; X j =
xi X x j X
p xi , x j log
p xi , x j
p xi p x j
zi = max 0,
I Xi ; X j i
i
(2)
2.4.
ARACNE
(1)
is the mutual information between Xi and X j , where Xi
X, i = 1, . . . , n, is a discrete random variable denoting the
expression level of the ith gene.
2.1. Chow-Liu tree
The Chow and Liu approach consists in finding the maximum spanning tree (MST) of a complete graph, where the
weights of the edges are the mutual information quantities
between the connected nodes [3]. The construction of the
MST with Kruskals algorithm has an O(n2 log n) cost. The
main drawbacks of this method are: (i) the minimum spanning tree has typically a low number of edges also for non
sparse target networks and (ii) no parameter is provided to
calibrate the size of the inferred network.
2.2. Relevance network (RELNET)
The relevance network approach [4] has been introduced in
gene clustering problems and successfully applied to infer relationships between RNA expression and chemotherapeutic
susceptibility [15]. The approach consists in inferring a genetic network, where a pair of genes {Xi , X j } is linked by an
edge if the mutual information I(Xi ; X j ) is larger than a given
The algorithm for the reconstruction of accurate cellular networks (ARACNE) [5] is based on the data processing inequality [16]. This inequality states that if gene X1 interacts
with gene X3 through gene X2 , then
I X1 ; X3 min I X1 ; X2 , I X2 ; X3 .
(3)
3
Network and data generator
Original
network
Artificial
dataset
Entropy estimator
Inference method
Mutual
information
matrix
Inferred
network
Validation procedure
Precision-recall
curves and
F-scores
Figure 1: An artificial microarray dataset is generated from an original network. The inferred network can then be compared to this true
network.
3.
We propose to infer a network using the maximum relevance/minimum redundancy (MRMR) feature selection
method. The idea consists in performing a series of supervised MRMR gene selection procedures, where each gene in
turn plays the role of the target output.
The MRMR method has been introduced in [11, 12] together with a best-first search strategy for performing filter
selection in supervised learning problems. Consider a supervised learning task, where the output is denoted by Y and V
is the set of input variables. The method ranks the set V of
inputs according to a score that is the dierence between the
mutual information with the output variable Y (maximum
relevance) and the average mutual information with the previously ranked variables (minimum redundancy). The rationale is that direct interactions (i.e., the most informative
variables to the target Y ) should be well ranked, whereas indirect interactions (i.e., the ones with redundant information
with the direct ones) should be badly ranked by the method.
The greedy search starts by selecting the variable Xi having
the highest mutual information to the target Y . The second
selected variable X j will be the one with a high information
I(X j ; Y ) to the target and at the same time a low information
I(X j ; Xi ) to the previously selected variable. In the following
steps, given a set S of selected variables, the criterion updates
S by choosing the variable
= arg max u j r j
X MRMR
j
X j V \S
(4)
(5)
uj = I Xj; Y
(6)
measures the average redundancy of X j to each already selected variable Xk S. At each step of the algorithm, the
selected variable is expected to allow an ecient trade-o
between relevance and redundancy. It has been shown in
[12] that the MRMR criterion is an optimal pairwise approximation of the conditional mutual information between
any two genes X j and Y given the set S of selected variables
I(X j ; Y | S).
The MRNET approach consists in repeating this selection procedure for each target gene by putting Y = Xi and
V = X \ {Xi }, i = 1, . . . , n, where X is the set of the expression levels of all genes. For each pair {Xi , X j }, MRMR returns
two (not necessarily equal) scores si and s j according to (5).
The score of the pair {Xi , X j } is then computed by taking the
maximum of si and s j . A specific network can then be inferred by deleting all the edges whose score lies below a given
threshold I0 (as in RELNET, CLR, and ARACNE). Thus, the
algorithm infers an edge between Xi and X j either when Xi is
a well-ranked predictor of X j (si > I0 ) or when X j is a wellranked predictor of Xi (s j > I0 ).
An eective implementation of the MRMR best-first
search is available in [17]. This implementation demands an
O( f n) complexity for selecting f features using a best-first
search strategy. It follows that MRNET has an O( f n2 ) complexity since the feature selection step is repeated for each of
the n genes. In other terms, the complexity ranges between
O(n2 ) and O(n3 ) according to the value of f . Note that the
lower the f value, the lower the number of incoming edges
per node to infer and consequently the lower the resulting
complexity.
Note that since mutual information is a symmetric measure, it is not possible to derive the direction of the edge from
its weight. This limitation is common to all the methods presented so far. However, this information could be provided
by edge orientation algorithms (e.g., IC) commonly used in
Bayesian networks [7].
(7)
4.
EXPERIMENTS
4
inference of the network, and the validation of the results.
This section details each step of the approach.
4.1. Network and data generation
In order to assess the results returned by our algorithm and
compare it to other methods, we created a set of benchmarks
on the basis of artificially generated microarray datasets. In
spite of the evident limitations of using synthetic data, this
makes possible a quantitative assessment of the accuracy,
thanks to the availability of the true network underlying the
microarray dataset (see Figure 1).
We used two dierent generators of artificial gene expression data: the data generator described in [18] (hereafter referred to as the sRogers generator) and the SynTReN generator [19]. The two generators, whose implementations are
freely available on the World Wide Web, are sketched in the
following paragraphs.
sRogers generator
The sRogers generator produces the topology of the genetic
network according to an approximate power-law distribution on the number of regulatory connections out of each
gene. The normal steady state of the system is evaluated by
integrating a system of dierential equations. The generator
oers the possibility to obtain 2k dierent measures (k wild
type and k knock out experiments). These measures can be
replicated R times, yielding a total of N = 2kR samples. After
the optional addition of noise, a dataset containing normalized and scaled microarray measurements is returned.
SynTReN generator
The SynTReN generator generates a network topology by selecting subnetworks from E. coli and S. cerevisiae source networks. Then, transition functions and their parameters are
assigned to the edges in the network. Eventually, mRNA expression levels for the genes in the network are obtained by
simulating equations based on Michaelis-Menten and Hill
kinetics under dierent conditions. As for the previous generator, after the optional addition of noise, a dataset containing normalized and scaled microarray measurements is returned.
Generation
The two generators were used to synthesize thirty datasets.
Table 1 reports for each dataset the number n of genes, the
number N of samples, and the Gaussian noise intensity (expressed as a percentage of the signal variance).
4.2. Mutual information matrix estimation
In order to benchmark MRNET versus RELNET, CLR, and
ARACNE, the same MIM is used for the four inference
approaches. Several estimators of mutual information have
Validation
TP
,
TP + FP
(8)
TP
,
TP + FN
(9)
5
Table 1: Datasets with n the number of genes and N the number of samples.
Generator
Topology
Noise
RN1
RN2
sRogers
sRogers
Power-law tail
Power-law tail
700
700
700
700
0%
5%
RN3
RN4
sRogers
sRogers
Power-law tail
Power-law tail
700
700
700
700
10%
20%
RN5
sRogers
Power-law tail
700
700
30%
RS1
sRogers
Power-law tail
700
100
0%
RS2
RS3
RS4
sRogers
sRogers
sRogers
Power-law tail
Power-law tail
Power-law tail
700
700
700
300
500
800
0%
0%
0%
RS5
sRogers
Power-law tail
700
1000
0%
RV1
sRogers
Power-law tail
100
700
0%
RV2
RV3
RV4
sRogers
sRogers
sRogers
Power-law tail
Power-law tail
Power-law tail
300
500
700
700
700
700
0%
0%
0%
RV5
sRogers
Power-law tail
1000
700
0%
SN1
SynTReN
S. Cerevisae
400
400
0%
SN2
SN3
SN4
SynTReN
SynTReN
SynTReN
S. Cerevisae
S. Cerevisae
S. Cerevisae
400
400
400
400
400
400
5%
10%
20%
SN5
SynTReN
S. Cerevisae
400
400
30%
SS1
SS2
SynTReN
SynTReN
S. Cerevisae
S. Cerevisae
400
400
100
200
0%
0%
SS3
SS4
SynTReN
SynTReN
S. Cerevisae
S. Cerevisae
400
400
300
400
0%
0%
SS5
SynTReN
S. Cerevisae
400
500
0%
SV1
SV2
SynTReN
SynTReN
S. Cerevisae
S. Cerevisae
100
200
400
400
0%
0%
SV3
SV4
SynTReN
SynTReN
S. Cerevisae
S. Cerevisae
300
400
400
400
0%
0%
SV5
SynTReN
S. Cerevisae
500
400
0%
Dataset
that if two algorithms A and B have the same error rate, then
Actual positive
Inferred positive
Inferred negative
TP
FN
Actual negative
FP
TN
2pr
,
r+p
(10)
2
NAB NBA
1
NAB + NBA
(11)
A thorough comparison would require the display of the PRcurves (Figure 2) for each dataset. For reason of space, we
decided to summarize the PR-curve information by the maximum F-score in Table 3. Note that for each dataset, the accuracy of the best methods (i.e., those whose score is not significantly lower than the highest one according to McNemar
test) is typed in boldface.
We may summarize the results as follows.
1
0.8
F-score
Precision
0.8
0.6
0.4
0.2
0.6
0.4
0
0
0.2
0.4
0.6
0.8
1
0.2
Recall
200
MRNET
CLR
ARACNE
600
Samples
CLR
ARACNE
Figure 2: PR-curves for the RS3 dataset using Miller-Madow estimator. The curves are obtained by varying the rejection/acceptation
threshold.
400 samples, Miller-Madow estimation on SynTReN datasets
F-score
400
800
1000
RELNET
MRNET
0.5
0.4
0.3
0.2
0.1
100
200
CLR
ARACNE
300
Genes
400
500
RELNET
MRNET
Figure 3: Influence of the number of variables on accuracy (SynTReN SV datasets, Miller-Madow estimator).
5.1.
Table 3: Maximum F-scores for each inference method using two dierent mutual information estimators. The best methods (those having
a score not significantly weaker than the best score, i.e., P-value < .05) are typed in boldface. Average performances on SynTReN and sRogers
datasets are reported, respectively, in the S-AVG, R-AVG lines.
CLR
0.24
Miller-Madow
ARACNE
0.27
MRNET
0.27
RELNET
0.21
CLR
0.24
Gaussian
ARACNE
0.3
SN1
RELNET
0.22
MRNET
0.26
SN2
0.23
0.26
0.29
0.29
0.21
0.25
0.31
0.25
SN3
0.23
0.25
0.24
0.26
0.21
0.25
0.31
0.26
SN4
0.22
0.24
0.26
0.26
0.21
0.25
0.28
0.26
SN5
0.21
0.23
0.24
0.24
0.2
0.25
0.27
0.24
SS1
0.21
0.22
0.22
0.23
0.19
0.24
0.24
0.23
SS2
0.21
0.24
0.28
0.29
0.2
0.24
0.27
0.25
SS3
0.21
0.24
0.27
0.28
0.2
0.24
0.28
0.25
SS4
0.22
0.24
0.27
0.27
0.21
0.24
0.3
0.26
SS5
0.22
0.24
0.28
0.29
0.21
0.24
0.3
0.26
SV1
0.32
0.36
0.41
0.39
0.3
0.4
0.44
0.38
SV2
0.25
0.28
0.35
0.33
0.25
0.35
0.36
0.32
SV3
0.21
0.24
0.3
0.28
0.21
0.28
0.3
0.27
SV4
0.22
0.24
0.27
0.27
0.21
0.24
0.3
0.26
SV5
S-AVG
RN1
0.24
0.23
0.59
0.23
0.25
0.65
0.29
0.28
0.6
0.29
0.28
0.61
0.22
0.21
0.89
0.24
0.26
0.87
0.31
0.30
0.92
0.26
0.27
0.93
RN2
0.5
0.57
0.5
0.49
0.89
0.87
0.92
0.92
RN3
0.5
0.55
0.5
0.52
0.89
0.87
0.92
0.92
RN4
0.46
0.51
0.47
0.47
0.89
0.87
0.92
0.91
RN5
0.42
0.46
0.41
0.4
0.88
0.86
0.91
0.91
RS1
0.1
0.11
0.09
0.1
0.19
0.19
0.19
0.18
RS2
0.35
0.32
0.31
0.31
0.45
0.44
0.47
0.46
RS3
0.38
0.32
0.36
0.38
0.58
0.56
0.6
0.6
RS4
0.47
0.54
0.47
0.5
0.75
0.75
0.8
0.79
RS5
0.58
0.68
0.6
0.64
0.9
0.86
0.93
0.93
RV1
0.52
0.38
0.46
0.46
0.72
0.75
0.72
0.72
RV2
0.49
0.53
0.49
0.53
0.71
0.71
0.71
0.71
RV3
0.45
0.5
0.45
0.48
0.69
0.69
0.71
0.71
RV4
0.47
0.51
0.48
0.48
0.69
0.7
0.74
0.72
RV5
R-AVG
0.47
0.45
0.52
0.48
0.47
0.44
0.48
0.46
0.7
0.72
0.68
0.71
0.74
0.74
0.73
0.74
Tot-AVG
0.34
0.36
0.36
0.37
0.47
0.49
0.52
0.51
Xi
F-score
0.8
0.6
0.4
Xj
0.2
0
0
0.05
0.1
0.15
Noise
0.2
0.25
0.3
Empirical
Gaussian
F-score
0.8
alone. This behavior is colloquially referred to as explainingaway eect in the Bayesian network literature [7]. Selecting
variables, like Xi , that take part into indirect interactions reduce the accuracy of the network inference task. However,
since MRMR relies only on pairwise interactions, it does not
take into account the gain in information due to conditioning. In our example, the MRMR algorithm, after having selected X j , computes the score si = I(Xi ; Y ) I(Xi ; X j ), where
I(Xi ; Y ) = 0 and I(Xi ; X j ) > 0. This score is negative and is
likely to be badly ranked. As a result, the MRMR feature selection criterion is less exposed to the inconvenient of most
feature selection techniques while sharing their interesting
properties. Further experiments will focus on this aspect.
0.6
6.
0.4
0.2
200
400
600
Samples
800
1000
Empirical
Gaussian
A new network inference method, MRNET, has been proposed. This method relies on an eective method of
information-theoretic feature selection called MRMR. Similarly to other network inference methods, MRNET relies on
pairwise interactions between genes, making possible the inference of large networks (up to several thousands of genes).
Another advantage of MRNET, which could be exploited
in future work, is its ability to benefit explicitly from a priori
knowledge.
MRNET was compared experimentally to three stateof-the-art information-theoretic network inference methods, namely RELNET, CLR, and ARACNE, on thirty inference tasks. The microarray datasets were generated artificially with two dierent generators in order to eectively
assess their inference power. Also, two dierent mutual information estimation methods were used. The experimental
results showed that MRNET is competitive with the benchmarked information-theoretic methods.
Future work will focus on three main axes: (i) the assessment of additional mutual information estimators, (ii) the
validation of the techniques on the basis of real microarray
data, (iii) a theoretical analysis of which conditions should
be met for MRNET to reconstruct the true network.
ACKNOWLEDGMENT
This work was partially supported by the Communaute
Francaise de Belgique under ARC Grant no. 04/09-307.
9
[17] P. Merz and B. Freisleben, Greedy and local search heuristics
for unconstrained binary quadratic programming, Journal of
Heuristics, vol. 8, no. 2, pp. 197213, 2002.
[18] S. Rogers and M. Girolami, A Bayesian regression approach
to the inference of regulatory networks from gene expression
data, Bioinformatics, vol. 21, no. 14, pp. 31313137, 2005.
[19] T. van den Bulcke, K. van Leemput, B. Naudts, et al., SynTReN: a generator of synthetic gene expression data for design
and analysis of structure learning algorithms, BMC Bioinformatics, vol. 7, p. 43, 2006.
[20] L. Paninski, Estimation of entropy and mutual information,
Neural Computation, vol. 15, no. 6, pp. 11911253, 2003.
[21] J. Beirlant, E. J. Dudewica, L. Gyofi, and E. van der Meulen,
Nonparametric entropy estimation: an overview, Journal of
Statistics, vol. 6, no. 1, pp. 1739, 1997.
[22] J. Dougherty, R. Kohavi, and M. Sahami, Supervised and unsupervised discretization of continuous features, in Proceedings of the 12th International Conference on Machine Learning
(ML 95), pp. 194202, Lake Tahoe, Calif, USA, July 1995.
[23] F. J. Provost, T. Fawcett, and R. Kohavi, The case against accuracy estimation for comparing induction algorithms, in Proceedings of the 15th International Conference on Machine Learning (ICML 98), pp. 445453, Morgan Kaufmann, Madison,
Wis, USA, July 1998.
[24] J. Bockhorst and M. Craven, Markov networks for detecting
overlapping elements in sequence data, in Advances in Neural
Information Processing Systems 17, L. K. Saul, Y. Weiss, and L.
Bottou, Eds., pp. 193200, MIT Press, Cambridge, Mass, USA,
2005.
[25] T. G. Dietterich, Approximate statistical tests for comparing
supervised classification learning algorithms, Neural Computation, vol. 10, no. 7, pp. 18951923, 1998.
[26] K. B. Hwang, J. W. Lee, S.-W. Chung, and B.-T. Zhang, Construction of large-scale Bayesian networks by local to global
search, in Proceedings of the 7th Pacific Rim International
Conference on Artificial Intelligence (PRICAI 02), pp. 375384,
Tokyo, Japan, August 2002.
[27] I. Tsamardinos, C. Aliferis, and A. Statnikov, Algorithms for
large scale markov blanket discovery, in Proceedings of the
16th International Florida Artificial Intelligence Research Society Conference (FLAIRS 03), pp. 376381, St. Augustine, Fla,
USA, May 2003.
[28] I. Tsamardinos and C. Aliferis, Towards principled feature selection: relevancy, filters and wrappers, in Proceedings of the
9th International Workshop on Artificial Intelligence and Statistics (AI&Stats 03), Key West, Fla, USA, January 2003.
Research Article
NML Computation Algorithms for Tree-Structured
Multinomial Bayesian Networks
1.
INTRODUCTION
this definition involves a normalizing sum over all the possible data samples of a fixed size. The logarithm of this sum is
called the regret or parametric complexity, and it can be interpreted as the amount of complexity of the model class. If the
data is continuous, the sum is replaced by the corresponding
integral.
The NML distribution has several theoretical optimality
properties, which make it a very attractive candidate for performing model class selection and related tasks. It was originally [8, 10] formulated as the unique solution to a minimax
problem presented in [9], which implied that NML is the
minimax optimal universal model. Later [11], it was shown
that NML is also the solution to a related problem involving
expected regret. See Section 2 and [1013] for more discussion on the theoretical properties of the NML.
Typical bioinformatic problems involve large discrete
datasets. In order to apply NML for these tasks one needs to
develop suitable NML computation methods since the normalizing sum or integral in the definition of NML is typically
dicult to compute directly. In this paper, we present algorithms for ecient computation of NML for both one- and
multidimensional discrete data. The model families used in
the paper are so-called Bayesian networks (see, e.g., [14]) of
varying complexity. A Bayesian network is a graphical representation of a joint distribution. The structure of the graph
corresponds to certain conditional independence assumptions. Note that despite the name, having Bayesian network
models does not necessarily imply using Bayesian statistics,
and the information-theoretic approach of this paper cannot
be considered Bayesian.
The problem of computing NML for discrete data has
been studied before. In [15] a linear-time algorithm for
the one-dimensional multinomial case was derived. A more
complex case involving a multidimensional model family,
called naive Bayes, was discussed in [16]. Both these cases
are also reviewed in this paper.
The paper is structured as follows. In Section 2, we discuss the basic properties of the MDL principle and the NML
distribution. In Section 3, we instantiate the NML distribution for the multinomial case and present a linear-time computation algorithm. The topic of Section 4 is the naive Bayes
model family. NML computation for an extension of naive
Bayes, the so-called Bayesian forests, is discussed in Section 5.
Finally, Section 6 gives some concluding remarks.
2.
F = M() : .
(3)
P yn | yn , M()
(4)
yn Xn
C M(), n =
P xn | xn , M()
x | M() =
,
C M(), n
where the normalizing term C(M(), n) in the case of discrete data is given by
M = P( | ) :
(2)
and the sum goes over the space of data samples of size n.
If the data is continuous, the sum is replaced by the
corresponding integral.
The stochastic complexity of the data xn , given a model
class M(), is defined via the NML distribution as
SC xn | M()
= log PNML xn | M()
xn , M() + log C M(), n
= log P xn |
(5)
and the term log C(M(), n) is called the (minimax) regret or
parametric complexity. The regret can be interpreted as measuring the logarithm of the number of essentially dierent
(distinguishable) distributions in the model class. Intuitively,
if two distributions assign high likelihood to the same data
samples, they do not contribute much to the overall complexity of the model class, and the distributions should not
be counted as dierent for the purposes of statistical inference. See [18] for more discussion on this topic.
The NML distribution (3) has several important theoretical optimality properties. The first is that NML provides a
unique solution to the minimax problem
P xn | xn , M()
log
,
min max
xn
P xn | M()
P
(6)
log P xn | xn , M()
log P xn | M()
(7)
that the NML distribution represents (or mimics) the behavior of all the distributions in the model class M(). Note that
the NML distribution itself does not have to belong to the
model class, and typically it does not.
A related property of NML involving expected regret was
proven in [11]. This property states that NML is also a unique
solution to
P xn | xn , M()
,
q xn | M()
(8)
FMN = M() : MN ,
(9)
M(K) = P( | ) : K ,
(10)
In [16, 20], a recursion formula for removing the exponentiality of CMN (K, n) was presented. This formula is given by
CMN (K, n) =
K =
1 , . . . , K : k 0, 1 + + K = 1
hk
K
n
k=1 hk /n
,
PNML x | M(K) =
3.3.
C M(K), n =
P y | y , M(K)
n
h1 ++hK
n!
hk
h
!
h
!
n
1
K
=n
k=1
(15)
nn
1
=
zn ,
1 T(z) n0 n!
(16)
nn
n!
n0
nn
n0
n!
h1 ++hK
n!
h k hk
h ! hK ! k=1 n
=n 1
zn
(17)
(14)
hk
(13)
yn
K
Although the previous algorithms have succeeded in removing the exponentiality of the computation of the multinomial
NML, they are still superlinear with respect to n. In [15], a
linear-time algorithm based on the mathematical technique
of generating functions was derived for the problem.
The starting point of the derivation is the generating
function B defined by
r2
(11)
r2
n
(12)
C M(K), n
r1
CMN K , r1 CMN K K , r2 ,
B(z) =
n! r1
r1 +r2 =n r1 !r2 ! n
(18)
n
CMN (K, n).
K
4.1.
(19)
Let us assume that our problem domain consists of m primary variables X1 , . . . , Xm and a special variable X0 , which
can be one of the variables in our original problem domain or it can be latent. Assume that the variable Xi has
Ki values and that the extra variable X0 has K0 values. The
data xn = (x1 , . . . , xn ) consist of observations of the form
x j = (x j0 , x j1 , . . . , x jm ) X, where
FNB = M() : NB
K 1
2K (K/2) 1
n
log + log
+
2
2
(K/2) 3(K/2 1/2) n
3 + K(K 2)(2K + 1)
2 (K/2)K 2
1
2
36
n
9 (K/2 1/2)
1
+ O 3/2 .
n
(20)
Since the error term of (20) goes down with the rate
O(1/n3/2 ), the approximation converges very rapidly. In [20],
the accuracy of (20) and two other approximations (Rissanens asymptotic expansion [8] and Bayesian information
criterion (BIC) [27]) were tested empirically. The results
show that (20) is significantly better than the other approximations and accurate already with very small sample sizes.
See [20] for more details.
4.
X = 1, 2, . . . , K0 1, 2, . . . , K1 1, 2, . . . , Km .
(21)
(22)
PNB X0 = x0 , X1 = x1 , . . . , Xm = xm |
m
= P X0 = x0 | P Xi = xi | X0 = x0 , .
(24)
i=1
ik1 + + ikKi = 1, i = 1, . . . , m, k = 1, . . . K0 ,
(25)
and the parameters are defined by k = P(X0 = k), ikl =
P(Xi = l | X0 = k).
Assuming i.i.d., the NML distribution for the naive Bayes
can now be written as (see [16])
PNML xn | M K0 , K1 , . . . , Km
K0
h
Ki
hk /n k m
k
i=1
l=1 fikl /h
C M K 0 , K 1 , . . . , Km , n
k=1
fikl
(26)
C M K 0 , K 1 , . . . , Km , n
=
h1 ++hK0
n!
m
K0
h k hk
h ! hK0 ! k=1 n
=n 1
CMN Ki , hk .
i=1
(27)
To simplify notations, from now on we write C(M(K0 ,
K1 , . . . , Km ), n) in an abbreviated form CNB (K0 , n).
n! r1
K0 , n =
r
r1 +r2 =n 1 !r2 ! n
r1
r2
n
r2
CNB K , r1 CNB K0 K , r2 ,
(28)
where K = 1, . . . , K0 1.
Proof. See the appendix.
In many practical applications of the naive Bayes, the
quantity K0 is unknown. Its value is typically determined
as a part of the model class selection process. Consequently, it is necessary to compute NML for model classes
M(K0 , K1 , . . . , Km ), where K0 has a range of values, say, K0 =
1, . . . , Kmax . The process of computing NML for this case is
described in Algorithm 2. The time complexity of the algorithm is O(n2 Kmax ). If the value of K0 is fixed, the time complexity drops to O(n2 log K0 ). See [16] for more details.
5.
FBF = M() : BF
(29)
with BF = {1, . . . , |G|} {1, 2, 3, . . . }m , where G is associated with an integer according to some enumeration of
all Bayesian forests on (X1 , . . . , Xm ). As the Ki are assumed
fixed, we can abbreviate the corresponding model classes by
M(G) := M(G, K1 , . . . , Km ).
Given a forest model class M(G), we index each model by
a parameter vector in the corresponding parameter space
G :
G = = ikl : ikl 0,
ikl = 1,
i = 1, . . . , m, k = 1, . . . , Kpa(i) , l = 1, . . . , Ki ,
(30)
where we define K := 1 in order to unify notation for root
and non-root nodes. Each such ikl defines a probability
(31)
P x | M(G),
=
m
i=1
i,xpa(i) ,xi .
i=1
(32)
Kpa(i)
(33)
fikl .
k=1
P xn | M(G), =
pa(i) Ki
m K
fikl
fpa(i),k
n
xsub(i)
Xnsub(i)
(34)
pa(i) Ki
m K
fikl fikl
i=1 k=1 l=1
n
xn
P xsub(i)
|
sub(i) , M Gsub(i)
(35)
Ci M(G), n | fi
fpa(i),k
(36)
:=
n
xdsc(i)
Xndsc(i)
Ci M(G), n :=
(37)
f
iklikl ,
which is maximized at
ikl xn , M(G) =
n
n
P xdsc(i)
, xin | xdsc(i)
, xin , M Gsub(i)
(38)
to be the corresponding sum with fixed root instantiation,
summing only over the attribute space spanned by the descendants on Xi .
Note that we use fi on the left-hand side, and xin on the
right-hand side of the definition. This needs to be justified.
Interestingly, while the terms in the sum depend on the ordering of xin , the sum itself depends on xin only through its
frequencies fi . To see this pick, any two representatives xin and
xni of fi and find, for example, after lexicographical ordering
of the elements, that
n
n
xin , xdsc(i)
:xdsc(i)
Xndsc(i) =
n
n
xni , xdsc(i)
:xdsc(i)
Xndsc(i) .
(39)
Next, we need to define corresponding sums over Xsub(i)
with the frequencies at the subtree root parent Xpa(i) given.
n
n
For any fpa(i) xpa(i)
Xpa(i)
define
Li M(G), n | fpa(i)
:=
n
n
n
n
P xsub(i)
| xpa(i)
, xsub(i)
, xpa(i)
, M Gsub(i)
xn , xn , M Gsub(i)
= P xin |
dsc(i) i
n
n
P xsub(
j) | xi ,
j ch(i)
n
xsub(i)
Xnsub(i)
(40)
Again, this is well defined since any other representative xnpa(i)
of fpa(i) yields summing the same terms modulo their ordering.
After having introduced this notation, we now briefly
outline the algorithm and in the following subsections give
a more detailed description of the steps involved. As stated
before, we go through G bottom-up. At each inner node Xi ,
we receive L j (M(G), n | fi ) from each child X j , j ch(i).
Correspondingly, we are required to send Li (M(G), n | fpa(i) )
up to the parent Xpa(i) . At each component tree root Xi , we
then calculate the sum Ci (M(G), n) for the whole connectivity component and then combine these sums to get the
normalizer Ci (M(G), n) for the complete forest G.
5.2.1. Leaves
Kpa(i)
CMN Ki , fpa(i),k .
(41)
k=1
The terms CMN (Ki , n ) (for n = 0, . . . , n) can be precalculated using recurrence (19) as in Algorithm 1.
L j M(G), n | fi ,
(45)
j ch(i)
n
where xdsc(i)
|sub( j) is the restriction of xdsc(i) to columns corresponding to nodes in G j . We have used (38) for (42), (32)
for (43) and (44), and finally (36) and (40) for (45).
Now we need to calculate the outgoing messages
Li (M(G), n | fpa(i) ) from the incoming messages we have just
combined into Ci (M(G), n | fi ). This is the most demanding
part of the algorithm, for we need to list all possible conditional frequencies, of which there are O(nKi Kpa(i) 1 ) many, the
1 being due to the sum-to-n constraint. For fixed i, we arrange the conditional frequencies fikl into a matrix F = ( fikl )
and define its marginals
n
xdsc(i)
, xin , M Gsub(i)
Ki
fil fil
l=1
(44)
n
n
xsub(
j) Xsub( j)
(F) :=
fik1 , . . . ,
(F) :=
fikKi ,
fi1l , . . . ,
(46)
fiKpa(i) l
Li M(G), n | fpa(i) =
Ci M(G), n | (F) .
F:(F)=fpa(i)
(47)
Ci M(G), n | fi
=
n
xdsc(i)
Xndsc(i)
n
n
P xdsc(i)
, xin | xdsc(i)
, xin , M Gsub(i)
xn , xn , M Gsub(i)
= P xin |
dsc(i) i
n
n
P xdsc(i)
|sub( j) | xi ,
n
xdsc(i)
Xndsc(i) j ch(i)
n
xdsc(i)
, xin , M Gsub(i)
(42)
(43)
Ci MG , n =
n!
fi
fi1 ! fiKi !
Ci MG , n | fi ,
(48)
where the Ci (MG , n | fi ) are calculated from (45). The summation goes over all nonnegative integer vectors fi summing
to n. The above is trivially true since we sum over all instantiations xi of Xi and group like terms, corresponding to the
same frequency vector fi , while keeping track of their respective count, namely n!/ fi1 ! fiKi !.
5.2.4. The algorithm
For the complete forest G we simply multiply the sums over
its tree components. Since these are independent of each
4:
Compute CMN (k, n ) as in Algorithm 1
5: end for
6: for each node Xi in some bottom-up order do
7:
if Xi is a leaf then
8:
for each frequency vector fpa(i) of Xpa(i) do
Kpa(i)
9:
Compute Li (M(G), n | fpa(i) ) = k=1 CMN (Ki , fpa(i)k )
10:
end for
11: else if Xi is an inner node then
12:
for each frequency vector fi Xi do
13:
Compute Ci (M(G), n | fi ) = Kl=i 1 ( fil /n) fil j ch(i) L j (M(G), n | fi )
14:
end for
15:
initialize Li 0
16:
for each non-negative Ki Kpa(i) integer matrix F with entries summing to n do
17:
Li (M(G), n | (F)) += Ci (M(G), n | (F))
18:
end for
19: else if Xi is a component tree root then
20:
Compute Ci (M(G), n) = fi Kl=i 1 ( fil /n) fil j ch(i) L j (M(G), n | fi )
21: end if
22: end for
23: Compute C(M(G), n) = ich() Ci (M(G), n)
n | M(G))/C(M(G), n)
24: Outpute PNML (xn | M(G)) = P(x
Algorithm 3: The algorithm for computing PNML (xn | M(G)) for a Bayesian forest G.
C MG , n =
Ci MG , n .
(49)
ich()
6.
CONCLUSION
The normalized maximum likelihood (NML) oers a universal, minimax optimal approach to statistical modeling. In
this paper, we have surveyed ecient algorithms for computing the NML in the case of discrete datasets. The model
families used in our work are Bayesian networks of varying
complexity. The simplest model we discussed is the multinomial model family, which can be applied to problems related
to density estimation or discretization. In this case, the NML
can be computed in linear time. The same result also applies
to a network of independent multinomial variables, that is, a
Bayesian network with no arcs.
For the naive Bayes model family, the NML can be computed in quadratic time. Models of this type have been
used extensively in clustering or classification domains with
good results. Finally, to be able to represent more complex dependencies between the problem domain variables,
we also considered tree-structured Bayesian networks. We
showed how to compute the NML in this case in polynomial time with respect to the sample size, but the order of
the polynomial depends on the number of values of the domain variables, which makes our result impractical for some
domains.
The methods presented are especially suitable for problems in bioinformatics, which typically involve multidimensional discrete datasets. Furthermore, unlike the
Bayesian methods, information-theoretic approaches such
as ours do not require a prior for the model parameters.
This is the most important aspect, as constructing a reasonable parameter prior is a notoriously dicult problem, particularly in bioinformatical domains involving novel types
of data with little background knowledge. All in all, information theory has been found to oer a natural and
successful theoretical framework for biological applications
in general, which makes NML an appealing choice for
bioinformatics.
In the future, our plan is to extend the current work
to more complex cases such as general Bayesian networks,
which would allow the use of NML in even more involved modeling tasks. Another natural area of future work
is to apply the methods of this paper to practical tasks
involving large discrete databases and compare the results to other approaches, such as those based on Bayesian
statistics.
1
d
dz 1 T(z) K
(A.5)
zK
K+1 T (z)
1 T(z)
K
T(z)
=
K+1
1
T(z)
1 T(z)
=
(A.6)
1
1
K+2
K+1
1 T(z)
1 T(z)
= K
=K
nn
n0
n!
CMN K + 2, n z
n
nn
n0
n!
where (A.6) follows from Lemma 3. Comparing the coecients of zn in (A.4) and (A.8), we get
CNB (K0 , n)
We have
zT (z) =
zT (z) 1 T(z) = ze
T(z)
h ! hK0 ! k=1 n
=n 1
h1 ++hK =r1
hK +1 ++hK0 =r2
r1 +r2 =n
n1
nn
n!
n0
nn
CMN (K, n)zn .
n!
i=1
i=1 k=1
h1 ++hK =r1
hK +1 ++hK0 =r2
r1 +r2 =n
K0
m
K
CMN Ki , hk
CMN Ki , hk
(A.2)
CMN Ki , hk
n! 0 hkk
CMN Ki , hk
n
n
h
!
k
i=1
=n
k=1
K
h1 ++hK0
(A.1)
m
K0
h k hk
n!
T(z)
.
1 T(z)
h1 ++hK0
dz n0 n!
CMN K + 1, n z
PROOFS OF THEOREMS
d nn
(A.8)
APPENDIX
(A.7)
n! r1
r1 !r2 ! n
r1
k=K +1
r2 r2
m
K
h k hk
r1 !
h1 ! hK ! k=1 r1
CMN Ki , hk
i=1
m
K0
h k hk
r2 !
hK +1 ! hK0 ! k=K +1 r2
n! r1
=
r1 +r2 =n r1 !r2 ! n
r1
r2
n
r2
CNB
CMN Ki , hk
i=1
K , r1 CNB K0 K , r2 ,
(A.10)
(A.4)
and the proof follows.
10
ACKNOWLEDGMENTS
The authors would like to thank the anonymous reviewers
and Jorma Rissanen for useful comments. This work was
supported in part by the Academy of Finland under the
project Civi and by the Finnish Funding Agency for Technology and Innovation under the projects Kukot and PMMA. In
addition, this work was supported in part by the IST Programme of the European Community, under the PASCAL
Network of Excellence, IST-2002-506778. This publication
only reflects the authors views.
REFERENCES
[1] G. Korodi and I. Tabus, An ecient normalized maximum
likelihood algorithm for DNA sequence compression, ACM
Transactions on Information Systems, vol. 23, no. 1, pp. 334,
2005.
[2] R. Tibshirani, T. Hastie, M. Eisen, D. Ross, D. Botstein, and B.
Brown, Clustering methods for the analysis of DNA microarray data, Tech. Rep., Department of Health Research and Policy, Stanford University, Stanford, Calif, USA, 1999.
[3] W. Pan, J. Lin, and C. T. Le, Model-based cluster analysis
of microarray gene-expression data, Genome Biology, vol. 3,
no. 2, pp. 18, 2002.
[4] G. J. McLachlan, R. W. Bean, and D. Peel, A mixture modelbased approach to the clustering of microarray expression
data, Bioinformatics, vol. 18, no. 3, pp. 413422, 2002.
[5] A. J. Hartemink, D. K. Giord, T. S. Jaakkola, and R. A.
Young, Using graphical models and genomic expression data
to statistically validate models of genetic regulatory networks,
in Proceedings of the 6th Pacific Symposium on Biocomputing
(PSB 01), pp. 422433, The Big Island of Hawaii, Hawaii,
USA, January 2001.
[6] J. Rissanen, Modeling by shortest data description, Automatica, vol. 14, no. 5, pp. 465471, 1978.
[7] J. Rissanen, Stochastic complexity, Journal of the Royal Statistical Society, Series B, vol. 49, no. 3, pp. 223239, 1987, with
discussions, 223265.
[8] J. Rissanen, Fisher information and stochastic complexity,
IEEE Transactions on Information Theory, vol. 42, no. 1, pp.
4047, 1996.
[9] Yu M. Shtarkov, Universal sequential coding of single messages, Problems of Information Transmission, vol. 23, no. 3, pp.
175186, 1987.
[10] A. Barron, J. Rissanen, and B. Yu, The minimum description
length principle in coding and modeling, IEEE Transactions
on Information Theory, vol. 44, no. 6, pp. 27432760, 1998.
[11] J. Rissanen, Strong optimality of the normalized ML models
as universal codes and information in data, IEEE Transactions
on Information Theory, vol. 47, no. 5, pp. 17121717, 2001.
[12] P. Grunwald, The Minimum Description Length Principle, The
MIT Press, Cambridge, Mass, USA, 2007.
[13] J. Rissanen, Information and Complexity in Statistical Modeling, Springer, New York , NY, USA, 2007.
[14] D. Heckerman, A tutorial on learning with Bayesian networks, Tech. Rep. MSR-TR-95-06, Microsoft Research, Advanced Technology Division, One Microsoft Way, Redmond,
Wash, USA, 98052, 1996.
[15] P. Kontkanen and P. Myllymaki, A linear-time algorithm for
computing the multinomial stochastic complexity, Information Processing Letters, vol. 103, no. 6, pp. 227233, 2007.
11