Vous êtes sur la page 1sur 124

EURASIP Journal on Bioinformatics and Systems Biology

Information Theoretic Methods


for Bioinformatics
Guest Editors: Jorma Rissanen, Peter Grnwald, Jukka Heikkonen,
Petri Myllymki, Teemu Roos, and Juho Rousu

Information Theoretic Methods for


Bioinformatics

EURASIP Journal on Bioinformatics and Systems Biology

Information Theoretic Methods for


Bioinformatics

Guest Editors: Jorma Rissanen, Peter Grunwald,


Jukka Heikkonen, Petri Myllymaki, Teemu Roos,
and Juho Rousu

Copyright 2007 Hindawi Publishing Corporation. All rights reserved.


This is a special issue published in volume 2007 of EURASIP Journal on Bioinformatics and Systems Biology. All articles are open
access articles distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Editor-in-Chief
I. Tabus, Tampere University of Technology, Finland

Associate Editors
Jaakko Astola, Finland
Junior Barrera, Brazil
Michael L. Bittner, USA
Michael R. Brent, USA
Yidong Chen, USA
Paul Dan Cristea, Romania
Aniruddha Datta, USA
Bart De Moor, Belgium
Edward R. Dougherty, USA

J. Garcia-Frias, USA
Debashis Ghosh, USA
John Goutsias, USA
Roderic Guigo, Spain
Yufei Huang, USA
Seungchan Kim, USA
John Quackenbush, USA
Jorma Rissanen, Finland
Stephane Robin, France

Paola Sebastiani, USA


Erchin Serpedin, USA
Ilya Shmulevich, USA
A. H. Tewfik, USA
Sabine Van Huel, Belgium
Z. Jane Wang, Canada
Yue Wang, USA

Contents
Information Theoretic Methods for Bioinformatics, Jorma Rissanen, Peter Grunwald,
Jukka Heikkonen, Petri Myllymaki, Teemu Roos, and Juho Rousu
Volume 2007, Article ID 79128, 2 pages
Compressing Proteomes: The Relevance of Medium Range Correlations, Dario Benedetto,
Emanuele Caglioti, and Claudia Chica
Volume 2007, Article ID 60723, 8 pages
A Study of Residue Correlation within Protein Sequences and Its Application to Sequence
Classification, Chris Hemmerich and Sun Kim
Volume 2007, Article ID 87356, 9 pages
Identifying Statistical Dependence in Genomic Sequences via Mutual Information Estimates,
Hasan Metin Aktulga, Ioannis Kontoyiannis, L. Alex Lyznik, Lukasz Szpankowski, Ananth Y. Grama,
and Wojciech Szpankowski
Volume 2007, Article ID 14741, 11 pages
Motif Discovery in Tissue-Specific Regulatory Sequences Using Directed Information, Arvind Rao,
Alfred O. Hero III, David J. States, and James Douglas Engel
Volume 2007, Article ID 13853, 13 pages
Splitting the BLOSUM Score into Numbers of Biological Significance, Francesco Fabris,
Andrea Sgarro, and Alessandro Tossi
Volume 2007, Article ID 31450, 18 pages
Aligning Sequences by Minimum Description Length, John S. Conery
Volume 2007, Article ID 72936, 14 pages
MicroRNA Target Detection and Analysis for Genes Related to Breast Cancer Using MDLcompress,
Scott C. Evans, Antonis Kourtidis, T. Stephen Markham, Jonathan Miller, Douglas S. Conklin,
and Andrew S. Torres
Volume 2007, Article ID 43670, 16 pages
Variation in the Correlation of G + C Composition with Synonymous Codon Usage Bias among
Bacteria, Haruo Suzuki, Rintaro Saito, and Masaru Tomita
Volume 2007, Article ID 61374, 7 pages
Information-Theoretic Inference of Large Transcriptional Regulatory Networks, Patrick E. Meyer,
Kevin Kontos, Frederic Lafitte, and Gianluca Bontempi
Volume 2007, Article ID 79879, 9 pages
NML Computation Algorithms for Tree-Structured Multinomial Bayesian Networks,
Petri Kontkanen, Hannes Wettig, and Petri Myllymaki
Volume 2007, Article ID 90947, 11 pages

Hindawi Publishing Corporation


EURASIP Journal on Bioinformatics and Systems Biology
Volume 2007, Article ID 79128, 2 pages
doi:10.1155/2007/79128

Editorial
Information Theoretic Methods for Bioinformatics
3 Jukka Heikkonen,4 Petri Myllymaki,

2, 5
Jorma Rissanen,1, 2 Peter Grunwald,
2,
5
5
Teemu Roos, and Juho Rousu
1 Computer

Learning Research Center, University of London, Royal Holloway TW20 0EX, UK


Institute for Information Technology, University of Helsinki, P.O. Box 68, 00014 Helsinki, Finland
3 Centrum voor Wiskunde en Informatica (CWI), P.O. Box 94079, 1090 GB Amsterdam, The Netherlands
4 Laboratory of Computational Engineering, Helsinki University of Technology, P.O. Box 9203, 02015 HUT, Finland
5 Department of Computer Science, University of Helsinki, P.O. Box 68, 00014 Helsinki, Finland
2 Helsinki

Received 24 December 2007; Accepted 24 December 2007


Copyright 2007 Jorma Rissanen et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.

The ever-ongoing growth in the amount of biological data,


the development of genome-wide measurement technologies, and the gradual, inevitable shift in molecular biology
from the study of individual genes to the systems view; all
these factors contribute to the need to study biological systems by statistical and computational means. In this task, we
are facing a dual challenge: on the one hand, biological systems and hence their models are inherently complex, and on
the other hand, the measurement data, while being genomewide, are typically scarce in terms of sample sizes (the large
p, small n problem) and noisy.
This means that the traditional statistical approach,
where the model is viewed as a distorted image of something
called a true distribution which the statisticians are trying to
estimate, is poorly justified. This lack of rationality is particularly striking when one tries to learn the structure of the data
by testing for the truth of a hypothesis in a collection where
none of them is true. Similarly, the Bayesian approaches that
require prior knowledge, which is either nonexistent or vague
and dicult to express in terms of a distribution for the parameters, are subject to modeling assumptions which may
bias the results in an unintended manner.
It was the editors intent and hope to encourage applications of techniques for model fitting influenced by information theory, originally created for communication theory but
more recently expanded to cover algorithmic information
theory and applicable to statistical modeling. In this view,
the objective in modeling is to learn structures and properties in data by simply fitting models without requiring any of
them to be true. The performance is not measured by any
distance to the nonexisting truth but in terms of the probability they assign to the data, which is equivalent to the code

length with which the data can be encoded, taking advantage


of the regular features the model prescribes to the data. This
task requires information and coding theoretic means. Similarly, the frequently used distance measures like the KullbackLeibler divergence and the mutual information express mean
codelength dierences.
D. Benedetto et al. study correlations and compressibility of proteome sequences. They identify dependencies at the
range of 10 to 100 amino acids. The source of such dependencies is not entirely clear. One contributing factor in the
case of interprotein dependencies is likely to be sequence duplication. The dependencies can be exploited in compression
of proteome sequences. Furthermore, they seem to have a
role in evolutionary and structural analysis of proteomes.
C. M. Hemmerich and S. Kim also use information theory for studying the correlations in protein sequences. They
base their method on computing the mutual information of
nonadjacent residues lying at a fixed distance d apart, where
the distance is varied from zero to a fixed upper bound. The
mutual information vector formed by these statistics is used
to train a nearest-neighbor classifier to predict membership
in protein families with results indicating that the correlations between nonadjacent residues are predictive of protein
family.
H. M. Aktulga et al. detect statistically dependent genomic sequences. Their paper addresses two applications.
First, they identify dierent parts of a gene (maize zmSRp32)
that are mutually dependent without appealing to the usual
assumption that dependencies are revealed by a considerable
amount of exact matches. It is discovered that dependencies
exist between the 5 untranslated region and its alternatively
spliced exons. As a second application, they discover short

2
tandem repeats which are useful in, for instance, genetic profiling. In both cases, the used techniques are based on mutual
information.
The objective in the paper by A. Rao et al. is to discover long-range regulatory elements (LREs) that determine
tissue-specific gene expression. Their methodology is based
on the concept of directed information, a variant of mutual
information introduced originally in the 1970s. It is shown
that directed information can be successfully used for selecting motifs that discriminate between tissue-specific and nonspecific LREs. In particular, the performance of directed information is better than that of mutual information.
F. Fabris et al. present an in-depth study to BLOSUM
block substitution matrix scores. They propose a decomposition of the BLOSUM score into three components: the mutual information of two compared sequences, the divergence
of observed amino acid co-occurence frequencies from the
probabilities in the substitution matrix, and the background
frequency divergence measuring the stochastic distance of
the observed amino acid frequences from the marginals in
the substitution matrix. The authors show how the result
of the decomposition, called BLOSpectrum, can be used to
analyze questions about the correctness of the chosen BLOSUM matrix, the degree of typicality of compared sequences
or their alignment, and the presence of weak or concealed
correlations in alignments with low BLOSUM scores.
The paper by J. Conery presents a new framework for
biological sequence alignment that is based on describing
pairs of sequences by simple regular expressions. These regular expressions are given in terms of right-linear grammars,
and the best grammar is found by use of the MDL principle. Essentially, when two sequences contain similar substrings, this similarity can be exploited to describe the sequences with fewer bits. The precise codelengths are determined with a substitution matrix that provides conditional
probabilities for the event that a particular symbol is replaced by another particular symbol. One advantage of such
a grammar-based approach is that gaps are not needed to
align sequences of varying length. The author experimentally
compares the alignments found by his method with those
found by CLUSTALW. In a second experiment, he measures
the accuracy of his method on pairwise alignments taken
from the BAlisBASE benchmark.
S. C. Evans et al. explore miRNA sequences based on
MDLcompress, an MDL-based grammar inference algorithm that is an extension of the optimal symbol compression ratio (OSCR) algorithm published earlier. Using MDLcompress, they analyze the relationship between miRNAs,
single nucleotide polymorphisms (SNPs) and breast cancer. Their results suggest that MDLcompress outperforms
other grammar-based coding methods, such as DNA sequitur, while retaining a two-part code that highlights biologically significant phrases. The ability to quantify cost in
bits for phrases in the MDL model allows prediction of regions where SNPs may have the most impact on biological
activity.
The partially redundant third position of codons
(protein-coding nucleotide triplets) tends to have a strongly
biased distribution. The amount of bias is known to be

EURASIP Journal on Bioinformatics and Systems Biology


correlated with G+C (guanine-cytosine) composition in the
genome. In their paper, H. Suzuki et al. quantify the correlation of G+C composition with synonymous codon usage
bias, where the bias is measured by the entropy of the third
codon position. They show that the correlation depends on
various genomic features and varies among dierent species.
This raises several interesting questions about the dierent
evolutionary forces causing the codon usage bias.
The paper by P. E. Meyer et al. tackles the challenging
problem of inferring large gene regulatory networks using information theory. Their MRNET method extends the maximum relevance/minimum redundancy (MRMR) feature selection technique to networks by formulating the network inference problem as a series of input/output supervised gene
selection procedures. Empirical results are competitive with
the state-of-the-art methods.
P. Kontkanen et al. study the problem of computing the
normalized maximum likelihood (NML) universal model for
Bayesian networks, which are important tools for modeling
discrete data in biological applications. The most advanced
MDL method for model selection between such networks is
based on comparing the NML distributions for each network
under consideration, but the naive computation of these distributions requires exponential time with respect to the given
data sample size. Utilizing certain computational tricks, and
building on earlier work with multinomial and Naive Bayes
models, the authors show how the computation can be performed eciently for tree-structured Bayesian networks.
ACKNOWLEDGMENTS
We thank the Editor-in-Chief for the opportunity to prepare
this special issue, and the sta of Hindawi for their assistance.
The greatest credit is of course to the authors, who submitted contributions of the highest quality. We also thank the
reviewers who have had a crucial role in the selection and
editing of the ten papers appearing in the special issue.
Jorma Rissanen
Peter Grunwald
Jukka Heikkonen
Petri Myllymaki
Teemu Roos
Juho Rousu

Hindawi Publishing Corporation


EURASIP Journal on Bioinformatics and Systems Biology
Volume 2007, Article ID 60723, 8 pages
doi:10.1155/2007/60723

Research Article
Compressing Proteomes: The Relevance of
Medium Range Correlations
Dario Benedetto,1 Emanuele Caglioti,1 and Claudia Chica2
1 Dipartimento
2 Structural

di Matematica, Universit`a di Roma La Sapienza, Piazzale Aldo Moro 5, 00185 Roma, Italy
and Computational Biology Unit, EMBL Heidelberg, Meyerhofstrae 1, 69117 Heidelberg, Germany

Received 14 January 2007; Revised 28 May 2007; Accepted 10 September 2007


Recommended by Teemu Roos
We study the nonrandomness of proteome sequences by analysing the correlations that arise between amino acids at a short and
medium range, more specifically, between amino acids located 10 or 100 residues apart; respectively. We show that statistical models that consider these two types of correlation are more likely to seize the information contained in protein sequences and thus
achieve good compression rates. Finally, we propose that the cause for this redundancy is related to the evolutionary origin of
proteomes and protein sequences.
Copyright 2007 Dario Benedetto et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.

1.

INTRODUCTION

Protein sequences have been considered for a long time as


nearly random or highly complex sequences, from the informational content point of view. The main reason for this is
the local complexity of amino acid composition, that is, the
type and number of amino acids found in a sequence segment, especially inside the globular domains [1]. This complexity could be related to the so called randomness of coding sequences in DNA, already pointed out in a pioneering
work [2] and explained by evolutionary models [3]. Studies
on protein sequence compression show that proteins behave
as sequences of independent characters and have a very low
compressibility, around 1% [4]. The ordered set of protein
sequences belonging to one organism, the proteome, was also
considered to be not compressible due to this little Markov
dependency [5]. Improvements are obtained by [6, 7]. However, later studies [810] suggest that proteomes contain different sources of regularities, and can be compressed to rates
around 30%. For a relevant discussion on the validity of these
results see Cao et al. [7].
In this work, we focus on the statistical study of proteome
sequences, using the concept of entropy brought into information theory by Shannon [11]. The Shannon entropy is related to the amount of information of a sequence emitted by
a certain source. The entropy h of a sequence is the limit of
the average amount of information per character, when the
length of the sequence tends to infinity. In particular, for a

finite sequence of length L, the informational content in bits


is approximately Lh and so Lh is the minimum length in bit
of any sequence that contains the same information. In this
way Lh provides a theoretical lower bound for the sequences
compression. A compression algorithm is intended to code a
sequence into a shorter one, from which it is possible to obtain unequivocally the former. In practise, one cannot compress at a rate equal to the Shannon entropy for the given
sequence. Nonetheless, it is possible to approximate such a
limit, using an ecient compression algorithm.
Statistical compression algorithms achieve their goal by
assigning shorter code words to the most probable characters; their eciency depends on the accuracy of the model
used to estimate each characters probability. Models try to
take advantage of the correlations between characters considering, for example, how the preceding characters, that is,
the characters context, determine the probability of the next
one, as in the prediction by partial matching (PPM) scheme
[12].
Most successful algorithms for proteome compression
are based on the identification of duplicated sequences or
repeats. The compress protein (CP) algorithm [5], for example, considers that duplicated sequences in proteomes are
similar but not identical because of mutation and evolutionary divergence. CP uses a modified PPM that includes
the probability of amino acid substitutions when estimating
each residue probability. The ProtComp algorithm [8] optimises the use of approximate repeats by updating the amino

EURASIP Journal on Bioinformatics and Systems Biology


Table 1: Proteome sequences.

Abbreviation
Mj
Hi
Vc
Ec
Sc
Dm
Ce
Hs

Organism
Methanococcus jannaschii
Heamophilus influenzae
Vibrio cholerae
Escherichia coli
Saccharomyces cerevisiae
Drosophyla melanogaster
Caenorhabditis elegans
Homo sapiens

acid substitution matrix as the repeated similar blocks appear


along the sequence. The context-tree weighting (CTW) [13]
is another context-based method that has been applied for
biological sequence compression. In [6] the authors present a
CTW-based algorithm that predicts the probability of a character by weighting the importance of short and long contexts
considering as well the occurrence of approximate repeats or
palindromes in those contexts. The XM [7] is a statistical algorithm which combines, via a Bayesian average, the probability of an amino acid calculated on a local scale with the
probability of that same residue being part of a duplicated
region of the proteome.
Nonstatistical approaches, based on the Burros-Wheeler
transform (BWT) [9], have also been used for identifying
overlapping and distant repeats in proteomes, and eciently
use them in compression. Even simpler models, that rely on a
block code representation of the protein sequences [10], have
proved to be successful in some cases.
All the algorithms commented above put into evidence
the existence and importance of redundancy in proteome sequences. Here we present a purely statistical study of 8 eukaryotic and prokaryotic proteomes. Firstly, we analyse the
correlation function of the whole sequences and find evidence of medium range correlations, between amino acids
located 100 residues apart. Then we calculate the amino acid
correlations considering the protein boundaries and identify the role of the intra/interprotein scale in determining
the medium range correlations. Furthermore, we generate
groups of amino acids using their pair correlations at distance 100, that reveal the structural meaning of the medium
range correlations. Using the results of proteome correlations, we propose a statistical model for the distribution of
amino acids in 4 proteomes: Haemophilus influenzae (bacteria), Methanococcus jannaschii (bacteria), Saccharomyces
cerevisiae (eukarya) and Homo sapiens (eukarya), and we estimate their compression rate to compare our results against
previous works.
The sources of nonrandomness studied fall into two
scales: the medium range correlations between amino acids
of the same and neighboring sequences, at distances of order
100, and the short range Markovian correlations between the
contiguous residues up to distance 10. Previous studies [9]
show that proteomes present repeated subsequences at very
long distances (50300). In this article, we do not consider
these long-range correlations of the order of the proteome

Proteome length
448 779
509 519
870 500
157 8496
2 900 352
5 818 330
6 874 562
3 295 751

Number of proteins
1680
1657
2988
5339
5835
11 592
17 456
5733

length. Protein length range correlations are in agreement


with the process of sequence duplication, as it has been previously suggested for long-range correlations [9]; in addition
to that, we show that they also contain information about
the three-dimensional structure of the proteins. Short range
correlations might instead relate to the local constraints on
amino acid distribution due to secondary structure requirements.
2.

RESULTS AND DISCUSSION

For our statistical analysis, we used the proteomes of 4


prokaryotic and 4 eukaryotic organisms shown in Table 1.
They were retrieved from the database of the Integr8 web
portal [14], with exception of the Hi, Mj, Sc, and Hs proteomes that were obtained from the protein corpus in [15],
for the sake of comparison of our compression rate results
with previous studies on the same proteomes. The proteomes
are not complete (in particular the version of Hs in the protein corpus) but they represent a natural set of proteins where
the redundancy has a biological meaning. It is important to
remark that the sequence of the proteins in the proteome files
of the Integr8 database is not the natural one. Those files are
not useful for our analysis. Nevertheless, using the additional
information available in the database, it is possible to order
the proteins as they are found in the chromososmes. The proteome files of the protein corpus do not present this problem,
but the sequence of the proteins is not available. Therefore,
for the analysis shown in Table 2 and in Figure 2, we have
used the version of Hi, Mj, Sc in the Integr8 database. For the
same reason, the data for Hs is missing in Table 2 since the
protein order is not obtainable at the Integr8 site.
2.1.

Correlations

As a first approximation to the general trends in residue distribution, we study the cooccurrence of amino acids. More
precisely, we calculate the pair correlations at dierent distances, that is, the average number of times equal residues a
appear at distance k along the whole sequence
Ck =

1 k
C
20 a aa

(1)

Dario Benedetto et al.

0.0004

Table 2: Intra- and interprotein correlation. Intraprotein correlation is always higher than interprotein correlation, and correlation
between matching halves () is higher than that of not corresponding halves (+).

0.00035
Correlation C(k)

0.0003
0.00025
0.0002
0.00015
0.0001
5e 05
0
5e 05

100

200

300

400

500 600 700


Distance k

800

900 1000

Mj
Vc
Hi

Dm
Ce
Sc

Figure 1: Correlation function for the 8 proteomes. Notice that the


function remains positive for distances up to 1000 and that eukaryotic proteomes (continuous lines) tend to present higher values.

with
k
Caa
=

N
k


N k

i=1

 

i = a i+k = a fa2 ,

(2)

where N is the sequence length, ( i = a) is the characteristic function of finding residue a at position i, and fa is
the relative frequency of amino acid a in the proteome. According to this definition, a positive correlation means that,
for a distance k, the number of pairs of equal amino acid
is more frequent than expected due to their frequency in
the proteome. The resulting correlation function for the 8
proteomes we studied (Figure 1) shows that eukaryotic sequences have stronger correlations than prokaryotic ones.
Moreover, for all the proteomes, the correlation remains positive at a medium range, for values of k bigger than 800 or
1000, depending on the proteome. We notice that the natural order of proteins in the proteomes, given by the succession of genes in the chromosomes, is relevant: when we randomly permute proteins, the medium range correlations are
lost, both in eukaryotes and prokaryotes.
The medium range correlations imply that, in proteomes,
the amino acid distribution of neighboring proteins tends
to be more similar than that of distant ones. This fact can
be related to the process of duplication, recognied as the
dominant force in the evolution of protein function [16]. As
protein repeats have been related to duplication at dierent
scales (genome, gene, or exon) [17], it is possible that the
amino acid patterns responsible for the observed medium
range correlation have the same evolutionary origin.
Due to the correlation definition used, the medium range
correlations could be caused either by pairs of amino acids
belonging to the same protein, or to dierent ones. Therefore, we split the nonlocal correlation into two groups and
analyse them separately: interprotein correlations (between
2 contiguous proteins) and intraprotein correlations (inside

Interprot corr
0.050381
0.045588
0.063712
0.080064
0.032501
0.095722
0.122692

Proteome Intraprot corr


Mj
0.271914
Hi
0.265803
Vc
0.256386
Ec
0.271597
Sc
0.270560
Dm
0.295940
Ce
0.288071

Interprot corr+
0.050231
0.039246
0.041780
0.069980
0.018606
0.056176
0.077690

the same protein sequence). In Table 2, we present the results for the intraprotein correlation between the two halves
of the same protein and the interprotein correlation between
corresponding and noncorresponding halves of two contiguous proteins: first half with first half (corr ) and second half
with first half (corr+ ).
These correlations are defined as follows. Let N p be the
number of proteins, let i (a) and +i (a) be the relative frequency of the residue a in the first and the second half of the
ith protein, respectively, and let (a) be the corresponding
mean value. We define

i, j =



1 
(a) (a) j (a) (a) ,
20 a i

(3)



1  +
i (a) (a) j (a) (a) .
20 a

(4)

for instance,

i, j =
We also define


i =
i,i .

+i = ++
i,i ,

(5)

The intraprotein correlation is


N

Cintra =

p
+
1  i,i
.
N p i=1 i +i

(6)

The two interprotein correlations are

Cinter
=

+
Cinter
=

1
Np 1
1
Np 1

N p 1


i,i+1
i=1

i i+1

,
(7)

N p 1

 +i,i+1

i=1

+i i+1

The correlation values in Table 2 have the same trend for all
the proteomes: intraprotein correlation is always higher than
interprotein correlation.
The correlation defined by means of
i, j are dierent
k which is the correlafrom the traditional correlation Caa
tion of the symbol a at distance k, where k is the number of
residues: we have calculated the correlation function of the

EURASIP Journal on Bioinformatics and Systems Biology


0.05

of strongly conserved hydrophobic residues even when the


other residues start to dier at several other positions.
The evidence obtained from the correlation analysis does
not allow to clarify the nature of the structural constraints
measured: do they reflect the modular repetition of secondary structure elements, caused by duplication or, perhaps, they depend on the conservation of higher order tertiary structure units like domains? We try to address this
question by defining amino acid groups as explained in the
next section.

Correlation C(k)

0.04
0.03
0.02
0.01
0

2.2.
0.01

10
15
20
Distance k (no of proteins)

25

30

Sc: inter-prot corr


Sc: inter-prot corr+

Figure 2: Correlation function, at distance of k proteins, between


amino acids belonging to corresponding (corr ), and noncorresponding (corr+ ) halves; S. cerevisiae proteome. Correlation between corresponding halves is higher, suggesting that structural requirements modulate the evolution of protein sequences, by maintaining certain amino acid patterns.

frequencies of the amino acids at the distance of one protein.


In Figure 2, we also analyse how the interprotein correlations
between matching and nonmatching protein halves vary with
the number k of proteins separating the two halves. We compare
C (k) =
C + (k) =

1
Np k
1
Np k

N p k


i,i+k
i=1

i i+k

,
(8)

N p k

 +i,i+k

i=1

+i i+k

As an extension of the results in Table 2, we find that the


correlation between matching halves is kept higher than that
of noncorresponding halves along the proteome. Analogous
results to Table 2 and Figure 2 hold for second-second and
first-second halves.
Gene duplication can explain both the existence and order dependence of interprotein correlation, but it is not
enough to justify why intraprotein correlations remain high,
because high interprotein correlations can also appear in a
low intraprotein correlations context. Indeed, the presence of
intraprotein correlations indicates a nonrandom distribution
of amino acids at a protein length scale. This nonrandomness
can be related to segmental duplication, that is, duplication
of segments inside the same protein; likewise, it can reflect
the maintenance of amino acid patterns during the protein
divergence that follows gene duplication as a consequence of
the structural constraints imposed upon protein sequences.
As an example, extensive searches of protein databases
[18] reveal the high frequency of tandemly repeated sequences of approximately 50 amino acids, ARM and HEAT,
in eukaryotic proteins. Moreover, those repeats present a core

Grouping of amino acids

In a previous study [4], the complexity of large sets of nonredundant protein sequences was measured using a reduced alphabet approximation, that is, using groups of amino acids
defined by an a priori classification. The Shannon entropy
was then estimated from the entropies of the blocks of ncharacters. The authors did not find enough evidence to support the existence of short range correlations between the
amino acids of protein sequences.
Conversely, given the above evidence of medium range
correlations in proteome sequences, we build groups of correlated amino acids using the correlations between the 20
k
, the correlation between all
amino acids. We calculate Cab
amino acid pairs ab at distances k, in the same way we calk in the previous section:
culate Caa
k
Cab
=

N
k


N k

 

i = a i+k = b fa fb .

(9)

A quick look at the resulting 20 20 matrix for k = 100


(Figure 3), which presumably includes both intraprotein and
interprotein correlation, puts in evidence that the signs of the
matrix elements, and thus the positive and negative correlations, are not distributed randomly among residues but, instead, in a grouped fashion: some amino acids present positive or negative correlations with the same subset of residues.
Then, we construct groups of amino acids in such a way
that they maximise the positive medium range correlation;
in practical terms it means that amino acids which are more
likely to appear at distances of order 100 would be grouped
together.
For a given partition of the set of amino acids in Ng
groups, we calculate the sum of the correlation function between any pair of residues ab belonging to a same group.
More precisely, groups are obtained by maximising the following quantity:
F(G) =

Ng
200

 
i=1 a,bgi k=1

k
Cab
,

(10)

which is function of a partition G of the amino acids in Ng


disjoint sets gi . Due to the huge number of possible choices
for the groups, we maximise this value using a simulated annealing algorithm. This is a Monte Carlo algorithm used for
optimisation [19]. For a given partition G, we construct a
new partition G choosing at random a residue and changing

V L I M F WN Q H K R D E G A S T C Y P

Dario Benedetto et al.

V L I M F W N Q H K R D E G A S T C Y P

Figure 3: Correlation between the 20 amino acids for Hi. Positive (black) and negative (grey) correlations determine amino acid
groups.

Table 3: Groups of amino acids determined by maximisation of


the positive medium range correlation. Amino acids that are more
likely to appear at 200 residues distance are grouped together.
Proteome
Hi

Mj

Sc

Hs

Groups
LIFWSY
VMGATP
NQHKRDEC
LIFWNSY
VMQHGATCP
KRDE
LIMFWCY
NQHSTP
KRDE
VGA
VLIMFWNY
HSTC
QKDE
RGAP

its group. If F(G ) > F(G), the algorithm accepts the new partition. Iterating this procedure we would reach a local maximum which may not be the absolute maximum. In order
to avoid being trapped in a local maximum, the algorithm
accepts, with a small probability P, a new partition G for
which F(G ) F(G). The value of this probability P slowly
decreases to zero as the number of iterations increases in such
a way that the convergence of the algorithm to the absolute
maximum of F is guaranteed.
The number and the structure of the groups chosen have
the highest value of F(G) and represent an equilibrated partition of the 20 amino acids, that is, groups with only one
element are not accepted.
The idea behind our grouping scheme is to simplify
the amino acid pattern mining by taking advantage of their

synonymous relationships. It is well known that mutations


between amino acids sharing geometrical and/or physicochemical properties are the basis of neutral evolution at a
molecular level [20]; this fact also explains why there is
not a one-to-one relationship between protein sequences
and structures [21]. Moreover, structurally neighboring
residues have been found to distribute dierentially (proximally/distally) in the protein sequences, depending on their
physico-chemical properties [22].
Indeed, the groups defined from the pair correlations at
a medium range (Table 3) almost correspond with the natural classification based on their physico-chemical properties:
hydrophobic, polar, charged, small, and ambiguous. In particular, the fact that hydrophobic amino acids group together
allows us to think that the correlation function is gathering
some of the three-dimensional information contained in the
protein sequence, more precisely tertiary structure information, as hydrophobic interactions are considered the driving
forces of the protein folding process [23].
Therefore, the reason why intraprotein correlations remain high is not only related to the repetition of secondary
structure units, but is also the conservation of the amino
acids responsible for the protein tertiary structure.
Beside this, it is important to notice that, even if the
amino acid usage in eukaryotes and prokaryotes is very similar [24], the amino acid correlations are not, as they collect part of the structural information, contained in the sequences. The number of groups is also dierent: 3 for H. influenzae and M. jannaschii, 4 for S. cerevisiae and H. sapiens.
This could indicate a higher interchangeability of residues in
some proteomes, but further analysis is needed to confirm
this hypothesis.
2.3.

Sequence entropy estimation

In order to quantify the capability that a statistical model has


to identify the nonrandomness of a sequence, one can use it
to construct an arithmetic coding compressor [25]. We estimate the compression rate of such a compressor with the
sequence entropy
1
log 2 pi ( i ),
N i
N

S=

(11)
 

using the model to calculate the probability Pi i of character i at position i. The better is the model, the lower is the
estimated value of the sequence entropy. We construct three
models to estimate the probability of each character, considering the previous ones and taking into account both short
and medium range correlations. For each model, we find parameters that minimise the sequence entropy. The Smin value
obtained is taken as an estimate of the compression rate of
a running arithmetic codification [25] of the proteomes and
is used to compare our results with other compression algorithms (Table 4).
Previous works on protein sequence compression like [5]
are based on short range Markovian models. In those models,
the probability of each amino acid is calculated as a function
of the context in which it appears, considering the frequency

EURASIP Journal on Bioinformatics and Systems Biology

Table 4: Compression rate in bit per character for the studied proteomes. One-character entropy is the entropy of the sequences considering
that their residues are independently distributed.
Algorithm
One-character entropy
CP, Nevill-Manning and Witten 1999 [5]
lza-CTW, Matsumoto et al. 2000 [6]
ProtComp, Cao et al. 2007 [7]
XM, Cao et al. 2007 [7]
Model 1
Model 2
Model 3
ProtComp, Hategan and Tabus 2004 [8]
BWT/SCP, Adjeroh and Nan 2006 [9]

Hi
4.155
4.143
4.118
4.108
4.102
4.111
4.102
4.100
2.330
2.546

Mj
4.068
4.051
4.028
4.008
4.000
4.017
4.005
4.002
3.910
2.273

Sc
4.165
4.146
3.951
3.938
3.885
3.963
3.948
3.945
3.440
3.111

Hs
4.133
4.112
3.920
3.824
3.786
3.978
3.933
3.931
3.910
3.435

Estimation
Results obtained with a dierent set of proteomes

with which this amino acid happens to be after the l previous


residues.
Following this idea, we start our statistical description
of proteome sequences taking into account the information
given by the neighboring residues using a variation of the interpolated Markov models [26]. In order to predict the probability of the ith character, we consider the contexts up to a
length Nc (number of contexts) that precede it, that is, the
substrings ik i1 for k = 0, . . . , Nc. For any character a, we count the number Fki (a) of previous occurrences
of the substring ik i1 a. The conditional frequency of
finding character a after the context ik i1 is obtained
dividing by the sum over all amino acids b at position i:
Fki (a)
.
i
b Fk (b)

(12)

Our model 1 predicts the probability of character a at position i with




i
1 + Nc
k=0 k Fk (a)
Model 1: pi (a) =   
.
Nc
i
b 1+
k=0 k Fk (b)

argued in other works on latent periodicity of protein sequences [27, 28]. From the point of view of protein sequence
evolution, the short range parameters can also reflect the existence of constraints on the distribution of residues. Protein
sequences are modified by mutation, but still have to cope
with folding requirements that determine a nonrandom positioning of key residues, depending on their geometrical and
physico-chemical properties. In fact, structural alphabets derived from hidden Markov models denote that local conformations of protein structures have dierent sequence specificity [29].
The intra/interprotein correlations identified in previous
sections suggest that the frequencies of the single residues
has nonnegligible fluctuations on the medium range. We take
into account these fluctuations in our second model (model
2 in Table 4):
i
1 + RiL (a) + Nc
k=0 k Fk (a)
Model 2: pi (a) =  
.
 Nc
i
i
b 1 + RL (b) +
k=0 k Fk (b)

(14)

Here we added
(13)

We remark that the main dierence between our short range


approach and CTW is that we give a weight to the dierent
contexts, while in [6] a weight is given to their corresponding conditional probabilities. We find that the most informative positions were the previous 8; this length is in qualitative agreement with the results found in [6]. Model 1 in
Table 4 indicates the results obtained considering only the
short range correlations for Nc = 8.
The model depends on the parameters k that are optimised, using standard algorithms for minimisation, in order to achieve the best estimate of the compression rate. This
entropy minimisation stage is very time expensive. In a real
compression procedure, those parameters should be specified and therefore would contribute to the estimated entropy.
In our case this contribution is negligible.
The short range correlations support the existence of periodic patterns in protein sequences. They can be caused by
the alternation of alpha-beta secondary structure units, as

RiL (a) = number of a in iL i1

i

.
(15)
L
This quantity is proportional to the frequency of the amino
acid a in the subsequence of length L, with L a distance of
medium scale, starting
from the position i L. The factor i/L

guarantees that a RiL (a) = i, so that it increases
 with i in the
same way as the other terms of the sum (e.g., a F0i (a) = i).
The parameter is optimised as k . The optimal values for L
found during the entropy minimisation stage are 190 for Hi,
163 for Mj, 105 for Sc, and 115 for Hs.
Finally, in model 3, we use the groups found in
Section 2.2 (see Table 3). In particular, a contribution to
the probablity of a given residue is obtained by computing
the probability of the residue to belong to a certain group
and then the conditional probability of the residue once the
group is given is
 

i
1 + GiL ga f i (a) + Nc
k=0 k Fk (a)
Model 3: pi (a) =  


,

Nc
i
i
i
b 1 + GL gb f (b) +
k=0 k Fk (b)
(16)

Dario Benedetto et al.

where ga is the group of a, f i (a) is the relative frequency of a


in its group, as measured up to the position i 1, and


GiL (g) = number of amino acids of


the group g in iL i1

i

(17)

For this model, the optimal values of the parameter L are 129
for Hi, 94 for Mj, 77 for Sc, and 100 for Hs.
As one can see in Table 4, the capability of our statistical
model to represent the nonrandom information contained
in proteomes is comparable to those models that consider
repeated amino acid patterns at both short and medium scale
[6, 7].
The improvement in the performance of models 2 and 3
is due to the fact that they identify the short range correlations and separate them from the fluctuations of amino acid
frequencies at a protein length range. This demonstrates that
both correlation types are informative and that the statistical
significance of repetitions at those scales is enough to model
the amino acid probabilities.
The compression rate achieved when the medium range
correlations are modelled with the frequency of amino acid
groups (model 3) is almost equivalent to the compression
rate of model 2. From a biological perspective it indicates that
groups of amino acids are meaningful, and that the redundant information at medium scale has a structural component might be coming from the three-dimensional structure
constraints.
According to our results, there is an important dierence
in the compressibility rates of the eukaryotic and prokaryotic
proteomes which is in agreement with the correlation function in Figure 1. The sequences of S. cerevisiae and H. sapiens are more redundant, and thus more compressible, than
those of H. influenzae and M. jannaschii; correspondingly,
the correlation functions of Sc and Hs remain positive for
longer distances than Hi and Mj. This additional redundancy
could be related to the presence, in eukaryotic proteomes, of
paralogous proteins with very similar distribution of synonymous amino acids, but dierent function. There is evidence
suggesting that paralogous genes have been recruited during
evolution of dierent metabolic pathways and are related to
the organism adaptability to environmental changes [16]. On
the other hand, the lower compressibility of the Hi and Mj
proteomes is in agreement with the reduction of prokaryotic
genome size as an adaptation to fast metabolic rates [30, 31].
3.

CONCLUSIONS

In this article, we show that the correlation function gathers evolutionary and structural information of proteomes.
Even if proteins are highly complex sequences, at a proteome
scale, it is possible to identify correlations between characters at short and medium ranges. It confirms that protein
sequences are not completely random, indeed they present
repeated amino acid patterns at those two scales. The alternation of secondary structure units can determine the local
redundancy. This was already known and generally modelled
using Markov models. In our opinion, sequence duplication

is a reasonable explanation for the interprotein correlation.


However, it does not account for the intraprotein correlations; this can instead be related to the maintenance of the
amino acid patterns responsible for the three-dimensional
structure, as the segregation between hydrophobic and polar
amino acids indicates. More elaborately, the sampling of the
space of structures during proteome evolution is determined
by the duplication processes but it is highly constrained by
the structural and functional requirements that protein sequences have to meet inside a living system.
Prokaryotic proteomes show lower correlation values, especially for distances under 100 residues, and a smaller compressibility than eukaryotic proteomes. These characteristics
point at a higher redundancy of eukaryotic proteome sequences, and suggest that the increase of proteome size does
not imply de novo generation of protein sequences, with
completely dierent amino acid distribution.
ACKNOWLEDGMENTS
The authors would like to thank Toby Gibson for reading and
commenting the manuscript and the reviewers for their constructive criticism that helped to improve the quality of the
paper.
REFERENCES
[1] J. C. Wootton, Non-globular domains in protein sequences:
automated segmentation using complexity measures, Computers & Chemistry, vol. 18, no. 3, pp. 269285, 1994.
[2] B. E. Blaisdell, A prevalent persistent global nonrandomness
that distinguishes coding and non-coding eucaryotic nuclear
DNA sequences, Journal of Molecular Evolution, vol. 19, no. 2,
pp. 122133, 1983.
[3] Y. Almirantis and A. Provata, An evolutionary model for the
origin of non-randomness, long-range order and fractality in
the genome, BioEssays, vol. 23, no. 7, pp. 647656, 2001.
[4] O. Weiss, M. A. Jimenez-Montano, and H. Herzel, Information content of protein sequences, Journal of Theoretical Biology, vol. 206, no. 3, pp. 379386, 2000.
[5] C. G. Nevill-Manning and I. H. Witten, Protein is incompressible, in Proceedings of the Data Compression Conference
(DCC 99), pp. 257266, Snowbird, Utah, USA, March 1999.
[6] T. Matsumoto, K. Sadakane, and H. Imai, Biological sequence
compression algorithms, Genome Informatics, vol. 11, pp. 43
52, 2000.
[7] M. D. Cao, T. I. Dix, L. Allison, and C. Mears, A simple statistical algorithm for biological sequence compression, in Proceedings of the Data Compression Conference (DCC 07), pp.
4352, Snowbird, Utah, USA, March 2007.
[8] A. Hategan and I. Tabus, Protein is compressible, in Proceedings of the 6th Nordic Signal Processing Symposium (NORSIG 04), pp. 192195, Espoo, Finland, June 2004.
[9] D. Adjeroh and F. Nan, On compressibility of protein sequences, in Proceedings of the Data Compression Conference
(DCC 06), pp. 422434, Snowbird, Utah, USA, March 2006.
[10] G. Sampath, A block coding method that leads to significantly lower entropy values for the proteins and coding sections of Haemophilus influenzae, in Proceedings of the IEEE
Bioinformatics Conference (CSB 03), pp. 287293, Stanford,
Calif, USA, August 2003.

8
[11] C. E. Shannon, A mathematical theory of communication,
Bell System Technical Journal, vol. 27, pp. 379423 and 623
656, 1948.
[12] J. Cleary and I. Witten, Data compression using adaptive coding and partial string matching, IEEE Transactions on Communications, vol. 32, no. 4, pp. 396402, 1984.
[13] F. M. J. Willems, Y. M. Shtarkov, and T. J. Tjalkens, The
context-tree weighting method: basic properties, IEEE Transactions on Information Theory, vol. 41, no. 3, pp. 653664,
1995.
[14] Integr8 web portal, ftp://ftp.ebi.ac.uk/pub/databases/integr8/,
2006.
[15] J. Abel, The data compression resource on the internet,
http://www.datacompression.info/, 2005.
[16] C. A. Orengo and J. M. Thornton, Protein families and their
evolutiona structural perspective, Annual Review of Biochemistry, vol. 74, pp. 867900, 2005.
[17] J. Heringa, The evolution and recognition of protein sequence repeats, Computers & Chemistry, vol. 18, no. 3, pp.
233243, 1994.
[18] M. A. Andrade, C. Petosa, S. I. ODonoghue, C. W. Muller, and
P. Bork, Comparison of ARM and HEAT protein repeats,
Journal of Molecular Biology, vol. 309, no. 1, pp. 118, 2001.
[19] S. Kirkpatrick, C. D. Gelatt Jr., and M. P. Vecchi, Optimization by simulated annealing, Science, vol. 220, no. 4598, pp.
671680, 1983.
[20] L. A. Mirny and E. I. Shakhnovich, Universally conserved positions in protein folds: reading evolutionary signals about stability, folding kinetics and function, Journal of Molecular Biology, vol. 291, no. 1, pp. 177196, 1999.
[21] M. A. Huynen, P. F. Stadler, and W. Fontana, Smoothness
within ruggedness: the role of neutrality in adaptation, Proceedings of the National Academy of Sciences of the United States
of America, vol. 93, no. 1, pp. 397401, 1996.
[22] S. Karlin, Statistical signals in bioinformatics, Proceedings of
the National Academy of Sciences of the United States of America, vol. 102, no. 38, pp. 1335513362, 2005.
[23] K. A. Dill, Dominant forces in protein folding, Biochemistry,
vol. 29, no. 31, pp. 71337155, 1990.
[24] B. Rost, Did evolution leap to create the protein universe?
Current Opinion in Structural Biology, vol. 12, no. 3, pp. 409
416, 2002.
[25] J. Rissanen and G. G. Langdon Jr., Arithmetic Coding, IBM
Journal of Research and Development, vol. 23, no. 2, pp. 149
162, 1979.
[26] S. L. Salzberg, A. L. Delcher, S. Kasif, and O. White, Microbial
gene identification using interpolated Markov models, Nucleic Acids Research, vol. 26, no. 2, pp. 544548, 1998.
[27] V. P. Turutina, A. A. Laskin, N. A. Kudryashov, K. G.
Skryabin, and E. V. Korotkov, Identification of latent periodicity in amino acid sequences of protein families, Biochemistry
(Moscow), vol. 71, no. 1, pp. 1831, 2006.
[28] E. V. Korotkov and M. A. Korotkova, Enlarged similarity of
nucleic acid sequences, DNA Research, vol. 3, no. 3, pp. 157
164, 1996.
[29] A. C. Camproux and P. Tuery, Hidden Markov modelderived structural alphabet for proteins: the learning of protein local shapes captures sequence specificity, Biochimica et
Biophysica Acta, vol. 1724, no. 3, pp. 394403, 2005.
[30] S. D. Bentley and J. Parkhill, Comparative genomic structure
of prokaryotes, Annual Review of Genetics, vol. 38, pp. 771
791, 2004.

EURASIP Journal on Bioinformatics and Systems Biology


[31] J. Raes, J. O. Korbel, M. J. Lercher, C. von Mering, and P. Bork,
Prediction of eective genome size in metagenomic samples,
Genome Biology, vol. 8, no. 1, p. R10, 2007.

Hindawi Publishing Corporation


EURASIP Journal on Bioinformatics and Systems Biology
Volume 2007, Article ID 87356, 9 pages
doi:10.1155/2007/87356

Research Article
A Study of Residue Correlation within Protein Sequences and
Its Application to Sequence Classification
Chris Hemmerich1 and Sun Kim2
1 Center

For Genomics and Bioinformatics, Indiana University, 1001 E. 3rd Street, Bloomington 47405-3700, India
of Informatics, Center for Genomics and Bioinformatics, Indiana University, 901 E. 10th Street,
Bloomington 47408-3912, India

2 School

Received 28 February 2007; Revised 22 June 2007; Accepted 31 July 2007


Recommended by Juho Rousu
We investigate methods of estimating residue correlation within protein sequences. We begin by using mutual information (MI)
of adjacent residues, and improve our methodology by defining the mutual information vector (MIV) to estimate long range
correlations between nonadjacent residues. We also consider correlation based on residue hydropathy rather than protein-specific
interactions. Finally, in experiments of family classification tests, the modeling power of MIV was shown to be significantly better
than the classic MI method, reaching the level where proteins can be classified without alignment information.
Copyright 2007 C. Hemmerich and S. Kim. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.

1.

INTRODUCTION

A protein can be viewed as a string composed from the 20symbol amino acid alphabet or, alternatively, as the sum of
their structural properties, for example, residue-specific interactions or hydropathy (hydrophilic/hydrophobic) interactions. Protein sequences contain sucient information to
construct secondary and tertiary protein structures. Most
methods for predicting protein structure rely on primary sequence information by matching sequences representing unknown structures to those with known structures. Thus, researchers have investigated the correlation of amino acids
within and across protein sequences [13]. Despite all this, in
terms of character strings, proteins can be regarded as slightly
edited random strings [1].
Previous research has shown that residue correlation can
provide biological insight, but that MI calculations for protein sequences require careful adjustment for sampling errors. An information-theoretic analysis of amino acid contact potential pairings with a treatment of sampling biases
has shown that the amount of amino acid pairing information is small, but statistically significant [2]. Another recent
study by Martin et al. [3] showed that normalized mutual information can be used to search for coevolving residues.
From the literature surveyed, it was not clear what significance the correlation of amino acid pairings holds for pro-

tein structure. To investigate this question, we used the family and sequence alignment information from Pfam-A [4]. To
model sequences, we defined and used the mutual information vector (MIV) where each entry represents the MI estimation for amino acid pairs separated by a particular distance in
the primary structure. We studied two dierent properties of
sequences: amino acid identity and hydropathy.
In this paper, we report three important findings.
(1) MI scores for the majority of 1000 real protein sequences sampled from Pfam are statistically significant
(as defined by a P value cuto of .05) as compared to
random sequences of the same character composition,
see Section 4.1.
(2) MIV has significantly better modeling power of proteins than MI, as demonstrated in the protein sequence
classification experiment, see Section 5.2.
(3) The best classification results are provided by MIVs
containing scores generated from both the amino acid
alphabet and the hydropathy alphabet, see Section 5.2.
In Section 2, we briefly summarize the concept of MI
and a method for normalizing MI content. In Section 3, we
formally define the MIV and its use in characterizing protein sequences. In Section 4, we test whether MI scores for
protein sequences sampled from the Pfam database are statistically significant compared to random sequences of the

EURASIP Journal on Bioinformatics and Systems Biology

same residue composition. We test the ability of MIV to classify sequences from the Pfam database in Section 5, and in
Section 6, we examine correlation with MIVs and further investigate the eects of alphabet size in terms of information
theory. We conclude with a discussion of the results and their
implications.
2.

MUTUAL INFORMATION (MI) CONTENT

We use MI content to estimate correlation in protein sequences to gain insight into the prediction of secondary and
tertiary structures. Measuring correlation between residues
is problematic because sequence elements are symbolic variables that lack a natural ordering or underlying metric [5].
Residues can be ordered in certain properties such as hydropathy, charge, and molecular weight. Weiss and Herzel [6]
analyzed several such correlation functions.
MI is a measure of correlation from information theory
[7] based on entropy, which is a function of the probability
distribution of residues. We can estimate entropy by counting residue frequencies. Entropy is maximal when all residues
appear with the same frequency. MI is calculated by systematically extracting pairs of residues from a sequence and calculating the distribution of pair frequencies weighted by the
frequencies of the residues composing the pairs.
By defining a pair as adjacent residues in the protein sequence, MI estimates the correlation between the identities
of adjacent residues. We later define pairs using nonadjacent
residues, and physical properties rather than residue identities.
MI has been proven useful in multiple studies of biological sequences. It has been used to predict coding regions
in DNA [8], and has been used to detect coevolving residue
pairs in protein multiple sequence alignments [3].
2.1. Mutual information
The entropy of a random variable X, H(X), represents the
uncertainty of the value of X. H(X) is 0 when the identity of
X is known, and H(X) is maximal when all possible values
of X are equally likely. The mutual information of two variables MI(X, Y ) represents the reduction in uncertainty of X
given Y , and conversely, MI(Y , X) represents the reduction
in uncertainty of Y given X:
MI(X, Y ) = H(X) H(X | Y ) = H(Y ) H(Y | X). (1)
When X and Y are independent, H(X | Y ) simplifies to
H(X), so MI(X, Y ) is 0. The upper bound of MI(X, Y ) is the
lesser of H(X) and H(Y ), representing complete correlation
between X and Y :
H(X | Y ) = H(Y | X) = 0.

(2)

We can measure the entropy of a protein sequence S as


H(S) =

iA

 

 

P xi log 2 P xi ,

(3)

where A is the alphabet of amino acid residues and P(xi ) is


the marginal probability of residue i. In Section 3.3, we discuss several methods for estimating this probability.

From the entropy equations above, we derive the MI


equation for a protein sequence X = (x1 , . . . , xN ):
MI =

 
iA j A

P xi , x j log 2

P(xi , x j )
,
P(xi )P(x j )

(4)

where the pair probability P(xi , x j ) is the frequency of two


residues being adjacent in the sequence.
2.2.

Normalization by joint entropy

Since MI(X, Y ) represents a reduction in H(X) or H(Y ), the


value of MI(X, Y ) can be altered significantly by the entropy
in X and Y . The MI score we calculate for a sequence is also
aected by the entropy in that sequence. Martin et al. [3] propose a method of normalizing the MI score of a sequence
using the joint entropy of a sequence. The joint entropy, or
H(X, Y ), can be defined as
H(X, Y ) =

 

iA j A

P xi , x j log 2 P xi , x j

(5)

and is related to MI(X, Y ) by the equation


MI(X, Y ) = H(X) + H(Y ) H(X, Y ).

(6)

The complete equation for our normalized MI measurement is


MI(X, Y )
 H(X,
 Y) 
    


iA
j A P xi , x j log 2 P xi , x j /P xi P x j






=
.
iA
j A P xi , x j log 2 P xi , x j
(7)
3.

MUTUAL INFORMATION VECTOR (MIV)

We calculate the MI of a sequence to characterize the structure of the resulting protein. The structure is aected by different types of interactions, and we can modify our methods to consider dierent biological properties of a protein sequence. To improve our characterization, we combine these
dierent methods to create of vector of MI scores.
Using the flexibility of MI and existing knowledge of protein structures, we investigate several methods for generating
MI scores from a protein sequence. We can calculate the pair
probability P(xi , x j ) using any relationship that is defined for
all amino acid identities i, j A . In particular, we examine
distance between residue pairings, dierent types of residueresidue interactions, classical and normalized MI scores, and
three methods of interpreting gap symbols in Pfam alignments.
3.1.

Distance MI vectors

Protein exists as a folded structure, allowing nonadjacent


residues to interact. Furthermore, these interactions help to
determine that structure. For this reason, we use MIV to
characterize nonadjacent interactions. Our calculation of MI
for adjacent pairs of residues is a specific case of a more general relationship, separation by exactly d residues in the sequence.

C. Hemmerich and S. Kim

Table 1: MI(3)residue pairings of distance 3 for the sequence


DEIPCPFCGC.
(1) DEIPCPFCGC
(2) DEIPCPFCGC
(3) DEIPCPFCGC

(4) DEIPCPFCGC
(5) DEIPCPFCGC
(6) DEIPCPFCGC

Table 2: Amino acid partition primarily based on hydropathy.


Hydropathy
Hydrophobic:
Hydrophilic:

Amino acids
C,I,M,F,W,Y,V,L
R,N,D,E,Q,H,K,S,T,P,A,G

Definition 1. For a sequence S = (s1 , . . . , sN ), mutual information of distance d, MI(d) is defined as


MI(d) =

 
iA j A

Pd xi , x j log 2

 

Pd xi , x j
    .
P xi P x j

(8)

The pair probabilities, Pd (xi , x j ), are calculated using all


combinations of positions sm and sn in sequence S such that
m + (d + 1) = n,

n N.

(9)

A sequence of length N will contain N (d + 1) pairs.


Table 1 shows how to extract pairs of distance 3 from the
sequence DEIPCPFCGC.
Definition 2. The mutual information vector of length k for
a sequence X, MIVk (X), is defined as a vector of k entries,
MI(0), . . . , MI(k 1).

Our second method is to use a common prior probability


distribution for all sequences. Since all of our sequences are
part of the Pfam database, we use residue frequencies calculated from Pfam as our prior. In our results, we refer to this
method as the Pfam prior. The large sample size allows the
frequency to more accurately estimate the probability. However, since Pfam contains sequences from many organisms,
the probability distribution is less accurate.
3.4.

Interpreting gap symbols

The Pfam sequence alignments contain gap information,


which presents a challenge for our MIV calculations. The
gap character does not represent a physical element of the
sequence, but it does provide information on how to view
the sequence and compare it to others. Because of this contradiction, we compared three strategies for processing gap
characters in the alignments.
The strict method
This method removes all gap symbols from a sequence before performing any calculations, operating on the protein
sequence rather than an alignment.
The literal method
Gaps are a proven tool in creating alignments between related sequences and searching for relationships between sequences. This method expands the sequence alphabet to include the gap symbol. For A we define and use a new alphabet:

A = A {}.

3.2. Sequence alphabets


The alphabet chosen to represent the protein sequence has
two eects on our calculations. First, by defining the alphabet, we also define the type of residue interactions we are
measuring. By using the full amino acid alphabet, we are
only able to find correlations based on residue-specific interactions. If we instead use an alphabet based on hydropathy,
we make correlations based on hydrophilic/hydrophobic interactions. Second, altering the size of our alphabet has a significant eect on our MI calculations. This eect is discussed
in Section 6.2.
In our study, we used two dierent alphabets: a set of 20
amino acids residues, A , and a hydropathy-based alphabet,
H , derived from grammar complexity and syntactic structure of protein sequences [9] (see Table 2 for mapping A to
H ).
3.3. Estimating residue marginal probabilities
To calculate the MIV for a sequence, we estimate the
marginal probabilities for the characters in the sequence alphabet. The simplest method is to use residue frequencies
from the sequence being scored. This is our default method.
Unfortunately, the quality of the estimation suers from the
short length of protein sequences.

(10)


MI is then calculated for 
A . H is transformed to G using
the same method.

The hybrid method


This method is a compromise of the previous two methods.
Gap symbols are excluded from the sequence alphabet when
calculating MI. Occurrences of the gap symbol are still considered when calculating the total number of symbols. For a
sequence containing one or more gap symbols,


Pi < 1.

(11)

iA

Pairs containing any gap symbols are also excluded, so for a


gapped sequence,


Pi j < 1.

(12)

i, j A

These adjustments result in a negative MI score for some


sequences, unlike classical MI where a minimum score of 0
represents independent variables.

EURASIP Journal on Bioinformatics and Systems Biology


Table 3: MIVs examples calculated for four sequences from Pfam. All methods used literal gap interpretation.

d
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

Globin MI(d)
A
H
1.34081
0.42600
1.20553
0.23740
1.07361
0.12164
0.92912
0.02704
0.97230
0.00380
0.91082
0.00392
0.90658
0.01581
0.87965
0.02435
0.83376
0.01860
0.88404
0.01000
0.88685
0.01353
0.90792
0.01719
0.95955
0.00231
0.88584
0.01387
0.93670
0.01490
0.86407
0.02052
0.89004
0.04024
0.91409
0.01706
0.89522
0.01691
0.92742
0.03319

Ferrochelatase MI(d)
A
H
0.95240
0.13820
0.93240
0.03837
0.90004
0.02497
0.87380
0.03133
0.90400
0.02153
0.78479
0.02944
0.81559
0.00588
0.91757
0.00822
0.87615
0.01247
0.90823
0.00721
0.89673
0.00611
0.94314
0.02195
0.87247
0.01027
0.85914
0.00733
0.88250
0.00335
0.94592
0.00548
0.92664
0.01398
0.80241
0.00108
0.85366
0.00719
0.90928
0.01334

3.5. MIV examples


Table 3 shows eight examples of MIVs calculated from the
Pfam database. A sequence was taken from four random
families, and the MIV was calculated using the literal gap
method for both H and A . All scores are in bits. The scores
generated from A are significantly larger than those from
H . We investigate this observation further in Sections 4.1
and 6.2.
3.6. MIV concatenation
The previous sections have introduced several methods for
scoring sequences that can be used to generate MIVs. Just
as we combined MI scores to create MIV, we can further
concatenate MIVs. Any number of vectors calculated by any
methods can be concatenated in any order. However, for two
vectors to be comparable, they must be the same length, and
must agree on the feature stored at every index.
Definition 3. Any two MIVs, MIV j (A) and MIVk (B), can be
concatenated to form MIV j+k (C).
4.

ANALYSIS OF CORRELATION IN
PROTEIN SEQUENCES

In [1], Weiss states that protein sequences can be regarded


as slightly edited random strings. This presents a significant
challenge for successfully classifying protein sequences based
on MI.

DUF629 MI(d)
A
H
0.70611
0.04752
0.63171
0.00856
0.63330
0.00367
0.66955
0.00575
0.62328
0.00587
0.68383
0.00674
0.63120
0.00782
0.67433
0.00172
0.63719
0.00495
0.61597
0.00411
0.60790
0.00718
0.66750
0.00867
0.64879
0.00805
0.66959
0.00607
0.66033
0.00106
0.62171
0.01363
0.63445
0.00314
0.67801
0.00536
0.65903
0.00898
0.70176
0.00151

Big 2 MI(d)
A
1.26794
0.92824
0.95326
0.99630
1.00100
0.98737
1.06852
1.04627
1.00784
0.97119
1.02660
0.92858
0.98879
1.09997
1.06989
1.27002
1.05699
1.06677
1.05439
1.17621

H
0.21026
0.05522
0.07424
0.04962
0.08373
0.03664
0.05216
0.12002
0.05221
0.04002
0.02240
0.02261
0.03156
0.04766
0.01286
0.06204
0.03154
0.02136
0.03310
0.01902

In theory, a random string contains no correlation between characters. So, we expect a slightly edited random
string to exhibit little correlation. In practice, noninfinite
random strings usually have a nonzero MI score. This overestimation of MI in finite sequences is a factor of the length
of the string, alphabet size, and frequency of the characters
that make up the string. We investigated the significance of
this error for our calculations and methods for reducing or
correcting for the error.
To confirm the significance of our MI scores, we used
a permutation-based technique. We compared known coding sequences to random sequences in order to generate a
P value signifying the chance that our observed MI score
or higher would be obtained from a random sequence of
residues. Since MI scores are dependent on sequence length
and residue frequency, we used the shue command from
the HMMER package to conserve these parameters in our
random sequences.
We sampled 1000 sequences from our subset of PfamA. A simple random sample was performed without replacement from all sequences between 100 and 1000 residues in
length. We calculated MI(0) for each sequence sampled. We
then generated 10 000 shued versions of each sequence and
calculated MI(0) for each.
We used three scoring methods to calculate MI(0):
(1) A with literal gap interpretation,
(2) A normalized by joint entropy with literal gap interpretation,
(3) H with literal gap interpretation.

C. Hemmerich and S. Kim

5
1

Mean of MI(0) for shues (bits)

1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
100

200

300

400

500

600

700

800

900 1000

Mean of MI(0) for shues/MI(0) for sequence

Sequence length (residue count)

0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
100

A literal
A literal, normalized
H literal

200

300

400

500

600

700

800

900 1000

Sequence length (residue count)


A literal
A literal, normalized
H literal

Figure 1: Mean MI(0) of shued sequences.

Figure 2: Normalized MI(0) of shued sequences.

In all three cases, the MI(0) score for a shued sequence of infinite length would be 0; therefore, the calculated
scores represent the error introduced by sample-size eects.
Figure 1, mean MI(0) of shued sequences, shows the average shued sequence scores (i.e., sampling error) in bits for
each method. This figure shows that, as expected, the sampling error tends to decrease as the sequence length increases.
4.1. Significance of MI(0) for protein sequences
To compare the amount of error, in each method we normalized the mean MI(0) scores from Figure 1 by dividing the
mean MI(0) score by the MI(0) score of the sequence used to
generate the shues. This ratio estimates the amount of the
sequence MI(0) score attributed to sample-size eects.
Figure 2, normalized MI(0) of shued sequences, compares the eectiveness of our two corrective methods in minimizing the sample-size eects. This figure shows that normalization by joint entropy is not as eective as Figure 1 suggests. Despite a large reduction in bits, in most cases, the portion of the score attributed to sampling eects shows only a
minor improvement. H still shows a significant reduction in
sample-size eects for most sequences.
Figures 1 and 2 provide insight into trends for the three
methods, but do not answer our question of whether or not
the MI scores are significant. For a given sequence S, we estimated the P value as
P=

x
,
N

(13)

where N is the number of random shues and x is the number of shues whose MI(0) was greater than or equal to
MI(0) for S. For this experiment, we choose a significance
cuto of .05. For a sequence to be labeled significant, no more
than 50 of the 10 000 shued versions may have an MI(0)
score equal or larger than the original sequence. We repeated

this experiment for MI(1), MI(5), MI(10), and MI(15) and


summarized the results in Table 4.
These results suggest that despite the low MI content of
protein sequences, we are able to detect significant MI in a
majority of our sampled sequences at MI(0). The number of
significant sequences decreases for MI(d) as d increases. The
results for the classic MI method are significantly aected by
sampling error. Normalization by joint entropy reduces this
error slightly for most sequences, and using H is a much
more eective correction.
5.

MEASURING MIV PERFORMANCE THROUGH


PROTEIN CLASSIFICATION

We used sequence classification to evaluate the ability of MI


to characterize protein sequences and to test our hypothesis that MIV characterizes a protein sequence better MI. As
such, our objective is to measure the dierence in accuracy
between the methods, rather than to reach a specific classification accuracy.
We used the Pfam-A dataset to carry out this comparison. The families contained in the Pfam database vary in
sequence count and sequence length. We removed all families containing any sequence of less than 100 residues due to
complications with calculating MI for small strings. We also
limited our study to families with more than 10 sequences
and less than or equal to 200 sequences. After filtering PfamA based on our requirements, we were left with 2392 families
to consider in the experiment.
Sequence similarity is the most widely used method of
family classification. BLAST [10] is a popular tool incorporating this method. Our method diers significantly, in
that classification is based on a vector of numerical features,
rather than the proteins residue sequence.

EURASIP Journal on Bioinformatics and Systems Biology

Table 4: Sequence significance calculated for significance cuto .05.


Scoring method
Literal-A
Normalized
literal-A
Literal-H

Number of significant sequences (of 1000)


MI(0) MI(1) MI(5) MI(10) MI(15)
762
630
277
103
54
777

657

309

106

60

894

783

368

162

117

Classification of feature vectors is a well-studied problem with many available strategies. A good introduction to
many methods is available in [11], and the method chosen
can significantly aect performance. Since the focus of this
experiment is to compare methods of calculating MIV, we
only used the well-established and versatile nearest neighbor
classifier in conjunction with Euclidean distance [12].
5.1. Classification implementation
For classification, we used the WEKA package [11]. WEKA
uses the instance based 1 (IB1) algorithm [13] to implement nearest neighbor classification. This is an instancebased learning algorithm derived from the nearest neighbor
pattern classifier and is more ecient than the naive implementation.
The results of this method can dier from the classic
nearest neighbor classifier in that the range of each attribute
is normalized. This normalization ensures that each attribute
contributes equally to the calculation of the Euclidean distance. As shown in Table 3, MI scores calculated from A
have a larger magnitude than those calculated from H . This
normalization allows the two alphabets to be used together.
5.2. Sequence classification with MIV
In this experiment, we explore the eectiveness of classifications made using the correlation measurements outlined in
Section 3.
Each experiment was performed on a random sample of
50 families from our subset of the Pfam database. We then
used leave-one-out cross-validation [14] to test each of our
classification methods on the chosen families.
In leave-one-out validation, the sequences from all 50
families are placed in a training pool. In turn, each sequence
is extracted from this pool and the remaining sequences are
used to build a classification model. The extracted sequence
is then classified using this model. If the sequence is placed
in the correct family, the classification is counted as a success. Accuracy for each method is measured as
no. of correct classifications
.
no. of classification attempts

(14)

We repeated this process 100 times, using a new sampling


of 50 families from Pfam each time. Results are reported for
each method as the mean accuracy of these repetitions. For
each of the 24 combinations of scoring options outlined in
Section 3, we evaluated classification based on MI(0), as well

as MIV20 . The results for these experiments are summarized


in Table 5, classification Results for MI(0) and MIV20 .
All MIV20 methods were more accurate than their MI(0)
counterparts. The best method was H with hybrid gap scoring with a mean accuracy of 85.14%. The eight best performing methods used H , with the best method based on A having a mean accuracy of only 66.69%. Another important observation is that strict gap interpretation performs poorly in
sequence classification. The best strict method had a mean
accuracy of 29.96%much lower than the other gap methods.
Our final classification attempts were made using concatenations of previously generated MIV20 scores. We evaluated all combinations of methods. The five combinations
most accurate at classification are shown in Table 6. The best
method combinations are over 90% accurate, with the best
being 90.99%. The classification power of H with hybrid
gap interpretation is demonstrated, as this method appears
in all five results. Surprisingly, two strict scoring methods appear in the top 5, despite their poor performance when used
alone.
Based on our results, we made the following observations.
(1) The correlation of non-adjacent pairs as measured
by MIV is significant. Classification based on every
method improved significantly for MIV compared to
MI(0). The highest accuracy achieved for MI(0) was
26.73% and for MIV it was 85.14% (see Table 5).
(2) Normalized MI had an insignificant eect on scores generated from H . Both methods reduce the sample-size
error in estimating entropy and MI for sequences. A
possible explanation for the lack of further improvement through normalization is that H is a more effective corrective measure than normalization. We explore this possibility further in Section 6.2, were we
consider entropy for both alphabets.
(3) For the most accurate methods, using the Pfam prior decreased accuracy. Despite our concerns about using the
frequency of a short sequence to estimate the marginal
residue probabilities, the results show that these estimations better characterize the sequences than the
Pfam prior probability distribution. However, four of
the five best combinations contain a method utilizing
the Pfam prior, showing that the two methods for estimating marginal probabilities are complimentary.
(4) As with sequence-based classification, introducing gaps
improves accuracy. For all methods, removing gap characters with the strict method drastically reduced accuracy. Despite this, two of the five best combinations included a strict scoring method.
(5) The best scoring concatenated MIVs included both alphabets. The inclusion of A is significantall eight
nonstrict H methods scored better than any A
method (see Table 5). The inclusion shows that A
provides information not included in the H and
strengthens our assertion that the dierent alphabets
characterize dierent forces aecting protein structure.

C. Hemmerich and S. Kim

Table 5: Classification results for MI(0) and MIV20 methods. SD represents the standard deviation of the experiment accuracies.
MIV20
rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

Method
Hybrid-H
Normalized hybrid-H
Literal-H
Normalized literal-H
Normalized Hybrid-H w/Pfam prior
Literal-H w/Pfam prior
Normalized Literal-H w/Pfam prior
Hybrid-H w/Pfam prior
Normalized literal-A
Hybrid-A
Normalized literal-A w/Pfam prior
Literal-A
Literal-A w/Pfam prior
Hybrid-A w/Pfam prior
Normalized hybrid-A
Normalized hybrid-A w/Pfam prior
Strict-H w/Pfam prior
Normalized strict-H w/Pfam prior
Normalized strict-A w/Pfam prior
Normalized strict-A
Strict-H
Normalized strict-H
Strict-A w/Pfam prior
Strict-A

MI(0) accuracy
Mean
SD
26.73%
2.59
26.20%
4.16
22.92%
3.41
23.45%
3.88
26.31%
3.95
22.73%
4.90
22.45%
4.89
22.81%
2.97
17.76%
3.21
17.16%
3.06
19.60%
3.67
16.36%
2.84
19.95%
2.84
23.09%
3.36
18.10%
3.08
23.32%
3.65
12.97%
2.85
13.01%
2.72
19.77%
3.52
18.27%
2.92
11.22%
2.33
11.15%
2.52
19.25%
3.38
16.27%
2.75

MIV20 accuracy
Mean
SD
85.14%
2.06
85.01%
2.19
79.51%
2.79
78.86%
2.79
77.21%
2.94
76.89%
2.91
76.29%
2.96
71.57%
3.15
66.69%
4.14
64.09%
4.36
63.39%
4.05
61.97%
4.32
61.82%
4.12
58.07%
4.28
41.76%
4.59
40.46%
4.04
29.96%
3.89
29.81%
3.87
29.73%
3.93
29.20%
3.65
29.09%
3.60
28.85%
3.58
28.44%
3.91
25.80%
3.60

Table 6: Top scoring combinations of MIV methods. All combinations of two MIV methods were tested, with these five methods performing
the most accurately. SD represents the standard deviation of the experiment accuracies.
Rank
1
2
3
4
5

6.

First method
Hybrid-H
Hybrid-H
Hybrid-H
Hybrid-H
Hybrid-H

Second method
Normalized hybrid-A w/Pfam prior
Normalized strict-A w/Pfam prior
Literal-A w/Pfam prior
Literal-A
Strict-A w/Pfam prior

FURTHER MIV ANALYSIS

In this section, we examine the results of our dierent methods of calculating MIVs for Pfam sequences. We first use correlation within the MIV as a metric to compare several of our
scoring methods. We then take a closer look at the eect of
reducing our alphabet size when translating from A to H .

Mean accuracy
90.99%
90.66%
90.30%
90.24%
90.08%

SD
1.44
1.47
1.48
1.73
1.57

The results strengthen our observations from the classification experiment. Methods that performed well in classification exhibit less redundancy between MIV indexes. In particular, the advantage of methods using H is clear. In each
case, correlation decreases as the distance between indexes
increases. For short distances, A methods exhibit this to a
lesser degree; however, after index 10, the scores are highly
correlated.

6.1. Correlation within MIVs


6.2.
We calculated MIVs for 120 276 Pfam sequences using each
of our methods and measured the correlation within each
method using Pearsons correlation. The results of this analysis are presented in Figure 3. Each method is represented by
a 20 20 grid containing each pairing of entries within that
MIV.

Effect of alphabets

Not all intraprotein interactions are residue specific. Cline


[2] explored information attributed to hydropathy, charge,
disulfide bonding, and burial. Hydropathy, an alphabet composed of two symbols, was found to contain half as much information as the 20-element amino acid alphabet. However,

EURASIP Journal on Bioinformatics and Systems Biology


20

20

20

20

15

15

15

15

10

10

10

10

0.4

0.2

10 15 20
Literal-A

10 15 20

Normalized literal-A

10 15 20
Hybrid-A

0.8
0.6

10 15 20

Normalized hybrid-A

(a)

20

20

20

20

15

15

15

15

10

10

10

10

0.4

0.2

10 15 20
Literal-H

10 15 20

Normalized literal-H

10 15 20
Hybrid-H

0.8
0.6

10 15 20

Normalized hybrid-H

(b)

Figure 3: Pearsons correlation analysis of scoring methods. Note the reduced correlation in the methods based on H , which all performed
very well in classification tests.

with only two symbols, the alphabet should be more resistant


to the underestimation of entropy and overestimation of MI
caused by finite sequence eects [15].
For this method, a protein sequence is translated using
the process given in Section 3.2. It is important to remember that the scores generated for entropy and MI are actually
estimates based on finite samples. Because of the reduced alphabet size of H , we expected to see increased accuracy in
entropy and MI estimations.To confirm this, we examined
the eects of converting random sequences of 100 residues
(a length representative of those found in the Pfam database)
into H .
We generated each sequence from a Bernoulli scheme.
Each position in the sequences is selected independently of
any residues selected before it, and all selections are made
randomly from a uniform distribution. Therefore, for every
position in the sequence, all residues are equally likely to occur.
By sampling residues from a uniform distribution, the
Bernoulli scheme maximizes entropy for the alphabet size
(N):
H = log 2

1
.
N

Table 7: Comparison of measured entropy to expected entropy values for 1000 amino acid sequences. Each sequence is 100 residues
long and was generated by a Bernoulli scheme.
Alphabet
A
H

Alphabet
size
20
2

Theoretical
entropy
4.322
0.971

Mean measured
entropy
4.178
0.964

bution. The positions remain independent, so the expected


MI remains 0.
Table 7 shows the measured and expected entropies for
both alphabets. The entropy for A is underestimated by
.144, and the entropy for H is underestimated by only
.007. The eect of H on MI estimation is much more pronounced. Figure 4 shows the dramatic overestimation of MI
in A and high standard deviation around the mean. The
overestimation of MI for H is negligible in comparison.
7.

CONCLUSIONS

(15)

Since all positions are independent of others, MI is 0.


Knowing the theoretical values of both entropy and MI, we
can compare the calculated estimates for a finite sequence to
the theoretical values to determine the magnitude of finite
sequence eects.
We estimated entropy and MI for each of these sequences
and then translated the sequences to H . The translated
sequences are no longer Bernoulli sequences because the
residue partitioning is not equaleight residues fall into one
category and twelve into the other. Therefore, we estimated
the entropy for the new alphabet using this probability distri-

We have shown that residue correlation information can be


used to characterize protein sequences. To model sequences,
we defined and used the mutual information vector (MIV)
where each entry represents the mutual information content
between two amino acids for the corresponding distance. We
have shown that MIV of proteins is significantly dierent
from random sequences of the same character composition
when the distance between residues is considered. Furthermore,
we have shown that the MIV values of proteins are significant
enough to determine the family membership of a protein sequence with an accuracy of over 90%. What we have shown is
simply that the MIV score of a protein is significant enough

C. Hemmerich and S. Kim

2.5

MI (d)

[6]

1.5

[7]

[8]

0.5
0
0

10

12

14

16

18

Residue distance d
Mean MIV for H
Mean MIV for A

Figure 4: Comparison of MI overestimation in protein sequences


generated from Bernoulli schemes for gap distances from 0 to
19 residues. The full residue alphabet greatly over-estimates this
amount. Reducing the alphabet to two symbols approximates the
theoretical value of 0.

[9]

[10]

[11]

[12]

for family classificationMIV is not a practical alternative to


similarity-based family classification methods.
There are a number of interesting questions to be answered. In particular, it is not clear how to interpret a vector
of mutual information values. It would also be interesting
to study the eect of distance in computing mutual information in relation to protein structures, especially in terms
of secondary structures. In our experiment (see Table 4), we
have observed that normalized MIV scores exhibit more information content than nonnormalized MIV scores. However, in the classification task, normalized MIV scores did
not always achieve better classification accuracy than nonnormalized MIV scores. We hope to investigate this issue in
the future.
ACKNOWLEDGMENTS
This work is partially supported by NSF DBI-0237901 and
Indiana Genomics Initiatives (INGEN). The authors also
thank the Center for Genomics and Bioinformatics for the
use of computational resources.
REFERENCES
[1] O. Weiss, M. A. Jimenez-Montano, and H. Herzel, Information content of protein sequences, Journal of Theoretical Biology, vol. 206, no. 3, pp. 379386, 2000.
[2] M. S. Cline, K. Karplus, R. H. Lathrop, T. F. Smith, R. G. Rogers
Jr., and D. Haussler, Information-theoretic dissection of pairwise contact potentials, Proteins: Structure, Function and Genetics, vol. 49, no. 1, pp. 714, 2002.
[3] L. C. Martin, G. B. Gloor, S. D. Dunn, and L. M. Wahl, Using information theory to search for co-evolving residues in
proteins, Bioinformatics, vol. 21, no. 22, pp. 41164124, 2005.
[4] A. Bateman, L. Coin, R. Durbin, et al., The Pfam protein families database, Nucleic Acids Research, vol. 32, Database issue,
pp. D138D141, 2004.
[5] W. R. Atchley, W. Terhalle, and A. Dress, Positional dependence, cliques, and predictive motifs in the bHLH protein do-

[13]

[14]

[15]

main, Journal of Molecular Evolution, vol. 48, no. 5, pp. 501


516, 1999.
O. Weiss and H. Herzel, Correlations in protein sequences
and property codes, Journal of Theoretical Biology, vol. 190,
no. 4, pp. 341353, 1998.
T. M. Cover and J. A. Thomas, Elements of Information Theory,
Wiley-Interscience, New York, NY, USA, 1991.
I. Grosse, H. Herzel, S. V. Buldyrev, and H. E. Stanley, Species
independence of mutual information in coding and noncoding DNA, Physical Review E, vol. 61, no. 5, pp. 56245629,
2000.
M. A. Jimenez-Montano, On the syntactic structure of protein sequences and the concept of grammar complexity, Bulletin of Mathematical Biology, vol. 46, no. 4, pp. 641659, 1984.
S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, Basic local alignment search tool, Journal of Molecular
Biology, vol. 215, no. 3, pp. 403410, 1990.
I. H. Witten and E. Frank, Data Mining: Practical Machine
Learning Tools and Techniques, Morgan Kaufmann Series in
Data Management Systems, Morgan Kaufmann, San Francisco, Calif, USA, 2nd edition, 2005.
T. M. Cover and P. Hart, Nearest neighbor pattern classification, IEEE Transactions on Information Theory, vol. 13, no. 1,
pp. 2127, 1967.
D. W. Aha, D. Kibler, and M. K. Albert, Instance-based learning algorithms, Machine Learning, vol. 6, no. 1, pp. 3766,
1991.
R. Kohavi, A study of cross-validation and bootstrap for accuracy estimation and model selection, in Proceedings of the
14th International Joint Conference on Artificial Intelligence (IJCAI 95), vol. 2, pp. 11371145, Montreal, Quebec, Canada,
August 1995.
H. Herzel, A. O. Schmitt, and W. Ebeling, Finite sample effects in sequence analysis, Chaos, Solitons & Fractals, vol. 4,
no. 1, pp. 97113, 1994.

Hindawi Publishing Corporation


EURASIP Journal on Bioinformatics and Systems Biology
Volume 2007, Article ID 14741, 11 pages
doi:10.1155/2007/14741

Research Article
Identifying Statistical Dependence in Genomic Sequences via
Mutual Information Estimates
Hasan Metin Aktulga,1 Ioannis Kontoyiannis,2 L. Alex Lyznik,3 Lukasz Szpankowski,4
Ananth Y. Grama,1 and Wojciech Szpankowski1
1 Department

of Computer Science, Purdue University, West Lafayette, IN 47907, USA


of Informatics, Athens University of Economics & Business, Patission 76, 10434 Athens, Greece
3 Pioneer Hi-Breed International, Johnston, IA, USA
4 Bioinformatics Program, University of California, San Diego, CA 92093, USA
2 Department

Received 26 February 2007; Accepted 25 September 2007


Recommended by Petri Myllymaki
Questions of understanding and quantifying the representation and amount of information in organisms have become a central
part of biological research, as they potentially hold the key to fundamental advances. In this paper, we demonstrate the use of
information-theoretic tools for the task of identifying segments of biomolecules (DNA or RNA) that are statistically correlated.
We develop a precise and reliable methodology, based on the notion of mutual information, for finding and extracting statistical
as well as structural dependencies. A simple threshold function is defined, and its use in quantifying the level of significance of
dependencies between biological segments is explored. These tools are used in two specific applications. First, they are used for the
identification of correlations between dierent parts of the maize zmSRp32 gene. There, we find significant dependencies between

the 5 untranslated region in zmSRp32 and its alternatively spliced exons. This observation may indicate the presence of as-yet
unknown alternative splicing mechanisms or structural scaolds. Second, using data from the FBIs combined DNA index system
(CODIS), we demonstrate that our approach is particularly well suited for the problem of discovering short tandem repeatsan
application of importance in genetic profiling.
Copyright 2007 Hasan Metin Aktulga et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.

1.

INTRODUCTION

Questions of quantification, representation, and description


of the overall flow of information in biosystems are of central importance in the life sciences. In this paper, we develop statistical tools based on information-theoretic ideas,
and demonstrate their use in identifying informative parts
in biomolecules. Specifically, our goal is to detect statistically
dependent segments of biosequences, hoping to reveal potentially important biological phenomena. It is well known
[13] that various parts of biomolecules, such as DNA, RNA,
and proteins, are significantly (statistically) correlated. Formal measures and techniques for quantifying these correlations are topics of current investigation. The biological implications of these correlations are deep, and they themselves
remain unresolved. For example, statistical dependencies between exons carrying protein coding sequences and noncoding introns may indicate the existence of as-yet unknown error correction mechanisms or structural scaolds. Thus mo-

tivated, we propose to develop precise and reliable methodologies for quantifying and identifying such dependencies,
based on the information-theoretic notion of mutual information.
Biomolecules store information in the form of monomer
strings such as deoxyribonucleotides, ribonucleotides, and
amino acids. As a result of numerous genome and protein
sequencing eorts, vast amounts of sequence data is now
available for computational analysis. While basic tools such
as BLAST provide powerful computational engines for identification of conserved sequence motifs, they are less suitable
for detecting potential hidden correlations without experimental precedence (higher-order substitutions).
The application of analytic methods for finding regions
of statistical dependence through mutual information has
been illustrated through a comparative analysis of the 5 untranslated regions of DNA coding sequences [4]. It has been
known that eukaryotic translational initiation requires the
consensus sequence around the start codon defined as the

2
Kozaks motif [5]. By screening at least 500 sequences, an
unexpected correlation between positions 2 and 1 of the
Kozaks sequence was observed, thus implying a novel translational initiation signal for eukaryotic genes. This pattern
was discovered using mutual information, and not detected
by analyzing single-nucleotide conservation. In other relevant work, neighbor-dependent substitution matrices were
applied to estimate the average mutual information content of the core promoter regions from five dierent organisms [6, 7]. Such comparative analyses verified the importance of TATA-boxes and transcriptional initiation. A similar
methodology elucidated patterns of sequence conservation
at the 3 untranslated regions of orthologous genes from human, mouse, and rat genomes [8], making them potential
targets for experimental verification of hidden functional signals.
In a dierent kind of application, statistical dependence
techniques find important applications in the analysis of gene
expression data. Typically, the basic underlying assumption
in such analyses is that genes expressed similarly under divergent conditions share functional domains of biological activity. Establishing dependency or potential relationships between sets of genes from their expression profiles holds the
key to the identification of novel functional elements. Statistical approaches to estimation of mutual information from
gene expression datasets have been investigated in [1].
Protein engineering is another important area where statistical dependency tools are utilized. Reliable predictions of
protein secondary structures based on long-range dependencies may enhance functional characterizations of proteins [9]. Since secondary structures are determined by both
short- and long-range interactions between single amino
acids, the application of comparative statistical tools based
on consensus sequence algorithms or short amino acid sequences centered on the prediction sites is far from optimal.
Analyses that incorporate mutual information estimates may
provide more accurate predictions.
In this work we focus on developing reliable and precise information-theoretic methods for determining whether
two biosequences are likely to be statistically dependent. Our
main goal is to develop ecient algorithmic tools that can
be easily applied to large data sets, mainlythough not
exclusivelyas a rigorous exploratory tool. In fact, as discussed in detail below, our findings are not the final word on
the experiments we performed, but, rather, the first step in
the process of identifying segments of interest. Another motivating factor for this project, which is more closely related to
ideas from information theory, is the question of determining whether there are error correction mechanisms built into
large molecules, as argued by Battail; see [10] and the references therein. We choose to work with protein coding exons and noncoding introns. While exons are well-conserved
parts of DNA, introns have much greater variability. They
are dispersed on strings of biopolymers and still they have
to be precisely identified in order to produce biologically relevant information. It seems that there is no external source
of information but the structure of RNA molecules themselves to generate functional templates for protein synthesis.
Determining potential mutual relationships between exons

EURASIP Journal on Bioinformatics and Systems Biology


and introns may justify additional search for still unknown
factors aecting RNA processing.
The complexity and importance of the RNA processing
system is emphasized by the largely unexplained mechanisms
of alternative splicing, which provide a source of substantial
diversity in gene products. The same sequence may be recognized as an exon or an intron, depending on a broader context of splicing reactions. The information that is required
for the selection of a particular segment of RNA molecules is
very likely embedded into either exons or introns, or both.
Again, it seems that the splicing outcome is determined
by structural information carried by RNA molecules themselves, unless the fundamental dogma of biology (the unidirectional flow of information from DNA to proteins) is to be
questioned.
Finally, the constant evolution of genomes introduces
certain polymorphisms, such as tandem repeats, which are an
important component of genetic profiling applications. We
also study these forms of statistical dependencies in biological sequences using mutual information.
In Section 2 we develop some theoretical background,
and we derive a threshold function for testing statistical significance. This function admits a dual interpretation either
as the classical log-likelihood ratio from hypothesis testing,
or as the empirical mutual information.
Section 3 contains our experimental results. In Section
3.1 we present our empirical findings for the problem of detecting statistical dependency between dierent parts in a
DNA sequence. Extensive numerical experiments were carried out on certain regions of the maize zmSRp32 gene [11],
which is functionally homologous to the human ASF/SF2 alternative splicing factor. The eciency of the empirical mutual information in this context is demonstrated. Moreover,
our findings suggest the existence of a biological connection
between the 5 untranslated region in zmSRp32 and its alternatively spliced exons.
Finally, in Section 3.2, we show how the empirical mutual information can be utilized in the dicult problem of
searching DNA sequences for short tandem repeats (STRs),
an important task in genetic profiling. We extend the simple
hypothesis test of the previous sections to a methodology for
testing a DNA string against dierent probe sequences, in
order to detect STRs both accurately and eciently. Experimental results on DNA sequences from the FBIs combined
DNA index system (CODIS) are presented, showing that the
empirical mutual information can be a powerful tool in this
context as well.
2.

THEORETICAL BACKGROUND

In this section, we outline the theoretical basis for the mutual information estimators we will later apply to biological
sequences.
Suppose we have two strings of unequal lengths,
X1n = X1 , X2 , . . . , Xn ,
Y1M = Y1 , Y2 , Y3 , . . . , YM ,

(1)

Hasan Metin Aktulga et al.

where M n, taking values in a common finite alphabet A.


In most of our experiments, M is significantly larger than
n; typical values of interest are n 80 and M 300.
Our main goal is to determine whether or not there is some
form of statistical dependence between them. Specifically,
we assume that the string X1n consists of independent and
identically distributed (i.i.d.) random variables Xi with common distribution P(x) on A, and that the random variables Yi are also i.i.d. with a possibly dierent distribution
Q(y). Let {W(y | x)} be a family of conditional distributions, or channel, with the property that, when the input distribution is P, the output has distribution Q, that is,

xA P(x)W(y | x) = Q(y) for all y. We wish to dierentiate
between the following two scenarios:
(i) independence: X1n and Y1M are independent,
(ii) dependence: First X1n is generated, then an index J
J+n1
{1, 2, . . . , M n+1} is chosen in an arbitrary way, and YJ
is generated as the output of the discrete memoryless channel
W with input X1n , that is, for each j = 1, 2, . . . , n, the conditional distribution of Y j+J 1 given X1n is W(y | X j ). Finally,
the rest of the Yi s are generated i.i.d. according to Q. (To
avoid the trivial case where both scenarios are identical, we
assume that the rows of W are not all equal to Q so that in
the second scenario X1n and YJJ+n1 are actually not independent.)
It is important at this point to note that although neither of these two cases is biologically realistic as a description of the elements in a genomic sequence, it turns out that
this set of assumptions provides a good operational starting
point: the experimental results reported in Section 3 clearly
indicate that, in practice, the resulting statistical methods obtained under the present assumptions can provide accurate
and biologically relevant information. Of course, the natural next step in any application is the careful examination of
the corresponding findings, either through purely biological
considerations or further testing.
To distinguish between (i) and (ii), we look at every possible alignment of X1n with Y1M , and we estimate the mutual
information between them. Recall that for two random variables X, Y with marginal distributions P(x), Q(y), respectively, and joint distribution V (x, y), the mutual information
between X and Y is defined as
I(X; Y ) =


x,y A

V (x, y) log

V (x, y)
.
P(x)Q(y)

(2)

Recall also that I(X; Y ) is always nonnegative, and it equals


zero if and only if X and Y are independent. The logarithms above and throughout the paper are taken to base 2,
log = log 2 , so that I(X; Y ) can be interpreted as the number
of bits of information that each of these two random variables carries about the other (cf. [12]).
In order to distinguish between the two scenarios above,
we compute the empirical mutual information between X1n
and each contiguous substring of Y1M of length n: for each
j = 1, 2, . . . , M n + 1, let p j (x, y) denote the joint
j+n1
), that is, let p j (x, y)
empirical distribution of (X1n , Y j
be the proportion of the n positions in (X1 , Y j ), (X2 ,
Y j+1 ), . . . , (Xn , Y j+n1 ) where (Xi , Y j+i1 ) equals (x, y). Sim-


and q j (y) denote the empirical distributions
ilarly, let P(x)
j+n1
n
, respectively. We define the empirical (perof X1 and Y j
j+n1

symbol) mutual information Ij (n) between X1n and Y j


by applying (2) to the empirical instead of the true distributions, so that
Ij (n) =

p j (x, y) log

x,y A

p j (x, y)
.
p(x)q j (y)

(3)

The law of large numbers implies that as n, we have


p(x)P(x), q j (y)Q(x), and p j (x, y) converges to the true
joint distribution of X, Y .
Clearly, this implies that in scenario (i), where X1n and
n
Y1 are independent, Ij (n)0, for any fixed j, as n. On
the other hand, in scenario (ii), IJ (n) converges to I(X; Y ) >
0 where the two random variables X, Y are such that X has
distribution P and the conditional distribution of Y given
X = x is W(y | x).
In passing we should point out there are other methods
of checking statistical (in)dependence, for instance, randomization or permutation tests discussed in [13, 14].
2.1.

An independence test based on


mutual information

We propose to use the following simple test for detecting dependence between X1n and Y1M . Choose and fix a threshold
> 0, and compute the empirical mutual information Ij (n)
j+n1
of length
between X1n and each contiguous substring Y j
M

n from Y1 . If I j (n) is larger than for some j, declare that
j+n1
are dependent; otherwise, declare
the strings X1n and Y j
that they are independent.
Before examining the issue of selecting the value of the
threshold , we note that this statistic is identical to the
(normalized) log-likelihood ratio between the above two hypotheses. To see this, observe that expanding the definition
of p j (x, y) in Ij (n), we can simply rewrite
Ij (n) =

n
 1
x,y A

n i=1

I{(Xi ,Y j+i1 )} (x, y) log

p j (x, y)
p(x)q j (y)

n



p j (x, y)
1
I{(Xi, Y j+i1 )} (x, y) log
,
n i=1 x,yA
p(x)q j (y)

(4)

where the indicator function I{(Xi ,Y j+i1 )} (x, y) equals 1 if


(Xi, Y j+i1 ) = (x, y) and it is equal to zero otherwise. Then,


Ij (n) =

n
p j Xi , Y j+i1
1

log   
n i=1
p Xi q j Y j+i1

 

n
 j Xi, Y j+i1
1
i=1 p
 ,
= log  n   
 Xi q j Y j+i1
n
i=1 p

(5)

which is exactly the normalized logarithm


of the ratio be
tween the joint empirical likelihood ni=1 p j (Xi , Y j+i1 ) of
the two strings,
of their empirical
marginal

 nand the product




(Xi )][ ni=1 q j (Y j+i1 ) .
likelihoods
i=1 p

EURASIP Journal on Bioinformatics and Systems Biology

2.2. Probabilities of error


There are two kinds of errors this test can make: declaring
that two strings are dependent when they are not, and vice
versa. The actual probabilities of these two types of errors
depend on the distribution of the statistic Ij (n). Since this
distribution is independent of j, we take j = 1 and write
I(n) for the normalized log-likelihood ratio I1 (n). The next
two subsections present some classical asymptotics for I1 (n).

We already noted that in this case I(n) converges to zero as


n , and below we shall see that this convergence takes
place at a rate of approximately 1/n. Specifically, I(n) 0
with probability one, and a standard application of the multivariate central limit theorem for the joint empirical distribution p j shows that nI(n) converges in distribution to a
(scaled) 2 random variable. This a classical result in statistics [15, 16], and, in the present context, it was rederived by
Hagenauer et al. [17, 18]. We have


2 

(2 ln 2)nI(n) Z 2 |A| 1

Pe,1 = Pr{declare dependence | independent strings}




= Pr I(n) > | independent strings


Pr Z > (2 ln 2)n ,

(7)

where Z is as before. Therefore, for large n the error probability Pe,1 decays like the tail of the 2 distribution function,


k, ( ln 2)n
1
,
(k)

(8)

where k = (|A| 1)2 /2, and , denote the Gamma function


and the incomplete Gamma function, respectively. Although
this is fairly implicit, we know that the tail of the 2 distribution decays like ex/2 as x; therefore,

Pe,1 exp (ln2)n ,

(9)

where this approximation is to first-order in the exponent.


Scenario (ii): dependence
In this case, the asymptotic behavior of the test statistic I(n)
is somewhat dierent. Suppose as before that the random
variables X1n are i.i.d. with distribution P, and that the conditional distribution of each Yi given X1n is W(Y | Xi ), for
some fixed family of conditional distributions W(y | x); this
makes the random variables Y1n i.i.d. with distribution Q.
We mentioned in the last section that under the second scenario, I(n) converges to the true underlying value

(10)

where the resulting variance 2 is given by




W(Y | X)
Q(Y )

p(x)W(y | x) log

x,y A

W(y | x)
I
Q(y)

2

(11)
.

An outline of the proof of (10) is given below; for another


derivation see [19].
Therefore, for any fixed threshold < I and large n, the
probability of error satisfies
Pe,2 = Pr{declare independence | W-dependent strings}


= Pr I(n) | W-dependent strings


Pr T [ I] n


(I )2
exp
n
,
2

(6)

where Z has a 2 distribution with k = (|A| 1)2 degrees of


freedom, and where |A| denotes the size of the data alphabet.
Therefore, for a fixed threshold > 0 and large n, we can
estimate the probability of error as

n I(n) I T N 0, 2 ,

2 = Var log

Scenario (i): independence

Pe,1

I = I(X; Y ) of the mutual information, but, as we show below, the rate of this convergence is slower than the 1/n rate
of scenario
(i): here,
I(n)I with probability one, but only at

rate 1/ n, in that n [I(n) I] converges in distribution to


a Gaussian

(12)
where the last approximation sign indicates equality to first
order in the exponent. Thus, despite the fact that I(n) converges at dierent speeds in the two scenarios, both error
probabilities Pe,1 and Pe,2 decay exponentially with the sample size n.
To see why (10) holds it is convenient to use the alternative expression for I(n) given in (5). Using this, and recalling
that I(n) = I1 (n), we obtain

n[I(n) I] = n

n
p1 Xi , Yi
1
log     I .
n i=1
p Xi q1 Yi

(13)

Since the empirical distributions converge to the corresponding true distributions, for large n it is straightforward to justify the approximation


 

n
P Xi W Yi | Xi
1 1
    I .
log
n I(n) I
n n i=1
P Xi Q Yi
(14)

The fact that this indeed converges in distribution to a


N(0, 2 ), as n, easily follows from the central limit theorem, upon noting that the mean of the logarithm in (14)
equals I and its variance is 2 .
Discussion
From the above analysis it follows that in order for both
probabilities of error to decay to zero for large n (so that we
rule out false positives as well as making sure that no dependent segments are overlooked) the threshold needs to be

Hasan Metin Aktulga et al.

5
DNA structure of zmSRp32

5 untranslated region (5 UTR)

Exons

3 UTR

Intron

Intron

Protein coding sequence

Start
mRNA structures

Stop

Pre-mRNA processing

Alternative exons

Alternative intron

178 268 369

3243 3688 3884 4254


3800

Figure 1: Alternative splicings of the zmSRp32 gene in maize. The gene consists of a number of exons (shaded boxes) and introns (lines)
flanked by the 5 and 3 untranslated regions (white boxes). RNA transcripts (pre-mRNA) are processed to yield mRNA molecules used as
templates for protein synthesis. Alternative pre-mRNA splicing generates dierent mRNA templates from the same transcripts, by selecting
either alternative exons or alternative introns. The regions discussed in the text are identified by indices corresponding to the nucleotide
position in the original DNA sequence.

strictly between 0 and I = I(X; Y ). For that, we need to have


some prior information about the value of I, that is, of the
level of dependence we are looking for. If the value of I were
actually known and a fixed threshold (0, I) was chosen
independent of n, then both probabilities of error would decay exponentially fast, but with typically very dierent exponents:

Pe,1 exp (ln 2)n ,




Pe,2 exp

2 

n ;

(15)

recall the expressions in (9) and (12). Clearly, balancing the


two exponents also requires knowledge of the value of 2 in
the case when the two strings are dependent, which, in turn,
requires full knowledge of the marginal distribution P and
the channel W. Of course this is unreasonable, since we cannot specify in advance the exact kind and level of dependence
we are actually trying to detect in the data.
A practical (and standard) approach is as follows: since
the probability of error of the first kind P1,e only depends on
(at least for large n), and since in practice declaring false
positives is much more undesirable than overlooking potential dependence, in our experiments we decide on an acceptably small false-positive probability , and then select based
on the above approximation, by setting Pe,1  in (7).
3.

in alternative processing (splicing) of pre-mRNA transcripts.


Then we show how the same methodology can be easily
adapted to the problem of identifying tandem repeats. We
present experimental results on DNA sequences from the
FBIs combined DNA index system (CODIS), which clearly
indicate that the empirical mutual information can be a powerful tool for this computationally intensive task.

EXPERIMENTAL RESULTS

In this section, we apply the mutual information test described above to biological data. First we show that it can
be used eectively to identify statistical dependence between
regions of the maize zmSRp32 gene that may be involved

3.1.

Detecting DNA sequence dependencies

All of our experiments were performed on the maize zmSRp32 gene [11]. This gene belongs to a group of genes that
are functionally homologous to the human ASF/SF2 alternative splicing factor. Interestingly, these genes encode alternative splicing factors in maize and yet themselves are also
alternatively spliced. The gene zmSRp32 is coded by 4735
nucleotides and has four alternative splicing variants. Two
of these four variants are due to dierent splicings of this
gene, between positions 1369 and 32434220, respectively,
as shown in Figure 1. The results given here are primarily
from experiments on these segments of zmSRp32.
In order to understand and quantify the amount of correlation between dierent parts of this gene, we computed
the mutual information between all functional elements including exons, introns, and the 5 untranslated region. As before, we denote the shorter sequence of length n by X1n =
(X1 , X2 , . . . , Xn ) and the longer one of length M by Y1M =
(Y1 , Y2 , . . . , YM ). We apply the simple mutual information
estimator Ij (n) defined in (3) to estimate the mutual inforj+n1
for each j = 1, 2, . . . , M
mation between X1n and Y j
n + 1, and we plot the dependency graph of Ij = Ij (n) versus j; see Figure 2. The threshold is computed, according

EURASIP Journal on Bioinformatics and Systems Biology


0.06

0.08

0.05

0.06

Mutual information

Mutual information

0.07

0.05
0.04
0.03
0.02

0.04
0.03
0.02
0.01

0.01
0
3200 3300 3400 3500 3600 3700 3800 3900
Base position on zmSRp32 gene sequence
(a)

0
3200 3300 3400 3500 3600 3700 3800 3900
Base position on zmSRp32 gene sequence
(b)

Figure 2: Estimated mutual information between the exon located between bases 1369 and each contiguous subsequence of length 369
in the intron between bases 32434220. The estimates were computed both for the original sequences in the standard four-letter alphabet
{A, C, G, T } (shown in (a)), as well as for the corresponding transformed sequences for the two-letter purine/pyrimidine grouping {AG, CT }
(shown in (b)).

to (7), by setting , the probability of false positives, equal to


0.001; it is represented by a (red) straight horizontal line in
the figures.
In order to amplify the eects of regions of potential
dependency in various segments of the zmSRp32 gene, we
computed the mutual information estimates Ij on the original strings over the regular four-letter alphabet {A, C, G, T },
as well as on transformed versions of the strings where
pairs of letters were grouped together, using either the
Watson-Crick pair {AT, CG} or the purine-pyrimidine pair
{AG, CT }. In our results we observed that such groupings are
often helpful in identifying dependency; this is clearly illustrated by the estimates shown in Figures 2 and 3. Sometimes
the {AT, CG} pair produces better results, while in other
cases the purine-pyrimidine pair finds new dependencies.
Figure 2 strongly suggests that there is significant dependence between the bases in positions 1369 and certain substrings of the bases in positions 32434220. While the 1
369 region contains the 5 untranslated sequences, an intron,
and the first protein coding exon, the 32434220 sequence
encodes an intron that undergoes alternative splicing. After
narrowing down the mutual information calculations to the
5 untranslated region (5 UTR) in positions 178 and the
5 UTR intron in positions 78268, we found that the initially
identified dependency was still present; see Figure 3. A close
inspection of the resulting mutual information graphs indicates that the dependency is restricted to the alternative exons
embedded into the intron sequences, in positions 36883800
and 38844254.
These findings suggest that there might be a deeper connection between the 5 UTR DNA sequences and the DNA
sequences that undergo alternative splicing. The UTRs are
multifunctional genetic elements that control gene expression by determining mRNA stability and eciency of mRNA
translation. Like in the zmSRp32 maize gene, they can provide multiple alternatively spliced variants for more complex regulation of mRNA translation [20]. They also contain a number of regulatory motifs that may aect many as-

pects of mRNA metabolism. Our observations can therefore


be interpreted as suggesting that the maize zmSRp32 5 UTR
contains information that could be utilized in the process of
alternative splicing, yet another important aspect of mRNA
metabolism. The fact that the value of the empirical mutual
information between 5 UTR and the DNA sequences that
encode alternatively spliced elements is significantly greater
than zero clearly points in that direction. Further experimental work could be carried out to verify the existence, and further explore the meaning, of these newly identified statistical
dependencies.
We should note that there are many other sequence
matching techniques, the most popular of which is probably
the celebrated BLAST algorithm. BLASTs working principles are very dierent from those underlying our method. As
a first step, BLAST searches a database of biological sequences
for various small words found in the query string. It identifies sequences that are candidates for potential matches, and
thus eliminates a huge portion of the database containing
sequences unrelated to the query. In the second step, small
word matches in every candidate sequence are extended by
means of a Smith-Waterman-type local alignment algorithm.
Finally, these extended local alignments are combined with
some scoring schemes, and the highest scoring alignments
obtained are returned. Therefore, BLAST requires a considerable fraction of exact matches to find sequences related to
each other. However, our approach does not enforce any such
requirements. For example, if two sequences do not have any
exact matches at all, but the characters in one sequence are
a characterwise encoding of the ones in the other sequence,
then BLAST would fail to produce any significant matches
(without corresponding substitution matrices), while our algorithm would detect a high degree of dependency. This
is illustrated by the results in the following section, where
the presence of certain repetitive patterns in Y1M is revealed
through matching it to a probe sequence X1n which does not
contain the repetitive pattern, but is statistically similar to
the pattern sought.

0.4

0.35

0.35

0.3

0.3

Mutual information

Mutual information

Hasan Metin Aktulga et al.

0.25
0.2
0.15
0.1

0.25
0.2
0.15
0.1
0.05

0.05
0
32 33 34 35 36 37 38 39 40 41
Base position on zmSRp32 gene sequence

0
32

42
102

(a)

0.1
0.09
Mutual information

Mutual information

0.12
0.1
0.08
0.06
0.04

0.08
0.07
0.06
0.05
0.04
0.03
0.02

0.02

0.01
0
32

33 34 35 36 37 38 39 40 41
2
Base position on zmSRp32 gene sequence 10
(c)

0.08
0.07
Mutual information

0.1
Mutual information

33 34 35 36 37 38 39 40 41
2
Base position on zmSRp32 gene sequence 10
(d)

0.12

0.08
0.06
0.04
0.02
0
32

42
102

(b)

0.14

0
32

33 34 35 36 37 38 39 40 41
Base position on zmSRp32 gene sequence

0.06
0.05
0.04
0.03
0.02
0.01

33
34
35
36 37 38 39 40
2
Base position on zmSRp32 gene sequence 10
(e)

0
32

33
34 35
36 37 38 39
Base position on zmSRp32 gene sequence

40
102

(f)

Figure 3: Dependency graph of Ij versus j for the zmSRp32 gene, using dierent alphabet groupings: in (a) and (b), we plot the estimated
mutual information between the exon found between bases 178 and each subsequence of length 78 in the intron located between bases
32434220. Plot (a) shows estimates over the original four-letter alphabet {A, C, G, T } , and (b) shows the corresponding estimates over the
Watson-Crick pairs {AT, CG}. Similarly, plots (c) and (d) contain the estimated mutual information between the intron located in bases
79268 and all corresponding subsequences of the intron between bases 32434220. Plot (c) shows estimates over the original alphabet, and
plot (d) over the two-letter purine/pyrimidine grouping {AG, CT }. Plots (e) and (f) show the estimated mutual information between the 5
untranslated region and all corresponding subsequences of the intron between bases 32434220, for the four-letter alphabet (in (e)), and for
the two-letter purine/pyrimidine grouping {AG, CT } (in (f)).

8
3.2. Application to tandem repeats
Here we further explore the utility of the mutual information statistic, and we examine its performance on the problem of detecting short tandem repeats (STRs) in genomic sequences. STRs, usually found in noncoding regions, are made
of back-to-back repetitions of a sequence which is at least two
bases long and generally shorter than 15 bases. The period of
an STR is defined as the length of the repetition sequence
in it. Owing to their short lengths, STRs survive mutations
well, and can easily be amplified using PCR without producing erroneous data. Although there are many well-identified
STRs in the human genome, interestingly, the number of repetitions at any specific locus varies significantly among individuals, that is, they are polymorphic DNA fragments. These
properties make STRs suitable tools for determining genetic
profiles, and have become a prevalent method in forensic investigations. Long repetitive sequences have also been observed in genomic sequences, but have not gained as much
attention since they cannot survive environmental degradation and do not produce high quality data from PCR analysis.
Several algorithms have been proposed for detecting
STRs in long DNA strings with no prior knowledge about
the size and the pattern of repetition. These algorithms
are mostly based on pattern matching, and they all have
high time-complexity. Finding short repetitions in a long
sequence is a challenging problem. When the query string
is a DNA segment that contains many insertions, deletions,
or substitutions due to mutations, the problem becomes
even harder. Exact- and approximate-pattern matching algorithms need to be modified to account for these mutations,
and this renders them complex and inecient. To overcome
these limitations, we propose a statistical approach using an
adaptation of the method described in the previous sections.
In the United States, the FBI has decided on 13 loci to be
used as the basis for genetic profile analysis, and they continue to be the standard in this area. To demonstrate how
our approach can be used for STR detection, we chose to
use sequences from the FBIs combined DNA index system
(CODIS): the SE33 locus contained in the GenBank sequence
V00481, and the VWA locus contained in the GenBank sequence M25858. The periods of STRs found in CODIS typically range from 2 to bases, and do not exhibit enough variability to demonstrate how our approach would perform under divergent conditions. For this reason, we used the V00481
sequence as is, but on M25858 we artificially introduced an
STR with period 11, by substituting bases 28212920 (where
we know that there are no other repeating sequences) with
9 tandem repeats of ACTTTGCCTAT. We have also introduced base substitutions, deletions, and insertions on our artificial STR to imitate mutations.
Let Y1M = (Y1 , Y2 , . . . , YM ) denote the DNA sequence in
which we are looking for STRs. The gist of our approach is
simply to choose a periodic probe sequence of length n, say,
X1n = (X1 , X2 , . . . , Xn ) (typically much shorter than Y1M ), and
then to calculate the empirical mutual information Ij = Ij (n)
between X1n and each of its possible alignments with Y1M . In
order to detect the presence of STRs, the values of the empirical mutual information in regions where STRs do appear

EURASIP Journal on Bioinformatics and Systems Biology


should be significantly larger than zero, where significantly
means larger than the corresponding estimates in ordinary
DNA fragments containing no STRs. Obviously, the results
will depend heavily on the exact form of the probe sequence.
Therefore, it is critical to decide on the method for selecting: (a) the length, and (b) the exact contents of X1n . The
length of X1n is crucial; if it is too short, then X1n itself is likely
to appear often in Y1M , producing many large values of the
empirical mutual information and making it hard to distinguish between STRs and ordinary sequences. Moreover, in
that case there is little hope that the analysis of the previous section (which was carried out of long sequences X1n )
will provide useful estimates for the probability of error. If,
on the other hand, X1n is too long, then any alignment of the
probe X1n with Y1M will likely also contain too many irrelevant
base pairs. This will produce negligibly small mutual information estimates, again making impossible to detect STRs.
These considerations are illustrated by the results in Figure 4.
As for the contents of the probe sequence X1n , the best
choice would be to take a segment X1n containing an exact
match to an STR present in Y1M . But in most of the interesting applications, this is of course unavailable to us. A second
best choice might be a sequence X1n that contains a segment
of the same pattern as the STR present in Y1M , where we say
that two sequences have the same pattern if each one can be
obtained from the other via a permutation of the letters in
the alphabet (cf. [21, 22]). For example, TCTA and GTGC
have the same pattern, whereas TCTA and CTAT do not
(although they do have the same empirical distribution). For
example, if X1n contains the exact same pattern as the periodic
part of the STR to be detected, and X1n has the same pattern
as X1n , then, a priori, either choice should be equally eective at detecting the STR under consideration; see Figure 5.
(This observation also shows that a single probe X1n may in
fact be appropriate for locating more than a single STR, e.g.,
STRs with the same pattern as X1n , as in Figure 5, or with the
same period, as in Figure 4.) The problem with this choice
is, again, that the exact patterns of STRs present in a DNA
sequence are not available to us in advance, and we cannot
expect all STRs in a given sequence to be of the same pattern.
Even though both of the above choices for X1n are usually
not practically feasible, if the sequence Y1M is relatively short
and contains a single STR whose contents are known, then either choice would produce high-quality data, from which the
STR contained in Y1M we can easily be detected; see Figure 5
for an illustration.
In practice, in addition to the fact that the contents of
STRs are not known in advance, there is also the issue that
in a long DNA sequence there are often many dierent STRs,
and a unique probe will not match all of them exactly. But
since STRs usually have a period between 2 and 15 bases, we
can actually run our method for all possible choices of repetition sequences, and detect all STRs in the given query sequence Y1M . The number of possible probes X1n can be drastically reduced by observing that (1) we only need one repeating sequence of each possible pattern, and (2) it suces to
only consider repetition patters whose period is prime. Note
that in view of the earlier discussion and the results shown
in Figure 4, the period of the repeating part of X1n is likely to

1.8

0.9

1.6

0.8

1.4

0.7

Mutual information

Mutual information

Hasan Metin Aktulga et al.

1.2
1
0.8
0.6
0.4

0.6
0.5
0.4
0.3
0.2
0.1

0.2
0
0

0
0

200 400 600 800 1000 1200 1400 1600 1800


Base position on GenBank V00481 sequence

200 400 600 800 1000 1200 1400 1600 1800


Base position on GenBank V00481 sequence

(a)

(b)

Figure 4: Dependency graph of the GenBank sequence Y1M = V 00481, for a probe sequence X1n which is a repetition of AGGT, of length (a)
12, or (b) 60. The sequence Y1M contains STRs that are repetitions of the pattern AAAG, in the following regions: (i) there is a repetition of
AAAG between bases 62108; (ii) AAAG is intervened by AG and AAGG until base 138; (iii) again between 138294 there are repetitions of
AAAG, some of which are modified by insertions and substitutions. In (a) our probe is too short, and it is almost impossible to distinguish
the SE33 locus from the rest. However, in (b) the location SE33 is singled out by the two big peaks in the mutual information estimates; the
shorter peak between the two larger ones is due to the interventions described above. Note that the STRs were identified by a probe sequence
that was a repetition of a pattern dierent from that of the repeating part of the STRs themselves, but of the same period.

1.5

Mutual information

Mutual information

1.5

0.5

50

100

150

200

250

(a)

0.5

50

100

150

200

250

(b)

Figure 5: Dependency graph of the VWA locus contained in GenBank sequence M25858 for a probe sequence X1n with n = 12, which is a
repetition of (a) TCTA , an exactly matching probe, (b) GTGC, a completely dierent probe, but of the exact same pattern. In both cases,
we have chosen X1n to be long enough to suppress unrelated information. Note that the results in (a) and (b) are almost identical. The VWA
locus contains an STR of TCTA between positions 44123. This STR is apparent in both dependency graphs by forming a periodic curve
with high correlation.

be more important than the actual contents. For example, if


we were to apply our method for finding STRs in Y1M with a
probe X1n whose period is 5 bases long, then many STRs with
a period that is a multiple of 5 should peak in the dependency
chart, thus allowing us to detect their approximate positions
in Y1M . Clearly, probes that consist of very short repeats, such
as AAA . . . , should be avoided. The importance of choosing
an X1n with the correct period is illustrated in Figure 6.
The results in Figures 4, 5, and 6 clearly indicate that the
proposed methodology is very eective at detecting the presence of STRs, although at first glance it may appear that it
cannot provide precise information about their start-end positions and their repeat sequences. But this final task can easily be accomplished by reevaluating Y1M near the peak in the

dependency graph, for example, by feeding the relevant parts


separately into one of the standard string matching-based
tandem repeat algorithms. Thus, our method can serve as an
initial filtering step which, combined with an exact pattern
matching algorithm, provides a very accurate and ecient
method for the identification of STRs.
In terms of its practical implementation, note that our
approach has a linear running time O(M), where M is the
length of Y1M . The empirical mutual information of course
needs to be evaluated for every possible alignment of Y1M and
X1n , with each such calculation done in O(n) steps, where n is
the length of X1n . But n is typically no longer than a few hundred bases, and, at least to first-order, it can be considered
constant. Also, repeating this process for all possible repeat

10

EURASIP Journal on Bioinformatics and Systems Biology


0.5

1.4

0.45
Mutual information

Mutual information

1.2
1
0.8
0.6
0.4

0.35
0.3
0.25
0.2
0.15
0.1

0.2
0

0.4

0.05
0

1000

2000

3000

4000

5000

6000

Base position on GenBank M25858 sequence

1000

2000

3000

4000

5000

6000

Base position on GenBank M25858 sequence

(a)

(b)

Figure 6: In these charts we use the modified GenBank sequence M25858, which contains the VWA locus in CODIS between positions
16831762 and the artificial STR introduced by us at 28212920. The repeat sequence of the VWA locus is TCTA, and the repeat sequence
of the artificial STR is ACTTTGCCTAT. In (a), the probe X1n has length n = 88 and consists of repetitions of AGGT. Here the repeating
sequence of the VWA locus (which has period 4) is clearly indicated by the peak, whereas the artificial tandem repeat (which has period 11)
does not show up in the results. The small peak around position 2100 is due to a very noisy STR again with a 4-base period. In (b), the probe
X1n again has length n = 88, and it consists of repetitions of CATAGTTCGGA. This produces the opposite result: the artificial STR is clearly
identified, but there is no indication of the STR present at the VWA locus.

periods does not aect the complexity of our method by


much, since the number of such periods is quite small and
can also be considered to be constant. And, as mentioned
above, choosing probes X1n only containing repeating segments with a prime period, further improves the running
time of our method.
We, therefore, conclude that (a) the empirical mutual information appears in this case to be a very eective tool for
detecting STRs; and (b) selecting the length and repetition
period of the probe sequence X1n is crucial for identifying tandem repeats accurately.

through extensive analysis of CODIS data, we show that our


approach is particularly well suited for the problem of discovering short tandem repeats, an application of importance
in genetic profiling studies.
ACKNOWLEDGMENTS
This research was supported in part by the NSF Grants
CCF-0513636 and DMS-0503742, and the NIH Grant R01
GM068959-01.
REFERENCES

4.

CONCLUSIONS

Biological information is stored in the form of monomer


strings composed of conserved biomolecular sequences. According to Manfred Eigen, The dierentiable characteristic of living systems is information. Information assures the
controlled reproduction of all constituents, thereby ensuring
conservation of viability. Hoping to reveal novel, potentially
important biological phenomena, we employ informationtheoretic tools, especially the notion of mutual information,
to detect statistically dependent segments of biosequences.
The biological implications of the existance of such correlations are deep, and they themselves remain unresolved. The
proposed approach may provide a powerful key to fundamental advances in understanding and quantifying biological information.
This work addresses two specific applications based on
the proposed tools. From the experimental analysis carried
out on regions of the maize zmSRp32 gene, our findings suggest the existence of a biological connection between the 5
untranslated region in zmSRp32 and its alternatively spliced
exons, potentially indicating the presence of novel alternative splicing mechanisms or structural scaolds. Secondly,

[1] R. Steuer, J. Kurths, C. O. Daub, J. Weise, and J. Selbig, The


mutual information: detecting and evaluating dependencies
between variables, Bioinformatics, vol. 18, supplement 2, pp.
S231S240, 2002.
[2] Z. Dawy, B. Goebel, J. Hagenauer, C. Andreoli, T. Meitinger,
and J. C. Mueller, Gene mapping and marker clustering using Shannons mutual information, IEEE/ACM Transactions
on Computational Biology and Bioinformatics, vol. 3, no. 1, pp.
4756, 2006.
[3] E. Segal, Y. Fondufe-Mittendorf, L. Chen, et al., A genomic
code for nucleosome positioning, Nature, vol. 442, no. 7104,
pp. 772778, 2006.
[4] Y. Osada, R. Saito, and M. Tomita, Comparative analysis of

base correlations in 5 untranslated regions of various species,
Gene, vol. 375, no. 1-2, pp. 8086, 2006.
[5] M. Kozak, Initiation of translation in prokaryotes and eukaryotes, Gene, vol. 234, no. 2, pp. 187208, 1999.
[6] D. A. Reddy and C. K. Mitra, Comparative analysis of transcription start sites using mutual information, Genomics, Proteomics and Bioinformatics, vol. 4, no. 3, pp. 189195, 2006.
[7] D. A. Reddy, B. V. L. S. Prasad, and C. K. Mitra, Comparative
analysis of core promoter region: information content from
mono and dinucleotide substitution matrices, Computational
Biology and Chemistry, vol. 30, no. 1, pp. 5862, 2006.

Hasan Metin Aktulga et al.


[8] S. A. Shabalina, A. Y. Ogurtsov, I. B. Rogozin, E. V. Koonin,
and D. J. Lipman, Comparative analysis of orthologous eukaryotic mRNAs: potential hidden functional signals, Nucleic
Acids Research, vol. 32, no. 5, pp. 17741782, 2004.
[9] P. Baldi, S. Brunak, P. Frasconi, G. Soda, and G. Pollastri, Exploiting the past and the future in protein secondary structure
prediction, Bioinformatics, vol. 15, no. 11, pp. 937946, 1999.
[10] G. Battail, Should genetics get an information-theoretic education? Genomes as error-correcting codes, IEEE Engineering
in Medicine and Biology Magazine, vol. 25, no. 1, pp. 3445,
2006.
[11] H. Gao, W. J. Gordon-Kamm, and L. A. Lyznik, ASF/SF2-like
maize pre-mRNA splicing factors aect splice site utilization
and their transcripts are alternatively spliced, Gene, vol. 339,
no. 1-2, pp. 2537, 2004.
[12] T. M. Cover and J. A. Thomas, Elements of Information Theory,
John Wiley & Sons, New York, NY, USA, 1991.
[13] P. I. Good, Resampling Methods, Birkhauser, Boston, Mass,
USA, 2005.
[14] B. Manly, Randomization, Bootstrap and Monte Carlo Methods
in Biology, Chapman & Hall/CRC, Boca Raton, Fla, USA, 1977.
[15] E. L. Lehmann and J. P. Romano, Testing Statistical Hypotheses,
Springer, New York, NY, USA, 3rd edition, 2005.
[16] M. J. Schervish, Theory of Statistics, Springer, New York, NY,
USA, 1995.
[17] J. Hagenauer, Z. Dawy, B. Gobel, P. Hanus, and J. Mueller, Genomic analysis using methods from information theory, in
Proceedings of IEEE Information Theory Workshop (ITW 04),
pp. 5559, San Antonio, Tex, USA, October 2004.
[18] B. Goebel, Z. Dawy, J. Hagenauer, and J. C. Mueller, An approximation to the distribution of finite sample size mutual
information estimates, in Proceedings of IEEE International
Conference on Communications (ICC 05), vol. 2, pp. 1102
1106, Seoul, Korea, May 2005.
[19] M. Hutter, Distribution of mutual information, in Advances
in Neural Information Processing Systems 14, pp. 399406, MIT
Press, Cambridge, Mass, USA, 2002.
[20] T. A. Hughes, Regulation of gene expression by alternative
untranslated regions, Trends in Genetics, vol. 22, no. 3, pp.
119122, 2006.

[21] J. Aberg,
Yu. M. Shtarkov, and B. J. M. Smeets, Multialphabet
coding with separate alphabet description, in Proceedings of
the International Conference on Compression and Complexity of
Sequences, pp. 5665, Positano, Italy, June 1997.
[22] A. Orlitsky, N. P. Santhanam, K. Viswanathan, and J. Zhang,
Limit results on pattern entropy, IEEE Transactions on Information Theory, vol. 52, no. 7, pp. 29542964, 2006.

11

Hindawi Publishing Corporation


EURASIP Journal on Bioinformatics and Systems Biology
Volume 2007, Article ID 13853, 13 pages
doi:10.1155/2007/13853

Research Article
Motif Discovery in Tissue-Specific Regulatory Sequences
Using Directed Information
Arvind Rao,1 Alfred O. Hero III,1 David J. States,2 and James Douglas Engel3
1 Departments

of Electrical Engineering and Computer Science and Bioinformatics, University of Michigan,


Ann Arbor, MI 48109, USA
2 Departments of Bioinformatics and Human Genetics, University of Michigan, Ann Arbor, MI 48109, USA
3 Department of Cell and Developmental Biology, University of Michigan, Ann Arbor, MI 48109, USA
Received 1 March 2007; Revised 23 June 2007; Accepted 17 September 2007
Recommended by Teemu Roos
Motif discovery for the identification of functional regulatory elements underlying gene expression is a challenging problem. Sequence inspection often leads to discovery of novel motifs (including transcription factor sites) with previously uncharacterized
function in gene expression. Coupled with the complexity underlying tissue-specific gene expression, there are several motifs that
are putatively responsible for expression in a certain cell type. This has important implications in understanding fundamental biological processes such as development and disease progression. In this work, we present an approach to the identification of motifs
(not necessarily transcription factor sites) and examine its application to some questions in current bioinformatics research. These
motifs are seen to discriminate tissue-specific gene promoter or regulatory regions from those that are not tissue-specific. There
are two main contributions of this work. Firstly, we propose the use of directed information for such classification constrained
motif discovery, and then use the selected features with a support vector machine (SVM) classifier to find the tissue specificity
of any sequence of interest. Such analysis yields several novel interesting motifs that merit further experimental characterization.
Furthermore, this approach leads to a principled framework for the prospective examination of any chosen motif to be discriminatory motif for a group of coexpressed/coregulated genes, thereby integrating sequence and expression perspectives. We hypothesize
that the discovery of these motifs would enable the large-scale investigation for the tissue-specific regulatory role of any conserved
sequence element identified from genome-wide studies.
Copyright 2007 Arvind Rao et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1.

INTRODUCTION

Understanding the mechanisms underlying regulation of


tissue-specific gene expression remains a challenging question. While all mature cells in the body have a complete copy
of the human genome, each cell type only expresses those
genes it needs to carry out its assigned task. This includes
genes required for basic cellular maintenance (often called
housekeeping genes) and those genes whose function is
specific to the particular tissue type that the cell belongs to.
Gene expression by a way of transcription is the process of
generation of messenger RNA (mRNA) from the DNA template representing the gene. It is the intermediate step before
the generation of functional protein from messenger RNA.
During gene expression (see Figure 1), transcription factor
(TF) proteins are recruited at the proximal promoter of the
gene as well as at sequence elements (enhancers/silencers)
which can lie several hundreds of kilobases from the genes

transcriptional start site (TSS). The basal transcriptional machinery at the promoter coupled with the transcription factor complexes at these distal, long-range regulatory elements
(LREs) are collectively involved in directing tissue-specific
expression of genes.
One of the current challenges in the post-genomic era
is the principled discovery of such LREs genome-wide. Recently, there has been a community-wide eort (http://
www.genome.gov/ENCODE) to find all regulatory elements
in 1% of the human genome. The examination of the discovered elements would reveal characteristics typical of most
enhancers which would aid their principled discovery and
examination on a genome-wide scale. Some characteristics
of experimentally identified distal regulatory elements [1, 2]
are as follows.
(i) Noncoding elements: distal regulatory elements are
noncoding and can either be intronic or intergenic regions on the genome. Hence, previous models for gene

EURASIP Journal on Bioinformatics and Systems Biology


TF complex

TATA box
RNA pol. II

Distal
enhancer

Promoter
(proximal)

Exon

Another practical reason for the examination of promoters is that their locations (and genomic sequences)
are more clearly delineated on genome databases (like
UCSC or Ensembl). Sucient data (http://symatlas
.gnf.org) on the expression of genes is also publicly
available for analysis. Sequence motif discovery is set
up as a feature extraction problem from these tissuespecific promoter sequences. Subsequently, a support
vector machine (SVM) classifier is used to classify
new promoters into specific and nonspecific categories
based on the identified sequence features (motifs). Using the SVM classifier algorithm, 90% of tissue-specific
genes are correctly classified based upon their upstream promoter region sequences alone.
(ii) Known long range regulatory elements (LRE) motifs:
to analyze the motifs in LRE elements, we examine
the results of the above approach on the Enhancer
Browser dataset (http://enhancer.lbl.gov) which has
results of expression of ultraconserved genomic elements in transgenic mice [8]. An examination of these
ultraconserved enhancers is useful for the extraction
of discriminatory motifs to distinguish the regulatory
elements from the nonregulatory (neutral) ones. Here
the results indicate that up to 95% of the sequences can
be correctly classified using these identified motifs.

Distal
enhancer

TSS

Intron

Figure 1: Schematic of transcriptional regulation. Sequence motifs


at the promoter and the distal regulatory elements together confer
specificity of gene expression via TF binding.

finding [3] are not directly applicable. With over 98%


of the annotated genome being noncoding, the precise localization of regulatory elements that underlie
tissue-specific gene expression is a challenging problem.
(ii) Distance/orientation independent: an enhancer can
act from variable genomic distances (hundreds of kilobases) to regulate gene expression in conjunction with
the proximal promoter, possibly via a looping mechanism [4]. These enhancers can lie upstream or downstream of the actual gene along the genomic locus.
(iii) Promoter dependent: since the action at a distance of
these elements involves the recruitment of TFs that direct tissue-specific gene expression, the promoter that
they interact with is critical.
Although there are instances where a gene harbors tissuespecific activity at the promoter itself, the role of long-range
elements (LREs) remains of interest, for example, for a detailed understanding of their regulatory role in gene expression during biological processes like organ development and
disease progression [5]. We seek to develop computational
strategies to find novel LREs genome-wide that govern tissue
specific expression for any gene of interest. A common approach for their discovery is the use of motif-based sequence
signatures. Any sequence element can then be scanned for
such a signature and its tissue specificity can be ascertained
[6].
Thus, our primary question in this regard is that is there
a discriminating sequence property of LRE elements that determines tissue-specific gene expressionmore particularly,
are there any sequence motifs in known regulatory elements
that can aid discovery of new elements [7]. To answer this, we
examine known tissue-specific regulatory elements (promoters and enhancers) for motifs that discriminate them from
a background set of neutral elements (such as housekeeping
gene promoters). For this study, the datasets are derived from
the following sources.
(i) Promoters of tissue-specific genes: before the widespread
discovery of long-range regulatory elements (LREs), it
was hypothesized that promoters governed gene expression alone. There is substantial evidence for the
binding of tissue-specific transcription factors at the
promoters of expressed genes. This suggests that in
spite of newer information implicating the role of
LREs, promoters also have interesting motifs that govern tissue-specific expression.

We note that some of the identified motifs might not be transcription factor binding motifs, and would need to be functionally characterized. This is an advantage of our methodinstead of constraining ourselves to the degeneracy present
in TF databases (like TRANSFAC/JASPAR), we look for all
sequences of a fixed length.
2.

CONTRIBUTIONS

Using microarray gene expression data, [9, 10] proposes an


approach to assign genes into tissue-specific and nonspecific
categories using an entropy criterion. Variation in expression
and its divergence from ubiquitous expression (uniform distribution across all tissue types) is used to make this assignment. Based on such assignment, several features like CpG
island density, frequency of transcription factor motif occurrence, can be examined to potentially discriminate these two
groups. Other work has explored the existence of key motifs (transcription factor binding sites) in the promoters of
tissue-specific genes (see [11, 12]). Based on the successes
reported in these methods, it is expected that a principled
examination and characterization of every sequence motif
identified to be discriminatory might lead to improved insight into the biology of gene regulation. For example, such
a strategy might lead to the discovery of newer TFBS motifs,
as well as those underlying epigenetic phenomena.
For the purpose of identifying discriminative motifs from
the training data (tissue-specific promoters or LREs), our approach is as follows.
(i) Variable selection: firstly, sequence motifs that discriminate between tissue-specific and non-specific elements are discovered. In machine learning, this is
a feature selection problem with features being the

Arvind Rao et al.

counts of sequence motifs in the training sequences.


Without loss of generality, six-nucleotide motifs (hexamers) are used as motif features. This is based on
the observation that most transcription factor binding
motifs have a 5-6 nucleotide core sequence with degeneracy at the ends of the motif. A similar setup has
been introduced in [1315]. The motif search space
is, therefore, a 46 = 4096-dimensional one. The presented approach, however, does not depend on motif length and can be scaled according to biological
knowledge. For variable (motif) selection, a novel feature selection approach (based on an information theoretic quantity called directed information (DI)) is proposed. The improved performance of this criterion
over using mutual information for motif selection is
also demonstrated.
(ii) Classifier design: after discovering discriminating motifs using the above DI step, an SVM classifier that
separates the samples between the two classes (specific
and nonspecific) from this motif space is constructed.
Apart from this novel feature selection approach, several
questions pertaining to bioinformatics methodology can be
potentially answered using this frameworksome of these
are as follows.
(i) Are there common motifs underlying tissue-specific
expression that are identified from tissue-specific promoters and enhancers? In this paper, an examination of motifs (from promoters and enhancers) corresponding to brain-specific expression is done to address this question.
(ii) Do these motifs correspond to known motifs (transcription factor binding sites)? We show that several
motifs are indeed consensus sites for transcription factor binding, although their real role can only be identified in conjunction with experimental evidence.
(iii) Is it possible to relate the motif information from the
sequence and expression perspectives to understand
regulatory mechanisms? This question is addressed in
Section 11.3.
(iv) How useful are these motifs in predicting new tissuespecific regulatory elements? This is partly explained
from the results of SVM classification.
This work diers from that in [13, 14], in several aspects.
We present the DI-based feature selection procedure as part
of an overall unified framework to answer several questions
in bioinformatics, not limited to finding discriminating motifs between two classes of sequences. Particularly, one of
the advantages is the ability to examine any particular motif as a potential discriminator between two classes. Also,
this work accounts for the notion of tissue-specificity of
promoters/enhancers (in line with more recent work in [8
10, 16, 17]). Also, this framework enables the principled integration of various data sources to address the above questions. These are clarified in Section 11.
3. RATIONALE
The main approaches to finding common motifs driving
tissue-specific gene regulation are summarized in [1, 2]. The

Examine sequences
(promoters/enhancers)
from Tissue Expression Atlas
Training data
Tissue-specific
sequences

Neutral sequences

Parse sequences to obtain relative counts


Preprocess
Build co-occurrence
matrices for training data

Feature (motif) selection (DI/MI)


and classification (SVM)

Biological interpretation
of top ranking motifs

Figure 2: An overview of the proposed approach. Each of the steps


are outlined in the following sections.

most common approach is to look for TFBS motifs that are


statistically over-represented in the promoters of the coexpressed genes based on a background (binomial or Poisson)
distribution of motif occurrence genomewide.
In this work, the problem of motif discovery is set up as
follows. Using two annotated groups of genes, tissue-specific
(ts) and nontissue-specific (nts), hexamer motifs that
best discriminate these two classes are found. The goal would
be to make this set of motifs as small as possible, that is, to
achieve maximal class partitioning with the smallest feature
subset.
Several metrics have been proposed to find features with
maximal class label association. From information theory,
mutual information is a popular choice [18]. This is a symmetric association metric and does not resolve the direction of dependency (i.e., if features depend on the class label or vice versa). It is important to find features that induce
the class label. Feature selection from data implies selection
(control) of a feature subset that maximally captures the underlying character (class label) of the data. There is no control over the label (a purely observational characterization).
With this motivation, a new metric for discriminative
hexamer subset selection, termed directed information
(DI), is proposed. Based on the selected features, a classifier
is used to classify sequences to tissue-specific or nontissuespecific categories. The performance of this DI-based feature
selection metric is subsequently evaluated in the context of
the SVM classifier.
4.

OVERALL METHODOLOGY

The overall schematic of the proposed procedure is outlined


in Figure 2.
Below we present our approach to find promoter-specific
or enhancer-specific motifs.

4
5.

EURASIP Journal on Bioinformatics and Systems Biology


MOTIF ACQUISITION

Table 1: The motif frequency matrix for a set of gene promoters.


The first column is their ENSEMBL gene identifiers and the other 4
columns are the motifs. A cell entry denotes the number of times a
given motif occurs in the upstream (2000 to +1000 bp from TSS)
region of each corresponding gene.

5.1. Promoter motifs


5.1.1. Microarray analysis
Raw microarray data is available from the Novartis Foundation (GNF) [http://symatlas.gnf.org]. Data is normalized using RMA from the bioconductor packages for R
[http://cran.r-project.org]. Following normalization, replicate samples are averaged together. Only 25 tissue types
are used in our analysis including: adrenal gland, amygdala,
brain, caudate nucleus, cerebellum, corpus callosum, cortex,
dorsal root ganglion, heart, HUVEC, kidney, liver, lung, pancreas, pituitary, placenta, salivary, spinal cord, spleen, testis,
thalamus, thymus, thyroid, trachea, and uterus.
In this context, the notion of tissue specificity of a gene
needs clarification. Suppose there are N genes, g1 , g2 , . . . , gN ,
and T tissue types (in GNF: T = 25), we construct an
N T tissue specificity matrix: M = [0]N T . For each gene
gi , 1 i N, let gi,[0.5T] = median(gi,k ), for all k 1, 2, . . . ,
T; gi,k being the expression level of gene i in tissue k. Define
each entry Mi,k as

if gi,k 2gi,[0.5T] ,
Mi,k =
0 otherwise.

(1)


Now consider the N-dimensional vector mi = Tk=1 Mi,k , 1


i N, that is, summing all the columns of each row. The
interquartile range of  m can be used for ts/nts assignment. Gene indices  i that are in quartile 1 (= 3) are labeled
as ts, and those in quartile 4 (= 22) are labeled as nts.
With this approach, a total of 1924 probes representing 1817 genes were classified as tissue-specific, while 2006
probes representing 2273 genes were classified as nontissuespecific. In this work, genes which are either heart-specific or
brain-specific are considered. From the tissue-specific genes
obtained from the above approach, 45 brain-specific gene
promoters and 118 heart-specific gene promoters are obtained. As mentioned in Section 2, one of the objectives is
to find motifs that are responsible for brain/heart specific
expression and also correlate them with binding profiles of
known transcription factor binding motifs.
5.1.2. Sequence analysis
Genes (ts or nts) associated with candidate probes are
identified using the Ensembl Ensmart [http://www.ensembl
.org] tool. For each gene, sequence from 2000 bp upstream
and 1000 bp down-stream upto the start of the first exon relative to their reported TSS is extracted from the Ensembl
Genome Database (Release 37). The relative counts of each
of the 46 hexamers are computed within each gene promoter
sequence of the two categories (ts and nts)using the
seqinr library in the R environment. A t-test is performed
between the relative counts of each hexamer between the two
expression categories (ts and nts) and the top 1000 sig = H1 , H2 , . . . , H1000 ) are obtained. The
nificant hexamers (H
relative counts of these hexamers is recomputed for each gene

Ensembl Gene ID
AAAAAA AAAAAG AAAAAT AAAACA
ENSG00000155366
0
0
1
4
ENSG000001780892
6
5
5
6
ENSG00000189171
1
2
1
0
ENSG00000168664
6
3
8
0
ENSG00000160917
4
1
4
2
ENSG00000163655
2
4
0
1
ENSG000001228844
8
6
10
7
ENSG00000176749
0
0
0
0
ENSG00000006451
5
2
2
1

individually. This results in two hexamer-gene cooccurrence


matricesone for the ts class (dimension Ntrain,+1 1000)
and the other for the nts class (dimension Ntrain,1 1000).
Here Ntrain,+1 and Ntrain,1 are the number of positive training
and negative training samples, respectively.
The input to the feature selection procedure is a gene
promoter-motif frequency table (Table 1). The genes relevant
to each class are identified from tissue microarray analysis,
following steps in Section 5.1.1 and the frequency table is
built by parsing the gene promoters for the presence of each
of the 46 = 4096 possible hexamers.
5.2.

LRE motifs

To analyze long range elements which confer tissue-specific


expression, the Mouse Enhancer database (http://enhancer
.lbl.gov) is examined. This database has a list of experimentally validated ultraconserved elements which have been
tested for tissue specific expression in transgenic mice [8],
and can be searched for a list of all elements which have
expression in a tissue of interest. In this work, we consider
expression in tissues relating to the developing brain. According to the experimental protocol, the various regions are
cloned upstream of a heat shock protein promoter (hsp68lacz), thereby not adhering to the idea of promoter specificity
in tissue-specific expression. Though this is of concern in
that there is loss of some gene-specific information, we work
with this data since we are more interested in tissue expression and also due to a paucity of public promoter-dependent
enhancer data.
This database also has a collection of ultraconserved elements that do not have any transgenic expression in vivo.
This is used as the neutral/background set of data which corresponds to the nts (nontissue-specific class) for feature selection and classifier design.
As in the above (promoter) case, these sequences (seventy four enhancers for brain-specific expression) are parsed
for the absolute counts of the 4096 hexamers, a cooccurrence
matrix (Ntrain,+1 = 74) is built and then t-test P-values are

  = H1 , H2 , . . . , H1000
used to find the top 1000 hexamers (H
)

Arvind Rao et al.


that are maximally dierent between the two classes (brainspecific and brain-nonspecific).
The next three sections clarify the preprocessing, feature
selection, and classifier design steps to mine these cooccurrence matrices for hexamer motifs that are strongly associated with the class label. We note that though this work is illustrated using two class labels, the approach can be extended
in a straightforward way to the multiclass problem.
6.

X2
Y
X2

X1

PREPROCESSING

From the above, Ntrain,+1 1000 and Ntrain,1 1000 dimensional cooccurrence matrices are available for the tissuespecific and nonspecific data, both for the promoter and
enhancer sequences. Before proceeding to the feature (hexamer motif) selection step, the counts of the M = 1000
hexamers in each training sample need to be normalized
to account for variable sequence lengths. In the cooccurrence matrix, let gci,k represent the absolute count of the
kth hexamer, k 1, 2, . . . , M, in the ith gene. Then, for
each gene gi , the quantile labeled matrix has Xi,k = l if
gci,[((l1)/K)M] gci,k < gci,[(l/K)M] , K = 4. Matrices of dimension Ntrain,+1 1001, Ntrain,1 1001 for the specific and
nonspecific training samples are now obtained. Each matrix
contains the quantile label assignments for the 1000 hexamers (Xi , i (1, 2, . . . , 1000)), as stated above, and the last column has the corresponding class label (Y = 1/ + 1).
7.

DIRECTED INFORMATION AND


FEATURE SELECTION

The primary goal in feature selection is to find the mini /H


  ) that lead to
mal subset of features (from hexamers: H
maximal discrimination of the class label (Yi (1/ + 1)),
using each of the i (1, 2, . . . , (Ntrain,+1 + Ntrain,1 )) genes
during training. We are looking for a subset of the variables
(Hi,1 , . . . , Hi,1000 ) which are directionally associated with the
class label (Yi ). These hexamers putatively influence/induce
the class label (see Figure 3). As can be seen from [19],
there is considerable interest in discovering such dependencies from expression and sequence data. Following [20], we
search for features (in measurement space) that induce the
class label (in observation space).
One way to interpret the feature selection problem is the
following: nature is trying to communicate a source symbol (Y {1/ + 1}), corresponding to the gene class label (nts/ts), to us. In this setup, an encoder that extracts
frequencies of a particular hexamer (Hi ) maps the source
symbol (Y ) to Hi (Y ). The decoder outputs the source reconstruction Y based on the received codeword ci (Y ) = Hi (Y ).
We observe that there are several possible encoding
schemes ci (Y ) that the encoder could potentially use (i =
1, 2, . . . , 1000), each corresponding to feature extraction via
a dierent hexamer Hi . An encoder is the mapping rule
ci : Y Hi . The ideal encoding scheme is one which induces
the most discriminative partitioning of the code (feature)
space, for successful reconstruction of Y by the decoder. The
ranking of each encoders performance over all possible mappings yields the most discriminative mapping. This measure

X1

Figure 3: Causal feature discovery for two class discrimination,


adapted from [20]. Here the variables X1 and X2 discriminate Y ,
the class label.

of performance is the amount of information flow from the


mapping (hexamer) to the class label. Using mutual information as one such measure indeed identifies the best features
[18], but fails to resolve the direction of dependence due to its
symmetric nature I(Hi ; Y ) = I(Y ; Hi ). The direction of dependence is important since it pinpoints those features that
induce the class label (not vice versa). This is necessary since
these class labels are predetermined (given to us by biology)
and the only control we have is the feature space onto which
we project the data points, for the purpose of classification.
This loosely parallels the use the directed edges in Bayesian
networks for inference of feature-class label associations [20].
Unlike mutual information (MI), directed information
(DI) is a metric to quantify the directed flow of information. It was originally introduced in [21, 22] to examine the
transfer of information from encoder to decoder under feedback/feedforward scenarios and to resolve directivity during bidirectional information transfer. Given its utility in the
encoding of sources with memory (correlated sources), this
work demonstrates it to be a competitive metric to MI for
feature selection in learning problems. DI answers which of
the encoding schemes (corresponding to each hexamer Hi )
leads to maximal information transfer from the hexamer labels to the class labels (i.e., directed dependency).
The DI is a measure of the directed dependence between two vectors Xi = [X1,i , X2,i , . . . , X n,i ] and Y =
[Y1 , Y2 , . . . , Yn ]. In this case, X j,i = quantile label for the frequency of hexamer i (1, 2, . . . , 1000) in the jth training
sequence. Y = [Y1 , Y2 , . . . , Yn ] are the corresponding class
labels (1, +1). For a block length N, the DI is given by [22]


I XiN Y N =

N


n=1

I Xin ; Yn | Y n1 .

(2)

Using a stationarity assumption over a finite-length memory of the training samples, a correspondence with the setup
in [22, 23] can be seen. As already known [24], the mutual
information is I(X N ; Y N ) = H(X N ) H(X N | Y N ), where
H(X N ) and H(X N | Y N ) are the Shannon entropy of X N and

EURASIP Journal on Bioinformatics and Systems Biology

the conditional entropy of X N given Y N , respectively. With


this definition of mutual information, the directed information simplifies to


I X N Y N =

H X n | Y n1 H X n | Y n

n=1

H X n , Y n1 H Y n1

(3)

n=1


 


.
H X n, Y n H Y n

Using (3), the directed information is expressed in terms of


individual and joint entropies of X n and Y n . This expression implies the need for higher-order entropy estimation
from a moderate sample size. A Voronoi-tessellation-based
[25] adaptive partitioning of the observation space can handle N = 5/6 without much complexity.
The relationship between MI and DI is given by [22] DI:

I(X N Y N ) = Ni=1 I(X i ; Yi | Y i1 ),

MI: I(X N ; Y N ) = Ni=1 I(X N ; Yi | Y i1 ) = I(X N Y N ) +
I(0Y N 1 X N ).
To clarify, I(X N Y N ) is the directed information from
X to Y , whereas I(0Y N 1 X N ) is the directed information
from a (one-sample) delayed version of Y N to X N . From
[23], it is clear that DI resolves the direction of information transfer (feedback or feedforward). If there is no feedback/feedforward, I(X N Y N ) = I(X N ; Y N ).
From the above chain-rule formulations for DI and MI,
it is clear that the expression for DI is permutation-variant
(i.e., the value of the DI is dierent for a dierent ordering of
random variables). Thus, we instead find the I p (X N Y N ),
a DI measure for a particular ordering of the N random
variables (r.v.s). The DI value for our purpose, I(X N Y N )
is an average over all possible sample permutations given

N
N
by I(X N Y N ) = (1/N!) N!
p=1 I p (X Y ). For MI, howN
N
N
N
ever, I p (X ; Y ) = I(X ; Y ), because MI is permutationinvariant (i.e., independent of r.v.s ordering). As can be
readily observed, this problem is combinatorially complex,
and hence, a Monte Carlo sampling strategy (1000 trials) is
used for computing I(X N Y N ). This is because we find that
about 1000 trials yields a DI confidence interval (CI) that
is only 20% more than the corresponding CI obtained from
10000 trials of the data, a far more exhaustive number.
To select features, we maximize I(X N Y N ) over the pos , Y ). This feature selection problem for the
sible pairs (X
ith training instance reduces to identifying which hexamer
(k (1, 2, . . . , 4096)) has the highest I(Xk Y ).
The higher-dimensional entropy can be estimated using
order statistics of the observed samples [25] by iterative partitioning of the observation space until nearly uniform partitions are obtained. This method lends itself to a partitioning
scheme that can be used for entropy estimation even for a
moderate number of samples in the observation space of the
underlying probability distribution. Several such algorithms
for adaptive density estimation have been proposed (see [26
28]) and can find potential application in this procedure. In

this methodology, a Voronoi tessellation approach for entropy estimation because of the higher performance guarantees as well as the relative ease of implementation of such a
procedure.
The above method is used to estimate the true DI between a given hexamer and the class label for the entire training set. Feature selection comprises of finding all those hexamers (Xi ) for which I(XiN Y N ) is the highest. From the definition of DI, we know that 0 I(XiN Y N ) I(XiN ; Y N ) <
. To make a meaningful comparison of the strengths of
association between dierent hexamers and the class label,
we use a normalized score to rank the DI values. This normalized measure DI should be able to map this large range
([0, ]) to [0, 1]. Following [29], an expression for the normalized DI is given by


DI = 1 e2I(X N Y N )

N
i
i1
= 1 e2 i=1 I(X ;Yi |Y ) .

(4)

Another point of consideration is to estimate the significance


of the DI value compared to a null distribution on the DI
value (i.e., what is the chance of finding the DI value by
chance from the N-length series Xi and Y ). This is done using
confidence intervals after permutation testing (Section 8).
8.

BOOTSTRAPPED CONFIDENCE INTERVALS

In the absence of knowledge of the true distribution of the DI


estimate, an approximate confidence interval for the DI esti N Y N )) is found using bootstrapping [30]. Denmate (I(X
sity estimation is based on kernel smoothing over the bootstrapped samples [31].
The kernel density estimate for the bootstrapped DI
(with n = 1000 samples), Z  IB (X N Y N ) becomes fh (Z) =

(1/nh) ni=1 (3/4)[1 ((zi z)/h)2 ]I(|(zi z)/h| 1) with
h 2.67 z and n = 1000. IB (X N Y N ) is obtained by finding
the DI for each random permutation of the X, Y series, and
performing this permutation B times. As it is clear from the
above expression, the Epanechnikov kernel is used for density estimation from the bootstrapped samples. The choice
of the kernel is based on its excellent characteristicsa compact region of support, the lowest asymptotic mean squared
error (AMISE) and favorable bias-variance tradeo [31].
We denote the cumulative distribution func N Y N ) by
tion (over the bootstrap samples) of I(X
FIB (X N Y N ) (IB (X N Y N )). Let the mean of the bootstrapped null distribution be IB (X N Y N ). We denote
by t1 , the (1 )th quantile of this distribution, that is,
{t1 : P([(IB (X N Y N ) IB (X N Y N ))/  ] t1 ) = 1 }.
 N Y N ) to be significant and close
Since we need the true I(X
N

Y N ) [IB (X N Y N ) + t1  ], with
to 1, we need I(X
 being
 the standard error of the bootstrapped distribution,
2

 = ([Bb=1 Ib (X N Y N ) IB (X N Y N )] )/(B 1); B is the


number of bootstrap samples.

Arvind Rao et al.


This hypothesis test is done for each of the 1000 motifs, in order to select the top  d motifs based on DI value,
which is then used for classifier training subsequently. This
leads to a need for multiple-testing correction. Because the
Bonferroni correction is extremely stringent in such settings,
the Benjamini-Hochberg procedure [32], which has a higher
false positive rate but a lower false negative rate, is used in
this work.
9.

SUPPORT VECTOR MACHINES

From the top d features identified from the ranked list


of features having high DI with the class label, a support vector machine classifier in these d dimensions is designed. An SVM is a hyperplane classifier which operates
by finding a maximum margin linear hyperplane to separate two dierent classes of data in high-dimensional (D >
d) space. The training data has N(= Ntrain,+1 + Ntrain,1 )
pairs (x1 , y1 ), (x2 , y2 ), . . . , (xN , yN ), with xi Rd and yi
{1, +1}.
An SVM is a maximum margin hyperplane classifier in a
nonlinearly extended high-dimensional space. For extending
the dimensions from d to D > d, a radial basis kernel is used.
The objective is to minimize

in the hyperplane {x :
T
T
f (x)
 = x + 0 }, subject to yi (xi + 0 ) 1 i i, i
0, i constant [33].
10.

SUMMARY OF OVERALL APPROACH

Our proposed approach is as follows. Here, the term sequence can pertain to either tissue-specific promoters or
LRE sequences, obtained from the GNF SymAtlas and Ensembl databases or the Enhancer Browser.
(1) The sequence is parsed to obtain the relative counts/
frequencies of occurrence of the hexamer in that sequence and to build the hexamer-sequence frequency
matrix. The seqinr package in R is used for this purpose. This is done for all the sequences in the specific
(class +1) and nonspecific (class 1) categories.
The matrix thus has N = Ntrain,+1 + Ntrain,1 rows and
46 = 4096 columns.
(2) The obtained hexamer-sequence frequency matrix is
preprocessed by assigning quantile labels for each hexamer within the ith sequence. A hexamer-sequence
matrix is thus obtained where the (i, j)th entry has the
quantile label of the jth hexamer in the ith sequence.
This is done for all the N training sequences consisting
of examples from the 1 and +1 class labels.
(3) Thus, two submatrices corresponding to the two class
labels are built. One matrix contains the hexamersequence quantile labels for the positive training examples and the other matrix is for the negative training
examples.
(4) To select hexamers that are most dierent between the
positive and negative training examples, a t-test is performed for each hexamer, between the ts and nts
groups. Ranking the corresponding t-test P-values
yields those hexamers that are most dierent distri-

7
butionally between the positive and negative training
samples. The top 1000 of these hexamers are chosen for further analysis. This step is only necessary
to reduce the computational complexity of the overall procedurecomputing the DI between each of the
4096 hexamers and the class label is relatively expensive.
(5) For the top K = 1000 hexamers which are most
significantly dierent between the positive and negative training examples, I(XkN Y N ) and I(XkN ; Y N ) reveal the degree of association for each of the k
(1, 2, . . . , K) hexamers. The entropy terms in the directed information and mutual information expressions are found using a higher-order entropy estimator. Using the procedure of Section 7, the raw DI values are converted into their normalized versions. Since
the goal is to maximize I(Xk Y ), we can rank the DI
values in descending order.
(6) The significance of the DI estimate is obtained based
on the bootstrapping methodology. For every hexamer, a P = 0.05 significance with respect to its
bootstrapped null distribution yields potentially discriminative hexamers between the two classes. The
Benjamini-Hochberg procedure is used for multipletesting correction. Ranking the significant hexamers
by decreasing DI value yields features that can be used
for classifier (SVM) training.
(7) Train the support vector machine (SVM) classifier on
the top d features from the ranked DI list(s). For comparison with the MI-based technique, we use the hexamers which have the top d (normalized) MI values.
The accuracy of the trained classifier is plotted as a
function of the number of features (d), after ten-fold
cross-validation. As we gradually consider higher d, we
move down the ranked list. In the plots below, the misclassification fraction is reported instead. A fraction of
0.1 corresponds to 10% misclassification.
Note. An important point concerns the training of the SVM
classifier with the top d features selected using DI or MI (step
(7) above). Since the feature selection step is decoupled from
the classification step, it is preferred that the top d motifs are
consistently ranked high among multiple draws of the data,
so as to warrant their inclusion in the classifier. However,
this does not yield expected results on this data set. Briefly,
a kendall rank correlation coecient [34] was computed between the rankings of the motifs between multiple data draws
(by sampling a subset of the entire dataset), for both MIand DI-based feature-selection. It is observed that this coecient is very low in both MI and DI, indicating a highly
variable ranking. This is likely due to the high variability in
data distribution across these multiple draws (due to limited
number of data points), as well as the sensitivity of the datadependent entropy estimation procedure to the range of the
samples in the draw. To circumvent this problem of inconsistency in rank of motifs, a median DI/MI value is computed
across these various draws and the top d features based on the
median DI/MI value across these draws are picked for SVM
training [20].

GC hkg prom

RESULTS

(a)

(b)

102

10

Frequency

We use DI to find hexamers that discriminate brain-specific


and heart-specific expression from neutral sequences. The
negative training sets are sequences that are not brain or
heart-specific, respectively. Results using the MI and DI
methods are given below (see Figures 5 and 7). The plots
indicate the SVM cross-validated misclassification accuracy
(ideally 0) for the data as the number of features using the
metric (DI or MI) is gradually increased. We can see that for
any given classification accuracy, the number of features using DI is less than the corresponding number of features using MI. This translates into a lower misclassification rate for
DI-based feature selection. We also observe that as the number of features d is increased, the performance of MI is the
same as DI. This is expected since, as we gather more features using MI or DI, the dierences in MI versus DI ranking
are compensated.
An important point needs to be clarified here. There
is a possibility of sequence composition bias in the tissuespecific and neutral sequences used during training. This has
been reported in recent work [15]. To avoid detecting GC
rich sequences as hexamer features, it is necessary to confirm
that there is no significant GC-composition bias between the
specific and neutral sets in each of the case studies. This is
demonstrated in Figures 4, 6, and 8. In each case, it is observed that the mean GC-composition is almost same for the
specific versus neutral set. However, in such studies, it is necessary to select for sequences that do not exhibit such bias.
In Figures 6 and 8, even the distribution of GC-composition
is similar among the samples. For Figure 4, even though the
distributions are slightly dierent, the box plots indicate similarity in mean GC-content.
Next, some of the motifs that discriminate between
tissue-specific and nonspecific categories for the brain promoter, heart promoter, and brain enhancer cases, respectively, are listed in Table 2. Additionally, if the genes encoding for these TFs are expressed in the corresponding tissue [35], a ( ) sign is appended. In some cases,
the hexamer motifs match the consensus sequences of
known transcription factors (TFs). This suggests a potential role for that particular TF in regulating expression
of tissue-specific genes. This matching of hexamer motifs
with TFBS consensus sites is done using the MAPPER engine (http://bio.chip.org/mapper). It is to be noted that a
hexamer-TFBS match does not necessarily imply the functional role of the TF in the corresponding tissue (brain or
heart). However, such information would be useful to guide
focused experiments to confirm their role in vivo (using techniques such as chromatin immunoprecipitation).
As is clear from the above results, there are several
other motifs which are novel or correspond to nonconsensus motifs of known transcription factors. Hence, each of
the identified hexamers merit experimental investigation.
Also, though we identify as many as 200 hexamers in this
work (please see Supplementary Material available online at

GC brain prom
0.8
0.7
0.6
0.5
0.4
0.3
0.2

Frequency

11.1. Tissue specific promoters

0.8
0.7
0.6
0.5
0.4
0.3
0.2

2
1

6
4
2

0
0.3

0.4 0.5 0.6 0.7


GC hkg prom
(c)

0.3

0.4
0.5
0.6
GC brain prom
(d)

Figure 4: GC sequence composition for brain-specific promoters


and housekeeping (hkg) promoters.

0.35

Misclassification rate (fraction)

11.

EURASIP Journal on Bioinformatics and Systems Biology

0.3
0.25
0.2
0.15
0.1
0.05
0

50
100
150
200
Number of top ranking features used for classification
MI
DI

Figure 5: Misclassification accuracy for the MI versus DI case


(brain promoter set). Accuracy of classification is 0.9, that is, 93%.

doi: 10.1155/2007/13853), we have reported only a few due


to space constraints.
In the context of the heart-specific genes, we consider the cardiac troponin gene (cTNT, ENSEMBL:
ENSG00000118194), which is present in the heart promoter
set. An examination of the high DI motifs for the heartspecific set yields motifs with the GATA consensus site, as
well as matches with the MEF2 transcription factor. It has
been established earlier that GATA-4, MEF2 are indeed

Arvind Rao et al.

GC hkg prom

GC heart prom

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

Brain promoters
Ahr-ARNT ( )
Tcf11-MafG ( )
c-ETS ( )
FREAC-4
T3R-alpha1

0.3
(a)

(b)

102

4
3

Frequency

Frequency

Table 2: Comparison of high ranking motifs (by DI) across dierent data sets. The ( ) sign indicates tissue-specific expression of the
corresponding TF gene.

2
1
0
0.3

0.4 0.5 0.6 0.7


GC hkg prom
(c)

30
25
20
15
10
5
0
0.3

0.4 0.5 0.6


GC heart prom

0.7

(d)

Figure 6: GC sequence composition for heart-specific promoters


and housekeeping (hkg) promoters.

0.35

Misclassification rate (fraction)

0.3
0.25
0.2
0.15
0.1
0.05
0

50
100
150
200
Number of top ranking features used for classification
MI
DI

Figure 7: Misclassification accuracy for the MI versus DI case (heart


promoter set).

involved in transcriptional activation of this gene [36] and


the results have been confirmed by ChIP [37].
11.2. Enhancer DB
Additionally, all the brain-specific regulatory elements profiled in the mouse Enhancer Browser database (http://
enhancer.lbl.gov) are examined for discriminating motifs.
Figure 8 shows that the two classes have similar GCcomposition. Again, the plot of misclassification accuracy

Heart promoters
Pax2
Tcf11-MafG ( )
XBP1 ( )
Sox-17 ( )
FREAC-4
GATA( )

Brain enhancers
HNF-4 ( )
Nkx2
AML1
c-ETS ( )
Elk1 ( )

versus number of features in the MI and DI scenarios reveal


the superior performance of the DI-based hexamer selection
compared to MI (see Figure 9).
In this case, the enhancer sequences are ultraconserved,
thus obtained after alignment across multiple species. The
examination of these sequences identified motifs that are
potentially selected for regulatory function across evolutionary distances. Using alignment as a prefiltering strategy helps remove bias conferred by sequence elements that
arise via random mutation but might be over-represented.
This is permitted in programs like Toucan [12] and rVISTA
(http://rvista.dcode.org).
As in the previous case, some of the top ranking motifs
from this dataset are also shown in Table 2. The ( ) signed
TFs indicate that some of these discovered motifs indeed
have documented high expression in the brain. The occurrence of such tissue-specific transcription factor motifs in
these regulatory elements gives credence to the discovered
motifs. For example, ELK-1 is involved in neuronal dierentiation [38]. Also, some motifs matching consensus sites
of TEF1 and ETS1 are common to the brain-enhancer and
brain-promoter set. Though this is interesting, an experiment to confirm the enrichment of such transcription factors in the population of brain-specific regulatory sequences
is necessary.
11.3.

Quantifying sequence-based TF influence

A very interesting question emerges from the above presented results. What if one is interested in a motif that is
not present in the above ranked hexamer list for a particular tissue-specific set? As an example, consider the case for
MyoD, a transcription factor which is expressed in muscle
and has an activity in heart-specific genes too [39]. In fact, a
variant of its consensus motif CATTTG is indeed in the top
ranking hexamer list. The DI-based framework further permits investigation of the directional association of the canonical MyoD motif (CACCTG) for the discrimination of heartspecific genes versus housekeeping genes. This is shown in
Figure 10. As is observed, MyoD has a significant directional
influence on the heart-specific versus neutral sequence class
label. This, in conjunction with the expression level characteristics of MyoD, indicates that the motif CACCTG is
potentially relevant to make the distinction between heartspecific and neutral sequences.

10

EURASIP Journal on Bioinformatics and Systems Biology


GC neutral

Empirical CDF of null distribution

GC brain enh
1

0.6

0.6

0.4

0.4

0.9
0.8
0.7
0.6

0.2
(a)

(b)

40
20
0
0.3

0.4 0.5 0.6


GC neutral
(c)

0.3

15

0.2

5
0

0.1
0
0.3

0.4
0.5
0.6
GC brain enh
(d)

Figure 8: GC sequence composition for brain-specific enhancers


and neutral noncoding regions.
0.35
0.3
Misclassification rate (fraction)

0.5
0.4

25
Frequency

Frequency

60

F(x)

0.2

0.25
0.2
0.15
0.1
0.05
0

50
100
150
200
Number of top ranking features used for classification
MI
DI

Figure 9: Misclassification accuracy for the MI versus DI case


(brain enhancer set).

Another theme picks up on something quite traditionally done in bioinformatics research-finding key TF regulators underlying tissue-specific expression. Two major questions emerge from this theme.
(1) Which putative regulatory TFs underlie the tissuespecific expression of a group of genes?
(2) For the TFs found using tools like TOUCAN [12], can
we examine the degree of influence that the particular
TF motif has in directing tissue-specific expression?
To address the first question, we examine the TFs revealed by DI/MI motif selection and compare these to the
TFs discovered from TOUCAN [12], underlying the expres-

0.1
0.2
0.3
0.4
0.5
0.6
DI of MyoDheart-specific promoters (x)

0.7

Figure 10: Cumulative distribution function for bootstrapped


I(M yoD motif: CACCTGY ); Y is the class label (heart-specific

Y ) = 0.4977.
versus housekeeping). True I(CACCTG

sion of genes expressed on day e14.5 in the degenerating


mesonephros and nephric duct (TS22). This set has about
43 genes (including Gata2). These genes are available in the
Supplementary Material.
Using TOUCAN, the set of module TFs is combinations
of the following TFs: E47, HNF3B, HNF1, RREB1, HFH3,
CREBP1, VMYB, GFI1. These were obtained by aligning the
promoters of these 43 genes (2000 bp upstream to +200 bp
from the TSS), and looking for over-represented TF motifs based on the TRANSFAC/JASPAR databases. Using the
DI-based motif selection, a set of 200 hexamers are found
that discriminate these 43 gene promoter sequences from
the background housekeeping promoter set. They map to
the consensus sites of several known TFs, such as (identified from http://bio.chip.org/mapper) Nkx, Max1, c-ETS,
FREAC4, Ahr-ARNT, CREBP2, E2F, HNF3A/B, NFATc, Pax2,
LEF1, Max1, SP1, Tef1, Tcf11-MafG; many of which are expressed in the developing kidney (http://www.expasy.org).
Moreover, we observe that the TFs that are common between
the TOUCAN results and the DI-based approach: FREAC4,
Max1, HNF3a/b, HNF1, SP1, CREBP, RREB1, HFH3, are
mostly kidney-specific. Thus, we believe that this observation makes a case for finding all (possibly degenerate) TF
motif searches from TRANSFAC, and filtering them based on
tissue-specific expression subsequently. Such a strategy yields
several more TF candidates for testing and validation of biological function.
For the second question, we examine the following scenario. The Gata3 gene is observed to be expressed in the
developing ureteric bud (UB) during kidney development.
To find UB specific TF regulators, conserved TF modules
can be examined in the promoters of UB-specific genes.
These experimentally annotated UB-specific genes are obtained from the Mouse Genome Informatics database at
http://www.informatics.jax.org. Several programs are used
for such analysis, like Genomatix [11] or Toucan [12]. Using

Arvind Rao et al.

11.4. Observations
With regard to the feature selection and classification results,
in both studies (enhancers and promoters), we observe that
about 100 hexamers are enough to discriminate the tissuespecific from the neutral sequences. Furthermore, some sequence features of these motifs at the promoter/enhancer
emerge.
(i) There is higher sequence variability at the promoter
since it has to act in concert with LREs of dierent tissue types during gene regulation.
(ii) Since the enhancer/LRE acts with the promoter to confer expression in only one tissue type, these sequences
are more specific and hence their mining identifies
motifs that are probably more indicative of tissuespecific expression.
We however, reiterate that the enhancer dataset that we study
uses the hsp68-lacz as the promoter driven by the ultraconserved elements. Hence there is no promoter specificity in
this context. Though this is a disadvantage and might not
reveal all key motifs, it is the best that can be done in the
absence of any other comprehensive repository.
The second aspect of the presented results highlights two
important points. Firstly, the identified motifs have a strong
predictive value as suggested by the cross-validation results as
well as Table 2. Moreover, DI provides a principled methodology to investigate any given motif for tissue-specificity as
well as for identifying expression-level relationships between
the TFs and their target genes, (Section 11.3).
12.

CONCLUSIONS

In this work, a framework for the identification of hexamer motifs to discriminate between two kinds of sequences (tissue-specific promoters or regulatory elements
versus nonspecific elements) is presented. For this feature se-

Empirical CDF
1
0.9
0.8
0.7
0.6
F(x)

Toucan, the promoters of the various UB specific genes are


aligned to discover related modules. The top-ranking module in Toucan contains AHR-ARNT, Hox13, Pax2, Tal1alphaE47, Oct1. Again, the power of these motifs to discriminate
UB-specific and nonspecific genes, based on DI, can be investigated.
For this purpose, we check if the Pax2 binding motif
(GTTCC [40]) indeed induces kidney specific expression by
looking for the strength of DI between the GTTCC motif and
the class label (+1) indicating UB expression (see Figure 11).
This once again adds to computational evidence for the true
role of Pax2 in directing ureteric bud specific expression [40].
The main implication here is that from sequence data, there
is strong evidence for the Pax2 motif being a useful feature
for UB-specific genes. This is especially relevant given the
documented role of Pax2 (see [41]) directing ureteric-bud
expression of the Gata3 gene, one of the key modulators of
kidney morphogenesis. Both the MyoD and Pax2 studies indicate the relevance of principled data integration using expression [35, 42] and sequence modalities.

11

0.5
0.4
0.3
0.2
0.1
0

0.1

0.2

0.3

0.4

0.5
x

0.6

0.7

0.8

0.9

Figure 11: Cumulative distribution function for bootstrapped


I(Pax2 motif: GTTCCY ); Y is the class label (UB/non-UB). True

I(GTTCC
Y ) = 0.9792.

lection problem, a new metricthe directed information


(DI)is proposed. In conjunction with a support vector machine classifier, this method was shown to outperform the
state-of-the-art method employing undirected mutual information. We also find that only a subset of the discriminating
motifs correlate with known transcription factor motifs and
hence the other motifs might be potentially related to nonconsensus TF binding or underlying epigenetic phenomena
governing tissue-specific gene expression. The superior performance of the directed-information-based variable selection suggests its utility to more general learning problems.
As per the initial motivation, the discovery of these motifs
can aid in the prospective discovery of other tissue-specific
regulatory regions.
We have also examined the applicability of DI to prospectively resolve the functional role of any TF motif in a biological process, integrating other sources (literature, expression
data, module searches).
13.

FUTURE WORK

Several opportunities for future work exist within this proposed framework. Multiple sequence alignment of promoter/regulatory sequences across species would be a useful
preprocessing step to reduce false detection of discriminatory motifs. The hexamers can also be identified based on
other metrics exploiting distributional divergence between
the samples of the +1 and 1 classes. Furthermore, there
is a need for consistent high-dimensional entropy estimators within the small sample regime. A very interesting direction of potential interest is the formulation of a stepwise
hexamer selection algorithm, using the directed information
for maximal relevance selection and mutual information for
minimizing between-hexamer redundancy [18]. This analysis is beyond the scope of this work but an implementation
is available from the authors for further investigation. (The

12

EURASIP Journal on Bioinformatics and Systems Biology

source code of the analysis tools in R 2.0 and MATLAB 6.1 is


available on request).
[13]

ACKNOWLEDGMENTS
The authors gratefully acknowledge the support of the NIH
under Award 5R01-GM028896-21 for J. D. Engel. They
would like to thank Professor Sandeep Pradhan and Mr.
Ramji Venkataramanan for useful discussions on directed
information. They are extremely grateful to Professor Erik
Learned-Miller and Dr. Damian Fermin for sharing their
code for high-dimensional entropy estimation and ENSEMBL sequence extraction, respectively. They also thank
the anonymous reviewers and the corresponding editor for
helping them improve the quality of the manuscript through
insightful comments and suggestions. The material in this
paper was presented in part at the IEEE Statistical Signal Processing Workshop 2007 (SSP07).

[14]

[15]

[16]

[17]

REFERENCES
[1] K. D. MacIsaac and E. Fraenkel, Practical strategies for discovering regulatory DNA sequence motifs, PLoS Computational Biology, vol. 2, no. 4, p. e36, 2006.
[2] G. Kreiman, Identification of sparsely distributed clusters of
cis-regulatory elements in sets of co-expressed genes, Nucleic
Acids Research, vol. 32, no. 9, pp. 28892900, 2004.
[3] C. Burge and S. Karlin, Prediction of complete gene structures in human genomic DNA, Journal of Molecular Biology,
vol. 268, no. 1, pp. 7894, 1997.
[4] Q. Li, G. Barkess, and H. Qian, Chromatin looping and the
probability of transcription, Trends in Genetics, vol. 22, no. 4,
pp. 197202, 2006.
[5] D. A. Kleinjan and V. van Heyningen, Long-range control of
gene expression: emerging mechanisms and disruption in disease, The American Journal of Human Genetics, vol. 76, no. 1,
pp. 832, 2005.
[6] L. A. Pennacchio, G. G. Loots, M. A. Nobrega, and I.
Ovcharenko, Predicting tissue-specific enhancers in the human genome, Genome Research, vol. 17, no. 2, pp. 201211,
2007.
[7] D. C. King, J. Taylor, L. Elnitski, F. Chiaromonte, W. Miller,
and R. C. Hardison, Evaluation of regulatory potential and
conservation scores for detecting cis-regulatory modules in
aligned mammalian genome sequences, Genome Research,
vol. 15, no. 8, pp. 10511060, 2005.
[8] L. A. Pennacchio, N. Ahituv, A. M. Moses, et al., In vivo enhancer analysis of human conserved non-coding sequences,
Nature, vol. 444, no. 7118, pp. 499502, 2006.
[9] K. Kadota, J. Ye, Y. Nakai, T. Terada, and K. Shimizu, ROKU:
a novel method for indentification of tissue-specific genes,
BMC Bioinformatics, vol. 7, pp. 294, 2006.
[10] J. Schug, W.-P. Schuller, C. Kappen, J. M. Salbaum, M. Bucan, and C. J. Stoeckert Jr., Promoter features related to tissue
specificity as measured by Shannon entropy, Genome biology,
vol. 6, no. 4, p. R33, 2005.
[11] T. Werner, Regulatory networks: linking microarray data
to systems biology, Mechanisms of Ageing and Development,
vol. 128, no. 1, pp. 168172, 2007.
[12] S. Aerts, P. Van Loo, G. Thijs, et al., TOUCAN 2: the allinclusive open source workbench for regulatory sequence

[18]

[19]
[20]

[21]

[22]

[23]

[24]
[25]

[26]

[27]

[28]
[29]

analysis, Nucleic Acids Research, vol. 33, (Web Server Issue),


pp. W393W396, 2005.
B. Y. Chan and D. Kibler, Using hexamers to predict cisregulatory motifs in Drosophila, BMC Bioinformatics, vol. 6,
p. 262, 2005.
G. B. Hutchinson, The prediction of vertebrate promoter
regions using dierential hexamer frequency analysis, Computer Applications in the Biosciences, vol. 12, no. 5, pp. 391398,
1996.
P. Sumazin, G. Chen, N. Hata, A. D. Smith, T. Zhang, and M.
Q. Zhang, DWE: discriminating word enumerator, Bioinformatics, vol. 21, no. 1, pp. 3138, 2005.
G. Lakshmanan, K. H. Lieuw, K.-C. Lim, et al., Localization of distant urogenital system-, central nervous system-,
and endocardium-specific transcriptional regulatory elements
in the GATA-3 locus, Molecular and Cellular Biology, vol. 19,
no. 2, pp. 15581568, 1999.
M. Khandekar, N. Suzuki, J. Lewton, M. Yamamoto, and J.
D. Engel, Multiple, distant Gata2 enhancers specify temporally and tissue-specific patterning in the developing urogenital system, Molecular and Cellular Biology, vol. 24, no. 23, pp.
1026310276, 2004.
H. Peng, F. Long, and C. Ding, Feature selection based
on mutual information criteria of max-dependency, maxrelevance, and min-redundancy, IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 27, no. 8, pp. 1226
1238, 2005.
Proceedings of NIPS 2006 Workshop on Causality Feature Selection, http://research.ihost.com/cws2006/.
I. Guyon and A. Elissee, An introduction to variable and
feature selection, The Journal of Machine Learning Research,
vol. 3, pp. 11571182, 2003.
H. Marko, The bidirectional communication theorya generalization of information theory, IEEE Transactions on Communications, vol. COM-21, no. 12, pp. 13451351, 1973.
J. Massey, Causality, feedback and directed information, in
Proceedings of the International Symposium on Information
Theory and Its Applications (ISITA 90), pp. 303305, Waikiki,
Hawaii, USA, November 1990.
R. Venkataramanan and S. S. Pradhan, Source coding with
feed-forward: rate-distortion theorems and error exponents
for a general source, IEEE Transactions on Information Theory, vol. 53, no. 6, pp. 21542179, 2007.
T. M. Cover and J. A. Thomas, Elements of Information Theory,
John Wiley & Sons, New York, NY, USA, 1991.
E. G. Miller, A new class of entropy estimators for multidimensional densities, in Proceedings of the IEEE International Conference on Accoustics, Speech, and Signal Processing
(ICASSP 03), vol. 3, pp. 297300, Hong Kong, April 2003.
R. M. Willett and R. D. Nowak, Complexity-regularized multiresolution density estimation, in Proceedings of the International Symposium on Information Theory (ISIT 04), pp. 303
305, Chicago, Ill, USA, June-July 2004.
I. Nemenman, F. Shafee, and W. Bialek, Entropy and inference, revisited, in Advances in Neural Information Processing
Systems 14, T. G. Dietterich, S. Becker, and Z. Ghahramani,
Eds., MIT Press, Cambridge, Mass, USA, 2002.
L. Paninski, Estimation of entropy and mutual information,
Neural Computation, vol. 15, no. 6, pp. 11911253, 2003.
H. Joe, Relative entropy measures of multivariate dependence, Journal of the American Statistical Association, vol. 84,
no. 405, pp. 157164, 1989.

Arvind Rao et al.


[30] B. Efron and R. J. Tibshirani, An Introduction to the Bootstrap,
Monographs on Statistics and Applied Probability, Chapman
& Hall/CRC, Boca Raton, Fla, USA, 1994.
[31] J. O. Ramsay and B. W. Silverman, Functional Data Analysis,
Springer Series in Statistics, Springer, New York, NY, USA,
1997.
[32] Y. Benjamini and Y. Hochberg, Controlling the false discovery
rate: a practical and powerful approach to multiple testing,
Journal of the Royal Statistical Society. Series B, vol. 57, no. 1,
pp. 289300, 1995.
[33] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, Springer, New York, NY, USA, 2001.
[34] M. G. Kendall, A new measure of rank correlation,
Biometrika, vol. 30, no. 1/2, pp. 8193, 1938.
[35] NCBI Pubmed URL, http://www.ncbi.nlm.nih.gov/entrez/
query.fcgi.
[36] A. M. Murphy, W. R. Thompson, L. F. Peng, and L. Jones II,
Regulation of the rat cardiac troponin I gene by the transcription factor GATA-4, Biochemical Journal, vol. 322, part 2, pp.
393401, 1997.
[37] A. Azakie, J. R. Fineman, and Y. He, Myocardial transcription
factors are modulated during pathologic cardiac hypertrophy
in vivo, The Journal of Thoracic and Cardiovascular Surgery,
vol. 132, no. 6, pp. 12621271.e4, 2006.
[38] P. Vanhoutte, J. L. Nissen, B. Brugg, et al., Opposing roles
of Elk-1 and its brain-specific usoform, short Elk-1, in nerve
growth factor-induced PC12 dierentiation, Journal of Biological Chemistry, vol. 276, no. 7, pp. 51895196, 2001.
[39] E. N. Olson, Regulation of muscle transcription by the MyoD
family: the heart of the matter, Circulation Research, vol. 72,
no. 1, pp. 16, 1993.
[40] G. R. Dressler and E. C. Douglass, Pax-2 is a DNA-binding
protein expressed in embryonic kidney and Wilms tumor,
Proceedings of the National Academy of Sciences of the United
States of America, vol. 89, no. 4, pp. 11791183, 1992.
[41] D. Grote, A. Souabni, M. Busslinger, and M. Bouchard,
Pax2/8-regulated Gata3 expression is necessary for morphogenesis and guidance of the nephric duct in the developing
kidney, Development, vol. 133, no. 1, pp. 5361, 2006.
[42] A. Rao, A. O. Hero, D. J. States, and J. D. Engel, Inference
of biologically relevant gene influence networks using the directed information criterion, in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP 06), vol. 2, pp. 10281031, Toulouse, France, May
2006.

13

Hindawi Publishing Corporation


EURASIP Journal on Bioinformatics and Systems Biology
Volume 2007, Article ID 31450, 18 pages
doi:10.1155/2007/31450

Research Article
Splitting the BLOSUM Score into Numbers of
Biological Significance
Francesco Fabris,1, 2 Andrea Sgarro,1, 2 and Alessandro Tossi3
1 Dipartimento

di Matematica e Informatica, Universit`a degli Studi di Trieste, via Valerio 12b, 34127 Trieste, Italy
di Biomedicina Molecolare, AREA Science Park, Strada Statale 14, Basovizza, 34012 Trieste, Italy
3 Dipartimento di Biochimica, Biofisica, e Chimica delle Macromolecole, Universit`
a degli Studi di Trieste,
via Licio Giorgieri 1, 34127 Trieste, Italy
2 Centro

Received 2 October 2006; Accepted 30 March 2007


Recommended by Juho Rousu
Mathematical tools developed in the context of Shannon information theory were used to analyze the meaning of the BLOSUM
score, which was split into three components termed as the BLOSUM spectrum (or BLOSpectrum). These relate respectively to the
sequence convergence (the stochastic similarity of the two protein sequences), to the background frequency divergence (typicality
of the amino acid probability distribution in each sequence), and to the target frequency divergence (compliance of the amino acid
variations between the two sequences to the protein model implicit in the BLOCKS database). This treatment sharpens the protein sequence comparison, providing a rationale for the biological significance of the obtained score, and helps to identify weakly
related sequences. Moreover, the BLOSpectrum can guide the choice of the most appropriate scoring matrix, tailoring it to the
evolutionary divergence associated with the two sequences, or indicate if a compositionally adjusted matrix could perform better.
Copyright 2007 Francesco Fabris et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.

1.

INTRODUCTION

Substitution matrices have been in use since the introduction of the Needleman and Wunsch algorithm [1], and are
referred to, either implicitly or explicitly, in several other papers from the seventies, McLachlan [2], Sanko [3], Sellers
[4], Waterman et al. [5], Dayho et al. [6]. These are the
conceptual tools at the basis of several methods for attributing a similarity score to two aligned protein sequences. Any
amino acid substitution matrix, which is a 20 20 table, has
a scoring method that is implicitly associated with a set of
target frequencies p(i, j) [7, 8], pertaining to the pair i, j of
amino acids that are paired in the alignment. An important
approach to obtaining the score associated with the paired
amino acids i, j, was that suggested by Dayho et al. [6],
who developed a stochastic model of protein evolution called
PAM (points of accepted mutations). In this model, the frequencies m(i, j) indicate the probability of change from one
amino acid i to another amino acid j, in homologous protein
sequences with at least 85% identity, during short-term evolution. The matrix M, relating each amino acid to each of the
other 19, with an evolutionary distance of 1, would have entries m(i, j) close to 1 on the main diagonal (i = j) and close

= j). An M k matrix, which


to 0 out of the main diagonal (i 
estimates the expected probability of changes at a distance of
k evolutionary units, is then obtained by multiplying the M
matrix by itself k times. Each M k matrix is then associated to
the scoring matrix PAMk , whose entries are obtained on the
basis of the log ratio

s(i, j) = log

mk (i, j)
,
p(i)p( j)

(1)

where p(i) and p( j) are the observed frequencies of the amino acids.
S. Heniko and J. G. Heniko introduce the BLOck SUbstitution Matrix (BLOSUM) [9]. While the scoring method
is always based on a log odds ratio, as seems natural in any
kind of substitution matrices [7], the method for deriving
the target frequencies is quite dierent from PAM; one needs
evaluating the joint target frequencies p(i, j) of finding the
amino acids i and j paired in alignments among homologous
proteins with a controlled rate of percent identity. This joint
probability is compared with p(i)p( j), the product of the
background frequencies of amino acids i and j, derived from
amino acids probability distribution P = { p1 , p2 , . . . , p20 }.

EURASIP Journal on Bioinformatics and Systems Biology

The target and background frequencies are tied by the equal


ity p(i) = 20
j =1 p(i, j) so that the background probability
distribution is the marginal of the joint target frequencies
[10]. The product p(i)p( j) reflects the likelihood of the independence setting, namely that the amino acids i and j are
paired by pure chance. If p(i, j) > p(i)p( j), then the presence
of i stochastically induces the presence of j, and vice versa (i
and j are attractive), while if p(i, j) < p(i)p( j), then the
presence of i stochastically prevents the presence of j, and
vice versa (i and j are repulsive). The log ratio (taken to
the base 2)
s(i, j) = log

p(i, j)
p(i)p( j)

(2)

furnishes the score associated with the pair of amino acids i,


j, when these are found in a certain position h of an assigned
protein alignment; it is positive when p(i, j) > p(i)p( j),
and negative when the opposite occurs. The i, j entry of
the BLOSUM matrix is the score of the pair i, j (or j, i,
which is the same since the sequences are not ordered; for
a dierent approach see Yu et al. [11]) multiplied by a suitable scale factor (4 for BLOSUM-35 and BLOSUM-40, 3 for
BLOSUM-50, and 2 for the remaining). The value so obtained is then rounded to the nearest integer, and the (unscaled) global score of two sequences X = x1 ,x2 , . . . , xn and
Y = y1 , y2 , . . . , yn of length n is given by summing up the
scores relative to each position
S(X, Y ) =

n



s xh , y h =

n(i, j) log

p(i, j)
,
p(i)p( j)

scoring method in finding concealed or weakly correlated sequences are well documented in the literature, the most relevant being:
(1) Gaps: insertions or deletions (of one or more residue)
in one or both the aligned sequence cause loss of synchronization, significantly decreasing the score;
(2) Bad : using a BLOSUM- matrix tailored for a particular evolutionary distance on sequences with a dierent evolutionary distance leads to a misleading score
[7, 12, 13];
(3) divergence in background distribution: standard substitution matrices, such as BLOSUM-, are truly appropriate only for comparison of proteins with standard
background frequency distributions of amino acids
[11].
We have set out to inspect, in more depth and by use of
mathematical tools, what the BLOSUM score really measures
from a biological point of view; the aim was to split the score
into components, the BLOSpecrum, that provide insight on
the above described phenomena and other biological information regarding the compared sequences, once the alignment has been made using the classical methods (BLAST,
FASTA, etc.). We do not propose an alternative alignment algorithm or a method for increasing the performance of the
available ones; nor do we suggest new methods for inserting
gaps so as to maximize the score (see, e.g., [14, 15]). Ours is
simply a diagnostic tool to reveal the following:

(3)

(1) if, for an available algorithm, the chosen scoring matrix is correct;

where n(i, j) is the number of occurrences of the pair i, j inside the aligned sequences. This equation weighs the log ratio
associated to the i, j entry of the BLOSUM matrix with the
occurrences of the pair i, j, and seems intuitive following a
heuristic approach, as any reasonable substitution matrix is
implicitly of this form [7]. In order to compute the necessary target and background frequencies p(i, j) and p(i)p( j),
S. Heniko and J. G. Heniko used the database BLOCKS
(http://blocks.fhcrc.org/index.html), which contains sets of
proteins with a controlled maximum rate of percent identity
that defines the BLOSUM matrix, so that BLOSUM-62
refers = 62%, and so forth.
Scoring substitution matrices, such as PAM or BLOSUM,
are used in modern web tools (BLAST, PSI-BLAST, and others) for performing database searches; the search is accomplished by finding all sequences that, when compared to a
given query sequence, sum up a score over a certain threshold. The aim is usually that of discovering biological correlation among dierent sequences, often belonging to dierent
organisms, which may be associated with a similar biological function. In most cases, this correlation is quite evident
when proteins are associated with genes that have duplicated,
or organisms that have diverged from one another relatively
recently, and leads to high values of the BLOSUM (or PAM)
score. But in some cases, a relevant biological correlation may
be obscured by phenomena that reduce the score, making
it dicult to capture. Those that limit the eciency of the

(2) whether the aligned sequences are typical protein sequences or not;

h=1

i, j

(3) whether the alignment itself is typical with respect to


BLOCKS database; and
(4) the possible presence of a weak or concealed correlation also for alignments resulting in a relatively low
BLOSUM score, that might otherwise be neglected.
The method is associated with the use of a BLOSUM
matrix that has been developed within the context of local
(ungapped) alignment statistics [7, 8, 11]. To allow a critical evaluation of our method, we furnish an online software package that provides values for each component of
the BLOSpecrum for two aligned sequences (http://bioinf.
dimi.uniud.it/software/software/blosumapplet). Providing a
rationale about the biological significance of an obtained
score sharpens the comparison of weakly related sequences,
and can reveal that comparable scores actually conceal completely dierent biological relationships. Furthermore, our
decomposition helps in selecting the matrix that is correctly
tailored for the actual evolutionary divergence associated to
the two sequences one is going to compare, or in deciding if
a compositionally adjusted matrix might not perform better.
Although we have used the BLOSUM scoring method for
our analyses, since it is the most widely used by web tools
measuring protein similarities, our decomposition is applicable, in principle, to any scoring matrix in the form of (3),

Francesco Fabris et al.

and confirms that the usefulness of this type of matrix has a


solid mathematical justification.
2.

METHODS

2.1. Mathematical analysis of the BLOSUM score


The BLOSUM score (3) can be analyzed from a mathematical
perspective using well-known tools developed by Shannon
in his seminal paper that laid the foundation for Information
Theory [16, 17]. The first of these is the Mutual Information
I(X, Y ) (or relative entropy) between two random variables
X and Y ,
I(X, Y ) =

p(i, j) log

i, j

p(i, j)
,
p(i)p( j)

(4)

where p(i, j), p(i), p( j) are, respectively, the joint probability distribution and the marginals associated to the random variables X and Y . We can adapt (4) to the comparison of two sequences if we interpret p(i, j) as the relative
frequency of finding amino acids i and j paired in the X
and Y sequences, and p(i) (p( j)) of finding amino acid i
( j) in sequence X (Y ). Following this approach, in a biological setting, mutual information (MI) becomes a measure
of the stochastic correlation between two sequences. It can be
shown (see the appendix) that I(X, Y ) log 20 4.3219.
The second tool is the informational divergence D(P//Q) between two probability distributions P = { p1 , p2 , . . . , pK } and
Q = {q1 , q2 , . . . , qK } [18], where
D(P//Q) =

K

i=1

p(i) log

p(i)
.
q(i)

(5)

The informational divergence (ID) can be interpreted as


a measure of the nonsymmetrical distance between two
probability distributions. A more detailed mathematical
treatment of the properties associated with MI and ID is provided in the appendix. Here, we simply indicate that ID and
MI are nonnegative quantities, and that they are tied by the
formula
I(X, Y ) =


i, j

p(i, j) log



p(i, j)
= D PXY //PX PY 0,
p(i)p( j)

(6)
so that MI is really a special kind of ID, that measures the
distance between the joint probability distributions PXY
and the product PX PY of the two marginals PX and PY .
Given two amino acid sequences, X and Y , the corresponding BLOSUM (unscaled) normalized score SN (X, Y ),
measured in bits, is computed as
n
 
p(i, j)
1 
s xh , y h =
f (i, j) log
,
SN (X, Y ) =
n h=1
p(i)p(
j)
i, j

dierent lengths, we report the normalized perresidue score


to permit a coherent comparison. It is important to stress the
fact that while f (i, j) is the observed frequency pertaining to
the sequences under inspection, the target frequencies p(i, j),
together with the background marginals p(i) and p( j), pertain to the database BLOCKS. In a sense, they constitute the
model of the typical behaviour of a protein, since p(i) or
p( j) is in fact the typical probability distribution of amino
acids as observed in most proteins, while p(i, j) is the typical probability of finding the amino acids i and j positionally paired in two protein sequences with a percent identity
depending from . From an evolutionary point of view, we
can say that if p(i, j) is greater than in the case of independence, then it is very likely that i and j are biologically correlated.
Equation (7) is in fact quite similar to (4), which specifies mutual information, the only dierence being the use
of f (i, j) instead of p(i, j) as the multiplying factor for the
logarithmic term, so that the normalized score is a kind of
mixed mutual information. As a matter of fact, we can define
I(A, B) =


i, j

p(i, j) log

p(i, j)
p(i)p( j)

(8)

as the mutual information, or relative entropy, of the target and background frequencies associated to the database
BLOCKS, or to any other protein model used to find the target frequencies. Here A, and B are dummy random variables
taken to have generated the data of the database. The quantity I(A, B) was in eect used by Altschul in the case of PAM
matrices [7], and by S. Heniko and J. G. Heniko [9] for the
BLOSUM matrices, and in both cases it can be interpreted as
the average exchange of information associated with a pair
of aligned amino acids of the data bank, or as the expected
average score associated to pairs of amino acids, when they
are put into correspondence in alignments that adhere to
the protein model over which the matrices are computed.
From the perspective of an aligning method, we can state that
I(A, B) measures the average information available for each
position in order to distinguish the alignment from chance,
so that the higher its value, the shorter the fragments whose
alignment can be distinguished from chance [7]. Equation
(6) (or (A.4) in the appendix) ensures also that this average
score is always greater than or equal to zero.
On the other hand, if we compute the expected score
when two amino acids i and j are picked at random in an
independence setting model, given as
E(A, B) =


i, j

p(i)p( j) log


p(i, j)
= D PX PY //PXY ) 0,
p(i)p( j)

(9)
(7)

where f (i, j) = n(i, j)/n is the relative frequency of the pair


i, j observed on the aligned sequences X and Y . Because
one usually deals with sequences that could have remarkably

the classical assumptions made in constructing a scoring matrix [7] require that this expected score is lower than or equal
to zero. Note that all these quantities pertain to the database
BLOCKS (in the case of BLOSUM), that is to the particular
protein model used.

EURASIP Journal on Bioinformatics and Systems Biology

To solely evaluate the stochastic similarity between two


sequences X and Y , the identity
I(X, Y ) =

f (i, j) log

i, j

f (i, j)
,
fX (i) fY ( j)

(10)
2.2.

which measures the degree of stochastic dependence between


the protein sequences, would suce (here fX (i) = n(i)/n and
fY ( j) = n( j)/n are the relative frequencies of amino acid i
observed in sequence X and amino acid j observed in sequence Y ). But this is not so interesting from the biological
point of view, as one has to take into account the possibility that, even if similar from the stochastic point of view, two
sequences are far from being an example of a typical proteinto-protein matching (or evolutionary transition). In other
words, we need to inspect this stochastic similarity under the
lens of the protein model used in the BLOCKS database (or
by the PAM model, for the matter).
Subjecting the (unscaled) normalized score SN (X, Y ) of
(7) to simple mathematical manipulations (see the appendix
for details), we can split SN (X, Y ) into the following terms:


SN (X, Y ) = I(X, Y ) D FXY //PAB




once it has been scaled. The dierence is usually quite small


(about 2-3% if the score is high), but it becomes more and
more significant as the score approaches zero.




+ D FX //PA + D FY //PB .

(11)

Here, FXY is the joint frequency distribution of the amino


acids pairs in the sequences, (observed target frequencies),
while FX and FY are, respectively, the distribution of the
amino acids inside X and Y (observed background frequencies). PAB instead is the joint probability distribution associated to the BLOCKS database, and is the vector of target
frequencies. Note also that PA = PB = P are the probability distributions of the amino acids inside the same database
BLOCKS, that is the database background frequencies; they
are equal as a consequence of the symmetry of the BLOSUM matrix entries, since p(i, j) = p( j, i). We define the set
{I(X, Y ), D(FXY //PAB ), D(FX //P), D(FY //P)} to be the BLOSUM spectrum of the aligned sequences (or BLOSpectrum).
Notice that (11) holds also when the BLOSUM matrix is decompositionally adjusted following the approach described
in Yu et al. [11], that is when the background frequencies are
dierent (PA 
= PB ).
The terms constituting the BLOSpectrum have a dierent order of magnitude, as D(FX //P) and D(FY //P) act with
a cardinality of 20, when compared to the joint divergences
I(X, Y ) and D(FXY //PAB ), that act on probability distributions whose cardinality is 20 20 = 400. From a practical
point of view, this means that the contribution of I(X, Y )
and D(FXY //PAB ) to the score is expected to be roughly
double than that of D(FX //P) and D(FY //P). Actually, under the hypothesis of a Bernoullian process (i.e., stationary and memoryless), we have D(P 2 //Q2 ) = 2D(P//Q) [18]
(as in our case 202 = 400), and the sum of the two terms
D(FX //P) + D(FY //P) compensates the order of magnitude
of the joint divergences.
Finally, it should be recalled that the score actually obtained by using the BLOSUM matrices, whose entries are
multiplied by the constant c and rounded to the nearest integer, is an approximation of the exact score SN (X, Y ) of (11),

Taking gaps into account

An important consideration regarding our mathematical


analysis is that it does not formally take gaps into account.
From a mathematical perspective, the only way to account
correctly for gaps would be to use a 21 21 scoring matrix, in
which the gap is treated as equivalent to a 21st amino acid, so
that pairs of the form (i, ) or (, j), where the symbol
represents the gap, are also contemplated; but from a biological perspective this might not be acceptable, since a gap is not
a real component of a sequence. We can nevertheless extend
our analysis to a gapped score if we admit the independence
between each gap and any residue paired with it. Biologically,
independence may be questionable, and would need to be
determined case by case, as each gap is due to a chance deletion or insertion event subsequently acted on by natural selection (which may be neutral or positive). Moreover, there
is no certainty as to the correct positioning of a gap in any
given alignment, as it is introduced a posteriori as the product of an alignment algorithm that takes the two sequences
X and Y , and tries to minimize (by an exact procedure, or
by a heuristic approach) the number of changes, insertions
or deletions that allow to transform X into Y (or vice versa).
In practice, we consider quite reasonable the idea that gaps
in a given position should imply a degree of independence as
to which amino acids might occur there in related proteins;
this is accepted also in PSI-BLAST [19]. The consequence of
assuming independence is that p(, j) = p()p( j) leads to a
null contribution of the corresponding score, since s(, j) =
log[p(, j)/ p()p( j)] = 0 (see (3)), so that for gapped sequences, we simply assign a score equal to zero whenever an
amino acid is paired with a gap. Note that this does not mean
that we reduce a gapped alignment to an ungapped one, but
that we simply ignore the gap and the corresponding residue,
since the pair is not aecting the BLOSpectrum, due to its
zero contribution to the score. Moreover, it is conceivable
that for distant sequence correlations, the use of dierent algorithms, or of dierent gap penalties schemes for any given
algorithm, could result in a dierent pattern of gaps and consequently in dierent sequence alignments, each with a corresponding BLOSpectrum. In this case, the likelihood of each
alignment might be tested by exploiting the BLOSpectrum,
that might be quite dierent even if the numerical scores have
approximately the same value; this can help identify the most
appropriate one.
3.

RESULTS AND DISCUSSION

3.1.

Meaning and biological implications of the


BLOSpectrum terms

Let us now analyze the meaning of the terms in (11).


(i) The mutual information I(X, Y ) is the sequence convergence, which measures the degree of stochastic dependence (or stochastic correlation) between aligned

Francesco Fabris et al.


sequences X and Y ; the greater its value, the more statistically correlated are the two. It is highly correlated
with, but not identical to, the percent identity of the
alignment, as it also includes the propensity of finding
certain amino acids paired, even if dierent.
This term enhances the overall BLOSUM score, since
it is taken with the plus sign.
(ii) The target frequency divergence D(FXY //PAB ) measures
the dierence between the observed target frequencies, and the target frequencies implicit in the substitution matrix. In mathematical terms, it measures the
stochastic distance between FXY and PAB , that is the
distance between the mode in which amino acids are
paired in the X and Y sequences and inside the protein model implicit in the BLOCKS database. When
the vector of observed frequencies FXY is far from
the vector of target frequencies PAB exhibited by the
protein model, then the divergence is high, so that
starting from X we obtain an Y (or vice versa) that
is not that we would expect on the basis of the target
frequencies of the database; in other words, the amino
acids are paired following relative frequencies that are
not the standard ones.
The term D(FXY //PAB ) is a penalty factor in (11), since
it is taken with the minus sign.
(iii) The background frequency divergence D(FX //PA ) (or
D(FY //PB )) of the sequence X (or Y ) measures the difference between the observed background frequencies, and the background frequencies implicit in the
substitution matrix. In mathematical terms, it measures the stochastic distance between the observed frequencies FX (or FY ) and the vector P = PA = PB of
background frequencies of the amino acids inside the
database BLOCKS. The greater is its value, the more
dierent are the observed frequencies from the background frequencies exhibited by a typical protein sequence.
This term enhances the score, since it is taken with the
plus sign.
Note that the quantities that constitute the decomposition of
the BLOSUM score are not independent of one another. For
example, D(FXY //PAB ) 0 implies low values for D(F//P)
also. This is because when FXY PAB (or D(FXY //PAB ) 0;
see the appendix), then also the observed marginals FX and
FY are forced to approach the background marginal, that
is FX P and FY P, which implies D(F//P) 0.
This is a consequence of the tie between a joint probability distribution and its marginals [10]. For the same reason,
if D(F//P)  0, then D(FXY //PAB ) will also be large, although the opposite is not necessarily the case. This leads
to (at least partially) a compensation of the eects, due to
the minus sign of the target frequency divergence, so that
D(FXY //PAB ) + D(FX //PA ) + D(FY //PB ) has a small value.
This implies that a significant BLOSUM score can be obtained only when the aligned sequences are statistically correlated, that is, when I(X, Y ) has a high value. Since when
performing an alignment we are mainly interested in positive or almost positive global scores, it is a straightforward

5
consequence that only alignment characterized by remarkable values of I(X, Y ) will emerge.
There are therefore essentially three cases of biological interest, which we can now analyze in terms of the correspondence between mathematical and biological meaning of the
terms.
Case 1. The joint observed frequencies FXY are typical,1 that
is, they are very close to the target frequencies, FXY PAB .
In this case, D(FXY //PAB ) 0 and also D(F//P) 0.
Case 2. The joint observed frequencies FXY are not typical
(FXY 
= PAB ), but the marginals are typical (FX P, FY P).
In this case, D(FXY //PAB )  0, but D(F//P) 0.
Case 3. Both the joint observed FXY and the marginals FX ,
FY are not typical, that is FXY 
= PAB , FX 
= P, FY 
= P.
In this case, D(FXY //PAB )  0, but also D(F//P)  0.
Case 1 is straightforward; two similar protein sequences
with a typical background amino acid distribution; and
amino acids paired in a way that complies with the protein
model implicit in BLOCKS result in a high score. This is
frequently the case for two firmly correlated sequences, belonging to the same family of proteins with standard amino
acid content, associated with organisms that diverged only
recently.
Case 2 is rather more interesting; the amino acid distribution is close to the background distribution (these are
typical protein sequences) but the score is highly penalized
as the observed joint frequencies are dierent from the target frequencies implicit in the BLOCKS database. This can
have dierent causes. For example, the chosen BLOSUM matrix may be incorrectly matched to the evolutionary distance
of the sequences, or the sequences may have diverged under
a nonstandard evolutionary process. For high-scoring alignments involving unrelated sequences, the target frequency divergence D(FXY //PAB ) will tend to be low, due to the second
theorem of Karlin and Altschul [8], when the target frequencies associated to the scoring matrix in use are the correct
ones for the aligned sequences being analyzed.2 This is because any set of target frequencies in any particular amino
acid substitution matrix, such as BLOSUM-, is tailored to
a particular degree of evolutionary divergence between the
sequences, generally measured by relative entropy (8) [7],
and related with the controlled maximum rate of percent identity. So a low D(FXY //PAB ) 0 is evidence that
the BLOSUM- matrix we are using is the correct one, as a
precise consequence of a mathematical theorem, while conversely for positive (or almost positive) scoring alignments
with large target frequency divergence, the sequences may be
1

Recall that the concept of typicality always refers to the adherence of the
various probability distributions to that of the protein model associated
to the database BLOCKS.
2 Note that in general, choosing the ( parameter associated with the)
smallest D(FXY //PAB ) is dierent from choosing the minimum E-value
associated with dierent parameters. Recall that E = m n2S , where S
is the score and m and n are the sequences lengths.

6
related at a dierent evolutionary distance than that of the
substitution matrix in use. Trying several scoring matrices
until something interesting is found is a common practice in protein sequence alignment [20]. In our case, scanning the range could thus lead to a significant decrease in
D(FXY //PAB ), as detected in the BLOSpectrum, and improve
the score [7, 12, 13], taking it back to Case 1. This could in
turn result in a better capacity to discriminate weakly correlated sequences from those correlated by chance. If, on the
other hand, tuning does not greatly aect D(FXY //PAB ),
and we are comparing typical sequences (low background
frequency divergence) with an appropriate parameter, the
large target frequency divergence indicates that some nonstandard evolutionary process (regarding the substitution of
amino acids) is at work. This cannot adequately be captured
by the standard BLOCKS database and BLOSUM substitution matrices. Under these circumstances, Case 2 can never
lead to high scores, due to the penalization of the target frequency divergence. We are here likely in the grey area of
weakly correlated sequences with a very old common ancestor, or of portions of proteins with strong structural properties that do not require the conservation of the entire sequence. Note that unfortunately we are not able to assess the
statistical significance when our method finds a suspected
concealed correlation; however, the method still gives us useful information that helps guide our judgment on the possible existence of such correlation, that needs to be further investigated in depth, exploiting other biological information
such as 3D structure and biological function.
Case 3 accounts for the situation in which we have two
nontypical sequences, with high values of both target and
background frequency divergence. This applies, for example,
to some families of antimicrobial peptides, that are unusually
rich in certain amino acids (such as Pro and Arg, Gly, or Trp
residues). This means that the high penalty arising from the
subtracted D(FXY //PAB ) is (at least partially) compensated
by the positive D(FX //PA ) and D(FY //PB ), and the global
score does not collapse to negative values, even if it is usually low. In eect, the background frequency divergence acts
as a compensation factor that prevents excessive penalties for
those sequences which, even though related by nonstandard
amino acid substitutions, also have a nontypical background
distribution of the amino acids inside the sequences themselves. In other words, the nontypicality of FXY is (at least
in part) forced of by the anomalous background frequencies of the amino acids. This compensation is welcome, since
it avoids missing biologically related sequences pertaining
to nontypical protein families, and mathematically corroborates the robustness of the BLOSUM scoring method.
The problem of evaluating the best method for scoring nonstandard sequences has been recently tackled by
Yu et al. [11, 21], who showed that standard substitution
matrices are not truly appropriate in this case, and developed a method for obtaining compositionally adjusted
matrices. In general, when background frequencies dier
markedly from those implicit in the substitution matrix (i.e.,
the background frequency divergence is high) is one case
when using a standard matrix is nonoptimal. Another is

EURASIP Journal on Bioinformatics and Systems Biology


when the background frequencies vary, and the scale factor
= (log(p(i, j)/ p(i)p( j)))/s(i, j) appropriate for normalizing nominal scores varies as well [8]. If the real is lower
than the standard one, then the uncorrected nominal score
can appear much too high [19, 22]. Our approach oers a
dierent perspective to the problem, that is, the possibility
of gaining insight about biological sequence correlation directly from the BLOSUM score. Moreover, the background
frequency divergence components of BLOSpectrum indicate
whether compositionally adjusted matrices could be useful
in the case under inspection. Since [21] illustrates three criteria for invoking compositional adjustment (length ratio,
compositional distance, and compositional angle), we suggest that the occurrence of Case 3 in the BLOSUM spectrum could be thought of as an additional fourth criterion.
The background divergence of the BLOSpectrum decomposition oers a further rationale to confirm the eectiveness
of the procedure proposed by Yu et al., since a large background divergence D(F//P) forces the target frequency divergence D(FXY //PAB ) to be unnaturally large; compositionally
adjusted matrices, that minimizes background frequency divergence, tend to remove this eect, leaving it free to assume
the value associated to the (correct degree of evolutionary)
divergence between the sequences under inspection.
As a consequence of the three cases discussed above, we
can suggest the following procedure for analyzing the score
obtained from an alignment between two given sequences
of the same length, or resulting from a BLAST or FASTA
(gapped or ungapped) database search.
Scoring analysis procedure
(1) Given the two sequences, evaluate the components
of (11) by inserting the sequences in the available
software to obtain the BLOSpectrum (http://bioinf.
dimi.uniud.it/software/software/blosumapplet).
(2) Evaluate the target frequency divergence D(FXY //PAB )
for each .
(3) Choose the value that minimizes D(FXY //PAB ).
(4) Determine if the alignment falls in Cases 1, 2, or 3 as
described.
(5) If the alignment falls in Case 1, we have two strictly
correlated proteins.
(6) If, even after tuning , the alignment falls in Case 2
(D(FXY //PAB ) is high, but D(F//P) is low), then we
may have a concealed or weak correlation between the
sequences.
(7) If the alignment falls in Case 3 (both D(FXY //PAB ) and
D(F//P) are high), we may have correlated sequences
belonging to a nontypical family. In this case, the use
of compositionally adjusted matrices may provide a
sharper score [11, 21].
In analyzing the parameters that compose the BLOSpectrum,
so as to decide among Cases 1, 2, and 3, we find it useful to
use an indicative, if somewhat arbitrary set of guidelines, as
summarized in Table 1.
We assign a range of values for each parameter (tag L =
Low, tag M = Medium, tag H = High). These values have been

Francesco Fabris et al.

Table 1: Rule of thumb guidelines to decide among low (L),


medium (M), and high (H) values of the parameters.

I(X, Y )
D(FXY //PAB )
D(F//P)

<0.9
<1.1
<0.3

0.91.1
1.11.5
0.30.7

>1.1
>1.5
>0.7

derived from a rule of thumb approach when analyzing the


results of the experiments described in the following sections;
but obviously they need to be tuned as soon as new experimental evidence will be available.
The final consideration is that, when comparing biologically related sequences, one has to choose the correct scoring
matrix if necessary by means of a compositional adjustment.
If, as a result, background and target frequency divergences
have low values, the mutual information or sequence convergence I(X, Y ) remains as the eective parameter that measures protein similarity. If, after considering the above possibilities, one still observes a residual persistence of the target
frequency divergence, then two weakly correlated sequences
are presumably identified, that derived from a common remote ancestor after several events of substitution.
3.2. Practical implementation of the method
As stated in the Introduction, we recall that the analysis based
on the BLOSpectrum evaluation is not aimed at increasing
the performance of available alignment algorithms, nor at
suggesting new methods for inserting gaps so as to maximize
the score. The BLOSpectrum only gives added information
of biological and operative interest, but only once two sequences have already been aligned using current algorithms,
such as BLAST, BLAST2, FASTA, or others. The ultimate biological goal of the method is that of revealing the possible
presence of a weak or concealed correlation for alignments
resulting in a relatively low BLOSUM score, that might otherwise be neglected. Another operative merit is that the knowledge of the target frequency divergence helps identify the best
scoring matrix, that is the one tailored for the correct evolutionary distance.
In order to perform automatic computation of the four
terms of (11), we have developed the software BLOSpectrum, freely available at http://bioinf.dimi.uniud.it/software/
software/blosumapplet. Given two sequences with the same
length, with or without gaps, the software derives the vectors FX , FY , and FXY by computing the relative frequencies
f (i) = n(i)/n, f ( j) = n( j)/n, and f (i, j) = n(i, j)/n, that is
the relative frequency of amino acid i observed in sequence
X, of amino acid j observed in sequence Y , and the relative
frequency of the pair i, j. The vectors PAB = { p(i, j)}i, j and
P = { p(i)}i , needed to decompose the score, are those derived from BLOCKS database and used by S. Heniko and
J. G. Heniko [9] to extract the score entries of the 20 20
BLOSUM matrices (35, 40, 50, 62, 80, 100); they have been
kindly provided by these authors on request. The software
computes also the exact BLOSUM normalized score, that is

the algebraic sum of the four terms, together with the rough
BLOSUM score, directly obtained by summing up the integer values of the BLOSUM- matrix. As already observed in
Section 2.2 the pairs containing a gap, such as (, j) or (i, ),
are not considered in the computation, since their contribution to the score is zero when one assumes the independence
between a gap and the paired amino acid.
There are essentially two ways for employing the BLOSpectrum. The first one is that of performing a BLAST or
FASTA search inside a database, given a query sequence.
The result is a set of h possible matches, ordered by score,
in which the query sequence and the corresponding match
are paired for a length that is respectively n1 , n2 , . . . , nh . The
user can extract all matches of interest within the output
set and compares them with the query sequence by using
BLOSpectrum software. The second one is that of comparing
two assigned sequences with a program such as BLAST2, so
as to find the best gapped alignment. Also in this case we can
use BLOSpectrum on the two portions of the query sequences
that are paired by BLAST2 and that have the same length n.
It is obvious that the next step would be that of integrating
the BLOSpectrum tool inside a widely used database search
engine.
Even if the correct way for using the BLOSpectrum software is that of supplying it with two sequences of the same
length, derived from preceding queries of BLAST, BLAST2,
FASTA or others, the BLOSpectrum applet accepts also two
sequences of dierent length n and m > n; in this case the
program merely computes the scores associated to all possible alignments of n over m, showing the highest one, but it
does not insert gaps.
3.3.

Biological examples

To illustrate the behavior of the BLOSpectrum under the perspective of the above three cases, we have chosen groups of
proteins from several established protein families present in
the SWISSPROT data bank http://www.expasy.uniprot.org
(see Table 2), together with some specific examples of sequences, taken from the literature, that are known to be biologically related, even if aligning with rather modest scores.
The first set contains sequences from the related Hepatocyte nuclear factor 4 (HNF4-), Hepatocyte nuclear factor 6 (HNF6), and GAT binding protein 1 (globin transcription factor 1 families). These represent typical protein families coupled by standard target frequencies. Furthermore, sequences within each family are quite similar to one another,
with a percent identity greater than 85%. All these proteins
are expected to fall in Case 1.
The second set of sequences is expected to fall in Case 2. A
first example is taken from the serine protease family, containing paralogous proteins such as trypsin, elastase, and chymotrypsin, whose phylogenetic tree constructed according to
the multiple alignment for all members of this family [23] is
consistent with a continuous evolutionary divergence from
a common ancestor of both prokaryotes and eukaryotes.
Another example pertaining to weakly correlated sequences
that show distant relationships is the one originally used by

EURASIP Journal on Bioinformatics and Systems Biology

Table 2: The three sets of protein families used in testing the BLOSpectrum. The UniProt ID is furnished (with the sequence length). For the
defensins and Pro-rich peptides, only the mature peptide sequences were used in alignments. In the following tables, sequences are indicated
by the corresponding numbers 14.
Sequence
Family

First set
HNF4-

P41235 (465)
H. sapiens

P49698 (465)
Mus musculus

P22449 (465)
Rattus norv.

HNF6

Q9UBC0 (465)
H. sapiens

O08755 (465)
Mus musculus

P70512 (465)
Rattus norv.

GAT1

P15976 (413)
H. sapiens

P17679 (413)
Mus musculus

P43429 (413)
Rattus norv.
Second set

P07477 (247)
H. sapiens
trypsin

P17538 (263)
H. sapiens
chymotrypsin

P00775 (259)
Streptomyces
griseus trypsin

P35049 (248)
Fusarium oxysporum trypsin

Hemoglobins

P02232 (92)
Vicia faba
leghemoglobin I

S06134 (92)
P. chilensis
hemoglobin I

Transposons

A26491 (41)
D. mauritiana
mariner transposon

NP493808 (41)
C. elegans
transposon TC1

Beta defensins

BD01 (36)
H. sapiens

BD02 (41)
H. sapiens

Serine proteases

Q9UNI1 (258)
H. sapiens
elastase1

BD03 (39)
H. sapiens

BD04 (50)
H. sapiens

PR39PRC (42) pig

PF (82) pig

Third set
Pro/Argrich
peptides

BCT5 (43) bovin

BCT7 (59) bovin

Altschul [7] to compare PAM-250 with PAM-120 matrices,


that is, the 92 length residue Vicia faba leghemoglobin I and
Paracaudina chilensis hemoglobin I, characterized by a very
poor percent identity (about 15%), with pairs of identical
amino acids residues that are spread fairly evenly along the
alignment. A further example considers the sequences associated to Drosophila mauritiana mariner transposon and
Caenorhabditis elegans transposon TC1, with a length of 41
residues, used by S. Heniko and J. G. Heniko [9] to test the
performance of their BLOSUM scoring matrices. The last example derives from human beta defensins. This family of host
defense peptides have arisen by gene duplication followed by
rapid divergence driven by positive selection, a common occurrence in proteins involved in immunity [24]. They are
characterized by the presence of six highly conserved cysteine residues, which determines folding to a conserved tertiary structure, while the rest of the sequence seems to have
been relatively free of structural constraints during evolution
[25, 26]. Even if clearly related, these peptides have a percentage sequence identity less than 40%.
All these families represent the case of nonstandard target frequencies, while the amino acid frequency distribution

does not appear, at first sight, to be too abnormal. The sequence comparisons score are modest at best, even though
members are known to be biologically correlated.
The third set contains sequences that are expected to fall
in Case 3. These are members of the Bactenecins family of linear antimicrobial peptides, with an unusually high content
of Pro and Arg residues, and an identity of about 35% [27],
representing sequences with a highly atypical amino acid frequency distribution.
If we analyze the alignments inside all these sets of protein families, we eectively find examples for each of the
three cases illustrated in the preceding section. The alignments of human and mouse HNF4- sequences (as illustrated in Table 3), and the BLOSpectrum of HNF4-, HNF6,
and GAT1 sequence comparisons (see Figure 1), are clear examples of Case 1, with high correlation between all respective
couples of sequences and a target frequency divergence that
is strongly sensitive to the BLOSUM- parameter, so we stop
the scoring procedure at step 5.
For example, the HNF4- alignment has a target frequency divergence that varies from 2.41 to 0.93 when
passing from BLOSUM-35 (a matrix tailored for a wrong

Francesco Fabris et al.

9
Table 3: BLOSUM decomposition for intrafamily alignments for proteins of the first set.
HNF4- human versus HNF4- mouse

BLOSUM

I(X, Y )

D(FXY //PAB )

D(FX //P)

D(FY //P)

SN (X, Y )

Score

% Identity

100

3.939

0.929

0.050

0.057

3.118

2833

95.9

80

3.939

1.297

0.046

0.053

2.741

2537

95.9

62

3.939

1.582

0.046

0.052

2.456

2330

95.9

50

3.939

1.861

0.043

0.050

2.171

3003

95.9

40

3.939

2.226

0.039

0.047

1.800

3381

95.9

35

3.939

2.414

0.036

0.044

1.605

2982

95.9

HNF4- (BLOSUM-100)
Sequences

I(X, Y )

D(FXY //PAB )

D(FX //P)

D(FY //P)

SN (X, Y )

Score

% Identity

13

3.955

0.930

0.050

0.056

3.132

2846

96.3

2-3

4.141

1.008

0.057

0.056

3.246

2952

99.5

First set

HNF4- human
versus
HNF4- mouse

GAT1 human
versus
GAT1 mouse

HNF6 human
versus
HNF6 mouse

1 2 3 4 5
BLOSUM-100

BLOSUM-100

BLOSUM-100

(1) I(X, Y ) (2) D(FXY //PAB ) (3) D(FX //P) (4) D(FY //P) (5) Score

Figure 1: BLOSpectrum for sequences of the first set.

evolutionary distance), to BLOSUM-100 (the matrix tailored for a correct evolutionary distance) so that minimizing the frequency divergence (rows in italic) helps identify
the best parameter for comparing the analyzed sequences;
it corresponds to = 100, coherent with the high percent identity (8696%). In this case, the compensation factor D(FX //P) + D(FY //P) corresponding to background frequency divergence is almost zero, since observed background
and target frequencies are very near to those implicit in
the BLOCKS database, leading to the conclusion that these
are typical sequences that correspond closely to the protein
model associated with BLOCKS. The global (normalized)
score is high (3.12 in the HNF4- example), due to a high
degree of stochastic similarity (I(X, Y ) 3.94), which is not

greatly penalized. Other members of the HNF4-, HNF6, or


GAT1 families behave similarly (see Figure 1).
The situation changes considerably when we compute the
BLOSUM decomposition for the dierent examples listed
for the second set, for example, comparing human trypsin,
elastase and chymotrypsin to one another, or comparing
these enzymes in distantly related species, such as human,
streptomyces griseus (a bacterium), and Fusarium oxysporum (a fungus). Following the Scoring Procedure, and starting
with ungapped alignments, we have a case of high target frequency divergence, with a low level of background frequency
divergence, corresponding to the situation outlined in step
6. However, as soon as we use gapped alignments, we observe a remarkable increment in the score, due to a reduced

10

EURASIP Journal on Bioinformatics and Systems Biology

Second set

Chymotrypsin human
versus
S. griseus trypsin

Vicia faba
leghemoglobin I
versus
Paracaudina chilensis
hemoglobin I

D. mauritiana
mariner transposon
versus
C. elegans
transposon TC1

BD01 human
versus
BD02 human

Ungapped
3
2

1 2 3 4 5
BLOSUM-35

BLOSUM-40

BLOSUM-62

BLOSUM-35

Gapped

1 2 3 4 5
BLOSUM-80
(1) I(X, Y )

BLOSUM-40
(2) D(FXY //PAB )

BLOSUM-50

(3) D(FX //P)

(4) D(FY //P)

(5) Score

Figure 2: BLOSpectrum for (ungapped and gapped) sequences of the second set.

penalization factor associated to target frequency divergence


(see Figure 2, first column, and Table 4). This is the obvious
case when the bad matching is a consequence of deletions
and/or insertions that occurred during evolution, which is
resolved once gaps are introduced, so that the sequence comparison falls into Case 1
A dierent situation occurs aligning Vicia faba leghemoglobin I and Paracaudina chilensis hemoglobin I. D(FXY //
PAB ) minimization (step 3) leads to a narrower spread
of values (2.482.07) when passing from BLOSUM-100 to
BLOSUM-35, with minimum (2.05) at = 40, which is consequently the best parameter to compare the sequences. The
global score (0.24) is rather low, despite these sequences being clearly evolutionarily related. In fact, the BLOSpectrum
shows that the stochastic correlation I(X, Y ) is quite high
(1.84), but is killed by the heavy penalty derived from the
negative contribution of D(FXY //PAB ), while the compensation factors due to background frequency divergence are less
significant (0.25 and 0.19, resp.), as the sequences are typical

proteins under the BLOCKS model. Furthermore, extending


the size of the alignment or including gaps does not significantly alter the spectrum (see Table 5 and Figure 2, second
column), so we leave the Scoring Procedure at step 6; we simply have weakly related sequences.
The Drosophila mauritiana and Caenorhabditis elegans
transposons provide a similar example, with only a weak
minimization for = 62 (D(FXY //PAB ) = 2.80). The other
BLOSpectrum components are respectively I(X, Y ) = 2.34,
D(FX //P) = 0.53, and D(FY //P) = 0.72. The sequences thus
have a high stochastic correlation, but the target frequencies
are rather atypical, so that the divergence entirely kills the
contribution derived from mutual information, and if the
score is weakly positive (0.79) it is only due to the terms
associated to background frequency divergence. In fact, the
biological relationship of these atypical sequence fragments
is eectively captured only due to the presence of this compensation factor. In this case, a gapped alignment including a wider portion of the sequences, actually reduces the

Francesco Fabris et al.

11
Table 4: BLOSUM decomposition for ungapped and gapped serine proteases.
Serine proteases

BLOSUM

I(X, Y )

D(FXY //PAB )

D(FX //P)

D(FY //P)

SN (X, Y )

Score

% Identity

human chymotrypsin versus Streptomyces griseus trypsin (ungapped)


100

1.014

2.023

0.134

0.132

0.742

398

11.5

80

1.014

1.739

0.141

0.137

0.446

230

11.5

62

1.014

1.570

0.146

0.145

0.264

121

11.5

50

1.014

1.437

0.134

0.141

0.147

120

11.5

40

1.014

1.321

0.132

0.138

0.035

42

11.5

35

1.014

1.305

0.136

0.145

0.008

11.5

100

1.645

1.213

0.164

0.156

0.753

326

35.9

80

1.645

1.138

0.170

0.164

0.842

382

35.9

62

1.645

1.149

0.178

0.171

0.845

416

35.9

50

1.645

1.176

0.171

0.159

0.800

557

35.9

40

1.645

1.270

0.170

0.158

0.703

640

35.9

35

1.645

1.346

0.177

0.163

0.640

584

35.9

human chymotrypsin versus Streptomyces griseus trypsin (gapped)

background frequency divergences to remarkably lower values (0.237 and 0.226), neutralizing the compensation (see
Table 6 and Figure 2, third column).
In both the preceding examples, we are in the situation
where the parameter of the substitution matrix is appropriate for the sequence divergence of the sequences in question,
the background frequency divergence is small, but the target
frequency divergence is still large: this is a signal that we are
dealing with weakly related sequences, characterized by several events of substitution that occurred during evolution. It
is usually dicult to capture these weakly related sequences
using standard scoring matrices, such as BLOSUM or PAM,
since the common ancestor could be very old. As a matter of
fact, this diculty was used to respectively test the PAM-250
versus PAM-120 matrices (Altschul [7], hemoglobin) and
BLOSUM-62 versus PAM-160 matrices (S. Heniko and J.
G. Heniko [9], transposons). Here, we cannot remove the
cause of mismatching and we leave the Scoring Procedure at
step 6.
The last example from this group derives from human
beta defensins, and even if these sequences are known to be
evolutionarily related, some couples actually show a negative
normalized score (14, 2-3, 24, see Table 7 and Figure 2,
last column), suggesting that they are not. In fact, a normal BLOSUM-62 BLAST search using the human beta defensin 1 sequence, picks up several homologues from other
mammalian species, whereas those with the three paralogous
human sequences are below the cuto score. BLOSpectrum
analysis reveals a high stochastic correlation I(X, Y ) (2.00
3.03), neutralized by an even higher-penalty factor due to the
target frequency divergence (3.283.56), partly compensated
by the substantial background frequency divergences (0.54
0.79), and with little eect of the BLOSUM- parameter, or
of introducing gaps. These are fairly typical proteins, whose

score is heavily penalized by a remarkable target frequency


divergence. Only the compensation factor induced by background frequency divergence can, in some cases, sustain the
score over positive values, allowing the identification of a biological correlation that would otherwise have been lost.
The third set of sequences are Pro/Arg rich antimicrobial peptides of the Bactenecins family, with about 35% identity [27, 28]. The obtained scores are clearly positive, despite
the poor stochastic correlation (0.400.60, see Table 8 and
Figure 3).
The penalty factor due to target frequency divergence is
remarkably high in this case (4.154.49) and should drag
the score to quite negative values, but the compensation factor due to background frequency divergence is even greater
and fully compensates it. We thus leave the scoring procedure at step 7. This is the typical case of poorly conserved
sequences with singular key structural aspects that are however highly preserved (c.f. the pattern of proline and arginine residues). As the background frequencies FX and FY
are far from the standard background P associated with the
BLOCKS database, the evaluation of a more realistic score for
these sequences pass through the use of a decompositionally
adjusted BLOSUM matrix [11]. Such matrices are built in
such a way as to reduce background frequency divergence,
so as to eliminate the portion of target divergence that is induced by it. In this way, the residual target divergence accounts only for eective evolutional divergence between sequences.
As a final example, we obtained BLOSUM spectra also for
sequences from obviously uncorrelated families. The results
are reported in Table 9 and Figure 4. In these cases we generally obtain a poor stochastic correlation I(X, Y ), and a high
value for the penalty factor D(FXY //PAB ), leading to a globally negative score, which is not compensated by background

12

EURASIP Journal on Bioinformatics and Systems Biology


Table 5: BLOSUM decomposition for ungapped and gapped hemoglobins.
P02232: 49 SAGVVDSPKLGAHAEKVFGMVRDSAVQLRATGEVVLDGKDGSIHIQKGVLDPHFVVVKEALLKTIKE 115
++ +

S ++ AHA +V

++ +

+L +

H+ +

+ L++ ++

S06134: 61 ASQLRSSRQMQAHAIRVSSIMSEYVEELDSDILPELLATLARTHDLNKVGADHYNLFAKVLMEALQA 127

P02232: 116 ASGDKWSEELSAAWEVAYDGLATAI 140


G

++E+

AW

A+

S06134: 128 ELGSDFNEKTRDAWAKAFSIVQAVL 152


Vicia faba leghemoglobin I versus Paracaudina chilensis hemoglobin I (ungapped)
BLOSUM

I(X, Y )

D(FXY //PAB )

D(FX //P)

D(FY //P)

SN (X, Y )

Score

100

1.839

2.478

0.264

0.207

0.166

31

15.2

80

1.839

2.240

0.264

0.199

0.063

12

15.2

62

1.839

50

1.839

2.128

0.260

0.192

0.163

35

15.2

2.077

0.255

0.185

0.203

54

15.2

40

1.839

2.051

0.255

0.194

0.237

83

15.2

35

1.839

2.070

0.263

0.202

0.235

82

15.2

% Identity

Vicia faba leghemoglobin I versus Paracaudina chilensis hemoglobin I (gapped)


100

1.597

1.962

0.166

0.172

0.026

10

18.1

80

1.597

1.759

0.161

0.163

0.162

40

18.1

62

1.597

1.661

0.154

0.153

0.243

65

18.1

50

1.597

1.618

0.145

0.145

0.268

104

18.1

40

1.597

1.606

0.145

0.155

0.291

152

18.1

35

1.597

1.623

0.154

0.163

0.283

148

18.1

P02232: 2 FTEKQEALVNSSSQLFKQNPSNYSVLFYTIILQKAPTAKAMFSFLK--DSAGVVDSPKLGAHAEKVF 68
T

Q+ +V

+N +++

P+A+

++ +

S ++ AHA +V

S06134: 12 LTLAQKKIVRKTWHQLMRNKTSFVTDVFIRIFAYDPSAQNKFPQMAGMSASQLRSSRQMQAHAIRVS 78

P02232: 69 GMVRDSAVQLRATGEVVLDGKDGSIHIQKGVLDPHFVVVKEALLKTIKEASGDKWSEELSAAWEVAY 135


++ +

+L +

H+ +

+ L++ ++

++E+

AW

A+

S06134: 79 SIMSEYVEELDSDILPELLATLARTHDLNKVGADHYNLFAKVLMEALQAELGSDFNEKTRDAWAKAF 145

frequency divergences. Note that in two cases, a mildly positive score could suggest a distant relationship. Analysis of the
BLOSpectrum helps in evaluating this possibility. The PF12
versus GAT1 alignment is simply a case of overcompensation
for a nontypical sequence (the background frequency divergence for one of the sequences is very high). In the second
case, however, the I(X, Y ) value for the BD04 versus GAT1
human alignment is surprisingly quite high, suggesting that
a closer look might be appropriate.

4.

CONCLUSIONS

A standard use of scoring substitution matrices, such as


BLOSUM-, is often insucient for discovering concealed
correlations between weakly related sequences. Among other
causes, this can derive from (i) the introduction of gaps during evolution (ii) use of a BLOSUM- matrix tailored for a
dierent evolutionary distance than that pertaining to the
aligned sequences, and/or (iii) the use of standard matrices

Francesco Fabris et al.

13
Table 6: BLOSUM decomposition for ungapped and gapped transposons.

NP_493808: 243 VFQQDNDPKHTSLHVRSWFQRRHVHLLDWPSQSPDLNPIEH 283


+F

DN P HT+

VR

+L

+ SPDL P +

A26491: 245 IFLHDNAPSHTARAVRDTLETLNWEVLPHAAYSPDLAPSDY 285


Drosophila mauritiana mariner transposon versus C. elegans transposon TC1 (ungapped)
BLOSUM

I(X, Y )

D(FXY //PAB )

D(FX //P)

D(FY //P)

SN (X, Y )

Score

% Identity

100

2.339

2.926

0.740

0.531

0.685

55

34.1

80

2.339

2.849

0.733

0.531

0.754

60

34.1

62

2.339

2.800

0.724

0.526

0.789

67

34.1

50

2.339

2.831

0.721

0.516

0.746

90

34.1

40

2.339

2.935

0.716

0.509

0.630

104

34.1

35

2.339

2.969

0.714

0.505

0.590

92

34.1

Drosophila mauritiana mariner transposon versus C. elegans transposon TC1 (gapped)


100

1.991

2.244

0.244

0.243

0.235

40

25.0

80

1.991

2.110

0.246

0.234

0.362

67

25.0

62

1.991

2.021

0.245

0.227

0.443

91

25.0

50

1.991

2.009

0.237

0.226

0.445

123

25.0

40

1.991

2.043

0.227

0.228

0.404

152

25.0

35

1.991

2.066

0.226

0.229

0.381

144

25.0

NP_493808: 243 VFQQDNDPKHTSLHVRSWFQRRHVHLLDWPSQSPDLNPIE-HLWEELERRLGGIRASNAD 301


+F

DN P HT+

VR

+L

+ SPDL P + HL+

+ +

A26491: 245 IFLHDNAPSHTARAVRDTLETLNWEVLPHAAYSPDLAPSDYHLFASMGHALAEQRFDSYE 304

NP_493808: 302 AKFNQLENAWKAIPMSVIHKLIDSMPRRCQAVIDANG 338


+

L+

+ A

+ I

+P R +

+ ++G

A2649: 305 SVKKWLDEWFAAKDDEFYWRGIHKLPERWEKCVASDG 341

for comparison of proteins with nonstandard background


frequency distributions of amino acids. All these well-known
eects can be better evidenced and quantified by decomposition of BLOSUM score (BLOSpectrum) according to (11).
This equation highlights the core of the biological correlation measured by the BLOSUM score, that is mutual information I(X, Y ), or sequence convergence. If gaps are taken
into account (such as in BLAST), and the correct parameter is chosen with the help of BLOSpectrum, and if the background frequencies of sequences are near to the standard
ones, then the global score is given by sequence convergence
plus a residual penalization factor due to target frequency
divergence. This residual value implicitly takes into account
that numerous substitution events may have occurred during sequence evolution, and so is a coherent measure of the
biological relationship and distance between the sequences.
If the background frequencies of sequences are not standard,

then we have shown the BLOSUM scoring method has an


in-built capacity to correct for anomalies in amino acid distributions using background frequency divergence as a compensation factor. One can also choose to compositionally adjust the matrix, so as to reduce the compensation factor together with the component of target frequency divergence
that is induced by a bad background frequency distribution.
This systematic method is illustrated in the scoring analysis
procedure of Section 2.
Our decomposition becomes important when we consider sequences for which the BLOSUM score indicates a
weak or no correlation. A critical evaluation of the BLOSpectrum components can help corroborate or identify an
underlying biological correlation and whether the matrices
being used are the most appropriate ones for measuring it.
In other words, when considering the grey area of BLOSUM scores with a marginal significance, it could help to

14

EURASIP Journal on Bioinformatics and Systems Biology


Table 7: The BLOSUM terms for beta defensins.
BD01 human versus BD02 human

BLOSUM

I(X, Y )

D(FXY //PAB )

D(FX //P)

D(FY //P)

SN (X, Y )

Score

% Identity

100

3.030

3.566

0.564

0.618

0.646

45

41.6

80

3.030

3.453

0.568

0.623

0.768

58

41.6

62

3.030

3.438

0.604

0.652

0.849

65

41.6

50

3.030

3.418

0.615

0.663

0.891

99

41.6

40

3.030

3.378

0.577

0.626

0.855

129

41.6

35

3.030

3.320

0.539

0.588

0.837

120

41.6

human beta defensins (BLOSUM-35)


Sequences

I(X, Y )

D(FXY //PAB )

D(FX //P)

D(FY //P)

SN (X, Y )

Score

% Identity

13

2.731

3.325

0.539

0.751

0.697

101

30.5

14

2.532

3.658

0.539

0.728

0.141

22

16.6

2-3

2.009

3.466

0.794

0.616

0.045

10

10.2

24

2.334

3.522

0.609

0.568

0.009

12.1

3-4

2.122

3.286

0.794

0.655

0.286

44

20.5

Table 8: The BLOSUM terms for Pro/Arg-rich peptides.


BCT5 bovin versus BCT7 bovin
BLOSUM

I(X, Y )

D(FXY //PAB )

D(FX //P)

D(FY //P)

SN (X, Y )

Score

% Identity

100

0.424

4.935

2.329

2.460

0.279

28

34.8

80

0.424

4.724

2.317

2.449

0.467

42

34.8

62

0.424

4.637

2.301

2.430

0.518

37

34.8

50

0.424

4.533

2.264

2.389

0.544

68

34.8

40

0.424

4.407

2.221

2.338

0.576

97

34.8

35

0.424

4.368

2.199

2.301

0.556

98

34.8

Pro/Arg-rich peptides (BLOSUM-35)


Sequences

I(X, Y )

D(FXY //PAB )

D(FX //P)

D(FY //P)

SN (X, Y )

Score

% Identity

13

0.516

4.434

2.095

2.205

0.382

63

30.9

14

0.446

4.491

2.199

2.488

0.643

110

39.5

2-3

0.584

4.156

2.095

2.257

0.780

133

47.6

24

0.406

4.350

2.256

2.251

0.563

134

37.2

3-4

0.609

4.260

2.095

2.347

0.792

132

45.2

decide if an evolutionary relationship actually exists. We provide online software at http://bioinf.dimi.uniud.it/software/


software/blosumapplet which integrates a BLOSpectrum histogram with the score obtained by a classical BLAST engine
working on two input sequences, which allows an immediate
visual analysis of the score components. The systematic use
of BLOSpectrum parameters to permit a more sensitive filtering of scores inside a BLAST or similar engine could be the
logical next operative step. We have provided several biological examples indicating the potential of our method, but it
is clear that it needs a massive biological experimentation to
completely test its eective usefulness.

APPENDIX
Proof of (11). By multiplying inside the log function of (7)
by f (i, j)/ f (i, j) and by f (i) f ( j)/ f (i) f ( j) and rearranging
the terms, we obtain

p(i, j) f (i, j) f (i) f ( j)
f (i, j) log
SN (X, Y ) =
p(i)p( j) f (i, j) f (i) f ( j)
i, j
=

f (i, j) log

i, j


i, j


f (i, j)
f (i, j)

f (i, j) log
f (i) f ( j) i, j
p(i, j)

f (i, j) log

f (i) f ( j)
p(i)p( j)

Francesco Fabris et al.

15

Third set
BCT5 bovin
versus
BCT7 bovin

BCT5 bovin
versus
PR39PRC pig

BCT7 bovin
versus
PR39PRC pig

1 2 3 4 5
BLOSUM-35

BLOSUM-35

BLOSUM-35

(1) I(X, Y ) (2) D(FXY //PAB ) (3) D(FX //P) (4) D(FY //P) (5) Score

Figure 3: BLOSpectrum for sequences of the third set.


Table 9: Some examples of BLOSUM-35 terms for sequences belonging to noncorrelated families.
BLOSUM-35
HNF4- human versus HNF6 human
Sequences

I(X, Y )

D(FXY //PAB )

D(FX //P)

D(FY //P)

SN (X, Y )

Score

% Identity

1-1

0.578

0.986

0.036

0.205

0.165

312

5.37

0.088

144

8.71

0.076

143

8.47

0.195

36

10.0

0.091

24

18.2

0.136

25

12.0

HNF4- human versus GAT1 human


1-1

0.712

1.033

0.038

0.193

HNF6 human versus GAT1 human


1-1

0.622

1.122

0.230

0.193

BD04 human versus BCT7 bovin


42

1.010

3.887

0.460

2.220

PF12 pig versus GAT1 human


41

0.686

3.486

41

2.243

3.033

2.182

0.709

BD04 human versus GAT1 human


0.460



= I(X, Y ) D FXY //PAB

f (i, j) log

i, j

f (i) 
f ( j)
f (i, j) log
+
p(i) i, j
p( j)





= I(X, Y ) D FXY //PAB + D FX //PA



+ D FY //PB .

(A.1)

A fuller understanding of the mathematical tools used in


Section 2 requires some definitions and mathematical prop-

0.465

erties pertaining to ID and MI; they are summarized as follows.


Let us start by considering some probability distributions [10] over an alphabet A with K symbols, for example
P = { p1 , p2 , . . . , pK }, Q = {q1 , q2 , . . . , qK }, and so on. In our
context, K = 20, as there are 20 amino acids, and the alphabet letters correspond to the 1-letter amino acid standard
coding (D = Asp, E = Glu, W = Trp, etc.). If we imagine the
space of all possible K dimensional probability distributions,
it is right to ask what is the distance from P to Q (or vice

16

EURASIP Journal on Bioinformatics and Systems Biology

Noncorrelated sequences

HNF4 human
versus
HNF6 human

HNF6 human
versus
GAT1 human

HNF4 human
versus
GAT1 human

1 2 3 4 5
BLOSUM-35

BLOSUM-35

BLOSUM-35

PF12 pig
versus
GAT1 human

BD04 human
versus
BCT7 bovin

BD04 human
versus
GAT1 human

1 2 3 4 5
BLOSUM-35

BLOSUM-35

BLOSUM-35

(1) I(X, Y ) (2) D(FXY //PAB ) (3) D(FX //P) (4) D(FY //P) (5) Score

Figure 4: BLOSpectrum for noncorrelated sequences.

versa). The most popular (pseudo-)distance is the informational divergence D(P//Q),


D(P//Q) 

K

i=1

p(i) log

p(i)
,
q(i)

(A.2)

introduced by Kullback in 1954 in the context of statistics


[29]; here p(i) 0 and q(i) > 0. It is easy to verify [18]
that the informational divergence (ID) is nonnegative, and it
is equal to 0 if and only if P is coincident with Q (P Q).
Furthermore, ID is not boundable, since D(P//Q) + if
an i exists such that q(i) 0. All this can be summarized in
the following way:


0 D(P//Q) +

(= 0 when P Q)


= + when there exists i such that 2(i) = 0 .

(A.3)

Note that ID is the sum of positive and negative terms, and


the fact that the average is always greater than zero is not obvious (it is a consequence of the convexity property of the

logarithm). Since D(P//Q) = 0 if and only if P Q, this allows us to interpret the ID as a measure of (pseudo)distance
between probability distributions. It is only pseudo (from
the mathematical point of view) since the concept of distance is well defined in mathematics, and requires also symmetry between the variables and the validity of the so-called
triangular inequality. But ID lacks both these last two properties, since, in general, D(P//Q) 
= D(Q//P) (it is asymmetric) and, if R is a third probability distribution, we are not
sure that D(P//R) + D(R//Q) is greater than D(P//Q) (the
triangular inequality does not hold). We underline that such
a distance is not symmetric (and so the order in which P and
Q are specified does matter), that is, it is a distance from
rather than a distance between.
Suppose now that PX = { pX (1), pX (2), . . . , pX (K)} and
PY = { pY (1), pY (2), . . . , pY (K)} are the probability distributions associated to the (random) variables X and Y , which
take their values in the same alphabet A. Here, pX (i) =
Pr{X = i} means the probability that the variable X assumes

Francesco Fabris et al.

17

the value i. In our framework, X and Y are two protein sequences of the same length n, and pX (2) = Pr{X = 2} = 0.09
(e.g.) is interpreted as the relative frequency of the second
amino acid of the alphabet A; so, the overall occurrence of
the 2nd amino acid in sequence X is equal to 0.09n. In this
context, we can introduce also a joint probability distribution associated to the sequences, PXY = { pXY (i, j), i, j
A} = Pr{X = i, Y = j, i, j A}, where pXY (i, j) corresponds to the relative frequency of finding the amino acids
i, j paired in a certain position
 of the alignment between X
and Y . It is well known that i, j pXY (i, j) = 1 (PXY is a probability distribution) and that the sum of the joint probabilities over one variable gives the marginal of the other variable

j pXY (i, j) = pX (i). For example, given that the ninth and
the fifth amino acid in the alphabet are Arginine and Leucine,
respectively, pXY (9, 5) = pXY (Arg, Leu) = 0.01 means that
the relative frequency of finding Arg in X paired with Leu in
Y is equal to 0.01. In practice, we avoid the use of the subscripts, and use the simpler notation p(i) and p(i, j) instead
of pX (i) and pXY (i, j).
Since the condition of independence between two variables (protein sequences) X and Y is fixed by the formula
pXY (i, j) = pX (i)pY ( j) (for each pair i, j A), then, once
assigned a certain PXY , it could be interesting to attempt
to evaluate the distance of PXY from the condition of independence between the variables. Making use of the ID (A.2),
we need to evaluate the quantity D(PXY //PX PY ), that is the
stochastic distance between the joint PXY and the product of
the marginals PX PY . If we have independence, then PXY
PX PY , and the divergence equals zero. On the contrary, if it
appears that X and Y are tied by a certain degree of dependence, this can be measured by


D PXY //PX PY =


i, j

p(i, j) log

p(i, j)
 I(X, Y ) 0.
p(i)p( j)
(A.4)

This quantity is called also the mutual information (or relative entropy) I(X, Y ) between the random variables (the protein sequences, in our setting) X and Y . It is symmetric in
its variables (I(X, Y ) = I(Y , X)) and is always nonnegative,
since it is an informational divergence. Note also that MI is
upper bounded by the logarithm of the alphabet cardinality, that is I(X, Y ) log 20 [18]. Moreover, since it equals
zero if and only if the joint probability distribution coincides with the product of the marginals, that is, when we
have independence between the two variables, we can interpret the mutual information (MI) as a measure of stochastic
dependence between X and Y . From another point of view,
we can say that independence is equivalent to the situation
in which the variables X and Y do not exchange information. So, the meaning of I(X, Y ) can be read also as the degree of dependence between the variables, or as the average information exchanged between the same variables. Mutual information is one of the pillars of Shannon information theory, and was introduced in the seminal paper by Shannon
[16, 17].

ACKNOWLEDGMENTS
The authors thank Jorja Heniko, who provided the matrices
of joint probability distributions associated to the database
BLOCKS, and an anonymous referee of a previous version
of this paper, who made several key remarks. This work has
been supported by the Italian Ministry of Research, PRIN
2003, FIRB 2003 Grants, by the Istituto Nazionale di Alta
Matematica (INdAM), 2003 Grant, and by the Regione Friuli
Venezia Giulia (2005 Grants).
REFERENCES
[1] S. B. Needleman and C. D. Wunsch, A general method applicable to the search for similarities in the amino acid sequence
of two proteins, Journal of Molecular Biology, vol. 48, no. 3,
pp. 443453, 1970.
[2] A. D. McLachlan, Tests for comparing related amino-acid
sequences. Cytochrome c and cytochrome c551 , Journal of
Molecular Biology, vol. 61, no. 2, pp. 409424, 1971.
[3] D. Sanko, Matching sequences under deletion-insertion
constraints, Proceedings of the National Academy of Sciences
of the United States of America, vol. 69, no. 1, pp. 46, 1972.
[4] P. H. Sellers, On the theory and computation of evolutionary distances, SIAM Journal on Applied Mathematics, vol. 26,
no. 4, pp. 787793, 1974.
[5] M. S. Waterman, T. F. Smith, and W. A. Beyer, Some biological sequence metrics, Advances in Mathematics, vol. 20, no. 3,
pp. 367387, 1976.
[6] M. O. Dayho, R. M. Schwartz, and B. C. Orcutt, A model of
evolutionary change in proteins, in Atlas of Protein Sequence
and Structure, M. O. Dayho, Ed., vol. 5, supplement 3, pp.
345352, National Biomedical Research Foundation, Washington, DC, USA, 1978.
[7] S. F. Altschul, Amino acid substitution matrices from an information theoretic perspective, Journal of Molecular Biology,
vol. 219, no. 3, pp. 555565, 1991.
[8] S. Karlin and S. F. Altschul, Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes, Proceedings of the National Academy of
Sciences of the United States of America, vol. 87, no. 6, pp. 2264
2268, 1990.
[9] S. Heniko and J. G. Heniko, Amino acid substitution
matrices from protein blocks, Proceedings of the National
Academy of Sciences of the United States of America, vol. 89,
no. 22, pp. 1091510919, 1992.
[10] W. Feller, An Introduction to Probability and Its Applications,
John Wiley & Sons, New York, NY, USA, 1968.
[11] Y.-K. Yu, J. C. Wootton, and S. F. Altschul, The compositional
adjustment of amino acid substitution matrices, Proceedings
of the National Academy of Sciences of the United States of America, vol. 100, no. 26, pp. 1568815693, 2003.
[12] S. F. Altschul, A protein alignment scoring system sensitive
at all evolutionary distances, Journal of Molecular Evolution,
vol. 36, no. 3, pp. 290300, 1993.
[13] D. J. States, W. Gish, and S. F. Altschul, Improved sensitivity of nucleic acid database searches using application-specific
scoring matrices, Methods, vol. 3, no. 1, pp. 6670, 1991.
[14] S. R. Sunyaev, G. A. Bogopolsky, N. V. Oleynikova, P. K.
Vlasov, A. V. Finkelstein, and M. A. Roytberg, From analysis of protein structural alignments toward a novel approach
to align protein sequences, Proteins: Structure, Function, and
Bioinformatics, vol. 54, no. 3, pp. 569582, 2004.

18
[15] M. A. Zachariah, G. E. Crooks, S. R. Holbrook, and S. E.
Brenner, A generalized ane gap model significantly improves protein sequence alignment accuracy, Proteins: Structure, Function, and Bioinformatics, vol. 58, no. 2, pp. 329338,
2005.
[16] C. E. Shannon, A mathematical theory of communication
part I, Bell System Technical Journal, vol. 27, pp. 379423,
1948.
[17] C. E. Shannon, A mathematical theory of communication
part II, Bell System Technical Journal, vol. 27, pp. 623656,
1948.
[18] I. Csiszar and J. Korner, Information Theory: Coding Theorems
for Discrete Memoryless Systems, Academic Press, New York,
NY, USA, 1981.
[19] A. A. Schaer, L. Aravind, T. L. Madden, et al., Improving
the accuracy of PSI-BLAST protein database searches with
composition-based statistics and other refinements, Nucleic
Acids Research, vol. 29, no. 14, pp. 29943005, 2001.
[20] F. Frommlet, A. Futschik, and M. Bogdan, On the significance
of sequence alignments when using multiple scoring matrices, Bioinformatics, vol. 20, no. 6, pp. 881887, 2004.
[21] S. F. Altschul, J. C. Wootton, E. M. Gertz, et al., Protein
database searches using compositionally adjusted substitution
matrices, FEBS Journal, vol. 272, no. 20, pp. 51015109, 2005.
[22] A. A. Schaer, Y. I. Wolf, C. P. Ponting, E. V. Koonin,
L. Aravind, and S. F. Altschul, IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed
position-specific score matrices, Bioinformatics, vol. 15,
no. 12, pp. 10001011, 1999.
[23] W. R. Rypniewski, A. Perrakis, C. E. Vorgias, and K. S. Wilson,
Evolutionary divergence and conservation of trypsin, Protein Engineering, vol. 7, no. 1, pp. 5764, 1994.
[24] A. L. Hughes, Evolutionary diversification of the mammalian
defensins, Cellular and Molecular Life Sciences, vol. 56, no. 12, pp. 94103, 1999.
[25] F. Bauer, K. Schweimer, E. Kluver, et al., Structure determination of human and murine -defensins reveals structural
conservation in the absence of significant sequence similarity,
Protein Science, vol. 10, no. 12, pp. 24702479, 2001.
[26] A. Tossi and L. Sandri, Molecular diversity in gene-encoded,
cationic antimicrobial polypeptides, Current Pharmaceutical
Design, vol. 8, no. 9, pp. 743761, 2002.
[27] R. Gennaro, M. Zanetti, M. Benincasa, E. Podda, and M. Miani, Pro-rich antimicrobial peptides from animals: structure,
biological functions and mechanism of action, Current Pharmaceutical Design, vol. 8, no. 9, pp. 763778, 2002.
[28] M. E. Selsted, M. J. Novotny, W. L. Morris, Y.-Q. Tang, W.
Smith, and J. S. Cullor, Indolicidin, a novel bactericidal
tridecapeptide amide from neutrophils, Journal of Biological
Chemistry, vol. 267, no. 7, pp. 42924295, 1992.
[29] S. Kullback, Information Theory and Statistics, Dover, Mineola,
NY, USA, 1997.

EURASIP Journal on Bioinformatics and Systems Biology

Hindawi Publishing Corporation


EURASIP Journal on Bioinformatics and Systems Biology
Volume 2007, Article ID 72936, 14 pages
doi:10.1155/2007/72936

Research Article
Aligning Sequences by Minimum Description Length
John S. Conery
Department of Computer and Information Science, University of Oregon, Eugene, OR 97403, USA
Received 26 February 2007; Revised 6 August 2007; Accepted 16 November 2007
Recommended by Peter Grunwald
This paper presents a new information theoretic framework for aligning sequences in bioinformatics. A transmitter compresses
a set of sequences by constructing a regular expression that describes the regions of similarity in the sequences. To retrieve the
original set of sequences, a receiver generates all strings that match the expression. An alignment algorithm uses minimum description length to encode and explore alternative expressions; the expression with the shortest encoding provides the best overall
alignment. When two substrings contain letters that are similar according to a substitution matrix, a code length function based on
conditional probabilities defined by the matrix will encode the substrings with fewer bits. In one experiment, alignments produced
with this new method were found to be comparable to alignments from CLUSTALW. A second experiment measured the accuracy
of the new method on pairwise alignments of sequences from the BAliBASE alignment benchmark.
Copyright 2007 John S. Conery. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1.

INTRODUCTION

Sequence alignment is a fundamental operation in bioinformatics, used in a wide variety of applications ranging
from genome assembly, which requires exact or nearly exact
matches between ends of small fragments of DNA sequences
[1], to homology search in sequence databases, which involves pairwise local alignment of DNA or protein sequences
[2], to phylogenetic inference and studies of protein structure
and function, which depend on multiple global alignments
of protein sequences [35].
These diverse applications all use the same basic definition of alignment: a character in one sequence corresponds
either to a character from the other sequence or to a gap
character that represents a space in the middle of the other
sequence. Alignment is often described informally as a process of writing a set of sequences in such a way that matching
characters are displayed within the same column, and gaps
are inserted in strings in order to maximize the similarity
across all columns. More formally, alignments can be defined
by a matrix M, where Mi j is 1 if character i of one sequence
is aligned with character j of the other sequence, or in some
cases, Mi j is a probability, for example, the posterior probability of aligning letters i and j [6].
This paper introduces a new framework for describing
the similarities and dierences in a set of sequences. The idea
is to construct a special-purpose grammar for the strings that

represent the sequences. If there are segments in each input


sequence that are similar to corresponding segments in the
other sequences, the grammar will have a single rule that directly generates the characters for these segments.
An alignment algorithm based on this new framework
will consider dierent sets of rules to include in the grammar
it produces. The focus of this paper is on the use of minimum
description length (MDL) [7] as the basis of the alignment
algorithm. The MDL principle argues that the best alignment
will be the one described by the shortest grammar, where the
length of a grammar is measured in terms of the number of
bits needed to encode it.
The key idea is to use conditional probabilities to encode
letters in aligned regions. If a grammar has a rule that aligns
letter x in one sequence with letter y in another sequence,
the encoding of the rule will be based on p(y | x), and if the
alignment is accurate, the resulting encoding is shorter than
the one that encodes x and y separately in an unaligned region. But there is a tradeo: adding a new rule to a grammar
requires adding new symbols for the rule structure, and the
number of bits required to encode these symbols adds to the
total size of the encoded grammar. The alignment algorithm
must determine the net benefit of each potential aligned region and choose the set of aligned regions that provides the
overall shortest encoding.
MDL has been used to infer grammars for large collections of natural language sentences [8] and to search

2
for recurring patterns in protein and DNA sequences [9].
These applications of MDL are examples of machine learning, where the system uses the data as a training set and the
goal is to infer a general description that can be applied to
other data. The goal of the sequence alignment algorithm
presented here is simply to find the best description for the
data at hand; there is no attempt to create a general grammar
that may apply to other sequences.
Grammars have been used previously to describe the
structure of biological sequences [1012], and regular expressions are a well-known technique for describing patterns
that define families of proteins [13]. But as with previous
work on MDL and grammars, these other applications use
grammars and regular expressions to describe general patterns that may be found in sequences beyond those used to
define the pattern, whereas for alignment the goal is to find a
grammar that describes only the input data.
Grammars have the potential to describe a wide variety
of relationships among sequences. For example, a top level
rule might specify several dierent ways to partition the sequences into smaller groups, and then specify separate alignments for each group. In this case, the top level rules are effectively a representation of a phylogenetic tree that shows
the evolutionary history of the sequences. This paper focuses on one very restricted type of grammar that is capable
of describing only the simplest correspondence between sequences. The algorithm presented here assumes that only two
sequences are being aligned, and that the goal is to describe
similarity over the entire length of both input sequences, that
is, the algorithm is for pairwise global alignment. For this application, the simplest type of formal grammara right linear grammaris sucient to describe the alignment. Since
every right linear grammar has an equivalent regular expression, and because regular expressions are simpler to explain
(and are more commonly used in bioinformatics), the remainder of this paper will use regular expression syntax when
discussing grammars for a pair of sequences.
Current alignment algorithms are highly sensitive to the
choice of gap parameters [1417]; for example, Reese and
Pearson showed that the choice of gap penalties can influence the score for alignments made during a database search
by an order of magnitude [18]. One of the advantages of the
grammar-based framework is that gaps are not needed to
align sequences of varying length. Instead, the parts of regular expressions that correspond to regions of unaligned positions will have a dierent number of characters from each
input sequence.
Previous work using information theory in sequence
alignment has been within the general framework of a
Needleman-Wunsch global alignment or Smith-Waterman
local alignment. Allison et al. [19] used minimum message
length to consider the cost of dierent sequences of edit operations in global alignment of DNA; Schmidt [20] studied the information content of gapped and ungapped alignments, and Aynechi and Kuntz [21] used information theory to study the distribution of gap sizes. The work described
here takes a dierent approach altogether, since gap characters are not used to make the alignments.

EURASIP Journal on Bioinformatics and Systems Biology


Regular expression alignments are similar to the alignments produced by DIALIGN [22, 23], a program that creates consistent sets of ungapped local alignments. The main
dierences are that fragments in DIALIGN are defined by a
Smith-Waterman alignment based on finding a locally optimal score and including neighboring letters until the score
drops below a threshold, and DIALIGN uses a minimum
length parameter to exclude short random matches. The
method presented in this paper uses the MDL criterion to
find the ends of aligned regionsif adding a pair of letters is
less costly than leaving the letters in a variable region, then
the letters are included in the aligned region.
Other methods that consider only ungapped local alignments are also similar to regular expression alignments.
Schneider [24] used information theory as the basis of a
multiple alignment algorithm for small ungapped DNA sequences and successfully applied it to binding sites. More recently, Krasnogor and Pelta [25] described a method for evaluating the similarity of pairs of proteins, but their analysis
describes a global similarity metric without actually aligning
the substrings responsible for the similarity.
The next section of this paper provides some background
information on sequence alignment and explains in more
detail how a regular expression can be used to capture the
essential information about the similarity in a set of sequences. The details of the MDL encoding for sequence letters and other symbols found in expressions are given in
Section 3. Results of two sets of experiments designed to test
the method are presented in Section 4.
The regular expression alignment method described in
this paper has been implemented in a program named
realign. The source code, which is written in C++ and has
been tested on OS/X and Linux systems, is freely available
under an open source license and can be downloaded from
the project web site [26].
2.

ALIGNMENTS AND REGULAR EXPRESSIONS

One of the main applications of sequence alignment is comparison of protein sequences. The inputs to the algorithm are
sets of strings, where each letter corresponds to one of the 20
amino acids found in proteins. The goal of the alignment is
to identify regions in each of the input sequences that are
parts of the same structural or functional elements or are descended from a common ancestor.
Figure 1(b) shows the evolution of fragments of three
hypothetical proteins starting from a 9-nucleotide DNA sequence. The labels below the leaves of the tree are the amino
acids corresponding to the DNA sequences at the leaves. The
only change along the left branch is a single substitution
which changes the first amino acid from P to T, and an alignment algorithm should have no problem finding the correspondences between the two short sequences (Figure 1(c)).
The sequence on the right branch of the tree is the result of a mutation that inserted six nucleotides in the middle
of the original sequence. In order to align the resulting sequence with one of its shorter cousins, a standard alignment
algorithm inserts a gap, represented by a sequence of one or
more dashes, to mark where it thinks the insertion occurred.

John S. Conery

Genetic code

.
.
.

.
.
.

(a)

(b)

(c)

(d)

(e)

Figure 1: (a) The genetic code specifies how triplets of DNA letters (known as codons) are translated into single amino acids when a cell
manufactures a protein sequence from a gene. (b) A tree showing the evolution of a short DNA sequence. Labels below the leaves are the
corresponding amino acid sequences. (c) Alignment of the two shorter sequences. (d) and (e) Two ways to align the longer sequence with
one of the shorter ones.

This alignment is complicated by the fact that the insertion


occurred in the middle of a codon; the single CCC that corresponded to a P in the ancestral sequence is now part of two
codons, CCT and TTC. Figures 1(d) and 1(e) show two dierent ways of doing the alignment; the dierence between the
two is the placement of the gap, which can go either before
or after the middle P of the short sequence.
A key parameter in the alignment of protein sequences
is the choice of a substitution matrix, a 20 20 array S in
which Si, j is a score for aligning amino acid i with amino acid
j. The PAM matrices [27] were created by analyzing hand
alignments of a carefully chosen set of sequences that were
known to be descending from a common ancestor. PAM matrices are identified by a number that indicates the degree to
which sequences have changed; a unit of 1 PAM is roughly
the amount of sequence divergence that can be expected in
10 million years [28], so the PAM20 matrix could be used
to align a set of sequences where the common ancestor lived
around 200 million years ago. Other common substitution
matrices are the BLOSUM family [29] and the Gonnet matrix [30].
Substitution matrices give higher scores to pairs of letters that are expected to be found in alignments, and lower
(negative) scores to pairings that are rare. For example, the
PAM100 matrix has positive scores on the main diagonal, to
use when aligning letters with themselves; the highest score is
12, for the pair W/W, since tryptophan (W) is highly conserved.
Smaller positive scores are for letters that frequently substitute for one another, for example, leucine (L) and isoleucine
(I) are both hydrophobic and the matrix entry for the pair
I/L is 1. Histidine (H) is hydrophilic, and the matrix entry
for I/H is 4. The pair P/L has a score of 4 and the pair P/S
has a score of 0, so an algorithm using PAM100 would prefer
the alignment shown in Figure 1(e).

Regular expressions are widely used for pattern matching, where the expression describes the general form of a
string and an application can test whether a given string
matches the pattern. To see how a regular expression is an
alternative to a standard gap-based alignment consider the
following pattern, which describes the two sequences in Figures 1(d) and 1(e):
P(P | LFS)P.

(1)

Here the vertical bar means or and the parentheses are used
to mark the ends of the alternatives. The pattern described
by this expression is the set of strings that start with a P, then
have either another P or the string LFS, and end in a P. In
this example, the letters enclosed in parentheses correspond
to a variable region: the pattern simply says these letters are
not aligned and no attempt is made to say why they are not
aligned or what the source of the dierence is. The regular
expression is an abstract description, covering both the alignments of Figures 1(d) and 1(e) (and a third, biologically less
plausible, alignment in which the top string would be PP
P).
For a more realistic example, consider the two sequence
fragments in Figure 2(a), which are from the beginning of
two of the protein sequences used to test the alignment application. Substrings of 15 characters near the front of each
sequence are similar to each other. A regular expression that
describes this similarity would have three groups, showing
letters before and after the region of similarity as well as the
region itself (Figure 2(b)).
Any pair of sequences can be described by a regular expression of this form. The expression consists of a series of
segments, written one after another, where each segment has
two substrings separated by the vertical bar. But this standard

EURASIP Journal on Bioinformatics and Systems Biology

(a)

(b)

(c)

Figure 2: (a) Strings from the start of two of the amino acid sequences used to test the alignment algorithm. The substrings in blue are
similar to the corresponding substring in the other sequence. (b) A regular expression that makes explicit the boundaries of the region of
similarity. (c) The canonical form representation of the regular expression. The canonical form has the same groupings of letters, but displays
the letters in a dierent order and uses marker symbols instead of parentheses to specify group boundaries. A # means the sequence segments
are blocks, where the ith letter from one sequence has been aligned with the ith letter in the other sequence. A > designates the start of a
variable region of unaligned letters.

notation introduces a problem: how does one distinguish


segments describing aligned characters from segments for
unaligned characters? The following convention solves the
problem of distinguishing between the types of segments and
reduces the number of symbols to a minimum. In a canonical
form sequence expression,
(i) each open parenthesis is replaced with a symbol that
specifies the type of the segment that starts at that location. An aligned segment starts with #, an unaligned
segment starts with >;
(ii) the vertical bar separating the two parts of a segment is
replaced by the symbol used at the start of the segment;
thus if the segment starts with #, the two parts of the
segment are separated by a second #;
(iii) the closing parenthesis marking the end of a segment
can just be deleted since it is redundant (every closing
parenthesis is either followed by an opening parenthesis or comes at the end of the expression);
(iv) to make an expression easier to read, it is displayed by
starting a new line for each # or >, with the understanding that white space breaking the expression
into new lines is for formatting purposes only and is
not part of the expression itself.
The canonical form of the expression describing the alignment of the initial parts of the two example genes is shown
in Figure 2(c).
In the literature on sequence alignment, an ungapped local alignment is often referred to as a block. In the canonical
form sequence expression, a block corresponds to a pair of
lines starting with #; pairs of lines starting with > are called
variable regions. Note that the substrings in blocks always
have the same number of sequence letters, and always have

at least one letter. Substrings in variable regions can have any


number of sequence letters, and one of the strings can have
zero letters. Since # and > define the boundaries of blocks
they are referred to as marker symbols.
Sequence expressions can easily be extended to describe
a multiple alignment of n > 2 sequences. Each segment in
an expression would have n substrings separated by vertical
bars, and the corresponding canonical form would have n
lines in each block and in each variable region. The MDL
code length function and the alignment algorithm in the following section assume there are only two sequences; possible
extensions for multiple alignment will be discussed in the final section.
3.

ALIGNMENT USING MINIMUM


DESCRIPTION LENGTH

It is easy to see there is at least one canonical form sequence


expression for every pair of sequences: simply create a single variable region, writing the string for each complete sequence to the right of a > symbol. This default expression is
the null hypothesis that the sequences have nothing in common.
The goal of an alignment algorithm is to generate alternative hypotheses, in the form of expressions that have
one or more blocks containing equal-length substrings from
the input sequences. The alignment process can be viewed
as a series of rewrite operations applied to variable regions.
A rewrite step that creates a block splits a variable region
into three parts: a variable region for characters before the
block, the block itself, and a variable region for characters
following the block (Figure 3). The transformation adds four
marker symbols to the expression: two # symbols identify

John S. Conery

2 markers
27 letters

6 markers
27 letters

Figure 3: Schematic representation of an expression rewriting operation. A canonical form expression with a single variable region
is transformed into a new expression with two variable regions surrounding a block. The number of sequence letters does not change,
but four new marker symbols are added to specify the boundaries
of the block.

the locations of the start of the block (one in each input sequence) and two > symbols mark the end of the block. As a
special case, the block might be at the beginning or end of
the expression; if so only two new # markers are added to the
expression.
Since the alignment algorithm uses the minimum description length principle to search for the simplest expression, this transformation appears to be a step in the wrong
direction because the complexity of the expression, in terms
of the number of symbols used, has increased. The key point
is that MDL operates at the level of the encoding of the expression, that is, it prefers the expression that can be encoded
in the fewest number of bits. As will be shown in this section,
blocks of similar sequence letters have shorter encodings. If
the number of bits saved by placing similar letters in a block
is greater than the cost of encoding the symbols that mark the
ends of the block, the transformed expression is more compact.
The code length function that assigns a number of bits
to each symbol in a canonical form sequence expression has
three components:
(i) a protocol that defines the general structure of an expression and the representation of alignment parameters;
(ii) a method for assigning a number of bits to each letter
from the set of input sequences;
(iii) a method for determining the number of bits to use
for the marker symbols that identify the boundaries
between blocks and variable regions.
3.1. Communication protocol
A common exercise in information theory is to imagine that
a compressed data set is going to be sent to a receiver in
binary form, and the receiver needs to recover the original
data. This exercise ensures that all the necessary information
is present in the compressed dataif the receiver cannot reconstruct the original data, it may be because essential information was not encoded by the compression algorithm. In
the case of the MDL alignment algorithm, the idea is to compress a set of sequences by creating a representation of a regular expression that describes the structure of the sequences.

The receiver recovers the original sequence data by expanding the expression to generate every sequence that matches
the expression.
A communication protocol that specifies the type of information contained in a message and the order in which the
pieces of the message are transmitted is an essential part of
the encoding. The representation of a sequence expression
begins with a preamble that contains information about the
structure of the expression and the encoding of alignment
parameters.
A canonical form sequence expression is an alternating
series of blocks and variable regions, where the marker symbols (# and >) inserted into the input sequences identify the
boundaries between segments. The communication protocol allows the transmitter to simplify the expression as it is
compressed by putting a single bit in the preamble to specify the type of the first segment. Then the only thing that is
required is a single type of symbol to specify the locations of
the remaining markers. For the example sequences shown in
Figure 2, the expression can be transformed into the following string:
> MNNNNYIF.MNSYKP.ENENPILYNTNEGEE.
ENENPVLYNYKEDEE.NRSS.SSHI

(2)

Here the >, represented by a single bit, indicates the type


of the first region. The periods identify the locations of the
markers. Since the regions alternate between # and >, the receiver infers the first period that represents another >, the
next two periods are #, and so on.
The key parameter in every alignment is the substitution
matrix used to define joint probabilities for each letter pair
and single (marginal) probabilities for each individual letter.
If the transmitter and receiver agree beforehand to restrict
the set of substitution matrices to a set of n commonly used
matrices, each matrix can be assigned an integer ID and the
preamble simply contains a single integer encoded in log2 n
bits to identify the matrix. If an arbitrary matrix is allowed,
the protocol would have to include a representation for the
substitution matrix.
The rest of the information contained in the preamble depends on the method used to represent the marker
symbols. Three dierent methods are presented below in
Section 3.3, and each uses a dierent combination of parameters; for example, the indexed representation requires the
transmitter to send the length of the longest sequence, and
the tagged representation requires the transmitter to send the
number of bits used in the encoding of marker symbols. For
numeric parameters, the transmitter can simply encode the
parameter in the fewest number of bits and include the encoding as part of the preamble. A standard technique for representing a number that can be encoded in k bits is to send k
0s, a 1, and then the k bits that encode the number itself.
In general a regular expression can be expanded into
more than just the original sequence strings. For example,
suppose the two input strings are AB and CD, and the regular
expression representing their alignment is of the form
(A | C) (B | D).

(3)

EURASIP Journal on Bioinformatics and Systems Biology

A receiver can expand this expression into the two original


input strings, but the expression also matches AD and CB.
Thus the protocol needs a method for telling the receiver
how to link together the substrings from dierent segments
so that it will reconstruct AB and CD but not AD or CB.
One solution would be to encode sequence IDs with the
substrings so the receiver correctly pieces together a sequence
using a consistent set of IDs. But if a simple convention is
followed, the receiver can infer the sequence IDs from the
order in which the sequences are transmitted. For canonical
form sequence expressions, the protocol requires that every
region has exactly two strings, and that within a region, the
strings need to be given in the same order each time.
3.2. Encoding sequence letters
The standard technique used in information theory of encoding symbols according to their probability distribution
can be used to encode sequence letters. If a letter x occurs
with probability p(x) the encoding of x requires log2 p(x)
bits.
The probability distribution for letters is based on the
substitution matrix being used for the alignment. Scores in
a substitution matrix are log odds ratios of the form
s(x, y) =

p(x, y)
1
log

p(x)p(y)

(4)

where p(x, y) is the joint probability of observing x aligned


with y, p(x) and p(y) are the background probabilities of x
and y, and is a scaling factor [31]. The realign program
uses a program named lambda [32] as a preprocessor that
takes an arbitrary substitution matrix as input, solves for ,
and saves a table of background probabilities for each single
letter and joint probabilities for each letter pair.
The number of bits used to encode a letter in a canonical sequence expression depends on whether the letter is in
a block or in a variable region. For a letter x in a variable
region the encoding is straightforward: simply use the background probability of x according to the transformed substitution matrix.
For a block, the encoding considers pairs of letters x and
y that occur in the same relative position in the block. The
number of bits to encode the letter x in one sequence is based
on p(x), the same as in a variable region, but for the letter y
in the other sequence, the conditional probability p(y | x) is
used to reflect the fact that x and y are aligned. Since by definition p(y | x) = p(x, y)/ p(x), the substitution matrix provides the necessary information to compute the conditional
probabilities.
To summarize, the cost, in bits, of encoding letters in a
canonical form sequence expression is defined as follows:
(i) for a letter x in a variable region or in the first line
of a block, the code length is a function of p(x), the
marginal probability of observing x :c(x) = log2 p(x);
(ii) for a letter y in the second line of a block, the code
length is a function of p(y | x), the conditional probability of seeing y in this location given character x in
the same position in the first line: c(y, x) = log2 p(y | x).

Table 1: Cost (in bits) of aligning pairs of letters. Sx,y is the score
for letters x and y in the PAM100 substitution matrix. c(x) + c(y)
is the sum of the costs of the two letters, which is incurred when
the letters are in a variable region. c(x) + c(y | x) is the cost of the
same letters when they are aligned in a block. The benefit of aligning two letters is the dierence between the unaligned cost and the
aligned cost: a positive benefit results from aligning similar letters,
a negative benefit from aligning dissimilar letters.
x
W
I
L
M
L
L
L

y
W
I
L
L
I
Q
C

Sx,y
12
6
6
3
1
2
6

c(x) + c(y)
6.36 + 6.36
3.65 + 3.65
3.09 + 3.09
4.97 + 3.09
3.09 + 3.65
3.09 + 5.02
3.09 + 5.78

c(x) + c(y | x)
6.36 + 0.44
3.65 + 1.25
3.09 + 0.72
4.97 + 2.26
3.09 + 3.66
3.09 + 6.09
3.09 + 9.38

benefit(y, x)
5.92
2.40
2.37
0.83
0.01
1.07
3.60

When x and y are the same letter, or similar according to


the substitution matrix being used, the cost using the conditional probability will be lower. For any two letters x and y,
the benefit of aligning y with x is the dierence between the
cost of placing the two letters in a variable region versus their
cost in a block:


benefit(y, x) = c(x) + c(y) c(x) + c(y | x)


= c(y) c(y | x).

(5)

In general, there is a positive benefit for pairs of letters


that have positive scores in a substitution matrix. On the
other hand, a negative benefit is incurred when an algorithm
tries to align two dissimilar letters. Table 1 shows a few examples of pairs of letters, the cost of placing them unaligned in
a variable region, and the benefit gained from aligning them
in a block.
3.3.

Encoding marker symbols

Three dierent methods for encoding of the marker symbols


that identify the boundaries between blocks and variable regions are illustrated in Figure 4. All three methods are based
on the transformation in which the # and > symbols have
been replaced by periods. The dierence between the three
methods is in the representation of each marker and the additional information included in the preamble.
3.3.1. Indexed representation
The indexed representation for marker symbols is based on
the observation that it is not necessary to include the marker
symbols themselves, but only their locations in each string. If
an expression has m segments, the transmitter can construct
a table of (m 1) entries for each string. The number of bits
for each table entry depends on n, the length of the corresponding input sequence. Using this technique, the preamble
of a message is constructed as follows:
(i) order the input sequences so the longest sequence is
the first one in the message;

John S. Conery

8
20

6
18
(a)

(b)

(c)
q(x, y) = (1 ) p(x, y)

p(x, y) = 1

= q()

q(x, y) = (1 )

(d)

Figure 4: The items in blue correspond to information added to a string to specify the locations of marker symbols. (a) Indexed representation. The preamble contains two tables of m 1 numbers to specify the locations of the m marker symbols (the first marker is always
at the front of the string) in each sequence. Each table entry has k = log2 n bits to specify a location in a string of length n. (b) Tagged
representation. A one-bit tag added to each symbol identifies the symbol class (letter or marker), and is followed by the bits that represent
the symbol itself. (c) Scaled representation. The number of bits for each symbol x is simply log2 q(x) where q(x) is the probability of the
symbol based on a distribution that includes the probability of a marker. (d) Given a probability for marker symbols, the joint probabilities
for the letter pairs are scaled by 1.0 so the sum of probabilities over all symbols is 1.0.

(ii) use one bit to specify the type of the first segment
(which will be the same for both sequences);
(iii) use log2 s bits to specify which one of the s substitution matrices was used to encode letters and letter
pairs;
(iv) use 2log2 n + 1 bits to specify n, the length of the first
input sequence. This number also allows the receiver
to determine k = log2 n, the number of bits required to
represent a single marker table entry;
(v) the next 2log2 m + 1 bits specify m, the number of
marker symbols in each sequence;
(vi) create a table of size mk bits for the locations of the
m markers in the first sequence, followed by another
table of the same size for the markers of the second
sequence.
Following the preamble, the body of the message simply
consists of the encoding of the letters defined in the previous
section. Since the receiver knows the length of the first sequence, there is no need to include an end-of-string marker
after the first sequence. This location becomes a de facto
marker for the start of the second sequence.
Figure 4(a) shows how the start of the two example sequences would be encoded with the indexed representation.
The numbers in blue are indices between 0 and the length of
the longer of the two sequences.
The advantage of this representation is that no additional
parameters are required to align a pair of sequences: the only
alignment parameter is the substitution matrix, which deter-

mines the individual probability for each letter and the joint
probability for each letter pair.
3.3.2. Tagged representation
There are two drawbacks to the indexed representation. The
first is that the number of bits used to represent a marker
grows (albeit very slowly) with the length of the input sequences. That means one might get a dierent alignment for
the same two substrings of sequence letters in dierent contexts; if the substrings are embedded in longer sequences,
the number of bits per marker will increase, and the alignment algorithm might decide on a dierent placement for
the markers in the middle of the substrings.
The second disadvantage is that in many cases marker
symbols identify the locations of insertions and deletions,
which are evolutionary events. The number of bits used to
represent a marker should correspond to the likelihood of an
insertion or deletion, but not the length of the sequence. If
anything, longer sequences are more likely to have had insertions or deletions, so the number of bits representing those
events should be lower, not higher.
The tagged representation addresses these problems by
defining a prefix code for markers and embedding the marker
codes in the appropriate locations within each sequence
string. This method requires the user to specify a value for a
new parameter, named , the number of bits required to represent a marker. Each symbol in the expression is preceded by

EURASIP Journal on Bioinformatics and Systems Biology

a one-bit tag that identifies the type of symbol, for example,


a zero for a marker and a one for a sequence letter. Following
the tag is the representation of the symbol itself: bits for
markers, and c(x) bits for a letter x using the cost function
defined in the previous section.
The preamble of a message based on the tagged representation is much simpler: it only contains the single bit designating whether the first segment is a block or a variable
region, the substitution matrix ID, and the value of . The
tagged representation of the alignment of the example sequences is shown in Figure 4(b).
3.3.3. Scaled representation
The additional bits attached to each symbol in the tagged
representation result in a rather awkward code from an information theoretic point of view, where the number of bits
used to represent a symbol should depend on the probability
of observing that symbol.
In order to define the number of bits for each symbol s
as log2 q(s), where s is either a sequence letter or a marker
symbol, one can scale each element in the joint probability
matrix by a constant factor 1 (where 0 < < 1) and then
define the number of bits in the representation of a marker as
= log2 () (Figure 4(d)). Now the body of the message is
simply the representation of each symbol, encoded according
to the modified probability matrix (see also Figure 4(c)):
c(x) = log2 q(x),
c(y | x) = log2 q(y | x),

(6)

c() = log2 ().


The preamble of a message encoded with the scaled representation is the same as the preamble for a tag-based message,
except that the additional parameter is instead of .
Since the probability of each single letter is the marginal
probability summed over a row of the joint probability matrix, and each matrix entry was multiplied by a constant scale
factor, the single-letter probabilities are also scaled by this
same amount:
q(x) =

(1 )p(x, y)

= (1 )

p(x, y) = (1 )p(x).

(7)

But note that conditional probabilities are not aected by


the scaling since the scale factors cancel out:
q(y | x) =

q(x, y) (1 )p(x, y)
=
q(x)
(1 )p(x)

p(x, y)
=
= p(y | x).
p(x)

(8)

Recall from Section 3.2 that a pair of letters will be included


in a block if there is a positive benefit from aligning them,
that is, if c(y) c(y | x) > 0. In the scaled representation,
this calculation compares a cost based on a scaled probability with a cost defined by an unscaled probability. Since the

scaled probabilities are lower than the original probabilities,


the scaled costs of single letters are higher, and some letter
pairs that had a negative benefit according to the original
probabilities will now have a positive benefit. For example,
in the PAM matrices, letter pairs with scores of 0 or higher
have a positive benefit using unscaled probabilities, but when
scaled with 1 = 0.75 pairs of slightly dissimilar amino
acids with scores of 1 have a positive benefit.
3.4.

Example

Two dierent alignments of the sequences of Figure 2 are


shown in Figure 5. The alignments were made using the
scaled representation with the PAM20 substitution matrix
and = 0.02. The code length for the null hypothesis
a single variable region containing all letters from the two
productionsis 240.279 bits. The code length of the expression with two variable regions and one block is 224.728 bits.
The cost of the expression with the block is less because
the net benefit from using conditional probabilities to compute the costs of the aligned letters (129.508 91.381 =
38.127 bits) outweighs the cost of introducing four marker
symbols (4 5.644 = 22.576 bits) for the boundaries of the
block.
4.

EXPERIMENTAL RESULTS

To evaluate the feasibility of aligning pairs of sequences by


finding the minimum cost sequence expression, a simple
graph search algorithm was developed and implemented in a
program named realign. The algorithm creates a directed
acyclic graph where nodes represent candidate blocks defined by equal-length substrings from each input sequence.
Weights assigned to nodes represent the cost in bits of the
corresponding block, and weights on edges connecting two
nodes are defined by the cost of a variable region for the
characters between the two blocks. The minimum cost path
through the graph corresponds to the optimal alignment.
In one set of experiments, alignments produced by
realign were compared to pairwise alignments generated
by CLUSTALW [33], one of the most widely used alignment
programs. In a second experiment, realign was used to
align pairs of sequences from the BaliBase benchmark suite
[34].
4.1.

Plasmodium orthologs

An important concept in evolutionary biology is homology,


defined to be similarity that derives from common ancestry.
In molecular genetics, two genes in dierent organisms are
said to be orthologs if they are both derived from a single gene
in the most recent common ancestor.
In genome-scale computational experiments, a simple
strategy known as reciprocal best hit is often used to identify pairs of orthologous genes. For each gene a from organism A, do a BLAST search [2] to find the gene b from organism B that is most similar to a. If a search in the other
direction, using BLAST to find the gene most similar to b in

John S. Conery




Cost of null hypothesis:


228.99 + 2 = 240.279 bits

c(x) + c(y) for letters in the block: 129.508 bits


c(x) + c(y |x) for the block: 91.381 bits

Cost of the expression with one block:


64.272 + 91.381 + 35.211 + 6 = 224.728 bits

(a)

(b)

Figure 5: Cost of alternative expressions for the example sequences using the PAM20 substitution matrix and = 0.02. The cost for each
marker symbol is = log2 = 5.644 bits. (a) The cost for the null hypothesis is the sum of all the individual letter costs plus the cost of the
two marker symbols. (b) When the letters in blue are aligned with one another, the costs of the letters in the second sequence are computed
with conditional probabilities. This reduces the cost of the letters in the block by 129.508 91.381 = 38.127 bits. The transformed grammar
has four additional markers, but the reduction in cost aorded by using the block outweighs the cost of the new markers (4 5.644 =
22.576 bits) so the expression with one block has a lower overall cost.

(a)

(b)

Untrim

Trim

Aligned by both

0.473

0.469

Aligned by neither

0.147

0.258

clustalw only

0.38

0.267

Realign only

<0.001

0.006

(c)

Figure 6: Alignment of sequences MAL7P1.11 and Pv087705 from ApiDB [35]. (a) Comparison of CLUSTALW alignment (top two lines of
text) and the regular expression alignment (bottom two lines). Background colors indicate whether the two algorithms agree. Green: columns
aligned by both algorithms; blue: letters not aligned by both algorithms; white: letters aligned by CLUSTALW but appearing in variable regions
in the regular expression; red: letters aligned in the regular expression but not by CLUSTALW. (b) Same as (a), but comparing the trimmed
CLUSTALW alignment with regular expression alignment. The middle row of two lines shows the result of the alignment trimming algorithm;
an asterisk identifies a column from the CLUSTALW alignment that was removed by gap expansion. (c) Proportion of each type of column
averaged over all 3909 alignments.

organism A, reveals that a is most similar to b, then a and b


are most likely orthologs.
Once pairs of genes are identified as reciprocal best hits,
a more detailed comparison is done using a global alignment
algorithm such as CLUSTALW [33]. To see how well the regular expression-based alignment algorithm performs on real
sequences, a series of alignments of orthologous genes made
with realign were compared to the CLUSTALW alignments
of the same genes. The complete set of genes from Plasmodium falciparum, the parasite that causes malaria, and a close
relative known as Plasmodium vivax were downloaded from
ApiDB, the model organism database for this family of organisms [35]. A set of 3909 orthologs were identified by us-

ing BLAST to search for reciprocal best hits. Since P. falciparum diverged from P. vivax approximately 200 MYA [36],
all the alignments used the PAM20 substitution matrix. The
realign alignments were made using the scaled representation for marker symbols with = 0.02 since insertion and
deletion events are relatively rare at this short evolutionary
time scale.
Figure 6 shows a detailed comparison of the alignments
for one pair of genes (MAL7P1.11 and Pv087705). The top
two lines in Figure 6(a) are the alignment produced by
CLUSTALW, and the bottom two are the regular expression
alignment. To make it easier to compare the alignments, the
marker symbols have been deleted, and the letters in variable

10
regions printed in italics to distinguish them from letters
in blocks. The four background colors indicate the level of
agreement between the two alignments: a pair can be aligned
by both programs, aligned by neither, or aligned by one but
not the other.
Researchers often apply an alignment trimming algorithm to the output of an alignment algorithm to identify
suspect columns in an alignment [37]. An example of a suspect column is the one shown in Figure 1 where an insertion occurred in the middle of a codon. Figure 6(b) shows
the alignment of the Plasmodium genes after an alignment
trimming operation [38] was applied to the CLUSTALW alignments. The middle two lines in this figure show the results
of the trimming application: an X indicates a letter that was
left in the alignment, and a  indicates a position that was
originally aligned but has now been converted to a gap. In
this example, the alignment trimming algorithm agreed with
the regular expression alignment: columns that were previously shown as aligned (white background color) are now
unaligned (blue).
Over all the 3909 pairs of sequences, the two alignment
methods agreed on 62% of the letters (top two rows of
Figure 6(c)). The disagreement was almost entirely due to the
fact that in 38% of the columns, the regular expression alignment was more conservative and placed characters in an unaligned region when CLUSTALW aligned those same letters.
There are very few instances where realign put letters in
an aligned block and CLUSTALW did not. Applying the alignment trimming algorithm increases the level of agreement:
approximately one fourth of the columns originally considered aligned by CLUSTALW were reclassified as unaligned, in
agreement with realign. The number of columns aligned
only by realign also increased, but that is simply due to the
fact that the alignment trimming algorithm used here [38] is
very conservative and also trims away the last character in an
aligned region (as shown by the red columns at the ends of
blocks in Figure 6(b)).
These results show that for sequences with a high degree
of similarity (separated by only 200MY of evolution), the
MDL method implemented in realign does a credible job
of global alignment. A more detailed analysis of genes with
known alignments, preferably including structural and functional alignment, would be required to determine whether
the 25% of the letter pairs aligned by CLUSTALW should in
fact be aligned, or whether realign was correct in leaving
them in variable regions.
4.2. BAliBASE reference alignments
The main parameter of the regular expression alignment
method is the substitution matrix, which defines the probabilities for amino acid letters. A second parameter, the number of bits to use for a marker symbol or the probability associated with a marker symbol, is required if expressions are
encoded with the tagged or scaled representations, respectively. To illustrate the eects of these parameters, an experiment evaluated the accuracy of realign alignments compared to known reference alignments from the BAliBASE
[34] benchmark suite.

EURASIP Journal on Bioinformatics and Systems Biology


Sequences in BAliBASE are organized in a collection of
dierent test sets. The sets were designed to provide dierent challenges to multiple alignment programs, for example,
all sequences in a test are equally distant, or sequences are in
two distinct subgroups. Sequences in each set have known 3D
structures, and each test set was manually curated to identify conserved core blocks within each multiple alignment.
The accuracy of an alignment algorithm can be assessed by
comparing how it aligns amino acids in the core blocks. The
comparisons reported here were made by aligning all pairs of
sequences in each test set.
Figure 7 illustrates how the choice of a substitution matrix aects the accuracy of an alignment. The blocks in
Figure 7(b) are from an alignment based on PAM20, and the
blocks in Figure 7(c) are from the same pair of sequences
aligned with PAM250. Letters shown in blue are accurate
pairings of letters in core blocks in the reference alignment,
and letters in red are misalignedeither they are placed in
variable regions, or if they are in blocks, they are aligned with
the wrong letter from the other sequence (e.g., the letters in
the block marked with (2)). The overall accuracy is higher for
the PAM250 alignment, which is not surprising since these
two sequences are only about 40% identical, and sequences
with this low level of similarity have probably diverged for
much more than 200MY.
The block marked with a (3) in Figure 7 is an example
of how a less strict substitution matrix leads to longer blocks.
The letter pair Q and G are dissimilar in PAM20, and the block
ends at this letter pair. But with PAM250, there is a slight benefit to aligning Q with G (c(G|Q) < c(G)) so these two letters
are aligned.
Note that in the region indicated by (1) in Figure 7, the
letters G and F still have a negative benefit with PAM250. But
they are included in a longer block in the PAM250 alignment
because they are surrounded on both sides by runs of similar letters, and it was less expensive for the algorithm to keep
them in this block than to break them out into a short variable region.
Varying the alignment parameter that determines the
number of bits used to represent a marker symbol also has
an eect on accuracy. The longer sequences evolve, the more
likely it is that an insertion or deletion mutation occurs in
one or both sequences, and the regular expression alignment
algorithm will need the flexibility to insert more marker symbols. When aligning pairs of sequences from BAliBASE, small
values of , either specified directly when the tagged representation is used or computed as log2 for the scaled representation, yields the most accurate alignments.
Since the goal of the alignment algorithm is to find the
sequence expression that can be represented in the fewest
number of bits, a natural question is whether the algorithm
should try to search for the value of that leads to the overall lowest cost expression. A related question, for sequences
which have a known reference alignment, is whether the expression with the shortest encoding also corresponds to the
most accurate alignment.
Unfortunately, the answers to these questions are not
straightforward. The plots in Figure 8 show the results of a
set of experiments that measure the eect of on the number

John S. Conery

11

(a)

(b)

(c)

Figure 7: Portions of alignments of sequences 1aho and 1bmr from the BAliBASE alignment benchmark (Release 3) [34]. (a) The reference
alignment from BAliBASE. Letters in core blocks are highlighted in blue. (b) Alignment from realign, using PAM20 and = 0.2. (c) Same
as (b) but using PAM250. In (b) and (c) lines starting with % are comments that show the degree of similarity of corresponding letters in the
preceding block: identical (=), similar (+), or dissimilar (). Sequence letters in blue are correctly aligned core blocks. Red letters are core
block column that should have been aligned but were left in variable regions. The circled numbers highlight changes in the alignment (see
text).

of bits needed to encode a set of sequence expressions and


the accuracy of the alignments. To make sure the alignment
algorithm had enough data to work with, the alignments
were done on the longest set of sequences in BAliBASE. There
are eight sequences in this test set (BB12007), ranging in
length from 994 to 1084 letters, with a mean length of 1020
letters. 28 pairwise alignments were created, using all possible pairs of sequences from the set.
Figure 8(a) shows that the number of bits required to
represent an alignment increases as increases. There is a
very slight decrease in cost near = 0.02. At smaller values
of the cost of representing a marker symbol (log2 ) is too
high for the algorithm to include any blocks. Near = 0.02,
a few blocks are found and the overall cost is lowered. But
as increases, the cost of the sequence letters increases, since
they are scaled by a factor of 1 . There are typically far
more letter symbols than marker symbols in a sequence expression, and the increase in the size of each letter outweighs
any gain from a shorter representation for marker symbols.
One could argue that for a given value of , it is not
the total size of a sequence expression that is important,
but rather the amount of compression that results from that
value of , where compression is the dierence in the number
of bits required to encode the null hypothesis (that the sequences have nothing in common) and the number of bits to

encode the shortest sequence expression. Figure 8(b) shows


a plot of the change in compression as a function of , where
there is a peak in the range 0.07 1.0. Superimposed
on this graph is a plot of the accuracy of the best alignment,
also as a function of . The peak in this plot is an accuracy of
69%, at = 0.05.
The most accurate alignments, with a mean accuracy of
80%, were created using the tagged representation and very
small values of between 1.25 and 1.75 bits (including the
tag bit). To obtain a comparable ratio between the cost of a
marker symbol and sequence letter in the scaled representation would have to be around 0.25. But because the scaled
representation requires the algorithm to compare letter probabilities scaled by 1 with unscaled conditional probabilities, the accuracy deteriorates with higher values of . This
distortion might be the reason the peak in the accuracy curve
does not correspond more closely to the peak in the compression curve in Figure 8(b).
5.

SUMMARY AND FUTURE WORK

This paper has shown that regular expressions provide useful descriptions of alignments of pairs of sequences. The expressions are simple concatenations of alternating blocks and
variable regions, where blocks are equal-length substrings

12

EURASIP Journal on Bioinformatics and Systems Biology


102

10

0.5

0.05

0.1

0.15

0.2

0.05

0.1

0.15

Mean accuracy

Mean compression (bits)

Mean total cost (bits)

103

0.2

Compression
Accuracy

(a)

(b)

Figure 8: The eect of the scaling parameter on alignments of pairs of sequences from BAliBASE [34] test set BB12007. There are eight
sequences in the set; the data points are based on averages over all (8 7)/2 = 28 pairs of sequences. (a) Mean cost (in bits) of alignments
as a function of . (b) Mean compression (the dierence between the cost of the null hypothesis and the lowest cost alignment for each pair
of sequences) is indicated by open circles. The mean accuracy of the alignments (proportion of core blocks correctly aligned) is indicated by
closed circles (scale shown on the right axis).

from each input sequence and variable regions are strings of


unaligned characters.
Alignment via regular expressions is an application of information theory: a hypothetical sender constructs a regular
expression that describes the sequences, compresses the expression by encoding blocks with conditional probabilities,
and transmits the encoded expression to a receiver, who can
recover the original sequences by generating every string that
matches the expression. The only parameter that is required
is a substitution matrix, which sets the background probabilities for unaligned letters and the conditional probabilities
for pairs of aligned letters. For greater flexibility, an optional
second parameter specifies the number of bits to use for the
marker symbols that denote block boundaries. This information theoretic framework does not use gaps to align variablelength sequencesinstead a global alignment of sequences
of dierent length will have at least one variable region with
a dierent number of letters from the input sequencesand
thus finesses issues associated with gap penalties.
Accurate alignment of biological sequences needs to take
into account the amount of time the sequences have been
changing since they diverged from their most recent common ancestor. The two parameters that aect the encoding of regular expressionsthe choice of substitution matrix
and the number of bits to use for marker symbolsare related to the two main types of mutations that can occur since

the input sequences diverged. The substitution matrix is the


basis for computing the probability of aligning pairs of letters, and generally reflects the probability that one of the letters changed via point mutation into the other letter. Marker
symbols typically denote block boundaries that are the result
of insertion or deletion mutations, and for very diverse sequences a smaller number of bits per marker reflect a higher
probability of an insertion or deletion.
An alignment algorithm based on this approach can be
seen as a process that begins with a default null hypothesis
that the sequences are unrelated, represented by an expression that has all characters in a single unaligned region. The
algorithm searches for candidate blocks, consisting of equallength substrings from each input sequence, and checks to
see if the encoding of an expression that includes a block is
shorter than the encoding without the block. The tradeo
that must be taken into account is that blocks of similar letters will have denser encodings due to the use of conditional
probabilities, but adding a block means increasing the number of marker symbols that denote the edges of blocks.
A comparison of this new method with CLUSTALW,
a widely used standard for sequence alignment, shows
that the regular expression alignments generally agree with
CLUSTALW on regions included in blocks in the regular expression. Approximately, three quarters of the characters left
unaligned in a regular expression are aligned by CLUSTALW,

John S. Conery

13

but that number drops to one half if the CLUSTALW alignments are treated with an alignment trimming algorithm
to remove ambiguous regions. A more detailed case-by-case
analysis would be required to determine if the remaining unaligned characters should remain unaligned (i.e., alignment
trimming should be more ambitious) or if they need to be
aligned (i.e., the regular expression approach is not aligning
some characters that should be aligned).
A second set of experiments compared the output of the
regular expression method with known reference alignments
from the BAliBASE alignment benchmark. Since the benchmark is designed to test multiple alignment algorithms, and
it is generally accepted that multiple alignment is more accurate than simple pairwise alignment [28], it is not possible
to say whether the regular expression approach is as accurate
as recent multiple alignment methods, but the overall accuracy of over 80% for sequences with 20% to 40% identity is
encouraging.
One direction for future research is to try to automatically determine, for each substitution matrix, the best value
for or , the parameters that determine the number of bits
per marker symbol. Based on extensive investigation (e.g.,
[39]) of dierent combinations of substitution matrix and
other parameters BLAST, CLUSTALW, and other applications
set default values for gap penalties based on the choice of substitution matrix. A similar analysis, perhaps based on insertion and deletion mutation rates, might be used to match a
substitution matrix with a setting of or for regular expression alignments.
A second direction for future research is to expand the
method to perform multiple alignment of more than two
sequences. One approach would be to use pairwise local
alignments produced by realign as anchors for DIALIGN
[22, 23], a progressive multiple alignment program that joins
consistent sets of ungapped local alignments into a complete multiple alignment. A dierent approach would align
all the sequences at the same time, using sum-of-pairs or
some other method to average conditional costs based on
each of the n (n 1)/2 pairs of sequences.
A third direction for future research is to extend the
canonical sequence expressions or the equivalent grammar
to include other forms of descriptions of regions of similarity.
One idea is to use PROSITE blocks [40] as subroutines that
can be embedded in blocks. For example, PROSITE block
PS00007 is [RK]-x(2, 3)-[DE]-x(2, 3)-Y, using a notation similar to a regular expression where a string in brackets means
any one of these letters and x(2, 3) means any sequence
between 2 and 3 letters long. A string that matches this pattern, RDIKDPEY, occurs in one of the Plasmodium sequences
discussed in Section 4.1. A block for the region containing
this pattern might include a reference to the PROSITE block,
for example, instead of
#DLLRDIKDPEYSYT

(9)

the block would be something like


#DLL ps00007 (R, DIK, D, PE) SYT,

(10)

where the arguments to the procedure call are pieces of


the sequence to plug in to the pattern. A benefit from us-

ing PROSITE or other predefined collections of patterns is


that blocks can be encoded in fewer bits. Where the pattern
specifies one of a small set of k letters, only log2 k bits are
required to encode one of these letters, assuming they are
equally probable in this context. In particular, constants in
the pattern require zero bits, since the receiver knows these
letters as soon as the pattern is specified. A second benefit is
that PROSITE blocks allow the expression to describe small
amounts of variability in the length of a region without introducing a new variable region. Of course these benefits are
oset by the additional complexity of an encoding that allows
for rule names and parameter delimiters.
As the last example shows, regular expressions and grammars are very flexible, with many dierent rule structures
able to describe the same set of sequences. The dierent rule
structures convey dierent information about the strings
generated by the grammars, and the goal will be to see if minimum description length encoding of these alternative structures and selection of the shortest encoding accurately provides the best description of the relationships between the
sequences.
ACKNOWLEDGMENTS
The anonymous reviewers made several valuable comments.
The indexed representation for marker symbols was suggested by one of the reviewers, and the scaled representation
is due to Peter Grunwald. The author gratefully acknowledges support by grants from the National Science Foundation (MCB-0342431), the National Institutes of Health
(5R01RR020833-02), and E.T.S. Walton Visitors Award from
Science Foundation Ireland.
REFERENCES
[1] E. W. Myers, The fragment assembly string graph, Bioinformatics, vol. 21, suppl. 2, pp. ii79ii85, 2005.
[2] S. F. Altschul, T. L. Madden, A. A. Schaer, et al., Gapped
BLAST and PSI-BLAST: a new generation of protein database
search programs, Nucleic Acids Research, vol. 25, no. 17, pp.
33893402, 1997.
[3] A. J. Phillips, Homology assessment and molecular sequence
alignment, Journal of Biomedical Informatics, vol. 39, no. 1,
pp. 1833, 2006.
[4] J. O. Wrabl and N. V. Grishin, Gaps in structurally similar
proteins: towards improvement of multiple sequence alignment, Proteins, vol. 54, no. 1, pp. 7187, 2004.
[5] K. Sjolander, Phylogenomic inference of protein molecular
function: advances and challenges, Bioinformatics, vol. 20,
no. 2, pp. 170179, 2004.
[6] B.-J. M. Webb, J. S. Liu, and C. E. Lawrence, BALSA: Bayesian
algorithm for local sequence alignment, Nucleic Acids Research, vol. 30, no. 5, pp. 12681277, 2002.
[7] J. Rissanen, Modelling by the shortest data description, Automatica, vol. 14, no. 5, pp. 465471, 1978.
[8] P. Grunwald, A minimum description length approach to
grammar inference, in Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing,
vol. 1040 of Lecture Notes in Computer Science, pp. 203216,
Springer, Berlin, Germany, 1996.

14
[9] A. Brazma, I. Jonassen, J. Vilo, and E. Ukkonen, Pattern discovery in biosequences, in International Conference on Grammar Inference (ICGI 98), V. Honavar and G. Slutski, Eds.,
vol. 1433 of Lecture Notes in Artificial Intelligence, pp. 257270,
Springer, Ames, Iowa, USA, 1998.
[10] L. Cai, R. L. Malmberg, and Y. Wu, Stochastic modeling
of RNA pseudoknotted structures: a grammatical approach,
Bioinformatics, vol. 19, suppl. 1, pp. i66i73, 2003.
[11] D. B. Searls, The computational linguistics of biological
sequences, in Artificial Intelligence and Molecular Biology,
pp. 47120, American Association for Artificial Intelligence,
Menlo Park, Calif, USA, 1993.
[12] D. Bsearls, Linguistic approaches to biological sequences,
Computer Applications in the Biosciences, vol. 13, no. 4, pp.
333344, 1997.
[13] A. Bairoch, PROSITE: a dictionary of sites and patterns in
proteins, Nucleic Acids Research, vol. 20, pp. 20132018, 1992.
[14] M. Vingron and M. S. Waterman, Sequence alignment and
penalty choice. Review of concepts, case studies and implications, Journal of Molecular Biology, vol. 235, no. 1, pp. 112,
1994.
[15] S. Heniko, Scores for sequence searches and alignments,
Current Opinion in Structural Biology, vol. 6, no. 3, pp. 353
360, 1996.
[16] G. Giribet and W. C. Wheeler, On gaps, Molecular Phylogenetics and Evolution, vol. 13, no. 1, pp. 132143, 1999.
[17] Y. Nozaki and M. Bellgard, Statistical evaluation and comparison of a pairwise alignment algorithm that a priori assigns the
number of gaps rather than employing gap penalties, Bioinformatics, vol. 21, no. 8, pp. 14211428, 2005.
[18] J. T. Reese and W. R. Pearson, Empirical determination of effective gap penalties for sequence comparison, Bioinformatics,
vol. 18, no. 11, pp. 15001507, 2002.
[19] L. Allison, C. S. Wallace, and C. N. Yee, Finite-state models in
the alignment of macromolecules, Journal of Molecular Evolution, vol. 35, no. 1, pp. 7789, 1992.
[20] J. P. Schmidt, An information theoretic view of gapped and
other alignments, in Proceedings of the 3rd Pacific Symposium
on Biocomputing (PSB 98), pp. 561572, Maui, Hawaii, USA,
January 1998.
[21] T. Aynechi and I. D. Kuntz, An information theoretic approach to macromolecular modeling: I. Sequence alignments,
Biophysical Journal, vol. 89, no. 5, pp. 29983007, 2005.
[22] B. Morgenstern, DIALIGN 2: improvement of the segmentto-segment approach to multiple sequence alignment, Bioinformatics, vol. 15, no. 3, pp. 211218, 1999.
[23] M. Brudno, M. Chapman, B. Gottgens, S. Batzoglou, and B.
Morgenstern, Fast and sensitive multiple alignment of large
genomic sequences, BMC Bioinformatics, vol. 4, p. 66, 2003.
[24] T. D. Schneider, Information content of individual genetic
sequences, Journal of Theoretical Biology, vol. 189, no. 4, pp.
427441, 1997.
[25] N. Krasnogor and D. A. Pelta, Measuring the similarity of
protein structures by means of the universal similarity metric,
Bioinformatics, vol. 20, no. 7, pp. 10151021, 2004.
[26] J. S. Conery, Realign: grammar-based sequence alignment,
University of Oregon, http://teleost.cs.uoregon.edu/realign.
[27] M. O. Dayho, R. M. Schwartz, and B. C. Orcutt, A model of
evolutionary change in proteins, in Atlas of Protein Sequence
and Structure, vol. 5, suppl. 3, pp. 345352, Washington, DC,
USA, 1978.

EURASIP Journal on Bioinformatics and Systems Biology


[28] D. W. Mount, Bioinformatics: Sequence and Genome Analysis,
Cold Spring Harbor Laboratory Press, New York, NY, USA,
2nd edition, 2004.
[29] S. Heniko and J. G. Heniko, Amino acid substitution
matrices from protein blocks, Proceedings of the National
Academy of Sciences of the United States of America, vol. 89,
no. 22, pp. 1091510919, 1992.
[30] G. H. Gonnet, M. A. Cohen, and S. A. Benner, Exhaustive
matching of the entire protein sequence database, Science,
vol. 256, no. 5062, pp. 14431445, 1992.
[31] S. Karlin and S. F. Altschul, Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes, Proceedings of the National Academy of
Sciences of the United States of America, vol. 87, no. 6, pp. 2264
2268, 1990.
[32] S. R. Eddy, Where did the BLOSUM62 alignment score matrix come from? Nature Biotechnology, vol. 22, no. 8, pp.
10351036, 2004.
[33] J. D. Thompson, D. G. Higgins, and T. J. Gibson, CLUSTAL
W: improving the sensitivity of progressive multiple sequence
alignment through sequence weighting, position-specific gap
penalties and weight matrix choice, Nucleic Acids Research,
vol. 22, no. 22, pp. 46734680, 1994.
[34] J. D. Thompson, F. Plewniak, and O. Poch, A comprehensive
comparison of multiple sequence alignment programs, Nucleic Acids Research, vol. 27, no. 13, pp. 26822690, 1999.
[35] C. Aurrecoechea, M. Heiges, H. Wang, et al., ApiDB: integrated resources for the apicomplexan bioinformatics resource
center, Nucleic Acids Research, vol. 35, pp. D427D430, 2007.
[36] R. Carter, Speculations on the origins of Plasmodium vivax
malaria, Trends in Parasitology, vol. 19, no. 5, pp. 214219,
2003.
[37] M. Cline, R. Hughey, and K. Karplus, Predicting reliable regions in protein sequence alignments, Bioinformatics, vol. 18,
no. 2, pp. 306314, 2002.
[38] J. S. Conery and M. Lynch, Nucleotide substitutions and the
evolution of duplicate genes, in Proceedings of the 6th Pacific
Symposium on Biocomputing (PSB 01), pp. 167178, Big Island of Hawaii, Hawaii, USA, January 2001.
[39] W. R. Pearson, Comparison of methods for searching protein
sequence databases, Protein Science, vol. 4, no. 6, pp. 1145
1160, 1995.
[40] N. Hulo, A. Bairoch, V. Bulliard, et al., The PROSITE
database, Nucleic Acids Research, vol. 34, pp. D227D230,
2006.

Hindawi Publishing Corporation


EURASIP Journal on Bioinformatics and Systems Biology
Volume 2007, Article ID 43670, 16 pages
doi:10.1155/2007/43670

Research Article
MicroRNA Target Detection and Analysis for Genes Related to
Breast Cancer Using MDLcompress
Scott C. Evans,1 Antonis Kourtidis,2 T. Stephen Markham,1 Jonathan Miller,3
Douglas S. Conklin,2 and Andrew S. Torres1
1 GE

Global Research, One Research Circle, Niskayuna, NY 12309, USA


Center for Excellence in Cancer Genomics, University at Albany, State University of New York,
One Discovery Drive, Rensselaer, NY 12144, USA
3 Human Genome Sequencing Center, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, USA
2 Gen*NY*Sis

Received 1 March 2007; Revised 12 June 2007; Accepted 23 June 2007


Recommended by Peter Grunwald
We describe initial results of miRNA sequence analysis with the optimal symbol compression ratio (OSCR) algorithm and recast
this grammar inference algorithm as an improved minimum description length (MDL) learning tool: MDLcompress. We apply this
tool to explore the relationship between miRNAs, single nucleotide polymorphisms (SNPs), and breast cancer. Our new algorithm
outperforms other grammar-based coding methods, such as DNA Sequitur, while retaining a two-part code that highlights biologically significant phrases. The deep recursion of MDLcompress, together with its explicit two-part coding, enables it to identify
biologically meaningful sequence without needlessly restrictive priors. The ability to quantify cost in bits for phrases in the MDL
model allows prediction of regions where SNPs may have the most impact on biological activity. MDLcompress improves on our
previous algorithm in execution time through an innovative data structure, and in specificity of motif detection (compression)
through improved heuristics. An MDLcompress analysis of 144 over expressed genes from the breast cancer cell line BT474 has
identified novel motifs, including potential microRNA (miRNA) binding sites that are candidates for experimental validation.
Copyright 2007 General Electric Company. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.

1.

INTRODUCTION

The discovery of RNA interference (RNAi) [1] and certain


of its endogenous mediators, the microRNAs (miRNAs), has
catalyzed a revolution in biology and medicine [2, 3]. MiRNAs are transcribed as long (1000 nt) pri-miRNAs, cut
into small (70 nt) stem-loop precursors, exported into
the cytoplasm of cells, and processed into short (20 nt)
single-stranded RNAs, which interact with multiple proteins
to form a superstructure known as the RNA-induced silencing complex (RISC). The RISC binds to sequences in the
3 untranslated region (3 UTR) of mature messenger RNA
(mRNA) that are partially complementary to the miRNA.
Binding of the RISC to a target mRNA induces inhibition
of protein translation by either (i) inducing cleavage of the
mRNA or (ii) blocking translation of the mRNA. MiRNAs
therefore represent a nonclassical mechanism for regulation
of gene expression.
MiRNAs can be potent mediators of gene expression, and
this fact has lead to large-scale searches for the full complement of miRNAs and the genes that they regulate. Al-

though it is believed that all information about a miRNAs


targets is encoded in its sequence, attempts to identify targets
by informatics methods have met with limited success, and
the requirements on a target site for a miRNA to regulate a
cognate mRNA are not fully understood. To date, over 500
distinct miRNAs have been discovered in humans, and estimates of the total number of human miRNAs range well into
the thousands. Complex algorithms to predict which specific
genes these miRNAs regulate often yield dozens or hundreds
of distinct potential targets for each miRNA [46]. Because
of the technical diculty of testing, all potential targets of a
single miRNA, there are few, if any, miRNAs whose activities
have been thoroughly characterized in mammalian cells. This
problem is of singular importance because of evidence suggesting links between miRNA expression and human disease,
for example chronic lymphocytic leukemia and lung cancer
[7, 8]; however, the genes aected by these changes in miRNA
expression remain unknown.
MiRNA genes themselves were opaque to standard informatics methods for decades in part because they are
primarily localized to regions of the genome that do not

EURASIP Journal on Bioinformatics and Systems Biology


Update codebook, array
Start with
initial
sequence

Yes
< 1?
Gain > Gmin ?

Check for descendents


for best SCR
grammar rule

Encode,
done

No

3.5

SCR

3
2.5
2

GAAGTGCAGT GAAGTGCAGT GTCAGTGCT

SCR for max.


length symbol
repeated 2 times

SCR for length2,


symbol repeated
L/2 times

GA AGTG C AGTG A AGTG C AGTG TC AGTG CT

1.5
1
0.5
0
10

20
20

Symb
o

30
l leng

40
th

50

60

60
70

80

40
ats
Repe

Length
10
4

Phrase
GAAGTGCAGT

Locations
1, 11

3, 8, 13, 18, 24
Best OSCR phrase

AGTG

Repeat
2
5

Figure 1: The OSCR algorithm. Phrases that recursively contribute most to sequence compression are added to the model first. The motif
AGTG is the first selected and added to OSCRs MDL model. A longest match algorithm would not call out this motif.

code for protein. Informatics techniques designed to identify protein-coding sequences, transcription factors, or other
known classes of sequence did not resolve the distinctive signatures of miRNA hairpin loops or their target sites in the
3 UTRs of protein-coding genes. In this sense, apart from
comparative genomics, sequence analysis methods tend to be
best at identifying classes of sequence whose biological significance is already known.
Minimum description length (MDL) principles [9] offer a general approach to de novo identification of biologically meaningful sequence information with a minimum of
assumptions, biases, or prejudices. Their advantage is that
they address explicitly the cost capability for data analysis
without over fitting. The challenge of incorporating MDL
into sequence analysis lies in (a) quantification of appropriate model costs and (b) tractable computation of model inference. A grammar inference algorithm that infers a twopart minimum description length code was introduced in
[10], applied to the problem of information security in [11]
and to miRNA target detection in [12]. This optimal symbol
compression ratio (OSCR) algorithm produces meaningful
models in an MDL sense while achieving a combination of
model and data whose descriptive size together represents an
estimate of the Kolmogorov complexity of the dataset [13].
We anticipate that this capacity for capturing the regularity
of a data set within compact, meaningful models will have
wide application to DNA sequence analysis.
MDL principles were successfully applied to segment
DNA into coding, noncoding, and other regions in [14].
The normalized maximum likelihood model (an MDL algorithm) [15] was used to derive a regression that also
achieves near state-of-the-art compression. Further MDLrelated approaches include the greedy oineGREEDY
algorithm [16] and DNA Sequitur [17, 18]. While these

grammar-based codes do not achieve the compression of


DNACompress [19] (see [20] for a comparison and additional approach using dynamic programming), the structure
of these algorithms is attractive for identifying biologically
meaningful phrases. The compression achieved by our algorithm exceeds that of DNA Sequitur while retaining a twopart code that highlights biologically significant phrases. Differences between MDLcompress and GREEDY will be discussed later. The deep recursion of our approach combined
with its two-part coding makes our algorithm uniquely able
to identify biologically meaningful sequence de novo with a
minimal set of assumptions. In processing a gene transcript,
we selectively identify sequences that are (i) short but occur frequently (e.g., codons, each 3 nucleotides) and (ii) sequences that are relatively long but occur only a small number of times (e.g., miRNA target sites, each 20 nucleotides
or more). An example is shown in Figure 1, where given
the input sequence shown, OSCR highlights the short motif
AGTG that occurs five times, over a longer sequence that occurs only twice. Other model inference strategies would bypass by this short motif.
In this paper, we describe initial results of miRNA analysis using OSCR and introduce improvements to OSCR that
reduce execution time and enhance its capacity to identify biologically meaningful sequence. These modifications,
some of which were first introduced in [21], retain the deep
recursion of the original algorithm but exploit novel data
structures that make more ecient use of time and memory by gathering phrase statistics in a single pass and subsequently selecting multiple codebook phrases. Our data structure incorporates candidate phrase frequency information
and pointers identifying location of candidate phrases in
the sequence, enabling ecient computation. MDL model
inference refinement is achieved by improving heuristics,

Scott C. Evans et al.

3
{128-bit strings alternating 1 and 0}
101010
010101

10101010 10

000000000000 000
000000000000 001
000000000000 010
000000000000 011

1111111111111 10
1111111111111 11
2128 = 3.4 1038

1111 0000
1100 1100
1001 1001

1010 1010
2124

{128-bit strings}

{128-bit strings with 64 1s}

Figure 2: Two-part representations of a 128-bit string. As the length of the model increases, the size of the set including the target string
decreases.

harnessing redundancies associated with palindrome data,


and taking advantage of local sequence similarity. Since it
now employs a suite of heuristics and MDL compression
methods, including but not limited to the original symbol
compression ratio (SCR) measure, we refer to this improved
algorithm as MDLcompress, reflecting its ability to apply
MDL principles to infer grammar models through multiple
heuristics.
We hypothesized that MDL models could discover biologically meaningful phrases within genes, and after summarizing briefly our previous work with OSCR, we present
here the outcome of an MDLcompress analysis of 144 genes
overexpressed in the breast cancer cell line, BT474. Our algorithm has identified novel motifs including potential miRNA
binding sites that are being considered for in vitro validation
studies. We further introduce a bits per nucleotide MDL
weighting from MDLcompress models and their inherent biologically meaningful phrases. Using this weighting, susceptible areas of sequence can be identified where an SNP disproportionately aects MDL cost, indicating an atypical and
potentially pathological change in genomic information content.
2.

MINIMUM DESCRIPTION LENGTH (MDL)


PRINCIPLES AND KOLMOGOROV COMPLEXITY

MDL is deeply related to Kolmogorov complexity, a measure


of descriptive complexity contained in an object. It refers to
the minimum length l of a program such that a universal
computer can generate a specific sequence [13]. Kolmogorov
complexity can be described as follows, where represents a
universal computer, p represents a program, and x represents
a string:


K (x) =

min l(p) .

(p)=x

(1)

As discussed in [22], an MDL decomposition of a binary


string x considering finite set models can be separated into
two parts,
+

K (x) = K(S) + log2 |S| ,

(2)

where again K (x) is the Kolmogorov complexity for string x


on universal computer . S represents a finite set of which x
is a typical (equally likely) element. The minimum possible
sum of descriptive cost for set S (the model cost encompassing all regularity in the string) and the log of the sets cardinality (the required cost to enumerate the equally likely set
elements) correspond to an MDL two-part description for
string x, a model portion that describes all redundancy in the
string, and a data portion that uses the model to define the
specific string. Figure 2 shows how these concepts are manifest in three two-part representations of the 128 binary string
101010 10. In this representation, the model is defined in
English language text that defines a set, and the log2 of the
number of elements in the defined set is the data portion
of the description. One representation would be to identify
this string by an index of all possible 128-bit strings. This involves a very small model description, but a data description
of 128 bits, so no compression of descriptive cost is achieved.
A second possibility is to use additional model description to
restrict the set size to contain only strings with equal number of ones and zeros, which reduces the cardinality of the set
by a few bits. A more promising approach will use still more
model description to identify the set of alternating pattern of
ones and zeros that could contain only two strings. Among
all possible two-part descriptions of this string the combination that minimizes the two-part descriptive cost is the MDL
description.
This example points out a major dierence between
Shannon entropy and Kolmogorov complexity. The firstorder empirical entropy of the string 101010 10 is very

EURASIP Journal on Bioinformatics and Systems Biology

Kk (x | n) = log |Sk | (bits)

K(x)

(bits)

Figure 3: This figure shows the Kolmogorov structure function. As


the model size (k) is allowed to increase, the size of the set (n) including string x with an equally likely probability decreases. k indicates the value of the Kolmogorov minimum sucient statistic.

high, since the numbers of ones and zeros are equal. However, intuitively the regularity of the string makes it seem
strange to call it random. By considering the model cost, as
well as the data costs of a string, MDL theory provides a formal methodology that justifies objectively classifying a string
as something other than a member of the set of all 128 bit
binary. These concepts can be extended beyond the class of
models that can be constructed using finite sets to all computable functions [22].
The size of the model (the number of bits allocated to
spelling out the members of set S) is related to the Kolmogorov structure function,  (see [23]).  defines the smallest set, S, that can be described in at most k bits and contains
a given string x of length n,


k x n | n =

min

p:l(p)<k ,U(p,n)=S

log2 |S| .

(3)

Cover [23] has interpreted this function as a minimum sucient statistic, which has great significance from an MDL perspective. This concept is shown graphically in Figure 3. The
cardinality of the set containing string x of length n starts out
as equal to n when k = 0 bits is used to describe set S (restrict
its size). As k increases, the cardinality of the set containing
string x can be reduced until a critical value k is reached
which is referred to as the Kolmogorov minimum sucient
statistic, or algorithmic minimum sucient statistic [22]. At
k , the size of the two-part description of string x equals
K (x) within a constant. Increasing k beyond k will continue to make possible a two-part code of size K (x), eventually resulting in a description of a set containing the single
element x. However, beyond k , the increase in the descriptive cost of the model, while reducing the cardinality of the
set to which x belongs, does not decrease the strings overall
descriptive cost.
The optimal symbol compression ratio (OSCR) algorithm is a grammar inference algorithm that infers a two-part

minimum description length code and an estimate of the algorithmic minimum sucient statistic [10, 11]. OSCR produces meaningful models in an MDL sense, while achieving a combination of model plus data whose descriptive size
together estimate the Kolmogorov complexity of the data set.
OSCRs capability for capturing the regularity of a data set
into compact, meaningful models has wide application for
sequence analysis. The deep recursion of our approach combined with its two-part coding nature makes our algorithm
uniquely able to identify meaningful sequences without limiting assumptions.
The entropy of a distribution of symbols defines the average per symbol compression bound in bits per symbol for
a prefix free code. Human coding and other strategies can
produce an instantaneous code approaching the entropy in
the limit of infinite message length when the distribution is
known. In the absence of knowledge of the model, one way
to proceed is to measure the empirical entropy of the string.
However, empirical entropy is a function of the partition and
depends on what substrings are grouped together to be considered symbols. Our goal is to optimize the partition (the
number of symbols, their length, and distribution) of a string
such that the compression bound for an instantaneous code,
(the total number of encoded symbols R time entropy Hs )
plus the codebook size is minimized. We define the approximate model descriptive cost M to be the sum of the lengths
of unique symbols, and total descriptive cost D p as follows:
M

li ,

D p M + R Hs .

(4)

While not exact (symbol delimiting comma costs are ignored in the model, while possible redundancy advantages
are not considered either), these definitions provide an approximate means of breaking out MDL costs on a per symbol
basis. The analysis that follows can easily be adapted to other
model cost assumptions.
2.1.

Symbol compression ratio

In seeking to partition the string so as to minimize the total


string descriptive length D p , we consider the length that the
presence of each symbol adds to the total descriptive length
and the amount of coverage of total string length L that it
provides. Since the probability of each symbol, pi , is a function of the number of repetitions of each symbol, it can be
easily shown that the empirical entropy for this distribution
reduces to
Hs = log2 (R)

 
1
ri log2 ri .
R i

(5)

Thus, we have
D p = R log2 (R) +
R log2 (R) =


i

 

li ri log2 ri ,


ri log2 (R) = log2 (R)

ri ,

with
(6)

is a constant for a given partition of symwhere log2 (R)


bols. Computing this estimate based on the partition in hand

Scott C. Evans et al.

5
String x= a rose is a rose is a rose

SCR versus symbol length for various number of repeats


1.1

OSCR statistics: SCR based on length and frequency of phrase


a

0.9

SCR

0.8

0.5

0.4

0.7
0.6

0.3

R = 26 3 = 23
l=2
r=3
R = 26 3(5) = 11
SCR = 1.023
l=6
r=3
R = 26 2(6) = 14
SCR = 0.5
l=7
r=2
SCR = 0.7143

0.2
0.1

Figure 5: OSCR example.


10

20

30

40

50

60

70

80

90

100 110

Symbol length (bits)


10 repeats
20 repeats

40 repeats
60 repeats

Figure 4: SCR versus symbol length for 1024-bit string.

enables a per-symbol formulation for D p and results in a conservative approximation for R log2 (R) over the likely range of
R. The per-symbol descriptive cost can now be formulated:

 

log ri
di = ri log2 (R)
2

+ li .

(7)

The following comments apply.

Thus, we have a heuristic that conservatively estimates the


descriptive cost of any possible symbol in a string considering
both model and data (entropy) costs. A measure of the compression ratio for a particular symbol is simply the descriptive length of the string divided by the length of the string
covered by this symbol. We define the symbol compression
ratio (SCR) as

 

log ri
ri log2 (R)
d
2
i = i =
Li
li ri

+ li

(8)

This heuristic describes the compression work a candidate


symbol will perform in a possible partition of a string. Examining SCR in Figure 4, it is clear that good symbol compression ratio arises in general when symbols are long and
repeated often. But clearly, selection of some symbols as part
of the partition is preferred to others. Figure 4 shows how
symbol compression ratio varies with the length of symbols
and number of repetitions for a 1024 bit string.
3.

(2) Calculate the SCR for all substrings. Select the substring from this set with the smallest SCR and add it
to the model M.
(3) Replace all occurrences of the newly added substring
with a unique character.
(4) Repeat steps 1 through 3 until no suitable substrings
are found.
(5) When a full partition has been constructed, use Human coding or another coding strategy to encode the
distribution, p, of symbols.

OSCR ALGORITHM

The optimal symbol compression ratio (OSCR) algorithm


forms a partition of string S into symbols that have the best
symbol compression ratio (SCR) among possible symbols
contained in S. The algorithm is as follows.
(1) Starting with an initial alphabet, form a list of substrings contained in S, possibly with user-defined constraints on minimum frequency and/or maximum
length, and note the frequency of each substring.

(1) This algorithm progressively adds symbols that do the


most compression work among all the candidates
to the code space. Replacement of these symbols leftmost-first will alter the frequency of remaining symbols.
(2) A less exhaustive search for the optimal SCR candidate
is possible by concentrating on the tree branches that
dominate the string or searching only certain phrase
sizes.
(3) The initial alphabet of terminals is user supplied.
3.1.

Example

Consider the phrase a rose is a rose is a rose with ASCII


characters as the initial alphabet. The initial tree statistics and
calculations provide the metrics shown in Figure 5. The
numbers across the top indicate the frequency of each symbol, while the numbers along the left indicate the frequency
of phrases.
Here we see that the initial string consists of seven terminals {a, , r, o, s, e, i}. Expanding the tree with substrings
beginning with the terminal a shows that there are 3 occurrences of substrings:


a, a , a r, a ro, a ros, a rose ,

(9)

but only 2 occurrences of longer substrings, for each of which


values consequently increase, leaving the phrase {a rose}
the candidate with the smallest . Here we see the unique
nature of the heuristic, which does not choose necessarily

EURASIP Journal on Bioinformatics and Systems Biology


Grammar
S1
S2
S

tic, and thus has more stringent separation of model


and data costs and more specific model cost calculations resulting in greater specificity.
(3) As described in [21] and will be discussed in later
sections, the computational architecture of MDLcompress diers from the sux tree with counts architecture of GREEDY. Specifically, MDLcompress gathers
statistics in a single pass and then updates the data
structure and statistics after selecting each phrase as
opposed GREEDYs practice of reforming the sux
tree with counts data structure at each iteration.

Model (set)

a rose
is S1
S1 S2 S2

S1 S2 S2

S1

a rose

f (S1 ) = 1

S2

is S1

f (S2 ) = 2

Equally likely musings:

a rose is a rose is a rose

TypicalSet = S2 S1 S2 = is a rose a rose is a rose


S S S is a rose is a rose a rose
2 2 1
Figure 6: OSCR grammar example model summary.

the most frequently repeating symbol, or the longest match


but rather a combination of length and redundancy. A second iteration of the algorithm produces the model described
in Figure 6. Our grammar rules enable the construction of a
typical set of strings where each phrase has frequency shown
the model block of Figure 6. One can think of MDL principles applied in this way as analogous to the problem of
finding an optimal compression code for a given dataset x
with the added constraint that the descriptive cost of the
codebook must also be considered. Thus, the cost of sending priors (a codebook or other modeling information)
is considered in the total descriptive cost in addition to
the descriptive cost of the final compressed data given the
model.
The challenge of incorporating MDL in sequence analysis lies in the quantification of appropriate model costs and
tractable computation of model inference. Hence, OSCR has
been improved and optimized through additional heuristics
and a streamlined architecture and renamed MDLcompress,
which will be described in detail in later sections. MDLcompress forms an estimate of the strings algorithmic minimum
sucient statistic by adding bits to the model until no additional compression can be realized. MDLcompress retains
the deep recursion of the original algorithm but improve
speed and memory use through novel data structures that
allow gathering of phrase statistics in a single pass and subsequent selection of multiple codebook phrases with minimal
computation.
MDLcompress and OSCR are not alone in the grammar
inference domain. GREEDY, developed by Apostolico and
Lonardi [16], is similar to MDLcompress and OSCR, but differ in three major areas.
(1) MDLcompress is deeply recursive in that the algorithm
does not remove phrases from consideration for compression after they have been added to the model. The
loss of compressibility inherent in adding a phrase
to the model was one of the motivations of developing
the SCR heuristicpreventing a too greedy absorption of phases from preventing optimal total compression. With MDLcompress, since we look in the model
as well for phrases to compress, we find that generally
the total compression heuristic at each phase gives the
best performance as will be discussed later.
(2) MDLcompress was designed with the express intent of
estimating the algorithmic minimum sucient statis-

Another comparable grammar-based code is Sequitur, a


linear time grammar inference algorithm [17, 18]. In this paper, we show MDLcompress to exceed Sequiturs ability to
compress. However, it does not match Sequiturs linear run
time performance.
4.

MIRNA TARGET DETECTION USING OSCR

In [12], we described our initial application of the OSCR algorithm to the identification of miRNA target sites. We selected a family of genes from Drosophila (fruit fly) that contain in their 3 UTRs conserved sequence structures previously described by Lai [24]. These authors observed that
a highly-conserved 8-nucleotide sequence motif, known as
a K-box (sense = 5 cUGUGAUa 3 ; antisense = 5 uAUCACAg) and located in the 3 UTRs of Brd and bHLH gene
families, exhibited strong complementarity to several fly
miRNAs, among them miR-11. These motifs exhibited a role
in posttranscriptional regulation that was at the time unexplained.
The OSCR algorithm constructed a phrasebook consisting of nine motifs, listed in Figure 7 (top) to optimally partition the adjacent set of sequences, in which the motifs
are color coded. The OSCR algorithm correctly identified
the most redundant antisense sequence (AUCACA) from the
several examples it was presented.
The input data for this analysis consists of 19 sequences,
each 18 nucleotides in length (Figure 7). From these sequences, OSCR generated a model consisting of grammar
variables S1 through S4 that map to individual nucleotides
(grammar terminals), the variable S5 that maps to the nucleotide sequence, AUCACA, and four shorter motifs S6 S9 .
The phrase S5 turns out to be a putative target of several different miRNAs, including miR-2a, miR-2b, miR-6, miR-13a,
miR-13b, and miR-11. OSCR identified as S9 a 2 nucleotide
sequence (5 GU 3 ) that is located immediately downstream
of the K-box motif. The new consensus sequence would read
5 AUCACAGU 3 and has a greater degree of homology
to miR-6 and miR-11 than to other D. melanogaster miRNAs. In vivo studies performed subsequent to the original
Lai paper demonstrated the specificity of miR-11 activity
on the Bob-A,B,C, E(spl)ma, E(spl)m4, and E(spl)md genes
[25].
In a separate analysis, we applied OSCR to the sequence
of an individual fruit fly gene transcript, BobA (accession
NM 080348; Figure 7, bottom). Only the BobA transcript

Scott C. Evans et al.

7
OSCR analysis of Brd family and bHlH repressor
Motif: AUCACA first phrase added
GUU second phrase added
CU, AU, and GU also called out

BobA gene from Drosophila melanogaster with


K-box and GY-box motifs highlighted. the
BobA gene is potentially regulated by miR-11
(K-box specificity) and miR-7 (GY-box
specificity). For clarity of exposition, stop and
start codons underlined in red.

1
61
121
181
241
301
361
421
481
541
601

S1
S2
S3
S4
S5
S6
S7
S8
S9

G
U
C
A
AUCACA
GUU
CU
AU
GU

GGUCACAUCACAGAUACU
CUCGUCAUCACAGUUGGA
CGAUUAAUCACAAUGAGU
UCCUCGAUCACAGUUGGA
GGUGCUAUCACAAUGUUU
UGUUUUAUCACAAUAUCU
AUUAGUAUCACAUCAACA
AAAUGUAUCACAAUUUUU
GUUGAUAUCACAAAUGUA
AAGACUAUCACACUUGGU
UACAAAAUCACAGCUGAA
AGGAACAUCACAUCAUAU
AGAACUAUCACAGGAACA
UUAGUUAUCACAUGAACU
AGUUAUAUCACAGUUGAA
CAGGCCAUCACACGGGAG
UGCCCUAUCACAGACUUA
UGGGCUAUCACAGAUGCG
GUUGCCAUCACAGUUGGG

aacaguucuccauccgagcagaucauaaguaaccaaccugcaaaauguucaccgaaaccg
cucuuguuuccaacuucaauggagugacagagaagaaaucucuuaccggcgccuccacca
accugaagaagcugcugaagaccaucaagaaggucuucaagaacuccaagccuucgaagg
agauuccgauccccaacaucaucuacucuugcaauacugaggaggagcaccagaauuggc
ucaacgaacaacuggaggccauggcaauccaucuucacugaguucuucugggacaucccc
cuccaucgaguaucugugaugugacccgaucaaaaggucuauaaaucggcacuccggcuu
uaauauccaacugugaugacgagaacacaagacugacugacuugugugccuuggagguga
caaaguucgucgccucugccaacuguacauaucaaacuagcugcuaaaaugucuucaauu
augcuuuaauguagucuaaguuaguauuaucauugucuuccauuaguuuaagaaaaucau
ugucuuccauguuuguuuguuaggguaaaaaaaacuagcuuaagaauaaaaaucccucgc
ggaaagaaaacaau

Figure 7: Motif analysis of 19 sequences each of which is believed to contain a single target site for miR-11 from fruit fly. (Top) OSCR adds
the variable S5 to its MDL codebook, the K-box motif, which has been shown to be a miRNA target site for miR-11. (Bottom) Full sequence
of BobA gene transcript with K-box and GY box motifs underlined in blue text. The K-box motif (CUGUGAUG) is a target site for miR-11
and the GY-box motif (UGUCUUCCAU) is a target site for miR-7.

itself entered this second analysis, which was performed


independently of the multisequence analysis described in the
paragraph above. The sense sequence of BobA is displayed
in Figure 2 with the 5 UTR indicated in green; the 237 nucleotides (79 codons) of the coding sequence in red; and
the 3 UTR in blue. OSCR identified the underlined motifs,
(cugugaug) and (ugucuuccau). These two motifs turn out
not only to be conserved among multiple Drosophila subspecies, but also to be targets of two distinct miRNAs: the Kbox motif (cugugaug) is a target of miR-11 and the GY-box
(ugucuuccau) a target of miR-7. Although we did not perform OSCR analysis on any additional genes, this motif had
been identified previously in several 3 UTRs, including those
of BobA, E(spl)m3, E(spl)m4, E(spl)m5, and Tom [23, 24].
The BobA gene is particularly sensitive to miR-7. Mutants
of the BobA gene with base-pair disrupting substitutions at
both sites of interaction with miR-7 yielded nearly complete
loss of miR-7 activity [25] both in vivo and in vitro. These
observations are consistent with studies from [26, 27] that
reveal specific sequence-matching requirements for eective
miRNA activity in vitro.
In summary, the OSCR algorithm identified (i) a
previously-known 8-nucleotide sequence motif in 19 dierent sequence and (ii) in an entirely independent analysis,
identified 2 sequence motifs, the K-box and GY-box, within
the BobA gene transcript. We now describe innovative refinements to our MDL-based DNA compression algorithm
with the goal of improved identification and analysis of biologically meaningful sequenceparticularly miRNA targets
related to breast cancer.

5.

MDLcompress

The new MDLcompress algorithmic tool retains the fundamental element of OSCRdeeplyrecursive heuristicbased grammar inference, while trading computational complexity for space complexity to decrease execution time. The
compression and hence the ability of the algorithm to identify specific motifs (which we hypothesize to be of potential
biological significance) have been enhanced by new heuristics and an architecture that searches not only the sequence
but also the model for candidate phrases. The performance
has been improved by gathering statistics about potential
code words in a single pass and forming and maintaining
simple matrix structures to simplify heuristic calculations.
Additional gains in compression are achieved by tuning the
algorithm to take advantage of sequence-specific features
such as palindromes, regions of local similarity, and SNPs.
5.1.

Improved SCR heuristic

MDLcompress uses steepest-descent stochastic-gradient


methods to infer grammar-based models based upon phrases
that maximize compression. It estimates an algorithmic minimum sucient statistic via a highly recursive algorithm
that identifies those motifs enabling maximal compression.
A critical innovation in the OSCR algorithm was the use of
a heuristic, the symbol compression ratio (SCR), to select
phrases. A measure of the compression ratio for a particular
symbol is simply the descriptive length of the string divided
by number of symbolsgrammar variables and terminals

EURASIP Journal on Bioinformatics and Systems Biology

encoded by this symbol in the phrasebook. We previously defined the SCR for a candidate phrase i as
i =

 

ri log2 (R) log2 ri


di
=
Li
li ri

12

+ li

for a phrase of length li , repeated ri times in a string of total


length L, with R denoting the total number of symbols in the
candidate partition. The numerator in the equation above
consists of the MDL descriptive cost of the phrase if added to
the model and encoded, while the denominator consists of an
estimate of the unencoded descriptive cost of the candidate
phrase. This heuristic encapsulates the net gain in compression per symbol that a candidate phrase would contribute if
it were to be added to the model.
While (10) represents a general heuristic for determining the partition of a sequence that provides the best compression, important eects are not taken into account by
this measure. For example, adding new symbols to a partition increases the coding costs of other symbols by a small
amount. Furthermore, for any given length and frequency,
certain symbols ought to be preferred over others, because
of probability distribution eects. Thus, we desire an SCR
heuristic that more accurately estimates the potential symbol
compression of any candidate phrases.
To this end, we can separate the costs accounted for in
(10) into three parameters: (i) entropy costs (costs to represent the new phrase in the encoded string); (ii) model costs
(costs to add the new phrase to the model); and (iii) previous costs (costs to represent the substring in the string previously). The SCR of [10, 11, 28] breaks these costs down as
follows:
Ch = Ri log

 
R

Ri

Cm = l i ,
C p = li Ri ,

(11)
(12)

where R is the length of the string after substitution, li is the


length of the code phrase, L is the length of the model, and
Ri is the frequency of the code phrase in the string. An improved version of this heuristic, SCR 2006, provides a more
accurate description of the compression work by eliminating
some of the simplifying assumptions made earlier. Entropy
costs (11) remain unchanged. However, increased accuracy
can be achieved by more specific costs for the model and previous costs. For previous costs we consider the sum of the
costs of the substrings that comprise the candidate phrase
C p = Ri

li

j =1

log

  

R
,
rj

10

(10)

(13)

where R  is the total number of symbols without the formation of the candidate phrase and r j is the frequency of the
jth symbol in the candidate phrase. Model costs require a
method for not only spelling out the candidate phrase but

SCR

6
4
2
0
0

20

40
Length

60

80

100

40

20
30
ats
Repe

10

Figure 8: Symbol compression ratio (vertical axis) as a function of


phrase length and number of occurrences (horizontal axes) for the
first phrase encountered of a given length and frequency. The variation indicates our improved heuristic is providing benefit by considering descriptive cost of specific phrases based on the grammars
and terminals contained in the phrase, not just length and number
of occurrences.

also the cost of encoding the length of the phrase to be described. We estimate this cost as
 

Cm = M l i +

li


log

j =1

  

R
,
rj

(14)

where M(L) is the shortest prefix encoding for the length


phrase. In this way we achieve both a practical method for
spelling out the model for implementation and an online
method for determining model costs that relies only on
known information. Since new symbols will add to the cost
of other symbols simply by increasing the number of symbols
in the alphabet, we specify an additional cost that reflects the
change in costs of substrings that are not covered by candidate phrase. The eect is estimated by




L+2
.
Co = R Ri log
L+1

(15)

This provides a new, more accurate heuristic as follows:


SCR 2006 =

C m + Ch + Co
.
Cp

(16)

Figure 8 shows a plot of SCR 2006 versus length and number


of repeats for a specific sequence, where the first phrase of a
given length and number of repeats is selected. Notice that
the lowest SCR phrase is primarily a function of number of
repeats and length, but also includes some variation due to
other eects. Thus, we have improved the SCR heuristic to
yield a better choice of phrase to add at each iteration.
5.2.

Additional heuristics

In addition to SCR, two alternative heuristics are evaluated to


determine the best phrase for MDL learning: longest match

Scott C. Evans et al.

9
Input sequence
Pease porridge hot,
pease porridge cold,
pease porridge in the pot,
nine days old.
Some like it hot,
some like it cold,
some like it in the pot,
nine days old.

120

TC

110
100
90
80
70

1
2
3
4
Total compression model inference
peasS5 porridgS5
S1 pease porridge
S2 <CR>some like it
S6 somS5 likS5 it
S3 in the pot, <CR>nine days old.
in thS5 pS7 S6 ninS5 days old.
S4 cold,
S5 e
S6 <CR>
S7 ot,
S
S1 hS7 S6 S1 S4 S6 S1 S3 S6 S2 hS7 S2 S4 S2 S3

Longest match model inference


in the pot, <CR>nine days old.
S1
, <CR>pease porridge
S2
<CR>some like it
S3
pease porridge hot, S2 cold, S2 S1 S3 hot, S3 cold, S2 S1
S

Figure 9: MDLcompress model-inferred grammar for the input sequence pease porridge using total compression (TC) and the longest
match (LM) heuristics. Both the SCR and TC heuristics achieve the same total compression and both exceed the performance of LM.
Subsequent iterations enable MDLcompress to identify phrases, yielding further compression of the TC grammar model.

(LM) and total compression (TC). Both of these heuristics


leverage the gains described above by considering the entropy
of specific variables and terminals when selecting candidate
phrases. In LM, the longest phrase is selected for substitution,
even if only repeated once. This heuristic can be useful when
it is anticipated that the importance of a codeword is proportional to its length. MDLcompress can apply LM to greater
advantage than other compression techniques because of its
deep recursionwhen a long phrase is added to codebook,
its subphrases, rather than being disqualified, remain potential candidates for subsequent phrases. For example, if the
longest phrase merely repeats the second longest phrase three
times, MDLcompress will nevertheless identify both phrases.
In TC, the phrase that leads to maximum compression
at the current iteration is chosen. This greedy process does
not necessarily increase the SCR, and may lead to the elimination of smaller phrases from the codebook. MDLcompress, as explained above, helps temper this misbehavior by
including the model in the search space of future iterations.
Because of this deep recursion phrases in both the model
and data portions of the sequence are considered as candidate codewords at each iteration-MDLcompress yields improved performance over the GREEDY algorithm [16]. As
with all MDL criteria, the best heuristics for a given sequence
is the approach that best compresses the data. The TC gain
is the improvement in compression achieved by selecting a
candidate phrase and can be derived from the SCR heuristic by removing the normalization factor. Examples of MDLcompress operating under dierent heuristics or combinations of heuristics are shown in Figures 9 and 10. Under our
improved architecture, the best compression seems to usually be achieved in TC mode, which we attribute to the fact

2000
1800
1600
1400
1200
1000
800
600
400
200
0

10

20

30

40

50

60

70

80

90

Model cost
Description cost
Total cost

Figure 10: The compression characteristic of MDLcompress using


the hybrid heuristics longest match, followed by total compress after
the longest match heuristic ceases to provide compression.

that we search the model as well as remaining sequence for


candidate phrases, reducing the need for and benefit from
the SCR heuristic. By comparison, SEQUITUR [17] forms
a grammar of 13 rules consisting of 74 symbols. Thus, using MDLcompress TC we achieve better compression with a
grammar model of approximately half the size.

10

EURASIP Journal on Bioinformatics and Systems Biology


Phrase starting index

Phrase length

ros e

i s

r os e

i s

ros e .

>>>phrase Array(1)
ans =
index: 1
length: 6
verboselength: 6
chararray: a rose
startindices: [1 11 21]
frequency: 3
>>>phrase Array(2)
ans =
index: 1
length: 10
verboselength: 10
chararray: a rose is
startindices: [1 11]
frequency: 2

2
Index box

Phrase array

Box update
Phrase array has all information necessary to update
other candidates after each phrase is added to the model.

S1

S1

S1 .

>>>phrase Array(1)
ans =
index: 1
length: 1
verboselength: 6
chararray: a rose
startindices: [1 6 11]
frequency: 3
>>>phrase Array(2)
ans =

index: 1
length: 5
verboselength: 10
chararray: a rose is
startindices: [1 6]
frequency: 2

Figure 11: The data structures used in MDLcompress allow constant time selection and replacement of candidate phrases. In the top of the
figure is the initial index matrix and phrase array. After adding a rose for the model, MDLcompress can generate the new index box and
phrase array, shown in the bottom half, in constant time.

5.3. Data structures


A second improvement of MDLcompress over OSCR is the
improvement to execution time to allow analysis of much
longer input strings, such as DNA sequences. This is achieved
through trading o memory usage and runtime by using matrix data structures to store enough information about each
candidate phrase to calculate the heuristic and update the
data structures of all remaining candidate phrases. This allows us to maintain the fundamental advantage of OSCR
and algorithms such as GREEDY [16] that compression is
performed based upon the global structure of the sequence,
rather than by the phrases that happen to be processed first,
as in schemes such as Sequitur, DNA Sequitur, and LempelZiv. We also maintain an advantage over the GREEDY algorithm by including phrases added to our MDL model and the
model space itself in our recursive search space.
During the initial pass of the input, MDLcompress generates an lmax by L matrix, where entry Mi, j represents the substring of length i beginning at index j. This is a sparse matrix
with entries only at locations that represent candidates for
the model. Thus, substrings with no repeats and substrings
that only ever appear as part of a longer substring are represented with a 0. Matrix locations with positive entries represent the index into an array with many more details for that
specific substring. In the example in Figure 11, a rose appears three times in the input. In each location of the matrix
corresponding to this substring is a 1, and the first element in
the phrase array has the length, frequency, and starting index
for all occurrences of the substring. A similar element exists
for a rose is but not exist for a rose since that only appears
as a substring of the first candidate.

During the phrase selection part of each iteration, MDLcompress only has to search through phrase array, calculating the heuristic for each entry. Once a phrase is selected,
the matrix is used to identify overlapping phrases, which will
have their frequency reduced by the substitution of a new
symbol for the selected substring. While there may be many
phrases in the array that are updated, only local sections of
the matrix are altered, so overall only a small percentage of
the data structure is updated. This technique is what allows
MDLcompress to execute eciently even with long input sequences, such as DNA.
5.4.

Performance bounds

The execution of MDLcompress is divided into two parts: the


single pass to gather statistics about each phrase and the subsequent iterations of phrase selection and replacement. Since
simple matrix operations are used to perform phrase selection and replacement, the first pass of statistics gathering almost entirely dominates both the memory requirements and
runtime.
For strings with input length, L, and maximum phrase
length, lmax , the memory requirements of the first pass are
bounded by the product L lmax and subsequent passes require less memory as phrases are replaced by (new) individual symbols. Since the user can define a constraint on
lmax , memory use can be restricted to as little as O(L), and
will never exceed O(L2 ). On platforms with limited memory
where long phrases are expected to exist, the LM heuristic
can be used in a simple preprocessing pass to identify and
replace any phrases longer than the system can handle in
the standard matrix described above. Because MDLcompress

Scott C. Evans et al.

11
Table 1

Genes
HUMDYSTROP
HUMGHCSA
HUMHBB
HUMHDABCD
HUMPRTB
CHNTXX

DNACompress
(bits/nucleotide)
1.91
1.03
1.79
1.80
1.82
1.61

Sequitur

inspects the model when searching for subsequent phrases,


this technique has minimal negative eect on overall compression.
The runtime of the first pass depends directly on L, lmax ,
average phrase length lavg , and average number of repeats
of selected phrases, ravg . The unclear relationship between
lmax , lavg , ravg, and L makes deriving guaranteed performance
bounds dicult. As a simple upper bound, we can note that
the product lavg ravg must be less than L, and the maximum
phrase length must be less than L/2, yielding a performance
bound of O(L3 ). In practice, a memory constraint limits lmax
to a constant independent of L, and lavg ravg was approximately constant and much smaller than L. Thus, the practical
performance bound was O(L).
The runtime of the second part of the algorithm, selection and replacement of compressible phrases, is simply the
sum of the time to identify the best phrase and to update
the matrices for the next iteration, multiplied by the number
of iterations. An upper bound on these is O(L2 ), but again
practical performance is much better. In this DNA application where 144 genes were analyzed, the number of candidate phrases, the average number of aected phrases, and the
number of iterations all were independent of input length,
and the selection and replacement phase ran in constant
time.
5.5. Enhancements for DNA compression
When a symbol sequence is already known to be DNA, several priors can be incorporated into the model inference
algorithm that may lead to improved compression performance. These assumptions relate to types of structure that
are typical of naturally occurring DNA sequence. By tuning
our algorithm to eciently code for these mechanisms, we
are essentially incorporating these priors into our model inference algorithm by hand. We consider these assumptions
to be small and within the big O constant inherent in translating between universal computers.
6.

REVERSE-COMPLEMENT MATCHES

As in DNA Sequitur, the search for and grammar encoding of reverse-complement matches is readily implemented
by adding the reverse-complement of a phrase to the MDL-

2.34
1.86
2.20
2.26
2.22
2.24

DNASequitur
2.2
1.74
2.05
2.12
2.14
2.12

MDLcompresss
1.95
1.49
1.92
1.92
1.92
1.95

compress model and taking account of the frequency of the


phrase and its reverse-complement in motif selection.
7.

POST PROCESSING

After the MDLcompress model has been created, two methods possibilities for further compression are the following.
(1) Regions of Local similarity: it is sometimes most ecient to define a phrase as a concatenation of multiple
shorter and adjacent phrases already in the codebook.
(2) Single nucleotide polymorphisms (SNPs): it is sometime most ecient to define a phrase as a single nucleotide alteration to another phrase already in the
codebook.
8.

COMPARISON TO OTHER GRAMMAR-BASED


CODES

We compare MDLcompress with the state of the art in


grammar-based compression: DNA Sequitur [18]. DNA Sequitur improves the Sequitur algorithm by enabling it to harness advantages of palindromes and by considering other
grammar-based encoding techniques as discussed in [20].
Results are summarized in Table 1.
While compression is ultimately the best measure of algorithms capacity to approximate Kolmogorov complexity,
an additional feature of grammar-based codes is their twopart encoding, which separates the meaningful model from
the data elementsan advantage we will discuss in more
detail later. The results above make use of the total compression heuristic and harness the advantage of considering palindromes. Although we exceeded the compression of
DNA Sequitur, DNACompress still achieves better compression; however it does not yield the two-part grammar code
that identifies biologically significant phrases, which we will
discuss next in the context of breast-cancer-related genes.
9.

IDENTIFICATION OF MIRNA TARGETS


USING MDLCOMPRESS

As shown in Figure 7, MDL algorithms can be used to


identify miRNA target sites. We have also tested MDLcompress for the ability to identify miRNA target sites in
known disease-related genes. The general approach is to analyze mRNA transcripts to identify short sequences that are

12

EURASIP Journal on Bioinformatics and Systems Biology


MDLcompress & LATS2: sequence elements in long 3UTR
LOCUS NM 014572
Definition homo sapiens LATS, large tumor suppressor, homolog 2 (Drosophila)
(LATS2), mRNA.

5UTR

CDS

3UTR

MDLcompress (of 3UTR ) output sequences


Sequence

Position in 3UTR

1) aaaaaaaaaaaa
2) agcacttatt
3) aaacaggac

433, 445
262, 362
155, 172

Figure 12: Validation of MDLcompress performance. MDL compress identifies miRNA-372 and 373 target motif (AGCACTTATT) in LATS2
tumor suppressor gene as second phrase.

repeated and localized to the 3 UTR. Comparative genomics


can be applied to increase our confidence that MDL phrases
in fact represent candidate miRNA target sites, even if there
are no known cognate miRNAs that will bind to that site.
As a test, we sought to determine if MDLcompress would
have identified the miRNA binding site in the 3 UTR of the
tumor suppressor gene, LATS2. A recent study, which used a
function-based approach to miRNA target site identification,
determined that LATS2 is regulated by miRNAs 372 and 373
[29]. Increased expression of these miRNAs led to down regulation of LATS2 and to tumorigenesis. The miRNA 372 and
373 target sequence (AGCACTTATT) is located in the 3 UTR
of LATS2 mRNA and is repeated twice but was not identified
with computation-based miRNA target identification techniques. Using the 3 UTR of LATS2 mRNA as an input, three
code words were added to the MDLcompress model, using
longest match mode as shown in Figure 12, the polyA tail,
the miRNA 372 and 373 target sequence (AGCACTTATT),
and a third phrase (AAACAGGAC) which we do not identify with any particular biological function at this time. This
shows that analyzing genes of interest a priori with MDLcompress can produce highly relevant sequence motifs.
Since miRNAs regulate genes important for tumorigenesis and MDLcompress is able to identify these targets, it follows that MDLcompress could be used to directly identify
genes that are important for tumorigenesis. To test this, we
used a target rich set of 144 genes known to have increased
expression patterns in ErbB2-positive breast cancer [30, 31]
and compressed each gene mRNA sequence with MDLcompress running in longest match mode. A total of 93 phrases
were added to MDLcompress codebooks resulting in compression of these genes. Of these phrases, 25 were found exclusively in the 3 UTRs of these genes. Since miRNAs interact
more frequently with the 3 UTRs of mRNAs [32], we focused
our analysis on these phrases, shown in Table 2.
The 25 3 UTR phrases were run through BLAST [33]
searches of a database of 3 UTRs [34, 35] to determine
level of conservation in human and other genomes. The
phrases were also run against the miRBase database [36] us-

ing SSEARCH [37] to detect possible sequence similarities to


known miRNAs. Finally, genes containing these phrases were
targeted with shRNA constructs in an ErbB2-positive breast
cancer cell line (BT474), as well as in normal mammary
epithelial cells (HMEC), in order to identify their potential role in breast tumorigenicity. One MDLcompress phrase,
AGAUCAAGAUC, found in the 3 UTR of the splicing factor arginine/serine-rich 7 (SFRS7) gene (a) was highly conserved, (b) resulted in miRBase matches to a small number of
miRNAs that fulfill the minimum requirements of putative
miRNA targets [32] (Figures 13(a) and 13(b)) in vitro data
implicate this gene in breast cancer progression. More specifically, down regulation of SFRS7 by shRNAs in BT474 cells
yielded a significant decrease in the proliferation marker alamarBlue (Biosource), but not in normal mammary epithelial
cells (HMEC) (Figure 13(b)). In this experiment, cells were
transiently transfected with miRNA-based-structure shRNA
constructs [38] targeting the coding sequence of SFRS7, by
using a lipid-based reagent (FuGENE 6, Roche). A plasmid
construct expressing green fluorescent protein (MSCV-GFP)
was cotransfected to the cells to normalize transfection eciency [3]. shRNAs against the firefly luciferase gene was used
as negative control. Although regulation by the specific miRNAs identified in our bioinformatics analysis still requires
validation, these results suggest the possible dierential regulation of this gene in breast cancer by a miRNA and that this
gene is significant in cell proliferation, underscoring the potential for OSCR to identify sequence of biological interest.
10.

ANALYSIS OF SINGLE NUCLEOTIDE


POLYMORPHISMS

By definition, mutation of an essential nucleotide within a


given miRNAs target sequence within an mRNA is expected
to have a strong eect on the activity of the given miRNA
on the target. If a nucleotide that is required for interaction of a miRNA with the mRNA is altered, the miRNA may
cease to regulate that target, thereby enhancing expression
of the mRNA and the protein it encodes. Alternatively, a

Scott C. Evans et al.

13
Table 2: 3 UTR MDLcompress phrases from 144 ErbB2-positive-related gene mRNA sequence.
Accession number
NM 000442
NM 004265
NM 004265
NM 004265
NM 005324
NM 005324
NM 005324
NM 005930
NM 005930
NM 005930
NM 005930
NM 005930
NM 006148
NM 006148
NM 006148
NM 006148
NM 006148
NM 006148
NM 006148
NM 006148
NM 006148
NM 006148
NM 006148
NM 006148
NM 006276

Number of repeats
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2

Length
13
10
10
10
12
10
9
11
11
10
10
10
13
12
11
11
11
11
11
11
10
10
10
10
11

Phrase
tttctcttttcct
tcagggaggg
ccccccagct
gcagaggcag
ttttatttataa
cagtttcctt
tttataata
tatttcaattt
tatttttgctc
gacaaatgtg
cttttttttc
ttggaacact
gtgtgtgagtgtg
ccccagtctcca
acttcttggtt
cctcctgccca
ccccatctctg
ggaagcacagc
tgtgggtgggg
cctttctggcc
ctccctcctc
cagctaccgg
tcccctcccc
gtggaggaag
agatcaagatc

Locations
2835, 3091
2274, 2667
2954, 3021
2255, 3051
1292, 1802
997, 1991
627, 1055
2903, 2932
2733, 3809
3064, 3250
3425, 3689
3750, 3787
1951, 3654
647, 1651
1067, 1290
1186, 1503
2147, 2302
1545, 2447
2014, 2776
2812, 3759
1035, 1408
525, 1591
1464, 1828
2159, 2267
1010, 1091

160
140
3UTR

120
100
80

OSCR
phrase

OSCR
phrase

60
40
20

OSCR sequence
hsa-miR-218
rno-miR-218
xtr-miR-218

AGAUCAAGAUC
UGUACCAAUCUAGUUCGUGUU
UGUACCAAUCUAGUUCGUGUU
UGUACCAAUCUAGUUCGUGUU
(a)

BT474

HMEC

Luciferase shRNA control


SFRS7 shRNA
(b)

Figure 13: A miRNA target site relevant to breast cancer is identified by OSCR. (a) Proposed interaction between miRNAs (human, rat,
frog) and OSCR phrase. (b) Down regulation of the SFRS7 by RNAi specifically inhibits the proliferation of breast cancer cell line BT474
and not normal cells. These miRNAs may be implicated in breast cancer.

single-nucleotide change to a target of one miRNA may yield


a target sequence for a distinct miRNA. A report published in
2006 demonstrated this SNP eect in a mammal. The study
found that Texel sheep, which are known for their meatiness,
possess a mutation in the 3 UTR of the myostatin gene that

results in an illegitimate interaction of miRNA 1 and 206


with the myostatin mRNA [39]. Mutations that yield such
interactions between mutant mRNA and miRNAs are called
Texel-like. The authors performed a preliminary analysis
of known human SNPs and their potential for perturbing

14

EURASIP Journal on Bioinformatics and Systems Biology

SNP500
(500 genes)

13

BT474 overexpression set


MDL sequences
(144 genes)

(a)

Name
ESR1
PTGS2
EGFR

Accession
NM 000125
NM 000963
NM 005228

MDL sequence
GATATGTTTA
CAAAATGC
TTTTACTTC

Position
4023.5325
2179, 2717.3097
4233.4967

SNP
4029 T C
3103 G A
4975 C T

(b)

Figure 14: MDLcompress directly identifies putative miRNA target sequences that may be implicated in breast cancer. (a) Schematic of
overlap between SNP500 database and potential miRNA sequences identified by MDLcompress in the test set. (b) Potential miRNA sites
identified by MDLcompress with disease-related polymorphisms identified by SNP analysis. These miRNA targets may be implicated in
breast cancer.

binding sites of predicted miRNAs and identified 2490 Texellike mutations and 483 mutations that potentially result in
loss of miRNA binding.
We performed a similar analysis on the 144 overexpressed
gene mRNA sequences from the BT474 breast cancer cell
line [30, 31] to identify which of these genes possess diseaserelated Texel-like mutations. By cross-referencing with the
SNP500 database [40], SNPs were found in 13 of the 144
overexpressed gene mRNA sequences from the BT474 breast
cancer cell line, all in the 3 UTR region. The initial comparison of the 93 MDLcompress code words from the 144 genes
discussed previously did not match with any SNP phrases.
We then relaxed the strict constraint that a phrase must lead
to compression at every step and asked MDLcompress in
longest match to identify the top 10 candidates in each gene
mRNA sequence that would most likely lead to compression.
Strikingly, 3 of these genes-ESR-1, PGTS2, and EGFR-have
SNPs in the set of the first 10 code word candidates identified
by MDLcompress when run on each these genes respective
mRNA sequence (Figure 14). These three sequences were selected out of the 13 because they fulfill the criteria we used
for Figure 13(a), that based on sequence analysis (similarity
to miRNA sequences and intra- and inter- species sequence
conservation); they are putative miRNA targets.
These motifs are localized to the 3 UTR and have not
been predicted to interact with any known miRNAs in the
literature. Although further validation studies are required,
these observations suggest that MDLcompress may be capable of directly identifying potential miRNA target sequences
with roles in breast cancer.
Our hypothesis regarding the significance of MDL
phrases that are added to the MDLcompress model motivates
search of these phrases for SNPs related to cancer. As shown
in Figure 10, an SNP identified in PTGS2 gene [40] colocalizes with the MDLcompress-identified phrase caaaatgc in
the 3 UTR of PTGS2 and yields a disproportionate change
in the descriptive cost of the sequence under the MDLcompress model generated for the original sequence. Altering a

MDLcompress cost per nucleotide-based of PGTS2 with SNP

2.5

SNP g

2
1.5
1

taaaacttccttttaaatcaaaatgccaaatttattaaggtggtggagcc

0.5
0
2700

2710

2720

2730

2740

2750

Figure 15: Cost per nucleotide for PTGS2. The blue curve identifies
cost per nucleotide of the original sequence based upon an MDLcompress model developed using the total compression heuristic
and the first 15 phrases to be selected. The cost per nucleotide under
the SNP g a is shown in red.

single nucleotide typically yields a very small change in descriptive cost, in most cases less than a bit; however, the SNP
in the phrase shown in Figure 15 yields a change in descriptive cost on the order of 4 bits, suggesting that this phrase
is in fact meaningful. Future work will elaborate on this potential relationship between meaningful phrases identified by
MDLcompress and disease, and explore the capability of using MDLcompress models to predict sites where SNPs are especially likely to cause pathology.
11.

CONCLUSIONS

MDLcompress yields compression of DNA sequences that is


superior to any other existing grammar-based coding algorithm. It enables automatic detection of model granularity,

Scott C. Evans et al.


leading to identification of interesting variable-length motifs.
These motifs include miRNA target sequences that may play
a role in the development of disease, including breast cancer,
introducing a novel method of identifying microRNA targets
without specifying the sequence (or, in particular, seed) of
the microRNA that is supposed to bind them. Additionally,
we have used our algorithm here to study SNPs found in
overexpressed genes in the breast cancer cell line BT474, and
we identified 3 SNPs that may alter the ability of microRNAs
to target their sequence neighborhood.
In future work, MDL specificity will be improved
through windowing and segmentation, concepts described
in Figure 4. Running MDLcompress on consecutive windows
of sequence will enable the detection of change points, such
as the transition from noncoding to coding sequence, and
permit the use of multiple codebooks, enhancing specificity
for each region of a gene. For example, the optimal MDL
codebook for a coding region is unlikely to be the same as
that for a 3 UTR. Applying the same model over an entire
gene reduces the eectiveness of the MDL compression algorithm in identifying biologically significant motifs. This improvement of MDLcompress to detect and take advantage of
change points will enable the detection of nonadjacent regions of the genome that are similar. The execution time of
MDLcompress will be further reduced by means of a novel
data structure that augments a sux tree with counts and
pointers, enabling deep recursion of model inference without
intractable computation. With this structure, when a phrase
is selected for the MDLcompress codebook, simple operations can update the structure to facilitate selection of the
next phrase by leveraging known information. The suxtree with counts and pointers architecture will enable nearlinear time processing of the windowed segments.
ACKNOWLEDGMENTS
This work was funded by the U.S. Army Medical Research
Acquisition Activity, 820 Chandler Street, Fort Detrick, DM
217-5014 in Grants W81XWH-0-1-0501 (to SE and AT) and
W8IWXH-04-1-0474 (to DSC). The content and information do not necessarily reflect the position or policy of the
government and no ocial endorsement should be inferred.
REFERENCES
[1] A. Fire, S. Xu, M. K. Montgomery, S. A. Kostas, S. E. Driver,
and C. C. Mello, Potent and specific genetic interference
by double-stranded RNA in caenorhabditis elegans, Nature,
vol. 391, no. 6669, pp. 806811, 1998.
[2] G. J. Hannon and J. J. Rossi, Unlocking the potential of
the human genome with RNA interference, Nature, vol. 431,
no. 7006, pp. 371378, 2004.
[3] A. Kourtidis, C. Eifert, and D. S. Conklin, RNAi applications
in target validation, in Systems Biology, Applications and Perspectives, P. Bringmann, E. C. Butcher, G. Parry, and B. Weiss,
Eds., vol. 61 of Ernst Schering Foundation Symposium Proceedings, pp. 121, Springer, New York, NY, USA, 2007.
[4] B. P. Lewis, I.-H. Shih, M. W. Jones-Rhoades, D. P. Bartel, and
C. B. Burge, Prediction of mammalian microRNA targets,
Cell, vol. 115, no. 7, pp. 787798, 2003.

15
[5] B. P. Lewis, C. B. Burge, and D. P. Bartel, Conserved seed pairing, often flanked by adenosines, indicates that thousands of
human genes are microRNA targets, Cell, vol. 120, no. 1, pp.
1520, 2005.
[6] V. Rusinov, V. Baev, I. N. Minkov, and M. Tabler, MicroInspector: a web tool for detection of miRNA binding sites in
an RNA sequence, Nucleic Acids Research, vol. 33, web server
issue, pp. W696W700, 2005.
[7] G. A. Calin, C.-G. Liu, C. Sevignani, et al., MicroRNA profiling reveals distinct signatures in B cell chronic lymphocytic
leukemias, Proceedings of the National Academy of Sciences of
the United States of America, vol. 101, no. 32, pp. 1175511760,
2004.
[8] A. Esquela-Kerscher and F. J. Slack, OncomirsmicroRNAs
with a role in cancer, Nature Reviews Cancer, vol. 6, no. 4, pp.
259269, 2006.
[9] P. Grunwald, I. J. Myung, and M. Pitt, Eds., Advances in Minimum Description Length: Theory and Applications, MIT Press,
Cambridge, Mass, USA, 2005.
[10] S. C. Evans, Kolmogorov complexity estimation and application
for information system security, Ph.D. dissertation, Rensselaer
Polytechnic Institute, Troy, NY, USA, 2003.
[11] S. C. Evans, B. Barnett, S. F. Bush, and G. J. Saulnier, Minimum description length principles for detection and classification of FTP exploits, in Proceedings of IEEE Military Communications Conference (MILCOM 04), vol. 1, pp. 473479,
Monterey, Calif, USA, October-November 2004.
[12] S. C. Evans, A. Torres, and J. Miller, MicroRNA target motif detection using OSCR, Tech. Rep. GRC223, GE Research,
Niskayuna, NY, USA, 2006.
[13] M. Li and P. Vitanyi, Introduction to Kolmogorov Complexity
and Applications, Springer, New York, NY, USA, 1997.
[14] W. Szpankowski, W. Ren, and L. Szpankowski, An optimal DNA segmentation based on the MDL principle, International Journal of Bioinformatics Research and Applications,
vol. 1, no. 1, pp. 317, 2005.
[15] I. Tobus, G. Korodi, and J. Rissanen, DNA sequence compression using the normalized maximum likelihood model for
discrete regression, in Proceedings of Data Compression Conference (DCC 03), pp. 253262, Snowbird, Utah, USA, March
2003.
[16] A. Apostolico and S. Lonardi, Some theory and practice of
greedy o-line textual substitution, in Proceedings of Data
Compression Conference (DCC 98), pp. 119128, Snowbird,
Utah, USA, March 1998.
[17] C. G. Nevill-Manning and I. H. Witten, Identifying hierarchical structure in sequences: a linear-time algorithm, Journal of
Artificial Intelligence Research, vol. 7, pp. 6782, 1997.
[18] N. Cherniavsky and R. Lander, Grammar-based compression of DNA sequences, in DIMACS Working Group on The
BurrowsWheeler Transform, Piscataway, NJ, USA, August
2004.
[19] X. Chen, M. Li, B. Ma, and J. Tromp, DNACompress: fast and
eective DNA sequence compression, Bioinformatics, vol. 18,
no. 12, pp. 16961698, 2002.
[20] B. Behzadi and F. Le Fessant, DNA compression challenge revisited: a dynamic programming approach, in The
16th Annual Symposium on Combinatorial Pattern Matching
(CPM 05), vol. 3537 of Lecture Notes in Computer Science, pp.
190200, Jeju Island, Korea, 2005.
[21] S. C. Evans, T. S. Markham, A. Torres, A. Kourtidis, and D.
Conklin, An improved minimum description length learning algorithm for nucleotide sequence analysis, in Proceedings of IEEE 40th Asilomar Conference on Signals, Systems and

16

[22]

[23]
[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]
[36]

[37]

[38]

[39]

[40]

EURASIP Journal on Bioinformatics and Systems Biology


Computers (ACSSC 06), pp. 18431850, Pacific Grove, Calif,
USA, October-November 2006.
P. Gacs, J. T. Tromp, and P. M. B. Vitanyi, Algorithmic statistics, IEEE Transactions on Information Theory, vol. 47, no. 6,
pp. 24432463, 2001.
T. M. Cover and J. A. Thomas, Elements of Information Theory,
Wiley-Interscience, New York, NY, USA, 1991.

E. C. Lai, MicroRNAs are complementary to 3 UTR sequence motifs that mediate negative post-transcriptional regulation, Nature Genetics, vol. 30, no. 4, pp. 363364, 2002.
E. C. Lai, B. Tam, and G. M. Rubin, Pervasive regulation of
Drosophila Notch target genes by GY-box-, Brd-box-, and Kbox-class microRNAs, Genes & Development, vol. 19, no. 9,
pp. 10671080, 2005.
J. G. Doench and P. A. Sharp, Specificity of microRNA target
selection in translational repression, Genes & Development,
vol. 18, no. 5, pp. 504511, 2004.
J. Brennecke, A. Stark, R. B. Russell, and S. M. Cohen, Principles of microRNA-target recognition, PLoS Biology, vol. 3,
no. 3, p. e85, 2005.
S. C. Evans, G. J. Saulnier, and S. F. Bush, A new universal two
part code for estimation of string kolmogorov complexity and
algorithmic minimum sucient statistic, in DIMACS Workshop on Complexity and Inference, Piscataway, NJ, USA, June
2003.
P. M. Voorhoeve, C. le Sage, M. Schrier, et al., A genetic screen
implicates miRNA-372 and miRNA-373 as oncogenes in testicular germ cell tumors, Cell, vol. 124, no. 6, pp. 11691181,
2006.
A. Mackay, C. Jones, T. Dexter, et al., cDNA microarray analysis of genes associated with ERBB2 (HER2/neu) overexpression in human mammary luminal epithelial cells, Oncogene,
vol. 22, no. 17, pp. 26802688, 2003.
F. Bertucci, N. Borie, C. Ginestier, et al., Identification and
validation of an ERBB2 gene expression signature in breast
cancers, Oncogene, vol. 23, no. 14, pp. 25642575, 2004.
L. P. Lim, N. C. Lau, P. Garrett-Engele, et al., Microarray analysis shows that some microRNAs downregulate large numbers
of target mRNAs, Nature, vol. 433, no. 7027, pp. 769773,
2005.
S. F. Altschul, T. L. Madden, A. A. Schaer, et al., Gapped
BLAST and PSI-BLAST: a new generation of protein database
search programs, Nucleic Acids Research, vol. 25, no. 17, pp.
33893402, 1997.
F. Mignone, G. Grillo, F. Licciulli, et al., UTRdb and UTRsite:
a collection of sequences and regulatory motifs of the untranslated regions of eukaryotic mRNAs, Nucleic Acids Research,
vol. 33, database issue, pp. D141D146, 2005.
http://microrna.sanger.ac.uk/sequences/index.shtml.
S. Griths-Jones, R. J. Grocock, S. van Dongen, A. Bateman,
and A. J. Enright, miRBase: microRNA sequences, targets and
gene nomenclature, Nucleic Acids Research, vol. 34, database
issue, pp. D140D144, 2006.
X. Huang, R. C. Hardison, and W. Miller, A space-ecient
algorithm for local similarities, Computer Applications in the
Biosciences, vol. 6, no. 4, pp. 373381, 1990.
P. J. Paddison, J. M. Silva, D. S. Conklin, et al., A resource
for large-scale RNA-interference-based screens in mammals,
Nature, vol. 428, no. 6981, pp. 427431, 2004.
A. Clop, F. Marcq, H. Takeda, et al., A mutation creating a potential illegitimate microRNA target site in the myostatin gene
aects muscularity in sheep, Nature Genetics, vol. 38, no. 7,
pp. 813818, 2006.
http://snp500cancer.nci.nih.gov/.

Hindawi Publishing Corporation


EURASIP Journal on Bioinformatics and Systems Biology
Volume 2007, Article ID 61374, 7 pages
doi:10.1155/2007/61374

Research Article
Variation in the Correlation of G + C Composition with
Synonymous Codon Usage Bias among Bacteria
Haruo Suzuki, Rintaro Saito, and Masaru Tomita
Institute for Advanced Biosciences, Keio University, Yamagata 997-0017, Japan
Received 31 January 2007; Accepted 4 June 2007
Recommended by Teemu Roos
G + C composition at the third codon position (GC3) is widely reported to be correlated with synonymous codon usage bias.
However, no quantitative attempt has been made to compare the extent of this correlation among dierent genomes. Here, we
applied Shannon entropy from information theory to measure the degree of GC3 bias and that of synonymous codon usage bias
of each gene. The strength of the correlation of GC3 with synonymous codon usage bias, quantified by a correlation coecient,
varied widely among bacterial genomes, ranging from 0.07 to 0.95. Previous analyses suggesting that the relationship between
GC3 and synonymous codon usage bias is independent of species are thus inconsistent with the more detailed analyses obtained
here for individual species.
Copyright 2007 Haruo Suzuki et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1.

INTRODUCTION

Most amino acids can be encoded by more than one codon


(i.e., a triplet of nucleotides); such codons are described as
being synonymous and usually dier by one nucleotide in
the third position. In many organisms, alternative synonymous codons are not used with equal frequency. Various factors have been proposed to contribute to synonymous codon
usage bias, including G + C composition, replication strand
bias, and translational selection [1]. Here, we focus on the
contribution of G + C composition to synonymous codon
usage bias.
G + C composition has been widely reported to be correlated with synonymous codon usage bias [211]. However,
no quantitative attempt has been made to compare the extent of this correlation among dierent genomes. It would be
useful to be able to quantify the strength of the correlation
of G + C composition with synonymous codon usage bias
in such a way that the estimates could be compared among
genomes.
Dierent methods have been used to analyse the
relationships between G + C composition and synonymous
codon usage. Multivariate analysis methods, such as correspondence analysis [57] and principal component analysis
[8], have been widely used to construct measures accounting for the largest fractions of the total variation in synony-

mous codon usage among genes. Carbone et al. [2, 3] used


the codon adaptation index as a universal measure of dominating codon usage bias. The measures obtained by these
methods can be interpreted as having dierent features (e.g.,
G + C composition bias, replication strand bias, and translationally selected codon bias), depending on the gene groups
analyzed. Therefore, these methods would be useful for exploratory data analysis but not for the analysis of interest
here. By contrast, measures such as the eective number of
codons [10] and Shannon entropy from information theory
[11] are well defined; these measures can be regarded as representing the degree of deviation from equal usage of synonymous codons, independently of the genes analyzed. Previous
analyses of the relationships between G + C composition and
synonymous codon usage bias using these measures have had
two problems. First, these measures of synonymous codon
usage bias have failed to take into account all three aspects of
amino acid usage (i.e., the number of dierent amino acids,
their relative frequency, and their codon degeneracy), and
therefore are aected by amino acid usage bias, which may
mask the eects directly linked to synonymous codon usage
bias. Second, previous analyses have compared the degree
of synonymous codon usage bias with G + C content [defined as (G + C)/(A + T + G + C)], and have therefore yielded
a nonlinear U-shaped relationship (a gene with a very low or
very high G + C content has a high degree of synonymous

EURASIP Journal on Bioinformatics and Systems Biology

codon usage bias) [911]; it is thus dicult to quantify the


nonlinear relationship.
To overcome the first of these problems, we use the
weighted sum of relative entropy (Ew ) as a measure of synonymous codon usage bias [12]. This measure takes into
account all three aspects of amino acid usage enumerated
above, and indeed is little aected by amino acid usage biases. To overcome the second problem, we compare the degree of synonymous codon usage bias (Ew ) with the degree of
G + C content bias (entropy) instead of simply the G + C content; this step can provide a linear relationship. The strength
of the linear relationship can be easily quantified by using a
correlation coecient.
The approach of quantifying the strength of the correlation of G + C composition with synonymous codon usage
bias by using the entropy and correlation coecient is applied to bacterial species for which whole genome sequences
are available.

The degree of bias in synonymous codon usage of the


ith amino acid (Hi ) was quantified with a measure of uncertainty (entropy) in Shannons information theory [16]:

2.

Ei ranges from 0 (maximum bias when Hi = 0) to 1 (no bias


when Hi = log2 ki ).
To obtain an estimate of the overall bias in synonymous
codon usage of a gene, we combined estimates of the bias
from dierent amino acids, as follows. First, to take account
of the dierence in the degree of codon degeneracy (ki ) between dierent amino acids, we used the relative entropy (Ei )
instead of the entropy (Hi ) as an estimate of the bias of each
amino acid. Second, to take account of the dierence in relative frequency between dierent amino acids in the protein,
we calculated the sum of the relative entropy of each amino
acid weighted by its relative frequency in the protein. The
measure of synonymous codon usage bias, designated as the
weighted sum of relative entropy (Ew ) [12], is given by

MATERIALS AND METHODS

2.1. Software
All analyses were conducted by using G-language genome
analysis environment software [13], available at http://www
.g-language.org. Graphs such as the histogram and scatter
plot were generated in the R statistical computing environment [14], available at http://www.r-project.org.
2.2. Sequences
We tested data from 371 bacterial genomes (see Additional
Table 1 for a comprehensive list (available online at http://
www2.bioinfo.ttck.keio.ac.jp/genome/haruo/BSB ST1.pdf)).
Complete genomes in GenBank format [15] were downloaded from the NCBI repository site (ftp://ftp.ncbi.nih.gov/
genomes/Bacteria). Protein coding sequences containing
letters other than A, C, G, or T and those containing amino
acids with residues less than their degree of codon degeneracy were discarded. From each coding sequence, start and
stop codons were excluded.

Hi =

ki

j =1

Ri j log2 Ri j ,

(2)

Hi can take values from 0 (maximum bias where only one


codon is used and all other synonyms are not present) to a
maximum value Hi max = ki ((1/ki ) log2 (1/ki )) = log2 ki (no
bias where alternative synonymous codons is used with equal
frequency; that is, for every j, Ri j = 1/ki ).
The relative entropy of the ith amino acid (Ei ) is defined
as the ratio of the observed entropy to the maximum possible
in the amino acid:
Ei =

Hi
Hi
=
,
Hi max
log2 ki

Ew =

s


w i Ei ,

(3)

(4)

i=1

where s is the number of dierent amino acid species in the


protein and wi is the relative frequency of the ith amino acid
in the protein as a weighting factor. Ew ranges from 0 (maximum bias) to 1 (no bias).

2.3. Analyses

2.3.2. Measure of the degree of G + C composition bias

2.3.1. Measure of the degree of synonymous


codon usage bias

The entropy was calculated to quantify the degree of bias in


G + C composition at the first, second, and third codon positions of a gene (HGC1 , HGC2 , and HGC3 , resp.),

The relative frequency of the jth synonymous codon for the


ith amino acid (Ri j ) is defined as the ratio of the number of
occurrences of a codon to the sum of all synonymous codons:
ni j

Ri j = ki

j =1 ni j

(1)

where ni j is the number of occurrences of the jth codon for


the ith amino acid, and ki is the degree of codon degeneracy
for the ith amino acid.

H p = p log2 p (1 p) log2 (1 p),

(5)

where p is the G+C content (defined as (G+C)/(A+T+G+C))


at the first, second, or third codon positions in the nucleotide
sequence (GC1, GC2, or GC3).
The entropy (H) for G + C composition (and for usage
of two-fold degenerate codons; coding for asparagine, aspartic acid, cysteine, glutamic acid, glutamine, histidine, lysine,
phenylalanine, or tyrosine) with values p and 1 p is plotted
in Figure 1 as a function of p.

Haruo Suzuki et al.

H (bits)

0.8

0.6

0.4

0.2

0.2

0.4

0.6

0.8

Figure 1: Entropy (H) of G + C composition and usage of two fold


degenerate codons with values p and 1 p.

2.3.3. Estimation of the correlation of G + C


composition with synonymous codon
usage bias
Spearmans rank correlation coecient (r) was calculated
to quantify the strength of the correlation between G + C
composition bias (HGC1 , HGC2 , and HGC3 ) and synonymous
codon usage bias (Ew ),




m
g =1 xg x yg y
r =  
,
2 m 
2
m
x

x
y

y
g
g
g =1
g =1
m

1 
1 
x=
xg , y =
yg ,
m g =1
m g =1

(6)

where xg is the rank of the x-axis value (HGC1 , HGC2 , or HGC3 )


for the gth gene, yg is the rank of the y-axis value (Ew ) for
the gth gene, and m is the number of genes in the genome.
The r value can vary from 1 (perfect negative correlation)
through 0 (no correlation) to +1 (perfect positive correlation).
3.

RESULTS

3.1. Correlation of G + C composition with


synonymous codon usage bias (r value)
We investigated the correlation between the degree of G + C
composition bias (HGC1 , HGC2 , and HGC3 ) and that of synonymous codon usage bias (Ew ) within each genome.
Figure 2 shows scatter plots of Ew plotted against HGC1 ,
HGC2 , and HGC3 with Geobacter metallireducens GS-15 genes
and with Saccharophagus degradans 240 genes as examples
and the Spearmans rank correlation coecient (r) calculated
from each plot. In G. metallireducens, the value of Ew was

much better correlated with HGC3 (Figure 2(c)) than with


HGC1 (Figure 2(a)), or HGC2 (Figure 2(b)), indicating that
GC3 contributed more to synonymous codon usage bias than
GC1 and GC2. In S. degradans, the value of Ew was not correlated with HGC1 (Figure 2(d)), HGC2 (Figure 2(e)), or HGC3
(Figure 2(f)), indicating that neither GC1, nor GC2 nor GC3
contributed to synonymous codon usage bias.
To compare the contributions of GC1, GC2, and GC3 to
synonymous codon usage bias, we produced pairwise scatter
plots of the r values of HGC1 , HGC2 , and HGC3 with Ew for 371
genomes (Figure 3).
In the scatter plot of the r values of HGC3 (y-axis) plotted against those of HGC1 (x-axis) (Figure 3(a)), 362 points
(97.6% of the total) are on the upper left of the line y = x,
indicating that GC3 contributed more to synonymous codon
usage bias than did GC1 in most of the genomes analyzed.
In the scatter plot of the r values of HGC3 (y-axis) plotted against those of HGC2 (x-axis) (Figure 3(b)), 367 points
(98.9% of the total) are on the upper left of the line y = x,
indicating that GC3 contributed more to synonymous codon
usage bias than did GC2 in most genomes analyzed.
In the scatter plot of the r values of HGC1 (y-axis) plotted
against those of HGC2 (x-axis) (Figure 3(c)), the scatter plot
displays a diuse distribution of points: 186 points (50.1%
of the total) are on the upper left of the line y = x, indicating that the relative contributions of GC1 and GC2 to
synonymous codon usage bias varied widely from genome to
genome.
We constructed histograms showing the distribution of
r values of HGC1 , HGC2 , and HGC3 with Ew for 371 bacterial genomes (Figure 4). The r values of HGC1 (Figure 4(a))
and HGC2 (Figure 4(b)) were distributed evenly between positive and negative values, whereas those of HGC3 (Figure 4(c))
were distributed towards positive values. The ranges [minimum, maximum] of the r values of HGC1 , HGC2 , and
HGC3 were [0.51, 0.46], [0.28, 0.39], and [0.07, 0.95],
respectively. The r values of HGC1 (Figure 4(a)) and HGC2
(Figure 4(b)) exhibited a monomodal distribution, whereas
those of HGC3 (Figure 4(c)) exhibited a multimodal distribution.
3.2.

Correlation of r value with genomic features

To investigate whether the correlation of GC3 with synonymous codon usage bias (the r value of HGC3 versus Ew ) was
related to species characteristics, we compared the r values
with genomic features such as genomic G + C content and
tRNA gene copy number. Among the 371 genomes analyzed
here, genomic G + C content ranged from 23% to 73% and
tRNA gene copy number varied from 28 to 145.
We constructed scatter plots of the r values of HGC3 with
Ew plotted against genomic G + C content and tRNA gene
copy number for 371 genomes (Figure 5). The relationship
between the r value of HGC3 and the tRNA gene copy number
was unclear (Figure 5(b)). In contrast, the r values of HGC3
tended to be high in G + C-poor or G + C-rich genomes, revealing a nonlinear relationship between the r value of HGC3
and genomic G+C content (Figure 5(a)). The highest r value

EURASIP Journal on Bioinformatics and Systems Biology

0.8

0.8

0.7

0.7
Ew

0.9

Ew

0.9

0.6

0.6

0.5

0.5

0.4

0.4
0.6

0.7

0.8
0.9
HGC1 , r = 0.25

0.85

(a)

0.9
0.95
HGC2 , r = 0.01

(b)

0.9

0.9

0.8

0.7

Ew

Ew

0.8

0.6
0.7
0.5
0.4

0.6
0.3

0.4

0.5 0.6 0.7 0.8


HGC3 , r = 0.95

0.9

0.88

(c)

0.92
0.96
HGC1 , r = 0.06

(d)

0.8

0.8
Ew

0.9

Ew

0.9

0.7

0.7

0.6

0.6
0.86

0.9
0.94
HGC2 , r = 0.08

0.98

(e)

0.85

0.9
0.95
HGC3 , r = 0.07

(f)

Figure 2: Scatter plots of Ew plotted against (a) HGC1 , (b)HGC2 , and (C) HGC3 for Geobacter metallireducens GS-15 genes and against (d)
HGC1 , (e) HGC2 , and (f) HGC3 for Saccharophagus degradans 240 genes. The extent of the correlation between HGC1 , HGC2 , and HGC3 and Ew
is represented by Spearmans rank correlation coecient (r).

of HGC3 (0.95) was found in G. metallireducens, with a genomic G+C content of 60% (Figure 2(c)). The lowest r value
of HGC3 (0.07) was found in S. degradans, with a genomic
G + C content of 46% (Figure 2(f)). The mean and standard

deviation of the r values of HGC3 for G + C-poor bacteria


(with genomic G + C contents less than 40%) were 0.58 and
0.12, respectively. The corresponding values for G + C-rich
bacteria (with genomic G + C contents greater than 60%)

Haruo Suzuki et al.

5
80
Number of genomes

r of HGC3

0.5

60
40
20
0
1

0.5

0.5

0
r of HGC1

0.5

0.5

0.5

(a)
1
1

0.5

0
r of HGC1

0.5

Number of genomes

80
0

(a)

60
40
20
0
1

0.5

0
r of HGC2
(b)

80
Number of genomes

r of HGC3

0.5

0.5

1
1

0.5

0
r of HGC2

0.5

60
40
20
0
1

(b)

0.5

0
r of HGC3
(c)

Figure 4: Histograms of the distribution of r values of (a) HGC1 , (b)


HGC2 , and (c) HGC3 with Ew for 371 bacterial genomes.

r of HGC1

0.5

were 0.86 and 0.04. Thus, the r values of HGC3 for G + Cpoor bacteria tended to be lower than those for G + C-rich
bacteria.

0.5

4.
1
1

0.5

0
r of HGC2

0.5

(c)

Figure 3: Pairwise scatter plots of the r values of HGC1 , HGC2 and


HGC3 with Ew for 371 bacterial genomes. Comparison of the correlation with Ew of (a) HGC3 and HGC1 , (b) HGC3 and HGC2 , and (c)
HGC1 and HGC2 .

DISCUSSION

Other investigators have reported that G + C composition is


correlated with synonymous codon usage bias in many organisms. However, no quantitative attempt has been made
to compare the extent of this correlation among dierent
genomes. Here, we quantified the strength of the correlation
of G + C composition bias (HGC1 , HGC2 , and HGC3 ) with synonymous codon usage bias (Ew ) by using a correlation coecient (r). This approach allowed us to quantitatively compare
the strength of this correlation among dierent genomes.

EURASIP Journal on Bioinformatics and Systems Biology

0.8

0.8

0.6

0.6
r of HGC3

r of HGC3

0.4

0.4

0.2

0.2

0
30

40
50
60
Genomic G + C content (%)

70

(a)

40

60

80
100
120
tRNA gene number

140

(b)

Figure 5: Scatter plots of the r values of HGC3 with Ew plotted against (a) genomic G+C content and (b) tRNA gene number for 371 bacterial
genomes.

In a previous analysis of the relationships between G + C


composition and synonymous codon usage bias, Wan et al.
[9] stated that GC3 was the most important factor in codon
bias among GC, GC1, GC2, and GC3. This is quantitatively
supported by the pairwise comparison of the r values of
HGC1 , HGC2 , and HGC3 (Figure 3). However, the statement by
Wan et al. that GC3 is the key factor driving synonymous
codon usage and that this mechanism is independent of
species diers from our conclusion that the strength of the
correlation of GC3 with synonymous codon usage bias (the
r value of HGC3 ) varies widely among species (Figure 4(c)).
This discordance appears to have arisen because Wan et al.
combined the genes from dierent genomes into a single
dataset for their analysis. This analysis of combined data
from dierent genomes masks the presence of genomes in
which the correlation of GC3 with synonymous codon usage
bias is negligible (such as that of S. degradans; Figure 2(f));
the results are thus inconsistent with those of the more detailed analyses obtained here for individual genomes.
Three factors, G+C composition, replication strand bias,
and translational selection, are well documented to shape
synonymous codon usage bias [1].
First, in bacteria with extreme genomic G + C compositions (either G + Crich or A + Trich), synonymous codon
usage could be dominated by strong mutational bias (toward
G + C or A + T) [17, 18]. The data in Figure 5(a) indicate
that, although genomic G + C content was nonlinearly correlated with the r value of HGC3 , there are some exceptions; for
example, Nanoarchaeum equitans Kin4-M and Mycoplasma
genitalium G37 had identical genomic G + C contents of
32% but very dierent r values of HGC3 (0.34 and 0.87, resp.),
and Thermococcus kodakarensis KOD1 had a genomic G + C
content of around 50% but a high r value of HGC3 (0.86).
The existence of the outliers suggests that, although muta-

tional biases have a major influence on the correlation of


GC3 with synonymous codon usage bias, other evolutionary
factors may play a part. For example, horizontal gene transfer among bacteria with dierent genomic G + C content
can contribute to intragenomic variation in G + C content
[19, 20].
Second, the spirochaete Borrelia burgdorferi exhibits a
strong base usage skew between leading and lagging strands
of replication (generally inferred as reflecting strand-specific
mutational bias): genes on the leading strand tend to preferentially use G- or T-ending codons [21]. The r values of
HGC3 for genes on the leading and lagging strands are similar
(0.65 and 0.63, resp.). This suggests that strand bias has little
influence on the correlation of GC3 with synonymous codon
usage bias in B. burgdorferi.
Third, in bacteria with more tRNA genes, synonymous
codon usage could be subject to stronger translational selection [22]. Figure 5(b) shows that tRNA gene copy number
was not correlated with the r value of HGC3 . This suggests
that translational selection has little influence on the correlation of GC3 with synonymous codon usage bias. Sharp et
al. [22] showed that the S value as a measure of translationally selected codon usage bias is highly correlated with tRNA
gene copy number but is not correlated with genomic G + C
content. Thus, the r value of HGC3 can be used as a measure
complementary to the S value.
The most accepted hypothesis for the unequal usage of
synonymous codons in bacterial genomes is that the unequal
usage is the result of a very complex balance among dierent
evolutionary forces (mutation and selection) [23]. The combined use of the r value and other methods (e.g., the S value)
will improve our understanding of the relative contributions
of dierent evolutionary forces to synonymous codon usage
bias.

Haruo Suzuki et al.


ABBREVIATIONS
A:
T:
G:
C:
GC1:
GC2:
GC3:
HGC1 :
HGC2 :
HGC3 :
Ew :
r:

Adenine
Thymine
Guanine
Cytosine
G + C content at the first codon position
G + C content at the second codon position
G + C content at the third codon position
Entropy of GC1
Entropy of GC2
Entropy of GC3
Weighted sum of relative entropy
Spearmans rank correlation coecient

ACKNOWLEDGMENTS
The authors thank Dr Kazuharu Arakawa (Institute for Advanced Biosciences, Keio University) for his technical advice
on the G-language genome analysis environment, and Kunihiro Baba (Faculty of Policy Management, Keio University) for his technical advice on the R statistical computing environment. This work was supported by the Ministry
of Education, Culture, Sports, Science, and Technology of
Japan Grant-in-Aid for the 21st Century Centre of Excellence
(COE) Program entitled Understanding and Control of Life
via Systems Biology (Keio University).
REFERENCES
[1] M. D. Ermolaeva, Synonymous codon usage in bacteria,
Current Issues in Molecular Biology, vol. 3, no. 4, pp. 9197,
2001.
[2] A. Carbone, F. Kepes, and A. Zinovyev, Codon bias signatures, organization of microorganisms in codon space, and
lifestyle, Molecular Biology and Evolution, vol. 22, no. 3, pp.
547561, 2005.
[3] A. Carbone, A. Zinovyev, and F. Kep`es, Codon adaptation index as a measure of dominating codon bias, Bioinformatics,
vol. 19, no. 16, pp. 20052015, 2003.
[4] R. D. Knight, S. J. Freeland, and L. F. Landweber, A simple model based on mutation and selection explains trends
in codon and amino-acid usage and GC composition within
and across genomes, Genome Biology, vol. 2, no. 4, pp.
research0010.1research0010.13, 2001.
[5] J. R. Lobry and A. Necsulea, Synonymous codon usage and
its potential link with optimal growth temperature in prokaryotes, Gene, vol. 385, pp. 128136, 2006.
[6] D. J. Lynn, G. A. C. Singer, and D. A. Hickey, Synonymous
codon usage is subject to selection in thermophilic bacteria,
Nucleic Acids Research, vol. 30, no. 19, pp. 42724277, 2002.
[7] G. A. C. Singer and D. A. Hickey, Thermophilic prokaryotes
have characteristic patterns of codon usage, amino acid composition and nucleotide content, Gene, vol. 317, no. 1-2, pp.
3947, 2003.
[8] H. Suzuki, R. Saito, and M. Tomita, A problem in multivariate
analysis of codon usage data and a possible solution, FEBS
Letters, vol. 579, no. 28, pp. 64996504, 2005.
[9] X.-F. Wan, D. Xu, A. Kleinhofs, and J. Zhou, Quantitative
relationship between synonymous codon usage bias and GC
composition across unicellular genomes, BMC Evolutionary
Biology, vol. 4, p. 19, 2004.

7
[10] F. Wright, The eective number of codons used in a gene,
Gene, vol. 87, no. 1, pp. 2329, 1990.
[11] B. Zeeberg, Shannon information theoretic computation of
synonymous codon usage biases in coding regions of human
and mouse genomes, Genome Research, vol. 12, no. 6, pp.
944955, 2002.
[12] H. Suzuki, R. Saito, and M. Tomita, The weighted sum of
relative entropy: a new index for synonymous codon usage
bias, Gene, vol. 335, no. 1-2, pp. 1923, 2004.
[13] K. Arakawa, K. Mori, K. Ikeda, T. Matsuzaki, Y. Kobayashi,
and M. Tomita, G-language genome analysis environment:
a workbench for nucleotide sequence data mining, Bioinformatics, vol. 19, no. 2, pp. 305306, 2003.
[14] R Development Core Team, R: a language and environment for
statistical computing, R Foundation for Statistical Computing,
Vienna, Austria, 2006.
[15] D. A. Benson, I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, and
D. L. Wheeler, GenBank, Nucleic Acids Research, vol. 35, supplement 1, pp. D21D25, 2007.
[16] C. E. Shannon, A mathematical theory of communication,
Bell System Technical Journal, vol. 27, pp. 379423, 1948.
[17] A. Muto and S. Osawa, The guanine and cytosine content
of genomic DNA and bacterial evolution, Proceedings of the
National Academy of Sciences of the United States of America,
vol. 84, no. 1, pp. 166169, 1987.
[18] N. Sueoka, On the genetic basis of variation and heterogeneity of DNA base composition, Proceedings of the National
Academy of Sciences of the United States of America, vol. 48,
no. 4, pp. 582592, 1962.
[19] S. Garcia-Vallve, A. Romeu, and J. Palau, Horizontal gene
transfer in bacterial and archaeal complete genomes, Genome
Research, vol. 10, no. 11, pp. 17191725, 2000.
[20] R. J. Grocock and P. M. Sharp, Synonymous codon usage in
Pseudomonas aeruginosa PA01, Gene, vol. 289, no. 1-2, pp.
131139, 2002.
[21] J. O. McInerney, Replicational and transcriptional selection
on codon usage in Borrelia burgdorferi, Proceedings of the
National Academy of Sciences of the United States of America,
vol. 95, no. 18, pp. 1069810703, 1998.
[22] P. M. Sharp, E. Bailes, R. J. Grocock, J. F. Peden, and R. E.
Sockett, Variation in the strength of selected codon usage
bias among bacteria, Nucleic Acids Research, vol. 33, no. 4, pp.
11411153, 2005.
[23] P. M. Sharp, M. Stenico, J. F. Peden, and A. T. Lloyd, Codon
usage: mutational bias, translational selection, or both? Biochemical Society Transactions, vol. 21, no. 4, pp. 835841, 1993.

Hindawi Publishing Corporation


EURASIP Journal on Bioinformatics and Systems Biology
Volume 2007, Article ID 79879, 9 pages
doi:10.1155/2007/79879

Research Article
Information-Theoretic Inference of Large Transcriptional
Regulatory Networks
Patrick E. Meyer, Kevin Kontos, Frederic Lafitte, and Gianluca Bontempi
ULB Machine Learning Group, Computer Science Department, Universite Libre de Bruxelles, 1050 Brussels, Belgium
Received 26 January 2007; Accepted 12 May 2007
Recommended by Juho Rousu
The paper presents MRNET, an original method for inferring genetic networks from microarray data. The method is based on
maximum relevance/minimum redundancy (MRMR), an eective information-theoretic technique for feature selection in supervised learning. The MRMR principle consists in selecting among the least redundant variables the ones that have the highest
mutual information with the target. MRNET extends this feature selection principle to networks in order to infer gene-dependence
relationships from microarray data. The paper assesses MRNET by benchmarking it against RELNET, CLR, and ARACNE, three
state-of-the-art information-theoretic methods for large (up to several thousands of genes) network inference. Experimental results on thirty synthetically generated microarray datasets show that MRNET is competitive with these methods.
Copyright 2007 Patrick E. Meyer et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.

1.

INTRODUCTION

Two important issues in computational biology are the extent to which it is possible to model transcriptional interactions by large networks of interacting elements and how these
interactions can be eectively learned from measured expression data [1]. The reverse engineering of transcriptional regulatory networks (TRNs) from expression data alone is far
from trivial because of the combinatorial nature of the problem and the poor information content of the data [1]. An additional problem is that by focusing only on transcript data,
the inferred network should not be considered as a biochemical regulatory network but as a gene-to-gene network, where
many physical connections between macromolecules might
be hidden by shortcuts.
In spite of these evident limitations, the bioinformatics
community made important advances in this domain over
the last few years. Examples are methods like Boolean networks, Bayesian networks, and Association networks [2].
This paper will focus on information-theoretic approaches [36] which typically rely on the estimation of mutual information from expression data in order to measure
the statistical dependence between variables (the terms variable and feature are used interchangeably in this paper).
Such methods have recently held the attention of the bioin-

formatics community for the inference of very large networks


[46].
The adoption of mutual information in probabilistic
model design can be traced back to Chow-Liu tree algorithm [3] and its extensions proposed by [7, 8]. Later [9, 10]
suggested to improve network inference by using another
information-theoretic quantity, namely multi-information.
This paper introduces an original information-theoretic
method, called MRNET, inspired by a recently proposed feature selection technique, the maximum relevance/minimum
redundancy (MRMR) algorithm [11, 12]. This algorithm has
been used with success in supervised classification problems
to select a set of nonredundant genes which are explicative of
the targeted phenotype [12, 13]. The MRMR selection strategy consists in selecting a set of variables that has a high
mutual information with the target variable (maximum relevance) and at the same time are mutually maximally independent (minimum redundancy between relevant variables).
The advantage of this approach is that redundancy among
selected variables is avoided and that the trade-o between
relevance and redundancy is properly taken into account.
Our proposed MRNET strategy, preliminarily sketched
in [14], consists of (i) formulating the network inference
problem as a series of input/output supervised gene selection procedures, where one gene at the time plays the role of

EURASIP Journal on Bioinformatics and Systems Biology

the target output, and (ii) adopting the MRMR principle to


perform the gene selection for each supervised gene selection
procedure.
The paper benchmarks MRNET against three state-ofthe-art information-theoretic network inference methods,
namely relevance networks (RELNET), CLR, and ARACNE.
The comparison relies on thirty artificial microarray datasets
synthesized by two public-domain generators. The extensive
simulation setting allows us to study the eect of the number
of samples, the number of genes, and the noise intensity on
the inferred network accuracy. Also, the sensitivity of the performance to two alternative entropy estimators is assessed.
The outline of the paper is as follows. Section 2 reviews
the state-of-the-art network inference techniques based on
information theory. Section 3 introduces our original approach based on MRMR. The experimental framework and
the results obtained on artificially generated datasets are presented in Sections 4 and 5, respectively. Section 6 concludes
the paper.
2.

threshold I0 . The complexity of the method is O(n2 ) since all


pairwise interactions are considered.
Note that this method is prone to infer false positives in
the case of indirect interactions between genes. For example,
if gene X1 regulates both gene X2 and gene X3 , a high mutual information between the pairs {X1 , X2 }, {X1 , X3 }, and
{X2 , X3 } would be present. As a consequence, the algorithm
would infer an edge between X2 and X3 although these two
genes interact only through gene X1 .
2.3.

CLR algorithm

The CLR algorithm [6] is an extension of RELNET. This algorithm computes the mutual information (MI) for each pair
of genes and derives a score related to the empirical distribution of these MI values. In particular, instead of considering
genes Xi and X j , it takes
the information I(Xi ; X j ) between

2
2
into account the score zi j = zi + z j , where


INFORMATION-THEORETIC NETWORK INFERENCE:


STATE OF THE ART

This section reviews some state-of-the-art methods for network inference which are based on information-theoretic
notions.
These methods require at first the computation of the
mutual information matrix (MIM), a square matrix whose
i, j element


MIMi j = I Xi ; X j =

 
xi X x j X

p xi , x j log

 

p xi , x j
p xi p x j

zi = max 0,

I Xi ; X j i
i

(2)

and i and i are, respectively, the mean and the standard


deviation of the empirical distribution of the mutual information values I(Xi , Xk ), k = 1, . . . , n. The CLR algorithm
was successfully applied to decipher the E. coli TRN [6]. Note
that, like RELNET, CLR demands an O(n2 ) cost to infer the
network from a given MIM.

   

2.4.

ARACNE

(1)
is the mutual information between Xi and X j , where Xi
X, i = 1, . . . , n, is a discrete random variable denoting the
expression level of the ith gene.
2.1. Chow-Liu tree
The Chow and Liu approach consists in finding the maximum spanning tree (MST) of a complete graph, where the
weights of the edges are the mutual information quantities
between the connected nodes [3]. The construction of the
MST with Kruskals algorithm has an O(n2 log n) cost. The
main drawbacks of this method are: (i) the minimum spanning tree has typically a low number of edges also for non
sparse target networks and (ii) no parameter is provided to
calibrate the size of the inferred network.
2.2. Relevance network (RELNET)
The relevance network approach [4] has been introduced in
gene clustering problems and successfully applied to infer relationships between RNA expression and chemotherapeutic
susceptibility [15]. The approach consists in inferring a genetic network, where a pair of genes {Xi , X j } is linked by an
edge if the mutual information I(Xi ; X j ) is larger than a given

The algorithm for the reconstruction of accurate cellular networks (ARACNE) [5] is based on the data processing inequality [16]. This inequality states that if gene X1 interacts
with gene X3 through gene X2 , then


 

 



I X1 ; X3 min I X1 ; X2 , I X2 ; X3 .

(3)

The ARACNE procedure starts by assigning to each pair of


nodes a weight equal to their mutual information. Then, as
in RELNET, all edges for which I(Xi ; X j ) < I0 are removed,
where I0 is a given threshold. Eventually, the weakest edge
of each triplet is interpreted as an indirect interaction and is
removed if the dierence between the two lowest weights is
above a threshold W0 . Note that by increasing I0 , we decrease
the number of inferred edges while we obtain the opposite
eect by increasing W0 .
If the network is a tree and only pairwise interactions
are present, the method guarantees the reconstruction of the
original network, once it is provided with the exact MIM.
The ARACNEs complexity for inferring the network is O(n3 )
since the algorithm considers all triplets of genes. In [5], the
method has been able to recover components of the TRN in
mammalian cells and appeared to outperform Bayesian networks and relevance networks on several inference tasks [5].

Patrick E. Meyer et al.

3
Network and data generator
Original
network

Artificial
dataset

Entropy estimator

Inference method

Mutual
information
matrix

Inferred
network

Validation procedure
Precision-recall
curves and
F-scores

Figure 1: An artificial microarray dataset is generated from an original network. The inferred network can then be compared to this true
network.

3.

OUR PROPOSAL: MINIMUM REDUNDANCY


NETWORKS (MRNET)

We propose to infer a network using the maximum relevance/minimum redundancy (MRMR) feature selection
method. The idea consists in performing a series of supervised MRMR gene selection procedures, where each gene in
turn plays the role of the target output.
The MRMR method has been introduced in [11, 12] together with a best-first search strategy for performing filter
selection in supervised learning problems. Consider a supervised learning task, where the output is denoted by Y and V
is the set of input variables. The method ranks the set V of
inputs according to a score that is the dierence between the
mutual information with the output variable Y (maximum
relevance) and the average mutual information with the previously ranked variables (minimum redundancy). The rationale is that direct interactions (i.e., the most informative
variables to the target Y ) should be well ranked, whereas indirect interactions (i.e., the ones with redundant information
with the direct ones) should be badly ranked by the method.
The greedy search starts by selecting the variable Xi having
the highest mutual information to the target Y . The second
selected variable X j will be the one with a high information
I(X j ; Y ) to the target and at the same time a low information
I(X j ; Xi ) to the previously selected variable. In the following
steps, given a set S of selected variables, the criterion updates
S by choosing the variable


= arg max u j r j
X MRMR
j
X j V \S

(4)

that maximizes the score


sj = uj rj,

(5)

where u j is a relevance term and r j is a redundancy term.


More precisely,


uj = I Xj; Y

(6)

is the mutual information of X j with the target variable Y ,


and

1  
rj =
I X j ; Xk
|S| X S
k

measures the average redundancy of X j to each already selected variable Xk S. At each step of the algorithm, the
selected variable is expected to allow an ecient trade-o
between relevance and redundancy. It has been shown in
[12] that the MRMR criterion is an optimal pairwise approximation of the conditional mutual information between
any two genes X j and Y given the set S of selected variables
I(X j ; Y | S).
The MRNET approach consists in repeating this selection procedure for each target gene by putting Y = Xi and
V = X \ {Xi }, i = 1, . . . , n, where X is the set of the expression levels of all genes. For each pair {Xi , X j }, MRMR returns
two (not necessarily equal) scores si and s j according to (5).
The score of the pair {Xi , X j } is then computed by taking the
maximum of si and s j . A specific network can then be inferred by deleting all the edges whose score lies below a given
threshold I0 (as in RELNET, CLR, and ARACNE). Thus, the
algorithm infers an edge between Xi and X j either when Xi is
a well-ranked predictor of X j (si > I0 ) or when X j is a wellranked predictor of Xi (s j > I0 ).
An eective implementation of the MRMR best-first
search is available in [17]. This implementation demands an
O( f n) complexity for selecting f features using a best-first
search strategy. It follows that MRNET has an O( f n2 ) complexity since the feature selection step is repeated for each of
the n genes. In other terms, the complexity ranges between
O(n2 ) and O(n3 ) according to the value of f . Note that the
lower the f value, the lower the number of incoming edges
per node to infer and consequently the lower the resulting
complexity.
Note that since mutual information is a symmetric measure, it is not possible to derive the direction of the edge from
its weight. This limitation is common to all the methods presented so far. However, this information could be provided
by edge orientation algorithms (e.g., IC) commonly used in
Bayesian networks [7].

(7)

4.

EXPERIMENTS

The experimental framework consists of four steps (see


Figure 1): the artificial network and data generation,
the computation of the mutual information matrix, the

4
inference of the network, and the validation of the results.
This section details each step of the approach.
4.1. Network and data generation
In order to assess the results returned by our algorithm and
compare it to other methods, we created a set of benchmarks
on the basis of artificially generated microarray datasets. In
spite of the evident limitations of using synthetic data, this
makes possible a quantitative assessment of the accuracy,
thanks to the availability of the true network underlying the
microarray dataset (see Figure 1).
We used two dierent generators of artificial gene expression data: the data generator described in [18] (hereafter referred to as the sRogers generator) and the SynTReN generator [19]. The two generators, whose implementations are
freely available on the World Wide Web, are sketched in the
following paragraphs.
sRogers generator
The sRogers generator produces the topology of the genetic
network according to an approximate power-law distribution on the number of regulatory connections out of each
gene. The normal steady state of the system is evaluated by
integrating a system of dierential equations. The generator
oers the possibility to obtain 2k dierent measures (k wild
type and k knock out experiments). These measures can be
replicated R times, yielding a total of N = 2kR samples. After
the optional addition of noise, a dataset containing normalized and scaled microarray measurements is returned.
SynTReN generator
The SynTReN generator generates a network topology by selecting subnetworks from E. coli and S. cerevisiae source networks. Then, transition functions and their parameters are
assigned to the edges in the network. Eventually, mRNA expression levels for the genes in the network are obtained by
simulating equations based on Michaelis-Menten and Hill
kinetics under dierent conditions. As for the previous generator, after the optional addition of noise, a dataset containing normalized and scaled microarray measurements is returned.
Generation
The two generators were used to synthesize thirty datasets.
Table 1 reports for each dataset the number n of genes, the
number N of samples, and the Gaussian noise intensity (expressed as a percentage of the signal variance).
4.2. Mutual information matrix estimation
In order to benchmark MRNET versus RELNET, CLR, and
ARACNE, the same MIM is used for the four inference
approaches. Several estimators of mutual information have

EURASIP Journal on Bioinformatics and Systems Biology


been proposed in literature [5, 6, 20, 21]. Here, we test
the Miller-Madow entropy estimator [20] and a parametric
Gaussian density estimator. Since the Miller-Madow method
requires quantized values, we pretreated the data withthe
equal-sized intervals algorithm [22], where the size l = N.
The parametric Gaussian estimator is directly computed by
I(Xi , X j ) = (1/2) log(ii j j / |C |), where |C | is the determinant of the covariance matrix. Note that the complexity of
both estimators is O(N), where N is the number of samples. This means that since the whole MIM cost is O(N n2 ),
the MIM computation could be the bottleneck of the whole
network inference procedure for a large number of samples
(N  n). We deem, however, that at the current state of the
technology, this should not be considered as a major issue
since the number of samples is typically much smaller than
the number of measured features.
4.3.

Validation

A network inference problem can be seen as a binary decision


problem, where the inference algorithm plays the role of a
classifier: for each pair of nodes, the algorithm either adds
an edge or does not. Each pair of nodes is thus assigned a
positive label (an edge) or a negative one (no edge).
A positive label (an edge) predicted by the algorithm is
considered as a true positive (TP) or as a false positive (FP)
depending on the presence or not of the corresponding edge
in the underlying true network, respectively. Analogously, a
negative label is considered as a true negative (TN) or a false
negative (FN) depending on whether the corresponding edge
is present or not in the underlying true network, respectively.
The decision made by the algorithm can be summarized
by a confusion matrix (see Table 2).
It is generally recommended [23] to use receiver operator characteristic (ROC) curves when evaluating binary decision problems in order to avoid eects related to the chosen
threshold. However, ROC curves can present an overly optimistic view of algorithms performance if there is a large skew
in the class distribution, as typically encountered in TRN inference because of sparseness.
To tackle this problem, precision-recall (PR) curves have
been cited as an alternative to ROC curves [24]. Let the precision quantity
p=

TP
,
TP + FP

(8)

measure the fraction of real edges among the ones classified


as positive and the recall quantity
r=

TP
,
TP + FN

(9)

also know as true positive rate, denote the fraction of real


edges that are correctly inferred. These quantities depend on
the threshold chosen to return a binary decision. The PR
curve is a diagram which plots the precision (p) versus recall
(r) for dierent values of the threshold on a two-dimensional
coordinate system.

Patrick E. Meyer et al.

5
Table 1: Datasets with n the number of genes and N the number of samples.
Generator

Topology

Noise

RN1
RN2

sRogers
sRogers

Power-law tail
Power-law tail

700
700

700
700

0%
5%

RN3
RN4

sRogers
sRogers

Power-law tail
Power-law tail

700
700

700
700

10%
20%

RN5

sRogers

Power-law tail

700

700

30%

RS1

sRogers

Power-law tail

700

100

0%

RS2
RS3
RS4

sRogers
sRogers
sRogers

Power-law tail
Power-law tail
Power-law tail

700
700
700

300
500
800

0%
0%
0%

RS5

sRogers

Power-law tail

700

1000

0%

RV1

sRogers

Power-law tail

100

700

0%

RV2
RV3
RV4

sRogers
sRogers
sRogers

Power-law tail
Power-law tail
Power-law tail

300
500
700

700
700
700

0%
0%
0%

RV5

sRogers

Power-law tail

1000

700

0%

SN1

SynTReN

S. Cerevisae

400

400

0%

SN2
SN3
SN4

SynTReN
SynTReN
SynTReN

S. Cerevisae
S. Cerevisae
S. Cerevisae

400
400
400

400
400
400

5%
10%
20%

SN5

SynTReN

S. Cerevisae

400

400

30%

SS1
SS2

SynTReN
SynTReN

S. Cerevisae
S. Cerevisae

400
400

100
200

0%
0%

SS3
SS4

SynTReN
SynTReN

S. Cerevisae
S. Cerevisae

400
400

300
400

0%
0%

SS5

SynTReN

S. Cerevisae

400

500

0%

SV1
SV2

SynTReN
SynTReN

S. Cerevisae
S. Cerevisae

100
200

400
400

0%
0%

SV3
SV4

SynTReN
SynTReN

S. Cerevisae
S. Cerevisae

300
400

400
400

0%
0%

SV5

SynTReN

S. Cerevisae

500

400

0%

Dataset

that if two algorithms A and B have the same error rate, then

Table 2: Confusion matrix.


Edge

Actual positive

Inferred positive
Inferred negative

TP
FN

Actual negative

FP
TN

Note that a compact representation of the PR diagram is


returned by the maximum of the F-score quantity
F=

2pr
,
r+p

(10)

which is a weighted harmonic average of precision and recall.


The following section will present the results by means of PR
curves and F-scores.
Also in order to asses the significance of the results, a McNemar test can be performed. The McNemar test [25] states

 

2

NAB NBA
1
NAB + NBA

> 3.841459 < 0.05,

(11)

where NAB is the number of incorrect edges of the network


inferred from algorithm A that are correct in the network
inferred from algorithm B, and NBA is the counterpart.
5.

RESULTS AND DISCUSSION

A thorough comparison would require the display of the PRcurves (Figure 2) for each dataset. For reason of space, we
decided to summarize the PR-curve information by the maximum F-score in Table 3. Note that for each dataset, the accuracy of the best methods (i.e., those whose score is not significantly lower than the highest one according to McNemar
test) is typed in boldface.
We may summarize the results as follows.

EURASIP Journal on Bioinformatics and Systems Biology


700 genes, Gaussian estimation on sRogers datasets

1
0.8

F-score

Precision

0.8
0.6
0.4
0.2

0.6

0.4
0
0

0.2

0.4

0.6

0.8

1
0.2

Recall

200

MRNET
CLR
ARACNE

600
Samples

CLR
ARACNE

Figure 2: PR-curves for the RS3 dataset using Miller-Madow estimator. The curves are obtained by varying the rejection/acceptation
threshold.
400 samples, Miller-Madow estimation on SynTReN datasets

F-score

400

800

1000

RELNET
MRNET

Figure 4: Influence of number of samples on accuracy (sRogers RS


datasets, Gaussian estimator).

0.5

how the accuracy is strongly and positively correlated to the


number of samples.

0.4

Accuracy sensitivity to the noise intensity.


The intensity of noise ranges from 0% to 30% for the datasets
RN1, RN2, RN3, RN4, and RN5, and for the datasets SN1,
SN2, SN3, SN4, and SN5. The performance of the methods
using the Miller-Madow entropy estimator decreases significantly with the increasing noise, whereas the Gaussian estimator appears to be more robust (see Figure 5).

0.3

0.2

0.1
100

200

CLR
ARACNE

300
Genes

400

500

RELNET
MRNET

Figure 3: Influence of the number of variables on accuracy (SynTReN SV datasets, Miller-Madow estimator).

Accuracy sensitivity to the MI estimator.


We can observe in Figure 6 that the Gaussian parametric estimator gives better results than the Miller-Madow estimator.
This is particularly evident with the sRogers datasets.
Accuracy sensitivity to the data generator.
The SynTReN generator produces datasets for which the inference task appears to be harder, as shown in Table 3.

Accuracy sensitivity to the number of variables.


The number of variables ranges from 100 to 1000 for the
datasets RV1, RV2, RV3, RV4, and RV5, and from 100 to
500 for the datasets SV1, SV2, SV3, SV4, and SV5. Figure 3
shows that the accuracy and the number of variables of the
network are weakly negatively correlated. This appears to be
true independently of the inference method and of the MI
estimator.

Accuracy of the inference methods.


Table 3 supports the following three considerations: (i) MRNET is competitive with the other approaches, (ii) ARACNE
outperforms the other approaches when the Gaussian estimator is used, and (iii) MRNET and CLR are the two best
techniques when the nonparametric Miller-Madow estimator is used.

Accuracy sensitivity to the number of samples.

5.1.

The number of samples ranges from 100 to 1000 for the


datasets RS1, RV2, RS3, RS4, and RS5, and from 100 to 500
for the datasets SS1, SS2, SS3, SS4, and SS5. Figure 4 shows

As shown experimentally in the previous section, MRNET


is competitive with the state-of-the-art techniques. Furthermore, MRNET benefits from some additional properties

Feature selection techniques in network inference

Patrick E. Meyer et al.

Table 3: Maximum F-scores for each inference method using two dierent mutual information estimators. The best methods (those having
a score not significantly weaker than the best score, i.e., P-value < .05) are typed in boldface. Average performances on SynTReN and sRogers
datasets are reported, respectively, in the S-AVG, R-AVG lines.

CLR
0.24

Miller-Madow
ARACNE
0.27

MRNET
0.27

RELNET
0.21

CLR
0.24

Gaussian
ARACNE
0.3

SN1

RELNET
0.22

MRNET
0.26

SN2

0.23

0.26

0.29

0.29

0.21

0.25

0.31

0.25

SN3

0.23

0.25

0.24

0.26

0.21

0.25

0.31

0.26

SN4

0.22

0.24

0.26

0.26

0.21

0.25

0.28

0.26

SN5

0.21

0.23

0.24

0.24

0.2

0.25

0.27

0.24

SS1

0.21

0.22

0.22

0.23

0.19

0.24

0.24

0.23

SS2

0.21

0.24

0.28

0.29

0.2

0.24

0.27

0.25

SS3

0.21

0.24

0.27

0.28

0.2

0.24

0.28

0.25

SS4

0.22

0.24

0.27

0.27

0.21

0.24

0.3

0.26

SS5

0.22

0.24

0.28

0.29

0.21

0.24

0.3

0.26

SV1

0.32

0.36

0.41

0.39

0.3

0.4

0.44

0.38

SV2

0.25

0.28

0.35

0.33

0.25

0.35

0.36

0.32

SV3

0.21

0.24

0.3

0.28

0.21

0.28

0.3

0.27

SV4

0.22

0.24

0.27

0.27

0.21

0.24

0.3

0.26

SV5
S-AVG
RN1

0.24
0.23
0.59

0.23
0.25
0.65

0.29
0.28
0.6

0.29
0.28
0.61

0.22
0.21
0.89

0.24
0.26
0.87

0.31
0.30
0.92

0.26
0.27
0.93

RN2

0.5

0.57

0.5

0.49

0.89

0.87

0.92

0.92

RN3

0.5

0.55

0.5

0.52

0.89

0.87

0.92

0.92

RN4

0.46

0.51

0.47

0.47

0.89

0.87

0.92

0.91

RN5

0.42

0.46

0.41

0.4

0.88

0.86

0.91

0.91

RS1

0.1

0.11

0.09

0.1

0.19

0.19

0.19

0.18

RS2

0.35

0.32

0.31

0.31

0.45

0.44

0.47

0.46

RS3

0.38

0.32

0.36

0.38

0.58

0.56

0.6

0.6

RS4

0.47

0.54

0.47

0.5

0.75

0.75

0.8

0.79

RS5

0.58

0.68

0.6

0.64

0.9

0.86

0.93

0.93

RV1

0.52

0.38

0.46

0.46

0.72

0.75

0.72

0.72

RV2

0.49

0.53

0.49

0.53

0.71

0.71

0.71

0.71

RV3

0.45

0.5

0.45

0.48

0.69

0.69

0.71

0.71

RV4

0.47

0.51

0.48

0.48

0.69

0.7

0.74

0.72

RV5
R-AVG

0.47
0.45

0.52
0.48

0.47
0.44

0.48
0.46

0.7
0.72

0.68
0.71

0.74
0.74

0.73
0.74

Tot-AVG

0.34

0.36

0.36

0.37

0.47

0.49

0.52

0.51

which are common to all the feature selection strategies for


network inference [26, 27], as follows.
(1) Feature selection algorithms can often deal with thousands of variables in a reasonable amount of time. This
makes inference scalable to large networks.
(2) Feature selection algorithms may be easily made parallel, since each of the n selections tasks is independent.
(3) Feature selection algorithms may be made faster by a
priori knowledge. For example, knowing the list of regulator
genes of an organism improves the selection speed and the
inference quality by limiting the search space of the feature

selection step to this small list of genes. The knowledge of


existing edges can also improve the inference. For example,
in a sequential selection process, as in the forward selection
used with MRMR, the next variable is selected given the already selected features. As a result, the performance of the selection can be strongly improved by conditioning on known
relationships.
However, there is a disadvantage in using a feature selection technique for network inference. The objective of feature selection is selecting, among a set of input variables, the
ones that will lead to the best predictive model. It has been

EURASIP Journal on Bioinformatics and Systems Biology


700 genes, 700 samples, MRNET on sRogers datasets
1

Xi

F-score

0.8
0.6
0.4

Xj

0.2

Figure 7: Example of indirect relationship between Xi and Y .

0
0

0.05

0.1

0.15
Noise

0.2

0.25

0.3

Empirical
Gaussian

Figure 5: Influence of the noise on MRNET accuracy for the two


MIM estimators (sRogers RN datasets).
MRNET 700 genes, sRogers datasets

F-score

0.8

alone. This behavior is colloquially referred to as explainingaway eect in the Bayesian network literature [7]. Selecting
variables, like Xi , that take part into indirect interactions reduce the accuracy of the network inference task. However,
since MRMR relies only on pairwise interactions, it does not
take into account the gain in information due to conditioning. In our example, the MRMR algorithm, after having selected X j , computes the score si = I(Xi ; Y ) I(Xi ; X j ), where
I(Xi ; Y ) = 0 and I(Xi ; X j ) > 0. This score is negative and is
likely to be badly ranked. As a result, the MRMR feature selection criterion is less exposed to the inconvenient of most
feature selection techniques while sharing their interesting
properties. Further experiments will focus on this aspect.

0.6

6.
0.4

0.2
200

400

600
Samples

800

1000

Empirical
Gaussian

Figure 6: Influence of MI estimator on MRNET accuracy for the


two MIM estimators (sRogers RS datasets).

proved in [28] that the minimum set that achieves optimal


classification accuracy under certain general conditions is the
Markov blanket of a target variable. The Markov blanket of
a target variable is composed of the variables parents, the
variables children, and the variables childrens parents [7].
The latter are indirect relationships. In other words, these
variables have a conditional mutual information to the target variable Y higher than their mutual information. Let us
consider the following example. Let Y and Xi be independent random variables, and X j = Xi + Y (see Figure 7). Since
the variables are independent, I(Xi ; Y ) = 0, and the conditional mutual information is higher than the mutual information, that is, I(Xi ; Y | X j ) > 0. It follows that Xi has some
information to Y given X j but no information to Y taken

CONCLUSION AND FUTURE WORK

A new network inference method, MRNET, has been proposed. This method relies on an eective method of
information-theoretic feature selection called MRMR. Similarly to other network inference methods, MRNET relies on
pairwise interactions between genes, making possible the inference of large networks (up to several thousands of genes).
Another advantage of MRNET, which could be exploited
in future work, is its ability to benefit explicitly from a priori
knowledge.
MRNET was compared experimentally to three stateof-the-art information-theoretic network inference methods, namely RELNET, CLR, and ARACNE, on thirty inference tasks. The microarray datasets were generated artificially with two dierent generators in order to eectively
assess their inference power. Also, two dierent mutual information estimation methods were used. The experimental
results showed that MRNET is competitive with the benchmarked information-theoretic methods.
Future work will focus on three main axes: (i) the assessment of additional mutual information estimators, (ii) the
validation of the techniques on the basis of real microarray
data, (iii) a theoretical analysis of which conditions should
be met for MRNET to reconstruct the true network.
ACKNOWLEDGMENT
This work was partially supported by the Communaute
Francaise de Belgique under ARC Grant no. 04/09-307.

Patrick E. Meyer et al.


REFERENCES
[1] E. P. van Someren, L. F. A. Wessels, E. Backer, and M. J. T. Reinders, Genetic network modeling, Pharmacogenomics, vol. 3,
no. 4, pp. 507525, 2002.
[2] T. S. Gardner and J. J. Faith, Reverse-engineering transcription control networks, Physics of Life Reviews, vol. 2, no. 1,
pp. 6588, 2005.
[3] C. Chow and C. Liu, Approximating discrete probability distributions with dependence trees, IEEE Transactions on Information Theory, vol. 14, no. 3, pp. 462467, 1968.
[4] A. J. Butte and I. S. Kohane, Mutual information relevance
networks: functional genomic clustering using pairwise entropy measurements, Pacific Symposium on Biocomputing, pp.
418429, 2000.
[5] A. A. Margolin, I. Nemenman, K. Basso, et al., ARACNE: an
algorithm for the reconstruction of gene regulatory networks
in a mammalian cellular context, BMC Bioinformatics, vol. 7,
supplement 1, p. S7, 2006.
[6] J. J. Faith, B. Hayete, J. T. Thaden, et al., Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles, PLoS Biology,
vol. 5, no. 1, p. e8, 2007.
[7] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks
of Plausible, Morgan Kaufmann, San Fransisco, Calif, USA,
1988.
[8] J. Cheng, R. Greiner, J. Kelly, D. Bell, and W. Liu, Learning
Bayesian networks from data: an information-theory based
approach, Artificial Intelligence, vol. 137, no. 1-2, pp. 4390,
2002.
[9] E. Schneidman, S. Still, M. J. Berry II, and W. Bialek, Network
information and connected correlations, Physical Review Letters, vol. 91, no. 23, Article ID 238701, 4 pages, 2003.
[10] I. Nemenman, Multivariate dependence, and genetic network
inference, Tech. Rep. NSF-KITP-04-54, KITP, UCSB, Santa
Barbara, Calif, USA, 2004.
[11] G. D. Tourassi, E. D. Frederick, M. K. Markey, and C. E. Floyd
Jr., Application of the mutual information criterion for feature selection in computer-aided diagnosis, Medical Physics,
vol. 28, no. 12, pp. 23942402, 2001.
[12] C. Ding and H. Peng, Minimum redundancy feature selection from microarray gene expression data, Journal of Bioinformatics and Computational Biology, vol. 3, no. 2, pp. 185
205, 2005.
[13] P. E. Meyer and G. Bontempi, On the use of variable complementarity for feature selection in cancer classification, in Applications of Evolutionary Computing: EvoWorkshops, F. Rothlauf, J. Branke, S. Cagnoni, et al., Eds., vol. 3907 of Lecture
Notes in Computer Science, pp. 91102, Springer, Berlin, Germany, 2006.
[14] P. E. Meyer, K. Kontos, and G. Bontempi, Biological network
inference using redundancy analysis, in Proceedings of the 1st
International Conference on Bioinformatics Research and Development (BIRD 07), pp. 916927, Berlin, Germany, March
2007.
[15] A. J. Butte, P. Tamayo, D. Slonim, T. R. Golub, and I. S. Kohane, Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance
networks, Proceedings of the National Academy of Sciences of
the United States of America, vol. 97, no. 22, pp. 1218212186,
2000.
[16] T. M. Cover and J. A. Thomas, Elements of Information Theory,
John Wiley & Sons, New York, NY, USA, 1990.

9
[17] P. Merz and B. Freisleben, Greedy and local search heuristics
for unconstrained binary quadratic programming, Journal of
Heuristics, vol. 8, no. 2, pp. 197213, 2002.
[18] S. Rogers and M. Girolami, A Bayesian regression approach
to the inference of regulatory networks from gene expression
data, Bioinformatics, vol. 21, no. 14, pp. 31313137, 2005.
[19] T. van den Bulcke, K. van Leemput, B. Naudts, et al., SynTReN: a generator of synthetic gene expression data for design
and analysis of structure learning algorithms, BMC Bioinformatics, vol. 7, p. 43, 2006.
[20] L. Paninski, Estimation of entropy and mutual information,
Neural Computation, vol. 15, no. 6, pp. 11911253, 2003.
[21] J. Beirlant, E. J. Dudewica, L. Gyofi, and E. van der Meulen,
Nonparametric entropy estimation: an overview, Journal of
Statistics, vol. 6, no. 1, pp. 1739, 1997.
[22] J. Dougherty, R. Kohavi, and M. Sahami, Supervised and unsupervised discretization of continuous features, in Proceedings of the 12th International Conference on Machine Learning
(ML 95), pp. 194202, Lake Tahoe, Calif, USA, July 1995.
[23] F. J. Provost, T. Fawcett, and R. Kohavi, The case against accuracy estimation for comparing induction algorithms, in Proceedings of the 15th International Conference on Machine Learning (ICML 98), pp. 445453, Morgan Kaufmann, Madison,
Wis, USA, July 1998.
[24] J. Bockhorst and M. Craven, Markov networks for detecting
overlapping elements in sequence data, in Advances in Neural
Information Processing Systems 17, L. K. Saul, Y. Weiss, and L.
Bottou, Eds., pp. 193200, MIT Press, Cambridge, Mass, USA,
2005.
[25] T. G. Dietterich, Approximate statistical tests for comparing
supervised classification learning algorithms, Neural Computation, vol. 10, no. 7, pp. 18951923, 1998.
[26] K. B. Hwang, J. W. Lee, S.-W. Chung, and B.-T. Zhang, Construction of large-scale Bayesian networks by local to global
search, in Proceedings of the 7th Pacific Rim International
Conference on Artificial Intelligence (PRICAI 02), pp. 375384,
Tokyo, Japan, August 2002.
[27] I. Tsamardinos, C. Aliferis, and A. Statnikov, Algorithms for
large scale markov blanket discovery, in Proceedings of the
16th International Florida Artificial Intelligence Research Society Conference (FLAIRS 03), pp. 376381, St. Augustine, Fla,
USA, May 2003.
[28] I. Tsamardinos and C. Aliferis, Towards principled feature selection: relevancy, filters and wrappers, in Proceedings of the
9th International Workshop on Artificial Intelligence and Statistics (AI&Stats 03), Key West, Fla, USA, January 2003.

Hindawi Publishing Corporation


EURASIP Journal on Bioinformatics and Systems Biology
Volume 2007, Article ID 90947, 11 pages
doi:10.1155/2007/90947

Research Article
NML Computation Algorithms for Tree-Structured
Multinomial Bayesian Networks

Petri Kontkanen, Hannes Wettig, and Petri Myllymaki


Complex Systems Computation Group (CoSCo), Helsinki Institute for Information Technology (HIIT),
P.O. Box 68 (Department of Computer Science), FIN-00014 University of Helsinki, Finland
Received 1 March 2007; Accepted 30 July 2007
Recommended by Peter Grunwald
Typical problems in bioinformatics involve large discrete datasets. Therefore, in order to apply statistical methods in such domains,
it is important to develop ecient algorithms suitable for discrete data. The minimum description length (MDL) principle is a
theoretically well-founded, general framework for performing statistical inference. The mathematical formalization of MDL is
based on the normalized maximum likelihood (NML) distribution, which has several desirable theoretical properties. In the case
of discrete data, straightforward computation of the NML distribution requires exponential time with respect to the sample size,
since the definition involves a sum over all the possible data samples of a fixed size. In this paper, we first review some existing algorithms for ecient NML computation in the case of multinomial and naive Bayes model families. Then we proceed by extending
these algorithms to more complex, tree-structured Bayesian networks.
Copyright 2007 Petri Kontkanen et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.

1.

INTRODUCTION

Many problems in bioinformatics can be cast as model class


selection tasks, that is, as tasks of selecting among a set of
competing mathematical explanations the one that best describes a given sample of data. Typical examples of this kind
of problem are DNA sequence compression [1], microarray
data clustering [24] and modeling of genetic networks [5].
The minimum description length (MDL) principle developed
in the series of papers [68] is a well-founded, general framework for performing model class selection and other types of
statistical inference. The fundamental idea behind the MDL
principle is that any regularity in data can be used to compress
the data, that is, to find a description or code of it, such that
this description uses less symbols than it takes to describe
the data literally. The more regularities there are, the more
the data can be compressed. According to the MDL principle, learning can be equated with finding regularities in data.
Consequently, we can say that the more we are able to compress the data, the more we have learned about them.
MDL model class selection is based on a quantity called
stochastic complexity (SC), which is the description length of
a given data relative to a model class. The stochastic complexity is defined via the normalized maximum likelihood
(NML) distribution [8, 9]. For multinomial (discrete) data,

this definition involves a normalizing sum over all the possible data samples of a fixed size. The logarithm of this sum is
called the regret or parametric complexity, and it can be interpreted as the amount of complexity of the model class. If the
data is continuous, the sum is replaced by the corresponding
integral.
The NML distribution has several theoretical optimality
properties, which make it a very attractive candidate for performing model class selection and related tasks. It was originally [8, 10] formulated as the unique solution to a minimax
problem presented in [9], which implied that NML is the
minimax optimal universal model. Later [11], it was shown
that NML is also the solution to a related problem involving
expected regret. See Section 2 and [1013] for more discussion on the theoretical properties of the NML.
Typical bioinformatic problems involve large discrete
datasets. In order to apply NML for these tasks one needs to
develop suitable NML computation methods since the normalizing sum or integral in the definition of NML is typically
dicult to compute directly. In this paper, we present algorithms for ecient computation of NML for both one- and
multidimensional discrete data. The model families used in
the paper are so-called Bayesian networks (see, e.g., [14]) of
varying complexity. A Bayesian network is a graphical representation of a joint distribution. The structure of the graph

EURASIP Journal on Bioinformatics and Systems Biology

corresponds to certain conditional independence assumptions. Note that despite the name, having Bayesian network
models does not necessarily imply using Bayesian statistics,
and the information-theoretic approach of this paper cannot
be considered Bayesian.
The problem of computing NML for discrete data has
been studied before. In [15] a linear-time algorithm for
the one-dimensional multinomial case was derived. A more
complex case involving a multidimensional model family,
called naive Bayes, was discussed in [16]. Both these cases
are also reviewed in this paper.
The paper is structured as follows. In Section 2, we discuss the basic properties of the MDL principle and the NML
distribution. In Section 3, we instantiate the NML distribution for the multinomial case and present a linear-time computation algorithm. The topic of Section 4 is the naive Bayes
model family. NML computation for an extension of naive
Bayes, the so-called Bayesian forests, is discussed in Section 5.
Finally, Section 6 gives some concluding remarks.
2.

defined as a process of finding the parameter vector , which


is optimal according to some predetermined criteria. In Sections 35, we discuss three specific model families, which will
make these definitions more concrete.
2.2.

The NML distribution

One of the most theoretically and intuitively appealing


model class selection criteria is the stochastic complexity.
Denote first the maximum likelihood estimate of data xn
 n , M()), that is,
for a given model class M() by (x
 n , M()) = arg max {P(xn | )}. The normalized
(x

maximum likelihood (NML) distribution [9] is now defined


as
PNML

Let xn = (x1 , . . . , xn ) be a data sample of n outcomes, where


each outcome x j is an element of some space of observations
X. The n-fold Cartesian product X X is denoted
by Xn , so that xn Xn . Consider a set Rd , where d is
a positive integer. A class of parametric distributions indexed
by the elements of is called a model class. That is, a model
class M is defined as
(1)

and the set is called the parameter space.


Consider a set Re , where e is a positive integer. Define a set F by


F = M() : .



(3)

P yn |  yn , M()



(4)

yn Xn

2.1. Model classes and families

C M(), n =

The MDL principle has several desirable properties. Firstly, it


automatically protects against overfitting in the model class
selection process. Secondly, this statistical framework does
not, unlike most other frameworks, assume that there exists
some underlying true model. The model class is only used
as a technical device for constructing an ecient code for describing the data. MDL is also closely related to the Bayesian
inference but there are some fundamental dierences, the
most important being that MDL does not need any prior distribution; it only uses the data at hand. For more discussion
on the theoretical motivations behind the MDL principle see,
for example, [8, 1013, 17].
The MDL model class selection is based on minimization of the stochastic complexity. In the following, we give
the definition of the stochastic complexity and then proceed
by discussing its theoretical properties.

P xn |  xn , M()


x | M() =
,
C M(), n


where the normalizing term C(M(), n) in the case of discrete data is given by

PROPERTIES OF THE MDL PRINCIPLE AND


THE NML MODEL

M = P( | ) :

(2)

The set F is called a model family, and each of the elements


M() is a model class. The associated parameter space is denoted by . The model class selection problem can now be

and the sum goes over the space of data samples of size n.
If the data is continuous, the sum is replaced by the
corresponding integral.
The stochastic complexity of the data xn , given a model
class M(), is defined via the NML distribution as


SC xn | M()



= log PNML xn | M()





 xn , M() + log C M(), n
= log P xn |

(5)
and the term log C(M(), n) is called the (minimax) regret or
parametric complexity. The regret can be interpreted as measuring the logarithm of the number of essentially dierent
(distinguishable) distributions in the model class. Intuitively,
if two distributions assign high likelihood to the same data
samples, they do not contribute much to the overall complexity of the model class, and the distributions should not
be counted as dierent for the purposes of statistical inference. See [18] for more discussion on this topic.
The NML distribution (3) has several important theoretical optimality properties. The first is that NML provides a
unique solution to the minimax problem




P xn |  xn , M()
log
,
min max


xn
P xn | M()
P

(6)

as posed in [9]. The minimizing P is the NML distribution,


and the minimax regret


log P xn |  xn , M()





log P xn | M()

(7)

is given by the parametric complexity log C(M(), n). This


means that the NML distribution is the minimax optimal universal model. The term universal model in this context means

Petri Kontkanen et al.

that the NML distribution represents (or mimics) the behavior of all the distributions in the model class M(). Note that
the NML distribution itself does not have to belong to the
model class, and typically it does not.
A related property of NML involving expected regret was
proven in [11]. This property states that NML is also a unique
solution to


max min Eg log


q



P xn |  xn , M()


,
q xn | M()

(8)

where the expectation is taken over xn with respect to g and


the minimizing distribution q equals g. Also the maximin expected regret is thus given by log C(M(), n).
3.

NML FOR MULTINOMIAL MODELS

In the case of discrete data, the simplest model family is the


multinomial. The data are assumed to be one-dimensional
and to have only a finite set of possible values. Although simple, the multinomial model family has practical applications.
For example, in [19] multinomial NML was used for histogram density estimation, and the density estimation problem was regarded as a model class selection task.
3.1. The model family
Assume that our problem domain consists of a single discrete random variable X with K values, and that our data
xn = (x1 , . . . , xn ) is multinomially distributed. The space of
observations X is now the set {1, 2, . . . , K }. The corresponding model family FMN is defined by


FMN = M() : MN ,

(9)

where MN = {1, 2, 3, . . . }. Since the parameter vector is in


this case a single integer K we denote the multinomial model
classes by M(K) and define


M(K) = P( | ) : K ,

(10)

To make the notation more compact and consistent in this


section and the following sections, C(M(K), n) is from now
on denoted by CMN (K, n).
It is clear that the maximum likelihood term in (12) can
be computed in linear time by simply sweeping through the
data once and counting the frequencies hk . However, the normalizing sum CMN (K, n) (and thus also the parametric complexity log CMN (K, n)) involves a sum over an exponential
number of terms. Consequently, the time complexity of computing the multinomial NML is dominated by (14).
3.2.

The quadratic-time algorithm

In [16, 20], a recursion formula for removing the exponentiality of CMN (K, n) was presented. This formula is given by
CMN (K, n) =

K =

1 , . . . , K : k 0, 1 + + K = 1

 hk
K 
 n

k=1 hk /n 
,
PNML x | M(K) =

3.3.

C M(K), n =

 

P y |  y , M(K)
n


h1 ++hK

n!
hk
h
!

h
!
n
1
K
=n
k=1

(15)

 nn
1
=
zn ,
1 T(z) n0 n!

(16)

where T is the so-called Cayleys tree function [23, 24]. It is


easy to prove (see [15, 25]) that the function BK generates
the sequence ((nn /n!)CMN (K, n))
n=0 , that is,
BK (z) =

 nn

n!
n0

 nn

n0

n!

h1 ++hK

n!


h k hk

h ! hK ! k=1 n
=n 1

zn
(17)

CMN (K, n)z ,


n

which by using the tree function T can be written as


1
BK (z) = 
K .
1 T(z)

(14)

The properties of the tree function T can be used to prove


the following theorem.

hk

(13)

yn
K

Although the previous algorithms have succeeded in removing the exponentiality of the computation of the multinomial
NML, they are still superlinear with respect to n. In [15], a
linear-time algorithm based on the mathematical technique
of generating functions was derived for the problem.
The starting point of the derivation is the generating
function B defined by



r2

The linear-time algorithm

(11)

where hk is the frequency (number of occurrences) of value


k in xn , and

r2
n

which holds for all K = 1, . . . , K 1. A straightforward


algorithm based on this formula was then used to compute
CMN (K, n) in time O(n2 log K). See [16, 20] for more details.
Note that in [21, 22] the quadratic-time algorithm was improved to O(n log n log K) by writing (15) as a convolutiontype sum and then using the fast Fourier transform algorithm. However, the relevance of this result is unclear due
to severe numerical instability problems it easily produces in
practice.

(12)

C M(K), n

r1

CMN K , r1 CMN K K , r2 ,

B(z) =

with k = P(X = k), k = 1, . . . , K.


Assume the data points x j are independent and identically distributed (i.i.d.). The NML distribution (3) for the
model class M(K) is now given by (see, e.g., [16, 20])

n! r1
r1 +r2 =n r1 !r2 ! n


where K is the simplex-shaped parameter space,




(18)

EURASIP Journal on Bioinformatics and Systems Biology

Theorem 1. The CMN (K, n) terms satisfy the recurrence


CMN (K + 2, n) = CMN (K + 1, n) +

n
CMN (K, n).
K

4.1.
(19)

Proof. See the appendix.


It is now straightforward to write a linear-time algorithm for computing the multinomial NML PNML (xn |
M(K)) based on Theorem 1. The process is described in
Algorithm 1. The time complexity of the algorithm is clearly
O(n + K), which is a major improvement over the previous
methods. The algorithm is also very easy to implement and
does not suer from any numerical instability problems.

The model family

Let us assume that our problem domain consists of m primary variables X1 , . . . , Xm and a special variable X0 , which
can be one of the variables in our original problem domain or it can be latent. Assume that the variable Xi has
Ki values and that the extra variable X0 has K0 values. The
data xn = (x1 , . . . , xn ) consist of observations of the form
x j = (x j0 , x j1 , . . . , x jm ) X, where


The naive Bayes model family FNB is defined by




FNB = M() : NB

3.4. Approximating the multinomial NML


In practice, it is often not necessary to compute the exact
value of CMN (K, n). A very general and powerful mathematical technique called singularity analysis [26] can be used
to derive an accurate, constant-time approximation for the
multinomial regret. The idea of singularity analysis is to use
the analytical properties of the generating function in question by studying its singularities, which then leads to the
asymptotic form for the coecients. See [25, 26] for details.
For the multinomial case, the singularity analysis approximation was first derived in [25] in the context of memoryless
sources, and later [20] re-introduced in the MDL framework.
The approximation is given by
log CMN (K, n)
=

K 1

2K (K/2) 1
n

log + log
+
2
2
(K/2) 3(K/2 1/2) n

3 + K(K 2)(2K + 1)
2 (K/2)K 2
1
2

36
n
9 (K/2 1/2)


1
+ O 3/2 .
n
(20)

Since the error term of (20) goes down with the rate
O(1/n3/2 ), the approximation converges very rapidly. In [20],
the accuracy of (20) and two other approximations (Rissanens asymptotic expansion [8] and Bayesian information
criterion (BIC) [27]) were tested empirically. The results
show that (20) is significantly better than the other approximations and accurate already with very small sample sizes.
See [20] for more details.
4.

NML FOR THE NAIVE BAYES MODEL

The one-dimensional case discussed in the previous section


is not adequate for many real-world situations, where data
are typically multidimensional, involving complex dependencies between the domain variables. In [16], a quadratictime algorithm for computing the NML for a specific
multivariate model family, usually called the naive Bayes, was
derived. This model family has been very successful in practice in mixture modeling [28], clustering of data [16], casebased reasoning [29], classification [30, 31], and data visualization [32].

X = 1, 2, . . . , K0 1, 2, . . . , K1 1, 2, . . . , Km .
(21)


(22)

with NB = {1, 2, 3, . . . } . The corresponding model


classes are denoted by M(K0 , K1 , . . . , Km ):
m+1

M K0 , K1 , . . . , Km = PNB ( | ) : K0 ,K1 ,...,Km .


(23)
The basic naive Bayes assumption is that given the value of
the special variable, the primary variables are independent.
We have consequently


PNB X0 = x0 , X1 = x1 , . . . , Xm = xm |

m




= P X0 = x0 | P Xi = xi | X0 = x0 , .

(24)

i=1

Furthermore, we assume that the distribution of P(X0 | ) is


multinomial with parameters ( 1 , . . . , K0 ), and each P(Xi |
X0 = k, ) is multinomial with parameters ( ik1 , . . . , ikKi ).
The whole parameter space is then
K0 ,K1 ,...,Km
=

 



1 , . . . , K0 , 111 , . . . , 11K1 , . . . , mK0 1 , . . . , mK0 Km :


k 0, ikl 0, 1 + + K0 = 1,


ik1 + + ikKi = 1, i = 1, . . . , m, k = 1, . . . K0 ,
(25)
and the parameters are defined by k = P(X0 = k), ikl =
P(Xi = l | X0 = k).
Assuming i.i.d., the NML distribution for the naive Bayes
can now be written as (see [16])


PNML xn | M K0 , K1 , . . . , Km
 K0 

h 



Ki
hk /n k m
k
i=1
l=1 fikl /h
 

C M K 0 , K 1 , . . . , Km , n

k=1

 fikl

(26)

where hk is the number of times X0 has value k in xn , fikl is the


number of times Xi has value l when the special variable has
value k, and C(M(K0 , K1 , . . . , Km ), n) is given by (see [16])


C M K 0 , K 1 , . . . , Km , n
=


h1 ++hK0

n!

m
K0


h k hk

h ! hK0 ! k=1 n
=n 1

CMN Ki , hk .

i=1

(27)
To simplify notations, from now on we write C(M(K0 ,
K1 , . . . , Km ), n) in an abbreviated form CNB (K0 , n).

Petri Kontkanen et al.

1: Count the frequencies h1 , . . . , hK from the data xn



 n , M(K))) = K (hk /n)hk
2: Compute the likelihood P(xn | (x
k=1
3: Set CMN (1, n) = 1

4: Compute CMN (2, n) = r1 +r2 =n (n!/r1 !r2 !)(r1 /n)r1 (r2 /n)r2
5: for k = 1 to K 2 do
6: Compute CMN (k + 2, n) = CMN (k + 1, n) + (n/k)CMN (k, n)
7: end for
 n , M(K)))/CMN (K, n)
8: Output PNML (xn | M(K)) = P(xn | (x
Algorithm 1: The linear-time algorithm for computing PNML (xn | M(K)).

4.2. The quadratic-time algorithm


It turns out [16] that the recursive formula (15) can be generalized to the naive Bayes model family case.
Theorem 2. The terms CNB (K0 , n) satisfy the recurrence
CNB

n! r1
K0 , n =
r
r1 +r2 =n 1 !r2 ! n

r1

r2
n

r2





CNB K , r1 CNB K0 K , r2 ,

(28)

where K = 1, . . . , K0 1.
Proof. See the appendix.
In many practical applications of the naive Bayes, the
quantity K0 is unknown. Its value is typically determined
as a part of the model class selection process. Consequently, it is necessary to compute NML for model classes
M(K0 , K1 , . . . , Km ), where K0 has a range of values, say, K0 =
1, . . . , Kmax . The process of computing NML for this case is
described in Algorithm 2. The time complexity of the algorithm is O(n2 Kmax ). If the value of K0 is fixed, the time complexity drops to O(n2 log K0 ). See [16] for more details.
5.

NML FOR BAYESIAN FORESTS

The naive Bayes model discussed in the previous section has


been successfully applied in various domains. In this section
we consider, tree-structured Bayesian networks, which include the naive Bayes model as a special case but can also
represent more complex dependencies.
5.1. The model family
As before, we assume m variables X1 , . . . , Xm with given value
cardinalities K1 , . . . , Km . Since the goal here is to model the
joint probability distribution of the m variables, there is no
need to mark a special variable. We assume a data matrix
xn = (x ji ) Xn , 1 j n, and 1 i m, as given.
A Bayesian network structure G encodes independence
assumptions so that if each variable Xi is represented as a
node in the network, then the joint probability distribution
factorizes into a product of local probability distributions,
one for each node, conditioned on its parent set. We define
a Bayesian forest to be a Bayesian network structure G on the
node set X1 , . . . , Xm which assigns at most one parent Xpa(i)

to any node Xi . Consequently, a Bayesian tree is a connected


Bayesian forest and a Bayesian forest breaks down into component trees, that is, connected subgraphs. The root of each
such component tree lacks a parent, in which case we write
pa(i) = .
The parent set of a node Xi thus reduces to a single value
pa(i) {1, . . . , i 1, i + 1, . . . , m, }. Let further ch(i) denote the set of children of node Xi in G and ch() denote the
children of none, that is, the roots of the component trees
of G.
The corresponding model family FBF can be indexed
by the network structure G and the corresponding attribute
value counts K1 , . . . , Km :


FBF = M() : BF

(29)

with BF = {1, . . . , |G|} {1, 2, 3, . . . }m , where G is associated with an integer according to some enumeration of
all Bayesian forests on (X1 , . . . , Xm ). As the Ki are assumed
fixed, we can abbreviate the corresponding model classes by
M(G) := M(G, K1 , . . . , Km ).
Given a forest model class M(G), we index each model by
a parameter vector in the corresponding parameter space
G :

G = = ikl : ikl 0,

ikl = 1,

i = 1, . . . , m, k = 1, . . . , Kpa(i) , l = 1, . . . , Ki ,
(30)
where we define K := 1 in order to unify notation for root
and non-root nodes. Each such ikl defines a probability


ikl = P Xi = l | Xpa(i) = k, M(G), ,

(31)

where we interpret X = 1 as a null condition.


The joint probability that a model M = (G, ) assigns to
a data vector x = (x1 , . . . , xm ) becomes


P x | M(G),
=

m






P Xi = xi | Xpa(i) = xpa(i) , M(G), =

i=1

i,xpa(i) ,xi .

i=1

(32)

EURASIP Journal on Bioinformatics and Systems Biology

1: Compute CMN (k, j) for k = 1, . . . , Vmax , j = 0, . . . , n, where Vmax = max {K1 , . . . , Km }


2: for K0 = 1 to Kmax do
3: Count the frequencies h1 , . . . , hK0 , fik1 , . . . , fikKi for i = 1, . . . , m, k = 1, . . . , K0 from the data xn
4: Compute the likelihood:

  Ki
fikl
 n , M(K0 , K1 , . . . , Km ))) = K0 (hk /n)hk m
P(xn | (x
i=1
k=1
l=1 ( fikl /hk )
5: Set CNB (K0 , 0) = 1
6: if K0 = 1 then

7:
Compute CNB (1, j) = m
i=1 CMN (Ki , j) for j = 1, . . . , n
8: else

9:
Compute CNB (K0 , j) = r1 +r2 = j ( j!/r1 !r2 !)(r1 / j)r1 (r2 / j)r2 CNB (1, r1 )CNB (K0 1, r2 ) for j = 1, . . . , n
10: end if
 n , M(K0 , K1 , . . . , Km )))/CNB (K0 , n)
11: Output PNML (xn | M(K0 , K1 , . . . , Km )) = P(xn | (x
12: end for
Algorithm 2: The algorithm for computing PNML (xn | M(K0 , K1 , . . . , Km )) for K0 = 1, . . . , Kmax .

For a sample xn = (x ji ) of n vectors x j , we define the corresponding frequencies as




fikl :=  j : x ji = l x j,pa(i) = k ,


fil :=  j : x ji = l  =

Kpa(i)

(33)

fikl .

k=1

By definition, for any component tree root Xi , we have fil =


fi1l . The probability assigned to a sample xn can then be written as


P xn | M(G), =

pa(i) Ki
m K

i=1 k=1 l=1

fikl
fpa(i),k

n
xsub(i)
Xnsub(i)

(34)


pa(i) Ki

m K


fikl fikl
i=1 k=1 l=1

n
 xn
P xsub(i)
|
sub(i) , M Gsub(i)



and for any vector xin Xin with frequencies fi = ( fi1 ,


. . . , fiKi ), we define

(35)

Ci M(G), n | fi

fpa(i),k

(36)

5.2. The algorithm


The goal is to calculate the NML distribution PNML (xn |
M(G)) defined in (3). This consists of calculating the
maximum data likelihood (36) and the normalizing term
C(M(G), n) given in (4). The former involves frequency
counting, one sweep through the data, and multiplication
of
the appropriate values. This can be done in time O(n +
i Ki Kpa(i) ). The latter involves a sum exponential in n,
which clearly makes it the computational bottleneck of the
algorithm.
Our approach is to break up the normalizing sum in (4)
into terms corresponding to subtrees with given frequencies
in either their root or its parent. We then calculate the com-

:=

n
xdsc(i)
Xndsc(i)

where we define f,1 := n. The maximum data likelihood


thereby is
P xn | M(G) =

Ci M(G), n :=

(37)
f
iklikl ,

which is maximized at
ikl xn , M(G) =

plete sum by sweeping through the graph once, bottom-up.


Let us now introduce some necessary notation.
Let G be a given Bayesian forest. Then for any node Xi
denote the subtree rooting in Xi , by Gsub(i) and the forest built
up by all descendants of Xi by Gdsc(i) . The corresponding data
domains are Xsub(i) and Xdsc(i) , respectively. Denote the sum
over all n-instantiations of a subtree by

n
n
P xdsc(i)
, xin |  xdsc(i)
, xin , M Gsub(i)



(38)
to be the corresponding sum with fixed root instantiation,
summing only over the attribute space spanned by the descendants on Xi .
Note that we use fi on the left-hand side, and xin on the
right-hand side of the definition. This needs to be justified.
Interestingly, while the terms in the sum depend on the ordering of xin , the sum itself depends on xin only through its
frequencies fi . To see this pick, any two representatives xin and
xni of fi and find, for example, after lexicographical ordering
of the elements, that

n
n
xin , xdsc(i)
:xdsc(i)
Xndsc(i) =

n
n
xni , xdsc(i)
:xdsc(i)
Xndsc(i) .

(39)
Next, we need to define corresponding sums over Xsub(i)
with the frequencies at the subtree root parent Xpa(i) given.

Petri Kontkanen et al.

n
n
For any fpa(i) xpa(i)
Xpa(i)
define

Li M(G), n | fpa(i)

:=

n
n
n
n
P xsub(i)
| xpa(i)
,  xsub(i)
, xpa(i)
, M Gsub(i)






 xn , xn , M Gsub(i)
= P xin |
dsc(i) i



 n
n
P xsub(

j) | xi ,
j ch(i)

n
xsub(i)
Xnsub(i)

(40)
Again, this is well defined since any other representative xnpa(i)
of fpa(i) yields summing the same terms modulo their ordering.
After having introduced this notation, we now briefly
outline the algorithm and in the following subsections give
a more detailed description of the steps involved. As stated
before, we go through G bottom-up. At each inner node Xi ,
we receive L j (M(G), n | fi ) from each child X j , j ch(i).
Correspondingly, we are required to send Li (M(G), n | fpa(i) )
up to the parent Xpa(i) . At each component tree root Xi , we
then calculate the sum Ci (M(G), n) for the whole connectivity component and then combine these sums to get the
normalizer Ci (M(G), n) for the complete forest G.
5.2.1. Leaves

Kpa(i)

CMN Ki , fpa(i),k .

(41)

k=1

The terms CMN (Ki , n ) (for n = 0, . . . , n) can be precalculated using recurrence (19) as in Algorithm 1.

L j M(G), n | fi ,

(45)

j ch(i)

n
where xdsc(i)
|sub( j) is the restriction of xdsc(i) to columns corresponding to nodes in G j . We have used (38) for (42), (32)
for (43) and (44), and finally (36) and (40) for (45).
Now we need to calculate the outgoing messages
Li (M(G), n | fpa(i) ) from the incoming messages we have just
combined into Ci (M(G), n | fi ). This is the most demanding
part of the algorithm, for we need to list all possible conditional frequencies, of which there are O(nKi Kpa(i) 1 ) many, the
1 being due to the sum-to-n constraint. For fixed i, we arrange the conditional frequencies fikl into a matrix F = ( fikl )
and define its marginals

For a leaf node Xi we can calculate the Li (M(G), n |


fpa(i) ) without listing its own frequencies fi . As in (27),
fpa(i) splits the n data vectors into Kpa(i) subsets of sizes
fpa(i),1 , . . . , fpa(i),Kpa(i) and each of them can be modeled independently as a multinomial; we have
Li M(G), n | fpa(i) =

 n



 xdsc(i)
, xin , M Gsub(i)


Ki


fil fil
l=1

(44)

n
n
xsub(
j) Xsub( j)

(F) :=

fik1 , . . . ,

(F) :=

fikKi ,

fi1l , . . . ,

(46)

fiKpa(i) l

to be the vectors obtained by summing the rows of F


and the columns of F, respectively. Each such matrix then
corresponds to a term Ci (M(G), n | (F)) and a term
Li (M(G), n | (F)). Formally, we have


Li M(G), n | fpa(i) =

Ci M(G), n | (F) .

F:(F)=fpa(i)

(47)

5.2.2. Inner nodes


For inner nodes Xi we divide the task into two steps. First, we
collect the child messages L j (M(G), n | fi ) sent by each child
X j ch(i) into partial sums Ci (M(G), n | fi ) over Xdsc(i) ,
and then lift these to sums Li (M(G), n | fpa(i) ) over Xsub(i)
which are the messages to the parent.
The first step is simple. Given an instantiation xin at Xi or,
equivalently, the corresponding frequencies fi , the subtrees
rooting in the children ch(i) of Xi become independent of
each other. Thus we have


Ci M(G), n | fi
=

n
xdsc(i)
Xndsc(i)




n
n
P xdsc(i)
, xin |  xdsc(i)
, xin , M Gsub(i)






 xn , xn , M Gsub(i)
= P xin |
dsc(i) i



n
n
P xdsc(i)

|sub( j) | xi ,
n
xdsc(i)
Xndsc(i) j ch(i)

n
 xdsc(i)
, xin , M Gsub(i)






(42)

(43)

5.2.3. Component tree roots


For a component tree root Xi ch() we do not need to
pass any message upward. All we need is the complete sum
over the component tree


Ci MG , n =

n!

fi

fi1 ! fiKi !

Ci MG , n | fi ,

(48)

where the Ci (MG , n | fi ) are calculated from (45). The summation goes over all nonnegative integer vectors fi summing
to n. The above is trivially true since we sum over all instantiations xi of Xi and group like terms, corresponding to the
same frequency vector fi , while keeping track of their respective count, namely n!/ fi1 ! fiKi !.
5.2.4. The algorithm
For the complete forest G we simply multiply the sums over
its tree components. Since these are independent of each

EURASIP Journal on Bioinformatics and Systems Biology

1: Count all frequencies fikl and fil from the data xn


  Kpa(i)  Ki
fikl
 n | M(G)) = m
2: Compute P(x
i=1
k=1
l=1 ( fikl / fpa(i),k )

3: for k = 1, . . . , Kmax := max {Ki } and n = 0, . . . , n do
i:Xi is a leaf

4:
Compute CMN (k, n ) as in Algorithm 1
5: end for
6: for each node Xi in some bottom-up order do
7:
if Xi is a leaf then
8:
for each frequency vector fpa(i) of Xpa(i) do
 Kpa(i)
9:
Compute Li (M(G), n | fpa(i) ) = k=1 CMN (Ki , fpa(i)k )
10:
end for
11: else if Xi is an inner node then
12:
for each frequency vector fi Xi do


13:
Compute Ci (M(G), n | fi ) = Kl=i 1 ( fil /n) fil j ch(i) L j (M(G), n | fi )
14:
end for
15:
initialize Li 0
16:
for each non-negative Ki Kpa(i) integer matrix F with entries summing to n do
17:
Li (M(G), n | (F)) += Ci (M(G), n | (F))
18:
end for
19: else if Xi is a component tree root then


20:
Compute Ci (M(G), n) = fi Kl=i 1 ( fil /n) fil j ch(i) L j (M(G), n | fi )
21: end if
22: end for

23: Compute C(M(G), n) = ich() Ci (M(G), n)
 n | M(G))/C(M(G), n)
24: Outpute PNML (xn | M(G)) = P(x
Algorithm 3: The algorithm for computing PNML (xn | M(G)) for a Bayesian forest G.

other, in analogy to (42)(45) we have




C MG , n =

Ci MG , n .

(49)

here is polynomial as well in the sample size n as in the graph


size m. For attributes with relatively few values, the polynomial is time tolerable.

ich()

Algorithm 3 collects all the above into a pseudocode.


The time complexity of this algorithm is O(nKi Kpa(i) 1 ) for
each inner node, O(n(n + Ki )) for each leaf, and O(nKi 1 ) for
a component tree root of G. When all m < m inner nodes
are binary, it runs in O(m n3 ), independently of the number
of values of the leaf nodes. This is polynomial with respect
to the sample size n, while applying (4) directly for computing C(M(G), n) requires exponential time. The order of the
polynomial depends on the attribute cardinalities: the algorithm is exponential with respect to the number of values a
non-leaf variable can take.
Finally, note that we can speed up the algorithm when
G contains multiple copies of some subtree. Also we have
Ci /Li (MG , n | fi ) = Ci /Li (MG , n | (fi )) for any permutation of the entries of fi . However, this does not lead to considerable gain, at least in order of magnitude. Also, we can see
that in line 16 of the algorithm we enumerate all frequency
matrices F, while in line 17 we sum the same terms whenever the marginals of F are the same. Unfortunately, computing the number of non-negative integer matrices with given
marginals is a #P-hard problem already when the other matrix dimension is fixed to 2, as proven in [33]. This suggests
that for this task there may not exist an algorithm that is
polynomial in all input quantities. The algorithm presented

6.

CONCLUSION

The normalized maximum likelihood (NML) oers a universal, minimax optimal approach to statistical modeling. In
this paper, we have surveyed ecient algorithms for computing the NML in the case of discrete datasets. The model
families used in our work are Bayesian networks of varying
complexity. The simplest model we discussed is the multinomial model family, which can be applied to problems related
to density estimation or discretization. In this case, the NML
can be computed in linear time. The same result also applies
to a network of independent multinomial variables, that is, a
Bayesian network with no arcs.
For the naive Bayes model family, the NML can be computed in quadratic time. Models of this type have been
used extensively in clustering or classification domains with
good results. Finally, to be able to represent more complex dependencies between the problem domain variables,
we also considered tree-structured Bayesian networks. We
showed how to compute the NML in this case in polynomial time with respect to the sample size, but the order of
the polynomial depends on the number of values of the domain variables, which makes our result impractical for some
domains.

Petri Kontkanen et al.

The methods presented are especially suitable for problems in bioinformatics, which typically involve multidimensional discrete datasets. Furthermore, unlike the
Bayesian methods, information-theoretic approaches such
as ours do not require a prior for the model parameters.
This is the most important aspect, as constructing a reasonable parameter prior is a notoriously dicult problem, particularly in bioinformatical domains involving novel types
of data with little background knowledge. All in all, information theory has been found to oer a natural and
successful theoretical framework for biological applications
in general, which makes NML an appealing choice for
bioinformatics.
In the future, our plan is to extend the current work
to more complex cases such as general Bayesian networks,
which would allow the use of NML in even more involved modeling tasks. Another natural area of future work
is to apply the methods of this paper to practical tasks
involving large discrete databases and compare the results to other approaches, such as those based on Bayesian
statistics.

On the other hand, by manipulating (18) in the same way, we


get
z

1
d


dz 1 T(z) K

(A.5)

zK

K+1 T (z)
1 T(z)
K
T(z)
= 
K+1
1

T(z)
1 T(z)
= 

(A.6)

1
1
K+2 
K+1
1 T(z)
1 T(z)

= K 

=K

 nn
n0

n!

CMN K + 2, n z
n

 nn
n0

n!

where (A.6) follows from Lemma 3. Comparing the coecients of zn in (A.4) and (A.8), we get


nCMN (K, n) = K CMN (K + 2, n) CMN (K + 1, n) ,


(A.9)

Proof of Theorem 2 (naive Bayes recursion)

Proof of Theorem 1 (multinomial recursion)

CNB (K0 , n)

We start by proving the following lemma.

We have

zT (z) =

Proof. A basic property of the tree function is the functional


equation T(z) = zeT(z) (see, e.g., [23]). Dierentiating this
equation yields
T (z) = eT(z) + T(z)T (z)


zT (z) 1 T(z) = ze

T(z)

h ! hK0 ! k=1 n
=n 1

h1 ++hK =r1
hK +1 ++hK0 =r2
r1 +r2 =n

Now we can proceed to the proof of the theorem. We start


by multiplying and dierentiating (17) as follows:
CMN (K, n)zn = z

n1

nn
n!

CMN (K, n)zn1


(A.3)


n0

nn
CMN (K, n)zn .
n!

i=1

n! r1r1 r2r2 r1 ! hkk r2 ! 0 hkk

nn r1 ! r2 ! r1r1 k=1 hk ! r2r2 k=K +1 hk !


K

i=1 k=1

h1 ++hK =r1
hK +1 ++hK0 =r2
r1 +r2 =n

K0
m
K






CMN Ki , hk
CMN Ki , hk

(A.2)

from which (A.1) follows.

CMN Ki , hk



n! 0 hkk
CMN Ki , hk
n
n
h
!
k
i=1
=n
k=1
K

h1 ++hK0

(A.1)

m
K0


h k hk

n!

T(z)
.
1 T(z)


h1 ++hK0

Lemma 3. For the tree function T(z) we have

dz n0 n!

CMN K + 1, n z

In this section, we provide detailed proofs of two theorems


presented in the paper.

from which the theorem follows.

PROOFS OF THEOREMS

d  nn

(A.8)

APPENDIX

(A.7)

n! r1
r1 !r2 ! n

r1

k=K +1

r2 r2

m
K


h k hk

r1 !

h1 ! hK ! k=1 r1

CMN Ki , hk

i=1

m
K0


h k hk

r2 !
hK +1 ! hK0 ! k=K +1 r2

n! r1
=
r1 +r2 =n r1 !r2 ! n

r1

r2
n

r2

CNB

CMN Ki , hk

i=1




K , r1 CNB K0 K , r2 ,

(A.10)
(A.4)
and the proof follows.

10
ACKNOWLEDGMENTS
The authors would like to thank the anonymous reviewers
and Jorma Rissanen for useful comments. This work was
supported in part by the Academy of Finland under the
project Civi and by the Finnish Funding Agency for Technology and Innovation under the projects Kukot and PMMA. In
addition, this work was supported in part by the IST Programme of the European Community, under the PASCAL
Network of Excellence, IST-2002-506778. This publication
only reflects the authors views.
REFERENCES
[1] G. Korodi and I. Tabus, An ecient normalized maximum
likelihood algorithm for DNA sequence compression, ACM
Transactions on Information Systems, vol. 23, no. 1, pp. 334,
2005.
[2] R. Tibshirani, T. Hastie, M. Eisen, D. Ross, D. Botstein, and B.
Brown, Clustering methods for the analysis of DNA microarray data, Tech. Rep., Department of Health Research and Policy, Stanford University, Stanford, Calif, USA, 1999.
[3] W. Pan, J. Lin, and C. T. Le, Model-based cluster analysis
of microarray gene-expression data, Genome Biology, vol. 3,
no. 2, pp. 18, 2002.
[4] G. J. McLachlan, R. W. Bean, and D. Peel, A mixture modelbased approach to the clustering of microarray expression
data, Bioinformatics, vol. 18, no. 3, pp. 413422, 2002.
[5] A. J. Hartemink, D. K. Giord, T. S. Jaakkola, and R. A.
Young, Using graphical models and genomic expression data
to statistically validate models of genetic regulatory networks,
in Proceedings of the 6th Pacific Symposium on Biocomputing
(PSB 01), pp. 422433, The Big Island of Hawaii, Hawaii,
USA, January 2001.
[6] J. Rissanen, Modeling by shortest data description, Automatica, vol. 14, no. 5, pp. 465471, 1978.
[7] J. Rissanen, Stochastic complexity, Journal of the Royal Statistical Society, Series B, vol. 49, no. 3, pp. 223239, 1987, with
discussions, 223265.
[8] J. Rissanen, Fisher information and stochastic complexity,
IEEE Transactions on Information Theory, vol. 42, no. 1, pp.
4047, 1996.
[9] Yu M. Shtarkov, Universal sequential coding of single messages, Problems of Information Transmission, vol. 23, no. 3, pp.
175186, 1987.
[10] A. Barron, J. Rissanen, and B. Yu, The minimum description
length principle in coding and modeling, IEEE Transactions
on Information Theory, vol. 44, no. 6, pp. 27432760, 1998.
[11] J. Rissanen, Strong optimality of the normalized ML models
as universal codes and information in data, IEEE Transactions
on Information Theory, vol. 47, no. 5, pp. 17121717, 2001.
[12] P. Grunwald, The Minimum Description Length Principle, The
MIT Press, Cambridge, Mass, USA, 2007.
[13] J. Rissanen, Information and Complexity in Statistical Modeling, Springer, New York , NY, USA, 2007.
[14] D. Heckerman, A tutorial on learning with Bayesian networks, Tech. Rep. MSR-TR-95-06, Microsoft Research, Advanced Technology Division, One Microsoft Way, Redmond,
Wash, USA, 98052, 1996.
[15] P. Kontkanen and P. Myllymaki, A linear-time algorithm for
computing the multinomial stochastic complexity, Information Processing Letters, vol. 103, no. 6, pp. 227233, 2007.

EURASIP Journal on Bioinformatics and Systems Biology


[16] P. Kontkanen, P. Myllymaki, W. Buntine, J. Rissanen, and H.
Tirri, An MDL framework for data clustering, in Advances
in Minimum Description Length: Theory and Applications, P.
Grunwald, I. J. Myung, and M. Pitt, Eds., The MIT Press, Cambridge, Mass, USA, 2006.
[17] Q. Xie and A. R. Barron, Asymptotic minimax regret for data
compression, gambling, and prediction, IEEE Transactions on
Information Theory, vol. 46, no. 2, pp. 431445, 2000.
[18] V. Balasubramanian, MDL, Bayesian inference, and the geometry of the space of probability distributions, in Advances
in Minimum Description Length: Theory and Applications, P.
Grunwald, I. J. Myung, and M. Pitt, Eds., pp. 8198, The MIT
Press, Cambridge, Mass, USA, 2006.
[19] P. Kontkanen and P. Myllymaki, MDL histogram density estimation, in Proceedings of the 11th International Conference on
Artificial Intelligence and Statistics, (AISTATS 07), San Juan,
Puerto Rico, USA, March 2007.
[20] P. Kontkanen, W. Buntine, P. Myllymaki, J. Rissanen, and H.
Tirri, Ecient computation of stochastic complexity, in Proceedings of the 9th International Conference on Artificial Intelligence and Statistics, C. Bishop and B. Frey, Eds., pp. 233238,
Society for Artificial Intelligence and Statistics, Key West, Fla,
USA, January 2003.
[21] M. Koivisto, Sum-Product Algorithms for the Analysis of Genetic Risks, Tech. Rep. A-2004-1, Department of Computer
Science, University of Helsinki, Helsinki, Finland, 2004.
[22] P. Kontkanen and P. Myllymaki, A fast normalized maximum
likelihood algorithm for multinomial data, in Proceedings of
the 19th International Joint Conference on Artificial Intelligence
(IJCAI 05), Edinburgh, Scotland, August 2005.
[23] D. E. Knuth and B. Pittle, A recurrence related to trees, Proceedings of the American Mathematical Society, vol. 105, no. 2,
pp. 335349, 1989.
[24] R. M. Corless, G. H. Gonnet, D. E. G. Hare, D. J. Jerey, and
D. E. Knuth, On the Lambert W function, Advances in Computational Mathematics, vol. 5, no. 1, pp. 329359, 1996.
[25] W. Szpankowski, Average Case Analysis of Algorithms on Sequences, John Wiley & Sons, New York, NY, USA, 2001.
[26] P. Flajolet and A. M. Odlyzko, Singularity analysis of generating functions, SIAM Journal on Discrete Mathematics, vol. 3,
no. 2, pp. 216240, 1990.
[27] G. Schwarz, Estimating the dimension of a model, Annals of
Statistics, vol. 6, no. 2, pp. 461464, 1978.
[28] P. Kontkanen, P. Myllymaki, and H. Tirri, Constructing
Bayesian finite mixture models by the EM algorithm, Tech.
Rep. NC-TR-97-003, ESPRIT Working Group on Neural and
Computational Learning (NeuroCOLT), Helsinki, Finland,
1997.
[29] P. Kontkanen, P. Myllymaki, T. Silander, and H. Tirri, On
Bayesian case matching, in Proceedings of the 4th European
Workshop Advances in Case-Based Reasoning (EWCBR 98), B.
Smyth and P. Cunningham, Eds., vol. 1488 of Lecture Notes
In Computer Science, pp. 1324, Springer, Dublin, Ireland,
September 1998.
[30] P. Grunwald, P. Kontkanen, P. Myllymaki, T. Silander, and H.
Tirri, Minimum encoding approaches for predictive modeling, in Proceedings of the 14th International Conference on Uncertainty in Artificial Intelligence (UAI 98), G. Cooper and S.
Moral, Eds., pp. 183192, Morgan Kaufmann, Madison, Wis,
USA, July 1998.
[31] P. Kontkanen, P. Myllymaki, T. Silander, H. Tirri, and P.
Grunwald, On predictive distributions and Bayesian networks, Statistics and Computing, vol. 10, no. 1, pp. 3954,
2000.

Petri Kontkanen et al.


[32] P. Kontkanen, J. Lahtinen, P. Myllymaki, T. Silander, and
H. Tirri, Supervised model-based visualization of highdimensional data, Intelligent Data Analysis, vol. 4, no. 3-4, pp.
213227, 2000.
[33] M. Dyer, R. Kannan, and J. Mount, Sampling contingency
tables, Random Structures and Algorithms, vol. 10, no. 4, pp.
487506, 1997.

11

Vous aimerez peut-être aussi