Académique Documents
Professionnel Documents
Culture Documents
BioMed Central
Open Access
Software
doi:10.1186/1471-2148-7-214
Abstract
Background: The evolutionary analysis of molecular sequence variation is a statistical enterprise.
This is reflected in the increased use of probabilistic models for phylogenetic inference, multiple
sequence alignment, and molecular population genetics. Here we present BEAST: a fast, flexible
software architecture for Bayesian analysis of molecular sequences related by an evolutionary tree.
A large number of popular stochastic models of sequence evolution are provided and tree-based
models suitable for both within- and between-species sequence data are implemented.
Results: BEAST version 1.4.6 consists of 81000 lines of Java source code, 779 classes and 81
packages. It provides models for DNA and protein sequence evolution, highly parametric
coalescent analysis, relaxed clock phylogenetics, non-contemporaneous sequence data, statistical
alignment and a wide range of options for prior distributions. BEAST source code is objectoriented, modular in design and freely available at http://beast-mcmc.googlecode.com/ under the
GNU LGPL license.
Conclusion: BEAST is a powerful and flexible evolutionary analysis package for molecular
sequence variation. It also provides a resource for the further development of new models and
statistical methods of evolutionary analysis.
Background
Evolution and statistics are two common themes that pervade the modern analysis of molecular sequence variation. It is now widely accepted that most questions
regarding molecular sequences are statistical in nature and
should be framed in terms of parameter estimation and
hypothesis testing. Similarly it is evident that to produce
models that accurately describe molecular sequence variation an evolutionary perspective is required.
The BEAST software package is an ambitious attempt to
provide a general framework for parameter estimation
and hypothesis testing of evolutionary models from
Page 1 of 8
(page number not for citation purposes)
http://www.biomedcentral.com/1471-2148/7/214
tion so that researchers can tailor their evolutionary analyses to their specific set of questions.
The ambition of this project requires teamwork, and we
hope that by making the source code of BEAST freely
available, the range of models implemented, while
already large, will continue to grow and diversify.
Substitution model The substitution model is a homogeneous Markov process that defines the relative rates at
which different substitutions occur along a branch in the
tree.
Rate model among sites The rate model among sites
defines the distribution of relative rates of evolutionary
change among sites.
Page 2 of 8
(page number not for citation purposes)
http://www.biomedcentral.com/1471-2148/7/214
If the sequence data are all from one time point, then the
overall evolutionary rate must be specified with a strong
Page 3 of 8
(page number not for citation purposes)
http://www.biomedcentral.com/1471-2148/7/214
BF =
p(D|M1)
p(D|M2 )
(1)
p(D | M) =
(2)
m HM(D | M) = (
1
N
Pr(D|q1(i),M))
; q (i) ~ p(q | D, M)
(3)
This estimator does not always behave very well, but there
are number of modifications that can be used to stabilize
it and bootstrapping can be employed to assess the uncertainty in the estimated marginal likelihoods. In general, a
BF > 20 is strong support for the favoured model (M1 in
equation 1).
Example
We demonstrate some of the key features of a Bayesian
analysis on a sample of 17 dengue virus serotype 4
sequences, isolated at dates ranging from 1956 to 1994
(see [30] for details). Like many RNA viruses, dengue virus
evolves at a rapid rate and as a result the sampling times
of the 17 isolates can be used by BEAST as calibrations to
estimate the overall substitution rate and the divergence
times in years. We analyzed the data under both a codonposition specific substitution model (GTR + CP), in which
each codon position had a separate relative substitution
rate parameter, as well as the standard GTR + + I model.
Both of these models have the same number of free
parameters. We also investigated two different models of
rates variation among branches: the strict clock and the
uncorrelated lognormal-distributed relaxed molecular
clock. We used the constant population size coalescent as
the tree prior. For each model the MCMC was run for
10,000,000 steps and sampled every 500 steps. The first
100,000 steps of each run were discarded as burnin. This
resulted in effective sample sizes for the posterior probability of much more than 1000 for all four analyses, (see
Addiitional files 1,2,3,4, for BEAST XML input of all four
runs).
Page 4 of 8
(page number not for citation purposes)
http://www.biomedcentral.com/1471-2148/7/214
Page 5 of 8
(page number not for citation purposes)
http://www.biomedcentral.com/1471-2148/7/214
Substitution Model
(a) GTR + CP + strict
(b) GTR + CP + relaxed
(c) GTR + + I + strict
(d) GTR + + I + relaxed
Marginal Likelihood
-3656.13 0.11
-3655.33 0.11
-3751.37 0.11
-3750.23 0.11
38
57
289
469
The marginal likelihoods, the number of distinct tree topologies in the 50% credible set and the mean tree height ( stderr) of the four substitution
models that were analyzed in the example. The large improvement in marginal likelihood clearly indicates that the two codon-position substitution
models (CP) are substantially superior to the models in which rate heterogeneity among sites is modeled by a 3-distribution and a proportion of
invariant sites. In contrast, in this example there is little difference in fit to the data between the strict clock and the relaxed clock analyses,
suggesting that this data is clock-like.
Conclusion
BEAST is a flexible analysis package for evolutionary
parameter estimation and hypothesis testing. The component-based nature of model specification in BEAST means
that the number of different evolutionary models possible
is very large and therefore diffcult to summarize. However
a number of published uses of the BEAST software already
serve to highlight the breadth of application the software
enjoys [6,8,34,37,41].
BEAST is an actively developed package and enhancements for the next version include (1) birth-death priors
for tree shape (2) faster and more flexible codon-based
substitution models (3) the structured coalescent to
model subdivided populations with migration (4) models of continuous character evolution and (5) new relaxed
clock models based on random local molecular clocks.
Methods
The overall architecture of the BEAST software package is
a file-mediated pipeline. The core program takes, as input,
an XML file describing the data to be analyzed, the models
to be used and technical details of the MCMC algorithm
such as the proposal distribution (operators), the chain
Page 6 of 8
(page number not for citation purposes)
http://www.biomedcentral.com/1471-2148/7/214
Additional material
Additional file 1
Dengue4-GTR-CP-strict. The BEAST input XML file for the GTR + CP
+ strict clock analysis.
Click here for file
[http://www.biomedcentral.com/content/supplementary/14712148-7-214-S1.xml]
Additional file 2
Dengue4-GTR-CP-relaxed. The BEAST input XML file for the GTR +
CP + relaxed clock analysis.
Click here for file
[http://www.biomedcentral.com/content/supplementary/14712148-7-214-S2.xml]
Additional file 3
Additional file 4
Acknowledgements
We would like to thank Roald Forsberg, Joseph Heled, Philippe Lemey,
Gerton Lunter, Sidney Markowitz, Oliver Pybus, Beth Shapiro, Korbinian
Strimmer and Marc Suchard for invaluable contributions. AJD was partially
supported by the Wellcome Trust and AR was supported by the Royal
Society.
References
1.
2.
3.
4.
Authors' contributions
AJD and AR designed and implemented all versions of
BEAST up to the current (version 1.4.6), which was developed between June 2002 and October 2007. Portions of
the BEAST source code are based on an original Markov
chain Monte Carlo program developed by AJD (called
MEPI) during his PhD at Auckland University between the
5.
6.
Huelsenbeck JP, Ronquist F: MrBayes: Bayesian inference of phylogenetic trees. Bioinformatics 2001, 17:754-755.
Beaumont M: Detecting population expansion and decline
using microsatellites. Genetics 1999, 153:2013-2029.
Drummond AJ, Nicholls G, Rodrigo A, Solomon W: Estimating
mutation parameters, population history and genealogy
simultaneously from temporally spaced sequence data.
Genetics 2002, 161:1307-1320.
Wilson I, Weale M, Balding D: Inferences from DNA data: population histories, evolutionary processes and forensic match
probabilities.
J Royal Stat Soc A-Statistics in Society 2003,
166:155-188.
Rannala B, Yang Z: Bayes estimation of species divergence
times and ancestral population sizes using DNA sequences
from multiple loci. Genetics 2003, 164:1645-1656.
Pybus O, Drummond AJ, Nakano T, Robertson B, Rambaut A: The
epidemiology and iatrogenic transmission of hepatitis C
virus in Egypt: a Bayesian coalescent approach. Mol Biol Evol
2003, 20:381-387.
Page 7 of 8
(page number not for citation purposes)
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
http://www.biomedcentral.com/1471-2148/7/214
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
Drummond AJ, Rambaut A, Shapiro B, Pybus O: Bayesian coalescent inference of past population dynamics from molecular
sequences. Molecular Biology and Evolution 2005, 22:1185-1192.
Wilson A, Sarich V: A molecular time scale for human evolution. Proc Natl Acad Sci USA 1969, 63:1088-1093.
Thorne J, Kishino H, Felsenstein J: An evolutionary model for
maximum likelihood alignment of DNA sequences. Journal of
Molecular Evolution 1991, 33:114-124.
Lemey P, Pybus O, Rambaut A, Drummond AJ, Robertson D, Roques
P, Worobey M, Vandamme A: The molecular population genetics of HIV-1 group O. Genetics 2004, 167:1059-1068.
Newton M, Raftery A: Approximate Bayesian inference with
the weighted likelihood bootstrap. Journal of the Royal Statistical
Society, Series B 1994, 56:3-48.
Shapiro B, Rambaut A, Drummond AJ: Choosing Appropriate
Substitution Models for the Phylogenetic Analysis of Protein-Coding Sequences. Mol Biol Evol 2006, 23:7-9.
Huelsenbeck J, Rannala B: Frequentist Properties of Bayesian
Posterior Probabilities of Phylogenetic Trees Under Simple
and Complex Substitution Models. Systematic Biology 2004,
53:904-913.
Shapiro B, Drummond AJ, Rambaut A, Wilson MC, Matheus PE, Sher
AV, Pybus OG, Gilbert MTP, Barnes I, Binladen J, Willerslev E, Hansen
AJ, Baryshnikov GF, Burns JA, Davydov S, Driver JC, Froese DG, Harington CR, Keddie G, Kosintsev P, Kunz ML, Martin LD, Stephenson
RO, Storer J, Tedford R, Zimov S, Cooper A: Rise and fall of the
Beringian steppe bison. Science 2004, 306:1561-1565.
Suchard M, Redelings B: BAli-Phy: simultaneous Bayesian inference of alignment and phylogeny.
Bioinformatics 2006,
22:2047-2048.
Rambaut A, Drummond AJ: Tracer [computer program]. 2003
[http://beast.bio.ed.ac.uk/tracer].
BioMedcentral
Page 8 of 8
(page number not for citation purposes)