Vous êtes sur la page 1sur 552

METHODS

IN

MOLECULAR BIOLOGY

Series Editor
John M. Walker
School of Life Sciences
University of Hertfordshire
Hatfield, Hertfordshire, AL10 9AB, UK

For further volumes:


http://www.springer.com/series/7651

TM

Evolutionary Genomics
Statistical and Computational Methods, Volume 2

Edited by

Maria Anisimova
Department of Computer Science, Swiss Federal Institute of Technology (ETHZ),
Zrich, Switzerland
Swiss Institute of Bioinformatics, Lausanne, Switzerland

Editor
Maria Anisimova, Ph.D.
Department of Computer Science
Swiss Federal Institute of Technology (ETHZ)
Zurich, Switzerland
Swiss Institute of Bioinformatics
Lausanne, Switzerland

The photo used for book cover is made by one of the authors of the book, Wojciech Makaowski.

ISSN 1064-3745
e-ISSN 1940-6029
ISBN 978-1-61779-584-8
e-ISBN 978-1-61779-585-5
DOI 10.1007/978-1-61779-585-5
Springer New York Dordrecht Heidelberg London
Library of Congress Control Number: 2012931005
Springer Science+Business Media, LLC 2012
All rights reserved. This work may not be translated or copied in whole or in part without the written permission of
the publisher (Humana Press, c/o Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013,
USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of
information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed is forbidden.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified
as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.
Printed on acid-free paper
Humana Press is part of Springer Science+Business Media (www.springer.com)

Preface
Discovery of genetic material propelled the power of classical evolutionary studies across
the diversity of living organisms. Together with early theoretical work in population
genetics, the debate on sources of genetic makeup initiated by proponents of the neutral
theory made a solid contribution to the spectacular growth in statistical methodologies for
molecular evolution. The methodology developed focused primarily on inferences from
single genes or noncoding DNA segments: mainly on reconstructing the evolutionary
relationships between lineages and estimating evolutionary and selective forces. Books
offering a comprehensive coverage of such methodologies have already appeared, with
Joe Felsensteins Inferring Phylogenies and Ziheng Yangs Computational Molecular
Evolution among the favorites.
This volume is intended to review more recent developments in the statistical methodology and the challenges that followed as a result of rapidly improving sequencing
technologies. While the first sequenced genome (RNA virus Bacteriophage MS2 in
1976) was not even 4,000 nucleotides long, the sequencing progress culminated with
the completion of the human genome of about 3.3  109 base pairs and advanced to
sequence many other species genomes, heading ambitiously towards population sequencing projects such as 1,000 genome projects for humans and Drosophila melanogaster.
Next-generation sequencing (NGS) technologies sparked the genomics revolution,
which triggered a renewed effort towards the development of statistical and computational
methods capable of coping with the flood of genomic data and its inherent complexity.
The challenge of analyzing and understanding the dynamics of large-system data can
be met only through an integration of organismal, molecular, and mathematical disciplines.
This requires commitment to an interdisciplinary approach to science, where both experimental and theoretical scientists from a variety of fields understand each others needs and
join forces. Evidently, there remains a gap to be breached. This book presents works by top
scientists from a variety of disciplines, each of whom embodies the interdisciplinary spirit of
evolutionary genomics. The collection includes a wide spectrum of articlesencompassing
theoretical works and hands-on tutorials, as well as many reviews with much biological
insight.
The evolutionary approach is clearly gaining ground in genomic studies, for it enables
inferences about patterns and mechanisms of genetic change. Thus, the theme of evolution
streams through each chapter of the book, providing statistical models with basic assumptions and illustrated with appealing biological examples. This book is intended for a wide
scientific audience interested in a compressed overview of the cutting-edge statistical
methodology in evolutionary genomics. Equally, this book may serve as a comprehensive
guide for graduate or advanced undergraduate students specializing in the fields of genomics or bioinformatics. The presentation of the material in this volume is aimed to equally
suit both a novice in biology with strong statistics and computational skills and a molecular
biologist with a good grasp of standard mathematical concepts. To cater for differences in
reader backgrounds, Part I of Volume 1 is composed of educational primers to help with
fundamental concepts in genome biology (Chapters 1 and 2), probability and statistics
(Chapter 3), and molecular evolution (Chapter 4). As these concepts reappear repeatedly
throughout the books, the first four chapters will help the neophyte to stay afloat.

vi

Preface

The exercises and questions offered at the end of each chapter serve to deepen the
understanding of the material. Additional materials and some solutions to exercises can
be found online: http://www.evolutionarygenomics.net.
Part II of this volume reviews state-of-the-art techniques for genome assembly (Chapter
5), gene finding (Chapter 6), sequence alignment (Chapters 7 and 8), and inference of
orthology, paralogy (Chapter 9), and laterally transferred genes (Chapter 10). Part III opens
with a comparative review of genome evolution in different breeding systems (Chapter 11)
and then discusses genome evolution in model organisms based on the studies of transposable elements (Chapters 12 and 13), gene families, synteny (Chapter 14), and gene order
(Chapters 15 and 16).
Part I of Volume 2 is the evidence that, since embracing Darwins tree-like representation of evolution and pondering over the universal Tree of Life, the field has moved on.
Nowadays, the evolutionary biologists are well aware of numerous evolutionary processes
that distort the tree, complicating the statistical description of models and increasing
computational complexity, often to prohibitive levels. Each taking a different angle, the
chapters of Part I, Volume 2 discuss how to overcome problems with phylogenetic
discordance, as the Tree of Life turns out to be more like a forest (Chapter 3).
The multispecies coalescent model offers one solution to reconciling phylogenetic discord
between gene and species trees (Chapter 1); others pursue probabilistic reconciliation
for gene families based on a birthdeath model along a species phylogeny (Chapter 2).
By some perspectives, constraining the understanding of evolution solely with tree-like
structures omits many important biological processes that are not tree-like (Chapter 4).
Most fundamental questions in genome biology strive to disentangle the evolutionary
forces shaping species genomes, inferring evolutionary history, and understanding how
molecular changes affect genomic and phenotypic characteristics. To this goal, Part II
of the Volume 2 introduces methods for detecting and reconciling selection (Chapters 5
and 6) and recombination (Chapters 9 and 10), and discusses the mechanisms for the
origins of new genes (Chapter 7) and the evolution of protein domain architectures
(Chapter 8). The role of natural selection in shaping genomes is a pinnacle of the classical
neutralistselectionist debate and sets an important theme of the book; the neoselectionist model of genome evolution is tested on many counts. This theme is also
apparent in Part III dedicated to population genomics, which starts by discussing models
for genetic architectures of complex disease and the power of genome-wide association
studies (GWAS) for finding susceptibility variants (Chapter 11). With the availability of
multiple genomes from closely related species, gleaning the ancestral population history
also became possible, as is illustrated in the following chapter (Chapter 12). Most population
genetics problems rely on ancestral recombination graphs (ARG), and reducing the redundancy of the ARG structure helps to reduce the computational complexity (Chapter 13).
Entering the era of postgenomics biology, recent years have seen rapid growth of
complementary genomic data, such as data on expression and regulation, chemical and
metabolic pathways, gene interactions and networks, disease associations, and more.
Considering the genome as a uniform collection of coding and noncoding molecular
sequences is no longer an option. To address this, great efforts are currently dedicated to
embrace the complexity of biological systems through the emerging -omics disciplines
the focus of Part IV of this volume. Chapter 14 discusses ways to study the evolution of
gene expression and regulation based on data from old-fashioned microarrays as well
as transcriptomics data obtained with NGS such as RNAseq and ChIPseq. Interactomics
is the focus of the next chapter. Indeed, better understanding of genes, their diversity

Preface

vii

and regulation comes from studies of interaction between their protein products and
networks of interacting elements (Chapter 15). Further topics include metabolomics
(Chapter 16), metagenomics (Chapter 17), epigenomics (Chapter 18), and the newly
reinvented discipline with a mysterious namegenetical genetics (Chapter 19). Despite
the effort, complex dependencies and causative effects are difficult to infer. A way forward
must be in the integration of complimentary -omics information with genomic sequence
data to understand the fundamentals of systems biology in living organisms. This cannot be
achieved without studying how such information changes over time and across various
conditions. Vast amount of multifaceted data promise a big future for machine learning,
pattern recognition and discovery, and efficient data mining techniques, as can be seen
from many chapters of this book.
Finally, Part V of the second volume focuses on challenges and approaches for large
and complex data representation and storage (Chapter 20). The rapid pace of computational genomics, as well as research transparency and efficiency, exacerbates the need for
sharing of data and programming resources. Fortunately, some solutions already exist
(Chapter 21). Handling ever increasing amounts of computation requires efficient computing strategies, which are discussed in the closing chapter of the book (Chapter 22).
For a novice in the field, this book is certainly a treasure chest of state-of-the-art
methods to study genomic and omics data. I hope that this collection will motivate
both young and experienced readers to join the interdisciplinary field of evolutionary
genomics. But even the experienced bioinformatician reader is certain to find a few
surprises. On behalf of all authors, I hope that this book will become a source of inspiration
and new ideas for our readers. Wishing you a pleasant reading!
rich, Switzerland
Zu

Maria Anisimova, Ph.D.

Acknowledgments
The foremost gratitude goes to the authors of this book who came together to make this
resource possible and who were enthusiastic and encouraging about the whole project.
Over 100 reviewers have contributed to improving the quality and the clarity of the
presentation with their constructive and detailed comments. Some reviewers have accepted
to be acknowledged by their name. With great pleasure, I list them here:
Tyler Alioto, Peter Andolfatto, Miguel Andrade, Irena Artamonova, Richard M.
Badge, David Balding, Mark Beaumont, Chris Beecher, Robert Beiko, Adam Boyko,
Katarzyna Bryc, Kevin Bullaughey, Margarida Cardoso-Moreira, Julian Catchen, Annie
Chateau, Karen Cranston, Karen Crow, Tal Dagan, Dirk-Jan de Koning, Christophe
Dessimoz, Mario dos Reis, Katherine Dunn, Julien Y. Dutheil, Toni Gabaldon, Nicolas
Galtier, Mikhail Gelfand, Josefa Gonzalez, Maja Greminger, Stephane Guindon, Michael
Hackenberg, Carolin Kosiol, Mary Kuhner, Anne Kupczok, Nicolas Lartillot, Adam
Leache, Gerton Lunter, Thomas Mailund, William H. Majoros, James McInerney,
Gabriel Musso, Pjotr Prins, David A. Ray, Igor Rogozin, Mikkel H. Schierup, Adrian
Schneider, Daniel Schoen, Cathal Seoighe, Erik Sonnhammer, Andrea Splendiani, Tanja
si, Jijun
Stadler, Manuel Stark, Krister Swenson, Adam M. Szalkowski, Gergely J. Szollo
Tang, Todd Treangen, Oswaldo R. Trelles Salazar, Albert Vilella, Rutger Vos, Tom
Williams, Carsten Wiuf, Yuri Wolf, Xuhua Xia, S. Stanley Young, Olga Zhaxybayeva, and
Stefan Zoller.
My colleagues from the Computational Biochemistry Research Group at ETH Zurich
deserve much credit for being a constant source of inspiration and for providing such an
enjoyable working environment. Finally, but no less importantly, I would like to thank my
family for their love and for tolerating the overtime that this project required.

ix

Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Contributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

PART I
1

2
3

5
6
7
8

PHYLOGENOMICS

Tangled Trees: The Challenge of Inferring Species Trees


from Coalescent and Noncoalescent Genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Christian N.K. Anderson, Liang Liu, Dennis Pearl, and Scott V. Edwards
Modeling Gene Family Evolution and Reconciling Phylogenetic Discord. . . . . . . . .
Gergely J. Szollosi and Vincent Daubin
Genome-Wide Comparative Analysis of Phylogenetic Trees:
The Prokaryotic Forest of Life. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Pere Puigbo`, Yuri I. Wolf, and Eugene V. Koonin
Philosophy and Evolution: Minding the Gap Between Evolutionary
Patterns and Tree-Like Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Eric Bapteste, Frederic Bouchard, and Richard M. Burian

PART II

v
xiii

3
29

53

81

NATURAL SELECTION, RECOMBINATION, AND INNOVATION


IN GENOMIC SEQUENCES

Selection on the Protein-Coding Genome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113


Carolin Kosiol and Maria Anisimova
Methods to Detect Selection on Noncoding DNA. . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Ying Zhen and Peter Andolfatto
The Origin and Evolution of New Genes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Margarida Cardoso-Moreira and Manyuan Long
Evolution of Protein Domain Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
Kristoffer Forslund and Erik L.L. Sonnhammer

Estimating Recombination Rates from Genetic Variation in Humans . . . . . . . . . . . . 217


Adam Auton and Gil McVean

10

Evolution of Viral Genomes: Interplay Between Selection, Recombination,


and Other Forces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
Sergei L. Kosakovsky Pond, Ben Murrell, and Art F.Y. Poon

PART III

POPULATION GENOMICS

11

Association Mapping and Disease: Evolutionary Perspectives . . . . . . . . . . . . . . . . . . . 275


Sren Besenbacher, Thomas Mailund, and Mikkel H. Schierup

12

Ancestral Population Genomics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293


Julien Y. Dutheil and Asger Hobolth
Nonredundant Representation of Ancestral Recombinations Graphs . . . . . . . . . . . . 315
Laxmi Parida

13

xi

xii

Contents

PART IV

THE -OMICS

14

Using Genomic Tools to Study Regulatory Evolution. . . . . . . . . . . . . . . . . . . . . . . . . 335


Yoav Gilad

15

Characterization and Evolutionary Analysis of ProteinProtein


Interaction Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
Gabriel Musso, Andrew Emili, and Zhaolei Zhang
Statistical Methods in Metabolomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
Alexander Korman, Amy Oh, Alexander Raskind, and David Banks

16
17

Introduction to the Analysis of Environmental Sequences: Metagenomics


with MEGAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
Daniel H. Huson and Suparna Mitra

18

Analyzing Epigenome Data in Context of Genome Evolution


and Human Diseases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
ller,
Lars Feuerbach, Konstantin Halachev, Yassen Assenov, Fabian Mu
Christoph Bock, and Thomas Lengauer
Genetical Genomics for Evolutionary Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
Pjotr Prins, Geert Smant, and Ritsert C. Jansen

19

PART V
20
21

22

HANDLING GENOMIC DATA: RESOURCES AND COMPUTATION

Genomics Data Resources: Frameworks and Standards . . . . . . . . . . . . . . . . . . . . . . . . 489


Mark D. Wilkinson
Sharing Programming Resources Between Bio* Projects
Through Remote Procedure Call and Native Call Stack Strategies. . . . . . . . . . . . . . . 513
Pjotr Prins, Naohisa Goto, Andrew Yates, Laurent Gautier, Scooter Willis,
Christopher Fields, and Toshiaki Katayama
Scalable Computing for Evolutionary Genomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529
Pjotr Prins, Dominique Belhachemi, Steffen Moller, and Geert Smant

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547

Contributors
CHRISTIAN N.K. ANDERSON  Department of Organismic and Evolutionary Biology &
Museum of Comparative Zoology, Harvard University, Cambridge, MA, USA
PETER ANDOLFATTO  Department of Ecology and Evolutionary Biology,
The Lewis-Sigler Institute for Integrative Genomics, Princeton University,
Princeton, NJ, USA
MARIA ANISIMOVA  Department of Computer Science, Swiss Federal Institute of
Technology (ETHZ), Zurich, Switzerland; Swiss Institute of Bioinformatics,
Lausanne, Switzerland
YASSEN ASSENOV  Max Planck Institute, Saarbrucken, Germany
ADAM AUTON  Wellcome Trust Centre for Human Genetics, Oxford, UK
DAVID BANKS  Department of Statistical Science, Duke University, Durham, NC, USA
ERIC BAPTESTE  UMR CNRS 7138, UPMC, Paris, France
DOMINIQUE BELHACHEMI  Section of Biomedical Image Analysis, Department of
Radiology, University of Pennsylvania, Philadelphia, PA, USA
SREN BESENBACHER  deCODE Genetics, Reykjavik, Iceland; Bioinformatics Research
Center, Aarhus University, Aarhus, Denmark
CHRISTOPH BOCK  Max Planck Institute, Saarbrucken, Germany; Broad Institute,
Cambridge, MA, USA
FREDERIC BOUCHARD  Departement de Philosophie, Universite de Montreal,
Station Centre-ville, Montreal, Quebec, Canada
RICHARD M. BURIAN  Department of Philosophy, Virginia Tech, Blacksburg, VA, USA
MARGARIDA CARDOSO-MOREIRA  Department of Molecular Biology and Genetics,
Cornell University, Ithaca, NY, USA
VINCENT DAUBIN  UMR CNRS 5558, LBBE, Biometrie et Biologie Evolutive
UCB Lyon 1, Villeurbanne, France
JULIEN Y. DUTHEIL  Institut des Sciences de lEvolution Montpellier (ISE-M),
UMR 5554, CNRS, Unversite Montpellier, Montpellier, France
SCOTT V. EDWARDS  Department of Organismic and Evolutionary Biology & Museum
of Comparative Zoology, Harvard University, Cambridge, MA, USA
ANDREW EMILI  Banting and Best Department of Medical Research, Donnelly Centre
for Cellular and Biomolecular Research, Department of Medical Genetics and
Microbiology, University of Toronto, Toronto, ON, Canada
LARS FEUERBACH  Max Planck Institute, Saarbrucken, Germany
CHRISTOPHER FIELDS  Institute for Genomic Biology, The University of Illinois,
Urbana, IL, USA
KRISTOFFER FORSLUND  Stockholm Bioinformatics Centre, Stockholm University,
Stockholm, Sweden
LAURENT GAUTIER  Department of Systems Biology, DMAC, Center for Biological
Sequence Analysis, Technical University of Denmark, Lyngby, Denmark
YOAV GILAD  Department of Human Genetics, The University of Chicago, Chicago,
IL, USA
xiii

xiv

Contributors

NAOHISA GOTO  Department of Genome Informatics, Genome Information Research


Center, Research Institute for Microbial Diseases, Osaka University, Osaka, Japan
KONSTANTIN HALACHEV  Max Planck Institute, Saarbrucken, Germany
ASGER HOBOLTH  Bioinformatics Research Center (BiRC), Aarhus University,
Aarhus, Denmark
DANIEL H. HUSON  Center for Bioinformatics, Tubingen University, Tubingen,
Germany
RITSERT C. JANSEN  Groningen Bioinformatics Centre, University of Groningen,
Groningen, The Netherlands
TOSHIAKI KATAYAMA  Laboratory of Genome Database, Human Genome Center,
Institute of Medical Science, University of Tokyo, Tokyo, Japan
EUGENE V. KOONIN  National Center for Biotechnology Information, National
Library of Medicine, National Institutes of Health, Bethesda, MD, USA
ALEXANDER KORMAN  Department of Statistical Science, Duke University, Durham,
NC, USA
SERGEI L. KOSAKOVSKY POND  Department of Medicine, University of California,
San Diego, CA, USA
CAROLIN KOSIOL  Institute of Population Genetics, Vetmeduni Vienna, Austria
THOMAS LENGAUER  Max Planck Institute, Saarbrucken, Germany
MANYUAN LONG  Department of Ecology and Evolution, University of Chicago,
Chicago, IL, USA
LIANG LIU  Department of Agriculture and Natural Resources, Delaware State
University, Dover, DE, USA
THOMAS MAILUND  Bioinformatics Research Center, Aarhus University, Aarhus,
Denmark
GIL MCVEAN  Wellcome Trust Centre for Human Genetics, Oxford, UK
SUPARNA MITRA  Center for Bioinformatics, Tubingen University, Tubingen, Germany
STEFFEN MOLLER  Department of Dermatology, University Clinics of SchleswigHolstein, formerly University of Lubeck, Institute for Neuro- and Bioinformatics,
Lubeck, Germany
FABIAN MULLER  Max Planck Institute, Saarbrucken, Germany; Broad Institute,
Cambridge, MA, USA
BEN MURRELL  Computer Science Division, Department of Mathematical Sciences,
University of Stellenbosch, Stellenbosch, South Africa; Biomedical Informatics
Research, Medical Research Council, Tygerberg, South Africa
GABRIEL MUSSO  Cardiovascular Division, Brigham & Womens Hospital, Boston,
MA, USA; Harvard Medical School, Boston, MA, USA
AMY OH  Department of Statistical Science, Duke University, Durham, NC, USA
LAXMI PARIDA  IBM Thomas J. Watson Research Center, Yorktown Heights, NY, USA
DENNIS PEARL  Department of Statistics, The Ohio State University, Columbus,
OH, USA
ART F.Y. POON  BC Centre for Excellence in HIV/AIDS, Vancouver, BC, Canada
PJOTR PRINS  Laboratory of Nematology, Wageningen University, Wageningen,
The Netherlands; Groningen Bioinformatics Centre, University of Groningen,
Groningen, The Netherlands

Contributors

xv

PERE PUIGBO`  National Center for Biotechnology Information, National Library


of Medicine, National Institutes of Health, Bethesda, MD, USA
ALEXANDER RASKIND  Department of Pathology, University of Michigan, Ann Arbor,
MI, USA
MIKKEL H. SCHIERUP  Bioinformatics Research Center, Aarhus University, Aarhus,
Denmark
GEERT SMANT  Laboratory of Nematology, Wageningen University, Wageningen,
The Netherlands
ERIK L.L. SONNHAMMER  Stockholm Bioinformatics Centre, Stockholm University,
Stockholm, Sweden; Swedish eScience Research Center, Stockholm, Sweden
GERGELY J. SZOLLOSI  UMR CNRS 5558, LBBE, Biometrie et Biologie Evolutive
UCB Lyon 1, Villeurbanne, France
MARK D. WILKINSON  Department of Medical Genetics, University of British Columbia
and PI Bioinformatics, Heart + Lung Institute at St. Pauls Hospital, Vancouver,
BC, Canada
SCOOTER WILLIS  Department of Computer & Information Science & Engineering,
University of Florida, Gainesville, FL, USA
YURI I. WOLF  National Center for Biotechnology Information, National Library
of Medicine, National Institutes of Health, Bethesda, MD, USA
ANDREW YATES  European Bioinformatics Institute, Wellcome Trust Genome Campus,
Hinxton, Cambridge, UK
ZHAOLEI ZHANG  Banting and Best Department of Medical Research, Donnelly
Centre for Cellular and Biomolecular Research, Department of Medical Genetics
and Microbiology, University of Toronto, Toronto, ON, Canada
YING ZHEN  Department of Ecology and Evolutionary Biology, The Lewis-Sigler
Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA

Part I
Phylogenomics

Chapter 1
Tangled Trees: The Challenge of Inferring Species Trees
from Coalescent and Noncoalescent Genes
Christian N.K. Anderson, Liang Liu, Dennis Pearl, and Scott V. Edwards
Abstract
Phylogenies based on different genes can produce conflicting phylogenies; methods that resolve such
ambiguities are becoming more popular, and offer a number of advantages for phylogenetic analysis.
We review so-called species tree methods and the biological forces that can undermine them by violating
important aspects of the underlying models. Such forces include horizontal gene transfer, gene duplication,
and natural selection. We review ways of detecting loci influenced by such forces and offer suggestions for
identifying or accommodating them. The way forward involves identifying outlier loci, as is done in
population genetic analysis of neutral and selected loci, and removing them from further analysis, or
developing more complex species tree models that can accommodate such loci.
Key words: Species tree, Gene tree discordance, Non-coalescent genes, Outlier analysis

1. Introduction
The concept of a species tree, a bifurcating dendrogram graphically
depicting the relationships of species to each other, is one of the
oldest and most powerful icons in all of biology (Figs. 1 and 2). After
Charles Darwin sketched the first species tree (in Transmutation of
Species, Notebook B, 1837), he remained fascinated by the image for
22 years, eventually including a species tree as the only figure in On
the Origin of Species (1859). Though species trees reached their
aesthetic apogee with Ernst Haeckels Tree of Life in 1886, the
pursuit of ever-more scientifically accurate trees has kept phylogenetics a vibrant discipline for the 150 years since.
Because the direct evolution of species is not observable
(not even in the fossil record), relationships are often inferred by
shared characteristics among extant taxa. Until the 1970s, this was
done almost exclusively by using morphological characters.

Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2,
Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5_1,
# Springer Science+Business Media, LLC 2012

C.N.K. Anderson et al.

Set of 9 Gene Trees

5k gen

Superimposed Gene Trees

Inferred Species Tree


A

A
A
A

A
B
B
C

C
C
D
D

3000

2500

2000

1500

1000

500

1500

1000

500

Fig. 1. An example showing the utility of multiple gene trees in producing species tree topologies. (A) Nine unlinked loci are
simulated (or inferred without error) from a species group with substantial amounts of incomplete lineage sorting. Note that
no single gene recovers the correct relationship between clades. Furthermore, despite identical conditions for all nine
simulations, no two genes agree on the correct topology, let alone the correct divergence times. (B) Superimposing the
nine gene trees on top of each other clarifies the relationships. It can be (correctly) inferred that the true tree is perfectly
ordered, with (ABC) diverging from D about 1,500 generations ago, the (AB)-C split occurring at 800, and A diverging from
B about 600 generations ago. Also, the amount of crossbreeding within the recently diverged taxa implies (correctly) that C
has the effective smallest population size.

Although this approach had many successes, the paucity of characters and the challenges of comparing species with no obvious morphological homologies were persistent problems (1). When
molecular techniques were developed in the late 1960s, it soon
became clear that the sheer volume of molecular data that could
be collected would represent a vast improvement. When DNA
sequences became widely available for a range of species (2), molecular comparisons quickly became de rigueur (36). Nonetheless, it
was recognized early on that molecular phylogenies had their own

1 Tangled Trees: Noncoalescent Genes

Deep coalescence
A

Branch length heterogeneity


D

species tree
gene trees

AB

DA B C D A B C D

BC D

A B C D A B C D

Fig. 2. The relationship between gene trees and species trees. Lines within the species trees indicate gene lineages.
Simplified gene trees are shown below each species tree. Whereas gene trees on the left vary due to deep coalescence,
gene trees on the right are topologically concordant but vary slightly in branch lengths due to the coalescent. Modified with
permission from Edwards (2009).

suite of problems; the concept that not all gene tree topologies
would match the true species tree topology (i.e., would not be
speciodendric sensu Rosenberg (7)) was implicit in studies as early
as the 1960s ((8), see also ref. 9). However, it was generally assumed
that the idiosyncratic genealogical history of any one gene, as
reconstructed from extant mutations, was an acceptable approximation for the true history of the species given the potentially overwhelming quantity and seductive utility of molecular data (1014).
By and large, the ensuing decades of molecular phylogenetics
has fulfilled much of this potential, revolutionizing taxonomies and
resolving conundrums previously considered intractable (15).
However, as the amount of genetic data per species becomes evermore voluminous, it has become clear that individual genes can
conflict with each other and with the overarching species tree, both
in topology and branch lengths (1619). In the meantime, the
term phylogeny frequently became conflated with gene tree,
the entity produced by many of the leading phylogenetics packages
of the day. The term species tree, in use since the late 1970s to
emphasize the distinction between lineage histories and gene histories (13, 16), was only gradually acknowledged, despite the fact
that species trees are the rightful heirs to the term phylogeny and
better encapsulate the true goals of molecular and morphological
systematics (20).
At first, some researchers treated this phenomenon as though it
were an information problem: when working with only a few

C.N.K. Anderson et al.

mutations, you were bound to occasionally be unlucky and


sequence a gene whose random signal of evolution did not
match that of the taxa being studied. The reasoning was surely
more and/or longer sequences would fix that problem and cause
gene trees to converge. However, as more genes were sequenced
and as the properties of gene lineages within populations were
studied in detail (21), the twin realities of gene tree heterogeneity
and fuzzy genetic distinctions between recently diverged taxa
(incomplete lineage sorting) became clear (Figs. 1 and 2). The
probability of an event such as incomplete lineage sortingwhich
if considered alone would lead to inferring the wrong species
treewas worked out theoretically for the four individual/two
species case first (10), followed by the three individual/three
species case (5, 12), and then the generalized case (11). This last
study was among those that proposed one class of solution: simply
acquire more gene sequences, and the central tendency of this
gene set will point to the correct relationships. On the empirical
side, researchers adopted two general approaches. Pamilo and Nei
(11) suggested a democratic vote method, where each gene was
allowed to propose its own tree, and the topology with the most
votes was declared the winner, and therefore the true phylogeny. This method was used in theoretical and empirical work,
particularly on primate data sets (22). However, though generally
true for three-species cases, it can sometimes produce the wrong
topology with four or more species. In fact, we now know that
there is an anomaly zone for species trees with short branch
lengths, in which the addition of more genes is guaranteed to lead
to the wrong species tree topology for the democratic vote
method (23, 24). (Branches here are measured in coalescent
units, which are equivalent to t/Ne, where t is the number of
generations since divergence and Ne is the effective population
size of the lineage (25).) Though it is not clear whether real
species trees possess branch lengths short enough to enter the
anomaly zone (26), the potential remains theoretically disconcerting. In addition, because the number of possible tree topologies
increases as the double factorial of the number of tips, for species
trees with more than four tips a very large number of genes is
required to determine which gene tree is in fact the most frequent.
A large number of advanced consensus methods (27, 28) have
recently been introduced, and circumvent some of the problems
of the democratic vote by using novel methods of combining gene
trees, such as rooted triple consensus (29), greedy consensus (30),
and supertree methods (31, 32). One recent approach, Bayesian
Concordance Analysis (33), acknowledges the possibility of valid
discordance (due to any of the other potential confounders discussed in Subheading 4), and, rather than establishing consensus,
seeks instead to quantify how much discordance exists among
gene trees (34).

1 Tangled Trees: Noncoalescent Genes

The second empirical approach to the problem of conflicting


gene trees was to bypass it altogether. Concatenation methods
appended one genes sequence onto the next to create long
alignments or supermatrices (35), a technique that in some situations was superior to standard consensus methods in resolving
discordance or achieving statistical consistency (36). But some
researchers, including those who questioned the total evidence
approach to systematics (37), advocated against concatenation
when, for whatever reason, gene trees appeared to conflict with
one another. One problem with the concatenation approach was
that it assumed full linkage across the supermatrix, a situation that
would obviously not be the case if genes were on different chromosomes. Even when the branch lengths in a species tree are long such
that gene tree topologies are congruent, the branch lengths of trees
of genes on different chromosomes will differ subtly from one
another due to the stochasticity of the coalescent process. The
early implementations of the supermatrix method also assumed
the same distribution of mutation rates across the sequence,
which was clearly not the case if the matrix included coding and
noncoding regions. Like democratic vote methods, concatenation
of many genes was sometimes defended as sufficient to override the
conflicting signal across genes (38, 39), despite widespread
acknowledgment that gene tree heterogeneity is ubiquitous and
that concatenation can sometimes give the wrong answer (40, 41).
Another problem is that, in a strict sense, concatenation also does
not generate species trees, which are derived by reconciling conflicts among gene trees; instead, it generates a single supergene
tree that is assumed to be equivalent to the species tree (20). Finally,
concatenation approaches also suffer from the same problem as
democratic vote methods; in certain trees with short branches,
more data can lead to the wrong answer with increasing confidence
(41). Nevertheless, concatenation still remains popular by default
(42), particularly among phylogenetic studies of higher taxa, where
incomplete lineage sorting is assumed to be rare.
In the end, the concatenation method will remain popular until
there are software alternatives that are robust, efficient, and easy to
use. As a result, researchers are in something of a double bind:
either use just one gene and risk inferring the wrong species tree
due to a lack of statistical power and incongruence with the underlying species tree or use many genes and risk inferring the wrong
species tree due to gene tree heterogeneity or short branches in
some of the gene tree topologies. One solution is to use models for
species trees that are consistent with what is known about biological
heritability. One such model is the multispecies coalescent (4346).
It is this model that provides the basis for a recent flurry of
promising methods that permit efficient and consistent estimation
of species trees under a variety of conditions.

C.N.K. Anderson et al.

2. The Multispecies
Coalescent Model
A plausible probabilistic model for analyzing multilocus sequences
should involve not only the phylogenetic relationship of species
(species tree), but also the genealogical history of each gene (gene
tree), and allow different genes to have different histories. Unlike
concatenation, such a model explains the evolutionary history of
multilocus sequences through a two-stage processfrom species
tree to gene tree and from gene tree to sequences (44). Construction of the two-stage model requires an explicit description of
how gene trees evolve in the species tree and how sequences evolve
on gene trees. As the second question has been extensively studied
in the traditional phylogenetic analyses for estimating gene trees, the
key is to address the first question adequately. With a few exceptions
(described below), the genealogical relationship (gene tree) of neutral alleles can be simply depicted by a coalescence process in which
lineages randomly coalesce with each other backward in time. The
coalescence model is simple in the sense that it assumes little or no
effect of evolutionary forces such as selection, recombination, and
gene flow, instead giving a prominent role to random genetic drift.
Despite these seemingly oversimplified assumptions, the pure coalescent model is fundamental in explaining the gene treespecies tree
relationship because it forms a baseline for incorporating additional
evolutionary forces on top of random drift (25). More importantly,
the pure coalescent model provides an analytic tool to detect the
evolutionary forces responsible for the deviation of the observed data
(molecular sequences) from those expected from the model.
The coalescent process works, in effect, by randomly choosing
ancestors from the population backward through time for each
sequence in the original sample. Eventually, two of these lineages
share a common ancestor, and the lineages are said to coalesce.
The process continues until all lineages have coalesced at the most
recent common ancestor (MRCA). Book-length treatments of the
process are available, and readers interested in the mathematical
details can find them in several sources (e.g., Refs. 28, 4749).
Multispecies coalescence works the same way but places constraints
on how recently the coalescences occur, corresponding to the species divergence times. Given a species tree, the probability density
function of each gene tree is evaluated; and these density functions
are combined to evaluate the likelihood of the species tree. In this
way, multispecies coalescent methods are the converse of consensus
methods; rather than each locus proposing a potentially divergent
species tree, a common species tree is assumed and evaluated in light
of the sometimes-divergent patterns observed across loci (30).
A number of implementations of this idea have been developed
(20). The BATWING package (50) was originally developed to

1 Tangled Trees: Noncoalescent Genes

generalize error estimates on a species tree from a single locus or


group of 100% linked microsatellite loci (50). Several packages are
available for moving from already estimated gene trees to species
trees, including Minimization of Deep Coalescence (16, 51), STEM
(52), JIST (53, 54), GLASS (55), STAR, and STEAC (45). The
MCMCcoal package (56) originally required a species tree topology
a priori to approximate divergence times and population sizes, but
now can infer species tree topologies as well with the bpp package
(57), and can also operate in a pseudo-MLE framework (58).
Several other full packages infer gene trees from DNA sequences,
and then species trees from the inferred gene trees, given a priori
assignment of the sequences to species groups. These include ESPCOAL (18), AUGIST (within the Mesquite environment (59)),
BEST (44, 60), and *Beast (61). Reviews describing these methods
in more detail are available (45). BUCKy (34) is notable for making
Bayesian inferences without assuming coalescence and performs
relatively better than some alternatives in the presence of horizontal
gene transfer (62), though when applied to coalescent data the
resulting analysis is generally not as accurate (63).
The multispecies coalescent can under some circumstances be
more efficient than concatenation (64), and can recover the correct
species tree even in the anomaly zone, where concatenation methods fail (65). One drawback is that the estimation of larger numbers
of parameters (population sizes and divergence times in addition to
topologies) can slow computation and does not necessarily improve
accuracy because of the many sources of error (66). Another ambiguous aspect of species tree methods and multispecies coalescent
models is that they appear to be less susceptible to overconfidence
in topology that was attributed to Bayesian analyses early on (67).
We have wondered (64) whether such inflation is in fact due to
traditional model misspecifications, such as incorrect substitution
matrices for DNA sequences, or to concatenation, which of course
can be viewed as a misspecification of the coalescent model because
it rejects independent assortment of loci. While the lower confidence values obtained from species trees are not deficiencies per se,
they are also not conducive to the adoption of this new family of
phylogenetic models by the empirical community!

3. Sources of Gene
Tree/Species Tree
Discordance
and Violations
of the Multispecies
Coalescent Model
3.1. Population
Processes

The standard and most common reason why gene trees are not
speciodendritic is incomplete lineage sorting, i.e., lineages have not
yet been reproductively isolated for long enough for drift to cause
complete genetic divergence in the form of reciprocal monophyly
of gene trees (68). This source of gene tree heterogeneity is guaranteed to be ubiquitous, if only because it arises from the finite

10

C.N.K. Anderson et al.

population sizes of all species that have ever come into existence.
Almost all the techniques and software packages discussed above
are designed to approximate uncertainties in species tree topology
arising from this phenomenon.
3.1.1. Accurate
Delimitation of Species
and Diverging Lineages

For recent divergences, the definition of species can become


problematic for species tree methods (53, 54), and the challenge
of delimiting species has, if anything, increased now that the
overly conservative strictures of gene tree monophyly as a delimiter of species have been mostly abandoned. This fundamental
issue in a phylogenetic studywhether the extent of divergence
among lineages warrants species statushas not gone away in the
species tree era. Researchers are often faced with a dilemma when
deciding how deep a node must lie in a phylogeny in order to
demonstrate genuine speciation. Each DNA sequence represents
only one allele (and in some cases, only one mitochondrion within
one cell of one individual (69)), and because genetic diversity
within species can be substantial a few unfortunately selected
representatives or an undersampling of a given species can lead
to spurious species assignments, which in turn can lead to high
degree of confidence in a mistaken species tree topology (70).
Simply avoiding the problem by calling groups of related individuals something else (such as operational taxonomic units
(OTUs) or populations) does not address the issue because the
key point is not so much whether the OTUs in a species tree study
are genuine species, but whether or not gene flow has ceased
(at least temporarily) at the time of sampling. Species trees need
not use good species as OTUs; they work perfectly well on
lineages that have recently diverged and ceased exchanging
genes, but nonetheless are not sufficiently divergent as to be called
species by other criteria. One common solution is to define speciation as occurring when both taxa in question are completely and
reciprocally genetically isolated. However, this criterion is generally considered too conservative (71), and fails to account for
situations in which genetic introgression occurs via a different
mechanism than incomplete reproductive isolation.
The problem of species delimitation may ultimately be solved
by data other than genetics, and today few species concepts use
strictly genetic criteria (72). Some have suggested that the line
between a population-level difference and a species-level difference
can be drawn empirically and with consistency in well-studied taxa,
such as birds, using morphological, environmental, and behavioral
data simultaneously (73). Thus, there is some hope that species
delimitation can be performed rigorously a priori in some cases.
Researchers who opt for delimiting species primarily with molecular
data have a wide array of techniques and prior examples available to
them (e.g., STRUCTURE (74), STRUCTURAMA (75), Brownie
(53), rjMCMC (57, 76); BEST-STEM approach (77, 78)).

1 Tangled Trees: Noncoalescent Genes

11

Recent progress in species delimitation is motivated by the


conceptual transition from biological/reproductive isolation
species to the traditional phylogenetic species requiring gene
tree monophyly, and ultimately to the lineage species concept,
which defines species not in terms of monophyly of gene lineages
but as population lineage segments in the species tree (71). Under
that recently expanded concept, boundaries of species (i.e., lineages
in the species tree) can be estimated from a collection of gene trees
in the framework of the multispecies coalescent model (57, 77).
3.1.2. Gene Flow

There are a number of other situations in which the assumptions of


the coalescent are violated. A key assumption in most species tree
methods developed thus far is whether or not gene flow is occurring between the taxa in a radiating clade. If some small amount of
gene flow continues between species after divergence, then the
multispecies coalescent can quickly destabilize, especially for a
small number of loci and as the rate of genetic introgression
increases (79, 80). Further studies of the effect of gene flow on
species tree inference are needed to determine the parameter space
in which it is and is not a significant problem, and how sampling or
analysis might ameliorate it.

3.2. Molecular
Processes

In addition to species delimitation and gene flow, there are at least


three mechanisms that generate discordance on the molecular level
(Fig. 3). These include horizontal gene transfer (HGT), which
violates the assumptions of the coalescent in such a way that it can
pose a serious risk to phylogenetic analysis with some methods;
gene duplication, whose risks can be avoided by certain models; and
natural selection, which generally poses no direct threat but,
depending on its mode of action and consequences for DNA and
protein sequences, can be the most challenging of all.

3.2.1. Horizontal Gene


Transfer

HGT is now known to be so widespread in prokaryotes that a Tree of


Life, even with reticulation, has been rejected by some authors as an
inappropriate paradigm for these domains (8183), though many
others feel that this is an overreaction (8488). Though generally
ignored in eukaryotes, evidence increasingly shows that eukaryotic
genomes contain substantial amounts of uploaded genetic material
from Bacteria, Archea, viruses, and even fellow eukaryotes. Though
eukaryotic gene sharing is most widespread between protists, it is also
reasonably common between plant lineages (89), and has been documented for animals, fungi, and interdomain transfers as well. For
example, Wolbachia have inserted their entire genomes (~1 Mb)
into the germ lines of at least eight species of nematodes and insects
(90). Transposable elements, such as helitrons, are continuously
being shared among widely divergent eukaryotic lineages, including
fish and mammals, possibly using viruses as vectors (91). Even
though good techniques are not yet widely available for detecting

12

C.N.K. Anderson et al.

a
INFERRED HISTORY

TRUE HISTORY

Gene Duplication

Copy 1
Copy 2

Convergent Evolution

B C

C D

Horizontal Gene Transfer

Mutation

C D

Fig. 3. Three examples of noncoalescent gene histories. (a) A duplication event that
precedes a speciation event can lead to incorrect inference of divergence times in the
species tree if copy 1 is compared to copy 2. This can be particularly difficult if one of the
gene copies has been lost or not sequenced by the researcher. (b) Convergent evolution
can occur at the molecular level, for example in certain genes under environmental
selection if both taxa move into the same environment. It tends to bring distantly related
taxa into a jumbled polyphyletic clade, and is likely to be given additional false support by
morphological data. (c) Horizontal gene transfer causes difficulties in current species tree
methods because it establishes a spurious lower bound to divergence times. Though rare
in eukaryotes, it is by no means unknown, and is likely to become a more difficult problem
in the future when species trees are based on tens of thousands of loci.

1 Tangled Trees: Noncoalescent Genes

13

HGT in eukaryotes, enough individual cases have been accidentally


discovered that reviewers have given up trying to list them all (92).
The implications of HGT for species tree research are substantial. For example, following the standard assumption in coalescent
theory that allelic divergences must occur earlier in time than the
divergences of species harboring those alleles, many species tree
techniques (56, 60) assume that the gene tree exhibiting the most
recent divergence between taxon A and taxon B establishes a hard
upper limit on the divergence time of those species in the species
tree. For small sets of genes in taxa where HGT is rare, a researcher
would need to be quite unlucky to choose a horizontally transferred gene for analysis. However, as the genomic era advances, it
becomes likely that at least one of the thousands of genes studied
will have been transferred horizontally and inadvertently establish
a spurious upper bound for clade divergence at the species level.
For example, if even one gene has been transferred between
humans and fruit flies in the last 910 million years (93) or
uploaded into the two lineages from the same pathogen, then
the date of this transfer event will be taken as the maximum
plausible divergence time for those species despite thousands of
other genes implying a much deeper split.
Although HGT is clearly a problem for some current methodologies, if transferred genes can first be identified, then they
could be extremely useful as genomic markers for monophyletic
groups that have inherited such genes and would otherwise be
difficult to resolve (94). Unfortunately, current methods to detect
such events rely both on having the true species tree already in hand
and also on the absence of other mechanisms causing gene tree
discordance (9597). For many types of comparisons, such as those
among major groups of animals or vertebrates, the data show
enough congruence to make identification of HGT events straightforward, and HGT appears to be infrequent among closely related
species of eukaryotes, although data is sparse. HGT poses particular
challenges for phylogenetically understudied groups for which the
expected shape of gene trees is not known.
3.2.2. Gene Duplication

Gene duplication presents another violation of the coalescent


model; like HGT, its potential problems are worst when they go
unrecognized. Imagine a taxon where a gene of interest duplicated
10 Mya into copy a and copy b; the taxon then splits 5 Mya into
species 1 and 2. A researcher investigating the daughter species
would, therefore, sequence four orthologous genes, with the
potential to compare a1 to b2 and b1 to a2 and thus generate
two gene trees, where the estimated split time was 10 Mya, rather
than 5 Mya. Such a situation is easily recognized if copy a and
b have diverged sufficiently by the time of their duplication, and a
number of methods of phylogenetic analysis have incorporated
gene duplication (e.g., Refs. 91, 98). Additionally, failure to

14

C.N.K. Anderson et al.

recognize the situation may not have drastic consequences for


phylogenetic analysis if the paralogs had not diverged much, in
which case the estimated gene coalescence would be approximately
correct no matter which comparison was made. However, if one of
the copies has been lost and only one of the remaining copies is
sequenced, then the chances of inferring an inappropriately long
period of genetic isolation are larger, and increase as the size of the
family of paralogs increases. This problem tends to overestimate
gene coalescence times, and some species tree methods depend on
minimum isolation times among a large set of genes. In addition,
these deep coalescences might spuriously increase inferred ancestral
population sizes.
3.2.3. Natural Selection

Natural selection causes yet another violation of the multispecies


coalescent model. Selection can cause serious problems in some
cases, although in other circumstances it is predicted not to cause
problems of phylogenetic analysis (99). The usual stabilizing
selection can be helpful to taxonomists working at high levels
because it slows the substitution rate; likewise, selective sweeps,
directional selection, and genetic surfing (100) tend to clarify
phylogenetic relationships by accelerating reciprocal monophyly
for genes in rapidly diverging clades. However, challenges to
phylogenetic inference are posed by convergent neutral mutations
(homoplasy), balancing selection, and selection-driven convergent evolution. Given a finite number of sites at a neutral locus,
occasional homoplasies occur, and are exacerbated by increased
variation in mutation rate among sites. In the absence of other
mechanisms, however, the addition of more informative and less
noisy loci often compensates for homoplasies at other loci.
Because balancing selection tends to preserve beneficial alleles at
a gene, two divergent taxa appear interdigitated at that locus and
reticulated through time if ancient DNA is available. Again,
including loci that are not under strong balancing selection, or
removing loci influenced by balancing selection from the data set,
should resolve this problem. Finally, convergent molecular evolution can occur across some genomic regions, at least in the mitochondrial genome, due to parallel selection on distantly related
taxa (e.g., Ref. 101). This insidious form of evolution (99) is
particularly difficult to resolve mathematically, entrapping treebuilding algorithms on false topologies because of strong support
for local optima or producing an excess of evidence favoring
incorrect phylogenies. It can also be difficult to detect, since the
synonymous/nonsynonymous mutation ratio might suggest
other types of selection, such as stabilizing selection, that in
themselves do not pose problems for phylogenetic analysis.

1 Tangled Trees: Noncoalescent Genes

4. Detecting
Violations
of the Multispecies
Coalescent Model
4.1. Detecting
Population Genetic
Outliers

4.2. Detecting
Phylogenetic Outliers

15

Many of the instances of violations of the coalescent model will


occur at individual genes, and usually will not dominate the signal
of the entire suite of genes sampled for phylogenetic analysis.
Thus, we can think of such genes as phylogenetic outliers
genes whose phylogenetic signal differs significantly from that of
the remainder of data set. This in turn raises the possibility of
developing statistical tests to identify such outliers, prior to, during, or after phylogenetic analysis, so that they can ultimately be
removed or downweighted. There is a robust history of detecting
outliers in phylogenetics, for example detecting cases of incongruence (102) or genes subject to HGT (95, 103). However, there
has been little work to our knowledge in detecting outliers while
simultaneously accounting for the variation among genes introduced by the multispecies coalescent. In addition, with or without
the context of the multispecies coalescent, there has been little
work on detecting phylogenetic outliers due to forces other than
HGTfor example, due to natural selection.
Detection of outliers has recently come to the fore in the field of
population genomics, and recent years have seen a flurry of studies
analyzing hundredsif not thousandsof genetically independent
loci, especially in surveys of model species, such as humans and
Drosophila. For example, there exist Bayesian methods to detect
loci that differ significantly from the dominant signal as measured
by Fst or some other metric of population divergence (104). In the
case of Fst, some means of correcting for the average heterozygosity
among markers is necessary because the extent of differentiation of
loci with higher average heterozygosity is expected to have a higher
variance than markers with low variance. The variance in differentiation among loci is useful to set up a null hypothesis for the test
statistic and genes falling outside this expected variance are deemed
outliers. In general, the construction of a valid null hypothesis for the
average locus in a given multilocus data setincorporating as many
sources of variance as possible, including coalescent variancecan be
useful in erecting statistical tests of outliers. We first mention some
ways in which phylogenetic outliers can be identified using traditional methods in molecular evolution. We then outline several
approaches that we suggest might be useful in identifying outliers
in the multispecies coalescent model, and provide an example of a
test that may prove useful to the community.
Synonymous/nonsynonymous mutation ratio: One method of detecting
potentially problematic forms of selection is to look for loci with
unusual dN/dS ratios. According to neutral theory, most loci should
be under stabilizing selection, and hence have many more mutations

16

C.N.K. Anderson et al.

in the third codon position than in positions one and two. Regions
under balancing selection should have higher nonsynonymous
mutation rates. However, using the dN/dS ratio as a means of
detecting phylogenetic outliers presents some difficulties. Of course,
such a test would only be applicable to coding regions (see Chap. 5
of this Volume; ref. 122). Additionally, although such genes may
exhibit anomalous behavior at the amino-acid level, they may not be
anomalous in their phylogenetic signal, which is our primary concern. Finally, many coding loci may undergo substitutions more
freely than expected due to canalization (sensu Waddington (105))
or genomic redundancy. Many genes exhibit a slight excess
of nonsynonymous substitutions within populations because
even strong directional selection rarely purges all such alleles from
populations (106).
GC ratio and DNA word frequencies: Regions of the genome that
have been acquired from another domain of life (such as a eukaryote with DNA from viruses, bacteria, or archea) often have an
unusual GC composition relative to the rest of the genome. Indeed,
focusing on genomic regions with anomalous GC content is a
common method for identifying genes that have undergone
HGT. More complex consequences of base composition and mutation patterns, such as the frequencies of DNA oligonucleotides
(words) in coding or noncoding regions, have also been used
to flag potential HGT genes, particularly in bacteria (107, 108).
Like the test above, the results of GC or DNA word frequency
analysis should be considered suggestive, but not conclusive. There
are other reasons for unusual GC content (e.g., leucine zipper
motifs, a GC microsatellite, etc.), which are likely to occur by
chance in a large genome. Again, the phylogenetic consequences
of such deviations in evolutionary pattern are paramount. In this
regard, high variation in GC content among genes can cause strong
deviations in resulting phylogenies, although distinguishing the
true gene tree from the tree suggested by the variation can be
challenging (e.g., using LogDet distances (109)).
4.3. Statistical Tests
to Detect Phylogenetic
Outliers

When faced with a surprising or nonconvergent species tree, one


possibility is that an unusual gene tree is to blame. Though techniques for dealing with violations of the coalescent model are in
their infancy, researchers do have a few options. Below, we list
several ideas, some borrowed from classical phylogenetics or from
methods used in bioinformatics. It is likely that the several tests
constructed to detect phylogenetic outliers in classical phylogenetics can be extended slightly to incorporate the additional variation among genes expected due to the coalescent process. Of
course, with larger data sets, single anomalous genes may have
little effect on the resulting species tree, particularly in species tree
methods utilizing summary statistics (e.g., STAR/STEAC (45)).
However, as pointed out above, species tree methods, such as

1 Tangled Trees: Noncoalescent Genes

17

BEST, that rely on hard boundaries for the species tree by


individual genes could be derailed due to the anomalous behavior
of even a single gene.
Jackknifing: A straightforward approach to detecting phylogenetic
outliers under the multispecies coalescent model is to rerun the
analysis n times, where n is the number of loci in the study, leaving
one locus out each time. An outlier can then be identified if the
analysis that does not include that gene differs from the remaining
analyses in which that gene is included. This approach has been
applied successfully in fruit flies by Wong et al. (19), who considered their problem resolved when the elimination of one of ten
genes unambiguously resolved a polytomy. There may be other
metrics of success that are more robust or sensitive or do not
depend as strongly on a priori beliefs about the relationships
among taxa. Because some duplications or horizontal transfers
may affect only one taxon, whole-tree topology summary statistics
are unlikely to be sensitive enough to detect recent events. However, the cophenetic distance of each taxon to its nearest neighbor
in the complete species tree could be compared across jackknife
results. This procedure produces a distribution of typical distances, and significance can therefore be assigned to highly divergent results. The drawback to such an approach is the
computational demand. Species tree analyses on their own can be
extremely time consuming to run even once, so jackknifing may
prove intractable for studies involving many species and loci.
4.4. Species Tree
Methods
Accommodating
Anomalous Loci

One attractive prospect is to develop algorithms for species tree


construction that are less susceptible to the effects of single genes.
STAR and STEAC are two approaches that use summary statistics
(average ranks or coalescence times across genes) to reconstruct
species trees. These methods are powerful and fast, yet they do not
utilize all the information in the data, and hence can be less accurate
than Bayesian or likelihood methods (45). A recently introduced
likelihood method based on gene tree triples also seems relatively
immune to events like HGT that compromise the signal in single
genes (58). Nonetheless, it would be desirable to have a fully
Bayesian or likelihood method that can resist bias introduced by
individual genes. For example, rather than basing clade divergence
times on the minimum gene tree split times, as done in BEST,
species divergence times could be chosen from the joint posterior
distribution of divergence times produced across gene trees. This
means that noncoalescent events would be incorporated into a
coalescent analysis only as often as they actually occur in the data,
given a sufficiently long MCMC run, and their effect on the final
result would be diluted. However, an alternative to the standard
Felsenstein likelihood (110; see also Ref. 61) would be required to
evaluate the likelihood of the species tree, since the Felsenstein
likelihood will always be extremely low for recent HGT events.

18

C.N.K. Anderson et al.

It is possible to run MCMC chains without this likelihood using a


summary statistic or epsilon-kernel approach (111), but software
implementing the praxis is not yet available.
Alternatives to coalescent models: Models using macroevolutionary
process other than the coalescent have merit, although a key question is whether the observed variation in gene trees could be
accommodated by coalescent variance. For example, Galtiers
HGT software (112) does not assume multispecies coalescence,
but allows HGT events, across a phylogenetic tree. Though the
software simulates data rather than analyzing it, a set of simulation
results can be compared to actual data (e.g., the correlation
between sequence length and gene tree concordance) to determine
how likely HGT is to be affecting a real data set. Galtiers method
has the advantage of allowing HGT to occur only between contemporaneous lineages. By contrast, Suchards (114) stochastic
MCMC model has been criticized because HGT events are
simulated through a Subtree-Prune-and-Regraft move which does
not preserve ultrametricity in rooted phylogenies, and therefore
may allow genes to be transferred between lineages that do not
coexist temporally. However, both Galtiers method as well as
kinetic models that determine equilibrium amounts of foreign
DNA in genomes subject to HGT and gene duplication can sometimes yield surprising results; Galthiers method, for example, suggested greater HGT in eukaryotic rather than bacterial data sets,
and the kinetic model can overestimate the amount of foreign DNA
present in a genome, even in the two species case (113).
Huson and Bryant approach the problem from a network
theory perspective with SplitsTree4 (115). Here, a free network is
fit to genetic data, and then analyzed for non-tree-like reticulations.
Though useful in detecting phylogenetic outliers, the software
suffers from the same potential for time-travelling lineages as
Suchards model. One of the more attractive alternatives is
conditioned reconstruction (116), which uses a Markov model to
allow genes to appear and disappear in lineages similar to the way
single nucleotides change in traditional mutation models. The
software is designed to detect whole-genome fusion events, meaning that the fundamental macroevolutionary model is a ring with
branches rather than a tree. Finally, Bayesian Concordance Analysis
(33) sidesteps the issue of alternative models by instead quantifying
how much vertical vs. horizontal signature is present in a multilocus
data set.
This last technique has recently been extended and proposed as
a way to reject the coalescent, with its assumption that the only
source of discordance is incomplete lineage sorting, as a sufficient
model for rooted three-species topologies (117). One does this by
comparing concordance factors (CFs) for conflicting topologies;
for example, if the CF for two trees is exactly 50%, then it is likely
that the common sister group in these two trees was produced

1 Tangled Trees: Noncoalescent Genes

19

through hybridization or whole-genome fusion. Alternatively, low


levels of CF can be compared to the theoretical expectation of CF
for an incorrect topology (1/3exp(t)) under incomplete lineage
sorting. In the future, it may be theoretically and computationally
possible to generalize this test to n-species topologies.
Outlier analysis : One other option for multilocus studies is to
construct either histograms of genetic distance or regressions of
molecular divergences between taxa in which each point represents
one locus, thereby allowing visual or statistical identification of outlier
loci. From a pragmatic and computational point of view, this is an
attractive option because genetic distances between taxa already need
to be calculated in most species tree software; thus, a second step
analyzing these distances would be computationally cheap. Such a
method also has the benefit of being able to detect both duplication
events and HGTs. Below, we provide a simulated example.
Example: We simulate a ten-species phylogeny (Fig. 4) with normally distributed divergence times (since species trees generally do

Fig. 4. HGT can be detected by comparing the diversity of genes in all taxa to the diversity
of genes in pairs of taxa. Transfer events should appear as anomalies in regressions or
histograms in each pair of species, in this case locus 21. In the example pair above, 1 of
the 20 normal loci also lies outside the 95% confidence band as expected, but this
locus would not be expected to lie outside the confidence band in all pairs. This particular
locus highlights another hazard of such an analysis: the locus has saturated (100
segregating sites in a 100-bp locus) and thus shows a positive deviation from expectation
in closely related taxa.

20

C.N.K. Anderson et al.

not exhibit the exponentially increasing divergence times of a


coalescent model). We then sprinkle JukesCantor mutations
on this tree with mutation rates spanning two orders of magnitude
(more than is commonly observed in nature to provide a rigorous
model test) to generate 20 loci of 100 nucleotides each (a fairly
modest total of 2,000 base pairs). The key component of this test is
the use of multiple loci to establish a pattern that can possibly be
violated by HGT. Finally, a 21st gene is simulated on a species tree
in which one taxon has acquired the gene laterally from another at
some point in the past. We then need an appropriate statistic with
which to quantify the phylogenetic patterns and divergences
among gene trees. Though many statistics are available, here we
simply count the number of variable sites displayed by a given pair
of species for clarity. Regressing the number of variable sites across
all ten taxa versus the number of variable sites between pairs of taxa
clearly demonstrates both the presence and direction of HGT
(Fig. 4). The recipient taxon can be easily distinguished because it
is anomalous in all pairwise comparisons. The donor taxon can be
identified as the closest relative of the recipient in that gene tree,
who is also a distant relative in all other gene trees. Since the HGT
event should be detectable by pairing the recipient taxon with any
other taxon in the tree, one test that should provide substantial
power is to count the number of times a locus lies outside the 95%
confidence band for each pairwise comparison. An HGT event that
occurs between internal nodes would appear in even more comparisons, though events that occur just after an actual lineage split may
not be detectable.

5. Future Directions
Species tree methods are likely to continue to gain ascendancy as
the strongest evidence of taxonomic relationship in phylogenetic
research. As with any form of evidence, the conclusions of a species
tree analysis are fallible, with each method susceptible to certain
biases in exceptional cases. In the future, we hope that these biases
and susceptibilities can be overcome, and that species tree methods
will continue to multiply. Because the most robust techniques rely
heavily on a coalescent paradigm, the field needs a method for
detecting loci that violate the assumptions of coalescent theory.
A few ideas for how to do this have been presented and outlined
above, but certainly need rigorous theoretical and empirical testing
to establish their effectiveness in phylogenetic inference.
Detection is just the first step. Currently, when such loci are
discovered, researchers have two options: they can use methods
that are sufficiently robust (hopefully) to overcome the faulty

1 Tangled Trees: Noncoalescent Genes

21

assumptions of coalescence or remove the loci from the analysis set.


These solutions, though adequate, are not best-case scenarios. As
discussed above, it would be preferable to develop methods that use
the information contained in noncoalescent events to further support phylogenetic inference. Such a program, widely applied,
would have the potential to not only solidify our understanding
of the genetic relationships of all organisms, but also provide
invaluable insight into the prevalence and significance of nonstandard evolutionary modes.

6. Practice
Problems
1. Consider the following discordant set of gene trees. {Gene
1 (A:10,(B:8,C:8):2); Gene 2 (B:9,(A:6,C:6):3); and
Gene 3 ((A:4,B:4):4,C:8)}. Assuming that these genes perfectly delimit the time of genetic divergence and the only
cause of discordance is deep coalescence, what is the correct
species tree?
2. In a study of five closely related species, you sequence five short
loci, and obtain the following matrix of variable sites between
taxon pairs.
Per gene total Species A Species B Species C Species D Species E
Species A

2,3,6,4,1

3,7,6,9,1

4,7,6,9,1

4,7,6,9,1

4,7,1,9,1

4,7,5,9,1

4,6,5,9,1

3,6,5,8,1

4,7,5,9,1

Species B

16

Species C

26

22

Species D

27

26

23

Species E

27

27

26

1,2,2,3,0
8

Which gene is the most likely to have been horizontally transferred, and between which two taxa?

Appendix A:
Simulating Gene
Trees in Species
Trees

Many researchers have found it useful to simulate the evolution of


genes over a species tree topology. This can be done to test mathematical models, to get a feel for the amount of divergence expected
in real data, or (as described below) to rigorously compare the
ability of alternative species histories to account for data in hand.
The program produces expected amounts of isolation due to drift,
and in the context of Bayesian analysis can be used to infer other

22

C.N.K. Anderson et al.

generations

500

1000

2000

300

3000

Ne

700

Fig. 5. The species tree simulated in the Appendix. Branch lengths are in units of
generations, and branch widths (population sizes) are in units of individuals. This
particular tree has the constraint that ancestral population sizes are the sum of the
population sizes of descendent lineages, but of course one can simulate without these
constraints using either Serial SimCoal or Phybase.

parameters regarding the demographic processes occurring at scales


finer than the species group. A simple example of how this could be
accomplished in Bayesian Serial SimCoal (118, 119) is described
below. The suite of tools available through Arlequin (120) and the
R-scripts in Phybase (121) can be used to further analyze the
output of BayeSSC.
Although species trees can be simulated from a birth and death
process using an R package TreeSim (http://cran.r-project.org/
web/packages/TreeSim/index.html), researchers often adopt a
fixed species tree to simulate genetic trees. Imagine a species tree
with ten individuals, four species (with 4, 2, 3, and 1 representatives, respectively), and with known (or previously inferred) split
times among taxa. In addition, we will assume for this example that
the effective population size Ne of each contemporary species is
1,000, and that the size of ancestral populations is the sum of the
sizes of their respective descendent population. This situation is
analogous to that depicted in Fig. 5. The corresponding NEXUSformatted species tree is:
(D:1,500,(C:800,(B:500,A:500):300):700).
Here, branch lengths are in units of generations, which is
commensurate with using units of individuals for the population
sizes (other simulation methods use units of t mt and y 4Nm,
in units of substitutions per site, instead of t and Ne, respectively).

1 Tangled Trees: Noncoalescent Genes

23

A simple forward simulation can be run in any version of


SimCoal using the following .par file:
Species tree input file; 10 taxa, 4 sp
4 demes
Deme sizes (arbitrary in this case)
1000
1000
1000
1000
Number of samples per deme
4
2
3
1
Growth rates
0
0
0
0
Number of migration matrices
0
Historical event: Date from to%mig new_N new_r migmat
3 events
500 1 0 1 2.00 0 0
800 2 0 1 1.50 0 0
1500 3 0 1 1.33 0 0
Mutations per generation for the whole sequence
0.0001
Number of loci
10
Data type: DNA, RFLP, or MICROSAT
DNA
//Mutation rates: Gamma parameters, theta and k
00
In this case, the tree was perfectly ordered, so all populations could
simply fuse with deme 0, readjusting the population size each time.
Of course, there is no need to assume that all populations have the
same effective size, nor that Ne of ancestral populations was the sum
of their Ne values of their descendants. If we wished to infer the size
of clade AB at the time of the split, for example, we could replace
the 2.00 in the first historical event with, for example, {U:0.5,3.0},
which would allow the program to infer the posterior probabilities
of clade AB having an Ne from 500 to 3,000 individuals. Similarly,

24

C.N.K. Anderson et al.

if the mutation rate of the gene in question was unknown or if a


range of mutation rates would simulate the desiderata, then the
mutation rate constant, set in the example above at 0.0001, could
be replaced with {E:0.0001}, creating an exponential distribution
of mutation rates whose mean was 0.0001. Full documentation on
the parameter files, and Bayesian inference using priors instead of
constants, can be found at the BayeSSC Web site: http://www.
stanford.edu/group/hadlylab/ssc/.
Note that the suite of Bayesian tools available at the Web site
can be used to evaluate the relative strength of different species
topologies. For example, the correspondence between output from
the parameter file above with a perfectly ordered tree (((AB)C)D)
and real data can be mathematically compared to the correspondence from a second file, where the tree is balanced with, say,
topology ((AB)(CD)) instead.

References
1. Hillis DM (1987) Molecular Versus Morphological Approaches to Systematics. Annu Rev
Ecol Syst 18:2342
2. Kocher TD, Thomas WK, Meyer A et al
(1989) Dynamics of mitochondrial DNA evolution in animals: amplification and sequencing with conserved primers. Proc Natl Acad
Sci USA 86:61966200
3. Miyamoto MM, Cracraft J (1991) Phylogeny
inference, DNA sequence analysis, and the
future of molecular systematics. In: Miyamoto
MM, Cracraft J (eds) Phylogenetic Analysis of
DNA Sequences. Oxford Univ. Press, New
York
4. Swofford DL, Olsen GJ, Waddell PJ et al
(1996) Phylogenetic inference. In: Hillis
DM MC, Mable BK (ed) Molecular Systematics. Sinauer Associates, Sunderland MA
5. Nei M (1987) Molecular Evolutionary Genetics, Columbia University Press, New York
6. Nei M, Kumar S (2000) Molecular Evolution
and Phylogenetics, Oxford University Press,
New York
7. Rosenberg NA (2002) The Probability of
Topological Concordance of Gene Trees and
Species Trees. Theor Popul Biol 61:225247
8. Cavalli-Sforza LL (1964) Population structure and human evolution. Proc R Soc
Lond, Ser B: Biol Sci 164:362379
9. Avise JC, Arnold J, Ball RM et al (1987) Intraspecific phylogeography: the mitochondrial
DNA bridge between population genetics and
systematics. Annu Rev Ecol Syst 18:489522

10. Tajima F (1983) Evolutionary relationship of


DNA sequences in finite populations. Genetics 105:437460
11. Pamilo P, Nei M (1988) Relationships
between gene trees and species trees. Molecular Biological Evolution 5:568583
12. Takahata N (1989) Gene genealogy in three
related populations: consistency probability
between gene and population trees. Genetics
122:957966
13. Avise JC (1994) Molecular markers, natural
history and evolution, Chapman and Hall,
New York
14. Wollenberg K, Avise JC (1998) Sampling
properties of genealogical pathways underlying
population
pedigrees.
Evolution
52:957966
15. Gould SJ (2001) The Book of Life: An illustrated history of the evolution of life on
earth, W. W. Norton & Co., New York
16. Maddison WP (1997) Gene trees in species
trees. Syst Biol 46:523536
17. Jennings WB, Edwards SV (2005) Speciational history of Australian grass finches (Poephila) inferred from thirty gene trees.
Evolution 59:20332047
18. Carstens BC, Knowles LL (2007) Estimating
species phylogeny from gene-tree probabilities despite incomplete lineage sorting: An
example from melanoplus grasshoppers. Syst
Biol 56(3):400411
19. Wong A, Jensen JD, Pool JE et al (2007)
Phylogenetic incongruence in the Drosophila

1 Tangled Trees: Noncoalescent Genes


melanogaster species group. Molecular
Phylogenetic Evolution 43:11381150
20. Edwards SV (2009) Is a new and general
theory of molecular systematics emerging?
Evolution 63:119
21. Neigel JE, Avise JC (1986) Phylogenetic
relationships of mitochondrial DNA under
various demographic models of speciation.
In: Karlin S, Nevo E (eds) Evolutionary processes and theory. Academic Press, New York
22. Satta Y, Klein J, Takahata N (2000) DNA
Archives and Our Nearest Relative: The Trichotomy Problem Revisited. Mol Phylogen
Evol 14(2):259275
23. Degnan JH, Rosenberg NA (2006) Discordance of Species Trees with Their Most Likely
Gene Trees. PLoS Genet 2(5):e68
24. Rosenberg NA, Tao R (2008) Discordance of
species trees with their most likely gene trees:
the case of five taxa. Syst Biol 57:131140
25. Degnan JH, Rosenberg NA (2009) Gene tree
discordance, phylogenetic inference and the
multispecies coalescent. Trends Ecol Evol
24:332340
26. Huang H, Knowles LL (2009) What Is the
Danger of the Anomaly Zone for Empirical
Phylogenetics? Syst Biol 58(5):527536
27. Bryant D (2003) A Classification of Consensus Methods for Phylogenetics. In: Janowitz
MF, Lapointe F-J, McMorris FR, Mirking B,
Roberts FS (eds) Bioconsensus. American
Mathematical Society, Providence RI
28. Felsenstein J (2004) Inferring Phylogenies,
Sinauer Associates, Sunderland MA
29. Ewing GB, Ebersberger I, Schmidt HA et al
(2008) Rooted triple consensus and anomalous gene trees. BMC Evol Biol 8:118
30. Degnan JH, DeGiorgio M, Bryant D et al
(2009) Properties of Consensus Methods for
Inferring Species Trees from Gene Trees. Syst
Biol
31. Steel M, Rodrigo A (2008) Maximum Likelihood Supertrees. Syst Biol 57(2):243250
32. Ranwez V, Criscuolo A, Douzery EJP (2010)
SUPERTRIPLETS: a triplet-based supertree
approach to phylogenomics. Bioinformatics
26(12):i115-i123
33. Ane C, Larget B, Baum DA et al (2007)
Bayesian Estimation of Concordance among
Gene Trees. Mol Biol Evol 24:412426
34. Larget BR, Kotha SK, Dewey CN et al
BUCKy: Gene tree/species tree reconciliation
with Bayesian concordance analysis. Bioinformatics 26:29102911

25

35. Wiens JJ (2003) Missing data, incomplete


taxa, and phylogenetic accuracy. Syst Biol
52:528538
36. Gadagkar SR, Rosenberg MS, Kumar S
(2005) Inferring species phylogenies from
multiple genes: concatenated sequence tree
versus consensus gene tree. Journal of Experimental Zoology B 304(1):6474
37. Bull JJ, Huelsenbeck JP, Cunningham CW
et al (1993) Partitioning and Combining
Data in Phylogenetic Analysis. Syst Biol
43:384397
38. Rokas A, Williams BL, Carroll NKSB et al
(2003) Genome-scale approaches to resolving
incongruence in molecular phylogenies.
Nature 425:798804
39. Driskell AC, Ane C, Burleigh JG et al (2004)
Prospects for Building the Tree of Life from
Large
Sequence
Databases.
Science
306:11721174
40. Rokas A (2006) Genomics and the Tree of
Life. Science 313:18971899
41. Kubatko LS, Degnan JH (2007) Inconsistency of Phylogenetic Estimates from Concatenated Data under Coalescence. Syst Biol 56
(1):1724
42. Wu M, Eisen JA (2008) A simple, fast, and
accurate method of phylogenomic inference.
Genome Biology 9:R151
43. Degnan JH, Salter LA (2005) Gene tree distributions under the coalescent process. Evolution 59:2437
44. Liu L (2008) BEST: Bayesian estimation of
species trees under the coalescent model. Bioinformatics 24(21):25422543
45. Liu L, Yu L, Kubatko LS et al (2009) Coalescent methods for estimating phylogenetic
trees. Mol Phylogen Evol 53:320328
46. Castillo-Ramirez S, Liu L, Pearl DK et al
(2010) Bayesian estimation of species trees:
a practical guide to optimal sampling and
analysis. In: Knowles LL, Kubatko LS (eds)
Estimating species trees: Practical and theoretical aspects. Hoboken NJ, John Wiley and
Sons
47. Gillespie JH (2004) Population Genetics: A
Concise Guide, 2nd edn. The Johns Hopkins
University Press, Baltimore, MD
48. Wakeley J (2009) Coalescent Theory: An
Introduction, Roberts & Co. Publishers,
Greenwood Village, CO
49. Hartl DL, Clark AG (2006) Principles of Population Genetics, 4th edn. Sinauer Associates,
Inc., Sunderland, MA

26

C.N.K. Anderson et al.

50. Wilson IJ, Weale ME, Balding DJ (2003)


Inferences from DNA data: population histories, evolutionary processes and forensic
match probabilities. Journal of the Royal Statistical Society: Series A 166:155158
51. Maddison WP, Knowles LL (2006) Inferring
phylogeny despite incomplete lineage sorting.
Syst Biol 55:2130
52. Kubatko LS, Carstens BC, Knowles LL
(2009) STEM: species tree estimation using
maximum likelihood for gene trees under coalescence. Bioinformatics 25(7):971973
53. OMeara BC (2010) New Heuristic Methods
for Joint Species Delimitation and Species
Tree Inference. Syst Biol 59(1):5973
54. OMeara BC (2008) Using trees: myrmecocystus phylogeny and character evolution and
new methods for investigating trait evolution
and species delimitation
55. Mossel E, Roch S (2007) Incomplete Lineage
Sorting: Consistent Phylogeny Estimation
From Multiple Loci. [mss]
56. Rannala B, Yang Z (2003) Bayes Estimation of
Species Divergence Times and Ancestral Population Sizes Using DNA Sequences From
Multiple Loci. Genetics 164:16451656
57. Yang Z, Rannala B (2010) Bayesian species
delimitation using multilocus sequence data.
Proc Natl Acad Sci USA 107:92649269
58. Liu L, Yu L, Edwards SV (2010) A maximum
pseudo-likelihood approach for estimating
species trees under the coalescent model.
BMC Evol Biol 10:302
59. Oliver JC (2008) AUGIST: inferring species
trees while accommodating gene tree uncertainty. Bioinformatics 24:29322933
60. Liu L, Pearl DK (2007) Species Trees from
Gene Trees: Reconstructing Bayesian Posterior Distributions of a Species Phylogeny
Using Estimated Gene Tree Distributions.
Syst Biol 56(3):504514
61. Heled J, Drummond AJ (2010) Bayesian
Inference of Species Trees from Multilocus
Data. Mol Biol Evol 27:570580
62. Chung Y, Ane C (2011) Comparing Two
Bayesian Methods for Gene Tree/Species
Tree Reconstruction: Simulations with
Incomplete Lineage Sorting and Horizontal
Gene Transfer. Syst Biol 60:261275
63. Leache AD, Rannala B The Accuracy of Species Tree Estimation under Simulation: A
Comparison of Methods. Syst Biol
64. Edwards SV, Liu L, Pearl DK (2007) Highresolution species trees without concatenation.
Proc
Natl
Acad
Sci
USA
104:59365941

65. Liu L, Edwards SV (2009) Phylogenetic


Analysis in the Anomaly Zone. Syst Biol
58:452460
66. Huang H, He Q, Kubatko LS et al (2010)
Sources of Error Inherent in Species-Tree
Estimation: Impact of Mutational and Coalescent Effects on Accuracy and Implications for
Choosing among Different Methods. Syst
Biol 59(5):573583
67. Suzuki Y, Glazko GV, Nei M (2002)
Overcredibility of molecular phylogenies
obtained by Bayesian phylogenetics. Proc
Natl Acad Sci USA 99:1613816143
68. Avise JC, Ball RM (1990) Principles of genealogical concordance in species concepts and
biological taxonomy. Oxford Surveys in Evolutionary Biology 7:4567
69. He Y, Wu J, Dressman DC et al (2010) Heteroplasmic mitochondrial DNA mutations in
normal
and
tumour
cells.
Nature
464:610614
70. Leache AD (2009) Species Tree Discordance
Traces to Phylogeographic Clade Boundaries
in North American Fence Lizards (Sceloporus).
Syst Biol 58:547559
71. De Queiroz K (2007) Species Concepts and
Species Delimitation. Syst Biol 56:879886
72. Hudson RR, Coyne JA (2002) Mathematical
consequences of the genealogical species concept. Evolution 56:15571565
73. Tobias JA, Seddon N, Spottiswoode CN et al
(2010) Quantitative criteria for species delimitation. Ibis 152(4):724746
74. Pritchard JK, Stephens M, Donnelly P (2000)
Inference of population structure using multilocus genotype data. Genetics 155:945959
75. Huelsenbeck JP, Andolfatto P (2007) Inference of Population Structure Under a Dirichlet Process Model. Genetics 175:1871802
76. Leache AD, Fujita MK (2010) Bayesian species delimitation in West African forest geckos
(Hemidactylus fasciatus). Proc Natl Acad Sci
USA 277:30713077
77. Knowles LL, Carstens BC (2007) Delimiting
Species without Monophyletic Gene Trees.
Syst Biol 56(6):887895
78. Carstens BC, Dewey TA (2010) Species
Delimitation Using a Combined Coalescent
and Information-Theoretic Approach: An
Example from North American Myotis Bats.
Syst Biol 59:400414
79. Wakeley J (2000) The effects of subdivision
on the genetic divergence of populations and
species. Evolution 54:10921101
80. Eckert AJ, Carstens BC (2008) Does gene
flow destroy phylogenetic signal? The

1 Tangled Trees: Noncoalescent Genes


performance of three methods for estimating
species phylogenies in the presence of gene
flow. Mol Phylogen Evol 49:832842
81. Doolittle WF, Bapteste E (2007) Pattern pluralism and the Tree of Life hypothesis. Proc
Natl Acad Sci USA 104:20432049
82. Boto L (2010) Horizontal gene transfer in
evolution: facts and challenges. Proc Roy Soc
Lond B 277:819827
83. Rivera MC, Lake JA (2004) The ring of life
provides evidence for a genome fusion origin
of eukaryotes. Nature 431:152155
84. Kurland CG, Canback B, Berg OG (2003)
Horizontal gene transfer: A critical view. Proc
Natl Acad Sci USA 100:96589662
85. Hodkinson TR, Parnell JAN (2006) Introduction to the Systematics of Species Rich
Groups. In: Hodkinson TR, Parnell JAN
(eds) Reconstructing the tree of life: taxonomy and systematics of species rich taxa. CRC
Press, Boca Raton, FL
86. Eisen JA (2000) Horizontal gene transfer
among microbial genomes: new insights
from complete genome analysis. Curr Opin
Genet Dev 10:606611
87. Jain R, Rivera MC, Lake JA (1999) Horizontal gene transfer among genomes: The complexity hypothesis. Proceedings of the
National Academy of Sciences of the United
States of America 96:38013806
88. Galtier N, Daubin V (2008) Dealing with
incongruence in phylogenomic analyses. Philosophical Transactions of the Royal Society B:
Biological Sciences 363:40234029
89. Andersson JO (2005) Lateral gene transfer in
eukaryotes. Cell Mol Life Sci 62:11821197
90. Hotopp JCD, Clark ME, Oliveira DCSG et al
(2007) Widespread Lateral Gene Transfer
from Intracellular Bacteria to Multicellular
Eukaryotes. Science 317:17531756
91. Thomas J, Schaack S, Pritham EJ (2010) Pervasive Horizontal Transfer of Rolling-Circle
Transposons among Animals. Genome Biology and Evolution 2:656664
92. Keeling PJ, Palmer JD (2008) Horizontal
gene transfer in eukaryotic evolution. Nature
Reviews Genetics 9:605618
93. Blair JE (2009) Animals: Metazoa. In:
Hedges SB, Kumar S (eds) The Timetree of
Life. Oxford University Press, New York
94. Huang J, Gogarten JP (2006) Ancient
horizontal gene transfer can benefit phylogenetic
reconstruction.
Trends
Genet
22:361366
95. Linz S, Radtke A, von Haesler A et al (2007)
A Likelihood Framework to Measure Hori-

27

zontal Gene Transfer. Mol Biol Evol


24:13121319
96. Rasmussen MD, Kellis M (2007) Accurate
gene-tree reconstruction by learning geneand species-specific substitution rates across
multiple complete genomes. Genome Res
17:19321942
97. Rasmussen MD, Kellis M (2011) A Bayesian
Approach for Fast and Accurate Gene Tree
Reconstruction. Mol Biol Evol 28:273290
98. Sanderson MJ, McMahon MM (2007) Inferring angiosperm phylogeny from EST data
with widespread gene duplication. BMC
Evol Biol 7:S1-S3
99. Edwards SV (2009) Natural selection and
phylogenetic analysis. Proc Natl Acad Sci
USA 106:87998800
100. Ray N, Excoffier L (2009) Inferring Past
Demography Using Spatially Explicit Population Genetic Models. Human Biology
81:141157
101. Castoe TA, Koning APJd, Kim H-M et al
(2009) Evidence for an ancient adaptive episode of convergent molecular evolution. Proc
Natl Acad Sci USA 106:89868991
102. Swofford DL (1991) When are phylogeny
estimates from molecular and morphological
data incongruent? Pp. 295333 In: Miyamoto MM, Cracraft J (eds) Phylogenetic analysis of DNA sequences. Oxford Univ. Press,
New York
103. Roettger M, Martin W, Dagan T (2009) A
Machine-Learning Approach Reveals That
Alignment Properties Alone Can Accurately
Predict Inference of Lateral Gene Transfer
from Discordant Phylogenies. Mol Biol Evol
26:19311939
104. Beaumont MA, Balding DJ (2004) Identifying adaptive genetic divergence among populations from genome scans. Mol Ecol
13:969980
105. Waddington CH (1942) Canalization of
development and the inheritance of acquired
characters. Nature 150:563565
106. Burke MK, Dunham JP, Shahrestani P et al
(2010) Genome-wide analysis of a long-term
evolution experiment with Drosophila. Nature
467:587590
107. Medrano-Soto A, Moreno-Hagelsieb G,
Vinuesa P et al (2004) Successful lateral transfer requires codon usage compatibility
between foreign genes and recipient genomes. Mol Biol Evol 21:18841894
108. Dufraigne C, Fertil B, Lespinats S et al (2005)
Detection and characterization of horizontal
transfers in prokaryotes using genomic
signature. Nucleic Acid Research 33:e6

28

C.N.K. Anderson et al.

109. Lockhart PJ, Steel MA, Hendy MD et al


(1994) Recovering evolutionary trees under
a more realistic model of sequence evolution.
Mol Biol Evol 11:605612
110. Felsenstein J (1981) Evolutionary trees from
DNA sequences: a maximum likelihood
approach. Evolution 17:368376
111. Marjoram P, Molitor J, Plagnol V et al (2003)
Markov chain Monte Carlo without likelihoods.
Proc Natl Acad Sci USA 100:1532415328
112. Galtier N (2007) A Model of Horizontal
Gene Transfer and the Bacterial Phylogeny
Problem. Syst Biol 56:633642
113. Koslowski T, Zehender F (2005) Towards a
quantitative understanding of horizontal
gene transfer: A kinetic model. J Theor Biol
237:2329
114. Suchard MA (2005) Stochastic Models for
Horizontal Gene Transfer: Taking a Random
Walk Through Tree Space. Genetics 170:
419431
115. Huson DH, Bryant D (2006) Application of
Phylogenetic Networks in Evolutionary Studies. Mol Biol Evol 23:254267
116. Lake JA, Rivera MC (2004) Deriving the
Genomic Tree of Life in the Presence of Horizontal Gene Transfer: Conditioned Reconstruction. Mol Biol Evol 21:681690

117. Ane C (2010) Reconstructing concordance


trees and testing the coalescent model from
genome-wide data sets. In: Knowles LL,
Kubatko LS (eds) Estimating Species Trees:
Practical and Theoretical Aspects. WileyBlackwell, Hoboken, NJ
118. Excoffier L, Novembre J, Schneider S (2000)
SIMCOAL: a general coalescent program for
simulation
of
molecular
data
in
interconnected populations with arbitrary
demography. J Hered 91:506509
119. Anderson CNK, Ramakrishnan U, Chan YL
et al (2005) Serial SimCoal: A population
genetics model for data from multiple populations and points in time. Bioinformatics
21:17331734
120. Schneider S, Roessli D, Excoffier L (2005)
Arlequin (version 3.0): An integrated software
package for population genetics data analysis.
Evolutionary Bioinformatics 1:4750
121. Liu L, Yu L (2010) Phybase: an R package
for species tree analysis. Bioinformatics
26:962963
122. Kosiol C, Anisimova M (2012) Selection on the
protein coding genome. In: Anisimova M
(ed) Evolutionary genomics: statistical and
computational methods (volume 2). Methods
in Molecular Biology, Springer Science+Business
Media New York

Chapter 2
Modeling Gene Family Evolution and Reconciling
Phylogenetic Discord
Gergely J. Szollosi and Vincent Daubin
Abstract
Large-scale databases are available that contain homologous gene families constructed from hundreds of
complete genome sequences from across the three domains of life. Here, we discuss the approaches
of increasing complexity aimed at extracting information on the pattern and process of gene family
evolution from such datasets. In particular, we consider the models that invoke processes of gene birth
(duplication and transfer) and death (loss) to explain the evolution of gene families.
First, we review birth-and-death models of family size evolution and their implications in light of the
universal features of family size distribution observed across different species and the three domains of life.
Subsequently, we proceed to recent developments on models capable of more completely considering
information in the sequences of homologous gene families through the probabilistic reconciliation of
the phylogenetic histories of individual genes with the phylogenetic history of the genomes in which they
have resided.
To illustrate the methods and results presented, we use data from the HOGENOM database, demonstrating that the distribution of homologous gene family sizes in the genomes of the eukaryota, archaea, and
bacteria exhibits remarkably similar shapes. We show that these distributions are best described by models of
gene family size evolution, where for individual genes the death (loss) rate is larger than the birth
(duplication and transfer) rate but new families are continually supplied to the genome by a process of
origination. Finally, we use probabilistic reconciliation methods to take into consideration additional
information from gene phylogenies, and find that, for prokaryotes, the majority of birth events are the
result of transfer.
Key words: Gene family evolution, Gene duplication, Gene loss, Horizontal gene transfer,
Birth-and-death models, Reconciliation

1. Introduction
The strongest evidence for the universal ancestry of all life on Earth
comes from two sources: (1) the shared molecular characters essential to the functioning of the cell, such as fundamental biological
polymers, core metabolism, and the nearly universal genetic
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2,
Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5_2,
# Springer Science+Business Media, LLC 2012

29

30

G.J. Szollosi and V. Daubin

code; (2) sequence similarity between functionally related proteins


in the bacteria, archaea, and eukaryota (1, 2). However, the majority
of functionally related genes, similar to other phylogenetic characters, exhibiting a more restricted distribution and consequently
taken separately, can only provide phylogenetic information on
finer scales. Nonetheless, considered together, the ensemble of
related sequences carry a comprehensive record of the evolutionary
history and mechanisms that have generated them (3). Sequence
similarity on these finer scales has been used to construct large-scale
databases of putative sets of sequences of common ancestry, in
particular homologous proteins and protein domains. At present,
such databases constructed from hundreds of complete genome
sequences from across the three domains of life are available. Here,
we discuss the methods capable of extracting information on the
pattern and process of genome evolution from large-scale datasets
composed of homologous gene families.

2. Birth-and-Death
Processes
and the Shape
of the Protein
Universe

The majority of bacterial, archaeal, and eukaryotic genes belong to


homologous families (4) which together contain a potential treasure trove of information on the pattern and process of descent of
these genes, and the genomes in which they reside. A qualitative
examination of the number of family members in genomes and the
phylogenetic distribution of the families reveals two important
patterns: (1) the distribution of the majority of homologous gene
families is not universal, but phylogenetically limited and (2) many
families contain multiple members from the same genomes while at
the same time being characterized by a patchy distribution. These
observations imply that (1) some process of gene origination must
exist that results in the ongoing generation of sequences sufficiently
different to be seen as a novel gene family and (2) processes of gene
birth capable of creating new genes with recognizable homology
from the existing ones must also exist in parallel with processes of
gene death leading to the loss of existing genes.
Considering the latter case first, several molecular mechanisms
are known to be involved in the creation of new gene structures in a
genome. Among eukaryotes, a range of mechanisms are known to
be capable of producing gene-sized duplications of genetic material.
These mechanisms include exon shuffling, reverse transcription of
expressed genes, and the action of mobile elements; for reviews, see
refs. 5, 6. In the case of prokaryotes, mechanisms for duplication are
less well understood and horizontally transferred genes are believed
to be an important, perhaps dominant, source of new gene structures entering the genome (7). Note that transfer of DNA into the
prokaryotic cell can occur primarily by three means: (1) transduction by viruses, (2) conjugation by plasmids, and (3) natural

Modeling Gene Family Evolution and Reconciling Phylogenetic Discord

31

genetic transformation: the ability of some bacteria to take up


DNA fragments released by another cell. For details, see ref. 8.
While we expect duplication to produce gene copies with recognizable homology, whether transfer is seen as gene origination or
gene birth in the context of a particular genome depends on the
presence of recognizable homologs. In contrast to duplication
and transfer, the loss of genes is thought to most frequently
result from a cascade of small deletion events with small or no
fitness effect, which follow the initial inactivation of a gene
(the emergence of a pseudogene). As in the case of pseudogenization, molecular mechanisms can generate new gene structures
or lead to the loss of existing ones in the genomes of individual
cells; the fate of these genomic changes, whether they will fix or
be lost in the population, will be determined by their selective
effects and population genetic parameters, such as effective population size.
On the broadest scale, the strength of genetic drift has been
hypothesized to be a dominant factor influencing genome size
across all three domains of life (9). As we see in the following
section, the pattern of the distribution of homologous gene family
sizes in and among genomes can, to a large extent, also be described
in terms of essentially neutral stochastic birth-and-death processes.
Birth (duplication and transfer) and death (loss) in the context of
these models correspond to the addition and removal of genes to
homologous gene families over evolutionary timescales that are
long compared to the mutational and population genetic timescales.
The question of mechanisms responsible for the origination of
gene families is not well understood. A significant fraction of genes
in genomes from all three domains of life appears to be of very
recent origin in so far as they are restricted to a particular genome
and possess no known homologs. By some counts, such orphan
genes constitute, e.g., one-third of the genes in the human genome
(6) and 14% in a survey of 60 bacterial genomes (10). While there
are signs that a large fraction of orphan genes in prokaryotic genomes may have viral origin (11), our understanding of where these
genes come from and more generally what the dominant processes
of gene origination are remain largely unresolved fundamental
questions. Nevertheless, as we show below using birth-and-death
processes as models, the continuous presence and significance of
origination during the course of genome evolution is readily apparent from the record it has in the pattern of gene homologous family
sizes, i.e., in the shape of the protein universe.
2.1. The Distribution
of Homologous Gene
Family Sizes

The frequency distributions of gene family sizes in the complete


genomes of organisms from all three domains exhibit remarkably
similar shapes with characteristic long, slowly decaying tails
(1214). These distributions all have a power-law shape; for large
family size n, the frequency of families f(n) falls off as f(n) a ng with

32

G.J. Szollosi and V. Daubin

Fig. 1. Distribution of homologous gene family sizes across the three domains. The distribution of homologous gene family
sizes was derived from the version 5 of the HOGENOM database (17). The results for the three domain data for the
complete genomes of 820 bacteria, 62 archaea, and 64 eukaryotes, and correspond to the average of the frequencies of
family sizes across species in the domain. Dashed lines indicate fits with different origination duplication and loss (ODL)
models. The linear model corresponds to the model of Reed et al. and the nonlinear is that proposed by Karev et al.; see
text for details. The bottom row presents the relative rate of duplication as a function of family size corresponding to the fits
of the nonlinear model of Eq. 2 in the two rows above it.

some g < 0. This power-law shape is apparent in the log-log plots


of Fig. 1 and corresponds to an excess of large and very large
families compared to what would be expected based on the size
of the average gene family. Even more remarkable is the similarity of
the family size distributions between species from a single domain
(columns in Fig. 1), and even between domains (rows in Fig. 1).
This similarity implies that the processes that have generated these
distributions may share universal features across species and across
the three domains. Here, we focus on the information that can be
inferred under the assumption that particular forms of birth-anddeath processes have shaped these distributions and will not consider potential connections with power-law scaling in functional
genome content (15) or homology networks and their connection
to other biological networks with similar characteristics (16).

2.2. Interpreting
the Pattern of Gene
Family Sizes

Modeling Gene Family Evolution and Reconciling Phylogenetic Discord

33

Huynen and van Nimwegen were the first to describe and interpret
a widespread pattern of a slowly decaying asymptotic power law in
the distribution of homologous gene family sizes. They examined a
diverse set of genomes spanning the bacteria, archaea, eukaryota,
and viruses (12). They found that a simple, but relatively abstract,
stochastic birth-and-death process, one where the duplication and
loss events are correlated within a family, produces power-law distributions (for details, see below). They found the exponent g to be
between 2 and 4 in their studies. In fact, a value consistent with
these results of g between 2 and 3 has been observed in all
subsequent studies and can easily be read off from Fig. 1. In the
context of Huynen and van Nimwegens model, this indicates that
the origination rate (in general, a combination of gain resulting
from transfer, and the birth of new families with no homologs in
other genomes) that is required to compensate for the stochastic
loss of families must be significant.
Subsequent work has shown that for models, where the birth
and death of genes in a gene family are considered independent,
the asymptotic decay of the distribution of gene family sizes can
also become a power law, albeit such behavior is only exhibited by
a certain specific subclass of originationduplicationloss-type
birth-and-death models. As demonstrated by Karev et al. (14),
this is the case for nonlinear models (see below) in which the death
rate approaches the birth rate for large families but is considerably
greater than the birth rate for small families (see bottom row of
Fig. 1). Karev et al. have been able to accurately reproduce the
distributions of gene (and domain) family sizes for a range of
analyzed genomes. The origination rates necessary to fit empirical
family size distributions were found to be relatively high, and
comparable, at least in small prokaryotic genomes, to the overall
intragenomic duplication rate. This has been interpreted as support for the key role of horizontal gene transfer (HGT) in these
genomes (14, 18, 19).
At about the same time as the work of Karev and colleagues
appeared, Reed et al. demonstrated (20) that a very simple birthand-death process can also exhibit an asymptotic power law. They
considered a model, where the birth and death of genes are independent of each other and family size, and origination occurs
randomly with a uniform rate (see below), and found asymptotic
power-law behavior under the condition that the rate of birth
(duplication) is larger than the rate of death (loss). In Fig. 1, we
show comparisons of the fits of the linear model of Reed et al. and
the nonlinear model of Karev et al. to gene family size distributions
for the three domains. We can see that despite its relative simplicity,
considering data from individual species (top row of Fig. 1), the
linear model (described by three parameters) provides comparable
quality fits as the model of Karev et al. (described by five parameters). If we consider, however, the fits to distributions averaged

34

G.J. Szollosi and V. Daubin

over the three domains, we can observe that the nonlinear mode
clearly provides a better fit (second row of Fig. 1). As the functions
being fit are discrete probability distributions, one can easily calculate the probability of the observed empirical distribution given
values of the model parameters, and subsequently perform fitting
by maximizing the likelihood of the model parameters. For the case
of the averaged distributions, this method of fitting using likelihood allows a clear interpretation of the fit to the averaged distributions, as corresponding to the hypothesis of a birth-and-death
process with identical parameter values across all species in the
domain having generated the observed distribution.
Perhaps more conclusively, the parameter values obtained in
the case of the linear model, corresponding to a birth-to-death ratio
of between roughly 2 and 5 (d/l 4.9 for the human dataset with
the best apparent fit), are qualitatively at odds with empirical estimates of the recent duplication and loss rates in eukaryotic genomes, which unanimously indicate a value much smaller than one
(see Table 1 in ref. 6).
2.3. The Theory
of Birth-and-Death
Processes

Historically, the biological application of birth-and-death processes,


starting with the seminal work of Yule (22) in the 1920s and
continuing in the following decades (2326), was the construction of stochastic models that can furnish a means for interpreting random fluctuations in the population size with time. The
application of birth-and-death process to sizes of gene families is
more recent. The realization that the sizes of gene families can be
compared with the aim of better understanding adaptive evolutionary processes and organismal phylogeny began with the work
of Hughes and Nei (27, 28) and others (29) in the context of
the debate on whether differences in the copy number of major
histocompatibility complex genes across species have evolved due
to adaptive or stochastic forces. As described above, recent work
has focused on explaining the distribution of the number of
genes in homologous gene families in genomes as the result of
stochastic birth-and-death processes (see also Chap. 3 of ref. 6).
A birth-and-death process is a stochastic process in which
transitions between states labeled by integers (representing the
number of individuals, cells, lineages, etc.) are only allowed to
neighboring states (see Fig. 2). An increase by one of the number
of individuals (or genes in a gene family) constitutes birth, whereas
decrease by one is a death. More formally, the dynamics of a
population (of individuals, or of genes in a gene family) is represented by a Markov process, i.e., the state of the population at time
t is described by the value of a random variable described by the
Markov property (for an accessible review, see ref. 18). In general,
for each state, the probability of both birth, a transition from state
n to n + 1, and of death, a transition from state n to n  1, is
described by a rate birth rate dn and a death rate ln. A third

Modeling Gene Family Evolution and Reconciling Phylogenetic Discord

Origination

Duplication
d1

0
genes

1
gene

di1

d2
2
genes

di
i
genes

..

di+1
i+1
genes

i+1

35

..
i+2

Loss

Gain

0
genes

1
gene

2
genes

..
3

i
genes

i+1
genes

i+1

..
i+2

Loss

Gain

Duplication
d1
0
genes

1
gene

2
genes

di1

d2

i
genes

..
3

di

di+1
i+1
genes

i+1

..
i+2

Loss
Fig. 2. Birth-and-death models of homologous gene family evolution. A birth-and-death process is a stochastic process in
which transitions between states labeled by integers (representing the number of individuals, cells, lineages, etc.) are only
allowed to neighboring states. A jump to the right constitutes birth, whereas a jump to the left is a death. In the context of
birth-and-death processes that model the evolution of homologous gene families, the number of representatives a
homologous gene family has in a given gene corresponds to the model state. Birth represents the addition of gene to a
family in genome as a result of (1) origination of a new family with a single member, (2) duplication of an existing gene, or
(3) gain of a gene by means of horizontal transfer of a gene from the same family from a different genome. The three
models pictured above have been used in different contexts to model observed patterns of gene family size: (a) the
stationary distribution of nonlinear originationduplicationloss-type models is able to reproduce the general shape and in
particular the power-law-like tail of the distribution of homologous gene family sizes (cf. Subheading 2 and 14) while
transient distributions of linear originationduplicationloss can be used to construct models of gene family size evolution
along a phylogeny, modeling the inparalog, i.e., vertically evolving component of the size family distribution (21); (b) and
(c) linear gainloss and gainduplicationloss-type models are used to model the nonvertically evolving, the so-called
xenolog, component of the family size distribution along a branch of a phylogenetic tree.

36

G.J. Szollosi and V. Daubin

elementary process besides birth and death that is relevant in the


context of gene family size evolution is origination. As described
above, not all gene families are of the same age, consequently to
model the process of origination of new families, families with a
single gene relevant to originate at some rate constant O as shown
in Fig. 2. Considering a similar rate of influx into each state can be
regarded as a model of HGT cf. Fig. 2.
The simplest type of birth-and-death processes with biological
relevance are linear birth-and-death processes. Linear birth-anddeath processes are described by a single birth rate d and a single
death rate l from which the state-wise rates can be derived by the
following first order rate law:
dn dn

and

ln ln:

(1)

In other words, a gene (individual) in a gene family (population) gives birth to a new gene at a rate d and undergoes death at a
rate l, independent of the size of the gene family. The stationary
distribution of a linear birth-and-death process with origination
with some rate Ocan be shown to be (1) a stretched exponential
if d  l, i.e., the birth rate is smaller than the death rate or (2)
exhibiting an asymptotic power-law behavior with exponent
g (O/(d  l) + 1) (30) if d > l. The transient distribution
can be analytically expressed for the linear version of all three
processes shown in Fig. 2. These distributions are important in
deriving the probability of observing a particular pattern of family
sizes at the leaves of a phylogeny, as well as in estimating branchwise duplication, transfer, and loss parameters from a forest of
gene trees that have been mapped using a series of duplication
transfer and loss events to the branches of a species phylogeny
(see Subheading 4).
A succession of more complex nonlinear models can be constructed, the simplest proposed (14) being a model with a family
size-dependent duplication and loss rate parameterized by a pair of
constants a and b:
dn dnn

 0

d n a
n
n

and

ln lnn

 0

l n b
n;
n

(2)

where we have not simplified by n to emphasize the relationship


with the linear model above. For this class of models, asymptotic
power laws are obtained only if d0 < l0 (14), i.e., the birth rate is
smaller than the death rate. It is important to note that the linear
originationduplicationloss type model of Reed et al. (20) differs
from those of Karev et al. (14) in details related to how origination
is considered and in how the space of possible states (family sizes)
and hence the stationary state is defined. While Hughes and Reed
consider gene families to originate at a constant rate and family
size to be unbounded, Karev et al. assume that family sizes are

Modeling Gene Family Evolution and Reconciling Phylogenetic Discord

37

bounded and consider reflecting boundary conditions. Discrete


time models that are closely related to the continuous time models
considered by Karev et al. were presented by Wojtowicz and
Tiurjn (31).
A different more abstract type of birth-and-death process was
historically the first to be proposed to model the distribution of
gene family sizes (12). Similarly to the above model, a gene family
is founded by a single ancestor, and the size of the family may
change as a result of duplications and losses (birth and death).
However, in contrast to the birth-and-death models considered so
far, duplications and losses are considered to act coherently on
genes within one gene family. That is, if a certain gene is likely to
duplicate (be lost), then all genes of its family are likely to duplicate (be lost). More formally, denoting the size of a gene family at
time t, by nt
nt at nt1 ;

(3)

where at is a random multiplication factor, giving the instantaneous


ratio of birth to death, that is drawn independently at each time
step from some distribution P(a). The distribution of gene family
sizes that is the result of many such processes can be shown to have
a power-law distribution, provided the further important condition
that some form of origination be present is met. The exponent of
the power-law asymptotic followed by the family size distribution is
in this case independent of the exact nature of origination (independent, e.g., of whether one considers reflecting
 boundary
 conditions or random influx) and is given by g  1  ma =s2a , where
ma hlogai is the mean
 of the logarithm of the random variable
a and s2a log2 a  ma is its variance (12). Interestingly,
this implies that birth-and-death models with coherent noise
(also called multiplicative noise) produce a power-law asymptotic
regardless of whether the birth rate is smaller or larger than the
death rate. The value of the exponent, however, can give an indication of their relative values. The reason being that since s2a is
positive, g < 1 implies ma hlogai<0, which can be shown to
be equivalent to the geometric mean of a, i.e., the instantaneous
ratio of birth to death, being smaller than unity.
2.4. Birth and Death
Along a Species
Phylogeny

So far, we have only considered the distribution of homologous gene


family sizes in genomes of individual species and the average of such
distribution across domains. The distributions of gene family sizes
between species are, however, not independent, but rather reflect
correlated histories related by common descent along a species phylogeny. The phylogenetic profile of a gene family, consisting of the
number of homologs within the same family in each genome, encodes
this information. Such phylogenetic profiles can be informative even
though they neglect a large part of the information present in gene
sequences. Nonetheless, profile datasets have been used both to

38

G.J. Szollosi and V. Daubin

construct organismal phylogenies (3236) and reconstruct ancestral


gene content (37). These methods have, however, proved sensitive to
methods of homology inference and have relatively poor performance
as methods of phylogenetic analysis. This can be explained, in the case
of prokaryotes, by high levels of homoplasy resulting from both
HGT and extensive parallel loss of gene families in certain bacteria
genomes (30). (Remember that homoplasy, also called convergent
evolution, describes the acquisition of the same biological trait in
this case, genes from the same family in unrelated lineages).
The primary advantage of the above attempts at reconstructing
phylogeny is their relative ease of implementation and computational tractability on large datasets derived from complete genomes. They, however, suffer two major shortcomings: (1) they
lack an explicit model of evolution and consequently provide at
best indirect information on processes and (2) they disregard a
great deal of phylogenetically relevant information present in
homologous sequences by considering only presenceabsence or
at most the gene copy number in genomes.
The first of these shortcomings can be overcome by considering
phylogenetic profiles as observations at the branches of a species
tree generated by a birth-and-death process of sufficient complexros and Miklos have recently developed an efficient algoity. Csu
rithm for calculating the probability of observing a given
phylogenetic profile as a function of branch-wise parameters of
duplication, gain, and loss along a species tree (21). Their model
assumes that gene families evolve according to a linear birth-anddeath process along the branches of the species tree. Each branch is
characterized by a duplication rate, a gain rate, and a loss rate.
A gene family evolves along the tree from the root toward the leaves
according to the birth-and-death process. At internal nodes of the
tree, families are instantaneously copied to evolve independently
along descendant branches. Transient distributions of the linear
version of processes presented in Fig. 2 give the expected change
in the number of vertically inherited genes (inparalogs) and
recently acquired ones (xenologs) (38). Leading up to the work
ros and Miklos, other groups had also developed likelihoodof Csu
based methodologies. These either only considered duplication and
loss (39) or relied on heuristic restrictions on maximal ancestral
family size for computational tractability (40, 41).
Using the above approach, it is possible to search for the
branch-wise duplication, gain, and loss rates that maximize the
likelihood given a set of observed profiles (derived from complete
genome sequences) and a species phylogeny. Conceptually, this is
no different than searching for branch-wise substitution rates that
maximize the likelihood given a set of homologous sites (see, for
instance, Chap. 16 of ref. 42). Columns of an alignment in the
former case correspond to the phylogenetic profile of an individual gene family in the latter. In Table 1, we present results

Modeling Gene Family Evolution and Reconciling Phylogenetic Discord

39

Table 1
Relative rates of duplication, gain, and loss for prokaryotic
phyla obtained by maximum likelihood using COUNT (43)
Phylum name

Loss

Duplication

Gain

# of
genomes

Actinobacteria

0.75

0.23

0.010

31

Alphaproteobacteria

0.85

0.13

0.008

47

Bacillales

0.52

0.42

0.048

16

Bacteroidetes/chlorobi

0.59

0.38

0.024

10

Betaproteobacteria

0.63

0.32

0.037

32

Chlamydiae/verrucomicrobia

0.70

0.24

0.043

Clostridia

0.57

0.37

0.055

11

Cyanobacteria

0.68

0.28

0.027

14

Deltaproteobacteria

0.64

0.33

0.024

13

Epsilonproteobacteria

0.54

0.29

0.158

Gammaproteobacteria

0.88

0.10

0.009

70

Lactobacillales

0.66

0.29

0.036

21

Mollicutes

0.49

0.47

0.023

14

Spirochetes

0.79

0.19

0.014

Crenarchaeota

0.69

0.28

0.018

11

Euryarchaeota

0.66

0.31

0.016

25

Rooted reference trees were obtained from concatenates of universal and nearuniversal genes and phylogenetic profiles extracted from version 4 of the
HOGENOM database (17). Relative rates correspond to the ratio of the
average of the branch-wise rates (of duplication, gain, and loss) to the average
branch-wise sum of the three rates

obtained in this manner using COUNT (43), a software that


provides an implementation of this calculation. The results in
Table 1 lend further support to both the observation that birthand-death rates are similar across the tree of life (although here we
have only considered prokaryotes) and the pattern of death (loss)
rates being on average significantly larger than birth (duplication
and gain) rates. Similar to what was observed for 28 archaeal
genomes (21), duplications are inferred to account for the majority of birth events.

40

G.J. Szollosi and V. Daubin

3. The Ubiquity
of Phylogenetic
Discord
and the Joint
Reconstruction
of Pattern
and Process

3.1. Phylogenetic
Discord Among
Homologous Gene
Families

In order to extract as much information as possible, we must step


beyond phylogenetic profiles and consider in more detail the phylogenetic information contained in the sequences of homologous
gene families. This can be done by using some model of sequence
evolution to infer a gene phylogeny from the multiple sequence
alignment (MSA) of the family. Because gene families evolve
through not only the genome level process of speciation but also
the gene level processes of origination, duplication, transfer, and
loss described above, the phylogenies of individual families constructed in this manner reflect intricate individual genic histories.
Differences in the histories of individual families inevitably lead to
phylogenetic discord among gene families. The amount of phylogenetic conflict reflects the extent of HGT among genomes, and
consequently the profusion of phylogenetic discord that we observe
among prokaryotes (see below) is interpreted as reflecting large
rates of transfer.
Independent of the degree of HGT, however, the existence
of gene level processes of birth and death makes it necessary to
extend the implicit model behind the tree of species. This extension
consists of taking into consideration the processes of gene origination, birth, and death described above. The classic concept of the
species tree implicitly assumes that all genes evolve along a strictly
shared trackthe branches of the species tree. The presence of
duplications, transfers, and losses obliges us to replace this model
by a tree, the branches of which can be best visualized as tubes
tubes within which genes may duplicate and be lost, and among
which they can be transferred. This tree of genomes is a straightforward extension of the classic tree of species with its branches
characterized by rates of duplication, transfer, and loss.
For this tree of genomes to be useful, however, methods based
on statistical models capable of considering data from complete
genome sequences and inferring such a tree need to be developed.
Below, we describe recent progress in the construction of tractable
models of genome evolution that are full, probabilistic models of all
variables, in particular in our case of branch-wise duplication, transfer, and loss rates and the species tree topology.
Apparent phylogenetic conflict can result from different processes.
First of all, inferred gene tree topologies can be different from the
species tree, and hence each other, in the absence of any biological
processes due to reconstruction errors. Such errors can result from
stochastic differences caused by, e.g., insufficient sequence length
and, more problematically, from systematic reconstruction artifacts

Modeling Gene Family Evolution and Reconciling Phylogenetic Discord

41

due to departures from model assumptions (44). More informatively,


phylogenetic discord can result from three important biological processes (summarized in (1) of Fig. 3): lineage sorting, HGT, and
hidden paralogy.
Galtier and Daubin (45) analyzed the level of phylogenetic
conflict between genes in several datasets extracted from the
HOGENOM (17) database. Their aim was to ascertain the relative
contribution of HGT to the amount of phylogenetic discord by
comparing metazoan datasets (where HGT can be assumed to be
rare) to prokaryotic ones. Their results were consistent with expectations as the level of discord measured for metazoan sequences was
smaller than for any of the bacterial datasets considered. Interestingly, however, the differences in the amount of discord among the
bacterial datasets were also measured to be large (see Table 1 of
ref. 45). These large differences in the amount of discord, presumably caused by differences in rates of transfer, stand in stark to
the broadly similar rates of gene birth and death implied by the
similarity of the gene family size distributions.
A further finding of the study of Galtier and Daubin was that
even in the case of actinobacteria (the prokaryotic dataset with the
highest degree of self-conflict) more than 75% of the genes did not
significantly reject the consensus tree. While it is clear that including more and more species would cause this particular measure to
converge to a much smaller value, a series of more careful studies
have demonstrated that there exists a strong signal of vertical
inheritance in prokaryotic genomes despite persistent HGT
(4649) (see also Chap. 3 of this volume, ref. 50).
3.2. Reconciling
Phylogenetic Discord

The detection and measurement of phylogenetic discord among a


group of phylogenetic trees can be accomplished relatively easily,
for instance, by using some measure of distance between trees
(see Chap. 30 of ref. 42 for an introduction on distance measures).
A different and harder problem consists of constructing a reconciliation between two trees, i.e., of proposing a set of evolutionary
events (such as speciations, duplications, transfers, and losses) that
correspond to an evolutionary scenario, where one of the trees (the
gene tree) has resulted from evolution along the other tree (the
species tree). In Fig. 3, we present three different reconciliations
involving different sets of events for the same gene tree. The set
of events considered in the context of the reconciliation problem
has, until recently, been limited to speciation, duplication and
loss events, and lineage sorting, as discussed in Chap. 1 of this
Volume (ref. 64), and respectively in Chaps. 29 and 25 of ref. 42.
Goodman (51) was the first to describe an algorithm to find the
reconciliation that minimizes the number of duplication and loss
events followed more recently by several others (see ref. 42 for
citations). If transfers are also considered, the problem of reconciliation becomes difficult from a combinatorial perspective for two

42

G.J. Szollosi and V. Daubin

Deep
coalescence

Duplication
ancestral
polymorphism

speciation
events

speciation events
(sorting of lineages)
loss
by drift

loss
loss
loss
loss

Transfer

speciation
events

loss

Fig. 3. Evolutionary processes behind phylogenetic discord. Phylogenetic incongruences can be the result of three major
evolutionary processes (45): (1) deep coalescence resulting from incomplete lineage sorting (see previous chapter);
(2) hidden paralogy (resulting from duplication and differential loss); and (3) horizontal gene transfer (HGT). Incomplete
lineage sorting occurs when an ancestral species undergoes two speciation events in rapid succession. If, for a given gene,
the ancestral polymorphism has not been fully resolved into two monophyletic lineages at the time of the second
speciation, with a probability determined by the effective population size, the gene tree will differ from the species tree.
A potential source of incongruence relevant over wider phylogenetic scales is hidden paralogy. If a gene family contains
paralogous copies (genes that are related by a duplication event, e.g., the dashed and grey lines above), the gene
phylogeny will partly reflect the duplication history of the gene that is independent of species divergence history. The third
process is HGT. If genetic exchanges occur between species, then the phylogeny of individual genes will be influenced by
the number and nature of transfers they have undergone. In the above figure, we illustrate how a particular gene tree
topology can be explained by each process. Depending on the parameters (duplication, transfer, and loss rates and
effective population size) describing the branches of the species tree, the three different scenarios have different
probabilities.

Modeling Gene Family Evolution and Reconciling Phylogenetic Discord

43

reasons: (1) the difficulty of restricting the set of events to ones


which respect the partial order of evolution imposed by speciation
events on the species tree (52); this corresponds to forbidding the
transfer of genes from a species (branch of the species tree) to species
from which it has descended (ancestral branches of the species tree),
i.e., forbidding transfers that go backward in time; (2) if transfer
events are considered where the acquisition of a homologous copy
implies the loss of extent copy, the problem of identifying the
minimum number of such events can easily be shown to correspond
to the problem of finding the shortest path between two trees using
subtree prune and regraft (SPR) operations that is known to be NP
complete (see Chaps. 4 and 30 of ref. 42).
The latter process of replacement of genes by HGT is biologically motivated by the elevated probability of functional redundancy in the case of homologous genes (53). Such replacement is
particularly relevant in modeling genes that are present in a single
copy in all or most genomes. A variety of approaches have been put
forward to solve the problem of tree reconciliation for the case
when the replacement of genes is relevant (5355). These
approaches offer heuristic algorithms to find approximate solutions
to the SPR and the closely related maximum agreement forest
(MAF) problems efficiently. However, they are all limited to single-label trees, i.e., trees of families that do not have multiple
members in any of the genomes considered.
The former problem of considering only transfers that respect
the partial time order implied by the species tree can be resolved by
fully specifying the time order of speciation events. As shown by
Tofigh (56), and described below, this allows the construction of a
dynamic programming algorithm that is able to efficiently traverse
all possible reconciliations allowing the calculation of the sum of
the probabilities of all reconciliations given a tree, the most parsimonious reconciliation (57, 58), or the reconciliation with the
highest likelihood.
3.3. The Probability
of a Gene Tree Given
a Species Tree
and Rates
of Duplication,
Transfer, and Loss

Tofigh et al. consider the forest of gene trees to be generated by a


common birth-and-death process taking place on a shared species
(or genome) tree. They derive the probability pG jS 0 ; MBD ; r of a
gene tree topology G given a reconciliation r, where MBD is a birthand-death process taking place on S 0 , a species tree for which the
order of speciation events are fully specified. Provided the process
MBD is linear, the probability of gene tree topology G can be
expressed given a reconciliation r that maps branches and nodes
of G to S 0 using events considered in MBD .
This calculation requires two functions: (1) the probability of
extinction Q e (t), i.e., the probability of a gene observed on branch
e at time t evolving such that it is not observed in any extant genome
(at time t 0); (2) the propagator Qef (t, t 0 ) which gives the probability of a gene observed on branch e at time t evolving such that it

44

G.J. Szollosi and V. Daubin

has a descendent present on branch f at time t 0 , furthermore any


descendants of the gene observed at the leaves (at time t 0) of
S 0 descend from this copy. These functions can be obtained numerically from systems of differential equations found in 56.
As illustrated in Fig. 4, the same gene tree can be reconciled in
different ways with the species tree. The probability of extinction
Q e (t) and the propagator Q ef (t, t0 ), together with rates of origination, duplication, and transfer, can be used to calculate the probability of a gene tree topology for an arbitrary reconciliation (here, we
present an example with a rooted gene tree; however, the position of
the root can be considered to be part of the reconciliation without
changing the complexity of the dynamic programming algorithm).
For this probability to be useful, however, we must be able to either
sum over all reconciliations,
X
pGjS 0 ; MBD
pGjS 0 ; MBD ; r;
(4)
r2O

to obtain the probability of G given S 0 and MBD, or alternatively


be able to find the most likely reconciliation allowing the calculation of:
pmax GjS 0 ; MBD max pGjS 0 ; MBD ; r:
r

(5)

The probability of a reconciliation can be hierarchically decomposed into the product of probabilities of the reconstructions of
subtrees of G. This allows the construction of a dynamic programming algorithm that can efficiently sum or take the maximum over
reconciliations, allowing the calculation of both Eqs. 4 and 5.
Furthermore, the same dynamic programming scheme can be
used to calculate the most parsimonious reconciliation given costs
of the possible events with reduced complexity (57).
3.4. Hierarchical
Probabilistic Models
of Duplication,
Transfer, and Loss

Using the above dynamic programming algorithm, it is possible to


calculate the likelihood of a species tree topology S 0 and the parameters describing MBD, i.e., rates of duplication, transfer, and loss
on its branches, given a forest of gene trees obtained from homologous gene families:
Y
LS 0 ; MBD jfGf g
pGf jS 0 ; M;
(6)
f 2families

where

Gf arg max fLGjMSA of f g;


G

and the product goes over the set of most likely gene trees {Gf}
encoding the sequence information in families of homologous
genes composing a set of genomes. This expression can be thought
of as being similar to the classic likelihood of a gene tree topology
G and some model of sequence evolution Mseq. with parameters,

time

Species tree

branch

speciation
events

Gene tree
root g

t = t1
branch

t = t2

branch

t = t3

node e

leaf a

10

t = 0
Genome A

Transfer
scenario

Genome B

Genome C

Genome D

origination

Duplication
scenario

root
g

origination
root
g

duplication
t = t

e
f

transfer
t = t

leaf a

leaf a

b
Q1,1(t;t1) g
propagation
e

g
Q2,7(t1;t)
propagation

Q2,7(t1;0)
propagation
e

Q2,7(t;0)
propagation

Q9,9(t;0)
propagation

Q3,3(t1;t2)
propagation

Q5(t2)
loss

Q6,6(t2;t3)
propagation
Q10(t3)
loss

Q9,9(t3;0)
propagation

Fig. 4. Probabilistic DTL model. If we consider gene trees to be generated by a linear birth-and-death process MBD taking
place on a tree S 0 with the order of speciation events fully specified, we can express the probability of a gene tree topology
G given a reconciliation. Specifying the order of speciation events corresponds to constructing time slices, which
decompose the branches of the species tree into pieces yielding the tree S 0 . For example, the branch leading to Genome
A is decomposed into three branches labeled 2, 4, 7 (for a formal definition, see ref. 56). Transfers are only possible between
branches in the same time slice, e.g., between 7 and 9, but not 4 and 9. A reconciliation consists of mapping the branches
and nodes of G to the branches of nodes of S 0 . For a given gene tree, there are many possible reconciliations. For G, we can
construct (1) a transfer scenario, where node g of G is a speciation at the root of S 0 , e is a transfer from 4 to 9, f is a speciation
at the end of 3, and the branch below f traverses the speciation at the end of 6 implying at least one loss and also (2) a
duplication scenario, where e maps to the root, g is a duplication above it, the position of f is unchanged, but at least four
losses have occurred. The probability of extinction Qe (t ) and the propagator Qef (t, t 0 ) can be used to construct the probability
of a given reconciliation as shown for the black subtree of G. Because the probability of a reconciliation can be hierarchically
decomposed into the product of the probabilities of the reconstructions of the subtrees of G, a dynamic programming
algorithm can be derived that is able to calculate the sum or maximum of the probability over all reconciliations.

G.J. Szollosi and V. Daubin

46

Cyanobacteria

Lactobacillales

14 genomes 3887 gene trees

21 genomes 2838 gene trees

Duplication

Transfer

Loss

Duplication

Transfer

Loss

Frequency in sample

0.30
0.5

0.25

0.4

0.20
0.15

0.3

0.10

0.2

0.05

0.1
0.1

0.2

0.3 0.4 0.5


relative rate

0.6

0.1

0.7

0.2

0.3
0.4
0.5
relative rate

0.6

0.7

Fig. 5. Relative rates of duplication, transfer, and loss for two prokaryotic phyla. The results were obtained by maximum
likelihood using reference trees inferred from concatenated alignments of universal and near-universal genes and all
homologous gene families with trees available in version 4 of the HOGENOM database (17). These results show that while
the ratio of birth to death is practically identical, taking into consideration phylogenetic information from gene trees, the
majority of birth events are inferred to have resulted from transfer and not duplication in contrast to results obtained from
phylogenetic profiles (see Table 1). The histograms correspond to results obtained for 1,000 jackknife samples of 20% all
trees (see Chap. 20 of 42 for a discussion of resampling). The calculation was implemented using results from 56 and 57.
We kept the species tree topology fixed and maximized Eq. 6 over the space of possible orders in time of speciations and
uniform rate parameters. We assumed each branch of S 0 to have branch lengths compatible with the time order of
speciations with all time slices being of equal width and inferred global rates of duplication, transfer, and loss.

such as branch-wise substitution rates, given a multiple sequence


alignment:
Y
pcolumn i of MSAjG; Mseq: ; (7)
LG; Mseq: jMSA
i2sites

where in this case the product goes over columns of homologous sites
composing a MSA. In Fig. 5, we present results obtained using such
an approach, where we have kept the species tree topology fixed and
maximized the likelihood given by Eq. 6 over the space of possible
orders in time of speciations and uniform rate parameters. We can see
that the inferred ratio of birth to death is in good agreement with that
obtained from phylogenetic profiles (see Table 1). In contrast, taking
into consideration additional information from the sequences of the
proteins in homologous families in the form of gene tree topologies,
we infer for both phyla considered the majority of birth events to be
the result of transfer.
This scheme has two shortcomings. First, instead of complete
sequence information, only the most likely gene tree topologies are
considered. Second, global information on how likely different gene
tree topologies are given S 0 and MBD is not considered. Both of

Modeling Gene Family Evolution and Reconciling Phylogenetic Discord

47

these shortcomings can be addressed by combining Eqs. 6 and 7 in a


hierarchical likelihood framework. Using such a framework allows
us to use global information on the species phylogeny and the birthand-death process, together with sequence information from each
family to improve gene trees, while at the same time inferring the
species phylogeny and the parameters of birth-and-death process.
Such a hierarchical framework was first suggested by Maddison (59)
and has recently been implemented using a duplication and loss
model (excluding transfer) (60) and models of transfer (excluding
duplication and loss) (61, 62). The dynamic programming
approach presents the first opportunity to construct a hierarchical
model that considers all three processes. That is, we can express the
likelihood of S 0 , {Gf}, and MBD given a set of homologous gene
families as
LS 0 ; fGf g; MBD j families

pGf jS 0 ; MBD  LGf jMSA of f :

f 2families

(8)
It is important to note that this hierarchical likelihood function
is amicable to parallel computation, because the p(Gf |S 0 , MBD) 
L(Gf | MSA of f ) terms can be computed independently, by client
nodes. It is possible to implement an efficient optimization scheme
consisting of a hierarchical optimization loop, wherein clients optimize the Gf -s using the independent terms in the hierarchical
likelihood product while keeping S and MBD fixed until conditionally optimal Gf -s are attained using which S and MBD can be
optimized.

4. Conclusion
In conclusion, the distributions of homologous gene family sizes in
the genomes of the eukaryota, archaea, and bacteria show astonishingly similar shapes. These distributions are best described by models of gene family size evolution, where the loss rates of individual
genes are larger than their duplication rate but new families are
continually supplied to the genome by a process of origination
that in general includes both transfer and the generation of new
gene families. This picture is supported by analysis of phylogenetic
profiles using maximum likelihood. Taking into consideration additional information from the sequences of the proteins in homologous families in the form of gene tree topologies, the inferred ratio
of birth to death is found to be in good agreement with that
obtained from phylogenetic profiles; however, in prokaryotes, the
majority of birth events is inferred to be the result of transfer.

48

G.J. Szollosi and V. Daubin

It has not been demonstrated to date that a single tree can


adequately describe the evolution of entire genomes across the diversity of life and certainly no such tree has been inferred. However,
recent advances in the construction and implementation of hierarchical probabilistic models of duplication, transfer, and loss presented
here have the potential to allow us undertake this project to infer
genome trees based on sequence information from complete
genomes. While currently this task is computationally daunting, the
use of parallel computing and recent advances in algorithms present
the promise of making this feasible in the foreseeable future.
From a biological perspective, birth-and-death models of gene
family size evolution are essentially neutral models of evolution.
They ignore completely the individuality of gene families and any
potential selective forces that make some of them expendable and
others indispensable. The fact that they accurately reproduce the
observed family size distributions, nonetheless, suggests that
genome evolution, at least on this coarse scale of observation,
might be in large part the result of a stochastic process, which is
only modulated by selection (6, 19). Even so, as soon as we are able
to better reconstruct the pattern and process of duplication, transfer, and loss, we can expect to be able to observe more and more of
this modulation by selection. And by proxy, start to learn more
about the biology of genome evolution over large timescales to
better understand the population genetic and biochemical and
ecological constraints and opportunities that govern the evolution
of genomes in general and the transfers of genes in particular. This
requires integrating information reconstructed from ancestral genomes and DTL events with system-level models of phenotype, such
as metabolic networks (3, 63).

5. Exercises
1. Using log-log axis on the range [0.1, 106], plot the following
functions: ex, ex/10, ex/100, ex/1000, x1, x3, x9 and
observe how power-law-like tails decay much slower than any
exponential function.
2. Using both the COG (http://www.ncbi.nlm.nih.gov/COG)
and the HOGENOM (http://pbil.univ-lyon1.fr/databases/
HOGENOM) databases, construct the histogram in Fig. 1 of
the frequency of homologous gene family sizes in the human
genome, i.e., the fraction fn of times you see a family of size n
among all homologous gene families in the human genome.
3. Using the result that the stationary distribution pn of family sizes
is reached exponentially fast and assuming that this occurs according to the relationship |pn(t)  pn| / e(d + l)t, considering the

Modeling Gene Family Evolution and Reconciling Phylogenetic Discord

49

rates of duplication and loss from Table 1 of 6, estimate the


amount of time (in units of percentage of divergence at silent
sites) that the distribution of family sizes needs to reach the
stationary distribution following a perturbation. Is this number
large or small? Which organisms can be described by such divergence in comparison to the human genome?
4. Using the form of the transient distribution for the linear duplicationloss process given in Table 1 of 38, express the duplication rate l and the loss rate d using the fraction of families with
0 genes and the mean number of genes in a family.
5. Write down the differential equation giving p(t) the probability
of families with size n at time t using only the probabilities of
pn1(t), pn+1(t) and the rates of duplication dn dn and loss
ln ln (note that the case p0(t) needs to be treated differently;
solution can be found in 20).
6. Using the transient distribution of the duplicationloss process
from Table 1 and the results of Lemma 1 in 38, and assuming
the species tree to be ((A:y, B:y):x, C:x + y) with branch lengths
x, y in arbitrary units of time (see ref. 42 for a description of the
Newick format), a duplication rate of d, a loss rate of l, and
assuming the probability of the number of genes in a family at
the root of the tree to be given by a Poisson distribution with
mean n0, further limiting the number of genes at internal nodes
to a maximum of M genes, derive the probability of observing a
profile {nA, nB, nC}.
7. In what respect would including gain introduce significant new
complications in the above calculations?
8. Considering only duplications and losses (excluding transfer),
express Q ef (t, t 0 ) using transient distributions from 38 and the
extinction probability Q e (t).
References
1. Crick, F. H. (1968) The origin of the genetic
code. J Mol Biol, 38, 36779.
2. Theobald, D. L. (2010) A formal test of the
theory of universal common ancestry. Nature,
465, 21922.
3. Boussau, B. and Daubin, V. (2010) Genomes
as documents of evolutionary history. Trends
Ecol Evol, 25, 22432.
4. Koonin, E. V. and Wolf, Y. I. (2008) Genomics
of bacteria and archaea: the emerging dynamic
view of the prokaryotic world. Nucleic Acids
Res, 36, 6688719.
5. Long, M., Betran, E., Thornton, K., and
Wang, W. (2003) The origin of new genes:
glimpses from the young and old. Nat Rev
Genet, 4, 86575.

6. Lynch, M. (2007) The origins of genome architecture. Sinauer Associates.


7. Lerat, E., Daubin, V., Ochman, H., and Moran,
N. A. (2005) Evolutionary origins of genomic
repertoires in bacteria. PLoS Biol, 3, e130.
8. Gogarten, J. P. and Townsend, J. P. (2005)
Horizontal gene transfer, genome innovation
and evolution. Nat Rev Microbiol, 3, 67987.
9. Lynch, M. and Conery, J. S. (2003) The origins
of genome complexity. Science, 302, 14014.
10. Siew, N. and Fischer, D. (2003) Analysis of
singleton orfans in fully sequenced microbial
genomes. Proteins, 53, 24151.
11. Daubin, V. and Ochman, H. (2004) Bacterial
genomes as new gene homes: the genealogy of
orfans in e. coli. Genome Res, 14, 103642.

50

G.J. Szollosi and V. Daubin

12. Huynen, M. A. and van Nimwegen, E. (1998)


The frequency distribution of gene family sizes
in complete genomes. Mol Biol Evol, 15, 5839.
13. Qian, J., Luscombe, N. M., and Gerstein, M.
(2001) Protein family and fold occurrence in
genomes: power-law behaviour and evolutionary model. J Mol Biol, 313, 67381.
14. Karev, G. P., Wolf, Y. I., Rzhetsky, A. Y., Berezovskaya, F. S., and Koonin, E. V. (2002) Birth
and death of protein domains: a simple model
of evolution explains power law behavior. BMC
Evol Biol, 2, 18.
15. Molina, N. and van Nimwegen, E. (2009) Scaling laws in functional genome content across
prokaryotic clades and lifestyles. Trends Genet,
25, 2437.
16. Koonin, E. V., Wolf, Y. I., and Karev, G. P.
(2006) Power laws, scale-free networks and
genome biology. Molecular biology intelligence
unit, Landes Bioscience/Eurekah.com.
17. Penel, S., Arigon, A.-M., Dufayard, J.-F., Sertier, A.-S., Daubin, V., Duret, L., Gouy, M.,
and Perrie`re, G. (2009) Databases of homologous gene families for comparative genomics.
BMC Bioinformatics, 10 Suppl 6, S3.
18. Novozhilov, A. S., Karev, G. P., and Koonin, E.
V. (2006) Biological applications of the theory
of birth-and-death processes. Brief Bioinform,
7, 7085.
19. Koonin, E. V., Wolf, Y. I., and Karev, G. P.
(2002) The structure of the protein universe
and genome evolution. Nature, 420,
21823.
20. Reed, W. J. and Hughes, B. D. (2004) A model
explaining the size distribution of gene and
protein families. Math Biosci, 189, 97102.
ros, M. and Miklos, I. (2009) Streamlining
21. Csu
and large ancestral genomes in archaea inferred
with a phylogenetic birth-and-death model.
Mol Biol Evol, 26, 208795.
22. Yule, G. U. (1925) A mathematical theory of
evolution, based on the conclusions of dr. j. c.
willis, f.r.s. Philosophical Transactions of the
Royal Society of London. Series B, Containing
Papers of a Biological Character, 213, 2187.
23. Feller, W. (1939) Die grundlagen der volterraschen theorie des kampfes urns dasein in
wahrscheinliehkeitstheoretischer behandlung.
Acta Biotheoretioa Series A., 5, 1139.
24. Kendall, D. G. (1948) On the generalized
birth-and-death process. The Annals of
Mathematical Statistics, 19, 115.
25. Bartholomay, A. (1958-06-01) On the linear
birth and death processes of biology as markoff
chains. Bulletin of Mathematical Biology, 20,
97118.

26. Takacs, L. (1962) Introduction to the theory of


queues. Oxford University Press.
27. Ota, T. and Nei, M. (1994) Divergent evolution and evolution by the birth-and-death
process in the immunoglobulin vh gene family.
Mol Biol Evol, 11, 46982.
28. Nei, M., Gu, X., and Sitnikova, T. (1997)
Evolution by the birth-and-death process in
multigene families of the vertebrate immune system. Proc Natl Acad Sci U S A, 94, 7799806.
29. Yanai, I., Camacho, C. J., and DeLisi, C.
(2000) Predictions of gene family distributions
in microbial genomes: evolution by gene duplication and modification. Phys Rev Lett, 85,
26414.
30. Hughes, A. L., Ekollu, V., Friedman, R., and
Rose, J. R. (2005) Gene family content-based
phylogeny of prokaryotes: the effect of criteria
for inferring homology. Syst Biol, 54, 26876.
31. Wojtowicz, D. and Tiuryn, J. (2007) Evolution
of gene families based on gene duplication,
loss, accumulated change, and innovation.
J Comput Biol, 14, 47995.
32. Fitz-Gibbon, S. T. and House, C. H. (1999)
Whole genome-based phylogenetic analysis of
free-living microorganisms. Nucleic Acids Res,
27, 421822.
33. Snel, B., Bork, P., and Huynen, M. A. (1999)
Genome phylogeny based on gene content.
Nat Genet, 21, 10810.
34. Wolf, Y. I., Rogozin, I. B., Grishin, N. V., and
Koonin, E. V. (2002) Genome trees and the
tree of life. Trends Genet, 18, 4729.
35. Deeds, E. J., Hennessey, H., and Shakhnovich,
E. I. (2005) Prokaryotic phylogenies inferred
from protein structural domains. Genome Res,
15, 393402.
36. Lienau, E. K., DeSalle, R., Rosenfeld, J. A., and
Planet, P. J. (2006) Reciprocal illumination in
the gene content tree of life. Syst Biol, 55,
44153.
37. Mirkin, B. G., Fenner, T. I., Galperin, M. Y.,
and Koonin, E. V. (2003) Algorithms for computing parsimonious evolutionary scenarios for
genome evolution, the last universal common
ancestor and dominance of horizontal gene
transfer in the evolution of prokaryotes. BMC
Evol Biol, 3, 2.
ros, M. and Miklos, I. (2009) Mathemati38. Csu
cal framework for phylogenetic birth-anddeath models. ar Xiv, p. 0902.0970.
39. Hahn, M. W., De Bie, T., Stajich, J. E.,
Nguyen, C., and Cristianini, N. (2005) Estimating the tempo and mode of gene family
evolution from comparative genomic data.
Genome Res, 15, 115360.

Modeling Gene Family Evolution and Reconciling Phylogenetic Discord

40. Spencer, M., Susko, E., and Roger, A. J. (2006)


Modelling prokaryote gene content. Evol
Bioinform Online, 2, 15778.
41. Iwasaki, W. and Takagi, T. (2007) Reconstruction
of highly heterogeneous gene-content evolution
across the three domains of life. Bioinformatics,
23, i2309.
42. Felsenstein, J. (2004) Inferring phylogenies.
Sinauer Associates.
ros, M. (2010) Count: evolutionary analy43. Csu
sis of phylogenetic profiles with parsimony and
likelihood. Bioinformatics, 26, 19102.
44. Jeffroy, O., Brinkmann, H., Delsuc, F., and
Philippe, H. (2006) Phylogenomics: the beginning of incongruence? Trends Genet, 22,
22531.
45. Galtier, N. and Daubin, V. (2008) Dealing
with incongruence in phylogenomic analyses.
Philos Trans R Soc Lond B Biol Sci, 363,
40239.
46. Daubin, V., Moran, N. A., and Ochman, H.
(2003) Phylogenetics and the cohesion of bacterial genomes. Science, 301, 82932.
47. Ochman, H., Lerat, E., and Daubin, V. (2005)
Examining bacterial species under the specter
of gene transfer and exchange. Proc Natl Acad
Sci U S A, 102 Suppl 1, 65959.
48. Beiko, R. G., Harlow, T. J., and Ragan, M. A.
(2005) Highways of gene sharing in prokaryotes. Proc Natl Acad Sci USA, 102, 143327.
49. Puigbo`, P., Wolf, Y. I., and Koonin, E. V.
(2009) Search for a tree of life in the thicket
of the phylogenetic forest. J Biol, 8, 59.
50. Puigbo`, P., Wolf, Y. I., and Koonin, E. V.
(2012) Genome-wide comparative analysis of
phylogenetic trees: the prokaryotic forest of
life. In Anisimova, M., (ed.), Evolutionary
genomics: statistical and computational methods
(volume 1). Methods in Molecular Biology,
Springer Science+Business Media New York.
51. Goodman, M., Czelusniak, J., Moore, W.,
Herrera, R., and Matsuda, G. (1979) Fitting
the gene lineage into its species lineage, a
parsimony strategy illustrated by cladograms
constructed from globin sequences. Systematic
Zoology, 28, 132163.
52. Hallett, M., Lagergren, J., and Tofigh, A.
(2004) Simultaneous identification of duplications and lateral transfers. RECOMB 04:
Proceedings of the eighth annual international
conference on Resaerch in computational

51

molecular biology, New York, NY, USA,


pp. 347356, ACM.
53. Abby, S. S., Tannier, E., Gouy, M., and
Daubin, V. (2010) Detecting lateral gene
transfers by statistical reconciliation of phylogenetic forests. BMC Bioinformatics, 11, 324.
54. Nakhleh, L., Ruths, D., and Wang, L.-S.
(2005) Riata-hgt: A fast and accurate heuristic
for reconstructing horizontal gene transfer.
Wang, L. (ed.), Computing and Combinatorics,
vol. 3595 of Lecture Notes in Computer Science,
pp. 8493, Springer Berlin / Heidelberg.
55. Beiko, R. G. and Hamilton, N. (2006) Phylogenetic identification of lateral genetic transfer
events. BMC Evol Biol, 6, 15.
56. Tofigh, A. (2009) Using Trees to Capture Reticulate Evolution: Lateral Gene Transfers and
Cancer Progression. Ph.D. thesis, KTH, School
of Computer Science and Communication.
57. Doyon, J., C, S., KY, G., GJ, S., V, R., and V, B.
(2010) An efficient algorithm for gene/
species trees parsimonious reconciliation with
losses, duplications and transfers. Proceedings
of RECOMB Comperative Genomics, p. to
appear.
58. David, L. A. and Alm, E. J. (2011) Rapid
evolutionary innovation during an archaean
genetic expansion. Nature, 469, 936.
59. Maddison, W. P. (1997) Gene trees in species
trees. Systematic Biology, 46, 523536.
60. Akerborg, O., Sennblad, B., Arvestad, L., and
Lagergren, J. (2009) Simultaneous bayesian
gene tree reconstruction and reconciliation
analysis. Proc Natl Acad Sci USA, 106,
57149.
61. Suchard, M. A. (2005) Stochastic models for
horizontal gene transfer: taking a random walk
through tree space. Genetics, 170, 41931.
62. Bloomquist, E. W. and Suchard, M. A. (2010)
Unifying vertical and nonvertical evolution:
a stochastic arg-based framework. Syst Biol,
59, 2741.
63. Wagner, A. (2009) Evolutionary constraints
permeate large metabolic networks. BMC Evol
Biol, 9, 231.
64. Anderson, C., Liu, L., Pearl, D., and Edwards,
S. V. (2012) Tangled Trees: The Challenge of
Inferring Species Trees from Coalescent and
Non-Coalescent Genes. In Anisimova M (ed)
Evolutionary genomics: statistical and computational methods.

Chapter 3
Genome-Wide Comparative Analysis of Phylogenetic
Trees: The Prokaryotic Forest of Life
Pere Puigbo`, Yuri I. Wolf, and Eugene V. Koonin
Abstract
Genome-wide comparison of phylogenetic trees is becoming an increasingly common approach in evolutionary
genomics, and a variety of approaches for such comparison have been developed. In this article, we
present several methods for comparative analysis of large numbers of phylogenetic trees. To compare
phylogenetic trees taking into account the bootstrap support for each internal branch, the Boot-Split
Distance (BSD) method is introduced as an extension of the previously developed Split Distance
method for tree comparison. The BSD method implements the straightforward idea that comparison
of phylogenetic trees can be made more robust by treating tree splits differentially depending on the
bootstrap support. Approaches are also introduced for detecting tree-like and net-like evolutionary
trends in the phylogenetic Forest of Life (FOL), i.e., the entirety of the phylogenetic trees for
conserved genes of prokaryotes. The principal method employed for this purpose includes mapping
quartets of species onto trees to calculate the support of each quartet topology and so to quantify
the tree and net contributions to the distances between species. We describe the application of these
methods to analyze the FOL and the results obtained with these methods. These results support
the concept of the Tree of Life (TOL) as a central evolutionary trend in the FOL as opposed to the
traditional view of the TOL as a species tree.
Key words: Forest of life, Tree of life, Phylogenomic methods, Tree comparison, Map of quartets

Abbreviations
CMDS
COG
BSD
FOL
HGT
ND
NUTs
QT
TNT
TOL
SD

Classical multidimensional scaling


Clusters of orthologous genes
Boot-split distance
Forest of life
Horizontal gene transfer
Nodal distance
Nearly universal trees
Quartet topology
Tree-net trend
Tree of life
Split distance

Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2,
Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5_3,
# Springer Science+Business Media, LLC 2012

53

54

P. Puigbo` et al.

1. Introduction
With the advances of genomics, phylogenetics entered a new era
that is noted by the availability of extensive collections of phylogenetic trees for thousands of individual genes. Examples of such tree
collections are the phylomes that encompass trees for all sufficiently
widespread genes in a given genome (14) or the Forest of Life
(FOL) that consists of all trees for widespread genes in a representative set of organisms (5). It has been known since the early days of
phylogenetics that trees built on the same set of species often
have different topologies, especially when the set includes distant
species, most notably, in prokaryotes (6, 7). The availability of
forests consisting of numerous phylogenetic trees exacerbated
the problem as an enormous diversity of tree topologies has been
revealed. The inconsistency between trees has several major sources:
(1) problems with ortholog identification caused primarily by cryptic paralogy; (2) various artifacts of phylogenetic analysis, such as
long branch attraction (LBA); (3) horizontal gene transfer (HGT);
and (4) other evolutionary processes distorting the vertical, tree-like
pattern, such as incomplete lineage sorting and hybridization (1,
810). In order to obtain robust results in genome-level phylogenetic analysis, for instance, to classify phylogenetic trees into clusters
with (partially) congruent topologies or to identify common trends
among multiple trees, reliable methods for comparing trees are
indispensable.
The number and diversity of tree comparison methods
and software have substantially increased in the last few years.
The tree comparison methods variously use tree bipartitions, such
as partition, or symmetric difference metrics (11) and split distance
(SD) (12); distance between nodes, such as the path length metrics
(13), nodal distance (12, 14), and nodal distance for rooted
trees (15); comparison of evolutionary units, such as triplets and
quartets (16); subtransfer operations, such as subtree transfer distance (17), nearest-neighbor interchanging (18), Subtree Prune
and Regraft (SPR) using a rooted reference tree (19), SPR for
unrooted trees (20), and Tree Bisection and Reconnection (TBR)
(17); (dis)agreement methods, such as agreement subtrees (21),
disagree (12), corresponding mapping (22), and congruence
index (23); tree reconciliation (24); and topological and branch
lengths methods, such as K-tree score (25). Several algorithms
have been proposed to analyze with multifamily trees. For example,
the FMTS algorithm systematically prunes each gene copy from a
multifamily tree to obtain all possible single-gene trees (12) and an
algorithm implemented in TreeKO prunes nodes from the input
rooted trees in which duplication and speciation events are labeled
(26). However, to the best of our knowledge, none of the available
metrics for tree comparison takes into account the robustness of the

Genome-Wide Comparative Analysis of Phylogenetic Trees. . .

55

branches, a feature that appears important to minimize the impact


of artifacts (unreliable parts of a tree) on the outcome of comparative tree analysis. Here, we present the Boot-Split Distance (BSD)
method that calculates distances between phylogenetic trees with
weighting based on bootstrap values. This method is implemented
in the program TOPD/FMTS (12). In our recent research, we
used the BSD method combined with classical multidimensional
scaling (CMDS) analysis to explore the main trends in the phylogenetic FOL and to explore the Tree of Life (TOL) concept in light
of comparative genomics (5, 27).
Since the time (ca 1838) when Darwin drew the famous sketch
of an evolutionary tree in his notebook on transmutation of species,
with the legend I think . . . , the thinking on the TOL has evolved
substantially. The first phylogenetic revolution, brought about by
the pioneering work of Zuckerkandl and Pauling (28), and later
Woese and coworkers (29), was the establishment of molecular
sequences as the principal material for phylogenetic tree construction. The second revolution has been triggered by the advent of
comparative genomics when it has been realized that HGT, at least
among prokaryotes, was much more common than previously
suspected. The first revolution was a triumph of the tree thinking,
when a well-resolved TOL started to appear within reach. The
second revolution undermines the very foundation of the TOL
concept and threatens to destroy it altogether (3032).
The current views of evolutionary biologists on the TOL span
the entire range from acceptance to complete rejection, with a host
of moderate positions. The following rough classification may be
used to summarize these positions: (a) acceptance of the TOL as
the dominant trend in evolution: HGT is considered to be rare and
overhyped, and most of the observed transfers are deemed to be
artifacts (3336); (b) the TOL is the common history of the
(nearly) nontransferable core of genes, surrounded by vines of
HGT (3748); (c) each gene has its own evolutionary history
blending HGT and vertical inheritance; a statistical trend might
exist in the maze of gene histories, and it could even be tree-like
(5, 4951); and (d) ubiquity of HGT renders the TOL concept
totally obsolete (prokaryotic species and higher taxa do not exist,
and microbial taxonomy is created by a pattern of biased HGT)
(30, 32, 5257).
We found that, although different trends and patterns have to
be invoked to describe the FOL in its entirety, the main, most
robust trend is the statistical TOL, i.e., the signal of coherent
topology that is discernible in a large fraction of the trees in the
FOL, in particular among the Nearly Universal Trees (NUTs).
Recently, we further explored the FOL by analysis of species
quartets (58). A quartet is a group of four species which is
the minimum evolutionary unit in unrooted phylogenetic trees;
each quartet can assume three unrooted tree topologies (16).

56

P. Puigbo` et al.

We described a quantitative measure of the tree and net signals in


evolution that is derived from an analysis of all quartets of species in
all trees of the FOL. The results of this analysis indicate that,
although diverse routes of net-like evolution jointly dominate the
FOL, the pattern of tree-like evolution that recapitulates the consensus topology of the NUTs is the single most prominent, coherent trend. Here, we report an extended version of these
methodologies introduced to analyze the FOL and its trends, as
well as new concepts of prokaryotic evolution under the FOL
perspective (Supplementary Fig. S1).

2. Materials
2.1. The Forest of Life
and Nearly Universal
Trees

We analyzed the set of 6,901 phylogenetic trees from ref. 5 that


were obtained as follows. Clusters of orthologous genes were
obtained from the COG (59) and EggNOG (60) databases from
100 prokaryotic species (59 bacteria and 41 archaea). The species
were selected to represent the taxonomic diversity of Archaea and
Bacteria (for the complete list of species, see ref. 5). The BeTs
algorithm (59) was used to identify the orthologs with the highest
mean similarity to other members of the same cluster (index
orthologs), so the final clusters contained 100 or fewer genes,
with no more than one representative of each species. The
sequences in each cluster were aligned using the Muscle program
(61) with default parameters and refined using Gblocks (62). The
program Multiphyl (63), which selects the best of 88 amino acid
substitution models, was used to reconstruct the maximum likelihood tree of each cluster. The NUTs are defined as trees from
COGs that are represented in more than 90% of the species
included in the study.

3. Methods
3.1. Boot-Split
Distance: A Method
to Compare
Phylogenetic Trees
Taking into Account
Bootstrap Support
3.1.1. Boot-Split Distance

The BSD method compares trees based on the original Split


Distance (12) method. Both methods work by collecting all possible binary splits of the two compared trees and calculating the
fraction of equal splits, i.e., those splits that are present in both
trees (different split refer to splits that are present in only one of the
two trees). Instead of considering all branches as being equal as is
the case in SD, the BSD method takes into account the bootstrap
values to increase or decrease the SD value proportionally to the
robustness of individual internal branches. The BSD value is
the average of the BSD in the equal splits (eBSD) and the BSD in

Genome-Wide Comparative Analysis of Phylogenetic Trees. . .

57

the different splits (Eq. 1). Equations 2 and 3 give the formulas to
calculate the eBSD and dBSD values, respectively.
eBSD dBSD
;
2
he
i
eBSD 1   Me ;
a

BSD

dBSD

d
 Md :
a

(1)
(2)
(3)

Here, e is the sum of bootstrap values of equal splits, d is the sum of


bootstrap value of different splits, a is the sum of all bootstrap
values, Me is the mean bootstrap value of equal splits, and Md is
the mean bootstrap value of different splits.
The BSD algorithm proceeds in four basic steps to compare
pairs of trees (Supplementary Fig. S2). The first step is to obtain
all possible splits from both trees. This procedure implies a binary
split of the tree at each internal branch so that the tree is partitioned into two parts, each of which contains at least two species.
Then, the common set of leaves between the two trees is obtained,
that is, the set of shared species. Only trees with a common leaf set
of at least four species can be compared. The third step consists in
pruning all splits to the common leaf set of species; at this step,
species that are present in only one of the two compared trees are
removed from the split list. After this procedure, in partially overlapping trees, the algorithm checks whether each of the splits
remains a valid partition, that is, a partition that separates at
least two species from the rest of the tree. If a split is not a valid
partition, it is removed. Finally, the algorithm calculates the BSD
using the Eqs. 13.
3.1.2. The BSD Algorithm

There are three possible types of comparisons for trees that do not
include paralogs, that is, include one and only one sequence from
each of the constituent species (Fig. 1). In the first case, the two
trees completely overlap, that is, consist of the same set of species
(Fig. 1a). In this case, step 2, the pruning procedure, is not necessary, and the comparison involves only obtaining all possible splits
and the calculation of the BSD. In the second case, one of the
compared trees is a subset of the other tree (Fig. 1b). In this case,
the splits are only pruned and occasionally removed from the bigger
tree. In the third case, when the two trees partially overlap or when
a tree is a subset of another tree, a pruning procedure is required.
In the example shown in Fig. 2, after the pruning procedure
(step 3), there is only one remaining split (split: AB|CD) that is
repeated several times in both trees. The remaining AB|CD split in
Tree 1 is separated by four nodes that have different bootstrap
values. In this case, the bootstrap of the remaining split is calculated
using the Eq. 4, where n is the total number of nodes between the

58

P. Puigbo` et al.

a
96

4
5

47

6
2

26

[96] 45 | 6231

16|2345 [58]

[47] 62 | 4531

162|345

[26] 31 | 4562

2613|45 [79]

[8]

SD = 0.667
SD
0.667

BSD
0.333

eBSD
0.512

dBSD
0.154

2
6
1
3
4
5

58
8
79

BSD = 0.333
p=2, q=4, m=6, a=3.140, e=1.750, d=1.390
Ma=0.53, Me=0.875, Md=0.3475

5
2
33

72 | 54163

5
1

59

[5]

[33] 43 | 152

724 | 5163 [15]

[59] 51 | 432

7254 | 163 [38]


72543 | 16 [18]

4
7

15
5

2
1
6

18
38

3
SD = 1.000
SD
1.000

BSD
0.681

eBSD
1.000

dBSD
0.363

BSD = 0.681
p=0, q=4, m=4, a=1.450, e=0.000, d=1.450
Ma=0.3625, Me=0, Md=0.3625

Fig. 1. Examples of the BSD algorithm in single-family trees. (a) Two trees of the same size. (b) Tree 1 is a subtree of the
tree 2. (c) Two trees that partially overlap. SD split Distance, BSD boot-split distance, eBSD BSD of equal splits, dBSD BSD
of different splits, p number of equal splits, q number of different splits, m total number of splits, a sum of bootstraps in all
splits, e sum of bootstraps in equal splits, d sum of bootstraps in different splits, Ma mean bootstrap value, Me mean
bootstrap value in equal splits, Md mean bootstrap value in different splits.

two sides of the split and BSi is the bootstrap value (adjusted to the
01 range) of the node i.
n
Y
1  BSi :
(4)
Bootstrap 1 
i1

The bootstrap value associated with a particular branch of a


binary tree is taken as a measure of the probability that the four
subtrees on the opposite ends of this branch are partitioned correctly. To estimate the probability of the correct partitioning of an
arbitrary set of four subtrees, the internal branch of the quartet tree
is mapped onto each of the internal branches of the original tree.
The quartet is considered to be resolved correctly if it is resolved
correctly relative to any of these branches. Under the assumption
that bootstrap probabilities on individual branches are independent,
Eq. 4 is obtained as the estimate of the bootstrap probability for the
internal branch of the quartet tree.

Genome-Wide Comparative Analysis of Phylogenetic Trees. . .

Tree 1

90
10

59

Tree 2

10

10

Q
R
C

A
B
90

C
D

90
10

A
B

10

10

Q
R
C
D

90
10

10

93

10

Bootstrap =1-(0.1x0.9x0.9x0.9) = 0.93


Fig. 2. Calculation of BSD for trees with an unequal number of species. The larger tree
(Eq. 1) is pruned prior to the calculation of BSD. The bootstrap value for the only shared
internal branch is calculated according to the Eq. 4.

3.1.3. Using a Bootstrap


Threshold: Pros and Cons

The key question regarding the BSD method is: What is the best
approach to phylogenetic tree comparison: using all branches, reliable
or not, with the appropriate weighting or using only branches supported by high bootstrap values? The first option is illustrated in
Fig. 1, whereas Fig. 3 shows an example of a tree comparison that
employs a bootstrap threshold of 70, i.e., only branches supported by
a higher bootstrap are taken into account in the comparison. The
second procedure appears reasonable and can be recommended in
some cases. However, it is not advisable as a general approach
because, when two large trees with varying bootstrap values are

60

P. Puigbo` et al.

86

71 | 52346 [75]

[86] 346 | 7125

715 | 2346 [35]

[37] 3467 | 125

7152 | 346 [32]

[77] 34671 | 25

71526 | 34 [80]

[98] 34 | 67125

98

4
77
37

2
5
1

35
75

1
3

80

32

6
SD = 0.600

Threshold
SD
70
0.600

BSD
0.536

eBSD
0.619

BSD = 0.536
dBSD
0.454

p: 2, q: 3, m: 5, a: 4.160, e: 1.780, d:2.380,


Ma: 0.832, Me: 0.890, Md: 0.793

Fig. 3. Example of the BSD algorithm using a bootstrap cutoff. The figure shows the comparison of two phylogenetic trees
that takes into account only those branches with bootstrap support greater than 70. SD split distance, BSD boot-split
distance, eBSD BSD of equal splits, dBSD BSD of different splits, p number of equal splits, q number of different splits,
m total number of splits, a sum of bootstraps in all splits, e sum of bootstraps in equal splits, d sum of bootstraps in different
splits, Ma mean bootstrap value, Me mean bootstrap value in equal splits, Md mean bootstrap value in different splits.

compared, using a strict threshold restricts the comparison to a small


subset of robust branches, resulting in an artificially low BSD value. In
other words, this procedure artificially inflates the similarity between
the two trees by depreciating a large fraction of the branches. In
addition, before considering the use of only most supported
branches, one should take into account that the BSD method already
uses bootstrap values to adjust the distance between trees; so if two
trees are topologically similar (low SD) but supported by low bootstrap, the distance value increases (higher BSD), which is one of the
advantages of the BSD method (see Eqs. 2 and 3).
3.1.4. Testing the
BSD Method

The performance of the BSD method was compared with that of


the original SD method implemented in the TOPD/FMTS program (12). Supplementary Fig. S3 shows the correlation of SD and
BSD for trees with a number of species from 4 to 15 (a) and from
16 to 100 (b) from a recent large-scale analysis of the FOL (5). The
three-way comparison of SD, BSD, and tree size (number of species) shows a positive correlation between SD and BSD for all tree
sizes (R2 0.8613 for trees with 416 species and R2 0.7055
for trees with 16100 species) (Supplementary Fig. S3c). However,
the SD follows a discrete distribution, which obviously is most
conspicuous in the comparisons of small trees (Supplementary
Fig. S3a), whereas, thanks to the use of the bootstrap values, the
BSD distribution is continuous (Fig. 4).
Figure 4 shows an example of the comparison (all against all) of
three trees with six species, each of which differs in 1, 2, and 3 splits,
resulting in SD values of 0.33, 0.66, and 1, respectively (Fig. 4a).
Also, each tree was compared to itself resulting in an SD of 0. Then,
bootstrap values were assigned randomly to the trees in order to

Tree 2

Tree 1

SD=0

Tree 1

AB|CDEF
ABC|DEF
ABCD|EF

AB|CDEF
ABC|DEF
ABCD|EF

AC|BDEF
ACB|DEF
ACBD|EF

AC|EBDF
ACE|BDF
ACEB|DF

AC|BDEF
ACB|DEF
ACBD|EF

Tree 2

AB|CDEF
ABC|DEF
ABCD|EF

AC|EBDF
ACE|BDF
ACEB|DF

Tree 3

SD=0.33

AB|CDEF
ABC|DEF
ABCD|EF

Tree 2

Tree 1

Tree 1

Tree 3

61

Tree 1

Genome-Wide Comparative Analysis of Phylogenetic Trees. . .

C1

C1

=1

0
D=

SD=0

.6

SD

C
A

E
B

C1

SD=0

Tree 3

0.8

Distance

BSDSD=0
BSDSD=0.33

0.6

BSDSD=0.67
BSDSD=1
0.4

SD=0
SD=0.33
SD=0.67

0.2

SD=1
0
0

250

500
Repetition

750

1000

Fig. 4. Comparisons of trees with six taxa. Bootstrap values were assigned randomly in each comparison.

compare the trees using the BSD method, and this procedure was
repeated 1,000 times. The resulting plot (Fig. 4b) shows that, for
the comparison of trees with SD of 0 and 1, the BSD values ranged
from 0 to 0.5 and from 0.5 to 1, respectively, and in principle, could
assume all intermediate values. In the case of the comparisons that
differed in one split (SD 0.33), the BSD value was greater than
0.33 in 75% of the comparison, whereas for the comparisons that

62

P. Puigbo` et al.

differed in two splits (SD 0.67), 25% of the BSD values were
greater than 0.67. Thus, the BSD method for tree comparison
offers a better resolution than the SD method, especially for trees
with a small number of species.
Figure 5a shows the results of analysis of six simulated alignments
with an increasing level of noise (divergence respect to the initial
alignment) in each alignment, i.e., from the alignment 0 (without
noise and producing trees with bootstrap values of 100) to alignment
5 with the maximum level of noise. For each alignment, a tree was
constructed using the UPGMA method from the Web server DendroUPGMA (http://genomes.urv.cat/UPGMA). Distances were
calculated using the Jaccard coefficient, and bootstraps were generated from 100 replicates. The results of the tree comparison (Fig. 5b)
using three different methods, namely, Nodal Distance (ND), SD,
and BSD, show that the BSD method presents a continuous distribution resulting in a better resolution of the distances than the other
two methods. Indeed, the SD and ND methods fail to discern the
similarity between trees after six changes, whereas the BSD method
still reports discernible similarity (Fig. 5b). In order to compare the
three tree comparison methods, the distance reported by each
method was normalized to the maximum value in each case, i.e.,
after 46 changes (maximum number of changes in the simulation),
the distance to the initial tree is 1.41, 0.30, and 0.42 for ND, SD, and
BSD, respectively. All three distance values indicate that the trees are
similar far above the random expectation, supporting the robustness
of all methods, but the BSD method presents a better resolution in
the tree comparison.
3.1.5. Analysis of Random
Trees and the Significance
of BSD Results

To assess the significance of the tree comparison by the BSD method,


we performed several tree comparisons using random trees containing between 4 and 100 species (Fig. 6). Each test is an all-against-all
comparison of 1,000 random trees (for complete results, see
Additional file 1 at ftp://ftp.ncbi.nih.gov/pub/koonin/FOL/).
The results from random tree comparison have to be used to determine whether the detected similarities or differences between trees
are significantly different from chance (12). Figure 6 shows that the
distance between random trees monotonically increases with the tree
size up to a value of approximately 0.75 for BSD and approximately
0.999 for SD. In other words, although BSD is an extension of the
SD method, the results obtained by the two methods are not directly
comparable. Therefore, to assess whether the similarity between two
trees is better than chance, one must consider the method used for
the tree comparison (e.g., SD or BSD) and the size of the tree. For
example, consider two trees with 15 species each for which the SD
method reports a distance of 0.75. This value is far below randomness
(Fig. 6), so the conclusion would be that the two trees are nonrandomly similar. However, if the same distance value (0.75) is reported
by the BSD method, the conclusion would be the opposite, namely,
that the two trees are no more similar than two random trees of
15 species.

1(3 changes)

2 (6 changes)

3 (12 changes)

4 (26 changes)

5 (46 changes)

Normalized distance

0.75
SD
0.5

ND
BSD

0.25

0
0

Trees

Fig. 5. Comparison of six trees constructed from alignments with increasing noise levels. (a) Comparison of trees from
six simulated alignments. The UPGMA tree from each alignment was reconstructed with the Web server DendroUPGMA (http://
genomes.urv.cat/UPGMA) using the Jaccard coefficient as the measure of distance and generating 100 bootstrap replicates.
Alignment 0 corresponds to the initial alignment without noise that perfectly separates all branches, resulting in a tree with

64

P. Puigbo` et al.

1.00

Distance

0.75

0.50

0.25

BSD
SD

0.00
0

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Tree size

Fig. 6. Random BSD and SD depending on the tree size. Results of the tree comparison of random trees (with different
sizes ranging from 4 to 100 species) show that the BSD and SD increase up to 0.75 and 0.999, respectively.

Another and probably the most important problem of the


comparison of phylogenetic trees is how to interpret the results
from a biological perspective. To address this issue, we generated
random trees containing from 4 to 100 species and performed 1 to
100 permutations (swap of a pair of branches) in each tree. The
resulting tree was then compared with the source tree (Fig. 7a, b).
The results show the number of permutations required to obtain a
particular BSD value for different tree sizes (number of species).
For instance, BSD 0.3 in the comparison of two trees with 20
species indicates that the two trees are separated by one permutation, whereas BSD 0.6 indicates that the trees are separated by
approximately 9 permutations (for the complete listing of equivalences between BSD, SD, and the number of permutations, see
Additional file 2 at ftp://ftp.ncbi.nih.gov/pub/koonin/FOL/).
Considering that each permutation corresponds to an HGT
event, the BSD may be construed as the measure of the extent of
HGT contributing to the topological difference between the compared trees. Given the discrete distribution of SD values, this measure cannot be used to infer the number of permutations with the
same precision as BSD.

Fig. 5. (continued) bootstrap values of 100 for all internal nodes. Alignments 15 correspond to the derivatives of the initial
alignment with increasing noise levels at each step. (b) Results of the comparison of each tree (1 to 5) with the initial tree
(0). The trees were compared using three methods: Split Distance (SD), Nodal Distance (ND), and Boot-Split Distance
(BSD). For the purpose of comparison, the results obtained with each of the three methods were normalized to the
maximum value in each case.

Genome-Wide Comparative Analysis of Phylogenetic Trees. . .

65

0.9

BSD

0.6

0.3

20

40

60

80

100

Permutations
Fig. 7. The number of permutations and the BSD. (a) BSD depending on the number
of permutations and tree size. (b) Mean and standard deviation of the BSD for up to
100 permutations for trees with 20 species.

3.2. Analysis of
Topological Trends in
a Set of Phylogenetic
Trees Calculation of
the Tree Inconsistency

A key characteristic of the FOL is the degree of the topological (in)


consistency between the constituent trees. To quantify this trend,
we introduced the inconsistency score (IS), which is the fraction of
the times that the splits from a given tree are found in all N trees
that comprise the FOL (cite). Thus, the IS may be naturally taken as
a measure of how representative of the entire FOL is the topology

66

P. Puigbo` et al.

of the given tree. The IS is calculated using Eqs. 57, where N is the
total number of trees, X is the number of splits in the given tree,
and Y is the number of times the splits from the given tree are found
in all trees of the FOL.
1

IS Y

 ISmin
;
ISmax

(5)

1
;
X N

(6)

1
 ISmin :
X

(7)

ISmin
ISmax

In addition to the calculation of a single value of IS for a given tree


by comparing its topology to the topologies of rest of trees in the
FOL, IS can be calculated along the depth of the trees, namely, split
depth and phylogenetic depth. The split depth was calculated for each
unrooted tree according to the number of splits from the tips to the
center of the tree. The value of split depth ranged from 1 to 49 ([100
species/2]  1). The phylogenetic depth was obtained from the
branch lengths of a rescaled ultrametric tree, rooted between archaeal
and bacterial species, and ranged from 0 to 1. The topology of the
ultrametric tree was obtained from the supertree of the 102 NUTs
using the CLANN program (64). The branch lengths from each of
the 6,901 trees were used to calculate the average distance between
each pair of species. The obtained matrix was used to calculate
the branch lengths of the supertree of the NUTs. This supertree
with branch lengths was then used to construct an ultrametric tree
using the program KITSCH from the Phylip package (65) and
rescaled to the depth range from 0 to 1. The resulting ultrametric
tree was used for the analysis of the dependence of tree inconsistency
on phylogenetic depth.
3.2.1. Classical
Multidimensional
Scaling Analysis

The CMDS, also known as principal coordinate analysis, is the


multifactorial method best suited to analyze matrices obtained
from tree comparison methods like BSD and identify the main
trends in a large set of phylogenetic trees. The CMDS embeds
n data points implied by an [n  n] distance matrix into an
m-dimensional space (m < n) such that, for any k 2 [1, m], the
embedding into the first k dimensions is the best in terms of
preserving the original distances between the points (66, 67). In
our analysis, the data points are distances between trees obtained
using the BSD method. The choice of the optimal number of
clusters is made using the gap statistics algorithm (68). The number
of clusters for which the value of the gap function for cluster k + 1
is not significantly higher than that for cluster k (z-score below
1.96, corresponding to 0.05 significance level) is considered
optimal. The CMDS analysis was performed using the kmeans

Genome-Wide Comparative Analysis of Phylogenetic Trees. . .

67

function of the R package that implements the K-means algorithm.


The CMDS approach has been previously employed by Hillis et al.
for phylogenetic tree comparison, with the distances between trees
calculated using the RobinsonFoulds distance (69).
3.3. Analysis
of Quartets of Species
3.3.1. Definition
of Quartets and Mapping
Quartets onto Trees

The minimum evolutionary unit in unrooted phylogenetic trees is


defined by groups of four species (or quartets), and each quartet
may be best represented by the three possible unrooted tree topologies (Supplementary Fig. S4a). A quartet defined by the set
of species A, B, C, and D has three possible unrooted topologies:
(1) AB|CD, (2) AC|BD, and (3) AD|BC. To analyze which quartet
topology (QT) best represents the relationships among the four
species in a quartet, each quartet was compared against the entire
set of phylogenetic trees from 100 species (the FOL).
For 100 species, there are 3,921,225 quartets, and accordingly
11,763,675 topologies (Supplementary Fig. S4b). A mapping of
quartets onto trees is produced using the SD method (12).
A binary version of this method was employed to compare quartets
and trees (a quartet is represented in a tree when SD 0 and not
represented when SD > 0). Figure 8a shows an example of quartet
mapping onto a set of ten trees. Here, q1 is a resolved quartet, with
the topology q1t1 supported by eight of the ten trees. By contrast,
for q2, three quartet topologies are equally supported, i.e., the
topology of this quartet remains unresolved.
To analyze which of the three possible topologies best represents the almost 4 million quartets in the FOL, each quartet topology was compared with the entire set of 6,901 trees, resulting in a
total number of 8.12  1010 tree comparisons (Supplementary
Fig. S4b), and the number of trees that support each quartet
topology was counted for the entire FOL or for the set of 102
NUTs (Supplementary Fig. S4b).

3.3.2. Distance Matrices


and Heat Maps

Using the quartet support values for each quartet, a 100  100
between-species distance matrix was calculated as dij 1  Sij/Q ij,
where dij is the distance between two species, Sij is the number of
trees containing quartets in which the two species are neighbors,
and Q ij is the total number of quartets containing the given two
species. Then, this distance matrix was used to construct different
heat maps using the matrix2png Web server ((70), Fig. 8b). In contrast to the BSD method, which is best suited for the analysis of the
evolution of individual genes, the distance matrices derived from
maps of quartets are used to analyze the evolution of species and to
disambiguate tree-like evolutionary relationships and highways
(preferential routes) of HGT.

3.3.3. The Tree-Net Trend

The quartet-based between-species distances were used to calculate the Tree-Net Trend (TNT) score. The TNT score is calculated
by rescaling each matrix of quartet distances to a 01 scale

q2

10
Tr
ee

9
Tr
ee

8
Tr
ee

7
Tr
ee

6
Tr
ee

Tr
ee

X
X

10%

q2t2

80%
10%

q1t3
q2t1

X
X

q3t3

Solved

q1t2

30%

40%

30%

Unsolved

q1

Tr
ee

q1t1

Tr
ee

Tr
ee

Tr
ee

P. Puigbo` et al.

68

qit1

qn

qit2

qit3

Sp. i
Sp. j

Heatmap

Distance matrix

for i,j=1 to 100 count:


Sp.

100

100

0
0
0

Fig. 8. Mapping quartets. (a) Mapping quartets onto a set of ten trees. (b) A schematic of the procedure used to reconstruct
a species matrix from the map of quartets.

between the supertree-derived matrix (which is taken to represent


solely the tree-like evolution signal, hence the distance of 0) and
the matrix obtained from permuted trees, with distance values
around the random expectation of 0.67 (Supplementary
Fig. S5). Two situations may occur in the calculation of the
TNT score depending on the relationship between the distance
in the supertree matrix (Ds) and the distance in the random matrix
(Dr 0.67). When Ds > Dr (e.g., in comparisons of archaea
versus bacteria), STNT (d  Dr)/(Ds  Dr), where STNT is
the TNT score and d is the distance between the two compared
species in the matrix. When Ds < Dr (in comparisons between
closely related species), STNT 1  ((d  Ds)/(Dr  Ds).

4.1. Patterns
in the Phylogenetic
Forest of Life

69

The reconstruction of the evolutionary trends in the FOL is based


on the idea that prokaryotes, effectively, share a common gene
pool. This gene pool consists of genes with widely different ranges
of phyletic spread, from universal to rare ones only present in a
few species (71). Thus, genes, as the elements of this gene pool,
have their distinct evolutionary histories blending HGT and
vertical inheritance (Fig. 9). In principle, the FOL encompasses
the complete set of phylogenetic trees for all genes from all genomes.
However, a comprehensive analysis of the entire FOL is computationally prohibitive (with over 1,000 archaeal and bacterial genomes
now available and the computational resources accessible to the
authors, estimation of the phylogenetic tree for each gene represented in all these genomes would take weeks of computer time), so a
representative subset of the trees needs to be selected and analyzed.
Previously (5), we defined such a subset by selecting 100 archaeal
and bacterial genomes, which are representative of all major prokaryote groups, and building 6,901 maximum likelihood (ML) trees
for all genes with a sufficient number of homologs and sufficient level
of sequence conservation in this set of genomes; for brevity, we refer
to this set of trees as the FOL. In this set of almost 7,000 trees, only a
very small portion of the forest is represented by NUTs (Fig. 9).
Furthermore, bacterial and archaeal universal trees are rare as well, as
reflected in Fig. 9 by the small peaks around 41 and 59 species, i.e., all
archaea and all bacteria, respectively. The dominant pattern in the
major part of the FOL is completely different: the FOL is best
represented by numerous small trees, with about 2/3 of the trees
including <20 species (Fig. 9).

2000

Number of trees

4. Phylogenetic
Concepts in Light
of Pervasive
Horizontal Gene
Transfer

Genome-Wide Comparative Analysis of Phylogenetic Trees. . .

1000

20

40
60
Tree size

80

100

Fig. 9. The Forest of Life (FOL). The distribution of the trees in the FOL by the number of
species. Modified from ref. 5.

70

P. Puigbo` et al.

Percentage of NUTs

75%

50%

25%

0%

O U H K
COG functions

Fig. 10. Distribution of the gene functions among the NUTs. The functional classification of
genes was from the COG database (59).

4.2. The Nearly


Universal Trees

We define the NUTs as trees for those COGs that were represented
in more than 90% of the included prokaryotes. This definition
yielded 102 NUTs. Not surprisingly, the great majority of the
NUTs are genes encoding proteins involved in translation and the
core aspects of transcription (Fig. 10). Among the NUTs, only 14
corresponded to COGs that consist of strict 1:1 orthologs (all of
them ribosomal proteins), whereas the rest of NUTs included
paralogs in some organisms (only the most conserved paralogs
were used for tree construction (5)). The 1:1 NUTs were similar
to the rest of the NUTs in terms of the connectivity in tree similarity
(1-BSD) networks and their positions in the single cluster of NUTs
obtained using CMDS.
The 102 NUTs were compared to trees produced by analysis of
concatenations of universal proteins (47). The results showed that
most of the NUTs were topologically similar to a tree obtained by
the concatenation of 31 universal orthologous genes (5)in other
words, the Universal TOL constructed by Ciccarelli et al. (47)
was statistically indistinguishable from the NUTs and showed properties of a consensus topology. Not surprisingly, the 1:1 ribosomal
protein NUTs were even more similar to the universal tree than the
rest of the NUTs, in part because these proteins were used for the
construction of the universal tree and, in part, presumably because
of the low level of HGT among ribosomal proteins.

4.3. The Tree of Life


as a Central Trend
in the FOL

We analyzed the matrix of all-against-all tree comparisons of the


NUTs by embedding them into a 30-dimensional tree space using
the CMDS procedure (66, 67). The gap statistics analysis (68)
reveals a lack of significant clustering among the NUTs in the tree
space. Thus, all the NUTs seem to belong to one unstructured
cloud of points scattered around a single centroid. This organization

Genome-Wide Comparative Analysis of Phylogenetic Trees. . .

71

of the tree space is best compatible with individual trees randomly


deviating from a single, dominant topology (which may be denoted
the TOL), apparently as a result of random HGT (but in part
possibly due to random errors in the tree-construction procedure).
Therefore, there is an unequivocal general trend among the NUTs.
Although the topologies of the NUTs were, for the most part, not
identical so that the NUTs could be separated by their degree of
inconsistency (a proxy for the amount of HGT), the overall
high consistency level indicated that the NUTs are scattered in the
close vicinity of a consensus tree, with HGT events distributed
randomly (5).
Thus, the NUTs present a unique and strong signal of unity
that seems to reflect the TOL pattern of evolution. The inconsistency score among the NUTs ranged from 1.4 to 4.3%, whereas the
mean IS value for an equivalent set (102) of randomly generated
trees with the same number of species was approximately 80%,
indicating that the topologies of the NUTs are highly consistent
and nonrandom (5).
To further assess the potential contribution of phylogenetic
analysis artifacts to observed inconsistencies between the NUTs,
we analyzed these trees with different bootstrap support thresholds
(that is, only splits supported by bootstrap values above the respective threshold value were compared). Particularly low IS levels were
detected for splits with high-bootstrap support, but the inconsistency was never eliminated completely, suggesting that HGT is a
significant contributor to the observed inconsistency among the
NUTs (IS ranges from 0.3 to 2.1% and 0.3 to 1.8% for splits with a
bootstrap value higher than 70 and 90, respectively) (5).
Analysis of the supernetwork built from the 102 NUTs (5)
showed that the incongruence among these trees is mainly concentrated at the deepest levels, with a much greater congruence at
shallow phylogenetic depths. The major exception is the unambiguous archaealbacterial split that is observed despite the apparent
substantial interdomain HGT. Evidence of probable HGT between
archaea and bacteria was obtained for approximately 44% of the
NUTs (13% from archaea to bacteria, 23% from bacteria to archaea,
and 8% in both directions), with the implication that HGT is likely
to be even more common between the major branches within the
archaeal and bacterial domains (5). These results are compatible
with previous reports on the apparently random distribution of
HGT events in the history of highly conserved genes, in particular
those encoding proteins involved in translation (72, 73), and on
the difficulty of resolving the phylogenetic relationships between
the major branches of bacteria (7476) and archaea (5, 77, 78).
More specifically, archaealbacterial HGT has been inferred for 83%
of the genes encoding aminoacyl-tRNA synthetases (compared
with the overall 44%), essential components of the translation
machinery that are known for their horizontal mobility (40, 79).

72

P. Puigbo` et al.

In contrast, no HGT has been predicted for any of the ribosomal


proteins, which belong to an elaborate molecular complex, the
ribosome, and hence appear to be nonexchangeable between the
two prokaryotic domains (40, 73). In addition to the aminoacyltRNA synthetases and in agreement with many previous observations (**(80) and references therein), evidence of HGT between
archaea and bacteria was seen also for the few metabolic enzymes
that belonged to the NUTs, including undecaprenyl pyrophosphate synthase, glyceraldehyde-3-phosphate dehydrogenase, nucleoside diphosphate kinase, thymidylate kinase, and others.
4.4. The NUTs
Topologies
as the Central Trend
and Detection of
Distinct Evolutionary
Patterns in the FOL

Using the BSD method, we compared the topologies of the NUTs


to those of the rest of the trees in the FOL. Notably, 2,615 trees
(~38% of the FOL) showed a greater than 50% similarity (P-value
< 0.05) to at least 1 of the NUTs, being the mean similarity of the
trees to the NUTs approximately 50% (Fig. 11). For a set of 102
randomized trees of the same size as the NUTs, only about 10% of
the trees in the FOL showed the same or greater similarity, indicating that the NUTs were strongly and nonrandomly connected to
the rest of the FOL.
We then analyzed the structure of the FOL by embedding the
3,789 COG trees into a 669-dimensional space using the CMDS
procedure (66, 67). A CMDS clustering of the entire set of 6,901
trees in the FOL was beyond the capacity of the R software package
used for this analysis; however, the set of COG trees included most of
the trees with a large number of species for which the topology
comparison is most informative. A gap statistics analysis (66, 67) of

Similarity
>80%

>60%

40-60%

<40%

<20%

100%

% of trees

75%

50%

25%

0%

NUTs
Fig. 11. Topological similarity between the NUTs and the rest of the FOL. Percentage of trees connected to the NUTs at a
different % of similarity (modified from Puigbo` et al. 2009).

Genome-Wide Comparative Analysis of Phylogenetic Trees. . .

(6) 48.6%
**

73

(2) 63.34%
*

(1) 42.43%
*
(3) 62.11%
**

(4) 56.21%
**
(5) 50.17%
**

(7) 49.66%
**

* p = 0.0014
** p < 0.000001
Fig. 12. Clusters and patterns in the FOL. The seven clusters identified in the FOL using the
CMDS method and the mean similarity values between the 102 NUTs and all trees from each
of the 7 clusters are shown (modified from Puigbo` et al. 2009).

K-means clustering of these trees in the tree space revealed distinct


clusters of trees in the forest. The FOL is optimally partitioned into
seven clusters of trees (the smallest number of clusters for which the
gap function did not significantly increase with the increase of the
number of clusters) (Fig. 12). Clusters 1, 4, 5, and 6 were enriched
for bacterial-only trees, all archaeal-only trees belonged to clusters
2 and 3, and cluster 7 consisted entirely of mixed archaealbacterial
clusters; notably, all the NUTs form a compact group inside cluster 6.
The results of the CMDS clustering (Fig. 12) support the existence of several distinct attractors in the FOL. However, we have to
emphasize caution in the interpretation of this clustering because
trivial separation of the trees by size could be an important contribution. The approaches to the delineation of distinct groves within the
forest merit further investigation. The most salient observation for the
purpose of the present study is that all the NUTs occupy a compact
and contiguous region of the tree space and, unlike the complete set of
the trees, are not partitioned into distinct clusters by the CMDS
procedure. Taken together with the high mean topological similarity
between the NUTs and the rest of the FOL, these findings indicate
that the NUTs represent a valid central trend in the FOL.

74

P. Puigbo` et al.

4.5. The Tree and Net


Components of
Prokaryote Evolution

The TNT map of the NUTs was dominated by the tree-like signal
(green in Fig. 13a): the mean TNT score for the NUTs was 0.63
(Fig. 14b), so the evolution of the nearly universal genes of prokaryotes appears to be almost two-third tree-like (i.e., reflects the
topology of the supertree). The rest of the FOL stood in a stark
contrast to the NUTs, being dominated by the net-like evolution,
with the mean TNT value of 0.39 (Fig. 14c) (about 60% net like).
Remarkably, areas of tree-like evolution were interspersed with areas
of net-like evolution across different parts of the FOL (Fig. 13b).
The major net-like areas observed among the NUTs were retained,
but additional ones became apparent including Crenarchaeota that
showed a pronounced signal of a non-tree-like relationship with
diverse bacteria as well as some Euryarchaeota (Fig. 13b). The
distribution of the tree and net evolutionary signals among different
groups of prokaryotes showed a striking split among the NUTs:
among the archaea, the tree signal was heavily dominant (mean
TNTNUTs_Archaea 0.80  0.20), whereas among bacteria the
contributions of the tree and net signals were nearly equal (mean
TNTNUTs_Bacteria 0.51  0.38). Among the rest of the trees in
the FOL, archaea also showed a stronger tree signal than bacteria,
but the difference was much less pronounced than it was among
the NUTs (mean TNTFOL_Archaea 0.47  0.11 and mean
TNTFOL_Bacteria 0.34  0.08). The conclusions on the tree-like
and net-like components of evolution made here are based on the
assumption that the supertree of the NUTs represents the tree-like
(vertical) signal. We did not perform direct tests of the robustness of
these conclusions to the supertree topology. However, observations
presented previously (5) suggest that the results are likely to be
robust given the coherence of the NUTs topologies as well as the
similarity of the supertree topology and the topologies of the individual NUTs to the TOL obtained from concatenated sequences
of universally conserved ribosomal proteins (47).

5. Conclusions
The analysis of the phylogenetic FOL is a logical strategy for
studying the evolution of prokaryotes because each set of orthologous genes presents its own evolutionary history and no single
topology may represent the entire forest. Thus, the methods introduced in this article that compare trees without the use of a preconceived representative topology for the entire FOL may be of
wide utility in phylogenomics.
We have shown that, although no single topology may represent
the entire FOL and several distinct evolutionary trends are detectable,
the NUTs contain a strong tree-like signal. Although the tree-like
signal is quantitatively weaker than the sum total of the signals from
HGT, it is the most pronounced single pattern in the entire FOL.

Genome-Wide Comparative Analysis of Phylogenetic Trees. . .

75

Fig. 13. The Tree/Network Trend (TNT) score heat maps. (a) The 102 NUTs. (b) The FOL without the NUTs (6,799 trees).
The TNT increases from red (low score, close to random, an indication of net-like evolution) to green (high score, close to
the supertree topology, an indication of tree-like evolution). The species are ordered according to the topology of the
supertree of the 102 NUTs. In (a), the major groups of archaea and bacteria are denoted (modified from Puigbo` et al. 2010).

76

P. Puigbo` et al.

NET

TREE

0.63

NUTs

NET

TRE

0.39

EE

FOL

TR

NE

Fig. 14. The Tree/Network Trends in the FOL and in the NUTs. (a) A hypothetical
equilibrium between the tree and net trends. (b) A schematic representation of the tree
tendency in the NUTs. (c) A schematic representation of the net tendency in the FOL.

Under the FOL perspective, the traditional TOL concept


(a single true tree topology) is invalidated and should be replaced
by a statistical definition. In other words, the TOL only makes sense as
a central trend in the phylogenetic forest.

6. Exercises
1. Calculate the split distance SD and BSD of the following two
trees (the trees are in the Newick format):
(((A,B)61,C)53,D,E);(((A,C)76,B)38,D,E).
2. Calculate the Inconsistency Score of the tree X in the forest of
trees Y.
X (((A,B),C),D,E);
Y (((A,B),C),D,E); (A,B,(E,D)); (((A,C),B),D,E); (A,C,
(B,D)); (A,B,(C,D)); (A,B,(C,E)); (A,E,(B,D)); (((A,C),
D),E,F); (((A,B),D),E,C); (((E,F),A),B,C).

Genome-Wide Comparative Analysis of Phylogenetic Trees. . .

77

Acknowledgments
The authors research is supported by the Department of Health
and Human Services intramural program (NIH, National Library
of Medicine).
References
1. Huerta-Cepas, J., Dopazo, H., Dopazo, J., and
Gabaldon, T. (2007) The human phylome.
Genome Biol 8, R109.
2. Huerta-Cepas, J., Bueno, A., Dopazo, J., and
Gabaldon, T. (2008) PhylomeDB: a database
for genome-wide collections of gene phylogenies. Nucleic Acids Res 36, D491-496.
3. Frickey, T., and Lupas, A. N. (2004) PhyloGenie: automated phylome generation and
analysis. Nucleic Acids Res 32, 52315238.
4. Sicheritz-Ponten, T., and Andersson, S. G.
(2001) A phylogenomic approach to microbial
evolution. Nucleic Acids Res 29, 545552.
5. Puigbo, P., Wolf, Y. I., and Koonin, E. V.
(2009) Search for a Tree of Life in the thicket
of the phylogenetic forest. J Biol 8, 59.
6. Felsenstein, J. (2004) Inferring Phylogenies.
Sunderland, MA: Sinauer Associates.
7. Nei, M., and Kumar, S. (2001) Molecular Evolution and Phylogenetics. Oxford: Oxford Univ.
8. Castresana, J. (2007) Topological variation in
single-gene phylogenetic trees. Genome Biol 8,
216.
9. Soria-Carrasco, V., and Castresana, J. (2008)
Estimation of phylogenetic inconsistencies in
the three domains of life. Mol Biol Evol 25,
23192329.
10. Marcet-Houben, M., and Gabaldon, T. (2009)
The tree versus the forest: the fungal tree of life
and the topological diversity within the yeast
phylome. PLoS ONE 4, e4357.
11. Robinson, D. F., and Foulds, L. R. (1981)
Comparison of phylogenetic trees. Math Biosci
53, 131147.
12. Puigbo, P., Garcia-Vallve, S., and McInerney,
J. O. (2007) TOPD/FMTS: a new software to
compare phylogenetic trees. Bioinformatics 23,
15561558.
13. Steel, M. A., and Penny, D. (1993) Distribution of tree comparison metrics - some new
results. Systematic Biol 42, 126141.
14. Bluis, J., and Shin, D.-G. (2003) Nodal distance
algorithm: calculating a phylogenetic tree comparison metric. In: Proceedings of the third
IEEE symposium on bioInformatics and bioEngineering. IEEE Computer Society, 8794.

15. Cardona, G., Llabres, M., Rossello, F., and


Valiente, G. (2009) Nodal distances for rooted
phylogenetic trees. J Math Biol.
16. Estabrook, G. F., McMorris, F. R., and
Meachan, A. (1985) Comparison of undirected phylogenetic trees based on subtree of
four evolutionary units. Syst Zool 34,
193200.
17. Allen, L., and Steel, M. (2001) Subtree Transfer Operations and Their Induced Metrics on
Evolutionary Trees Annals of Combinatorics 5,
115.
18. Waterman, M. S., and Steel, M. (1978) On the
similarity of dendrograms. J Theor Biol 73,
789800.
19. Beiko, R. G., and Hamilton, N. (2006) Phylogenetic identification of lateral genetic transfer
events. BMC Evol Biol 6, 15.
20. Hickey, G., Dehne, F., Rau-Chaplin, A., and
Blouin, C. (2008) SPR Distance Computation
for Unrooted Trees. Evol Bioinform Online 4,
1727.
21. Kubicka, E., Kubicki, G., and McMorris, F. R.
(1995) An algorithm to find agreement subtrees. J Classification 12, 9199.
22. Nye, T. M., Lio, P., and Gilks, W. R. (2006) A
novel algorithm and web-based tool for comparing two alternative phylogenetic trees. Bioinformatics 22, 117119.
23. de Vienne, D. M., Giraud, T., and Martin, O.
C. (2007) A congruence index for testing
topological similarity between trees. Bioinformatics 23, 31193124.
24. Cotton, J. A., and Page, R. D. (2002) Going
nuclear: gene family evolution and vertebrate
phylogeny reconciled. Proc Biol Sci 269,
15551561.
25. Soria-Carrasco, V., Talavera, G., Igea, J., and
Castresana, J. (2007) The K tree score: quantification of differences in the relative branch
length and topology of phylogenetic trees. Bioinformatics 23, 29542956.
26. Marcet-Houben, M., and Gabaldon, T. (2011)
TreeKO: a duplication-aware algorithm for the
comparison of phylogenetic trees. Nucleic
Acids Res 39, e66.

78

P. Puigbo` et al.

27. Koonin, E. V., Wolf, Y. I., and Puigbo, P.


(2009) The phylogenetic forest and the quest
for the elusive tree of life. Cold Spring Harb
Symp Quant Biol 74, 205213.
28. Zuckerkandl, E., and Pauling, L. (1962)
Molecular evolution. In: Horizons in Biochemistry. Edited by Kasha M, B. P. New York:
Academic Press; 189225.
29. Woese, C. R. (1987) Bacterial evolution.
Microbiol Rev 51, 221271.
30. Bapteste, E., OMalley, M. A., Beiko, R. G.,
Ereshefsky, M., Gogarten, J. P., Franklin-Hall,
L., et al. (2009) Prokaryotic evolution and the
tree of life are two different things. Biol Direct
4, 34.
31. Doolittle, W. F. (2000) Uprooting the tree of
life. Sci Am 282, 9095.
32. Doolittle, W. F., and Bapteste, E. (2007) Pattern pluralism and the Tree of Life hypothesis.
Proc Natl Acad Sci U S A 104, 20432049.
33. Kurland, C. G., Canback, B., and Berg, O. G.
(2003) Horizontal gene transfer: A critical
view. Proc Natl Acad Sci U S A 100,
96589662.
34. Kurland, C. G. (2005) What tangled web: barriers to rampant horizontal gene transfer. Bioessays 27, 741747.
35. Logsdon, J. M., and Faguy, D. M. (1999)
Thermotoga heats up lateral gene transfer.
Curr Biol 9, R747-751.
36. Genereux, D. P., and Logsdon, J. M., Jr. (2003)
Much ado about bacteria-to-vertebrate lateral
gene transfer. Trends Genet 19, 191195.
37. Kunin, V., Goldovsky, L., Darzentas, N., and
Ouzounis, C. A. (2005) The net of life: reconstructing the microbial phylogenetic network.
Genome Res 15, 954959.
38. Daubin, V., Moran, N. A., and Ochman, H.
(2003) Phylogenetics and the cohesion of
bacterial genomes. Science 301, 829832.
39. Lerat, E., Daubin, V., and Moran, N. A. (2003)
From Gene Trees to Organismal Phylogeny in
Prokaryotes:The Case of the gamma-Proteobacteria. PLoS Biol 1, E19.
40. Woese, C. R., Olsen, G. J., Ibba, M., and Soll,
D. (2000) Aminoacyl-tRNA synthetases, the
genetic code, and the evolutionary process.
Microbiol Mol Biol Rev 64, 202236.
41. Fitz-Gibbon, S. T., and House, C. H. (1999)
Whole genome-based phylogenetic analysis of
free-living microorganisms. Nucleic Acids Res
27, 42184222.
42. Hanage, W. P., Fraser, C., and Spratt, B. G.
(2006) Sequences, sequence clusters and bacterial species. Philos Trans R Soc Lond B Biol Sci
361, 19171927.

43. Eisen, J. A., and Fraser, C. M. (2003) Phylogenomics: intersection of evolution and genomics. Science 300, 17061707.
44. Salzberg, S. L., White, O., Peterson, J., and
Eisen, J. A. (2001) Microbial genes in the
human genome: lateral transfer or gene loss?
Science 292, 19031906.
45. Galtier, N. (2007) A model of horizontal gene
transfer and the bacterial phylogeny problem.
Syst Biol 56, 633642.
46. Galtier, N., and Daubin, V. (2008) Dealing with
incongruence in phylogenomic analyses. Philos
Trans R Soc Lond B Biol Sci 363, 40234029.
47. Ciccarelli, F. D., Doerks, T., von Mering, C.,
Creevey, C. J., Snel, B., and Bork, P. (2006)
Toward automatic reconstruction of a highly
resolved tree of life. Science 311, 12831287.
48. Choi, I. G., and Kim, S. H. (2007) Global
extent of horizontal gene transfer. Proc Natl
Acad Sci U S A 104, 44894494.
49. Koonin, E. V., Wolf, Y. I., and Puigbo, P.
(2009) The Phylogenetic Forest and the
Quest for the Elusive Tree of Life. Cold Spring
Harb Symp Quant Biol.
50. Dagan, T., and Martin, W. (2009) Getting a
better picture of microbial evolution en route
to a network of genomes. Philos Trans R Soc
Lond B Biol Sci 364, 21872196.
51. Boucher, Y., Douady, C. J., Papke, R. T.,
Walsh, D. A., Boudreau, M. E., Nesbo, C. L.,
et al. (2003) Lateral gene transfer and the origins of prokaryotic groups. Annu Rev Genet
37, 283328.
52. Bucknam, J., Boucher, Y., and Bapteste, E.
(2006) Refuting phylogenetic relationships.
Biol Direct 1, 26.
53. Schliep, K., Lopez, P., Lapointe, F. J., and
Bapteste, E. (2011) Harvesting evolutionary
signals in a forest of prokaryotic gene trees.
Mol Biol Evol 28, 13931405.
54. Beiko, R. G., Doolittle, W. F., and Charlebois,
R. L. (2008) The impact of reticulate evolution
on genome phylogeny. Syst Biol 57, 844856.
55. Doolittle, W. F., and Zhaxybayeva, O. (2009)
On the origin of prokaryotic species. Genome
Res 19, 744756.
56. Gogarten, J. P., and Townsend, J. P. (2005)
Horizontal gene transfer, genome innovation
and evolution. Nat Rev Microbiol 3, 679687.
57. Gogarten, J. P., Doolittle, W. F., and Lawrence,
J. G. (2002) Prokaryotic evolution in light of
gene transfer. Mol Biol Evol 19, 22262238.
58. Puigbo, P., Wolf, Y. I., and Koonin, E. V.
(2010) The tree and net components of prokaryote evolution. Genome Biol Evol 2,
745756.

Genome-Wide Comparative Analysis of Phylogenetic Trees. . .

59. Tatusov, R. L., Fedorova, N. D., Jackson, J. D.,


Jacobs, A. R., Kiryutin, B., Koonin, E. V., et al.
(2003) The COG database: an updated version
includes eukaryotes. BMC Bioinformatics 4, 41.
60. Jensen, L. J., Julien, P., Kuhn, M., von Mering,
C., Muller, J., Doerks, T., et al. (2008) eggNOG: automated construction and annotation
of orthologous groups of genes. Nucleic Acids
Res 36, D250-254.
61. Edgar, R. C. (2004) MUSCLE: multiple
sequence alignment with high accuracy and
high throughput. Nucleic Acids Res 32,
17921797.
62. Castresana, J. (2000) Selection of conserved
blocks from multiple alignments for their use
in phylogenetic analysis. Mol Biol Evol 17,
540552.
63. Keane, T. M., Naughton, T. J., and McInerney,
J. O. (2007) MultiPhyl: a high-throughput
phylogenomics webserver using distributed
computing. Nucleic Acids Res 35, W33-37.
64. Creevey, C. J., and McInerney, J. O. (2005)
Clann: investigating phylogenetic information
through supertree analyses. Bioinformatics 21,
390392.
65. Felsenstein, J. (1996) Inferring phylogenies
from protein sequences by parsimony, distance,
and likelihood methods. Methods Enzymol 266,
418427.
66. Torgerson, W. S. (1958) Theory and Methods
of Scaling. New York: Wiley.
67. Gower, J. C. (1966) Some distance properties
of latent root and vector methods used in multivariate analysis. Biometrika 53, 325328.
68. Tibshirani, R., Walther, G., and Hastie, T.
(2001) Estimating the number of clusters in a
data set via the gap statistic. Journal of the Royal
Statistical Society: Series B (Statistical Methodology) 63, 411423.
69. Hillis, D. M., Heath, T. A., and St John, K.
(2005) Analysis and visualization of tree space.
Syst Biol 54, 471482.
70. Pavlidis, P., and Noble, W. S. (2003)
Matrix2png: a utility for visualizing matrix
data. Bioinformatics 19, 295296.

79

71. Koonin, E. V., and Wolf, Y. I. (2008) Genomics of bacteria and archaea: the emerging
dynamic view of the prokaryotic world. Nucleic
Acids Res 36, 66886719.
72. Ge, F., Wang, L. S., and Kim, J. (2005) The
cobweb of life revealed by genome-scale estimates of horizontal gene transfer. PLoS Biol 3,
e316.
73. Brochier, C., Bapteste, E., Moreira, D., and
Philippe, H. (2002) Eubacterial phylogeny
based on translational apparatus proteins.
Trends Genet 18, 15.
74. Wolf, Y. I., Rogozin, I. B., Grishin, N. V., and
Koonin, E. V. (2002) Genome trees and the
tree of life. Trends Genet 18, 472479.
75. Wolf, Y. I., Rogozin, I. B., Grishin, N. V.,
Tatusov, R. L., and Koonin, E. V. (2001)
Genome trees constructed using five different
approaches suggest new major bacterial clades.
BMC Evolutionary Biology 1.
76. Creevey, C. J., Fitzpatrick, D. A., Philip, G. K.,
Kinsella, R. J., OConnell, M. J., Pentony, M.
M., et al. (2004) Does a tree-like phylogeny
only exist at the tips in the prokaryotes? Proc
Biol Sci 271, 25512558.
77. Brochier-Armanet, C., Boussau, B., Gribaldo,
S., and Forterre, P. (2008) Mesophilic Crenarchaeota: proposal for a third archaeal phylum, the Thaumarchaeota. Nat Rev Microbiol
6, 245252.
78. Elkins, J. G., Podar, M., Graham, D. E.,
Makarova, K. S., Wolf, Y., Randau, L., et al.
(2008) A korarchaeal genome reveals new
insights into the evolution of the Archaea.
Proc Natl Acad Sci USA in press.
79. Wolf, Y. I., Aravind, L., Grishin, N. V., and
Koonin, E. V. (1999) Evolution of aminoacyltRNA synthetasesanalysis of unique domain
architectures and phylogenetic trees reveals a
complex history of horizontal gene transfer
events. Genome Res 9, 689710.
80. Koonin, E. V. (2003) Comparative genomics,
minimal gene-sets and the last universal common ancestor. Nature Rev Microbiol 1,
127136.

Chapter 4
Philosophy and Evolution: Minding the Gap Between
Evolutionary Patterns and Tree-Like Patterns
Eric Bapteste, Frederic Bouchard, and Richard M. Burian
Abstract
Ever since Darwin, the familiar genealogical pattern known as the Tree of Life (TOL) has been prominent in
evolutionary thinking and has dominated not only systematics, but also the analysis of the units of
evolution. However, recent findings indicate that the evolution of DNA, especially in prokaryotes and
such DNA vehicles as viruses and plasmids, does not follow a unique tree-like pattern. Because evolutionary
patterns track a greater range of processes than those captured in genealogies, genealogical patterns are in
fact only a subset of a broader set of evolutionary patterns. This fact suggests that evolutionists who
focus exclusively on genealogical patterns are blocked from providing a significant range of genuine
evolutionary explanations. Consequently, we highlight challenges to tree-based approaches, and point
the way toward more appropriate methods to study evolution (although we do not present them
in technical detail). We argue that there is significant benefit in adopting wider range of models, evolutionary representations, and evolutionary explanations, based on an analysis of the full range of evolutionary
processes. We introduce an ecosystem orientation into evolutionary thinking that highlights the importance
of type 1 coalitions (functionally related units with genetic exchanges, aka friends with genetic benefits), type 2 coalitions (functionally related units without genetic exchanges), communal interactions,
and emergent evolutionary properties. On this basis, we seek to promote the study of (especially
prokaryotic) evolution with dynamic evolutionary networks, which are less constrained than the TOL,
and to provide new ways to analyze an expanded range of evolutionary units (genetic modules, recombined
genes, plasmids, phages and prokaryotic genomes, pangenomes, microbial communities) and evolutionary
processes. Finally, we discuss some of the conceptual and practical questions raised by such network-based
representation.
Key words: Network, Lateral gene transfer, Horizontal gene transfer, Evolution, Prokaryotes,
Philosophy of biology, Units of evolution
Is the phylogenetic or a definitely nonphylogenetic system (e.g., an idealistic-morphological system) better suited to serve as a general reference
system, or does one of these systems for intrinsic reasons demand this
precedence over all others? (1)

Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2,
Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5_4,
# Springer Science+Business Media, LLC 2012

81

82

E. Bapteste et al.

1. Genealogical
Patterns
and Evolutionary
Patterns Are Two
Different Things

Decades of phylogenetic research and practice provided Hennigs


followers with a firm answer to his question: they held that the
phylogenetic system should be preferred for the study of evolution
and that such work allows the reconstruction of a Tree of Life
(TOL). For his supporters, a TOL provides a universal, natural,
practical, and heuristic framework for evolutionary research (25).
One of the key arguments in favor of this position is that nonphylogenetic systems (i.e., evolutionary studies that do not give the
priority to the reconstruction of a common genealogical tree)
cannot provide adequate heuristics for adaptive explanations. In
this chapter, we argue that this claim is wrong because units not
recognized in the TOL are required in many adaptive explanations,
and because the assumption that the units of evolution are supplied
by phylogenetic genealogies forecloses the understanding of key
evolutionary processes. The appeal of genealogical modeling
depends on the uniformity and relative simplicity of its explanatory
structure, based on a single TOL, in which, in the absence of
extinction, diversity increases over time and there is no reticulation
between branches. Although tree-based practices and the virtue
of the uniformity and structural simplicity of the TOL have been
explicit in evolutionary thinking since Darwin, many recent findings show that no single phylogenetic tree can represent the evolutionary history of many microbes or such DNA vehicles as viruses
and plasmids (618). Furthermore, to make a more theoretical
point, even when a tree obtains in some parts of the macrobial
world, it does so for purely contingent reasons. Thus, although we
grant that tree-shaped patterns correctly characterize some sections
of evolutionary history, we argue that this genealogical canalization
is contingent. Tree-like modes of evolution result from some but
not of all the evolutionary processes at play (e.g., cell division,
preferential mating); other evolutionary processes are also relevant
to model evolution (19). Familiar examples of process that do not
respect phylogenetic boundaries are introgression across genera in
plants, resulting in reticulated evolution, and incorporation of viral
DNA, often with additional exogenous DNA, into both prokaryotic and eukaryotic genomes (17, 20, 21). Other such processes are
discussed below.
By privileging mostly (or exclusively) nicely contained genealogical patterns and the constraints fashioning them, the phylogenetic
system is a priori blind to the other patterns and constraints that are an
integral part of evolution. Purely genealogical explanations of the
patterns of life do not include many microbial adaptations. To cite
one example in passing, adaptation to high temperature (>50 C) in
archaea and bacteria involves multiple and important exchanges of
genetic material between these distantly related organisms (22).

4 Philosophy and Evolution: Minding the Gap Between Evolutionary Patterns. . .

83

GP:
Evolutionary phenomena associated
with the genealogy

EP:
GP + evolutionary phenomena that
do not match the genealogy:

Splitting events

Splitting & clumping events

Evolutionary relationships
=
Genealogical relationships

Evolutionary relationships
=
Genealogical relationships
+
Other relationships
(ecological, functional, genetic
partnerships)

Evolutionary units
=
Genealogical units

Evolutionary units
=
Genealogical units
+
Other evolving units

Fig. 1. Relationships between genealogical pattern (GP) (black ) and evolutionary pattern (EP) (grey ). Evolutionary patterns
encompass genealogical patterns but not the reverse.

Thus, the adaptive hyperthermophile and thermophile phenotypes


cannot be tracked solely by their genealogy. Yet, no evolutionist
studying microbes would assert that this adaptation is an epiphenomenon. On the basis of theoretical considerations and by use of several
examples along these lines, we argue that comprehensive evolutionary analyses should take a variety of evolutionary processes that are
not captured by conventional genealogical thinking into account.
Genealogical patterns (GPs) and evolutionary patterns (EPs) can be
two different things, two distinct outcomes of evolution, that can be
summarized by distinct drawings (see Fig. 1). In this figure, the trees
are temporally oriented: the vertical axis in the top left (GP) and top
right (EP) diagrams is time with earlier at the bottom, later at the top.
GP (as in the left-hand diagram) considers only splitting lineages and
no interactions across lineages while EP (as in the right-hand diagram) considers both. Therefore, EP can be broader because not only
reticulations of various kinds (symbioses, genetic partnerships, etc.)
are important, but also because these interactions are crucial to
evolutionary fates and contents of lineages.

84

E. Bapteste et al.

Proponents of the TOL hold that (some) monophyletic groups


on the TOL provide a fruitful representation of (all of the) natural
groups and thus provide a fruitful representation of (all important)
evolutionary scenarios (see, e.g., refs. 23, 24). But when GP and EP
differ, this approach suffers from two significant limitations whose
importance is becoming widely recognized. By definition, the
TOL can only represent branching processes and it focuses solely
and explicitly on subsets of evolutionary processes, namely, the
evolution of (monophyletic) species understood as reproductively
isolated populations. The proponents of the notion of species,
defined as the least relevant monophyletic groups on an appropriately scaled or constructed tree, have identified some of the limitations of that notion. As Mayr and Ghiselin separately note,
many plants do not fit this account, and therefore would deserve
to be distinguished from the other species, and called instead
paraspecies or pseudospecies. For the former, only sexually reproducing organisms qualify as species. Some other terminology,
for instance paraspecies, will have to be found for uniparentally
reproducing forms (25). For the latter, asexual lineages do not
form reproductive populations, and have to be considered pseudospecies (26). We are not claiming that clonal plants or bacteria
cannot be accommodated in the TOL, but that an understanding of
evolutionary patterns focused on clearly demarcated, fully
encapsulated, monophyletic group leads to counterintuitive claims
about how lineages are formed and maintained, e.g., clonally, sexually, etc. (27). These limits of tree-based approaches are the basis for
insisting on the importance of providing a less constrained way of
modeling and interpreting more (and ideally, all) of the fundamental evolutionary processes. In consequence, we point the way
toward more appropriate methods, although we cannot present
them in technical detail. We will, no doubt, fall short of persuading
all readers of our approach, but we at least show that greater
inclusivity yields a considerable improvement in the modeling of
evolutionary patterns and processes.
Our primary motivation, the idea that evolutionary patterns
encompass genealogical patterns but not the reverse, is illustrated in
Fig. 1. For phylogeneticists, GPs are the bedrock of evolutionary
thinking (23), but many evolutionary biologists have come to
accept that at least some adaptations do not translate into one
clean genealogical pattern (12, 27). Here, we restrict the argument
to adaptations, by concentrating to some extent on adaptations of
prokaryotes, but in fact we think it holds for a much broader class of
phenomena (e.g., typically traits that emerge from multilevel selection, carried on mobile elements). Consider the spread of antibiotic
drug resistance in prokaryotes: drug-resistant phenotypes result
from the action of a wide diversity of mechanisms that move
DNA between distantly related organisms: plasmids, phages, integrons, transformation, cellcell fusion, activation of the SOS

4 Philosophy and Evolution: Minding the Gap Between Evolutionary Patterns. . .

85

system, successful gene expression after a lateral gene transfer, etc.


(2832). Most of these mechanisms yield regularities discordant
with phylogeny; therefore, GPs certainly do not explain the EPs
that result from the acquisition and loss of antibiotic resistance
in microbes.
Given the broad acceptance of adaptive traits emerging from
multilevel selection in the prokaryotic world (3340), the historical
reduction of the process of evolution of natural groups to tree-like
patterns is no longer fully satisfactory. Recent findings force evolutionists to entertain a richer set of patterns (19, 41, 42). Because
EPs are broader in scope than GPs, it may not be the best explanatory strategy to go from a limited pattern (the evolution of monophyletic groups in genealogical relationships) to a universal
characterization of evolutionary processes. This concern motivates
some evolutionary studies that explicitly attempt to accommodate
heterogeneous evolutionary models for evolving natural groups,
instead of trying to constrain evolutionary patterns to match the
branching genealogical patterns of the TOL (9, 11, 13, 4347).
The studies we have cited focus on microbial evolution using
alternative approaches to classic tree-based approaches. Importantly, they are not only justified by the question of which patterns
are broader and more encompassing (EP > GP or EP < GP).
Indeed, the deeper problem is that genealogical patterns and evolutionary patterns do not track the same processes; rather, they aim
to capture distinct phenomena. The fact that there is a gap between
those patterns suggests that we are missing out on a lot of genuine
evolutionary explanations when exclusively adopting GP. Minding
the gap could have profound consequences.

2. What Does
the Gap Between
Genealogical
Patterns and
Evolutionary
Patterns Imply?

In genealogical patterns, the basic explanatory unit has been species


or monophyletic groupings. Since isomorphy of evolutionary and
genealogical patterns (or convergence of EP on GP) was assumed,
it has also been assumed that the basic explanatory unit for evolutionary patterns is species or monophyletic groupings (Fig. 1)
(1, 23). The assumed superiority of genealogical thinking is in
part a function of this perceived isomorphy between monophyletic
groups as the sole unit of evolution and monophyletic groups as the
sole unit of evolutionary explanation. By contrast, the gap between
GP and EP shows us that monophyletic groupings may not be the
only (or best) explanatory unit in evolutionary patterns (Fig. 1).
Evolutionists may need other units.
There is a connection here with important methodological issues
recently discussed by philosophers of biology (4853). The studies
that seek potential units of evolution are exploratory in character,

86

E. Bapteste et al.

deploying some of the methods of traditional natural history together


with the laboratory-intense methods of molecular biology and bioinformatics. This combination requires exploratory use of sequence
databases, such as those used in recent -omic sciences in combination with the molecular tools (e.g., those that allow replacement of
one gene by another) and new computer methods designed to sample
and analyze protein and gene sequences from various natural and
experimental contexts. Thus, exploratory experimentation does not
follow the standard methods of hypothesis testing; instead, it deploys
a variety of means for varying parameters to examine what follows
from, e.g., the incorporation of a novel plasmid into a population of
microbes or by changing the timing of a developmental switch, and to
extract surprising patterns from an hypothesis-neutral data set
(which, of course, cannot have been gathered in the absence of
hypotheses). The patterns unraveled in these exploratory approaches
are important because they capture certain (molecular) sequelae of
some event or process. The spirit of such exploratory experiments,
characteristic of much new work in the -omic sciences and in systems
biology, could be embraced to improve evolutionary studies by identifying additional evolutionary units and the processes that generated
them, without depending on the central hypothesis of a TOL.
It is one thing to show the incompleteness of existing evolutionary explanations based on the TOL (12) and quite another to show
that one could step outside the TOL to recognize additional units of
evolution of diverse sorts. Defenders of the TOL might argue that
existing explanations, although incomplete, are powerful enough to
encompass the majority of additional evolutionary patterns as outliers, as acceptable noise. We disagree because the inclusion of evolutionary processes and units in evolutionary representations and
explanations beyond those envisaged in the TOL entails an inescapable pluralism. Yet, as we argue, the additional units are required to
recognize the importance of interactions among hierarchical processes at several levels in bringing about evolutionary change. For us,
the gap between EP and GP encourages conceptual and practical
developments aimed at capturing all the adaptations in which the
phylogeneticist is interested, as well as other adaptations, objects, and
process beyond those revealed by studies restricted to the usual monophyletic groups relevant to phylogenetic studies (54).
What are these additional evolutionary objects? Consider, for
instance, the impact of lateral gene transfer (LGT) and recombination, which produce evolutionary modules (genes, groups of genes,
operons) with their own individual fates. One example based on LGT
is the suite of coevolving genes coding for gas vesicles in cyanobacteria
and haloarchaea; this suite of genes defines a functional and evolutionary unit (55). This genetic module codes for a clear adaptive
phenotype, conferring buoyancy to its hosts, and can be inherited
by LGT and vertical descent from ancestors to descendants. These
(adaptive) genes and groups of genes are distributed across

4 Philosophy and Evolution: Minding the Gap Between Evolutionary Patterns. . .

87

prokaryotes and mobile genetic elements in ways that do not match


species genealogies. LGT and recombination also create phylogenetically mosaic entities (e.g., recombined genes (56), recombined
plasmids (10), viral (16) or prokaryotic genomes (22, 57)). Quite
generally, microbial genomes harbor genes with multiple distinct
phylogenetic affinities and from distantly related sources. These processes, thus, impact the size of bacterial pangenomes (e.g., the overall
gene pool of a set of organisms considered as belonging to a single
species) (58). Consequently, pangenomes of various sizes, composition, and origins are also remarkable evolving entities that are outcomes of evolution. Finally, LGT and recombination are also greatly
involved in the evolution of microbial communities (59, 60). These
ecologically shuffled evolutionary units are often phylogenetically
composite: they associate distinct DNA donors and hosts (also
referred to as genetic partners (41)) in a genetic network (9),
mixing both mobile elements and cell lineages. Many examples
beyond that of antibiotic resistance mentioned above are known
for example, communities of cyanobacteria, cyanophages and plasmids in the ocean (6164), natural communities in acid mine drainage (56), or in gut microbiomes of various metazoans (65, 66). All
include many ecologically shared genetic partners that do not occupy
a single branch in a TOL. Evolution of microbes and their mobile
elements is greatly affected by such a communal lifestyle.
By focusing anew on the evolutionary processes in these and
other cases, we may be able to model additional evolutionary
patterns that cannot appear within genealogical patterns. Species
and monophyletic groups as the sole units of evolution are not as
explanatorily exhaustive as many evolutionary biologists would like
to believe, a fact that should be reflected in our explanatory models.
For many, this has led to efforts to redefine species in order to make
the concept refer to something that is simultaneously an evolutionary, a classificatory, a functional, and an explanatory unit (67).
In our view, this effort cannot succeed. In fact, to reduce the gap
between model and phenomena, i.e., to improve explanations of
evolutionary processes when EP and GP are not isomorphic, evolutionists may wish to reexamine the units of explanation they
employ and ask whether additional units of evolution are
involved in the processes underlying the patterns they have found.

3. Richer
Conceptualization
and Representation
of Evolution

The biological world is not easily carved up at its joints. The use of
species/monophyletic groups as the primary unit of evolutionary
change assumes a strong form of uniformity and continuity in what
evolves. LGT is but one of many processes that transgresses these
frontiers; it serves us as one indicator that this assumption does not
always obtain. Speciation patterns are of course patterns of

88

E. Bapteste et al.

increased discontinuity. But various indicators suggest that many


processes distinct from lineage splitting yield clumping patterns
(711, 13, 16, 43, 68); such patterns are found at many levels
(from infracellular to supraspecific) in evolution. Thus, evolutionists
need to study the dynamics of the many sorts of clumping and splitting
that occur in evolution, far beyond those provided in standard
genealogical studies (Fig. 1).
A first step toward a broader conceptualization and representation of evolution consists in recognizing that evolution by natural
selection is not necessarily a linear transformation within a lineage;
it often involves the intersection of many processes across many different types of entities. Thus, LGT and recombination cause differential rates of recombination in various regions of prokaryotic and
eukaryotic genomes. For example, in prokaryotes, gene evolution
varies between genomic islands and the rest of the chromosome.
Recent data indicate that environmental Vibrio differentiate rapidly
into endemic subpopulations by tapping into a local gene pool as
they acquire and express local newly acquired gene cassettes by
LGT in their integrons (105). However, most of their gene content
outside the integron remains unchanged. Thus, a genes occurrence in the chromosome of a Vibrio is not a sufficient indicator
of whether it will be conserved or recombined; another process,
such as that occurs when the mechanistic processes that yield a
higher rate of recombination between integron gene cassettes
than between bacterial chromosomes and a local environmental
pool of integrons, intersects with the canalization that stabilizes
Vibrio chromosomes. Processes affecting organisms at a higher level
of organization also intersect with the genealogical canalization.
Bacteria living in dynamic and genetically diverse environments,
with many partners, typically have larger pangenomes than obligate
intracellular pathogens (58).
In such contexts, the concept of a coalition may be more useful
than that of a species or monophyletic group. This concept enables us
to focus on functionally related units that swap functions and sometimes parts (e.g., segments of DNA) within or across communities
and populations. Metazoan species are coalitions; for the functional
relations that count for building, a coalition include reproductive
relations, but for many biological systems a more fluid category
than species is needed to reflect how evolutionary change occurs. We
distinguish two kinds of coalitions, depending on the type of material
that is swapped. In type 1 coalitions, some of the swapped material is
DNA; therefore, members of a given coalition can be seen as friends
with genetic benefits. For example, cyanobacteria and cyanophages
sometimes form such a coalition. The genes encoding the photosystem-II (PSII) or the photosystem-I (PSI) reaction center have been
found in many cyanophage genomes, and some phages, like plants
and cyanobacteria, even contain both PSII and PSI genes and NADH
dehydrogenase genes. As these viruses infect their cyanobacterial

4 Philosophy and Evolution: Minding the Gap Between Evolutionary Patterns. . .

89

host, they can use different options to maximize their survival and that
of their host by enhancing either cyanobacterial photosynthesis or
ATP production (69). Similarly, phylogenetically heterogeneous
communities known as gut microbiomes, comprising archaea and
bacteria, converge in their repertoires of carbohydrate-active enzymes
to adapt to shared challenges, in large part thanks to LGT mediated
by mobile elements rather than gene family expansion (70).
Gut microbiomes of metazoans are full of friends with genetic benefits. Last but not least, although the chimeric nature of many eukaryotic genomes is often underappreciated in deep eukaryotic
phylogenetics, type 1 coalitions can also be observed in eukaryotes.
Using the diatoms as an example, Moustafa et al. (71) found that 16%
of the P. tricornutum nuclear genes may have green algal origins (72).
Ignoring the probability that additional genes have been contributed
to the genome over time in a nonvertical manner, this means at least
one in five of this diatoms genes could be expected to produce a
phylogenetic signal at odds with vertically inherited genes due to
endosymbioses followed by gene transfer to the host nucleus.
On another hand, tight functional interactions between phylogenetically unrelated partners in symbioses, consortia, etc. can also
occur with few if any gene exchanges. We refer to functionally
related units with a shared evolutionary fate in which no genetic
material is swapped between communities and populations as type
2 coalitions. Many biologists might find that evolutionary studies
of type 2 coalitions do not require new models of evolution that
go beyond the TOL. However, the consideration of these type
2 coalitions argues for the dependence of the change in the evolutionary fate of various subgroups on what others (often, members of
other species or other types of partner) in the community do, a
phenomenon that cannot be represented with a genealogical tree
alone. Consider the oft-studied Vibrio fischeriHawaiian Bobtail
squid interaction, where bioluminescence of the squid allows it to
avoid predators. Bioluminescence is generated by quorum sensing
of the bacteria in the constrained environment (i.e., high-density
conditions) of the squids mantle that they colonize. The fitness
gain from bioluminescence is not obvious for the Vibrio sans
symbiosis and the squid alone cannot generate light, but as a
coalition they allow for novel adaptations for both the squid and
the Vibrio. To put things a bit simply: Vibrio do not need to glow,
and squids cannot glow, but they have coevolved the adaptations
of bioluminescence and those required for their cooperative behaviors. This illustrates our claim that we should not expect EP to
match GP, since it is the ecological interaction that allows for these
adaptations to occur, not the genealogical confinement alone (73).
Many cases of genuine coevolution (74), e.g., between pollinators
and plants or hosts and parasites, support this same conclusion.
Cases of type 2 coalitions are also well known in prokaryotes.
An example is the interspecific associations of anaerobic

90

E. Bapteste et al.

methane-oxidizing archaea (ANME) and sulfate-reducing


bacteria (Desulfosarcina, Desulfobulbaceae, Desulfobacteriaceae,
Betaproteobacteria, and Alphaproteobacteria) (75). These consortia, in which the archaeal member oxidizes methane and
shuttles reduced compounds to the sulfate-reducing bacteria,
are globally distributed. This metabolic cooperation enables
the partners to thrive on low-energy carbon sources, which
neither partner could utilize on its own (40). Together,
ANMEsulfate reducer coalitions are estimated to be responsible for more than 80% of the consumption of methane in the
oceans. Another obvious microbial coalition, Chlorochromatium aggregatum, an interspecific phototrophic consortium
with worldwide distribution, may constitute as much as 2/3
of bacterial biomass at the oxic/anoxic interface in stratified
lakes (60). These are tight associations of green sulfur bacterial
epibionts which surround a central, motile, chemotrophic
bacterium. The epibionts act as light sensors and control the
carbon uptake of the central bacterium, which confers motility
to the consortium, assuring that the coalition occupies a niche
in which it will grow (76). The cell division of these bacterial
partners is highly coordinated and it was estimated by proteomics and transcriptomics that 352 genes are likely to be
involved in sustaining the coalition (77). Many intricate cases
of mutualism and commensalism display similar emergent adaptations in type 2 coalitions. Importantly, such emergent adaptations have more than one genealogical origin, and hence require
other models to be thoroughly analyzed.
Precisely, a second step in proposing new models of evolution
rests on the recognition that the interactions between many
processes and entities are structured, and that their frequent intersections should be modeled carefully. After all, this is exactly why
the populational approach was adopted in preference to a typological approach: pre-Darwinian concepts treated species as fixed
types with fixed characteristics. Transformist theories forced biologists to think about species as malleable. Mayr devised the nondimensional Biological Species concept (BSC) as part of his
effort to reconcile an established biological category, species,
which had implied stable properties from Aristotle to Linnaeus,
with a view of evolution hinted by Darwin and developed in
population genetics, that species are metapopulations of populations of genealogically related individuals with diverse traits.
Because of the shuffling of individuals and the impact of selection,
the frequency of traits within populations changed through time;
the BSC picks out the suprapopulational entity composed of all
potentially interbreeding individuals as of a given time or short
stretch of time. Although it has no essential properties, it has a
separate evolutionary fate because of the limitations on interbreeding with members of other species. The subpopulation trajectories

4 Philosophy and Evolution: Minding the Gap Between Evolutionary Patterns. . .

91

determine the distribution of attributes within populations and


therefore within the species, thus ultimately affecting its fate. But,
moving beyond Mayrs development of the BSC, one needs to
realize that such intersections go beyond the ebb and flow of
populational mixings. Populational approaches implicitly adopt a
network approach in that individuals and subpopulations exchange
genes in ways that are spatially determined. Take a population of
deer. Their spatial distribution determines which ones can reproductively interact with which others. Ecological constraints
(mountain range, rivers, etc.) determine the placement of nodes,
i.e., of bottlenecks delimiting subpopulations within which gene
change occurs. Real populations have a clustered topology. This is
often abstracted away in population models, but it is a fact that
should remain in the forefront of our understanding of the processes involved (see, for instance, Sewall Wrights shifting balance
theory). To fully account for this natural clustered topology, evolutionists should provide better accounts of the motley crew of types
of partners and the very diverse class of types of interactions
between partners (41).
For convenience, the evolutionarily significant interactions can
be classified as genetic, structural, and functional. The first type of
interaction is most prevalent in monophyletic groups of metazoa
(which has led many to assume that EP and GP are the same thing).
Nonetheless, one should not be surprised to find genuine functional interactions among nonrelated groups that lead to adaptive
change, as observed in microbial evolutionary studies. Such findings force us to broaden our understanding of what to count as an
efficacious partner in a coalition. The two prokaryotic coalitions
(ANMEsulfate reducer and Chlorochromatium aggregatum)
described above clearly associate organisms that are phylogenetically
distant but nonetheless bona fide functional partners. And they are
not exceptional. There are many cases of communal evolution with
traits that GPs cannot properly describe because they involve both
distinct phylogenetic microbial lineages and mobile elements. These
are reported with increasing frequency in the metagenomic literature,
and strongly supported by molecular data (see Fig. 2). For such
communities, evolution is often coevolution, and functional, structural, and genetic interactions matter. Such coalitions cannot be
neglected. For instance, type 1 coalitions of cyanobacteria and cyanophages play a central role in marine photosynthesis, global carbon
cycle, and the world oxygen supply. Type 2 coalitions, such as the one
observed between Glomerales and 6080% of the land plants for at
least 460 million years (7881), positively affected plant performance, nutrient mobilization from soil minerals, fixation of atmospheric nitrogen, and protection of plants against root pathogens,
and thus determined many aspects of community and ecosystem
functioning. Overall, the impact of coalitions (be they genetic or
not) should make communal interactions (and their resulting

92

E. Bapteste et al.
% in the dataset
30
1

25
Plasmids
Phages
20

4
2

15

10

0
A: B: J: K: L: D: V: T: M: N: Z: W: U: O: C: E: F: G: H: I: P: Q: R: S:
Functional categories

Fig. 2. Distribution of genes of various functional categories in genomes of mobile elements. All functional categories of
genes, except genes of nuclear structure, can be found in mobile elements, many of which should benefit communal
evolution since expression of genes with cellular functions increases the fitness of cells containing the mobile elements,
which, in turn, increases the likelihood of the mobile elements being carried forward to the next cellular generation. Bars
for plasmids are in black; bars for phages are in white. The X-axis corresponds to the functional categories defined by
clusters of orthologous groups (COGs) (100). The Y-axis indicates the percentage of occurrences of these categories in an
unpublished data set of 148,864 plasmids and 79,413 phage sequences, annotated using RAMMCAP (101). Functional
categories are sorted as follows: (1) Information storage and processing; A: RNA processing and modification; B: chromatin
structure and dynamics; J: translation; K: transcription; L: replication and repair; (2) cellular processes; D: cell cycle control
and mitosis; Y: nuclear structure; V: defense mechanisms; T: signal transduction; M: cell wall/membrane/envelop
biogenesis; N: cell motility; Z: cytoskeleton; W: extracellular structures; U: intracellular trafficking, secretion, and vesicular
transport; O: posttranslational modification, protein turnover, and chaperone functions; (3) metabolism; C: energy
production and conversion; E: amino acid metabolism and transport; F: nucleotide metabolism and transport;
G: carbohydrate metabolism and transport; H: coenzyme metabolism and transport; I: lipid metabolism and transport;
P: inorganic ion transport and metabolism; Q: secondary metabolites biosynthesis, transport, and catabolism; (4) poorly
characterized; R: general functional prediction only; S: function unknown.

emergent evolutionary properties) essential features of evolutionary


models, narratives, and explanations, beside monophyletic groups.
Finally, a third step to improve our model of evolution is
to acknowledge that these coalitions evolved in ecosystems. Odenbaugh (82) offers a detailed analysis of the concepts of community
and ecosystem, most helpful to understand the latter. A community
corresponds to the assemblage of most or all interacting species
(populations) in a given area, ecological niche, or environment.

4 Philosophy and Evolution: Minding the Gap Between Evolutionary Patterns. . .

93

Communities are defined solely by the biotic entities that they


include. Some think communities need to be functionally
integrated (83), but this view is arguably the minority view in
contemporary ecology. An ecosystem corresponds to the functional
assemblage of all communities as well as their abiotic (physical,
chemical, geological, climatic) environment. Tansley (84) offered
an early defense of such a view, according to which community is
best considered a populational term focusing on the demographic
distribution of the biotic individuals in a given context (e.g.,
predatorprey population interactions), whereas ecosystem is a
functional term focusing on the functional integration between biotic
and abiotic subsystems in a given context. The possibility that whole
ecosystems can be said to evolve has recently been gaining some
traction (85, 86). But even if one rejects that possibility, the ecosystem perspective improves on the evolutionary models of a purely
populational-community perspective by highlighting functional integration and natural clustered topology over shared genealogical
history.
To sum up, many sorts of processes and types of entities that
intersect during evolution should have at least three consequences
for evolutionary models and methods. First, understanding evolution should often mean understanding coalitions. Second, understanding coalitions requires understanding the functional, genetic,
and material interchanges that structure communal interactions
among partners. Third, the interchanges underlying communal
interactions in coalitions are better understood by considering the
ecosystems in which evolution occurs. According to this point of
view, a more complete representation of (prokaryotic) evolution
corresponds to a dynamic topology (Fig. 3) rather than a TOL,
tracking only the genealogical relationships. The various -omics are
very good ways to define additional edges in dynamic evolutionary
networks, as they capture aspects of these diverse relationships
between evolving entities. Phylogenomics provides a phylogenetic
distance between genes, genomes, and other operational taxonomic units (OTUs) of interest (e.g., these units may correspond
to terminal taxa of a phylogenetic tree, such as species, genera,
individuals, etc., and to any biotic nodes in the network). Comparative genomics produces estimates (e.g., percentages of identity,
average nucleotide identity distances (87), etc.) based on the DNA
shared between genomes and OTUs. It also provides physical distances between genes (e.g., by measuring their physical distance on
chromosomes and organelles). Transcriptomics proposes coexpression matrices for genes, which can serve as bases for distances of
genetic coregulation, within cells and within environments; similarly, proteomics provides measures of the physical and functional
interactions of proteins within cells and within environments.
Last but not least, metagenomics leads to identification of genetic
partnerships (and incompatibilities) between and within

94

E. Bapteste et al.

Squid genome

Genetic

Functional
Vibrio genome

Vibrio genome

E
C
O
L
O
G
I
C
A
L
O
R
G
A
N
I
S
M

Highways of LGT exchanges deduced from a phylogenetic forest

Fig. 3. Theoretical scheme of a dynamic evolutionary network and real polarized network of genetic partnerships between
Archaea and Bacteria. (a) Nodes are apparent entities that can be selected during evolution. Various -omics help determine
the various edges in such network in order to describe covariation of fitness between nodes. Note that nodes can contain
other nodes (nodes are multilevel). Smaller grey nodes are genes. Some of these genes have phylogenetic affinities
indicated by long, dashed black edges, and others connected by plain thin edges are coexpressed. Collectively, some of
these gene associations define larger units (here, the two Vibrio genomes or ecological organisms, like the Vibriosquid
emergent ecological individual). Some of these genes and genomes interact functionally with the products of other genes
and other genomes defining coalitions (dashed grey lines ). In many coalitions, the interaction between partners may be
transient, ephemeral, and not the result of a long coevolution, yet the adaptations they display still deserve evolutionary
analysis. Thus, edge length corresponds to the temporal stability of the association (closer nodes are in a more stable
relationships over time). (b) Network adapted from ref. 47 computed from gene trees, including only Archaea and a single
bacterial OTU in a phylogenetic forest of 6,901 gene trees with 59 species of Archaea and 41 species of Bacteria.
The isolated bacterial OTU (that can differ in different trees) is odd, since the rest of the tree comprises only archaeal
lineages. For this reason, the single odd taxon is called an intruder (47). Archaea are represented by squares, and Bacteria
are represented by circles. Edges are colored based on the lifestyle distance between the pairs of partners, from 0 (darkest
edges, same lifestyle) to 4 (clearest edges, 50% similar lifestyle). The largest lifestyle distance in that analysis was 8, so
the organisms with the greater number of LGT had all a close to moderately distant lifestyle. Edge length is inversely
proportional to the number of transferred genes: the greater the number of shared genes between distantly related
organisms, the shorter the edge on the graph. The networks are polarized by arrows pointing from donors to hosts, here
showing LGT from Archaea to Bacteria.

environmental genes, populations, etc. The important claim here is


that if evolutionists intend to do so, they can represent coalitions,
functional integration, and natural topologies along with genealogy
in evolutionary studies.

4. Exploiting
Dynamic
Evolutionary
Networks

When evolutionists reconstruct the dynamic evolutionary networks


described above, they face a plethora of relations between biotic
entities rather than a simple unitary TOL. The patterns of evolution
also reflect the impact of a wide range of disparate processes that link
together the fates of entities at different levels, with varying degrees

4 Philosophy and Evolution: Minding the Gap Between Evolutionary Patterns. . .

95

and kinds of connection to each other. Note that even though the
examples described above mainly concern the evolution of organisms,
the biotic entities entering coalitions, partnerships, and ecosystems
can be of many types, e.g., genes, operons, plasmids, genomes, organisms, coalitions, communities, etc. Whereas multilevel selection is
usually focused on the very different levels at each of which entities
of the same type interact (i.e., genes with genes, cells with cells,
organisms with organisms, etc.), a coalition approach is open to the
possibility that entities at different levels of organization can and do
interact. The Vibriosquid symbiosis is such an example, where a
single organism interacts not with one individual organism but with
a group of individuals (i.e., a bacterial colony). Gut flora in many
metazoa has a similar profile: in those cases, an individual organism
interacts with a community of different microbial species. However, a
network-based representation of this complexity raises serious conceptual and practical questions. How could evolutionists make sense
of such dynamic evolutionary networks (except by reconstructing a
TOL) (13, 17, 88)? It is one thing to claim that whole ecosystems qua
ecosystems can evolve; it is another to try to model interactions, where
the monophyletic groups that are functional parts of those ecosystems
are not the only relevant units that one needs to model to track
evolutionary change. In the dynamic evolutionary networks
approach, it is an open question: Which units of evolution deserve
tracking and which explanatory units should be used in models?
To answer such questions, we need to think about relation
between units of evolution (i.e., what actually evolves in response
to natural selection) and units of explanation (i.e., the conceptual
objects should be used to model this change). In the GP
approach, it was largely assumed that representations of the changes
in the evolutionary units of the TOL were sufficient to provide the
explanatory units of evolutionary explanations. Monophyletic
genealogical relationships served both as evolutionary and explanatory units. We, like many others, have argued that while this representation may be appropriate for the evolution of some
monophyletic groups (especially monophyletic groups of eukaryotes), it is woefully inadequate for many microbes and is ruled
out by definition in the evolution of more complex biological
arrangements that we called coalitions (19, 41, 73). Let us now
see how other additional units of evolution and units of explanation
play out in this coalition world.
4.1. Searching
Clusters in Networks

Since we do not wish to rule out any type of organization as


possibly being a coalition or a member of a coalition, we suggest
adopting investigating clusters in our topologies as a first way to
identify coalitions (9, 11, 89). See Box 1 for a description of how
such genetic networks are reconstructed with sequence data and
the ways by which they are dynamically maintained. Our working
hypothesis is that we will be able to identify and track coalitions. We

96

E. Bapteste et al.

have shown that clusters in networks, for instance in genome networks, are areas where nodes show a greater number of connections
among themselves than with the other nodes of the graph. We
expect to demonstrate that such patterns might be the result of
evolution, as we explain below.
But first, let us stress that looking for such clusters is consistent
with the natural inclination of biologists to favor significant groupings of phenomena. In tree pattern analysis, the search for clusters is
also central, and it has translated in the classic problems of ranking
and grouping (90). The problem of grouping has been solved by
privileging a single unified type of relation, namely, the genealogical
relation exhibited by nodes. This allowed objective pairs of nodes
shown to share a last common ancestor in a data set to be grouped
together and shown to be distally related. Ranking (e.g., the decision to classify a genealogical group as a species instead of genus, an
order, etc.) was never truly solved and remains largely arbitrary
(91). This point was explicitly made by Darwin himself in Chapter
1 of the Origin. It is, therefore, somewhat ironic that evolutionary
explanations have reified clusters as real encapsulated (bounded)
evolutionary units by privileging genealogical relations. That is,
evolutionary explanations have treated evolutionary clusters as if
they were stable unitary units impervious to interference from other
clusters, apart from the change in the selective environment caused
by changes in the abiotic environment and the changes that any one
group causes in the other groups with which it interacts. Genealogical explanations have given absolute ontological priority to genealogical change of a certain type and been blind to other natural
processes that have deep consequences in the process of adaptation.
It behooves us to look at the neglected branches created by LGT,
hybridization, and other means of genetic exchange, coevolution,
and reticulation between branches in order to reexamine the adequacy of models that focus exclusively on well-compartmentalized
(i.e., modular) monophyletic groups. By looking at these usual
outliers in shared gene networks for instance, we identify new
clusters, some of which, we argue, are created and maintained by
selective pressures and evolutionary processes. Figure 4 illustrates
how clusters of partners of different types (e.g., clusters of bacteria
and plasmids, bacteria and phages, plasmids and phages) can unravel
the presence of groups of entities affected by processes of conjugation, transduction, and/or recombination, respectively. These entities are candidate friends with genetic benefits.
Importantly, as the ecosystems approach to microbial evolution
has taught us, the networks representing evolutionary dynamics
should not be purely genealogical; they should also be structural
and functional. Ecosystems involve both biotic and abiotic processes. Abiotic processes do not have genealogies (after all, they
are not genetic systems) and the arrangements of species in communities can be initiated or reorganized in ways that do not reflect

4 Philosophy and Evolution: Minding the Gap Between Evolutionary Patterns. . .

exchanges between phages and plasmids

97

b
Genetic world of phages

Genetic world of
plasmids

conjugation events

recombination between
plasmids
conjugation events

Genetic world of
phages

transduction events

Fig. 4. Remarkable patterns and processes in shared genome networks. (a) Schematic diagram of a connected component,
showing a candidate coalition of friends with genetic benefits, where each node represents a genome, cellular (white for
bacterial chromosome), plasmidic (grey ), or phage (black ). Data are real and were kindly provided by S. Halary and P. Lopez (9).
Two nodes are connected by an edge if they share homologous DNA (reciprocal best BLAST hit with a minimum of 1e-20 score,
and 100% minimum identity). Edges are weighted by the number of shared DNA families. The layout was produced by
Cytoscape using an edge-weighted spring-embedded model, meaning that genomes sharing more DNA families are closer on
the display (102). Clusters of bacteria and plasmids suggest events of conjugation; clusters of bacteria and phages suggest
events of transduction; clusters of phages and plasmids suggest exchange of DNA between classes of mobile elements, etc.
(b) Three connected components corresponding to three genetic worlds, defined by displaying connections between genomes
(same color code) for a reciprocal best BLAST hit with a minimum of 1e-20 score, and a minimum of 20% identity. Their three
gene pools are absolutely distinct, which suggests that some mechanisms and barriers structure the genetic diversity and the
genetic evolution outside the TOL. These real data were also kindly provided by S. Halary and P. Lopez (9).

or require deep evolutionary histories. Increasingly comprehensive


pattern analyses of ecosystems then require an increasing number of
types of edges and types of nodes as compared to the genome
network of Fig. 4. Some of the edges (those involved in abiotic
processes) are of a physico-chemical nature (92) while others may
(but will not necessarily) track more traditional biological relationships. Given the seemingly incommensurable nature of the possible
types of relationships, it may appear that clustering in salient units
becomes incredibly arduous.
Yet, the fact that analyses of comprehensive evolutionary
networks are difficult does not mean they are impossible or useless.
It merely relativizes the import of the conclusions that evolutionists
may draw from their attempts at clustering vastly heterogeneous
networks. If nature is not neatly cut at the joints, we should be
suspicious of any overly simple model (e.g., a TOL) that assumes
such simplicity. A pluralistic approach to clustering seems necessary
to track the complex, messy, and sometimes transient nature of
evolutionary dynamics. The work of an evolutionary modeler

98

E. Bapteste et al.

goes from tracking simple monophyletic groups (which we now


know do not yield the universal history that they were expected to
for most of the twentieth century) to analyzing the possible ways in
which structural constraints and functional possibilities interact
with hereditary systems in selective environments. It is not that
genealogy is insignificant, but rather that it becomes one tool
(among others) to track evolutionary change.
But how are evolutionists to identify the relevant interesting
explanatory clusters? This chapter is an initial salvo in a broad project
to reconceptualize evolution by natural selection. To describe the
dynamics of the changes in both units and relationships, evolutionists need to think about how the evolution of the processes translates
into changes in the topology of dynamic evolutionary networks.
Figure 4 is but the tip of the iceberg of interesting EPs that demand
to be accommodated in our models. We know, for instance, how
processes of conjugation and transduction translate into a topology
of shared gene networks, as they generate remarkable clusters of
bacteria and plasmids on the one hand and of bacteria and phages on
the other hand along lines suggested schematically in Fig. 4 (9, 11,
13, 16). Evolutionists need to learn how these and other processes
translate into even more comprehensive dynamic evolutionary networks that include biotic and abiotic components.
4.2. Searching
for Correlations
in Networks

Our second suggestion for identifying units that could play a


significant role in evolutionary explanations is to display and
compare multiple networks, including the same objects but
connected according to different rules (e.g., functional similarity,
genetic similarity, physical interactions, etc.), to look for their
common features. This approach is also consistent with scientific
practice (see, for instance, the ongoing National Geographicsponsored Genographic project that studies human evolution by
searching for correlations between molecular analyses and nonmolecular analyses of diverse traits that can be fairly well tracked
(such as similarities of single-nucleotide polymorphisms (SNPs) in
genomes, disease susceptibilities, gut flora, linguistic patterns, and
ecological neighbors)).
Importantly, the richness and great diversity of the biological
world has always been perceived as a significant methodological
research opportunity as well as genuine problem. As Hennig has
rightly pointed out,
each organism may be conceived as a member of the totality of all
organisms in a great variety of ways, depending on whether this
totality is investigated as a living community, as a community of
descent, as the bearer of the physiological characters of life, as a
chorologically differentiated unit, or in still other ways. The classification of organisms or specific groups of organisms as parasites,
saprophytes, blood suckers, predators, carnivores, phytophages, etc.;
into lung-, trachea-, or gill-breathers, etc.; into diggers of the digging

4 Philosophy and Evolution: Minding the Gap Between Evolutionary Patterns. . .

99

wasp type, mole type, and earthworm type; into homoiothermous


or poikilothermous; into inhabitants of the Palearctic, Neotropical,
Ethiopian regions, etc. are partial pieces of such systematic presentations that have been carried out for different dimensions of the
multidimensional multiplicity (1).

However, for Hennig and the many evolutionists that his thinking influenced, this multiplicity was in part reducible, since one
dimension (the genealogical) provided the best proxy for all the
others. As Hennig put it: making the phylogenetic system the
general reference system for special systematics has the inestimable
advantage that the relations to all other conceivable biological systems can be most easily represented through it. This is because the
historical development of organisms must necessarily be reflected in
some way in all relationships between organisms. Consequently,
direct relations extend from the phylogenetic system to all other
possible systems, whereas they are often no such direct relations
between these other systems (1). However, the -omic disciplines
reveal that the number of processes, interactions, systems, and
relationships affecting evolutionaryand the various entities that
are, in fact, units of evolutionare more astonishingly diverse than
Hennig (and for that matter, Darwin) recognized. Phylogenomics
also provides a strong case that the TOL is a poor proxy for all the
features of biodiversity (93), as it would explain only the history of
1% of the genes in a complete tree for prokaryotes (12) or of about
1015% at the level of bacterial phyla (94, 95), and, by definition,
none of the emergent and communal microbial properties. Likewise, some functional analyses of metagenomic data show that the
functional signal is, in some cases, stronger than the genealogical
signal in portions of the genome, showing that the presence of
genetic material with a given function matters more than the presence of a given genealogical lineage in some ecosystems (90).
Thus, the claim that one system has precedence over the others
deserves empirically reassessment. We maintain that such reassessment has potential to unravel important hidden correlations in the
relationships between evolving entities, overlooked thus far when
they were not consistent with the genealogy.
Network approaches (in contrast to branching genealogical
representations) are precisely the right tool to use for this purpose;
they are better suited to the evolutionary modeling needed here in
that they are agnostic about the structure of the relevant topologies.
Network-based studies can easily represent the multiplicity of relationships discovered by -omics approaches, and test whether,
indeed, one system (i.e., one of the networks) is a better proxy
than the others. In fact, all sorts of relationships between evolving
entities can be represented on these graphs. Proteomics allows one
to draw connections based on proteinprotein interaction and functional associations. Metagenomics proposes environmental and
functional connections. Correlation studies between multiple

100

E. Bapteste et al.
Organism / Environment i
Phylogenetic

Functional

Physical

Regulatory

Organism / Environment j
Phylogenetic

Functional

Physical

Regulatory

Fig. 5. Schematic correlations between -omics network. Each node corresponds to one individual gene. Four networks
illustrate the relationships inferred by -omics for these genes: black edges between nodes indicate the shortest distances
in terms of phylogenetics, functional interaction, physical distance, and regulatory distances for these genes. The question
whether one of these networks is a better proxy for all the others (within an organism or an environment or between
organisms or environments) is an open (empirical) question. Shaded edges indicate paths that are identical between more
than two networks of a single organism; bold edges indicate paths that are identical between comparable networks of
distinct organisms. For instance, in this graph, a cluster of three interconnected genes showed functional, physical, and
regulatory coherence both in organisms/environments i and j. However, this pattern was not captured by their phylogenetic
affinities in gene trees.

networks reconstructed for the same objects (e.g., thousands of


genes) by using different rules with respect to connections should
expose, without preconceptions, which networks (e.g., functional,
regulatory, genetic) and parts of networks can be placed in direct
relation to each other. Evolutionary studies can then examine the
shared connections (paths, edges, modules) present in these networks (Figs. 5 and 6), e.g., to identify units that are worthy of note
for their shared functional, structural, and genetic features and for
the possibility that these are the result of evolutionary significant
interactions.
Correlation analyses of this sort have in fact already been initiated
for organisms for which metabolic networks, proteinprotein
interaction networks, and phylogenetic information are available.
For instance, Cotton and McInerney (45) recently showed that the
phylogenetic origin of eukaryotic genes (e.g., from archaea or from
bacteria) is correlated with the centrality of these genes in metabolic
network (e.g., genes of archaeal origin occupy less terminal positions
in yeast metabolic network). This result suggests that eukaryotes
evolved as bits of bacterial metabolisms were added to a backbone

4 Philosophy and Evolution: Minding the Gap Between Evolutionary Patterns. . .

101

Fig. 6. Functional networks of shared genes for plasmids, phages, and prokaryotes. Four functional genome networks,
including 2,209 genomes of plasmids, 3,477 genomes of phages, and 116 prokaryotic chromosomes (from the same data
set as Fig. 2), were reconstructed by displaying only edges that correspond to the sharing of genetic material involved in
each of these functions on a separated graph. Here, we only showed the giant connected components of four functional
genomes network: (a) for J: translation, (b for c) energy production, (c) for T: signal transduction, and (d) for U: intracellular
trafficking. Bacterial genomes are in black, archaeal genomes in white, plasmids in light grey, and phages in dark grey. It is
clear that these functional networks are quite different because the histories of the genes coding for these functions were
distinct. However, some local correspondence can be found between the GCC of these functional graphs, suggesting that
some functional categories underwent the same evolutionary history in some groups of genomes, sometimes consistently
with the taxonomy (e.g., translation and energy production in bacteria and archaea), sometimes not. The layout was
produced by Cytoscape (102).

of archaeal pathways. Also, Dilthey and Lercher characterized spatially and metabolically coherent clusters of genes in gamma-proteobacteria. Though these genes share connections in spatial and
metabolic networks, they present multiple inconsistent phylogenetic
origins with the rest of the genes of the genomes hosting them. This
lack of correlation between the genealogical affinities of genes otherwise displaying remarkable shared connections in their spatial and
functional interactions suggests that analyses of correlations in these
particular networks could be used to predict LGT of groups of tightly
associated genes (Dilthey and Lercher, in prep.). Here, additional
evolutionary units (gene coalitions), consistent with the selfish
operon theory, could be identified (110).
Our more general point is that, ifat some level of evolutionary
analysisno network is an objectively better proxy for all the others,
local parts of different networks could still show significant

102

E. Bapteste et al.

correlations, useful to elaborate evolutionary scenarios (e.g., involving genetic modules, pathway evolution, etc.). Just as Dilthey and
Lercher suggested for clusters of metabolic genes, locally common
paths between physical and functional networks reconstructed for
many organisms could define clusters of genes with physical and
functional interactions that are found in multiple taxa. If the genes
making these clusters are distantly related in terms of phylogeny,
such findings suggest that these genes may have been laterally
transferred, possibly between distantly related members of a type 1
coalition. With further investigation, the physical and functional
associations observed between these genes, in multiple taxa, could
be interpreted as emerging phenotypes owing to LGT.
Correlations between networks based on transcriptomics, proteomics, and metagenomics could also inform evolutionists about the
robustness of coalitions (e.g., the presence of resilient and recurring
edges in various OTUs/coalitions/environments/over time). Think
of a trophic cycle in a given ecosystem. Various species can play the
same functional role, but the cycle remains. A species can be replaced
(via competition, migration, etc.) within a trophic cycle. Representing
this in networks, we would observe that some clusters have changed (a
network focused on genealogical relationships) while others are stable
(those focused on functional properties). The fact that some functional relationships persist longer than some genealogical ones may be
an indication of an evolutionary cluster that cannot be tracked by GP
alone (97), i.e., when the functional composition of a community
remains stable over longer times than the taxonomic composition.
Again, this is typically observed in gut flora: the functional network
and the phylogenetic network are not always well correlated, since the
composition and diversity of microbial populations change within the
gut, even if the microbes keep thriving on a shared gene pool (96).
It would also be observed in natural geochemical cycles (92), which
has the potential to introduce functional, genetic, and environmental
signatures in evolution that might outlive genealogical ones.
Since this search for correlation between networks does not
impose an a priori dominant pattern on biodiversity, it could offer
an improved and finer-grained representation of some aspects of
evolution. In particular, this approach would facilitate the recognition of evolutionary units not revealed in analyses based solely on
monophyletic groupings. The evaluation of the evolutionary
importance of such units cannot properly begin until they are
made into explicit objects of evolutionary study. If significant correlations reveal a pattern worth naming and deserving evolutionary
explanation, they will thus have opened up pathways in the study of
evolutionary origins not accessible in a strictly phylogenetic evolutionary system (Fig. 6).

4 Philosophy and Evolution: Minding the Gap Between Evolutionary Patterns. . .

103

5. Conclusion
We suggest that in nature coalitions (both friends with genetic
benefits and type 2 coalitions) are an important category of evolving
entities. Developing the tools (e.g., of network analysis) to analyze
the evolutionary impact of the processes into which coalitions enter
and the various roles that coalitions (and their evolutionarily interesting components) play will provide an improved basis for the study
of evolution, one that can include but also go beyond what can be
achieved with TOL-based modeling. We also suggest that modeling
of evolutionary adaptive processes can be significantly improved by
examining the evolutionary dynamics of coalitions, in particular by
including parameters informative about the topology and structure
of the components of the networks classified in various ways, including their evolutionary roots. Such modeling is open to various types
of assortments of partners (whereas GPs focus on same types of
associations), various durations of association (whereas GPs focus
on the long term relative to organismal scale), and all the degrees of
functional integration (whereas GPs focus almost exclusively on the
maximally integrated associations, such as mitochondria, or on the
shallow associations of coevolution). Because genealogical patterns
and evolutionary patterns are not isomorphic, evolutionists should
not be too strict in maintaining the ontological superiority of genealogical patterns. In genealogical patterns, evolutionists had (rightly
or not) an intuition about what persisted through time: species and
monophyletic groups. This allowed for the changing of parts while
maintaining continuity of some entity (which was assumed to
be what evolution was about). In the broader (and a priori less
constrained) perspective for which we argued, i.e., in ecosystemoriented evolutionary thinking, what persists through evolution
needs to be pinned down more carefully since monophyletic groups
are not the exclusive units and do not provide all of the ways of
carving out the patterns. In particular, studies of the correlations and
clusters in evolutionary dynamic networks could offer a possible
future alternative approach to complete the TOL perspective.

Box 1
Reconstructing Genome and Gene Networks
The various networks described in this chapter can easily be
reconstructed, for instance using genetic similarities.
For genome networks, a set of protein and/or nucleic
sequences from complete genomes must be retrieved from a relevant database (e.g., the NCBI (http://www.ncbi.nlm.nih.gov/
Entrez)). All these sequences are then BLASTed against one
another. To define homologous DNA families, sequences are
(continued)

104

E. Bapteste et al.

Box 1

(continued)
clustered when they share a reciprocal best-BLAST hit (RBBH)
relationship with at least one of the sequences of the cluster,
and a minimum sequence identity. For each pair of sequences,
all best BLAST hits with a score of 1e-20 are stored in a mySQL
database. To define homologous DNA families, sequences must
be clustered, for instance using a single-linkage algorithm or
MCL. With the former approach, a sequence is added to a cluster
if it shares an RBBH relationship with at least one of the sequences
of the clusters. We call cluster of homologous DNA families
(CHDs) the DNA families so defined. Requirement that RBBH
pairs share a minimum sequence identity, in addition to a BLAST
homology, can also be taken into account to define the CHDs.
Thus, distinct sets of CHDs can be produced, e.g., for various
identity thresholds (from 100%to study recent eventsto 20%
to study events of all evolutionary ages). Based on these sets of
CHDs and their distribution in the genomes, genome networks
can be built to summarize the DNA-sharing relationships between
the genomes under study, as summarized in Fig. 7. A network
layout can be produced by Cytoscape software using an edgeweighted spring-embedded model.
Several different evolutionary gene networks (EGNs) can be
reconstructed to be contrasted with proteinprotein interaction
networks or networks of metabolic pathways. For instance,
EGN based on sequence similarity can be reconstructed when
each node in the graph corresponds to a sequence. Two nodes
are connected by edges if their sequences show significant similarity, as assessed by BLAST. Hundreds of thousands of DNA
(or protein) sequences can, thus, be all BLASTed against each
other. The results of these BLASTs (the best BLAST scores
between two sequences, their percent of identity, the length over
which they align, etc.) are stored in databases. Groups of homologous sequences are then inferred using clustering algorithms (such
as the simple linkage algorithm). The BLAST score or the percentage of identity between each pair of sequences, or in fact any
evolutionary distance inferred from the comparison of the two
sequences, can then be used to weight the corresponding edges.
Most similar sequences can then be displayed closer on the EGN.
The lower the BLAST score cutoff (e.g., 1e-5), the more inclusive
the EGNs. Since not all gene forms resemble one another, however, discontinuous variations structure the graph.
Finally, clusters in genome and gene networks can be found
by computing modules, using packages for graph analysis, such
as MCODE 1.3 Cytoscape plugin (default parameters) and
Igraph (98), or by modularity maximization (as described in refs.
11 and 99).

4 Philosophy and Evolution: Minding the Gap Between Evolutionary Patterns. . .


Phages

B
L
A
S
T
/
C
L
U
S
T
E
R
I
N
G

Plasmids

ph1

ph2

plas1

ph3

ph4

plas2

Chromosomes
chr2

chr1

Global network

Matrix of presence/Absence of gene families


DNA
family

105

chr1

chr2

Genomes
ph1

ph2

ph3

ph4

plas1

plas2

chr1

chr2

ph1

ph2

Display
ph3
plas1

ph4

plas2
Connected
component

Giant connected
component

Fig. 7. Illustration for Box 1. Genes found in each type of DNA vehicle and belonging to the same homologous DNA family
are represented by a similar dash. The distribution of DNA families in mobile elements and cellular chromosomes can be
summarized by a presence/absence matrix, which can be used to reconstruct a network. With real data, the network of
genetic diversity is disconnected yet highly structured. It presents multiple connected components.

6. Exercises
1. What are the computational steps required to reconstruct a
genome network?
2. Cite four examples of communal evolution.
3. Cite three examples of coalitions.
4. In your opinion, is the genealogical pattern the best proxy for all
evolutionary patterns? What aspects of evolution in particular
cannot be described by a TOL only? Are there aspects of
evolution that can be described by the TOL that cannot be
captured in a network-based approach?
5. Are genes from all functional categories found in the genomes
of mobile elements?

106

E. Bapteste et al.

Acknowledgments
This paper was made possible through a series of meetings funded
by the Leverhulme Trust (Perspectives on the Tree of Life),
organized by Maureen OMalley, whom we want to thank dearly.
We also thank P. Lopez, S. Halary, and K. Schliep for help with
some analyses and figures, and P. Lopez and L. Bittner for critical
discussions.
References
1. Hennig, W. (1966) Phylogenetic systematics.
Urbana.
2. Daubin, V., Moran, N.A., and Ochman, H.
(2003) Phylogenetics and the cohesion of
bacterial genomes. Science 301, 829832.
3. Galtier, N., and Daubin, V. (2008) Dealing
with incongruence in phylogenomic analyses.
Philos Trans R Soc Lond B Biol Sci 363,
40234029.
4. Ciccarelli, F.D., Doerks, T., von Mering, C.,
Creevey, C.J., Snel, B., and Bork, P. (2006)
Toward automatic reconstruction of a highly
resolved tree of life. Science 311, 12831287.
5. Kurland, C.G., Canback, B., and Berg, O.G.
(2003) Horizontal gene transfer: a critical
view. Proc Natl Acad Sci USA 100,
96589662.
6. Lawrence, J.G., and Retchless, A.C. (2009)
The interplay of homologous recombination
and horizontal gene transfer in bacterial speciation. Methods Mol Biol 532, 2953.
7. Retchless, A.C., and Lawrence, J.G. (2010)
Phylogenetic incongruence arising from fragmented speciation in enteric bacteria. Proc
Natl Acad Sci USA 107, 1145311458.
8. Retchless, A.C., and Lawrence, J.G. (2007)
Temporal fragmentation of speciation in bacteria. Science 317, 10931096.
9. Halary, S., Leigh, J.W., Cheaib, B., Lopez, P.,
and Bapteste, E. (2010) Network analyses
structure genetic diversity in independent
genetic worlds. Proc Natl Acad Sci USA
107, 127132.
10. Brilli, M., Mengoni, A., Fondi, M., Bazzicalupo, M., Lio, P., and Fani, R. (2008) Analysis
of plasmid genes by phylogenetic profiling
and visualization of homology relationships
using Blast2Network. BMC Bioinformatics
9, 551.
11. Dagan, T., Artzy-Randrup, Y., and Martin, W.
(2008) Modular networks and cumulative
impact of lateral transfer in prokaryote

genome evolution. Proc Natl Acad Sci USA


105, 1003910044.
12. Dagan, T., and Martin, W. (2006) The tree of
one percent. Genome Biology 7, 118.
13. Dagan, T., and Martin, W. (2009) Getting a
better picture of microbial evolution en route
to a network of genomes. Philos Trans R Soc
Lond B Biol Sci 364, 21872196.
14. Doolittle, W.F., Nesbo, C.L., Bapteste, E.,
and Zhaxybayeva, O. (2007) Lateral Gene
Transfer. In: Evolutionary Genomics and Proteomics: Sinauer.
15. Doolittle, W.F., and Bapteste, E. (2007) Pattern pluralism and the Tree of Life hypothesis.
Proc Natl Acad Sci USA 104, 20432049.
16. Lima-Mendez, G., Van Helden, J., Toussaint,
A., and Leplae, R. (2008) Reticulate representation of evolutionary and functional relationships between phage genomes. Mol Biol Evol
25, 762777.
17. Ragan, M.A., McInerney, J.O., and Lake, J.A.
(2009) The network of life: genome beginnings and evolution. Introduction. Philos
Trans R Soc Lond B Biol Sci 364, 21692175.
18. Boucher, Y., Douady, C.J., Papke, R.T.,
Walsh, D.A., Boudreau, M.E., Nesbo, C.L.,
Case, R.J., and Doolittle, W.F. (2003) Lateral
gene transfer and the origins of prokaryotic
groups. Annu Rev Genet 37, 283328.
19. Bapteste, E., OMalley, M., Beiko, R.G., Ereshefsky, M., Gogarten, J.P., Franklin-Hall, L.,
Lapointe, F.J., Dupre, J., Dagan, T., Boucher,
Y., and Martin, W. (2009) Prokaryotic evolution and the tree of life are two different
things. Biology Direct 4, 34.
20. Lopez, P., and Bapteste, E. (2009) Molecular
phylogeny: reconstructing the forest. C R Biol
332, 171182.
21. Brussow, H. (2009) The not so universal tree
of life or the place of viruses in the living
world. Philos Trans R Soc Lond B Biol Sci
364, 22632274.

4 Philosophy and Evolution: Minding the Gap Between Evolutionary Patterns. . .


22. Zhaxybayeva, O., Swithers, K.S., Lapierre,
P., Fournier, G.P., Bickhart, D.M., DeBoy,
R.T., Nelson, K.E., Nesb, C.L., Doolittle,
W.F., Gogarten, J.P., and Noll, K.M. (2009)
On the chimeric nature, thermophilic origin,
and phylogenetic placement of the Thermotogales. Proc Natl Acad Sci USA 106,
58655870.
23. OHara, R.J. (1997) Population thinking and
tree thinking in systematics. Zoologica Scripta
26, 323329.
24. Kuntner, M., and Agnarsson, I. (2006) Are the
linnean and phylogenetic nomenclatural
systems combinable? Recommendations for
biological nomenclature. Syst Biol 55,
774784.
25. Mayr, E. (1987) The ontological status of
species. Biology and Philosophy 2, 145166.
26. Ghiselin, M.T. (1987) Species concepts,
Individuality, and Objectivity. Biology and
Philosophy 4, 127143.
27. Doolittle, W.F., and Zhaxybayeva, O. (2009)
On the origin of prokaryotic species. Genome
Res 19, 744756.
28. Tsvetkova, K., Marvaud, J.C., and Lambert,
T. (2010) Analysis of the mobilization functions of the vancomycin resistance transposon
Tn1549, a member of a new family of conjugative elements. J Bacteriol 192, 702713.
29. DAuria, G., Jimenez-Hernandez, N.,
Peris-Bondia, F., Moya, A., and Latorre, A.
(2010) Legionella pneumophila pangenome
reveals strain-specific virulence factors. BMC
Genomics 11, 181.
30. Barlow, M. (2009) What antimicrobial
resistance has taught us about horizontal
gene transfer. Methods Mol Biol 532,
397411.
31. Manson, J.M., Hancock, L.E., and Gilmore,
M.S. (2010) Mechanism of chromosomal
transfer of Enterococcus faecalis pathogenicity island, capsule, antimicrobial resistance,
and other traits. Proc Natl Acad Sci USA
107, 1226912274.
32. Davies, J., and Davies, D. (2010) Origins and
evolution of antibiotic resistance. Microbiol
Mol Biol Rev 74, 417433.
33. Krakauer, D.C., and Komarova, N.L. (2003)
Levels of selection in positive-strand virus
dynamics. J Evol Biol 16, 6473.
34. Lee, H.H., Molla, M.N., Cantor, C.R., and
Collins, J.J. (2010) Bacterial charity work
leads to population-wide resistance. Nature
467, 8285.
35. Dupre, J., and OMalley, M.A. (2007) Metagenomics and biological ontology. Stud Hist
Philos Biol Biomed Sci 38, 834846.

107

36. Shah, S.A., and Garrett, R.A. (2010)


CRISPR/Cas and Cmr modules, mobility
and evolution of adaptive immune systems.
Res Microbiol.
37. Lyon, P. (2007) From quorum to cooperation: lessons from bacterial sociality for evolutionary theory. Stud Hist Philos Biol Biomed
Sci 38, 820833.
38. Koonin, E.V., and Wolf, Y.I. (2009) Is evolution Darwinian or/and Lamarckian? Biol
Direct 4, 42.
39. Van Melderen, L., and Saavedra De Bast, M.
(2009) Bacterial toxin-antitoxin systems:
more than selfish entities? PLoS Genet 5,
e1000437.
40. DeLong, E.F. (2007) Microbiology. Life on
the thermodynamic edge. Science 317,
327328.
41. Bapteste, E., and Burian, R.M. (2010) On the
Need for Integrative Phylogenomics, and
Some Steps Toward its Creation. Biology
and Philosophy 25, 711736.
42. Valas, R.E., and Bourne, P.E. (2010) Save the
tree of life or get lost in the woods. Biol Direct
5, 44.
43. Dagan, T., and Martin, W. (2009) Microbiology. Seeing green and red in diatom genomes.
Science 324, 16511652.
44. Dagan, T., Roettger, M., Bryant, D., and
Martin, W. (2010) Genome networks root
the tree of life between prokaryotic domains.
Genome Biol Evol 2, 379392.
45. Cotton, J.A., and McInerney, J.O. (2010)
Eukaryotic genes of archaebacterial origin
are more important than the more numerous
eubacterial genes, irrespective of function.
Proc Natl Acad Sci USA.
46. Lapointe, F.J., Lopez, P., Boucher, Y., Koenig, J., and Bapteste, E. (2010) Clanistics: a
multi-level perspective for harvesting
unrooted gene trees. Trends Microbiol 18,
341347.
47. Schliep, K., Lopez, P., Lapointe, F.J., and
Bapteste, E. (2010) Harvesting Evolutionary
Signals in a Forest of Prokaryotic Gene Trees.
Mol Biol Evol, ahead of print
48. Franklin, L.R. (2005) Exploratory experiments. Philosophy of Science 72, 888899.
49. Burian, R.M. (2007) On microRNA and the
need for exploratory experimentation in postgenomic molecular biology. History and Philosophy of the Life Sciences 29(3), 285312.
50. Elliott, K.C. (2007) Varieties of exploratory
experimentation in nanotoxicology. History
and Philosophy of the Life Sciences 29(3),
313336.

108

E. Bapteste et al.

51. OMalley, M.A. (2007) Exploratory experimentation and scientific practice: Metagenomics and
the proteorhodopsin case. History and Philosophy of the Life Sciences 29(3), 337360.
52. Strasser, B.J. (2008) GenBankNatural History in the 21st Century? Science 322,
537538.
53. Strasser B.J. (2010) Laboratories, Museums,
and the Comparative Perspective: Alan A.
Boydens Serological Taxonomy, 19251962.
Historical Studies in the Natural Sciences 40
(2), 149182.
54. Bapteste, E., and Boucher, Y. (2008) Lateral
gene transfer challenges principles of microbial
systematics. Trends Microbiol 16, 200207.
55. Walsby, A.E. (1994) Gas vesicles. Microbiol
Rev 58, 94144.
56. Lo, I., Denef, V.J., Verberkmoes, N.C., Shah,
M.B., Goltsman, D., DiBartolo, G., Tyson, G.
W., Allen, E.E., Ram, R.J., Detter, J.C.,
Richardson, P., Thelen, M.P., Hettich, R.L.,
and Banfield, J.F. (2007) Strain-resolved community proteomics reveals recombining genomes of acidophilic bacteria. Nature 446,
537541.
57. Nesbo, C.L., Bapteste, E., Curtis, B., Dahle,
H., Lopez, P., Macleod, D., Dlutek, M., Bowman, S., Zhaxybayeva, O., Birkeland, N.K.,
and Doolittle, W.F. (2009) The genome of
Thermosipho africanus TCF52B: lateral
genetic connections to the Firmicutes and
Archaea. J Bacteriol 191, 19741978.
58. Wilmes, P., Simmons, S.L., Denef, V.J., and
Banfield, J.F. (2009) The dynamic genetic
repertoire of microbial communities. FEMS
Microbiol Rev 33, 109132.
59. Vogl, K., Wenter, R., Dressen, M., Schlickenrieder, M., Ploscher, M., Eichacker, L., and
Overmann, J. (2008) Identification and analysis of four candidate symbiosis genes from
Chlorochromatium aggregatum, a highly
developed bacterial symbiosis. Environ
Microbiol 10, 28422856.
60. Wanner, G., Vogl, K., and Overmann, J. (2008)
Ultrastructural characterization of the prokaryotic symbiosis in Chlorochromatium aggregatum. J Bacteriol 190, 37213730.
61. Lindell, D., Jaffe, J.D., Coleman, M.L.,
Futschik, M.E., Axmann, I.M., Rector, T.,
Kettler, G., Sullivan, M.B., Steen, R., Hess,
W.R., Church, G.M., and Chisholm, S.W.
(2007) Genome-wide expression dynamics
of a marine virus and host reveal features of
co-evolution. Nature 449, 8386.
62. Lindell, D., Sullivan, M.B., Johnson, Z.I.,
Tolonen, A.C., Rohwer, F., and Chisholm,

S.W. (2004) Transfer of photosynthesis


genes to and from Prochlorococcus viruses.
Proc Natl Acad Sci USA 101, 1101311018.
63. Palenik, B., Ren, Q., Tai, V., and Paulsen, I.T.
(2009) Coastal Synechococcus metagenome
reveals major roles for horizontal gene transfer and plasmids in population diversity. Environ Microbiol 11, 349359.
64. Zeidner, G., Bielawski, J.P., Shmoish, M.,
Scanlan, D.J., Sabehi, G., and Beja, O.
(2005) Potential photosynthesis gene recombination between Prochlorococcus and Synechococcus via viral intermediates. Environ
Microbiol 7, 15051513.
65. Gill, S.R., Pop, M., Deboy, R.T., Eckburg, P.
B., Turnbaugh, P.J., Samuel, B.S., Gordon, J.
I., Relman, D.A., Fraser-Liggett, C.M., and
Nelson, K.E. (2006) Metagenomic analysis of
the human distal gut microbiome. Science
312, 13551359.
66. Qu, A., Brulc, J.M., Wilson, M.K., Law, B.F.,
Theoret, J.R., Joens, L.A., Konkel, M.E.,
Angly, F., Dinsdale, E.A., Edwards, R.A., Nelson, K.E., and White, B.A. (2008) Comparative metagenomics reveals host specific
metavirulomes and horizontal gene transfer
elements in the chicken cecum microbiome.
PLoS One 3, e2945.
67. Simpson, G.G. (1961) Principles of Animal
Taxonomy. New York: Columbia Univ Press.
68. Lane, C.E., and Archibald, J.M. (2008) The
eukaryotic tree of life: endosymbiosis takes its
TOL. Trends Ecol Evol 23, 268275.
69. Alperovitch-Lavy, A., Sharon, I., Rohwer, F.,
Aro, E.M., Glaser, F., Milo, R., Nelson, N.,
and Beja, O. (2010) Reconstructing a puzzle:
existence of cyanophages containing both
photosystem-I and photosystem-II gene
suites inferred from oceanic metagenomic
datasets. Environ Microbiol.
70. Lozupone, C.A., Hamady, M., Cantarel, B.L.,
Coutinho, P.M., Henrissat, B., Gordon, J.I.,
and Knight, R. (2008) The convergence of
carbohydrate active gene repertoires in
human gut microbes. Proc Natl Acad Sci
USA 105, 1507615081.
71. Moustafa, A., Beszteri, B., Maier, U.G.,
Bowler, C., Valentin, K., and Bhattacharya,
D. (2009) Genomic footprints of a cryptic
plastid endosymbiosis in diatoms. Science
324, 17241726.
72. Lane, C.E., and Durnford, D. (2010)
Endosymbiosis and the evolution of plastids.
In: Molecular Phylogeny of Microorganisms.
Oren, A., and Papke, R.T. eds. Norwich:
Horizon Press.

4 Philosophy and Evolution: Minding the Gap Between Evolutionary Patterns. . .


73. Bouchard, F. (2010) Symbiosis, Lateral
Function Transfer and the (many) saplings of
life. Biology and Philosophy 25, 623641.
74. Janzen, D.H. (1980) When is it coevolution?
Evolution 34, 611612.
75. Pernthaler, A., Dekas, A.E., Brown, C.T.,
Goffredi, S.K., Embaye, T., and Orphan, V.J.
(2008) Diverse syntrophic partnerships from
deep-sea methane vents revealed by direct cell
capture and metagenomics. Proc Natl Acad
Sci USA 105, 70527057.
76. Overmann, J. (2010) The phototrophic
consortium Chlorochromatium aggregatum - a model for bacterial heterologous
multicellularity. Adv Exp Med Biol 675,
1529.
77. Wenter, R., Hutz, K., Dibbern, D., Li, T.,
Reisinger, V., Ploscher, M., Eichacker, L.,
Eddie, B., Hanson, T., Bryant, D.A., and Overmann, J. (2010) Expression-based identification
of genetic determinants of the bacterial symbiosis Chlorochromatium aggregatum. Environ
Microbiol.
78. Ehinger, M., Koch, A.M., and Sanders, I.R.
(2009) Changes in arbuscular mycorrhizal
fungal phenotypes and genotypes in response
to plant species identity and phosphorus concentration. New Phytol 184, 412423.
79. Scheublin, T.R., Sanders, I.R., Keel, C., and
van der Meer, J.R. (2010) Characterisation of
microbial communities colonising the hyphal
surfaces of arbuscular mycorrhizal fungi.
ISME J 4, 752763.
80. Hijri, I., Sykorova, Z., Oehl, F., Ineichen, K.,
Mader, P., Wiemken, A., and Redecker, D.
(2006) Communities of arbuscular mycorrhizal fungi in arable soils are not necessarily low
in diversity. Mol Ecol 15, 22772289.
81. Kuhn, G., Hijri, M., and Sanders, I.R. (2001)
Evidence for the evolution of multiple genomes in arbuscular mycorrhizal fungi. Nature
414, 745748.
82. Odenbaugh, J. (2007) Seeing the Forest and
the Trees: Realism about Communities and
Ecosystems. Philosophy of Science 74,
628641.
83. Hutchinson, G.E. (1948) Circular Causal
Systems in Ecology. Annals of the New York
Academy of Sciences 50, 221246.
84. Tansley, A.G. (1935) The Use and Abuse of
Vegetational Terms and Concepts. Ecology
16, 284307.
85. Swenson, W., Wilson, D.S., and Elias, R.
(2000) Artificial Ecosystem Selection. Proceedings of the National Academy of Science
97, 91109114.

109

86. Bouchard, F. (2011) How ecosystem evolution strengthens the case for functional pluralism. In: Functions: selection and mechanisms.
Huneman, P. ed.: Synthese Library, Springer.
87. Konstantinidis, K.T., and Tiedje, J.M. (2005)
Genomic insights that advance the species
definition for prokaryotes. Proc Natl Acad
Sci USA 102, 25672572.
88. Doolittle, W.F. (2009) Eradicating Typological Thinking in Prokaryotic Systematics and
Evolution. Cold Spring Harb Symp Quant
Biol.
89. Popa O, Hazkani-Covo E, Landan G, Martin
W, Dagan T. (2011) Directed networks reveal
genomic barriers and DNA repair bypasses to
lateral gene transfer among prokaryotes.
Genome Res 21(4), 599609. Epub 2011
Jan 26.
90. Broogard, B. (2004) Species as Individuals.
Biology and Philosophy 19, 223242.
91. Ereshefsky, M. (2010) Mystery of mysteries:
Darwin and the species problem. Cladistics
26, 113.
92. Falkowski, P.G., Fenchel, T., and Delong, E.
F. (2008) The microbial engines that drive
Earths biogeochemical cycles. Science 320,
10341039.
93. Doolittle, W.F., and Zhaxybayeva, O. (2010)
Metagenomics and the Units of Biological
Organization. Bioscience 60, 102112.
94. Lerat, E., Daubin, V., and Moran, N.A.
(2003) From gene trees to organismal phylogeny in prokaryotes: the case of the gammaProteobacteria. PLoS Biol 1, E19.
95. Touchon, M., Hoede, C., Tenaillon, O.,
Barbe, V., Baeriswyl, S., Bidet, P., Bingen,
E., Bonacorsi, S., Bouchier, C., Bouvet, O.,
Calteau, A., Chiapello, H., Clermont, O.,
Cruveiller, S., Danchin, A., Diard, M., Dossat, C., Karoui, M.E., Frapy, E., Garry, L.,
Ghigo, J.M., Gilles, A.M., Johnson, J., Le
Bouguenec, C., Lescat, M., Mangenot, S.,
Martinez-Jehanne, V., Matic, I., Nassif, X.,
Oztas, S., Petit, M.A., Pichon, C., Rouy, Z.,
Ruf, C.S., Schneider, D., Tourret, J., Vacherie, B., Vallenet, D., Medigue, C., Rocha, E.
P., and Denamur, E. (2009) Organised
genome dynamics in the Escherichia coli species results in highly diverse adaptive paths.
PLoS Genet 5, e1000344.
96. Dinsdale, E.A., Edwards, R.A., Hall, D., Angly,
F., Breitbart, M., Brulc, J.M., Furlan, M., Desnues, C., Haynes, M., Li, L., McDaniel, L.,
Moran, M.A., Nelson, K.E., Nilsson, C.,
Olson, R., Paul, J., Brito, B.R., Ruan, Y.,
Swan, B.K., Stevens, R., Valentine, D.L.,
Thurber, R.V., Wegley, L., White, B.A., and

110

E. Bapteste et al.

Rohwer, F. (2008) Functional metagenomic


profiling of nine biomes. Nature 452, 629632.
97. Bouchard, F. (2008) Causal Processes, Fitness
and the Differential Persistence of Lineages.
Philosophy of Science 75, 560570.
98. Csardi, G., and Nepusz, T. (2006) The igraph
software package for complex network research.
InterJournal Complex Systems, 1695.
99. Newman, M.E. (2006) Finding community
structure in networks using the eigenvectors
of matrices. Phys Rev E Stat Nonlin Soft Matter Phys 74, 36104.
100. Tatusov, R.L., Koonin, E.V., and Lipman, D.
J. (1997) A genomic perspective on protein
families. Science 278, 631637.
101. Li, W. (2009) Analysis and comparison of very
large metagenomes with fast clustering and

functional annotation. BMC Bioinformatics


10, 359.
102. Killcoyne, S., Carter, G.W., Smith, J., and
Boyle, J. (2009) Cytoscape: a communitybased framework for network modeling.
Methods Mol Biol 563, 219239.
103. Boucher, Y., Cordero, O.X., Takemura, A.,
Hunt, D.E., Schliep, K., Bapteste, E.,
Lopez, P., Tarr, C.L., and Polz, M.F. (2011)
Local mobile gene pools rapidly cross species
boundaries to create endemicity within global
Vibrio cholerae populations. MBio 2(2). pii:
e00335-10. doi:10.1128/mBio.00335-10.
104. Lawrence, J. (1999) Selfish operons: the evolutionary impact of gene clustering in prokaryotes and eukaryotes. Curr Opin Genet Dev 9
(6), 6428. Review.

Part II
Natural Selection, Recombination, and Innovation
in Genomic Sequences

Chapter 5
Selection on the Protein-Coding Genome
Carolin Kosiol and Maria Anisimova
Abstract
Populations evolve as mutations arise in individual organisms and, through hereditary transmission, may
become fixed (shared by all individuals) in the population. Most mutations are lethal or have negative
fitness consequences for the organism. Others have essentially no effect on organismal fitness and can
become fixed through the neutral stochastic process known as random drift. However, mutations may also
produce a selective advantage that boosts their chances of reaching fixation. Regions of genes where new
mutations are beneficial, rather than neutral or deleterious, tend to evolve more rapidly due to positive
selection. Genes involved in immunity and defense are a well-known example; rapid evolution in these
genes presumably occurs because new mutations help organisms to prevail in evolutionary arms races
with pathogens. In recent years, genome-wide scans for selection have enlarged our understanding of the
evolution of the protein-coding regions of the various species. In this chapter, we focus on the methods to
detect selection in protein-coding genes. In particular, we discuss probabilistic models and how they have
changed with the advent of new genome-wide data now available.
Key words: Conserved and accelerated regions, Positive selection scans, Codon models, Time and
space heterogeneity of genome evolution, Phylo-HMMs, Selection-mutation models

1. Introduction
Protein-coding genes are the DNA sequences used as templates for
the production of a functional protein. Such sequences consist of
nucleotide triplets called codons. During the protein production
phase, codons are transcribed and then translated into amino acids
(AAs) according to the organisms genetic code. In the past, selection studies on coding DNA mainly focused on the analysis of
particular proteins of interest. With the availability of comparative
genomic data, the emphasis has shifted from the study of individual
proteins to genome-wide scans for selection. The overview of genomic data underlying the genome-wide analysis of protein-coding
genes is included in Subheading 2.
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2,
Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5_5,
# Springer Science+Business Media, LLC 2012

113

114

C. Kosiol and M. Anisimova

The analysis of coding sequences can be performed on three


different levels: using DNA, AA, or codon sequences. The mutational
processes at these three levels can be described by probabilistic
models, which set the basis for evaluating selective pressures and
selection tests. The fundamental properties of these models are
summarized in Subheading 3.1.
There is accumulating evidence that the evolutionary process
varies between sites in biological sequences. Even in nonfunctional
genomic regions, there appears to be variation in the mutational
process. This variation is even more pronounced in active genomic
segments. In protein-coding sequences, changes that impede function are unlikely to be accepted by selection (e.g., mutation in active
site) while those altering less vital areas are under lower selective
constraints (e.g., mutation in nonfunctional loop regions). Furthermore, systematic studies have shown that variability is not
determined exclusively by selection on protein structure and function, but is also affected by the genomic position of the encoding
genes, their expression patterns, their position in biological networks and their robustness to mistranslation (see ref. 1 for a review
of these factors).
In Fig. 1, we summarize the different levels of modeling selection on protein-coding sequences. The wedges represent the three
data types: DNA, AA, and codons. Temporal heterogeneity is
represented by the tree branches from lineage-specific models to
analyses considering genealogies and population properties, such as
the effective population size and the distribution of selective coefficients. For example, temporal heterogeneity is included in models
that detect regions with accelerated regions in DNA, rate shifts in
AA sequences, or the branch-specific codon models.
Furthermore, the concentric layers in Fig. 1 describe different
levels of modeling spatial heterogeneity in cDNA, such as
phylogenetic hidden Markov models (phylo-HMMs) for DNA or
branch-site models for codon sequences. Within the Methods

Fig. 1. A diagram illustrating the different data levels to analyze protein-coding sequences
and the relationship of the various approaches modeling spatial and temporal heterogeneity.

5 Selection on the Protein-Coding Genome

115

section, Subheadings 3.23.4 are devoted to models allowing for


temporal and spatial heterogeneity and give an overview of state-ofthe-art methods to analyze selection of protein-coding regions.
Subheadings 4.14.5 discuss possible sources of errors in
genome-wide analyses. Finally, we conclude with the Discussion
section providing insights to emerging directions in studying selection at the genomics level.

2. Comparative
Genome Data
Several whole-genome sequence data sets are now available
for selection scans. Mammalian genomes are well represented
(in particular primates), and insect genomes are becoming more
numerous (in particular Drosophila). These data can be downloaded
as orthologous alignments from the Ensembl (2) and UCSC (3)
browsers. Methods for constructing orthologous sets of genes are
reviewed in Chapter 9 of Volume 1 (4).
In light of recent advances in DNA sequencing, with the
so-called next-generation sequencing (NGS) technologies that
have dramatically reduced the cost and time needed to sequence
an organisms entire genome, large-scale (involving many organisms) sequencing projects have been and are currently being undertaken. In particular, genome projects resequencing 1000 Human,
1000 Drosophila melanogaster, and 1001 Arabidopsis individuals
are ongoing. These polymorphism data from multiple individuals
from several species enable us to detect very recent selection.
Together with the progress in sequencing technologies,
algorithmic advances now allow the de novo assembly of genomes
from NGS data (see Chapter 5 in Volume 1 (5)), including complex
mammalian genomes (e.g., giant panda genome (6)). Announced
shortly after the Human 1000 Genomes Project, the 1000 Plant
Genomes Project is yet another, similar highly large-scale genomics
endeavor to take advantage of the speed and efficiency of NGS.
The Genome 10 K project aims to assemble a genomic zooa
collection of DNA sequences representing the genomes of 10,000
vertebrate species, approximately 1 for every vertebrate genus.
All these genomes can be subject to scans for selection, for which
we outline methods below.

3. Methods
3.1. Probabilistic
Models for Genome
Evolution

The statistical modeling of the evolutionary process is of great


importance when performing selection studies. When comparing
reasonably divergent sequences, counting the raw sequence

116

C. Kosiol and M. Anisimova

identity (percentage of sites with observed changes) underestimates


the amount of evolution that has occurred because, by chance
alone, some sites will have incurred multiple substitutions. In this
chapter, we discuss maximum likelihood (ML) and Bayesian methods to detect selection based on probabilistic models of character
evolution. Such substitution models provide more accurate evolutionary distance estimates by accounting for these unobserved
changes and explicitly model the selection pressure on the proteincoding level.
One of the primary assumptions made in defining probabilistic
substitution models is that future evolution is only dependent on its
current state and not on previous (ancestral) states. Statistical processes with this lack of memory are called Markov processes. The
assumption itself is reasonable because during the evolution mutation and natural selection can only act upon the molecules present
in an organism and have no knowledge of what came previously.
However, some large-scale mutational events, such as recombination (7), gene conversion (e.g., see refs. 8 and 9), or horizontal
transfer (10), might not satisfy this memoryless condition.
To reduce the complexity of evolutionary models, it is often
further assumed that each site in a sequence evolves independently
from all other sites. There is evidence that the independence of sites
assumption is violated. In real proteins, chemical interactions
between neighboring sites or the protein structure affect how
other sites in the sequence change. Steps have been made toward
context-dependent models, where the specific characters at neighboring sites affect the sites evolution (e.g., see refs. 11 and 12).
The Markov model asserts that one protein sequence is derived
from another by a series of independent substitutions, each changing one character in the first sequence to another character in the
second during the evolution. Thereby, we assume independence of
evolution at different sites. A continuous-time Markov process is
fully defined by its instantaneous rate matrix Q {qij}i,j 1. . .N.
The diagonal elements of Q are defined by a mathematical
requirement that the rows sum up to zero. For multiple sequence
alignments, the substitution process runs in continuous time over a
tree representing phylogenetic relations between the sequences.
The transition probability matrix P(t) {pij(t)} eQt consists of
transition probabilities from residue i to residue j over time t, and is
found as a solution of the differential equation dP(t)/dt P(t)Q
with P(0) being the identity matrix. In order for tree branches to be
measured by the expected number of substitutions per site,
the matrix Q is scaled so that the average substitution rate at
equilibrium equals 1.
As a matter of mathematical and computational convenience
rather than biological reality, several simplifying assumptions are
usually made. Standard substitution models allow any state to
change into any other. Such Markov process is called irreducible

5 Selection on the Protein-Coding Genome

117

and has a unique stationary distribution corresponding to the


equilibrium codon frequencies p {pi}. Time reversibility implies
that the direction of the change between two states, i and j, is
indistinguishable so that pi pij(t) pj pji(t). This assumption helps
to reduce the number of model parameters and is convenient when
calculating the matrix exponential (the matrix Q of a reversible
process has only real eigenvectors and eigenvalues (13)). The fully
unrestrained matrix Q for N characters defines an irreversible
model with [N(N  1)  1] free parameters while for a reversible
process this number is [(N(N + 1)/2)  2].
By comparing how well-substitution models explain sequence
evolution and by examining the parameters estimated from data,
ML and Bayesian inference can be used to address many biologically important questions. In this section, we focus on probabilistic
models that are used to detect selection.
3.2. Detecting Regions
of Accelerated
Genome Evolution

Understanding the forces shaping the evolution of specific lineages


is one of the most exciting areas in evolutionary genomics. In
particular, regions of accelerated evolution in mammalian and
insect species have been studied (e.g., see ref. 14). To eliminate
nonfunctional regions, one strategy is to begin with a search for
regions that are conserved through the mammalian history or
longer. A likelihood ratio test (LRT) may be used to detect acceleration of rates in a lineage of interest, for example the human
lineage. Such LRT compares the likelihood of the alignment data
under two probabilistic models. The null model has a single-scale
parameter representing shortening (more conserved) and lengthening (less conserved) of all branches of the tree. The alternative
model has an additional parameter for the human lineage, which is
constraint to be 1. This extra parameter allows the human branch
to be relatively longer (accelerated) than the branches in the rest of
the tree.
For example, this approach was used to identify genomic
regions that are conserved in most vertebrates, but have evolved
rapidly in humans. Interestingly, the majority of the human accelerated regions (HARs) were noncoding and many were located
near protein-coding genes with protein functions related to the
nervous system (14).
In contrast, the majority of D. melanogaster-accelerated regions
(DMARs) are found in protein-coding regions and primarily result
from rapid adaptive change at synonymous sites (15). This could be
because flies have much more compact genomes compared to
humans; however, even after considering the genomic content, in
Drosophila, a significant excess of DMARs occur in protein-coding
regions. Furthermore, Holloway and colleagues observed a mutational bias from G|C to A|T, and therefore the accelerated divergence in DMARs might be attributed to a shift in codon usage and
a fixation of many suboptimal codons.

118

C. Kosiol and M. Anisimova

In a similar manner, amino acid-based models search for site- or


lineage-specific rate accelerations and residues subject to altered
functional constraints. Such sites are likely to be contributing to
the change in protein function over time. The advantage of amino
acid-based models is that they might be suitable for the analysis of
deep divergences of fast-evolving genes, where sequences rapidly
saturate over time. Also amino acid methods are not influenced by
the effects of codon bias, a topic that is discussed at the end of this
chapter. The idea is that adaptive change on the level of amino acid
sequences may not necessarily correspond to an adaptive change in
protein function but rather to peaks in the protein-adaptive landscape reflecting the optimization of the protein function in a
particular species to long-term environmental changes. One class
of methods for detecting functional divergence searches for a
lineage-specific change in the shape parameter of the gamma distribution that is used to model rate heterogeneity (see refs. 1618
and 19). Other methods search for evidence of clade-specific rate
shifts at individual sites (see refs. 2025 and 26). For example,
Gu (21) proposed a simple stochastic model for estimating the
degree of divergence between two prespecified clusters. The statistical significance was tested using site-specific profiles based on an
HMM, which was used to identify amino acids responsible for these
functional differences between two gene clusters. More flexible
evolutionary models were incorporated in the maximum likelihood
approach applicable to the simultaneous analysis of several gene
clusters (27). This was extended (28) to evaluate site-specific shifts
in amino acid properties, in comparison with site-specific rate shifts.
Pupko and Galtier (24) used the LRT to compare ML estimates of
the replacement rate at an amino acid site in distinct subtrees.
3.3. Phylogenetic
Hidden Markov
Models

Phylo-HMMs are probabilistic models that consider not only the


way substitutions occur along an evolutionary history represented
by a tree, but also the way this process changes from site to site in a
genome. Phylo-HMMs describe evolution as a combination of two
Markov processesone that operates in the dimension of space
(along the genome) and one that operates in the dimension of
time (along the branches of a phylogenetic tree). In the assumed
process, a character is drawn at random from the background
distribution and assigned to the root of the tree. Character
substitutions occur randomly along the tree branches from root
to leaves. The characters that are found at the leaves when the
process has been completed define an alignment column having a
correlation structure that reflects the phylogeny and the substitution process. The different phylogenetic models associated with the
states of the phylo-HMM may reflect different overall rates of
substitution (for example, conserved and nonconserved as in
Fig. 2) and different patterns of substitution or background distributions (as in different codon positions). The idea is to identify

5 Selection on the Protein-Coding Genome

119

Fig. 2. Visualization of an example phylo-HMM showing the probabilistic graph and the input alignment. The grey columns
represent the conserved state; the white columns the fast state. At each time step, a new state is visited according to the
transition probabilities (m and n parameters on arcs) and a multiple alignment column is emitted according to
the conserved and nonconserved phylogenetic models Cc and Cn. Thereby, the phylogenetic models include the
parameters describing the tree and the pattern of substitution.

highly conserved genomic regions indicating purifying selection or


accelerated regions indicating positive selection in a set of multiple
aligned sequences. Such regions are good candidates for further
selection analysis and they are likely to be functionally important.
Hence, the identification of regions through phylo-HMMs has
become a subject of considerable interest in comparative genomics
(see refs. 29 and 30).
3.4. Codon Models:
Site, Branch,
and Branch-Site
Specificity
3.4.1. Basic Codon Models

In protein-coding sequences, nucleotide sites at different codon


positions usually evolve with highly heterogeneous patterns (e.g.,
see ref. 31). Thus, DNA substitution models fail to account for this
heterogeneity unless the sequences are partitioned by codon positions for the analysis. But even then, DNA models do not model the
structure of genetic code or selection at the protein level. Indeed,
one advantage of studying protein-coding sequences at the codon
level is the ability to distinguish between nonsynonymous (AA
replacing) and synonymous (silent) codon changes. Based on this
distinction, the selective pressure on the protein-coding level can
be measured by the ratio o dN/dS of the nonsynonymous-tosynonymous substitution rates. The nonsynonymous substitution
rate may be higher than the synonymous rate and thus o > 1 due to
fitness advantages associated with recurrent AA changes in the
protein, i.e., positive selection on the protein. In contrast, purifying

120

C. Kosiol and M. Anisimova

selection acts to preserve the protein sequence so that the nonsynonymous substitution rate is lower than the synonymous rate, with
o < 1. Neutrally evolving sequences exhibit similar nonsynonymous and synonymous rates, with o  1.
First methods that used the o-ratio as a criterion to detect
positive selection were based on pairwise estimation of dN and dS
rates with counting methods (e.g., see ref. 32). However, ML
estimates of pairwise dN and dS based on a codon model were
shown to outperform all other approaches (33). Moreover, a
Markov codon model is naturally extended to multiple sequence
alignments, unlike the counting methods. This, together with the
benefits of the probabilistic framework within which codon models
are defined, made codon models very popular in studies of positive
selection in protein-coding genes.
The first two codon models were proposed simultaneously in
the same issue of Molecular Biology and Evolution ((34) and (35)).
The model of Goldman and Yang (34) included the transition/
transversion rate ratio k, and modeled the selective effect indirectly
using a multiplicative factor based on Grantham (36) distances, but
was later simplified to estimate the selective pressure explicitly using
the o parameter (37). The main distinction between the first codon
models concerns the way to describe the instantaneous rates
with respect to equilibrium frequencies: (1) proportional to the
equilibrium frequency of a target codon (as in Goldman and Yang
(34)) or (2) proportional to the frequency of a target nucleotide
(as in Muse and Gaut (35)).
Recently, empirical codon models have been estimated (see refs.
38 and 39) that summarize substitution patterns from large quantities of protein-coding gene families. In contrast to the parametric
codon models that estimate gene-specific parameters (e.g., transitiontransversion k, selective pressure o, etc.), the empirical codon
models do not explicitly consider distinct factors that shape protein
evolution. Standard parametric models assume that protein evolution proceeds only by successive single-nucleotide substitutions.
However, empirical codon models indicate that model accuracy is
significantly improved by incorporating instantaneous doublet and
triplet changes. Kosiol et al. (39) also found that the affiliations
among codon, the amino acid it encodes, and the physicochemical
properties of the amino acid are main driving factors of the process of
codon evolution. Neither multiple nucleotide changes nor the
strong influence of the genetic code nor amino acid properties
form a part of the standard parametric models.
On the other hand, parametric models have been very successful
in applications studying biological forces shaping protein evolution
of individual genes. Thus, combining the advantages of parametric
and empirical approaches offers a promising direction. Kosiol,
Holmes, and Goldman (39) explored a number of combined
codon models that incorporated empirical AA exchangeabilities

5 Selection on the Protein-Coding Genome

121

from ECM while using parameters to study selective pressure,


transition/transversion biases, and codon frequencies. Similarly,
AA exchangeabilities from (suitable) empirical AA matrices may be
used to alter probabilities of nonsynonymous changes, together
with traditional parameters o, k, and codon frequencies pj (40).
Such an approach accommodates site-specific variation of selective
pressure and can be further extended to include lineage-specific
variation. Combined empirical and parametric models will, therefore, become more frequent in selection studies. However, selecting
an appropriate model is of utmost importance and needs further
study. In particular, parameter interpretations may change with
different model definitions, since empirical exchangeabilities
already include average selective factors and other biases (39).
Thus, selection among alternative parameterizations requires detailed attention.
3.4.2. Accounting
for Variability of Selective
Pressures

First codon models assumed constant nonsynonymous and synonymous rates among sites and over time. Although most proteins
evolve under purifying selection most of the time, positive selection
may drive the evolution in some lineages. During episodes of
adaptive evolution, only a small fraction of sites in the protein
have the capacity to increase the fitness of the protein via AA
replacements. Thus, approaches assuming constant selective pressure over time and over sites lack power in detecting genes affected
by positive selection. Consequently, various scenarios of variation in
selective pressure were incorporated in codon models, making
them more powerful at detecting positive selection, and short
episodes of adaptive evolution in particular. Evidence of positive
selection on a gene can be obtained by an LRT comparing two
nested models: a model that does not allow positive selection
(constraining o  1 to represent the null hypothesis) and a
model that allows positive selection (o > 1 is allowed in the alternative hypothesis). Positive selection is detected if a model o > 1
fits data significantly better compared to the model restricting
o  1 at all sites and lineages. However, the asymptotic null distribution may vary from the standard w2 due to boundary problems or
if some parameters become not estimable (e.g., see refs. 41 and 42).

3.4.3. Case Study:


Application
of a Genome-Wide Scan
of Positive Selection
on Six Mammalian
Genomes

In 2006, six high-coverage genome assemblies became available for


eutherian mammals. The increased phylogenetic depth of this data set
permitted Kosiol and colleagues (43) to perform several new lineageand clade-specific tests using branch-site codon models. Of ~16,500
human genes with high-confidence orthologs in at least two other
species, 544 genes showed significant evidence of positive selection
using branch-site codon models and standard LRTs.
Interestingly, several pathways were found to be strongly
enriched in genes with positive selection, suggesting possible
coevolution of interacting genes. A striking example is the

122

C. Kosiol and M. Anisimova

complement immunity system, a biochemical cascade responsible


for the elimination of pathogens. This system consists of several
small proteins found in the blood that cooperate to kill target cells
by disrupting their plasma membranes. Of 28 genes associated with
this pathway in KEGG (see http://www.genome.jp/kegg-bin/
show_pathway?map04610 for the complement cascades), 9 were
under positive selection (FDR < 0.05) and 5 others had nominal
P < 0.05. Most of the genes under positive selection are inhibitors
(DAF, CFH, CFI) and receptors (C5AR1, CR2), but some are part
of the membrane attack complex (C7, C9, C8B), which punctures
cell membranes to initiate cell lysis. Here, we focus on the analysis
of these proteins of the membrane attack complex.
First, we calculate gene-averaged o value using the basic M0
model (34). The ML estimates of o < 1 (o 0.31 for C7,
o 0.25 for C8B, and o 0.44 for C9) indicate that most sites
in these genes are under purifying selection. However, selection
pressure could be variable at different locations of the membrane
proteins and we, therefore, continue our analysis by applying models that allow for variation in selective pressure across sites.
3.4.4. Selective Variability
Among Codons:
Site Models

The simplest site models use the general discrete distribution with a
prespecified number of site classes. Each site class i has an independent parameter oi estimated by ML together with proportions of
sites pi in each class. Since a large number of site categories require
many parameters, three categories are usually used (requiring five
independent parameters). To test for positive selection, several pairs
of nested site models were defined to represent the null and alternative hypotheses in LRTs. For example, model M1a includes two
site classes, one with o0 < 1 and another with o1 1, representing the neutral model of evolution (the null hypothesis). The
alternative model M2a extends M1a by adding an extra site class
with o2  1 to accommodate sites evolving under positive selection. Significance of the LRT is tested using the w22 distribution for
the M1 vs. M2 comparison. We test the C7 gene for positive
selection by the LRT comparing nested models M1a and M2a
(Table 1).
Model M2a has two additional parameters compared to model
M1a. The resulting LRT statistic is 2 (log L2  log L1) 2
(6377.35  (6369.67)) 2  7.68 15.36. This is much
greater than the critical value of the chi-square distribution
w2 (df 2, at 5%) 5.99, and we calculate a p-value of
P 5.0e04. However, the M1a vs. M2a comparison for genes
C8B and C9 is not significant.
Another LRT can be performed on the basis of the modified
model M8 with two site classes: one with sites, where the o-ratio is
drawn from the beta distribution (with 0  o  1 describing
the neutral scenario), and the second, discrete class, with o  1.
Constraining o 1 for this second class provides a sufficiently

5 Selection on the Protein-Coding Genome

123

Table 1
Parameter estimates and log likelihoods for an LRT
of positive selection for the complement immunity
component C7
M1a (nearly neutral)
0
1
Site class
(p1 1  p0 0.31)
p0 0.69
Proportion
o0 0.07
(o1 1)
o ratio
Log likelihood L1 6377.35
M2a (selection)
0
Site class
p0 0.70
Proportion
o0 0.08
o ratio

1
p1 0.29
(o1 1)

2
(p2 1  p0  p1 0.01)
o2 10.89

Log likelihood L2 6369.67

The model M2a is the alternative model with a class of sites with o2  1.
The null hypothesis M1a is the same model but with o2 1 fixed

flexible null hypothesis, whereby all evolution can be explained by


sites with o from the beta distribution or from a discrete site class
with o 1. Significance of the LRT is tested using the mixture
1 2
1 2
2w0 2w1 for the M8 (o 1) vs. M8 comparison. If the LRT for
positive selection is found to be significant, specific sites under
positive selection may be predicted based on the values of posterior
probabilities (PPs) to belong to the site class under positive selection (usually, PP > 0.95, but see refs. 44 and 45). Such posterior
probabilities are estimated using the nave empirical Bayesian
(NEB) approach (46), full hierarchical Bayesian approach (47), or
a mid-way approach the Bayes empirical Bayes (BEB (45)). For a
discussion on this approaches, see Scheffler and Seoighe (48) and
Aris-Brosou (49). Alternatively, Massingham and Goldman (50)
proposed a site-wise likelihood ratio estimation to detect sites
under purifying or positive selection.
For the C7 gene, using BEB, we identified several amino acid
sites to be putatively under selection: residue R at position 223
(PP 0.94), H at position 239 (PP 0.93), and N at position
331 (PP 0.93). Unfortunately, the crystal structures of C7
(as well as C8B and C9) are not known, and we cannot relate the
location of amino acids in the protein sequence to relevant 3D data,
such as sites of proteinprotein interaction or binding sites of the
protein. If such structural information were known, it would also
be possible to use this biological knowledge in a model that is aware
of the position of the different structural elements.
Site models that do not use a priori partitioning of codons
(as those described above) are known as random-effect (RE) models. In contrast, fixed-effect (FE) models categorize sites based on a
prior knowledge, e.g., according to tertiary structure for single

124

C. Kosiol and M. Anisimova

genes, or by gene category for multigene data. Site partitions for FE


models can be defined also based on inferred recombination breakpoints, useful for inferences of positive selection from recombining
sequences (see refs. 51 and 52), although the uncertainty
of breakpoint inference is ignored in this way. FE models with
each site being a partition should be avoided, as they lead to
the infinitely many parameter trap (e.g., see ref. 53). Given a
biologically meaningful a priori partitioning, FE models are useful
to study heterogeneity among partitions. However, a priori information is not always available.
3.4.5. Selective
Variability Over Time:
Branch Models

A simple way to include the variation of the selective pressure over


time is by using separate parameters o for each branch of a phylogeny
(known as free-ratio model (37)). Compared with the one-ratio
model (which assumes constant selection over time), the free-ratio
model requires additional 2T  4 o-parameters for T species.
Figure 3 shows the estimates of the free-ratio model for the C8B
gene. Although the ML estimates of o values on the rodent lineages
are visibly higher than on the primate lineages, none of the branches
has o > 1.
Other branch models can be defined by constraining different
sets of branches of a tree to have an individual o. LRTs are used to

0.18
human
0.52

chimp
0.09

0.16

0.17

macaque

0.42
mouse
0.16
0.46
rat

0.32

dog

Fig. 3. An estimate of o for each branch of a six-species phylogeny. Shown is the maximum
likelihood estimate for the gene C8B. Each branch is labeled with the corresponding
estimate of o.

5 Selection on the Protein-Coding Genome

125

decide (1) whether selective pressure is significantly different on a


prespecified set of branches and (2) whether these branches are
under positive selection.
However, branch models have relatively poor power to detect
selection (54) in comparison to branch-site models that are discussed in the next section. Also note that testing of multiple
hypotheses on the same data requires a correction, so the overall
false-positive rate is kept at the required level (most often 5%).
Correction for multiple testing further reduces the power of the
method, especially when many hypotheses are tested simultaneously (see discussion later).
3.4.6. Temporal
and Spatial Variation
of Selective Pressure

Several solutions were proposed to simultaneously account for


differences in selective constraints among codons and the episodic
nature of molecular evolution at individual sites. One of the first
modelsmodel MA (45)assumes four site classes. Two classes
contain sites evolving constantly over time: one under purifying
selection with o0 < 1 and another with o1 1. The other two site
classes allow selective pressure at a site to change over time on a
prespecified set of branches, known as the foreground. The two
variable classes are derived from the constant classes so that sites
typically evolving with o0 < 1 or o1 1 are allowed to be under
positive selection with o2  1 on the foreground. Testing
for positive selection on the rodent clade involves an LRT comparing a constrained version of MA (with o2 1) vs. an unconstrained
MA model. Compared to branch models, the branch-site formulation improves the chance of detecting short spills of adaptive
pressure in the past even if these occurred at a small fraction of sites.
Returning to our example of gene C8B of the complement
pathway, we perform a branch-site LRT for positive selection using
the M1a vs M2a comparison. Thereby, we take mouse and the rat
lineage, respectively, as foreground branches, and all other branches
as background branches. Significance of the LRT is tested using the
mixture 12w20 12w21 with critical values to be 2.71 at 5%. For the C8B
gene, we calculate 2 (log L2  log L1) 2  2.23 4.46 for
the mouse lineage and 11.2 for the rat lineage.
A major drawback of described branch-site models is their
reliance on a biologically viable a priori hypothesis. In the context
of detecting sites and lineages affected by positive selection, one
possible solution is to perform multiple branch-site LRTs, each
setting a different branch at the foreground (55). In the example
of six species (Fig. 3), a total of nine tests (for an unrooted tree) are
necessary in the absence of an a priori hypothesis. Multiple test
correction has to be applied to control excessive false inferences.
This strategy tends to be conservative but can be sufficiently powerful in detecting episodic instances of adaptation. As with all
model-based techniques, precautions are necessary for data with
unusual heterogeneity patterns, which may cause deviations from

126

C. Kosiol and M. Anisimova

the asymptotic null distribution and thus result in an elevated falsepositive rate.
In the case of episodic selection where any combination of
branches of a phylogeny can be affected, a Bayesian approach in
lieu of the standard LRTs and multiple testing have been suggested.
The multiple LRT approach is most concerned with controlling the
false-positive rate of selection inference, and is less suited to infer
the best-fitting selection history. In the hypothetical example
(Fig. 3), a total of 29  1 511 selection histories (excluding
the history without selection on any branch) need to be considered.
The Bayesian analysis allows a probability distribution over possible
selection histories to be computed, and therefore permits estimates
of prevalence of positive selection on individual branches and
clades. Such approach evaluates uncertainty in selection histories
using their posterior probabilities and allows robust inference of
interesting parameters, such as the switching probabilities for gains
and losses of positive selection (43).
Other models (e.g., with dS-variation among sites (56)) also
may be extended to allow changes of selective regimes on different
branches. This is achieved by adding further parameters, one per
branch, describing the deviation of selective pressure on a branch
from the average level on the whole tree under the site model. Such
model is parameter rich and can be used for exploratory purposes
on data with long sequences, but does not provide a robust way of
testing whether o > 1 on a branch is due to positive selection on a
lineage or due to inaccuracy of the ML estimation.
Kosakovsky Pond and Frost (56) suggested detecting lineagespecific variation in selective pressure using the genetic algorithm
(GA)a computational analogue of evolution by natural selection.
The GA approach was successfully applied to phylogenetic reconstruction (see refs. 57, 58, and 59). In the context of detecting
lineage-specific positive selection, GA does not require an a priori
hypothesis. Instead, the algorithm samples regions of the whole
hypotheses space according to their fitness measured by AICC.
The branch-model selection with GA may also be adapted to incorporate dN and dS among-site variation, although this imposes a
much heavier computational burden.
In branch and branch-site models, change in selection regime is
always associated with nodes of a tree, but the selective pressure
remains constant over the length of each branch. Guindon et al. (60)
proposed a Markov-modulated model, where switches of selection
regimes may occur at any site and any time on the phylogeny. In a
covarion-like manner, this codon model combines two Markov
processes: one governs the codon substitution while the other
specifies rates of switches between selective regimes. These models
can be used to study the patterns of the changes in selective pressures over time and across sites by estimating the relative rates of

5 Selection on the Protein-Coding Genome

127

changes between different selective regimes (purifying, neutral,


and positive).
3.5. Software

The software PHylogenetic Analysis with Space/Time (PHAST)


models includes several phylo-HMM-based programs. Two programs in PHAST are particularly interesting in the context of
selection studies: PhastCons is a program for conservation scoring
and identification of conserved elements (61). PhyloP is designed
to compute p-values for conservation or acceleration, either lineage
specific or across all branches (62). PHAST is designed for use on
DNA sequences only.
A variety of codon models to detect selection, including
branch-site models and the recent selection-mutation model, are
implemented in the CODEML program of PAML (63). HYPHY
is another implementation that includes a large variety of codon
models (64). FitModel is the ML implementation of the switching codon model (60). Selecton Web server (65) offers several
site models as well as the combined model described in DoronFaigenboim and Pupko (40).
Xrate (66) is a generic tool to implement complex probabilistic
models in the form of context-free stochastic grammars. Grammars
for codon models can be defined such that they lead to estimates
consistent with those at PAML, but for features of particular proteins (e.g., see analysis of transmembrane proteins (67)). However,
Xrate is slower than PAML.

4. Notes/Discussion
With the wider use of codon models to detect selection, some
questioned the statistical basis of testing based on branch-site models.
In 2004, Zhang found that the original branch-site test (68)
produced excessive false positives when its assumptions were not
met. The modified branch-site test was shown to be more robust to
model violations (see refs. 45 and 69), and is now commonly used in
genome-wide selection scans (e.g., see ref. 70). Recently, however,
another simulation study by Nozawa et al. (71) suggested that this
modification also showed an excess of false positives. Yang and
Dos Reis (54) defended the branch-site test by examining the null
distribution and showing that Nozawa and colleagues (71) misinterpreted their simulation results. However, it is clear that even tests
with good statistical properties are affected by data quality and the
extent of models violations. Below, we list factors that can affect
the test, and so should be taken into account when analyzing
genome-wide data.

128

C. Kosiol and M. Anisimova

4.1. Quality of Multiple


Alignments

The impact of the quality of sequence and the alignment is a major


concern when performing positive selection scans. For example, in
their analysis of 12 genomes, Markova-Raina and Petrov (72)
found that the results were highly sensitive to the choice of an
alignment method. Furthermore, visual analysis indicated that
most sites inferred as positively selected are in fact misaligned at
the codon level. The rate of false positives ranged ~50% and more
depending on the aligner used. Some of these results can be
ascribed to the high divergence level of the 12 Drosophila species,
and could be addressed by better filtering of the data. Nevertheless,
even in mammals where alignment is easier, problems have been
observed.
Bakewell et al. (73) used the branch-site test to analyze ~14,000
genes from the human, chimpanzee, and macaque, and detected
more genes to be under positive selection on the chimpanzee lineage than on the human lineage (233 vs. 154). The same pattern was
also observed by Arbiza et al. (74) and Gibbs et al. (75). Mallick
et al. (76) reexamined 59 genes detected to be under positive
selection on the chimpanzee lineage by Bakewell et al. (73), using
more stringent filters to remove less reliable nucleotides and using
synteny information to remove misassembled and misaligned
regions. They found that with improved data quality, the signal of
positive selection disappeared in most of the cases when the branchsite test was applied. It now appears that, as suggested by
Mallick et al. (76), the earlier discovery of more frequent positive
selection on the chimpanzee lineage than on the human lineage is
an artifact of the poorer quality of the chimpanzee genomic
sequence. This interpretation is also consistent with a few recent
studies analyzing both real and simulated data, which suggest that
sequence and alignment errors may cause excessive false positives
(see refs. 77 and 78). Indeed, most commonly used alignment
programs tend to place nonhomologous codons or amino acids
into the same column (see refs. 79 and 80), generating the wrong
impression that multiple nonsynonymous substitutions occurred at
the same site and misleading the codon models into detecting
positive selection (78).
It appears very challenging to develop a test of positive selection
that is robust to errors in the sequences or alignments. Instead, we
advise to carefully check the alignments of genes that are putatively
under selection by any method described here.

4.2. Overlapping
Reading Frames

Another line of development in modeling the evolution of proteincoding genes concerns evaluating selective pressures on overlapping reading frames (ORFs). In particular, viruses are known to
frequently encode genes with ORFs to maximize information content of their short genomes. This may increase codon bias and affect
evolutionary constraints on overlapping regions. Indeed, regions of
genes that encode several protein products evolve under constraints

5 Selection on the Protein-Coding Genome

129

imposed on each frame, which is disregarded in standard codon


models. Although less common, ORFs are also found in eukaryotic
genomes.
Some solutions for modeling overlapping regions have been
proposed. A nonstationary model can fully accommodate complex
site dependencies caused by ORFs and other effects, such as methylation, but requires a conditional Markov process of a higher order
with 61Nx61N instantaneous rate matrix so that instantaneous
rates at a base are dependent on the neighboring nucleotide states
(see refs. 81 and 82). The ML parameter estimation is analytically
intractable for such model. When applied only to pairs of
sequences, the model requires MCMC for parameter estimation.
To speed up the computation under such site-dependent model, an
approximate estimation method can be used, based on the pseudolikelihood via expectationmaximization (EM) algorithm (83).
The process of context-dependent substitution may be extended
to a general phylogeny at the expense of limiting the full processbased JensenPedersen model (84). A second-order Markov
process running at the tips of a tree is an approximation since
interdependencies in the ancestral sequences are ignored. The
likelihood is calculated with a modified pruning algorithm and
optimized with EM.
Instead, computationally simple approximations may be used.
For example, Sabath, Landan, and Graur (85) extended the simple
GY codon model to accommodate different average selective pressures in two overlapping genes using an additional o-parameter for
the second gene. This model, however, assumes a multiplicative
selective effect in ORF and uniform selective pressures within each
gene. Another alternative is to define a phylo-HMM with hidden
classes being the degeneracy classes, which include the possible
outcomes of ORFs (see refs. 86, 87, and 88). Such phylo-HMM
also assumes the constancy of selective pressure over time and in the
sequence and that degeneracy of a site is constant over time. It is
not known whether for the estimates of selective pressure in overlapping genes these assumptions are more detrimental compared to
those made in the model of Sabath et al. (85). Further improvements in codon models are needed to describe the evolution of
ORFs more realistically to provide more accurate estimates of selection in gene regions with ORFs.
4.3. Recombination

Most codon models assume a single phylogeny and a constant


synonymous rate among sites, implying that rate variation among
codons is solely due to the variation of the nonsynonymous rate.
Recent studies question whether such assumptions are generally
realistic (e.g., see ref. 89) suggested that failure to account for
synonymous rate variation may be one of the reasons why LRTs
for positive selection are vulnerable on data with high recombination rates. Some selection scans try to control this problem
by checking putatively selected genes for recombination either

130

C. Kosiol and M. Anisimova

manually or automated with traditional detection software (e.g.,


RDP (90)). Also Drummond and Suchard (91) have recently developed a Bayesian approach to detect recombination within a gene.
Another approach is to explicitly consider recombination.
For example, Scheffler, Martin, and Seoighe (92) extended codon
models with both dN and dS site variation and allowed changes of
topology at the detected recombination breakpoints. Certainly, fastevolving pathogens (such as viruses) undergo frequent recombination which often changes either the whole shape of the underlying
tree or only the apparent branch lengths. While the efficiency of the
approach depends on the success of inferring recombination breakpoints, the study demonstrated that taking into account alternative
topologies achieves a substantial decrease of false-positive inferences
of selection while maintaining reasonable power. In a related development, Wilson and McVean (93) used an approximation to a
population genetics coalescent with selection and recombination.
Inference was performed on both parameters simultaneously using
the Bayesian approach with reversible-jump MCMC.
4.4. Biased Gene
Conversion

Mutation rate variation can also cause genomic regions to have


different substitution rates without any change in fixation rate.
Recent studies of guanine and cytosine (GC)-isochores in the
mammalian genome have suggested the importance of another
selectively neutral evolutionary process that affects nucleotide evolution. As described in the work of Laurent Duret and others (see
refs. 94 and 95), biased gene conversion (BGC) is a mechanism
caused by the mutagenic effects of recombination combined with
the preference in recombination-associated DNA repair toward
strong (GC) versus weak (adenine and thymine [AT]) nucleotide
pairs at non-WatsonCrick heterozygous sites in heteroduplex
DNA during crossover in meiosis. Thus, beginning with random
mutations, BGC results in an increased probability of fixation of G
and C alleles. In particular, methods looking for accelerated regions
in coding DNA but also codon models cannot distinguish positive
selection from BGC (see refs. 96 and 97). Therefore, the putatively
selected genes should be checked for GC content, and closeness
to recombination hot spots and telomeres. A recent study by
Yap et al. (98) suggests that modeling nucleotide target frequencies
to be conditional on the other nucleotides in the codon should help
to alleviate codon-dependent biases, like BGC and CpG biases.

4.5. Selection
on Synonymous Sites

Most selection studies to date focused on detecting selection on the


protein, since synonymous changes are often presumed neutral and
so unaffected by selective pressures. However, selection on synonymous sites has been documented more than a decade ago. Codon
usage bias is known to affect the majority of genes and species.
In his seminal work, Akashi (99) demonstrated purifying selection
on genes of D. melanogaster, where strong codon bias favoring

5 Selection on the Protein-Coding Genome

131

certain (optimal) codons serves to increase the translational accuracy. Pressure to optimize for translational efficiency, robustness,
and kinetics leads to synonymous codon bias, which was shown to
widely affect mammalian genes (100), as well as genes of fastevolving pathogens like viruses (101). Positive selection on synonymous sites has been unheard of until recently when Resch et al.
(102) conducted a large-scale study of selection on synonymous
sites in mammalian genes. They measured selection by comparing
the average rate of synonymous substitutions (dS) to the average
substitution rate in the corresponding introns (dI). While purifying
selection was found to affect 28% of genes (dS/dI < 1), 12% of
genes were found to have been affected by positive selection on
synonymous sites (dS/dI > 1). The signal of positive selection
correlated with lower predicted mRNA stability compared to
genes with negative selection on synonymous sites, suggesting
that mRNA destabilization (affecting mRNA levels and translation)
could be driving positive selection on synonymous sites.
An increasing number of experimental studies may now explain
how synonymous mutation may be affected by positive or negative
selection. Codon bias to match skews of tRNA abundances may
influence translation (103). Changes at silent sites can disrupt splicing control elements and create new cryptic splice sites, as well as
mRNA and transcript stability can be affected through preference or
avoidance of certain sequence motifs (see refs. 104 and 100). Silent
changes may affect gene regulation via constraints for efficient binding of miRNA to sense mRNA (see refs. 105 and 100). Cotranslational protein folding hypothesis suggests that speed-dependent
protein folding may be another source of selective pressure (106)
because slower production could cause the protein to take an altered
final form (as has been shown in multidrug resistance-1 (107)).
Finally, synonymous changes may act to modulate expression by
altering mRNA secondary structure, affecting protein abundance
(108).
Models of codon evolution currently provide the best approach
for studying selection on silent sites. In particular, models with
variable synonymous rates (see refs. 64 and 109) may be applied
to evaluate the extent of variability of synonymous rates in a gene
and to predict the positions of most conserved and most variable
synonymous sites (for example, see ref. 101). Whether or not the
site has been affected by selection requires further testing. For
example, Zhou, Gu, and Wilke (110) suggested distinguishing
two types of synonymous substitution rates: the rate of conserving
synonymous changes dSC (between preferred codons or between
rare codons) and the rate of nonconserving synonymous changes
dSN (between codons from the two different groups rare and
preferred). Silent sites with dSN/dSC > 1 may be considered to
be under positive selection, and significance can be tested based on
an LRT. Alternatively, synonymous rates at sites may be compared

132

C. Kosiol and M. Anisimova

to the mean substitution rate in the corresponding intron, which


can be implemented in a joint codon and DNA model, similar to
the approach proposed by Wong and Nielsen (111).
While selection on codon usage bias is typically studied with
various codon adaptation indexes (see ref. 112 for review), several
codon models were developed for this task (see refs. 113, 114, and
115). The mutation-selection models include selective and mutational effects separately and allow estimating the fitness of various
codon changes. The relative rate of substitution for selected mutations to neutral mutations is given by o 2g/(1  e2g) , where
g 2Ns is the scaled selection coefficient (see Exercise 3 for a derivation). Nielsen et al. (114) assumed that all changes between preferred
and rare codons have the same fitness (and so the same selection
coefficient). They used one selection coefficient for optimal codon
usage for each branch of a phylogeny, and estimated these jointly with
the o-ratio by ML. Using this approach to study ancestral codon
usage bias, Nielsen et al. (114) confirmed the reduction in selection
for optimal codon usage in D. melanogaster. In contrast, Yang and
Nielsen (2008) estimated individual codon fitness parameters and
used them to estimate optimal codon frequencies for a gene across
multiple species. LRT is used to test whether the codon bias is due to
the mutational bias alone. Finally, one remarkable contribution of the
mutation-selection models is the connection they make between the
interspecific and population parameters. Exploiting this further
should provide insights into how changing demographic factors
influence observed intraspecific patterns.

5. Outlook:
Selection Scans
Using Population
Data

By modeling genome evolution as a process by which a single


genome sequence mutates along the branches of a species phylogeny, standard phylogenetic methods reduce the entire populations
to single points in genotypic space. In reality, each population
consists of many individualsor more precisely, chromosomes
from these individualsthat are related by trees of genetic ancestry
known as genealogies. With the publication of large amounts of
genome-wide polymorphism data, it is now possible to study the
role of advantageous mutations. Many population genomic techniques can be applied to noncoding and coding regions. Here, we
focus on scans for selection acting on protein-coding genes. Methods for the analysis of noncoding regions are discussed in Chapter 6
of this Volume (116).

5 Selection on the Protein-Coding Genome

5.1. Neutrality Tests


with a Focus
on Protein-Coding
Genes

133

Many methods have been proposed for population data. Tajimas


D-test (for DNA data) compares the estimate of the populationscaled mutation rate based on the number of pairwise differences
with that based on the number of segregating sites in a sample
(117). Under neutrality, Tajimas D  0 and significant deviations
may indicate a selective sweep (D < 0) or balancing selection
(D > 0). Other neutrality tests are based on a similar idea but use
different summary statistics (e.g., see refs. 118 and 119). The
HudsonKreitmanAguade (HKA) test for DNA data evaluates
the neutral hypothesis by comparing variability within and between
species for two or more loci (120). Under neutrality, levels of polymorphism (variability within species) and divergence (variability
between species) should be proportional to the mutation rate, resulting in a constant polymorphism-to-divergence ratio. Tests of selective
neutrality based solely on simple summary statistics are successful at
rejecting the strictly neutral model but are sensitive to demographic
assumptions, such as constant population size, no population structure, and migration (see refs. 121 and 122). While simple neutrality
tests are not specific to coding data, performing such tests separately
for synonymous and nonsynonymous changes can potentially help
separating selective and demographic effects. Indeed, the popular
McDonaldKreitman (MK) test for protein-coding data exploits
the underlying idea of the HKA test, but classifies the observed
changes into synonymous and nonsynonymous (123). The MK
test compares the ratio of nonsynonymous (amino acid altering) to
synonymous (silent) substitutions within and between species, which
should be the same in the absence of selection. This test is more
robust to demographic assumptions, as the effect of the demographic
model should be the same for both nonsynonymous and synonymous
sites (122). Whereas the population demographic process is expected
to affect all genomic loci, selection should be nonuniform. Several
studies (see refs. 124, 125, and 126) took a genomic approach and
confirmed that polymorphism-to-divergence ratios differed significantly only for a few genes, although the high amounts of inferred
adaptation exceeded expectations.
Apart from biasing the mutation frequency distribution, selection may also affect the distribution of genealogical shapes in
population data. Drummond and Suchard (91) proposed a Bayesian test for neutrality that takes into account the distribution of
genealogical shapes and can test for both selection and recombination. Such test should be relevant particularly for protein-coding
sequences, where most selection is expected to operate. More
generally, methods that use information from both the mutation
frequency spectrum and the shape of the genealogies are expected
to be more powerful than when either used individually.
Unlike neutrality tests that do not explicitly model selection, the
Poisson random-field framework (see refs. 127130 and 131)

134

C. Kosiol and M. Anisimova

enables estimation of mutation and selection parameters in various


population genetics scenarios. The rationale behind the approach is
that natural selection alters the site-frequency spectrum, making it
possible to estimate the strength of selection that has contributed to
the observed deviation from neutrality. Boyko et al. (132) estimated
~10% of adaptive amino acid changes in humans, but the proportion
of adaptively driven substitutions is higher than 50% in some microorganisms and Drosophila (see refs. 125, 133, and 134). Also
current estimates might be biased downwardly in the presence of
slightly deleterious mutations and decreasing population size (135).
Recently, Gutenkunst et al. (136) have developed methods for
multidimensional site frequency spectra. These allow the joint
inference of the demographic history of multiple populations.
Nielsen et al. (137) used a 2D site frequency spectrum to study
the Darwinian and demographic forces in protein-coding genes
from two human populations. In the future, we can expect to
study selection on protein-coding genes in more populations
from more species as new sequencing technologies and new methods for detecting selection in population data will be developed.

6. Exercises
Q1. Amino acid and codon substitution models: How many
parameters need to be estimated in the instantaneous rate matrix
Q defining a reversible empirical AA model? How many such parameters are necessary to estimate for a reversible empirical codon
model? How many parameters are to be estimated in both cases if a
model is nonreversible?
Q2. Positive selection scans: Go to the UCSC genome browser
(http://genome.ucsc.edu). Search for the HAVCR1 (hepatitis A
virus cellular receptor 1) in the human genome (assembly
NCBI36/hg18) belonging to the mammalian clade.
Genome browser tracks provide the summary of previous analysis of coding regions. Switch Pos Sel Genes under Genes and
Gene Prediction Tracks to full and collect information on the
LRTs that were performed for the six species scan. Next, switch the
17-Way Cons under Comparative Genomics to full. Why are
only a few bases in the HAVCR1 gene conserved? Is this consistent
with the results obtained by LRTs?
Click on the Conservation track to retrieve the multiple
sequence alignment for the HAVCR1 gene. Use the PAML software
(http://abacus.gene.ucl.ac.uk/software/paml.html) to test the
models for positive selection on any lineage of the mammalian
tress by comparing models M1a and M2a with an LRT.

5 Selection on the Protein-Coding Genome

135

Use PAML to identify sites under positive selection by using


the BEB approach. Do you find the same sites to be under selection
as in Fig. 3 of Kosiol et al. (43)?
Q3. Selection-mutation models: Models incorporating selection
and mutation rely on a theoretical relationship between the nonsynonymoussynonymous rate ratio o and the scaled selection
coefficient g 2Ns. The probability that a new mutation eventually
becomes fixed is
Pr(fixation)

1  e2s
2s

1  e4Ns 1  e4Ns

if we assume that the selection coefficient s is small and N is large


and represents the effective population size, which is constant in
time (138). Furthermore, assume that synonymous substitutions
are neutral and nonsynonymous have equal (and small) selection
coefficients. Derive the relationship
o

4s
2g

4Ns
1e
1  e2 g

that combines phylogenetic with population genetic quantities and


is crucial for mutation-selection models.

Acknowledgments
C.K. is supported by the University of Veterinary Medicine Vienna.
M.A. is supported by the ETH Zurich and also receives funding from
the Swiss National Science Foundation (grant 31003A_127325).
References
1. Pal C, Papp B, Lercher MJ (2006) An
integrated view on protein evolution. Nature
Rev Genet 7:337348
2. Flicek P, Aken BL, Ballester B, Beal K, Bragin
E, Brent S, Chen Y, Clapham P, Coates G,
Fairley S, Fitzgerald S, Fernandez-Banet J,
Gordon L, Gra f S, Haider S, Hammond M,
Howe K, Jenkinson A, Johnson N,
Kahari A, Keefe D, Keenan S, Kinsella R,
Kokocinski F, Koscielny G, Kulesha E,
Lawson D, Longden I, Massingham T,
McLaren W, Megy K, Overduin B, Pritchard
B, Rios D, Ruffier M, Schuster M, Slater G,
Smedley D, Spudich G, Tang YA, Trevanion
S, Vilella A, Vogel J, White S, Wilder SP,
Zadissa A, Birney E, Cunningham F, Dunham
I, Durbin R, Fernandez-Suarez XM,

Herrero J, Hubbard TJ, Parker A, Proctor


G, Smith J, Searle SM (2010) Ensembls
10th year. Nucleic Acids Research 38:
D557D562
3. Fujita PA, Rhead B, Zweig AS, Hinrichs AS,
Karolchik D, Cline MS, Goldman M, Barber
GP, Clawson H, Coelho A, Diekhans M,
Dreszer TR, Giardine BM, Harte RA,
Hillman-Jackson J, Hsu F, Kirkup V, Kuhn
RM, Learned K, Li CH, Meyer LR, Pohl A,
Raney BJ, Rosenbloom KR, Smith KE,
Haussler D, Kent WJ (2011) The UCSC
Genome Browser database: update 2011.
Nucleic Acids Res 39:D876-D882
4. Altenhoff AM, Dessimoz C (2012) Inferring
orthology and paralogy. In: Anisimova M
(ed) Evolutionary genomics: statistical and

136

C. Kosiol and M. Anisimova

computational methods (volume 1). Methods


in Molecular Biology, Springer Science+
Business Media New York
5. Lee H, Tang H (2012) Next generation
sequencing technology and fragment assembly
algorithms. In: Anisimova M (ed) Evolutionary
genomics: statistical and computational methods (volume 1). Methods in Molecular Biology,
Springer Science+Business Media New York
6. Li R, Fan W, Tian G, Zhu H, He L, Cai J,
Huang Q, Cai Q, Li B, Bai Y, Zhang Z,
Zhang Y, Xuan Z, Wang W, Li J et al. (2010)
The sequence and de novo assembly of the
giant panda genome. Nature 463:311317
7. Posada D, Crandall KA (2002) The effect of
recombination on the accuracy of phylogenetic estimation. J Mol Evol 54:396402
8. Sawyer S (1989) Statistical tests for detecting
gene conversion. Mol Biol Evol 6:526538
9. Semple C Wolfe KH (1999) Gene duplication
and gene conversion in the caenorhabditis
elegans genome. J Mol Evol 48:555564
10. Doolittle WF (1999) Phylogentic classification and the universal tree. Science
284:21242129
11. Robinson DM, Jones DT, Kishino H,
Goldman N, Thorne JL (2003) Protein evolution with dependence among codons due
to tertiary structure. Mol Biol Evol
20:16921704
12. Choi SC, Holboth A, Robinson DM, Kishino
H, Thorne JL (2007) Quantifying the impact
of protein tertiary structure on molecularevolution. Mol Biol Evol 24:17691782
13. Keilson J (1979). Markov Chain ModelsRarity and Exponentiality. Springer, New-York
14. Pollard KS, Salama SR, King B, Kern AD,
Dreszer T, Katzman S, Siepel A, Perdersen
JS, Berjerano G, Baertsch R, Rosenblum KR,
Kent J, Haussler D (2006) Frorces shaping
the fastest evolving regions in the human
genome, PLoS Genetics 2(10): e168.
15. Holloway AK, Begun DJ, Siepel A, Pollard K
(2008) Accelerated sequence divergence of
conserved genomic elements in Drosophila
melanogaster. Genome Res 18:15921601
16. Miyamoto MM, Fitch WM (1995) Testing
the covarion hypothesis of molecular evolution. Mol Biol Evol 12:503513
17. Lockhart PJ, Steel MA, Barbrook AC, Huson
DH, Charleston MA, Howe CJ (1998) A
covariotide model explains apparent phylogenetic structure of oxygenic photosynthetic
lineages. Mol Biol Evol 15:11831188
18. Penny D, McComish BJ, Charleston MA,
Hendy MD (2001) Mathematical elegance
with biochemical realism: the covarion

model of molecular evolution. J Mol Evol


53:711753
19. Siltberg J, Liberles DA (2002) A simple
covarion-based approach to analyse nucleotide substitution rates. J Evol Biol
15:588594
20. Lichtarge O, Bourne HR, Cohen FE (1996)
An evolutionary trace method defines binding
surfaces common to protein families. J Mol
Evol 257:342358
21. Gu X (1999) Statistical methods for testing
functional divergence after gene duplication.
Mol Biol Evol 16:16641674
22. Armon A, Graur D, Ben-Tal N (2001) ConSurf: an algorithmic tool for the identification
of functional regions in proteins by surface
mapping of phylogenetic information. J Mol
Biol 307:447463
23. Gaucher EA, Gu X, Miyamoto MM, Benner
SA (2002) Predicting functional divergence in
protein evolution by site-specific rate shifts.
Trends Biochem Sci 27: 315321
24. Pupko T, Galtier N (2002) A covarion-based
method for detecting molecular adaptation:
application to the evolution of primate mitochondrial
genomes.
Proc
Biol
Sci
269:13131316
25. Blouin C, Boucher Y, Roger AJ (2003) Inferring functional constraints and divergence in
protein families using 3D mapping of phylogenetic information. Nucleic Acids Res
31:790797
26. Landau M, Mayrose I, Rosenberg Y, Glaser F,
Martz E, Pupko T, Ben-Tal N (2005) ConSurf 2005: the projection of evolutionary
conservation scores of residues on protein
structures.
Nucleic
Acids
Res
33:
W299W302
27. Gu X (2001) Maximum-likelihood approach
for gene family evolution under functional
divergence. Mol Biol Evol 18:453464
28. Gu X (2006) A simple statistical method for
estimating type-II (cluster-specific) functional
divergence of protein sequences. Mol Biol
Evol 23:19371945
29. Siepel A, Haussler D (2004) Combining
phylogenetic and hidden Markov models in
biosequence analysis. J Comput Biol
11:413428
30. Siepel A, Haussler D (2004) Phylogenetic
estimation of context-dependent substitution
rates by maximum likelihood. Mol Biol Evol
21:468488
31. Bofkin L, Goldman N (2007) Variation in
evolutionary processes at different codon
positions. Mol Biol Evol 24:513521

5 Selection on the Protein-Coding Genome


32. Hughes AL, Nei M (1988) Pattern of nucleotide substitution at major histocompatibility
complex class I loci reveals overdominant
selection. Nature 335:167170
33. Yang Z, Nielsen R (2000) Estimating
synonymous and nonsynonymous substitution rates under realistic evolutionary models.
Mol Biol Evol 17:3243
34. Goldman N, Yang Z (1994) A codon-based
model of nucleotide substitution for proteincoding DNA sequences. Mol Biol Evol
11:725736
35. Muse SV, Gaut BS (1994) A likelihood
approach for comparing synonymous and
nonsynonymous nucleotide substitution
rates, with application to the chloroplast
genome. Mol Biol Evol 11:715724
36. Grantham R (1974) Amino acid difference
formula to help explain protein evolution.
Science 185:862864
37. Yang Z (1998) Likelihood ratio tests for
detecting positive selection and application
to primate lysozyme evolution. Mol Biol
Evol 15:568573
38. Schneider A, Cannarozzi GM, Gonnet GH
(2005) Empirical codon substitution matrix.
BMC Bioinformatics 6:134
39. Kosiol C, Holmes I, Goldman N (2007) An
empirical codon model for protein sequence
evolution. Mol Biol Evol 24:14641479
40. Doron-Faigenboim A, Pupko T (2007) A
combined empirical and mechanistic codon
model. Mol Biol Evol 24:388397
41. Whelan S, Goldman N (1999) Distributions
of statistics used for the comparison of models
of sequence evolution in phylogenetics. Mol
Biol Evol 16:12921299
42. Anisimova M, Bielawski JP, Yang Z (2001)
Accuracy and power of the likelihood ratio
test in detecting adaptive molecular evolution.
Mol Biol Evol 18:15851592
43. Kosiol C, Vinar T, Da Fonseca RR, Hubisz
MJ, Bustamante CD, Nielsen R, and Siepel A
(2008) Patterns of positive selection in six
mammalian genomes. PLoS Genet 4:
e10000144
44. Anisimova M, Bielawski JP, Yang Z (2002)
Accuracy and power of bayes prediction of
amino acid sites under positive selection.
Mol Biol Evol 19:950958
45. Yang Z, Wong WS, Nielsen R (2005) Bayes
empirical Bayes inference of amino acid sites
under positive selection. Mol Biol Evol
22:11071118
46. Yang Z, Nielsen R, Goldman N, Pedersen
AMK (2000) Codon-substitution models for
heterogeneous selection pressure at amino
acid sites. Genetics 155: 431449

137

47. Huelsenbeck JP, Dyer KA (2004) Bayesian


estimation of positively selected sites. J Mol
Evol 58:661672
48. Scheffler K, Seoighe. C (2005) A Bayesian
model comparison approach to inferring
positive
selection.
Mol
Biol
Evol
22:25312540
49. Aris-Brosou S, Bielawski JP (2006) Large-scale
analyses of synonymous substitution rates can
be sensitive to assumptions about the process of
mutation. Gene 378:5864
50. Massingham T, Goldman N (2005) Detecting
amino acid sites under positive selection and
purifying selection. Genetics 169:17531762
51. Kosakovsky Pond SL, Posada D, Gravenor
MB, Woelk CH, Frost SD (2006) GARD:
a genetic algorithm for recombination detection. Bioinformatics 22:30963098
52. Kosakovsky Pond SL, Posada, D Gravenor
MB, Woelk,CH and Frost SD (2006) Automated phylogenetic detection of recombination using a genetic algorithm. Mol Biol Evol
23:18911901
53. Felsenstein J (2004) Inferring phylogenies.
Sinauer Associates, Sunderland Massachusetts
54. Yang Z, Dos Reis M (2011) Statistical properties of the branch-site test of positive selection. Mol Biol Evol 28:12171228
55. Anisimova M, Yang Z (2007) Multiple
hypothesis testing to detect lineages under
positive selection that affects only a few sites.
Mol Biol Evol 24:12191228
56. Kosakovsky Pond SL., and Frost SD (2005) A
genetic algorithm approach to detecting lineage-specific variation in selection pressure.
Mol Biol Evol 22:478485
57. Lemmon AR, and Milinkovitch MC (2002)
The metapopulation genetic algorithm: An
efficient solution for the problem of large
phylogeny estimation. Proc Natl Acad Sci
U S A 99:1051610521
58. Jobb G, von Haeseler A, and Strimmer K
(2004) TREEFINDER: a powerful graphical
analysis environment for molecular phylogenetics. BMC Evol Biol 4:18
59. Zwickl DJ (2006) Genetic algorithm
approaches for the phylogenetic analysis of
large biological sequence datasets under the
maximum likelihood criterion. PhD dissertation, The University of Texas, Austin.
60. Guindon S.A, Rodrigo G, Dyer KA, Huelsenbeck JP (2004) Modeling the site-specific variation of selection patterns along lineages.
Proc Natl Acad Sci U S A 101:1295712962
61. Siepel A, Bejerano G, Pedersen JS, Hinrichs
A, Hou M, Rosenbloom K, Clawson H,
Spieth J, Hillier LW, Richards S, Weinstock
GM, Wilson RK, Gibbs RA, Kent WJ,

138

C. Kosiol and M. Anisimova

Miller W, Haussler D (2005) Evolutionarily


conserved elements in vertebrate, insect,
worm, and yeast genomes. Genome Res 20:
10341050
62. Pollard KS, Hubisz MJ, Rosenbloom KR,
Siepel A (2010) Detection of non-neutral
substitution rates on mammalian phylogenies.
Genome Res 20: 110121
63. Yang Z (2007) PAML 4: phylogenetic analysis
by maximum likelihood. Mol Biol Evol
24:15861591
64. Kosakovsky Pond SL, Muse SV (2005)
Site-to-site variation of synonymous substitution rates. Mol Biol Evol 22:23752385
65. Stern A, Doron-Faigenboim A, Erez E, Martz
E, Bacharach E, and Pupko T (2007) Selecton
2007: advanced models for detecting positive
and purifying selection using a Bayesian inference approach. Nucleic Acids Res 35:W506511
66. Klosterman PS, Uzilov AV, Bendana YR,
Bradley RK, Chao S, Kosiol C, Goldman N,
Holmes I (2006) XRate: a fast prototyping,
training and annotation tool for phylo-grammars. BMC Bioinformatics 7: 428
67. Heger A, Ponting CP, Holmes I (2009) Accurate estimation of gene evolutionary rates
using XRATE, with an application to transmembrane proteins. Mol Biol Evol
26:17151721
68. Yang Z, Nielsen R (2002) Codon-substitution
models for detecting molecular adaptation at
individual sites along specific lineages. Mol
Biol Evol 19:908917
69. Zhang J, Nielsen R, Yang Z (2005) Evaluation of an improved branch-site likelihood
method for detecting positive selection at
the molecular level. Mol Biol Evol
22:24722479
70. Vamathevan JJ, Hasan S, Emes RD, AmrineMadsen H, Rajagopalan D, Topp SD, Kumar
V, Word M, Simmons MD, Foord SM, Sanseau P, Yang Z, Holbrook JD (2008) The role
of positive selection in determining the
molecular cause of species differences in disease. BMC Evol Biol 8:273
71. Nozawa M, Suzuki Y, Nei M (2009) Reliabilities of identifying positive selection by the
branch-site and site-prediction methods.
Proc Natl Acad Sci USA 106:67006705
72. Markova-Raina P, Petrov D (2011) High sensitivity to aligner and high rate of false positives in the estimates of positive selection in
12 Drosophila genomes. Genome Res.
doi:10.1101/gr.115949.110
73. Bakewell MA, Shi P, Zhang J (2007) More
genes underwent positive selection in chim-

panzee than in human evolution. Proc Natl


Acad Sci USA 104:E97
74. Arbiza L, Dopazo J, Dopazo H (2006)
Positive selection, relaxation, and acceleration
in the evolution of the human and chimp
genome. PLoS Comput Biol 2:e38
75. Gibbs RA, Rogers J, Katze MG, Bumgarner
R, Weinstock GM, Mardis ER, Remington
KA, Strausberg RL, Venter JC, Wilson RK
et al. (2007) Evolutionary and biomedical
insights from the macaque genome. Science
316:222234
76. Mallik S, Gnerre S, Muller P, Reich D (2010)
The difficulty of avoiding false positives in
genome scans for natural selection. Genome
Res 19:922933
77. Schneider A, Souvorov A, Sabath N, Landan
G, Gonnet GH (2009) Estimates of positive
Darwinian selection are inflated by errors in
sequencing, annotation, and alignment.
Genome Biol Evol 1:114118
78. Fletcher W, Yang Z (2010) The effect of insertions, delections and alignment errors on the
branch-site test of positive selection. Mol Biol
Evol 27:22572267
79. Loytynoja A, Goldman N (2005) An algorithm for progressive multiple alignment of
sequences with insertions. Proc Natl Acad
Sci U S A 102:1055710562
80. Loytynoja A, Goldman N (2008) Phylogenyaware gap placement prevents error in
sequence alignment and evolutionary analysis.
Science 320:16321635
81. Jensen JL, Pedersen AK (2000) Probabilistic
models of DNA sequence evolution with context dependent rates of substitution. Adv Appl
Probab 32:499517
82. Pedersen AK, Jensen JL (2001) A Dependent-Rates Model and an MCMC-Based
Methodology for the Maximum-Likelihood
Analysis of Sequences with Overlapping
Reading Frames. Mol Biol Evol (2001)
18:763776
83. Christensen OF, Hoboth A, Jensen JL (2005)
Pseudo-likelihood analysis of context dependent codon substitution models. J Comp Biol
12:11661182
84. Siepel A, Haussler D (2004) Phylogenetic
estimation of context-dependent substitution
rates by maximum likelihood. Mol Biol Evol
21:468488
85. Sabath N, Landan G, Gaur D (2008) A
method for the simultaneous estimation of
selection intensities in overlapping genes.
PLoS One 3:e3996
86. De Groot S, Mailund T, Hein J (2007).
Comparative annotation of viral genomes

5 Selection on the Protein-Coding Genome


with non-conserved genestructure. Bioinformatics 23:10801089
87. McCauley S, Hein J (2006) Using hidden
Markov models (HMMs) and observed
evolution to annotate ssRNA Viral Genomes.
Bioinformatics 22: 13081316
88. McCauley S, de Groot S, Mailund T, Hein J
(2007) Annotation of selection strength in
viral genomes. Bioinformatics 23:29782986
89. Anisimova M, Nielsen R, Yang Z (2003)
Effect of recombination on the accuracy of
the likelihood method for detecting positive
selection at amino acid sites. Genetics
164:12291236
90. Martin DP, Williamson C, Posada D (2005)
RDP2: recombination detection and analysis
of sequence alignments. Bioinformatics
21:260262
91. Drummond AJ, Suchard MA (2008) Fully
Bayesian tests of neutrality using genealogical
summary statistics. BMC Genet 9:68
92. Scheffler K, Martin DP, Seoighe C (2006)
Robust inference of positive selection from
recombining coding sequences. Bioinformatics
22:24932499
93. Wilson DJ, McVean G (2006) Estimating
diversifying selection and functional constraint in the presence of recombination.
Genetics 172:14111425
94. Duret L, Semon M, Piganeau G, Mouchiroud
D, Galtier N (2002) Vanishing GC-rich isochores in mammalian genomes. Genetics
162:18371847
95. Meunier J, Duret L (2004). Recombination
drives the evolution of GC content in the
human genome. Mol Biol Evol 21:984990
96. Berglund J, Pollard KS, Webster MT (2009)
Hotspots of biased nucleotide substitutions in
human genes. PLoS Biology 7:e26
97. Ratnakumar A, Mousset S, Glemin S, Berglund J, Galtier N, Duret L, Webster MT
(2010) Detecting positive selection within
genomes: the problem of biased gene
conversion. Phil Trans Roy Soc B
365:25712580
98. Yap B, Lindsay H, Easteal S, Huttley G
(2010) Estimates of the effect of natural selection on protein-coding content. Mol Biol
Evol 27:726734
99. Akashi H (1994) Synonymous codon usage in
Drosophila melanogaster: Natural selection and
translational accuracy. Genetics 136:927935
100. Chamary JV, Parmley JL, Hurst LD (2006)
Hearing silence: non-neutral evolution at synonymous sites in mammals. Nat Rev Genet
7:98108

139

101. Ngandu N, Scheffler K, Moore P, Woodman Z,


Martin D, Seoighe C (2009) Extensive purifying selection acting on synonymous sites in
HIV-1 Groug M sequences. Virol J 5:160
102. Resch AM, Carmel L, Marino-Ramirez L,
Ogurtsov AY, Shabalina SA, Rogozin IB,
Koonin EV (2007) Widespread Positive
Selection in Synonymous Sites of Mammalian
Genes. Mol Biol Evol 24:18211831
103. Cannarozzi GM, Faty M, Schraudolph NN,
Roth A, von Rohr P, Gonnet P, Gonnet GH,
Barral Y (2010) A role for codons in translational dynamics, Cell 141:355367
104. Hurst LD, Pal C (2001) Evidence of
purifying selection acting on silent sites in
BRCA1. Trends Genet 17: 6265
105. Chamary JV, Hurst LD (2005) Biased usage
near intron-exon junctions: selection on
splicing enhancers, splice site recognition or
something else? Trends Genet 21:256259
106. Komar AA (2008) Protein translational rates
and protein misfolding: Is there any link?
In: ODoherty CB, Byrne AC (eds) Protein
Misfolding: New Research. Nova Science
Publisher Inc, New York.
107. Kimichi-Sarfaty C, Oh JM, Kim IW, Sauna
ZE, Calcagno AM, Ambudkar SV, Gottesman
MM (2007) A silent polymorphism in the
MDR1 gene changes substrate specificity.
Science 315:525528
108. Nackley AG, SA Shabalina, Tchivileva IE,
Satterfield K, Korchynskyi O, Makarov SS,
Maixner W, Diatchenko L (2006) Human catechol-O-methyltransferase haplotypes modulate protein expression by altering mRNA
secondary structure. Science 314:19301933
109. Mayrose I, Doron-Faigenboim A, Bacharach
E, Pupko T (2007) Towards realistic codon
models: among site variability and dependency of synonymous and non-synonymous
rates. Bioinformatics 23:i319-327
110. Zhou T, Gu W, Wilke CO (2010) Detecting
positive and purifying selection at synonymous sites in yeast and worm. Mol Biol Evol
27: 19121922
111. Wong WSW, Nielsen R (2004). Detecting
selection in non-coding regions of nucleotide
sequences. Genetics 167:949958
112. Roth A, Anisimova M, Cannarozzi GM
(2011) Measuring codon usage bias. In:
Cannarozzi G, Schneider A (eds) Codon Evolution: mechanisms and models. Oxford University Press
113. Nielsen R, Yang Z (2003) Estimating the
distribution of selection coefficients from
phylogenetic data with applications to

140

C. Kosiol and M. Anisimova

mitochondrial and viral DNA. Mol Biol Evol


20:12311239
114. Nielsen R, Bauer DuMont VL, Hubisz MJ,
Aquadro CF (2007) Maximum likelihood
estimation of ancestral codon usage bias
parameters in Drosophila. Mol Biol Evol
24:228235
115. Yang Z, Nielsen R (2008) Mutation-selection
models of codon substitution and their use to
estimate selective strengths on codon usage.
Mol Biol Evol 25:568579
116. Zhen Y, Andolfatto P (2012) Detecting
selection on non-coding genomics regions.
In: Anisimova M (ed) Evolutionary genomics:
statistical and computational methods (volume 1). Methods in Molecular Biology,
Springer Science+Business Media New York
117. Tajima F (1989) Statistical method for testing
the neutral mutation hypothesis by DNA
polymorphism. Genetics 123:585595
118. Fu YX, Li WH (1993) Statistical tests of neutrality of mutations. Genetics 133:693709
119. Fay JC, Wu CI (2000) Hitchhiking under
positive Darwinian selection. Genetics
155:14051413
120. Hudson RR, Kreitman M, Aguade M (1987)
A test of neutral molecular evolution based on
nucleotide data. Genetics 116:153159
121. Wayne ML, Simonsen K (1998) Statistical
tests of neutrality in the age of weak selection.
Trends Ecol Evol 13:12921299
122. Nielsen R (2001) Statistical tests of selective
neutrality in the age of genomics. Heredity
86:641647
123. McDonald JH, Kreitman M (1991) Adaptive
protein evolution at the Adh locus in
Drosophila. Nature 351:652654
124. Fay JC, Wyckoff GJ, Wu CI (2001) Positive
and negative selection on the human genome.
Genetics 158:12271234
125. Eyre-Walker A (2002) Changing effective
population size and the McDonaldKreitman
test. Genetics 162:20172024
126. Smith NG, Eyre-Walker A (2002) Adaptive
protein evolution in Drosophila. Nature
415:10221024
127. Sawyer SA, Hartl DL (1992) Population
genetics of polymorphism and divergence.
Genetics 132:11611176
128. Hartl DL, Moriyama EN, Sawyer SA (1994)
Selection intensity for codon bias. Genetics
138:227234

129. Akashi H (1999) Inferring the fitness


effects of DNA mutations from polymorphism and divergence data: statistical power
to detect directional selection under stationarity and free recombination. Genetics
151:221238
130. Bustamante CD, Nielsen R, Sawyer SA,
Olsen KM, Purugganan, Hartl DL (2002)
The cost of inbreeding: fixation of deleterious genes in Arabidopsis. Nature
416:531534
131. Bustamante CD, Fledel-Alon A, Williamson
S, Nielsen R, Todd-Hubisz M, Glanowski S,
Hernandez R, Civello D, Tanebaum DM,
White TJ, Sninsky JJ, Adams MD, Cargill M,
Clark AG (2005) Natural selection on protein
coding genes in the human genome. Nature
437:11531157
132. Boyko AR, Williamson SH, Indap AR,
Degenhardt JD, Hernandez RD, Lohmueller
KE, Adams MD, Schmidt S, Sninsky JJ,
Sunyaev SR, White TJ, Nielsen R, Clark AG,
Bustamante CD (2008) Assessing the evolutionary impact of amino acid mutations in the
human genome. PLoS Genetics 4(5):
e1000083
133. Bierne N, Eyre-Walker A (2004) Genomic
rate of adaptive amino acid substitution in
Drosophila. Mol Biol Evol 21:13501360
134. Welch JJ (2006) Estimating the genome-wide
rate of adaptive protein evolution in Drosophila. Genetics 173: 821837
135. Eyre-Walker A, and Keightley PD (2009)
Estimating the rate of adaptive molecular evolution in the presence of slightly deleterious
mutations and population size change. Mol
Bio Evol 26:20972018
136. Gutenkunst RN, Hernandez RD, Williamson
SH, Bustamante CD (2009) Inferring the
joint demographic history of multiple populations from SNP data. PLoS Genetics 5:
e1000695
137. Nielsen R, Hubisz MJ, Hellmann I, Torgerson D, Andres AM, Albrechtsen A, Gutenkunst R, Adams MD, Cargill M, Boyko A,
Indap A, Bustamante CD, Clark AG (2009)
Darwinian and demographic forces affecting
human protein coding genes. Genome Res
19:838849
138. Kimura M, Ohta T (1969) The average number of generations until fixation of a mutant
gene in a finite population. Genetics
61:763771

Chapter 6
Methods to Detect Selection on Noncoding DNA
Ying Zhen and Peter Andolfatto
Abstract
Vast tracts of noncoding DNA contain elements that regulate gene expression in higher eukaryotes.
Describing these regulatory elements and understanding how they evolve represent major challenges for
biologists. Advances in the ability to survey genome-scale DNA sequence data are providing unprecedented
opportunities to use evolutionary models and computational tools to identify functionally important
elements and the mode of selection acting on them in multiple species. This chapter reviews some of the
current methods that have been developed and applied on noncoding DNA, what they have shown us, and
how they are limited. Results of several recent studies reveal that a significantly larger fraction of noncoding
DNA in eukaryotic organisms is likely to be functional than previously believed, implying that the functional
annotation of most noncoding DNA in these organisms is largely incomplete. In Drosophila, recent studies
have further suggested that a large fraction of noncoding DNA divergence observed between species may be
the product of recurrent adaptive substitution. Similar studies in humans have revealed a more complex
pattern, with signatures of recurrent positive selection being largely concentrated in conserved noncoding
DNA elements. Understanding these patterns and the extent to which they generalize to other organisms
awaits the analysis of forthcoming genome-scale polymorphism and divergence data from more species.
Key words: Adaptive evolution, Neutrality test, Selective constraint, Deleterious mutations,
McDonaldKreitman test, Population genetics

1. Introduction
and Methods
The lions share of higher eukaryotic genomes comprises noncoding
DNA, which encodes the information necessary to regulate the
level, timing, and spatial organization of the expression of
thousands of genes (1). A growing body of evidence supports the
view that the evolution of gene expression regulation is the primary
genetic mechanism behind the modular organization, functional
diversification, and origin of novel traits in higher organisms
(25). Historically, noncoding DNA has been little studied relative
to proteins and the lack of knowledge about its function has led to

Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2,
Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5_6,
# Springer Science+Business Media, LLC 2012

141

142

Y. Zhen and P. Andolfatto

it being viewed as mostly junk. More recently, technological


advances have allowed researchers to probe noncoding DNA function in more detail, including the annotation of genomic elements
that regulate levels of DNA transcription and translation (6). The
complexity of regulation generally precludes the direct evaluation of
all functions of regulatory elements in noncoding DNA, or an
understanding of how genetic variation in regulation corresponds
to organismal fitness. Nonetheless, even in the absence of this
information, developments in evolutionary theory and computational biology, in conjunction with the increasing availability of
genome-scale data, are providing unprecedented insights into the
functional significance of noncoding DNA and its evolution. The
emerging picture, in many eukaryotic organisms, is that a much
larger fraction of noncoding DNA is functional and subject to both
positive and negative natural selection than previously believed.
These findings, in turn, have profound implications for our broader
understanding of the evolutionary processes underlying patterns of
genome evolution and how we should interpret patterns of genomic divergence between closely related species (710).
Here, we review some of the emerging evolutionary/computational methods for detecting and quantifying selection acting on
noncoding DNA, and how these might be used to identify functionally important elements in genomes and the mode of selection acting
on them. We focus on methods that have been developed or adapted
specifically for application to noncoding DNA rather than approaches
that can be more generically applied to genome sequences. For an
overview of the latter approaches, including tests for selection based
on genomic scans for high levels of population differentiation (e.g.,
Fst), linkage disequilibrium and haplotype structure, or reduced variation, Hahn (11), Oleksyk et al. (12), and Charlesworth and Charlesworth (13) offer recent reviews. In addition, our purpose here is to
highlight seminal papers and recent good examples rather than
exhaustively review what is quickly becoming a vast literature.
1.1. Phylogenetic
Methods: Quantifying
Functionality
of Noncoding DNA
via Constraint

What fraction of noncoding DNA in eukaryotic genomes is functional?


Modern functional genomics approaches, like Chip-seq (14),
RNA-seq (15), and DNAse I hypersensitivity mapping (16), will
likely provide at least part of the answer to this question. However,
the complete answer to this question is unlikely to come from direct
functional studies alone because they lack sensitivity given the vast
complexity of gene regulation (e.g., tissue or developmental specificity, environmental factors, context dependence, as yet undiscovered biology, etc.). A complementary guide to evaluating the
functional significance of noncoding DNA is the notion of measuring evolutionary constraint. This notion is perhaps most familiar
in its application to proteins. That is, codons defining a protein
sequence can be divided into discrete functional classes of sites:
nonsynonymous sites, at which a newly arising mutation will alter

6 Methods to Detect Selection on Noncoding DNA

143

the protein sequence, and synonymous sites, at which a newly arising


mutation will alter the codon used, but not the protein sequence.
If nonsynonymous sites and synonymous sites were functionally
equivalent, we would expect that the probability of a substitution
at either class of sites, defined as dN and dS, respectively, would be
the same. However, in comparisons of homologous proteins from
related species in a phylogenetic context, it is clear that dN is usually
considerably smaller than dS on average (17). If one considers that
the vast majority of randomly occurring amino acid substitutions to a
protein is detrimental to the proteins function, dN < dS is expected
and consistent with the removal of deleterious nonsynonymous
mutations by natural selection. Thus, the measure constraint in
the context of protein evolution is defined as the fraction of newly
arising nonsynonymous mutations in a protein that are deleterious
enough to be removed by natural selection and is measured as the
deficit in divergence at nonsyonymous sites relative to expectations
based on synonymous sites (18). If we are to assume that synonymous substitutions are neutral and that mutation rates to synonymous and nonsynonymous sites are equal, then a measure of
constraint on protein sequences can be defined as 1  (dN/dS).
Even when reference sites are not truly neutral, such a comparative
approach is a powerful way to detect purifying selection on a particular class of sites.
The same logic can be applied to comparisons of any class of
functional sites in the genome, and has been used to identify conserved noncoding (CNC) sequences. That is, using a class of sites in
the genome that can be regarded as neutral reference sites, assuming that differences in mutation rates can be accounted for and that
all newly arising mutations are deleterious, one can use levels of
divergence at these reference sites to estimate levels of constraint in
noncoding DNA as a proxy for its functional significance. Several
early applications of this approach suggested that the number of
functionally important nucleotides in noncoding DNA equals or
exceeds the number of functionally important coding nucleotide
sites in the genomes of nematodes, Drosophila, and mammals
(1921) and more recent studies have generally pushed these estimates even higher (2226). Looking at constraint in the context of
larger phylogenies and varying phylogenetic distances (23, 27, 28)
has sometimes been referred to as phylogenetic footprinting (29)
or phylogenetic shadowing (30). Though the latter approaches
use essentially the same principles, they are more often used to
detect individual functional elements rather than to estimate genomic constraint in general.
Using constraint as a measure of functionality of noncoding
DNA is not without its difficulties. Typically, synonymous
sites, intronic DNA, or ancestral repeats are chosen as reference
sites. However, recent studies of divergence in Arabidopsis and
mammals have highlighted how the choice of reference sites can

144

Y. Zhen and P. Andolfatto

add considerable uncertainty to estimates of constraint in intergenic


DNA (25, 26, 31). Of primary concern is the possibility that selection
on reference sites themselves leads to underestimates of constraint.
For example, selection on synonymous sites likely downwardly biases
estimates of constraint in Drosophila and humans (24, 26). Further,
there is no guarantee that ancient transposable element-derived
DNA, another popular source of reference sites, has not been functionally co-opted (32, 33). A first difficulty, thus, becomes in identifying reliable reference sites in the genome. Halligan and Keightley (24)
suggested using the fastest evolving intronic (FEI) sites in the Drosophila genome, bases 820 of short introns, to calibrate estimates of
constraint, though the fact that they are the fastest evolving sites in
the genome does not guarantee that they are the most neutral
(see below).
A second potential source of uncertainty is mutation bias
(25, 31, 34) and these are particularly important when the reference and queried sites differ in base composition or, perhaps more
problematically, genomic location. Thirdly, the very notion of
constraint as an index of functionality depends on the assumption that newly arising beneficial mutations are exceedingly rare
and contribute negligibly to divergence between species (18, 35).
These assumptions have recently been challenged using other
approaches and population genetic data from Drosophila
(see below). Notably, if a substantial fraction of the divergence
observed between species is positively selected, rather than neutral
or slightly deleterious, constraint is difficult to interpret.
Finally, the notion of constraint on noncoding DNA is usually
thought of as a property of sites in the genome rather than, more
correctly, a property of possible mutations that occur at these sites.
For example, it is possible for a completely functionless piece of
noncoding DNA to exhibit constraint if some fraction of the
mutations that occur at these sites create spurious regulatory
sites that result in the misexpression of genes (36, 37). Another
example is that the functional status of some binding sites in an
enhancer may depend on the state at other binding sites (38).
Thus, while constraint may be a reasonable first approximation
to functionality in noncoding DNA, its interpretation can sometimes be difficult. In addition, a lack of evidence for selection may
be misleading about function, as suggested by the recent identification of functional transcriptional enhancers in the human
genome with little evidence of constraint (39).
Recently, a number of methods have been introduced to detect
noncoding sequences evolving faster than neutral reference sites
(4047), presumably due to the action of recurrent adaptive
substitution. Generally, these approaches have focused on lineagespecific accelerations in the rate of substitution in CNC sequences.
Lineage-specific changes in the rate of evolution can be caused by
recurrent positive selection, but also a simple relaxation in selective

6 Methods to Detect Selection on Noncoding DNA

145

constraint (e.g., loss of function). However, sequences exceeding


the rate of evolution at neutral reference sites can be inferred to be
the targets of recurrent positive selection (as for protein
sequencessee ref. 48). Using this logic, Pollard et al. (40) identified 202 genomic regions that are highly conserved in most vertebrates but evolve more rapidly in humans. Interestingly, most of
these regions (80.4%) localize to noncoding regions in the vicinity
of genes involved in transcription and DNA binding. Another
example is a similar study on Drosophila that identified 64 highly
conserved genomic regions that exhibited a recent rate acceleration
in the Drosophila melanogaster lineage (46). However, only a
fraction of these regions (28%) are found in noncoding DNA.
Kim and Pritchard (44) looked for heterogeneity in evolutionary
rates for CNCs across vertebrates and estimated that 32% of CNC
regions exhibit branch-specific rate changes. Prabhakar et al. (41)
found that CNC regions with rate accelerations in human and
chimpanzee are significantly enriched near genes with neurological
functions and (42) showed that accelerated CNCs in the human
lineage are associated with human-specific segmental duplications.
Using a similar approach, Hahn et al. (49) suggested comparing rates of substitution in putative functional sites (in this case,
transcription factor-binding sites, Kb) to intervening, nonfunctional sites (Ki). They found a significant excess of fixations in
putative binding sites in the 50 noncoding region of the factor VII
locus of humans (i.e., Kb/Ki > 1); however, it is difficult in such a
test to rule out selective constraint on the intervening sites. Thus,
using such an approach alone, it is difficult to distinguish a relaxation of selection from positive selection.
More generally, methods based on sequence divergence alone
lack power to detect selection because they tend to assume that a
given region of the genome is either negatively selected or positively
selected, whereas in most cases positively and negatively selected
sites may be interspersed. One notable exception is a study by
Lunter et al. (50) that used the distribution of small insertion and
deletion (indel) substitutions in putatively neutral reference
sequences to identify functional noncoding DNA (i.e., regions
resistant to indels were inferred to be under selective constraint).
Of the noncoding DNA sequences inferred to be functional, based
on the pattern of indel substitutions, those that evolve faster than
neutral reference sites with respect to the rate of nucleotide substitution were identified to be under positive selection. Using this
approach, Lunter et al. estimate that 23% of human genome is
functional with 0.03% of sites being the targets of recent adaptive
substitution. While the model of Lunter et al. (50) does allow for
heterogeneous selective pressures on noncoding DNA (i.e., negative selection on indels and negative or positive selection on nucleotide substitutions), the model is still obviously limited in the way
that it can accommodate this heterogeneity. That is, there is no

146

Y. Zhen and P. Andolfatto

reason to suppose that some fraction of indel substitutions is not


positively selected or that a particular region of noncoding DNA
must be either selectively constrained or positively selected at the
nucleotide level. Indeed, recent analyses in Drosophila have revealed
complex lineage-specific selection pressures on indel variation
(51, 52). In addition, like inferences of constraint, inferences of
recurrent positive selection on noncoding DNA using divergencebased approaches suffer from the limitation that it is difficult
or sometimes impossible to rule out variation in mutation rates
(or mutation bias) or selective constraint on the chosen reference
sites themselves.
Another approach allowing for some degree of heterogeneity in
selection pressures is that proposed by Moses (53) to look at the
evolution of transcription factor-binding sites (TFBSs) in enhancers. The approach is to compute a null distribution of the effects of
random substitutions on the strength of binding affinity in TFBSs.
By comparing the effects of actual divergence to this distribution,
one can identify TFBSs that show a larger change than expected
under the null distribution, presumably due to negative or positive
selection to either weaken or strengthen the binding affinity. At the
moment, this method might be most successfully applied to wellcharacterized enhancers, where changes in binding site affinity lead
to concrete predictions about the output of the system. However,
the method may be difficult to apply to (or interpret) situations in
which the effects of substitutions are highly context dependent (38)
or to noncoding DNA with unknown function, as there may be as
much or more selection in favor of reducing binding site affinity as
increasing it.
Intricately tied to the issue of detecting and estimating selection
based on patterns of substitution, whether single-nucleotide
substitutions or indels, is the issue of uncertainty in alignment
(5458). The implicit assumption in an alignment, from which
patterns of substitution are inferred, is that orthologous base
positions are being compared. Pollard et al. (58) compared the
performance of numerous tools that have been developed to align
noncoding sequences and predictably found that the accuracy of
alignments decreases with increasing divergence for all tools and
declines faster in the presence of indel substitutions. Keightley and
Johnson (57) proposed using empirical estimates of mutation
parameters (e.g., the observed distribution of indel substitutions)
to improve the quality of alignments, and a growing number of
studies (54, 55, 59, 60) propose approaches to estimate the degree
of certainty associated with particular alignments, which can in
turn be used to appropriately weight estimates of evolutionary
parameters (such as mutation and selection). Several recent
advances in alignment algorithms (61, 62) are aimed at reducing
errors associated with alignments by incorporating phylogenetic
information.

6 Methods to Detect Selection on Noncoding DNA

1.2. Population Genetic


Approaches:
The Distribution
of Polymorphism
Frequencies

147

As defined above, the detection and quantification of constraint


due to negative selection or accelerated evolution due to positive
selection are intrinsically tied to the estimation of evolutionary
distances. Doing this accurately can be challenging given differences in mutation rate or bias of nucleotides in different genomic
contexts. An alternative population genetic approach is to compare
the distribution of polymorphism frequencies (DPF) at a putatively
selected class of sites with that at a putatively neutral class of
reference sites (6366). This approach relies on the fact that purifying selection tends to decrease the frequencies of polymorphisms at
functional sites relative to neutral sites. This approach has the
advantage of being robust to the details of the mutation process,
provided that the method employed either does not depend on the
ancestral state (for example, the folded distribution (35)) or that
the ancestral state can be accurately reconstructed (67, 68).
Analysis of the distribution of polymorphism frequencies has
been used to demonstrate negative selection on amino acid variants in a variety of plant and animal species (22, 63, 6972) and
certain classes of synonymous codon changes relative to others in
Drosophila (64, 73). The approach has also been extended to
demonstrate evidence for selective constraint on noncoding
DNA in Drosophila (22, 7477), humans (69, 7881), and
Arabidopsis (72). Ronald and Akey (82) and Emerson et al. (83)
extended this approach to look at the frequencies of polymorphisms underlying expression variation in yeast and were able to infer
that most polymorphisms affecting expression in cis and trans are
under purifying selection.
Recently, Kern and Haussler (84) developed a Hidden Markov
model (popGenHMM), similar to that developed by Siepel et al. (23),
that uses the distribution of polymorphism frequencies (instead of
divergence) to detect genomic regions experiencing negative or
positive selection. In a scan of a 7 Mb of the D. melanogaster
genome, Kern and Haussler estimate that approximately 75% of
sites in untranslated-transcribed regions (UTRs) are under negative
selection, which is comparable to estimates based on levels of constraint (22). Kern and Hausslers method does come with a number
of important caveats. In particular, the assumption of independence
among sites and the assumption of an equilibrium panmictic population similarly lead to high false-positive rates. The authors recommend simulations of the genealogies with recombination and
demography (85) to be used to generate appropriate null distributions. Perhaps more problematic, like similar methods based on
divergence (23), this method assumes that negatively and positively
selected sites cluster into discreet elements rather than being
interspersed. Studies in both Drosophila and humans suggest that,
while more and less constrained elements can be identified, constraint appears to be widely dispersed throughout noncoding DNA
in both genomes (22, 24, 79), and constrained and positively

148

Y. Zhen and P. Andolfatto

0.7

neutral (2Ns = 0)
positive selection (2Ns=+10)
negative selection (2Ns=10)
mixture (50% neutral : 40% 2Ns=10 : 10% 2Ns=+10)

0.6

proportion

0.5
0.4
0.3
0.2
0.1
0.0

8 9
11
13
frequency (n=20)

15

17

19

Fig. 1. The effect of directional selection on the distribution of polymorphism frequencies (DPFs). Plotted are expected
proportion of polymorphisms on the y-axis and frequency in a sample of 20 chromosomes based on equations in
Bustamante et al. (90). Selected variants are assumed to have additive effects on fitness. In brown is a mixture model
that posits 50% of newly arising mutations being neutral, 40% being negatively selected, and 10% positively selected. The
similarity of this mixture model to neutral expectations implies that it may be difficult to detect positive or negative
selection in regions of the genome with pluralistic selective pressures based on the shape of the DPF alone.

selected sites may often be interdigitated. These caveats are likely to


seriously limit the power and accuracy of this approach in both
detecting and quantifying selection in noncoding DNA (see Fig. 1).
1.3. Population
Genetic Approaches:
Using Polymorphism
and Divergence

The interdigitation of positively and negatively selected sites in


genomes limits the power of approaches that assume a particular
form of selection acting on a genomic region. McDonaldKreitman
(MK) (86) proposed a statistical test to detect selection by utilizing
information on both divergence and polymorphism. The method
works by comparing two ways to estimate constraint at a class of
putatively selected sites (X)one based on polymorphism within
species (pX/pneutral) and one based on divergence between species
(dX/dneutral). Under Kimuras neutral hypothesis (17), which
assumes that all mutations are either neutral or strongly negatively
selected, these two ratios should be equal. Departures from equality
can be informative about the direction and intensity of selection on
a class of putatively selected sites. That is, a divergence deficit
relative to polymorphism at putatively selected sites suggests that
some polymorphism is deleterious enough that it does not contribute to divergence. Conversely, an excess of divergence relative
to polymorphism at putatively selected sites is consistent with

6 Methods to Detect Selection on Noncoding DNA

149

recurrent adaptive substitution (86, 87) or a relaxation in the


intensity of negative selection in the past (88). Several statistical
approaches based on this framework have been developed to quantify the intensity of selection (65, 87, 89, 90), and the fraction of
divergence in excess of the neutral model predictions (77, 89,
9194). As these are based on essentially the same statistical framework as first proposed by McDonald and Kreitman (86), we refer to
these collectively as McDonaldKreitman approaches.
Though the McDonaldKreitman test was originally applied to
proteins (i.e., comparing nonsynonymous to putatively neutral
synonymous sites), several authors have also applied modified versions of this test to noncoding DNA. Generally, this has been
applied in two ways. First, the test has been used to detect selection
at individual elements in the genome, for example, by comparing
functional noncoding DNA, such as TFBSs, to nonfunctional
noncoding DNA (95, 96). However, given high levels of constraint
found in noncoding DNA currently lacking annotated function
(see above), this approach is expected to lack power because nonfunctional noncoding DNA may in fact be functional. This has
prompted others to modify the approach to use synonymous sites as
a neutral reference to detect selection at individual noncoding
DNA elements (97).
Second, a variety of MK approaches have been used in more
broad-scale comparisons of classes of sites to infer the mode
of selection acting on noncoding DNA throughout the genome
(22, 7577, 98, 99). Using this approach, Andolfatto (22) used
polymorphism data from D. melanogaster, and divergence to its
closest relative D. simulans, to show that there is a significant
divergence excess relative to polymorphism for almost all classes
of noncoding sequence, consistent with widespread recurrent adaptive substitution in noncoding DNA. In particular, Andolfatto
estimated that ~20% of nucleotide divergence in introns and intergenic regions and ~60% of divergence in UTRs are in excess of
neutral theory predictions. Similar conclusions are reached when
using polymorphism from D. simulans rather than D. melanogaster,
and lineage-specific estimates of divergence (75). Casillas et al. (76)
noted that purifying selection appears to be stronger in conserved
noncoding sequences in Drosophila while the inferred divergence
excess appears to be larger in less constrained sequences. In mice
and humans, the Drosophila-like patterns of widespread constraint
and a divergence excess relative to neutral expectations are not
generally observed (77), though there is some evidence for negative
and positive selection in CNCs (99). This might be expected given
the size of mammalian genomes. That is, regulatory elements may
be much more diffuse in noncoding DNA of mammals than in
organisms like Drosophila, making recurrent positive selection difficult to detect in most noncoding DNA, but easier to detect in
regions of the genome enriched for functional sites (such as CNCs

150

Y. Zhen and P. Andolfatto

in mammals). In support of this view, Kousathanas et al. (100)


estimate similar numbers of adaptive substitutions in coding
regions and upstream/downstream noncoding DNA in mice,
though the latter estimates are not significantly different than
zero. Little evidence for constraint and positive selection has also
been documented in yeast, despite the expectation of a highly
streamlined genome. This said, sample sizes from yeast populations
have been very small (71) which limits the power of population
genetic approaches. In addition, yeast populations appear to be
highly structured and population sizes within demes appear to
be quite small (101), which may render many mutations that
would be deleterious in Drosophila effectively neutral in yeast.
Though MK approaches are expected to be more informative
about the direction and intensity of selection than divergence-alone
or polymorphism-alone methods, they also can be biased by several
factors. First, the approach is limited by an appropriate choice of
neutral reference sites. While synonymous sites are often chosen for
this purpose, weak purifying selection on these sites (which has
been documented in numerous taxa) can be expected to bias the
MK test in favor of detecting positive selection (22, 102), and bias
estimates of the divergence excess at putatively selected sites
upward (22, 92). Alternative choices of neutral reference sites,
such as the fastest evolving sites of short introns (24), have been
proposed, though levels of polymorphism and divergence at these
sites appear to be quite similar to synonymous sites, at least in D.
melanogaster (52).
A second concern is the presence of appreciable numbers of
weakly deleterious polymorphisms in the putatively selected class
of sites, which tend to limit the power of the MK test to detect a
divergence excess due to positive selection (103). To circumvent
this problem, it has been proposed that a frequency filter be used
(on both neutral and selected sites) to exclude low-frequency polymorphisms, which are enriched for substitutions that contribute to
polymorphism but not divergence (91, 104). An alternative
approach is to estimate the distribution of selective effects of deleterious mutations and use this estimate to infer the fraction of divergence in excess of neutral expectations (Fig. 2) (66, 77, 99, 105).
Importantly, these latter methods assume a particular distribution of
fitness effects of newly arising mutations (e.g., normal, exponential,
gamma, etc.), which may or may not be biologically meaningful.
A subset of the methods above (66, 77) also co-estimate a demographic model, the purpose of which is discussed below.
A third concern is that in comparisons of putatively selected
and neutral reference sites, the assumption of the MK test is that
these sites share the same genealogical history (86, 106). In general, this assumption works when there is either no recombination
between neutral and selected sites or selected and neutral sites are
close to evenly interdigitated. This assumption is rarely met in

6 Methods to Detect Selection on Noncoding DNA

151

100
80
60
%
40
20
0
01

010

10100

>100

N*E(s)

Fig. 2. Selective constraint and positive selection on noncoding DNA inferred using polymorphism and divergence. Shown
is the inferred distribution of fitness effects of newly arising mutations and the fraction of divergence in excess of
expectations (a) for a sample of intronic sites in D. melanogaster (from Table 6 of 77). The method uses the DPF for
synonymous sites to estimate parameters of a population size change model. The method then uses this demographic
model, with the DPF and divergence at synonymous and intronic sites, to estimate selection on the latter class of sites. The
implication is that 30% of newly arising mutations in these introns are subject to deterministic negative selection and that
20% of the nucleotide divergence observed between species is in excess of expectations under the neutral model. The
error bars indicate standard errors on the estimates.

comparisons involving noncoding DNA potentially leading to


underestimates of confidence intervals on estimates of the divergence excess (22) or false positives in tests for selection at individual
genomic regions (106). This issue can be corrected by establishing
the appropriate significance level using parametric coalescent simulations to generate null distributions of the test statistic. A similar
issue stems from the practice of pooling sites across the genome,
which can induce biased estimates of adaptive evolution if there is a
negative correlation between levels of diversity and the extent of
divergence at putatively selected sites (107, 108). In fact, such a
correlation has been observed in patterns of polymorphism and
divergence for protein coding (108113) and noncoding DNA
sequences in humans (112).
A final concern stems from the assumption that the current
level of selective constraint on a genomic region (recorded in
levels of polymorphism) has either remained constant over time
or is not different than the average level of constraint in the past
history of the species (recorded in levels of divergence). The
relative contribution of deleterious mutations to divergence is
determined by the distribution of deleterious selective effects of
mutations and the effective population size of the species (92,
114, 115). If the effective population size of a species changes
over time, as one might expect due to bottlenecks and expansions,
levels of constraint on selected sites could change over time,
leading to genome-wide biases in estimates of negative and positive selection (91, 116). The observation of positive selection in

152

Y. Zhen and P. Andolfatto

noncoding DNA in Drosophila and mice appears to be robust


to recent population expansion (77, 117). However, it may be
difficult to rule out the possibility of ancient bottlenecks that were
more severe. The extent of shared polymorphism in two species
(due to shared ancestry) may put useful limits on the severity of
past bottlenecks, as suggested by Andolfatto et al. (117).
A related issue is the possibility of shifting constraints on noncoding DNA over time. Such changes in constraint over time may arise by
a period of relaxed selection due to, for example, duplication (creating
a period of functional redundancy) or changes in the environment.
Another example is binding site turnover expected under simple
models of stabilizing selection for a regulatory element, which can
cause levels of selective constraint to shift within the element over
time (38). The extent to which these issues cause a problem for
inferences of positive and negative selection on noncoding elements
using MK approaches is in need of further investigation.
1.4. Prospects

Our understanding of the function of noncoding DNA and the


population-level processes shaping its evolution is in its infancy.
Many approaches that have been applied to detect and quantify
selection on noncoding DNA are derivatives of approaches first
formulated for protein-coding genes (e.g., dN/dS, the MK test,
etc.); thus, many of the same limitations of these methods apply
equally to coding and noncoding DNA. The study of noncoding
DNA is also fraught with its own additional specific challenges.
Paramount among these is the comparative lack of functional
annotation of sites. Apart from knowledge of the putative binding
sites for a handful of transcription factors and regulatory RNAs, the
function of most noncoding DNA is unknown. The finding of
widespread selective constraint across the genomes of many eukaryotes suggests that we have much to learn about the functional
significance of most noncoding DNA in eukaryotic genomes. Some
of this constraint may be due to protein-coding and RNA genes yet
to be discovered (118, 119), though it is unclear to what extent this
can account for the widespread constraint patterns in unannotated
noncoding DNA of many organisms. The inability to form prior
hypotheses about function in noncoding DNA is a key factor
limiting the power of statistical methods to detect and quantify
selection. For example, where should we look for selection in
noncoding DNA and what sites in the genome constitute appropriate neutral reference sites? The answer to the latter question in
organisms with highly streamlined genomes and large population
sizes (which determines the efficacy of selection), like Drosophila or
Arabidopsis, might be very few sites indeed.
Much of the evidence for selection on noncoding DNA
currently comes from generalized genomic studies that benefit from
the statistical power afforded by looking at many sites in the genome.
One of the outstanding questions in this area of investigation is

6 Methods to Detect Selection on Noncoding DNA

153

whether the inferences of selection being made are robust to past


changes in population size and structure. Another is how general
these findings are across different organismsnotably, signatures of
positive selection observed in Drosophila noncoding DNA (albeit
multiple species) are not obvious in other organisms, such as yeast,
Arabidopsis, mice, and humans. Part of the explanation for this might
be that functional sites in noncoding DNA are more diffuse in very
large genomes. However, these species also differ in many other
aspects of biology that may play an important role in determining
patterns of selection in noncoding DNA, including population size,
population structure, and mating system (8, 120). Population genomic data from more species should shed light on the generality of this
pattern and perhaps point to important factors determining our
ability to detect positive and negative selection.
A second challenge is the ability to use any of the approaches
outlined above to reliably detect positive and negative selection at
individual regulatory elements in the genome. Genome-wide scans
for selection based on genetic hitchhiking patterns (e.g., haplotype
structure, reduced variation, etc.) are typically likely to lack the
resolution to definitively identify specific targets of positive selection in noncoding DNA (but see ref. 121). Another issue is that
power to detect selection at a single locus is typically limited by the
number of informative substitutions and confidence in their frequencies (i.e., sample size). To date, polymorphism data has been
quite limited, particularly those involving samples of individuals
that are large enough to meaningfully estimate allele frequencies.
Forthcoming genome projects of large samples of genomes for
some organisms (e.g., http://browser.1000genomes.org; http://
www.1001genomes.org) should usher in a new era of progress in
detecting selection in the noncoding genome.

2. Exercises
Download the coding and noncoding polymorphism data
of Andolfatto (22)http://genomics.princeton.edu/Andolfatto
Lab/link_nature2005.html. The first sequence in each file is the
sequence for D. simulans (an appropriate outgroup). The next 12
sequences are from a Zimbabwean population of D. melanogaster.
You will need a script to extract polymorphism and divergence
statistics from this data.
1. Compare the distribution of polymorphism frequencies for noncoding sites and fourfold synonymous sites of the D. melanogaster
sequences. Since both demography and selection can influence
polymorphism frequencies, how can you distinguish between
these processes based on this comparison? Katzman et al. (80)

154

Y. Zhen and P. Andolfatto

compared the distribution of polymorphism frequencies in


coding regions to CNCs, but used different population samples
for these two classes of sites. What is the danger of comparing the
distribution of polymorphism frequencies in this context?
2. Perform a McDonaldKreitman test for each UTR locus using
pooled synonymous sites as a neutral reference and obtain a
distribution of p-values. What kinds of factors influence the
type-I error of this test when used in this way? Describe how
you might correct p-values for these factors.
3. Pooling UTR loci and using pooled synonymous sites as a
neutral reference, estimate the fraction of UTR divergence in
excess of neutral expectations (a) using the estimators of Fay
et al. (91) and Eyre-Walker and Keightley (77) (see the DFEalpha server http://homepages.ed.ac.uk/eang33/). According
to the Eyre-Walker and Keightley approach, what fraction of
newly arising mutations in noncoding sites is subject to weak
negative selection? What factors make these two estimators of
(a) different?

Acknowledgments
Thanks to Stephen Wright, Molly Przeworski, Kevin Bullaughey,
and anonymous reviewers for helpful discussion and comments on
the manuscript. This work was supported in part by NIH grant
R01-GM083228.
References
1. Lewin, B. (2007) Genes IX, Oxford University Press. p 892.
2. Stern, D. L., (2010) Evolution, development
and the predictable genome. Roberts and Co.
Publishing. p 264.
3. Wray, G., Hahn, M., Abouheif, E., Balhoff, J.,
Pizer, M., Rockman, M., and Romano, L.
(2003) The evolution of transcriptional
regulation in eukaryotes, Mol Biol Evol 20,
13771419.
4. Davidson, E. H. (2001) Genomic regulatory
systems : development and evolution, Academic
Press, San Diego.
5. Carroll, S. B. (2000) Endless forms: the
evolution of gene regulation and morphological diversity, Cell 101, 577580.
6. Sakabe, N. J., and Nobrega, M. A. (2010)
Genome-wide maps of transcription regu-

latory elements, Wiley Interdiscip Rev Syst


Biol Med 2, 422437.
7. Charlesworth, B., Betancourt, A. J., Kaiser,
V. B., and Gordo, I. (2009) Genetic recombination and molecular evolution, Cold Spring
Harb Symp Quant Biol 74, 177186.
8. Wright, S., and Andolfatto, P. (2008) The
impact of natural selection on the genome:
emerging patterns in drosophila and
arabidopsis, Annu Rev Ecol Evol Syst 39,
193213.
9. Keightley, P. D., and Eyre-Walker, A. (1999)
Terumi Mukai and the riddle of deleterious
mutation rates, Genetics 153, 515523.
10. Kondrashov, A. S. (1988) Deleterious mutations and the evolution of sexual reproduction, Nature 336, 435440.
11. Hahn, M. (2007) Detecting natural selection
on cis-regulatory DNA, Genetica 129, 718.

6 Methods to Detect Selection on Noncoding DNA


12. Oleksyk, T. K., Smith, M. W., and OBrien,
S. J. (2010) Genome-wide scans for footprints of natural selection, Phil Trans Roy Soc
B 365, 185205.
13. Charlesworth, B., and Charlesworth, D.
(2010) Elements of evolutionary genetics,
Roberts and Co. Publishers.
14. Park, P. J. (2009) ChIP-seq: advantages and
challenges of a maturing technology, Nat Rev
Genet 10, 669680.
15. Wang, Z., Gerstein, M., and Snyder, M.
(2009) RNA-Seq: a revolutionary tool for
transcriptomics, Nat Rev Genet 10, 5763.
16. Shibata, Y., and Crawford, G. E. (2009)
Mapping regulatory elements by DNaseI
hypersensitivity chip (DNase-Chip), Methods
Mol Biol 556, 177190.
17. Kimura, M. (1983) The neutral theory of
molecular evolution, Cambridge University
Press, Cambridge.
18. Kondrashov, A. S., and Crow, J. F. (1993) A
molecular approach to estimating the human
deleterious mutation rate, Hum Mutat 2,
229234.
19. Shabalina, S., and Kondrashov, A. (1999) Pattern of selective constraint in C-elegans and
C-briggsae genomes, Genet Res 74, 2330.
20. Shabalina, S., Ogurtsov, A., Kondrashov, V.,
and Kondrashov, A. (2001) Selective
constraint in intergenic regions of human
and mouse genomes, Trends in Genetics 17,
373376.
21. Bergman, C., and Kreitman, M. (2001) Analysis of conserved noncoding DNA in Drosophila reveals similar constraints in
intergenic and intronic sequences, Genome
Res 11, 13351345.
22. Andolfatto, P. (2005) Adaptive evolution of
non-coding DNA Drosophila, Nature, 437,
11491152.
23. Siepel, A., Bejerano, G., Pedersen, J., Hinrichs, A., Hou, M., Rosenbloom, K., Clawson, H., Spieth, J., Hillier, L., Richards, S.,
Weinstock, G., Wilson, R., Gibbs, R., Kent,
W., Miller, W., and Haussler, D. (2005) Evolutionarily conserved elements in vertebrate,
insect, worm, and yeast genomes, Genome Res
15, 10341050.
24. Halligan, D. L., and Keightley, P. D. (2006)
Ubiquitous selective constraints in the Drosophila genome revealed by a genome-wide
interspecies comparison, Genome Res 16,
875884.
25. Gaffney, D. J., and Keightley, P. D. (2006)
Genomic selective constraints in murid noncoding DNA, PLoS Genetics 2, 19121923.

155

26. Eory, L., Halligan, D. L., and Keightley, P. D.


(2010) Distributions of Selectively Constrained Sites and Deleterious Mutation
Rates in the Hominid and Murid Genomes,
Mol Biol Evol 27, 177192.
27. Consortium. (2007) Evolution of genes and
genomes on the Drosophila phylogeny,
Nature 450, 203218.
28. Cooper, G., Stone, E., Asimenos, G., Green,
E., Batzoglou, S., and Sidow, A. (2005)
Distribution and intensity of constraint in
mammalian genomic sequence, Genome Res
15, 901913.
29. Duret, L., and Bucher, P. (1997) Searching
for regulatory elements in human noncoding
sequences, Curr Opin Struc Biol 7, 399406.
30. Boffelli, D., McAuliffe, J., Ovcharenko, D.,
Lewis, K., Ovcharenko, I., Pachter, L., and
Rubin, E. (2003) Phylogenetic shadowing of
primate sequences to find functional regions
of the human genome, Science 299,
13911394.
31. DeRose-Wilson, L. J., and Gaut, B. S. (2007)
Transcription-related mutations and GC content drive variation in nucleotide substitution
rates across the genomes of Arabidopsis
thaliana and Arabidopsis lyrata, BMC Evol
Biol, 7, 66.
32. Britten, R. (1996) Cases of ancient mobile
element DNA insertions that now affect
gene regulation, Mol Phylogenet Evol 5,
1317.
33. Nishihara, H., Smit, A. F. A., and Okada, N.
(2006) Functional noncoding sequences
derived from SINEs in the mammalian
genome, Genome Res 16, 864874.
34. Haddrill, P., Charlesworth, B., Halligan, D.,
and Andolfatto, P. (2005) Patterns of intron
sequence evolution in Drosophila are dependent upon length and GC content, Genome
Biology 6, r67.
35. Tajima, F. (1989) Statistical method for testing the neutral mutation hypothesis by DNA
polymorphism, Genetics 123, 585595.
36. Hahn, M., Stajich, J., and Wray, G. (2003)
The effects of selection against spurious transcription factor binding sites, Mol Biol Evol 20,
901906.
37. Clop, A., Marcq, F., Takeda, H., Pirottin, D.,
Tordoir, X., Bibe, B., Bouix, J., Caiment, F.,
Elsen, J., Eychenne, F., Larzul, C., Laville, E.,
Meish, F., Milenkovic, D., Tobin, J., Charlier,
C., and Georges, M. (2006) A mutation creating a potential illegitimate microRNA target
site in the myostatin gene affects muscularity
in sheep, Nature Genetics 38, 813818.

156

Y. Zhen and P. Andolfatto

38. Bullaughey, K. (2011) Changes in selective


effects over time facilitate turnover of
enhancer sequences, Genetics 187, 56782.
39. Blow, M. J., McCulley, D. J., Li, Z., Zhang,
T., Akiyama, J. A., Holt, A., Plajzer-Frick, I.,
Shoukry, M., Wright, C., Chen, F., Afzal, V.,
Bristow, J., Ren, B., Black, B. L., Rubin, E.
M., Visel, A., and Pennacchio, L. A. (2010)
ChIP-Seq identification of weakly conserved
heart enhancers, Nat Genet 42, 806810.
40. Pollard, K. S., Salama, S. R., King, B., Kern,
A. D., Dreszer, T., Katzman, S., Siepel, A.,
Pedersen, J. S., Bejerano, G., Baertsch, R.,
Rosenbloom, K. R., Kent, J., and Haussler,
D. (2006) Forces shaping the fastest evolving
regions in the human genome, PLos Genetics
2, 15991611.
41. Prabhakar, S., Noonan, J. P., Paabo, S., and
Rubin, E. M. (2006) Accelerated evolution of
conserved noncoding sequences in humans,
Science 314, 786786.
42. Bird, C., Stranger, B., Liu, M., Thomas, D.,
Ingle, C., Beazley, C., Miller, O, W., Hurles,
M., and Dermitzakis, E. (2007) Fast-evolving
noncoding sequences in the human genome,
Genome Biol, 8, R118.
43. Haygood, R., Fedrigo, O., Hanson, B.,
Yokoyama, K.-D., and Awray, G. (2007)
Promoter regions of many neural- and nutrition-related genes have experienced positive
selection during human evolution, Nature
Genetics 39, 11401144.
44. Kim, S. Y., and Pritchard, J. K. (2007) Adaptive evolution of conserved noncoding elements in mammals, PLos Genetics 3,
15721586.
45. Wong, W., and Nielsen, R. (2004) Detecting
selection in noncoding regions of nucleotide
sequences, Genetics 167, 949958.
46. Holloway, A. K., Begun, D. J., Siepel, A., and
Pollard, K. S. (2008) Accelerated sequence
divergence of conserved genomic elements
in Drosophila melanogaster, Genome Res 18,
15921601.
47. Pollard, K. S., Hubisz, M. J., Rosenbloom, K.
R., and Siepel, A. (2010) Detection of nonneutral substitution rates on mammalian phylogenies, Genome Res 20, 110121.
48. Hurst, L. (2002) The Ka/Ks ratio: diagnosing the form of sequence evolution, 18,
486487.
49. Hahn, M., Rockman, M., Soranzo, N., Goldstein, D., and Wray, G. (2004) Population
genetic and phylogenetic evidence for positive
selection on regulatory mutations at the Factor VII locus in humans, Genetics 167,
867877.

50. Lunter, G., Ponting, C. P., and Hein, J.


(2006) Genome-wide identification of
human functional DNA using a neutral indel model, PLoS Comp BIol 2, 212.
51. Presgraves, D. C. (2006) Intron length evolution in drosophila, Mol Biol Evol 23,
22032213.
52. Parsch, J., Novozhilov, S., Saminadin-Peter,
S., Wong, K., and Andolfatto, P. (2010) On
the utility of short intron sequences as a reference for the detection of positive and negative
Selection in Drosophila, Mol Biol Evol, 27,
12261234.
53. Moses, A. M. (2009) Statistical tests for
natural selection on regulatory regions based
on the strength of transcription factor binding
sites, BMC Evol Biol 9, 286.
54. Satija, R., Pachter, L., and Hein, J. (2008)
Combining statistical alignment and phylogenetic footprinting to detect regulatory elements, Bioinformatics 24, 12361242.
55. Lunter, G., Rocco, A., Mimouni, N., Heger,
A., Caldeira, A., and Hein, J. (2008)
Uncertainty in homology inferences: assessing
and improving genomic sequence alignment,
Genome Res 18, 298309.
56. Wang, J., Keightley, P. D., and Johnson, T.
(2006) MCALIGN2: faster, accurate global
pairwise alignment of non-coding DNA
sequences based on explicit models of in-del
evolution, BMC Bioinformatics 7, 292.
57. Keightley, P. D., and Johnson, T. (2004)
MCALIGN: stochastic alignment of noncoding DNA sequences based on an evolutionary
model of sequence evolution, Genome Res 14,
442450.
58. Pollard, D. A., Bergman, C. M., Stoye, J.,
Celniker, S. E., and Eisen, M. B. (2004)
Benchmarking tools for the alignment of
functional noncoding DNA, BMC Bioinformatics 5, 6.
59. Landan, G., and Graur, D. (2007) Heads or
tails: a simple reliability check for multiple
sequence alignments, Mol Biol Evol 24,
13801383.
60. Satija, R., Hein, J., and Lunter, G. A. (2010)
Genome-wide functional element detection
using pairwise statistical alignment outperforms multiple genome footprinting techniques, Bioinformatics 26, 21162120.
61. Liu, K., Raghavan, S., Nelesen, S., Linder, C.
R., and Warnow, T. (2009) Rapid and
accurate large-scale coestimation of sequence
alignments and phylogenetic trees, Science
324, 15611564.
62. Loytynoja, A., and Goldman, N. (2010) webPRANK: a phylogeny-aware multiple

6 Methods to Detect Selection on Noncoding DNA


sequence aligner with interactive alignment
browser, BMC Bioinformatics 11, 579.
63. Sawyer, S. A., Dykhuizen, D. E., and Hartl,
D. L. (1987) Confidence interval for the
number of selectively neutral amino acid polymorphisms, Proc Natl Acad Sci U S A 84,
62256228.
64. Akashi, H., and Schaeffer, S. (1997) Natural
selection and the frequency distributions of
silent DNA polymorphism in Drosophila,
Genetics 146, 295307.
65. Keightley, P. D., and Eyre-Walker, A. (2007)
Joint inference of the distribution of fitness
effects of deleterious mutations and population demography based on nucleotide polymorphism
frequencies,
Genetics
177,
22512261.
66. Boyko, A. R., Williamson, S. H., Indap, A. R.,
Degenhardt, J. D., Hernandez, R. D.,
Lohmueller, K. E., Adams, M. D., Schmidt,
S., Sninsky, J. J., Sunyaev, S. R., White, T. J.,
Nielsen, R., Clark, A. G., and Bustamante,
C. D. (2008) Assessing the evolutionary
impact of amino acid mutations in the
human genome, PLoS Genetics, 30, e1000083
67. Hernandez, R. D., Williamson, S. H., and
Bustamante, C. D. (2007) Context dependence, ancestral misidentification, and
spurious signatures of natural selection, Mol
Biol Evol 24, 17921800.
68. Baudry, E., and Depaulis, F. (2003) Effect of
misoriented sites on neutrality tests with
outgroup, Genetics 165, 16191622.
69. Kryukov, G., Schmidt, S., and Sunyaev, S.
(2005) Small fitness effect of mutations in
highly conserved non-coding regions,
Human Molecular Genetics 14, 22212229.
70. Foxe, J. P., Dar, V.-u.-N., Zheng, H., Nordborg, M., Gaut, B. S., and Wright, S. I.
(2008) Selection on amino acid substitutions
in Arabidopsis, Mol Biol Evol 25, 13751383.
71. Doniger, S. W., Kim, H. S., Swain, D., Corcuera, D., Williams, M., Yang, S. P., and Fay,
J. C. (2008) A catalog of neutral and deleterious polymorphism in yeast, PLoS Genet 4,
e1000183.
72. Kim, S., Plagnol, V., Hu, T. T., Toomajian, C.,
Clark, R. M., Ossowski, S., Ecker, J. R., Weigel,
D., and Nordborg, M. (2007) Recombination
and linkage disequilibrium in Arabidopsis thaliana, Nat Genet 39, 11511155.
73. Zeng, K., and Charlesworth, B. (2010)
Studying patterns of recent evolution at
synonymous sites and intronic sites in
Drosophila melanogaster, J Mol Evol 70,
116128.

157

74. Bachtrog, D., and Andolfatto, P. (2006)


Selection, recombination and demographic
history in Drosophila miranda, Genetics 174,
20452059.
75. Haddrill, P., Bachtrog, D., and Andolfatto, P.
(2008) Positive and negative selection on
noncoding DNA in Drosophila simulans,
Mol Biol Evol 25, 18251834.
76. Casillas, S., Barbadilla, A., and Bergman, C.
(2007) Purifying selection maintains highly
conserved Noncoding sequences in Drosophila,
Mol Biol Evol 24, 22222234.
77. Eyre-Walker, A., and Keightley, P. D. (2009)
Estimating the rate of adaptive molecular
evolution in the presence of slightly deleterious mutations and population size change,
Mol Biol Evol 26, 20972108.
78. Drake, J., Bird, C., Nemesh, J., Thomas, D.,
Newton-Cheh, C., Reymond, A., Excoffier,
L., Attar, H., Antonarakis, S., Dermitzakis,
E., and Hirschhorn, J. (2006) Conserved
noncoding sequences are selectively constrained and not mutation cold spots, Nature
Genetics 38, 223227.
79. Asthana, S., Noble, W., Kryukov, G., Grantt,
C., Sunyaev, S., and Stamatoyannopoulos, J.
(2007) Widely distributed noncoding purifying selection in the human genome, Proc Natl
Acad Sci USA 104, 1241012415.
80. Katzman, S., Kern, A. D., Bejerano, G.,
Fewell, G., Fulton, L., Wilson, R. K., Salama,
S. R., and Haussler, D. (2007) Human
genome ultraconserved elements are ultraselected, Science 317, 915.
81. Chen, K., and Rajewsky, N. (2006) Natural
selection on human microRNA binding sites
inferred from SNP data, Nat Genet 38,
14521456.
82. Ronald, J., and Akey, J. M. (2007) The evolution of gene expression QTL in Saccharomyces cerevisiae, PLoS One 2, e678.
83. Emerson, J. J., Hsieh, L. C., Sung, H. M.,
Wang, T. Y., Huang, C. J., Lu, H. H., Lu, M.
Y., Wu, S. H., and Li, W. H. (2010) Natural
selection on cis and trans regulation in yeasts,
Genome Res 20, 826836.
84. Kern, A., and Haussler, D. (2010) A population
genetic Hidden Markov Model for detecting
genomic regions under selection, Mol Biol Evol
27, 167385
85. Hudson, R. R. (2002) Generating samples
under a Wright-Fisher neutral model of
genetic variation, Bioinformatics 18, 337338.
86. McDonald, J. H., and Kreitman, M. (1991)
Adaptive Protein Evolution at the Adh Locus
in Drosophila, Nature 351, 652654.

158

Y. Zhen and P. Andolfatto

87. Sawyer, S. A., and Hartl, D. L. (1992) Population genetics of polymorphism and divergence, Genetics 132, 11611176.
88. Ohta, T. (1993) Amino acid substitution at
the Adh locus of Drosophila is facilitated by
small population size, Proc Natl Acad Sci
U S A 90, 45484551.
89. Sawyer, S. A., Parsch, J., Zhang, Z., and
Hartl, D. L. (2007) Prevalence of positive
selection among nearly neutral amino acid
replacements in Drosophila, Proc Natl Acad
Sci U S A 104, 65046510.
90. Bustamante, C. D., Wakeley, J., Sawyer, S.,
and Hartl, D. L. (2001) Directional selection
and the site-frequency spectrum, Genetics
159, 17791788.
91. Fay, J. C., Wyckoff, G. J., and Wu, C. I.
(2001) Positive and negative selection on the
human genome, Genetics 158, 12271234.
92. Eyre-Walker, A., Keightley, P. D., Smith, N.
G., and Gaffney, D. (2002) Quantifying the
slightly deleterious mutation model of
molecular evolution, Mol Biol Evol 19,
21422149.
93. Bierne, N., and Eyre-Walker, A. (2004) The
genomic rate of adaptive amino acid substitution in Drosophila, Mol Biol Evol 21,
13501360.
94. Welch, J. J. (2006) Estimating the genomewide rate of adaptive protein evolution in
Drosophila, Genetics 173, 821837.
95. Jenkins, D. L., Ortori, C. A., and Brookfield,
J. F. (1995) A test for adaptive change in
DNA sequences controlling transcription,
Proc Biol Sci 261, 203207.
96. Ludwig, M. Z., and Kreitman, M. (1995)
Evolutionary dynamics of the enhancer region
of even-skipped in Drosophila, Mol Biol Evol
12, 10021011.
97. Holloway, A., Lawniczak, M., Mezey, J.,
Begun, D., and Jones, C. (2007) Adaptive
gene expression divergence inferred from
population genomics, PLoS Genetics 3,
20072013.
98. Kohn, M., Fang, S., and Wu, C. (2004)
Inference of positive and negative selection
on the 5 regulatory regions of Drosophila
genes, Mol Biol Evol 21, 374383.
99. Torgerson, D., Boyko, A., Hernandez, R.,
Indap, A., Hu, X., White, T., Sninsky, J., Cargill, M., Adams, M., Bustamante, C., and
Clark, A. (2009) Evolutionary Processes Acting on Candidate cis-Regulatory Regions in
Humans Inferred from Patterns of Polymorphism and Divergence, PLoS Genetics 5,
e1000592.

100. Kousathanas, A., Oliver, F., Halligan, D. L.,


and Keightley, P. D. (2010) Positive and negative selection on non-coding DNA close
to protein-coding genes in wild house mice,
Mol Biol Evol 28, 118391.
101. Elyashiv, E., Bullaughey, K., Sattath, S.,
Rinott, Y., Przeworski, M., and Sella, G.
(2010) Shifts in the intensity of purifying
selection: An analysis of genome-wide polymorphism data from two closely related yeast
species, Genome Res, 20, 15581573.
102. Akashi, H. (1995) Inferring Weak Selection
from Patterns of Polymorphism and Divergence at Silent Sites in Drosophila DNA,
Genetics 139, 10671076.
103. Templeton, A. R. (1996) Contingency tests
of neutrality using intra/interspecific gene
trees: the rejection of neutrality for the evolution of the mitochondrial cytochrome oxidase
II gene in the hominoid primates, Genetics
144, 12631270.
104. Charlesworth, J., and Eyre-Walker, A. (2006)
The rate of adaptive evolution in enteric bacteria, Mol Biol Evol 23, 13481356.
105. Sawyer, S. A., Kulathinal, R. J., Bustamante,
C. D., and Hartl, D. L. (2003) Bayesian analysis suggests that most amino acid replacements in Drosophila are driven by
positive selection, J Mol Evol 57 Suppl 1,
S154-164.
106. Andolfatto, P. (2008) Controlling type-I
error of the McDonald-Kreitman test in
genome wide scans for selection on noncoding DNA, Genetics 180, 17671771
107. Smith, N. G., and Eyre-Walker, A. (2002)
Adaptive protein evolution in Drosophila,
Nature 415, 10221024.
108. Shapiro, J. A., Huang, W., Zhang, C.,
Hubisz, M. J., Lu, J., Turissini, D. A., Fang,
S., Wang, H. Y., Hudson, R. R., Nielsen, R.,
Chen, Z., and Wu, C. I. (2007) Adaptive
genic evolution in the Drosophila genomes,
Proc Natl Acad Sci U S A 104, 22712276.
109. Andolfatto, P. (2007) Hitchhiking effects of
recurrent beneficial amino acid substitutions
in the Drosophila melanogaster genome,
Genome Res 17, 17551762.
110. Macpherson, J., Sella, G., Davis, J., and Petrov, D. (2007) Genomewide spatial correspondence
between
nonsynonymous
divergence and neutral polymorphism reveals
extensive adaptation in drosophila, Genetics
177, 20832099.
111. Bachtrog, D. (2008) Similar rates of protein
adaptation in Drosophila miranda and D.
melanogaster, two species with different

6 Methods to Detect Selection on Noncoding DNA


current effective population sizes, BMC Evol
Biol 8, 334.
112. Cai, J., Macpherson, J., Sella, G., and Petrov,
D. (2009) Pervasive Hitchhiking at Coding
and Regulatory Sites in Humans, PLoS Genetics 5, e1000336
113. Ingvarsson, P. K. (2009) Natural selection on
synonymous and nonsynonymous mutations
shapes patterns of polymorphism in Populus
tremula, Mol Biol Evol 27, 650660.
114. Fay, J. C., and Wu, C. I. (2001) The neutral
theory in the genomic era, Curr Opin Genet
Dev 11, 642646.
115. Eyre-Walker, A., and Keightley, P. D. (2007)
The distribution of fitness effects of new mutations, Nature Reviews Genetics 8, 610618.
116. Eyre-Walker, A. (2002) Changing effective
population size and the McDonald-Kreitman
test, Genetics 162, 20172024.
117. Andolfatto, P., Wong, K. M., and Bachtrog,
D. (2011) Effective population size and the
efficacy of selection on the X chromosomes of
two closely related Drosophila species,
Genome Biol Evol 3, 114128.

159

118. Hanada, K., Zhang, X., Borevitz, J. O.,


Li, W. H., and Shiu, S. H. (2007) A large
number of novel coding small open reading
frames in the intergenic regions of the Arabidopsis thaliana genome are transcribed
and/or under purifying selection, Genome
Res 17, 632640.
119. Pickrell, J. K., Marioni, J. C., Pai, A. A.,
Degner, J. F., Engelhardt, B. E., Nkadori,
E., Veyrieras, J. B., Stephens, M., Gilad, Y.,
and Pritchard, J. K. (2010) Understanding
mechanisms underlying human gene expression variation with RNA sequencing, Nature
464, 768772.
120. Sella, G., Petrov, D., Przeworski, M., and
Andolfatto, P. (2009) Pervasive Natural
Selection in the Drosophila Genome?, PLoS
Genetics 5, e1000495.
121. Kudaravalli, S., Veyrieras, J. B., Stranger, B.
E., Dermitzakis, E. T., and Pritchard, J. K.
(2009) Gene expression levels are a target of
recent natural selection in the human
genome, Mol Biol Evol 26, 649658.

Chapter 7
The Origin and Evolution of New Genes
Margarida Cardoso-Moreira and Manyuan Long
Abstract
New genes are a major source of genetic innovation in genomes. However, until recently, understanding
how new genes originate and how they evolve was hampered by the lack of appropriate genetic datasets.
The advent of the genomic era brought about a revolution in the amount of data available to study new
genes. For the first time, decades-old theoretical principles could be tested empirically and novel and
unexpected avenues of research opened up. This chapter explores how genomic data can and is being used
to study both the origin and evolution of new genes and the surprising discoveries made thus far.
Key words: New genes, Gene duplication, Retrogenes, Gene rearrangements, De novo genes, Genetic
novelty, Copy number variation

1. Introduction
In the 1940s, geneticists were immersed in a debate over the nature
of genetic innovation and organismal complexity (reviewed in
ref. 1). The debate centered over determining which class of
mutations is responsible for the predominant changes observed
between the primordial amoeba and men. Are men and amoeba
separated only by mutations in preexisting genes or have increases
in gene number been a fundamental component of the history of
these two lineages? Fifty years onward, we find ourselves in the
genomic era, and in possession of the genomes of not only a great
number of species, but also of different individuals within the same
species. And a comparison of the (several) amoeba and human
genomes leaves no doubt as to the origination of new genes being
one of the most important sources of evolutionary change.
Most theoretical treatments of the population genetics and
molecular evolution of new genes focused on the particular class
of gene duplication and preceded the genomic revolution by several
decades (e.g., see refs. 24). When sequencing technology became
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2,
Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5_7,
# Springer Science+Business Media, LLC 2012

161

162

M. Cardoso-Moreira and M. Long

readily available in the 1980s, researchers were finally able to empirically study new genes. Initially, only a limited number of new genes
were studied in detail, and these were discovered mainly serendipitously (5, 6). In spite of the small sample size, the first examples of
new genes began to bring into question long-held views on the
mutational processes that generate new genes and on the evolutionary forces that act upon their formation (5, 7). With the onset
of the genomic era and the many technologies that it fostered (e.g.,
in situ hybridization, microarray technology), whole-genome
surveys of new genes became feasible. These data allowed researchers to start addressing decades-old questions regarding the early
stages of the evolution of new genes. Genome-wide surveys of new
genes confirmed several of the previous theoretical predictions and
provided a wealth of novel and unexpected observations.
This chapter discusses both the origin and early evolution of
new eukaryotic genes, predominantly focusing on the research of
the last 10 years that addresses both topics using genome-wide
approaches. This chapter is divided into two main sections. The
first section explores the different pathways that generate new genes
and how the different classes of new genes can be identified from
genomic data. The second section focuses on the evolutionary
trajectories of new genes. The techniques employed in different
studies are described, and the results that are relevant to understanding the evolutionary forces driving the fixation and preservation of new genes in genomes are examined.

2. Origin of New
Genes
2.1. Mechanisms
of New Gene
Origination

New genes are created by a variety of molecular processes, and not


all of them are present or are equally active in all genomes. Different
molecular pathways generate different classes of new genes, each
with distinct molecular signatures that can be recognized from
genomic sequence data. Different strategies can be used to date
the origin of a new gene, and depending on the class of new gene it
might be straightforward or impossible to determine which copy is
the original gene (henceforth called parental gene) and which copy
is the new gene (henceforth called offspring).

2.1.1. Gene Duplication

Gene duplication is arguably one of the most important sources of


evolutionary change and the study of its functional and evolutionary
consequences can be traced back to as early as 1911 (1, 8). Duplication events can vary dramatically in size, ranging from a few base pairs
to encompassing the complete genome. This review focuses on the
smaller class of duplication events, those smaller than a chromosome
and larger than a few hundred base pairs, where one or a few new
genes are introduced in genomes. Whole-genome duplications

7 Origin and Evolution of New Genes

163

(WGDs) are, however, a very important source of genetic


novelties (9), and the readers are encouraged to read Chapter 14,
Volume 1 by Kuraku and Meyer (10) of this book, where this phenomenon is discussed. For the purpose of this review, it is important to
note that new genes created by small-scale duplications and WGDs
differ not only in how they originate, but also in their early evolutionary trajectories. As a consequence, some classes of genes that tend to
be fixed after small-scale duplications are not retained in genomes
after whole-genome duplication events, and vice versa (9, 11, 12).
As genomes were being sequenced, it became clear that a
sizeable portion of all genes (ranging from 17% in some bacteria
to 65% in the plant Arabidopsis) could be recognized as being
duplicates (13). The first whole-genome study of the process of
gene duplication was published in 2000 by Lynch and Conery (14)
using the then recently fully sequenced fly (Drosophila melanogaster), nematode (Caenorhabditis elegans), and yeast (Saccharomyces
cerevisiae) genomes, and the large sequence data already available
for the Arabidopsis (A. thaliana), mouse (Mus musculus), and
human (Homo sapiens) genomes. This was a pioneering study
whose methods are still relevant today. Lynch and Conery used
gapped BLAST on all translated open reading frames to indentify
similar sequence pairs within each genome. They then produced
nucleotide sequence alignments for all gene pairs and from them
they estimated the fraction of synonymous nucleotide substitutions.
Assuming a molecular clock (see Chapter 4, Volume 1 of this book;
ref. 112), dS (i.e., divergence at synonymous sites) can be used as a
crude estimate of the age of a duplication. Lynch and Conery
calculated the rate of gene duplication using the following data: (1)
number of highly similar gene pairs (dS < 0.01, i.e., divergence
lower than 1%); (2) estimated number of genes in each of the
genomes; and (3) independent estimates of the amount of time
needed for two duplicated genes to attain a divergence of 1%. The
authors estimated the rate of gene duplication to be between 0.002
(for Drosophila) and 0.02 (for the nematode) per gene, per million
years. These results were unexpected because they suggested a high
rate of gene duplication, on the same order of magnitude as the
mutation rate for nucleotide substitutions. With these same data,
they also estimated the rate of duplicate gene loss. Lynch and Conery
reasoned that if genes are created at a constant rate and if there is no
gene loss, when the youngest duplicates (dS < 0.25) are binned into
different values of dS, one should find a similar number of genes in
each bin. However, if there is gene loss, one would find instead a
decreasing number of genes with increasing dS. Lynch and Conery
found evidence for pervasive gene loss, with more than 90% of gene
duplicates disappearing from genomes after only 50 million years,
providing an average half-life of 37 million years (14).
One limitation of this analysis is its reliance on the molecular
clock to estimate the ages of gene duplicates. Although it is

164

M. Cardoso-Moreira and M. Long

reasonable to use the molecular clock to make sequence data


comparisons between two species, the model may not hold for
duplicate genes as a result of gene conversion (for more details on
gene conversion, see Chapter 2, Volume 1 by Budd (15) in this
book). If gene conversion is relatively common, older duplicates will
falsely appear to be young (reduced dS), thereby leading to an
overestimation of the rate of gene duplication (which is calculated
using the number of very young genes (dS < 0.001)). An alternative
and more reliable method to the molecular clock is to use a species
phylogeny and parsimony to assign gene duplication events to the
intervals between the nodes of the phylogenetic tree. Wholegenome sequence data across a species phylogeny became available
in 2003 with the published genome sequences of six of S. cerevisiae
relatives (16, 17). Using these data, Gao and Innan (18) recalculated the age distribution of gene duplicates (originated from WGD
and small-scale duplications) using the species tree and arrived at a
rate of gene duplication two orders of magnitude lower than the one
reported by Lynch and Conery (14). The discrepancy between the
two studies suggests that gene conversion plays an important role in
the evolution of gene duplicates in yeast genomes, and consequently
that the phylogenetic approach is more reliable than relying on the
molecular clock (18). In 2007, whole-genome sequence data
became available for 12 Drosophila species (19), providing a second
opportunity to estimate the rate of gene duplication without resorting to the molecular clock. The results of this analysis have, however,
been inconclusive. Using the data for all 12 genomes, Hahn and
colleagues (20) estimated the rate of gene duplication to be similar
to the one calculated by Lynch and Conery (21), thereby suggesting
that gene conversion plays a minor role on gene duplicates across the
Drosophila phylogeny. However, Osaka and Innan (22), using the
same data for the D. melanogaster subgroup (which corresponds to
4 of the 12 species), arrived at a lower estimate for the rate of gene
duplication (but to a lesser degree than the difference found for the
yeast genomes), and further found evidence for widespread gene
conversion among recent gene duplicates. Despite the disagreement
between these two studies on the importance of gene conversion for
the evolution of gene duplicates in Drosophila, the phylogenetic
approach should be robust to the effects of gene conversion and
consequently should be favored if the necessary data is available.
Another advantage of the phylogenetic approach is that it also
avoids the problem of variation in the evolutionary rate at synonymous sites that can also affect the dating of duplicate genes (23).
Duplication events do not have to be restricted to single
genes, and quite often encompass multiple genes. As a result, it
makes sense to search for the complete stretch of DNA sequence
that was duplicated (segmental duplication) instead of only
searching for individual gene duplicates. There are two main
advantages to this approach: (1) the rate of gene duplication is

7 Origin and Evolution of New Genes

165

not overestimated by a single duplication event being counted


multiple times and (2) information is gathered on the molecular
pathways that generated that mutation. The identification of segmental duplications can also be carried out using the BLAST suite
of programs (or similar algorithms). However, instead of using
individual gene sequences (amino acid and/or nucleotides), an
all-by-all nucleotide genome comparison is required, usually
followed by filtering steps aimed at distinguishing duplication
events from transposable element sequences, microsatellites, and
other repeats (for an example, see ref. 24).
Additional challenges are faced in the detection of duplications
that are still polymorphic or that were only recently fixed. These
very young duplications have diverged so little between each other
that they can be collapsed together when genomes are assembled.
As a consequence, the number of very young duplicates may be
underestimated from most current genome assemblies. Bailey and
colleagues (25) showed that this was an appreciable concern in the
human genome by estimating that at least 5% of the human genome
is composed of segmental duplications. Bailey and colleagues
cleverly reasoned that if they mapped the available whole-genome
shotgun reads against the reference genome sequence, the regions
that correspond to collapsed segmental duplications should show
an increase in read depth resulting from paralogous reads aligning
to the same region. Read depth can be calculated using sliding
windows along chromosomes, and after segmental duplications
are detected their breakpoints can be refined using small-sized
windows around the predicted breakpoints (25). This strategy has
proven to be relatively successful in identifying segmental duplications in several mammalian genomes (26) and is now routinely used
to detect polymorphic duplications using next-generation sequencing data (e.g., see ref. 27). A main caveat of this approach is that the
genomic location of the extra copies cannot be retrieved from the
analysis.
Determining which of the duplicate copies is the parental gene
and which is the offspring can be difficult (Fig. 1). For dispersed
duplicates (located distantly from each other), the parentoffspring
relationship can be established by combining phylogenetic and syntenic information (Fig. 1). For tandem duplications accompanied by
inversions, phylogenetic information combined with gene orientation can also determine the parentoffspring relationship. However,
for tandem gene duplications, it may be impossible to distinguish
which copy is the parental gene and which copy is the offspring.
There are two main sources of large duplications (and deletions):
the imperfect repair of DNA double-strand breaks and DNA
replication errors (28). Multiple cellular processes can generate
DNA double-strand breaks (e.g., oxidative stress, replication),
and since these are highly pathogenic they have to be readily
repaired (29). Cells use two main DNA repair pathways to

166

M. Cardoso-Moreira and M. Long

Gene duplication

b
Complete gene duplication

Partial gene duplication

Dispersed gene duplication

Fig. 1. Schematic depiction of (a) complete, (b) partial, and (c) dispersed gene duplication
events as seen in a phylogenetic context. Please note that for complete and partial
tandem duplications (a and b) it may be impossible to distinguish the ancestral from the
derived copies. In the case of dispersed duplications (c), the parentoffspring relationship
can be inferred by combining phylogenetic and syntenic information.

fix these breaks, one that is homology dependent (homologous


recombination or HR) and another that is homology independent
(nonhomologous end joining or NHEJ) (29, 30). Both HR and
NHEJ have been implicated in creating copy number changes (i.e.,
duplications and deletions). HR can generate duplications (and
deletions) when the repair utilizes nonallelic sequences of high
sequence identity (instead of the corresponding allele in the sister
chromatid or in the homologous chromosome) in a process known
as nonallelic homologous recombination (NAHR) (28, 30). Transposable elements, segmental duplications (older duplications
already fixed in the species), and other classes of repeats can all
mediate NAHR (28, 30). As a result, for young duplications, the
role of NAHR can be inferred directly by determining if the duplicated region is flanked by sequences of high sequence identity.
In the absence of these sequences, NHEJ or DNA replication errors
are assumed to be the underlying mechanism. It has been proposed
that DNA replication errors underlie the more complex class of
rearrangements (i.e., regions exhibiting multiple structural variants)
but it is currently unknown what is its contribution to the formation
of simple duplications (and deletions) (28, 30).

7 Origin and Evolution of New Genes

167

Genomic rearrangements generating gene fusions

Inversion

Deletion

Chr A
Chr B
Translocation

Fig. 2. Schematic depiction of how different classes of genomic rearrangements (deletions, inversions, and translocations)
can create fusion genes by juxtaposing sequences from two previously independent genes. All these rearrangements can
be preceded by a duplication event, which would allow the creation of a new gene without disrupting the parental genes.
The dashed lines represent the area that is mutated (deleted, inversed, or translocated to another genomic location).
All examples would create a novel chimeric gene structure.

2.1.2. Genomic
Rearrangements

Inversions, translocations, and deletions all have the potential to


create new genes by juxtaposing the sequences of two previously
independent genes. One example is gene fusion, where two previously distinct genes are fused together in the same transcript creating a novel protein (Fig. 2). Although gene fusions may not be a
dominant source of new genes in natural populations (though there
are several known examples (31)), they play an important role in
many human cancers as gain-of-function mutations (32). Another
example of joining distinct genic sequences is exon shuffling,
which, as the name suggests, corresponds to recombinationmediated rearrangement of exons between different genes. Exon
shuffling is likely to play a major role in the formation of novel
protein domains (33, 34). If a duplication precedes the genomic
rearrangement, a new gene can be formed while maintaining the
parental gene intact. This is expected to increase the probability of
the new gene not being deleterious, thus increasing its probability
of being fixed.

2.1.3. Retroposition

Retroposition is a class of gene duplication (often called RNA-level


duplication or retroduplication) with many distinctive features that
distinguish it from the classical model of gene duplication and so
merits independent consideration. Retrogenes are created when a

168

M. Cardoso-Moreira and M. Long

Retroposition

b
Retroposition
event

Germline transcription

(mRNA)

AAA

AAA
Reverse transcription
and re-insertion in the genome
AAA

Fig. 3. Schematic representation of how retrogenes are created (a) and how they can be identified using a phylogenetic
approach (b). In (a), a retrogene is created after the messenger RNA from the parental gene, intronless and containing a
poly-A tail, is reinserted back into the genome. A new regulatory element is then recruited by the new retrogene.
A retroposition event can be clearly identified and dated using phylogenetic information (b).

messenger RNA is reverse transcribed and inserted back into the


genome. Retrogenes are readily identifiable in genome sequences
due to several clear hallmarks: (1) absence of introns, (2) presence
of a poly-A tail, and (3) flanking short direct repeats. The direct
repeats and poly-A tail may not be detectable for older retrogenes,
but the presence in a genome of two duplicate genes, one with
introns and the other intronless, strongly suggests that the latter
was created by retroposition (Fig. 3). The ease with which retrogenes and their parental genes are identified in whole-genome data
has made them a model system with which to study new gene
formation and evolution (5, 35).
Using the different dating strategies highlighted above, the
rates of functional retrogene formation have been estimated
for the fly, human, and rice genomes to be of 0.5, 1, and 17 new
retrogenes per million years, respectively (3638). However,
retrogenes are not present in all genomes, and if present their
abundance can vary greatly between organisms. This is because in
order for retroposition to occur two important conditions have to
be met: (1) the genome has to possess enzymes capable of reverse
transcribing messenger RNAs and integrating the cDNAs back into
the genome and (2) those enzymes have to be active in the germ
line (in order for retrogenes to be heritable). This may help explain
why while the fly and mammalian genomes are very rich in retrogenes, the nonmammalian vertebrate genomes sequenced so far
seem to be lacking them (35, 39).
An important feature of retroposition is that it frequently
(though not always) generates new genes without regulatory elements. For this reason, retroposition was long believed to be

7 Origin and Evolution of New Genes

169

Lateral gene transfer

Lateral gene
transfer
event

Fig. 4. In a lateral gene transfer event, a gene present in a species is horizontally


transferred to another species creating a situation, where the gene tree disagrees with
the known species tree.

inconsequential for the origin of new genes. However, a growing


number of studies are demonstrating that there are vast numbers of
functional retrogenes and that they have been able to recruit regulatory elements through several means (35). For example, retrogenes are often inserted either within or nearby other genes,
allowing them to share their regulatory machinery. They can also
recruit regulatory elements from nearby retrotransposons, from
CpG dinucleotides, as well as evolving de novo regulatory elements.
Finally, when retrogenes are created from genes with multiple
transcript start sites, regulatory elements from the parental gene
are also part of the newly formed retrogene (35).
2.1.4. Lateral Gene Transfer

Lateral (or horizontal) gene transfer occurs when a gene is transferred between different organisms (as opposed to being vertically
transmitted through the germ line). The laterally transferred gene
and its ortholog in the parental lineage are often called xenologs
(40). Lateral gene transfer has been shown to be rampant among
certain prokaryotic taxa, where it is associated with gains of new
genes with many distinct novel functions that contribute dramatically to the evolution of those taxa (41, 42). Lateral gene transfer
events can be recognized from genome sequence data in several
ways. A lateral gene transfer event generates anomalous or incongruent phylogenetic trees, whereby a given gene may share the highest
sequence similarity with a gene in a distantly related species (Fig. 4).
Without resorting to phylogenetic trees, genes that have been laterally transferred can be identified in genomes when there are contigs
(or sequence reads) that contain sequences readily identified as
belonging to different genomes (for example, the presence of

170

M. Cardoso-Moreira and M. Long

both bacterial and eukaryotic gene sequences in the genome of


an eukaryote) (43). See Chapter 10, Volume 1 by Lawrence and
Azad (44) in this book for more details on how to detect lateral gene
transfer events.
Although prokaryoteprokaryote lateral gene transfers are
considered to be fairly abundant, prokaryoteeukaryote (and
eukaryoteprokaryote) are believed to be much more rare and
eukaryoteeukaryote even more. Noteworthy examples of lateral
gene transfers between prokaryotes and eukaryotes are the several
genes in eukaryotic nuclear genomes that originated from the
mitochondrial and plastid genomes (45). Several examples of lateral
gene transfers between the bacterial endosymbiont Wolbachia and
several insect and nematode species have also been documented
(prokaryoteeukaryote lateral gene transfer) (46) as have lateral
gene transfers from eukaryotes to prokaryotes (47). A remarkable
example of lateral gene transfer was found in the pea aphid (Acyrthosiphon pisum) genome. When this genome was sequenced in 2010,
the authors detected more than ten events of lateral gene transfer
from bacteria to this eukaryotic genome (48). However, a limitation
to the studys design was that it identified laterally transferred genes
from bacterial origin only. Intriguingly, a subsequent study demonstrated that aphids get their orange and red colorations from a set of
genes created by duplication events that followed an initial lateral
gene transfer from the genome of a fungus (eukaryoteeukaryote
lateral gene transfer) (49). The detection of laterally transferred
genes should become easier as more sequence data from many
different groups of organisms is obtained. These data should also
make it possible to quantify the extent of lateral gene transfer
between different taxa.
2.1.5. De Novo Gene
Origination

De novo genes refer to events, where a coding region originates


from a previously noncoding region. De novo genes were thought
for a very long time to be, at most, rare, even though it was
acknowledged that new exons could possibly be added this
way (i.e., de novo exons) (5). However, in 2006, Levine and colleagues (50) reported the existence of five new genes in the
D. melanogaster genome, all derived from noncoding DNA. This
exciting observation was confirmed in subsequent studies on the
origin of new genes in Drosophila (51, 52) and by discoveries of de
novo genes in several other genomes (5356). In order for a new
gene to be classified as a de novo gene, the orthologous noncoding
region in the genome of a close relative should be identified. This is
required to show that indeed coding sequence evolved from a
previously noncoding sequence (Fig. 5). The presence of a gene
in a genome and its absence in the genomes of close relatives does
not necessarily imply that that gene evolved de novo. For example,
that gene could have been lost from all other genomes or it could
still be present in those genomes but in regions that are hard to
sequence and/or assemble (e.g., heterochromatic regions).

7 Origin and Evolution of New Genes

171

de novo gene formation


de novo
gene creation
Mutations create
open reading frame

Acquisition of promoter
and expression

Fig. 5. A gene can be created de novo when mutations generate a new open reading frame and new regulatory sequences
(a). Although a de novo gene will only be present in the lineage where it was created, orthologous noncoding sequences
will be present in closely related taxa (b).

2.2. New Noncoding


Genes

The repertoire of genes in genomes is not limited to protein-coding


genes, but also includes several classes of noncoding RNA genes,
such as microRNAs, Piwi-interacting RNAs, and long noncoding
RNAs. However, the origin and evolution of noncoding RNA genes
are still poorly understood. This reflects the fact that these classes of
genes were unknown until recently, but also that they are difficult to
detect and present significant challenges for testing functionality.
The first studies aimed at investigating the origin of new noncoding genes focused on gene duplication. These studies revealed
an important role for gene duplication in generating microRNAs
(57) and Piwi-interacting RNAs (58). However, evidence is still
lacking for the role of gene duplication in the formation of long
noncoding RNAs (59). Intriguingly, studies of individual long noncoding RNAs, such as the Xist in mammals and spx in flies, showed
that these were created from protein-coding genes, suggesting that
this could be a potentially important pathway for the formation of
this class of genes (60, 61). Transposable elements are often
involved in the formation of new genes either by mediating duplication events (e.g., see ref. 62), by being incorporated into new
protein-coding genes as exons, and/or by providing the enzymes
needed for retroposition to occur (35). They may play an even more
important role in the origination of new noncoding genes as several
small RNA genes seem to have emerged from transposable elements
(63, 64) as well as some long noncoding RNAs (65). The study of
the origin and evolution of novel noncoding genes will likely flourish in the next couple of years, propelled by a better understanding
of the molecular biology of these genes.

172

M. Cardoso-Moreira and M. Long

2.3. Evidence
of Functionality
in New Genes

The term new gene is not indiscriminately applied to any type of


novel coding sequence. It is reserved for those gene structures that
show evidence of functionality. By definition, a new gene should have
an open reading frame, free of any disabling mutations, such as
premature stop codons or frameshift mutations. It is important to
note, however, that the presence of disabling mutations is only
suggestive of the absence of functionality. For example, after a gene
duplication event, the occurrence of a mutation that shortens the size
of the coded protein could potentially generate a new functional
protein. More informative is determining if a new gene is evolving
under selective constraint (as expected if that gene is functional) or if
it is evolving neutrally (as expected from nonfunctional sequences).
Information on the selective forces acting on gene pairs can be
gathered by determining the rate of synonymous nucleotide substitutions (dS) and the rate of nonsynonymous (i.e., amino acid replacement) substitutions (dN) per site. dN/dS ratios are commonly
calculated between orthologous genes, where a dN/dS ratio significantly smaller than 1 suggests that the gene pair is under purifying
selection while a dN/dS ratio close to 1 suggests that the genes are
evolving under no or very little constraint. A third possibility is a
dN/dS ratio significantly higher than 1, which is suggestive of positive
selection (66). See Chapter 5 by Kosiol and Anisimova (67) in this
volume for details on estimating dN/dS. A similar test can be applied
to paralogs with a small change. If the parental gene is evolving
under functional constraint but the offspring is evolving under no
constraint, the dN/dS ratio will be significantly smaller than 1 but
greater than 0.5 (68). Hence, for new genes, evidence of constraint
using a dN/dS ratio should conservatively require it to be smaller
than 0.5 instead of simply 1 because only the former guarantees that
the offspring gene is also under purifying selection. In addition to
tests of evolutionary constraint, evidence for transcription and translation of the novel coding sequence provides strong evidence that
a putative new gene is functional. However, it is important to note
that evidence that a novel coding sequence is expressed is not
enough to infer functionality because often bona fide pseudogenes
are transcribed (69). Evidence that the new gene is actually translated
into a protein constitutes much stronger evidence of functionality
(52). Ideally, inferring that a new gene is functional should
require several lines of evidence. Moreover, particular classes of new
genes may require additional or different lines of evidence to show
evidence of functionality, as is the case with de novo genes and
new noncoding genes.

2.4. Lessons
from Genome-Wide
Surveys of New Genes

Zhou and colleagues (51) generated the first comprehensive survey


of all classes of recently generated new genes for the D. melanogaster species subgroup (which comprises four Drosophila species). By
taking advantage of the 12 Drosophila genomes, their well-known
phylogeny, and estimated divergence times, they detected all novel

7 Origin and Evolution of New Genes

173

genes generated after the split of the D. melanogaster species


subgroup and dated each event (51). Both sequence similarity
and syntenic information were used to infer orthology. Zhou and
colleagues found that tandem gene duplications correspond to the
vast majority (~80%) of new lineage-specific genes (i.e., genes
present in only one species). However, they found a different
pattern for older new genes (those shared by multiple species and
more likely to be functional): 44% are dispersed gene duplicates
(i.e., located distantly from each other) while only 34% occur as
tandem duplications. Ten percent of the remaining new genes were
created by retroposition and a surprisingly twelve percent were
created de novo. No lateral gene transfers were detected. Using
this subset of older new genes, Zhou and colleagues estimated the
rate of new gene origination to range between 0.0004 and 0.0009
per gene per million of years, which translates into 511 new genes
added to the Drosophila genome every million years.
One of the most surprising results coming from surveys of new
genes in different genomes is the large amount of chimeric gene
structures found. A new gene is considered chimeric if it recruits
novel sequence from nearby regions. For example, retrogenes are
expected to recruit novel regulatory sequences as the transposition
event often leads to the loss of all regulatory sequences from the
parental gene. Similarly, gene fusions and exon shuffling generate
chimeric gene structures (70). However, gene duplication, which is
the mechanism responsible for the creation of most new genes, was
thought for a long time to generate two fully redundant copies of a
gene (4). As discussed in the next section, population genetic
models of the evolution of gene duplicates usually assume this to
be the case.
The highest rate of new chimeric gene formation was observed in
grass genomes (37), where 7 chimeric genes are fixed every million
years, a rate 50 times higher than the one found for humans (36). In
the survey of new genes in Drosophila mentioned above, Zhou and
colleagues (51) found that only 41% of new genes specific to
D. melanogaster have their coding sequence completely duplicated
and that this percentage is even lower for older new genes (16%).
They also found that ~30% of all new genes recruit additional flanking
sequence. Previous studies on new genes created by gene duplication
in the nematode C. elegans also suggested that as much as 50% of all
new genes have recruited novel sequences and that most gene duplication events did not encompass the complete gene structure but are
instead partial gene duplications (71, 72). Better insight into the
mutational processes generating new genes can be gained by looking
at the youngest class of all new genes that are still segregating as
polymorphisms. Surveys of polymorphic duplications and deletions
in both flies and humans (collectively called copy number variants, or
CNVs) found that most duplications are indeed partial, with only a
minority encompassing complete genes (73, 74).

174

M. Cardoso-Moreira and M. Long

By comparing new genes of different ages, insight can be


gained into the characteristics that increase their probability of
being preserved in genomes. Both the distance between the two
copies of a gene and the recruitment of other genomic sequences
(i.e., creation of chimeric gene structures) seem to increase the
probability of a new gene being preserved in a genome for a longer
period of time (75).
When knowledge of new genes was limited to individual case
studies, two patterns began to emerge. The first was that many new
genes were found on the X chromosome. The second was that most
new genes were proposed to have male-biased functions, with
evidence coming from both expression and functional data (e.g.,
see refs. 76, 77). Genome-wide surveys of new genes emphatically
confirmed both patterns (51). They further showed that this pattern was true for even less conventional classes of new genes, such as
de novo genes (50, 78). Recent studies have shown that both the
distribution of new genes among chromosomes and their expression patterns are dynamic processes. In both fly and mammalian
genomes, the youngest class of new genes is enriched on the X
chromosome and exhibits male-biased expression (52, 79). However, for older classes of new genes, both patterns change: these
genes are less likely to reside on the X chromosome and to have
male-biased functions (52, 79). One explanation is that new genes
with male-biased expression move progressively through time out
of the X chromosome and into the autosomes, leading to an overall
paucity of male-biased genes on the X chromosome (52, 7981).
The movement of new genes out of the X chromosome and into the
autosomes was first described in Drosophila for retrogenes (77),
later confirmed in the mouse and human genomes (68) and further
shown to also be true for genes created by gene duplication (82).
More work is required to determine what is the actual proportion
of retrogenes (and new genes in general) that are formed on the X
chromosome and then are translocated to the autosomes (50).
Global analysis of gene expression of both parental and offspring
genes in flies and mammals suggests that meiotic X chromosome
inactivation is one of the driving forces behind the movement of
new male-biased genes away from the X chromosome (83, 84).

3. The Evolutionary
Trajectories of New
Genes

Just like any other mutation, new genes can be neutral, deleterious,
or advantageous. Except in populations with an extremely small
population size, if a new gene is deleterious it will be kept at low
frequency in the population, never reaching fixation (i.e., never
becoming present in all individuals of the species). Examples of
deleterious new genes are duplications of dosage-sensitive genes,

7 Origin and Evolution of New Genes

175

where the new copy of the gene leads to a deleterious change of


gene expression (85). If a new gene is neutral or advantageous, then
it has a chance of becoming fixed. The probability of fixation and
the time to fixation depend on the strength of selection. The higher
the selective advantage, the likelier it is for the new gene to be fixed
and the shorter the time to fixation. It is important to note that the
most likely fate for neutraland even advantageousnew genes is
removal from the populations (86). Once a new gene is fixed, its
subsequent evolution dictates its probability of being retained in
the genome for long periods of time (86, 87). Three main evolutionary fates have been suggested for new gene duplicates and these
can be extended to other classes of new genes. They are discussed in
detail below.
3.1. Possible
Evolutionary Fates
for New Genes
3.1.1. Pseudogenization
(Nonfunctionalization)

3.1.2. Neofunctionalization

The most likely outcome for a new gene is to become a pseudogene


due to the accumulation of inactivating mutations. It has been
estimated that there is one pseudogene for every eight functional
genes in the C. elegans genome (88) and as much as one pseudogene for every two functional genes in the human genome (89).
It is important to emphasize that not all pseudogenes are derived
from new genes. Many genes that were functional for long periods
of time become pseudogenes because of changes in the evolutionary pressures acting on them. For example, it is thought that the
reduced use of olfaction in hominoids contributed to the large
percentage of pseudogenes in the family of human olfactory receptors (90). The reason that pseudogenization is the most likely
outcome for a new gene is that the vast majority of mutations that
can occur in a new gene (or in any other genomic sequence) are
either neutral or deleterious. Hence, if a new gene is not evolving
under constraint, it will sooner or later accumulate enough mutations that render it nonfunctional.
New genes will be preserved in genomes for long periods of time if
they confer a novel (advantageous) function. The classical neofunctionalization model advocated by Ohno proposed that after a gene
duplication event there would be two redundant copies of the same
gene, which would relax selective constraints in one of the copies
allowing it to accumulate mutations (4). Although advantageous
mutations are rare, if one occurred in one of the copies of the gene
it could provide it with a novel function, thereby preserving the new
duplicate in the genome. A now classical example of neofunctionalization is the duplication of a pancreatic ribonuclease gene in
leaf-eating monkeys. After the duplication, one of the copies evolved
rapidly under positive selection for a more efficient digestive function
in a new microenvironment (91). Remarkably, this same gene was
suggested to have been duplicated independently in Asian and African leaf-eating monkeys and in both monkeys one of the copies
evolved under positive selection for more digestive functions (92).

176

M. Cardoso-Moreira and M. Long

The very large number of duplicates preserved in genomes


suggested to some that neofunctionalization could not be responsible for the preservation of all or even of most of them (93, 94). This
is because the balance between the number of deleterious and
advantageous mutations tilts strongly toward the former. This led
different authors to propose alternative models, namely, the different subfunctionalization models described below. However, recent
genomic data suggests that novel functions may be more common
than previously thought and that they can often be created at the
time the new gene is formed. With the exception of complete gene
duplications, all other processes that create new genes do not generate two fully redundant copies of the same gene. Partial gene
duplications, gene fusions, exon shuffling, retrogenes, and de novo
genes all create novel gene structures that often recruit nearby
genomic sequences. Even if a novel gene structure is not created,
the presence of the new gene in a different chromatin environment
from its parental gene could potentially already endow it with a new
function (e.g., by being able to be expressed under different conditions). Of course, only a small fraction of these novel gene structures
are likely to provide a novel function and are thus likely to be fixed
and preserved by positive selection (51). Surveys of new genes
support the idea that novel gene structures and/or different genomic locations contribute disproportionately to the fraction of new
genes that end up being preserved in genomes (51, 62).
3.1.3. Subfunctionalization

The concept that a pair of duplicate genes can share the same
function of the ancestral gene is old (1). More recently, this concept
has been formalized into distinct models. One of them is called the
duplication, degeneration, complementation model (DCC) (93).
It posits that after a gene duplication event that generates two fully
redundant copies selection is relaxed for both copies and mutations
are allowed to accumulate. A mutation that would be deleterious
when there was only one copy of the gene is now rendered neutral
due to the presence of the other copy. This allows both copies to
accumulate degenerative and complementary mutations, which
result in the two genes being necessary to fulfill the functions of
the original gene. Importantly, this model of subfunctionalization
requires only neutral substitutions (as opposed to beneficial mutations) and applies to the partitioning of functions coded both in
protein and regulatory sequences. An alternative subfunctionalization model is called the escape from adaptive conflict (EAC) (9, 94,
95). This model assumes that the original gene is capable of two or
more distinct functions that cannot be simultaneously optimized by
selection due to pleiotropic effects. Gene duplication would allow
each of the copies to perform one of the functions that could now be
optimized by positive selection. The DCC and EAC models differ in
that in the DCC the mutations that cause the subfunctionalization
are explicitly neutral and in the EAC they are adaptive.

7 Origin and Evolution of New Genes

177

Neofunctionalization and subfunctionalization are not mutually


exclusive. After a subfunctionalization event that preserves the two
duplicates in the genome, an advantageous mutation can still occur
and create a novel function in one of the duplicates. Subfunctionalization could greatly increase the probability of neofunctionalization
by extending the period of time available for an advantageous mutation to occur (96).
3.2. Methods to Detect
the Evolutionary Forces
Acting on New Genes
3.2.1. Determining the
Selective Forces Responsible
for the Fixation of New Genes

Understanding the fixation process of new genes requires either the


study of recently fixed new genes or the study of new genes that are
still polymorphic in the population. When a new gene is fixed,
either by neutral genetic drift or by positive selection, it exhibits
reduced levels of polymorphism because all individuals in the population share the same recently originated new gene. However, the
degree of reduction of polymorphism in the new gene (and also in
the parental gene if they are linked) depends on the strength of
selection. The stronger the selection, the lower the levels of polymorphism. Positive selection also leads to reduced levels of polymorphism in the sequences surrounding the new gene, a
phenomenon referred to as selective sweep (97). The stronger the
selection, the more reduced the levels of polymorphism will be and
the larger the area surrounding the new gene that exhibits low
levels of polymorphism. After the fixation, patterns of polymorphism in both the new gene and the surrounding sequences return
to the levels observed before the mutation event, thereby erasing
the signature of the selective force responsible for this process (97).
Very few studies to date have addressed the fixation process of new
genes (a remarkable exception being 98). This is likely to change in
the next few years with the proliferation of population genomic
data for different species (e.g., 1,000 genomes project, various
Drosophila population genomics project, Arabidopsis population
genomics project).
Polymorphic new genes can also provide important information
about the process of fixation of new genes. Surveys of CNVs in
different species have already identified several candidates to be
under positive selection (73, 74). Evidence comes from analyzing
patterns of polymorphism surrounding the CNVs as described
above and by looking at population differentiation (99, 100).
Most CNV studies so far identify polymorphic duplications but
often cannot determine the exact number of new copies, their
location, or their actual sequence. As next-generation sequencing
methods are more widely applied to detect CNVs, these limitations
should disappear and detailed sequence analysis of both the polymorphic duplications and their flanking sequences will be available.
CNVs can also help elucidate how often new genes are fixed by
positive selection due to changes in gene dosage. The combination
of expression data and sequence polymorphism can address this
question directly.

178

M. Cardoso-Moreira and M. Long

3.2.2. Identifying the


Evolutionary Fates
Responsible
for the Retention
of New Genes

The different models proposed for the fates of new genes make
different predictions regarding the early stages of the evolution of
new genes. The neofunctionalization model proposed by Ohno
predicts that in a duplicate gene pair one member experiences a
period of relaxed constraint, followed by a period of positive selection (after the occurrence of the mutation that confers a new
function), while the other member continuously experiences purifying selection (4). According to this model, there should be an
asymmetric rate of evolution between the two duplicates. This same
asymmetry should also be detected for those new genes whose
origination immediately confers a new advantageous function.
In this case, there should not be any period of relaxed constraint.
Instead, the new genes are expected to be driven to fixation by
positive selection, which is expected to continue to act for some
period of time. Meanwhile, the parental gene is expected to evolve
under purifying selection. New genes that are identical to its parental genes could be immediately favored by positive selection due to
changes in gene dosage, as numerous examples have demonstrated
(e.g., see refs. 99, 101). When this occurs, the new gene is fixed by
positive selection, but in this case both parental and offspring genes
are expected to be under purifying selection and exhibit a symmetrical rate of evolution.
The subfunctionalization models do not make clear predictions
regarding whether gene duplicates are expected to diverge symmetrically or asymmetrically because the functions of the ancestral gene
could potentially be divided equally or unequally between the two
duplicates. However, at least in its earlier stages, the DCC model
would predict both genes to experience relaxed constraint and
during this stage their evolution should be symmetrical. The
DCC and EAC models can be distinguished from each other
because the latter predicts both parental and offspring genes to
experience a period of positive selection.
As mentioned above, subfunctionalization and neofunctionalization are not mutually exclusive. New genes may experience an
initial stage of subfunctionalization (DCC model) followed by a
period of neofunctionalization. This would be translated into an
initial period of evolution under relaxed constraints for both genes
followed by a symmetrical or asymmetrical period of evolution
under positive selection depending on whether the latter acts on
one or both duplicates. Another alternative scenario is the fixation
of a duplicate by positive selection for dosage alteration that then
subsequently evolves a novel function. This scenario would create
an initial period of positive selection driving the duplication to
fixation, followed by a period of symmetrical evolution, where
both members are under purifying selection, and finally another
period of positive selection created by the mutation that confers the
novel function. The fact that different scenarios can be hypothesized and that the different models do not make explicit enough

7 Origin and Evolution of New Genes

179

assumptions to allow for their clear distinction has hampered


our capability of determining what are the dominant modes of
evolution for new genes (97, 102).
3.2.3. Detecting the Modes
of Selection Acting
on ParentOffspring
Gene Pairs

Advantageous mutations capable of conferring a new gene with a


new function can occur in both coding and noncoding (regulatory)
regions. The different methods available to detect positive selection
acting on both types of sequence are reviewed in detail in Chapters
56 of this volume (67, 103), and can be readily applied to new
genes. One such method is using the dN/dS ratio to infer if a gene
is evolving under purifying selection, neutrality, or positive selection. As discussed below, this method has been applied extensively
to the study of new genes and so it is important to note two of its
limitations. First, positive selection is of an episodic nature and
is followed by a period of purifying selection that can erase the
sequence patterns suggestive of positive selection. Therefore, tests
based on the dN/dS ratio have more power when applied to young
genes. Although several techniques have been proposed to detect
signs of positive selection in older parentoffspring pairs (reviewed
in ref. 104), it is very hard to distinguish among the different
evolutionary scenarios for old genes. Second, positive selection
may only act on a small subset of the gene with the remaining
sequence evolving under purifying selection. In this case, the
dN/dS ratio also fails to detect positive selection (104). As described
in Chapter 5 of this volume by Kosiol and Anisimova (67), there are
different techniques that can be used to detect positive selection
acting on a subset of the protein sequence.
Distinguishing between the different models proposed for the
early evolution of new genes requires determining if the parental
gene and its offspring are evolving symmetrically or asymmetrically. Relative rate tests use an outgroup sequence (i.e., an ortholog in a close species of the parental gene) to determine if one
of the genes is evolving at a faster rate (104). A faster rate of
evolution in one of the genes is compatible with two scenarios:
(1) one of the genes is evolving under relaxed constraints while
the other is under purifying selection or (2) one of the genes is
evolving under positive selection while the other is under purifying selection. Additional data has to be collected to distinguish
between these two scenarios. For older new genes that have
already had time to accumulate several additional mutations, polymorphism and divergence data can be combined to show that if
that gene was evolving neutrally then inactivating mutations
would already have had time to accumulate. In this case, the
presence of extensive amino acid changes without disruption of
the protein-coding sequence is only compatible with positive
selection (and not with relaxed selection). For younger genes,
the number of nucleotide substitutions is usually not enough to
distinguish between the two scenarios.

180

M. Cardoso-Moreira and M. Long

Evidence for asymmetrical evolution can also be gathered from


expression data. A novel function or a partition of functions among
duplicates can be detected at the expression level by comparing the
patterns of expression of the parental gene, the offspring, and the
ortholog of the parental gene in a closely related species (e.g., see
ref. 105). Studying the patterns of evolution of pairs of parentoffspring genes of different ages could provide a dynamic picture of
the early stages of the evolution of new genes. However, caution
has to be taken when doing this type of comparisons. Certain
trends that can emerge from this type of analyses may be due to
the differential features of preserved vs. nonpreserved gene pairs
instead of reflecting the changes through time experienced by
preserved gene pairs (96).
3.2.4. Insights from
Genome-Wide Surveys
of the Early Evolution
of New Genes

The first large-scale surveys on the forces acting on duplicated


genes found little evidence for positive selection (14, 106). Lynch
and Conery (14) calculated dN/dS ratios for pairs of gene duplicates
in six eukaryotic genomes and found that the vast majority was
under purifying selection. The youngest class of gene duplicates
showed signs of being under purifying selection even though they
were more likely to tolerate amino acid changes than older genes
(which could be a sign of relaxed constraints or positive selection)
(14). Kondrashov and colleagues (106) applied the same dN/dS
approach to gene duplicates in 26 bacterial, 6 archaeal, and 7
eukaryotic genomes and also found purifying selection to be the
dominant force. They further used an outgroup sequence to compare the rate of evolution between the two duplicates and found
that paralogs typically evolve symmetrically (106). Conant and
Wagner (107) used a codon-based model that distinguishes
between silent substitutions and amino acid replacements when
testing for potential asymmetries in protein sequence divergence.
This time, evidence was found supporting asymmetrical evolution
for 2030% of duplicate gene pairs in four different eukaryotic
genomes. They also found evidence for relaxed selective constraints
in those genes evolving asymmetrically with a minority exhibiting
signs of being under positive selection (107). As discussed above, in
older duplicates, the earlier signs of a period of asymmetrical evolution may have been obliterated by the subsequent period of purifying selection. Hence, it is noteworthy that when Zhang and
colleagues focused on young duplicates in the human genome
they found that ~60% were evolving asymmetrically (108).
Since some new genes are identical to their parental genes (i.e.,
complete tandem gene duplications) while others are not (i.e.,
retrogenes, dispersed duplicates), it merits asking if the percentage
of genes evolving at asymmetrical rates is the same for the two
classes of genes. Cusack and Wolfe (109) found that the degree of
asymmetry in the rate of evolution is greater for gene pairs where
parent and offspring genes differ from each other than for those

7 Origin and Evolution of New Genes

181

gene pairs where parent and offspring genes are identical. Han and
colleagues (110) found a similar result when studying lineagespecific duplicates in the human, macaque, mouse, and rat genomes. By focusing on very young duplicates, they also aimed at
detecting signs of positive selection before it was masked by the
purifying selection that follows. Approximately 10% of all lineagespecific genes showed signs of positive selection acting in their
protein sequences. Furthermore, they showed that for gene duplicates, where parental and offspring genes are located in different
genomic locations, 80% of the time that there was evidence for
positive selection it came from the offspring copy. This was true
when the offspring was a retrogene or was created by the classical
model of gene duplication (110).
When divergence data is combined with polymorphism data,
further insight can be gained into the evolutionary forces acting on
new genes. More precisely, combining both types of data allows
distinguishing between the two scenarios that can cause accelerated
rates of protein evolution: relaxation of selective constraints and
positive selection. Cai and Petrov (111) combined human polymorphism data with humanchimp divergence data and found
strong evidence that the elevated rates of protein evolution found
for younger genes are mostly due to relaxed selective constraints
and found weaker evidence that younger genes experience adaptive
evolution more frequently than older genes.

4. Future
Perspectives
It is unquestionable that the wealth of genomic data collected in the
past 10 years dramatically changed our understanding of how new
genes are created. But more than answering long-standing questions, the genomics revolution brought about a brand new set of
questions. Only recently have we learned that new genes could be
created de novo (5056) and we are still lacking the proper tools to
study how selection acts in this group of genes. Also, now that we
know that an important component of genomes are nonproteincoding genes, we have to devise more sensitive detection techniques in order to detect them and study their evolution. And perhaps the greatest challenge of all, we have to go beyond simply
describing the sequence and evolution of new genes and determine
the novel functions these genes are coding. Although genomic data
helps us determining if a gene is functional or not, determining its
actual function requires a multidisciplinary effort that combines
genomics and proteomics with a multitude of functional assays.
As more genomes are sequenced, phylogenies will become
more and more complete and our capability of detecting new
genes, dating them, and understanding how they are formed will

182

M. Cardoso-Moreira and M. Long

increase. As we move from sequencing genomes of different species


to sequencing many genomes from the same species, we will be able
to combine divergence and polymorphism data on a genome-wide
scale and finally be able to better describe the evolutionary
forces acting on new genes. We will also move from detecting
polymorphic new genes using microarray technology to using
next-generation sequencing, and with it we will obtain the detailed
sequence information on the new genes, their location, and breakpoint information that we are currently lacking. As genomic data
continues to accumulate, so will our understanding of how new
genes are formed, how they are fixed in populations, and why they
are preserved in genomes.

5. Questions
1. Count the number of genes in the human and chimpanzee
genomes. Does the difference suggest the gain or the loss of
some genes in one lineage? How can you distinguish between
the two possibilities?
2. Imagine the genome sequences of 12 bee species (the phylogeny is known) have just been released. The 12 genomes have
been annotated using both experimental and computational
approaches. What would be the steps needed to find all lineage-specific genes, i.e., genes present in only one of the species?
What genomic hallmarks would you use to distinguish the
different classes of new genes?

Acknowledgments
We thank J. Roman Arguello, Maria Vibranovski, three anonymous
reviewers, and our editor, Maria Anisimova for comments and
critical reading of the manuscript.
References
1. Taylor JS, Raes J (2004) Duplication and
divergence: the evolution of new genes and
old ideas. Annu Rev Genet 38:615643
2. Haldane JBS (1932) The causes of evolution.
Princeton Science Library
3. Bridges CB (1936) The Bar gene a duplication. Science 83:210211
4. Ohno S (1970) Evolution by gene duplication.
Springer-Verlag

5. Long M, Betran E, Thornton K et al (2003)


The origin of new genes: glimpses from the
young and old. Nat Rev Genet 4:865875
6. Presgraves DC (2005) Evolutionary genomics:
new genes for new jobs. Curr Biol 15:R5253
7. Long M, Langley CH (1993) Natural selection
and the origin of jingwei, a chimeric processed
functional gene in Drosophila. Science
260:9195

7 Origin and Evolution of New Genes


8. Kuwada Y (1911) Meiosis in the pollen
mother cells of Zea Mays L. Bot Mag 25:1633
9. Conant GC, Wolfe KH (2008) Turning a
hobby into a job: how duplicated genes find
new functions. Nat Rev Genet 9:938950
10. Kuraku S, Meyer A (2012) Detection and phylogenetic assessment of conserved synteny
derived from whole genome duplications.
In: Anisimova M (ed) Evolutionary genomics:
statistical and computational methods (volume 1). Methods in Molecular Biology,
Springer Science+Business Media New York
11. Wapinski I, Pfeffer A, Friedman N et al
(2007) Natural history and evolutionary principles of gene duplication in fungi. Nature
449:5461
12. Maere S, De Bodt S, Raes J (2005) Modeling
gene and genome duplications in eukaryotes.
Proc Natl Acad Sci U S A 102:54545459
13. Zhang J (2003) Evolution by gene duplication: an update. Trends Eco Evo 18: 292298
14. Lynch M, Conery JS (2000) The evolutionary
fate and consequences of duplicate genes.
Science 290:11511155
15. Budd A (2012) Diversity of genome organization. In: Anisimova M (ed) Evolutionary
genomics: statistical and computational
methods (volume 1). Methods in Molecular
Biology, Springer Science+Business Media
New York
16. Cliften P, Sudarsanam P, Desikan A et al (2003)
Finding functional features in Saccharomyces
genomes by phylogenetic footprinting. Science
301:7176
17. Kellis M, Patterson N, Endrizzi M et al
(2003) Sequencing and comparison of yeast
species to identify genes and regulatory elements. Nature 423:241254.
18. Gao LZ, Innan H (2004) Very low gene
duplication rate in the yeast genome. Science
306:13671370.
19. Drosophila 12 Genomes Consortium (2007)
Evolution of genes and genomes on the Drosophila phylogeny. Nature 450:203218
20. Hahn MW, Han MV, Han SG (2007) Gene
family evolution across 12 Drosophila
genomes. PLoS Genet 3:e197
21. Lynch M, Conery JS (2003) The evolutionary
demography of duplicate genes.J Struct Funct
Genomics 3:3544
22. Osada N, Innan H (2008) Duplication and
gene conversion in the Drosophila melanogaster genome. PLoS Genet 4:e1000305
23. Long M, Thornton K (2001) Gene duplication and evolution. Science 293:1551
24. Fiston-Lavier AS, Anxolabehere D, Quesneville
H (2007) A model of segmental duplication

183

formation in Drosophila melanogaster.


Genome Res 17:14581470
25. Bailey JA, Gu Z, Clark RA et al (2002) Recent
segmental duplications in the human genome.
Science 297:10031007
26. Marques-Bonet T, Girirajan S, Eichler EE
(2009) The origins and impact of primate
segmental duplications. Trends Genet
25:443454
27. Medvedev P, Stanciu M, Brudno M (2009)
Computational methods for discovering
structural variation with next-generation
sequencing. Nat Methods 6:S13-20
28. Gu W, Zhang F, Lupski JR (2008) Mechanisms
for human genomic rearrangements. Pathogenetics 1:4
29. Aguilera A, Gomez-Gonzalez B (2008)
Genome instability: a mechanistic view of its
causes and consequences. Nat Rev Genet
9:204217
30. Hastings PJ, Lupski JR, Rosenberg SM et al
(2009) Mechanisms of change in gene copy
number. Nat Rev Genet 10:551564
31. Rogers RL, Bedford T, Hartl DL (2009)
Formation and longevity of chimeric and
duplicate genes in Drosophila melanogaster.
Genetics 181:313322
32. Stratton MR, Campbell PJ, Futreal PA
(2009) The cancer genome. Nature
458:719724
33. Long M, Rosenberg C, Gilbert W (1995)
Intron phase correlations and the evolution
of the intron/exon structure of genes. Proc
Natl Acad Sci U S A 92:1249512499
34. Patthy L (1999) Genome evolution and the
evolution of exon-shufflinga review. Gene
238:103114
35. Kaessmann H, Vinckenbosch N, Long M
(2009) RNA-based gene duplication: mechanistic and evolutionary insights. Nat Rev
Genet 10:1931
36. Marques AC, Dupanloup I, Vinckenbosch N
et al (2005) Emergence of young human
genes after a burst of retroposition in primates. PLoS Biol 3:e357
37. Wang W, Zheng H, Fan C et al (2006) High
rate of chimeric gene origination by retroposition in plant genomes. Plant Cell
18:17911802
38. Bai Y, Casola C, Feschotte C et al (2007)
Comparative genomics reveals a constant
rate of origination and convergent acquisition
of functional retrogenes in Drosophila.
Genome Biol 8:R11
39. Kaessmann H (2010) Origins, evolution, and
phenotypic impact of new genes. Genome Res
20:13131326

184

M. Cardoso-Moreira and M. Long

40. Patterson C (1988) Homology in classical


and molecular biology. Mol Biol Evol 5:
603625
41. Ochman H, Lawrence JG, Groisman EA
(2000) Lateral gene transfer and the nature
of bacterial innovation. Nature 405:299304
42. Gogarten JP, Townsend JP (2005) Horizontal
gene transfer, genome innovation and evolution. Nat Rev Microbiol 3:679687
43. Zhaxybayeva O (2009) Detection and
quantitative assessment of horizontal gene
transfer. Methods Mol Biol 532:195213
44. Lawrence J, Azad R (2012) Detecting lateral
gene transfer. In: Anisimova M (ed) Evolutionary genomics: statistical and computational
methods (volume 1). Methods in Molecular
Biology, Springer Science+Business Media
New York
45. Martin W, Herrmann RG (1998) Gene transfer from organelles to the nucleus: how much,
what happens, and Why? Plant Physiol
118:917
46. Dunning Hotopp JC, Clark ME, Oliveira DC
et al (2007) Widespread lateral gene transfer
from intracellular bacteria to multicellular
eukaryotes. Science 317:17531756
47. Doolittle RF, Feng DF, Anderson KL et al
(1990) A naturally occurring horizontal gene
transfer from a eukaryote to a prokaryote. J
Mol Evol 31:383388
48. The International Aphid Genomics Consortium (2010) Genome Sequence of the Pea
Aphid Acyrthosiphon pisum. PLoS Biol 8:
e1000313
49. Moran NA, Jarvik T (2010) Lateral transfer
of genes from fungi underlies carotenoid
production in aphids. Science 328:624627.
50. Levine MT, Jones CD, Kern AD et al (2006)
Novel genes derived from noncoding DNA in
Drosophila melanogaster are frequently
X-linked and exhibit testis-biased expression.
Proc Natl Acad Sci U S A 103:99359939
51. Zhou Q, Zhang G, Zhang Y et al (2008)
On the origin of new genes in Drosophila.
Genome Res 18:14461455
52. Zhang YE, Vibranovski MD, Krinsky BH et al
(2010) Age-dependent chromosomal distribution of male-biased genes in Drosophila.
Genome Res 20:15261533
53. Cai J, Zhao R, Jiang H et al (2008) De novo
origination of a new protein-coding gene in
Saccharomyces
cerevisiae.
Genetics
179:487496
54. Knowles DG, McLysaght A (2009) Recent de
novo origin of human protein-coding genes.
Genome Res 19:17521759
55. Toll-Riera M, Bosch N, Bellora N et al (2009)
Origin of primate orphan genes: a compara-

tive genomics approach. Mol Biol Evol


26:603612
56. Xiao W, Liu H, Li Y et al (2009) A rice gene of
de novo origin negatively regulates pathogeninduced defense response. PLoS One 4:e4603
57. Hertel J, Lindemeyer M, Missal K et al (2006)
The expansion of the metazoan microRNA
repertoire. BMC Genomics 7:25
58. Assis R, Kondrashov AS (2009) Rapid repetitive element-mediated expansion of piRNA
clusters in mammalian evolution. Proc Natl
Acad Sci U S A 106:70797082
59. Ponting CP, Oliver PL, Reik W (2009)
Evolution and functions of long noncoding
RNAs. Cell 136:629641
60. Duret L, Chureau C, Samain S et al (2006)
The Xist RNA gene evolved in eutherians by
pseudogenization of a protein-coding gene.
Science 312:16531655
61. Wang W, Brunet FG, Nevo E et al (2002)
Origin of sphinx, a young chimeric RNA
gene in Drosophila melanogaster. Proc Natl
Acad Sci U S A 99:44484453
62. Yang S, Arguello JR, Li X et al (2008) Repetitive element-mediated recombination as
a mechanism for new gene origination in
Drosophila. PLoS Genet 4:e3
63. Smalheiser NR, Torvik VI (2005) Mammalian
microRNAs derived from genomic repeats.
Trends Genet 21:322326
o-Ramrez L, Jordan IK
64. Piriyapongsa J, Marin
(2007) Origin and evolution of human microRNAs from transposable elements. Genetics
176:13231337
65. Brosius J (1999) RNAs from all categories
generate retrosequences that may be exapted
as novel genes or regulatory elements. Gene
238:115134
66. Wagner A (2002) Selection and gene duplication: a view from the genome. Genome Biol 3:
reviews1012
67. Kosiol C, Anisimova M (2012) Selection in
protein coding regions. In: Anisimova M (ed)
Evolutionary genomics: statistical and
computational methods (volume 1). Methods
in Molecular Biology, Springer Science+Business Media New York
68. Emerson JJ, Kaessmann H, Betran E et al
(2004) Extensive gene traffic on the mammalian X chromosome. Science 303:537540
69. Vinckenbosch N, Dupanloup I, Kaessmann H
(2006) Evolutionary fate of retroposed gene
copies in the human genome. Proc Natl Acad
Sci U S A 103:32203225
70. Arguello JR, Fan C, Wang W et al (2007)
Origination of chimeric genes through
DNA-level recombination. Genome Dyn
3:131146

7 Origin and Evolution of New Genes


71. Katju V, Lynch M (2003) The structure
and early evolution of recently arisen gene
duplicates in the Caenorhabditis elegans
genome. Genetics 165:17931803
72. Katju V, Lynch M (2006) On the formation
of novel genes by duplication in the Caenorhabditis elegans genome. Mol Biol Evol
23:10561067
73. Emerson JJ, Cardoso-Moreira M, Borevitz JO
et al (2008) Natural selection shapes genomewide patterns of copy-number polymorphism
in
Drosophila
melanogaster.
Science
320:16291631
74. Conrad DF, Pinto D, Redon R et al (2010)
Origins and functional impact of copy number variation in the human genome. Nature
464:704712
75. Zhou Q, Wang W (2008) On the origin and
evolution of new genesa genomic and experimental perspective. J Genet Genomics
35:639648
76. Arguello JR, Chen Y, Yang S et al (2006)
Origination of an X-linked testes chimeric
gene by illegitimate recombination in
Drosophila. PLoS Genet 2:e77
77. Betran E, Thornton K, Long M (2002)
Retroposed new genes out of the X in
Drosophila. Genome Res 18541859
78. Begun DJ, Lindfors HA, Kern AD et al
(2007) Evidence for de novo evolution of testis-expressed genes in the Drosophila yakuba/
Drosophila
erecta
clade.
Genetics
176:11311137
79. Zhang Y, Vibranovski DV, Landback P et al
(2010) Chromosomal Redistribution of
Male-Biased Genes in Mammalian Evolution
with Two Bursts of Gene Gain on the X Chromosome. PLoS Bio 8:e1000494
80. Ranz JM, Castillo-Davis CI, Meiklejohn CD
et al (2003) Sex-dependent gene expression
and evolution of the Drosophila transcriptome. Science 300:17421745
81. Parisi M, Nuttall R, Naiman D et al (2003)
Paucity of genes on the Drosophila X
chromosome showing male-biased expression. Science 299:697700
82. Vibranovski MD, Zhang Y, Long M (2009)
General gene movement off the X chromosome in the Drosophila genus. Genome Res
19:897903
83. Vibranovski MD, Lopes HF, Karr TL et al
(2009) Stage-specific expression profiling of
Drosophila spermatogenesis suggests that
meiotic sex chromosome inactivation drives
genomic relocation of testis-expressed genes.
PLoS Genet 5:e1000731

185

84. Potrzebowski L, Vinckenbosch N, Marques


AC et al (2008) Chromosomal gene movements reflect the recent origin and biology of
therian sex chromosomes. PLoS Biol 6:e80
85. Conrad B, Antonarakis SE (2007) Gene duplication: a drive for phenotypic diversity and
cause of human disease. Annu Rev Genomics
Hum Genet 8:1735
86. Otto SP, Yong P (2002) The evolution of
gene duplicates. Adv Genet 46:451483
87. Kondrashov FA, Kondrashov AS (2005) Role
of selection in fixation of gene duplications.
J Theor Biol 239:141151
88. Harrison PM, Echols N, Gerstein MB (2001)
Digging for dead genes: an analysis of the
characteristics of the pseudogene population
in the Caenorhabditis elegans genome.
Nucleic Acids Res 29:818830
89. Harrison PM, Hegyi H, Balasubramanian S
et al (2002) Molecular fossils in the human
genome: identification and analysis of the
pseudogenes in chromosomes 21 and 22.
Genome Res 12:272280
90. Rouquier S, Blancher A, Giorgi D (2000) The
olfactory receptor gene repertoire in primates
and mouse: evidence for reduction of the
functional fraction in primates. Proc Natl
Acad Sci U S A 97:28702874
91. Zhang J, Zhang YP, Rosenberg HF (2002)
Adaptive evolution of a duplicated pancreatic
ribonuclease gene in a leaf-eating monkey.
Nat Genet 30:411415
92. Zhang J (2006) Parallel adaptive origins of
digestive RNases in Asian and African leaf
monkeys. Nat Genet 38:819823
93. Force A, Lynch M, Pickett FB et al (1999)
Preservation of duplicate genes by complementary, degenerative mutations. Genetics
151:15311545
94. Hughes AL (1994) The evolution of functionally novel proteins after gene duplication.
Proc Biol Sci 256:1191124
95. Piatigorsky J, Wistow G (1991) The recruitment of crystallins: new functions precede
gene duplication. Science 252:10781079
96. Lynch M, Katju V (2004) The altered
evolutionary trajectories of gene duplicates.
Trends Genet 20:544549
97. Innan H, Kondrashov F (2010) The
evolution of gene duplications: classifying
and distinguishing between models. Nat Rev
Genet 11:97108
98. Moore RC, Purugganan MD (2003) The
early stages of duplicate gene evolution. Proc
Natl Acad Sci U S A 100:1568215687

186

M. Cardoso-Moreira and M. Long

99. Perry GH, Dominy NJ, Claw KG et al (2007)


Diet and the evolution of human amylase
gene copy number variation. Nat Genet
39:12561260
100. Schrider DR, Hahn MW (2010) Gene copynumber polymorphism in nature. Proc Biol
Sci 277:32133221
101. Schmidt JM, Good RT, Appleton B et al
(2010) Copy number variation and transposable elements feature in recent, ongoing
adaptation at the Cyp6g1 locus. PLoS Genet
6:e1000998
102. Hahn MW (2010) Distinguishing among
evolutionary models for the maintenance of
gene duplicates. J Hered 100:605617
103. Zhen Y, Anfolfatto P (2012) Detecting
selection on non-coding genomics regions.
In: Anisimova M (ed) Evolutionary
genomics: statistical and computational
methods (volume 1). Methods in Molecular
Biology, Springer Science+Business Media
New York
104. Raes J, Van de Peer Y (2003) Gene duplication, the evolution of novel gene functions,
and detecting functional divergence of duplicates in silico. Appl Bioinformatics 2:91101
105. Huminiecki L, Wolfe KH (2004) Divergence
of spatial gene expression profiles following
species-specific gene duplications in human
and mouse. Genome Res 14:18701879

106. Kondrashov FA, Rogozin IB, Wolf YI et al


(2002) Selection in the evolution of gene
duplications.
Genome
Biol
3:
RESEARCH0008
107. Conant GC, Wagner A (2003) Asymmetric
sequence divergence of duplicate genes.
Genome Res 13:20522058
108. Zhang P, Gu Z, Li WH (2003) Different
evolutionary patterns between young duplicate genes in the human genome. Genome
Biol 4:R56
109. Cusack BP, Wolfe KH (2007) Not born
equal: increased rate asymmetry in relocated
and retrotransposed rodent gene duplicates.
Mol Biol Evol 24:679686
110. Han MV, Demuth JP, McGrath CL et al
(2009) Adaptive evolution of young gene
duplicates in mammals. Genome Res
19:859867
111. Cai JJ, Petrov DA (2010) Relaxed purifying
selection and possibly high rate of adaptation
in primate lineage-specific genes. Genome
Biol Evol 2:393409
112. Aris-Brosou S, Rodrigue N (2012) The essentials of computational molecular evolution.
In: Anisimova M (ed) Evolutionary
genomics: statistical and computational
methods (volume 1). Methods in Molecular
Biology, Springer Science+Business Media
New York

Chapter 8
Evolution of Protein Domain Architectures
Kristoffer Forslund and Erik L.L. Sonnhammer
Abstract
This chapter reviews the current research on how protein domain architectures evolve. We begin by
summarizing work on the phylogenetic distribution of proteins, as this directly impacts which domain
architectures can be formed in different species. Studies relating domain family size to occurrence have
shown that they generally follow power law distributions, both within genomes and larger evolutionary
groups. These findings were subsequently extended to multidomain architectures. Genome evolution
models that have been suggested to explain the shape of these distributions are reviewed, as well as evidence
for selective pressure to expand certain domain families more than others. Each domain has an intrinsic
combinatorial propensity, and the effects of this have been studied using measures of domain versatility or
promiscuity. Next, we study the principles of protein domain architecture evolution and how these have
been inferred from distributions of extant domain arrangements. Following this, we review inferences of
ancestral domain architecture and the conclusions concerning domain architecture evolution mechanisms
that can be drawn from these. Finally, we examine whether all known cases of a given domain architecture
can be assumed to have a single common origin (monophyly) or have evolved convergently (polyphyly).
Key words: Protein domain, Protein domain architecture, Superfamily, Monophyly, Polyphyly,
Convergent evolution, Domain evolution, Kingdoms of life, Domain co-occurrence network, Node
degree distribution, Power law, Parsimony

1. Introduction
1.1. Overview

By studying the domain architectures of proteins, we can understand


their evolution as a modular phenomenon, with high-level events
enabling significant changes to take place in a time span much shorter
than required by point mutations only. This research field has become
possible only now in the -omics era of science, as both identifying many
domain families in the first place and acquiring enough data to chart
their evolutionary distribution require access to many completely
sequenced genomes. Likewise, the conclusions drawn generally
consider properties averaged for entire species or organism groups or
entire classes of proteins, rather than properties of single genes.

Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2,
Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5_8,
# Springer Science+Business Media, LLC 2012

187

188

K. Forslund and E.L.L. Sonnhammer

We begin by introducing the basic concepts of domains and


domain architectures, as well as the biological mechanisms by
which these architectures can change. The remainder of the chapter
is an attempt at answering, from the recent literature, the question
of which forces shape domain architecture evolution and in what
direction. The underlying issue concerns whether it is fundamentally a random process or whether it is primarily a consequence of
selective constraints.
1.2. Protein Domains

Protein domains are high-level parts of proteins that either occur


alone or together with partner domains on the same protein chain.
Most domains correspond to tertiary structure elements, and are
able to fold independently. All domains exhibit evolutionary conservation, and many either perform specific functions or contribute
in a specific way to the function of their proteins. The word domain
strictly refers to a distinct region of a specific protein, an instance of
a domain family. However, domain and domain family are often
used interchangeably in the literature.

1.3. Domain
Databases

By identifying recurring elements in experimentally determined


protein 3D structures, the various domain families in structural
domain databases, such as SCOP (1) and CATH (2), were gathered. New 3D structures allow assignment to these classes from
semiautomated inspection. The SUPERFAMILY (3) database
assigns SCOP domains to all protein sequences by matching them
to Hidden Markov Models (HMMs) that were derived from SCOP
superfamilies, i.e., proteins whose evolutionary relationship is evidenced structurally. The Gene3D (4) database is similarly constructed, but based on domain families from CATH.
This approach resembles the methodology used in pure
sequence-based domain databases, such as Pfam (5). In these databases, conserved regions are identified from sequence analysis and
background knowledge to make multiple sequence alignments. From
these, HMMs are built that are used to search new sequences for the
presence of the domain represented by each HMM. All such instances
are stored in the database. The HMM framework ensures stability
across releases and high quality of alignments and domain family
memberships. The stability allows annotation to be stored along
with the HMMs and alignments. The INTERPRO database (6) is a
metadatabase of domains combining the assignments from several
different source databases, including Pfam. The Conserved Domain
Database (CDD) is a similar metadatabase that also contains additional domains curated by the NCBI (7). SMART (8) is a manually
curated resource focusing primarily on signaling and extracellular
domains. ProDom (9) is a comprehensive domain database automatically generated from sequences in UniProt (10). Likewise,
ADDA (11) is automatically generated by clustering subsequences
of proteins from the major sequence databases. It is currently being

8 Evolution of Protein Domain Architectures

189

used for generating Pfam-B families, low-fidelity sets of putative


domains which may provide starting points for new Pfam-A families.
Such automatic approaches, however, inevitably produce low-quality
domain definitions and alignments, and lack annotation.
Since the domain definitions from different databases only
partially overlap, results from analyses often cannot be directly
compared. In practice, however, choice of database appears to
have little effect on the main trends reported by the studies
described here.
1.4. Domain
Architectures

The term domain architecture or domain arrangement generally


refers to the domains in a protein and their order, reported in N- to
C-terminal direction along the amino acid chain. Another recurring
term is domain combinations. This refers to pairs of domains cooccurring in proteins, either anywhere in the protein (the bag-ofdomains model) or specifically pairs of domains being adjacent on an
amino acid chain, in a specific N- to C-terminal order (12). The latter
concept is expanded to triplets of domains, which are subsequences of
three consecutive domains, with the N- and C-termini used as
dummy domains. A domain X occurring on its own in a protein,
thus, produces the triplet N-X-C (13).

1.5. Mechanisms
for Domain
Architecture Change

Most mutations are point mutations: substitutions, insertions, or


deletions of single nucleotides. While conceivably enough of these
might create a new domain from an old one or noncoding sequence
or remove a domain from a protein, in practice we are interested in
mechanisms, whereby the domain architecture of a protein changes
instantly or nearly so. Figure 1 shows some examples of ways in
which domain architectures may mutate. In general, adding or
removing domains requires genetic recombination events. These
can occur either through errors made by systems for repairing DNA
damage, such as homologous (14, 15) or nonhomologous (illegitimate) (16, 17) recombination, or through the action of mobile
genetic elements, such as DNA transposons (18) or retrotransposons (19, 20). Recombination can cause loss or duplication of parts
of genes, entire genes, or much longer chromosomal regions.
In organisms that have introns, exon shuffling (21, 22) refers
to the integration of an exon from one gene into another, for
instance through chromosomal crossover, gene conversion, or
mobile genetic elements. Exons could also be moved around by
being brought along by mobile genetic elements, such as retrotransposons (22, 23).
Two adjacent genes can be fused into one if the first one loses
its transcription stop signals. Point mutations can cause a gene to
lose a terminal domain by introducing a new stop codon, after
which the lost domain slowly degrades through point mutations
as it is no longer under selective pressure (24). Alternatively, a
multidomain gene might be split into two genes if both a start

190

K. Forslund and E.L.L. Sonnhammer

Fig. 1. Examples of mutations that can change domain architectures. Adapted from Buljan and Bateman (BioMed Central,
2010). (a) Gene fusion by a mobile element. LINE refers to a Long Interspersed Nuclear repeat Element, a retrotransposon.
The reverse transcriptase encoded within the LINE causes its mRNA to be reverse transcribed into DNA and integrated into
the genome, making the domain-encoding blue exon from the donor gene integrate along with it in the acceptor gene.
(b) Gene fusion by loss of a stop signal or deletion of much of the intergenic region. Genes 1 and 2 are joined together into a
single, longer gene. (c) Domain insertion through recombination. The blue domain from the donor gene is inserted within
the acceptor gene by either homologous or illegitimate recombination. (d) Right : Gene fission by introduction of
transcription stop (the letter O) and start (the letter A). Left : Domain loss by introduction of a stop codon (exclamation
mark) with subsequent degeneration of the now untranslated domain.

and a stop signal are introduced between the domains. Novel


domains could arise, for instance, through exonization, whereby
an intronic or intergenic region becomes an exon, after which
subsequent mutations would fine tune its folding and functional
properties (23, 25).

2. Distribution
of the Sizes
of Domain
Families

Domain architectures are fundamentally the realizations of how


domains combine to form multidomain proteins with complex
functions. Understanding how these combinations come to be
requires first that we understand how common the constituent
domains of those architectures are, and whether there are selective

8 Evolution of Protein Domain Architectures

191

pressures determining their abundances. Because of this, the body


of work concerning the sizes and species distributions of domain
families becomes important to us.
Comprehensive studies of the distributions and evolution of
protein domains and domain architectures are possible as genome
sequencing technologies have made many entire proteomes
available for bioinformatic analysis. Initial work (2628) focused
on the number of copies that a protein family, either single domain
or multidomain, has in a species. Most conclusions from these early
studies appear to hold true for domains, supradomains (see below),
and domain architectures (2931). In particular, these all exhibit a
dominance of the population by a selected few (28), i.e., a small
number of domain families are present in a majority of the proteins
in a genome, whereas most domain families are found only in a
small number of proteins.
Looking at the frequency N of families of size X (defined as the
number of members in the genome), in the earliest studies, this
frequency was modeled as the power law
N cX a ;
where a is a slope parameter. The power law is a special case of the
generalized Pareto distribution (GPD) (32):
N ci X a :
Power law distributions arise in a vast variety of contexts: from
human income distributions, connectivity of Internet routers, word
usage in languages, and many other situations ((27, 28, 34, 35), see
also ref. 36 for a conflicting view). Luscombe et al. (28) described a
number of other genomic properties that also follow power law
distributions, such as the occurrence of DNA words, pseudogenes, and levels of gene expression. These distributions fit much
better than the alternative they usually are contrasted against, an
exponential decay distribution. The most important difference
between exponential and power law distributions in this context
concerns the fact that the latter has a fat tail, that is, while most
domain families occur only a few times in each proteome, most
domains in the proteome still belong to one of a small number of
families.
Later work ((32, 37), see also ref. 38) demonstrated that
proteome-wide domain occurrence data fit the general GPD better
than the power law, but that it also asymptotically fits a power law as
X  i. The deviation from strict power law behavior depends on
proteome size in a kingdom-dependent manner (37). Regardless, it
is mostly appropriate to treat the domain family size distribution as
approximately (and asymptotically) power law like, and later studies
typically assume this.

192

K. Forslund and E.L.L. Sonnhammer

The power law, but not the GPD, is scale free in the sense of
fulfilling the condition
f ax gaf x;
where f (x) and g(x) are some functions of a variable x, and a is a
scaling parameter, that is, studying the data at a different scale does
not change the shape of function. This property has been extensively studied in the literature and is connected to other attributes,
notably when it occurs in network degree distributions (i.e., frequency distributions of edges per node). Here, it has been associated with properties, such as the presence of a few central and
critical hubs (nodes with many edges to other nodes), the similarity
between parts and the whole (as in a fractal), and the growth
process called preferential attachment, under which nodes are
more likely to gain new links the more links they already have.
However, the same power law distribution may be generated
from many different network topologies with different patterns of
connectivity. In particular, they may differ in the extent that hubs
are connected to each other (36). It is possible to extend the
analysis by taking into account the distribution of degree pairs
along network edges, but this is normally not done.
What kind of evolutionary mechanisms give rise to this kind of
distribution of gene or domain family sizes within genomes? In one
model by Huynen and van Nimwegen (26), every gene within a
gene family is more or less likely to duplicate, depending on the
utility of the function of that gene family within the particular
lineage of organisms studied, and they showed that such a model
matches the observed power laws. While they claimed that any
model that explains the data must take into account family-specific
probabilities of duplication fixation, Yanai and coworkers (39) proposed a simpler model using uniform duplication probability for all
genes in the genome, and also reported a good fit with the data.
Later, more complex birthdeath (37) and birthdeath-andinnovation models (BDIM) (27, 32) were introduced to explain the
observed distributions, and from investigating which model parameter
ranges allow this fit the authors were able to draw several far-ranging
conclusions. First, the asymptotic power law behavior requires that
the rates of domain gain and loss are asymptotically equal. Karev et al.
(32) interpreted this as support for a punctuated equilibrium-type
model of genome evolution, where domain family size distributions
remain relatively stable for long periods of time but may go through
stages of rapid evolution, representing a shift between different
BDIM evolutionary models and significant changes in genome complexity. Like Huynen and van Nimwegen (26), they concluded that the
likelihood of fixated domain duplications or losses in a genome directly
depends on family size. The family, however, only grows as long as
new copies can find new functional niches and contribute to a net
benefit for survival, i.e., as long as selection favors it.

8 Evolution of Protein Domain Architectures

193

Aside from Huynen and van Nimwegens, none of the models


discussed depend very strongly on family-specific selection to
explain the abundances of individual gene families, nor do they
exclude such selection. Some domains may be highly useful to
their host organisms lifestyle, such as cellcell connectivity
domains to an organism beginning to develop multicellularity.
Expansion of these domain families might, therefore, become
more likely in some lineages than in others. To what extent these
factors actually affect the size of domain families remains to be fully
explored. Karev et al. (32) suggested that the rates of domain-level
change events themselvesdomain duplication and loss rates, as
well as the rate of influx of novel domains from other species or de
novo creationmust be evolutionarily adapted, as only some
such parameters allow the observed distributions to be stable. van
Nimwegen (40) investigated how the number of genes increases in
specific functional categories as total genome size increases. He
found that the relationship matches a power law, with different
coefficients for each functional class remaining valid over many
bacterial lineages. Ranea et al. (41) found similar results. Also,
Ranea et al. (42) showed that, for domain superfamilies inferred
to be present in the last universal common ancestor (LUCA),
domains associated with metabolism have significantly higher
abundance than those associated with translation, further supporting a connection between the function of a domain family and how
likely it is to expand.
Extending the analysis to multidomain architectures, Apic et al.
(30) showed that the frequency distribution of multidomain family
sizes follows a power law curve similar to that reported for individual domain families. It, therefore, seems likely that the basic underlying mechanisms should be similar in both cases, i.e., duplication
of genes, and thus their domain architectures, is the most important type of event affecting the evolution of domain architectures.
Have the trends described above stood the test of time as more
genomes have been sequenced and more domain families have been
identified? We considered the 1,503 complete proteomes in version
24.0 of Pfam, and plotted the frequency Y of domain families that
have precisely X members as a function of X, and fit a power
law curve to this. Figure 2a shows the resulting plots for three
representative species, one complex eukaryote (Homo sapiens),
one simple eukaryote (Saccharomyces cerevisiae), and one prokaryote (Escherichia coli). Figure 2b shows the corresponding plots for
all domains in all complete eukaryotic, bacterial, and archaeal
proteomes. The power law curve fits decently well, with slopes
becoming less steep for the more complex organisms, whose distributions have relatively more large families. The power law-like
behavior suggests that complex organisms with large proteomes
were formed by heavily duplicating domains from relatively few
families. Figure 3a and b show equivalent plots, not for single

194

K. Forslund and E.L.L. Sonnhammer

Fig. 2. (a) Distribution of domain family sizes in three selected species. Power law distributions were fitted to these curves
such that, for frequency f of families of size X, f cX a. For Saccharomyces cerevisiae, a 1.8, for Escherichia coli,
a 1.7, and for Homo sapiens, a 1.5. (b) Distribution of domain family sizes across the three kingdoms. Power law
distributions were fitted to these curves such that, for frequency f of families of size X, f cX a. For bacteria, a 2.4,
for archaea, a 2.4, and for eukaryotes, a 1.8.

domains but for entire multidomain architectures. The curve


shapes as well as the relationship between both species and organism groups are similar, indicating that the evolution of these distributions have been similar.

3. Kingdom
and Age
Distribution
of Domain
Families
and Architectures

How old are specific domain families or domain architectures?


With knowledge of which organism groups they are found in, it is
possible to draw conclusions about their age, and whether lineagespecific selective pressures have determined their kingdom-specific
abundances. Domain families as well as their combinations have
arisen throughout evolutionary history, presumably by new combinations of preexisting elements that may have diverged beyond
recognition or by processes, such as exonization. We can estimate
the age of a domain family by finding the largest clade of organisms
within which it is found, excluding organisms with only xenologs,

8 Evolution of Protein Domain Architectures

195

Fig. 3. (a) Distribution of multidomain (architecture) family sizes in three selected species. Power law distributions were
fitted to these curves such that, for frequency f of families of size X, f cX a. For Saccharomyces cerevisiae, a 2.0,
for Escherichia coli, a 1.8, and for Homo sapiens, a 1.7. (b) Distribution of multidomain (architecture) family
sizes across the three kingdoms. Power law distributions were fitted to these curves such that, for frequency f of families of
size X, f cX a. For bacteria, a 2.5, for archaea, a 3.4, and for eukaryotes, a 2.2.

i.e., horizontally transferred genes (13). The age of this lineages


root is the likely age of the family. The same holds true for domain
combinations and entire domain architectures. This methodology
allows us to determine how changing conditions at different points
in evolutionary history, or in different lineages, have affected the
evolution of domain architectures.
Apic et al. (29) analyzed the distribution of SCOP domains
across 40 genomes from archaea, bacteria, and eukaryotes. They
found that a majority of domain families are common to all three
kingdoms of life, and thus likely to be ancient. Kuznetsov et al. (37)
performed a similar analysis using INTERPRO domains, and found
that only about one-fourth of all such domains were present in all
three kingdoms, but a majority was present in more than one of
them. Lateral gene transfer or annotation errors can cause a domain
family to be found in one or a few species in a kingdom without
actually belonging to that kingdom. To counteract this, one can

196

K. Forslund and E.L.L. Sonnhammer

require that a family must be present in at least a reasonable fraction


of the species within a kingdom for it to be considered anciently
present there. For instance, using Gene3D assignments of
CATH domains to 114 complete genomes, mainly bacterial,
Ranea et al. (42) isolated protein superfamily domains that were
present in at least 90% of all the genomes and also at least 70% of the
archaeal and eukaryotic genomes. Under these stringent cutoffs for
considering a domain to be present in a kingdom, 140 domains,
15% of the CATH families found in at least 1 prokaryote genome,
were inferred to be ancient. Chothia and Gough (43) performed a
similar study on 663 SCOP superfamily domains evaluated at many
different thresholds, and found that while 516 (78%) superfamilies
were common to all three kingdoms at a threshold of 10% of species
in each kingdom only 156 (24%) superfamilies were common to all
three kingdoms at a threshold of 90%. They also showed that for
prokaryotes a majority of domain instances (i.e., not domain
families but actual domain copies) belong to common superfamilies
at all thresholds below 90%.
Extending to domain combinations, Apic et al. (29) reported
that a majority of SCOP domain pairs are unique to each kingdom,
but also that more kingdom-specific domain combinations than
expected were composed only of domain families shared between
all three kingdoms. This would imply a scenario, where the independent evolution of the three kingdoms mainly involved creating
novel combinations of domains that existed already in their common ancestor.
Several studies have reported interesting findings on domain
architecture evolution in lineages closer to ourselves: in metazoa
and vertebrates. Ekman et al. (44) claimed that new metazoaspecific domains and multidomain architectures have arisen roughly
once every 0.11 million years in this lineage. According to their
results, most metazoa-specific multidomain architectures are a
combination of ancient and metazoa-specific domains. The latter
category are, however, mostly found as novel single-domain proteins. Much of the novel metazoan multidomain architectures
involve domains that are versatile (see below) and exon bordering
(allowing for their insertion through exon shuffling). The novel
domain combinations in metazoa are enriched for proteins associated with functions required for multicellularityregulation, signaling, and functions involved in newer biological systems, such as
immune response or development of the nervous system, as previously noted by Patthy (21). They also showed support for exon
shuffling as an important mechanism in the evolution of metazoan
domain architectures. Itoh et al. (45) added that animal evolution
differs significantly from other eukaryotic groups in that lineagespecific domains played a greater part in creating new domain
combinations.

8 Evolution of Protein Domain Architectures

197

Fig. 4. (a) Kingdom distribution of unique domains. Values are given as percentages of the total 7,270 domains.
(b) Kingdom distribution of unique domain pairs. Values are given as percentages of the total 6,270 domain pairs.
(c) Kingdom distribution of unique domain triplets. Values are given as percentages of the total 20,396 domain triplets.
(d) Kingdom distribution of unique multidomain architectures. Values are given as percentages of the total 7,862
multidomain architectures.

In the most recent datasets, what is the distribution of domains


and domain combinations across the three kingdoms of life? Looking at the set of complete proteomes in version 24.0 of Pfam, the
distribution of domains across the three kingdoms is as displayed in
the Venn diagram of Fig. 4a. Figure 4b and c shows the equivalent
distributions of immediate neighbors and triplets of domains,
respectively, and Fig. 4d shows the distribution of multidomain
architectures across kingdoms. The numbers are somewhat biased
toward bacteria as 90% of the complete proteomes are from this
kingdom. However, with this high coverage of all kingdoms
(76 eukaryotic, 68 archaeal, and 1,359 bacterial proteomes), the
results should be robust in this respect. Compared to most previous
reports, we see a striking difference in that a much smaller portion
of domains are shared between all kingdoms. There are some
potential artifacts which could affect this analysis. If lateral gene
transfer is very widespread, we may overestimate the number of
families present in all three kingdoms. Moreover, there are cases,

198

K. Forslund and E.L.L. Sonnhammer

where separate Pfam families are actually distant homologs of each


other, which could lead to underestimation of the number of
ancient families. To counteract this, we make use of Pfam clans,
considering domains in the same clan to be equivalent. While not all
distant homologies have yet been registered in the clan system,
performing the analysis on the clan level reduces the risk of such
underestimation.
Our finding that 11% of all Pfam-A domains are present in all
kingdoms is strikingly lower than in the earlier works, and is even
lower than reported by Ranea et al. (42), who used very stringent
cutoffs. However, a direct comparison of statistics for Pfam
domains/clans and CATH superfamilies is difficult. The decrease
in ancient families that we observe may be a consequence of the
massive increase in sequenced genomes and/or that the recent
growth of Pfam has added relatively more kingdom-specific
domains. We further found that only 23% of all domains or
domain combinations are unique to archaea, suggesting that
known representatives of this lineage have undergone very little
independent evolution and/or that most archaeal gene families
have been horizontally transferred to other kingdoms. The trend
when going from domain via domain combinations to whole architectures is clearthe more complex patterns are less shared
between the kingdoms. In other words, each kingdom has used a
common core of domains to construct its own unique combinations of multidomain architectures.

4. Domain
Co-occurrence
Networks

A multidomain architecture connects individual domains with


each other. There are several ways to derive these connections
and quantify the level of co-occurrence. The simplest method is
to consider all domains on the same amino acid chain to be
connected, but we can also limit the set of co-occurrences we
consider to, e.g., immediate neighbor pairs or triplets. Regardless
of which method is used, the result is a domain co-occurrence
network, where nodes represent domains and where edges represent the existence of proteins in which members of these families
co-occur. Figure 5 shows an example of such a network and the set
of domain architectures which defines it. This type of explicit
network representation is explored in several studies, notably by
Itoh et al. (45), Przytycka et al. (46), and Kummerfeld and Teichmann (12). It is advantageous as it allows the introduction of
powerful analysis tools developed within the engineering sciences
for use with artificial network structures, such as the World Wide
Web. The patterns of co-occurrences that we observe should be a
direct consequence of the constraints and conditions under which

8 Evolution of Protein Domain Architectures

199

Fig. 5. Example of protein domain co-occurrence network, adapted from Kummerfeld and
Teichmann (BioMed Central, 2009). (a) Sample set of domain architectures. The lines
represent proteins, and the boxes their domains in N- to C-terminal order. (b) Resulting
domain co-occurrence (neighbor) network. Nodes correspond to domains, and are linked
by an edge if at least one domain exists, where the two domains are found adjacent to
each other along the amino acid chain.

domain architectures evolve, and because of this the study of these


patterns becomes relevant for understanding such factors.
The frequency distribution of node degrees in the domain cooccurrence network has been fitted to a power law (29) and a more
general GPD as well (34). The closer this approximation holds, the
more the network will have the scale-free property. This property
can be thought of as a hierarchy in the network, where the more
centrally connected nodes link to more peripheral nodes with the
same relative frequency at each level. In the context of domains, this

200

K. Forslund and E.L.L. Sonnhammer

means that a small number of domains co-occur with a high


number of other domains, whereas most domains only have a few
neighborsusually, some of the highly connected hubs. The most
highly connected domains are referred to as promiscuous (47),
mobile, or versatile (13, 48, 49). Many such hub domains are
involved in intracellular or extracellular signaling, proteinprotein
interactions and catalysis, and transcription regulation. In general,
these are domains that encode a generic function, e.g., phosphorylation, that is reused in many contexts by additional domains that
confer substrate specificity or localization. Table 1 shows the
domains (or clans) with the highest numbers of immediate neighbors in Pfam 24.0.
One way of evolving a domain co-occurrence network that
follows a power law is by preferential attachment (33, 46). This
means that new edges (corresponding to proteins, where two
domains co-occur) are added with a probability that is higher the
more edges these nodes (domains) already have, resulting in a
power law distribution.
Apic et al. (30) considered a null model for random domain
combination, in which a proteome contains domain combinations
with a probability based on the relative abundances of the domains
only. They showed that this model does not hold, and that far fewer
domain combinations than expected under it are actually seen.
If most domain duplication events are gene duplication events
that do not change domain architectureor at the very least, do
not disrupt domain pairsthen this finding is not unexpected, nor
does it require or exclude any particular selective pressure to keep
these domains together in proteins. There is growing support for
the idea that separate instances of a given domain architecture in
general descend from a single ancestor with that architecture (50),
with polyphyletic evolution of domain architectures occurring only
in a small fraction of cases (46, 51, 52).
Itoh et al. (45) performed reconstruction of ancestral domain
architectures using maximum parsimony, as described in the next
section. This allowed them to study the properties of the ancestral
domain co-occurrence network, and thus explore how network connectivity has altered over evolutionary time. Among other things,
they found increased connectivity in animals, particularly of animalspecific domains, and suggest that this phenomenon explains the high
connectivity for eukaryotes reported by Wuchty (34). For nonanimal
eukaryotes, they reported a correlation between connectivity and age
such that older domains had relatively higher connectivity, with
domains preceding the divergence of eukaryotes and prokaryotes
being the most highly connected, followed by early eukaryotic
domains. In other words, early eukaryotic evolution saw the emergence of some key hub proteins while the most prominent eukaryotic
hubs emerged in the animal lineage.

8 Evolution of Protein Domain Architectures

201

Table 1
The 20 most densely connected hubs with regards
to immediate domain neighbors, according to Pfam 24.0
Identifier

Name

Number of different
immediate neighbors

CL0123

Helix-turn-helix clan

202

CL0023

P-loop containing nucleoside


triphosphate hydrolase
superfamily

166

CL0063

FAD/NAD(P)-binding Rossmann
fold Superfamily

155

CL0159

Ig-like fold superfamily (E-set)

71

CL0036

Common phosphate-binding site


TIM barrel superfamily

71

CL0016

Protein kinase superfamily

62

CL0172

Thioredoxin like

52

CL0202

Galactose-binding domain-like
superfamily

50

CL0058

Tim barrel glycosyl hydrolase


superfamily

50

CL0125

Peptidase clan CA

46

CL0028

Alpha/beta hydrolase fold

45

CL0304

CheY-like superfamily

44

CL0137

HAD superfamily

42

PF00571

CBS domain

41

CL0219

Ribonuclease H-like superfamily

41

CL0010

Src homology-3 domain

41

CL0300

Twin-arginine translocation motif

40

CL0261

NUDIX superfamily

40

CL0025

His Kinase A (phospho-acceptor)


domain

39

CL0183

PAS domain clan

38

What is the degree distribution of current domain co-occurrence


networks? We again used the domain architectures from all complete
proteomes in version 24.0 of Pfam, and considered the network
of immediate neighbor relationships, i.e., nodes (domains) have an

202

K. Forslund and E.L.L. Sonnhammer

Fig. 6. (a) Distribution of domain co-occurrence network node degrees in three selected species. Power law distributions
were fitted to these curves such that, for frequency f of nodes of degree X, f cX a. For Saccharomyces cerevisiae,
a 2.7, for Escherichia coli, a 2.1, and for Homo sapiens, a 2.3. (b) Distribution of domain co-occurrence
network node degrees across the three kingdoms. This corresponds to a network, where two domains are connected if any
species within the kingdom has a protein, where these domains are immediately adjacent. Power law distributions were
fitted to these curves such that, for frequency f of nodes of degree X, f cX a. For bacteria, a 1.8, for archaea,
a 2.1, and for eukaryotes, a 2.1.

edge between them if there is a protein, where they are adjacent. Each
domain was assigned a degree as its number of links to other domains.
We then counted the frequency with which each degree occurs in the
co-occurrence network. Figure 6a shows this relationship for the set
of domain architectures found in the same species as for Fig. 2a, and
Fig. 6b shows the equivalent plots for the three kingdoms as found
among the complete proteomes in Pfam. Regressions to a power law
have been added to the plots. The presence of a power law-like
behavior of this type implies that few domains have very many immediate neighbors while most domains have few immediate neighbors.
Note that the observed degrees in our dataset were strongly reduced
by removing all sequences with a stretch longer than 50 amino acids
lacking domain annotation.

8 Evolution of Protein Domain Architectures

5. Supradomains
and Conserved
Domain Order

6. Domain Mobility,
Promiscuity,
or Versatility

203

As we have seen, whole multidomain architectures or shorter


stretches of adjacent domains are often repeated in many proteins.
These only cover a small fraction of all possible domain combinations.
Are the observed combinations somehow special? We would expect
selective pressure to retain some domain combinations but not
others, since only some domains have functions that would synergize
together in one protein. Often, co-occurring domains require each
other structurally or functionally, for instance in transcription factors,
where the DNA-binding domain provides substrate specificity,
whereas the trans-activating domain recruits other components of
the transcriptional machinery (53). Vogel et al. (31) identified series
of domains co-occurring as a fixed unit with conserved N- to
C-terminal order but flanked by different domain architectures, and
termed them supradomains. By investigating their statistical overrepresentation relative to the frequency of the individual domains in the
set of nonredundant domain architectures (where nonredundant is
crucial, as otherwise, e.g., whole-gene duplication would bias the
results), they identified a number of such supradomains. Many
ancient domain combinations (shared by all three kingdoms) appear
to be such selectively preserved supradomains.
How conserved is the order of domains in multidomain architectures? In a recent study, Kummerfeld and Teichmann (12) built a
domain co-occurrence network with directed edges, allowing it to
represent the order in which two domains are found in proteins. As in
other studies, the distribution of node degrees fits a power law well.
Most domain pairs were only found in one orientation. This does not
seem required for functional reasons, as flexible linker regions should
allow the necessary interface to form also in the reversed case (50),
but may rather be an indication that most domain combinations are
monophyletic. Weiner and Bornberg-Bauer (54) analyzed the evolutionary mechanisms underlying a number of reversed domain order
cases and concluded that independent fusion/fission is the most
frequent scenario. Although domain reversals occur in only a few
proteins, it actually happens more often than was expected from
randomizing a co-occurrence network (12). That study also observed
that the domain co-occurrence network is more clustered than
expected by a random model, and that these clusters are also functionally more coherent than would be expected by chance.

While some protein domains co-occur with a variety of other


domains, some are always seen alone or in a single architecture in
all proteomes where they are found. A natural explanation is that
some domains are more likely to end up in a variety of architectural

204

K. Forslund and E.L.L. Sonnhammer

contexts than others due to some intrinsic property they possess.


Is such domain versatility or promiscuity a persistent feature of a
given domain, and does it correlate with certain functional or
biological properties of the domain?
Several ways of measuring domain versatility have been suggested. One measure, NCO (34), counts the number of other
domains found in any architectures, where the domain of interest
is found. Another measure, NN (30), instead counts the number of
distinct other domains that a domain is found adjacent to. Yet
another measure, NTRP (55), counts the number of distinct triplets of consecutive domains, where the domain of interest is found
in the middle. All of these measures can be expected to be higher
for common domains than for rare domains, i.e., variations in
domain abundance (the number of proteins a domain is found in)
can hide the intrinsic versatility of domains. Therefore, three different studies (13, 48, 56) formulated relative domain versatility
indices that aim to measure versatility independently of abundance.
It is worth noting that most studies have considered only immediately adjacent domain neighbors in these analyses, a restriction
based on the assumption that those are more likely to interact
functionally than domains far apart on a common amino acid chain.
The first relative versatility study was presented by Vogel et al.
(56), who used as their domain dataset the SUPERFAMILY database applied to 14 eukaryotic, 14 bacterial, and 14 archaeal proteomes. They modeled the number of unique immediate neighbor
domains as a power law function of domain abundance, performed
a regression on this data, and used the resulting power law exponent as a relative versatility measure. Basu et al. (48) used Pfam and
SMART (8) domains and measured relative domain versatility for
28 eukaryotes as the immediate neighbor pair frequency normalized by domain frequency. They then defined promiscuous
domains as a class according to a bimodality in the distribution of
the raw numbers of unique domain immediate-neighbor pairs.
Weiner et al. (13) used Pfam domains for 10,746 species in all
kingdoms, and took as their relative versatility measure the logarithmic regression coefficient for each domain family across genomes, meaning that it is not defined within single proteomes.
To what extent is high versatility an intrinsic property of a
certain domain? Vogel et al. (56) only examined large groups of
domains together and therefore did not address this question for
single domains. Basu et al. (48) and Weiner et al. (13) instead
analyzed each domain separately and concluded that there are
strong variations in relative versatility at this level. Their results
are very different in detail, however, reflected by the fact that only
one domain family (PF00004, AAA ATPase family) is shared
between the ten most versatile domains reported in the two studies.
As they used fairly similar domain datasets, it would appear that the
results strongly depend on the definition of relative versatility.

8 Evolution of Protein Domain Architectures

205

Another potential reason for the different results is that Basus list
was based on eukaryotes only while Weiners analysis was heavily
biased toward prokaryotes. Furthermore, the top ten lists in
Basu et al. (48) and their follow-up paper (49) only overlap by
four domains; yet the main difference is that in the latter study all
28 eukaryotes were considered while the former study was limited
to the subset of 20 animal, plant, and fungal species. The choice of
species, thus, seems pivotal for the results when using this method.
They also used different methods for calculating the average value
of relative versatility across many species, which may influence
the results.
Does domain versatility vary between different functional
classes of domains? Vogel et al. (56) found no difference in
relative versatility between broad functional or process categories
or between SCOP structural classes. In contrast to this,
Basu et al. (48) reported that high versatility was associated with
certain functional categories in eukaryotes. However, no test for
the statistical significance of these results was performed. Weiner
et al. (13) also noted some general trends, but found no significant enrichment of Gene Ontology terms in versatile domains.
This does not necessarily mean that no such correlation exists, but
more research is required to convincingly demonstrate its strength
and its nature.
Another important question is to what extent domain versatility varies across evolutionary lineages. Vogel et al. (56) reported
no large differences in average versatility for domains in different
kingdoms. The versatility measure of Basu et al. (48) can be
applied within individual genomes, which means that according
to this measure domains may be versatile in one organism group
but not in another, as well as gain or lose versatility across evolutionary time. They found that more domains were highly versatile
in animals than in other eukaryotes. Modeling versatility as a
binary property defined for domains in extant species, they further
used a maximum parsimony approach to study the persistence of
versatility for each domain across evolutionary time, and concluded that both gain and loss of versatility are common during
evolution. Weiner at al. (13) divided domains into age categories
based on distribution across the tree of life, and reported that the
versatility index is not dependent on age, i.e., domains have equal
chances of becoming versatile at different times in evolution. This
is consistent with the observation by Basu et al. (48) that versatility is a fast-evolving and varying property. When measuring versatility as a regression within different organism groups, Weiner
et al. (13) found slightly lower versatility in eukaryotes, which is
in conflict with the findings of Basu et al. (48). Again, this underscores the strong dependence of the method and dataset on
the results.

206

K. Forslund and E.L.L. Sonnhammer

Further properties reported to correlate with domain versatility


include sequence length, where Weiner et al. (13) found that longer
domains are significantly more versatile within the framework of
their study while at the same time shorter domains are more abundant, and hence may have more domain neighbors in absolute
numbers. Basu et al. (48) further reported that more versatile
domains have more structural interactions than other domains.
To determine which of these reported correlations genuinely reflect
universal biological trends, further comprehensive studies are
needed using more data and uniform procedures. This would
hopefully allow the results from the studies described here to be
validated, and any conflicts between them to be resolved.
Basu et al. (48) further analyzed the phylogenetic spread of all
immediate domain neighbor pairs (bigrams) containing domains
classified as promiscuous. The main observation this yielded was
that although most such combinations occurred in only a few
species most promiscuous domains are part of at least one combination that is found in a majority of species. They interpreted this as
implying the existence of a reservoir of evolutionarily stable domain
combinations from which lineage-specific recombination may draw
promiscuous domains to form unique architectures.

7. Principles
of Domain
Architecture
Evolution

What mutation events can generate new domain architectures, and


what is their relative predominance? The question can be approached
by comparing protein domain architectures of extant proteins. This is
based on the likely realistic assumption that most current domain
architectures evolved from ancestral domain architectures that can
still be found unchanged in other proteins. Because of this, in pairs of
most similar extant domain architectures, one can assume that one of
them is ancestral. This agrees well with results indicating that most
groups of proteins with identical domain architectures are monophyletic. By comparing the most similar proteins, several studies have
attempted to chart the relative frequencies of different architecturechanging mutations.
Bjorklund et al. (57) used this particular approach and came
to several conclusions. First, changes to domain architecture are
much more common by the N- and C-termini than internally in
the architecture. This is consistent with several mechanism for
architecture changes, such as introduction of new start or stop
codons or mergers with adjacent genes, and similar results have
been found in several other studies (23, 24, 58). Furthermore,
insertions or deletions of domains (indels) are more common
than substitutions of domains, and the events in question mostly
concern just single domains, except in cases with repeats

8 Evolution of Protein Domain Architectures

207

expanding with many domains in a row (59). In a later study, the


same group made use of phylogenetic information as well, allowing them to infer directionality of domain indels (44). They then
found that domain insertions are significantly more common than
domain deletions.
Weiner et al. (24) performed a similar analysis on domain loss
and found compatible resultsmost changes occur at the termini.
Moreover, they demonstrated that terminal domain loss seldom
involves losing only part of a domain or rather that such partial
losses quickly progress into loss of the entire domain.
There is some support (21, 60, 61) for exon shuffling to have
played an important part in domain evolution, and there are a
number of domains that match intron borders well, for example
structural domains in extracellular matrix proteins. While it may not
be a universal mechanism, exon shuffling is suggested to have been
particularly important for vertebrate evolution (21).

8. Inferring
Ancestral Domain
Architectures

The above analyses, based on pairwise comparison of extant protein


domain architectures, cannot tally ancestral evolutionarily events
nearer the root of the tree of life. With ancestral architectures, one
can directly determine which domain architecture changes have
taken place during evolution and precisely chart how mechanisms
of domain architecture evolution operate, as well as gauge their
relative frequency. A drawback is that since we can only infer
ancestral domain architectures from extant proteins, the result
depends somewhat on our assumptions about evolutionary
mechanisms. On the upside, it should be possible to test how well
different assumptions fit the observed modern-day protein domain
architecture patterns.
Attempts at such reconstructions have been made using parsimony. Given a gene tree and the domain architectures at the leaves,
dynamic programming can be used in order to find the assignment
of architectures to internal nodes that requires the smallest number
of domain-level mutation events. This simple model can be elaborated by weighting loss and gain differently or requiring that a
domain or an architecture can only be gained at most once in a
tree (Dollo parsimony) (62).
An early study of Snel et al. (63) considered 252 gene trees
across 17 fully sequenced species and used parsimony to minimize
the number of gene fission and fusion events occurring along the
species tree. Their main conclusion, that gene fusions are more
common than gene fissions, was subsequently supported by a larger
study by Kummerfeld and Teichmann (64), where fusions were
found to be about four times as common as fissions in a most

208

K. Forslund and E.L.L. Sonnhammer

parsimonious reconstruction. Fong et al. (65) followed a similar


procedure on yet more data and concluded that fusion was 5.6
times as likely as fission.
Buljan and Bateman (58) performed a similar maximum parsimony reconstruction of ancestral domain architectures. They too
observed that domain architecture changes primarily take place at
the protein termini, and the authors suggested that this might
largely occur because terminal changes to the architecture are less
likely to disturb the overall protein structure. Moreover, they concluded from reconciliation of gene and species trees that domain
architecture changes were more common following gene duplications than following speciation, but that these cases did not differ
with respect to the relative likelihood of domain losses or gains.
Recently, Buljan et al. (23) presented a new ancestral domain
architecture reconstruction study which assumed that gain of a
domain should take place only once in each gene tree, i.e., Dollo
parsimony (62). Their results also support gene fusion as a major
mechanism for domain architecture change. The fusion is generally
preceded by a duplication of either of the fused genes. Intronic
recombination and insertion of exons are observed, but relatively
rarely. They also found support for de novo creation of disordered
segments by exonization of previously noncoding regions.

9. Polyphyletic
Domain
Architecture
Evolution

There appears to be a grammar for how protein domains are


allowed to be combined. If nature continuously explores all possible domain combinations, one would expect that the allowed combinations would be created multiple times throughout evolution.
Such independent creation of the same domain architecture can be
called convergent or polyphyletic evolution, whereas a single original creation event for all extant examples on an architecture would
be called divergent or monophyletic evolution. This is relevant for
several reasons, not least because it determines whether or not we
can expect two proteins with identical domain architectures to have
the same history along their entire length.
A graph theoretical approach to answer this question was taken
by Przytycka et al. (46), who analyzed the set of all proteins containing a given superfamily domain. The domain architectures of these
proteins define a domain co-occurrence network, where edges connect two domains both found in a protein, regardless of sequential
arrangement. The proteins of such a set can also be placed in an
evolutionary tree, and the evolution of all multidomain architectures containing the reference domain can be expressed in terms of
insertions and deletions of other domains along this tree to form the
extant domain architectures. The question, then, is whether or not

8 Evolution of Protein Domain Architectures

209

all leaf nodes sharing some domain arrangement (up to and including
an entire architecture) stem from a single ancestral node possessing
this combination of domains. For monophyly to be true for all
architectures containing the reference domain, the same companion
domain cannot have been inserted in more than one place along the
tree describing the evolution of the reference domain. By application
of graph theory and Dollo parsimony (62), they showed that monophyly is only possible if the domain co-occurrence network defined by
all proteins containing the reference domain is chordal, i.e., it
contains no cycles longer than three edges.
Przytycka et al. (46) then evaluated this criterion for all superfamily domains in a large-scale dataset. For all domains where the
co-occurrence network contained fewer than 20 nodes (domains),
the chordal property held, and hence any domain combinations or
domain architectures containing these domains could potentially
be monophyletic. By comparing actual domain co-occurrence networks with a preferential attachment null model, they showed that
far more architectures are potentially monophyletic than would be
expected under a pure preferential attachment process. This finding
is analogous to the observation by Apic et al. (30) that most domain
combinations are duplicated more frequently (or reshuffled less)
than expected by chance. In other words, gene duplication is much
more frequent than domain recombination (56). However, for
many domains that co-occurred with more than 20 other different
domains, particularly for domains previously reported as promiscuous, the chordal property was violated, meaning that multiple
independent insertions of the same domain, relative to the reference domain phylogeny, must be assumed.
A more direct approach is to do complete ancestral domain
architecture reconstruction of protein lineages and to search for
concrete cases that agree with polyphyletic architecture evolution.
There are two conceptually different methodologies for this type of
analysis. Either one only considers architecture changes between
nodes of a species tree or one considers any node in a reconstructed
gene tree. The advantage of using a species tree is that one avoids
the inherent uncertainty of gene trees, but on the other hand only
events that take place between examined species can be observed.
Gough (51) applied the former species tree-based methodology to SUPERFAMILY domain architectures, and concluded that
polyphyletic evolution is rare, occurring in 0.44% of architectures.
The value depends on methodological details, with the lower
bound considered more reliable.
The latter gene tree-based methodology was applied by Forslund
et al. (52) to the Pfam database. Ancestral domain architectures were
reconstructed through maximum parsimony of single-domain phylogenies which were overlaid for multidomain proteins. This strategy
yielded a higher figure, ranging between 6 and 12% of architectures
depending on dataset and whether or not incompletely annotated

210

K. Forslund and E.L.L. Sonnhammer

proteins were removed. The two different approaches, thus, give very
different results. The detection of polyphyletic evolution is in both
frameworks dependent on the data that is usedits quality, coverage,
filtering procedures, etc. The studies used different datasets which
makes it hard to compare. However, given that their domain annotations are more or less comparable, the major difference ought to be
the ability of the gene-tree method to detect polyphyly at any point
during evolution, even within a single species. It should be noted that
domain annotation is by no means completeonly a little less than
half of all residues are assigned to a domain (5)and this is clearly a
limiting factor for detecting architecture polyphyly. The numbers
may, thus, be adjusted considerably upward when domain annotation
reaches higher coverage.
Future work will be required to provide more reliable estimates
of how common polyphyletic evolution of domain architectures is.
Any estimate will depend on the studied protein lineage, versatility
of the domains, and methodological factors. A comprehensive and
systematic study using more complex phylogenetic methods than
the fairly ad hoc parsimony approach, as well as effective ways to
avoid overestimating the frequency of polyphyletic evolution due to
incorrect domain assignments or hidden homology between different domain families, may be the way to go. At this point, all that can
be said is that polyphyletic evolution of domain architectures definitely does happen, but relatively rarely, and that it is more frequent
for complex architectures and versatile domains.

10. Conclusions
As access to genomic data and increasing amounts of compute
power has grown during the last decade, so has our knowledge of
the overall patterns of domain architecture evolution. Still, no study
is better than its underlying assumptions, and differences in the
representation of data and hypotheses means that results often
cannot be directly compared. Overall, however, the current state
of the field appears to support some broad conclusions.
Domain and multidomain family sizes, as well as numbers of
co-occurring domains, all approximately follow power laws, which
implies a scale-free hierarchy. This property is associated with many
biological systems in a variety of ways. In this context, it appears to
reflect how a relatively small number of highly versatile components
have been reused again and again in novel combinations to create a
large part of the domain and domain architecture repertoire
of organisms. Gene duplication is the most important factor
to generate multidomain architectures, and as it outweighs
domain recombination only a small fraction of all possible domain
combinations is actually observed. This is probably further

8 Evolution of Protein Domain Architectures

211

modulated by family-specific selective pressure, though more work


is required to demonstrate to what extent. Most of the time, all
proteins with the same architecture or domain combination stem
from a single ancestor, where it first arose, but there remains a
fraction of cases, particularly with domains that have very many
combination partners, where this does not hold.
Most changes to domain architectures occur following a gene
duplication, and involves the addition of a single domain to either
protein terminus. The main exceptions to this occur in repeat regions.
Exon shuffling played an important part in animals by introducing a
great variety of novel multidomain architectures, reusing ancient
domains as well as domains introduced in the animal lineage.
In this chapter, we have reexamined with the most up-to-date
datasets many of the analyses done previously on less data, and
found that the earlier conclusions still hold true. Even though we
are at the brink of amassing enormously much more genome and
proteome data thanks to the new generation of sequencing technology, there is no reason to believe that this will alter the fundamental observations we can make today on domain architecture
evolution. However, it will permit a more fine-grained analysis, and
also there will be a greater chance to find rare events, such as
independent creation of domain architectures. Furthermore, careful application of more complex models of evolution with and
without selection pressure may allow us to determine more closely
to what extent the process of domain architecture evolution was
shaped by selective constraints.

11. Materials
and Methods
Updated statistics were generated from the data in Pfam 24.0.
All Uniprot proteins belonging to any of the full proteomes
covered in Pfam 24.0 were included. These include 1,359 bacteria,
76 eukaryotes, and 68 archaea. All Pfam-A domains regardless of
type were included. However, as stretches of repeat domains are
highly variable, consecutive subsequences of the same domain were
collapsed into a single pseudo-domain, if it was classified as type
Motif or Repeat, as in several previous works (44, 52, 56, 65).
Domains were ordered within each protein based on their
sequence start position. In the few cases of domains being inserted
within other domains, this was represented as the outer domain
followed by the nested domain, resulting in a linear sequence of
domain identifiers. As long regions without domain assignments are
likely to represent the presence of as-yet uncharacterized domains, we
excluded any protein with unassigned regions longer than 50 amino
acids (more than 95% of Pfam-A domains are longer than this). This
approach is similar to that taken in previous works (51, 52, 57).

212

K. Forslund and E.L.L. Sonnhammer

Other studies (44, 59) have instead performed additional more


sensitive domain assignment steps, such as clustering the unassigned
regions to identify unknown domains within them.
Pfam domains are sometimes organized in clans, where clanmates are considered homologous. A transition from a domain to
another of the same clan is, thus, less likely to be a result of domain
swapping of any kind, and more likely to be a result of sequence
divergence from the same ancestor. Because of this, we replaced all
Pfam domains that are clan members with the corresponding clan.
The statistics and plots were generated using a set of Perl, R,
and GnuPlot scripts, which are available upon request. Power law
regressions were done using the MarquardtLevenberg nonlinear
least squares algorithm as implemented in GnuPlot and allowed to
continue until the convergence criterium (for least squares sum Xi
following the ith iteration, (Xi  Xi+1)/Xi should not exceed
105) was met. For reasons of scale, the regression for a power
law relation, such as
N cX a ;
was performed on the equivalent relationship
logX 1=alogc  logN ;
for the parameters a and c, with the exception of the data for Fig. 6,
where instead the relationship,
logN logc  a logX ;
was used. Moreover, because species or organism group datasets
were of very different size, raw counts of domains were converted
to frequencies before the regression was performed.

12. Online Domain


Database Resources
For further studies or research into this field, the first and most
important stop will be the domain databases. Table 2 presents a
selection of domain databases in current use.

13. Exercises/
Questions
l

Which aspects of domain architecture evolution follow from


properties of natures repertoire of mutational mechanisms,
and which follow from selective constraints?

What trends have characterized the evolution of domain architectures in animals?

8 Evolution of Protein Domain Architectures

213

Table 2
A selection of protein domain databases
Database

URL

Notes

ADDA

http://ekhidna.biocenter.
helsinki.fi/sqgraph/
pairsdb

Automatic clustering of protein domain sequences

CATH

http://www.cathdb.info

Based solely on experimentally determined 3D


structures

CDD

http://www.ncbi.nlm.nih.
gov/Structure/cdd/cdd.
shtml

Metadatabase joining together domain assignments


from many different sources, as well as some unique
domains

Gene3D

http://gene3d.biochem.ucl.
ac.uk

Bioinformatical assignment of sequences to CATH


domains using hidden Markov models

INTERPRO

http://www.ebi.ac.uk/
interpro

Metadatabase joining together domain assignments


from many different sources

Pfam

http://pfam.sanger.ac.uk

Domain families are defined from manually curated


multiple alignments, and represented using Hidden
Markov Models

PRODOM

http://prodom.prabi.fr

Automatically derived domain families from proteins


in UniProt

SCOP

http://scop.mrc-lmb.cam.
ac.uk

Based solely on experimentally determined 3D


structures

SMART

http://smart.emblheidelberg.de

Domain families are defined from manually curated


multiple alignments, and represented using Hidden
Markov Models

SUPERFAMILY http://supfam.cs.bris.ac.uk

Bioinformatical assignment of sequences to SCOP


domains using Hidden Markov Models trained on
the sequences of domains in SCOP

Discuss approaches to handle limited sampling of species with


completely sequenced genomes. How can one draw general
conclusions or test the robustness of the results? Apply, e.g., to
the observed frequency of domain architectures that have
emerged multiple times independently in a given dataset.

Describe the principle of preferential attachment for evolving


networks. In what protein domain-related contexts does this
seem to model the evolutionary process, and what distribution
of node degrees does it produce?

What protein properties correlate with domain versatility? Can


the versatility of a domain be different in different species
(groups) and change over evolutionary time?

What protein domain-related properties differ between prokaryotes and eukaryotes?

214

K. Forslund and E.L.L. Sonnhammer

References
1. Andreeva A, Howorth D, Chandonia JM, Brenner SE, Hubbard TJ, Chothia C and Murzin AG.
(2008) Data growth and its impact on the SCOP
database: new developments. Nucleic Acids Res.
36(Database issue):D419425.
2. Cuff AL, Sillitoe I, Lewis T, Redfern OC,
Garratt R, Thornton J and Orengo CA.
(2009) The CATH classification revisited
architectures reviewed and new ways to characterize structural divergence in superfamilies.
Nucleic Acids Res. 37(Database issue):D310314.
3. Wilson D, Pethica R, Zhou Y, Talbot C, Vogel
C, Madera M, Chothia C and Gough J. (2009)
SUPERFAMILYsophisticated comparative
genomics, data mining, visualization and phylogeny. Nucleic Acids Res. 37(Database issue):
D380-386.
4. Lees J, Yeats C, Redfern O, Clegg A and
Orengo C. (2010) Gene3D: merging structure
and function for a Thousand genomes. Nucleic
Acids Res. 38(1):D296-D300.
5. Finn RD, Mistry J, Tate J, Coggill P, Heger A,
Pollington JE, Gavin OL, Gunesekaran P,
Ceric G, Forslund K, Holm L, Sonnhammer
ELL, Eddy SR and Bateman A. (2010) The
Pfam protein families database. Nucleic Acids
Research, Database Issue 38:D211222.
6. Hunter S, Apweiler R, Attwood TK, Bairoch A,
Bateman A, Binns D, Bork P, Das U, Daugherty L, Duquenne L, Finn RD, Gough J, Haft
D, Hulo N, Kahn D, Kelly E, Laugraud A,
Letunic I, Lonsdale D, Lopez R, Madera M,
Maslen J, McAnulla C, McDowall J, Mistry J,
Mitchell A, Mulder N, Natale D, Orengo C,
Quinn AF, Selengut JD, Sigrist CJ, Thimma
M, Thomas PD, Valentin F, Wilson D, Wu
CH and Yeats C. (2009) InterPro: the integrative protein signature database. Nucleic Acids
Res. 37(Database issue):D211-5
7. Marchler-Bauer A, Anderson JB, Chitsaz F,
Derbyshire MK, DeWeese-Scott C, Fong JH,
Geer LY, Geer RC, Gonzales NR, Gwadz M,
He S, Hurwitz DI, Jackson JD, Ke Z,
Lanczycki CJ, Liebert CA, Liu C, Lu F, Lu S,
Marchler GH, Mullokandov M, Song JS,
Tasneem A, Thanki N, Yamashita RA, Zhang
D, Zhang N and Bryant SH. (2009) CDD:
specific functional annotation with the Conserved Domain Database. Nucleic Acids Res.
37(Database issue):D205-210.
8. Letunic I, Doerks T and Bork P. (2009) SMART
6: recent updates and new developments.
Nucleic Acids Res. 37(Database issue):
D229232.

9. Bru C, Courcelle E, Carre`re S, Beausse Y,


Dalmar S and Kahn D. (2005) The ProDom
database of protein domain families: more
emphasis on 3D. Nucleic Acids Res. 33(Database issue):D212215.
10. UniProt Consortium. (2010) The Universal
Protein Resource (UniProt) in 2010. Nucleic
Acids Res. 38(Database issue):D142148.
11. Heger A, Wilton CA, Sivakumar A and Holm
L. (2005) ADDA: a domain database with
global coverage of the protein universe.
Nucleic Acids Res. 33(Database issue):
D188191.
12. Kummerfeld SK and Teichmann SA. (2009)
Protein domain organisation:adding order.
BMC Bioinformatics 10 (39). BioMed
Central 2010.
13. Weiner J 3rd, Moore AD and Bornberg-Bauer
E. (2008) Just how versatile are domains?
BMC Evolutionary Biology 8(285).
14. del Carmen Orozco-Mosqueda M, Altamirano-Hernandez J, Farias-Rodriguez R, Valencia-Cantero E and Santoyo G. (2009)
Homologous recombination and dynamics of
rhizobial genomes. Research in Microbiology
160(10):733741.
15. Heyer WD, Ehmsen KT, and Liu J. (2010)
Regulation of Homologous Recombination in
Eukaryotes. Annu. Rev. Genet. 44:113139.
16. Brissett NC and Doherty AJ. (2009) Repairing
DNA double-strand breaks by the prokaryotic
non-homologous end-joining pathway. Biochemical Society Transactions 37:539545.
17. van Rijk A and Bloemendal H. (2003)
Molecular mechanisms of exon shuffling: illegitimate recombination. Genetica 118:245249.
18. Feschotte C and Pritham EJ. (2007) DNA
transposons and the evolution of eukaryotic
genomes. Annu Rev Genet. 41:331-368.
19. Cordaux R and Batzer MA. (2009) The impact
of retrotransposons on human genome evolution. Nature Reviews Genetics 10:691703.
20. Gogvadze E and Buzdin A. (2009) Retroelements and their impact on genome evolution
and functioning. Cell Mol Life Sci. 66
(23):37273742.
21. Patthy L. (2003) Modular assembly of genes and
the evolution of new functions. Genetica. 2003
Jul;118(23):21731.
22. Liu M and Grigoriev A. (2004) Protein
domains correlate strongly with exons in multiple eukaryotic genomes evidence of exon
shuffling? Trends Genet. 20(9):399403.

8 Evolution of Protein Domain Architectures


23. Buljan M, Frankish A and Bateman A. (2010)
Quantifying themechanisms of domain gain in
animal proteins. Genome Biol. 11(7):R74.
BioMed Central 2010.
24. Weiner J 3rd, Beaussart F and Bornberg-Bauer
E. (2006) Domain deletions and substitutions
in the modular protein evolution. FEBS Journal 273: 20372047.
25. Schmidt EE and Davies CJ. (2007) The origins
of polypeptide domains. Bioessays. 29(3):
262270.
26. Huynen MA and van Nimwegen E. (1998)
The Frequency Distribution of Gene Family
Sizes in Complete Genomes. Mol. Biol. Evol.
15(5):583589.
27. Qian J, Luscombe NM and Gerstein M (2001)
Protein Family and Fold Occurrence in Genomes: Power-law Behaviour and Evolutionary
Model. J. Mol. Biol. 313:673681.
28. Luscombe NM, Qian J, Zhang Z, Johnson T
and Gerstein M. (2002) The dominance of the
population by a selected few: power-law behaviour applies to a wide variety of genomic properties. Genome Biol 3: RESEARCH0040.
29. Apic G, Gough J and Teichmann SA. (2001)
Domain Combinations in Archaeal, Eubacterial and Eukaryotic Proteomes. J. Mol. Biol.
310:311325.
30. Apic G, Huber W and Teichmann SA. (2003)
Multi-domain protein families and domain
pairs: comparison with known structures and
a random model of domain recombination.
Journal of Structural and Functional Genomics
4:6778.
31. Vogel C, Berzuini C, Bashton M, Gough J and
Teichmann SA. (2004) Supra-domains: Evolutionary Units Larger than Single Protein
Domains. J. Mol. Biol. 336:809823.
32. Karev GP, Wolf YI, Rzhetsky AY, Berezovskaya
FS and Koonin EV. (2002) Birth and death of
protein domains: a simple model of evolution
explains power law behavior. BMC Evol Biol. 2
(1):18.
33. Barabasi AL and Albert R. (1999) Emergence
of scaling in random networks. Science. 286
(5439):509512.
34. Wuchty S. (2001) Scale-free Behavior in
Protein Domain Networks. Mol. Biol. Evol.
18(9):16941702.
35. Rzhetsky A and Gomez SM. (2001) Birth
of scale-free molecular networks and the number of distinct DNA and protein domains per
genome. Bioinformatics. 17(10):988996.
36. Li L, Alderson D, Tanaka R, Doyle JC and
Willinger W. (2005) Towards a Theory of
Scale-Free Graphs: Definition, Properties, and

215

Implications. Internet Mathematics 2 (4):


431523.
37. Kuznetsov V, Pickalov V, Senko O and Knott
G. (2002) Analysis of the evolving proteomes:
Predictions of the number of protein domains
in nature and the number of genes in eukaryotic organisms. J. Biol. Syst. 10(4):381407.
38. Koonin EV, Wolf YI and Karev GP. (2002) The
structure of the protein universe and genome
evolution. Nature 420:218-223.
39. Yanai I, Camacho CJ and DeLisi C. (2000)
Predictions of Gene Family Distributions in
Microbial Genomes: Evolution by Gene Duplication and Modification. Phys. Rev. Let. 85
(12):26412644.
40. van Nimwegen E. (2005) Scaling laws in the
functional content of genomes. Annu. Rev.
Biochem. 74:867900.
41. Ranea JAG, Buchan DWA, Thornton JM and
Orengo CA (2004) Evolution of Protein
Superfamilies and Bacterial Genome Size.
J. Mol. Biol. 336:871887.
42. Ranea JAG, Sillero A, Thornton JM, and
Orengo CA. (2006) Protein superfamily
evolution and the last universal common ancestor (LUCA). Journal of Molecular Evolution
63(4):513-525.
43. Chothia C and Gough J. (2009) Genomic and
structural aspects of protein evolution. Biochem. J. 419:1528.
44. Ekman D, Bjorklund AK and Elofsson A.
(2007) Quantification of the Elevated Rate of
Domain Rearrangements in Metazoa. J. Mol.
Biol. 372:13371348.
45. Itoh M, Nacher JC, Kuma K, Goto S and
Kanehisa M. (2007) Evolutionary history and
functional implications of protein domains and
their combinations in eukaryotes. Genome
Biol. 8(6):R121.
46. Przytycka T, Davis G, Song N and Durand D.
(2006) Graph theoretical insights into evolution of multidomain proteins. J Comput Biol.
13(2):351363.
47. Marcotte EM, Pellegrini M, Ng HL, Rice DW,
Yeates TO and Eisenberg D. (1999). Detecting
protein function and protein-protein interactions from genome sequences. Science. 285
(5428):751753.
48. Basu MK, Carmel L, Rogozin IB, and Koonin
EV. (2008) Evolution of protein domain promiscuity in eukaryotes. Genome Res.
18:449461.
49. Basu MK, Poliakov E and Rogozin IB. (2009)
Domain mobility in proteins: functional and
evolutionary
implications.
Briefings
in
Bioinformatics 10(3):205216.

216

K. Forslund and E.L.L. Sonnhammer

50. Bashton M and Chothia C. (2002) The Geometry of Domain Combination in Proteins.
J. Mol. Biol. 315:927939.
51. Gough J. (2005) Convergent evolution of
domain architectures (is rare). Bioinformatics
21(8):14641471.
52. Forslund K, Hollich V, Henricson A, and
Sonnhammer ELL. (2008) Domain Tree
Based Analysis of Protein Architecture
Evolution MBE 25:254264.
53. Brivanlou AH and Darnell JE. (2002) Signal
Transduction and the Control of Gene Expression. Science 295(5556):813 818.
54. Weiner J 3rd and Bornberg-Bauer E. (2006)
Evolution of Circular Permutations in
Multidomain Proteins. Mol. Biol. Evol.
23(4):734743.
55. Tordai H, Nagy A, Farkas K, Banyai L, Patthy
L. (2005) Modules, multidomain proteins and
organismic complexity. FEBS J 272
(19):50645078.
56. Vogel C, Teichmann SA and Pereira-Leal J.
(2005) The Relationship Between Domain
Duplication and Recombination. J. Mol. Biol.
346:355365.
57. Bjorklund AK, Ekman D, Light S, Frey-Skott J
and Elofsson A. (2005) Domain Rearrangements in Protein Evolution. J. Mol. Biol.
353:911923.

58. Buljan M and Bateman A. (2009) The


evolution of protein domain families. Biochem.
Soc. Trans. 37:751755.
59. Bjorklund AK, Ekman D and Elofsson A.
(2006) Expansion of Protein Domain Repeats.
PLoS Comput Biol 2(8):114.
60. Doolittle RD and Bork P (1993) Evolutionary
mobile modules in proteins. Scient Am
Oct:3440.
K, Ekman D,
61. Moore AD, Bjorklund A
Bornberg-Bauer E and Elofsson A. (2008)
Arrangements in the modular evolution
of proteins. Trends Biochem Sci. 33
(9):444151.
62. Farris JS. (1977). Phylogenetic analysis under
Dollo s Law. Systematic Zoology 26: 7788.
63. Snel B, Bork P and Huynen M. (2000)
Genome evolution. Gene fusion versus gene
fission. Trends Genet. 16(1):911.
64. Kummerfeld SK and Teichmann SA. (2005)
Relative rates of gene fusion and fission in
multi-domain proteins. Trends in Genetics 21
(1):2530.
65. Fong JH, Geer LY, Panchenko AR and Bryant
SH. (2007) Modeling the Evolution of Protein
Domain Architectures Using Maximum Parsimony. J Mol Biol. 366(1):307315.

Chapter 9
Estimating Recombination Rates from Genetic
Variation in Humans
Adam Auton and Gil McVean
Abstract
Recombination acts to shuffle the existing genetic variation within a population, leading to various
approaches for detecting its action and estimating the rate at which it occurs. Here, we discuss the principal
methodological and analytical approaches taken to understanding the distribution of recombination across
the human genome. We first discuss the detection of recent crossover events in both well-characterised
pedigrees and larger populations with extensive recent shared ancestry. We then describe approaches for
learning about the fine-scale structure of recombination rate variation from patterns of genetic variation in
unrelated individuals. Finally, we show how related approaches using individuals of admixed ancestry can
provide an alternative approach to analysing recombination. Approaches differ not only in the statistical
methods used, but also in the resolution of inference, the timescale over which recombination events
are detected, and the extent to which inter-individual variation can be identified.
Key words: Recombination, Pedigree analysis, Linkage disequilibrium, Admixture

1. Introduction
Genetic recombination is of fundamental importance not only in the
generation of gametes within eukaryotes, but also in the process of
evolution. Specifically, while mutation provides a mechanism by
which novel variants are generated, it is recombination that allows
new combinations of variants to be exposed to natural selection.
Despite this importance, it is only recently that the key mechanisms
by which recombination is distributed along the human genome have
begun to be understood. For example, while it has been known for
some time that recombination rates vary at the broad scale (1, 2),
recent advances in experimental and statistical techniques have
revealed a complex landscape of recombination at the fine scale as
well (35). In fact, we now know that the majority of recombination
occurs in localized regions of roughly 2 kb in width (6, 7), where the
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2,
Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5_9,
# Springer Science+Business Media, LLC 2012

217

218

A. Auton and G. McVean

recombination rate can be thousands of times that of the surrounding


sequence. These recombination hotspots are a ubiquitous feature of
the human genome, with at least 30,000 identified by statistical
methods (8). Understanding the processes that lead to the formation
of hotspots has led to important discoveries about the biology and
evolution of meiotic recombination (912).
Knowledge of the distribution of recombination across the
human genome has, beyond the biological significance, some practical importance. For example, local recombination rates are used in
linkage mapping, imputation-aided association studies (13, 14),
and admixture analysis (15). Recombination is a well-known confounder in the analysis of signals of natural selection (16, 17) and
determines the ability to fine-map signals of association. Furthermore, a number of medical conditions are directly associated with
incorrect resolution of recombination events (18, 19).
In this chapter, we describe the various approaches that can be
taken to characterise the genetic map of humans by studying patterns of genetic variation among individuals, each with its own
strengths, limitations, and challenges. We aim to give a brief overview of the key insights underlying each approach and the statistical
methods used to extract the relevant signal. In addition, we aim to
characterise the resolution of different approaches and indicate how
they are sensitive to recombination events happening over different
timescales, a factor that can be important given the inter-individual
variation in recombination (10, 20) and its rapid evolution (7).
Finally, it should be noted that other methods for characterising
recombination rates exist. For example, one of the most powerful
has been the analysis of crossover events in sperm, which led to the
initial characterisation of hotspots (6, 21) and the discovery of
hotspot polymorphism (22, 23) among others. However, because
this approach has focused on the characterisation of specific hotspots and is currently impracticable for the large-scale analysis of
whole chromosomes, we do not discuss it further. Also, even
though there are many forms of recombination (including gene
conversion and non-allelic events), we only consider allelic crossing-over, and will use the term recombination as synonymous with
this process.

2. Pedigree
Analysis
The first whole-genome measurements of recombination in
humans were obtained in the 1980s by using individuals with a
known ancestral relationship to track the inheritance of genetic
alleles through the genealogical tree or pedigree (24). To give
an example of how transmission of alleles from one generation to
the next is informative about recombination, consider the simple

Estimating Recombination Rates from Genetic Variation in Humans

219

b
1

0/1-0/1

0/1-0/1

Mother

Father

0/1-0/1

1/1-1/1

Child 2

Child 1

3
0/0-0/0

0/1-1/1

Fig. 1. (a) Transmission of alleles in a single family quartet. In this diagram, a recombination event has occurred during the
transmission from the mother to child two, as indicated by the line shading. In practice, only the genotypes are observed,
and while it remains possible to determine that a recombination event has happened, it is not possible to resolve in which
individual it occurred without additional data. (b) An example of a simple pedigree. In each non-baseline generation,
each parent can have at most one mate and only one parent can have ancestry within the pedigree. Individuals without
ancestry within the pedigree are indicated by shaded shapes. In this example, all individuals have been genotyped at two
bi-allelic sites.

pedigree (a quartet of two parents and two children) in Fig. 1a.


Even though the haplotype phase of the alleles in the mother is not
known, we can infer that she must have had at least one recombination event between the markers to generate both genotypes in
the children (though note that we do not know which child has the
recombinant haplotype).
A single quartet can be used to detect the 2540 crossover
events that are expected to occur in a single meiosis (25). However,
the locations of these events are scattered over the whole genome,
and hence a single meiosis is unlikely to provide much information
regarding the recombination rate in a given region. In order to
obtain a reliable measure of the recombination rate over a given
interval, it is necessary to observe and localize multiple events,
which requires information from many families and/or generations. For such larger pedigrees, recombination events could be
detected by dividing the data into independent quartets and treating each separately. While this would be a valid approach, it is
generally inefficient as large amounts of information can be gained
by considering more of the pedigree simultaneously.
In order to perform inference of the recombination rate within
a pedigree, it is desirable to calculate the likelihood of the data as a
function of the recombination rate. In principle, this calculation can
be performed by exhaustively considering all possible haplotypes
within the pedigree, although this is impractical for all but the
smallest datasets. Elston and Stewart first devised an algorithm
that allows practical calculation of the likelihood in 1971 (26) for
the case of a simple pedigree which consists of a single pair of
initial founders and no consanguineous unions (Fig. 1b and Box 1).

220

A. Auton and G. McVean

If the dataset consists of S bi-allelic loci and M non-founder individuals, then the calculation can be performed in most O(M  26S)
operations (27), meaning that the ElstonStewart algorithm was
suitable for large pedigrees, but with relatively few loci.

Box 1
The ElstonStewart Algorithm
Although first described for use in disease linkage studies, the
ElstonStewart algorithm allows efficient calculation of the likelihood of a given recombination rate from large pedigrees. In
order for the assumptions of the algorithm to be satisfied, the
pedigree must start with a single founder nuclear family, with
every other nuclear family containing exactly one parent with
ancestry within the pedigree and one parent with no ancestry
within the pedigree. There can be no multiple marriages within
the pedigree, and no consanguineous unions.
The ElstonStewart algorithm works by summing over all
possible data configurations that are compatible with the inheritance structure defined by the pedigree. When using genotype
data to estimate recombination rates, this means summing over
the possible haplotype configurations that are consistent with
the observed genotypes as they are transmitted from parents to
offspring.
We wish to compute the likelihood as a function of the
recombination rate, L(R), in the absence of disease data. Let
the ith individual have a set of compatible haplotype pairs Hi
(i.e. all possible pairs of haplotypes that are consistent with the
individuals genotype data). For n individuals in the pedigree
and a given recombination rate, R, the likelihood can, in a very
general way, be written as
X X Y
...
PrHm jHk ; Hl ; R:
LR
H1

Hn fk;l;mg

In the above equation, {k, l, m} defines the set of all parent


offspring trios within the pedigree. The transmission probability
Pr(Hm|Hk,Hl,R) represents the probability that parents with
haplotype pairs Hk and Hl produce a child with the haplotype
pair Hm, given the recombination rate R. The insight of the
ElstonStewart algorithm was to note that this computation can
be done efficiently given the restrictions on the pedigree structure described above. Given this likelihood, the recombination
rate can be estimated by, say, finding the recombination rate that
maximizes the likelihood.
As an example, consider the simple two-generation pedigree
shown in Fig. 1b. This pedigree consists of one family quartet and
(continued)

Estimating Recombination Rates from Genetic Variation in Humans

221

Box 1

(continued)
a trio family with one parent having ancestry within the trio.
Consider the trio family first. In this family, there are two possible
haplotype configurations (arising from the indeterminate phase of
the heterozygotic sites in individual 4). Without knowing the
phase of the parents, it is not clear if the child has inherited a
recombinant type or not.
Now consider the quartet family. As there are three
individuals with ambiguous phase in the quartet, there are
23 8 possible haplotype configurations. However, given
a haplotype configuration for the quartet, the haplotype configuration of the trio is also determined, and the probability of the
whole pedigree can be calculated by taking the product of the
transmission probabilities.

In an attempt to solve the computational limitations of the


ElstonStewart algorithm, Lander and Green proposed a new
approach in 1987 (27). The LanderGreen algorithm redefines
the likelihood so that summations are performed over loci rather
than individuals (Box 2). The algorithm considers all individuals
simultaneously on a locus-by-locus basis, treating the inheritance
pattern along the genome as a Hidden Markov process, with transitions between states caused by recombination events. This algorithm
scales as O(S  22M), so is more suitable for datasets consisting of
many sites, but fewer meioses, although subsequent work has
reduced the computational burden of larger pedigrees to some
extent (28).

Box 2
The LanderGreen Algorithm
The LanderGreen algorithm calculates the likelihood of
pedigree data using a commonly used statistical model known
as a Hidden Markov Model (HMM). To describe this model, let
Xj denote the genotypes of all individuals within the pedigree at
site j. The genotypes of children within the pedigree are determined by the alleles transmitted from the parents, and this
information is represented in an inheritance vector, which
records which alleles are transmitted from parent to child.
As an example, consider the pedigree in Fig. 1a. At the first
site, the genotype vector is X0 ({1,0}, {1,1}, {1,1}, {1,1}),
where entries in curly brackets represent the genotypes of the
(continued)

222

A. Auton and G. McVean

Box 2

(continued)
mother, father, and two children, respectively. The inheritance
vector for the children is I0 ({0,1}, {0,0}), with 0 indicating
that the allele from the first parental chromosome was inherited,
and 1 indicating that the allele from the second chromosome was
inherited. In this example, child 1 inherited the allele from the
first maternal chromosome, and the allele from the second paternal chromosome. Conversely, child 2 inherited the alleles from
the first chromosome of both parents. Following this logic to the
second site would give us X1 ({1,0}, {1,1}, {1,1}, {0,1}) and
I1 ({0,1}, {1,0}). Given the inheritance vector at a site, we can
calculate the probability of obtaining the observed genotypes,
Pr(Xj|Ij).
In the absence of recombination, there would be a single
inheritance vector for all sites in our data. However, recombination between sites causes the inheritance vector to transition to a
new state as we move from site to site. The probability of
transitioning from one inheritance vector at one site to a different inheritance vector at the next site depends on the probability
of recombination between sites, pr. Assuming the state of the
inheritance vector at site j + 1 only depends on the state at site j,
the probability of transitioning from one vector to the next is
written as Pr(Ij + 1|Ij). For a single meiosis, there are only two
possible inheritance vectors (either the parents first allele is
transmitted or the second is). Hence, the probability of transitioning to a new inheritance vector is:



1  pr if Ij 1 Ij
Pr Ij 1 jIj
pr
otherwise.
For a pedigree containing two meioses (such as a family
trio), the possible inheritance vectors can be separated by
R 0, 1, or 2 recombination events. In this case, the transition
probabilities are:
8 
2
if R 0

 <  1  pr 
Pr Ij 1 jIj
1  pr pr if R 1
: 2
pr
if R 2:
A recursive formula can be used to calculate the transition
probabilities between inheritance vectors for any number of
meioses, although the number of possible transitions becomes
quite large for more than a few meioses.
(continued)

Estimating Recombination Rates from Genetic Variation in Humans

223

Box 2

(continued)
In practice, only the genotypes are observed in the datathe
inheritance vector at each site is unknown and hence treated as a
hidden state and has to be summed over when calculating the
likelihood. For m sites, the likelihood can be written in a general
form as
L

X
I1

...

X
Im

PrI1

m
Y
i2

PrIi jIi1

m
Y

PrXi jIi :

i1

However, using standard HMM methods (specifically, the


forward part of the forward-backward algorithm), the above
calculation can be performed efficiently and the recombination
rate estimated.
The LanderGreen algorithm (and variants thereof) was used
to generate many of the early large-scale genetic maps (1, 2).
The genome average recombination rate was measured to be
1.13 cM/Mb, although considerable broad-scale variation was
observed, with recombination rates observed as high as 3 cM/Mb
in certain regions and as low as 0.1 cM/Mb in others (2).
The resolution of pedigree studies is determined by both the
number of available families contributing informative meioses and
the number of markers that allow the location of recombination
events to be determined. For a number of years, the highest resolution achieved by pedigree studies remained at the megabase scale.
However, a 2010 study by deCODE Genetics significantly
improved the resolution by genotyping thousands of individuals
at over 300,000 SNPs (20). In contrast to traditional pedigree
studies, this new study did not genotype all members of a given
family, but only genotyped a single parent and child from each
family. As described above, at least four individuals are required
within a single pedigree in order to detect recombination events
from genotype data. However, if haplotype phase can be assigned
unambiguously, recombination events can be determined even by
considering a single parent and child (Fig. 1a).
The key innovation of the 2010 deCODE study was to exploit
the high degree of relatedness that exists among members of the
Icelandic population in order to phase the samples. In human
populations, it is often possible to identify regions of an individuals
genome that are very similar, if not identical, to another individual
in the population. Such a pair of individuals are said to have a region
of the genome that has identity by state (IBS). In deCODEs study,
individuals were collected from the Icelandic population and were
often from relatively closely related families. In this situation, a high
level of IBS between two individuals is indicative of a shared recent

224

A. Auton and G. McVean

common ancestor, and the two individuals share a common haplotype. In this case, the shared region of the genome is said to have
identity by descent (IBD).
Long-range IBD can be used to obtain highly accurate phasing
of the genotyped individuals (29). First, an individual is selected for
phasing, known as the proband. If the genotypes of both parents of
this individual were known, it would be relatively trivial to phase the
proband individual by identifying which allele was inherited from
each parent (with the exception that this is not possible at those
sites where the child and both parents are heterozygous).
For example, in Fig. 1a, it is possible to identify the haplotypes
transmitted from each parent to child 2. However, in the deCODE
study, the genotypes of either one or both of the parents were
generally not known. To overcome this, the authors divided
the genome of the proband into sections, and for each section
identified a separate pair of individuals within the study showing
high levels of relatedness, or IBD, with the proband. The authors
were able to use the selected individuals as surrogate parents, and
phase the proband as if the parents were known. By exploiting the
relatedness between individuals in the study, the authors were able
to obtain near-perfect phasing for thousands of individuals over
many megabases of the genome (29). Furthermore, because it is
possible to select many surrogate parents on each side, the fraction
of sites that can be phased unambiguously is much higher because
only one of the surrogate parents on each side needs to be homozygous in order to determine transmission.
Using the above method, the 2010 deCODE study was able to
obtain highly accurate phasing for parentoffspring pairs yielding a
total of 15,257 meioses. This number of meioses represented an
order of magnitude over previous studies, and in combination with
the increased marker density, the resolution to detect recombination events was improved from ~5 Mb in previous studies to
approximately 10 kb.
An advantage of large-scale pedigree studies is that detected
recombination events can generally be assigned to a specific individual, and it is therefore possible to identify differences in recombination rate between groups of individuals. For example, the 2010
deCODE study compared recombination rates in males and
females and revealed that approximately 15% of hotspots appear
to be sex specific (20). The mechanism of sex-specific hotspot
formation is currently unknown.
Despite the success of pedigree studies, their large-scale nature
means that they cannot be practically applied in many casesfor
example, in many non-human species, the cost may be prohibitive,
and even with thousands of meioses the resolution remains relatively low. Furthermore, the resulting recombination rate estimates
are obtained by averaging across many individuals, as each family
can only provide evidence of a handful of recombination events.

3. Linkage
Disequilibrium
Based Approaches

Estimating Recombination Rates from Genetic Variation in Humans

225

An alternative source of information regarding recombination can


be found in samples of genetic data taken from unrelated individuals, sampled randomly from a population, and genotyped or
sequenced over some or all of the genome.
Due to the shared ancestral history between individuals, the
alleles at nearby loci are often correlatedknowing the allele at a
given locus is often informative of the allele at a second, nearby
locus. This non-random association of alleles is known as Linkage
Disequilibrium (LD). Historically, information about LD has been
summarised through the use of two-locus measures of LD, such as
D 0 (30) and r2 (31). Consider a pair of loci with alleles A/a at
the first and B/b at the second. If fAB is the frequency of haplotypes
with alleles A and B, fA is the frequency of haplotypes with A
allele at the first locus, fB is the frequency of haplotypes with
B allele at the second locus, and so on, then these statistics can be
calculated as:
D fAB  fA fB
r2
D0

D2
fA fB fa fb
(

D
minfA fb ;fa fB
D
minfA fB ;fa fb

if D  0
if D < 0:

The D 0 statistic is a measure of LD defined as the difference


between the frequency of a two-locus haplotype and the product of
the component alleles, divided by the most extreme possible value
given the marginal allele frequencies. Alternatively, the r2 statistic is
the squared correlation coefficient of gene frequencies between
two loci.
These very simple statistics can, at least informally, be related to
the underlying recombination rate, as recombination events tend to
break down the amount of LD between loci, leading to lower values
of r2 and D 0 . High values of r2 and D 0 are typically indicative of low
levels of recombination, and vice versa. However, the relationship
is not perfect (e.g. see ref. 32) and because these statistics are also
influenced by evolutionary processes, such as mutation, selection,
genetic drift, and demographic parameters (33), it is not possible to
use their values to estimate the recombination rate in a reliable fashion.
There are, however, other measures of LD that relate more
directly to recombination. The simplest such approach is known
as the four-gamete test (34), which detects recombination events
by locating pairs of segregating sites that cannot have arisen without either recombination or a repeat mutation (Box 3).

226

A. Auton and G. McVean

Box 3
The Four-Gamete Test
The four-gamete test aims to identify patterns of population genetic data that are indicative of historical recombination events. In the absence of recombination and reverse
mutation, four haplotype sequences with two bi-allelic sites can be related by the five
possible ancestral histories shown below. Each possible ancestral history corresponds to a
specific haplotype configuration. Note that the labelling of which allele is the mutant is
arbitrary, as is the ordering of sites, and hence all possible haplotype configurations
(without recombination) can be classified into one of the configurations shown here.

However, if all four haplotypes are observed in a sample, as shown above, a simple
tree cannot represent the ancestry of the sample. In the absence of reverse mutation, only
recombination could have generated the observed pattern. The four-gamete test calls a
recombination event between sites if this situation is observed.

The four-gamete test is appropriate in situations where the


possibility of reverse mutation can be discounted. In humans, the
genome is sufficiently large and the mutation sufficiently low that
the probability of a single site receiving two mutations is quite small.
However, in organisms where the mutation rate is considerably
higher such as viruses, this assumption is not appropriate.
The relative simplicity of the four-gamete test means that it is
easy to apply to large datasets. In principle, a large sample ensures
greater power to detect recombination events, as there is a greater
chance of sampling a rare haplotype that is indicative of recombination. However, the statistical power of the four-gamete test to
detect recombination is low, and increasing the sample size is

Estimating Recombination Rates from Genetic Variation in Humans

227

Fig. 2. (a) Example of a coalescent tree for six samples. The topology of the tree indicates
the relatedness between samples, with mutations indicated by circles. (b) An example of
an ancestral recombination graph (ARG) for four samples, with three mutations. There is a
single recombination event, indicated by the splitting of the ancestral lineage of the third
chromosome as it is followed backwards in time.

inefficient, as the number of detectable recombination events


increases only with the log of the log (sic) of the sample size (35).
Aside from the four-gamete test, there are a number of other
non-parametric tests for detecting recombination from population
genetic data, many of which are more powerful and/or sophisticated (for example, see refs. 36, 37). However, these methods
generally only provide a lower bound on the number of recombination events in the history of the sample, and do not inform about
the time or rate at which they occurred.
In order to use population genetic data to learn about the rate
of recombination, it is necessary to use model-based approaches.
Specifically, it is necessary to model the evolutionary process by
which the population genetic data were generated. A commonly
studied model of the evolutionary process is that of the Coalescent.
We briefly introduce the Coalescent here, but the interested reader
is directed towards full introductions to coalescent theory elsewhere (e.g. see refs. 3840).
Coalescent theory models the evolutionary history of a sample
of population genetic data. In the absence of recombination, the
history of a sample can be described by a genealogical tree structure,
which determines relatedness between samples (Fig. 2a). Variants
within the population originate as mutations that occur on the
branches of trees. Coalescent theory provides a framework that
describes the structure of these trees. For example, the model
describes the rate at which branches join (or coalesce) relative to
the rate at which mutations appear on the branches.

228

A. Auton and G. McVean

In the presence of recombination, a tree structure is not sufficient to describe the ancestry of a sample as it is possible for the
ancestral history to differ between loci. In this case, the ancestry of
the sample can be represented in the form of a graph known as the
Ancestral Recombination Graph (ARG, Fig. 2b) (41). As with a
coalescent tree, branches in the ARG coalesce and contain mutations. However, the ARG also contains recombination events,
which are represented by a bifurcation of a given branch representing a point in the history in which loci to the left of the recombination event follow a different ancestry to those to the right of the
recombination event. As with the basic coalescent, the shape of a
typical ARG is determined by the relative rate at which mutation,
coalescence, and recombination events occur.
Within the context of the ARG, it is not possible to make
inference of the per-generation recombination rate, r, directly.
Rather, in coalescent theory, the rate of recombination is measured
in terms of the population recombination rate, r. The population
recombination rate is related to the per-generation recombination
rate by the formula r 4Ner, where Ne is known as the effective
population size, and depends on a number of factors, such as the
demographic history of the population. In order to infer r from r, it
is necessary to obtain an independent estimate of Ne, which can be
achieved by comparison with existing genetic maps or from diversity estimates. In humans, Ne has generally been estimated in the
range of 10,00018,000 (5, 42).
Given a specific genealogy, the resulting genetic dataset is
uniquely determined. Furthermore, the probability of obtaining
the genealogy from the coalescent model can be calculated for a
given mutation, coalescence, and recombination rate. Hence, if the
genealogy is known, it is possible to calculate the probability, or
likelihood, of obtaining the observed data.
However, the converse is not true; knowing the genetic dataset
does not uniquely determine the genealogy. Typically, there is no
record of the genealogy of the samplethe genealogy is missing
data. In order to calculate the likelihood of our data, it is therefore
necessary to integrate over all possible genealogies. Unfortunately,
the number of possible genealogies is infinite, and even by restricting
the allowed genealogies to those that conform to the infinite-sites
model, and those with non-trivial recombination events, the number
of genealogies increases at a fantastic rate as the sample size increases.
For example, a dataset with just seven sequences and five SNP sites
could have been generated by over 9.1  1016 genealogiesan infeasible number to sum over even using modern supercomputers (43).
It is, therefore, difficult to calculate the likelihood of the data
under the coalescent model. While it is possible to estimate the
likelihood over a range of recombination rates for a single pair of
SNPs, the calculations do not scale with the number of sites, and
hence full likelihood inference is not practical for all but the smallest

Estimating Recombination Rates from Genetic Variation in Humans

229

of datasets. To overcome this problem, the full likelihood calculation


can be replaced with a composite likelihood, in which all pairs of SNPs
are treated as independent of each other (44, 45). If the data at site j
is Xj, the likelihood of a pair of SNPs for a given recombination rate is
written as L(r|Xi, Xj), where r is the population recombination rate.
Then, as all SNP pairs are assumed to be independent, the composite
likelihood for all SNPs is calculated as the product of the likelihood
over all possible pairs of SNPs within some distance of each other:
Y
LrjXi ; Xj :
CL(r
i;j :jij jL

In practice, the composite likelihood is often calculated within


windows of, say, L 50 neighbouring SNPs and hence SNPs
with large intervening distances do not contribute to the calculation (4, 46). This version of the composite likelihood is known as
the truncated composite likelihood.
The composite likelihood is a fairly drastic approximation, as
interspersed SNP pairs are clearly not independent of each other.
However, the composite likelihood has some attractive features.
First, the maximum composite-likelihood estimate (MCLE) is
strongly correlated with the maximum full-likelihood estimate (45).
Second, at least for the truncated form of the composite likelihood,
the MCLE is consistentthat is, given enough data, the MCLE
converges on the true value of r (47). Third, the likelihood for site
pairs can be pre-calculated over a wide range of possible values of r
and stored in lookup tables, allowing extremely rapid subsequent
calculation of the composite likelihood. On the negative side, the
composite likelihood does not use all of the information available in
a dataset, and tends to be overly peaked in comparison to the full
likelihood, making inference about the uncertainty in the recombination rate estimate difficult (45).
Nonetheless, the composite likelihood has been deployed with
great success. The ability to calculate the composite likelihood
extremely quickly means that it is possible to estimate recombination rates that are allowed to vary over a given interval. Such an
approach was developed in the LDhat package (4, 46), which uses
an MCMC method to explore possible recombination rate profiles.
In doing so, it is possible to obtain recombination rate estimates
that are comparable in resolution to those obtained via experimental methods such as sperm typing. Furthermore, it is possible
to estimate recombination rates using data from hundreds of
samples with millions of SNPs. This has allowed fine-scale recombination rate estimates to be obtained on a genome-wide scale,
while showing good broad-scale correlation with estimates
obtained from pedigree-based studies (4, 5, 42). Furthermore,
composite likelihood methods (4, 46) can be used to test specific
hypotheses about the presence or absence of recombination
hotspots.

230

A. Auton and G. McVean

In effect, LD-based studies make use of many (and in some


sense all) of the meioses that have occurred in the history of a
sample since the most recent common ancestor. As such, the number of effective meiotic events can number in hundreds of
thousands and LD-based studies can achieve resolutions almost
comparable to those achieved by experimental methods (4, 46).
However, the major limitations of LD-based studies are first that
they require a model of population history (typically assumed to be
very simple) and second that the recombination events represent an
average over thousands of generations; hence, it is not possible to
use this information to detect differences in recombination rate
between individuals (or sexes).

4. Admixture
Pedigree and LD-based studies have provided complementary
insights into the genome-wide patterns of recombination. With
the growing amount of available data, these techniques will continue to improve in resolution. However, scope remains for
continued method development. One novel technique, which
makes use of individuals with a history of recent genetic admixture,
has recently been described (48) that provides an additional
resource for the measurement of recombination.
The principle of recombination detection via admixture is that
the genomes of admixed individuals are made of a mosaic of genetic
material inherited from differing ancestral populations (Fig. 3).
If the ancestral populations are sufficiently diverged from each
other, it is possible to detect the regions of the admixed genome
that have been inherited from one population or the other. The
break points between ancestral sections represent recombination
events that have occurred since the time of the admixture event.
The ability for admixture techniques to detect recombination
depends on accurate detection of break points between ancestral
haplotypes. In order to achieve this, a statistical model of the
relationship between haplotypes is needed. Such a model is available in the form of the Li and Stephens model, which is a widely
used model in a number of areas of population genetics (49).
The Li and Stephens model is based on the idea that if a number
of haplotypes have already been observed the next haplotype to be
sampled is likely to look quite similar to those already seen. The new
haplotype could be constructed as a mosaic of sections of the
previously observed haplotypes, allowing some level of mismatch or
mutation. In other words, the new haplotype is constructed by
copying sections of existing haplotypes, and hence traces a path
through the set of existing haplotypes (Box 4). The new haplotype
is modelled using an HMM, in which the hidden state defines which
of the existing haplotypes is being copied.

Estimating Recombination Rates from Genetic Variation in Humans

231

Fig. 3. Demographic history of admixed populations. The merging of two diverged


populations creates an admixed population. The genomes of the resulting individuals
are made up of a mosaic of genetic material inherited from each of the ancestral
populations.

To use the Li and Stephens model in admixture detection, a set of


reference haplotypes from the ancestral populations are needed. The
ancestry of a target individual can be determined for each site in the
genome by using the Li and Stephens model to calculate the probability that the haplotypes within the target individual copy from one
ancestral population rather than another. This method is the basis of
the HapMix algorithm (15), which can be used to obtain fine-scale
ancestry estimates from hundreds of individuals and to localise
breaks in ancestry indicative of recombination events. For example,
an African-American individual may have both African and European
genetic ancestry, and hence a switch in ancestry from African to
European (or vice versa) along a chromosome is likely to reflect a
recombination event that happened within the last 20 generations or
so (5053).
While the information regarding the location of an admixture
break point within a single individual can be quite weak, by combining information across multiple admixed individuals it is possible to
construct a genetic map (using methodology similar to that
employed in the analysis of LD data). As each admixed individual
can provide information regarding recombination events spanning a
significant number of generations, the achievable resolution is

232

A. Auton and G. McVean

Box 4
The Li and Stephens Model
The basic idea of the Li and Stephens model is that if we have observed a set of haplotypes
the next haplotype we observe is likely to look similar to those we have already observed
due to their shared common ancestry. Suppose we have observed a collection of eight
haplotypes, h1 to h8, as in the diagram below.
h1
h2
h3
h4
h5
h6
h7
h8

h*

The Li and Stephens model considers the next haplotype, h*, given the set of
previously observed haplotypes. This is achieved by assuming that h* is constructed by
copying sections from the previously observed haplotypes, allowing some level of error.
In the diagram, an example of how h* could be constructed from h1 . . . h8 is indicated by
the path traced out by the arrows.
The path through the collection of haplotypes is unknown, and is therefore modelled
using an HMM, where the hidden state is the haplotype being copied from. Given k
haplotypes have been observed so far, the emission probabilities for possible alleles a at site
j in the next haplotype are given by:



k=k y 12 y=k y if hx;j a
Pr h;j ajXj x; h1 ; . . . ; hk 1
;
if hx;j 6 a
2 y=k y
where Xj defines the haplotype being copied at site j, hx,j is the allele of haplotype x at site
j, and y is the mutation parameter. The above probability captures the idea that a
haplotype is more likely to have copied from a similar haplotype than a dissimilar one.
Transitions between hidden states (i.e. the haplotype being copied from) occur with
probability that depends on the recombination distance, rj, between sites j and j + 1:
(




erj =k 1k 1  erj =k if x 0 x
0

Pr Xj 1 x jXj x 1 
rj =k
otherwise.
k 1e
(continued)

Estimating Recombination Rates from Genetic Variation in Humans

233

Box 4

(continued)
Using standard HMM machinery (as for the LanderGreen algorithm), it is possible
to sum over all possible paths, and hence calculate the likelihood of obtaining the new
haplotype, given the set of existing haplotypes.

potentially higher than that achieved by pedigree studies. In practice,


admixture studies have not yet been performed on the same scale as
the largest pedigree studies, and hence the resolution achieved by
admixture genetic maps to date is similar to that of pedigree studies.
However, admixture detection methods remain attractive, as publicly
available genetic data from unrelated admixed individuals is increasingly common.
However, admixture studies cannot generally determine when
a detected recombination event occurred. Like LD maps, admixture-based genetic maps, therefore, represent an average over a
number of generations (albeit considerably fewer and more recent),
and it is generally not possible to assign recombination events to
specific individuals.

5. Conclusion
Recombination detection methods have evolved rapidly over recent
years. The methods described here differ in terms of the achievable
resolution, the regions of the genome that can be analysed, and the
number of generations that recombination events are measured
over (Table 1). Direct experimental methods such as sperm-typing
continue to provide the highest resolution insight into rate variation, but experimental challenges limit their widespread application
and only provide rate estimates within males. LD studies can
achieve similar resolution, but only offer rate estimates averaged
over thousands of generations and cannot provide substantial information of differences between individuals. Between the two lie the
pedigree and admixture studies, which are today limited largely by
sample size, but which currently provide the best prospects for
detecting and understanding variation among individuals and
populations in both local and global rates of recombination.
In recent years, these methods have led to huge leaps in our
understanding of recombination. It is now accepted that recombination hotspots are a ubiquitous feature of the human genome, but
until a few years ago the mechanisms leading to hotspot formation
were largely unknown. This has started to change with the identification of a short DNA sequence motif found to be highly enriched

234

A. Auton and G. McVean

Table 1
Summary of described methods for recombination rate measurement,
assuming typical parameters of studies to date

Method

Approximate
Size of
number
Approximate analysed of useful
resolution
region
meioses

Sperm
typing

~300 bp1 kb ~200 kb 70022,000


2.5 Mb

Pedigree
studies

10 kb5 Mb

Genome
wide

1,50015,000 110

Can obtain genome-wide, pergeneration rate estimates for


males and females separately,
but resolution limited by
sample size

LD
studies

15 kb

Genome
wide

~300,000

Fine-scale genome-wide
estimates, but estimates
represent an average over
many generations, and may be
biased by population genetic
history

Genome
wide

1,00020,000 ~515

Admixture 1040 kb

Generations
analysed
Comments
1

~10,000

Provides excellent fine-scale, pergeneration rate estimates, but


experimentally challenging,
limited to small regions of the
genome, and male specific

Fine-scale rate estimates can be


obtained with moderate
sample sizes, but represent an
average over a possibly
unknown number of
generations

in the sequence of hotspots (18). This in turn has led to the


identification of a zinc-finger protein, PRDM9, which is suspected
to bind to the DNA sequence motif and recruit other proteins that
initiate a recombination event (912). This new understanding
could only have been gained via the improvements in recombination detection methods described in this chapter.
However, our understanding of recombination is far from
complete, and a number of questions remain. For example, there
is good evidence that recombination rates vary between males and
females, and do so at the fine scale (2, 20). There is also evidence
that recombination rates vary by age (25, 54). It is not known how
these differences between individuals arise. Likewise, there is good
evidence that recombination rates evolve on short timescales
(11, 55, 56), and this is strongly suggestive of powerful selection
forces at work that are yet to be fully elucidated.

Estimating Recombination Rates from Genetic Variation in Humans

235

6. Questions
and exercises
1. Is it possible to detect recombination events using genotype
data obtained from a single nuclear family trio? Explain your
answer.
2. Write down the haplotype configurations that are consistent
with the data shown in Fig. 9.1b. Convince yourself that at least
one recombination event is required in the pedigree.
3. Suppose you have sampled the following five haplotypes with
three segregating sites from a population:
Haplotype 1: 011
Haplotype 2: 000
Haplotype 3: 100
Haplotype 4: 010
Haplotype 5: 101
Using the four-gamete test, calculate the minimum number of
recombination events that have occurred in the population
history between sites 1 and 2. How about sites 2 and 3? And
finally, between sites 1 and 3?
4. Suppose an admixture event occurred between two populations
three generations ago. Assuming a recombination rate of
1 cM/Mb, what would the average ancestry track length be
in an individual sampled from the population today? How
about after seven generations?
References
1. Broman, K.W., et al., Comprehensive human
genetic maps: individual and sex-specific variation in recombination. Am J Hum Genet,
1998. 63(3): p. 8619.
2. Kong, A., et al., A high-resolution recombination map of the human genome. Nat Genet,
2002. 31(3): p. 2417.
3. The International HapMap Consortium, A
haplotype map of the human genome. Nature,
2005. 437(7063): p. 1299320.
4. McVean, G.A., et al., The fine-scale structure of
recombination rate variation in the human
genome. Science, 2004. 304(5670): p. 5814.
5. Myers, S., et al., A fine-scale map of recombination rates and hotspots across the human
genome. Science, 2005. 310(5746): p. 3214.
6. Jeffreys, A.J., L. Kauppi, and R. Neumann,
Intensely punctate meiotic recombination in

the class II region of the major histocompatibility complex. Nat Genet, 2001. 29(2):
p. 21722.
7. Jeffreys, A.J., et al., Human recombination
hotspots hidden in regions of strong marker
association. Nat Genet, 2005. 37(6): p. 6016.
8. Myers, S., et al., The distribution and causes of
meiotic recombination in the human genome.
Biochem Soc Trans, 2006. 34(Pt 4):
p. 52630.
9. Baudat, F., et al., PRDM9 is a major determinant of meiotic recombination hotspots in
humans and mice. Science, 2010. 327(5967):
p. 83640.
10. Berg, I.L., et al., PRDM9 variation strongly
influences recombination hot-spot activity and
meiotic instability in humans. Nat Genet,
2010. 42(10): p. 85963.

236

A. Auton and G. McVean

11. Myers, S., et al., Drive against hotspot motifs in


primates implicates the PRDM9 gene in meiotic recombination. Science, 2010. 327
(5967): p. 8769.
12. Parvanov, E.D., P.M. Petkov, and K. Paigen,
Prdm9 controls activation of mammalian
recombination hotspots. Science, 2010. 327
(5967): p. 835.
13. Marchini, J., et al., A new multipoint method
for genome-wide association studies by imputation of genotypes. Nat Genet, 2007. 39(7):
p. 90613.
14. Abecasis, G.R., D. Ghosh, and T.E. Nichols,
Linkage disequilibrium: ancient history drives
the new genetics. Hum Hered, 2005. 59(2):
p. 11824.
15. Price, A.L., et al., Sensitive detection of chromosomal segments of distinct ancestry in
admixed populations. PLoS Genet, 2009. 5
(6): p. e1000519.
16. McVean, G. and C.C. Spencer, Scanning the
human genome for signals of selection. Curr
Opin Genet Dev, 2006. 16(6): p. 6249.
17. Nielsen, R., et al., Recent and ongoing selection in the human genome. Nat Rev Genet,
2007. 8(11): p. 85768.
18. Myers, S., et al., A common sequence motif
associated with recombination hotspots and
genome instability in humans. Nat Genet,
2008. 40(9): p. 11249.
19. Stankiewicz, P. and J.R. Lupski, Genome architecture, rearrangements and genomic disorders. Trends Genet, 2002. 18(2): p. 7482.
20. Kong, A., et al., Fine-scale recombination rate
differences between sexes, populations and
individuals. Nature, 2010. 467(7319):
p. 1099103.
21. Jeffreys, A.J., A. Ritchie, and R. Neumann,
High resolution analysis of haplotype diversity
and meiotic crossover in the human TAP2
recombination hotspot. Hum Mol Genet,
2000. 9(5): p. 72533.
22. Jeffreys, A.J. and R. Neumann, Reciprocal
crossover asymmetry and meiotic drive in a
human recombination hot spot. Nat Genet,
2002. 31(3): p. 26771.
23. Jeffreys, A.J. and R. Neumann, Factors
influencing recombination frequency and
distribution in a human meiotic crossover
hotspot. Hum Mol Genet, 2005. 14(15):
p. 227787.
24. Botstein, D., et al., Construction of a genetic
linkage map in man using restriction fragment
length polymorphisms. Am J Hum Genet,
1980. 32(3): p. 31431.

25. Coop, G., et al., High-resolution mapping of


crossovers reveals extensive variation in finescale recombination patterns among humans.
Science, 2008. 319(5868): p. 13958.
26. Elston, R.C. and J. Stewart, A general model
for the genetic analysis of pedigree data. Hum
Hered, 1971. 21(6): p. 52342.
27. Lander, E.S. and P. Green, Construction of
multilocus genetic linkage maps in humans.
Proc Natl Acad Sci U S A, 1987. 84(8):
p. 23637.
28. Kruglyak, L., et al., Parametric and nonparametric linkage analysis: a unified multipoint
approach. Am J Hum Genet, 1996. 58(6):
p. 134763.
29. Kong, A., et al., Detection of sharing by
descent, long-range phasing and haplotype
imputation. Nat Genet, 2008. 40(9):
p. 106875.
30. Lewontin, R.C., The Interaction of Selection
and Linkage. I. General Considerations; Heterotic Models. Genetics, 1964. 49(1):
p. 4967.
31. Hill, W.G. and A. Robertson, Linkage disequilibrium in finite populations. TAG Theoretical
and Applied Genetics, 1968. 38(6):
p. 226231.
32. McVean, G., Linkage disequilibrium, recombination and selection, in The Handbook of Statistical Genetics, D.J. Balding, M. Bishop, and
C. Cannings,
Editors. 2008, Wiley.
p. 909940.
33. Ardlie, K.G., L. Kruglyak, and M. Seielstad,
Patterns of linkage disequilibrium in the
human genome. Nat Rev Genet, 2002. 3(4):
p. 299309.
34. Hudson, R.R. and N.L. Kaplan, Statistical
properties of the number of recombination
events in the history of a sample of DNA
sequences. Genetics, 1985. 111(1): p. 14764.
35. Myers, S., The Detection of Recombination
Events Using DNA Sequence Data, in Department of Statistics, 2002, University of Oxford:
Oxford.
36. Myers, S.R. and R.C. Griffiths, Bounds on the
minimum number of recombination events in a
sample history. Genetics, 2003. 163(1):
p. 37594.
37. Song, Y.S. and J. Hein, Constructing minimal
ancestral recombination graphs. J Comput
Biol, 2005. 12(2): p. 14769.
38. Wakeley, J., Coalescent theory : an introduction, 2009, Greenwood Village, Colo.:
Roberts & Co. Publishers. xii, 326 p.
39. Nordborg, M., Coalescent theory. 2000.

Estimating Recombination Rates from Genetic Variation in Humans

40. Hein, J., M.H. Schierup, and C. Wiuf, Gene


genealogies, variation and evolution : a primer
in coalescent theory, 2005, Oxford ; New York:
Oxford University Press. xiii, 276 p.
41. Griffiths, R.C. and P. Marjoram, Ancestral inference from samples of DNA sequences with
recombination. J Comput Biol, 1996. 3(4):
p. 479502.
42. The International HapMap Consortium,
A second generation human haplotype map of
over 3.1 million SNPs. Nature, 2007. 449
(7164): p. 85161.
43. Song, Y.S., R. Lyngso, and J. Hein, Counting
all possible ancestral configurations of sample
sequences in population genetics. IEEE/ACM
Trans Comput Biol Bioinform, 2006. 3(3):
p. 23951.
44. Hudson, R.R., Two-locus sampling distributions and their application. Genetics, 2001.
159(4): p. 180517.
45. McVean, G., P. Awadalla, and P. Fearnhead, A
coalescent-based method for detecting and
estimating
recombination
from
gene
sequences.
Genetics,
2002.
160(3):
p. 123141.
46. Auton, A. and G. McVean, Recombination rate
estimation in the presence of hotspots.
Genome Res, 2007. 17(8): p. 121927.
47. Fearnhead, P., Consistency of estimators of the
population-scaled recombination rate. Theor
Popul Biol, 2003. 64(1): p. 6779.
48. Hinch, A.G., et al., The landscape of recombination in African Americans. Nature, 2011.
476: p. 17075.

237

49. Li, N. and M. Stephens, Modeling linkage


disequilibrium and identifying recombination
hotspots using single-nucleotide polymorphism data. Genetics, 2003. 165(4):
p. 221333.
50. Pfaff, C.L., et al., Population structure in
admixed populations: effect of admixture
dynamics on the pattern of linkage disequilibrium. Am J Hum Genet, 2001. 68(1):
p. 198207.
51. Patterson, N., et al., Methods for high-density
admixture mapping of disease genes. Am
J Hum Genet, 2004. 74(5): p. 9791000.
52. Seldin, M.F., et al., Putative ancestral origins of
chromosomal segments in individual african
americans:
implications
for
admixture
mapping. Genome Res, 2004. 14(6):
p. 107684.
53. Tian, C., et al., A genomewide single-nucleotide-polymorphism panel with high ancestry
information for African American admixture
mapping. Am J Hum Genet, 2006. 79(4):
p. 6409.
54. Kong, A., et al., Recombination rate and reproductive success in humans. Nat Genet, 2004.
36(11): p. 12036.
55. Ptak, S.E., et al., Fine-scale recombination patterns differ between chimpanzees and humans.
Nat Genet, 2005. 37(4): p. 42934.
56. Winckler, W., et al., Comparison of finescale recombination rates in humans and
chimpanzees. Science, 2005. 308(5718):
p. 10711.

Chapter 10
Evolution of Viral Genomes: Interplay Between Selection,
Recombination, and Other Forces
Sergei L. Kosakovsky Pond, Ben Murrell, and Art F.Y. Poon
Abstract
RNA viruses evolve very rapidly, often recombine, and are subject to strong host (immune response) and
anthropogenic (antiretroviral drugs) selective forces. Given their compact and extensively sequenced
genomes, comparative analysis of RNA viral data can provide important insights into the molecular
mechanisms of adaptation, pathogenicity, immune evasion, and drug resistance. In this chapter, we present
an example-based overview of recent advances in evolutionary models and statistical approaches that enable
screening viral alignments for evidence of adaptive change in the presence of recombination, detecting
bursts of directional adaptive evolution associated with the phenotypic changes, and detecting of coevolving
sites in viral genes.
Key words: Viral evolution, Recombination, Natural selection, Epistasis, Machine learning, Bayesian
networks

1. Introduction
Whether one considers them to be living organisms or not, viruses are
the most extensively sequenced members of the natural world. Virus
genomes, especially those of RNA viruses, present many unique
challenges to genetic sequence analysis. Even though they are comparably small in size (ranging approximately from 103 to 106 nucleotides in length) and contain a relatively small number of genes, they
also undergo a very high mutation rate that drives the accumulation
of extensive sequence variation (1). Combined with the extremely
rapid pace of evolution due to high mutation and recombination
rates, short generation times, and strong selection in host environments, viruses provide some of the clearest examples of natural selection in action. Detecting the site-specific signature of selection
in viruses by codon-based models of molecular evolution is one

Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2,
Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5_10,
# Springer Science+Business Media, LLC 2012

239

240

S.L.K. Pond et al.

of the great achievements of modern evolutionary biology (2, 3).


In this chapter, we cover some of the difficulties often encountered
in the analysis of virus genomes and how they may be overcome
by recently developed techniques in molecular evolution. Specifically,
we describe and demonstrate methods used to detect recombination,
selection, and epistasis from alignments of homologous proteincoding sequences from virus genomes. We also present a method
for identifying factors in the environment (agents of selection)
that are responsible for the fitness advantage of certain virus genotypes over others. The reader should be aware that phylogenetics is
a rapidly moving field and that many of the methods being presented
in this chapter are relatively new and experimental and consequently
have not had time to become well established in the field. However,
we believe that these are the methods that will be of greatest interest
to investigators dealing with virus genomic variation.

2. Example Data
and Software
Datasets used as examples in this chapter can be downloaded from
http://www.hyphy.org/pubs/book2011/data. All computational
procedures described below are based on the HyPhy software package (4). A basic level of familiarity with the package is expected and
we recommend that readers peruse relevant package documentation, which can be found at http://www.hyphy.org.

3. Recombination
We start by presenting a method for detecting recombination from
an alignment of homologous sequences. This is not a conventional
ordering of topics because methods for detecting recombination are
generally predated by codon model-based methods for detecting
diversifying selection (see Subheading 4). However, we strongly
advocate screening an alignment for recombination before all else
because recombinationwhich causes different regions of an alignment to be related by different phylogeniescan strongly affect the
results of subsequent analyses, such as selection detection.
Recombination plays a key role in the evolution of many viral
pathogens. For instance, major pandemic strains of the influenza A
virus (IAV) have arisen through segmental reassortment, which can
be thought of as intergenic, or gene-preserving, recombination.
For example, the swine-origin HIN1 virus has undergone at least
two reassortment events, and carries genes from three different
ancestral IAV lineages (5).

10

Evolution of Viral Genomes: Interplay Between Selection. . .

241

In HIV-1, each viral particle packages two RNA genomes and


during reverse transcription (RT), the RT enzyme switches between
two RNA templates at rates as high as 2  103 per nucleotide per
replication cycle (6), creating recombinant DNA templates, which
in turn give rise to recombinant progeny. If a single cell is infected
with multiple divergent HIV-1 viruses (this can occur in up to 10%
of infected hosts (7), depending on a variety of factors), then it is
possible that resulting recombinants will establish distinct and
novel viral lineages. Molecular epidemiology of HIV-1 is replete
with examples of such lineages, termed circulating recombinant
forms (CRFs), with over 40 characterized to date (8).
How frequently recombination occurs is strongly influenced by
the viral type and species: Chare et al (9) found evidence of recombination in 40% of plant RNA genomes that they had examined,
but in fewer than 10% of negative-sense RNA viruses (10). Apart
from its importance in generating novel or removing deleterious
genetic diversity and accelerating evolution (11), recombination
has a strong effect on many practical aspects of evolutionary analyses (12). As can be seen in Fig. 1a, the most apparent effect of
including recombinant sequences in a phylogenetic analysis is topological incongruence between trees inferred from different parts of
the alignment. In such instances, there is no single topology which
can correctly represent the shared ancestry of all the sequences in
the sample.
There are many computational approaches to finding evidence
of recombination in a sequence alignment (13); however at their
core, many such methods look for evidence of phylogenetic incongruence. Here, we discuss one such methodGenetic Algorithms
for Recombination Detection, GARDthat we have found to have
the best performance among a wide range of approaches on
simulated data (14). A genetic algorithm attempts to find an optimal
solution to a complex problem by mimicking processes of biological
evolution (mutation, recombination, and selection) in a population
of competing solutions. In this application of genetic algorithms, we
are evolving a population of chromosomes that specify different
numbers and locations of recombination breakpoints in the alignment with the objective of detecting topological incongruence, i.e.,
support for different phylogenies by separate regions of the alignment. The fitness of each chromosome is determined by using
maximum likelihood methods to evaluate a separate phylogeny for
each nonrecombinant fragment defined by the breakpoints (e.g., to
the left and right of a breakpoint in Fig. 1), and computing a
goodness of fit (small sample Akaike Information Criterion or
AICc) for each such model. The genetic algorithm searches for the
number and placement of breakpoints yielding the best AICc and
also reports confidence values for inferred breakpoint locations
based on the contribution of each considered model weighted by
how well the model fits the data. For computational expedience,

242

S.L.K. Pond et al.

A
R
B

A
B
R
O

Fig. 1. (a) Phylogenetic incongruence caused by the presence of a recombinant sequence in an alignment. Sequence R is a
product of homologous recombination between sequences A and B. Phylogenies reconstructed from sequences A,B,R and
an outgroup sequence (O) differ based on which part of the alignment is being considered to the left of the break point, R
clusters with A, whereas to the right of the break point R clusters with B. (b) GARD analysis of the Cache Valley Fever Virus
glycoprotein.

10

Evolution of Viral Genomes: Interplay Between Selection. . .

243

the current implementation of GARD infers topologies for each


segment using Neighbor Joining (15) based on the TN93 pairwise
distance estimator (16) and then fits a user-specified nucleotide
evolutionary model using maximum likelihood to obtain AICc
scores.
GARD is a computationally intensive method and typically
examines 103105 competing models on a single dataset. There
are two free implementations of GARD, both of which require a
distributed computing environment (message passing interface,
MPI) in order to fit many models in parallel and speed up the
execution: in the HyPhy package (presented here) and on the
Datamonkey Web server (http://www.datamonkey.org, discussed
in (17)).
We demonstrate GARD using 13 glycoprotein sequences from
Cache Valley Fever virus (CVFv, file CVFg.fas). To execute a
GARD screen, launch HyPhy, select Recombination from the standard analyses menu and choose Screen an alignment using GARD/
GARD.bf batch file, locate the alignment file, and supply values for
the following options.
1. Please enter a 6-character model designation (e.g., 010010
defines HKY85)this option controls which nucleotide substitution model is to be used for analysis using PAUP* notational
shorthand. The six-character shorthand allows the user to
specify the entire spectrum from F81 (000000) to GTR
(012345), which is a good default option for most analyses.
For example, the abbreviation 012232 defines the model with
four nucleotide substitution rates:
yAC ; yAG ; yAT ; yCG yAT ; yCT ; yGT yAT :
2. Rate variation optionsdetermine how site-to-site rate variation
should be modeled. Select None to discount site-to-site rate
variation; this causes the analysis to run several times faster
than other options, but creates the risk of mistaking rate heterogeneity for recombination. This option can only be recommended for alignments with three or four sequences. Choose
General Discrete (the recommended default) to model rate variation using an N bin general discrete distribution, and BetaGamma for an adaptively discretized G distribution (this is a
more flexible version of the standard +G4 model).
3. How many distribution bins [232]if rate variation is selected
in the previous step, this option allows the user to decide how
many different rate classes should be included in the model.
We recommend using three rate classes by default since both
General Discrete and Beta-Gamma distributions are very flexible and can capture the variability in the majority of alignments
with only a few rate classes.

244

S.L.K. Pond et al.

4. Save results tosupply a file name, where HyPhy should write


an HTML-formatted summary of the analysis. HyPhy generates several other files with names obtained by appending suffixes to the main result file. The _finalout file stores the
original alignment in NEXUS format with inferred nonrecombinant sections of the alignment saved in the ASSUMPTIONS
block and trees inferred for each partition in the TREES block;
this file can be input into many recombination-aware analyses
in HyPhy and other programs that can read NEXUS. The
_ga_details file contains two lines of information about
each model examined by the GA: its AICc score and the location of breakpoints in the model. Finally, _ga_splits file
stores information about the location of breakpoints and trees
inferred for each alignment region under the best model found
by the GA.
The HTML file generated by a GARD analysis (Fig. 1b) presents a summary of the results. In addition to basic model fitting
metrics, such as log log-likelihood, AICc, inferred nucleotide substitution rates and site-to-site rate distribution (if selected as an
option), the page presents the best-scoring partitioning of the
alignment for a given number of breakpoints. For example,
among all models with two breakpoints in the Cache Valley Virus
glycoprotein alignment, the best model places them at nucleotides
1,491 and 1,693 and improves the AICc over the best model with a
single breakpoint (at 1,446) by 137.991 points. The score continues to improve until the number of breakpoints reaches 5, at
which point the program terminates and reports the best model
with 4 breakpoints. If GARD reports that the best model has
0 breakpoints, we may conclude that no evidence of recombination
has been found. Note that because genetic algorithms are stochastic
there is no guarantee that replicate runs will converge to exactly the
same quantitative results: for example, the difference in AICc values
between models. When there is a strong signal of recombination
breakpoints in the data, however, the qualitative results (number
and general location of breakpoints) should be fairly robust.
GARD does not automatically check to ensure that the
improvement in model fit is due to a change in the tree topology.
For example, if one contiguous part of the alignment evolves at a
much higher rate than the remainder of the alignment (e.g., an
exposed loop) or if the rates of evolution vary among lineages
due to heterotachy, then a model which uses two trees with the
same topology but different branch lengths may be selected by
GARD. To confirm that the topologies differ between segments, it
is necessary to execute a postprocessing analysis implemented in
the Process GARD results/GARDProcessor.bf module. This analysis
does not require an MPI environment and must be provided
with the same alignment that GARD has been applied to and

10

Evolution of Viral Genomes: Interplay Between Selection. . .

245

the _ga_splits file generated by GARD. GARDProcessor.bf


performs two tests for topological differences. The first test seeks
overall evidence of such differences: it compares the AICc score of
the best model found by GARD with the fit of the model that uses
the same set of breakpoints, but maintains the tree topology
inferred from the entire alignment for all partitions. For the
CVFg example, the GARD model is strongly preferred by this
test. This fact is reported as Versus the single tree/multiple
partition model: Delta AIC 253.037. Secondly, the analysis examines whether the trees to the left and right of each breakpoint are topologically different using the ShimodairaHasegawa
(SH) test (18), using the RELL approximation scheme (19) to
speed up the calculations. For complete details, please refer to the
original GARD manuscript.
In this case, three out of four breakpoints are confirmed using
the SH test, with p-values of 0.05 (corrected for multiple testing).
Break

LHS-

RHS-

LHS raw p

adjusted p

RHS raw p

adjusted p

588

0.00060

0.00480

0.00140

0.01120

1,080

0.00260

0.02080

0.02130

0.17040

1,491

0.00010

0.00080

0.00010

0.00080

1,693

0.00010

0.00080

0.00010

0.00080

point

To understand the report (p-values differ slightly between runs


because of the stochastic nature of SH resampling), consider the
second line: the segment to the left (LHS) of breakpoint 1080 has a
topology significantly different from that to the right (Bonferronicorrected p-value of 0.0208), but the reverse is not true (RHSadjusted p 0.1704); hence, this breakpoint may be attributed to
processes other than recombination.
GARD is geared toward mapping the breakpoints and detecting segments of the alignment which can be adequately described
by a single tree topology; as we discuss in the next section, this is
necessary to allow more complex analyses to handle alignments
with recombinant sequences. Because GARD allows arbitrary tree
changes across breakpoints, there are certain cases when it does not
perform well: for example, short alignments with many sequences.
GARD requires about approximately four times as many sites as
sequences to run; otherwise, the number of samples (sites) is less
than the number of model parameters (branch lengths and rates).
Another case occurs when only a few sequences in a large alignment
have undergone recombination, in which instance the cost of adding many new branch length parameters for one or more trees will
likely outweigh the likelihood improvement due to several local
subtree rearrangements.

246

S.L.K. Pond et al.

The latter case is common when viral sequences are subtyped. In


HIV-1 or IAV, for example, it is common to construct an alignment
and a phylogeny of reference sequences with known subtypes or
serotypes and then use one of many algorithms to thread a
sequence to be classified onto the reference topology. One such
algorithm is a modification of GARD, called subtype classification
using evolutionary algorithms (SCUEAL), developed in ref. 20.
Unlike GARD, SCUEAL assumes that the reference sequences can
be related by a single topology, which is fixed a priori. It is possible
to include recombinant sequences in the reference alignment (see
ref. 20 for details). A genetic algorithm searches for the breakpoints
in the query sequences only and for each sequence fragment
defined by the breakpointsthe branch in the reference tree,
where the query sequence attaches. SCUEAL is implemented in
HyPhy, and all the necessary files to run it can be downloaded from
http://www.hyphy.org/pubs/SCUEAL/. The download includes
a prebuilt reference alignment for HIV-1 pol sequences and documentation on how to make custom reference alignments and screen
sequences against them.

4. Selection
Selection is the outcome of the variation in fitness induced by the
environment in which genetic variants are expressed. Based on
the excess number of nonsynonymous codon substitutions or a
change in allele frequencies, it is possible to identify sites within
protein-coding regions of a genome that have been targeted by
selection: some of the methods for accomplishing this are presented in preceding chapters. Diversifying (host specific) selection
on virus genome variation is predominated by the immune
response mounted by the host. Jawed vertebrates, such as humans,
have, in addition to the innate immune system, an adaptive
immune system that is further partitioned into the humoral and
cellular immune responses (21). The humoral response takes place
in the extracellular environment and mounts an antibody-based
defense that attacks exposed surfaces of the virus particle. The
cellular response takes place within the infected cell and involves
the recognition and binding of peptides encoded by the virus
genome, which are displayed on the surface of the cell to trigger
the lysis of the cell by cytotoxic T-lymphocytes (CTLs). Both
components of the adaptive immune system play a crucial role in
managing a viral infection and thereby shaping the genetic variation of the virus population. In addition, many human pathogenic
viruses, particularly HIV-1, influenza virus, hepacivirus, and herpesvirus, are treated by antiviral agents that also target specific sites
of the virus genome (22).

10

5. Detecting
Selection in the
Presence of
Recombination

Evolution of Viral Genomes: Interplay Between Selection. . .

247

In order to infer selection in an alignment or at individual sites,


most algorithms estimate the rates of synonymous and nonsynonymous substitutions and test them for equality. It has long been
recognized that by confounding the phylogenetic signal, recombination can mislead rate estimation procedures and natural selection
tests, often severely (23). A simple illustration of this effect can be
seen in Fig. 2.
The simple approach to guard against this undesirable behavior
is to identify and remove recombinant sequences prior to running
selection analyses. In addition to practical difficulties in reliably detecting which sequences have been subject to recombination, discarding
sequence data lowers power of analyses and could introduce
0.01

0.1

ACC
TCC

TCC
ACC

ACC

ACC
TCC

ACC

TCC
TCC

Fig. 2. The effect of recombination on inferring diversifying selection. Reconstructed evolutionary history of codon 516 of
the Cache Valley Fever virus glycoprotein alignment is shown according to GARD-inferred segment phylogeny (left ) or a
single phylogeny inferred from the entire alignment (right ). Ignoring the confounding effect of recombination causes the
number of nonsynonymous substitutions to be overestimated. A fixed effects likelihood (FEL (60)) analysis infers codon 516
to be under diversifying selection when recombination is ignored ( p 0.02), but not when it is corrected for using a
partitioning approach ( p 0.28).

248

S.L.K. Pond et al.

unanticipated biases. Scheffler et al (24) proposed a PARtitioning


approach for Robust Inference of Selection (PARRIS) to retain all
the sequences, including recombinants, for selection testing, whose
stages are described below.
1. The input alignment is screened for evidence of recombination,
e.g., using GARD, and the number and location of breakpoints
are inferred.
2. A separate tree is constructed for each nonrecombinant
segment; for the CFVg alignment, this would generate five
alignment segments and five corresponding trees.
3. A codon model (see previous sections or 25 for further details)
is defined using the following rate matrix, whose qij element
describes the instantaneous rate of substitution between codon
i and codon j:
8
a single-nucleotide synonymous change,
ayij pij ;
>
>
< oay p ;
a single-nucleotide nonsynonymous change,
ij ij
qij
0;
multiple-nucleotide synonymous change,
>
>
: P q ; i j :
k6i ik
yij parameterize the unequal substitution rates between
nucleotides, pij are the frequency parameters, correcting for
the nucleotide composition of the alignment, a is the synonymous substitution rate, and o is the familiar ratio of nonsynonymous to synonymous substitution rates. For example,
qACA;ATA aoyCT p2T ; qACT ;ACA ayAT p3A ; qAAA;CCC 0 . Notice
that yij yji (because of time reversibility of the process); pm
n
refers to the observed frequency of nucleotide n in codon
position m (the MG frequency parameterization (25)). One
key feature of this model is that both a and o can vary from site
to site; traditionally, it has been assumed that a is proportional
to the mutation rate and is constant across all sites. There is
increasing evidence that synonymous rates vary among sites as
well, e.g., due to secondary structure of viral RNA and codon
usage bias, and not accounting for such variation can cause
misidentification of relaxed constrains as positive selection in
some cases (e.g., see ref. 26).
4. All parameters of the codon model are estimated jointly
from all nonrecombinant data partitions while the tree topology and branch lengths are allowed to differ between partitions. In this way, recombination is accommodated (different
topologies and branch lengths), but the parameters of the
evolutionary process (e.g., o) are inferred from all sequences
jointly.
5. Two models with site-to-site rate variation are fitted to
the data: the null model which restricts o  1 and the

10

Evolution of Viral Genomes: Interplay Between Selection. . .

249

alternative model which does not have this restriction. The


models are analogous to M1a and M2a implemented in the
PAML package (described in an earlier section), except that in
PARRIS synonymous rates a are also variable, and drawn from a
three-bin general discrete distribution.
To start a PARRIS analysis, launch HyPhy, select Selection/
Recombination from the standard analyses menu and choose
A PARtitioning approach for Robust Inference of Selection/PARRIS.bf batch file, supply an alignment file, and choose values for
one of the many analysis options. In addition to the models of
Scheffler et al. (24), PARRIS.bf implements those described by
Delport et al. (26), adapted for handling partitioned data, and a
number of unpublished or experimental options.
1. Choose Genetic Codeselect the genetic code appropriate for
the alignment under investigation.
2. How many datafiles are to be analyzed?PARRIS and other
recombination-aware selection analyses (see Subheading 9) can
read NEXUS-formatted files with multiple partitions encoded
in the ASSUMPTIONS block and corresponding trees in the
TREES block (e.g., the _finalout files output by GARD) or
read individual partition and tree files. For the latter option,
select the number (1) of files to be input, and for the former
enter 1.
3. Branch Lengthsto speed up calculations, HyPhy can use
branch lengths estimated from a nucleotide model for the
analysis, i.e., hold them constant while the codon model is
being fitted (the Nucleotide Model option, suitable for initial
screens, especially on larger alignments), or estimate them
together with all other parameters (the Codon Model option,
suggested to confirm results).
4. Options for handling equilibrium frequenciesselect how to
parameterize codon frequencies in the substitution model,
MuseGaut (MG) vs GoldmanYang (GY). There are some
reasons to prefer MG in general (see ref. 25 for a discussion).
5. Nucleotide Rate Matrix Optionsspecify the nucleotide bias
component of the substitution model.
6. Options for multiple classes of non-synonymous substitutions
decide how the model will handle unequal substitution rates
between different amino acids. Single specifies that a single o
rate applies to all nonsynonymous substitutions (this is by far
the most common option). With Multi, the analysis prompts
the user to select a file defining the protein analog of the
(012232) string for nucleotide models as a 20  20 matrix
(see the Mutlirate.mdl for file for an example). There are a
number of ways such a matrix can be obtained, including a

250

S.L.K. Pond et al.

model selection process for codon data (27). NMulti allows


the specification of numerical substitution rates between
pairs of amino acids (much like the BLOSUM62 matrix used
by blastp).
7. Rate Variation Modelsallow rate variation models described
in 26 to use the partitioning approach. For PARRIS analyses,
select the Dual option, where both synonymous and nonsynonymous rates vary from site to site.
8. Independent or multiplicative nonsynonymous ratein the rate
matrix defined above, we parameterized the nonsynonymous
substitution rate as ao, i.e., via a multiplicative factor which
modulates the synonymous rate (the Multiplicative option).
It is also possible to parameterize this rate via an independent
parameter b (the Independent option). The latter is generally
more flexible, e.g., it allows both a 0 and b > 0 which cannot be parameterized through a finite o ratio, but makes the
testing for selection (i.e., o > 1) difficult. The PARRIS analysis
uses the Multiplicative option.
9. Codon or nucleotide level synonymous rate variationthis is an
experimental (at the time of writing) option. Select Codon (syn1)
to run PARRIS. The other option Nucleotide (syn3) allows the
model to vary synonymous substitution rates (a) based on the
position of the codon, where the substitution is taking place.
10. Distribution Optionsdetermine which site-to-site rate variation models are fitted to the data. The PARRIS option runs the
two discrete models needed to test for evidence of diversifying
positive selection described in the original manuscript while the
others provide more choices, including discretized gamma distributions. Selecting Run All or Run Custom provides access to
all or some of these models.
11. Initial Value Optionsallow the optimization procedure to
start from predefined values (Default) or from a random starting point (Randomized). The latter option is useful for checking convergence; if multiple runs of the analysis attain the same
log likelihood and parameter values, then the procedure has
converged.
12. Save summary result fileHyPhy writes analysis summary (also
echoed to the screen) to this file. Also, it creates three files for
each of the fitted model by appending suffixes to the summary
file name, much like in GARD. For PARRIS, the null model is
named M1a and the alternative modelM2a. The .model.
fit file contains the fitted likelihood function for each model
in a NEXUS format with the HYPHY block used to encode the
model and parameter estimates. The .model.distributions
file stores a text summary of the distributions of synonymous

10

Evolution of Viral Genomes: Interplay Between Selection. . .

251

and nonsynonymous rates inferred for the model while the


.model.marginals file provides a detailed report for the
empirical Bayes analysis carried out by the program to identify
sites subject to negative and positive selection and posterior
distributions of o and a values at each site.
As an illustration, we run the PARRIS analysis with the REV
nucleotide model, codon branch lengths, MG frequency option,
single nonsynonymous rate class, dual rate variation model, multiplicative nonsynonymous rate, codon-level synonymous rates,
PARRIS distributions, and default starting values on HepatitisE.
nex (single partition) and HepatitisEgard.nex (GARD-inferred
_finalout) files, containing an alignment of 21 capsid sequences
from hepatitis E virus. PARRIS executed on the unpartitioned alignment provides the following summary output:

Model

Log
likelihood

Synonymous
CV

NS Exp
and CV

N/S Exp
and CV

discr (3),
M1a

10618.
85294

0.55189576

0.08138,
3.17446

discr (3),
M2a

10613.
84434

0.55501859

0.17868,
3.96245

p-Value

Prm

AIC

0.24263,
5.48971

N/A

50

21,337.71

0.66827,
7.00400

0.0066803

52

21,331.69

In this particular instance, allowing a proportion of sites to


evolve with o > 1 (M2a model) provides a significantly improved
fit compared to the null model which only permits sites with o  1,
both according to the likelihood ratio test (p 0.007 based on the
w22 distribution) and AIC (21,331.69 vs. 21,337.71). The other
values reported in the table summarize means and coefficients of
variation (CV) for synonymous and nonsynonymous distributions
of rates. In .M2a.marginals file, four sites are reported to be
under diversifying positive selection with posterior probabilities of
0.95 or greater (23,109,110 and 115).
GARD inferred 4 breakpoints (5 partitions) in this dataset, and
the corresponding summary table is as follows:

Model

Log
likelihood

Synonymous NS Exp
CV
and CV

N/S Exp
and CV

p-Value

Prm AIC

discr (3), 10457.49607


M1a

0.42057602 0.08316,
0.20055,
N/A
3.16349
7.23691

84

21,082.99

discr (3), 10454.94654


M2a

0.42703723 0.11230,
0.28263,
0.0781191 86
3.68455
8.72351

21,081.89

252

S.L.K. Pond et al.

Notice that the evidence for positive selection is much


weaker when recombination is taken into account: LRT p-value is
no longer significant and the AIC improvement is much smaller
compared to the unpartitioned analysis. The partitioned models
have a much better AIC than their unpartitioned counterparts,
indicating that the data are better explained by the former. Also, no
positively selected sites with posterior probabilities of 0.95 or greater
are found.
This example demonstrates that if recombination could
have shaped the evolutionary history of sequences being analyzed
it is prudent to use approaches which take it into consideration,
lest it be misinterpreted as another process, e.g., positive selection.
All selection analyses in HyPhy and Datamonkey accept partitioned data, thus allowing researchers to keep all the sequences
and correct for the confounding effects (see Subheading 9 for
another example).

6. Directional
Selection
HIV-1 replicates extremely rapidly, producing as many as 1010 viral
particles per day. The fidelity of reverse transcription is low, with a
rate of 3  105 errors per base per replication cycle. Together, this
provides HIV-1 with a powerful means to escape selective pressure
introduced by antiretroviral therapy (ART), which suppresses HIV1 replication by interfering with various stages of the viral life cycle,
leading to drug resistance.
Some important features of the evolution of drug resistance
must be encoded by models of evolution to detect substitutions
under selective pressure induced by ART. For this discussion, we
are modeling evolution over a reverse transcriptase phylogeny that
has been constructed from treatment naive, as well as posttreatment
sequences (see Fig. 3). The first thing to notice is that the selective
pressure of interest is not constant over the entire phylogeny, but
rather restricted to a subset of branches: it is episodic. A second
critical property of the evolution of drug resistance is that once
ART is introduced, selection is directional, where only substitutions toward one or more target amino acids are favored. This can
be contrasted with diversifying selection, where nucleotide substitutions that change the amino acid are favored, regardless of the
amino acid. Diversifying selection approximates the continuously
shifting coevolutionary environment typified by hostpathogen
arms-race coevolution (28). The evolution of drug resistance,
on the other hand, is characterized by discrete major shifts of fitness
landscape with the introduction of therapies. The probability of the
emergence of particular amino acids contributing to drug resistance

10

Evolution of Viral Genomes: Interplay Between Selection. . .

253

0.03

Fig. 3. A phylogeny of reverse transcriptase sequences. Foreground branches which lead


to posttreatment sequences are colored red.

increases inexorably with time, as long as viral replication is not


suppressed. Once treatment resistance emerges, selection becomes
purifying as long as the drug regimen is maintained.
A Model of Episodic Directional Selection (MEDS) models
directional selection along a priori selected foreground branches
while assuming that the background branches evolve in a nondirectional (but not necessarily neutral) manner. MEDS is a codon
model, based on MG94  REV (which combines a general timereversible model of nucleotide substitution with separate synonymous and nonsynonymous rates, a and b), that extends two earlier
models of coding sequence evolution: (1) the episodic component
of MEDS is structurally identical to the Internal Fixed Effects
Likelihood (IFEL) model proposed by Kosakovsky Pond et al (29)
and (2) the directional component is introduced in the same manner
as the model of directional selection proposed by Seoighe et al (30).
Two separate codon models are used to model substitutions
along foreground and background branches. A single synonymous
rate a is shared between them, but each is allowed its own nonsynonymous substitution rates (bF and bB). Diversifying selection
is, thus, allowed on both foreground and background branches.
Directional selection along foreground branches is introduced with
oT, which is multiplied onto the rates of all substitution to a
specified target amino acid T. Elevating oT, thus, increases the
rate of substitutions to T. The analysis proceeds site by site. Branch
lengths and nucleotide rate parameters are first estimated from

254

S.L.K. Pond et al.

the whole alignment under a simpler model. For each site, we


define the null model by setting oT 1, a special case of the
alternative model, where oT is free to vary. The null model has
three free parameters per site: a, bF, and bB. The alternative model
has a single additional parameter, oT, biasing substitutions toward
T. To test for selection toward amino acid T at a specific site, we
obtain maximum likelihood scores for the null and alternative
models and perform a likelihood ratio test (LRT). Scanning a site
for selection toward any possible amino acid (T) involves testing 20
hypotheses, and Bonferroni correction (31) is employed to control
the sitewise Type I error rate.
To run a MEDS analysis, an alignment and a rooted phylogeny
are required. Furthermore, the foreground branches of the phylogeny must be labeled. To do this, {FG} is placed after the foreground
node names (but before the colons) in the Newick tree string.
In this example (sub)tree, Branch1 is labeled foreground:
(Branch1{FG}: 0.1, Branch2:0.1); for large trees, editing the
Newick files by hand is inefficient. One solution is to use FigTree
(http://tree.bio.ed.ac.uk/software/figtree) to color the foreground branches, and then replace the color tag (e.g.,
[&!color#--64512]) in the resulting Newick string with {FG}.
Once the phylogeny is suitably annotated, from Hyphy execute
Standard Analyses/Positive Selection/MEDS.bf, select the data and
tree files, and specify an output .csv file. The output file contains
the maximum likelihood parameter values and LRTs for each of
20 amino acids at each site. To assist with interpreting such a large
file, we provide a Web script (www.cs.sun.ac.za/~bmurrell/py/
MEDSproc) that takes an output file and p-value threshold and
summarizes all detected substitutions. In addition to the test for
directional selection, MEDS.bf also performs a test for episodic
diversifying selection along foreground branches. These results
are included in the output and summary files.
Table 1 displays the results for MEDS on the reverse transcriptase alignment (HIV_RT.fasta) for the phylogeny in Fig. 3
(HIV_RT_tagged.tre). This alignment contains 26 sequences
from patients before the initiation of ART, and after failing ART,
obtained from the Stanford HIV Drug Resistance Database
(hivdb. stanford.edu). We tested for episodic directional selection with MEDS, episodic diversifying selection (which MEDS.bf
automatically tests for), and constant diversifying selection using
FEL from Datamonkey. Using a p-value threshold of 0.05,
MEDS detected seven substitutions under selection, six of which
are known drug resistance-associated mutations (DRAMs). The
test for episodic diversifying selection detected five sites under
selection, all known to be associated with drug resistance. The
test for constant diversifying selection detected only two sites,
both involved in drug resistance. On this alignment, the

10

Evolution of Viral Genomes: Interplay Between Selection. . .

255

Table 1
HIV-1 reverse transcriptase drug resistance: Episodic directional, episodic
diversifying, and constant diversifying selection
Site

Target

EEDS p-value

FEEDS p-value

FEL p-value

Resistance

41

<0.0001

NRTI

74

83

0.0004

103

<0.0001

0.007

NNRTI

184

<0.0001

0.02

NRTI

210

<0.0001

0.008

NRTI

215

<0.0001

<0.0001

<0.0001

NRTI

219

0.0017

NRTI accessory

0.001

0.007

NRTI

Note that the p-value for MEDS is obtained from a likelihood ratio test (LRT) for episodic directional
selection; FEEDS p-value is obtained from an LRT of the hypothesis bF > a that tests for diversifying
selection; and FEL is a test for constant diversifying selection run on Datamonkey. denotes a nonsignificant (a 0.05) p-value and asterisk indicates no target residue because of lack of detection by MEDS

performance of MEDS and its accompanying test for episodic


diversifying selection were similar, and both clearly outperformed
the FEL test for constant diversifying selection. On other datasets,
greater differences between MEDS and the test for episodic diversifying selection have been observed. In one case (32), on a much
larger RT phylogeny, MEDS detected 16 substitutions (13
DRAMs) while the test for episodic diversifying selection identified
only 4 sites (2 DRAMs). The factors contributing to the performance differences between datasets are still being explored.
Another model, Episodic Directional Evolution of Protein
Sequences (EDEPS), was proposed by Murrell et al (32). EDEPS
extends DEPS, proposed by Kosakovsky Pond et al (33). EDEPS also
detects sites with increased substitution rates toward specific amino
acids, but it differs from MEDS in two ways: (1) EDEPS models
directional selection of amino acid rather than codon sequences and
(2) EDEPS uses a Random Effects Likelihood (REL) framework to
bias selection toward amino acids across all sites, relying on an empirical Bayes analysis to identify sites of interest. As in MEDS, accelerated
substitutions toward a target residue T are restricted to foreground
branches. Background branches evolve according to a baseline protein
substitution model, which, for this task, would be the HIV-Between
empirical model (34). It is well known that amino acid substitution
rates depend on the residues involved (e.g., see ref. 27), and specifying
a baseline model which includes unequal substitution rates provides a

256

S.L.K. Pond et al.

qualitative advance over MEDS. Conversely, because EDEPS works


with protein sequences, the natural proxy of neutral evolution is not
available. The performance of EDEPS is similar to MEDS on most
datasets tested so far.
MEDS and EDEPS are applicable whenever a set of a priori
known branches on a phylogenetic tree are expected to be under
the same or similar kinds of selective pressure. The power to detect
directional selection on foreground branches is likely to decrease
with their quantity, although this has not been tested explicitly. Also
important is the arrangement of the foreground branches: if the
difference between amino acids on foreground and background
branches can be explained by a single substitution along a single
branch, then there will be little evidence for directional selection.
Paired HIV sequences sampled from the same patient before and
after therapy produce an ideal arrangement of disconnected foreground branches, although MEDS and EDEPS still perform well in
less ideal cases.

7. Epistasis
The effect of a mutation depends not only on the host environment, but also on the rest of the genome sequence in which it
occurs. Put another way, the rest of the genome comprises an
extremely significant part of the mutations environment. The
dependence of a mutations effect on other sites of the genome is
known as epistasis. Because epistasis is inherently nonlinear, it is
exceedingly difficult to model and hence to estimate from data.
In quantitative genetics, epistasis is assessed as a nonadditive component of variance attributable to interactions among genetic factors (35); however, this framework does not provide a means of
explicitly identifying those interactions. On the other extreme,
population genetics models tend to incorporate epistasis as a nonadditive term for the effects of mutant alleles at two loci (36). While
this scheme is mathematically convenient, it is not adequate for
the purpose of studying the evolution of genomes, even when they
are relatively small in size.
The comparative study of sequence variation offers a practical
approach to identifying which sites in the genome participate in
epistatic interactions. Literally hundreds of investigators across disjoint subdisciplines of biology have proposed various comparative
methods to accomplish this objective. Though we have not yet
encountered a comprehensive review, interested readers may find
useful references in 3740. Essentially, all of these methods use
correlated patterns of substitution at different sites as evidence of
an interaction. Most methods apply some correlation test statistic

10

Evolution of Viral Genomes: Interplay Between Selection. . .

257

to pairs of amino acid sites (columns) in an alignment of


protein sequences. (Gobel and colleagues (41) are often cited as
the first example of this approach, but they were in fact preceded by
Korber and colleagues (42).) While this is a convenient approach, it
ignores the biological reality that sequences are the product of
evolution. In other words, a significant correlation between sites
can easily be confounded by the evolutionary relationships among
the observed sequences, especially when certain combinations
of residues have been inherited by large numbers of descendants
without further modification (43). A statistical association between
codon sites in a gene would then be falsely attributed to some
functional interaction between residues (identity by state),
when it was in fact due to evolutionary relationships (identity by
descent).
To overcome the confounding effect of evolutionary history,
we have advocated and extended Felsensteins approach of redirecting the focus of comparative study from patterns in the end
products of evolution to patterns in the process of evolution itself
(37, 44). We are looking for residues that coevolve such that a
substitution at one site accelerates the substitution rate at one or
more other sites. Let us proceed with an analysis of the example file
p24.seq, which contains an alignment of HIV-1 subtype C gag
p24 sequences and a tree. Because this analysis uses many of the
same functions as many of the standard selection analyses implemented in HyPhy, a template for detecting coevolution (epistasis)
can be accessed through the QuickSelectionDetection.bf batch
file listed in the Standard Analyses menu under the heading Positive Selection. In summary, a codon substitution model is fit to the
sequence alignment and a tree. Maximum likelihood parameter
estimates are subsequently used to reconstruct ancestral sequences
at the internal nodes of the tree using a fast algorithm proposed by
Pupko et al (45). It is then straightforward to map mutations to
branches of the tree by comparing character states at each codon
site at the start and end of each branch. If the reconstructed/
observed character states are different, then a mutation must have
occurred at some point along the branch (46, 47). This mapping
procedure does not account for cases, where more than one substitution occurs at the same codon site in a branch, but see ref. 48 for a
method that can account for these multiple hits. When the
distributions of mutations mapped to branches of the tree are
significantly correlated between two codon sites, then we interpret
this outcome as evidence of an epistatic interaction between
those sites.
We could then proceed by comparing mutational maps
between all pairs of codon sites in the alignment (49, 50). While
this is a convenient framework for identifying sites with interactions, however, it is subject to the problem of confounding.

258

S.L.K. Pond et al.

To use a popular example in the artificial intelligence literature,


a pairwise analysis finds a significant association between sales of
ice cream and the number of drownings at a public pool. A nave
observer would then be led to believe that consuming ice cream
causes one to drown. Of course, what is actually happening is that
ice-cream sales and the number of swimmers at the pool are both
greater on warm and sunny days, and that drowning is more
frequent when more people are swimming. It is difficult to reconstruct this system of cause and effect by taking an approach that
evaluates only pairs of variables at a time. Following through with
our analogy, an agent of selection is like our sunny days variable
that is the common cause of multiple effects (different sites in
the genome, like ice-cream sales and swimming variables).
If we are limited to evaluating pairs of variables at a time, we may
consequently be led to falsely report an epistatic interaction
between sites in the genome. Similarly, we could overestimate the
number of sites influenced by an agent of selection because of actual
epistatic interactions among those sites.
To account for this problem of confounding, we use Bayesian
networks to analyze the joint distributions of mutations that we
have mapped to the tree at all sites by maximum likelihood.
A Bayesian network is a graphical model giving a compact representation of the joint distribution of variables (51). It comprises
nodes that represent variables, and edges that connect nodes
to indicate that one variable is conditionally dependent on another.
In our context, the nodes correspond to mutational maps for
different columns (codon sites) in the alignment or the presence/
absence of a selective agent, and edges correspond to a statistical
association between maps. By evaluating the joint distribution, we
are assessing all the variables at once and are, therefore, able to
discriminate between real associations (sunny days and ice cream)
and those that are spurious (ice cream and drowning). Our ideal
objective is to find the Bayesian network that best explains the data.
However, the space of all possible Bayesian networks is so astronomically large for a modest number of variables that we must
abandon any hope of finding a single best network. Instead, we
take a Bayesian approach and endeavor to generate a random
sample of Bayesian networks from the posterior distribution that
is shaped by the data (52). (In fact, we follow 52 and further
collapse the space of Bayesian networks into a new space over
permutations of node orders, i.e., assertions about which
other nodes in the network a given node can be conditionally
dependent on. This transformation greatly reduces the size of
model space and smoothes the posterior probability surface as
well, resulting in better model convergence.) A random sample is
obtained by Markov chain Monte Carlo (MCMC) using
the MetropolisHastings sampling algorithm (53, 54), which

10

Evolution of Viral Genomes: Interplay Between Selection. . .

259

essentially explores the posterior probability surface by taking a


biased random walk from one point in model space to another.
Execute the QuickSelectionDetection.bf batch file to analyze the
p24.seq file. This first set of options are the same for analyses of
positive selection, and may be familiar to you.
1. Choose Genetic CodeBecause HIV replicates in human hosts,
choose the Universal option.
2. New/RestoreThis menu gives you the opportunity to reload a
nucleotide model fit from a previous execution of QuickSelectionDetection.bf on the same alignment and tree, which can
save computational time. Since this is probably the first time
you have run this batch file on these data, select New.
3. Select the p24.seq file.
4. Model OptionsSpecification of the nucleotide and codon substitution models. The Default is to fit the HKY85 nucleotide
substitution model to refine estimates of branch lengths in the
tree, followed by fitting the MuseGaut codon substitution
model crossed with the HKY85 model with branch lengths in
the tree constrained to scale by a factor that is estimated from
the data.
5. Enter y in the console window to use the tree included with
the FASTA file.
6. Specify a file name to export the maximum likelihood fit of the
nucleotide model.
7. dN/dS bias parameter optionsSelect Estimate dN/dS only.
At this point, the batch file branches into different types of
analyses with very different options. You should select the option
BGM co-evolution (Bayesian Graphical Model (BGM) is a
synonym for a Bayesian network) in order to perform ancestral
reconstruction and mutational mapping followed by Bayesian
network analysis. These are the analysis options that are raised by
the BGM coevolution pipeline.
1. Treatment of AmbiguitiesAmbiguous nucleotide calls can be
interpreted according to one of the two extreme assumptions
either that they are all errors and should be resolved to the
predominant nucleotide in the alignment (Resolved) or that
they reflect genuine polymorphisms in the population (Averaged). This assumption affects how ancestral sequences are
reconstructed at internal nodes of the tree.
2. Substitution count cutoffThe HyPhy console window displays
some descriptive statistics of the distribution in the number of
branches that substitutions have been mapped for all sites in the
alignment. For our p24 example, the mean and medians of this
distribution are 12.3 and 2, respectively. This indicates that

260

S.L.K. Pond et al.

most codon sites have two or fewer substitutions mapped to the


tree while there are a small number of highly variable sites
inflating the mean. This is useful information because it indicates
that we can ignore the majority of codon sites in the alignment
either because they are completely conserved or because there is
insufficient variation to detect coevolution (i.e., one or two
substitutions). As a very rough rule of thumb, we do not like
to have more variables sent to the Bayesian network than the
number of samples that we have observed. The p24 alignment
comprises 541 sequences; a cutoff of 10 mapped substitutions
per site leaves 51 codon sites to analyze.
3. Maximum parentsWe assume that nodes cannot be conditionally dependent on more than X other nodes (parents),
where X is either 1 or 2 in this template batch file. For example,
the space of all possible networks includes those in which one
variable is asserted to depend on every other variable. Such
cases are either unlikely or not very informative, and this simplifying assumption drastically reduces the space of all possible
Bayesian networks. For a quick preliminary analysis, you should
select a limit of 1.
4. Duration of MCMC chainThis needs to be a large number in
order for the MetropolisHastings sampler to explore the
parameter space long enough to obtain an adequate sample,
given that autocorrelation in the chain sample is inevitable, i.e.,
that adjacent states in the chain are very similar. The default
value of 105 steps is adequate for most one-parent networks.
A two-parent network analysis requires a longer chain and
burn-in (see the next item).
5. Duration of MCMC burn-inThe number of steps that are
discarded as a burn-in period, i.e., for the sampler to move into
favorable regions of model space with a relatively high posterior
density. Note that this number must be smaller than the
previous setting (you cannot discard more steps than you
have run the chain for).
6. Sampling intervalThe length of the interval for thinning
the chain. Because the chain is inevitably highly autocorrelated,
it is necessary to thin the chain sample by taking every nth step.
The default interval size of 1,000 steps is reasonable and results
in a final sample size of 90 under all default settings, i.e.,
(105104)/1,000.
7. Ancestral resamplingThe reconstruction of ancestral
sequences by maximum likelihood is increasingly uncertain as
we go deeper into the tree, i.e., further from the observed
sequences. To address this uncertainty, we provide an option
of using a nonparametric bootstrap procedure to resample
ancestral sequences from the posterior probability distributions

10

Evolution of Viral Genomes: Interplay Between Selection. . .

261

of character states at each internal node of the tree. The Bayesian


network analysis can then be run on each of these bootstrap
samples. Because the computational time scales linearly with the
number of samples, we provide an MPI implementation that
can distribute MCMC runs across processors; however, this
requires that you are running an MPI-enabled command line
version of HyPhy, i.e., HYPHYMPI.
8. Output filesYou will be prompted to identify two files to write
analytical results to. The first file contains the edge marginal
posterior probabilities, i.e., the proportion of the chain sample
that contains a given edge. The second file contains a graph
(encoded in the DOT language interpreted by GraphViz,
which is open-source software for rendering graphs that can
be downloaded from http://www.graphviz.org) comprising all
edges with marginal posterior probabilities exceeding a cutoff
of 95%.
When you have run through this analysis using the gag p24
example data, you will find that HyPhy has spawned two new
windows. One window is labeled MCMC Trace and displays
the chain sample that has been thinned down to 90 steps (Fig. 4).
By default, the chain is displayed as a scatterplot, but it is easy to
switch to a step plot by selecting this option from the drop-down
menu labeled Type. This plot is a convenient means for spotting
severe cases of autocorrelation, i.e., a clear trend of increase and/or
decrease over the length of the thinned sample. For example, if
there was a clear monotonic increase in the sample over time, then
it would be highly likely that the sample size was inadequate and the
model would need to be run for substantially longer duration.
A second window displays a histogram summarizing the edge
marginal posterior probabilities in the thinned sample (Fig. 4).
A U-shaped distribution indicates that there is sufficient data to
identify a minority of edges that are highly likely to be in the
Bayesian network from a background of edges that are highly
unlikely to occur. This distribution can be used to customize the
cutoff value (default 0.95) used to generate a consensus Bayesian
network (Fig. 5). For example, the histogram in Fig. 4 suggests that
there are many edges with marginal posterior probabilities between
0.85 and 0.95 that one might be interested in seeing included into
the default network.

8. Identifying
Agents of Selection:
The CTL Response

In previous sections, we have outlined several methods for detecting the signature of selection from an alignment of homologous
sequences. It is much more difficult to identify which aspects of the

262

S.L.K. Pond et al.

Fig. 4. HyPhy BGM diagnostics. (Left) A graph depicting a thinned Markov chain Monte Carlo sample from the posterior
probability distribution of Bayesian networks given the HIV-1 p24 example data. (Caveat: Posterior values are labeled as
LogL, which is an abbreviation of log-transformed likelihood.). (Right) A histogram summarizing the edge marginal
posterior probabilities from the same analysis.

host environment were responsible for favoring one variant of a


genome over another. These external factors are generally known as
the agents of selection (55). For example, the cellular immune
response identifies infected host cells for lysis by CTLs, based on
the presence of peptides derived from virus proteins by the antigenprocessing pathway in the cell. In human cells, 9-mer peptides are
recognized and bound by human leukocyte antigen (HLA)

10
31

41

1.00 0.95
33

Evolution of Viral Genomes: Interplay Between Selection. . .


45

87

0.96 0.99
135

54
1.00
58

0.97 1.00
96

98

1.00 0.99
116

187

199

1.00

0.98

208

191

203

263

204

0.97 0.97 1.00


120

207

Fig. 5. A graph depicting compensatory interactions inferred from the alignment of HIV-1 subtype C gag p24 sequences.
Each square node represents a position in the gp41 protein sequence that participates in at least one interaction. The
arrows (edges) representing those interactions are annotated with the fraction of graphs sampled in chain sample that
contain the edge.

molecules that are encoded by the highly variable major histocompatibility class (MHC) I loci.
Consequently, many sites in a virus genome experience strong
selection for amino acid replacements because they encode components of a protein that are preferentially targeted by the antigenprocessing pathway, such as the anchor residues that determine
HLA-binding specificities. We would like to know which regions
of a virus genome are enriched for sites targeted by the cellular
immune responsesuch regions can identify peptides to be
incorporated into anti-HIV vaccine candidates (56). However,
there are hundreds of alleles that have been described at the three
MHC class I loci (denoted A, B, and C) and each one can potentially target a different set of sites in the HIV-1 genome.
This is a situation that is amenable to being analyzed with a
Bayesian network because potential agents of selection in the host
environment can simply be handled as additional variables in the
graph (57). Simply put, we want to know if substitutions tend to
occur more often than random in branches that represent hosts
that are presenting a particular agent of selection. The capacity
of Bayesian networks to find causal relationships in the midst of
potential confounding variables is an important strength of this
application. However, there is a catchwe cannot reconstruct
host environments in the virus phylogeny. This limits an analysis
of associations between agents of selection and site-specific rates of
virus evolution to the terminal branches of the tree: in other words,
the branches that are leading directly to observed virus sequences.
Unfortunately, that means that we must sacrifice a substantial
amount of valuable information on virus genome coevolution
that has been mapped to internal branches of the tree.
In order to accomplish such an analysis, we need to extract the
substitution map that has been generated by the QuickSelectionDetection batch file. The following is a code snippet that writes this
substitution map to a file.

264

S.L.K. Pond et al.

The first column contains the sequence names, which you can
use to link each row of the substitution map to whatever agents of
selection (or even phenotypes) that you have obtained for these
sequences. When we were downloading HIV p24 sequences from
the LANL Web site, we happened to include HLA genotypes into
the sequence annotations. An example file containing a binary-state
matrix corresponding to amino acid substitutions mapped to terminal branches leading to each sequence, as well as columns indicating the presence or absence of common HLA serotypes, is
provided as a comma-delimited file named agents.csv. HLA serotypes are labeled in accordance with standard nomenclature, e.g.,
A24. (Note that codons in HIV p24 are numbered and prefixed
with an X in this example file, which was simply a consequence of
merging the serotype and codon data in the statistical programming environment R, which does not permit variable names to
begin with a number.)
In order to perform a BGM analysis outside of the QuickSelectionDetection batch file, we have provided a custom batch file called
BgmAnalysis.bf. The options for this batch file are very similar to
those raised by QuickSelectionDetection, with two important
exceptions. First, you need to specify a file containing a commadelimited matrix, where each column represents an integer-valued
variable (substitution map at a given codon site, or the presence/
absence of an agent of selection, for example) and each row represents a terminal branch in the phylogeny. For each column, the
integer values must start at 0 and progress in increments of 1; in
other words, a variable cannot skip 1 and go directly to 2. In the
example matrix agents.csv, columns with HLA serotypes in the
header contain a 0 to indicate that the serotype is absent and a 1
to indicate that it is present in the corresponding host. Second,
the number of steps specified for a burn-in period is appended to
the length of the chain, rather than indicating the number of
steps in the chain to be discarded. For example, setting the
chain length to 100,000 steps and the burn-in to 10,000 steps
now result in a total of 110,000 steps, of which the first 10,000
are discarded before thinning. We recommend setting the

10

Evolution of Viral Genomes: Interplay Between Selection. . .

T41
98

E98
A116

T178

D187

97
98

93

98

H87
M96
A177

96

E45

A204

92

93
93
99

B81

92

91

C7
P207

98

G208
R203

98

A2

C6

87

A1

A3
99
97

91

T110

97

B58

C4

98
V191

C2

A43

I135
T200

99

98

T54

98

A31

D71

B72

98

90

95

91

T58

V59

S33

E75

K199

265

99

B42

A30

98
B71

98

B8

B44
99

C10

90
A68

91

C8

Fig. 6. A Bayesian network inferred from the joint distributions of codon site-specific substitutions mapped to terminal
branches of the HIV-1 p24 phylogeny (open nodes), and HLA serotypes presented by the corresponding host environments
(filled nodes). A marginal edge posterior cutoff of 0.9 was used to generate this consensus network. Edges between HLA
serotypes and HIV p24 codon sites are highlighted in bold.

maximum number of parents to 2 because this makes it easier to


detect cases, where the rate of evolution at one site is influenced by
both an agent of selection (HLA serotype) as well as a second site in
the genome.
Results from performing this analysis are displayed in Fig. 6.
Note that many epistatic interactions (edges between nodes representing codon sites) detected by our preliminary analysis (Fig. 5)
are recovered in this network as well: for example, 45/54/58.
Edges between nodes representing HLA serotypes can generally
be interpreted as alleles in linkage disequilibrium, i.e., MHC haplotypes, although we cannot rule out joint effects of the serotypes
on a third variable that has not been incorporated into the analysis.
We identify four edges between HLA serotypes and codon sites in
HIV-1 p24. First, HLA serotypes B81 influence nonsynonymous
substitution rates at codon sites 45 and 54, which are within or
adjacent to the known p24 epitope TPQDLNTML (58). Second,
codon site 110 is influenced by HLA serotypes A2 and B58, which
is consistent with its membership in the known epitopes TSTLQEQIGW and STLQEQIGWM, respectively (59).

9. Exercises
9.1. Selection
in the Presence
of Recombination

In this exercise, we examine the CFVg.fas alignment and its


GARD-partitioned counterpart CFVg-gard.nex for selection at
individual sites, with and without correcting for recombination,
using the fixed effects likelihood (FEL) approach (60).

266

S.L.K. Pond et al.

Launch HyPhy, select Selection/Recombination from the standard analysis menu, and then choose QuickSelectionDetectionMF.bf.
Use Universal genetic code, New Analysis, Custom nucleotide
model, 012345 to specify the general time reversible model, 1 dataset to be analyzed, either CFVg.fas or CFVg-gard.nex for
the input alignment, Estimate dN/dS only, the FEL method, 0.1
for the significance level for LRTs, and All for branch option. Save
results (a comma-separated value) to a file (taking care to keep
partitioned and unpartitioned results in separate files).
As HyPhy performs the analysis, a typical output line may look
like this:
Site 195 dN/dS inf dN 4.9848 dS 0.0000 dS(dN)
2.3353 Full Log(L) -14.5463 LRT 3.9208 p-value

0.04769 *P

Here, codon 195 has the maximum likelihood synonymous


rate (dS) inferred at 0, and the nonsynonymous rate (dN)at
4.9848 (their ratio is infinite). The log-likelihood of the site with
these parameters is 14.5463. The null model which forces dN
dS infers the value at 2.3353. The LRT for non-neutral evolution
has the test statistic of 3.9208 and the p-value of 0.04769 (which is
significant at the specified level). The site is called positively selected
(*P) because the test is significant and dN > dS. First, compare the
list of sites reported as positively selected by the two analyses.
Second, load the two resulting .csv files into a plotting program
and draw a scatterplot of the p-values from the two analyses against
each other. Do you think that there is an effect depending on
whether or not we correct for the possible confounding caused by
recombination?
9.2. Directional
Selection

In this exercise, we use a Directional Evolution in Protein


Sequences (DEPS (33)) analysis to identify sites subject to directional positive selection in an alignment of 26 reverse transcriptase
protein sequences (HIV_RT_AA_DEPS.nex) from HIV-infected
patients sampled before and after antiretroviral therapy.
Launch HyPhy, select Positive Selection from the standard analysis menu, and then choose Di-rectionalREL.bf. Choose Reload
(the initial model fitting takes about 1015 min, so the file you
have downloaded contains the HIV-Between model (suitable for
the analysis of HIV sequences) prefitted to this alignment), and use
the navigation box to find HIV_RT_AA_DEPS.nex, Unknown root.
Hyphy goes to work and prints some text to the console window.
It tries every possible amino acid residue and computes a p-value
that some proportion of sites in the alignment are evolving
directionally toward that residue. Each of these models are written
to a fit file, e.g., HIV_RT_AA_DEPS.nex.A. For residues with
significant p-values (after a multiple test correction), individual

10

Evolution of Viral Genomes: Interplay Between Selection. . .

267

sites which may be evolving under directional selection are


identified.
For instance, the text block below indicates that 1.8% of the
sites show strong bias toward N (p < 0.001). Rates to N are
32.576 times faster than they are on the HIV_Between protein
substitution model, leading to 53.828% frequency increase of residue N over the length of the tree.
[PHASE 12.1] Model biased for N
[PHASE 12.2] Finished with the model biased for N. LogL

-2,032.825

Bias term

32.576

Proportion

0.018

Exp freq increase

53.828%

p-value

0.000

Three residues show evidence of directional evolution: W, N,


and V:
[Residues (and p-values) for which there is evidence
of directional selection
W: 2.96368e-06
N: 2.19281e-05
V: 0.000763188

Site 103 is evolving directionally toward N, 184 toward V,


and 210 and 219 toward W:
The list of sites which show evidence of directional
selection (Bayes Factor

>

20) together with the target

residues and inferred substitution counts


Site 103 (max BF

2.78192e + 07)

Preferred residues: N
Substitution counts:
K->N: 8/N->K: 0
Site 184 (max BF

13,813.7)

Preferred residues: V
Substitution counts:
M->V: 10/V->M: 0
Site 210 (max BF

1.96712e + 07)

Preferred residues: W
Substitution counts:
F->L: 0/L->F: 1
L->W: 5/W->L: 0

268

S.L.K. Pond et al.


Site 219 (max BF

135.304)

Preferred residues: W
Substitution counts:
K->Q: 1/Q->K: 0
K->W: 1/W->K: 0

9.3. Coevolution

Because we have gone through an example of using HyPhy to detect


coevolution in HIV-1 p24, let us work through a different sort of
example using the same data. (If you have just run the one-parent
Bayesian network analysis on these data using QuickSelectionDetection.bf, you will probably already have the likelihood function in
memory under the name If, and can skip the following.) Select
the template batch file AnalyzeCodonData.bf from the Standard
Analyses menu and choose the following options.
1. Choose Genetic CodeSelect Universal.
2. Tree TopologySelect p24_tree, which corresponds to the
tree that was included with the alignment in the data file.
3. Select the p24.seq file.
4. Choose one of the standard modelsSelect MG94CUSTOM,
which corresponds to the MuseGaut codon substitution
model crossed with any nucleotide substitution model.
5. Model optionsSelect Global to estimate one set of model
parameters, such as the transition/trans version rate bias, for all
branches in the tree.
6. Enter a PAUP*-style model specification string. For example,
HKY85 is specified by the string 010010.
7. Use the tree included with the sequences in the p24.seq file
by typing y into the console window and hitting ENTER.
8. Branch LengthsSelect Proportional to input tree. Otherwise, we will be estimating over 1,000 branch length parameters in the tree, which is a very time-consuming analysis.
HyPhy fits a codon substitution model to these data. This
analysis takes at least a few minutes. When it is complete (HyPhy
spools a Newick tree string to the console), select the User Actions
icon in the bottom-right corner of the console window (the icon is
a pair of interlocked gears) and choose SimulateFromLF. You will
be asked how many replicates to simulate. Type 1 in the console
window and hit enter. What we will be doing is simulating the
evolution of codon sequences along the HIV-1 p24 phylogeny
using the model parameters that we have just estimated from the
data. Specify a file to save the simulated data to; it will be output in a
NEXUS format, so you may want to use a .nex file extension.
Note that the filename that you specify is used as a prefix for all

10

Evolution of Viral Genomes: Interplay Between Selection. . .

269

replicate simulations, which are distinguished from one another by


an integer-valued suffix.
Now, perform a BGM coevolution analysis on the simulated
alignment by executing the Quick-SelectionDetection.bf batch file
and following the instructions in Subheading 7 (use all the suggested default values). There should not be any sites identified as
participating in an epistatic interaction because the codon substitution model from which we simulated these sequences explicitly
assumes that the evolution of each codon site is independent.
Performing this analysis on simulated data is a useful negative
control and assesses the false-positive rate. When we performed
this analysis, we found that none of the edges in the network had
a marginal posterior probability exceeding the 0.95 cutoff, and only
two edges had probabilities greater than 0.9.
Now, open the simulated alignment in an HyPhy data panel by
selecting Open Data File. . . from the File drop-down menu. Select
all sites by choosing Select All from the Edit drop-down menu and
create a data partition object by choosing Selection -> Partition
from the Data drop-down menu. This object appears as a new row
in the bottommost field of the Data Panel. Set the Partition Type as
Codon and then translate the alignment into protein sequences
by selecting Aminoacid Translation from the Additional Info submenu within the Data drop-down menu. Click on All in the
window that appears to translate all sequences in the alignment and
Map to missing data in the following window to leave ambiguous
nucleotides as unresolved.
A new data panel appears with the protein sequence alignment.
Select a small range of amino acid sites (about 3040) by clicking
and shift clicking in the data panel and create a partition from this
selection. Click on the magnifying-glass icon to open the Data
Operations menu and select Association [Fisher exact]. Enter a
significance level of 0.05 in the window that appears. (You may
receive a warning that HyPhy needs to create X data partitions for
significant clusters identified by this association test statisticif so,
hit Cancel.) This mimics a pairwise correlation analysis of protein
sequences that does not account for phylogenetic relationships nor
confounding. Depending on which sites you select, you will
observe some number of false positives; when we selected residues
4080, we obtained 53 pairs with a p-value below 0.05, with some
as low as 107.
References
1. J W Drake, B Charlesworth, D Charles worth,
and J F Crow. Rates of spontaneous mutation.
Genetics, 148(4): 166786, Apr 1998.
2. R Nielsen and Z Yang. Likelihood models for
detecting positively selected amino acid sites

and applications to the HIV-1 envelope gene.


Genetics, 148(3):92936, Mar 1998.
3. E C Holmes. Comparative studies of RNA
virus evolution. In Esteban Domingo, Colin
Ross Parrish, and J J Holland, editors, Origin

270

S.L.K. Pond et al.

and evolution of viruses, chapter 5, pages


119134. Elsevier, 2 edition, 2008.
4. Sergei L Kosakovsky Pond, Simon D W Frost,
and Spencer V Muse. HyPhy: hypothesis testing using phylogenies. Bioinformatics, 21
(5):6769, Mar 2005.
5. Gavin J D Smith, Dhanasekaran Vijaykrishna,
Justin Bahl, Samantha J Lycett, Michael Worobey, Oliver G Pybus, Siu Kit Ma, Chung Lam
Cheung, Jayna Raghwani, Samir Bhatt, J S
Malik Peiris, Yi Guan, and Andrew Rambaut.
Origins and evolutionary genomics of the
2009 swine-origin H1N1 influenza A
epidemic. Nature, 459(7250): 11225,
Jun 2009.
6. Timothy E Schlub, Redmond P Smyth,
Andrew J Grimm, Johnson Mak, and Miles P
Davenport. Accurately measuring recombination between closely related HIV-1 genomes. PLoS Comput Biol, 6(4):e1000766,
Apr 2010.
7. Davey M Smith, Susanne J May, Samantha
Tweeten, Lydia Drumright, Mary E Pacold,
Sergei L Kosakovsky Pond, Rick L Pesano,
Yolanda S Lie, Douglas D Richman, Simon D
W Frost, Christopher H Woelk, and Susan J
Little. A public health model for the molecular
surveillance of HIV transmission in San Diego,
California. AIDS, 23(2):22532, Jan 2009.
8. Barbara S Taylor, Magdalena E Sobieszczyk,
Francine E McCutchan, and Scott M Hammer.
The challenge of HIV-1 subtype diversity. N
Engl J Med, 358(15):1590602, Apr 2008.
9. E R Chare and E C Holmes. A phylogenetic
survey of recombination frequency in plant
RNA viruses viruses. Arch Virol, 151
(5):93346, May 2006.
10. Elizabeth R Chare, Ernest A Gould, and
Edward C Holmes. Phylogenetic analysis
reveals a low rate of homologous recombination in negative-sense RNA viruses. J Gen
Virol, 84(Pt 10):2691703, Oct 2003.
11. M Worobey and E C Holmes. Evolutionary
aspects of recombination in RNA viruses.
J Gen Virol, 80 (Pt 10):253543, Oct 1999.
12. D Posada, KA Crandall, and EC Holmes.
Recombination in evolutionary genomics.
Annual Review of Genetics, 36:7597, 2002.
13. D Posada and K A Crandall. Evaluation of
methods for detecting recombination from
DNA sequences: computer simulations. Proc
Natl Acad Sci USA, 98(24): 1375762, Nov
2001.
14. Sergei L Kosakovsky Pond, David Posada,
Michael B Gravenor, Christopher H Woelk,
and Simon D W Frost. Automated phylogenetic detection of recombination using a

genetic algorithm. Mol Biol Evol, 23(10):


1891901, Oct 2006.
15. N Saitou and M Nei. The neighbor-joining
method: a new method for reconstructing phylogenetic trees. Mol Biol Evol, 4(4):40625, Jul
1987.
16. K Tamura and M Nei. Estimation of the number of nucleotide substitutions in the control
region of mitochondrial DNA in humans and
chimpanzees. Mol Biol Evol, 10(3):51226,
May 1993.
17. Art F Y Poon, Simon D W Frost, and Sergei L
Kosakovsky Pond. Detecting signatures of
selection from DNA sequences using datamonkey. Methods Mol Biol, 537:16383, 2009.
18. H. Shimodaira and M. Hasegawa. Multiple
comparisons of log-likelihoods with applications to phylogenetic inference. Mol Biol Evol,
16:11141116, 1999.
19. N Goldman, J P Anderson, and A G Rodrigo.
Likelihood-based tests of topologies in phylogenetics. Syst Biol, 49(4):65270, Dec 2000.
20. Sergei L Kosakovsky Pond, David Posada, Eric
Stawiski, Colombe Chappey, Art F Y Poon,
Gareth Hughes, Esther Fearnhill, Mike B
Gravenor, Andrew J Leigh Brown, and Simon
D W Frost. An evolutionary model-based algorithm for accurate phylogenetic breakpoint
mapping and subtype prediction in HIV-1.
PLoS Comput Biol, 5(11):e1000581, Nov
2009.
21. Kenneth P Murphy, Paul Travers, Mark Walport, and Charles Janeway. Janeways immunobiology. Garland Science, New York, 7th ed.
edition, 2008.
22. M S Hirsch and R T Schooley. Resistance to
antiviral drugs: the end of innocence. N Engl J
Med, 320(5):3134, Feb 1989.
23. Maria Anisimova, Rasmus Nielsen, and Ziheng
Yang. Effect of recombination on the accuracy
of the likelihood method for detecting positive
selection at amino acid sites. Genetics, 164(3):
122936, Jul 2003.
24. Konrad Scheffler, Darren P Martin, and Cathal
Seoighe. Robust inference of positive selection
from recombining coding sequences. Bioinformatics, 22(20):24939, Oct 2006.
25. Wayne Delport, Konrad Scheffler, and Cathal
Seoighe. Models of coding sequence evolution.
Briefings in bioinformatics, 10(1):97109,
January 2009.
26. Sergei Kosakovsky Pond and Spencer V Muse.
Site-to-site variation of synonymous substitution rates. Mol Biol Evol, 22(12):237585,
Dec 2005.
27. Wayne Delport, Konrad Scheffler, Gordon
Botha, Mike B Gravenor, Spencer V Muse,

10

Evolution of Viral Genomes: Interplay Between Selection. . .

and Sergei L Kosakovsky Pond. CodonTest:


modeling amino acid substitution preferences
in coding sequences. PLoS Comput Biol, 6(8),
2010.
28. A. L. Hughes. Looking for darwin in all the
wrong places: the misguided quest for positive
selection at the nucleotide sequence level.
Heredity, 99(4):364373, July 2007.
29. Sergei L. Kosakovsky Pond, Simon D. W.
Frost, Zehava Grossman, Michael B. Gravenor,
Douglas D. Richman, and Andrew J. Brown.
Adaptation to different human populations by
HIV-1 revealed by codon-based analyses. PLoS
Comput Biol, 2(6):e62+, June 2006.
30. Cathal Seoighe, Farahnaz Ketwaroo, Visva
Pillay, Konrad Scheffler, Natasha Wood,
Rodger Duffet, Marketa Zvelebil, Neil Martinson, James McIntyre, Lynn Morris, and Winston Hide. A model of directional selection
applied to the evolution of drug resistance
in HIV-1. Mol Biol Evol, 24(4):10251031,
April 2007.
31. William R. Rice. Analyzing tables of statistical
tests. Evolution, 43(1):223225, 1989.
32. Ben Murrell, Tulio de Oliviera, Chris Seebregts, Sergei L Kosakovsky Pond, and Konrad
Scheffler. Modeling HIV-1 drug resistance as
episodic directional selection. Mol Biol Evol, in
revision, 2011.
33. Sergei L. Kosakovsky Pond, Art F. Y. Poon,
Andrew J. Leigh Brown, and Simon D. W
Frost. A maximum likelihood method for
detecting directional evolution in protein
sequences and its application to influenza a
virus. Mol Biol Evol, 25(9): 18091824, September 2008.
34. David C Nickle, Laura Heath, Mark A Jensen,
Peter B Gilbert, James I Mullins, and Sergei L
Kosakovsky Pond. HIV-specific probabilistic
models of protein evolution. PLoS One, 2(6):
e503, 2007.
35. T F Hansen and G P Wagner. Modeling genetic
architecture: a multilinear theory of gene interaction. Theor Popul Biol, 59(1):6186, Feb
2001.
36. James F Crow and Motoo Kimura. An introduction to population genetics theory. Harper &
Row, New York, 1970.
37. Art F Y Poon, Fraser I Lewis, Sergei L Kosakovsky Pond, and Simon D W Frost. An evolutionary-network model reveals stratified
interactions in the V3 loop of the HIV-1 envelope. PLoS Comput Biol, 3(11):e231, Nov
2007.
38. David S Horner, Walter Pirovano, and Graziano Pesole. Correlated substitution analysis
and the prediction of amino acid structural

271

contacts. Brief Bioinform, 9(1):4656, Jan


2008.
er and Mario A Fares. Why
39. Francisco M Codon
should we care about molecular coevolution?
Evol Bioinform Online, 4:2938, 2008.
40. Christopher A Brown and Kevin S Brown. Validation of coevolving residue algorithms via
pipeline sensitivity analysis: ELSC and OMES
and ZNMI, oh my! PLoS One, 5(6):e10779,
2010.
41. U Gobel, C Sander, R Schneider, and A Valencia. Correlated mutations and residue contacts
in proteins. Proteins, 18(4):30917, Apr 1994.
42. B T Korber, R M Farber, D H Wolpert, and A S
Lapedes. Covariation of mutations in the V3
loop of human immunodeficiency virus type 1
envelope protein: an information theoretic
analysis. Proc Natl Acad Sci U S A, 90
(15):717680, Aug 1993.
43. J. Felsenstein. Phylogenies and the comparative
method. Am. Nat., 125(1): 115, 1985.
44. Art F Y Poon, Fraser I Lewis, Simon D W
Frost, and Sergei L Kosakovsky Pond. Spidermonkey: rapid detection of co-evolving sites
using bayesian graphical models. Bioinformatics, 24(17): 194950, Sep 2008.
45. T Pupko, I Peer, R Shamir, and D Graur. A fast
algorithm for joint reconstruction of ancestral
amino acid sequences. Mol Biol Evol, 17(6):
8906, Jun 2000.
46. P Tuff and P Darlu. Exploring a phylogenetic
approach for the detection of correlated substitutions in proteins. Mol Biol Evol, 17(11):
17539, Nov 2000.
47. Rasmus Nielsen. Mapping mutations on phylogenies. Syst Biol, 51(5):72939, Oct 2002.
48. Julien Dutheil, Tal Pupko, Alain Jean-Marie,
and Nicolas Galtier. A model-based approach
for detecting coevolving positions in a molecule. Mol Biol Evol, 22(9): 191928, Sep 2005.
49. Beth Shapiro, Andrew Rambaut, Oliver G
Pybus, and Edward C Holmes. A phylogenetic
method for detecting positive epistasis in
gene sequences and its application to RNA
virus evolution. Mol Biol Evol, 23(9):
172430, Sep 2006.
50. John P Huelsenbeck, Rasmus Nielsen, and
Jonathan P Bollback. Stochastic mapping of
morphological characters. Syst Biol, 52(2):
13158, Apr 2003.
51. Judea Pearl. Causality: models, reasoning,
and inference. Cambridge University Press,
Cambridge, U.K., 2000.
52. Nir Friedman and Daphne Koller. Being bayesian about network structure. a bayesian
approach to structure discovery in bayesian

272

S.L.K. Pond et al.

networks. Machine Learning, 50:95125,


2003. 10.1023/A: 1020249912095.
53. N. Metropolis, A. W. Rosenbluth, M. N.
Rosenbluth, and A. H. Teller. Equation of
state calculations by fast computing machines.
J. Chem. Phys., 21(6): 10871092, 1953.
54. W. K. Hastings. Monte Carlo sampling methods using Markov chains and their applications.
Biometrika, 57(1):97109, 1970.
55. M J Wade and S Kalisz. The causes of
natural
selection.
Evolution,
44(8):
19471955, 1990.
56. Brian Gaschen, Jesse Taylor, Karina Yusim,
Brian Foley, Feng Gao, Dorothy Lang,
Vladimir Novitsky, Barton Haynes, Beatrice H
Hahn, Tanmoy Bhattacharya, and Bette Korber.
Diversity considerations in HIV-1 vaccine
selection. Science, 296(5577):235460, Jun
2002.
57. Art F Y Poon, Sergei L Kosakovsky Pond,
Douglas D Richman, and Simon D W Frost.
Mapping protease inhibitor resistance to

human immunodeficiency virus type 1


sequence polymorphisms within patients.
J Virol, 81(24):13598607, Dec 2007.
58. V Novitsky, H Cao, N Rybak, P Gilbert, M F
McLane, S Gaolekwe, T Peter, I Thior, T
Ndungu, R Marlink, T H Lee, and M Essex.
Magnitude and frequency of cytotoxic T-lymphocyte responses: identification of immunodominant regions of human immunodeficiency
virus type 1 subtype C. J Virol, 76(20):
1015568, Oct 2002.
59. J Lieberman, J A Fabry, D M Fong, and G R
Parkerson, 3rd. Recognition of a small number
of diverse epitopes dominates the cytotoxic T
lymphocytes response to HIV type 1 in an
infected individual. AIDS Res Hum Retroviruses, 13(5):38392, Mar 1997.
60. Sergei L. Kosakovsky Pond and Simon D. W.
Frost. Not so different after all: A comparison
of methods for detecting amino acid sites
under selection. Mol Biol Evol, 22(5):
12081222, May 2005.

Part III
Population Genomics

Chapter 11
Association Mapping and Disease: Evolutionary
Perspectives
Sren Besenbacher, Thomas Mailund, and Mikkel H. Schierup
Abstract
In this chapter, we give a short introduction to the genetics of complex disease with special emphasis on
evolutionary models for disease genes and the effect of different models on the genetic architecture, and
finally give a survey of the state-of-the-art of genome-wide association studies.
Key words: Complex diseases, Association mapping, Genome-wide association studies, Common
disease/common variant

1. Introduction
The phenotype of an individual is determined by a combination of
its genotype and its environment. The degree to which the phenotype is determined by genotype rather than environmentthe
balance of nature versus nurturevaries from trait to trait, with
some traits essentially independent of genotype and determined by
the environment and others highly influenced by the genotype and
independent of the environment.
A measure quantifying the importance of genotype as compared to the environment is the heritability. It is the fraction of
the total variance in the populationreferred to as the phenotypic
varianceexplained by variation in the genotype among the individuals in the population (1). An interesting trait, such as a common disease, that exhibits a nontrivial heritability, awakes an
interest in finding the genetic explanation behind the trait, that is,
identifying the genetic polymorphisms affecting the trait. The first
step toward this is association mapping, searching for polymorphisms statistically associated with the trait. Polymorphisms associated

Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2,
Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5_11,
# Springer Science+Business Media, LLC 2012

275

276

S. Besenbacher et al.

with the disease need not influence the trait directly, but it is among
those that we will find the polymorphisms that do.
The variants at the various polymorphisms in the genome are
correlatedthey are in linkage disequilibrium (LD)so we need
not examine all polymorphisms. By analyzing a few hundred
thousands to a million evident polymorphisms, we can capture
most of the common variation in the entire genome (24). In
finding such polymorphisms associated with disease risk, we locate
a region of the genome that contains one or more polymorphisms
that affect disease risk, and by examining such a region in more
detail we may locate these.
In the following, we first discuss possible genetic architectures
of complex diseasesmainly based on theoretical considerations
since little is known about thisand then describe the state of the
art in genome-wide association studies (GWASs).

2. The Allelic
Architecture
of Genetic
Determinants
for Disease

2.1. Theoretical
Models for the
Allelic Architecture
of Common Diseases

Many complex diseases show a rather large heritability. Each


genetic variant that increases the risk of disease contributes to the
measured heritability of the disease. A fraction of heritability can,
thus, be attributed to each variant. Doing this for the disease-risk
variants known at present, however, we only explain a small fraction
of the total heritability (5). The allelic architecture of common
diseasesin terms of the number of variants, their frequency, and
the risk associated to eachis, thus, poorly understood.
To illustrate the difficulties of inferring the architecture, we
consider two hypotheses: the common disease common variant
(CDCV) hypothesis or the common disease rare variant (CDRV)
hypothesis. CDCV states that most of the heritability can be
explained by a few high-frequency variants with moderate effects
while CDRV states that most of the heritability can be explained by
moderate- or low-frequency variants with large effects.
Arguments for the expected number of alleles, their frequency, and
the risk associated with each allele are based on population genetics
considerations. The frequency distribution of independent mutations under mutationdriftselection balance in a stable population
can be derived from diffusion approximations (see, e.g., Wright
(6)). Central parameters are the mutation rate, u, and the selection
for or against an allele, measured by s, scaled with the effective
population size, N. Mutations enter a population with a rate determined by Nu and subsequently their frequencies change in a stochastic manner. If a mutant allele functions like its origin, s 0 and
the allele is selectively neutral. It then rises and falls with equal

11

Association Mapping and Disease: Evolutionary Perspectives

277

Fig. 1. Mutation, drift, and selection. New mutations enter a population at stochastic
intervals, determined by the mutation rate, u, and the effective population size, N. For low
or high frequencies, where the range of such frequencies is determined by the selection
factor, s, and the effective population size, the frequency of a mutant allele changes
stochastically. At medium frequencies, on the other hand, the frequency of the allele
changes up or down, depending on s, in a practically deterministic fashion. If a positively
selected allele reaches moderate frequency, it will quickly be brought to high frequency,
at a speed also determined by s and N.

probability while if it is under selection it has a higher probability of


increasing than decreasing in frequency for positive selection
(s > 0) and conversely for negative selection (s < 0).
At very high or very low frequencies, selection has a very small
effect on the change in frequency and the system evolves essentially
completely stochastic (genetic drift). At moderate frequencies,
however, the effect of selection is more pronounced, and given
sufficiently strong selection (of an order Ns>>1) the direction of
changes in the allele frequency is almost deterministically determined by the direction of selection. An allele subject to sufficiently
strong selection that happens to reach moderate frequencies either
halts its increase and drifts back to a low frequency or continues to
high frequencies, where eventually the stochastic effects again dominate (see Fig. 1).
The range of frequencies, where drift dominates or selection
dominates, is determined by the strength of selection (Ns) and the
genotypic characteristics of selection, as, e.g., dominance relations
between alleles. For very strong selection or in very large populations, the process is predominantly deterministic for most frequencies while for weak selection or a small population the process is
highly stochastic for most frequencies. The time an allele can spend
at moderate frequencies is also determined by Ns and selection
characteristics.
Pritchard and Cox (7, 8) used diffusion arguments to show that
common diseases are generally expected to be caused by a large
number of different mutations in the genes, where damage conveys

278

S. Besenbacher et al.

Fig. 2. Accumulation of several rare frequencies. If selection works against a set of alleles,
each will be kept at a low frequency. Their accumulated frequency, however, can be high
in the population.

disease susceptibility. This implies that genes commonly involved in


susceptibility exert their effect through multiple independent
mutations rather than a single mutation identical by descent in all
carriers (see Fig. 2). Each mutation, if under weak purifying
selection, is unlikely to reach moderate frequencies, and since the
population will only have few carriers of the disease allele it can only

11

Association Mapping and Disease: Evolutionary Perspectives

279

Fig. 3. A population out of equilibrium following an expansion. In a transition period


following a population expansion, the allele frequency patterns are different from the
patterns in a stable population.

explain little of the heritability. The accumulated frequency of


several alleles, each kept to low frequency by selection, can, however, reach moderate frequencies. So the heritability can be
explained either by many recurrent mutations or many independent
loci affecting the disease: the CDRV hypothesis.
Implicitly, this model assumes a population in mutation selection equilibrium, and this does not necessarily match the human
population. The human population has recently expanded considerably in size, and changes in lifestyle, e.g., from hunter-gathers to
farmers might have changed the adaptive landscape.
The number of variants at mutationselectiondrift balance is
lower in a small population than in a large population. Therefore, in a
large population (such as present-day humans), a deleterious mutation is not expected at high frequency unless the population has
recently grown dramatically (9). This is illustrated as the transient
period in Fig. 3, where common genetic variants may contribute
much more to disease than under stable demographic conditions.
Following an expansion, alleles that would otherwise be held at low
frequency by selection may be at moderate frequencies, and thus
contribute a larger part of the heritability: the CDCV hypothesis.
Similarly, a recent change in the selective landscape of a population might cause an allele previously held at low frequency to be
under positive selection and rise in frequency while alleles previously at high frequencies can drop in frequency due to negative
selection (10). In this transition period, an allele may be at a
moderate frequency and therefore contributes significantly to the
heritability of disease susceptibility (see Fig. 4).

280

S. Besenbacher et al.

Fig. 4. A population out of equilibrium following changes in the selective landscape. If the
selection of an allele changes direction, so the positively selected allele becomes
negatively selected and vice versa, it will eventually move through moderate frequencies.
Following a change in the selective landscape, it is thus possible to find alleles at
moderate frequencies that would not otherwise be found.

Depending on which hypothesis is valid, different mapping


strategies are needed. Association mapping, however, has so far
mainly assumed the CDCV hypothesis for two practical reasons.
The first is caused by the fact that the LD patterns across the
genome greatly restrict examination to only a small fraction of the
total possible variation. This is also the effect that greatly reduces
the cost of genome-wide studies by allowing a subset of polymorphisms to reflect the actual genetic variation in the human
population due to polymorphisms segregating common alleles.
Statistical analysis of association between polymorphism and disease is rather straightforward for moderate-frequency alleles but has
far less power to detect association with low-frequency alleles.
Thus, so far, only the CDCV hypothesis has been testable and the
bulk of association studies have, therefore, used it as their working
hypothesis.
2.2. The Allelic
Frequency Spectrum
in Humans

Empirically, the allelic frequency spectrum of SNPs in the human


genome is known in great detail for relatively common
alleles (minor allele frequency, MAF, > 5%) from the HapMap
project (11). The recently completed pilot project for the 1000
genome project (12) expands the knowledge on uncommon alleles
(1% < MAF < 5%), which should all be identified during the next
phase of the project. This will allow estimation, if not identification,
of the number of rare variants and the number of variants carried by
single individuals. Identification of very rare alleles awaits the
sequencing of many thousand individuals for larger pieces of
DNA. This may be achieved sooner for exons than for the rest of

11

Association Mapping and Disease: Evolutionary Perspectives

281

the genome. There are already clear indications that the number of
rare variants will be larger than a simple extrapolation of the common SNPs due to the complex demographic history of humans
(1215). Further, recent sequencing of 200 exomes in Europeans
reported an enrichment of nonsynonymous variants over synonymous variants among rare polymorphisms (14), strongly suggesting that many nonsynonymous variants are kept in low frequency
by natural selection. The proportion of these variants that are
involved in complex diseases and perhaps selected against due to
this effect is currently unknown.
The European population, where most GWASs so far have been
carried out, reveals a site frequency distribution of synonymous
variants that generally are shifted to more common alleles as compared to the African population. This is most likely due to a severe
bottleneck connected to the out-of-Africa expansion, but also to
the expected excess of rare variants in a demographically stable
population of the same effective size under selective neutrality.
Excess of low-frequency variants is a hallmark of recent population
growth and/or weak selection against rare alleles. The latter is
visible in the contrast between the frequency distribution for synonymous and nonsynonymous alleles as explained above.

3. The Basic GWAS


The first GWASs were published around 2006 (16, 17) when
Illumina and Affymetrix first introduced genotyping chips that
made it possible to test hundreds of thousands of SNPs quickly
and inexpensively. The GWASs approach to finding susceptibility
variants for diseases boils down to testing approximately 0.32 million SNPs (depending on chip type) for differences in allele frequencies between cases and controls, adjusting for the high number
of multiple tests. This is a wonderfully simple procedure that
requires no complicated statistics or algorithms but only classical
well-known statistical tests and a minimum of computing power.
Despite the simplicity, a number of issues remain, such as faulty
genotype data and confounding factors that can result in erroneous
findings if not handled properly. The most important aspects of any
GWAS are, therefore, thorough quality control to make sure that
only good-quality genotype data is used and to take measures to
avoid and reduce the effect of confounding factors.
3.1. Statistical Tests

The primary analysis in an association study is usually testing each


marker separately under the assumption of an additive or multiplicative model. One way of doing that is by creating a 2  2 allelic

282

S. Besenbacher et al.

Table 1
Contingency table for allele counts in case/control data
Allele A

Allele B

Case

Ncase, A

Ncase, B

Ncases

Control

Ncontrol, A

Ncontrol, B

Ncontrols

NA

NB

Table 2
Expected allele counts in case/control data
Allele A

Allele B

Case

(Ncases  NA)/N

(Ncases  NB)/N

Ncases

Control

(Ncontrols  NA)/N

(Ncontrols  NB)/N

Ncontrols

NA

NB

contingency table as shown in Table 1 by summing the number of


A and B alleles seen in all case individuals and all control individuals.
Be aware that we are counting alleles and not individuals in this
contingency table, so Ncases will be equal to two times the number
of case individuals because each individual carries two copies of each
SNP unless we are looking at nonautosomal DNA. If there is no
association between the SNP and the disease in question, we would
expect the fraction of cases that have a particular allele to match the
fraction of controls that have that allele. In that case, the expected
allele count (EN) would be as shown in table 2.
To test whether the difference between the observed allele
counts (in Table 1) and the expected allele counts (in Table 2) is
significant, a Pearson w2 statistic can be calculated:
w2

SPhenotype SAllele NPhenotype; Allele  ENPhenotype; Allele 2


:
ENPhenotype; Allele

This statistic approximates an w2 distribution with 1 degree of


freedom, but if the expected allele counts are very low (<10) the
approximation breaks down. This means that if the MAF is very low
or if the total sample size, N, is small an exact test, such as the
Fishers exact test, should be applied. An alternative to the tests that
use the 2  2 allelic contingency table and thereby assumes a
multiplicative model is the CochranArmitage trend test that

11

Association Mapping and Disease: Evolutionary Perspectives

283

assumes an additive risk model (18). The latter test is preferred by


some since it does not require an assumption of HardyWeinberg
equilibrium in cases and controls combined (19).
While a 1 degree of freedom test that assumes an additive or
multiplicative model is usually the first analysis, many studies also
perform a test that would be better at picking up associations
following a dominant or recessive pattern, for instance by
performing a 2 degrees of freedom test of the null hypothesis of
no association between rows and column in the 2  3 contingency
table that counts genotypes instead of alleles.
3.2. Effect Estimates

A commonly used way of measuring the effect size of an association


is the allelic odds ratio (OR), which is the ratio of the odds of being
a case given that you carry n copies of alleles A to the odds of being
a case if you carry n  1 copies of allele A. Assuming a multiplicative model, this can be calculated as:
OR

Ncase; A =Ncontrol; A Ncase; A  Ncontrol; B


:

Ncase; B =Ncontrol; B Ncase; B  Ncontrol; A

Another measure of effect size that is perhaps more intuitive is


the relative risk (RR), which is the disease risk in carriers divided
by the disease risk in noncarriers. This measure, however, suffers
from the weakness that it is harder to estimate. If our cases and
controls were sampled from the population in an unbiased way, the
allelic RR could be calculated as:
RR

Ncase; A =NA
;
Ncase; B =NB

but it is very rare to have an unbiased population sample in association studies because the studies are generally designed to deliberately oversample the cases to increase the power. This oversampling
affects the RR as calculated by the formula above but not the OR
which is one of the reasons why the OR is usually reported in
association studies instead of the RR.
3.3. Quality Control

Data quality problems can be either SNP specific or individual


specific and inspection usually results in the removal of both problematic individuals and problematic SNPs from the data set.
Individual specific problems can be caused by low DNA quality or contamination by foreign DNA. A sample of low DNA
quality results in a high rate of missing data, where particular
SNPs cannot be called and there is a higher risk of miscalling
SNPs. It is, therefore, recommended that individuals lacking calls
in more than 23% of the SNPs are removed from the analysis.
Excess heterozygosity is an indicator of sample contamination,
and individuals displaying that should also be disregarded. Sex
checks and other kinds of phenotype tests might also be applied

284

S. Besenbacher et al.

to remove individuals, where the genotype information does not


match the phenotype information due to a sample mix-up (20).
For a given SNP, the data from an individual can be suspicious
in two ways: it can fail to be called by the genotype-calling program
or it can be miscalled. Typically, a conservative cutoff value is used
in the calling process securing that most problems show up as
missing data rather than miscalls. Most problematic SNPs, therefore, reveal a high fraction of missing data and SNPs missing calls
above a given threshold (typically, 15%) are removed. Miscalls
typically occur when the homozygotes are hard to distinguish
from the heterozygotes and some of the heterozygotes are being
misclassified as homozygotes or vice versa. Both biases manifest as
deviation from HardyWeinberg equilibrium, and SNPs that show
large deviations from HardyWeinberg equilibrium within the controls should be removed (21).
3.4. Confounding
Factors

Confounding factors are differences between cases and controls


unrelated to the disease. For instance, if cases are gathered primarily
from one part of a country and controls from another part, false
association signals could be created because of genetic differences
between the two parts of the country. This confounding error is
particularly likely to occur when samples mix different ethnicities. If
the source of the data is mainly samples from one population, then
samples not originating from that population should be excluded
whenever possible. Methods for inferring population substructure,
such as principal components analysis, are useful for detecting outliers that ought to be removed from the data (22). When a data set
includes individuals from distinct subpopulations, the association
analysis should be performed separately in each subpopulation and
subsequently combined using a procedure that does not assume
that the frequencies in the two subpopulations are the same.
Type and frequency of errors that may happen during sample
preparation and SNP calling are likely to vary through time and
space, so case and control samples should be completely randomized as early as possible in the procedure of genotypic typing.
Failure to carefully plan this aspect of an investigation introduces
errors in the data that are hard, if not impossible to disclose, and
they may reduce interesting findings to mere artifacts.
Although analyzing distinct populations separately and excluding outliers go a long way, it does not remove all problems caused
by population structure. A general inflation of test statistics due to
population substructure and cryptic relatedness should be kept in
mind, especially when analyzing diseases with high familial aggregation, since that causes cases to be more closely related than
controls. A useful way of visualizing such inflation of test statistics
is the so-called quantilequantile (QQ) plot. In this plot, ranked
values of the test statistic are plotted against their expected distribution under the null hypothesis. In case of no true positives and

11

Association Mapping and Disease: Evolutionary Perspectives

285

Fig. 5. QQ plots from an w2-distribution. (a) A QQ plot, where the observation follows the expected distribution. (b) A QQ
plot, where the majority of observations follow the expected distribution, but where some have unexpectedly high values,
i.e., are statistically significant. (c) A QQ plot, where the observations all seem to be higher than expected, which is an
indication that the observations are not following the expected distribution.

no inflation of the test statistic due to population structure or


cryptic relatedness, the points of the plot lie on the x y line
(see Fig. 5a). True positives show as an increase in values above
the line in the right tail of the distribution but do not affect the rest
of the points since only a small fraction of the SNPs are expected to
be true positives (Fig. 5b). Cryptic relatedness and population
stratification lead to a deviation from the null distribution across
the whole distribution and can, thus, be seen in the QQ plot as a
line with a slope larger than 1 (Fig. 5c). A standard adjustment for
the inflation of the test statistic is to shrink the range of their
distribution to make the median coincide with the expected value.
This procedure is called genomic control (23).
Population substructure is not the only difference between
cases and controls that can cause trouble. The pooling of data
genotyped using different kinds of equipment may produce similar
problems. A recent study on longevity by Sebastiani et al. (24) serves
as a warning. The researchers applied two different kinds of chips
and failed to remove several SNPs that exhibited bad quality on only
one of the chips (25). If the fraction of the two different kinds of
chips had been the same in both cases and controls, that would
probably not have resulted in false signals, but unfortunately the
chip with the bad SNPs was used in twice as many cases as controls.
3.5. Replication

The best way to make sure that a finding is real is to replicate it. If
you find the same signal in another set of cases and controls, it
means that the association was not caused by a confounding factor
specific to your data set. Likewise, if you still see the association
after typing the markers using another genotyping method, it
means that it is not a false positive due to some artifact of the
genotyping method used.
When trying to replicate a finding, the best strategy is to try to
replicate it in a population of similar ancestry. A marker that is

286

S. Besenbacher et al.

tagging a true causal variant in one population might not be tagging the same variant in a population of different ethnicity, where
the LD structure can be different. This is especially a problem when
trying to replicate an association found in a non-African population
in an African population (26). A marker might easily have 20
completely correlated markers in a European population, but no
good correlates in an African population. This means that if you see
a significant association with an SNP that has 20 equivalent SNPs in
the European population it is not enough to try to replicate only
that SNP, but in an African population you have to test all 20. This,
however, also offers a way to fine map the signal and possibly find
the causative variant (27).
Before spending time and effort to replicate an association
signal in a foreign cohort, it is a good idea to search for the existing
partial replication of the marker within the data. Usually, a marker is
surrounded by several correlated markers on the genotyping chip,
and if one marker shows a significant association then the correlated
markers should show an association too. If a marker is significantly
associated with a disease but no other marker in the region is, then
it should be viewed as suspicious. Decisions in cases like this may be
further validated by investigating markers that according to HapMap are correlated to the marker in question.

4. Imputation:
Squeezing More
Information Out
of Your Data

The current generation of SNP chip types include only 0.32 million
of the 910 million common SNPs in the human (that is, SNPs with
an MAF of more than 5%). Because of the correlation between SNPs
in LD, however, the SNP chips can still claim to assay most of the
common variants in the genome (in European populations anyway).
Although the Illumina HumanHap300 chip only directly tests about
3% of the 10 million common SNPs, it still covers 77% of the SNPs
in HapMap with a squared correlation coefficient (r2) of at least 0.8
in a population of European ancestry (11). The corresponding
fraction in a population of African ancestry is only 33%, however.
These numbers expose two limitations of the basic GWAS
strategy. First, there is a substantial fraction of the common SNPs
that are not well covered by the SNP chips even in European
populations (23% in the case of the HumanHap300 chip). Secondly, we rely on tagging to test a large fraction of the common
SNPs and this diluted signal from correlated SNPs inevitably causes
us to overlook true associations in many instances. An efficient way
of alleviating these limitations is genotype imputation, where genotypes that are not directly assayed are predicted using information
from a reference data set that contains data from a large number of

11

Association Mapping and Disease: Evolutionary Perspectives

287

SNPs. Such imputation improves the GWAS in multiple ways: It


boosts the power to detect associations, gives a more precise location of an association, and makes it possible to do meta-analyses
between studies that used different SNP chips (28).
4.1. Selection
of Reference Data Set

The two important choices when performing imputation is the


reference data set to use and the software to use. Usually, a publicly
available reference data set, such as the HapMap (11) or 1000
genomes project (12), is used. Alternatively, researchers type a part
of their study cohort on a larger SNP chip with a denser coverage and
thus creating their one reference data set. The latter strategy has the
advantage that one can be certain that the ancestry of the reference
data matches the ancestry of the study cohort. It is important that
the reference data is from a population that is similar to the study
population. If the reference population is too distantly related to the
study population, the reliability of the imputed data will be reduced.
The quality and nature of the reference data also limit the quality of
the imputed data in other ways. A reference data set consisting of
only a few individuals is not able to reliably estimate the frequency of
rare variants and that in turn means that the imputation of rare
variants lacks in accuracy. This means that there is a natural limit to
how low a frequency a variant can have and still be reliably imputed.
The use of imputation methods does not only offer the possibility of increased SNP coverage, but, given the right reference
data, also eases the analysis of common non-SNP variation, such
as indels and copy number variations (CNVs).
Whole-genome sequencing projects, such as the 1000 genomes
project, coupled with imputation will soon make it possible to use
the SNP chips to test many structural variants that are not being
(routinely) tested today (29).

4.2. Imputation
Software

The commonly applied genotype imputation methods, such as


IMPUTE (30), MACH (31), and BIMBAM (32, 33), are all
based on hidden Markov models (HMMs). Comparisons of these
software packages have shown that they produce data of broadly
similar quality but that they are superior to imputation software
based on other methodological approaches (28, 34). The basic
HMMs used in these programs are similar to earlier HMMs developed to model LD patterns and estimate recombination rates.
Since imputation is based on probabilistic models, its output is
merely a probability for each genotype of an SNP unknown in a given
individual. That is, instead of reporting the genotype of individual as
AG, the program reports that the probability of the genotype being
AA is 5%, that of being AG is 93%, and that of being GG is 2%. This
nature of the output data challenges the GWAS. The simplest way of
imputation is to use the best guess genotype, i.e., assume the
genotype with the highest probability and ignore the others. In the
example above, individual would be given the genotype AG at the

288

S. Besenbacher et al.

SNP in question, and usually an individuals genotype would


be considered as missing if none of the genotypes have a probability
larger than a certain threshold (e.g., 90%). The use of best guess
genotype is problematic since it does not take the uncertainty of the
imputed genotypes into account, may introduce a systematic bias,
and lead to false positives and false negatives. A better way is to report
a logistic regression on the expected allele countin the example
above, the expected allele count for allele A would be 1.03 (2 
pAA + pAG). The method developed on this basis has proved to be
surprisingly robust at least when the effect of the risk allele is small
(35), which is the case for most of the variants found using GWA. An
even better solution is to use methods that fully account for the
uncertainty of the imputed genotypes (30, 35, 36).

5. Current Status
Association mapping has for the last 5 years had a strong focus on
GWAS using SNP chips with 500,000 1,000,000 SNPs, based on
the HapMap identification of common human variation in Europe,
Asia, and Africa. Hundreds of SNPs have been found to be associated with common diseases in a discovery cohort of affected
individuals with matched controls and at least one further population for replication of the initial finding (37). These studies have
typically found increased risks of 520% for each variant. While
initially considered a great proof of concept for the CDCV model,
there is a growing awareness that these are unlikely to explain most
of the genetic effect unless there is a very large number of common
alleles with very small ORs that have escaped detection using the
current cohort sizes (typically, 2,00020,000 individuals). Hence,
there is a renewed focus on the site frequency spectrum of disease
alleles as discussed in Subheading 2. A clear pattern observed in
findings so far is that most variants identified are common and that
the inferred ORs increase with rarity of the variant. This cannot be
taken as evidence that rare variants have higher ORs, though, since,
as demonstrated by Iles (38), we can only detect the rarer variants if
they have higher ORs. However, if analysis is restricted to nonsynonymous disease SNPs, then rare variants do seem to have a
generally larger OR (39). An analysis of the site frequency spectrum
as a function of functional classification of SNPs using PolyPhen also
found that rare variants should be more damaging (which through
selection would explain their rarity) (13). Thus, it cannot be ruled
out that the bulk of heritability not explained by GWASs so far could
be explained by many rare variants, each with ORs larger than
common variants identified (perhaps, ORs in the 210 range
which are still sufficiently low to be easily missed by linkage studies).

11

Association Mapping and Disease: Evolutionary Perspectives

289

6. Perspectives
Future data will provide identification of most, if not all, SNPs and
CNVs in a large set of individuals. To further our quest toward
understanding the genetic etiology of common diseases, methods
are needed that expand information on the role of rare variants and
variants of small effect. This requires statistically powerful ways of
handling information from rare SNPs, rare LD blocks, and amassing distributed local effects. Several promising methods have
recently emerged as ways to add signals together locally (3943).
It will be of great interest to use these approaches even in cases,
where association with common variants has been shown, since it is
possible that some of these associations are due to synthetic association, i.e., that several rare variants accidentally are associated with
the same more common variants (44). Searching for variants adding up to a risk in a certain gene does not identify any specific causal
variants, but it points to causal genes or regions that can then be
further scrutinized either statistically in replication cohorts, by
bioinformatics pathway and functional prediction, or experimentally. With full sequencing, we know that the causal variants have
been included, but many other variants will be associated due to
LD. LD will, thus, be more of a burden than of an asset in future
studies and populations with least LD should be most easily amenable for association mapping. However, other approaches to distinguish real from associated variants that are based on biological
information on their putative function will be very useful. A wealth
of annotation to each position in the genome will soon be available,
including the epigenetic context (e.g., nucleosome positioning and
modifications, transcription factor and enhancer binding, DNAS
structure) and the structure of the protein, including its position in
biological pathways and interaction with other proteins. Thus, each
putative variant can be assigned a posterior probability of true
association, which can be used as hypothesis generator as well as a
prior probability in replication studies.

7. Questions
1. How can you distinguish causal variants from other variants
when all variants have been typed? Is there any statistical way of
distinguishing between correlation and causality just from
genotype data? Could you use functional annotations?
2. Consider a GWAS data set, where in the top ten ranked statistics you have five markers that are close together and the
remaining five scattered across the genome. Would you

290

S. Besenbacher et al.

consider the five close markers more or less likely to be a true


positive? Why? If one of them is a false positive, what would you
think about the others?
3. Why is the RR but not the OR estimate affected by a biased
case/control sample?
4. How would you test for, e.g., dominant or recessive effects in a
contingency table?
References
1. Visscher PM, Hill WG, Wray NR (2008)
Heritability in the genomics era concepts
and misconceptions. Nat Rev Genet 9:
255266.
2. Cardon LR, Abecasis GR (2003) Using haplotype blocks to map human complex trait loci.
Trends Genet 19: 135140.
3. de Bakker PI, Yelensky R, Peer I, Gabriel SB,
Daly MJ, et al. (2005) Efficiency and power in
genetic association studies. Nat Genet 37:
12171223.
4. Daly MJ, Rioux JD, Schaffner SF, Hudson TJ,
Lander ES (2001) High-resolution haplotype
structure in the human genome. Nat Genet 29:
229232.
5. Maher B (2008) Personal genomes: The case of
the missing heritability. Nature 456: 1821.
6. Wright S (1931) Evolution in Mendelian
populations. Genetics 16: 97159.
7. Pritchard JK (2001) Are rare variants responsible for susceptibility to complex diseases? Am
J Hum Genet 69: 124137.
8. Pritchard JK, Cox NJ (2002) The allelic architecture of human disease genes: common
disease-common variant. . .or not? Hum Mol
Genet 11: 24172423.
9. Reich DE, Lander ES (2001) On the allelic
spectrum of human disease. Trends Genet 17:
502510.
10. Di Rienzo A, Hudson RR (2005) An evolutionary framework for common diseases: the
ancestral-susceptibility model. Trends Genet
21: 596601.
11. Frazer KA, Ballinger DG, Cox DR, Hinds DA,
Stuve LL, et al. (2007) A second generation
human haplotype map of over 3.1 million
SNPs. Nature 449: 851861.
12. Durbin RM, Abecasis GR, Altshuler DL, Auton
A, Brooks LD, et al. (2010) A map of human

genome variation from population-scale


sequencing. Nature 467: 10611073.
13. Gorlov IP, Gorlova OY, Sunyaev SR, Spitz MR,
Amos CI (2008) Shifting paradigm of
association studies: value of rare single-nucleotide polymorphisms. Am J Hum Genet 82:
100112.
14. Li Y, Vinckenbosch N, Tian G, Huerta-Sanchez E, Jiang T, et al. (2010) Resequencing of
200 human exomes identifies an excess of lowfrequency non-synonymous coding variants.
Nat Genet 42: 969972.
15. Pelak K, Shianna KV, Ge D, Maia JM, Zhu M,
et al. (2010) The characterization of twenty
sequenced human genomes. PLoS Genet 6.
16. Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler
RS, et al. (2005) Complement factor H polymorphism in age-related macular degeneration. Science 308: 385389.
17. Duerr RH, Taylor KD, Brant SR, Rioux JD,
Silverberg MS, et al. (2006) A genome-wide
association study identifies IL23R as an inflammatory bowel disease gene. Science 314:
14611463.
18. Lewis CM (2002) Genetic association studies:
design, analysis and interpretation. Brief Bioinform 3: 146153.
19. Balding DJ (2006) A tutorial on statistical
methods for population association studies.
Nat Rev Genet 7: 781791.
20. WTCCC (2007) Genome-wide association
study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447:
661678.
21. McCarthy MI, Abecasis GR, Cardon LR,
Goldstein DB, Little J, et al. (2008) Genomewide association studies for complex traits:
consensus, uncertainty and challenges. Nat
Rev Genet 9: 356369.

11

Association Mapping and Disease: Evolutionary Perspectives

22. Patterson N, Price AL, Reich D (2006) Population structure and eigenanalysis. PLoS Genet
2: e190.
23. Devlin B, Roeder K (1999) Genomic control
for association studies. Biometrics 55:
9971004.
24. Sebastiani P, Solovieff N, Puca A, Hartley SW,
Melista E, et al. (2010) Genetic Signatures of
Exceptional Longevity in Humans. Science.
25. Alberts B (2010) Editorial expression of concern. Science 330(6006): 912. DOI:
10.1126/science.330.6006.912-b.
26. Teo YY, Small KS, Kwiatkowski DP (2010)
Methodological challenges of genome-wide
association analysis in Africa. Nat Rev Genet
11: 149160.
27. Zaitlen N, Pasaniuc B, Gur T, Ziv E, Halperin
E (2010) Leveraging genetic variability across
populations for the identification of causal variants. Am J Hum Genet 86: 2333.
28. Marchini J, Howie B (2010) Genotype imputation for genome-wide association studies.
Nat Rev Genet 11: 499511.
29. Sudmant PH, Kitzman JO, Antonacci F, Alkan
C, Malig M, et al. (2010) Diversity of human
copy number variation and multicopy genes.
Science 330: 641646.
30. Marchini J, Howie B, Myers S, McVean G,
Donnelly P (2007) A new multipoint
method for genome-wide association studies
by imputation of genotypes. Nat Genet
39: 906913.
31. Li Y, Willer CJ, Ding J, Scheet P, Abecasis GR
(2010) MaCH: using sequence and
genotype data to estimate haplotypes and
unobserved genotypes. Genet Epidemiol 34:
816834.
32. Scheet P, Stephens M (2006) A fast and flexible
statistical model for large-scale population
genotype data: applications to inferring missing
genotypes and haplotypic phase. Am J Hum
Genet 78: 629644.
33. Servin B, Stephens M (2007) Imputation-based
analysis of association studies: candidate regions
and quantitative traits. PLoS Genet 3: e114.

291

34. Nothnagel M, Ellinghaus D, Schreiber S,


Krawczak M, Franke A (2009) A comprehensive evaluation of SNP genotype imputation.
Hum Genet 125: 163171.
35. Guan Y, Stephens M (2008) Practical issues in
imputation-based association mapping. PLoS
Genet 4: e1000279.
36. Stephens M, Balding DJ (2009) Bayesian statistical methods for genetic association studies.
Nat Rev Genet 10: 681690.
37. Hindorff LA, Sethupathy P, Junkins HA, Ramos
EM, Mehta JP, et al. (2009) Potential etiologic
and functional implications of genome-wide
association loci for human diseases and traits.
Proc Natl Acad Sci U S A 106: 93629367.
38. Iles MM (2008) What can genome-wide association studies tell us about the genetics
of common disease? PLoS Genet 4: e33.
39. Kryukov GV, Pennacchio LA, Sunyaev SR
(2007) Most rare missense alleles are deleterious in humans: implications for complex disease and association studies. Am J Hum Genet
80: 727739.
40. Li B, Leal SM (2008) Methods for detecting
associations with rare variants for common diseases: application to analysis of sequence data.
Am J Hum Genet 83: 311321.
41. Madsen BE, Browning SR (2009) A groupwise
association test for rare mutations using a
weighted sum statistic. PLoS Genet 5:
e1000384.
42. Morris AP, Zeggini E (2010) An evaluation of
statistical approaches to rare variant analysis
in genetic association studies. Genet Epidemiol
34: 188193.
43. Price AL, Kryukov GV, de Bakker PI,
Purcell SM, Staples J, et al. (2010) Pooled
association tests for rare variants in exonresequencing studies. Am J Hum Genet 86:
832838.
44. Dickson SP, Wang K, Krantz I, Hakonarson H,
Goldstein DB (2010) Rare variants create synthetic genome-wide associations. PLoS Biol
8:e1000294.

Chapter 12
Ancestral Population Genomics
Julien Y. Dutheil and Asger Hobolth
Abstract
The full genomes of several closely related species are now available, opening an emerging field of
investigation borrowing both from population genetics and phylogenetics. Providing we can properly
model sequence evolution within populations undergoing speciation events, this resource enables us to
estimate key population genetics parameters, such as ancestral population sizes and split times. Furthermore, we can enhance our understanding of the recombination process and investigate various selective
forces. We discuss the basic speciation models for closely related species, including the isolation and
isolation-with-migration models. A major point in our discussion is that only a few complete genomes
contain much information about the whole population. The reason being that recombination unlinks
genomic regions, and therefore a few genomes contain many segments with distinct histories. The
challenge of population genomics is to decode this mosaic of histories in order to infer scenarios of
demography and selection. We survey different approaches for understanding ancestral species from
analyses of genomic data from closely related species. In particular, we emphasize core assumptions and
working hypothesis. Finally, we discuss computational and statistical challenges that arise in the analysis of
population genomics data sets.
Key words: Coalescence, Demography, Selection, Divergence, Speciation, Markov model, Ancestral
population

1. Introduction
We are on the edge of the population genomics era, but the majority
of population genomics data sets, such as the 1000 human genomes
project (1) and the 1001 arabidopsis genomes project (2), are still in
the production stage. The current data available consists of alignments of fully sequenced and closely related genomes. In some
cases, the genomes are consensus genomes obtained by pooling
sequences from several individuals. Under these conditions, the
recent history of species is not available to the investigator (although

Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2,
Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5_12,
# Springer Science+Business Media, LLC 2012

293

294

J.Y. Dutheil and A. Hobolth


Divergence time

Ancestor

Speciation
Recombination event

Species 1

Species 2

Position along genome

Fig. 1. Left: Isolation model of two species. Right: The coalescent process along the genomes of the two species. By
comparing the two genomes, we obtain information about the split time of the species and the ancestral population size.
Furthermore, the break points along the genomes correspond to recombination events, so we also have information about
the recombination process.

in some cases information is available from heterozygous positions


(3)). By comparing genomes from closely related species, we can,
however, obtain information about split times, ancestral population
sizes, ancestral recombination events, and selection in ancestral
species (see Fig. 1). In this chapter, we discuss various models for
obtaining this information.
Comparing homologous sequences available for a given locus
to infer their degree of relatedness enables the discovery of the
parental relationships of the sequences, depicted as a tree thereby
named genealogy. When one sequence sampled from one individual
of one species is compared with the ones taken from other species,
the resulting genealogy contains information about the history of
species, the so-called phylogeny. The phylogeny summarizes the
relationship and the divergence times between the species.
Conversely, when sequences from several individuals within a
species are sampled, we have access to the genetic variation in
contemporary populations. The evolutionary forces that shape
genetic variation within a species are genetic drift, mutation,
recombination, and selection and is the subject of population
genetics. The key modeling tool in population genetics is coalescent theory. Classical coalescent theory describes the genetic ancestry of a sample of homologous DNA sequences from the same
species. This genealogical description includes times to common
ancestry, which is measured back into the past.
Molecular phylogenetics and population genetics have accumulated 30 years of specific methodological developments. The
convergence of these two fields and their key mathematical tools
is needed in order to fully understand genomic sequence alignments because comparing genealogies and phylogenies is at the
heart of the study of the speciation process (4).
We describe the interplay between population genetics and
phylogenetics by reviewing the methods and models that have

12 Ancestral Population Genomics

295

been developed to understand evolutionary history from genomic


data (see Table 1 for a comparative summary of all methods).

2. Coalescent
Theory and
Speciation

2.1. The Standard


Coalescent Model

We start by describing the standard coalescent model within one


population. The coalescent model describes the shape of the genealogy of several sequences sampled from a single population. For
more information on the coalescent, we refer to refs. 5 and 6. In
subsequent sections, we extend the standard model to include two
or more populations. In the cases where multiple populations are
present, we describe both the isolation model and the isolationwith-migration (IM) model.
The standard coalescent model is a continuous-time approximation
of the neutral WrightFisher model. In the WrightFisher model,
the number of chromosomes 2N (we consider diploid organisms) is
fixed in each nonoverlapping generation. Each chromosome in a
new generation chooses its ancestor uniformly at random from the
previous generation.
Consider two chromosomes. The probability of the two chromosomes choosing the same ancestor is 1/(2N) and the probability
of the two chromosomes not finding a common ancestor is 1  1 /
(2N). Let R2 denote the number of generations back in time when
the two individuals find a most recent common ancestor (MRCA). By
repeating the argument above, the probability of the two chromosomes not finding a common ancestor r generations back in time is


1 r
PR2 >r 1 
:
2N
If we scale time t in units of 2N, i.e., set r 2Nt, we get

 

1 r
1 2Nt
PR2 >r 1 
1
 et ;
2N
2N
where the approximation is valid for large N. In coalescent time
units, the waiting time T2 R2 / (2N) before coalescent of two
individuals is, therefore, exponentially distributed with mean one.
These considerations can be extended to multiple individuals.
In general, the time Tn before two
 of
 n individuals coalesce is
n
.
exponentially distributed with rate
2
The waiting time Wn for a sample of n individuals to find the
MRCA is given by
Wn Tn Tn1    T2 ;

Primates: 1 Mb alignment
Orangutans: Two full genomes

T1, T2, NA1, NA2,


r RAS
T, NA, r

Markov
process

Integrating over the discretized


distribution of divergence for a pair
of genomes

(20)

(10)

(9)

(12)

(11)

(14)

(25)

RAS Rate Across Site model, assuming an a priori distribution of evolutionary rate (usually a discretized gamma distribution) over alignment positions I Isolation
model IM Isolation with migration model

Primates: 1 Mb alignment

T1, T2, NA1, NA2

Markov
process
Markov
process

Primates: 15,000 neutral loci


(7.4 Mb)

Integrating over a subset of candidate


genealogies using a hidden Markov
model

RAS + branch-specific
departure from
molecular clock

Primates: Same data as 12


restricted to human,
chimpanzee, gorilla,
and orangutan

T1, T2, T3,


T4, NA1, NA2,
NA3

Independent estimate
of rate

Drosophila

Independent I
loci

T1, T2, N1, N2,


NA, m1!2,
m2!1
T1, T2, N1, N2,
NA, m1!2,
m2!1

Primates

Independent estimate
of rate
RAS

T1, T2, NA

(17)

Primates

Correction with
outgroup

T1, NA

Reference

Primates: 53 random autosomal (28)


intergenic nonrepetitive DNA
segments of 220 kb

Data set

T, NA

Rate variation/
sequencing errors

Bayesian inference

Independent IM
loci

Independent I
loci
Independent IM
loci

Count alignment patterns, fit EM model


to infer parameters

Likelihood calculation under a


demographic model, numerical
integration over genealogies

Independent I
loci

Parameters
ARG Approx. Spec. estimated

Infer genealogy from independent loci, use


distribution of inferred divergence and
topology counts to estimate parameters

Principle

Table 1
Methods comparison. This table summarizes and compares existing ancestral population genomics methods. Parameters
correspond to the one in figure 4

296
J.Y. Dutheil and A. Hobolth

12 Ancestral Population Genomics

297

T2
W5
T3
T4
T5

Fig. 2. Illustration of the coalescent process. The waiting time before two out of n
individuals coalesce is Tn and the time before a sample of n individuals find common
ancestry is Wn.

where Tk are independent exponential random variables with


k
parameter
; see Fig. 2 for an illustration.
2
It follows that the mean of Wn is



n
n
n 
X
X
X
2
1
1
1
2

2 1
:
ETk
EWn
kk  1
k1 k
n
k2
k2
k2
Note that EWn " 2 for n ! 1.
The variance of Wn is
VarWn 

n
X

VarTk

k2

n  2
X
k
k2




n1
X
1
1
1
3
:
8
4 1
k2
n
n
k1

Note that VarWn " (8p2 / 6  12) 1.16 for n ! 1.


The consequences of these calculations are that when we only
sample within a population we are limited to relatively recent
events. The expected time for a large sample to find its MRCA is
approximately 2(2N) 4N generations with standard deviation
p
1:16  2N 2:15N generations. As a consequence, a neutral
sample within a population contains little information beyond 6N
generations.
Humans have a generation time of approximately 20 years and
an effective population size of approximately N 10,000, and
therefore 6N generations correspond to approximately 1.2 million
years (My) for humans. Therefore, human diversity at neutral loci
contains little demographic information beyond 1.2 My.

298

J.Y. Dutheil and A. Hobolth

2.2. Adding Mutations


to the Standard
Coalescent Model

Now, suppose mutations occur at a rate u per locus per generation.


In a lineage of r generations, we then expect ru mutations or in the
coalescent time units with r 2Nt we expect 2Ntu mutations. We
let y 4Nu be the mutation rate parameter. Since u is small, we
can make a Poisson approximation of the Binomial number of
mutations in a lineage of r generations
Binr; u Bin2Nt; y=2  2N  Poisty=2:
We have, thus, arrived at the following two-step process for
generating samples under the coalescent: (a) generate the genealogy by merging lineages uniformly at random
  and with waiting
n
times exponentially distributed with rate
when n lineages are
2
present; (b) on each lineage in the tree, add mutations according to
a Poisson process with rate y/2.
Another possibility is to scale the coalescent process such that
one mutation is expected in one time unit. In thiscase,
 the expon
2=y, and
nentially distributed waiting times in (a) have rate
2
in (b) the mutations are added with unit rate. We use the latter
version of the coalescent-with-mutations process below.
For species where recombination occurs, different parts of the
genome come from distinct ancestors, and therefore have a distinct
history. Figure 3 exemplifies this phenomenon for two species.
It displays the genealogical relationships for two sequences which
underwent a single recombination event. In the presence of recombination, each position of a genome alignment therefore has a
specific genealogy, and close positions are more likely to share the
same one (recall Fig. 1). The genome alignment can, therefore,
be described as an ordered series of genealogies, spanning a variable
amount of sites, and then changing because of a recombination
event (4). A single genome, thus, contains different samples

2.3. Taking
Recombination
into Account

1 2

3 4

Fig. 3. Ancestral recombination graph for two species, (a) genealogy of four sampled sequences from two species.
The bold line shows the divergence of two sequences of interest, (b) a single recombination event happened between
the lineages of sequences 3 and 4 (horizontal line) so that in a part of the sequences the genealogy is as depicted by the
bold line and therefore displays an older divergence, (c) the corresponding ancestral recombination graph. Dotted lines
show the portions of lineages which are not present in the sample composed of sequences 1 and 3. When going backward
in time, a split corresponds to a recombination event and a merger is a coalescence event.

12 Ancestral Population Genomics

299

from the distribution of the age of the MRCA, and the distribution
contains information about the ancestral population size and
speciation time.

3. Models of
Speciation
In this section, we extend the standard coalescent model. We
consider coalescent models with multiple species and introduce
population splits or speciation events. The models that we describe
are shown in Fig. 4 (see also Table 1) and include (a) the twospecies isolation model; (b) the two-species isolation-with-migration
models; (c) the three-species isolation model (and incomplete
lineage sorting); and (d) the three-species isolation-with-migration

b
NA

NA

m12
T

N2

N1
m21

Isolation model with two species

Isolation-migration model with two species

d
NA2

NA2

mA13
T2

NA1

T1

Isolation model with three species

T2

T1

NA1

m3A1

m13
N
m23 3
m12
N1
N2
m21
m32
m31

Isolation-Migration model with three species

Fig. 4. Speciation models and associated parameters. In all exemplified models, effective population size is constant
between speciation event, represented by dash lines. The timing of the speciation events, noted T are parameters of the
models, together with ancestral effective population sizes, noted NA. In some cases, contemporary population sizes can
also be estimated, and are noted Ni, where i is the index of the population. Models with postdivergence genetic exchanges
have additional migration parameters labeled mfrom!to. The number of putative migration rates increases with the number
of contemporary populations under study, and some models might consider some of them to be equal or eventually null to
reduce complexity.

300

J.Y. Dutheil and A. Hobolth

model. We also discuss the general multiple-species isolationwith-migration model. The two-species isolation model is introduced
in ref. 7 and the isolation-with-migration model is introduced in
ref. 8.
3.1. Isolation Model
with Two Species

If the sequences are sampled from two distinct species that have
diverged a time T ago (see Fig. 4a), then the distribution of the age
of the MRCA is shifted to the right with the amount T, resulting in
the distribution

0
if t<T
fT2 t 2 e 2tT if t>T :
yA

yA

The mean time to coalescent is E[T2] T + yA / 2 and the


average divergence time between two sequences is twice this quantity, that is, 2T + yA. Since yA 4NAu, it follows that the larger the
size of the ancestral population, the bigger the difference between
the speciation time and the divergence time.
The variance of the divergence time is Var[T2] y2 / 4. With
access to the distribution of divergence times, we could estimate the
speciation time and population size from the mean and variance of
the distribution. Unfortunately, we do not know the complete
distribution of divergence times and it is not immediately available
to us because long regions are needed for precise divergence estimation but long regions have experienced one or more recombination events.
3.2. Isolation Model
with Three or More
Species and
Incomplete
Lineage Sorting

Now, consider the isolation model with three species depicted in


Fig. 4c. Such a model is often used for the humanchimpanzeegorilla
(HCG) triplet (e.g., refs. 911).
The density function for the time to coalescence between
sample 1 and sample 2 is given by
8
if t<T
>
< 02 2tT1
e
if T1 <t<T12 ;
fT2 t yA1
(1)
yA1
>
2tT

12
: P12 2 e
if
t>T
12
yA2
yA2
where
T12 T1 T2

and

2T12 T1
yA1

P12 e

is the probability of the two samples not coalescing in the ancestral


population of sample 1 and sample 2. In the upper right corner of
Fig. 5, we plot the density (Eq. 1) with parameters that resemble
the HCG triplet.
If sample 1 and sample 2 do not coalesce in the ancestral
population of sample 1 and sample 2, then the three trees

12 Ancestral Population Genomics

Two species

301

Three species

100 150 200 250 300

50

Isolation+migration model

100 150 200 250 300


Density

50

Density

Isolation model

Time
0.000 0.005 0.010 0.015 0.020 0.025

0.000 0.005 0.010 0.015 0.020 0.025


Time

Fig. 5. Illustration of the density for coalescent in various models and data layout. The curves are the probability density
functions. In the most simple case with two species, a constant ancestral population size, and a punctual speciation (top left
panel), more genomic regions find a common ancestor close the species split (the vertical line) while a few regions have a
more ancient common ancestor, distributed in an exponential manner (see Eq. 1). If speciation is not punctual and migration
occurred after isolation of the species, then some sequences have a common ancestor which is more recent than the
species split and the distribution in the ancestor becomes more complex (bottom left panel, see Eqs. 4 and 6). When a third
species is added (right panel), then another discontinuity appears and all distributions depend on additional parameters,
particularly when migration is allowed. We use yA1 0.0062, yA2 0.0033, and t1 0.0038 (the first vertical line),
t2 0.0062 (the second vertical line) corresponding to the HCG triplet. Ancestral population sizes are taken from the
simulation study in Table 6 in ref. 14: y1 0.005 and y2 0.003. Migration parameters are all set to 50.

((1,2),3), ((1,3),2), and ((2,3),1) are equally likely. The probability


of the gene tree being different from the species tree is, thus,
Princongruence

2
2 2T12 T1
P12 e yA1 :
3
3

(2)

The event that the gene tree is different from the species tree is
called incomplete lineage sorting (ILS). ILS is important because
species tree incongruence often manifests itself as a relatively clear
signal in a sequence alignment and thereby allows for accurate
estimation of population parameters. In Fig. 6, we show the (in)

302

J.Y. Dutheil and A. Hobolth

Incomplete Lineage Sorting


1.0

0.8

Probability

congruence
incongruence
0.6

((human,chimpanzee),gorilla)

0.4

0.2

0.0
0.0

0.5

1.0

1.5
(123 12)/12

2.0

2.5

3.0

Fig. 6. Probability (Eq. 2) of gene tree and species tree being incongruent. In case of the HCG triplet, we obtain
(T12  T1)/yA1 (0.0062  0.0038)/0.0062 0.39 which corresponds to an incongruence probability of 30%.

congruence probability (Eq. 2). We also refer to Subheadings 7.1


and 7.2 for more discussion of ILS.
In the three-species isolation model, the mean coalescent time
for a sample from population 1 and a sample from population 2 is
given by
yA1
yA2
ET2  T1 1  P12
P12
:
(3)
2
2
Burgess and Yang (12) describe the speciation process for
humans (H), chimpanzees (C), gorillas (G), orangutans (O), and
macaques (M) using an isolation model with five species.
The HCGOM model contains four ancestral parameters yHC,
yHCG, yHCGO, and yHCGOM. In this case, Eq. 3 extends to
yHC
yHCG
PHC 1  PHCG
2
2
yHCGO
PHC PHCG 1  PHCGO
2
yHCGOM
:
PHC PHCG PHCGO 1  PHCGOM
2

ET2  THC 1  PHC

3.3. Isolation with


Migration Model with
Two Species and Two
Samples

The isolation-with-migration model with two species is shown


in Fig. 4b. The IM model has six parameters: the mutation rates y1, y2, and yA, the migration rates m1 and m2, and the speciation
time T. We let Y (y1, y2, yA, m1, m2, T) be the vector of
parameters.

12 Ancestral Population Genomics

303

Wang and Hey (14) consider a situation with two genes. Before
time T, the system is in one of the following five states.
S11: Both genes are in population 1.
S22: Both genes are in population 2.
S12: One gene is in population 1 and the other is in population 2.
S1: The genes have coalesced and the single gene is in population 1.
S2: The genes have coalesced and the single gene is in population 2.
The instantaneous rate matrix Q is given by
S11
S12
S22
S1
S2

S11

m1
0

S12
2m2

2m1

S22
0
m2


S1
2=y1
0
0

m1

S2
0
0
:
2=y2
m2


Starting in state a, the density for coalescent in population 1 at


time t < T is given by (13)
 
2
Qt
f1 t e aS11
;
(4)
y1
the density for coalescent in population 2 at time t < T is
 
2
f2 t eQt aS22
;
y2

(5)

and the total density for a coalescent at time t < T is


f t f1 t f2 t:

(6)

P
i
Here, e A 1
i0 A i! the matrix exponential of the matrix A and
A
(e )ij is entry (i,j) in the matrix exponential.
After time T, the system only has two states: SAA corresponding
to two genes in the ancestral population and SA corresponding to
one single gene in the ancestral population. The rate of going from
state SAA to state SA is 2/yA. The density for coalescent in the
ancestral population at time t > T is, therefore,
 
h
i 2  2 tT
y
f t eQT aS11 eQT aS12 eQT aS22
e A
: (7)
yA
In Fig. 5, we illustrate the coalescent density in the two-species
isolation with migration model.
The likelihood for a pair of homologous sequences X is given by
Z 1
PX jtf tjYdt;
(8)
PX jY LYjX
0

304

J.Y. Dutheil and A. Hobolth

where f(t) f(t|Y) given by Eqs. 6 and 7 is the density of the two
sequences finding an MRCA at time t and P(X|t) is the probability
of the two sequences given that they find an MRCA at time t.
The latter term is calculated using a distance-based method.
One possibility is to use the infinite sites model, where it is assumed
that substitutions happen at unique sites, i.e., there are no recurrent
substitutions. In this case, the number of differences between the
two sequences follows a Poisson distribution with rate 1.
For an application of the isolation-with-migration model with
two sequences, we refer to ref. 14; a discussion of their approach
can be found in ref. 15.
3.4. Isolation with
Migration Model
with Three or More
Species and Three
or More Samples

Hey (16) considers the multipopulation isolation-with-migration


model. Recall from Fig. 4b that the two-population IM model
has six parameters: two present population sizes, one ancestral
population size, one speciation time, and two migration rates.
The three-population IM model in Fig. 4d has 15 parameters:
three present population sizes, two ancestral population sizes,
two speciation times, and eight migration rates. In general, a kpopulation IM model has 3k  2 + 2(k  1)2 parameters:
l

k present population sizes

(k  1) ancestral population sizes

(k  1) speciation times

2(k  1)2 migration rates

See Subheading 7.3 for a derivation of the number of migration


rates in the general k-population model. For k 5, 6, and 7, we
obtain 45, 66, and 91 parameters, respectively. Because the number
of parameters becomes very large even for small k, Hey (16) suggests adding constraints to the migration rates, e.g., setting some
rates to zero or introducing symmetry conditions, where rates
between populations are the same.

4. Approximating
the Ancestral
Recombination
Graph

4.1. The Independent


Loci Approach: All
Recombination
Between, No
Recombination Within

In this section, we discuss the three methods of taking recombination into account. The three methods are visualized in Fig. 7ce
and correspond to (1) independent loci, (2) site patterns, and (3)
hidden Markov model (HMM).
The simplest way to handle issues relating to the ancestral recombination graph is to divide the data into presumably independent
loci. Such analyses are, therefore, restricted to candidate regions
that are not too large (to avoid including a recombination point)
and not too close (to ensure that several recombination events

12 Ancestral Population Genomics

305

Fig. 7. The coalescent process along genomes, (a) four archetypes of coalescence scenarios with three species,
exemplified with human, chimpanzee, and gorilla. In the first scenario, human and chimpanzee coalesce within the
humanchimpanzee common ancestor. In the three other scenarios, all sequences coalesce within the common ancestor
of all species, with probability 1/3 depending on which two sequences coalesce first, (b) example of genealogical changes
along a piece of an alignment. The alignment was simulated using the true coalescent process and parameters
corresponding to the humanchimpanzeeorangutan history. The blue line depicts the variation along the genome of
the humanchimpanzee divergence. The background colors depict the change in topology, red and yellow corresponding
to incomplete lineage sorting. Every change in color or break of the blue line is the result of a recombination event. (ce)
Three possible ways of approximating the ancestral recombination graph. In (c), a number of small loci are analyzed
independently under an assumption of no recombination within loci, which allows to estimate the probability distribution of
sequence divergence. In (d), the alignment is summarized in terms of counts of site patterns, and in (e) the data is analyzed
in terms of a hidden Markov model along the sequence, with distinct genealogies featuring various divergence times
as hidden states. The underlying model includes transition probabilities between genealogies along the genome.
See Subheading 4 for more details.

happened between loci). Each region can, therefore, be described


by a single underlying tree, reducing the analytical and computational load. This approach cannot be used when the species under
study are too distantly related, as recombination events will have

306

J.Y. Dutheil and A. Hobolth

fragmented the ARG up to a point, where no single region size


without recombination can be defined.
Using 15,000 loci distant from 10 kb totaling 7.4 Mb and isolation
model introduced above, (Table 2, Model (b) Sequencing errors
in (12)) find yHC 0.0062, yHCG 0.0033, yHCGO 0.0061,
and yHCGOM 0.0118 and THC 0.0038, THCG 0.0062,
THCGO 0.0137, and THCGOM 0.0260. They get ETHCG
0.0062 (corresponding to a 1.2% divergence between human and
chimpanzee) and THC 0.0038. Therefore, 38/62 0.61 61%
of the divergence between humans and chimpanzees is due to speciation and 39% is due to ancestral polymorphism. Converting those
estimated in time units requires an estimate of the substitution rate,
either absolute or deduced from a scaling point. Using u 109 as an
estimate for substitutions per year, this leads to an estimate of 3.8 My
for the humanchimpanzee speciation, a very recent estimate. Using
the same data, Yang (11) showed that the isolation with migration
model was preferred. Yang finds a more ancient speciation time THC
0.0053 (5.3 My with u 1e  9) when migration is accounted for
(was THC 0.0044 without migration).
4.2. Site Pattern
Analysis

Patterson et al. (17) used a different approach based on site patterns.


They sequenced fragments of DNA from a western lowland gorilla
and a spider monkey, which they combined with whole-genome reads
from the orangutan and macaque, and built a genome alignment
using the human scaffold. The resulting 20-Mb data set was
extended and/or used thereafter by refs. 912. Patterson et al.
counted the frequencies of all possible site patterns in the resulting
HCGOM alignment. These patterns can be sorted depending on
which genealogy they support: ((H,C),G),O, ((H,G),C),O, ((C,G),
H),O, etc. They introduced a model that allowed them to estimate
speciation time and ancestral population sizes from the frequencies
of the observed patterns, independently of the recombination rate.
The only requirement is that recombination occurred to enable
the various patterns to be observed, which is warranted by the
large genomic region they used. This method makes very little
assumption on the data, particularly regarding recombination,
and uses ILS as its only source of signal for estimating population
parameters. However, it ignores alternative sources of signal, like
singletons, which carry information about the local sequence divergence. Such an approach is, therefore, limited to simple models of
speciation, and cannot easily be extended to more complex scenarios
like isolation with migration.
Patterson et al. inferred a recent speciation time for human and
chimpanzee, below 5.4 My. They also found a most recent divergence on the X chromosome, which they interpret in terms of
complex speciation event with hybridization. Alternative explanations for this observation were provided (18, 19).

12 Ancestral Population Genomics

4.3. The Markov


Assumption
Along Sites

307

The work by Hobolth et al. (9) used site patterns in a different way.
With a hidden Markov model, they used the correlation of patterns
along the genome to reconstruct the site-specific genealogy, including divergence times. They further used these divergence estimates
together with the inferred amount of incomplete lineage sorting
to compute the speciation times and ancestral population sizes.
In this approach, the recombination rate is embedded into the
transition matrix of the hidden Markov chain, which specifies the
probabilities of transition from one genealogy to the other along
the genome. Hobolth et al. showed that this matrix is constrained
by symmetric relationships, and estimated the remaining three
parameters together with the divergence parameters. Dutheil
et al. (10) extended this approach by identifying further constraints
on the parameters and fully expressing the divergence times and
probabilities of transition between genealogies as function of the
speciation times, ancestral population sizes, and recombination
rate, therefore allowing their direct estimation. The analytical
expressions of the parameters as function of populational quantities
are, therefore, difficult to obtain, notably for the transition probabilities, even in the simplest case.
Mailund et al. (20) used a different approach to compute these
for the two-species isolation model. They used a continuous Markov chain to model the evolution of a pair of contiguous positions.
This model features two types of events: when going backward in
time, the two positions can either coalesce (with a rate proportional
to the effective population size) or split (with a rate equal to the
recombination rate). The transition probabilities between genealogies are immediately available from the joint pair of contiguous
positions and the Markov assumption. This approach can be
generalized to more species are and potentially allows for more
realistic demographic scenarios, for instance allowing migration
between populations.
The coalescent HMM framework, thus, models recombination, which is assumed to be constant in all lineages and along the
alignment. The model further assumes that the probability of
switching from one genealogy to another when we walk along a
genome alignment only depends on the genealogy at the previous
position, that is, the process of genealogy change along the genome
is Markovian. This is an approximation of the true coalescent
process that greatly simplifies calculation (21). Dutheil et al. (10)
and Mailund et al. (20) used simulated data sets under a coalescent
process with recombination to show that this assumption had,
however, little influence on the parameter estimates. Using this
approach, Hobolth et al. estimated a speciation time between
human and chimpanzee around 4.1 My and a large ancestral effective population size of 60,000 for the human chimpanzee ancestor. Dutheil et al. (10) found similar estimates with the same data
set while accounting for substitution rate variation across sites, and
estimated an average recombination rate of 1.7 cM/Mb.

308

J.Y. Dutheil and A. Hobolth

5. Specific Issues
Faced When
Dealing with
Genomic Data

5.1. Sequencing Errors


and Rate Variation

In previous sections, we discussed population genetic models


for between-species comparisons and methods for parameter estimation. We now describe several pitfalls encountered when analyzing whole-genome data sets, including sequencing errors and
alignment errors, but also computational and statistical issues related
to the data sets of large dimension that are underlying genomics
analyses.
Sequencing errors are a well-described source of bias in population
genetics analyses, resulting in an excess of singletons (22). When
full genome sequences are used, the issue becomes more complex as
the error rate differs between and within sequences not only due to
coverage variation, but also properties of the genome (base composition, repeated elements, etc.). Such errors result in a departure
from the molecular clock hypothesis, thus potentially leading to
biases in parameter estimates, such as asymmetries in genealogy
frequencies (23, 24). In this respect, data preprocessing becomes
a crucial step in any genomic analysis. Methods would also benefit
in many cases of inclusion of a proper modeling of such errors.
Burgess and Yang noticed that sequencing errors can be seen as a
contemporary acceleration in external branches, resulting in an
extra branch length (12). Such an extra length can be easily accommodated in many models. It has to be noted that only a differential
in error rates between lineages results in a departure from molecular
clock, and in such approaches one still has to consider that at least
one sequence is error free. In addition, as noted by the authors,
assuming a constant error rate over all genomic positions may also
turn out to be inappropriate, and better models should allow this
rate to vary across the sequence. Such approaches still have to be
explored. Moreover, sequencing errors are not distinguishable
from lineage-specific acceleration (or deceleration in another
species). In that respect, sequence quality scores can be a valuable
source of information. They are currently used to preprocess
the data by removing doubtful regions, but can ultimately be
used in the modeling framework.
The rate of substitution also varies along the genome which
potentially affects the reconstruction of sequence genealogy, a phenomenon well known by phylogeneticists. There, things are a bit
easier, as the tools developed for phylogenetic analysis can in most
cases be applied with a reasonable cost. This generally consists in
assuming a prior distribution of the site-specific rate, and integrate
the likelihood over all possible rates (10, 12, 14). Alternatively, one
can also use one or more outgroup sequences to calibrate the rate,
as in refs. 17, 25.

12 Ancestral Population Genomics

309

5.2. Aligning Genomes

To sequence errors, one should add assembly errors due to the


sequencing technology. Assembling reads can be error prone in
case of repeated or duplicated regions, which ultimately can lead to
compare nonorthologous regions. In addition to this technical issue,
genome data are intrinsically fragmented firstly because of chromosomal organization, but also because of rearrangements that prevent
molecule-to-molecule alignment from one species to another.
A genome data set is, therefore, a set of distinct alignments, one per
syntheny block. Building the genome alignment, that is, recovering
the syntheny structure, is, therefore, performed with potential issues
that are close in effect to the assembly errors. Finally, as all comparative methods rely on an input alignment, any artifact affecting the
alignment process itself is relevant. As populational methods are
based on closely related species, alignment programs are, however,
expected to perform accurately, and alignment errors should be
negligible compared to other sources. So far, the only way to deal
with such errors is to restrict the analysis on regions, where orthology
can be unambiguous resolved, mostly by removing short syntheny
blocks and regions that contain a high proportion of repeated
elements, gaps, and duplications.

5.3. Computational Load

Dealing with genomic data heavily relies on computer performance. Depending on the genome sizes and the method used,
the analysis may cover from millions to billions of genomic positions. As most methods rely on maximum likelihood or Bayesian
inference, efficient algorithmics and software implementation are
much needed. Fortunately, the data structure here comes handy:
independent parts of the genomes, like chromosomes, syntheny
blocks, or even loci, depending on the methodology used, can be
analyzed separately, therefore enabling easy parallelization for use
of computer grids. Aside to the computational issue, the genomic
area also dramatically changed the structure of the result tables.
While analyzing per-gene result sets, consisting of a few dozen
thousand rows, is still feasible with statistical software like R, it
becomes much more problematic when per-site result sets are
considered. As our understanding of genome evolution grows, we
are more keen on fishing specific regions with a peculiar demographic or selective history. Such data sets typically reach sizes
of several millions rows. While they can still be loaded into the
memory of computers with strong configuration, a single pass on
the table for retrieving information becomes prohibitive, which
becomes problematic when several sets are to be compared (for
instance, in order to compare a window-based calculation with
gene annotations). The only alternative currently available is to
use database engines, with proper indexing algorithms. Such databases are currently used in genome browsers, like the UCSC
genome browser. In that respect, cross-information storage and

310

J.Y. Dutheil and A. Hobolth

retrieval, as well as Web-based services, will become even more


crucial for genome data analysis.
5.4. Statistical
Challenges

The genetics to genomics shift also leads to new challenges in data


analysis. When tests are performed, for instance when comparing
models of speciation like in ref. 11, the global false discovery rate
has to be properly controlled for. As genomes are not analyzed in
one single analysis (at least full chromosomes are analyzed independently, but in most cases chromosomes are also split into several
parts), multiple testing issues occur. Multiple testing also matters
when candidate regions are scanned for, for instance for specific
selection regime. Verhoven et al. (26) offer a nice tutorial presenting appropriate statistical methods for handling multiple testing.
A related matter, when performing several types of tests on a wide
set of genomics regions, is the so-called overoptimism issue, also
named data optimization (27). This concerns the selection of
data sets in order to increase the significance of results, resulting in a
potential bias. In genomics, the data set selection often takes the
form of an extensive filtering of the data in order to exclude regions
with potential paralogous sequences, low complexity, or known
functional role. It, therefore, appears important to emphasize to
which peculiar region of the genome the obtained conclusions
apply to, and eventually report how they change when other
regions are included (see, for instance, ref. 12).

6. Discussion
Studying the speciation process with genome data implies new
modeling challenges, as the basic configuration of a population
genetics data set is drastically changed: instead of having a few loci
sequenced in several individuals, we have an (almost) exhaustive set
of loci sequenced in one individual for a few species. The change
involve the spatial dimension, but also time, as the process under
study occurred much further back in time than the ones that are
commonly studied with a standard population genetics data set.
The use of the spatial signal has a major consequence, namely, that
recombination has to be dealt with, even if it is not directly modeled.
Apart from these considerations, ancestral population genomics, as population genetics, heavily relies on the study of sequence
genealogy, its shape, as well as its variation. The underlying models
build on existing intraspecies population modeling, as they
only need to add the species divergence process, that is, a moment
in time where two populations stop exchanging genetic
material and evolve fully independently. The simplest isolation model assumes that the speciation is instantaneous while the
isolation-with-migration model assumes that the two neo-species

12 Ancestral Population Genomics

311

can still exchange some material, at least for a certain time after the
split. Such a model is not different from a pure isolation model,
where the ancestral population is structured into two subpopulations: in the first case, the speciation time is defined as the time of
the split while in the second case it is the time of the last genetic
exchange. Recent work on primates (11) suggests that the speciation of human and chimp was not instantaneous. If the average
divergence of the human and chimpanzee is a bit more than 6 My
(using widely accepted mutation rate), then the split of the two
species initiated around 5.5 My ago, and the last genetic exchange
can be dated around 4 My.
The fact that we sample a large number of positions in the
genome, thus, appears to have the power to counterbalance
the reduced sampling of individuals within population, allowing
the estimation of demographic parameters in the ancestor. Nonetheless, complexity limits are rapidly reached when considering, for
example, three closely related species that can exchange migrants.
More complex demographic scenarios, incorporating for instance
variation in population sizes, will also add additional parameters
that might not all be identifiable.
If the ancient speciation processes have left signatures in the
contemporary genomes, we do not know yet how far back in time
this is true. Intuitively, the signal is maximal when the variation
in divergence due to polymorphism is large enough compared
to the total divergence. The divergence due to polymorphism is
proportional to the ancestral population size while the divergence
of species is only dependent on the time when it happened. So the
further back in time we are looking at, the bigger the population
sizes need to be so that the ancient polymorphism leaves a signature
in the total divergence time. In addition to this, one has to take into
consideration sequence saturation due to the too large number
of substitutions that accumulated since ancient split and the fact
that demographic scenarios complexity increases with time. For
instance, when considering the evolution of a species over several
millions of generations, the probability that a bottleneck, resetting
the signal from past events, occurred once is not negligible.
The population genomics era is just ahead, where we will have
full individual genomes for closely related species. Such data sets
are the key to understand the detailed evolutionary processes that
are linked to the formation and evolution of species, as they will
open windows to new periods in time. Analyzing such data sets
with the current methodologies, however, offers major challenges:
(1) developing the appropriate computational tools able to handle
such data sets with current machines (both in terms of processor
speed and memory usage) and (2) design realistic models with
enough complexity to capture the most important historical events
while remaining computationally tractable.

312

J.Y. Dutheil and A. Hobolth

7. Exercises
7.1. ILS in Primates

Assuming that there are 5 My between the speciation times of human


with the gorilla and the orangutan, that the HG ancestral effective
population size was 50,000, what is the expected amount of ILS
among human, gorilla, and orangutan? Assuming that another 2.5
My separates the speciations of human with chimpanzee and gorilla,
with an HC effective ancestral population size of 50,000, what is the
expected amount of ILS among human, chimpanzee, and orangutan?
We assume a generation time of 20 years for all extent and ancestral
primates.

7.2. Estimating
Ancestral Population
Size from the Observed
Amount of ILS

Given that 30% of incomplete lineage sorting is observed among


human, chimpanzee, and gorilla and assuming a generation time of
20 years and that 2.5 My separate the splits between human/
chimpanzee and humanchimpanzee/gorilla, what is the effective
ancestral population size compatible with this observed amount?
Using Burgess and Yangs method (12), a researcher finds a higher
estimate of Ne than expected. What could explain this discrepancy?

7.3. Number of
Migration Rates in the
General k-Population
IM Model

In this exercise, we show that a k-population IM model has 2


(k  1)2 migration rates.
1. Starting at the bottom of the k-population IM model, argue
that the number of migration rates at the level of k populations
is k(k  1).
2. Moving up to the next level where (k  1) populations are
present (one of them being an ancestral population, we assume
that there two-speciation events are never simultaneous),
argue that the new ancestral population introduces 2(k  1)
new migration rates.
3. Moving up yet another level where (k  2) populations
are present, argue that the new ancestral population introduces
2(k  2) new migration rates.
4. Show that the total number of migration rates is 2(k  1)2.

Acknowledgments
The authors would like to thank Thomas Mailund for providing
useful comments on this chapter. This publication is contribution
volution de Montpel2011-035 of the Institut des Sciences de lE
lier (UMR 5554CNRS). This work was supported by the French
Agence Nationale de la Recherche Domaines Emergents (ANR08-EMER-011 PhylAriane).

12 Ancestral Population Genomics

313

References
1. Siva, N. (2008), 1000 genomes project.
Nature Biotechnology 26(3), 256
2. Weigel, D., Mott, R. (2009), The 1001 genomes project for arabidopsis thaliana. Genome
Biology 10(5), 107+
3. Enard, D., Depaulis, F., Roest Crollius, H.
(2010), Human and non-human primate
genomes share hotspots of positive selection.
PLoS Genet 6(2), e1000,840+
4. Siepel, A. (2009), Phylogenomics of primates
and their ancestral populations. Genome
Research 19(11), 19291941
5. Wakeley, J. (2008). Coalescent Theory: An
Introduction, 1 edn. Roberts & Company
Publishers
6. Tavare, S. (2004). Ancestral inference in population genetics, vol. 1837, pp. 1188. Springer
Verlag, New York
7. Takahata, N., Nei, M. (1985), Gene genealogy
and variance of interpopulational nucleotide
differences. Genetics 110(2), 325344
8. Nielsen, R., Wakeley, J. (2001), Distinguishing
migration from isolation: a markov chain monte
carlo approach. Genetics 158(2), 885896
9. Hobolth, A., Christensen, O.F., Mailund, T.,
Schierup, M.H. (2007), Genomic relationships
and speciation times of human, chimpanzee,
and gorilla inferred from a coalescent hidden
markov model. PLoS Genet 3(2), e7+
10. Dutheil, J.Y., Ganapathy, G., Hobolth, A.,
Mailund, T., Uyenoyama, M.K., Schierup, M.
H. (2009), Ancestral population genomics:
The coalescent hidden markov model
approach. Genetics 183(1), 259274
11. Yang, Z. (2010), A likelihood ratio test of speciation with gene flow using genomic sequence
data. Genome Biol Evol 2(0), 200211
12. Burgess, R,., Yang, Z. (2008), Estimation of
hominoid ancestral population sizes under
bayesian coalescent models incorporating
mutation rate variation and sequencing errors.
Molecular biology and evolution 25(9),
19791994
13. Tavare, S. (1979), A note on finite homogeneous continuous-time markov chains.
Biometrics 35, 831834
14. Wang, Y., Hey, J. (2010), Estimating Divergence Parameters With Small Samples From a
Large Number of Loci. Genetics 184(2),
363379
15. Hobolth, A., Andersen, L.N., Mailund, T.
(2011), On computing the coalescence time den-

sity in an isolation-with-migration model with


few samples. Genetics 187(4), 12413
16. Hey, J. (2010), Isolation with Migration Models for More Than Two Populations. Mol Biol
Evol 27(4), 905920
17. Patterson, N., Richter, D.J., Gnerre, S.,
Lander, E.S., Reich, D. (2006), Genetic evidence for complex speciation of humans and
chimpanzees. Nature 441(7097), 11031108
18. Barton, N.H. (2006), Evolutionary biology:
how did the human species form? Curr Biol
16(16)
19. Wakeley, J. (2008), Complex speciation of
humans and chimpanzees. Nature 452(7184),
E34; discussion E4
20. Mailund, T., Dutheil, J.Y., Hobolth, A.,
Lunter, G., Schierup, M.H. (2011), Estimating
speciation time and ancestral effective population size of bornean and sumatran orangutan
subspecies using a coalescent hidden markov
model. PLoS Genetics 7(3), e1001,319
21. Marjoram, P., Wall, J.D. (2006), Fast coalescent simulation. BMC Genet 7(1)
22. Achaz, G. (2008), Testing for neutrality in
samples with sequencing errors. Genetics 179
(3), 14091424
23. Slatkin, M., Pollack, J.L.L. (2008), Subdivision
in an ancestral species creates asymmetry in
gene trees. Mol biol Evol 25(10), 22412246
24. Hobolth, A., Dutheil, J.Y., Hawks, J.,
Schierup, M.H., Mailund, T. (2011), Incomplete lineage sorting patterns among human,
chimpanzee and orangutan suggest recent
orangutan speciation and widespread natural
selection. Genome Research 21(3), 34956
25. Yang, Z. (2002), Likelihood and bayes estimation of ancestral population sizes in hominoids
using data from multiple loci. Genetics 162(4),
18111823
26. Verhoeven, K.J., Simonsen, K.L., McIntyre, L.
M. (2005), Implementing false discovery rate
control: increasing your power. Oikos 108(3),
643647
27. Boulesteix, A.L. (2010), Over-optimism in
bioinformatics research. Bioinformatics 26(3),
437439
28. Chen, F.C., Li, W.H. (2001), Genomic divergences between humans and other hominoids
and the effective population size of the common ancestor of humans and chimpanzees.
American journal of human genetics 68(2),
444456

Chapter 13
Nonredundant Representation of Ancestral
Recombinations Graphs
Laxmi Parida
Abstract
The network structure that captures the common evolutionary history of a diploid population has been
termed an ancestral recombinations graph. When the structure is a tree the number of internal nodes is
usually OK where K is the number of samples. However, when the structure is not a tree, this number has
been observed to be very large. We explore the possible redundancies in this structure. This has implications
both in simulations and in reconstructability studies.
Key words: Ancestral recombinations graph, ARG, Redundancies, Minimal descriptor, Coalescent,
WrightFisher, Population simulators, Nonredundant

1. Introduction
In keeping with the theme of the book, we study in this chapter the
common evolutionary history of a diploid population. This common history is a phylogeny with the extant members at the terminal
or leaf nodes. The internal nodes of the topology are some common ancestors while the edges can be viewed as conduits for the
flow of genetic material. The direction on the edges represents the
direction of flow. A directed edge from node v1 to node v2 is to be
interpreted as v1 being an ascendant of v2 or v2 is a descendant of v1.
The topology has no cycles since, no matter what the underlying
model, a member is not an ancestor of itself. Thus, the topology is
always a directed acyclic graph (DAG). Under uni-parental (unilinear) transmission each member at a generation derives all its genetic
material from only one parent whereas under a biparental model a
member derives the material from two parents. Then does this
simple difference in inheritance in the two models have an effect

Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2,
Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5_13,
# Springer Science+Business Media, LLC 2012

315

316

L. Parida

on the overall topology of the common evolutionary history?


Under uniparental model a unit has only one ancestor (ascendant)
in an earlier generation while under biparental model a unit can
have multiple ancestors. But in both models, a unit can have
multiple descendants at a future generation. Thus, the DAG for
only the uniparental model is guaranteed to be a tree.
One of the primary genetic events shaping an autosomal chromosome is recombination which is a process that occurs during
meiosis that results in the offsprings having different combinations
of homologous genes, or chromosomal segments, of the two
parents. The topology incorporating this has been called the ancestral recombinations graphs (ARGs) and is an annotated network
structure that captures the common evolutionary history of the
extant haplotypes. This subject is also discussed in the chapter on
Ancestral Population Genomics in this book. The random mathematical object, ARG, was introduced in the context of modeling
population evolution in the field of population genetics (1, 2).
Thus, the ARG is not only used for modeling population evolution
(3), but is also the object of interest in the reconstruction of the
evolution history from the haplotypes of extant samples (4, 5). For
the latter, the ARG is viewed as a phylogeny of the extant samples.
The reader must keep this general view of ARG in mind for the
chapter.
In summary, the topology of the evolutionary history of a
diploid population is a rather complicated network that represents
the flow of the genetic material down to the extant units. See Fig. 1
for a visualization of the ARG that simulates the history of 210
samples or extant units (see the figure caption for details). The
complexity of this combinatorial structure begs the following question: Is it possible to identify a substructure that really matters to the
extant units? The problem addressed in this chapter is the extent of
topological redundancies, if any, in such structures. This understanding of redundancy is useful both for reconstruction as well as
simulation studies. While in the former it is possible to obtain an
algorithm-independent bound on the recoverability of common
history, in the latter it has the potential for producing simpler
simulation systems. In any case the issue of redundancy of a
model is never an irrelevant mathematical question to ask.

2. Background
The ideal population or WrightFisher model assumes some properties of the evolving population such as constant population size
and nonoverlapping generations. While these conditions appear
nonrealistic at first glance, the assumptions are reasonable for the

13

Nonredundant Representation of Ancestral Recombinations Graphs

317

Fig. 1. The terminal (leaf) nodes are as follows: the 60 brown nodes represent African samples, the 50 blue nodes AfricanAmerican samples, the 50 yellow nodes Asian samples and the 50 green nodes European samples. The internal cyan and
red nodes are recombination nodes and gray nodes are coalescent nodes. The simulation was generated with COSI (2) and
the visualization using Pajek (http://vlado.fmf.uni-lj.si/pub/networks/pajek/). The red recombination nodes are the ones
reconstructed by the method in (1).

purposes of the study of the genetic variations at the population


level. In fact, models with varying population size and/or overlapping generations can be reparameterized for an equivalent
WrightFisher model (see texts such as ref. 3, 6). Yet another
property of the evolving WrightFisher population is panmixia.
Panmictic means that there is no substructuring of the population
due to mating restrictions caused by mate selection, geography, or
any other such factors. Thus the model assumes equal sex ratio and
equal fecundity. Figure 2a shows the complete pedigree history of
four (K 4) samples with a population size of eight males and
eight females (N 8). The network structure is a random graph
written as GPG(K, N). An ARG, which tracks some fixed locus on all
the K samples, is a subgraph of this complete pedigree history and
an instance is shown in Fig. 2b. To mimic the genetic diversity
patterns seen in worldwide human populations, it is important to
also weave in other influencing factors such as different migration,
(site) selection, and expansion models.
As discussed earlier, if the locus under study is always transmitted from a single parent, then the topology of the evolutionary
history is a tree (i.e., no closed paths in the directed graph). The
mitochondrial genome and nonrecombining Y chromosome satisfy
this property. The former is always transmitted from the mother

318

L. Parida

Complete pedigree graph GPG(4;8)

Tracking a locus in (a).

Fig. 2. (a) The first ten generations of the relevant part of the complete pedigree graph (GPG (K, N) with K 4 and N 8).
The solid (blue) dots represent one gender, say males and the hollow (red) dots represent the other gender (females). Each
row is a generation with the direction on edges indicating the flow of the genetic material and the four extant units are at
the bottom row, i.e., row 0. Under the WrightFisher population model, there are equal number of males and females in
each row and the two distinct parents, one male and one female from the immediately preceding generation are randomly
chosen. (b) Tracking a locus gives a subgraph of (a).

and the latter from the father. However, if the locus is on the
autosome or even the X chromosome then the genetic material
may be transmitted from two parents. This implies that the topology
of the evolutionary history is no longer a tree, but a network (i.e., it
may have closed paths in the directed graph). Thus, due to the
occurrence of genetic exchange event, such as recombination, the
common evolutionary history can no longer be captured by a tree.
The network that captures both the genetic exchange event (such as
recombinations) and events that do not exchange genetic material
between parents (such as mutations) is the ARG. For simplicity of
exposition we call the class of latter events as nonexchange events.
Notice that this important distinction in the topological characteristics arises simply from the basic locus-inheritance model, that
is uniparental or biparental. The rest of the model characteristics
define the depth (or age) distribution of the nodes. Thus, it is
important to note the subtlety that an ARG is a random object
and there are many (infinite) instances of the ARG. Usually, when
we say that a topological property holds for the ARG, we mean that
the property that holds for every instance of the ARG, i.e., the
property holds with probability 1. Note that some may hold for a
subset of instances (such as unboundedness).
Focusing on the topology of the ARG and its effect on the
samples provides us with insights to identify vertices that do not
matter. Modeling these as missing nodes in the ARG leads to a
core that preserves the essential characteristics. The random object
ARG is defined by at least two parameters: K, the number of extant
samples and 2N, the population size at a generation. A Grand Most
Recent Common Ancestor (GMRCA) plays an important role in
restricting the zone of interest in the common evolutionary

13

Nonredundant Representation of Ancestral Recombinations Graphs

319

structure. A GMRCA is defined as a unit whose genetic material is


ancestral to all the genetic materials in all the extant samples (6).
Thus, while the relevant common evolutionary history of some
K > 1 units is potentially unbounded, it is reasonable to bound
this structure of interest with this single GMRCA. Thus when a
GMRCA exists, it is unique and we say the ARG is bounded. When
an ARG has no GMRCA, we call it unbounded.
The least common ancestor (LCA) of a set of vertices V in a graph
is defined as a common ancestor of V with no other common ancestor
of V on any path from the LCA to any vertex of V. A combinatorial
treatment, based on random graphs, of the ARG is presented in (7).
The directed graph representation is acyclic, a root is analogous to a
GMRCA, and the leaf nodes to the extant samples. Though tantalizingly similar GMRCA and LCA do not define the same entity in an
ARG. The edges (or nodes) of the ARG must be annotated with the
genetic material it transmits. The absence of any annotation leads to
the ancestor without ancestry paradox: It is possible for an individual
with finite amount of genetic material to have an infinite number of
unrelated (i.e., no genetic flow between any pair) ancestors. This
paradox is averted by annotating the ARG (7).

3. A Combinatorial
Definition of ARG
The random object ARG is usually parameterized by three essential
parameters: K the number of extant samples, 2N the population
size, and recombination rate r (see texts such as ref. 3 for a detailed
description). The following theorem is paraphrased from (7):
Theorem 1. Every ARG G on K > 1 extant samples is the topological
union of some M  1 trees (or forests).

The alternative definition of an ARG suggested by this theorem


is illustrated in Fig. 3. Here an ARG, defined on four (K) extant
samples, is decomposed into three (M) trees. Note that M is the
number of nonmixing or completely linked segments in the extant
samples. In both the models, all the samples are of same length say l
and additionally the
Plength of each of the M segments is specified as
l1, l2, . . ., lM with M
i1 li l, in the latter.
We describe the graph G (ARG) here. Although the figures do
not show the direction of the edges to avoid clutter, the direction is
toward the more recent generation (or the leaves). In other words,
the leaf (extant) nodes have no outgoing edges and the root node
has no incoming edges. The edges of the ARG are annotated with
genetic events and these labels are displayed in the illustrations. See
Fig. 4a for an example. An edge in G is defined to have multiple
strands. In the illustrations, the multiple strands are shown as
distinct colors, each color corresponding to one of the component

320

L. Parida

3
G

2
3
Three embedded trees

Fig. 3. Here K 4 and the extant samples are numbered 1, 2, 3, and 4. The hatched nodes are the genetic exchange
nodes. (a) The topology of an ARG, where the GMRCA is marked by an additional rectangle (on top). (b) A possible
embedding of (a) by three trees (shown in green, red, and blue, respectively).

trees 1  i  M. Between any pair of vertices v1 and v2, no two


strands can be of the same color. Thus, the number of multiple
strands, corresponding to an edge, between a pair of vertices can be
no more than M. An i-path from node v1 to node v2 is a path where
all the edges in the path are on the component tree i.
The annotations on the edges play a critical role since it is
these annotations that ultimately shape the units on the leaf nodes.
In the chapter, samples refer to extant samples. The two kinds of
genetic events represented in the graph are genetic (1) nonexchange and (2) exchange events. While the former is modeled by
the genetic exchange nodes, the latter is modeled by labels on the
edges. To keep this discussion simple, let the nonexchange genetic
event correspond to single nucleotide polymorphisms (SNPs). The
set of labels of edge v1v2 is written as lbl(v1v2). Then xi 2 lbl(v1v2)
is a label on strand i of edge v1v2. For example in Fig. 4a, the labels
on the green tree are the SNPs a, b, c, d. Also, the exact position of
the SNP on the genome does not matter. However, in the ARG, a
particular ordering of the M trees is assumed and hence the SNPs
of each of the M trees respect this order (this is reflected in the
sample definitions below where green is the leftmost segment and
blue the rightmost). Each strand of an edge is labeled by a set of
genetic events (SNPs), possibly empty. A node with multiple
ascendants (parents) is called a genetic-exchange node. A node
with multiple descendants (children) is a coalescent node. Note
that a node can be both a coalescent as well as a genetic-exchange
node. In the figure a genetic-exchange node is hatched.

13

Nonredundant Representation of Ancestral Recombinations Graphs

321

a
s

z
c
x

r
v

a
w

Genetic flow

c
r

x
v

a
d
1

Tree 1

q
1

Tree 2

Tree 3

Fig. 4. (a) Genetic event labels on the edges. At each node the nonmixing segment corresponding to the embedded tree is
shown in the same color as that of the tree. The three embedded trees are shown separately in (b), (c), and (d).

Next, we define the samples represented by the graph instance


G of the ARG. This is denoted as S(G) which is a set of K sequences
which is also the number of leaf nodes in G. Each sequence is
obtained simply by flowing the genetic event labels of tree i,
1  i  M, along paths of color i all the way down to the leaf
(samples) units. In other words, for each extant unit u on G, let the
corresponding sequence be s(u) (2 S(G)). Each label is associated
with a chromosomal position and its exact location on the
sequence really does not matter in this framework. However, we

322

L. Parida

Fig. 5. Example of an unbounded ARG. Here K 3 corresponding to the samples


numbered 1, 2, and 3 and M 2, for the two segments colored red and green. The
pattern of vertices and edges can be repeated along the dashed edges to give an
unbounded structure.

use the value of the label to define the sequence s(u). Let P(s(u))
M
S
denote the elements of s(u). Then Psu
fxi jxi 2 lblv1 v2
i1

and there exists an i - path from v2 to u:g: Although the exact


location does not matter, the labels of a strand (tree or color) i are
adjacent on the chromosome sequence s(u). Let s1, s2, s3, and s4 be
the sequences corresponding to the extant units marked 1, 2, 3, and
4, respectively in Fig. 4a. Assigning colors and a relative ordering to
the strand labels, the aligned four samples are:
8
9
s1 >
 b     r  v w  z; >
>
>
<
=
s2 a b   p      x ;
SG
:
a b   p q     x ; >
s3 >
>
>
:
;
s4   c d    s   x 
(1)
The here is to be interpreted as the ancestral allele.
To summarize,
1. An ARG G must satisfy the following
(a) (topology) Every node v in G must have multiple children
or multiple parents (since chains are not informative).
(b) (annotations) The nonexchange genetic event label (say,
SNP) corresponding to a position on the samples must
transmit down to at least one extant sample.
2. Further, a nontrivial G must encode at least M  1 genetic
exchange events.
It is quite possible to have unbounded ARGs, i.e., ARGs with
no GMRCA. Figure 5 shows such an example. See the Exercise
for other families of unbounded structures on the WrightFisher
population.

13

Nonredundant Representation of Ancestral Recombinations Graphs

323

4. Redundancies
in an ARG
How do we identify redundancies in the topology of an ARG?
Studying the effect of the topology on the samples provides us
with insights to identify vertices that do not matter. Modeling
these as missing nodes in the ARG leads to a core that preserves the
essential characteristics.
To maintain biological relevance, a missing node is modeled
by the following vertex removal operation. Note that in an ARG,
each node has an implicit depth associated with it that reflects its
age (in generations). An alternative view is that the edge length
denotes the age. Note that in the following the age of the nodes
does not change and the new edges get the edge length from the
ages of the nodes they connect. Given G and a node v in G, G\{v} is
obtained in the following steps. This is not the only possible definition of vertex removal, but it is a simple and natural one and is used
in this chapter
1. For each child vc,i of v, that is in the embedded tree 1  i  M
(a) (adding new edges) This child is connected by a new edge
to vp,i, a parent of v in i.
(b) (annotating the new edges) The new edges between vp,i
and vc,i are annotated as follows: for each strand i, the label
of the new edge is the union of the labels on the i-path
from vp,i to vc,i. Next if a label xi appears on multiple new
outgoing edges of vp,i, then it is removed from all but one
of the outgoing edges. (This is to avoid introducing parallel mutations, i.e., the same label appearing multiple times
on the embedded tree i.)
2. The node v with all the edges incident on it are removed from G.
4.1. SamplesPreserving
Transformation

Two distinct ARGs G and G 0 are samples preserving if and only if


S(G) S(G 0 ). When two instances are samples preserving, all the
allele statistics, including allele frequencies, LD decay, and so on are
identical in the two.
A node v of G is called nonresolvable if S(G) S(G\{v}). The
intuition is that if removing the node v has no effect on the samples,
then no algorithm can detect the node using only the samples.
Node v is called resolvable if S(G) 6 S(G \{v}). Again, the intuition
is that some algorithm may be able to detect the node in this case.

4.2. StructurePreserving
Transformation

Next we identify the vertices in G that determine the topology (as


well as the branch lengths) in the M embedded trees. Given G and
G 0 , if each of the M embedded trees in G and G 0 are identical in
topology as well as branch lengths (in generations), then G 0 preserves the structure of G and vice versa.

324

L. Parida

Note that the embedded trees (also called marginal trees) are
very important in an ARG and critical in defining the ARG: Not just
the topology but also the branch lengths, which represent the time
(in generations) to the next coalescent event. Then is it possible to
characterize a node that can lead to structure-preserving transformation? A coalescent vertex in G is t-coalescent if and only if it is also
a coalescent node in at least one of the M embedded trees. In fact
the following is proved in (8).
Theorem 2. If G 0
G \U and no t-coalescent vertex of G is in U, then
G 0 is structure-preserving.

In other words, if a set of coalescent nodes that are not tcoalescent are removed from G to obtain G 0 , then G and G 0 are
structure preserving. With this useful property, we are ready to
zero-in on a core preserving structure.
4.3. Minimal Descriptor

We begin with the following theorem (8) that relates t-coalescent


with resolvability.
Theorem 3. A resolvable coalescent node v is also t-coalescent in G.

The theorem shows that the vertices that ensure the invariance
of the branch lengths of each embedded tree are also resolvable,
leading to the following definitions.
1. An ARG G is a minimal descriptor if and only if every coalescent
vertex, except the GMRCA, is t-coalescent.
2. An ARG Gmd is a minimal descriptor of G if and only if (a) Gmd
is a minimal descriptor, (b) Gmd preserves the structure of G,
and (c) G and Gmd are samples preserving, i.e., S(G) S(Gmd)
holds.
Given G, let U be the set of all coalescent vertices in G, other
than the GMRCA, that is not t-coalescent. Let G0
G\U. By the
definition of a minimal descriptor and the following statement, G0 is
a minimal descriptor.
If v1 is a t-coalescent vertex in G and v2 is not, then v1 continues to be
a t-coalescent vertex in G\{v2}. Further if V1 is a set of t-coalescent
vertices in G, and none of the vertices in V2 is, then each v 2 V1
continues to be t-coalescent in G\V2.
The following gives a constructive description of a minimal
descriptor. Let G0 be a minimal descriptor of G. Then G0 is biologically
and evolutionarily relevant as
1. (Structure preserving) the embedded (marginal) trees of G and
G0 are identical.
2. (Samples preserving) the allele statistics (including allele frequencies, LD decay) in the samples in both G and G0 are identical.

13

Nonredundant Representation of Ancestral Recombinations Graphs

5. Properties
of Minimal
Descriptor

325

Although, a minimal descriptor of an ARG is not unique (see


Subheading 8), it nevertheless has very interesting properties. Figure 6 shows an example of a minimal descriptor of an ARG.
1. Boundedness. It is quite surprising that even an unbounded
ARG G always has a bounded minimal descriptor. It takes
some mathematical ingenuity to prove this and the interested
reader is directed to (8) for details. We just illustrate this
through an example here in Fig. 7a.
2. Overlap of genetic segments. This is a local property of a node
that can be potentially used in designing sampling algorithms.
Let v be a coalescent node, except the GMRCA, in a minimal
descriptor ARG with descendants u1, u2, . . ., ul, for some l > 1.
Then for each descendant ui of v there exists another descendant uj of v overlapping with ui, 1  i 6 j  l. Figure 7b shows
an example. Note that it is adequate that the overlap is only
pairwise.
3. Small size. The number of vertices in a minimal descriptor ARG
is not just guaranteed to be finite (by 1 above) but is also quite
small. Let nc be the number of coalescent events, ne be the
number of genetic exchange events, and nv be the number of

b
s

z
c

c
x

a
w

3
G

a
w

Gmd

Fig. 6. Overall picture: (a) A generic ARG and all its genetic flow, thus defining the samples S(G). The two marked nodes are
not t-coalescent. (b) A minimal descriptor, Gmd as it preserves the structure of G. Although the graphs are clearly topologically
very different, yet they define exactly the same samples, i.e., S(G) S(Gmd) and Gmd preserves the structure of G.

326

L. Parida

Fig. 7. (a) Bounded Gmd of unbounded G of Fig. 5. (b) Pairwise overlap of genetic segments in the children of node v.

vertices, excluding the leaf nodes, in a nontrivial minimal


descriptor ARG. Then
1  nc  M K  1 1;
0  ne  K M  1 M K  1;
nv OMK :
This property is surprising, since most current simulators produce an extremely large number of internal nodes. It appears that
most of them have no effect either on the marginal tree structures
or on the samples. We end this discussion with this interesting
observation.

6. Population
Simulators
A modelless approach to simulations is to take an existing population sample S and perturbs it to obtain S0 that has similar properties
as S. However, here we discuss systems that explicitly model the
population evolution evolving under the WrightFisher model (9).
It is important to point out that literature abounds with population
simulation systems and the list of simulators mentioned here is by
no means complete. However, the attempt here is to classify them
based on the underlying approaches. The simulation systems are
aligned along two approaches: forward and backward. In the former the simulation of the events proceeds forward in time, that is
from past to present. While this is a natural direction to proceed a
trickier approach is to simulate backward in time that is from
present to past. In principle, this is more economical in space and
time. In both approaches an implicit phylogeny structure is constructed. We call the reduced version of this as the ARG in Fig. 8.
An internal node in an ARG is either a coalescent node or a genetic
exchange node but not neither. A mathematically interesting
approach is to simulate the time to the next coalescent, or recombination, event without explicit simulation of every generation.

13

Nonredundant Representation of Ancestral Recombinations Graphs

MaCS
SMC; FastCoal

327

Minimal Descriptor

SMC
Nonredundant

Spatial Algo

Exact coalescent

Approximate coalescent

Nonredundant

FORWSIM

COSI; SelSim
MS
(binary ARG)
Coalescent

Hybrid
SFS_CODE; FREGENE
GENOME

[simuPOP]
Forwards

Backwards

(ARG)
Modelbased

Fig. 8. A classification of the model-based (hence an associated ARG) population evolution systems based on their
underlying architectures. The software systems are shown either in red or green. The systems in green additionally
incorporate selection and/or demographics to produce genetic diversity patterns that somewhat reflect the current
populations. Bottom to top: Backward and forward are the two basic schemes with hybrid as a combination of the two.
Coalescent is a mathematically interesting backward scheme whose ARG topology characterizes it as a binary ARG. A set
of simulators are listed here as approximate coalescent which are attempts at removing redundancies in the underlying
binary ARG. The minimal descriptor, by its definition, is a nonredundant representation of the ARGs resulting from all the
schemes (and additionally it is an exact coalescent model, hence the bifurcation in the coalescent lineage above).

The coalescent model captures this in the backward model. Figure 8


gives a classification of a few simulators along these lines.
The primary output for the simulators is the K sample (genetic)
sequences, given the population size N along with other parameters. The primary genetic exchange event captured in the simulators is recombinations, although some simulators also incorporate
gene exchange. Realistic worldwide human population requires the
modeling of at least two more classes of parameters: (1) selectionrelated and (2) migration-related parameters. Due to the inherent
complexity of the variations in the human population, the simulators generally handle population at the level of continents, that is,
African, Asian, and European. Most of the programs do not make
the ARG available. The authors of cosi made the internal ARG
accessible to us (which has been visualized in Fig. 1).
6.1. Forward Simulators

Forward simulation is conceptually the simpler of the two


approaches. An advantage of this approach is its easy adaptability
to diverse evolutionary forces. simuPOP (10) is an individual-based
forward simulation environment. The system also allows for interactive evolution of populations. For ease of use, many basic

328

L. Parida

population genetics models are available through their cookbooks. This is a suitable system for experimentations since the
user can engineer complex evolutionary scenarios in the environment.
Next we discuss a few simulators that directly provide the
population samples based on a set of input parameters. SFS_CODE
(11) is a forward simulator that additionally handles effects of
migration, demographics, and selection. The migration model is
the general island model with complex demographic histories.
FREGENE (12) additionally incorporates selection, recombination
(crossovers and gene conversion), population size and structure,
and migration.
6.2. Backward
Simulators

In the software GENOME (13), the authors simulate the coalescent


and recombination events at every generation proceeding backward
in time. The standard coalescent model, however, simulates the
time to the next event. However, GENOME models an evolutionary history, more general than the standard coalescent model. In
the random graphs framework in (7), the genetic exchange model or
mixed subgraph represented this more general model. In this chapter, to avoid confusion in terminologies, such a general model is
simply called the generic ARG or just ARG. On the other hand, the
standard coalescent model is called the binary ARG, for reasons
discussed below.
FORWSIM (14) simulates the WrightFisher population of
constant size under natural selection at multiple sites, moving
forward in time. However, the authors describe this as a forwardbackward simulator, since they simulate only those chromosomes in
the next generation that can potentially contribute to the future
population. This handling of multiple generation in a single step is
possible only by some backward insight. Hence in Fig. 8, this is
classified as a hybrid scheme. Additionally, it also models self-fertilization, making it a possible candidate for plant populations.
The Standard Coalescent. Coalescent theory provides a continuoustime approximation for the history of a relative small sample of
extant units from a large population. Under this framework, the
genealogy of a sample of DNA sequences is modeled backward in
time and mutations (neutral) are superposed on the structure to
generate sequence polymorphism data. Hudson introduced MS the
seminal implementation to sample sequences from a population
evolving under the WrightFisher model. COSI (15) is an implementation of simulation with the addition of human population
demographics to the coalescent model. In fact, the same parameters
were used in the forward simulator FREGENE discussed above.
SelSim (16) is yet another simulator based on the coalescent framework that incorporates natural selection. It is important to point
out a subtlety here. Usually under the coalescent model, the coalescence is between exactly two lineages and multiple genetic events

13

Nonredundant Representation of Ancestral Recombinations Graphs

329

do not occur in the same generation in the common evolutionary


history. These simplifications help in defining the model as an
ordered sequence of events as well as in estimating the time from
one event to the next. Thus in these simulators, every node has no
more than two descendants and no more than two ascendants,
hence is called the binary ARG.
Approximate Standard Coalescent. While the above methods generate events backward in time, an orthogonal approach, introduced
in (17), samples the events along the sequence. This is called the
Spatial Algorithm (SA) and one of its characteristic effects is that
the density of recombination breakpoints increases as one moves
along the sequence. Another (perhaps related) characteristic of SA
is that the process is not Markovian. The Sequentially Markov
Coalescent (18) introduces modifications to the process to make
the structure Markovian. Based on this model, in FastCoal (19), the
authors use an additional heuristic of retaining only a subset of local
trees while moving along the sequence. MaCS (20) is an implementation including human population demographics. It turns out that
all the models discussed here, including the Markovian structure,
only approximate the standard coalescent model. While each model
is defined algorithmically as a sequence of precise steps, yet the
reason for this lack of exactness is not clear enough to provide
algorithmic modifications to close or reduce the gap with the
standard model. These simulators that address redundancies are
labeled approximate coalescent in Fig. 8.
6.2.1. Minimal Descriptor

The minimal descriptor is a compact version of the ARG which is


both samples preserving and structure preserving. It is a nonredundant structure that can be extracted from any ARG, no matter its
underlying model. The model could be based on forward or backward simulations or even backward coalescent. Notice that any
probability measure, such as the above, immediately induces (by
push forward) a measure on the space of minimal descriptors. Thus
when the ARG is binary coalescent, it models the underlying standard coalescent exactly. Figure 8 illustrates this generality of the
minimal descriptor.
Assume that the true probability space of the ARGs is the one
implicated by the WrightFisher model. In fact, the standard coalescence also does not exactly capture the WrightFisher for high
enough recombination rate (see ref. 21). To address the issue of the
true probability space, Parida (7) defines a natural measurable space
over the combinatorial pedigree history structures and presents a
sampling algorithm based on it.
Any method that directly samples the space of minimal descriptors, such as in a statistical sampling setting say, needs to (implicitly)
incorporate an underlying probability space. For instance, incorporation of the standard coalescent primarily manifests itself as the
problem of estimation of branch lengths in the structures.

330

L. Parida

7. Conclusion
Population evolution models are important to understand the differences and similarities in individual genomes, particularly due to
the explosion of data in this area. While these faithfully model the
genetic dynamics of the evolving population, their structure is
usually very large involving tens of thousands of internal nodes
for say a few hundred samples with a thousand SNPs each. The
complexity of this combinatorial structure raises the question of
redundancies in this structure. This chapter addressed this precise
question and gave mathematical description of such a substructure.
This is important not only for simulations and reconstruction
purposes, but also opens the door for a comprehensive understanding of genetic dynamics that ultimately shape the chromosomes.

8. Exercises
1. Construct an instance of GPG(4, 3) with no LCAs.
What is the probability of an instance of GPG(4, 3) having no
LCAs?
(Hint: see ref. 7 for the definition of a natural probability
measure).
2. (a) What is the difference in topology of a pedigree history
graph and ARG?
(Hint: How many parents must a diploid have?)
(b) When tracing a haploid, at most how many parents can the
extant unit have? Why? Does this hold for a unit at every
generation? (Hint: Fig. 9a.)
3. Is it possible to assign labels to the nodes of the ARGs in
Fig. 9b, c, why?
4. Argue that the number of resolvable nodes decreases with
depth of the nodes.
5. Argue that an ARG may have multiple minimal descriptors.
(Hint: Fig. 10.)

Acknowledgments
I would like to thank Marc Pybus for generating the visualization of
the ARG produced by COSI to show the world populations
(Fig. 1). I am grateful to the anonymous referees whose comments
have substantially improved the exposition.

13

Nonredundant Representation of Ancestral Recombinations Graphs

331

Fig. 9. (a) Tracking haploids in diploids. (b) and (c) The pattern of connectivity is repeated in both to produce infinite graphs.
r
z
c

r
z

q
c

r c
s

w
w,y

p
b,d p,q,s

w,y

b,d p,q,s
x

G
Fig. 10. Gmd and G 0 md are minimal descriptors of G.

Gmd

vx

Gmd

332

L. Parida

References
1. R. R. Hudson. Properties of a neutral allele
model with intragenic recombination. Theoretical Population Biology, 23(2):183201, April
1983.
2. R. C. Griffiths and P. Marjoram. An ancestral
recombinations graph. Progress in Population
Genetics and Human Evolution (P Donnelly
and S Tavare Eds) IMA vols in Mathematics
and its Applications, 87:257270, 1997.
3. Jotun Hein, Mikkel H. Schierup, and Carsten
Wiuf. Gene Genealogies, Variation and Evolution: A Primer in Coalescent Theory. Oxford
Press, 2005.
4. Laxmi Parida, Marta Mele, Francesc Calafell,
Jaume Bertranpetit, and Genographic Consortium. Estimating the ancestral recombinations
graph (ARG) as compatible networks of SNP
patterns. Journal of Computational Biology, 15
(9):122, 2008.
5. Marta Mele, Asif Javed, marc Pybus,, Francesc
Calafell, Laxmi Parida, Jaume Bertranpetit, and
Genographic Consortium.
6. M.A. Jobling, M. Hurles, and C. Tyler-Smith.
Human Evolutionary Genetics: Origins, Peoples
and Disease. Mathematical and Computaional
Biology Series. Garland Publishing, 2004.
7. Laxmi Parida. Ancestral Recombinations
Graph: A Reconstructability Perspective using
Random-Graphs Framework. to appear in
Journal of Computational Biology, 2010.
8. Laxmi Parida, Pier Palamara, and Asif Javed. A
minimal descriptor of an ancestral recombinations graph. BMC Bioinformatics, 12(Suppl 1):
S6, 2011. http://www.biomedcentral.com/
1471-2105/12/S1/S6.
9. R. R. Hudson. Generating samples under a
Wright-Fisher neutral model of genetic variation. Bioinformatics, 18:337338, Feb 2002.
10. Bo Peng* and Marek Kimmel. simuPOP: a
forward-time population genetics simulation
environment. Bioinformatics, 21:36863687,
2005.
11. RD. Hernandez. A flexible forward simulator
for populations subject to selection and

demography. Bioinformatics, 24:27862787,


2008.
12. Marc Chadeau-Hyam, Clive J Hoggart, Paul F
OReilly, John C Whittaker, Maria De Iorio,
and David J Balding. Fregene: Simulation of
realistic sequence-level data in populations and
ascertained samples. BMC Bioinformatics, 9,
2008.
DOI doi:10.1186/1471-2105-9364.
13. Liming Liang, Sebastian Zllner, and Goncalo
R. Abecasis. Genome: a rapid coalescent-based
whole genome simulator. Bioinformatics, 23
(12):15651567, 2007.
14. Badri Padhukasahasram and Paul Marjoram
and Jeffrey D. Wall and Carlos D. Bustamante
and Magnus Nordborg. xploring Population
Genetic Models With Recombination Using
Efficient Forward-Time Simulations. Genetics,
178(4):24172427, 2008.
15. S. F. Schaffner, C. Foo, S. Gabriel, D. Reich,
M. J. Daly, and D. Altshuler. Calibrating a
coalescent simulation of human genome
sequence
variation.
Genome
Res.,
15:15761583, Nov 2005.
16. Spencer CC and Coop G. SelSim: a program to
simulate population genetic data with natural
selection and recombination. Bioinformatics,
12:20:36735, 2004.
17. Carsten Wiuf and Jotun Hein. Recombination
as a point process along sequences. Theoretical
Population Biology, 55:248259, 1999.
18. Gilean McVean and Niall Cardin. Approximating the coalescent with recombination. Phil.
Trans. R. Soc. B, 360:13871393, Sep 2005.
19. P. Marjoram and J. D. Wall. Fast coalescent
simulation. BMC Genetics, 7(16), Jan 2006.
20. G. K. Chen, P. Marjoram, and J. D. Wall. Fast
and flexible simulation of DNA sequence data.
Genome Res., 19:136142, Jan 2009.
21. Joanna L. Davies, Frantiek Simank, Rune
Lyngs, Thomas Mailund, and Jotun Hein.
On recombination-induced multiple and
simultaneous coalescent events. Genetics,
177:21512160, December 2007.

Part IV
The -omics

Chapter 14
Using Genomic Tools to Study Regulatory Evolution
Yoav Gilad
Abstract
Differences in gene regulation are thought to play an important role in speciation and adaptation.
Comparative genomic studies of gene expression levels have identified a large number of differentially
expressed genes among species, and, in a number of cases, also pointed to connections between interspecies
differences in gene regulation and differences in ultimate physiological or morphological phenotypes.
The mechanisms underlying changes in gene regulation are also being actively studied using comparative
genomic approaches. However, the relative importance of different regulatory mechanisms to interspecies
differences in gene expression levels is not yet well understood. In particular, it is often difficult to infer
causality between apparent differences in regulatory mechanisms and changes in gene expression levels, a
challenge that is compounded by the fact that the link between sequence variation and gene regulation is
not clear. Indeed, in certain cases, gene regulation can be conserved even when sequences at associated
regulatory elements have changed. In this chapter, I examine different genomic approaches to the study of
regulatory evolution and the underlying genetic and epigenetic regulatory mechanisms. I try to distinguish
between hypothesis-driven and exploratory studies, and argue that the latter class of studies provides
valuable information in its own right as well as necessary context for the former. I discuss issues related
to study designs and statistical analyses of genomic studies, and review the evidence for natural selection on
gene expression levels and associated regulatory mechanisms. Most of the issues that are discussed pertain
to the general nature of multivariate genomic data, and thus are often relevant regardless of the technology
that is used to collect high-throughput genomic data (for example, microarrays or massively parallel
sequencing).
Key words: Comparative genomics, Gene regulation, Evolution

1. What Can
We Learn from
Genomic-Scale
Comparative
Studies of Gene
Regulation?

Genomic studies of gene regulatory phenotypes are only rarely


hypothesis driven. There are exceptions, for example studies that
focus on a difference in phenotypes between populations or species
(e.g., 1), and use a genome-wide approach to query regulatory
differences that might explain the observed difference in

Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2,
Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5_14,
# Springer Science+Business Media, LLC 2012

335

336

Y. Gilad

phenotypes. However, most comparative genomic studies of gene


regulation are exploratory in nature. Thus, the results of such
studies cannot typically be evaluated by the standard metric of
considering whether a question was convincingly answered or a
hypothesis provided further support. In addition, most genomic
studies focus on steady-state gene regulatory phenotypes (such as
steady-state gene expression levels or transcription factor binding)
and cannot, mainly due to technological limitations, take into
account the detailed spatial and temporal dynamics of gene regulation. It is, therefore, important to consider the following question: What can we learn from nonhypothesis-driven comparative
genomic explorations of steady-state estimates of gene regulatory
phenotypes?
Comparative genomic regulatory studies typically address
three general aims. First, they provide a general description of
variation in gene expression levels, or variation in regulatory
interactions, within and between populations. In itself, such a
description is often of no particular interest. However, these
descriptions allow investigators to place hypotheses regarding
individual genes as well as appreciate observations of differences
in regulatory phenotypes between individuals, or across populations and species, in the appropriate context. For example, consider the observation that 20% of the annotated genes in the
insulin/IGF-signaling pathway are differentially expressed
between human and chimpanzee livers (10). In order to assess
the significance of this observation, it needs to be interpreted in
the context of overall genome-wide variation in gene regulation
between species. In other words, genome-wide data are required
to test whether the observation that 20% of genes annotated in the
insulin/IGF-signaling pathway are differentially expressed
between the two species is indeed unexpected.
The second general aim of comparative genomic investigations
of gene regulation is to understand the relative importance of
changes in different regulatory mechanisms, and the associated
evolutionary pressures, which shape gene regulatory variation
within and between species. Functional studies of individual genes
are often able to link specific change in regulatory mechanism with
a shift in expression levels, which may underlie physiological or
morphological phenotypic variation. In some cases, these studies
are also able to obtain evidence for the action of natural selection on
gene regulation, especially when a strong prior hypothesis exists
(for example, in the case of genes related to skin pigmentation
and their associated cis regulatory elements (2)). However, while
studies of single genes illustrate the connection between regulatory
evolution and phenotypic variation, only genome-wide explorations can offer a wide enough perspective to address the more
general question of the relative importance of changes in different
molecular mechanisms to the evolution of gene regulation.

14 Using Genomic Tools to Study Regulatory Evolution

337

Similarly, genome-wide perspective is required to study the overall


impact of natural selection on gene regulatory differences within
and between species.
The third aim of comparative genomic studies is to develop
specific hypotheses for follow-up functional experiments, which are
typically too demanding to be performed on a genome-wide scale.
For example, it can be shown, based on genome-wide comparative
data, that it is entirely unexpected (by chance) that 20% of the genes
annotated in the insulin/IGF-signaling pathway would be differentially expressed between humans and chimpanzees (10). Thus, it
may be reasonable to assume that the regulation of this pathway has
evolved under directional selection in either humans or chimpanzees (or both). The insulin/IGF-signaling pathway might, therefore, be a promising candidate for subsequent functional studies
and analysis. For example, one might choose to proceed by considering interspecies differences in the metabolic phenotypes associated with this pathway.
Beyond these three aims, comparative studies of gene regulation
are sometimes motivated by general hypotheses, for example when
used as tools to survey possible mechanisms that might explain
genetic associations (as in the context of genetic association studies
of human diseases (3, 4)). Comparative genomic investigations of
regulatory response phenotypes (for example, a response to infection) are another class of studies driven by a general hypothesis.

2. How to Compare
Gene Expression
Levels Across
Species?

Comparative studies of gene expression levels involve related but


somewhat different challenges than those involved in studies of the
regulatory mechanisms underlying variation in gene expression
levels. In what follows, I therefore discuss these classes of studies
separately. I begin with a discussion of comparative studies of gene
expression levels.
With the advent of massively parallel high-throughput
sequencing technologies (next-generation sequencing), interspecies comparisons of gene expression levels, while still not
straightforward, became more feasible. Prior to the availability
of next-generation sequencing technologies, genome-wide comparisons of gene expression levels relied solely on DNA microarrays. Microarrays are still more cost-effective than sequencing for
genome-wide transcriptional profiling. Yet, with respect to interspecies comparisons, microarrays fall short. The principal problem
is that the collection of gene expression data using microarrays
relies on hybridization between the RNA samples being queried
and the probes on the arrays. Sequence mismatches between

338

Y. Gilad

target RNA samples and the microarray probes lead to attenuation


of the hybridization intensity, and result in biased estimates of
gene expression levels (5). Interspecies comparisons of gene
expression levels always involve the hybridization of RNA samples
with different sequences. The use of commonly available commercial microarrays, each designed based on the sequence information of only one species (typically, only model organisms and
humans), is therefore problematic. Species-specific and multispecies microarrays can be custom designed and used to compare
gene expression levels within and between species, without the
confounding effects of sequence mismatches on hybridization
intensities (e.g., 12). However, the design and manufacturing of
such custom arrays is costly, and one can only design arrays for
species for which a sequenced genome is available. Moreover, each
time another species is added to a comparative study, a new array
has to be designed and ordered, and the entire study repeated.
Ultimately, due to these considerations, sequencing is generally a
more cost-effective choice than microarrays for comparative genomic studies of gene expression levels. Thus, in what follows, I
mainly focus on methodological issues related to comparative
studies using sequencing.
2.1. Multispecies
Comparisons of Gene
Expression Levels
Using RNAseq

Gene expression studies using RNA sequencing (RNAseq) are not


free of challenges related to the comparison of expression levels
across different species. However, the solutions typically lie in
proper analysis of the data rather than in development of new
empirical tools (by no means do I intend to argue that all challenges
involved in RNAseq data analysis have been solved, only that there
are fewer specific difficulties associated with comparative studies
when RNAseq is being used instead of microarrays, and most of
the remaining difficulties can be solved by proper and cautious
analysis). The first set of challenges relate to the requirement of
defining the transcriptome. This is necessary because comparisons
of estimates of expression levels can only be interpreted in the
context of defined transcriptional units (for example, comparison
of the expression levels of exons, specific transcripts, or genes).
When RNA is being sequenced from a species for which a wellannotated genome is available, RNAseq reads can simply be aligned
to the previously defined transcriptional units and expression
levels can be estimated based on the number of aligned reads.
The problem is that there are only a few well-annotated genomes
(such as the human and mouse genomes), and even these are not
perfectly annotated (indeed, studies continue to find additional
transcriptional units in the human and mouse genomes, such as
previously unrecognized exonstypically 50 to annotated promoters and novel small RNAs (6, 7)).

14 Using Genomic Tools to Study Regulatory Evolution

339

If one is sequencing RNA from a species for which a sequenced


genome is available yet is not well annotated, there are two general
alternatives for defining transcriptional units. First, one can rely on
the functional annotation of a closely related genome. Consider, for
example, a comparative study of gene expression levels among
humans, chimpanzees, and rhesus macaques using RNAseq.
Sequenced genomes are available for all three species, yet only the
human genome is well annotated. Because the three species are
closely related, it may seem relatively easy to use the functional
annotation of the human genome to define theoretical transcriptional units in the two nonhuman primate genomes. The challenge,
however, is to accurately define orthology. If one is conservative
(requires exceptionally high sequence similarity) in defining orthology, a large fraction of transcriptional units may be excluded from
the analysis. On the other hand, if one defines orthology using
relaxed criteria (accepting even weak evidence for homology),
falsely classified orthologous regions will often lead to the inclusion
of real transcriptional units in human, coupled to spuriously defined
transcriptional units in the nonhuman primates. This results in a
bias toward estimates of higher expression levels in humans compared to the other two species. Even if a balance is achieved
between the desire to include as many transcriptional units as
possible and the need to avoid falsely classified orthologous genomic regions, transcriptional units that are specific to the nonhuman
primates will never be included in an analysis anchored by annotations based on the human genome. Thus, ultimately this approach
will always result in a certain bias. For example, exons that are being
used frequently in alternatively spliced transcripts in chimpanzees
but not in humans might be excluded from a comparative analysis
based on functional annotation of the human genome (Fig. 1).
The second alternative is to use the alignment of the RNAseq
reads to the available genomes of all studied species in order to
define, de novo, the expressed transcriptional units. This is far from
a trivial task, as it requires one to distinguish foreground expression
levels from the background (such as sequencing reads originating
from unspliced introns). At the time this chapter is being written,
there are only a handful of algorithms for de novo definition of
transcriptional units from aligned sequencing data (e.g., 8), and
their effectiveness is still being debated. That said, this is an area of
active research, and probably the most promising way to proceed.
Comparative gene expression studies that are based on de novo
definition of transcriptional units are not affected by biases due to
preexisting functional annotations.
When a sequenced genome itself is not available, a third
approach is to perform de novo assembly of the transcriptome.
This is the most difficult approach because it does not rely on an
alignment of the sequencing reads to a known genome. Currently,
there is no effective approach for performing de novo assembly of

340

Y. Gilad

Fig. 1. RNAseq data from human and chimpanzee liver samples are plotted along the Vanin-family protein 3 (VNN3) gene
region. The human gene structure is provided below each plot and indicates that there are seven annotated exons in this
genes (there is no independent annotation of the chimpanzee genome). The arrows indicate a cluster of sequencing reads
that does not correspond to any part of the human gene model. A de novo definition of transcriptional units clearly
classifies this as an additional exon. Arguably, there is yet another unannotated exon at the 50 end of the region.

the transcriptome using RNAseq data. Such approaches can in


principle rely on successful existing algorithms for de novo assembly of entire genomes (Chap. 5, of volume 1 of this book, ref. 54,
where the biggest challenge is typically to identify and resolve
repeats. However, de novo assembly of the transcriptome is challenging in a different way because one has to take into account the
broad distribution of copy numbers across transcriptional units
(namely, the different expression levels). With respect to comparisons of expression levels across species, data processed by using
effective de novo assembly of the transcriptome is expected to
have the same properties as data processed by de novo definition
of the transcriptional units based on aligned RNAseq reads.
However, assembly of the transcriptome is an attractive approach
because it allows one to perform comparative RNAseq studies
on any species, including species for which a sequenced genome is
not yet available. That said, with the rapid decrease in sequencing
costs and the corresponding increase in sequencing capacity, it
might be reasonable to expect that sequencing a new large (e.g.,
mammalian) genome may not be a prohibitive enterprise in the
near future.
For the remainder of the chapter, when issues pertaining to
RNAseq studies are discussed, it is assumed that the analysis is
being performed using the final dataset of reads that map to a
defined set of transcriptional units (regardless of the method

14 Using Genomic Tools to Study Regulatory Evolution

341

used). For simplicity of writing, I will also henceforth refer generally to genes as examples of transcriptional units. It should be
kept in mind, however, that RNAseq data can be used to study the
expression levels of any transcriptional unit, including individual
exons, alternatively spliced transcripts, small RNAs, etc.
2.2. General Issues
in Design
of Comparative Gene
Expression Studies

Genome-wide investigations of gene regulation need to take into


account a large number of potential confounding sources of variation. These can be technical, such as variation in sample quality and
batch effects, or biological, such as variation due to sex, age, and
circadian rhythm. Comparative studies of gene expression levels are
arguably even more sensitive to confounding effects because of the
large number of physical, morphological, and environmental differences between species. Differences in diets, for example which may
be unavoidable in a study of multiple species, can affect gene
regulation.
One of the main goals of comparative studies of gene expression levels is to understand interspecies genetically regulated differences. However, in many multispecies studies, the environmental
and genetic components affecting gene regulation are completely
confounded and cannot be distinguished. Similarly, differences in
developmental trajectories, organ size, cellular composition, and
life histories may all be inherently confounded with genetic effects
in a multispecies comparative study.
To some extent, many of these differences can be sidestepped
by limiting the investigation to model organisms that can be kept in
the lab. In that case, one can often ensure that tissue samples are
staged, namely, that samples are being collected from individuals of
the same age and sex, who have experienced similar life histories,
and that sample collection procedures are identical across individuals, regardless of species. In contrast, studies of non-model species can almost never obtain staged tissues, as in most cases the
sample collection is opportunistic in nature (for example, when
collecting samples from nonhuman apes that died in accidents,
fights, or due to other natural causes).
As a result, observations from comparative studies of gene
regulation, especially of non-model organisms, should be interpreted with caution. Some patterns are likely robust with respect
to the uncontrolled aspects of the study designs, and these can
readily be interpreted. For example, it is reasonable to assume
that interspecies differences in the environment and life histories
experienced by donor individuals will result in perturbation of gene
regulation and lead to increased variation in gene expression
levels across species. Thus, patterns of similarity (namely, low variation) of gene expression levels between individuals, regardless of
species, are probably robust with respect to environmental effects.

342

Y. Gilad

Fig. 2. Comparative liver gene expression profiles in primates (data from Blekhman et al. 2008). In all panels, the mean
(s.e.m) log gene expression level (y-axis) of six individuals from each species (x-axis) is plotted relative to the human
value (which was set to zero). Top panels: Though Blekhman et al. did not obtain staged tissuesthe samples were
collected opportunistically during postmortem procedures; the expression levels of each of these four genes are
remarkably constant across individuals and species (importantly, these four genes are expressed at moderate to high
levels, so the observed interindividual low variation is not due to lack of expression). Technical or environmental
explanations for these patterns are unlikely. It is, therefore, reasonable to assume that the expression levels of these
genes are tightly regulated (indeed, Blekhman and colleagues argue that the regulation of these genes has likely evolved
under stabilizing selection in primates). Bottom panels: These genes have similar expression levels in chimpanzees and
rhesus macaques, and a significantly different expression level in humans. In these four cases, explanations based on
interspecies genetic or environmental differences are completely confounded.

One can conclude, therefore, with considerable confidence that


such patterns are genetically (or epigenetically) controlled (Fig. 2,
top panels).
In contrast, the observation of interspecies differences in gene
expression levels (Fig. 2, bottom panels) may always be difficult to
interpret, as environmental and genetic explanations can be
completely confounded. Arguably though, in some cases, the
mechanism underlying the observation of a regulatory difference

14 Using Genomic Tools to Study Regulatory Evolution

343

Fig. 3. Examples of strong concordance between expression levels measured using the multispecies arrays from Blekhman
et al., 2008, and using the RNAseq data from Blekhman et al., 2009. Six genes are displayed, chosen at random from the
data of Blekhman et al., 2008, conditional only on a significant (FDR < 0.05) difference in gene expression level between
humans and chimpanzees (expression levels in the rhesus macaques were not considered for the selection process). For
each gene, the expression estimate (mean  s.e.m) from the multispecies array (left ) and normalized expression level
(mean  s.e.m) from the RNAseq data (right ) are shown for each species (H human, C chimpanzee, R rhesus macaque).
Each study used different individual samples, yet the patterns are consistent across studies, suggesting that the relative
estimates of gene expression levels based on six individuals from each species are mostly stable.

between species is of less importance as long as the difference


is indeed between the species rather than between the specific
sampled individuals. In that case, care needs to be taken to
ensure that a sufficient number of individuals have been sampled
to obtain a relatively stable estimate of gene expression levels
in the entire species, given specified conditions. Perhaps surprisingly, the number of required individuals to satisfy this criterion
can often be quite modest (on the order of a dozen individuals
(11); Fig. 3).

344

Y. Gilad

2.3. General Issues


in the Analysis
of Comparative Gene
Expression Data

The challenges involved in the analysis of genome-wide gene


expression data are common to nearly all multivariate highthroughput studies, and are not specific to comparative genomics
studies. General topics in multivariate analysis are discussed in
Chap. 3, Volume 1 (ref. 55) as well as covered in more detail in
many dedicated textbooks. Similarly, approaches for modeling gene
expression levels based on microarray or sequencing data are discussed elsewhere in detail (e.g., 7, 9, 10). Here, I focus on three
particular issues: first, on normalization of gene expression datasets;
second, on the relationship between gene length, absolute expression level, and the power to detect differences in gene expression
levels, as it pertains to RNAseq data; and third, on the arbitrary
nature of the choice of statistical cutoffs.
Normalization. Normalization of gene expression datasets can be
performed in a number of ways (e.g., linear shifts, nonlinear extrapolations, median corrections based on smoothing). Microarray
studies routinely use a normalization step as part of the low-level
analysis of the data. In contrast, most recently published RNAseq
studies (including two early studies from my own group) have
standardized read count based on transcript length and the total
number of sequenced reads in each sample, but have not normalized the sequencing data across samples prior to modeling gene
expression levels. In this section, rather than explore particular
approaches for normalization, I discuss the reasons for which it is
necessary to apply a normalization step to RNAseq data (see refs.
3941 for details on different normalization approaches).
A normalization step is generally required in genomic studies
of gene expression levels to correct for purely technical differences
among data from different samples, such as differences in overall
RNA quantity and/or quality, sample processing, and batch
effects. Arguably, most of these effects can be taken into account
in an RNAseq study by correcting gene-specific read counts by the
total number of reads sequenced in each sample. Note that
this standardization step relies on the assumption of no interacting technical confounding effects, which may or may not be a
reasonable assumption. Since I proceed by arguing that normalization is needed, I shall not continue to discuss the validity of this
assumption.
A correction based on the total number of sequenced reads,
however, cannot account for differences in the distribution of gene
expression levels across samples (12, 13). This is a property that we
did not need to consider in microarray studies. In contrast to
microarrays, where each RNA type hybridizes (we can assume
independently) to a dedicated probe, estimates of gene expression
levels using RNAseq are based on the proportion of reads that
are sequenced from each gene relative to the total number of
sequenced reads in a sample. As the total number of reads

14 Using Genomic Tools to Study Regulatory Evolution

345

sequenced from a given sample is limited, by definition, the range


and distribution of gene expression values affect how often genes
with a given absolute expression level are being sampled (because
the fractions of reads mapped to individual genes must sum to one
in each sample). For example, assume that the number of genes
expressed in livers and kidneys is identical, but in livers all genes are
expressed at low to moderate levels while in kidneys a few genes
are expressed at extremely high levels and all other genes at low to
moderate levels. In that case, for a given number of RNAseq reads
per sample (and when reads are sampled at random), the probability
that a lowly expressed gene will be represented is higher in the liver
than in the kidney. Normalization of RNAseq data is, therefore,
necessary to take these differences into account.
Power to detect differentially expressed genes. Another important
property of RNAseq data is that the number of sequence reads
that map to a particular gene tends to be roughly proportional
to the expression level of the gene multiplied by the genes
length (14). Thus, long genes tend to be represented by more
sequence reads than short genes expressed at the same level. As a
result, estimates of expression levels based on RNAseq data, though
they are standardized by gene length, tend to be less variable for
long genes than for shorter genes (or transcripts, or exonsthis
property is not specific to a particular class of transcriptional units).
The ability to identify differentially expressed genes between samples is, therefore, strongly associated with the length of the transcript. Moreover, when overall sequence coverage is increased, the
corresponding increase in the power to detect differences in expression levels across samples is also associated with gene length because
the corresponding increase in the number of reads is greater for
long than for shorter genes. Microarray data are not susceptible to
this complex interaction between gene length and the power to
detect differences in expression levels because all probes on the
array are typically of the same length.
Since one of the most attractive features of RNAseq is the ability
to assay the expression of entire transcriptional units, it may be
undesirable to account for this length bias by restricting the analysis
to subsections of genes (such as the first n base pairs of 3 0 UTRs).
The association between gene length and the power to detect expression differences may, therefore, be a constant property of RNAseq
studies, and its bias on downstream analyses needs to be considered
(15). For example, ranking or testing for functional enrichments (for
example, by using gene ontology annotations) among genes that are
classified as differentially expressed between species based on RNAseq data might result in the spurious identification of enriched pathways or functional annotations that include mainly longer genes.

346

Y. Gilad

For that reason, analyses aimed at assessing whether an


observation of an enrichment of regulatory differences in a particular pathway or a biological process is unusual need to take into
account a background of matching gene lengths or at least a
background of matching estimated expression levels. Consider
again the observation that 20% of the annotated genes in the
insulin/IGF-signaling pathway are differentially expressed
between human and chimpanzee livers. In contrast to our simplified discussion above, because of the power-related considerations, it is not appropriate to estimate whether this observation
is indeed unexpected by simply considering the overall fraction of
differentially expressed genes between the two species. Instead, a
proper null expectation should be developed by considering interspecies differences in expression levels in a proper background of
genes with similar length as the genes in the insulin/IGF-signaling pathway (15). Alternatively, one can develop a null expectation by sampling at random subsets of n geneswhere n is the
number of genes in the insulin/IGF-signaling pathway while
maintaining a similar distribution of expression levels.
The choice of statistical cutoffs. Genome-wide studies typically use
statistical cutoffs to sort genes into different classes, for example to
classify genes as differentially expressed between cases and controls.
In many contexts, especially when genome-wide studies are used to
develop hypotheses for further testing (which typically involve
functional experiments that are time consuming and costly), minimizing the number of false positives is nearly the only guiding
principle behind the choice of a statistical cutoff. However, comparative studies of gene regulation are often exploratory, and, as
such, one of the goals is typically to describe biological processes
and pathways that are enriched among different classes of genes,
such as those that are differentially expressed between species. The
challenge is to provide a description of such patterns that does not
rely on the exact choice of the statistical cutoff.
While the choice of cutoffs is nearly always arbitrary, it is often
possible to guide it by using prior information regarding related
properties of the data. For example, consider housekeeping
genes (the definition of housekeeping genes is controversial,
but for the purpose of this discussion, assume that we have an
established list of true housekeeping genes). A reasonable
assumption might be that housekeeping genes will be underrepresented among differentially expressed genes between species. In
that case, one approach is to choose a cutoff with which the overall
number of genes classified as differentially expressed is maximized
while the number of housekeeping genes classified as differentially
expressed is minimized. When two or more genomic datasets
are combined, the opportunity to leverage information to guide
the choice of statistical cutoffs increases. Consider the

14 Using Genomic Tools to Study Regulatory Evolution

347

combination of a transcription factor ChIPseq dataset with


genome-wide estimates of gene expression levels following perturbation of the same transcription factor dosage. Two cutoffs need
to be chosen: one to classify transcription factor promoter binding
events in the ChIPseq data and one to classify differences in gene
expression levels following the perturbation of the transcription
factor dosage. In choosing these cutoffs, the prior expectation of
enrichment in overlap between the two sets of observations can be
leveraged. Indeed, true regulatory targets of the transcription
factor are expected to be differentially expressed, as well as have
the transcription factor bound to their promoters.
Regardless of the type of analysis used or the ability to use prior
information to guide the choice of statistical cutoffs, the order of pvalues rarely changes. For that reason, an analysis that indicates that
the conclusions are robust with respect to a wide range of arbitrary
choices always reinforces the study. One way to achieve this is to
perform the entire analysis using a range of alternative cutoffs.
A more formal way to test specific properties of interest is to use
approaches, such as gene set enrichment analysis (16), which rely
on the order of p-values rather than on specific choices of cutoffs.
Using these approaches, one can explore the overall dependence
between the choice of cutoff and the examined property of the data
(such as an enrichment of differentially expressed genes in a particular pathway).
Strong conclusions can only be based on properties that are
demonstrably robust with respect to the choice of statistical cutoffs.
For example, the specific number of genes classified as differentially
expressed between species obviously depends on the choice of a
statistical cutoff. However, the property that the fraction of genes
classified as differentially expressed between humans and chimpanzee is smaller than between either humans or chimpanzees, and the
more distantly related rhesus macaques, is robust with respect to
the specific choice of cutoff (11, 12).

3. What Have We
Learned from
Comparative
Genomic Studies
of Gene Expression
Levels?

At the time this chapter is being written, comparative studies of


gene expression levels are still mostly limited to exploration of
variation in gene regulation within and between species. A large
number of specific hypotheses have been raised based on the existing studies, but only a few have been followed up. We are still
working toward a better understanding of the evolutionary forces
that shape gene regulatory phenotypes, and this has still remained
the focus of most comparative studies of gene expression levels.

348

Y. Gilad

In the first large-scale study to investigate natural variation in


gene regulation, Oleksiak et al. (17) compared gene expression
levels in heart ventricles from 18 individual postreproductive
males from three populations: two of Fundulus heteroclitus (a saltwater fish) and one of its close relative, F. grandis. Despite low
migration rates between the two conspecific populations and across
the species boundary, fewer than 3% of the 907 genes surveyed
were classified as differentially expressed between populations. An
order of magnitude more genes were found to be differentially
expressed between individuals within populations. In other
words, there was little evidence of population structure at the
genome-wide expression level. In addition, patterns of variation
between populations were inconsistent with the neutral prediction
that phenotypic divergence should scale with genetic distance.
Instead, gene expression profiles were more similar for the southern
F. heteroclitus and F. grandis populations, suggesting that adaptation to different temperatures, rather than genetic drift, drove the
differentiation.
Rifkin et al. (18), who studied gene expression variation during Drosophila metamorphosis, took a more explicit quantitative
genetic approach to study selection pressures acting on gene
regulation. They measured average levels of gene expression in
four strains of the cosmopolitan species D. melanogaster and one
strain each of D. simulans and D. yakuba at the start of metamorphosis. To identify genes whose regulation evolves under different
selective pressures, Rifkin et al. analyzed the gene expression data
using a system of related linear models corresponding to the
expectations under three different evolutionary scenarios. Using
this approach, they could not reject overall low variation for 44%
of the expressed genes, could not reject species-specific gene
expression patterns for 39% of the genes, and could not reject a
model consistent with neutrality for the remaining 17% of genes.
They interpreted these results to indicate a dominant signature for
stabilizing selection in gene expression evolution with smaller, but
important, roles for directional selection and neutral evolution,
respectively.
In contrast to Rifkin et al., Lemos and colleagues (19) explicitly
tested a null neutral model of gene expression evolution by making
two key assumptions about variance in gene expression. First, they
used estimates of mutational variance in other quantitative traits as
a measure of the mutational variance that might be affecting gene
expression. Second, following Lynch (20), they assumed that environmental variance was half the within-population variancei.e.,
that broad-sense heritability of gene expression patterns was at
most 50%. Using these estimates and based on the neutral model
of Lynch and Hill (21), they calculated the minimal and maximal
rates of gene expression diversification that would be consistent
with neutrality (i.e., evolution without constraint).

14 Using Genomic Tools to Study Regulatory Evolution

349

Lemos et al. (19) used their approach to perform a metaanalysis of available gene expression datasets from multiple species,
and found that the overwhelming majority of genes in all datasets
exhibited far less between species variation than expected under a
neutral model. They interpreted this pattern to be the result of
stabilizing selection acting on within-species gene expression. In
fact, Lemos et al. (19) estimated that even if the mutational input to
gene expression were two orders of magnitude lower than they had
assumed, levels of between-population differentiations in gene
expression would still be inconsistent with neutrality. Only in comparisons between mouse lab strains did an appreciable number of
genes evolve in a manner consistent with neutrality.
The conclusions of Lemos et al. were supported by several
studies that directly measured the mutational input of variation in
gene expression levels per generation in a number of model organisms (2224). Mutational input can be estimated by measuring the
variance for a phenotypic trait among a set of initially homogeneous
lines maintained with minimally sized populations for many
generations. Natural selection is at its weakest under such conditions because genetic drift in such small populations is extremely
fast. In an extreme case, when a single, randomly chosen individual
propagates each line, the only mutations which can be selected
against are those that kill the organism before reproduction or that
eliminate fertility altogether. Otherwise, most mutations will be
effectively neutral and will quickly either drift to fixation or be lost.
As different lines fix different random mutations, the lines drift
apart. Variation between lines can then be used to estimate the
mutational variance.
These mutation accumulation studies (2224) provided the
first direct estimates of mutational variance in gene expression
levels. When comparative gene expression data were analyzed in
the context of these estimates (by applying a similar modeling
approach to the one used by Lemos et al.) in all systems studied
to date, it was concluded that stabilizing selection places severe
bounds on gene expression divergence.
3.1. Gene Expression
in Apes

Understanding phenotypic evolution in primates is typically more


difficult than in model organisms because key experiments often
cannot be performed to distinguish between competing hypotheses
or to estimate important parameters. Moreover, material is often
scarce, leading to largely unknown and uncontrolled environmental
variance between samples. These limitations are particularly problematic for dynamic, environmentally sensitive traits, like gene expression.
Perhaps due to the these difficulties, the first few studies that
examined the selection pressures that shape gene expression profiles
in humans and our extant close evolutionary relatives resulted in
somewhat conflicting conclusions (19, 15). However, more recent
work on interprimate comparisons of gene expression levels,

350

Y. Gilad

focusing on patterns of the data that should be robust with respect


to the uncontrolled aspects of the study design, indicates that, for
most genes, there is little evidence for change in expression levels
across primate species. These observations are consistent with widespread stabilizing selection on gene regulation in primates, in agreement with the observations in model organisms (18, 24, 26, 27).
Nonetheless, a subset of genes whose regulation appears to
have evolved under positive (directional) selection in the human
and chimpanzee lineages was identified. Intriguingly, among this
set of genes, there was a significant excess of transcription factors in
the human lineage. In addition to the rapid evolution of their
expression, genes encoding transcription factors have also been
shown to evolve rapidly in the human lineage at the coding
sequence level (28). Together, these findings raise the possibility
that the function and regulation of transcription factors have been
substantially modified in the human lineage, a change that could
have propagated to many downstream targets over a short evolutionary time frame. Interestingly, the opposite finding has emerged
from studies of closely related Drosophila species, in which the
expression levels of transcription factors appear to evolve more
slowly than the expression levels of genes encoding other types of
proteins (18, 22).

4. How to Compare
Regulatory
Mechanisms
Across Species?

Beyond comparisons of gene expression levels across species, there


is a great interest in understanding the underlying regulatory
mechanisms. Specifically, we still know little about the relative
importance of changes in different regulatory mechanisms to interspecies differences in gene expression levels. Genomic technologies, in particular since the advent of next-generation sequencing
techniques, allow us to characterize genome-wide variation in a
larger number of genetic and epigenetic regulatory mechanisms
and regulatory interactions.
It is important to note at the onset of this discussion that
genomic studies can only rarely be used to directly test for causality.
Much more often, the inference of causality (for example, between
changes in a regulatory mechanism and ultimate differences in gene
expression levels) relies on the observation of correlations on a
genome-wide scale. Statistical correlation in itself, however, does
not provide strong evidence for causality, and, in any case, provides
no information for the direction of causality. Instead, most often,
inference of causality in comparative studies of gene regulation

14 Using Genomic Tools to Study Regulatory Evolution

351

relies on prior functional knowledge of regulatory mechanisms. For


example, enhancer transcription factors are known to bind to promoters of genes, precipitate the assembly of the transcriptional
machinery at those promoters, and increase the rate of transcription
of the associated genes. Based on this proposed mechanism (which
is strongly supported by a large body of independent studies), one
may be able to infer causality in a genome-wide study that correlates
variation in genome-wide transcription factor binding at promoters
and variation in gene expression levels.
4.1. Leveraging Different
Sources of Information

Because inference of causality almost always relies on prior information, genome-wide studies of regulatory mechanisms should
aspire to build the strongest possible independent circumstantial
case for a relationship between variation in regulatory interactions
and changes in gene expression levels. This can often be done
by combining different sources of genome-wide information.
For example, consider the task of identifying the direct regulatory
targets of a transcription factor. To do so, empirical studies typically
use one of the two main approaches: (1) expression profiling following a perturbation of the transcription factor dosage or (2)
chromatin immunoprecipitation followed by sequencing (ChIPseq) using a specific antibody against the transcription factor.
In the first approach, the dosage of the transcription factor
is perturbed in cells or in model organisms by a treatment of
either overexpression or knockdown (using, for example, siRNA
technology (29, 30)) of the transcription factor. Following the
treatment, the expression profiles of a large number of genes are
studied in order to identify the genes whose regulation has been
affected by the perturbation of the transcription factor dosage (29).
Typically, a large number of genesoften several thousandsare
found to be differentially expressed in such experiments (30, 31).
However, it is clear that not all the differentially expressed genes are
directly regulated by the transcription factor whose dosage was
perturbed. Indeed, a large proportion of the genes are expected
to be secondary targets (i.e., regulated by genes that are themselves
directly regulated by the transcription factor). In addition, a change
in the dosage of a transcription factor often affects the cellular
environment in ways that may trigger larger changes in the gene
expression profiles, not directly related to the regulatory effects of
the perturbed transcription factor (30).
In order to identify the subset of direct transcriptional targets
among all the differentially expressed genes, computational predictions of the transcription factor-binding sites are often used.
Namely, a gene is considered as a direct regulatory target only if it
is differentially expressed following the perturbation of the transcription factor and the binding motif of the transcription factor
can be found within the genes putative promoter (30, 31). The
problem is that computational searches for transcription factor-

352

Y. Gilad

binding sites are known to have a high error rate (32). In particular,
since transcription factor-binding sites are short (612-mers), a
large number of false positives are expected. In addition, it is
unclear how to assign significance to the identification of transcription factor-binding sites based on a single sequence (32).
An alternative approach is to use ChIPseq (33) to directly
identify all the sites in the genome to which the transcription
factor binds (e.g., refs. 34, 35). In these experiments, sequencing
is used to measure the abundance of chromatin that is first precipitated along with the transcription factor of interest. The goal is
to identify genomic regions with peaks of aligned sequencing
reads, which correspond to regions putatively bound by the transcription factor. When the transcription factor-binding locus is in
proximity to a known gene, it is assumed that the gene is being
regulated by the transcription factor (35, 36). However, even if
the antibody against the transcription factor is highly specific and
the number of falsely identified binding events is assumed to be
small (37), it is unclear how many binding events reflect a true
biological function. Namely, it is unclear how often a transcription
factor can bind to genomic regions near genes without participating in the regulation of those genes.
Thus, ChIPseq and dosage perturbation experiments, considered one at a time, suffer from high false-positive rates due to the
nonspecificity of the antibody, random binding of the transcription
factor in the case of the ChIPseq experiment, or the ripple effect of
knocking down a transcription factor in the siRNA experiments.
Considered together, however, these approaches enable the reliable
identification of genes whose promoter regions are bound to by the
transcription factor and whose regulation is affected by the perturbation of the transcription factor dosage. In other words, using this
paradigm, one can build a strong circumstantial case for classifying
direct regulatory targets of a specific transcription factor.
4.2. Statistical
Challenges in
Comparative Studies
of Gene Regulation

Most of the statistical challenges involved in genomic studies of


gene regulatory mechanisms are related to the multivariate nature
of the data. In many ways, therefore, these issues are similar to the
ones reviewed above for comparative studies of gene expression
levels. For example, effective study designs are still required to
test the hypothesis that the variation of regulatory mechanisms
between species is significantly larger than the variation between
individuals within a species (this seems worth mentioning because a
few recent comparative studies of regulatory mechanisms have
reported interspecies variation without including independent
biological replicates within species).
Similarly, investigations of regulatory mechanisms also rely on
mostly arbitrary choices of the statistical cutoffs used to classify the
observed patterns. As in most genome-wide studies, regardless of
whether the choice of cutoffs is guided to some extent by prior

14 Using Genomic Tools to Study Regulatory Evolution

353

information, the main goal is typically to keep false positives to a


minimum. However, comparisons of regulatory mechanisms
between species are in that sense more complex because controlling
the rate of false negatives is a crucial factor as well. The principal
issue is that the data supporting a regulatory mechanism need to be
interpreted in the context of each sample (or each species) before
variation across samples (or species) can be characterized.
For example, consider a genome-wide comparative study
of histone modifications using ChIPseq, namely, a study aimed
at characterizing similarities and differences across species in the
locations of these epigenetic markers. This may be of interest in
order to study the extent to which interspecies variation in gene
expression levels can be explained by changes in histone modification profiles. The first step in such a study is to identify all the
genomic regions, which are associated with histone modifications,
in each species. The characterization of such genomic regions is
based on statistical analysis of the data. In the ChIPseq example, the
goal is to identify peaks of aligned sequencing reads, which are
indicative of enriched chromatin that is associated with histone
modifications. In principle, once genomic regions associated with
histone modifications are identified in each species independently, a
comparison across species can be performed. Here, however, it
becomes a bit more challenging.
Typically, one would tend to choose stringent statistical cutoffs
to identify peaks of sequencing reads in each species independently,
namely, choose such cutoffs that minimize the false positive rate.
However, such an approach, while controlling the rate of falsely
identified genomic regions associated with histone modification in
each species, results in a high rate of spuriously identified differences in this epigenetic regulatory mechanism between species. For
example, assume that associations with histone modifications are
classified, in each species independently, at an FDR < 0.05 (this
would typically refer to the expected proportion of peaks with
similarly strong evidence in a negative-control ChIPseq experiment). In that case, an observation of a genomic region associated
with histone modifications at an FDR 0.049 in one species and
an FDR 0.051 in the other species would be considered as
evidence for an interspecies difference in histone modifications at
this genomic region. Clearly, this would be a problem.
To minimize the number of falsely identified interspecies differences in regulatory mechanisms, one should leverage information
from all samples. This can be done using a number of different
Bayesian approaches. In its simplest form, such an analysis (although
not strictly Bayesian) could use the application of two statistical cutoffs. Considering the example of histone modifications, one can
assume that conditional on observing an associated genomic region
with high confidence in one species (namely, using a stringent cutoff)
the orthologous site in a closely related species is also likely to have the

354

Y. Gilad

Fig. 4. Example of how a distribution of FDR values can guide the choice of statistical cutoffs. (a) All ChIPseq peaks with
FDR  20% from a genomic study of histone modification in cell lines from three primate species; the chosen stringent
2% FDR cutoff is indicated with a dashed line. (b) Enrichment peaks with FDR  20% in each species, which also overlap
peaks with FDR  2% in any of the other species; the chosen relaxed 5% FDR cutoff for a secondary observation is
indicated with a dashed line.

modification. Accordingly, one can relax the statistical cutoff for the
classification of such secondary observations. Although the choice of
statistical cutoffs may still be arbitrary, the distributions of FDR
values can be used as a guide, especially with respect to the choice of
the second cutoff (Fig. 4). The two cutoff approach uses information
across all studied species to increase the power to detect histone
modification in any species. This approach is, therefore, conservative
with respect to identifying differences across species.

5. What Have We
Learned from
Comparative
Studies of
Regulatory
Mechanisms?

Comparative studies of genetic mechanisms. In contrast to the relative abundance of comparative gene expression data from multiple
species, there are far fewer genomic-scale comparative datasets of
regulatory mechanisms. At the genetic level, the largest comparative
study of regulatory mechanisms to date is that of Schmidt and
colleagues (38), who used ChIPseq to compare the genomic locations of binding sites of two transcription factors (CCAAT/
enhancer-binding protein alpha and hepatocyte nuclear factor 4

14 Using Genomic Tools to Study Regulatory Evolution

355

alpha) in the livers of five vertebrate species (human, mouse, dog,


short-tailed opossum, and chicken). Schmidt and colleagues found
that most transcription factor-binding locations are species specific,
and that orthologous binding locations present in all five species are
rare. Quite often, the sequences of orthologous binding loci were
identical across species, even when the binding event was inferred to
have been lost in one species. On the other hand, in many cases,
there was no evidence for conservation at the sequence level even
when the location of the transcription factor binding was shared
across species.
These observations suggest that interspecies differences in
genetic regulation by transcription factors are widespread. However, it should be noted that Schmidt and colleagues did not
analyze their data by leveraging information from all species, but
rather classified binding events independently in each species. As a
result, their analysis was not conservative with respect to classifying
differences in binding across species. It is reasonable to assume that
to some extent this study overestimated the proportion of differences in binding locations between species.
There are a few othersomewhat smaller in scalepublished
comparative studies of transcription factor-binding locations
across species (3943). These studies, quite intuitively, suggest
that the level of divergence in binding locations largely depends
on the specific transcription factor that is being studied (as well
as on the evolutionary distance between the species). Most of the
comparative ChIPseq studies published to date have not yet been
coupled with genome-wide characterization of interspecies gene
expression differences. As a result, we still do not have an estimate
of the relative importance of changes in transcription factor-binding locations to overall gene expression differences between species. That said, a property that emerges from this collective body
of work is that we currently find very little correlation between
divergence of inferred transcription factor-binding sites and differences (or similarities) in the observed transcription factor binding.
In other words, without additional information, the study of
conservation of individual binding sites across species is not very
informative with respect to predicting conservation of transcription factor-binding locations.
Comparative studies of epigenetic mechanisms. Parallel surveys of
interspecies differences in genetic and epigenetic regulatory
mechanisms may provide context that allows us to better appreciate the relationship between differences in transcription factor
binding and sequence changes at transcription factor-binding
sites. To date, however, genome-wide comparative studies of epigenetic mechanisms have not yet been coupled with other sources
of data.

356

Y. Gilad

Studies of one class of epigenetic marker, DNA methylation,


have suggested that the role of DNA methylation in tissue-specific
gene regulation is generally conserved. For example, after identifying tissue-specific differentially methylated regions (T-DMRs (44))
in a number of tissues in mice, Kitamura and colleagues were able to
use the methylation status in orthologous human regions to distinguish between the corresponding human tissues (45). In turn,
Irizarry and colleagues (46), who studied genome-wide DNA
methylation patterns in spleen, liver, and brain tissues from
human and mouse, reported that 51% of T-DMRs are shared across
both species. However, there also are a large number of potentially
functional differences in methylation levels across species. In particular, in primates, Gama-Sosa and colleagues (47) found that
relative methylation levels within tissues generally differ between
species, with the exception of hypermethylation in the brain and
thymus, which were observed regardless of species. In addition,
Enard and colleagues (48), who compared methylation profiles of
36 genes in livers, brains, and lymphocytes from humans and
chimpanzees, reported significant interspecies methylation level
differences in 22 of the 36 genes in at least 1 tissue.
A somewhat different picture may be emerging from comparative studies of a different class of epigenetic markers, histone
modifications. Characterization of several types of histone modifications on human chromosomes 21 and 22, and the syntenic
chromosomes in mouse, indicated that the genomic locations of
these epigenetic markers at orthologous loci are strongly conserved, even in the absence of sequence conservation (39, 49).
Interestingly, the conservation of histone modification patterns
was highest in genomic regions proximal to annotated orthologous genes.
With few exceptions, however (e.g., with respect to DNA
methylation, ref. 50), genome-wide comparative studies of epigenetic regulatory mechanisms have also not yet explored the extent
to which changes in specific regulatory interactions underlie interspecies differences in gene expression levels. As a result, we still
cannot assess the relative importance of changes in different genetic
and epigenetic regulatory mechanisms to overall regulatory evolution. This status might change rapidly because the main limitation
for performing high-throughput investigations of epigenetics markers was technological. Massively parallel sequencing technologies
now facilitate comparative epigenetic studies using genome-wide
protocols, such as MeDIP and ChIPseq.

14 Using Genomic Tools to Study Regulatory Evolution

357

6. Summary and
Additional Topics
We have gained important insights from comparative genomic
studies of gene expression levels. We established that the regulation of most genes evolves under stabilizing selection (51, 52) and
described variation in gene expression levels within and between
species with sufficient details so that we can now use empirical
approaches to identify genes whose regulation likely evolved under
directional selection (53). These would be promising candidates
for further functional studies. Current efforts are moving beyond
the investigation of interspecies variation in gene expression
levels to studies of the underlying regulatory mechanisms. In
that respect, I did not mention in this chapter many of the types
of datasets that are currently being collected, such as measures of
chromatin accessibility (using DNase hypersensitive sites, for
example), different markers of enhancer elements (such as the
cofactors p300 and mediator), maps of nucleosome positions,
and expression levels of small regulatory RNA classes. Once we
combine different sources of comparative genomic data into a
unified model of gene regulation, we should obtain power to
truly dissect the genetic and epigenetic architecture of gene regulatory evolution.

7. Exercises
1. You are ready to design a large study to compare gene expression between species using RNAseq. You know that you need
to take into account a large number of possible biological and
technical effects, but then you also learn that a certain physical
environment (such as temperature, humidity, amount of light,
etc.) might affect your results. You, therefore, decided to
design a pilot experiment to test the effect of this physical
environment on the measurements of gene expression level
using your platform of choice. Your design should not rely
on the availability of gold standards (namely, you are not
able to obtain samples for which the differences in gene
expression are known, neither a priori nor by using additional
techniques).
(a)

Explain the study design that allows you to test for the
effects of the physical environment of choice.

(b)

What are the expected results if the physical environment of choice has no effect on the measurement of
gene expression levels?

358

Y. Gilad

(c)

What are the expected results if the effect of the physical


environment of choice is random? In that case, how will
you take this information into account when you design
the larger study?

(d)

What are the expected results if the physical environment of choice is nonrandom? In that case, how will you
take this information into account when you design the
larger study?

2. Design a study that will allow you to compare genome-wide


RNA decay rates across species, using RNAseq (using a chemical agent that stops transcription in the cell).
(a) Explain your study design.
(b) As part of the low-level analysis of your data, do you
need to perform a normalization step? If so, how would
you normalize your data?
(c) Explain, in general terms, how would the data be analyzed to estimate gene-specific RNA decay rates.
References
1. Gompel, N., B. Prudhomme, P.J. Wittkopp,
V.A. Kassner, and S.B. Carroll (2005) Chance
caught on the wing: cis-regulatory evolution and
the origin of pigment patterns in Drosophila.
Nature, 433(7025): p. 4817.
2. Linnen, C.R., E.P. Kingsley, J.D. Jensen, and
H.E. Hoekstra (2009) On the origin and spread
of an adaptive allele in deer mice. Science, 325
(5944): p. 10958.
3. Drake, T.A., E.E. Schadt, and A.J. Lusis
(2006) Integrating genetic and gene expression
data: application to cardiovascular and metabolic traits in mice. Mamm Genome, 17(6):
p. 46679.
4. Emilsson, V., G. Thorleifsson, B. Zhang, A.S.
Leonardson, F. Zink, J. Zhu, S. Carlson, A.
Helgason, G.B. Walters, S. Gunnarsdottir, M.
Mouy, V. Steinthorsdottir, G.H. Eiriksdottir,
G. Bjornsdottir, I. Reynisdottir, D. Gudbjartsson, A. Helgadottir, A. Jonasdottir, A. Jonasdottir, U. Styrkarsdottir, S. Gretarsdottir, K.P.
Magnusson, H. Stefansson, R. Fossdal, K.
Kristjansson, H.G. Gislason, T. Stefansson, B.
G. Leifsson, U. Thorsteinsdottir, J.R. Lamb, J.
R. Gulcher, M.L. Reitman, A. Kong, E.E.
Schadt, and K. Stefansson (2008) Genetics of
gene expression and its effect on disease. Nature,
452(7186): p. 4238.
5. Gilad, Y., S.A. Rifkin, P. Bertone, M. Gerstein,
and K.P. White (2005) Multi-species microarrays

reveal the effect of sequence divergence on gene


expression profiles. Genome Res, 15(5):
p. 67480.
6. Mortazavi, A., B.A. Williams, K. McCue, L.
Schaeffer, and B. Wold (2008) Mapping and
quantifying mammalian transcriptomes by
RNA-Seq. Nat Methods, 5(7): p. 621628.
7. Sultan, M., M.H. Schulz, H. Richard, A.
Magen, A. Klingenhoff, M. Scherf, M. Seifert,
T. Borodina, A. Soldatov, D. Parkhomchuk, D.
Schmidt, S. OKeeffe, S. Haas, M. Vingron, H.
Lehrach, and M.L. Yaspo (2008) A global view
of gene activity and alternative splicing by deep
sequencing of the human transcriptome. Science,
321(5891): p. 95660.
8. Trapnell, C., L. Pachter, and S.L. Salzberg
(2009) TopHat: discovering splice junctions
with RNA-Seq. Bioinformatics, 25(9):
p. 110511.
9. Marioni, J.C., C.E. Mason, S.M. Mane, M.
Stephens, and Y. Gilad (2008) RNA-seq: an
assessment of technical reproducibility and comparison with gene expression arrays. Genome
Res, 18(9): p. 150917.
10. Bolstad, B.M., F. Collin, K.M. Simpson, R.A.
Irizarry, and T.P. Speed (2004) Experimental
design and low-level analysis of microarray data.
Int Rev Neurobiol, 60: p. 2558.

14 Using Genomic Tools to Study Regulatory Evolution


11. Bolstad, B.M., R.A. Irizarry, M. Astrand, and
T.P. Speed (2003) A comparison of normalization methods for high density oligonucleotide
array data based on variance and bias. Bioinformatics, 19(2): p. 18593.
12. Robinson, M.D. and A. Oshlack (2010) A scaling normalization method for differential
expression analysis of RNA-seq data. Genome
Biol, 11(3): p. R25.
13. Bullard, J.H., E. Purdom, K.D. Hansen, and S.
Dudoit (2010) Evaluation of statistical methods
for normalization and differential expression in
mRNA-Seq experiments. BMC Bioinformatics,
11: p. 94.
14. Oshlack, A. and M.J. Wakefield (2009) Transcript length bias in RNA-seq data confounds
systems biology. Biol Direct, 4: p. 14.
15. Young, M.D., M.J. Wakefield, G.K. Smyth, and
A. Oshlack (2010) Gene ontology analysis for
RNA-seq: accounting for selection bias. Genome
Biol, 11(2): p. R14.
16. Subramanian, A., P. Tamayo, V.K. Mootha, S.
Mukherjee, B.L. Ebert, M.A. Gillette, A. Paulovich, S.L. Pomeroy, T.R. Golub, E.S. Lander,
and J.P. Mesirov (2005) Gene set enrichment
analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc
Natl Acad Sci U S A, 102(43): p. 1554550.
17. Oleksiak, M.F., G.A. Churchill, and D.L.
Crawford (2002) Variation in gene expression
within and among natural populations. Nat
Genet, 32(2): p. 2616.
18. Rifkin, S.A., J. Kim, and K.P. White (2003)
Evolution of gene expression in the Drosophila
melanogaster subgroup. Nat Genet, 33(2):
p. 13844.
19. Lemos, B., C.D. Meiklejohn, M. Caceres, and
D.L. Hartl (2005) Rates of divergence in gene
expression profiles of primates, mice, and flies:
stablizing selection and variability among functional
categories.
Evolution,
59(1):
p. 126137.
20. Lynch, M. (1990) The Rate of Morphological
Evolution in Mammals from the Standpoint of
the Neutral Expectation. American Naturalist,
136(6): p. 727741.
21. Lynch, M. and W.G. Hill (1986) Phenotypic
Evolution by Neutral Mutation. Evolution, 40
(5): p. 915935.
22. Rifkin, S.A., D. Houle, J. Kim, and K.P. White
(2005) A mutation accumulation assay reveals
extensive capacity for rapid gene expression evolution. Nature, 438(7065): 2203.
23. Keightley, P.D., U. Trivedi, M. Thomson, F.
Oliver, S. Kumar, and M.L. Blaxter (2009)

359

Analysis of the genome sequences of three Drosophila melanogaster spontaneous mutation


accumulation lines. Genome Res, 19(7):
p. 1195201.
24. Denver, D.R., K. Morris, J.T. Streelman, S.K.
Kim, M. Lynch, and W.K. Thomas (2005) The
transcriptional consequences of mutation and
natural selection in Caenorhabditis elegans.
Nat Genet, 37(5): p. 5448.
25. Khaitovich, P., G. Weiss, M. Lachmann, I.
Hellmann, W. Enard, B. Muetzel, U. Wirkner,
W. Ansorge, and S. Paabo (2004) A neutral
model of transcriptome evolution. PLoS Biol, 2
(5): p. E132.
26. Lemon, B. and R. Tjian (2000) Orchestrated
response: a symphony of transcription factors for
gene control. Genes Dev, 14(20): p. 255169.
27. Landry, C.R., B. Lemos, S.A. Rifkin, W.J.
Dickinson, and D.L. Hartl (2007) Genetic
properties influencing the evolvability of gene
expression. Science, 317(5834): p. 11821.
28. Bustamante, C.D., A. Fledel-Alon, S. Williamson, R. Nielsen, M. Todd Hubisz, S. Glanowski, D.M. Tanenbaum, T.J. White, J.J.
Sninsky, R. Hernandez, D. Civello, M.D.
Adams, M. Cargill, and A.G. Clark (2005)
Natural Selection on Protein Coding Genes in
the Human Genome. Nature, 437(7062):
11537.
29. Panowski, S.H., S. Wolff, H. Aguilaniu, J. Durieux, and A. Dillin (2007) PHA-4/Foxa mediates diet-restriction-induced longevity of C.
elegans. Nature, 447(7144): p. 5505.
30. Murphy, C.T. (2006) The search for DAF-16/
FOXO transcriptional targets: Approaches and
discoveries.
Experimental
Gerontology,
doi:10.1016/j.exger.2006.06.040.
31. Chavez, V., A. Mohri-Shiomi, A. Maadani, L.
A. Vega, and D.A. Garsin (2007) Oxidative
Stress Enzymes Are Required for DAF-16Mediated Immunity Due to Generation of Reactive Oxygen Species by Caenorhabditis elegans.
Genetics, 176(3): p. 156777.
32. Vavouri, T. and G. Elgar (2005) Prediction of
cis-regulatory elements using binding site matricesthe successes, the failures and the reasons for
both. Curr Opin Genet Dev, 15(4):
p. 395402.
33. Negre, N., S. Lavrov, J. Hennetin, M. Bellis,
and G. Cavalli (2006) Mapping the distribution
of chromatin proteins by ChIP on chip. Methods
Enzymol, 410: p. 31641.
34. Sandmann, T., J.S. Jakobsen, and E.E. Furlong
(2006) ChIP-on-chip protocol for genome-wide
analysis of transcription factor binding in

360

Y. Gilad

Drosophila melanogaster embryos. Nat Protoc, 1


(6): p. 283955.
35. Ceribelli, M., M. Alcalay, M.A. Vigano, and R.
Mantovani (2006) Repression of new p53 targets
revealed by ChIP on chip experiments. Cell
Cycle, 5(10): p. 110210.
36. Lin, Z., S. Reierstad, C.C. Huang, and S.E.
Bulun (2007) Novel estrogen receptor-alpha
binding sites and estradiol target genes identified
by chromatin immunoprecipitation cloning in
breast cancer. Cancer Res, 67(10): p. 501724.
37. Qi, Y., A. Rolfe, K.D. MacIsaac, G.K. Gerber,
D. Pokholok, J. Zeitlinger, T. Danford, R.D.
Dowell, E. Fraenkel, T.S. Jaakkola, R.A.
Young, and D.K. Gifford (2006) High-resolution computational models of genome binding
events. Nat Biotechnol, 24(8): p. 96370.
38. Schmidt, D., M.D. Wilson, B. Ballester, P.C.
Schwalie, G.D. Brown, A. Marshall, C. Kutter,
S. Watt, C.P. Martinez-Jimenez, S. Mackay, I.
Talianidis, P. Flicek, and D.T. Odom (2010)
Five-vertebrate ChIP-seq reveals the evolutionary
dynamics of transcription factor binding. Science, 328(5981): p. 103640.
39. Wilson, M.D., N.L. Barbosa-Morais, D.
Schmidt, C.M. Conboy, L. Vanes, V.L. Tybulewicz, E.M. Fisher, S. Tavare, and D.T. Odom
(2008) Species-specific transcription in mice carrying human chromosome 21. Science, 322
(5900): p. 4348.
40. Odom, D.T., R.D. Dowell, E.S. Jacobsen, W.
Gordon, T.W. Danford, K.D. Macisaac, P.A.
Rolfe, C.M. Conboy, D.K. Gifford, and E.
Fraenkel (2007) Tissue-specific transcriptional
regulation has diverged significantly between
human and mouse. Nat Genet, 39(6):
p. 730732.
41. de Candia, P., R. Blekhman, A.E. Chabot, A.
Oshlack, and Y. Gilad (2008) A combination of
genomic approaches reveals the role of FOXO1a
in regulating an oxidative stress response pathway. PLoS ONE, 3(2): p. e1670.
42. Bradley, R.K., X.Y. Li, C. Trapnell, S. Davidson, L. Pachter, H.C. Chu, L.A. Tonkin, M.D.
Biggin, and M.B. Eisen (2010) Binding site
turnover produces pervasive quantitative
changes in transcription factor binding between
closely related Drosophila species. PLoS Biol, 8
(3): p. e1000343.
43. Wittkopp, P.J. (2010) Variable transcription
factor binding: a mechanism of evolutionary
change. PLoS Biol, 8(3): p. e1000342.
44. Rakyan, V.K., T.A. Down, N.P. Thorne, P. Flicek, E. Kulesha, S. Gra f, E.M. Tomazou, L.
Backdahl, N. Johnson, M. Herberth, K.L.
Howe, D.K. Jackson, M.M. Miretti, H. Fiegler,

J.C. Marioni, E. Birney, T.J.P. Hubbard, N.P.


Carter, S. Tavare, and S. Beck (2008) An
integrated resource for genome-wide identification and analysis of human tissue-specific differentially methylated regions (tDMRs). Genome
Research, 18(9): p. 151829.
45. Makino, S., M. Adachi, Y. Ago, K. Akiyama, M.
Baba, Y. Egashira, M. Fujimura, T. Fukuda, K.
Furusho, Y. Iikura, H. Inoue, K. Ito, I. Iwamoto, J. Kabe, Y. Kamikawa, Y. Kawakami, N.
Kihara, S. Kitamura, K. Kudo, K. Mano, T.
Matsui, H. Mikawa, S. Miyagi, T. Miyamoto,
Y. Morita, Y. Nagasaka, T. Nakagawa, S. Nakajima, T. Nakazawa, S. Nishima, K. Ohta, T.
Okubo, H. Sakakibara, Y. Sano, K. Shinomiya,
K. Takagi, K. Takahashi, G. Tamura, H.
Tomioka, K. Yoyoshima, K. Tsukioka, N.
Ueda, M. Yamakido, S. Hosoi, and H. Sagara
(2005) Definition, diagnosis, disease types, and
classification of asthma. Int Arch Allergy Immunol, 136 Suppl 1: p. 34.
46. Irizarry, R.A., C. Ladd-Acosta, B. Wen, Z. Wu,
C. Montano, P. Onyango, H. Cui, K. Gabo, M.
Rongione, M. Webster, H. Ji, J.B. Potash, S.
Sabunciyan, and A.P. Feinberg (2009) The
human colon cancer methylome shows similar
hypo- and hypermethylation at conserved tissuespecific CpG island shores. Nature Genetics, 41
(2): p. 17886.
47. Gama-Sosa, M.A., R.M. Midgett, V.A. Slagel,
S. Githens, K.C. Kuo, C.W. Gehrke, and M.
Ehrlich (1983) Tissue-specific differences in
DNA methylation in various mammals. Biochimica et Biophysica Acta, 740: p. 212219.
48. Enard, W., A. Fassbender, F. Model, P. Adorjan, S. Paabo, and A. Olek (2004) Differences
in DNA methylation patterns between humans
and chimpanzees. Current Biology, 14(4):
p. R148-R149.
49. Bernstein, B.E., M. Kamal, K. Lindblad-Toh,
S. Bekiranov, D.K. Bailey, D.J. Huebert, S.
McMahon, E.K. Karlsson, E.J. Kulbokas, 3rd,
T.R. Gingeras, S.L. Schreiber, and E.S. Lander
(2005) Genomic maps and comparative analysis
of histone modifications in human and mouse.
Cell, 120(2): p. 16981.
50. Farcas, R., E. Schneider, K. Frauenknecht, I.
Kondova, R. Bontrop, J. Bohl, B. Navarro, M.
Metzler, H. Zischler, U. Zechner, A. Daser,
and T. Haaf (2009) Differences in DNA methylation patterns and expression of the CCRK
gene in human and nonhuman primate cortices.
Mol Biol Evol, 26(6): p. 137989.
51. Fay, J.C. and P.J. Wittkopp (2008) Evaluating
the role of natural selection in the evolution of
gene regulation. Heredity, 100(2): p. 1919.

14 Using Genomic Tools to Study Regulatory Evolution


52. Whitehead, A. and D.L. Crawford (2006) Neutral and adaptive variation in gene expression.
Proc Natl Acad Sci U S A, 103(14): p. 542530.
53. Gilad, Y., A. Oshlack, and S.A. Rifkin (2006)
Natural selection on gene expression. Trends
Genet, 22(8): p. 45661.
54. Lee, H., and Tang, H (2012) Next generation
sequencing technology and fragment assembly
algorithms. In M. Anisimova (ed) Evolutionary

361

Genomics: Statistical and Computational


Methods. Methods in Molecular Biology,
Springer Science+Business Media New York.
55. Beerenwinkel, N., Siebourg, J (2012) Probability, statistics and computational science. In M.
Anisimova (ed) Evolutionary Genomics: Statistical and Computational Methods. Methods in
Molecular Biology, Springer Science+Business
Media New York.

Chapter 15
Characterization and Evolutionary Analysis
of ProteinProtein Interaction Networks
Gabriel Musso, Andrew Emili, and Zhaolei Zhang
Abstract
While researchers have known the importance of the proteinprotein interaction for decades, recent
innovations in large-scale screening techniques have caused a shift in the paradigm of protein function
analysis. Where the focus was once on the individual protein, attention is now directed to the surrounding
network of protein associations. As protein interaction networks can provide useful insights into the
potential function of and phenotypes associated with proteins, the increasing availability of large-scale
protein interaction data suggests that molecular biologists can extract more meaningful hypotheses through
examination of these large networks. Further, increasing availability of high-quality protein interaction data
in multiple species has allowed interpretation of the properties of networks (i.e., the presence of hubs and
modularity) from an evolutionary perspective. In this chapter, we discuss major previous findings derived
from analyses of large-scale protein interaction data, focusing on approaches taken by landmark assays in
evaluating the structure and evolution of these networks. We then outline basic techniques for protein
interaction network analysis with the goal of pointing out the benefits and potential limitations of these
approaches. As the majority of large-scale protein interaction data has been generated in budding yeast,
literature described here focuses on this important model organism with references to other species
included where possible.
Key words: Protein interaction, Network, Modularity, Evolution, Hub, Scale free

1. Introduction:
Mining Protein
Interaction
Networks

Although it has long been known that proteins elicit their function
through association, over the past few years it has become increasingly apparent that analyses of entire networks of protein interactions can provide useful information regarding protein function
and deletion consequence. An increase in the use of genome-scale
interaction detection techniques, such as tandem affinity purification (TAP) and yeast 2-hybrid (Y2H) screening (see Fig. 1), has
generated a wealth of proteinprotein interaction (PPI) data in

Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2,
Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5_15,
# Springer Science+Business Media, LLC 2012

363

364

G. Musso et al.

Fig. 1. Protein interaction detection. Binary detection assays used for protein interaction screening typically employ the
reconstruction of a reporter when two recombinant proteins (each tethered to one component of the reporters activator)
are in sufficiently close proximity. In the case of traditional Y2H screening (upper left ), the DNA binding and activation
domains of GAL4 are tethered to a bait (B) and prey (P) protein, respectively, reconstructed, and a reporter signal activated.
Split ubiquitin screening (upper right ) utilizes a variation of this concept in which ubiquitin is reconstructed, cleaves an
attached transcription factor, and subsequently causes reporter activation. Alternately, detection of complexes typically
involves some form of epitope tagging followed by affinity purification. While there are multiple tags that can be used for
affinity purification assay, traditional tandem affinity purification (TAP; bottom half ) uses a tag containing protein A, a
tobacco etch virus (TEV) cleavage site, and calmodulin-binding peptide for two successive rounds of purification based on
immobilization of the tagged bait. In either binary or affinity purification-based techniques, interactions are generally
confirmed through reciprocal assay.

multiple species, allowing a paradigm shift in which hypothesis


generation and functional characterization are facilitated through
network analysis techniques. Specifically, topological properties of
these interaction networks have proven to be analytically useful. As
similarities in protein interaction network localization can be used
to infer function (1, 2), patterns of interactions have been used to
predict associated phenotypes (3), and network structure has been
shown to have prognostic value in determining disease progression
(4, 5). The question then of how these networks achieve their
hallmark properties through evolution has become one of great
interest.

15

Characterization and Evolutionary Analysis of ProteinProtein. . .

365

Table 1
Types of interactions used to generate networks
Interaction
type

Description

Potential sources

Genetic

Generally, an observation of greater or


lesser phenotypic consequence when
disrupting two genes in the same
organism than expected based on
individual deletions. The majority of
this type of evidence comes from
budding yeast, where there is an
ongoing effort to assay all pairwise
gene deletions

The BioGRID: http://thebiogrid.org


DRYGIN: http://drygin.ccbr.utoronto.ca
DroID: www.droidb.org

Protein

Physical association between gene


products, either transient in nature or
indicative of comembership within a
protein complex. Large-scale detection
of protein interactions typically
employs a recombination-based
tagging or complementation strategy
(e.g., tandem affinity purification, yeast
2 hybrid)

The BioGRID: http://thebiogrid.org


DIP: http://dip.doe-mbi.ucla.edu
IntAct: www.ebi.ac.uk/intact
HPRD: www.hprd.org

Functional
similarity
or data
integration

Several databases actively update gene


functional annotations based on
experimental evidence and
computational prediction. Examining
the proximity of genes in a network of
functional linkage can indicate the
extent of functional overlap. Some
publically available tools also integrate
data from several sources to derive a
score indicating the functional overlap
of a pair or group of genes

The Gene Ontology: www.geneontology.org


AmiGO: http://amigo.geneontology.org/
cgi-bin/amigo/go.cgi
BioPixie: http://pixie.princeton.edu/pixie
FuncAssociate: http://llama.mshri.on.ca/
funcassociate

Coexpression

NNN: http://quantbio-tools.princeton.
Similarity in patterns of expression is a
edu/cgi-bin/nnn
good indication of both physical and
genetic association and can be used to Avadis: http://www.strandls.com/Avadis
derive useful functional relationships

Listed are four basic types of association used to draw inference regarding overlapping function of genes or
gene products

Networks of protein or gene interaction can be derived from


any manner of association, from epistasis, to coexpression, to physical association (see Table 1 for a list of available association datasources). While in this review we focus specifically on PPI-derived
interaction networks, the analytical concepts described here could
effectively be applied to networks of any type. In the context of
protein interactions, we describe interaction networks using the

366

G. Musso et al.

Fig. 2. Illustration of network types. Preferences in the attachment of edges during the
generation of a network greatly affect its topology. Both networks above contain seven
nodes connected by six edges; however, in the left graph, associations were distributed
uniformly, whereas on the right edges were preferentially attached to nodes with existing
edges. The right graph is an example of a small world design, as the presence of hubs
(black nodes) affords a structure in which any two nodes can be connected by a small
number of edges.

common nomenclature of graph theory in which an interaction


graph is represented as a series of edges (here, a proxy for protein
interactions) connecting vertices (proteins; also referred to as
nodes). We begin with a brief review of some landmark analyses in
the field of protein interaction network analysis in Subheading 2,
focusing on the applied analytical techniques. In Subheading 3, we
then discuss progress in evaluating the potential means of evolution
for these networks, highlighting important work examining the
development of unilaterally observable network characteristics.
We then provide a brief step-by-step instruction of how to perform
a basic topological analysis of a PPI network in Subheading 2.
References to techniques not covered in depth in this section or
to reviews that discuss topics listed here in greater detail are given
where appropriate.

2. Major Works
in ProteinProtein
Interaction
Network Analysis
2.1. Observation of
Small World Properties
in Protein Interaction
Networks

One of the earliest noted observations regarding large-scale protein


interaction networks was the uneven distribution of edges (6, 7).
More specifically, some proteins had only a few interactions while a
small number of proteins had a very large number of interactions.
Graphs that are organized in this fashion tend to be labeled as
having small world properties, since this particular type of organization allows any two nodes in the network to be connected by
very few links (8). Specifically, a defining characteristic of a small
world network is that the average minimum number of edges
required to connect two nodes increases logarithmically with the
number of nodes (see Fig. 2). For more specific definitions of small
world network types, see Amaral et al. (9) who identify three classes
of small world networks and contrast their respective properties.

15

Characterization and Evolutionary Analysis of ProteinProtein. . .

367

Early analysis of PPI data further suggested that protein interaction networks fit the more stringent definition of having a scale-free
connectivity distribution (6, 7): a subset of small world networks in
which new edges are preferentially connected to highly connected
nodes, and consequently the number of edges incident on each node
follows a power-law distribution. This would have implications not
only for the topological properties of the network, but also in the
interpretation of its evolution, as this would suggest retention and
loss of interactions through specific mechanisms (10). Incorrectly
labeling an interaction graph as being scale free has additional analytic
ramifications. For example, Khanin and Wit (11) argue that this
results in the incorrect assumption that biological networks follow
the same design principles as those observed in the physical and social
sciences. In the past several years, the classification of virtually all
protein interaction networks as scale free has been contested based
on goodness of fit tests (11, 12), although this discrepancy may be
due to an incomplete sampling of the full interaction network (13).
While the presence of scale-free connectivity distributions may
still be a contentious issue, properties of small world networks
appear to be universally apparent in PPI networks. Two characteristics of the small world connection structure that are commonly
observed in PPI networks are the presence of highly connected
nodes (or hubs) and a simplified definition of cliques or subnetworks. Each of these properties is discussed in detail below.
2.2. Properties
of Network Hubs

When analyzing a network composed mainly of Y2H interactions


from Uetz et al. (14), Jeong et al. (6) showed that the most highly
connected nodes were also the most likely to be essential. Specifically, the authors demonstrated an overall lethality rate of 21% when
deleting genes with five or fewer interactions, but a 62% lethality
rate among genes with more than five interactions (6). This was the
first indication that the so-called hub proteins in protein interaction
networks might be uniquely important both in cellular function
and as therapeutic targets. Han et al. (15) would later further
subdivide hub proteins (again defined as having more than five
interactions, although this time using a network of protein interactions confirmed in any two of several combined Y2H and affinity
purification datasets) after correlating the expression of hub proteins with their respective interactors and noting a resulting
bimodal distribution. The so-called party and date hubs, respectively, expressed at similar and different times as their interactors,
differed in their impact when artificially removed from the network,
with date hubs causing a greater fragmentation of the largest
connected subset of proteins. Paradoxically, however, Fraser examined the evolutionary rate (synonymous-to-nonsynonymous substitution ratio) of these hubs and concluded that party hubs were
under stricter evolutionary constraint, reasoning that the evolutionary lability of date hubs made them more important to the

368

G. Musso et al.

overall network (16), despite their seemingly decreased importance


to network structure.
Assertions regarding the distinction between party and
date hubs would later be directly contested by Batada et al. who
claimed that the observations of hub deletion consequence and
coexpression with interactors did not support a distinction, and
thus were due to an incomplete sampling of the interaction
network. Batada et al. (17) made this conclusion using an interaction network that included TAP data and, while still requiring
interactions to appear by at least two lines of evidence, was nearly
four times as densely connected. Batada et al. would also argue
that the observation of stricter evolutionary conservation for
party hubs was eliminated when controlling for their abundance.
A corresponding second article published by each group would
argue both for and against the dateparty hub distinction. Those
arguing for the distinction claimed that date/party hubs could
still be observed if the larger networks were more stringently
filtered (18) while those arguing against it suggested that even
within this more filtered network correlation distributions for
hubs with their interactors did not meet a more rigorous definition of bimodality (19). More recent analysis using a richer PPI
dataset has suggested that this apparent dichotomy was driven by
a small subset of hubs that are highly coexpressed with their
interactors (20), perhaps emphasizing the importance of considering biological overrepresentation underlying observed topological trends.
The date/party hub debate is a good illustration of how conclusions regarding the biological implications of interaction network structure can be impacted not only by analytical technique,
but also by the selection of an interaction network for study.
Despite the fact that they may be largely overlapping in coverage,
large-scale PPI datasets collected from various experimental sources
could differ in topology due to inherent biases (21). For example,
there are two major categories of large-scale protein interaction
detection assay: those that detect direct interactions between two
proteins or protein fragments and those that assay complexes and
may or may not infer interactions among all proteins retrieved by a
single bait (e.g., binary versus complex screening; see Fig. 1, and for
more detailed descriptions of large-scale PPI detection methods see
Musso et al. (22), Sanderson (23), and Cagney (24)). In their
recent large-scale Y2H screen, Yu et al. (25) examined datasets
resulting from these two types of assay in detail by determining
the overlap of their data with interactions from several other
sources. The authors concluded that data assembled by binary
and complex-centric interaction detection methods could be highly
accurate but still largely nonoverlapping, as the detected interactions tended to be complementary. Therefore, while it is still common practice with large-scale experimental datasets to comment

15

Characterization and Evolutionary Analysis of ProteinProtein. . .

369

both on the topological nature of the resulting network and the


presence/absence and function of hub proteins, this should generally be considered in the biological context of the experimental
technique.
2.3. Protein Interaction
Network Modularity
and Guilt by Association

Small world networks are particularly amenable to the definition of


cliques or subnetworks. Whether defined as the organization of
higher eukaryotic organisms into multiple, distinct cell types or
the presence of identifiable units in a protein interaction network,
modularity permeates virtually all levels of systematic organization
in molecular biology. Proteins elicit their effects through association into stable units, the coordinated assembly of which are essential for proper cell function (26). Conceptually, these units are not
necessarily all constitutively bound associative units, such as the
ribosome, but are also often considered as modules of proteins
united by a common discrete function (27). From a graph theoretical perspective, these units are typically identified as areas with
more dense network connectivity among the nodes than with the
remainder of nodes of the graph. For an in-depth discussion of the
establishment of functional modules, see Pereira-Leal et al. (28),
who suggest that maintenance of these modules often requires
strict evolutionary conservation.
In perhaps the most pertinent example of the impact of module
detection on experimental results, two independent, large-scale TAP
surveys published in 2006 by Krogan et al. (1) and Gavin et al. (2)
sought to experimentally identify high-quality protein interactions
and then reconstitute protein complexes using clustering algorithms.
Specifically, Krogan et al. employed a graph clustering technique
(Markov Clusteringgenerally recognized as a fast and accurate
algorithm for protein complex detection (29, 30)) to transform
their interaction data into a list of protein clusters. Conversely,
in Gavin et al.s study, complexes were derived using an iterative
implementation of a clustering algorithm, varying clustering parameters, and evaluating the resulting cluster sets for accuracy at each
iteration. The published list of clusters was ultimately an amalgam of
the iterations scoring above a sufficient cutoff for both coverage and
accuracy, with the varying representations of clusters in these iterations taken as isoforms.
The respective clustering techniques applied by the Krogan
et al. and Gavin et al. studies illustrate, to some extent, differences
in their biological interpretation of the complexosome (the entire
complement of protein complexes within the cell). Krogan et al.
applied a clustering technique that only allowed for exclusive membership, suggesting that while genes may be pleotropic ultimately
they have a representative function that can be used to group them
with other genes. This eased postanalysis and allowed extensive
characterization of gene function through guilt by association
(GBA) (more on this below). Alternately, Gavin et al. identified

370

G. Musso et al.

stable core members of protein complexes (present in the majority


of complex isoforms), as well as accessory members that tended to
have more transient membership. This allowed the investigators to
identify unifying properties of complex cores (e.g., frequent coexpression), and in the authors view presented a more biologically
accurate representation of the biological circuitry. Despite obvious
differences in clustering technique, each of these two surveys
showed a high quality of protein complexes as benchmarked against
an external gold standard (31), ultimately illustrating the potential
subjectivity of module determination. Ramifications of cluster
definition for evolutionary analysis are discussed below; however,
for a more detailed discussion and comparison of clustering techniques commonly applied to interaction graphs, see Brohee and
van Helden (29).
Much like the concept of hub proteins, the notion that all
proteins in the cell can be grouped into discrete functional units is
an obvious oversimplification. However, assignment of gene function based on proximity in the interaction network has been shown
to be accurate in understanding gene function. The so-called GBA
generally applies the concept that proteins close to one another in
the PPI network tend to share similar function. In the cases of
Krogan et al. and Gavin et al., assigned function based on GBA
could involve determination of a common function for a complex,
and then assignment of this function to all complex members.
Some noncluster-based techniques have also been developed for
GBA (see Sharan et al. (32) for a more comprehensive description
of GBA techniques), and range from tallying functions annotated
to proteins that associate with, or are close to, a protein of interest
and assigning function based on frequency (33, 34) to probabilistic
assignment based on Markov Random Fields (35, 36). While these
techniques are useful in assigning functional characterizations to
unassayed genes, they ultimately are still subject to error and should
be coupled with biological validation before considered certain.
2.4. Summary

Before moving on to describe methods for evolutionary analysis of


protein interaction networks, we end this section with the warning
that care must be taken in analyses of network topology. The
homogenization of the interaction network that occurs during
large-scale network analysis has the advantage that it tends to be
robust against random noise, but, as alluded to throughout this
discussion, is also invariably an oversimplification of the nature of
protein associations. For example, due to variations in splicing or
posttranslational processing, any node in the network could effectively represent a collection of proteins with varying physical properties and domain structures (37). Similarly, edges could have been
generated by disparate experimental techniques and could indicate
anything from a stable, permanent association, to a transient interaction, to merely coassociation within the same protein complex.
The inherent variability in not only the proteins but also the types

15

Characterization and Evolutionary Analysis of ProteinProtein. . .

371

of interactions represented in a network graph must be considered


before asserting any conclusions. As we discuss in the next section,
this situation becomes all the more tenuous when comparing interactions among various organisms.

3. Evolutionary
Comparisons of
Protein Networks

3.1. Cross-Species
Comparisons of Protein
Interaction Networks

The majority of findings mentioned in the previous section were


discovered in yeast; however, the increasing generation of highquality interaction data in multiple species continues to allow more
accurate, direct comparisons of the resulting interaction networks.
As mentioned above, there is always some subjectivity in using
protein interaction networks to assign gene function; however,
cross-species comparisons have generated some useful insights
into the evolutionary process. In this section of the chapter, we
discuss the impact of evolution on the protein interaction network
by first examining approaches for comparison of protein networks
between species, and then move onward to specific network components, such as hubs and modules, and how they might arise.
Perhaps, one of the most striking findings to follow from early
genome analysis was that the complexity of an organism does not
necessarily correlate with the number of genes in its genome. One
possible explanation for this apparent discrepancy is that while
higher organisms may not necessarily have more genes, there may
be more communication between proteins, which would be evident
in a denser protein interaction network. Given the varying extent to
which model organisms have been assayed for PPI (interaction
databases are generally dominated by experimental evidence from
S. cerevisiae), examining this question in an unbiased manner is not
easy. Thus, while it is common practice for large-scale interaction
screening studies to analyze and comment on the overall topology
of their generated networks (4, 38, 39), global comparisons of the
complexity of protein interaction networks between species are
virtually nonexistent.
Several analytical methods have emerged to examine and compare local topology in interaction networks (40, 41), and the field
of network alignment has demonstrated accurate detection of pathways across species (42). However, as we have seen in the examples
above, comparing interactions even within a single species can be
problematic due to inherent biases in datasets. Therefore, conclusions drawn between species become all the more questionable,
although incorporation of unbiased data can often help mediate
this uncertainty. For example, Xia et al. (43) noted a correlation
between species complexity and the number of annotated protein
domains per protein when analyzing data for 19 species ranging

372

G. Musso et al.

from yeast to human. Further, the authors noted that domain


coverage (defined as the fraction of a given protein sequence length
contained in annotated domains) shows a strong correlation with
the number of protein interactions. However, despite the authors
attempts to minimize species-specific biases (leave-one-out analyses
to ensure lack of overrepresentation of a given class of protein
domains, removal of all proteins without any known annotations
from analysis), there remained the possibility that knowledge of
protein domain structures was based on previous experimental
research, which might disproportionally favor some organisms.
3.2. Evolution
of Network Hubs
and Modules

Central to the understanding of protein network structure across


species is the determination of how these networks add and remove
edges. Soon after the first large-scale Y2H screens of protein interactions were published, researchers attempted to answer the question of
how interactions may be gained or lost following gene duplications
through determination of the contribution of known duplicates to
the overall network structure. Presumably, immediately after gene
duplication, the two resulting paralogs share all interaction partners,
and then gain or lose interactions over evolutionary time. Early
analysis suggested that duplicated genes, as established through
sequence similarity, were not more likely than randomly selected
pairs to share interaction partners or be located in similar interaction
subnetworks (44). This finding implied that interactions that were
initially shared between paralogs were effectively being randomly
replaced with new interactions. However, more recent analysis generated using more densely populated interaction networks have alternately proposed preferential overlap in the retained interactions of
duplicates (4547), with retained function thought to contribute to
robustness of the interaction network. Consequently, mutation or
deletion of duplicated genes is generally associated with a lessened
decrease in fitness (48), as the cellular consequence is thought to be
buffered by the presence of an extant duplicate.
Despite observations of shared molecular function, however,
the concept of a correlation between centrality (the number of
interactions) and lethality (the effect upon deletion of a gene) (6)
seemed to argue against potential functional compensation for
hubs on behalf of paralogs. However, in recent work, Kafri et al.
(49) suggested that while not all duplicates buffer the phenotypic
consequences of deleting their sister paralog, those that do tend to
have higher connectivity in the network. The authors came to this
conclusion by dividing network hubs based on the presence or
absence of a duplicate and noting two distinct correlations between
fitness defect upon gene deletion and interaction network degree
(further confirmed through direct experimentation). What would
cause some hubs to retain duplicates over millions of years of
evolutionary time and not others is unclear, but may be related to
the mode of duplication.

15

Characterization and Evolutionary Analysis of ProteinProtein. . .

373

Retention of interactions following gene duplication is ultimately


mediated by selective pressures, which are known to vary depending
on duplication event type (50). While typical gene duplication events,
such as retrotranspositions and tandem duplications, add at minimum one extra node to the protein interaction network (herein
referred to as small-scale duplication, SSD), whole-genome duplication (WGD) events represent a near or complete doubling in genomic
content (see Fig. 3), duplicating entire complexes or pathways.

Fig. 3. Mechanisms of gene duplication. Depicted are several common mechanisms for both gene and genome duplication.
Beginning at top left and going clockwise, two well-described mechanisms for tandem duplication are unequal exchange
or crossing over occurring due to misalignment (indicated by small squares and dotted lines) during mitosis and meiosis,
respectively. Retrotransposition involves the reverse transcription of mRNA sequences into the genome as cDNA.
Allopolyploidy events involve the combination of the genomes of two species to increase the genetic complement (one
described case depicted). In contrast, autopolyploidies typically result from errors in the reduction of gametes among a
single species. Portions regarding auto and alloploidization adapted from Campbell and Reece (65), and regarding tandem
duplication adapted from Ohno (66).

374

G. Musso et al.

Due to the fact that WGD events allow duplication of functional


modules or complexes that may be sensitive to imbalance, the resulting functional bias of WGD-resultant paralogs (51) is thought to be
due at least in part to maintenance of dosage among proteins within a
complex or pathway, avoiding haploimbalance.
The term haploimbalance was originally coined to describe the
formation of complexes that were inactive due to either increased or
decreased dosage of one member (52). The subsequent dosage
balance hypothesis purported that duplicating a subcomponent of
a complex alters its inherent stoichiometry and is potentially harmful, demonstrating that genes with dosage sensitivity were more
than twice as likely to be involved in protein complexes (53) and
that many subunit pairs with associated fitness defects are coexpressed (53). For example, ribosomal proteins, which are particularly sensitive to imbalance (54), are preferentially detected to
increase in number following a WGD.
Periera-Leal and Teichmann (55) have challenged the assertion
that entire functional modules or complexes could arise entirely
from large-scale duplication events, arguing instead that they
emerge gradually. The authors examined protein complexes in
multiple experimental and literature-curated PPI datasets, and
assigned a similarity score to pairs of complexes based on the
proportion of identical or similar (based on domain assignments
or sequence alignment) proteins therein. This allowed the authors
to conclude that most complexes arise through a stepwise process
since most of the complexes showing similarity had (1) only partially orthogonal units; (2) did not have members from the same
chromosomal segment, and thus likely did not arise from a single
chromosomal duplication; and (3) were not known to be created
from the WGD event. Similarly, when fitting models of interaction
loss to approximate functional relationships among extant genes
created by the WGD event, Conant and Wolfe (56) noted a partitioning of genes based on coexpression, although not based on
PPIs. Thus, while suggesting a homogeneous mechanism for complex or module generation would be oversimplistic, it appears that
some element of duplication, either small or large scale, followed by
selective loss or restructuring of interaction partners is necessary in
the generation of novel complexes.
One consequent question then is how these module or complex members gain or lose interaction partners following a largescale duplication event. There is a long-observed asymmetry in the
number of protein interaction partners for retained duplicates (44),
suggesting that loss of interaction partners may follow a particular
pattern. Zhang et al. (57) noted that the difference in degree
between duplicates followed a power-law distribution, implicating
rich-get-richer scenarios of generation, and suggested that symmetry in loss of interaction partners depended on the connectivity of
the ancestral gene, with highly connected ancestral genes giving rise

15

Characterization and Evolutionary Analysis of ProteinProtein. . .

375

to duplicates that lose interactions in an asymmetric manner. While


this finding does support previous conclusions gathered using
alternate evidence (58, 59), certainty regarding asymmetric divergence of paralogs would rely on the construction of an ancestral
network, which has no confirmable accuracy.
3.3. Summary

Evolutionary assertions regarding the development and maintenance


of protein interaction network structures are ultimately conceptual
arguments, as the ancestral interaction network can never be reproduced with complete fidelity. However, determination of similarities
in extant networks illustrates their cohesion through establishment of
similar motifs and modules, suggesting a selectable advantage. As
information regarding protein interactions in various species continues to develop, our awareness and capacity to eliminate biases
from these networks will continually improve as will our understanding of what truly both unifies and separates these networks. The next
phase of network analysis can then be the understanding of how these
networks differentially respond to environmental cues and stresses,
and, by extension, what mediated their specific differentiations.

4. Hands-On
Network Analysis
4.1. Determination
of Network Properties

In this section, we present a basic analysis of the topological properties of a protein interaction network. Although the instructions
given in this section are meant to be generally applicable to any
dataset, the example results are derived using the human MAP
kinase protein interaction data published by Bandyopadhyay et al.
(60). This analysis calculates basic network properties (Table 2)
using the NetworkAnalyzer (61) plugin for the network visualization tool Cytoscape (62). As it is a publically available multiplatform tool with a wealth of analytical features constantly being
added and refined by the Computational Biology community, we
strongly recommend the use of Cytoscape for all forms of network
analysis. While a description of the basic use of Cytoscape is beyond
the scope of this chapter, detailed information regarding the installation and functionality of Cytoscape can be found in the associated
wiki: http://cytoscape.wodaklab.org/wiki, as well as the protocol
written by Cline et al. (63). The NetworkAnalyzer plugin that is
used for this analysis can be downloaded from: http://med.bioinf.
mpi-inf.mpg.de/netanalyzer.
This Web site contains further documentation describing the
full capabilities of the NetworkAnalyzer plugin as well as instructions for its implementation. Alternative tools that could provide
more in-depth analysis are Pajek (64), and the network analysis and
visualization package for R: http://igraph.sourceforge.net/doc/
R/00Index.html.

376

G. Musso et al.

Table 2
Network property description
Property

Description

Calculation

Clustering
coefficient

Describes either global or local


density of connections in a
network. A small world network
has a significantly higher global
clustering coefficient than a
random graph

There are 3 potential edges


connecting the neighbors of A.
In this example 2 of these 3 edges
exist (bold), giving A a clustering
coefficient score of 2/3. For the
global metric, average this value
across all nodes.

Characteristic
path length

The smallest number of edges


required to link two nodes is
known as the minimum edge
distance. The average minimum
edge distance between nodes in the
network is the characteristic path
length

The minimum number of edges


required to connect nodes A and B
is 2. To calculate the characteristic
path length, find the average
minimum edge distance across all
permuted node pairs.

Network
centralization

Centralization is the measure of the


proportion of nodes to which a
single node can connect. When
calculated for a graph, it indicates
the extent to which the graph is
centered around a small number
of highly connected nodes

Centralization is defined as the degree


of each node divided by a graph
term known as the maximum
possible sum of differences. This
score is designed to give graphs like
the one depicted here a maximum
value (close to 1).

Described are three network characteristics outputted by NetworkAnalyzer. For a detailed description of
the remaining metrics, see the tools online help: http://med.bioinf.mpi-inf.mpg.de/netanalyzer/help/
2.7/index.html

4.2. Importing Network


Data into Cytoscape

Interaction data used for this analysis can be obtained as the first
supplementary table published by Bandyopadhyay et al.: http://www.
nature.com/nmeth/journal/v7/n10/extref/nmeth.1506-S2.xls.
As downloaded, this file will be in a 10-column format with
columns including names, gene IDs, descriptions, and confidence
information for each interaction. Only the gene IDs are required for
the purpose of this analysis, so columns 2 and 4 should be copied to
a new Excel file and saved (without headers; should have 2,272
rows). This file can be directly imported into Cytoscape using the
Import Network from Table command in the File menu.

4.3. Determining Basic


Properties of the
Network

Perform an analysis of network properties by selecting Analyze


Network from the Network Analysis heading within the plugins menu. As the inputted interactions are bidirectional, select
Treat the network as undirected on the resulting dialog box. This
generates a window displaying network properties, such as the
clustering coefficient, characteristic path length, and network centralization under the Simple Parameters heading (see Fig. 4 for
expected output and Table 2 for explanation of network properties). These properties can be exported using the Save Statistics
option, and are outputted in a .netstats format which can be

15

Characterization and Evolutionary Analysis of ProteinProtein. . .

377

Fig. 4. Simple network parameters from NetworkAnalyzer. NetworkAnalyzer outputs a small number of basic network
parameters that can be saved for further analysis and comparison with other networks. A description of some of these
metrics can be found in Table 2.

viewed with any text editor. Selecting a subset of nodes and repeating
this analysis using the Analyze subset of nodes option allows comparison among a specific subset of genes. This is useful, for example,
to identify the local properties of one gene family of interest.
Under the heading of Node Degree Distribution, we see a
log-log plot of node degree versus frequency of occurrence. The
Fit Power Law function can be used to determine whether the
distribution of edges in this graph approximates a power law
(Fig. 5). The MAP kinase protein interaction network seems to fit
this definition (r 0.955), which is to be expected since a small
number of baits with somewhat overlapping targets were screened
in depth. Graphs visualizing the distributions of network properties
(degree, clustering coefficient, and shortest path length) can be
exported as image files by selecting Export Chart.

5. Questions
1. Describe the major differences in filtering procedures applied in
the 2006 Krogan et al. and Gavin et al. studies. Discuss the
merits and disadvantages of defining clusters that only allow
exclusive membership.
2. Protein interaction data from the Krogan et al. and Gavin et al.
screens is freely available from BioGRID (http://thebiogrid. org).
A comprehensive list of yeast paralogs originating from the

378

G. Musso et al.

Fig. 5. Fitting network edge distribution to a power law using NetworkAnalyzer. This graph was outputted directly from
NetworkAnalyzer and shows a strong correlation between the degree distribution of our network and a power-law function,
suggesting it to be a scale-free network. NetworkAnalyzer fits the power-law function to degree data using the least
squares technique.

single WGD event is available here: http://genome.cshlp.org/


content/suppl/2005/09/16/gr.
3672305.DC1/Byrne_
Supp_Table2.xls. Calculate the average shortest path length
between paralogs in the Krogan et al. and Gavin et al. studies.
What do these differences tell you regarding the importance of
data filtering technique and evolutionary analysis?
3. Assuming that comprehensive protein interaction data were to
become available for multiple both pre- and post-WGD yeast
species, describe how you would approach determining how
protein complexes may have evolved.

Acknowledgments
AE and ZZ acknowledge a Team Grant from the Canadian Institutes of Health Research (CIHR MOP#82940).
References
1. Krogan NJ, G Cagney, et al. (2006). Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 4407084: 637643.
2. Gavin AC, P Aloy, et al. (2006). Proteome
survey reveals modularity of the yeast cell
machinery. Nature 4407084: 631636.

3. Fraser HB and JB Plotkin (2007). Using


protein complexes to predict phenotypic
effects of gene mutation. Genome Biol 811:
R252.
4. Goehler H, M Lalowski, et al. (2004). A protein interaction network links GIT1, an

15

Characterization and Evolutionary Analysis of ProteinProtein. . .

enhancer of huntingtin aggregation, to Huntingtons disease. Mol Cell 156: 853865.


5. Taylor IW, R Linding, et al. (2009). Dynamic
modularity in protein interaction networks predicts breast cancer outcome. Nat Biotechnol
272: 199204.
6. Jeong H, SP Mason, et al. (2001). Lethality
and centrality in protein networks. Nature
4116833: 4142.
7. Rain JC, L Selig, et al. (2001). The protein
protein interaction map of Helicobacter pylori.
Nature 4096817: 211215.
8. Milgram S (1967). The small world problem.
Psychology Today 2: 6067.
9. Amaral LA, A Scala, et al. (2000). Classes of
small-world networks. Proc Natl Acad Sci U S A
9721: 1114911152.
10. van Noort V, B Snel, et al. (2004). The yeast
coexpression network has a small-world, scalefree architecture and can be explained by a
simple model. EMBO Rep 53: 280284.
11. Khanin R and E Wit (2006). How scale-free are
biological networks. J Comput Biol 133:
810818.
12. Tanaka R, TM Yi, et al. (2005). Some protein
interaction data do not exhibit power law statistics. FEBS Lett 57923: 51405144.
13. Han JD, D Dupuy, et al. (2005). Effect of
sampling on topology predictions of proteinprotein interaction networks. Nat Biotechnol
237: 839844.
14. Uetz P, L Giot, et al. (2000). A comprehensive
analysis of proteinprotein interactions in Saccharomyces cerevisiae. Nature 4036770:
623627.
15. Han JD, N Bertin, et al. (2004). Evidence for
dynamically organized modularity in the yeast
proteinprotein interaction network. Nature
4306995: 8893.
16. Fraser HB (2005). Modularity and evolutionary constraint on proteins. Nat Genet 374:
351352.
17. Batada NN, T Reguly, et al. (2006). Stratus not
altocumulus: a new view of the yeast protein
interaction network. PLoS Biol 410: e317.
18. Bertin N, N Simonis, et al. (2007). Confirmation of organized modularity in the yeast interactome. PLoS Biol 56: e153.
19. Batada NN, T Reguly, et al. (2007). Still stratus
not altocumulus: further evidence against the
date/party hub distinction. PLoS Biol 56:
e154.
20. Agarwal S, CM Deane, et al. (2010). Revisiting
date and party hubs: novel approaches to role
assignment in protein interaction networks.
PLoS Comput Biol 66: e1000817.

379

21. Hakes L, DL Robertson, et al. (2005). Effect


of dataset selection on the topological interpretation of protein interaction networks. BMC
Genomics 6: 131.
22. Musso GA, Z Zhang, et al. (2007). Experimental and computational procedures for the assessment of protein complexes on a genome-wide
scale. Chem Rev 1078: 35853600.
23. Sanderson CM (2009). The Cartographers
toolbox: building bigger and better human
protein interaction networks. Brief Funct
Genomic Proteomic 81: 111.
24. Cagney G (2009). Interaction networks: lessons from large-scale studies in yeast. Proteomics 920: 47994811.
25. Yu H, P Braun, et al. (2008). High-quality
binary protein interaction map of the yeast
interactome network. Science 3225898:
104110.
26. Alberts B (1998). The cell as a collection of
protein machines: preparing the next generation
of molecular biologists. Cell 923: 291294.
27. Hartwell LH, JJ Hopfield, et al. (1999). From
molecular to modular cell biology. Nature
4026761 Suppl: C4752.
28. Pereira-Leal JB, ED Levy, et al. (2006). The
origins and evolution of functional modules:
lessons from protein complexes. Philos Trans
R Soc Lond B Biol Sci 3611467: 507517.
29. Brohee S and J van Helden (2006). Evaluation of
clustering algorithms for proteinprotein interaction networks. BMC Bioinformatics 7: 488.
30. Vlasblom J and SJ Wodak (2009). Markov clustering versus affinity propagation for the partitioning of protein interaction graphs. BMC
Bioinformatics 10: 99.
31. Mewes HW, D Frishman, et al. (2002). MIPS:
a database for genomes and protein sequences.
Nucleic Acids Res 301: 3134.
32. Sharan R, I Ulitsky, et al. (2007). Networkbased prediction of protein function. Mol Syst
Biol 3: 88.
33. Schwikowski B, P Uetz, et al. (2000). A network of proteinprotein interactions in yeast.
Nat Biotechnol 1812: 12571261.
34. Chua HN, WK Sung, et al. (2006). Exploiting
indirect neighbours and topological weight to
predict protein function from proteinprotein
interactions. Bioinformatics 2213: 16231630.
35. Deng M, Z Tu, et al. (2004). Mapping Gene
Ontology to proteins based on proteinprotein
interaction data. Bioinformatics 206: 895902.
36. Letovsky S and S Kasif (2003). Predicting protein function from protein/protein interaction
data: a probabilistic approach. Bioinformatics
19 Suppl 1: i197204.

380

G. Musso et al.

37. Tsai CJ, B Ma, et al. (2009). Proteinprotein


interaction networks: how can a hub protein
bind so many different partners? Trends Biochem Sci 3412: 594600.
38. Rual JF, K Venkatesan, et al. (2005). Towards
a proteome-scale map of the human
proteinprotein interaction network. Nature
4377062: 11731178.
39. Arifuzzaman M, M Maeda, et al. (2006).
Large-scale identification of proteinprotein
interaction of Escherichia coli K-12. Genome
Res 165: 686691.
40. Liang Z, M Xu, et al. (2006). Comparison of
protein interaction networks reveals species
conservation and divergence. BMC Bioinformatics 7: 457.
41. Koyuturk M, W Szpankowski, et al. (2007).
Assessing significance of connectivity and conservation in protein interaction networks. J
Comput Biol 146: 747764.
42. Srinivasan BS, NH Shah, et al. (2007). Current
progress in network research: toward reference
networks for key model organisms. Brief Bioinform 85: 318332.
43. Xia K, Z Fu, et al. (2008). Impacts of protein
protein interaction domains on organism and
network complexity. Genome Res 189:
15001508.
44. Wagner A (2001). The yeast protein interaction network evolves rapidly and contains few
redundant duplicate genes. Mol Biol Evol 187:
12831292.
45. Musso G, Z Zhang, et al. (2007). Retention of
protein complex membership by ancient duplicated gene products in budding yeast. Trends
Genet 236: 266269.
46. Guan Y, MJ Dunham, et al. (2007). Functional
analysis of gene duplications in Saccharomyces
cerevisiae. Genetics 1752: 933943.
47. Wapinski I, A Pfeffer, et al. (2007). Natural
history and evolutionary principles of gene
duplication in fungi. Nature 4497158: 5461.
48. Gu Z, LM Steinmetz, et al. (2003). Role of
duplicate genes in genetic robustness against
null mutations. Nature 4216918: 6366.
49. Kafri R, O Dahan, et al. (2008). Preferential
protection of protein interaction network hubs
in yeast: evolved functionality of genetic redundancy. Proc Natl Acad Sci USA 1054:
12431248.
50. Conant GC and KH Wolfe (2008). Turning a
hobby into a job: how duplicated genes find
new functions. Nat Rev Genet 912: 938950.

51. Davis JC and DA Petrov (2005). Do disparate


mechanisms of duplication add similar
genes to the genome? Trends Genet 2110:
548551.
52. Veitia RA (2002). Exploring the etiology of
haploinsufficiency. Bioessays 242: 175184.
53. Papp B, C Pal, et al. (2003). Dosage sensitivity
and the evolution of gene families in yeast.
Nature 4246945: 194197.
54. Li B, J Vilardell, et al. (1996). An RNA structure involved in feedback regulation of splicing
and of translation is critical for biological fitness. Proc Natl Acad Sci USA 934:
15961600.
55. Pereira-Leal JB and SA Teichmann (2005).
Novel specificities emerge by stepwise duplication of functional modules. Genome Res 154:
552559.
56. Conant GC and K Wolfe (2006). Functional
partitioning
of
yeast
co-expression
networks after genome duplication. PLoS Biol
44: e109.
57. Zhang Z, ZW Luo, et al. (2005). Divergence
pattern of duplicate genes in proteinprotein
interactions follows the power law. Mol Biol
Evol 223: 501505.
58. Conant GC and A Wagner (2003). Asymmetric
sequence divergence of duplicate genes.
Genome Res 139: 20522058.
59. Wagner A (2002). Asymmetric functional
divergence of duplicate genes in yeast. Mol
Biol Evol 1910: 17601768.
60. Bandyopadhyay S, CY Chiang, et al. (2010). A
human MAP kinase interactome. Nat Methods
710: 801805.
61. Assenov Y, F Ramirez, et al. (2008). Computing topological parameters of biological networks. Bioinformatics 242: 282284.
62. Shannon P, A Markiel, et al. (2003). Cytoscape: a software environment for integrated
models of biomolecular interaction networks.
Genome Res 1311: 24982504.
63. Cline MS, M Smoot, et al. (2007). Integration
of biological networks and gene expression
data using Cytoscape. Nat Protoc 210:
23662382.
64. Batagelj V (1998). Pajek: A program for large
network analysis. Connections 2: 4757.
65. Campbell NA and JB Reece (2005). Biology.
San Francisco, CA, Pearson.
66. Ohno S (1970). Evolution by Gene Duplication. Berlin, Springer-Verlag.

Chapter 16
Statistical Methods in Metabolomics
Alexander Korman, Amy Oh, Alexander Raskind, and David Banks
Abstract
Metabolomics is the relatively new field in bioinformatics that uses measurements on metabolite abundance
as a tool for disease diagnosis and other medical purposes. Although closely related to proteomics, the
statistical analysis is potentially simpler since biochemists have significantly more domain knowledge about
metabolites. This chapter reviews the challenges that metabolomics poses in the areas of quality control,
statistical metrology, and data mining.
Key words: ALS disease, Machine learning, Mass spectrometry, Metabolomics, Premature labor,
Quality control

1. Introduction
Metabolism may be defined as the complete set of chemical
reactions that take place in the living organism. This set is divided
into two major branches: anabolism (synthesis) and catabolism
(breakdown). The subjects of these reactions are metabolitesa
very diverse group of chemicals combining all small (nonpolymeric)
molecules found in living cells. Natural metabolites may be roughly
separated into two large groups: primary, which directly involved in
normal growth, development, and reproduction; and secondary,
which are not directly involved in these processes, but that may still
play a vital role in the organisms biochemistry. Artificial food
components, drugs, and products of their breakdown constitute a
third large group, often referred to as xenobiotics (from the Greek
xenos stranger and biotic related to living beings). The collection of all metabolites of the cell, tissue, organ, or organism is called
the metabolome, in analogy with genome, proteome, and transcriptome.
In the living system, metabolites are connected by a complex
network of enzyme-assisted reactions. Logical components of this
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2,
Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5_16,
# Springer Science+Business Media, LLC 2012

381

382

A. Korman et al.

network, such as energy production, synthesis, and breakdown


of lipids, amino acids, nucleotides, and so on, are called biochemical
pathways. Study of metabolism is not new. But the scientific
and technological advances of the last decades have made it possible
to raise it to a qualitatively different level. It has become a branch
of systems biology, combining high-throughput analytical methods
with advanced computing and bioinformatics, and is now
called metabolomics by analogy with other omics techniques that
emerged earlier: genomics, transcriptomics, and proteomics.
The subject of metabolomics is the study of the composition and
dynamics of the metabolome, but the ultimate goal is to give
a biologically meaningful explanation of some phenomena or prediction of system behavior under certain conditions, and thus
data analysis and interpretation are critical parts of any metabolomics project.
Currently, the main applications of metabolomics in human
studies are the following.
1. Early detection of diseases and health problems, such as necrosis, amyotrophic lateral sclerosis (ALS), pre-eclampsia, prostate
cancer, and preterm labor (14): However, not all diseases have
a clear or unique metabolic signature.
2. Assessment of drug toxicityliver toxicity and other metabolic
side effects are common barriers to drug approval.
3. Understanding the physiological effect of diet strategies (such
as the Atkins, Palm Beach, or rice diets): There is concern that
some diets, when adopted for a long period of time, may distort
normal metabolism (5).
4. Drug testing for athletes (wide-spectrum assays), employees
(narrow-spectrum assays), and for legal or forensic purposes.
5. Discovery of new biochemical pathways: Experts believe that
although the pathways shown in KEGG charts capture about
90% of the chemical mass they show only about 60% of the total
number of pathways; cf. (6).
Nonhuman studies often address other purposes, such as
improved understanding of how organisms respond to environmental stress.
Although metabolomics shares much with other highthroughput technologies, there are several significant differences.
In spite of the enormous sequence diversity of proteins, DNA and
RNA, their chemical diversity is limited within each group, which
allows the use of uniform unbiased analysis methods, such as DNA
and protein sequencing and microarray techniques. The number of
different metabolites is much smaller. The current release of the
KEGG Ligand database contains about 14,000 compounds with
defined molecular structure (http://www.genome.jp/kegg/docs/
upd_ligand.html); the more conservative Human Metabolome

16

Statistical Methods in Metabolomics

383

Project (http://www.hmdb.ca) lists only 7,900 compounds,


including drugs and food components. But the chemical diversity
of those few thousand compounds is much greater than that of the
millions of protein, DNA and RNA species. This makes it virtually
impossible to develop analytical methods that report with equal
sensitivity and accuracy the quantities for all metabolites in a specimen. However, the limited number of metabolites and our knowledge of their exact chemical nature allow the analyst to obtain pure
authentic standards for most of them and consequently to perform
reliable authentication. In contrast, with proteomics and transcriptomics experiments, there is often an element of ambiguity regarding highly homologous proteins and RNA. Authentic standards
also allow true quantitative analysis while transcriptomics experiments are semiquantitative at best and proteomics ones are most
often purely qualitative.
Metabolomics experiments have the same main stages as any
other large-scale analytical project: experimental design, sample
collection, sample preparation, data acquisition, and data analysis.
The purpose of experimental design is to ensure that the required
information is obtained, ensure estimability of key quantities, and
properly balance the amount and quality of this information with
effort and resources that are allocated for the experiment. Experimental design enables tight confidence intervals, eliminates confounding effects, and controls for known sources of variation.
Sample collection, preparation, and data acquisition are the phases
where systematic and nonsystematic errors actually occur, which
contribute to overall error of the experiment. These stages can be
controlled to a large extent through careful protocol design and
execution but nonetheless make the primary contribution to the
overall uncertainty (two main sources are insufficient understanding of the system and human error).
The subject of this chapter is statistical methods in metabolomics. In particular, it addresses statistical problems that occur in
designing appropriate quality control procedures, in the metrology
that leads to abundance estimates, and in the use of data mining
procedures that relate metabolic profiles to specific disease states.
Before turning to these three statistical topics, it is helpful to briefly
describe the technology that underlies most metabolomics platforms.

2. Technology
This section is intended as a brief overview of separation techniques
used in metabolomics. Readers interested in the details of the
subject should refer to other chapters in this book or 7.

384

A. Korman et al.

Modern metabolomics research concentrates on wide-spectrum


analyses trying to identify and quantify as many different metabolites in a tissue sample as possible, and then attempts to use a
combination of statistics and biochemical knowledge to infer the
mechanism of biological process, health or nutrition status, or other
properties.
Given the complexity of biological samples and low concentrations of individual compounds, analytical techniques with
extremely high resolution and sensitivity are required for metabolomics research. Consequently, all current metabolomic platforms
are based on different kinds of hyphenated methods, where during
a single analysis compounds are sequentially separated based on
two different physical principles. Primary separation is performed
by gas or liquid chromatography (GC or LC, respectively) or
capillary electrophoresis (CE); secondary separation, detection,
and quantification are done by mass spectrometry (MS). Naturally,
abbreviations combining primary and secondary separation
techniques are commonly used to identify analysis method; e.g.,
LC-MS is the combination of liquid chromatography with mass
spectrometry.
GC, LC, and CE separate compounds based on the combination of their physical properties and have very-high-resolution
power, so the complexity of the compound mixture entering the
mass spectrometer at any given time is much less than the complexity of the initial sample. In mass spectrometry, the molecules are
ionized (charged) and separated based on mass/charge ratio. Ionization may be mild, when apart from introducing the charge the
molecule remains intact (for example, electrospray ionization often
used in combination with LC and CE), or destructive, where the
molecule is broken into several fragments, some of which are
charged (electron-impact ionization after GC). Depending on the
ionization technique, it is possible to produce positive, negative, or
both types of ions, but since the mass spectrometer is capable of
detecting only ions of one polarity at any given time, negative- and
positive-mode mass spectrometry are distinguished. The most
common designs of mass analyzers (the part of mass spectrometer
that actually separates different ions) are linear quadrapole (Q), ion
trap, time of flight (TOF), and Fourier-transform ion cyclotron
resonance (FT-ICR); the resolution, sensitivity, and price of the
systems increase in that order. Hybrid mass spectrometers that have
a combination of mass analyzers (Q-TOF, QQQ, TOF-Q-TOF)
are now common since they allow the use of much more sophisticated analytical techniques and they obtain more information about
the analytes. Modern high-resolution/high-accuracy mass spectrometers (TOF, Q-TOF, FT-ICR) allow mass determination with
accuracy of 15 ppm and sensitivity in the nano- and even picomolar range of concentrations, so very complex mixtures of compounds may be successfully resolved.

16

Statistical Methods in Metabolomics

385

There are two main requirements on the analytical sample for


successful implementation of a hyphenated method: the compound
should be mobile (soluble in the case of LC and CE and volatile in
case of GC; CE also requires molecules to be charged in solution)
and the compound should be ionizable in order to be detected by
MS. These requirements govern the methods of sample preparation
and determine the scope of compounds that may be analyzed by
each technique. Because of the wide range of metabolite chemical
properties, no single method of analysis is sufficient to produce a
comprehensive snapshot of the metabolome; so data from the
positive- and negative-mode LC-MS and GC-MS must be
integrated in single set. The general rule is that LC-MS better
works for polar compounds while GC-MS is better for nonpolar,
although many polar nonvolatile compounds may be readily analyzed by GC-MS after chemical modification. One of the serious
drawbacks of GC is the high temperature necessary for separation
(80350 C) which leads to breakdown of the thermally labile
compounds. Another disadvantage of GC-MS is that the mass
resolution and mass accuracy of MS is usually low, but ongoing
progress in instrument design is likely to close this gap between
GC-MS and LC-MS in the near future.
It merits emphasis that quantitative information obtained by
MS reflects the concentration of ions, not the concentration of
compounds in the experiment. Since ionization efficiency of different compounds is different under the same experimental conditions
and also very sensitive to variations in those conditions, MS data do
not provide information about relative amounts of different chemical species in the sample. Absolute concentration measurements by
MS are possible only through calibration curves based on authentic
standards.

3. Quality Control
Issues
Good experimental design is essential to ensure quality checks on
each phase of the analysis. This requires executive commitment;
every set of samples should include multiple internal controls of
different kinds. Managers should anticipate that a significant proportion of the runs will be dedicated to quality control goals. There
is an implicit costbenefit analysis in determining the trade-off
between resources spent on quality control and on analysis, better
equipment, curation, and so forth. Historically, most laboratories
have undervalued process control, which reduces data quality and
often increases total operational costs (8, 9).
The experimental setup includes biological sample collection,
storage, analytical sample preparation, and analysis (data acquisition) itself. Each of these steps contributes to the variance of the

386

A. Korman et al.

final estimates. Data post-processing, although a purely mathematical


procedure, may also be an important source of variation.
The measurement process starts with sample collection.
According to multiple observations, natural biological variation is
the primary and unavoidable source of variance. That is, equivalent
cells, tissues, or biological fluids taken from different subjects of the
same species, variety, cultivar, strain, clone, etc., under equal conditions, will still not be identical. Another important source of
variance at the stage of sample collection is nonhomogeneity; for
example, tissue samples typically include different relative amounts
of several types of tissues (e.g., tumor and healthy cells in animals,
mesophyll and veins in plants). Reducing variance from this source
is almost exclusively a question of knowledge and the practical skill
of the person responsible for sample collection. In contrast, fluid
samples usually do not present nonhomogeneous variance.
Changes in metabolite concentrations due to conversion, degradation, and volatilization are a significant and frequently underestimated source of variance and bias at the stages of sample
collection, storage, and preparation for analysis. Biological interpretation of metabolomic data should rely on accurate snapshots of
concentrations reflecting true in vivo conditions. In living cell,
metabolites undergo rapid conversions, primarily assisted by
enzymes. Sample collection often disrupts the normal life cycle
and may cause sudden and significant changes in metabolite concentrations unless special precautions are in place. According to the
Arrhenius equation, the rate of chemical reactions approximately
doubles with temperature increases of 10 C, and thus it is essential
to collect and store samples at low temperature. Flash freezing in
liquid nitrogen is probably the method of choice if the sample
nature and size permit it, since it arrests both enzymatic and nonenzymatic reactions and prevents loss of volatile compounds. Deep
freezing (to 70 C or below) should be used for sample storage,
especially in the long term.
Analytical sample preparation is often a multistage process that
is highly dependent on the biological sample type and the method
of analysis. Variation at this stage may result from reagent volume
errors, incomplete homogenization of the sample and/or solubilization of metabolites, and chemical instability mentioned in the
previous paragraph. Much of the chemical instability is due to
enzymatic reactions and may be reduced or eliminated by rapid
inactivation of enzymes, for example by ethanol precipitation.
Variance from homogenization and solubilization may be reduced
by refining the technical methodology, but what is more important
is that it is accurately estimated using recovery standards.
Recovery standards are a set of compounds added in known
amounts to each sample immediately before starting the preparation. Recovery standards should satisfy several requirements: they
should not occur in samples naturally, they should be chemically

16

Statistical Methods in Metabolomics

387

stable, and they should span a wide range of chemical properties.


A set of natural metabolites labeled with a stable isotope (e.g., 13C)
may be a good example. Differences in concentrations of the
corresponding recovery standards in the final analysis provide a
benchmark that allows for correction of the sample loss during
preparation.
Variance in the data acquisition stage comes from injection
errors, matrix effects, carryover, instrument drift, and the high
dynamic range of metabolite concentrations. Ideally, equal volumes
of samples of approximately the same concentration should be
subjected to analysis to obtain comparable data. In case of injection
error, the actual sample volume is different from that which was
intended. The main causes for injection errors are equipment fault
or the presence of insoluble particles. Usually, this is a random
error; it can be reduced by improving the technical side of instrumentation and it can be estimated by using equal amounts of
injection standards in every sample during the last stage of sample
preparation. Estimation of the injection error allows the analyst to
correct estimates of abundance in the final analysis. The basic
requirements for injection standards are the same as for recovery
standards, leading to similar methods of data correction.
Matrix effects are generally differences in compound mobility
and/or ionization efficiency caused by other components of the
sample. They cannot be theoretically predicted and they may significantly complicate the data analysis; adjustment of the analysis
method would be required to reduce them.
Carryover errors may occur for different reasons; the result in
all cases is partial mixing of the samples during separation. It is
technically impossible to completely exclude carryover, but it may
be reduced and its significance may be estimated by including blank
runs (analyses without biological sample) between regular samples.
Instrument drift includes drift of the mass axis and a number of
electronic parameters for mass spectrometer and retention shifts
during the primary separation step. Mass axis drift is corrected by
periodic or continuous calibration against mass standards. Drift in
electronic parameters is corrected by periodic tuning according to
manufacturer-specified procedures. Retention drift is caused by
many factors and may be partially reduced by proper maintenance.
Using retention time locking on GC and a set of retention standards on all platforms allows correction for retention drift during
data processing. Recovery and/or injection standards may serve as
retention standards as well.
High dynamic range of concentrations leads to relatively large
errors on both lower and higher ends of the scale. Modern instruments are able to identify saturated signals and exclude them from
analysis. There are no clear rules, however, for determination of the
minimal signal. The signal-to-noise ratio is a good indicator of
quality, but determining its acceptable value is somewhat subjective

388

A. Korman et al.

and depends upon the purpose of the experiment and the intended
depth of data analysis.
Most of the current analytical platforms are based on plate
formats, where samples are delivered from a multiwell plate. Plate
geometry may be additional source of variation since it determines
to some extent the order in which the robotic mechanisms deposit
samples, calibrants, and chemical reagents. In some platforms, the
order in which wells are filled is random; but one should record the
time stamp at which a well is filled in order to allow estimation of
time-dependent biases, such as volatilization. It is good practice to
reserve the center well and the center wells in each plate quadrant
for a known complex calibranta process blankwhich is identical
in composition and treatment to all other samples but does not
contain biological material. This enables direct correction across
multiple plates, with minimal noise. It also enables detection of
drift and estimation of systematic effects due to plate geometry, and
the complexity of the calibrant provides known anchors that enable
multivariate regression methods to de-bias other measurements. In
addition to process blanks, it is useful to reserve some wells for pure
blanks (or solvent blanks) which are used to estimate and correct for
carryover and to estimate background noise. These locations
should also be geometrically balanced across the plate so that
systematic measurement biases can be assessed.
The appropriate experimental design for assigning samples,
process blanks, and solvent blanks depends upon the geometry of
the plate and the number of replicates per sample. If the plate has
square geometry, then one would consider a Latin square or
Graeco-Latin square design (10). These allow the analyst to control
for two or three possible confounders, such as plate row, plate
column, and order in which the well is filled. If the plate has
rectangular geometry, then there are analogous Latin rectangle
designs (10). Depending upon the situation, it may be appropriate
to use a balanced incomplete block design or the more exotic
partially balanced incomplete block design with some specific number of associate classes (11).
Besides plate geometry, experimental design issues will arise if
the tissue samples come from a research study. For example, one
common goal is to see whether two groups are different (say liver
tissue from sacrificed lab rats, some of whom received a new drug
and some of whom did not). In this case, the analyst should use
randomization for the order in which samples are run, and do
careful double blinding for all significant steps in the process.
Restricted randomization is sensible to do, but it can be hard to
explain. (With restricted randomization, not all possible randomizations are permitted; if the randomization happens to situate all or
most of the treatment group before the control group, it should
probably be excluded.) But most researchers prefer to write papers
which say that the samples were run in random order without any

16

Statistical Methods in Metabolomics

389

footnotes. This is not a significant issue with large sample sizes, but
metabolomics research often must use relative small sample sizes.
A second common type of analysis is time-course studies
these look at trends over time within the same subject. For example, studies may examine metabolic changes in blood drawn at
hourly intervals after a drug is administered. It is good if one can
run all samples from the same subject on the same plate, but the
time order should be randomized intelligently. Crossover experiments also generate special structure that requires thought when
laying out the allocation of samples to wells. Often, a useful and
flexible heuristic is to run the samples in blocks, with the order of
the samples randomized within each block. The definition of a
block may vary according to the structure of the experiment.
Finding a good experimental design is not trivial and requires
specialized statistical expertise. For routine operation, it is probably
sufficient to select a good, robust design and use it for nearly all
runs. But if the problem has specific design structure (e.g., yeast
culture cultivated under crossed stress factors), then the operator
should have access to a competent statistician.
Data post-processing includes usually several stepsnoise
removal, background subtraction (sometimes), signal deconvolution, and compound identification. Noise and background removal
are self-explanatory. Deconvolution is the most difficult step, and
produces most of the errors. Briefly, the purpose of deconvolution
is to separate the signals from different compounds which entered
mass spectrometer simultaneously or with significant overlap based
on the shape of their signals, the combination of mass values, and a
set of chemical rules. The complexity of the task is highlighted by
the fact that even the best software packages on the market often
make incorrect assignments during deconvolution. What is more
important is that deconvolution results may be inconsistent
between samples; very minor variations in the raw data can lead
to significant differences in results, since deconvolution output is
used for the ultimate identification and quantification of the metabolites. At present, extremely labor-intensive manual expert curation is an inevitable step if high-quality data are required.
A good metabolomics platform invests in quality. Strategies for
monitoring and improving quality include the following.
l

Randomly assign several wells to hold the same known calibrant. The random assignment of the calibrant provides a
measure of how the magnitude of the noise is affected by
geometry. Some noise occurs because of periodic refill of solvents, being last in line for testing on a plate, or degraded robot
fingers.

Use multivariate CUSUM charts. Slow drifts are more likely on


metabolomic platforms than abrupt change, so the CUSUM
charts will signal more quickly than, say, a Shewhart chart (12).

390

A. Korman et al.
l

Test three or more aliquots of the same sample on a plate. If


one just tests two, it is impossible to distinguish which is the
outlier if there is substantial disagreement. With three aliquots
per sample, then one can flag outliers. Also, one gets a much
better estimate of the variance.

Four is better than three. With just three, it happens regularly


that one of them is ruined by an assignable cause. So the fourth
provides backup. Once a mature platform has been established,
most of the cost is in curation. Therefore, one can cut costs by
curating just three of the four samples.

There is a trade-off between running multiple aliquots per


biological sample and obtaining more biological samples.
Since variation is typically larger across samples, then, if possible, multiple samples should be obtained.

Locate aliquot replicates randomly but with restrictions. Random well assignment prevents systematic errors that accrue
from locating triplets or quadruplets in the same locations,
run after run. But balanced random assignment is better
because it avoids chance neighboring that may result in correlated noise. For example, one might randomize the placement
subject to the constraint so that there is one aliquot in each
quadrant of the plate.

Freeze and save some of the sample aliquots until after curation
has been completed. Complex procedures mean that peculiar
things can happen. A single reagent might be bad or one of the
internal calibrants might have been mixed incorrectly. Such
mistakes can affect estimates of certain metabolites but not
others.

This list is not necessarily exhaustive. The intent is to identify


some of the larger issues that arise for nearly every metabolomics
laboratory.
Robust platform management should rotate the focus of its QC
efforts. It must maintain regular attention everywhere, but scrutinize something in depth each day. Some QC problems show up
only with metabolites that are heavy, light, or water soluble or
contain specific chemical structures. Since one cannot afford to
track everything all the time, it is helpful to shift attention regularly.
Such management protocols should beware of false positives.
When there is intensive QC monitoring, there will be many spurious
warnings (especially with multivariate responses). There is a large
literature on such multiple testing situations. One strategy to address
the problem is to control the false discovery rate (13). A second
strategy, for a mature process, is to track the number of flags one sees
through a process control chart (12). If the number of warnings
significantly exceeds the statistically calculable expected number,
this is evidence that the measurement process is not in control.

16

Statistical Methods in Metabolomics

391

Quality control methods are critical for accurate metabolomics


platforms. The appropriate suite of tools varies according to the
operational context; research platforms have different needs than
commercial platforms. But the core strategies are well established.
The major challenge for quality management in metabolomics is
that the measurement process requires simultaneous review and
control of many more standards than those usually occur in the
multivariate process control literature (14).

4. Abundance
Estimation
The primary purpose of metabolomics is abundance estimation.
The main steps for achieving this are locating the ion peaks in the
bivariate histogram, integrating the peaks to estimate total ion
counts, and then apportioning those counts to different metabolites.
A given metabolite compound usually has several distinct
fragmentation patterns, depending upon randomness in the ionization step. Sometimes, the molecule breaks at one bond, and
sometimes at another; but usually, breaks occur in a small number
of different ways. Therefore, the abundance signal is typically
distributed across multiple peaks in the two-dimensional histogram in which ion counts are plotted against elution time and
the mass/charge ratio. In general, the analyst knows the probability of each of the major fragmentation patterns and the location at
which the peaks should occur. But it must be borne in mind that
different metabolites may have some ion fragments in common, so
certain peaks are combinations of signal from several different
metabolites.
The first step is to locate the peaks. For nearly all the main
metabolites, the presumptive locations are known. This information is available from corporate or public libraries of fragmentation
outcomes. A prominent one is the NIST/EPA/NIH Mass Spectral
Library (15), initially developed by Steve Stein at the National
Institute of Standards and Technology. So, in principle, one
knows exactly where the peaks for each metabolite should appear
(this is importantly different from the case with proteomics).
However, platforms tend to drift, despite regular recalibration
and quality control. Thus, a particular run might have the peaks
slightly shifted, independently in both the elution time axis and the
mass-to-charge ratio axis. The amount of that shift may not be
constant across the entire range of the instrument; for example,
lightweight ions may be shifted a bit more than heavy ions. Also,
the amount of shift may be affected by the abundance; a dense
cloud of charged ions has internal electrodynamics that affects the
TOF measurements differently from a less dense cloud.

392

A. Korman et al.

Since it is not possible that the two axis shifts could physically
interact, one can decompose the peak location problem into separate problems. The solution requires estimation of two warping
functions, f1(x) and f2(y), which fit the amount of shift at a given
location on each of the two axes (16). These functions must be
monotonic; if there is no shift, they perfectly prescribe the lines
f1(x) x and f2(y) y. If a warping function dips below that ideal
line, then the measurement axis is compressed at that location; if it
is above the line, then the measurement axis is stretched.
Few platforms or analysts explicitly calculate warping functions.
Most just use software that implements decision rules; i.e., it is
known that cholesterol produces a peak at a given location, so the
system looks for the nearest peak and declares that to be the
appropriate ion fragment of cholesterol. Although this piecemeal
approach is quick to code and avoids some technical mathematics, it
is less accurate than simultaneously warping both axes to best
accommodate all of the signals. One implication is that the curation
step takes longer and is, thus, more costly. Another is that one does
not learn as much about the performance of the platform as one
might.
Once the peak location has been identified, the second step is
to calculate the number of ions at that peak. There are two main
issues: the peak is slightly smeared, with respect to both axes,
and the peak may be an underestimate due to saturation of the
ion counter.
The smearing of the peak can be complex. Typically, the spread
in the elution time axis is greater than the spread in the mass/
charge ratio axis. However, the mass/charge ratio axis has special
structure. First, there are isotope shadows. These occur when the
chemical structure of an ion contains atoms that have distinct but
common isotopes. The instrumentation is now sensitive enough to
resolve these into distinct peaks, nearby but separated along the
mass/charge axis, but essentially simultaneous on the elution time
axis. Additionally, there are several common adducts which characteristically attach to certain ions; in this case, there will be a second
trail of isotope shadow peaks, a little further to the right on the
mass/charge ratio axis and perhaps slightly delayed on the elution
time axis (17).
As previously mentioned, undercount of the ions occurs when
the abundance is high and the ion detector becomes saturated. In
this case, there are two strategies: one can try to adjust for the
undercount or perhaps impute the count in the saturated peak from
an unsaturated isotope shadow. Ideally, a proper statistical analysis
would combine the multiple signals, but this requires some mathematics and a clear understanding of the measurement capability
function of the hybrid ion counter. In practice, the curation process
is used to address peak abundance estimation outside the dynamic
range of the instrument.

16

Statistical Methods in Metabolomics

393

A different problem is that often there is spurious overcount for


ions with low mass/charge ratios. (This arises during the fragmentation of the metabolite into ions; for some compounds, ionization
can produce chips, very light fragments whose pedigree is essentially impossible to determine.) Usually, one corrects for this by
subtracting out a baseline correction, which is a lower envelope
function, which finds a lower bound, or estimated true zero, for
the readings (17).
A strict lower envelope is a piecewise planar function that
connects the smallest nonzero counts at any given elution time
and mass/charge ratio. This ensures that there are no negativeadjusted counts, but one typically needs to review the data to see
whether there are any downliers that should be excluded before
making the baseline correction. An example of a clear downlier is a
measured value that is below the detection threshold of the instrument, but others may be less obvious.
Since the lower envelope can be rough, some researchers prefer
to use a loess smoother with a very wide bin width to estimate the
baseline correction (18). As a result, the fitted surface typically lies
below all of the peaks, but some corrected measurements become
negative. Loess fits the model
ExY  yx0 x;
where
^yx arg min
p
y2R

n
X

w kx  Xi kYi  y0 Xi 2

i1

for Yi the ion count at location Xi and w is a weight function that


governs the influence of the ith datum according to the (possibly
Mahalanobis, cf (19)) distance of Xi from x.
The final step is to use the estimated peak counts to construct
an imputed estimate of metabolite abundance. Current practice
relies upon proprietary black-box software that implements a decision tree; different platforms use different trees. These trees initially
identify single peaks that correspond to ion fragments specific to
single metabolites, and use these (in conjunction with the known
probabilities of different fragmentation patterns) to estimate the
abundance of the corresponding metabolites. The entire set of
fragmentation patterns is weighted by that estimated abundance
and subtracted from the bivariate histogram. Iterative extraction
leaves a bivariate histogram whose peaks correspond to noise and
metabolites that have no unique ion fragment. At this stage, some
systems revert to curation to estimate the remaining metabolites,
and others attempt to decompose the peaks that correspond to ions
which are produced by just two different metabolites. When
there are multiple metabolites that contribute to the peak, estimation encounters the knapsack problem, which is NP hard (20).

394

A. Korman et al.

In practice, the software resolves this by using domain knowledge


about typical ratios of certain metabolites. But using typical ratios
can mislead inference when the patient is abnormal.
A more statistically principled alternative is to fit the entire
bivariate histogram at once using a mixture model. Each of the
metabolites has a known fragmentation pattern; suppose the ith
metabolite produces fragmentation pattern hi(x, y), where x is the
elution time and y is the mass-to-charge ratio. This pattern includes
the isotope shadows, lagged adduct peaks, and even the ill-defined
spray of ion chips. Then, one finds the weights p1, . . ., pp that
minimize
#2
Z Z "
p
X
hx; y 
pi hi x; y dx dy:
x

i1

These weights are the estimated metabolite abundances that


minimize the integrated mean squared error between the observed
bivariate histogram h(x, y) and the weighted mixture of known
fragmentation patterns. (Technically, this mixture model would
simultaneously also minimize with respect to the two warping
functions and the baseline correction, but that level of detail
obscures the strategy.) The minimization problem is computer
intensive, but not unduly difficult. Bayesian regression analysis
using Markov chain Monte Carlo (21) with wavelet basis functions
(22) appears to work well in test problems.
To show the kind of impact that these data cleaning methods
have, consider Figs. 1 and 2. Figure 1 shows the raw data provided
from mass spectrometry of a process blank sample at the Beecher
Laboratory at the University of Michigan. (But bear in mind that
this data are not as raw as a statistician would want; proprietary
commercial software in the mass spectrometer has already performed baseline correction and made adjustments that account
for the calibration standards.) This is often called profiled data
and it contains a lot of noise.
Figure 2 shows centroided data. These data have been
deconvolved to obtain the estimated ion abundance after deblurring of the signal (i.e., due to measurement error, ions with the
same chemical structure may appear to have slightly different m/z
ratios or elution times; centroiding is one method for summing
over all the scatter associated with a single peak). As can be seen,
Fig. 2 is much cleaner. Commonly, this step would be followed by
thresholding, in which all estimated peaks with very small counts
are removed, under the assumption that they represent noise or
minor contamination.
The data in Figs. 1 and 2 are available from the Web site at the
Beecher Laboratory, http://mctp-ap1.path.med.umich.edu:8010/
pub. This particular process blank contains two kinds of quality
control standards. One kind is a set of amino acids in which

16

Statistical Methods in Metabolomics

395

Fig. 1. This graph shows nearly raw m/z, elution time, and abundance data from a process blank measurement. This
profiled data has been preprocessed by commercial software that is part of the mass spectrometer. In most applications,
researchers do not have direct access to nor detailed knowledge of that software, and must rely upon the capability of the
instruments vendor, as ratified by regular calibration.

Carbon-13 (13C) has replaced the much more common Carbon-12


(12C accounts for about 99% of natural carbon, and 13C accounts
for about 1%; the only other isotope is 14C, which is very rare and,
since it is radioactive, unstable). The second kind of standard are
three variants of common amino acids; these variants do not occur
in natural tissue samples. The reason for using 13C and amino acid
variants is to ensure that these calibration standards do not get
conflated with real biological signals. It is notable that the instrumentation is sufficiently sensitive to accurately discriminate ions
with different carbon isotopes.
This section has reviewed the statistical issues that arise in the
core metrology problem in metabolomics. However, there are two
additional problems that deserve comment: propagation of error
formulae and cross-platform comparisons. The following subsections describe these in more detail.
4.1. Propagation of Error

As previously discussed, there are many sources of variance in the


final estimate of abundance. It is helpful to develop an end-to-end
uncertainty budget that indicates where the largest sources of variation arise and the role they play in the final estimate in order that
one can improve the measurement process.

396

A. Korman et al.

x 106
12
10

Intensity

8
6
4
2
0
0

ut

El

200
n

io

2000
m

Ti

1500

400
1000
600

rge

500
0

Cha
ass/

Fig. 2. This graph shows centroided data, in which a measurement error model has been used to deblur the peaks. This
concentrates the smeared signals shown in Fig. 1 into single peaks, and thus provides a much cleaner representation of
the ion abundance.

In metrology, one representation of an uncertainty budget


is an expression of the confidence interval in terms of sources of
error, e.g.,
^  e1  e2  e3 :
m
Each e term captures a different kind of uncertainty. For example, in
physical science, it can happen that one error term corresponds to
instrumentation, one to the number of terms in a Taylor series
approximation, and one to the error in the Monte Carlo integration.
In metabolomics, one wants to decompose the total error into
attributable sources, with estimates of the variance due to each. The
National Institute of Standards and Technology advocates the following approach to this problem (23):
l

Build a model for the error terms.

Do a designed experiment with replicated measurements.

Fit a measurement equation to the data.

However, the highly multivariate nature of metabolomics data


makes this somewhat difficult.

16

Statistical Methods in Metabolomics

397

Let z be the vector of raw time-stamped ion fragment counts,


and let x be the estimated metabolite abundances. An ideal measurement equation is
gz x m e;
where
e e1 . . . er
so that there are r distinct sources of independent variation.
In practice, such perfect decomposition of the error structure is
unattainable. Currently, the best work gives crude approximations.
But the rough approximations can be adequatethey identify the
dominant sources of uncertainty, and are able to pinpoint the
aspects of the process that most repay improvement. To illustrate
the method, consider estimating the abundance xi of only the ith
metabolite. The multivariate problem is harder, but not different.
A crude measurement equation is:
X Z Z
gi z ln
smz  bm=z; t dm =z dt;
wi
i

where wi picks out and weights the peaks that contribute to metabolite i, sm(z) smooths the raw bivariate histogram (i.e., accumulates
ion counts from all the isotope shadows and adducts), and b(m/z, t)
is the baseline correction subtracted during denoising.
In the previous equation, one usually takes the logarithm since
the main interest is ratios of abundances. (Using ratios eliminates
the effect of dilution, which can vary from sample to sample.) The
hope is that this measurement equation creates an independent
homoscedastic error term, and that components of variance analysis
(24) can ascribe a certain portion of the error to each of the
following sources: within-subject variation, within-tissue variation,
miscalibration of standards, measurement error, and so forth.
The law of propagation of error (also known as the delta method
(25)) says that the variance in the univariate estimate xi is approximately
Varxi 

p
X
j 1

Varzj  2

X @ 2 gi
Covzj ; zk :
@zj @zk
j 6k

This equation takes technical liberties to achieve a compact display.


For example, instead of referring to the counts for the jth (m/z, t)
bucket as zj, it would be more correct to index the denoised peaks
and count those.
4.2. Cross-Platform
Comparisons

Cross-platform experiments are crucial for calibration. And regular


calibration is essential for managing a metabolomics platform.
Metrologists have learned that it is not meaningful to ask which
of the two measurement processes is more accurate (instead, the

398

A. Korman et al.

focus is on which has the smallest variance and hence the greatest
replicability). In fact, it is fundamentally impossible to decide which
laboratory or platform gives the right answer; one can only
estimate the differences between laboratories. In statistical language, the true value of the measurand is not identifiable (25),
but contrasts between laboratories are identifiable. Therefore,
good metabolomics platforms are ones with small variance within
their range of measurement. Calibration can then tune the output
to match that from other systems.
The basis for such cross-platform calibration are key comparison
designs (26). Here, the same samples (aliquots of Grob or some
tissue) are sent to multiple labs, and each lab produces its own
estimate and a corresponding estimate of uncertainty. There are
several prominent key comparison designs. In the star design, after
a laboratory measures the sample, it is returned to the starting point
for remeasurement (to ensure that transit has not altered the sample). In the circle design, the sample is not remeasured until all of
the participating laboratories have measured it. The latter is less
expensive, but if there is contamination during the process, it is
difficult to determine where along the exchange that happened.
The Mandel bundle-of-lines model (27) is a standard method
for the analysis of the key comparison designs. Here, the measurement Xij on sample j at laboratory i is modeled as
Xij ai bi tj eij ;
where tj is the unknown true value of the sample, ai and bi determine the linear calibration for lab i, and eij  N 0; s2ij is the
measurement error.
Because tj is not estimable, one must impose constraints. A
 1, and t 0.
frequentist would typically require that a 1, b
However, many other constraints would work, and forcing the
average t to some sensible but arbitrary value vj can be convenient.
A Bayesian would put priors on the laboratory coefficients

 and the
2
error
variance.
Natural
priors
would
be
a

N
0;
s
i
A ; bi  N




1; s2B and tj  N vj ; s2T .
A multivariate version of the Mandel bundle-of-lines model
would best serve metabolomics needs. The strategy is straightforward, but to our knowledge it has not been developed. Instead,
people do one-at-a-time calibrations. Usually, they use the same
sample but consider the measurement on each metabolite separately, ignoring known correlations among the measurements.

5. Disease
Diagnosis
Although metabolomics may serve many purposes, a key application is the diagnosis of disease. For most situations, this entails
the Curse of Dimensionality. When the data are high dimensional,

16

Statistical Methods in Metabolomics

399

then the inference becomes less accurate and the inaccuracy


increases faster than linearly in the dimension. The problem
is particularly acute when the number of covariates (metabolites)
is larger than the number of samples (patients). This is sometimes
called the p > n problem. (Since metabolites occur in
pathways, it may be that abundances of different metabolites
within the same pathway that are strongly correlated. If properly
modeled, this would reduce the effective number of covariates
and be another potential advantage of metabolomics over
proteomics).
In terms of the sample size n and dimension p, the Curse of
Dimensionality has three nearly equivalent descriptions:
l

For fixed n, as p increases, the data become sparse.

As p increases, the number of possible models explodes.

For large p, most datasets are multicollinear (or concurve,


which is a nonparametric generalization). That is, the datasets
tend to concentrate on an affine subspace or upon a smoothly
curved manifold within the data space.

For more details, see 28. Essentially, these descriptions get at


different facets of the core problem with multivariate inference
high-dimensional space is very large, so the amount of information
that is available for fitting a classification function locally is usually
insufficient for good predictive accuracy.
In disease diagnosis, the typical situation is that one has a
training sample of tissue from what is arguably a random sample
of diseased subjects and a random sample of healthy subjects. For
the ith specimen, one records the estimated amounts of each
metabolite as a vector of measurements xi. Then, one seeks some
mathematical combination of metabolite measurements that
enables a physician to classify, with high probability, the group to
which a subject belongs. The Curse of Dimensionality implies that
one is apt to fit a model that does well on the training sample but
performs poorly with tissue from a new subject.
There are two main strategies for building classification rules as
follows.
l

Geometric, which includes discriminant analysis, flexible discriminant analysis, partial least squares, and recursive partitioning: These methods tend to be based on fairly specific models.

Algorithmic, which includes neural nets, nearest neighbor


rules, support vector machines, and random forests (RFs):
Many of these are ensemble methods, in which many different
models are weighted in producing the final inference.

An introduction to many of these methods, from a modern


machine learning perspective, can be found in 29.
Geometric rules started with Fishers classification of iris species
(30). For two classes with a total of n specimens, the data are x1, . . .,

400

A. Korman et al.

xn (in Fishers example, xi was the sepal length, sepal width, petal
length, and petal width on the ith iris). Fishers linear discriminant
analysis assumes that the two populations have multivariate normal
distributions with common unknown covariance matrix and different unknown mean vectors. It assigns a new observation x to the
population whose mean has the smallest Mahalanobis distance to
the observation:

1=2
dM x; xj x  xj 0 S 1 x  xj
;
where x1 is the sample mean for the training sample vectors from
the first class, x2 is the sample mean for the training sample vectors
from the second class, and S is the sample covariance matrix.
To analyze the effect of noise in linear discriminant analysis,
suppose one has a fixed sample size n and assume that the covariance
matrices are known to be s2I. Write the estimates of the means as:
s
^1 m1 p n1
m
n
s
^2 m2 p n2
m
n
Also, write the new observation to classify as:
x m1 n:
Fishers classification rule assigns population 1 if
^1 <dM x; m
^2
dM x; m
and under our assumptions, this is equivalent to:
^1 <x  m
^2 0 x  m
^2 :
^1 0 x  m
x  m
^2 in terms of n1, n2, and n shows that this
^1 and m
Writing x; m
criterion is equivalent to:

s
s
s
s
n  p n1
n  p n1 < m1  m2 n  p n2
m1  m2 n  p n2
n
n
n
n

or, after further simplification,



0

s
s
m1  m2  p n1 n2 2sn  m1  m2  p n1  n2 2sn >0:
n
n

As n ! 1, this criterion converges to


2sn0 m1  m2 >km1  m2 k2
and
probability of misclassification is
h thus the asymptotic
i
2
P n>km1  m2 k =2s . Thus, the asymptotic error rate depends
only on the signal-to-noise ratio km1  m2 k2 =2s.

16

Statistical Methods in Metabolomics

401

Now, consider the same problem from a Curse of Dimensionality perspective. Without using asymptotics in the sample size n,
the rule assigning population 1 can be written as
2sn0

r !
2
2
m1  m2
sn1 >  km1  m2 k2 s2 n01 n2
n
n

so that the probability of misclassification is


2
1
2s2
p
1  F4 km1  m2 k 1
2
2s
km1  m2 k n

!1=2 3
5;

where F is the cumulative distribution function for the standard


normal distribution. Since n does not go to infinity, the fraction p/n
does not go to zero (and for p  n, as often happens in metabolomics, it is quite large). Consequently, the probability of misclassifying the observation is also large (31).
A common strategy to improve classification accuracy is variable selection. Here, one selects a relatively small number of covariates from among the many metabolites that are measured and
builds the classification function from those. Too often, people use
stepwise discriminant analysis (the classification analogue of stepwise regression (32)) to do variable selection. It performs poorly.
Better alternatives include the lasso (33), the elastic net (34), and
newer, more exotic methods, such as the Dantzig selector (35).
However, none of these methods is directly tuned to the types of
pathway-driven effects one expects in metabolomic data. If a classification or regression response depends upon all of a set of steps in a
pathway working properly, then different classification models are
needed.
In our experience, as reported in two cases studies described in
Subheading 6, the best classifiers for metabolomics data are either a
version of support vector machines that uses variable selection or
random forests. This is no guarantee that these methods are always
the best, but we encourage analysts to consider them when using
metabolomics for disease diagnosis.
5.1. Support Vector
Machines

Support vector machines (SVMs) were invented by Vapnik (36).


SVMs use optimization methods to find surfaces that best separate
the training sample classes in high-dimensional space. Their key
innovation is to express the separating surfaces in terms of a vastly
expanded set of basis functions, instead of just a linear combination
of the raw measurements (e.g., as in Fishers linear discriminant
analysis).

402

A. Korman et al.

Before describing this expanded set, we first consider the


simplest SVM. Suppose the n observations in the training sample
are {(xi, yi)}, where xi 2 Rp and yi 2 {1, 1} are the labels
that indicate to which of the two categories the observation
belongs. And suppose one seeks a simple linear classification rule
of the form
gx signbT x b0 ;
where b is determined from the data. (Without loss of generality,
assume kbk 1.) If the two classes are separable, then there are
many possible b that work. A natural strategy is to pick the b that
creates the biggest margin between the two classes. The margin is
twice perpendicular distance from the closest value with label +1 to
the separating hyperplane (or the sum of the two smallest distances,
one from each class).
Denote the margin by d. Then, the optimization problem is
max

b;b0 ;kbk1

subject to

yi bT xi b0 d;

i 1; . . . n:

One can rewrite this as a convex optimization problem with a


quadratic criterion and linear constraints:
min kbk
subject to yi b xi b0 1;
T

i 1; . . . n

where the requirement that the elements of b have unit norm is


dropped. It turns out that d kbk1 . Note that this solution is not
equivalent to the one obtained from linear discriminant analysis.
Also, the solution depends only upon the three closest points in the
two classes.
To solve the rewritten problem, we can use a Lagrangian
multiplier:
L

n
n
X
X
1
li yi x 0 i b b0
li :
kbk2 
2
i1
i1

The goal is to minimize L with respect to b and b0 while


simultaneously requiring li 0 and that the derivatives of L with
respect to the li vanish. The Lagrangian formulation has two
advantages as follows.
l

The original constraints are replaced by constraints on the


Lagrange multipliers, which are easier to handle.

The training data only appear once, as dot products in a sum,


which allows generalization to nonlinear machines.

16

Statistical Methods in Metabolomics

403

The dual problem of the Lagrangian minimization is to


maximize L subject to:
@L
0
for all i
@bi
@L
0
@b0
0
li for all i:
This is called the Wolfe dual. The zero-gradient requirement
generates equality constraints, leading to
1X
li lj yi yj xi0 xj
2
i; j
i1
Pn
which is maximized under the constraint that
i1 li yi 0 and
li 0 for all i. Training the SVM amounts to solving this optimization problem.
Note that for each observation there is a Lagrange multiplier li.
Those observations with li > 0 are the support vectors and determine the margin. For all the other observations, the li are zero. If
the problem were reworked using only the support vectors data, the
same solution would be found.
In practice, it usually happens that the two classes are not
separable. In that case, one wants to find the hyperplane that
minimizes the sum of the perpendicular distances for the data that
violate the rule. This leads to a slightly more advanced optimization
problem that includes slack variables. This can also be solved using
Lagrange multipliers. Cortes and Vapnik (37) developed SVMs to
extend this optimization strategy.
The mathematics in the SVM literature can be serious, but the
key idea is to find linear combinations of basis functions in Rp that
describe good separating surfaces. In real problems, one wants
more flexible separating surfaces than hyperplanes. Boser et al.
(38) showed how to create appropriate sets of surfaces in the
SVM context, drawing on 39. Their main innovation was to express
the surfaces in terms of a vastly expanded set of basis functions.
SVMs map the problem to a higher dimensional space, and then
solve the linear separation problem using nonlinear basis elements.
One implication is that if many of the covariates are not relevant
there is considerable danger of overfit; the SVM will do well on the
training sample but poorly in prediction.
Consider the problem of building a classification rule for just
two types of objects, where each object has measurements in Rp .
One can find separating hyperplanes or quadratic surfaces, as in
linear or quadratic discriminant analysis, or one can fit more complex separating surfaces. In principle, one would like to be able to fit
very curvy surfaces, when the data warrant, that can separate complex structure in the two classes.
LD

n
X

li 

404

A. Korman et al.

The SVM strategy is to greatly expand the dimension of the


input space beyond p, and then find hyperplanes in that expanded
space that classify the training sample. Next, one maps back down
to the low-dimensional space, and, in general, the linear separating
rules in the high-dimensional space become nonlinear rules in the
p dimensional space.
Suppose one expands the set of inputs to include additional
functions of the inputs, say h(xi) (h1(xi), . . ., hq(xi))T. One can
show that the separating surface has the form
n
X
bi <hx; hxi > b0 ;
gx
i1

where <,> denotes an inner product. Determining the functions


in h(x) turns out to be equivalent to selecting a basis for a particular
subspace. This depends upon a kernel function.
A kernel function is a positive semidefinite function
K x; x < hx; hx >
q
X
hj x0 hj x :

j 1

This is related to reproducing kernel Hilbert spaces. Three


common choices forkernel functions
 in SVM applications are:
l

K x; x exp kx  x k2 =c , known as the radial basis

K(x, x*) tanh(a1 < x, x* > +a2) or the neural network basis

K(x, x*) (1+<x, x*>)r, the rth degree polynomials

To see how this works, suppose p 2 and use the rth degree
polynomial basis with r 2. Then,
K x; x 1 <x; x >2
1 x1 x1 x2 x2 2
1 2x1 x1 2x2 x2 x1 x1 2 x2 x2 2
2x1 x2 x1 x2 :

Thus, q 6, and with a little algebra one can show that


p
h1 x 1
h2 x 2x1
p
h3 x 2x2 h4 x x12
p
h5 x x22
h6 x 2x1 x2 :
So the SVM is looking for quadratic discriminant rules, and the
programming problem finds the best quadratic surface (in terms of
maximizing the margin in Rp ) that separates the classes.
SVMs do not automatically avoid the Curse of Dimensionality.
For example, as p gets large, the 2nd degree polynomial basis
becomes very much like quadratic discriminant analysis, and
this is well known to suffer when p is large. For this reason, we

16

Statistical Methods in Metabolomics

405

recommend a variant of SVMs that does both variable selection and


slightly modifies the misclassification penalty.
5.2. Random Forests

Random forests (40) are based upon classification and regression


trees (CARTs) or similar recursive partitioning procedures
(4143). A classification tree starts with a training sample of n
cases with known categories. Case i has a vector of covariates xi,
and those are used to build a tree-structured classification rule. This
kind of recursive partitioning is one of the most popular data
mining tools, in large part because the tree-structured decision
rule is easy to represent and often easy to interpret.
Formally, recursive partitioning splits the training sample into
increasingly homogeneous groups, thus inducing a partition of the
space of explanatory variables Rp . At each step, the algorithm considers three possible kinds of splits using the vector of explanatory
values x:
1. Is xi
t? (univariate split)
Pp
2. Is
i1 wi xi ? (linear combination split)
3. Is xi 2 S? (categorical split, used if xi is a categorical variable)
The algorithm searches over all possible values of t, all coefficients {wi}, and all possible subsets S of the category values to find
the split that best separates the cases in the training sample into two
groups with maximum increase in overall homogeneity.
Different partitioning algorithms use different methods for
assessing improvement in homogeneity. Classical methods seek to
minimize Ginis index of diversity or use a twoing rule. Hybrid
methods can switch criteria as they move down the decision tree.
Similarly, some methods seek to find the greatest improvement on
both sides of the split, whereas other methods choose the split that
achieves maximum homogeneity on one side or the other. Some
methods grow elaborate trees, and then prune back to improve
predictive accuracy outside the training sample. (This is a partial
response to the kinds of overfit concerns that arise from the Curse of
Dimensionality.) Ultimately, the process produces a decision tree.
The following figure shows a famous application (41) that
assesses whether or not an emergency room patient is at risk for a
second heart attack, based on a large number of medical measurements taken after admission. Moving down the tree, a patient
whose minimum systolic blood pressure is less than 91 is classified
as high risk (G); otherwise, the tree asks whether the patient is older
than 62.5 years. If not, then the patient is at low risk (F); if so, then
the tree asks about sinus tachycardia. If there is no tachycardia, the
patient is at low risk; see Fig. 3.
A random forest is a collection of identically distributed trees.
Each tree is constructed by applying some tree classification algorithm, such as CART, to a bootstrap sample from the training data.

406

A. Korman et al.

Fig. 3. CART tree used to classify patients with respect to their risk of a heart attack. It is
based upon an example in 46.

(A bootstrap sample is taken by making n draws from the training


sample, with replacement. Thus, some members of the training
sample are selected multiple times, and some not at all. About
one-third of the training sample is not used in the construction of
any specific tree. See 44 for more details on bootstrapping).
For each bootstrap sample, a classification tree is formed, and
there is no pruningthe tree grows until all terminal nodes are
pure. After the trees are grown, one drops a new case down each of
the trees. The classification that receives the majority vote is the
category that is assigned (40).
The main points about the random forest method are the
following.
l

Random forests are very good classifiers; empirical comparisons


show that they are fully competitive with SVMs.

It generates an internal unbiased estimate of the predictive


error (the out-of-bag estimate).

It handles missing data very well, and can maintain high levels
of accuracy when up to 80% of the data are missing at random.

16

Statistical Methods in Metabolomics

407

It provides estimates of the relative importance of each of the


covariates in the classification rule.

It computes proximities between pairs of cases that can be used


in clustering, identifying outliers, and multidimensional scaling.

It can rebalance the weights when the category proportions in


the training data do not reflect the category proportions in the
true population.

Its logic seems well suited to metabolomics; recursive partitioning is a natural way to deal with data generated in pathways.

In the following case studies, random forests were consistently


superior to SVMs and other classification methods that we considered. (A comparative study of machine learning classifiers in the
context of chemometrics is given in 45, but regrettably that study
did not include either SVMs or random forests).

6. Case Studies
The following two metabolomics case studies make use of data
mining techniques. Also, they illustrate the inferential methods
discussed in Subheading 5.
6.1. Classifying ALS
Patients

Automytrophic lateral sclerosis (ALS) is also known as Lou


Gehrigs disease. It affects the portion of the central nervous system
that controls voluntary muscle movement. In an early metabolomics study (46), blood was drawn from 32 healthy subjects and 31
ALS patients. Of the ALS patients, 9 were taking a new experimental medication, and 22 were not taking medication. The goal of the
study was to use metabolomic profiles to discriminate between the
two ALS groups and the healthy group.
The study used measurements on 317 metabolites obtained
from gas chromatography followed by mass spectrometry. Obviously, this omits many metabolites; some could not be separated by
gas chromotography, some were below the detection limit, and
some were aliased (e.g., ionized fragments of sugars are essentially
the same, so it is not possible to directly determine how much of
each kind of sugar is present in the sample).
The analysis tried many different classifiers, and the best results
came from random forests. Among the training sample, all 9 ALS
patients taking the new drug were correctly classified, 20 more ALS
patients were correctly classified, and 29 of the healthy patients were
correctly classified. More usefully, the overall out-of-bag error rate
was 7.94%. (Recall that the out-of-bag error rate is an unbiased
estimate of predictive accuracy.) The random forest rule was calculated with the BreimanCutler code for random forests.

408

A. Korman et al.

As discussed in the previous section, random forests allow one


to estimate the importance of each variable. This identified 20 of the
metabolites as important in the classification, of which 3 were clearly
dominant. Domain experts concurred that those three metabolites
were biochemically sensible in terms of their plausible connection to
ALS. (We regret that we cannot identify the metabolites specifically;
that information was held confidential by the researchers.)
Random forests can also detect outliers using proximity scores.
Proximity scores are calculated by dropping observations down each
tree (after the trees are slightly trimmed). If two observations reach
the same terminal node, this increments their proximity score. The
final step in the procedure is to normalize by dividing the number of
trees. If an observation has low proximity with all other observations, then it may be considered an outlier. In this example, four
outliers were identified, but all had high proximity to each other.
These outliers came from the (nondrug) ALS patients, and the
domain experts speculate that ALS may have a subcluster in which
the disease manifests slightly differently (i.e., there are two diseases
with similar symptoms and possibly related mechanisms).
We also applied a number of different kinds of SVMs to this
data: the linear SVM, polynomial SVM (of degree 3), Gaussian
Kernel SVM, L1 SVM (47), and SCAD SVM (48). Both the L1
and SCAD SVMs do variable selection within the SVM framework.
Among all these methods, the SCAD SVM had the best performance. Its estimated predictive error was 14.3%, and it chose 18
metabolites as important (the three dominant metabolites from the
random forests study were among these).
The SCAD SVM is a modification of the L1 SVM. The L1 SVM
mimics the automatic variable selection of lasso (33) by solving the
programming problem
min
b;b0

n 
X
i1

p
X


1  yi b0  bT xi l
bj ;
j 1

where the first sum is over the observations and the second sum is
over the coefficients on the basis elements. The function []+ is zero
when the argument is negative, and otherwise it equals the argument. The L1 penalty encourages most of the coefficients to be
zero, and thus it performs variable selection.
The SCAD SVM replaces the L1 penalty with a non-convex
penalty that asymptotes to a constant. Thus, all large coefficients
tend to have nearly the same penalty, as opposed to having penalties
proportional to their absolute values. As a result, SCAD SVM
requires more computation, but it avoids overpenalizing coefficients that are large but necessary.
Several other analyses were performed but were not definitive.
Besides relatively standard methods of classification, we tried a multiple tree analysis with FIRMPlusTM software from Golden Helix, as

16

Statistical Methods in Metabolomics

409

well as visualization tools, such as parallel coordinate plots (49) and


GGobi (50). We also attempted a robust singular value decomposition (51) that simultaneously clusters subjects and metabolites.
6.2. Classifying Preterm
Labor Outcomes

A more recent project concerned outcomes from preterm labor (4).


Dr. Roberto Romero at the National Institutes of Health wanted to
know whether metabolomic analysis of amniotic fluid samples from
women in preterm labor could classify them with respect to three
outcomes as follows.
l

The early labor subsides, and the pregnancy continues for the
normal duration.

There is premature birth, and the physician is able to attribute


the cause to infection or inflammation.

There is premature birth, and the cause is not infection or


inflammation (i.e., unknown cause).

His initial analysis used data from 50 Peruvian women. He did


stepwise linear discriminant analysis with 73 metabolites (and
some additional covariates, such as age), and he was able to classify
the training sample with 96.3% accuracy. But since this is estimated
from the training data itself, it overestimates the predictive accuracy.
The ideal strategy to estimate predictive accuracy is to hold out
a random portion of the data, fit a model to the rest, and then use
the fitted model to predict the response values in the holdout
sample. This strategy allows a straightforward estimate and unbiased estimate of predictive classification error. Unfortunately, this
strategy does not make full use of the data; researchers want to use
all of their sample in order to fit the best classification rule, rather
than sacrifice a large fraction (typically, about a third) for the
holdout sample. The problem is exacerbated in modern computer-intensive analyses, where many different rules are fit and
compared, requiring many different holdout samples. (If the same
holdout sample is reused, then the comparisons are not independent, and (worse) the model selection process will tend to choose a
rule that overfits the holdout sample, causing spurious optimism.)
Cross-validation (52) is a procedure that balances the need to
use data to select a model and the need to use data to assess
prediction. Specifically, the steps in n-fold cross-validation are as
follows.
1. Randomly divide the cases into n subsets of approximately equal
size.
2. For i 1, . . ., n, hold out portion i and fit the model from the
rest of the data.
3. For i 1, . . ., n, use the fitted model to predict the holdout
sample.
4. Average the misclassification rates over the n different fits.

410

A. Korman et al.

One repeats these steps (including the random division of the


sample!) each time a new model is assessed. The choice of n requires
judgment. If n n, then cross-validation has low bias but possibly
high variance, and computation is lengthy. If n is small, say 4,
then bias can be large. A common compromise is tenfold crossvalidation.
In Dr. Romeros initial work, the tenfold cross-validation estimate of predictive accuracy was quite poor and much less than the
96.3% obtained directly from the training sample. So he collected
more amniotic fluid samples from 168 Peruvian women in preterm
labor. The samples were run through the metabolomics platform in
random order, within a single week and given to the statisticians in
three increments. The first increment, which included samples from
55 subjects, served as training data for model selection. The second
increment, with 51 subjects, was used for parameter fitting and
confirmation of the selected model. Samples from the remaining 62
subjects comprised the third increment and were used as a pristine
holdout sample to assess predictive accuracy. (This strategy for allocating the sample among distinct inferential tasks was pioneered by
Ivakhenko (53).) The statisticians did not know the outcomes in the
third increment, but it was known that there were nearly equal
numbers of patients in each of the three categories.
Measurements on 117 metabolites were available, as well
as information on age, health, and pregnancy history. The statisticians did a great deal of data cleaning, visualization, and outlier
assessment and attempted to cluster both cases and variables.
The case clustering was not very useful because the Curse
of Dimensionality prevented robust clustering. However, clustering of variables proved critical; the association analysis showed that
amino acids and sugars formed two groups, and thus we created a
proxy variable for general amino acids and general sugars.
We compared the previous SVM techniques using cross-validation to assess predictive accuracy based on the first two increments
of data. As before, the SCAD SVM did best. Random forests
produced even better results, and was the analysis of choice.
Other methods (boosting, nearest neighbor, flexible discriminant
analysis) did not lead to competitive predictions.
Since the ultimate choice of method was random forests and since
random forests has an unbiased estimate of predictive accuracy, we
decided to pool the second and third increments. This resulted in an
88.5% estimate of accuracy, which was about 0.75% lower than the
estimate obtained from just the third increment alone.
The confusion matrix shows that most of the errors arose from
classifying women who had preterm birth with inflammation as
women whose labor would subside.

16

Statistical Methods in Metabolomics

True\predicted

Term

Inflammation

Term

39

Inflammation

32

No inflammation

29

411

No inflammation

The physicians in the project reviewed the important variables


selected by random forests and their effect on the classifications.
Their interpretation was as follows.
l

For those women who proceeded to regular term delivery, the


amino acids were low and the carbohydrates (sugars) were high.

For those who had preterm delivery and inflammation or infection, the carbohydrates were low and the amino acids were high.

For those who had preterm delivery without inflammation or


infection, both amino acids and carbohydrates were low.

Had we not done the initial variable clustering, we might have


missed achieving this level of accuracy and the corresponding intuitive medical interpretation. The necessary signal would have been
hidden among many amino acids and sugars, instead of collected
into two proxy variables.

7. Exercises
1. Using the profiled data shown in Fig. 1 that is available from the
Beecher Laboratory at http://mctp-ap1.path.med.umich.
edu:8010/pub, write a program to deconvolve the measurements, thereby producing cleaner data, such as that shown in
Fig. 2. Most analysts assume that the signal is blurred according
to a bivariate Gaussian distribution with mean centered at the
true value and covariance matrix given by the performance specifications for the instrument. A more thoughtful analysis might
assume a gamma distribution to model blur in the elution time
(because, for physical reasons, it is unlikely for an ion to arrive
early, but there are several mechanisms that might delay it) and an
independent univariate Gaussian distribution to model blur in
the m/z measurement.
2. Using the same data, write a program to perform baseline
correction of the profile data (for example, with Loess). In
principle, baseline correction has already been done by the
software in the mass spectrometer. But if the estimated correction is statistically significantly different from zero, this suggests
that the automatic baseline correction software is inadequate.
(Hint: To assess whether the new correction is significantly
different from zero, use the bootstrap.)

412

A. Korman et al.

3. Go the University of California at Irvines Machine Learning


Repository at http://archive.ics.uci.edu/ml and access the
Breast Cancer Wisconsin (Diagnostic) dataset. Using random forests and SVM methods, build two classification rules
from a randomly selected half-sample, and use those rules
to classify the remaining half. (Note: There are many choices
for SVM techniques, but radial basis function kernels often
work well.)
References
1. Rozen, S., Cudkowicz, M., Bogdanov, M.,
Matson, W., Kristal, B., Beecher, C., Harrison,
S., Vouros, P., Flarakos, J., Vigneau-Callahan,
K., Matson, T., Newhall, K., Beal, M. F.,
Brown, R. H. Jr., and Kaddurah-Daouk, R.
(2005) Metabolomic analyiss and signtures in
motor neuron disease. Metabolomics, 1,
101108.
2. Kenny, L., Dunn, W., Ellis, D., Myers, J.,
Baker, P., the GOPEC Consortium, and Kell,
D. (2005) Novel biomarkers for pre-eclampsia
detected using metabolomics and machine
learning. Metabolomics, 1, 227234.
3. Murthy, A., Rajendiran, T., Poisson, L., Siddiqui, J., Lonigro, R., Alexander, D., Shuster, J.,
Beecher, C., Wei, J., Chinnaiya, A., and Sreekumar, A. (2010) An alternative screening tool
for prostate adenocarcinoma: Biomarker discovery. MURJ, 19, 7179.
4. Romero, R., Mazaki-Tovi, S., Vaisbuch, E.,
Kusanovic, J., Nien, J., Yoon, B., Mazor, M.,
Luo, J., Banks, D., Ryals, J., and Beecher, C.
(2010) Metabolomics in premature labor:
A novel approach to identify patients at
risk for preterm delivery. Journal of MaternalFetal and Neonatal Medicine, 23, 13441359.
5. Wishart, D. (2008) Metabolomics: Applications to food science and nutrition research.
Trends in Food Science and Technology, 19,
482493.
6. Romero, P., Wagg, J., Green, M., Kaiser, D.,
Krummenacker, M., and Karp, P. (2004)
Computational prediction of human metabolic
pathways from the complete human genome.
Genome Biology, 6, R1R17.
7. Dunn, W., and Ellis, D. (2005) Metabolomics:
Current analytical platforms and methodologies.
Trends in Analytical Chemistry, 24, 285294.
8. Broadhurst, D., and Kell, D. (2007) Statistical
strategies for avoiding false discoveries in metabolomics and related experiments. Metabolomics, 2, 171196.
9. Baggerley, K., Morris, J., and Coombes, K.
(2004). Reproducibility of SELD-TOF protein

patterns in serum: Comparing datasets from


different experiments. Bioinformatics, 20,
777785.
10. Kempthorne, O. (1952) Design and Analysis
of Experiments, John Wiley & Sons, New York,
N.Y.
11. Bose, R., and Shimamoto, T. (1952) Classification and analysis of partially balanced incomplete block designs with two associate classes.
Journal of the American Statistical Association,
47, 151184.
12. Montgomery, D. (1991) Statistical Quality
Control, Wiley, New York, N.Y.
13. Benjamini, Y., and Hochberg, Y. (1995)
Controlling the false discovery rate: A practical
and powerful approach to multiple testing.
Journal of the Royal Statistical Society, Series
B, 57, 289300.
14. Liu, R. (1995). Control charts for multivariate
processes. Journal of the American Statistical
Association, 90, 13801387.
15. http://www.nist.gov/srd/nist1.cfm
16. Wang, K., and Gasser, T. (1997). Alignment of
curves by dynamic time warping. Annals of
Statistics, 25, 12511276.
17. Katajamaa, M., and Oresic, M. (2007) Data
processing for mass spectrometry-based metabolomics. Journal of Chromatography A,
1158, 318328.
18. Xi, Y., and Rocke, D. (2008) Baseline correction for NMR spectroscopic metabolomics
data analysis. BMC Bioinformatics, 9, 110,
doi:10.1186/1471-2105-9-324.
19. Morrison, D. (1990). Multivariate Statistical
Methods, McGraw-Hill, New York, N.Y.
20. Martello, S., and Toth, P. (1990) Knapsack
Problems: Algorithms and Computer Implementation, John Wiley & Sons, New York,
N.Y.
21. Gilks, W., Richardson, S., and Spiegelhalter, D.
(1996) Markov Chain Monte Carlo in
Practice, Chapman & Hall/CRC, Boca
Raton, FL.

16
22. Vidakovic, B. (1999) Statistical Modeling by
Wavelets, Wiley, New York, N.Y.
23. Cameron, J. (1982) Error analysis. Encyclopedia of Statistical Sciences, vol. 2, 545551,
Wiley, New York, N.Y.
24. Searle, S., Casella, G., and McCulloch, C.
(1992) Variance Components, Wiley, New
York, N.Y.
25. Casella, G., and Berger, R. (1990) Statistical
Inference, Duxbury Press, Belmont, CA.
26. Steele, A., Hill, K., and Douglas, R. (2002).
Data pooling and key comparison reference
values. Metrologia, 39, 269277.
27. Milliken, G. A. and Johnson, D. E. (2000) The
Analysis of Messy Data, vol. II. Wiley.
28. Clarke, B., Fokoue, E., and Zhang, H. (2009).
Principles and Theory for Data Mining and
Machine Learning, Springer, New York, N.Y.
29. Hastie, T., Tibshirani, R., and Friedman, J.
(2009) The Elements of Statistical Learning,
Springer, New York, N.Y.
30. Fisher, R. A. (1936) The use of multiple measurements in taxonomic problems. Eugenics,
7, 179188.
31. Raudys, S. and Young, D. (2004) Results in
statistical discriminant analysis: A review of
the former Soviet Union literature. Journal
of Multivariate Analysis, 89, 135.
32. Weisberg, S. (1980) Applied Linear Regression, Wiley, New York, N.Y.
33. Tibshirani, R. (1996). Regression shrinkage
and selection via the lasso. Journal of the
Royal Statistical Society, B, 58, 267288.
34. Zou, H. and Hastie, T. (2005). Regularization
and variable selection via the elastic net.
Journal of the Royal Statistical Society, B, 67,
301320.
35. Candes, E., and Tao, T. (2007). The Dantzig
selector: Statistical estimation when p is much
larger than n. Annals of Statistics, 35,
23132351.
36. Vapnik, V. (1996) The Nature of Statistical
Learning. Springer, New York, N.Y.
37. Cortes, C., and Vapnik, V. (1995), Supportvector networks, Machine Learning, 20,
273297.
38. Boser, B., Guyon, I., and Vapnik, V. (1992) A
training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, D.
Haussler, ed., pp. 144152. ACM Press, Pittsburgh, PA.
39. Aizerman, M., Braverman, E., and Rozonoer,
L. (1964) Theoretical foundations of the
potential function method in pattern recogni-

Statistical Methods in Metabolomics

413

tion learning. Automation and Remote Control, 25, 821837.


40. Breiman, L. (2001) Random forests. Machine
Learning, 45, 532.
41. Breiman, L., Friedman, J., Olshen, R., and
Stone, C. 1984) Classification and Regression
Trees. Wadsworth/Brooks Cole, Belmont,
CA.
42. Hawkins, D., Kass, G. (1982). Chapter 5:
Automatic interaction detection. In Topics in
Applied Multivariate Analysis, D. Hawkins, ed.,
pp. 269302. Cambridge University Press,
Cambridge, U.K.
43. Quinlan, J. R. (1992). C4.5 Programs for
Machine Learning, Morgan Kaufmann, San
Mateo, CA.
44. Efron, B., and Tibshirani, R. (1993). An Introduction to the Bootstrap. Chapman & Hall/
CRC, Boca Raton, FL.
45. Simmons, K., Kinney, J., Owens, A., Kleier, D.,
Bloch, K., Argentar, D., Walsh, A., and Vaidyanathan, G. (2008). Comparative study of
machine learning and chemometric tools for
analysis of in-vivo high-throughput screening
data. Journal of Chemical Information and
Modeling, 48, 16631668.
46. Truong, Y., Lin, X., Beecher, C., Cutler, A.
and Young, S. (2004) Learning a complex
dataset using random forests and support
vector machines. Proceedings fo the Tenth
ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining,
835840.
47. Bradley, P., and Mangasarian, O. (1998)
Feature selection via concave minimization
and support vector machines. International
Conference on Machine Learning 15, 8290.
48. Fan, J., and Li, R. (2001) Variable selection via
nonconcave penalized likelihood and its oracle
properties. Journal of the American Statistical
Association, 96, 13481360.
49. Wegman, E. (1990) Hyperdimensional data
analysis using parallel coordinates. Journal
of the American Statistical Association, 85,
664675.
50. http://www.ggobi.org
51. Liu, L., Hawkins, D., Ghosh, S., and Young, S.
(2003) Robust singular value decomposition
analysis of microarray data. Proceedings of the
National Academy of Sciences of the United
States of America, 100, 1316713172.
52. Stone, M. (1977) Asymptotics for and against
cross-validation. Biometrika, 64, 2935.
53. Ivahkenko, A. G. (1970). Heuristic selforganization in problems of engineering cybernetics. Automatica, 6, 207219.

Chapter 17
Introduction to the Analysis of Environmental
Sequences: Metagenomics with MEGAN
Daniel H. Huson and Suparna Mitra
Abstract
Metagenomics is the study of microbial organisms using sequencing applied directly to environmental
samples. Similarly, in metatranscriptomics and metaproteomics, the RNA and protein sequences of such
samples are studied. The analysis of these kinds of data often starts by asking the questions of who is out
there?, what are they doing?, and how do they compare?. In this chapter, we describe how these
computational questions can be addressed using MEGAN, the MEtaGenome ANalyzer program. We first
show how to analyze the taxonomic and functional content of a single dataset and then show how such
analyses can be performed in a comparative fashion. We demonstrate how to compare different datasets
using ecological indices and other distance measures. The discussion is conducted using a number of
published marine datasets comprising metagenomic, metatranscriptomic, metaproteomic, and 16S rRNA
data.
Key words: MEGAN, RMA-file, Taxonomic analysis, Functional analysis, Comparative metagenomics, 16S analysis, KEGG pathways, SEED subsystems

1. Introduction
In metagenomics, the aim is to understand the composition and
operation of complex microbial consortia in environmental samples through sequencing and analysis of their DNA. Similarly,
metatranscriptomics and metaproteomics target the RNA and
proteins contained in such samples. Technological advances in
next-generation sequencing methods are fueling a rapid increase
in the number and scope of environmental sequencing projects. In
consequence, there is a dramatic increase in the volume of
sequence data to be analyzed. The first three basic computational
tasks for such data are taxonomic analysis, functional analysis, and
comparative analysis. These are also known as the who is out

Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2,
Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5_17,
# Springer Science+Business Media, LLC 2012

415

416

D.H. Huson and S. Mitra

there?, what are they doing?, and how do they compare?


questions. They pose an immense conceptual and computational
challenge, and there is a need for new bioinformatics tools and
methods to address them.
In 2007, we published the first stand-alone analysis tool that
targets next-generation metagenomic data, called MEtaGenome
ANalyzer (MEGAN) (1). Initially, our aim was to provide a tool
for studying the taxonomic content of a single dataset. A
subsequent version of the program allowed the comparative taxonomic analysis of multiple datasets (MEGAN 2). In version 3 of the
program, we aimed at also providing a functional analysis of metagenome data, based on the GO ontology (2). Unfortunately, in our
hands the GO ontology proved to be little suitable for this purpose.
In version 4 of MEGAN, the GO analyzer has been replaced by two
new functional analysis methods, one based on the SEED classification (3) and the other based on Kyoto Encyclopedia for Genes
and Genomes (KEGG) (4).
To prepare a dataset for use with MEGAN, one must first
compare the given reads against a database of reference sequences,
for example, by performing a BLASTX search (5) against the
NCBI-NR database (6). The file of reads and the resulting
BLAST file(s) can then be directly imported into MEGAN. The
program will automatically calculate a taxonomic classification of
the reads and also, if desired, a functional classification, using
either the SEED or KEGG classification, or both. The results
can be interactively viewed and inspected. Multiple datasets can
be opened simultaneously in a single comparative document that
provides comparative views of the different classifications.
The goal of this chapter is to provide an introduction to
taxonomic and functional analysis of environmental sequences
using the new version 4 of MEGAN, which was released at the
beginning of 2011 (7). To this end, we use a number of published marine datasets as a running example. After discussing
how to get started, we illustrate how to perform a taxonomic
analysis of a single dataset, based on the NCBI taxonomy. We
then focus on how to perform a functional analysis using SEED,
and then KEGG. This is followed by a discussion of how to
compare the taxonomic and functional content of multiple datasets. While the main focus of this chapter is on the analysis
of metagenomic and metatranscriptomic data, in the final section
we briefly demonstrate that MEGAN can also be used to
analyze peptide sequences (metaproteomics) and 16S rRNA
sequences.
MEGAN is written in Java and requires a JRE version 1.5 or
newer. Installers for all major operating systems are available from
www-ab.informatik.uni-tuebingen.de/software/megan.

17

Introduction to the Analysis of Environmental Sequences: Metagenomics. . .

417

2. Getting Started
Throughout this chapter, we use eight published datasets from a
controlled coastal ocean mesocosm study involving an induced
phyto-plankton bloom as a running example (8). Four are metagenomes (labeled DNA) and four are metatranscriptomes
(labeled cDNA). Four were sampled at the peak of the bloom
(labeled Time1) and the other four after the bloom had collapsed
(labeled Time2). In each case we report on two replicates (labeled
Bag1 and Bag6, respectively). Based on the mentioned labels, we use
the following names for the datasets: DNA-Time1-Bag1, DNATime1-Bag2, DNA-Time2-Bag1, DNA-Time2-Bag2, cDNATime1-Bag1, cDNA-Time1-Bag2, cDNA-Time2-Bag1, and
cDNA-Time2-Bag2.
2.1. BLAST Computation

Given a file of sequences, for example, obtained by sequencing an


environmental sample using random shotgun sequencing (9, 10),
the first computational step is to compare the reads against one or
more reference databases using a tool such as BLAST. This is
usually the computationally most demanding step of an analysis of
metagenomic data. As a rough estimate, currently one giga-base of
sequence requires on the order of 10,000 CPU hours for a
BLASTX comparison against the NCBI-NR database. In a typical
study, the reads are compared against the NCBI-NR database using
BLASTX. However, MEGAN is not tied to any particular comparison method or database.
If one is only interested in a taxonomic analysis, but not a functional analysis, then an alternative to a BLAST-based approach is to
use a fast classifier that performs taxonomic assignment based on
compositional features such as k-mer counts (see, e.g., ref. 11, 12).

2.2. MEGAN Analysis

Upon launch, MEGAN loads a full copy of the NCBI taxonomy


into memory and then displays the top ranks of the taxonomy.
Once this step is completed, the user can start a new analysis by
importing a BLAST file using the Import from BLAST option.
MEGAN will parse the BLAST file (and the reads file, if present)
and will then automatically perform a taxonomic classification, and
if desired, also a functional classification, of the data. In a taxonomic
analysis, reads are mapped to nodes of the NCBI taxonomy, using
the LCA algorithm (1), which we describe in slightly more detail
below. The NCBI taxonomy is displayed as a tree and the size of
each node is scaled to indicate how many reads have been assigned
to the corresponding taxon.
In a SEED-based functional analysis, reads are mapped to
so-called functional roles, which in turn belong to one or more
functional subsystems (3). In MEGAN, the SEED classification is

418

D.H. Huson and S. Mitra

represented by a tree in a similar way to the taxonomic classification.


In a KEGG-based functional analysis, reads are mapped to so-called
KEGG orthology (KO) accession numbers, which in turn correspond to enzymes or genes that are present in different KEGG
pathways (4). The pathways are hierarchically organized in the
KEGG classification. In MEGAN, the KEGG classification is represented by a tree. For any given KEGG pathway, MEGAN provides a
visualization of the pathway in which enzymes are shaded to indicate
the number of reads assigned to them.
The results of the analyses (and also all reads and matches, if
desired) are saved by MEGAN to an RMA (read-match-archive)
file. RMA is a compressed binary format especially designed for
storing and accessing metagenomic data. The initial analysis of a
dataset by MEGAN can take a number of hours and may require up
to 8 GB of computer memory, depending on the size of the dataset.
However, once the initial analysis has been completed, opening and
working with multiple RMA files is very fast and memory efficient.
As an alternative to file-based processing, MEGAN 4 is also able to
communicate with a PostgreSQL database, running either locally
or on a server.
MEGAN has been tested on files containing millions of reads
and BLAST files of up to one terabyte in size. For a rough idea of
the programs parsing and processing speed, note that the initialize
analysis of a dataset comprising 6 GB of reads and 750 GB of
BLAST matches takes less than 48 h on a standard desktop.

3. Taxonomic
Analysis
Although the diversity of the microbial world is believed to be
huge, to date less than 6,000 microbial species have been named
(13), and most of these are represented by only just one or a few
genes in public sequence databases. Current databases are biased
toward organisms of specific interest and were not explicitly populated to provide an unbiased representative sampling of the true
biodiversity. For this reason, at present, taxonomic analysis usually
cannot be based on high-similarity sequence matching, but rather
depends on the detection of remote homologies using more sensitive methods, such as BLASTX.
One type of approach is to use phylogenetic markers to distinguish between different species in a sample. The most widely used
marker is the SSU rRNA gene; others include RecA, EF-Tu, EF-G,
HSP70, and RNA polymerase B (RpoB) (14). A main of advantage
of this type of approach is that such genes have been studied in
detail and there are large phylogenies of high quality available that
can be used to phylogenetically place reads. However, one problem
is that the universal primers used to target specific genes are not

17

Introduction to the Analysis of Environmental Sequences: Metagenomics. . .

419

truly universal and it can happen that only a portion of the actual
diversity is captured (15). While the use of a random shotgun
approach can overcome this problem, less than 1% of the reads in
a random shotgun dataset correspond to commonly used phylogenetic marker genes (16), and it seems wasteful that more than 99%
of the reads will remain unused (and unclassified).
A second type of method is based on analyzing the nucleotide
composition of reads. In a supervised approach (see, e.g., ref. 11,
12), the nucleotide composition of a collection of reference genomes is used to train a classifier, which is then used to place a given
set of reads into taxonomic bins. In an unsupervised approach (see,
e.g., ref. 17), reads are clustered by composition similarity and then
the resulting clusters are analyzed in an attempt to place the reads.
The approach adopted in MEGAN is to compare random
shotgun reads against the NCBI-NR database (or some other
appropriate database) to find homologous sequences, thus making
use of the fact that remote homologies are easier to detect on the
protein level. The program treats all sequence matches of high
significance as equally valid indications that the given read represents a gene that is present in the corresponding organism. In more
detail, each read is placed on the lowest common ancestor (in the
NCBI taxonomy) of all the organisms that are known to contain
the gene present in the read. So, in essence, the placement of a read
is governed by the gene content of the available reference genomes
and thus we refer to our method as the LCA gene-content approach.
An attractive feature of the LCA gene-content approach is that
it is inherently conservative and is more prone to err toward noninformative assignments of reads (to high-level nodes in the taxonomy) than toward false-positive assignments (placing reads from
one species onto the node of another species). In particular, genes
that are susceptible to horizontal gene transfer will not be assigned
to either of the participating species, if both donor and acceptor
species are represented in the reference database.
MEGAN provides a number of parameters to tune the LCA
algorithm. First, the min-score parameter allows one to set a minimum value that the bit score must attain so that a BLAST match is
considered by the LCA algorithm. Second, the top-percent parameter restricts the set of considered matches further to those whose bit
score lies within the given percentage of the highest score. Third,
the min-support parameter is used to specify the minimum number
of reads that must be assigned to a taxon before that taxon is
considered present. If the number of reads assigned to a node
does not meet the threshold, then the reads are moved up the
taxonomy until they reach a node that has the number of reads
required.
If the program is given paired reads (i.e., pairs of reads each
sequenced from different ends of the same clone), then in its
paired-end-mode MEGAN uses a modified version of the LCA

420

D.H. Huson and S. Mitra

algorithm that boosts the bit score of any match for one read of the
pair that is confirmed by a match to the same reference species for
the other read, by adding an increment of 20% to the bit score.
Moreover, if one read is given a more specific assignment than the
other by the LCA algorithm, then both reads are assigned to the
more specific taxon.
In summary, MEGAN uses the NCBI taxonomy to bin all reads
of a given metagenome dataset. The NCBI taxonomy provides
names and accession numbers for over 670,000 taxa, including
approximately 287,000 eukaryota, 28,000 bacteria, and 62,000
viruses. The species are hierarchically classified at the levels of superkingdom, kingdom, phylum, class, order, family, genus, and species
(and some unofficial clades in between like groups, subspecies).
We now demonstrate how to perform a taxonomic analysis of
the marine sample DNA-Time1-Bag1 using MEGAN. The first
step is to compare the set of reads (in this case, approximately
200,000) against the NCBI-NR database using BLASTX, in this
case resulting in a 18-GB file containing approximately 30 million
high-scoring pairs (or BLAST hits). The second step is then to
process the BLAST file and reads using MEGAN to obtain an
RMA file DNA-Time1-Bag1.rma, which is about 5 GB in size, if
MEGAN is set to embed all reads and relevant BLAST hits in the
file.
MEGAN can then be used to interactively explore the dataset.
In Fig. 1, we show the assignment of reads to the NCBI taxonomy.
Each node is labeled by a taxon and the number of reads assigned to
it. The size of a node is scaled logarithmically to represent the
number of assigned reads. Optionally, the program can also display
the number of reads summarized by a node, that is, the number of
reads that are assigned to the node or to any of its descendants in
the taxonomy. The program allows one to interactively inspect the
assignment of reads to a specific node, to drill down to the individual BLAST hits that support the assignment of a read to a node, and
to export all reads (and their matches, if desired) that were assigned
to a specific part of the NCBI taxonomy. Additionally, one can
select a set of taxa and then use MEGAN to generate different
types of charts for them.

4. Functional
Analysis
MEGAN 4 provides two different methods for analyzing the functional content of a dataset.
4.1. SEED Analysis
with MEGAN

To perform a functional analysis using the SEED classification (3),


MEGAN attempts to map each read to a SEED functional role,
using the highest scoring BLAST match to a protein sequence for

17

Introduction to the Analysis of Environmental Sequences: Metagenomics. . .

421

Fig. 1. Taxonomic analysis of 200,000 reads of a marine dataset (DNA-Time1-Bag1, (8)) by MEGAN. Different parts of the
taxonomy have been expanded to different ranks. Each node is labeled by a taxon and the number of reads assigned to
the taxon, or to any taxon below it in the taxonomy. The size of each node is scaled logarithmically to represent the number
of assigned reads.

which the functional role is known. The SEED classification is


depicted as a rooted tree whose internal nodes represent the different subsystems and whose leaves represent the functional roles.
Note that the tree is multilabeled in the sense that different leaves
may represent the same functional role, if it occurs in different types
of subsystems. The current tree has about 13,000 nodes. Figure 2
shows a part of the SEED analysis of a marine metagenome sample.
4.2. KEGG Pathway
Analysis Using MEGAN

To perform a KEGG analysis (4), MEGAN attempts to match each


read to a KEGG orthology (KO) accession number, using the best
hit to a reference sequence for which a KO accession number is
known. This information is then used to assign reads to enzymes
and pathways. The KEGG classification is represented by a rooted

422

D.H. Huson and S. Mitra

Fig. 2. Part of a SEED-based functional analysis of 200,000 reads from a marine dataset (DNA-Time1-Bag1, (8)). Details of
the Mannose Metabolism of subtree of Carbohydrates are shown.

tree (with approximately 13,000 nodes) whose leaves represent


different pathways. Each pathway can also be inspected visually, to
see which reads were assigned to which enzymes.
As an example, consider the citric acid cycle, which is of central
importance for cells that use oxygen as part of cellular respiration.
In Fig. 3 we show the citric acid cycle pathway. In such a drawing of
a pathway as provided by the KEGG database, different participating enzymes are represented by numbered rectangles. MEGAN
colors each such rectangle so as to indicate the number of reads
assigned to the corresponding enzyme.
All interactive features described above for the taxonomic analysis are also available for both types of functional analysis. In both
types of functional analysis, MEGAN uses so-called RefSeq accession numbers embedded in the BLAST matches to identify the
functional role or enzyme associated with the given gene.

17

Introduction to the Analysis of Environmental Sequences: Metagenomics. . .

423

Fig. 3. The citrate cycle KEGG pathway (4), as displayed by MEGAN. Numbered rectangles represent different enzymes
that are shaded on a scale from white (corresponding to 0 reads) to dark green (corresponding to 330 reads, for this
example) to indicate the number of reads assigned to each enzyme.

5. Comparing
Datasets
Environmental samples are rarely studied in isolation and thus
the task of comparing different datasets is important. MEGAN
supports both visual and computational comparison of multiple
datasets.
5.1. Visual Comparison
of Metagenomes

To facilitate the visual comparison of a collection of different datasets, MEGAN provides a comparison view that is displayed as a tree
in which each node shows the number of reads assigned to it for
each of the datasets. This can be done either as a pie chart, a bar
chart, or as a heat map. To construct such a view using MEGAN,
first the datasets must be individually opened in the program. Using

424

D.H. Huson and S. Mitra

Fig. 4. Comparative visualization of eight marine datasets (8), displaying the bacterial part of the NCBI taxonomy down to
the rank of Phylum. The number of reads assigned to a node is indicated by a logarithmically scaled bar chart. The node
labeled Chlamydiae/Verrucomicrobia group is shown in a selected mode, in which both the number of reads assigned to
the node (Ass) and summarized by the node (Sum) is listed for the eight datasets.

a provided compare dialog one can then setup a new comparison


document containing the datasets of interest.
Figure 4 shows the taxonomic comparison of all eight marine
datasets. Here, each node in the NCBI taxonomy is shown as a bar
chart indicating the number of reads (normalized, if desired) from
each dataset that have been assigned to the node.
In a similar fashion, MEGAN supports the simultaneous analysis and comparison of the SEED functional content of multiple
metagenomes (see Fig. 5). Moreover, a comparative view of assignments to a KEGG pathway is also possible.
5.2. Computational
Comparison of
Metagenomes

MEGAN provides an analysis window for comparing multiple


datasets. It allows one to compute a distance matrix for a collection
of datasets using a number of different ecological indices.

17

Introduction to the Analysis of Environmental Sequences: Metagenomics. . .

425

Fig. 5. Comparative visualization of eight marine datasets based on their functional content using SEED subsystems. Here,
MEGAN has been set to display the full subtree below the node representing Flagellar motility.

The calculation can be based on data from a taxonomic, SEED, or


KEGG classification. If a set of nodes have been selected in the tree
representing the chosen classification, then the distances are
derived from the numbers of reads assigned to the selected nodes.
Otherwise, the program uses the numbers of reads assigned to all
leaves of the tree.
MEGAN supports a number of different methods for calculating a distance matrix, such as Goodalls ecological index (18), a
simple version of UniFrac (19), and euclidean distances. Such a
distance matrix can be visualized either using a split network (20)
calculated using the neighbor-net algorithm (21), or using a multidimensional scaling plot, see (22) for details. In Fig. 6, we show the
result of a comparison of the eight marine datasets based on the
taxonomic content of the datasets and computed using Goodalls
index.

426

D.H. Huson and S. Mitra

Fig. 6. Split network representing Goodalls index for the eight marine datasets, based on all leaves of the tree shown in
Fig. 4, except for the Not Assigned and No Hits nodes.

6. Analyzing Other
Types of Data
So far, our focus has been on metagenomic and metatranscriptomic data. However, it is easily possible to analyze metaproteomic data as well. We illustrate this using a set of 8,073 peptide
sequences recently published in (23). In a first analysis, one can
simply compare the sequences against the NCBI-NR database
using the BLASTP program. Because the peptides are very short,
only about 1,700 give rise to significant hits. In a more sophisticated two-stage approach described in (23), the peptide
sequences are first blasted against much longer environmental
sequences that are available from the Global Ocean Sampling
(GOS) project (24). Then the GOS sequences that are hit by
the peptide sequences are blasted against NR and the LCA
algorithm is applied to determine taxonomic assignments for
the reads.
Finally, we would like to demonstrate that MEGAN can also
be used to analyze sequencing reads obtained in an approach
targeted at 16S rRNA sequences (25). To illustrate this, we use a
set of 849 16S rRNA reads published in (23). The sequences were
compared against the Silva database (26) using BLASTN
and processed then by MEGAN. All three analyses are compared
in Fig. 7.

17

Introduction to the Analysis of Environmental Sequences: Metagenomics. . .

427

Fig. 7. Comparative visualization of two different analyses of a set of 8,073 metaproteomic sequences (23). The data
labeled Peptides-NR-Morris2010 were obtained as a result of blasting the sequences against the NCBI-NR database. The
data labeled Peptides-GOS-CAMERA-Morris2010 were obtained in a more sophisticated two-stage approach, as described
in (23). In addition, we display the result of an analysis of 849 16S rRNA sequences, based on a BLASTN comparison
against the Silva database (26).

7. Discussion
and Outlook
The main goal of MEGAN is to provide a powerful and easy-to-use
tool to explore, analyze, and compare the taxonomic and functional
content of multiple metagenome datasets. MEGAN is based on the
comparison of reads against a reference database. Unfortunately, at
present, publicly available sequence databases cover only a very
small percentage of the true microbial diversity believed to exist in
nature. While projects such as GEBA (27) and the Human Microbiome Project (28) aim at addressing this problem, progress in
sequencing new reference genomes will be slow and so the analysis
of complex environmental samples will remain very challenging.

428

D.H. Huson and S. Mitra

Projects such the Human Microbiome Project (http://www.


hmpdacc.org), the Terragenome Consortium (http://www.
terragenome.org), and the Earth Microbiome Project (http://
www.earthmicrobiome.org) promise to generate petabases of
sequence that will pose substantial computational and conceptual
challenges.
As we continue to develop MEGAN, one of the main questions
that we are interested in is how to make it easy to compare large
numbers of metagenome datasets so that one can correlate changes
in taxonomic or functional composition with environmental parameters such as location, time-of-day, or disease state of host.

8. Exercises
Download and install MEGAN from http:www-ab.informatik.unituebingen.de/software/megan/welcome.html. Download four
preprocessed mouse datasets (MEGANs own rma files) from
http://www-ab2.informatik.uni-tuebingen.de/megan/rma/
BookChap_data. These analyses are based on datasets described in
(29). Using MEGAN, open the files.
1. Analyze the taxonomic content of mouse samples and compare
the results with the published results.
2. Analyze the functional content of mouse samples and compare
the results with the published results.
3. Compare all four mouse samples and try to identify differences
that are correlated with the different diets.

References
1. Huson DH, Auch AF, Qi J, Schuster SC
(2007) MEGAN analysis of metagenomic
data. Genome Res 17: 377386.
2. Ashburner M, Ball CA, Blake JA, Botstein D,
Butler H, et al. (2000) Gene ontology: tool for
the unification of biology. the gene ontology
consortium. Nat Genet 25: 2529.
3. Overbeek R, Begley T, Butler RM, Choudhuri
JV, Chuang HY, et al. (2005) The subsystems
approach to genome annotation and its use in
the project to annotate 1000 genomes. Nucleic
Acids Res 33: 569102.
4. Kanehisa M, Goto S (2000) Kegg: Kyoto encyclopedia of genes and genomes. Nucleic Acids
Res 28: 2730.

5. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search
tool. J Mol Biol 215: 403410.
6. Benson D, Karsch-Mizrachi I, Lipman D,
Ostell J, Wheeler D (2005) Genbank. Nucleic
Acids Res 1: D3438.
7. Huson DH, Mitra S, Ruscheweyh HJ, Weber
N, Schuster SC (2011) Integrative analysis of
environmental sequences using MEGAN 4.
Under revision.
8. Gilbert JA, Field D, Huang Y, Edwards R, Li
W, et al. (2008) Detection of large numbers of
novel sequences in the metatranscriptomes of
complex marine microbial communities. PLoS
One 3: e3042.

17

Introduction to the Analysis of Environmental Sequences: Metagenomics. . .

9. Mardis ER (2008) Next-generation DNA


sequencing methods. Annu Rev Genomics
Hum Genet 9: 387402.
10. Shendure J, Ji H (2008) Next-generation DNA
sequencing. Nat Biotechnol 26: 113545.
11. McHardy AC, Martin HG, Tsirigos A, Hugenholtz P, Rigoutsos I (2007) Accurate phylogenetic classification of variable-length DNA
fragments. Nat Methods 4: 6372.
12. Rosen GL, Reichenberger E, Rosenfeld A
(2010) NBC: The naive Bayes classification
tool webserver for taxonomic classification of
metagenomic reads. Bioinformatics : Advanced
access.
13. Kuever J, Rainey FA, Widdel F (2005) Bergeys
Manual of Systematic Bacteriology. Springer,
1388pp.
14. Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, et al. (2004) Environmental
genome shotgun sequencing of the Sargasso
Sea. Science 304: 6674.
15. Wu M, Eisen JA (2008) A simple, fast, and
accurate method of phylogenomic inference.
Genome Biol 9: R151.
16. von Mering C, Hugenholtz P, Raes J, Tringe
SG, Doerks T, et al. (2007) Quantitative phylogenetic assessment of microbial communities
in diverse environments. Science 315:
112630.
17. Tyson GW, Chapman J, Hugenholtz P, Allen
EE, Ram RJ, et al. (2004) Community structure and metabolism through reconstruction of
microbial genomes from the environment.
Nature 428: 3743.
18. Goodall DW (1966) A new similarity index
based on probability. Biometrics 22: 882907.
19. Lozupone C, Hamady M, Knight R (2006)
Unifrac - an online tool for comparing microbial community diversity in a phylogenetic context. BMC Bioinformatics 7: 371.
20. Huson D, Bryant D (2006) Application of
phylogenetic networks in evolutionary studies.

429

Molecular Biology and Evolution 23:


254267.
21. Bryant D, Moulton V (2004) Neighbor-net:
An agglomerative method for the construction
of phylogenetic networks. Molecular Biology
and Evolution 21: 255265.
22. Mitra S, Gilbert JA, Field D, Huson DH
(2010) Comparison of multiple metagenomes
using phylogenetic networks based on ecological indices. ISME J 4: 12361242.
23. Morris RM, Nunn BL, Frazar C, Goodlett DR,
Ting YS, et al. (2010) Comparative metaproteomics reveals ocean-scale shifts in microbial
nutrient utilization and energy transduction.
ISME J 4: 673685.
24. Rusch DB, Halpern AL, Sutton G, Heidelberg
KB, Williamson S, et al. (2007) The Sorcerer II
Global Ocean Sampling expedition: northwest
Atlantic through eastern tropical Pacific. PLoS
Biol 5: e77.
25. Pace N, Stahl D, Olsen G, Lane D (1985)
Analyzing natural microbial populations by
rRNA sequences. American Society for Microbiology News 51: 412.
26. Pruesse E, Quast C, Knittel K, Fuchs B,
Ludwig W, et al. (2007) SILVA: a comprehensive online resource for quality checked
and aligned ribosomal RNA sequence data
compatible with ARB. Nuc Acids Res 35:
71887196.
27. Wu D, Hugenholtz P, Mavromatis K, Pukall R,
Dalin E, et al. (2009) A phylogeny-driven
genomic encyclopaedia of bacteria and archaea.
Nature 462: 10561060.
28. Turnbaugh PJ, Ley RE, Hamady M, Fraser-Liggett CM, Knight R, et al. (2007) The Human
Microbiome Project. Nature 449: 804810.
29. Turnbaugh PJ, Backhed F, Fulton L, Gordon
JI (2008) Diet-induced obesity is linked to
marked but reversible alterations in the mouse
distal gut microbiome. Cell Host Microbe 3:
213223.

Chapter 18
Analyzing Epigenome Data in Context of Genome
Evolution and Human Diseases
Lars Feuerbach, Konstantin Halachev, Yassen Assenov,
Fabian Muller, Christoph Bock, and Thomas Lengauer
Abstract
This chapter describes bioinformatic tools for analyzing epigenome differences between species and in
diseased versus normal cells. We illustrate the interplay of several Web-based tools in a case study of CpG
island evolution between human and mouse. Starting from a list of orthologous genes, we use the Galaxy
Web service to obtain gene coordinates for both species. These data are further analyzed in EpiGRAPH,
a Web-based tool that identifies statistically significant epigenetic differences between genome region sets.
Finally, we outline how the use of the statistical programming language R enables deeper insights into the
epigenetics of human diseases, which are difficult to obtain without writing custom scripts. In summary, our
tutorial describes how Web-based tools provide an easy entry into epigenome data analysis while also
highlighting the benefits of learning a scripting language in order to unlock the vast potential of public
epigenome datasets.
Key words: Epigenomics, Computational epigenetics, DNA methylation, CpG islands, Comparative
genomics, Galaxy, EpiGRAPH, R statistical programming language

1. Introduction
Readers who are new to the field of epigenetics may wonder why
DNA sequence alone is not sufficient to encode the information
required by a cell. To answer this question, imagine that the book
you are currently reading consisted of plain text only, without paragraphs, headlines, or any other markup. Finding specific pieces of
information would become a time-consuming task. Likewise, proteins, such as polymerases, need guidance to find gene promoters
among the billions of nucleotides in mammalian genome. As this
cellular markup differs between cell types, an additional layer of
information is required on top of (which is one of the many
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2,
Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5_18,
# Springer Science+Business Media, LLC 2012

431

432

L. Feuerbach et al.

translations of the Greek word epi) the genomic DNA sequence.


This information must be heritable between cell generations of
the same type, but needs to be modified as cells differentiate.
DNA methylation and histone modification constitute the bestunderstood epigenetic mechanisms. Both mechanisms control the
access of soluble factors to the DNA. Histone modifications achieve
this by controlling the compaction level of the chromatin. Open
euchromatin increases the accessibility of the DNA for transcription
factors while tightly packed heterochromatin has the opposite
effect. In contrast, DNA methylation affects the chemical and sterical properties of single nucleotides and, thus, increases or
decreases the binding affinity of specific proteins to them (1).
In vertebrates, DNA methylation occurs predominantly in the
form of a methyl groups covalent attachment to the 5-carbon atom
of a cytosine that is followed by a guanine in the DNA sequence. This
CpG methylation represents a direct link between the fields
of genomics and epigenomics. Notably, the CpG pattern is about
sixfold underrepresented in the human genome, but often colocalizes in CpG-rich islands with regulatory elements, such as gene
promoters. The unmethylated state of these CpG island (CGI) promoters is associated with transcriptional competence while the
methylated state correlates with robust transcriptional silencing
(2, 3). Notably, the expression of genes with CpG-poor promoters
is much less affected by DNA methylation, thus partitioning
promoters into two distinct groups of which one is directly coregulated by DNA methylation (CGI promoter) while the other is largely
insensitive to this modification (non-CGI promoter), and thus is
mainly regulated by alternative mechanisms, such as transcription
factor binding or enhancer/repressor activity.
As the methylation state of CpG dinucleotides in binding motifs
directly influences the affinity of transcription-associated proteins to
these sites, it functions as an epigenetic switch. While the position of
these switches can be identified with genomic sequencing, their DNA
methylation state can be determined by epigenetic and epigenomic
assays. Epigenomic methods gain importance in biomedical research.
For example, abnormal methylation patterns are associated with
a variety of diseases and can be used to diagnose functionally compromised cell states (47). Identifying among the 30 million CpG
dinucleotides in the human genome those that are associated with a
given cancer type is a nontrivial task, which is comparable to the
genome-wide association studies (GWASs) discussed in Chap. 11 of
this Volume (31) and uses similar methods for statistical analysis.
Furthermore, for the development of powerful biomarkers and
investigation of potential therapy options, it is essential that these
associations are studied in appropriate model systems. This is a
complex problem, as the conservation of a promoter in a model
organism is not necessarily equivalent to the conservation of its
epigenetic regulation machinery.

18

Analyzing Epigenome Data in Context of Genome Evolution. . .

433

To effectively address this question, epigenetics has to be


integrated with comparative genomics methodology. Such comparative epigenomics approaches can exploit the fact that DNA methylation leaves a footprint in the genomic sequence, as the methylation
state of a cytosine directly influences its point mutation rate. More
precisely, the spontaneous deamination rate of 5-methylcytosine is
twice as high as for unmethylated cytosine (8). Furthermore, the
deamination product of plain cytosine is uracil while 5-methylcytosine decays into thymine. Thus, the higher fidelity of U:G mismatch
repair over T:G mismatch repair further contributes to the increased
substitution of CpG by TpG (or CpA when the substitution occurs
on the antisense strand) (9, 10). In consequence, CpG methylation as
epigenetic modification has a detectable influence on the local and
global nucleotide composition of the genome.
Our tutorial study exploits this phenomenon to analyze the
conservation of epigenetic regulation of promoters. Starting from a
sequence-based classification of humanmouse orthologous gene
promoters in two types, colocalizing or not with a CGI, we discriminate between merely conserved genes and those that also possess a
conserved promoter type. Actually, for a pair of orthologous genes in
human and mouse, in general the situation that one of the genes has
a promoter overlapping a CGI and the other does not can arise in
three different ways: first, in one species, the CGI has been lost by
mutation or genomic rearrangement and alternative regulation
mechanisms have become dominant; second, in the common ancestor, the gene was alternatively regulated, but then the promoter in
one species evolved into the DNA methylation coregulated CGI
type; and third, the CGI definition fails to correctly classify promoters that are close to violating the relevant constraints. These promoters have recently been described as intermediate CpG content
promoters (ICPs) (11). In this last case, even small fluctuations in the
general species-specific genome sequence composition would be
potent enough to push such a promoter above or below the thresholds of the CGI definition, thus leading to the wrong assumption
that a change in biological function has occurred.
In the first two cases, the epigenetic regulation differs drastically in the two species. This would, for instance, render results
obtained from a DNA methylation-based cancer therapy study in a
mouse model system unreliable for transfer into a human model
system. On the other hand, the last case is caused by lack of
sensitivity in the applied CGI definition and a comparative association study may still be promising.
In the genomics era, bioinformatic methods become increasingly
more relevant for analyzing large-scale epigenetic datasets (12).
However, no single tool solves all problems that an epigeneticist
frequently confronts. Therefore, we apply a number of tools in concert and, in this way, present a flexible pipeline for analyzing epigenome variation in the context of genome evolution and human disease.

434

L. Feuerbach et al.

This study exemplifies how comparative epigenomics can be


applied to study the conservation and epigenetic regulation of
promoters of orthologous genes in human and mouse. Furthermore, we demonstrate how to identify differentially methylated
regions (DMRs) in human cancer studies and how these results
can be combined to select those DMRs that can be studied in the
mouse model system.
This chapter is structured in three main sections. In the first
section, we determine the promoter type of humanmouse orthologous gene pairs. On the methodical level, this chapter introduces
Galaxy (13)a versatile online interface to an extensive collection
of tools for life science data analysis. The reader learns how to
import, format, and integrate genome annotations from external
databases, such as BioMart and the UCSC Genome Browser. Furthermore, techniques for porting annotations between different
genome versions and across species are introduced.
In the second section, we analyze the epigenomic context of
these promoter pairs, including DNA sequence features, DNA
structure predictions, and histone modifications. Special attention
is paid to CpG to TpG/CpA mutations. For this purpose, we
introduce the EpiGRAPH Web service (14). It can automatically
annotate a set of genome regions with genomic and epigenomic
information. Furthermore, given a dataset consisting of two distinct types of genomic regions, for example CGI promoters and
non-CGI promoters, it provides a statistical framework for identifying the most significant differences between the types.
The final section in this chapter describes a pipeline for the
analysis of publicly available disease-related methylation data. By
applying the statistical programming language R (http://www.
r-project.org), which is also touched briefly in the other sections,
to data obtained by the Illumina Infinium assay, we identify
candidates for gene promoters that are differentially methylated in
ovarian cancer (OV) and normal tissue. With the objective of
finding candidate genes that most likely are also coregulated by
DNA methylation in the mouse model system, we then filter these
lists for genes that are orthologous in human and mouse and
possess conserved epigenetic features in their promoters.
For the following step-by-step description of software tools,
text labels that are enclosed in quotation marks represent as closely
as possible the markup in which they are displayed on screen to
support visual pattern matching and reduce the readers search
times. Conceptual notions pertaining software components are
denoted in italic. Furthermore, for each section, intermediate
results are provided on the books online repository to ensure
that the respective analyses can be performed independently from
each other.

18

2. Conservation
Statistics on CpG
Island Promoters

Fig. 1. The galaxy interface.

Analyzing Epigenome Data in Context of Genome Evolution. . .

435

The objective of the first analysis is to find a set of orthologous


human and mouse genes, determine their promoter regions, and
identify which of these promoters overlap with CGIs.
We perform this analysis using the online tool Galaxy. In order
to get started, visit http://galaxy.psu.edu and select the option
Use Galaxy. As displayed in Fig. 1, the front end of Galaxy is
divided into three main areas. The Tools panel is located on the left
side and structured in a two-level hierarchy. At the top level, names
of toolboxes are displayed that can be expanded into lists of tools by
clicking on them. The available datasets that can be manipulated
with these tools are displayed in the History panel on the right side
of the front end. Each application of a tool generates a new dataset
in this History, which contains its output. The central area displays
details about a selected tool or dataset and allows for its parameterization and inspection.
An analysis in genomics and epigenomics can start from a list of
manually curated genes that has been crafted by an external expert.
To emulate this scenario, our first analysis focuses on a set of 3,197
humanmouse orthologous gene pairs that has been manually

436

L. Feuerbach et al.

curated and analyzed in a recent study (15) comparing the distribution of CGIs in those promoters. This Jiang dataset can be
downloaded from http://mbe.oxfordjournals.org/cgi/content/
full/msm128/DC1 (first supplementary table).
For studies that cannot benefit from such preparatory work,
Exercise 1 (see below) outlines how the approach can be
generalized to arbitrary selection of species and gene sets. The
Galaxy analysis workflow is available online at http://main.g2.bx.
psu.edu/u/fmueller/w/conservation-of-cpg-island-promoters,
but it is recommended to perform the analysis manually to become
familiar with Galaxy.
2.1. Obtain Human
Gene List
from BioMart

To load a new dataset into the History panel, click on the Get
Data menu entry in the Tools panel. Several alternatives for data
acquisition are offered. In order to retrieve the human gene list,
we choose the BioMart Central server option. The Browser
opens the BioMart interface. From the -CHOOSE DATABASE- pull-down menu, we choose the recent Ensembl instance
(for this analysis, ENSEMBL GENES 58 (SANGER UK) was
applied, but the resource is constantly updated). The new pulldown menu -CHOOSE DATASET- is displayed. Select the
Homo sapiens genes (GRCh37) option. Galaxy loads the new
dataset and displays it in the left panel. To select the subset of
genes of interest, click on Filters. To limit the scope of the region
list on the genes from the Jiang dataset, choose Gene: from the
selection criteria on the right area and check the box ID list
limit. From the pull-down menu beside this box, we pick
HGNC symbol(s) (e.g., ZFY). We can now restrict the selection
of genes to those that match the gene symbols we enter into the
text area below.
Copy the human gene symbol column from the Jiang dataset
(H-M sheet of the Excel file) and paste it into the Human
official gene symbol field.
To specify which additional information we need for our analysis,
we now select the Attributes option in the left panel. In the Gene:
category, we first deselect both preselected attributes. Now, we
choose Chromosome Name, Gene Start (bp), Gene End
(bp), and Strand. Additionally, we expand the External: section
and check the HGNC symbol box in the External Reference
subsection. Note that the order in which the Attributes are selected
determines the format of the output file. For some steps downstream
in our pipeline, the order of the first three columns is important
(Chromosome Name, Gene Start, and Gene End).
Click on Results in the black top panel to export the complete
dataset to Galaxy. You will see a preview on the data that will be
exported. Galaxy is already selected as target. Check the box
Unique results only to exclude duplicates and press the Go
button. The browser returns to the Galaxy interface, which displays

18

Analyzing Epigenome Data in Context of Genome Evolution. . .

437

the new dataset in the History panel. The upload from BioMart
to Galaxy may take a few moments. Eventually, we obtain a
tab-separated table containing the data for the subset of the orthologous genes that were retrieved. Note that some genes from the
Jiang dataset are not included in the BioMart database and thus are
not imported into Galaxy.
2.2. Obtain Mouse
Gene List
from BioMart

To obtain the analogous dataset for the mouse genome, we repeat


the procedure with a few alterations. Select Mus musculus genes
(NCBIM37) as dataset from the Ensembl Genes in BioMart.
Furthermore, the gene symbols for mouse are called MGI symbol
instead of HGNC symbol. For simplicity, we will exploit later that
orthologous genes in HGNC and MGI annotations have the same
symbols, but in the case of human they are in upper case (16).
After transferring the mouse data to Galaxy, you may notice that
a small number of MGI symbols are contained multiple times in the
resulting dataset. This may influence the outcome of the analysis,
and thus we exclude those genes. In order to uniquely identify
each gene, we can add a column by selecting the corresponding
command from Galaxys Text Manipulation toolbox. Leave
the numeric value 1 in the textbox. Be sure to select YES in the
Iterate drop-down box and the imported mouse genes as the
dataset to work on. Next, we group the resulting dataset (Join,
Subtract, and Group menu) on the symbol column (c5 if the dataset
was uploaded with the column specifications above). Before calculating, add an additional count operation on the newly generated
running number column. Then, hit Execute.
In order to obtain only unique symbols, use the Filter operation
(Filter and Sort toolbox) with the condition "c21" on the resulting dataset. Perform a Join two queries on the original MGI dataset
retrieved from BioMart and the set obtained in the previous step. Use
the corresponding MGI symbol columns for this. Finally, for cleanup
purposes, go to the Text Manipulation menu and cut the chromosome, genomic start and end, strand, and MGI symbol column.
Optionally, you can rename the resulting dataset by clicking on
the small pen symbol next to the dataset in the History panel. The
dataset properties appear and you can add an appropriate name and
description. Click the Save button when done. It is advisable to
annotate every generated dataset in order to enhance the readability
of the dataset history.

2.3. Convert
Chromosome
Symbols and Strand
Symbols to Achieve
Compatibility

The next steps operate on genomic intervals. The formats of the


chromosome and the strand columns obtained from the BioMart
data are not compatible with intervals as defined in Galaxy and thus
need adjustment. The following changes need to be conducted for
both human and mouse gene sets. We start by adding the prefix chr
to the chromosome column via the Compute operation from the
Text Manipulation toolbox using the expression "chr" + c1

438

L. Feuerbach et al.

(assuming that the first column contains the chromosome name).


Next, we convert the strand information from 1 to + and from
1 to . Column 4 is our strand column. Then, computing the
expression
c4-1 and "-" or (c41 and "+" or "")
performs the conversion. As Galaxy currently does not provide if
statements in the compute operation, we implement the if statement via an equivalent expression: in many programming languagesincluding Python, which is used by GalaxyA and B
or C is equivalent to if A then B else C. Afterward, we perform
cleanup by applying cut on the chromosome name, genomic start
and end, strand, and symbol columnsin our case: columns c6,
c2, c3, c7, and c5.
2.4. Lift Over GRCh37/
hg19 to hg18

As ongoing improvements in the assemblies of genomes lead to the


refinement of the published canonical sequences, it sometimes
becomes necessary to convert coordinates from an older to a
newer assembly or vice versa to achieve compatibility with available
annotations. For this purpose, we use the LiftOver tool to transfer the human gene coordinates from the GRCh37/hg19 assembly
to the hg18 assembly.
By clicking on the small pen symbol next to the human dataset
in the History panel, we can check if the dataset is registered
correctly. The data type is currently tabular. Switch to interval
and press Save to update. Then, select the hg19 Database/
Build. Also, make sure that the Strand column option is
checked and set to the correct column number (e.g., c4). Press
the Save button again. By clicking on the name of the dataset, the
first lines are displayed in the history panel and you can verify if the
column names are selected correctly.
To perform the actual migration to hg18, select the Convert
genome coordinates operation in the Lift-Over toolbox. Then,
choose the dataset of human genes we have previously obtained
from BioMart, select hg18 in the To: pull-down menu, and
press Execute.
Two new datasets will be added to the History panel. The
[MAPPED COORDINATES] dataset contains all updated coordinates while [UNMAPPED COORDINATES] contains all
regions that could not be mapped to hg18. In the following
steps, we use the updated coordinates set.
Click once more on the small pen symbol of the resulting
dataset and verify that the genome assembly is indeed hg18. You
might need to select the strand and identifier column again.

18

Analyzing Epigenome Data in Context of Genome Evolution. . .

439

2.5. Select Promoter


Area around
Gene Start

We now narrow down the gene coordinates to the promoter area


which we define here as 2 kb upstream to 1 kb downstream of the
transcription start site (TSS). These values are derived from empirical data on the location of functional elements in gene promoters,
but the exact threshold definitions remain debatable. Of course,
wider or narrower promoter assignments are possible and we
encourage the reader to explore the influence of different stringency
levels on the results of the analysis. First, ensure that the strand
column is registered correctly in both gene lists by repeating the
procedure described in the previous section.
In the Tools panel, choose the Operate on genomic intervals
toolbox and then the Get Flanks option. Select the human gene
set as input data and choose in the Region: pull-down menu the
Around Start option. Leave Location set to Upstream, but set
Offset to 1,000 and change Length of the flanking region(s):
to 3,000. Then, press Execute. Repeat the process for the mouse
gene list. When comparing the dataset before the operation with
the updated dataset, you will find that in the case of forward-strand
genes a window around the gene start is selected. The window
selected for genes on the reverse strand is around the gene end.
Now, update the name of the new datasets by using the pen symbol.
For example, add the Prefix Promoter_ to the name of the
original sets.

2.6. Import
Whole-Genome CpG
Island Annotations

There are several CGI annotation programs and definitions for


CGIs. We choose annotations computed with the CgiHunter software. The main advantage of CgiHunter over similar programs is its
search algorithm, for which it was mathematically proven that it
does not miss any region that fulfills a given CGI definition. The
precomputed annotations can be obtained from the CgiHunter
Web site at http://cgihunter.bioinf.mpi-inf.mpg.de/annotations.
php. From the offered CGI tracks, we choose the widely used
TakaiJones definition that requires a region to meet minimal
requirements of 500-bp region length, G + C content of 55%,
and a ratio of observed overexpected CpG frequency of 0.65. It
has the benefit that it is stringent enough to exclude most of the
CpG-rich ALU repeats while it still captures most of the promoter
CGIs (17). The files are named CGIH_TJ_hg18.txt and
CGIH_TJ_mm9.txt for human and mouse, respectively. First,
download the CGI map of the hg18 and mm9 annotations. Then,
back in Galaxy, use the Upload file tool from the Get Data
toolbox. For each dataset, select the interval File Format, then add
the previously obtained datasets in the File: field, enter the
correct assembly name under Genome:, and press Execute.
Finally, verify in the History panel that both datasets are registered
correctly.
To familiarize yourself with the CGI datasets, it is often useful
to visualize some of their properties. As an example, here are a few

440

L. Feuerbach et al.

Fig. 2. Histogram of CpG island lengths.

lines of R code that generate histograms of the distribution of the


lengths of the islands in mouse and human. Plots are created on
linear and logarithmic scales (see Fig. 2) in order to obtain a more
refined perspective on our data.

Script 1.

2.7. Determine Genes


with and without CpG
Island Promoters

To determine which genes in both genomes overlap with CGIs, we


choose the Intersect tool from the Operate on Genomic Intervals toolbox. We are interested in the Overlapping Intervals of:

18

Analyzing Epigenome Data in Context of Genome Evolution. . .

441

the promoter lists that intersect: the CGI annotation of the


respective genome assembly for at least 500 base pairs (bp).
This procedure ensures that the overlap of promoter and CGI is
at least equivalent in size to the minimal length constraint of the
TakaiJones definition.
Then, we determine the set of genes without CGI promoters
by applying the Subtract tool. Be sure to select Subtract a
Whole Query form another query from the Join, Subtract and
Group toolbox. We enter choose the results of the previous step in
the Subtract: field, and enter the dataset of all promoters in the
from: field. Perform these steps for the human and mouse datasets. Finally, give the four resulting datasets appropriate names.
2.8. Joining the Data

To integrate the previously obtained data into a single file, open the
Text Manipulation toolbox and select Add column. In the text
field, Add this value, enter True. to Query should be set to
the dataset of the human genes promoters that overlap with CGIs.
Hit Execute. Repeat this step for the corresponding datasets in
mouse.
Similarly, we add a column with the value False to both
datasets of promoters not overlapping with CGIs.
To join the corresponding datasets for each genome, use the
Concatenate queries tool from the same toolbox: first, select both
human datasets using the Concatenate Query drop-down menu
and then the Add new Query button and the Select pull-down
menu. By pressing the Execute button, both queries are joined
head to tail. Repeat this step to concatenate the mouse datasets.
Finally, we want to integrate both sets in such a way that each
promoter line contains information on its genomic locations in both
genomes and indicators for the existence of CGIs in either species.
First, convert the MGI gene symbols to Upper case to match
the HGNC gene symbols by applying the Change Case to the
symbol column in the Text Manipulation toolbox. Choose the
combined dataset of the mouse genes, enter the column number of
the gene symbols in the Change case of columns: text field, check
that the correct option is selected in the To: pull-down menu,
and execute the operation.
Next, open the Join, Subtract and Group toolbox, choose
the Join two Queries tool for human and mouse datasets, and
select the corresponding column numbers of the upper case gene
symbols. By pressing the Execute button, a new dataset is generated. It contains only genes that appear in both lists and share
exactly the same uppercase gene symbol. Download the dataset by
clicking on the disk symbol in the History panel and name it
orthologous-genes.txt. Finally, open the file in a text editor and add
a row containing the column headers separated by tabulator.
In order to obtain summary statistics on how many promoters
are included in each of the groups (CGI in human but not in

442

L. Feuerbach et al.

Table 1
Conservation of human and mouse CpG island promoters
Mouse CGI
promoter

Non-CGI promoter

Observed Expected Observed Expected Total


Human CGI promoter 1,820
Non-CGI promoter
Total

152
1,972

1,425.8

284

678.2

2,104

546.2

654

259.8

806

938

2,910

Apparently, the null hypothesis that promoter types are independent for
homologous genes in human and mouse can be rejected

mouse, CGI in mouse but not in human, CGI in both organisms,


non-CGI in both organisms), use Galaxys Count tool from the
Statistics toolbox. Choose the final dataset and select both indicator columns for human and mouse CGI promoters to operate
upon. The results are shown in Table 1.
The bulk of the promoters under investigation overlap with
CGIs. As expected, in the majority of the cases, the CGI attribute of
the promoters is conserved between human and mouse.
In this section, we have applied several Web-based tools and
databases to collect a set of orthologous genes and determined in
how far their promoter type is conserved between human and
mouse. In the following section, we annotate the resulting four
groups of gene promoter pairs with multiple genomic and epigenomic properties and statistically analyze the similarities and differences between them.

3. Genomic
Features Analysis
with EpiGRAPH

Having identified a set of orthologous gene promoters, we now


want to obtain a more detailed picture on their epigenetic traits. The
online statistical analysis software, EpiGRAPH, is applied to annotate these genomic regions with a large number of genomic and
epigenetic features, such as GC content or histone modifications.
Subsequently, we partition the dataset into different subsets according to promoter type and host species. For each pair of subsets,
EpiGRAPH can perform a statistical test whether an individual
feature or a group of features is overrepresented in one of those
subgroups. Additionally, multiple statistical learning approaches can
be applied to assess the prediction power of feature sets on the
defined response. More details on the basics and various approaches

18

Analyzing Epigenome Data in Context of Genome Evolution. . .

443

in machine learning can be found in 18. In the following sections,


we apply this procedure with different subset combinations to identify footprints of epigenetic regulation.
3.1. First Steps
with EpiGRAPH

At http://epigraph.mpi-inf.mpg.de/WebGRAPH/faces/Login.jsp,
a free user account can be created that enables the use of EpiGRAPH s advanced custom analyses. Instructive video tutorials are
provided on the same site and in previously published tutorials (19).

3.2. Identifying
Properties of
Conserved and Not
Conserved Promoters

In the previous section, we showed that the majority of orthologous


promoters share the same CGI state. However, we also identified
several gene promoters that overlap with a CGI only in one species.
Several possible explanations for those orthologous loci with nonorthologous CGI were discussed in the introductory section.
A possible mechanism for loss of CGIs in promoter regions is a
slow erosion process that is triggered by increased DNA methylation in the germ line followed by subsequent loss of individual CpGs
through spontaneous deamination. Such erosion has previously
been observed for CGIs in the mouse genome (20). Here, we
investigate if this process is associated with the genomic properties
of the promoter and can also be observed at the orthologous human
loci (albeit at a slower pace). In an initial analysis, we search for
genomic attributes in human CGI promoters that are predictive for
the absence of CGIs in their mouse orthologous promoters.

3.2.1. Uploading
the Dataset with Mapped
Promoters in EpiGRAPH

To analyze the dataset from the previous section in detail, we first


need to import it into EpiGRAPH.
Click on the Upload Custom Attribute Dataset button,
which loads the Attribute View (Fig. 3), then select the file location
(1), specify the meta information of the dataset (also referred to as
attribute), such as attribute name (2), and provide the service with
the information on how genomic locations as well as the desired
response are stored in the file by specifying their respective column
names (3). The attribute upload is completed by selecting the
Submit attribute and Proceed button and return to the overview
page using the corresponding link.

3.2.2. Defining
an EpiGRAPH Analysis

The objective of the first analysis is to identify features of human


promoters that are predictive of the promoter type of the orthologous genes in mouse. The two types of promoters distinguished in
this study are CGI-associated promoters and non-CGI-associated
promoters. First, select Define New Analysis Using This Website
from the EpiGRAPH Overview page. We arrive again at the Attribute View (Fig. 4). This time, select Calculate Derived Attribute
(1), and from the list of available attributes select the hg18_orthologous_promoters (2) uploaded in the previous step. In order to
base the analysis only on human CGI orthologous promoters,
define an inclusion filter. This is achieved by selecting the

444

L. Feuerbach et al.

Fig. 3. Attribute View for uploading a new user-defined dataset.

hg18_CGI column from the list (3) and selecting Add Column
button (4) from below the Inclusion Filter field and adding the
True statement at the end. Before continuing via Submit Attribute and Proceed, make sure you assigned an attribute label (5).
Proceed to be taken to a view used for defining control sets. As a
control set is not needed for this analysis, skip the next step by
selecting the Skip this Step button.
The next screen (Fig. 5) is the Analysis View, in which the parameters for the actual analysis are specified. First, specify the target
feature that is the basis of the analysis. Partitioning of the region set
for all further statistical and machine learning analyses is based on the
target featurein this case, the mm9_CGI (1). Next, choose the
additional genomic and epigenomic features for EpiGRAPH to
inspect for each genomic region. These features include frequency
counts for various DNA sequence patterns, predicted DNA structure,
information for overlap with repeats, evolutionary history, population
variation, and others. All of above are automatically obtained and
preprocessed from public sources and databases. A full list with
detailed descriptions of the features and interpretation of the computed representative values can be found on the EpiGRAPH Web site
(http://epigraph.mpi-inf.mpg.de/WebGRAPH/faces/Background.

18

Analyzing Epigenome Data in Context of Genome Evolution. . .

445

Fig. 4. The Attribute View used for computing an attribute based on an already existing dataset.

html#attributes). For the exploratory purposes of this analysis, choose


the analysis to be performed on all default EpiGRAPH attributes by
selecting the Select All Default Attributes* button (2). Next, enter a
name and a short description of the analysis (3). A useful option is to
activate the e-mail notification that reports the completion of the
analysis, as the annotation and the analysis might take several minutes.
Finally, click the Start Analysis button to submit the analysis to the
EpiGRAPH server (4).
3.2.3. Inspecting
the Results

Once the analysis is complete, first inspect the results of the statistical
analysis, which focuses on each computed feature separately. The
values for each feature are split into two groups depending on the
target feature. EpiGRAPH then uses a statistical method named
Wilcoxon rank sum test (21) to assess the validity of the null hypothesis that these two sets of values come from the same distribution.

446

L. Feuerbach et al.

Fig. 5. Analysis View allows the user to specify the settings of the analysis he/she wants to perform.

To be more specific, this is a nonparametric statistical test used as an


alternative to the two-sample t-test in the case that the underlying
distribution of the data is not known. Its null hypothesis is that the
observation values from the two groups are drawn from the same
distribution. The method sorts all values, obtains a rank for each of
them, and aggregates the ranks for the values in each sample group.
Under the null hypothesis, the normalized sum of the ranks for the
sample groups is expected to be equal.
A nonparametric test is suited best for this purpose, as it is more
universal and does not assume feature values to come from specific
data distributions. EpiGRAPH uses this test for every computed
feature and reports an uncorrected p-value. Because an EpiGRAPH
analysis applies the same test to hundreds of features, there is a high
probability that the statistical tests report a low p-value for some of
those features by chance. To correct for such misleading p-values,
multiple test correction is used. Therefore, EpiGRAPH reports if
the corrected p-value is significant after multiple testing correction
using the Bonferroni and Benjamini/Hochberg methods (22).
In the results table of the statistical analysis (Fig. 6a), the
features are displayed ranked according to p-value. The statistical

18

Analyzing Epigenome Data in Context of Genome Evolution. . .

447

Fig. 6. Results from EpiGRAPH (a) Statistical analysis and (b) Machine learning.

test on the frequency of the CpG dinucleotides (Pat_CG_ freq)


reports a very low p-value that remains significant after multiple
testing correction, indicating the rejection of the null hypothesis
in our case, that the frequency of CpG dinucleotides in human CGI
promoters orthologous to mouse CGI promoters has the same
distribution as in the human CGI promoters orthologous to
mouse non-CGI promoters. Another feature used to define
CGIsthe observed versus expected ratio of CpG within the regions
(CpG_obs_vs_exp_ratio)behaves similarly. Also, a more complex
measure for CGI strength that integrates the combined epigenetic
score for bona fide CGI prediction (23) with DNA sequence features
shows significant higher values for the conserved CGIs. Furthermore, we notice that the H3K4me3 and H4K20me1 histone modifications are enriched in the human CGI promoters whose

448

L. Feuerbach et al.

orthologs resemble CGI promoters in mouse as well. These posttranslational modifications of histones are generally associated with
open chromatin and CGIs that are especially enriched for CpGs.
However, the experimental data for those histone modifications
were obtained only from blood tissues (more information can be
found in the EpiGRAPH documentation) and should be interpreted
cautiously, as they do not necessarily correlate to histone modification states in other tissues. More precisely, the presence of these
marks indicates that a promoter is subject to epigenetic regulation
in at least one tissue, but their absence in one tissue does not rule out
that the promoter is epigenetically regulated in other tissues.
Among the most significant sequence patterns are a measure for
the ratio between CpG frequency and the frequency of the spontaneous deamination products TpG and CpG (CpG_vs_TpG_v_CpA_ratio) and the CpA/TpG frequency (CA_freq as search is performed on
both strands and thus includes the reverse complement as well). Both
values indicate that deamination products are enriched in those promoters that lost their CGI status in mouse.
As previously mentioned, visual inspection of the data is an
important step. The diagram generation module of EpiGRAPH
allows the user to inspect the distribution of a feature with respect
to the target. This is achieved by selecting the checkboxes of the
features you would like to visualize and clicking Calculate Selected
Diagrams. The box plot presented in Fig. 7 indicates that for

Fig. 7. Diagram representing the distribution of the feature CpG_obs_vs_exp_ratio for


promoters that are CpG islands in mouse (gray) and that are not (black ).

18

Analyzing Epigenome Data in Context of Genome Evolution. . .

449

human CGI promoters the observed versus expected ratio of CpG


counts of orthologous to mouse non-CGI promoters is significantly lower as that of orthologous to mouse CGI promoters.
Nonetheless, the substantial overlap of the two distributions in
the range between 0.55 and 0.65 also indicates that this feature
alone does not provide sufficient power to predict whether or not
the orthologous mouse promoter of a human CGI promoter also
contains a CGI.
These observations are supported and quantified by the machine
learning analysis, which measures the predictive power of genomic
features grouped by biological function (Fig. 6b). As default statistical
learning method, EpiGRAPH uses classification via support vector
machines; however, it also allows the user to select from multiple
other available methods (14). In short, EpiGRAPH has the set of
genetic regions partitioned in two groups based on the value of a userspecified target variable. In the current analysis, this variable encodes
if the homologous mouse locus overlaps with a CGI or not. Each
genetic and epigenetic property computed for all regions is used as
input feature for the classification algorithm, and the value of the
target variable is the output. EpiGRAPH applies cross validation to
obtain two measures (prediction accuracy and Pearson correlation
coefficient) for accuracy of the prediction. These numbers estimate
how well EpiGRAPH predicts the value of the target variable for
novel genomic locations. A correlation coefficient close to 1 indicates
that EpiGRAPH will almost always be correct while a correlation
coefficient of 0 indicates EpiGRAPH did not find any association
between the features and the output. In the specific scenario, we
observe that no group of features is exceptionally predictive for the
type of a mouse promoter orthologous to a human CGI promoter.
3.2.4. Discussion

In this analysis, we tested and confirmed the hypothesis that human


CGI promoters that do not overlap with CGIs at the homologous
mouse loci display general properties of ICP-like CGIs (11), such as
lower frequency of CpG and lower CpG observed versus expected
ratio, and furthermore show less evidence for open chromatin, such
as H3K4me3 histone modifications.
Considering these observations, we can now reassess the three
alternative explanations for a change in CGI status of promoters.
The first explanation asserts that the promoter is conserved in one
species, but lost in the other. The reduced amount of active
histone marks indicates that epigenetic activation is weaker, in
general, for those promoters. Thus, methylation-independent
regulation may already possess a more dominant role at these
loci or they completely lack regulatory potential. From this observation, we can derive the hypothesis that the stronger islands
are more likely to be conserved and are more epigenetically active
while the weaker islands are epigenetically less involved, therefore less protected from DNA methylation by their functional

450

L. Feuerbach et al.

architecture and also by positive selection, and, in consequence,


more prone to getting lost in the course of evolution. Such a loss is
most likely mediated by loss of protection from DNA methylation, which then causes increased CpG decay by spontaneous
deamination. The above-mentioned significant difference of the
values of the TpG/CpA-related features indicates that this process
is already observable at the human loci, but presumably at a slower
rate than in mouse.
The latter observation also argues against the second possible
explanation, namely, that the CGIs have been newly formed in human
either by slow gain of CpGs or by an insertion of CpG-rich sequence.
The former is improbable in a deamination-favoring environment.
An indication for the latter could be the borderline-significant (feature rank 160 according to significance) higher presence of L1 repeats
in the promoters with nonconserved CGI status. In individual cases,
it can therefore be inspected if these overlapping L1 repeats are CpG
rich and also present in mouse to test if they inserted CGIs by retrotransposition into individual human promoters, but it is unlikely
to be a general explanation for our observations.
The third explanation argues that the TakaiJones CGI definition could be too strict for the mouse genome. The previously
mentioned CGI erosion process (20) has caused loss of CpGs at
the boundaries of many CGIs (15) and produced a somewhat
shrunken CGI type in mouse. This would primarily affect weaker
islands, as those require fewer mutations to be pushed below one of
the three thresholds of the definition and as a result are not considered to be CGIs any more.
Hence, not a full change in promoter type explains most of the
lost CGIs, but a slight evolutionary change in their structure that is
not reflected in the CGI definition.
To test these hypotheses in the context of more epigenetic data,
in the next section we inspect the DNA methylation properties of
the promoters in more detail.
3.3. Analyzing DNA
Methylation State of
Orthologous Promoters

In this section, we analyze the association of DNA methylation and


CpG conservation in the context of orthologous gene promoters in
human and mouse. For this purpose, we need to extract methylation
information for all orthologous promoters both in human and
mouse. For human, this is achieved by repeating the steps for
defining the analysis from the previous section, with two changes.
The first is to exclude the filtering step (inclusion filter) that is
referred to as point (4) in the Attribute View in Fig. 4. This modification results in all promoters being processed rather than only CGI
promoters. The second change in the analysis settings is to switch off
the downsampling in the Analysis View by clicking the link above the
textbox (referred to as point (5) in Fig. 5). Once the analysis is
complete, access the analysis results and download the computed
data table to your machine under the name hg18-promoter-

18

Analyzing Epigenome Data in Context of Genome Evolution. . .

451

methylation.txt using the Download Data Table button. We also


need to repeat these steps along with the two modifications in the
mouse context to obtain related methylation information for
the mouse orthologous promoters. First, this means to switch the
genome version on the right panel of EpiGRAPH to mm9. Then,
repeat the steps from the previous subsection by uploading the
dataset again, but setting the columns defining the genomic coordinates (point (3) in Fig. 3) this time to the mouse coordinates.
The remaining analysis is configured analogous to the human case
above. After it is complete, store the output file locally named
mm9-promoter-methylation.txt.
3.3.1. Summarizing
the DNA Methylation Data

The attribute data computed in the previous paragraph includes


DNA methylation data obtained from Reduced Representation
Bisulfite Sequencing (RRBS) experiments (24). RRBS allows for
the assignment of a methylation score to every covered cytosine.
Methylation scores range between 0 and 1, with 0 indicating
entirely unmethylated and 1 (or 100%) marks fully methylated
CpG sites. To obtain a representative methylation score for a promoter, EpiGRAPH averages the methylation scores of the individual CpG sites within this promoter. We have to keep in mind that
the RRBS technology predominantly enriches for CpGs within
CpG-rich regions, and thus we might not have representative
methylation information for some CpG-poor regions. We first
inspect the distribution of these methylation scores by using an R
script (Script 2) on the files computed in the previous paragraph.
The results (Fig. 8) indicate that the majority of the sites are

Fig. 8. Visualization of the promoter methylation obtained via RRBS for mouse and human. The black vertical lines indicate
the thresholds chosen to identify methylated (>0.66) and unmethylated cases(<0.33).

452

L. Feuerbach et al.

Table 2
Distribution of promoter methylation data visualized
by genome and promoter CGI status
Unmethylated

Methylated

CpG island promoter

hg18

mm9

hg18

mm9

hg18 and mm9

1,746

1,759

28

10

hg18

224

94

18

40

mm9

14

119

28

Neither

34

42

165

137

Only promoters with unambiguous methylation state are shown

unmethylated. Due to the RRBS bias toward CpG-rich regions, we


observe only few methylated regions. The distributions of the
methylation scores of the mouse and human cases are similar.

Script 2.

Next, we want to inspect the distribution of the methylated and


unmethylated promoters in the different groups of promoters, with
respect to their CGI status. The R script below (Script 3) produces
the data shown in Table 2. In essence, the methylation information
for every promoter is converted from a continuous value between
0 and 1 to a discrete statemethylated or unmethylated. This is a
standard technique in methylation analysis and the common
choices are either to consider every region with methylation score
below 0.33 unmethylated and above 0.66 methylated or to apply
the stricter thresholds 0.25 and 0.75 (25). As indicated in Fig. 8,

18

Analyzing Epigenome Data in Context of Genome Evolution. . .

453

in our case, the choice of cutoff values would not influence the
results significantly. Of further interest for the reader might be to
modify the R code from Script 1 to visualize the distributions only
for the True/True groups, the False/True group, etc.

Script 3.

We observe that promoters in the group True/False (mm9 CGI


promoters orthologous to non-CGI human promoters) are predominantly methylated in human. In contrast, the corresponding
promoters in mouse (the False/True group) are predominantly
unmethylated. In spite of the relatively small number of cases, it
potentially indicates that in human most of these promoters either
have lost their ability to be epigenetically regulated or are silenced by
DNA methylation in the analyzed tissues. However, in mouse, the
majority of these promoters appear to be still epigenetically active,
although they do not meet the CGI criteria.
3.3.2. Analyzing
the Epigenetically
Interesting Group of
Non-CGI Mouse Promoters
Orthologous to CGI Human
Promoters

First, ensure to be working on the mouse dataset. Using the EpiGRAPH filtering options on the already computed datasets, we
extract the mouse promoters that are CGI in human but are not
CGI in mouse. Repeat the steps of the Defining an EpiGRAPH
analysis section, defining the inclusion filter (point (4) in Fig. 4) to
ensure that the hg18_CGI feature has value True and the
mm9_CGI feature has value False. We also exclude all cases
that do not have strong methylation scores by adding to the inclusion filter a restriction that methylation score is either less than 0.33
or more than 0.66. We also add a new column that contains the

454

L. Feuerbach et al.

Fig. 9. Visualization of some of the most significant features differentiating between methylated and unmethylated mouse
non-CGI promoters orthologous to human CGI promoters.

methylation status of every promoter as a binary value by specifying


the new column information (see Fig. 4) with a calculation formula
of the sort int(round(%(meth_ratio)f)). We analyze the genetic properties of these promoters for significant differences between the
methylated and unmethylated promoters. The results (Fig. 9) indicate that unmethylated non-CGI promoters in mouse have significantly higher frequency of CpG (Fig. 9a) and longer CpG-related
patterns as well as higher CpG observed versus expected ratios
(Fig. 9b) and lower CpA and TpG frequencies (Fig. 9c) which
indicate CpG decay. The unmethylated non-CGI promoters are
either protected from this decay or it is considerable slower. Interestingly, the most significantly different feature is the standard
deviation of CpG content (CG_std) (Fig. 9d).This feature is

18

Analyzing Epigenome Data in Context of Genome Evolution. . .

455

computed by partitioning the region in multiple consecutive subregions, estimating the CpG frequency in each of them and computing the standard deviation of the obtained set of frequencies. It
has low values when the distribution of CpGs is similar along the
whole region and higher values if certain parts have high CpG
frequencies while others are CpG poor. High values of CG_std
are usually indicative of regions overlapping with CGI. A possible
explanation for the significantly elevated values of this feature in
unmethylated non-CGI promoters is the previously described erosion process (15) that starts from the edges of the CGI. Alternatively, the mouse genome may possess smaller CGIs that are
somewhat below the minimal length of human CGIs.
3.3.3. Discussion

In summary, these results indicate that among the promoters that


lost or never gained CGIs in mouse we observe two different
classes. On the one hand, there are the methylated promoters,
which apparently homogeneously lose CpGs due to the CpG
decay effect. On the other hand, we have the unmethylated promoter type, which represents a shrunken type of CGI that dropped
below the thresholds of the classical CGI definition but still shows
many of the classical CGI characteristics.
To assess variation in the general evolutionary trends between
mouse and human, the next subsection compares the orthologous
promoters with unchanged CGI state. This provides an additional
background against which the results from this section can be
evaluated.

3.4. Differential
Analysis on Human
and Mouse Promoter
Traits

As a follow-up analysis, we test which genomic features are significantly different between human and mouse promoters. For this
purpose, we use the full attribute data computed in the beginning
of the previous section for human and mouse promoters.

3.4.1. Preparing the Data

We download the data from both the human and the mouse analyses
(see Subheading 3). This is done by selecting the button named
Download Data Table from the analyses pages. We then run an R
script (Script 4) that (1) reads the two datasets (the specific file
names need to be set additionally); (2) selects only the properties
that are common for both datasets; (3) extracts the combined
dataset as a result with additional column called Genome, which
indicates which is the original genome source of the specific case;
and finally (4) stores this new combined dataset into a new file. Once
the file is prepared, we are ready to perform an EpiGRAPH analysis
to identify significant differences between the properties for the
different genomes.

456

L. Feuerbach et al.

Script 4.
3.4.2. Statistical
and Machine Learning
Analysis of the Variation
of Genome Features
Between Human
and Mouse

We define a standard analysis as in the Attribute View we load the


file by indicating the chromosome, start, and end columns. We
further need to identify the class columns: hg18_CGI,
mm9_CGI, and the new column Genome indicating the genome
organisms. In the Analysis View, we choose the Genome column as
target attribute for the EpiGRAPH analysis and start the analysis.
After the analysis is finished, we observe that the set of features
could differentiate almost perfectly (prediction accuracy of 98%)
between the mouse cases and the human cases. Furthermore, the
features that distinguish most significantly are associated with
G + C content and CpG markers as well as repeat content (Fig. 10).
We repeat the above analysis only for the True/True group,
i.e., promoters that overlap with CGIs both in mouse and human.
The results indicate similar predictive power (prediction accuracy of
98%) and show that the human CGI promoters have significantly
more CpGs and higher observed versus expected ratio in the context of only slightly higher G + C content. Furthermore, we notice
that the TpG/CpA pattern is higher in mouse promoters. Both
observations are in accordance with 15 and indicate that CGIs in
mouse have lost CpGs probably due to the CpG decay effect. We
also observe significantly higher overlap with repeats for human
promoters.

Fig. 10. List of genomic features that are most significantly different between human and mouse promoters.

18

Analyzing Epigenome Data in Context of Genome Evolution. . .

457

As third analysis in this subsection, we compare promoters that


are neither CGI in human nor in mouse. The results point to a
number of patterns of A + T-rich patterns indicative for the original
genome of the regions. In all cases, the available features could
almost perfectly distinguish between human and mouse promoters.
3.4.3. Discussion

In this subsection, we used EpiGRAPH to show that orthologous


human and mouse promoters have significantly different genetic
and epigenetic features. We showed that the corresponding CGI
promoters in mouse have significantly weaker CpG patterns and
enriched products of spontaneous deamination compared to
human while orthologous non-CGI promoters differ in mouse
and human mainly in their A + T-rich patterns. This indicates that
especially CGI promoters have lost CpG content in mouse compared to the orthologous human promoters.

3.5. Summary

We can conclude that there are two independent biological explanations for the loss of CGIs in some mouse promoters. First, a
general trend in mouse was observed toward smaller but functional
CGIs with less CpGs that are not captured by the TakaiJones CGI
definition, which leads to false-negative classifications in the assessment of the promoter type. Second, a number of genes actually
differ in their promoter type, which is reflected by an increase of
DNA methylation and a uniform loss of CpGs over the whole
promoter region. While studying epigenetic regulation of promoters from the first group in mouse may also grant insights into their
regulation in humans, this is implausible for promoters from the
second group.
In the next section, we use this knowledge to enhance candidate selection in a computational screen for methylation-associated
cancer markers in human. We search for those candidates that are
amenable to functional studies in the mouse model system.

4. Methylation
and Disease
Epigenetic drugs can bring important progress to anticancer therapy. For example, the drug 5-azacytidine is applied in the treatment
of myelodysplastic syndrome. With the acquired knowledge on
enzymes involved in gene regulation, epigenetic therapies could
provide an effective alternative to chemotherapy (26). In this section, we are going to identify genes that are differentially methylated in ovarian cancer and normal tissue. The epigenetic state of
some of these genes might have a causal effect on tumor progression and proliferation. Such genes are putative targets for epigenetic therapy. Finally, we are going to prioritize the list of potential

458

L. Feuerbach et al.

targets, focusing on genes that (when studied more extensively) are


most likely to be relevant for clinical investigation. Keep in mind
that the pipeline presented here is a simplified version of an epigenomic study. Important steps, such as type II error estimation and
correction for confounding factors, such as biased patient selection,
are omitted. These fundamental aspects of a genome-wide association study are discussed in greater detail in Chap. 11 of this Volume.
The Illumina methylation assay (27, 28) utilizes the HumanMethylation27 BeadChip technology to read out an array of over 27,578
pairs of CpG methylation-specific probes complementary to bisulfite-converted DNA. It effectively measures the methylation status of
these over 27,000 CpGs mapping to promoters of approximately
14,000 genes. A recent study (29) focused on the relationship
between Polycomb Group proteins and genes with age-specific
methylation. Within this study, the Illumina Infinium assay (28)
was utilized to obtain methylation profiles of whole blood samples
from 540 postmenopausal women. Two hundred and sixty-six samples were taken from postmenopausal women with ovarian cancer
and two hundred and seventy-four from age-matched healthy controls. We use this dataset to identify genes that exhibit OV-specific
methylation, referring to them as OV-associated genes.

Fig. 11. (a) Table of methylation values, also referred to as beta values. They depict methylation degree and span the range
between 0 (completely unmethylated) and 1 (fully methylated). Row names are Infinium probe identifiers; column names
are sample identifiers. Note that only the first two rows and four columns are displayed. The full table contains 27,578 rows
and 540 columns. (b) Table of sample clinical information. Row names are sample identifiers. Note that only the first two
rows and two columns are displayed. The full table contains 540 rows and 14 columns. (c) Table summarizing the Illumina
Infinium platform specification. Row names are probe identifiers. Note that only the first two rows and four columns are
displayed. The full table contains 27,578 rows and 38 columns.

18

Analyzing Epigenome Data in Context of Genome Evolution. . .

459

4.1. Obtaining the Data

The methylation dataset is deposited in the Gene Expression Omnibus (GEO) under the identifier GSE19711 (http://www.ncbi.nlm.
nih.gov/geo/query/acc.cgi?accGSE19711). The SOFT-formatted family file contains the methylation degrees of all probes, as well
as clinical information about the samples. SOFT files are text files
with a simple structure. A short script can be used to extract from
this file the tables presented in Fig. 11 (an example R script is
provided in the supplementary information.). The mapping from
Infinium probe identifiers to genes is available in the GEO under
platform specification GPL8490 (http://www.ncbi.nlm.nih.gov/
geo/query/acc.cgi?accGPL8490). The table can be downloaded
by clicking the button Download full table. . ..
The information available in the tables shown in Fig. 11 enables
us to compare the methylation states of healthy and cancerous
samples. For every gene, we can obtain a set of 266 numbers
depicting its methylation status in healthy samples, as well as a set
of 274 numbers corresponding to its methylation status in ovarian
cancer cells. These genes are then classified as OV associated if the
two sets of numbers are significantly different. Several statistical
tests can be applied for that purpose. We apply Wilcoxon rank sum
test. Instead of a long script, the necessary R code to perform the
analysis is presented as snippets within the associated subsections.
The tables presented in Fig. 11 are loaded from files named
GSE19711-betas.txt
(Fig.
11a),
GSE19711-clinical.txt
(Fig. 11b), and GPL8940-65.txt (Fig. 11c). The first two tables can
be obtained after parsing the SOFT file in GEO record GSE19711.
The third table can be downloaded from GEO record GPL8940.

4.2. Determining
OV-Associated Genes

In this section, we perform a statistical test that checks if a genes


methylation state is related to the disease state of ovarian tissue. The
test is applied independently to every gene and produces a p-value.
The null hypothesis is that the gene is methylated equally strong in
patient samples and healthy controls. In case the resulting p-value is
sufficiently low, we can reject this null hypothesis and conclude that
the methylation state at this locus is correlated to the disease state.
The R objects created by the code snippets in this section are
summarized in Fig. 12.

4.2.1. Step 1

The first step is to load the table of methylation values for all
samples, the table with clinical information on the samples, as well
as a table containing information on the probes used in the Illumina
Infinium platform.

460

L. Feuerbach et al.

Fig. 12. R Objects created in the analysis of disease association. Matrices are represented by tables, and listsby charts
with inner borders only. Arrows indicate how the objects are derived. Numbers in the arrows correspond to the steps
described in the Subheading 4.2.

Script 5.
4.2.2. Step 2

Many genes are represented by more than one probe in the Infinium assay. CpG methylation is highly correlated over short
distances (30). Therefore, we can estimate the methylation level
of a gene promoter by averaging the methylation value for all

18

Analyzing Epigenome Data in Context of Genome Evolution. . .

461

associated probes. The next code snippet performs this task and
creates a matrix of methylation values named betas.genes. In this
matrix, every row corresponds to a gene and every sample is
represented by a column.

Script 6.
4.2.3. Step 3

As a next step, we need to define the two sample setshealthy and


ovarian cancer. The code below creates a list of two index vectors.
The first vector contains the column indices of cancer samples, and
the second vectorof healthy samples.

Script 7.
4.2.4. Step 4

With this definition of sets, the Wilcoxon rank sum test (introduced in
Subheading 2) is applied for every row of betas.genes, that is, for
every gene. The resulting p-values are stored in a matrix gene.p.
values. Every column in the matrix corresponds to a gene. The first
row of the matrix lists the p-values for hypomethylation, and the
second rowfor hypermethylation of the respective gene. A p-value
reflects the probability that an observation can be explained by
chance. If an individual gene is tested, a p-value of 0.001 is highly
significant. If multiple genes are tested, this level loses its significance.
For instance, in a set of 20,000 genes, we can expect to observe 20
genes with a p-value below 0.001 by chance alone. In Subheading 3,
EpiGRAPH automatically provided two alternative multiple-testing
correction procedures that vary in their strictness. As in the current
analysis, the objective is to filter only the most promising candidates,
and the more conservative Bonferroni method is applied in the last
line of the code snippet.
Note that gene.p.values is transposed as a side effect of the last
command. Therefore, genes are represented by rows and methylation states by columns.

462

L. Feuerbach et al.
Script 8.

By convention, corrected p-values below 0.05 can be considered


significant. The snippet below uses this threshold to create a list of
all OV-associated genes. First, the matrix gene.p.values is extended,
adding information for every gene if it is disease associated (hypo- or
hypermethylated).

Script 9.

4.3. Relationship
Between Disease and
CpG Island Association

The list disease.genes contains three vectors of gene namesall


hypomethylated, hypermethylated, and disease-associated genes.
The third vector contains effectively the union of the first two gene
sets. In total, we identified 650 hypomethylated and 93 hypermethylated genes in the ovarian cancer samples.
Ovarian cancer association is a property of a genesome genes have
this property while others do not. An essential aspect of bioinformatics
is determining relationships between seemingly different properties.
In this section, we are going to test if the properties OV association
and promoter CGI status are related in group of orthologous genes.
For this purpose, we are going to use the list of orthologous genes
obtained using Galaxy in the first section of this chapter.
First, we import the orthologous-genes.txt file into our R session.
Next, we explicitly specify the data types of each of the 12 columns.
Note that the attributes for human (first six columns) are in the
same order as the corresponding attributes in the last six mouse
columns; hence, the vector of data types specified in the read.table
() method below contains six elements only. It is automatically
recycled to specify the classes for all columns.

Script 10.

As pointed out above, the Illumina Infinium assay covers a set of


approximately 14,000 human genes. This set differs from the list of
orthologous genes loaded in the code snippet above. In order to
check for a correlation between CGI status and disease association,
we need to inspect the common genes in both studies. Using the
code snippet below, we obtain a list of gene names common for
the two groups. The command for sorting the names affects only
the output, should the list of common genes be outputted (e.g.,
stored in a file). Sorting the list does not have an effect on the

18

Analyzing Epigenome Data in Context of Genome Evolution. . .

463

Table 3
Contingency table of the properties OV association
and promoter CpG island status
OV associated

Not OV associated

Observed

Expected

Observed

Expected

Total

CGI

96

120

1,882

1,858

1,978

Non-CGI

70

46

686

710

756

Total

166

2,568

2,734

Values in the table are number of genes

results of the analyses that follow

Script 11.

The last step in this analysis performs Fishers exact test in order to
check for a significant correlation between CGI promoter status
and OV association among the set of orthologous genes.

Script 12.

In the test for negative correlation (underrepresentation), the


reported p-value of 2.3e-05 indicates strong statistical significance.
This result suggests that differentially methylated genes in ovarian
cancer tend to have promoters that do not overlap with CGIs. The
exact numbers of genes with the studied properties are given in
Table 3.
Genes with a CGI promoter that exhibit disease-specific promoter methylation provide potential diagnostic markers or targets
of epigenetic drugs. Due to the advantages of using mouse as a
model system for investigating the effect of epigenetic drugs on
gene regulation, human disease-linked genes that have orthologous
genes in mouse and possess a conserved promoter type are preferred candidates for in-depth functional studies. Therefore, we can
facilitate the insights that we gained into the comparative epigenetics of theses promoters in the last section.
The code snippet below extracts all OV-associated human
genes that can be considered priority candidates for studying cancer-associated epigenetic deregulation and epigenetic drug

464

L. Feuerbach et al.

treatment in the mouse model system. We use the information


obtained in Subheading 3 about the methylation state in mouse,
which was saved to mm9-promoter-methylation.txt. The resulting
filtered list of genes is saved to a file named target-genes.txt.

Script 13.

4.4. Summary

In this section, we applied Wilcoxon rank sum test gene-wise, comparing the methylation profiles of healthy ovary and ovarian cancer
tissues. We identified a list of differentially methylated gene promoters. The majority of these promoters are hypomethylated in cancer.
We also observed that the methylation state of non-CGI promoters is
preferentially modified in ovarian cancer cells. As a final outcome, we
created a list of genes that are suitable candidates for studying epigenetic deregulation associated with ovarian cancer. These are the genes
that exhibit significant hypo- or hypermethylation in ovarian cancer,
and have similarly regulated orthologs in mouse.

5. Concluding
Remarks
In this chapter, we outlined methods and tools for comparative
epigenomics analysis in the context of genome evolution and
human diseases. The combination of Web-based tools is becoming
increasingly powerful and provides a productive start into epigenome data analysis. As users become more experienced, it is a natural
extension that they start learning a scripting language (e.g., R or
Python) that can often be combined with Web services to perform
more advanced and individualized data analysis tasks. A biologist
equipped with ever more powerful Web-based tools and basic
scripting skills will be in a good position to capitalize on the
increasing wealth of public epigenome datasets.
We point out that each of these tools not only has certain advantages, but also drawbacks. Galaxy and EpiGRAPH offer easy access to
powerful operations on genome-wide datasets while simple text
manipulations can usually be performed more efficiently in text

18

Analyzing Epigenome Data in Context of Genome Evolution. . .

465

editors or by manual programming in any common programming


language. A statistical programming language like R, on the other
hand, excels in the analysis of prepared datasets while the integration
of multiple annotations can be laborious and requires a certain level of
expertise. For studies in computational biology, in general, it is
therefore advisable to familiarize oneself with a wide spectrum of
tools and gain expertise in transferring data between them.

6. Exercises
1. In Subheading 2 of this chapter, we used a published set of
orthologous genes as starting point of our analysis, and furthermore we used the similarity of the gene symbols in human
and mouse to map them to each other. To generalize this
approach, the reader should explore ways to repeat this step
with a comprehensive set of human genes and use the LiftOver
tool to map them to the mouse genome. We obtain a larger but
more noisy set of putative homologous genes. How can we
identify unconserved genes? Considering the lessons from
Chap. 9 of volume 1 (32), can you discriminate between
orthologous and paralogous genes? Does the larger gene set
influence the statistic on promoter-type conservation?
2. In Subheading 3, we used loose thresholds for classifying a
promoter as methylated or unmethylated. We then observed
that gene promoters overlapping with CGI in human, but not
in mouse, still appear to be mostly unmethylated and epigenetically active in mouse. Use the script from this section to test if
such observation still holds if stricter thresholds of 0.25 and
0.75 are applied.
3. In the Subheading 4, the methylation value for each gene in
each sample is obtained by averaging over all probes that correspond to the genes promoter region. However, the methylation values of the probes might differ drastically. In such a
case, the average value is probably an unreliable estimate of the
genes promoter methylation. Write an R script that filters out
every gene with multiple probes, for which the methylation
values in at least one sample differ by 0.5 or more.
4. In the Subheading 4, the Wilcoxon rank sum test is applied in
order to obtain a p-value for OV association for every gene. Use
the R function ks.test for applying the KolmogorovSmirnov
(KS) test instead. Inspect to what extent the resulting gene
associations change, compared to applying Wilcoxon rank
sum test. Why is the KS test inappropriate if we need to find
genes with differential methylation?

466

L. Feuerbach et al.

Acknowledgment
The contribution of Y.A. was partially supported by the EU STREP
CancerDIP (EU grant HEALTH-F2-2007-200620)
References
1. Jaenisch, R., and Bird, A. (2003) Epigenetic
regulation of gene expression: how the genome
integrates intrinsic and environmental signals,
Nat Genet 33 Suppl, 245254.
2. Bird, A. (2002) DNA methylation patterns and
epigenetic memory, Genes Dev 16, 621.
3. Novik, K. L., Nimmrich, I., Genc, B., Maier,
S., Piepenbrock, C., Olek, A., and Beck, S.
(2002) Epigenomics: genome-wide study of
methylation phenomena, Current issues in
molecular biology 4, 111128-111128.
4. Noushmehr, H., Weisenberger, D. J., Diefes,
K., Phillips, H. S., Pujara, K., Berman, B. P.,
Pan, F., Pelloski, C. E., Sulman, E. P., Bhat, K.
P., Verhaak, R. G., Hoadley, K. A., Hayes, D.
N., Perou, C. M., Schmidt, H. K., Ding, L.,
Wilson, R. K., Van Den Berg, D., Shen, H.,
Bengtsson, H., Neuvial, P., Cope, L. M.,
Buckley, J., Herman, J. G., Baylin, S. B.,
Laird, P. W., and Aldape, K. (2010) Identification of a CpG island methylator phenotype that
defines a distinct subgroup of glioma, Cancer
Cell 17, 510522.
5. Figueroa, M. E., Lugthart, S., Li, Y.,
Erpelinck-Verschueren, C., Deng, X., Christos,
P. J., Schifano, E., Booth, J., van Putten, W.,
Skrabanek, L., Campagne, F., Mazumdar, M.,
Greally, J. M., Valk, P. J., Lowenberg, B.,
Delwel, R., and Melnick, A. (2010) DNA
methylation signatures identify biologically
distinct subtypes in acute myeloid leukemia,
Cancer Cell 17, 1327.
6. Yi, J. M., Dhir, M., Van Neste, L., Downing, S.
R., Jeschke, J., Glockner, S. C., de Freitas
Calmon, M., Hooker, C. M., Funes, J. M.,
Boshoff, C., Smits, K. M., van Engeland, M.,
Weijenberg, M. P., Iacobuzio-Donahue, C. A.,
Herman, J. G., Schuebel, K. E., Baylin, S. B.,
and Ahuja, N. (2011) Genomic and Epigenomic Integration Identifies a Prognostic Signature in Colon Cancer, Clin. Cancer Res. 17,
15351545.
7. Bock, C., Kiskinis, E., Verstappen, G., Gu, H.,
Boulting, G., Smith, Z. D., Ziller, M., Croft,
G. F., Amoroso, M. W., Oakley, D. H., Gnirke,
A., Eggan, K., and Meissner, A. (2011) Reference Maps of Human ES and iPS Cell Variation
Enable High-Throughput Characterization of
Pluripotent Cell Lines, Cell 144, 439452.

8. Shen, J. C., Rideout III, W. M., and Jones, P.


A. (1994) The rate of hydrolytic deamination
of 5-methylcytosine in double-stranded DNA,
Nucleic Acids Research 22, 972972.
9. Pfeifer, G. (2006) Mutagenesis at Methylated
CpG Sequences, in DNA Methylation: Basic
Mechanisms, pp 259281.
10. Chahwan, R., Wontakal, S. N., and Roa, S.
(2010) Crosstalk between genetic and epigenetic information through cytosine deamination, Trends in Genetics 26, 443448.
11. Weber, M., Hellmann, I., Stadler, M. B.,
Ramos, L., Paabo, S., Rebhan, M., and Schubeler, D. (2007) Distribution, silencing potential and evolutionary impact of promoter DNA
methylation in the human genome, Nat Genet
39, 457466.
12. Bock, C., and Lengauer, T. (2008) Computational epigenetics, Bioinformatics 24, 110.
13. Blankenberg, D., Kuster, G. V., Coraor, N.,
Ananda, G., Lazarus, R., Mangan, M., Nekrutenko, A., and Taylor, J. (2010) Galaxy: A WebBased Genome Analysis Tool for Experimentalists, Vol. 89, John Wiley & Sons, Inc.
14. Bock, C., Halachev, K., B
uch, J., and Lengauer,
T. (2009) EpiGRAPH: User-friendly software
for statistical analysis and prediction of (epi-)
genomic data, Genome Biol 10, R14.
15. Jiang, C., Han, L., Su, B., Li, W.-H., and Zhao,
Z. (2007) Features and Trend of Loss of Promoter-Associated CpG Islands in the Human
and Mouse Genomes, Molecular Biology and
Evolution 24, 19912000.
16. Bruford, E. A., Lush, M. J., Wright, M. W.,
Sneddon, T. P., Povey, S., and Birney, E.
(2008) The HGNC Database in 2008: a
resource for the human genome, Nucleic
Acids Research 36, D445-D448.
17. Takai, D., and Jones, P. A. (2002) Comprehensive
analysis of CpG islands in human chromosomes 21
and 22, Proc Natl Acad Sci USA 99, 37403745.
18. Hastie, T., Tibshirani, R., and Friedman, J. H.
(2001) The elements of statistical learning :
data mining, inference, and prediction,
Springer, New York.
19. Bock, C., Kuster, G. V., Halachev, K., Taylor,
J., Nekrutenko, A., and Lengauer, T. (2009)
Web-based analysis of (epi-) genome data using

18

Analyzing Epigenome Data in Context of Genome Evolution. . .

EpiGRAPH and Galaxy, Methods in Molecular


Biology 628, 275296.
20. Matsuo, K., Clay, O., Takahashi, T., Silke, J.,
and Schaffner, W. (1993) Evidence for erosion
of mouse CpG islands during mammalian
evolution, Somat Cell Mol Genet 19, 543555.
21. Corder, G. W., and Foreman, D. I. (2009)
Nonparametric Statistics for Non-Statisticians:
A Step-by-Step Approach, John Wiley & Sons.
22. Shaffer, J. P. (1995) Multiple hypothesis
testing, Annu. Rev. Psychol. 46, 561584.
23. Bock, C., Walter, J., Paulsen, M., and Lengauer, T. (2007) CpG island mapping by epigenome prediction, PLoS Comput Biol 3, e110.
24. Gu, H., Bock, C., Mikkelsen, T. S., Jager, N.,
Smith, Z. D., Tomazou, E., Gnirke, A.,
Lander, E. S., and Meissner, A. (2010)
Genome-scale DNA methylation mapping of
clinical samples at single-nucleotide resolution,
Nat Meth 7, 133136.
25. Meissner, A., Mikkelsen, T. S., Gu, H., Wernig,
M., Hanna, J., Sivachenko, A., Zhang, X.,
Bernstein, B. E., Nusbaum, C., Jaffe, D. B.,
Gnirke, A., Jaenisch, R., and Lander, E. S.
(2008) Genome-scale DNA methylation maps
of pluripotent and differentiated cells, Nature
454, 766770.
26. Yoo, C. B., and Jones, P. A. (2006) Epigenetic
therapy of cancer: past, present and future, Nat
Rev Drug Discov 5, 3750.
27. Bibikova, M., Le, J., Barnes, B., Saedinia-Melnyk, S., Zhou, L., Shen, R., and Gunderson, K.
L. (2009) Genome-wide DNA methylation
profiling using Infinium assay, Epigenomics 1,
177200.
28. Weisenberger, D. J., Berg, D. V. D., Pan, F.,
Berman, B. P., and Laird, P. W. (2008)
Comprehensive DNA Methylation Analysis on

467

the Illumina Infinium Assay Platform [http://


www.illumina.com/downloads/InfMethyla
tion_AppNote.pdf].
29. Teschendorff, A. E., Menon, U., GentryMaharaj, A., Ramus, S. J., Weisenberger, D.
J., Shen, H., Campan, M., Noushmehr, H.,
Bell, C. G., Maxwell, A. P., Savage, D. A.,
Mueller-Holzner, E., Marth, C., Kocjan, G.,
Gayther, S. A., Jones, A., Beck, S., Wagner,
W., Laird, P. W., Jacobs, I. J., and Widschwendter, M. (2010) Age-dependent DNA methylation of genes that are suppressed in stem cells is
a hallmark of cancer, Genome Research 20,
440446.
30. Eckhardt, F., Lewin, J., Cortese, R., Rakyan, V.
K., Attwood, J., Burger, M., Burton, J., Cox,
T. V., Davies, R., Down, T. A., Haefliger, C.,
Horton, R., Howe, K., Jackson, D. K., Kunde,
J., Koenig, C., Liddle, J., Niblett, D., Otto, T.,
Pettett, R., Seemann, S., Thompson, C., West,
T., Rogers, J., Olek, A., Berlin, K., and Beck, S.
(2006) DNA methylation profiling of human
chromosomes 6, 20 and 22, Nat Genet 38,
13781385.
31. Besenbacher, S., Mailund, T., Schierup, M.
(2012) Association mapping and disease: evolutionary perspectives. In Anisimova, M., (ed.),
Evolutionary genomics: statistical and computational methods (volume 1). Methods in
Molecular Biology, Springer Science+Business
Media New York.
32. Altenhoff, A. M., Dessimoz, C. (2012) Inferring orthology and paralogy. In Anisimova, M.,
(ed.), Evolutionary genomics: statistical and
computational methods (volume 1). Methods
in Molecular Biology, Springer Science+Business Media New York.

Chapter 19
Genetical Genomics for Evolutionary Studies
Pjotr Prins, Geert Smant, and Ritsert C. Jansen
Abstract
Genetical genomics combines acquired high-throughput genomic data with genetic analysis. In this
chapter, we discuss the application of genetical genomics for evolutionary studies, where new highthroughput molecular technologies are combined with mapping quantitative trait loci (QTL) on the
genome in segregating populations.
The recent explosion of high-throughput datameasuring thousands of proteins and metabolites, deep
sequencing, chromatin, and methyl-DNA immunoprecipitationallows the study of the genetic variation
underlying quantitative phenotypes, together termed xQTL. At the same time, mining information is not
getting easier. To deal with the sheer amount of information, powerful statistical tools are needed to analyze
multidimensional relationships. In the context of evolutionary computational biology, a well-designed
experiment may help dissect a complex evolutionary trait using proven statistical methods for associating
phenotypical variation with genomic locations.
Evolutionary expression QTL (eQTL) studies of the last years focus on gene expression adaptations,
mapping the gene expression landscape, and, tentatively, eQTL networks. Here, we discuss the possibility of
introducing an evolutionary prior, in the form of gene families displaying evidence of positive selection, and
using that in the context of an eQTL experiment for elucidating hostpathogen proteinprotein interactions. Through the example of an experimental design, we discuss the choice of xQTL platform, analysis
methods, and scope of results. The resulting eQTL can be matched, resulting in putative interacting genes
and their regulators. In addition, a prior may help distinguish QTL causality from reactivity, or independence of traits, by creating QTL networks.
Key words: Genetical genomics, QTL, eQTL, xQTL, R-genes, Evolution, R/qtl, NGS, Genomics,
Metabolomics, Network inference

1. Introduction
Genetics, as it is used here, concerns the study of quantitative, or
complex, traits. A quantitative trait is influenced by multiple factors,
including gene interactions and environmental factors, and typically
does not lead to discrete phenotypes. Many traits of interest, such as
milk production in cattle, response to fertilizer in crops, and most

469

470

P. Prins et al.

human, animal, and plant diseases, are complex traits. Associating, or


linking, complex traits with certain positions on the genome
are achieved through the mapping of the so-called quantitative trait
loci (QTL).
Mapping QTL in experimental populations is possible when
linkage and/or association information is available. When we have a
population of individuals with known genotypes, it may be possible to
link a phenotype with a certain genotype. To genotype individuals,
first marker maps are created. A marker is a known genomic location,
where the genotype of an individual can be determined. In the early
days, the genotype was determined with visible chromosome features, later with restriction fragment length polymorphism (RFLP),
and amplified fragment length polymorphism (AFLP) (1), and, these
days increasingly, with SNP/haplotype data (2). Say, all individuals
with genotype A, at a marker location somewhere on the genome, are
susceptible to a disease and all other individuals with genotype B are
not, there is linkage/association, or a QTL. If it is clear-cut, it may
even be a single-gene effect. When it is not a single-gene effect,
significance statistics are required to link phenotype with genotype.
It is also possible to map QTL in natural populations through
linkage disequilibrium (LD). LD occurs when certain stretches of the
genome (haplotypes) show nonrandom behavior, based on allele
frequencies and recombination. Associating haplotype frequencies
with phenotypes potentially render QTL. Kim et al. (3) describe the
genome-wide pattern of LD in a sample of 19 Arabidopsis thaliana
accessions using SNP microarrays. LD is tested, for example by Dixon
et al., to globally map the effect of polymorphism on gene expression
in 400 children from families recruited through a proband with
asthma (4).
The use of terms association and linkage can be confusing,
even in literature. Here, we use association with haplotypes in
natural populations of unrelated individuals, and linkage with markers in experimental populations. Note that, in Dixon et al., individuals are related, i.e., some within-family linkage information is
available for 400 children from 206 families.
Statistical power can be increased by using experimental
crosses, instead of natural populations. For example, recombinant
inbred lines (RILs) are homozygous at every genomic location,
simplifying genetics and increasing statistical power at the same
time. For model organisms, such as A. thaliana, Caenorhabditis
elegans, Drosophila melanogaster, and Mus musculus, genotyped
experimental crosses are available; i.e., for these species, it is not
always necessary to generate a new cross. Compared with natural
populations, experimental crosses may introduce some bias, for
example with recessive lethal alleles. Also, these individuals are
rarely 100% homozygous. Finally, populations that have been
maintained for some time will likely contain genotyping errors;
we have evidence that 4% line swaps can be expected due to human

19

Evolutionary Genetical Genomics

471

error, plus mutations over generations. Data analysis should


account for such sources of bias (63).
Genetical genomics combines genetics with high-throughput
molecular technologies. In 2001, Jansen and Nap (5) coined
the term Genetical Genomics for mapping QTL in segregating populations with gene expression as a phenotype. Combining gene expression, as measured by microarray probes, with linkage leads to gene
expression QTL (eQTL). Such eQTL studies elucidate how genotypic variation underlies, for example morphological phenotypes, by
using gene expression levels as intermediate molecular phenotypes. In
other words, the expression level as measured by a microarray probe,
or probe set, is treated as a phenotype, a gene expression trait. This
phenotype is associated with the genome in the form of one or more
eQTL. With microarrays, the probe represents a known gene, and
therefore genomic location. Therefore, expression phenotype and
probe connect two types of genomic information: eQTL location(s)
and gene location. It is usually assumed that eQTL loci represent cisor trans-transcription regulators of the target gene (6). If the eQTL is
located close to the gene on the genome, the eQTL may point to a cisregulator. If the eQTL is located far from the gene on the genome,
the eQTL may point to a trans-regulator of a single gene or even
trans-bands for multiple regulated genes (7).
In a similar fashion, abundance of thousands of proteins
and metabolites can be measured to map protein QTL (pQTL) and
metabolite QTL (mQTL). Deep sequencing, chromatin, and methylDNA immunoprecipitation are just a few of the latest technologies
that add to the arsenal of tools available for the study of the genetic
variation underlying quantitative phenotypes. Together, eQTL,
mQTL, and pQTL are referred to as xQTL. Different xQTL appear
to confirm each other, for example, with the A. thaliana glucosinolate pathway (8). Such causal inference can lead to dissecting pathways and gene networks, currently an active field of research.
1.1. Evolutionary xQTL
Studies

From the perspective of evolutionary biology, genetical genomics


has been applied to elucidate evolutionary adaptations of transcript
regulation. For example, Fraser et al. introduced a test for lineagespecific selection and analyzed the directionality of microarray
eQTL for 112 haploid segregants of a genetic cross between two
strains of the budding yeast Saccharomyces cerevisiae, reanalyzing
the two-color cDNA microarray data of Brem and Kruglyak (9).
They found that hundreds of gene expression levels have been
subject to lineage-specific selection. Comparing these findings
with independent population genetic evidence of selective sweeps
suggests that this lineage-specific selection has resulted in recent
sweeps at over a hundred genes, most of which led to increased
transcript levels. Fraser et al. (10) suggest that adaptive evolution of
gene expression is common in yeast, that regulatory adaptation can
occur at the level of entire pathways, and that similar genome-wide
scans may be possible in other species, including humans.

472

P. Prins et al.

In another S. cerevisiae study, Zou et al. (11), by reanalyzing the


same two-color cDNA microarray data, uncovered genetic regulatory network divergence between duplicate genes. They found
evidence that the regulation of the ancestral gene diverged since
gene duplication.
Li et al. studied plasticity of gene expression in C. elegans using
a set of 80 RILs generated from a cross of N2 (Bristol) and CB4856
(Hawaii), representing two genetic and ecological extremes of
C. elegans. They found that differential expression induced in an
RIL population by temperatures of 16 and 24 C has a strong
genetic component. With a group of trans-genes, there was prominent evidence for a common master regulator: a trans-band of 66
coregulated genes appeared at 24 C. The results suggest widespread genetic variation of differential expression responses to environmental impacts and demonstrate the potential of genetical
genomics for mapping the molecular determinants of phenotypic
plasticity (7), leading to a more generalized genetical genomics,
where value is added from environmental perturbation (12).
Kliebenstein et al. detected significant gene network variation
in 148 RILs originating from a cross between two A. thaliana
accessions, Bay-0 and Shahdara. They were able to identify eQTL
controlling network responses for 18 out of 20 a priori-defined
gene networks, representing 239 genes (13).
According to Gilad, eQTL studies show that (1) variation in
gene expression levels is both widespread and highly heritable; (2)
gene expression levels are highly amenable to genetic mapping; and
(3) most strong eQTL are found near the target gene, suggesting
that variation in cis-regulatory elements underlies much of the
observed variation in gene expression levels (14). Meanwhile,
Alberts et al. (15) suggest that sequence polymorphisms may
cause many false cis eQTL, which should be accounted for.
1.2. Adding a Prior

QTL link complex traits with one or more locations on the genome
(Fig. 1). Such a location is a wide measure because a QTL is a
statistical estimate, and rarely a precise indicator. On the genome, a
single QTL may represent tens, hundreds, or even thousands of real
genes. Combining the QTL with high-throughput technologies,
such as microarrays, can add information. To zoom in on the genes
underlying QTL, information from other sources can be utilized.
Such a priori knowledge could consist of results from traditional
linkage studies or association studies of, for example, human disease.
That way, one can assign a specific regulatory role to polymorphic
sites in a genomic region known to be associated with disease (14).
Other useful priors can be the existing information on gene ontology
terms, metabolic pathways, and proteinprotein interactions, which
can be used to identify genes and pathways (16), provided these
databases are sufficiently informative.

19

Evolutionary Genetical Genomics

473

Fig. 1. In this hypothetical and schematic example, related to mapped locations on a


chromosome, prior information is combined with multiple phenotypegenotype QTL
mappings to zoom in on genomic areas and to reason about causal relations between
different layers of information. (a) The prior (red area on the chromosome) points out that
certain sections are of interest; these sections consist of related genes with high
homology showing evidence of positive selection, as discussed in the main text. The
blue double arrow points out the confidence interval for each QTL, above the significance
threshold (red dotted line). The accumulated evidence (light blue areas) leads to a
narrowed down section on the genome, where in this case the prior information is the
most specific. In addition, A and B point to the exact gene locations (dotted line, based on
the exact probe information). (b) To infer causal relationships network inference is
possible. On the left (vertical I ), traits A, B, and D map to one hot spot, where A may
be a regulator of B, as one QTL is shared. B causes metabolite C, again a shared QTL.
Phenotype D matches A and B, and phenotype E matches A, B, and C. These causal
relationships are drawn by arrows. The figure suggests that, while individual QTL are not
very informative, accumulated evidence, including a prior, starts to paint a picture.

474

P. Prins et al.

Zou et al. (11), for example, used gene ontology as a prior and
concluded that trans-acting eQTL divergence between duplicate
pairs is related to fitness defect under treatment conditions, but not
with fitness under normal condition.
Chen et al. (17) identified strong candidate genes for resistance
to leaf rust in barley and on the general pathogen response pathway
using a custom barley microarray on 144 doubled haploid lines of
the St/Mx population. 15,685 eQTL were mapped from 9,557
genes. Correlation analysis identified 128 genes that were correlated with resistance, of which 89 had eQTL colocating with the
phenotypic QTL (phQTL) or classic QTL. Transcript abundance in
the parents and conservation of synteny with rice prioritized six
genes as candidates for Rphq11, the phQTL of largest effect (17).
1.3. Evidence
of Positive Selection
as the Prior

2. Designing
an Evolutionary
x QTL Experiment

In this chapter, we discuss the steps needed to design an xQTL


experiment to make use of genetical genomics in evolutionary
studies more concrete. As the prior, we add information on plant
host genes showing evidence of positive selection.

An experimental design based on genetical genomics can highlight


sections of the genome showing correlation with an evolutionary
trait. One such evolutionary trait of interest is plant resistance
against pathogens. Plants have developed mechanisms to defend
themselves against pests. When a pathogen, such as potato blight
Phytophthora infestans, or a nematode, such as Meloidogyne hapla,
infects a plant, it uses a battery of so-called effectors to help
invade the plant. Some of these effector molecules act to dissolve
cellulose (18). Intriguingly, other molecules are involved in actively
reprogramming plant cells. Such plant pathogen effectors have
been shown to mimic plant transcription factors (19) and switch
on genes that help the pathogen (20). A susceptible plant allows the
pathogen to suppress defense mechanisms and to change cell configuration. For example, the nematodes M. hapla and Globodera
rostochiensis transform plant cells, so they become elaborate feeding
structures. The genetics of this plantpathogen interaction is
potentially even relevant for human medicine, as an increased
understanding of hostpathogen relationships may help understand the workings of the innate immune system and helminth
immunomodulation, e.g., refs. 21, 22. The innate immune system
influences susceptibility to infections in all multicellular organisms
and is a much older evolutionary mechanism than the advanced
adaptive immune system of higher organisms. For more on the
family of plant resistance genes (R-genes), see Box 1.

19

Evolutionary Genetical Genomics

475

Box 1
R-Genes
Plant resistance genes (R-genes) are a homologous family of
genes, formed by gene duplication events and hypothesized to
be involved in an evolutionary arms race with pathogen effectors. R-genes are involved in recognizing specific pathogens with
cognate avirulence genes and initiating defense signaling that
results in disease resistance (25). R-genes are characterized by a
molecular gene-for-gene interaction (26) in which a specific
allele of a disease resistance gene recognizes an avirulence protein or pathogen allele. This specificity is often encoded, at least
in part, in a relatively fast-evolving leucine-rich-repeat (LRR)
region (27), which consists of a varying number of LRR modules. Activation of at least some of these proteins are regulated in
trans, as has been shown for RPM1 and RPS2 (28).
A single A. thaliana plant has about 150 R-genes, representing
a subset of R-genes in the overall population. The protein products
of R-genes are involved in molecular interactions. They generally
have a recognition site which can dock against, i.e., recognize,
another one or more specific molecule(s). The proteins encoded
by the largest class of R-genes carry a nucleotide-binding site LRR
domain (NB-LRR, also referred to as NB-ARC-LRR and NBSLRR). NB-LRR R-genes can be further subdivided based on their
N-terminal structural features into TIR-NB-LRR, which have
homology to the Drosophila Toll and mammalian interleukin-1
receptors and CC-NB-LRR, which contain a putative coiled-coil
motif (29). The LRR domain appears to mediate specificity in
pathogen recognition while the N-terminal TIR, or coiled-coil
motif, is likely to play a role in downstream signaling (27). When
a molecule is docked, the R-protein is able to activate pathways in
the cell, resulting in, for example, a hypersensitive response causing
apoptosis and preventing spread of infection.
Meanwhile, one single R-protein only recognizes one type of
invading molecules. Therefore, through its R-genes, one individual plant only recognizes a limited number of strains of invading
pathogens, as the individual pathogens have variation in effectors
too. When a pathogen evolves to use nonrecognized effectors, the
plant becomes susceptible. The success of plant defense is determined by both evolution and the variation of specificity in a
population. Unlike the evolved mammal immune system, which
can change in a living organism and learn about invasions on the
fly (30), plant R-genes depend on the variation inside a gene pool
to provide the resistance against a pathogen; see for example
Holub et al. (31). Even so, many genes involved in pathogen
recognition undergo rapid adaptive evolution (24), and studies
have found that A. thaliana R-genes show evidence of positive
selection, e.g., refs. 3234.

476

P. Prins et al.

In this chapter, we do not want to limit ourselves to (known)


R-genes. Plants have evolved a complex array of chemical and
enzymatic defenses, both constitutive and inducible, that are not
involved in pathogen detection but whose effectiveness influences
pathogenesis and disease resistance. The genes underlying these
defenses comprise a substantial portion of the host genome.
Based on genomic sequencing, it is estimated that some 14% of
the 21,000 genes in Arabidopsis are directly related to defense (23).
Most of these genes are not involved in pathogen detection, but
possibly their products do molecularly interact directly with pathogen proteins or protein products. Among these proteins, for example, are chitinases and endoglucanases that attack and degrade the
cell walls of pathogens, and which pathogens counterattack with
inhibitors. Such systems of antagonistically interacting proteins
provide the opportunity for molecular coevolution of individual
systems of attack and resistance (24).
Here, we design an experiment, looking for all gene families
showing evidence of positive selection. This information is the prior
for eQTL analysis: combining known genomic locations of gene
families with eQTL locations derived from gene expression variation in a hostpathogen interaction experiment, which hopefully
results in zooming in on gene families involved in plant resistance.
The prior adds statistical power in locating putative gene families
involved in hostpathogen coevolution (Fig. 1). Note that, in this
chapter, the term interaction is used in two ways. The first is QTL
interaction, where two QTL on the genome interact statistically.
The second is hostpathogen gene-for-gene interaction, where
gene products from different species interact physically.
2.1. Create a Prior
with PAML

To create the prior, we use Yangs Codeml implementation of


phylogenetic analysis by maximum likelihood (PAML) (35).
PAML can find amino acid sites which show evidence of positive
selection using dN/dS ratios, which is the ratio of nonsynonymous
over synonymous substitution; see also Chapter 5 of this volume
on Selection on the Protein Coding Genome (36). The calculation of maximum likelihood for multiple evolutionary models is
computationally expensive, and executing PAML over an alignment
of a hundred sequences may take hours, sometimes days, on a PC.
To scale up calculations for a genome-wide search, we run calculations in parallel on multiple PCs, which is described in Chapter 22
of this Volume (37). The software for generating the prior is prepackaged on BioNode, including BLAST (38), Muscle (39), pal2nal
(40), PAML (35), and BioRuby (41). Data, software description,
and downloads are available in supplemental data online.
It is possible to find nonoverlapping large gene families by using
blastclust, a tool that is part of the BLAST tool set (42). After
fetching the A. thaliana cDNA sequences from the Arabidopsis
Information Resource (TAIR) (43), convert the sequences to a

19

Evolutionary Genetical Genomics

477

protein BLAST database format. Based on a homology criterium,


the identity score, genes are clustered into putative gene families by
running blastclust with 70% amino acid sequence identity. Note that
the percentage identity may not render all families, and will leave out
a number of genes. It is used here for demonstration purposes only.
For A. thaliana, such a genome-wide search finds at least 60 gene
families, including some R-gene families.
After aligning all family sequences, use PAMLs Codeml to find
evidence of positive selection in the gene families. Muscle is used to
align the amino acid sequences, and create a phylogenetic tree.
Next, pal2nal creates CODON alignments, which can be used by
PAML. Finally, run PAMLs Codeml M0M3 tests and M7M8
tests in a computing cluster environment using, for example, BioNode and the rq job scheduler, described in Chapter 22 of this
Volume (37). An M0-M3 w2 test finds that 43 gene families (out of
60) show significant evidence of positive selection. M7M8, meanwhile, finds 35 gene families. Therefore, based on the described
procedure, approximately half the families show significant evidence
of positive selection and can therefore be considered candidate gene
families involved in hostpathogen interactions. Note that this
figure contains false positives because the evolutionary model may
be too simplistic; see also ref. 44. Nevertheless, these candidate gene
families can be used as an effective filter for further research.
When a gene family displays evidence of positive selection, the
genome locations can be used as a prior for genetical genomics
(Fig. 1). With the full genome sequence of A. thaliana available,
the location of gene families showing evidence of positive selection
is known. For example, in the Columbia-0 (Col-0) ecotype, the
majority of the 149 R-genes are combined in clusters spreading 2 to
9 loci; the remaining 40 are isolated. Clusters are organized in the
so-called superclusters (29, 45). Phylogenetic analysis shows that
such clusters are the result of both old segmental duplications and
recent chromosome rearrangements (29, 46).
2.2. Select a Suitable
Experimental
Population

To select a suitable experimental population, the choice of parents


is the key. Here, we want a descriptive evolutionary prior based on
gene families with known genome locations. This means that one of
the parents has to have a sequenced genome. The choice of parents
for QTL analysis is normally based on large (classical) phenotypic
differences. For testing pathogen resistance, the choice would ideally be one susceptible parent and one resistant (nonsusceptible)
parent. For eQTL, the phylogenetical distance can be used when
there is no obvious phenotype. In general, it is a good idea to use
common library strains based on, for example, Colombia (Col),
Landsberg erecta (Ler), Wassilewskijai (Ws), or Kashmir (Kas) as
one of the parents because experimental resources and online information will be available. In addition, a reference genetic background is provided in this way, which allows the comparison of

478

P. Prins et al.

the effects of QTL and mutant alleles (47). A number of RIL


populations can be found through TAIR, a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials, and community (43).
2.3. Which xQTL
Technology?

Most published xQTL studies are based on gene expression eQTL


because gene expression probe provides a direct genomic link.
When it comes to selecting single-color or dual-color arrays, one
consideration may be that two-color arrays have higher efficiency
when using a distant pair design (48).
Deep sequencing technology (RNA)-(Seq) (49) will soon be
affordable for eQTL studies. The main advantage over microarrays is
improved signal-to-noise ratios, and possibly improved coverage.
Microarrays are noisy partly due to cross-hybridization, e.g., ref. 50,
and have limited signal on low expressors; both facts are detrimental
to significance. Deep sequencing is no panacea, however, since it
accentuates the high expressors. High expressors are expressed
thousands of times higher than low expressors. Low expressors
may lack significance for differential expression. Worse because
deep sequencing is stochastic, many low expressors may even be
absent. Another point to consider is that currently at least 1 in 1,000
nucleotide base pairs is misread, which makes it harder to disentangle error from genetical variation. Only when a sequence polymorphism is measured many times (say 20), it is confirmed to be
genetical variation.
Also a choice of eQTL technology may take into account that,
when looking at differential gene expression analysis, different
microarray platforms agree with each other, but overlap between
microarray and deep sequencing is much lower, suggesting a technical bias (51). On deep sequencing, the jury is still out; results of
the first deep-sequencing eQTL RIL studies are expected in 2011.
For an example of a metabolite mQTL study, see Keurentjes
et al. (52) and Fu et al. (53). For a study integrating eQTL, pQTL,
mQTL, and classical phQTL, see Fu et al. (54) and Jansen et al. (8).

2.4. Sizing the


Experimental
Population

The size of the experimental population should be large enough to


give informative results. For classical QTL analysis, the sizing may
be assisted using estimates of total environmental variance and the
total genetic variance derived from the accessions, selected as parents. Roughly, population sizes of 200 RILs, without replications,
allow detection of large-effect QTL with an explained variance of
10% in confidence intervals of 1020 cM. Detection of small-effect
QTL or mapping of accuracy below 5% requires increasing the
population size to at least 300 RILs (47). It is important to see
that QTL mapping accuracy is a function of both marker density and
number of individuals tested. The promise of extreme dense marker
maps, such as delivered by SNPs, does not automatically translate to
higher accuracy. It is the number of recombination events in the

19

Evolutionary Genetical Genomics

479

population for a particular QTL that limits QTLs interval size. In


fact, current marker maps, in the order of thousands of (evenly
spread) markers per genome, suite population sizes of a few hundred RILs. It is a fallacy, for example, to expect higher mapping
power combining an ultradense SNP map with just 20 individuals.
For high-throughput xQTL, the experimental population
should be sized against an acceptable false discovery rate (FDR).
This can be achieved using a permutation strategy to assess statistical significance, maintaining the correlation of the expression traits
while destroying any genetic linkages, or associations in natural
populations: marker data is permuted while keeping the correlation
structure in the trait data, such as presented by Breitling et al. (55).
Unfortunately, this information differs for every experiment and is
only available afterward! Analyzing a similar experiment, using the
same tissue and data acquisition technology, may give an indication
(54), but when no such material is available a crude estimate may be
had by taking the thresholds of a (classic) single-trait QTL experiment, and adjust that for multiple testing by the Bonferonni correction. Note that this results in a very conservative estimate.
2.5. Analyzing the xQTL
Experiment with R/qtl

R/qtl is extensible, interactive free software for the mapping of xQTL


in experimental crosses. It is implemented as an add-on package for
the widely used statistical language/software R (56). Since its introduction, R/qtl has become a reference implementation with an
extensive guide on QTL mapping in the Springer series (57).
R/qtl includes Multiple QTL Mapping (MQM) (58), also by
authors of this chapter, an automated procedure, which combines
the strengths of generalized linear model regression with those of
interval mapping. MQM can handle missing data by analyzing
probable genotypes. MQM selects important marker cofactors by
multiple regression and backward elimination. QTL are moved
along the chromosomes using these preselected markers as cofactors. QTL are interval mapped using the most informative model
through maximum likelihood. MQM for R/qtl brings the following advantages to QTL mapping: (1) higher power, as long as the
QTL explain a reasonable amount of variation; (2) protection
against overfitting because MQM fixes the residual variance from
the full model; (3) prevention of ghost QTL detection (between
two QTL in coupling phase); and (4) detection of negating QTL
(QTL in repulsion phase) (58).
MQM for R/qtl brings additional advantages to genetical
genomics data sets with hundreds to millions of traits: (5) a pragmatic permutation strategy for control of the FDR and prevention
of locating false QTL hot spots, as discussed above; (6) highperformance computing by scaling on multi-CPU computers, as
well as clustered computers, by calculating phenotypes in parallel,
through the message passing interface (MPI) of the SNOW package for R (59); (7) visualizations for exploring interactions in a

480

P. Prins et al.

genomic circle plot and cis- and trans-regulation (see figure


cistrans) (58). A 40-page tutorial for MQM is part of the software
distribution of R/qtl and is available online (60).
2.5.1. Matching the Prior

After detecting eQTL, we have a map of gene regulation in the form


of a cistrans map. When taking a priori information into account,
i.e., genomic locations derived through other methods, we can
potentially match the genomic locations of genes and gene families
with the eQTL cistrans map. Until now, there has been no
combined QTL and evolutionary study, involving PAML, for
hostpathogen relationships in plants, though they have been conducted separately.

2.5.2. Combining xQTL


Results: Causality, Network
Inference

In addition to identifying eQTL or xQTL, it is possible to think in


terms of grouping-related traits by correlations. Molecular and
phenotypic traits can be informative for inferring underlying molecular networks. When two traits share multiple QTL, something that
is not likely to happen at random, inference of a functional relationship is possible (Fig. 1). Thus, distinguishing trait causality, reactivity, or independence can be based upon logic involving underlying
QTL. This was the basic idea in Jansen and Nap (5). Later, people
started to use the biological variation as extra source for reasoning
because biological variation in trait A is propagated to B and not
vice versa if A affects B. This assumes that there is no hidden trait C
affecting both A and B; see also Li et al. (61).
Mapping phenotypes for thousands of traits is the first step in
attempting to reconstruct gene networks. Not only can network
reconstruction be used within a particular layer, say within eQTL
analysis, i.e., transcript data only, but also across layers. Such interlevel (system) analysis integrates transcript eQTL, protein pQTL,
metabolite mQTL, and classical QTL (8).
The examination of pairwise correlation between traits can lead
to the hypothesis of a functional relationship when that correlation
is high. Beyond the detected QTL, the correlation between residuals among traits, after accounting for QTL effects, or correlations between traits conditional on other traits is further evidence
for a network connection. To infer directional effects, it is necessary
to analyze the correlations among pairs of traits in detail. If trait A
maps to a subset of the QTL of trait B, then the common QTL can
be taken as evidence for their network connection while the distinct
QTL can be used to infer the direction (see Fig. 1), unless all the
common QTL have widespread pleiotropic effects, which is when a
single gene influences multiple traits. If traits A and B have common QTL, without QTL that are distinct, then the inference is
more complicated and further analysis is needed to discriminate
pleiotropy from any of the possible orderings among traits (8, 61).
Li et al. (61) point out that, despite the exciting possibilities of
correlation analysis, extreme caution is advised, especially in

19

Evolutionary Genetical Genomics

481

intralevel analyses, owing to the potential impact of correlated


measurement error (leading to false-positive connections). By
introducing a prior, however, causal inference may become feasible
for realistic population sizes (61). The outcome of a causal
inference on two traits sharing a common QTL may be either
that one is causal for the other or that they are independent.
In the first case, QTL-induced variation is propagated from one
trait to the other while in the latter case the two traits may even be
regulated by different genes or polymorphisms within the QTL
region and their apparent relationship (correlation) is explained
by LD and not by a shared biological pathway (61).

3. Discussion
A QTL is a statistical property connecting genotype with phenotype.
In this chapter, we reviewed studies which, with various degrees of
success, combine some type of prior information with xQTL. We
propose that a search for genome-wide evidence of positive selection
can produce a valid and interesting prior for xQTL analysis. This is
achieved by tying genomic locations of putative gene families, possibly involved in plantpathogen interactions, with QTL locations
derived from a genetical genomics experiment. Both the eQTL
example and the search for genome-wide evidence of positive
selection pressure are essentially exploratory and result in a list of
putative genes, or gene families, with known genomic locations. The
combined information yields candidate genes and pathways that are
under positive selection pressure and, potentially, involved in
hostpathogen interactions. We explain that it is possible to design
an eQTL experiment using existing experimental populations, e.g.,
using an A. thaliana RIL population, and analyze results with the
existing free and open-source software, such as the R/qtl tool set.
Genetical genomics bridges the study of quantitative traits
with molecular biology and gives new impetus to QTL population
studies. Genetic variation at multiple loci in combination with environmental factors can induce molecular or phenotypic variation.
Variation may manifest itself as linear patterns among traits at different levels that can be deconstructed. Correlations can be attributed
to detectable QTL and a logical framework based on common and
distinct QTL and propagation of biological variation, which can be
used to infer network causality, reactivity, or independence (61).
Unexplained biological variation can be used to infer direction
between traits that share a common QTL and have no distinct
QTL, though it may be difficult to separate biological from technical
variation. Prior knowledge and complementary experiments, such
as deletion mapping followed by independent gene expression

482

P. Prins et al.

studies between parental lines, may validate or disprove implicated


network connections (62).
Evolutionary genetical genomics can help dissect the underlying
genetics of pathogen susceptibility in plants. Where Evolutionary
Genetics describes how evolutionary forces shape biodiversity, as
observed in nature, Evolutionary Genetical Genomics describes how
phenotype variation in a population is formed by genotype variation
between, for example, host and pathogen involved in an evolutionary
arms race.
If you want to know more about eQTL, we suggest the review
by Gilad et al. (14), which also discusses eQTL in genome-wide
association studies (GWASs), useful in situations where experimental crosses are not available (such as with many pathogens and
humans). For further reading on R-gene evolution, we recommend
Bakker et al. (27). For R/qtl analysis, we recommend the R/qtl
guide (57) and our MQM tutorial online (60). For integrating
different xQTL methods and causal inference, we recommend
Li et al. (61) and Jansen et al. (8).

4. Questions
1. What is an eQTL, and why does it present two genomic locations?
2. Can a prior, as used here, really add statistical power, or is it no
more than circumstantial evidence?
3. When designing an evolutionary genetical genomics experiment, what are the steps to consider?
4. How can causal inference be used in QTL networks?

Acknowledgments
The European Commissions Integrated Project BIOEXPLOIT
(FOOD-2005-513959 to GS and PP); the Netherlands Organization for Scientific Research/TTI Green Genetics (1CC029RP to
PP); the EU 7th Framework Programme under the Research Project PANACEA (222936 to RJ).

19

Evolutionary Genetical Genomics

483

References
1. Nandi S, Subudhi P K, Senadhira D et al.
(1997) Mapping QTLs for submergence
tolerance in rice by AFLP analysis and selective
genotyping. Mol Gen Genet. 255:18
2. Meaburn E, Butcher L M, Schalkwyk L C &
Plomin R (2006) Genotyping pooled DNA
using 100K SNP microarrays: a step towards
genomewide association scans. Nucleic Acids
Res. 34:e27p
3. Kim S, Plagnol V, Hu T T et al. (2007) Recombination and linkage disequilibrium in Arabidopsis thaliana. Nat Genet. 39:11511155. http://
www.ncbi.nlm.nih.gov/pubmed/17676040
4. Dixon A L, Liang L, Moffatt M F et al. (2007)
A genome-wide association study of global
gene expression. Nat Genet. 39:12021207
5. Jansen R C & Nap J P (2001) Genetical genomics: the added value from segregation. Trends
Genet. 17:388391
6. Gibson G & Weir B (2005) The quantitative
genetics of transcription. Trends Genet.
21:616623
7. Li Y, Alvarez O A, Gutteling E W et al. (2006)
Mapping determinants of gene expression plasticity by genetical genomics in C. elegans.
PLoS Genet. 2:e222p
8. Jansen R C, Tesson B M, Fu J, Yang Y &
Mcintyre L M (2009) Defining gene and
QTL networks. Curr Opin Plant Biol.
12:241246
9. Brem R B & Kruglyak L (2005) The landscape
of genetic complexity across 5,700 gene
expression traits in yeast. Proc Natl Acad Sci
USA. 102:15721577
10. Fraser H B, Moses A M & Schadt E E (2010)
Evidence for widespread adaptive evolution of
gene expression in budding yeast. Proc Natl
Acad Sci U S A. 107:29772982
11. Zou Y, Su Z, Yang J, Zeng Y & Gu X (2009)
Uncovering genetic regulatory network divergence between duplicate genes using yeast eqtl
landscape. J Exp Zool B Mol Dev Evol.
312:722733
12. Li Y, Breitling R & Jansen R C (2008) Generalizing genetical genomics: getting added value
from environmental perturbation. Trends
Genet. 24:518524. http://www.ncbi.nlm.
nih.gov/pubmed/18774198
13. Kliebenstein D J, West M A, van Leeuwen H
et al. (2006) Identification of QTLs controlling
gene expression networks defined a priori.
BMC Bioinformatics. 7:308p
14. Gilad Y, Rifkin S A & Pritchard J K (2008)
Revealing the architecture of gene regulation:

the promise of eQTL studies. Trends Genet.


24:408415
15. Alberts R, Terpstra P, Li Y et al. (2007)
Sequence polymorphisms cause many false cis
eqtls. PLoS One. 2:e622p
16. Franke L, Bakel H v, Fokkens L et al. (2006)
Reconstruction of a functional human gene
network, with an application for prioritizing
positional candidate genes. Am J Hum Genet.
78:10111025
17. Chen X, Hackett C A, Niks R E et al. (2010)
An eqtl analysis of partial resistance to puccinia
hordei in barley. PLoS One. 5:e8598p
18. Qin L, Kudla U, Roze E H et al. (2004) Plant
degradation: a nematode expansin acting on
plants. Nature. 427:30p. doi:10.1038/427030a
19. Saijo Y & Schulze-lefert P (2008) Manipulation of the eukaryotic transcriptional machinery by bacterial pathogens. Cell Host Microbe.
4:9699
20. Chen L Q, Hou B H, Lalonde S et al. (2010)
Sugar transporters for intercellular exchange
and nutrition of pathogens. Nature.
468:527532
21. Hewitson J P, Grainger J R & Maizels R M
(2009) Helminth immunoregulation: the role
of parasite secreted proteins in modulating host
immunity. Mol Biochem Parasitol. 167:111
22. Bird P I, Trapani J A & Villadangos J A (2009)
Endolysosomal proteases and their inhibitors
in immunity. Nat Rev Immunol. 9:871882.
doi:10.1038/nri2671
23. Bevan M, Bancroft I, Bent E et al. (1998)
Analysis of 1.9 Mb of contiguous sequence
from chromosome 4 of Arabidopsis thaliana.
Nature. 391:485488. doi:10.1038/35140
24. Bishop J G, Dean A M & Mitchell-olds T
(2000) Rapid evolution in plant chitinases:
molecular targets of selection in plantpathogen coevolution. Proc Natl Acad Sci USA.
97:53225327
25. Dangl J L & Jones J D (2001) Plant pathogens
and integrated defence responses to infection.
Nature. 411:826833. doi:10.1038/35081161
26. Flor H (1956) The complementary genic systems in flax and flax rust*. Advances in Genetics. 8:2954. doi:10.1016/S0065-2660(08)
60498-8
27. Bakker E G, Toomajian C, Kreitman M & Bergelson J (2006) A genome-wide survey of R
gene polymorphisms in Arabidopsis. Plant Cell.
18:18031818. doi:10.1105/tpc.106.042614
28. Mackey D, Belkhadir Y, Alonso J M, Ecker J R
& Dangl J L (2003) Arabidopsis rin4 is a target

484

P. Prins et al.

of the type iii virulence effector avrrpt2 and


modulates rps2-mediated resistance. Cell.
112:379389
29. Richly E, Kurth J & Leister D (2002) Mode of
amplification and reorganization of resistance
genes during recent arabidopsis thaliana evolution. Mol Biol Evol. 19:7684
30. Medzhitov R & Janeway C A J (1997) Innate
immunity: impact on the adaptive immune
response. Curr Opin Immunol. 9:49
31. Holub E B (2001) The arms race is ancient
history in Arabidopsis, the wildflower. Nat Rev
Genet. 2:516527. doi:10.1038/35080508
32. Xiao S, Emerson B, Ratanasut K et al. (2004)
Origin and maintenance of a broad-spectrum
disease resistance locus in Arabidopsis.
Mol Biol Evol. 21:16611672. doi:10.1093/
molbev/msh165
33. Mondragon-Palomino M, Meyers B C,
Michelmore R W & Gaut B S (2002) Patterns
of positive selection in the complete NBS-LRR
gene family of Arabidopsis thaliana. Genome
Res. 12:13051315. doi:10.1101/gr.159402
34. Sun X, Cao Y & Wang S (2006) Point mutations with positive selection were a major force
during the evolution of a receptor-kinase resistance gene family of rice. Plant Physiol.
140:9981008. doi:10.1104/pp.105.073080
35. Yang Z (1997) PAML: a program package for
phylogenetic analysis by maximum likelihood.
Comput Appl Biosci. 13:555556. http://
www.ncbi.nlm.nih.gov/pubmed/9367129
36. Kosiol, C., & Anisimova, M. (2012) Selection
on the protein coding genome. In: Anisimova,
M., (ed.), Evolutionary genomics: statistical
and computational methods (volume 1).
Methods in Molecular Biology, Springer
Science+Business Media New York
37. Prins, P., Belhachemi, D., Moller, S. & Smant,
G. (2012) Scalable computing in evolutionary
genomics. In: Anisimova, M., (ed.), Evolutionary genomics: statistical and computational methods (volume 1). Methods in
Molecular Biology, Springer Science+Business
Media New York
38. Altschul S F, Madden T L, Schaffer A A et al.
(1997) Gapped blast and psi-blast: a new generation of protein database search programs.
Nucleic Acids Res. 25:33893402
39. Edgar R C (2004) Muscle: multiple sequence
alignment with high accuracy and high throughput. Nucleic Acids Res. 32:17921797.
doi:10.1093/nar/gkh340
40. Suyama M, Torrents D & Bork P (2006) Pal2nal: robust conversion of protein sequence
alignments into the corresponding codon
alignments. Nucleic Acids Res. 34:W609W612. doi:10.1093/nar/gkl315

41. Goto N, Prins P, Nakao M et al. (2010) BioRuby: bioinformatics software for the Ruby programming
language.
Bioinformatics.
26:26172619. doi:10.1093/bioinformatics/
btq475
42. Altschul S F, Madden T L, Schaffer A A et al.
(1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25:33893402
43. Rhee S Y, Beavis W, Berardini T Z et al. (2003)
The Arabidopsis Information Resource
(TAIR): a model organism database providing
a centralized, curated gateway to Arabidopsis
biology, research materials and community.
Nucleic Acids Res. 31:224228. http://www.
ncbi.nlm.nih.gov/pubmed/12519987
44. Anisimova M, Nielsen R & Yang Z (2003)
Effect of recombination on the accuracy of
the likelihood method for detecting positive
selection at amino acid sites. Genetics.
164:12291236
45. (2000) Analysis of the genome sequence of the
flowering plant Arabidopsis thaliana. Nature.
408:796815
46. Michelmore R W & Meyers B C (1998) Clusters of resistance genes in plants evolve by
divergent selection and a birth-and-death process. Genome Res. 8:11131130. http://www.
ncbi.nlm.nih.gov/pubmed/9847076
47. Salinas J & Sanchez-serrano J (2006) Arabidopsis protocols. Humana Pr Inc, Totowa, NJ
48. Fu J & Jansen R C (2006) Optimal design and
analysis of genetic studies on gene expression.
Genetics. 172:19931999. doi:10.1534/
genetics.105.047001
49. Mortazavi A, Williams B A, Mccue K, Schaeffer L
& Wold B (2008) Mapping and quantifying
mammalian transcriptomes by rna-seq. Nat
Methods.
5:621628.
doi:10.1038/
nmeth.1226
50. Eklund A C, Turner L R, Chen P et al. (2006)
Replacing cRNA targets with cDNA reduces
microarray cross-hybridization. Nat Biotechnol.
24:10711073. doi:10.1038/nbt0906-1071
51. Hoen P A, Ariyurek Y, Thygesen H H et al.
(2008) Deep sequencing-based expression
analysis shows major advances in robustness,
resolution and inter-lab portability over five
microarray platforms. Nucleic Acids Res. 36:
e141p. doi:10.1093/nar/gkn705
52. Keurentjes J J, Sulpice R, Gibon Y et al. (2008)
Integrative analyses of genetic variation in
enzyme activities of primary carbohydrate
metabolism reveal distinct modes of regulation
in Arabidopsis thaliana. Genome Biol. 9:
R129p. doi:10.1186/gb-2008-9-8-r129
53. Fu J, Swertz M A, Keurentjes J J & Jansen R C
(2007) Metanetwork: a computational

19
protocol for the genetic study of metabolic
networks.
Nat
Protoc.
2:685694.
doi:10.1038/nprot.2007.96
54. Fu J, Keurentjes J J, Bouwmeester H et al.
(2009) System-wide molecular evidence
for phenotypic buffering in Arabidopsis.
Nat Genet. 41:166167. doi:10.1038/ng.308
55. Breitling R, Li Y, Tesson B M et al. (2008)
Genetical genomics: spotlight on QTL hotspots.
PLoS
Genet.
4:e1000232p.
doi:10.1371/journal.pgen.1000232
56. Development core team R (2010) R: a language and environment for statistical computing. http://www.R-project.org
57. Broman K & Sen (2009) A guide to QTL
mapping with R/qtl. Springer Verlag, New
York, NY
58. Arends D, Prins P, Jansen R C & Broman K W
(2010) R/qtl: high-throughput multiple QTL
mapping. Bioinformatics. 26:29902992.
doi:10.1093/bioinformatics/btq565
59. Tierney L, Rossini A & Li N (2009) SNOW: a
parallel computing framework for the R system.

Evolutionary Genetical Genomics

485

International Journal of Parallel Programming.


37:7890
60. Arends D, Prins P, Broman K W & Jansen R C
(2010) Tutorial Multiple-QTL Mapping
(MQM) Analysis. http://www.rqtl.org/tutorials/MQM-tour.pdf
61. Li Y, Tesson B M, Churchill G A & Jansen R C
(2010) Critical reasoning on causal inference in
genome-wide linkage and association studies.
Trends Genet. 26:493498. doi:10.1016/
j.tig.2010.09.002
62. Wayne M L & Mcintyre L M (2002) Combining mapping and arraying: an approach to candidate gene identification. Proc Natl Acad Sci
USA.
99:1490314906.
doi:10.1073/
pnas.222549199
63. Westra HJ, Jansen RC, Fehrmann RS, te
Meerman GJ, van Heel D, Wijmenga C, Franke
L. (2011) MixupMapper: correcting sample mixups in genome-wide datasets increases power to
detect small genetic effects. Bioinformatics. Aug
1;27(15):210411. Epub 2011 Jun 7. http://
www.ncbi.nlm.nih.gov/pubmed/21653519

Part V
Handling Genomic Data: Resources and Computation

Chapter 20
Genomics Data Resources: Frameworks and Standards
Mark D. Wilkinson
Abstract
The emergence of genomics tools for the evolutionary and comparative biology community led to a rapid
explosion in the number of online resources targeted at this specialized community, including Web-based
comparative genomics software, such as the Artemis Comparison Tool (WebACT); databases, such as
PaleoDB, Global Biodiversity Information Facility, and TreeBase; and knowledge frameworks, such as
the Evolution Ontology. Unfortunately, these providers are largely independent of one another and
therefore the individual resources do not share any centralized plan for how the data or tools would or
should be provided. As a result, there are a myriad of often incompatible technologies and frameworks
being used by this community of providers. In this chapter, we explore approaches to online resource
publication, both those already in use by the community, as well as new and emergent frameworks and
standards. Exploration of the strengths and weaknesses of each approach, together with a brief exploration
of the philosophy or informatics theory behind the varying approaches, will hopefully help readers as they
navigate this data space. The discussion is constructed such that it lays the groundwork for exploration of a
new global standard for data and knowledge representationThe Semantic Webthat holds promise of
providing solutions to many of the complexities users face in their attempts to discover and integrate
biodiversity data, and examples are provided.
Key words: Interoperability, REST, Identifier systems, HTTP protocol, URI, URL, LSID, Web
services, Semantic Web

1. Introduction
Informatics is the field of study that examines technological
approaches that improve access to, and utilization of, information.
For bioinformaticians, informatics research and development provides the core computational communications standards and messaging syntaxes through which they can find the data they need, and then
integrate, organize, format, and analyze it. For biologists, informatics
technologies lie underneath the software applications they use day by
day that allow them to do relatively complex bioinformatics analyses
without necessarily having to become computer programmers
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2,
Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5_20,
# Springer Science+Business Media, LLC 2012

489

490

M.D. Wilkinson

themselves. Informatics is also, arguably, the most broadly interdisciplinary research domain under the bioinformatics umbrella,
spanning the biological sciences, computer sciences, library sciences,
legal/ethical studies, and (increasingly) pure philosophy (1).
Broadly speaking, this chapter covers two main topics:
1. How do we name things, and what things are being named?
2. How do we get information about or analyze a named thing?
These topics are examined primarily in the context of the Web,
and, in particular, Web resources related to Evolutionary Genomics; however, the discussion occasionally extends to more general
themes, since the informatics issues faced by Evolution researchers
are shared with most other biological domains.

2. Naming
2.1. How Do We
Name Things?

Biologists would rather share a toothbrush than share their gene


names! said Michael Ashburner (2), founder of the Gene Ontology consortium. The problem, however, is not restricted to gene
names and the biologists who coin them, but extends to all
biological objects and most individuals, institutions, consortia,
and even nations! Fundamentally, the only way in which information can be accurately and meaningfully integrated is if we are using
the same name to refer to precisely the same thing. As such,
disagreements over naming represent the single most disruptive
and destructive barrier to data integration and sharing, and one
that has so far not yielded to any purely technological solutioni.e.
it is a social problem, which includes aspects of ownership (3),
but is also influenced significantly by both language and culture.
Therefore, it is the first issue that we consider in this chapter, as it
can make or break a scalable information infrastructure.
It is probably an obvious point, but every piece of data should
have an identifier. When creating a data infrastructure, either for
personal use or for a public-facing data store, choices have to be
made about how identifiers will be coined and/or reused, and
what the policies will be around those identifiers. Similarly, when
constructing a research-based data warehouse of third-party data,
the kinds of identifiers used, and the policies around those identifiers, are an important consideration for the longevity and utility of
your research dataset, and reproducibility of your results. The
primary questions to consider are the following.
1. Is the identifier locally unique or globally unique? That is, does
the set of characters in this identifier exist anywhere in the
world to identify any other thing?

20

Genomics Data Resources: Frameworks and Standards

491

2. Are there other identifiers, locally or globally, that already


identify this piece of data?
3. Is the identifier stable? If not, under what circumstances does
the identifier change? For example, does it change when there
is a revision in the data? If it does, then is there a way to
distinguish between revisions of a record and completely different records?
4. Is the identifier permanent? For example, will the identifier for
that identical piece of data be the same next year as it is today? If
not, is there a way to track what the new identifier is?
5. Is the identifier transparent or opaque? For example, can
you tell, by looking at it, what kind of data the identifier
represents, who produced it, or when it was produced?
6. Does the identifier represent one thing or a set of things? If the
latter, is it useful/possible to refer to the members of the set
using another identifier?
7. How do I retrieve the data identified by that identifier?
8. How do I obtain metadata about the data identified (e.g.
authorship, time/date of production, algorithm used, database
version used, etc.)?
It is important to point out that there are no right answers to
the questions above; the answers are simply a matter of policy, and
vary from organization to organization. Thus, when linking to or
utilizing third-party data, it is important to know what the policy is
for each data provider you are linking to, since the longevity/
stability of your dataset depends on the answers to these questions.
For identifiers on the World Wide Web, best-practice answers
to many of these questions have been established by the Cool
URIs movement (4). The Uniform Resource Identifier (URI) is a
standard approved by the World Wide Web Consortium for naming
things on the Web (5). The most commonly recognizable type of
URI is the Uniform Resource Locator (URL)these are the Web
addresses that we type into our browsers to go to a particular Web
site. For example, http://purl.org/phylo/treebase/phylows/
study/TB2:S1787 is the URIspecifically a URLreferring to
the TreeBase record of a study by Maddison, Zhang, and Bodner.
URIs have some particularly useful behaviours with respect to
the above questions. The structure of URIs (i.e. that they generally
contain the domain name for the institution) makes it much simpler
to ensure that a URI is globally unique (question 1), and URI
resolution schemas1 provide standard ways of retrieving the data

The process of retrieving the data and/or metadata that is identified by any Web identifier is called resolution;
therefore, URIs of all types are resolved to data or resolved to metadata by calling a server using a protocol
that is appropriate for that type of URI.

492

M.D. Wilkinson

identified by a URI (question 7). The Cool URIs movement has


also suggested that, by best practice, the URI for a piece of data
should not change (question 4), and that there is only one acceptable resolution schemaHTTP GETthis is the resolution system used by Web Browsers when retrieving Web pages. However,
the remainder of the questions are not adequately answered by any
currently approved, widely adopted standard.
The evolutionary biology and biodiversity community are
almost unique in the life sciences in that they adopted novel standards that addressed these outstanding questions/considerations as
far back as 2004. The Life Sciences Identifier (6) (LSID)a URI
standard, distinct from URLs, and with a distinct resolution
schemawas adopted by this community and is still being used,
for example in the taxonomic database working group (TDWG)/
Global Biodiversity Information Facility (GBIF) and partner projects, such as the Biodiversity Collections Index, Global Names
Index, and the Encyclopedia of Life (EOL) (7).
Below is an example of an LSID, which identifies the species
Pternistis leucoscepus:
urn:lsid:ubio.org:namebank:11815
Notice that LSIDs look similar to URLs in some respects,
for example they contain a domain name, such as ubio.org, but
they differ in that the various fields are separated by : characters
rather than / characters. More importantly, compared to a URL
prefix of http://, the prefix for an LSID is urn:lsid. This prefix
tells the underlying network software that LSIDs use a different
Web resolution schema compared to URLs in order to retrieve the
information about the entity being named. Without going into
extraneous technical detail, it is this alternative resolution protocol
that makes it possible for LSIDs to identify anythingnot only Web
pages, but also species, genes, people, articles, concepts, etc..
Because LSIDs have the ability to identify anything, they also have
created a mechanism for explaining what kind of entity they
are describing (questions 5 and 6); moreover, LSIDs have an
explicit mechanism for managing and naming revisions of records
(questions 3 and 4), and have a recommended mechanism for
discovering not only the information about the entity named by
the LSID, but also its curatorial history and relationships with other
data (question 8).
To explore how LSIDs are used, the TDWG project provides a
public LSID resolver, together with some example LSIDs at
http://lsid.tdwg.org. When visiting this site, it is important to
keep in mind, however, that in order to make the LSID information
available via your browser for these examples it becomes necessary
to piggy-back the LSID protocol (non-browser) over the HTTP
protocol (browser). As such, the example LSIDs on the TDWG

20

Genomics Data Resources: Frameworks and Standards

493

Fig. 1. An RDF Graph, representing a small portion of the data from LSID record for Pternistis leucoscepus from TDWG.
Notice that RDF is able to link URIs to textual content (dark rectangles), as well as linking URIs to other URIs (light ovals),
and moreover that these linkages themselves take the form of a URI. In this way, machines are able to traverse these vast
global graphs of inter-linked data while maintaining and understanding the context or meaning of each data linkage
without human intervention. Exercise 2 includes additional exploration of RDF retrieval and visualization.

page look like, and act like, typical hyperlinks. To keep the distinction clear, it is necessary to understand that:
The hyperlink:
http://lsid.tdwg.org/summary/urn:lsid:ubio.org:namebank:11815.
Displays the data obtained by resolving the following LSID:
urn:lsid:ubio.org:namebank:11815.

To explore the LSID resolution system further, Exercise 1


provides additional tasks that clarify other aspects of how LSIDs
can be used to solve typical data integration and curatorial problems. Also see Section 2, and Fig. 1 in that section, for additional
discussion and visualizations of the data format that is returned by
LSID resolution.
Unfortunately, outside of the Biodiversity and BioMoby communities (BioMoby is discussed in more detail below), LSIDs did
not achieve widespread adoption. Despite their additional utility
beyond that of URLs, LSIDs have been rejected by the Cool URI
community (8) and thus should probably not be considered when
building public-facing data infrastructures. Nevertheless, LSIDs
have extremely useful behaviours and support for LSID at the
programming/software level is relatively strong, so their use for
purely internal data management infrastructures might still be considered. Moreover, LSIDs are purpose built (9) to take advantage of
the emergent Semantic Web technologies (described and demonstrated below); thus, Evolutionary Biology organizations, such as
TDWG, are already well positioned to take advantage of these new
powerful technologies as they become more prevalent.

494

M.D. Wilkinson

Dryad (10) takes advantage of another well-known naming


protocolthe Digital Object Identifier. These days, most digital
objects available through online catalogues or repositoriesarticles,
movies, music, etc.have a DOI. DOIs, like LSIDs, make certain
promises about stability and continuity, including that the DOI for a
digital object does not change, even if it is Internet location changes.
Thus, DOIs have stability beyond that provided by URLs. While
there is a programmatic way to resolve the data identified by a DOI,
the most common approach is to append the DOI to the URL
address of a DOI proxy. The most commonly used DOI proxy is at
http://dx.doi.org/. Thus, the DOI: 10.5061/dryad.20 can be
resolved by appending it to the proxy URL, where http://dx.doi.
org/10.5061/dryad.20 resolves to the data files for the 2007 study
on characiform fishes by Sidlauskas. Because DOIs come with a
certain promise of stability, DOIs (or their proxy URLs) provide a
greater level of stability and longevity to data infrastructures that
utilize them, versus those that store traditional URLs or databasespecific identifiers.
2.2. What Thing
Did We Name?

The next topic provides the groundwork for our subsequent


discussion of automated technologies for data integration, and
the kinds of problems that one often encounters when compiling
large genomics datasets. This topic is receiving increased attention
in the biological informatics world in the past 23 years, and is
referred to as the issue of denotation (11).
In the context of URIs, denotation refers to the real-world
thing that is being identified by an identifier. Take, for example,
the GenBank reference genes 843928 (SKP1) and 839982 (UFO).
Examining the Web page for SKP1 (12), we find various bits of
information, including that the gene interacts with a variety of
other genes in two-hybrid experiments, and that the sequence
record has been curated and reviewed on June 8, 2010. If we
were to break-out the information from that page into individual
statements, we might derive (among others) the following factual
observations:
(a) GeneID:843928 interacts with GeneID:839982.
(b) GeneID:843928 was last updated on June 8, 2010.
What is being identified by GeneID:843928? In observation A,
we might initially suggest that it is the SKP1 gene; however, we
know from basic biology that genes do not interact with genes.
Therefore, it is more likely that GeneID:843928 is referring to the
protein product of the gene, rather than the gene itself. Yet even
that is perhaps misleading, since individual protein molecules interact with other individual protein molecules; so observation A might
lead us to conclude that GeneID:843928 is, in fact, referring to a
single instance of a protein molecule (But which one? In which cell?
In which plant?). Either way, observation B now becomes

20

Genomics Data Resources: Frameworks and Standards

495

troubling. Whether GeneID:843928 refers to a gene, an individual


protein, or a set of protein molecules, it is still difficult to conceive
how a gene or protein molecule can be curated or updated! So we
are, thus, led by observation B to believe that GeneID:843928
refers to the database record for SKP1.
Our innate ability, as humans, to disambiguate between these
various representations/abstractions of reality effortlessly, and
without even noticing that we are doing so, has allowed the Web
to flourish and thrive as a medium for human-to-human communication. However, the simplicity of packaging vastly different types
of information into a human-readable record poses significant problems when we ask machines to discover, integrate, and interpret
data on our behalf. This, then, is the motivation for the emergent
set of technologies described as The Semantic Web.
The Semantic Web has been defined in various ways, but effectively describes a Web in which data is explicitly described, and
linked to other data through named relationships that are machine
readable and where each relationship has a precisely defined meaning. The two core technologies in the Semantic Web are:
l

Resource Description Framework (RDF (13))

Web Ontology Language (OWL (14))

Visualized in Fig. 1 below, RDF can be thought of as the most


simplistic data model possible; a model within which any conceivable information can be represented. It consists of statements in the
form of Triples of [subject], [predicate], and [object]. An informal example might be the Triple [GeneID:843928, regulates,
GeneID:839982]. In RDF, each component of the Triple is a
URI; thus, RDF is tightly integrated with the Web, and Web
resolution protocols can be used to retrieve information about the
subject, the predicate, or the object. Thus, the important improvement of RDF over the traditional Web is that, rather than having
pages linked by arbitrary, human-readable hypertext, the [subject] and [object] URIs are linked by a [predicate] URI that has an
explicit and precise machine-readable meaning, and this meaning
can be obtained simply by resolving the predicates URI. For
example, by resolving the predicates URI, one might discover
that there are varying types of regulates, such as activates or
inhibits, or that all values ([object]) for the regulates predicate
must be of type Gene. Thus, in RDF, both the data itself as well
as the connections between pieces of data are meaningful and can
be retrieved and interpreted by a machine, often in very detailed
ways. In large RDF datasets, the object of one triple can become the
subject of another, and vice versa. As such, information can be
interlinked into arbitrarily complex networks.
These large interlinked networks, also known as Graphs (see
Fig. 1), are at the core of the Linked Data movement (15), where
Linked Data simply refers to the act of publishing data in RDF
format, and attempting to link nodes in your Graph to nodes in

496

M.D. Wilkinson

other published Graphs through well-defined RDF predicates. While


this is an extremely laudable and useful goal, it is important to note
that Linked Data is not synonymous with the Semantic Webto
achieve a (useful) Semantic Web, there are certain constraints on
Linked Data that must be carefully addressed (one of them being
the Semantic Webs requirement for precise denotation and the
careful formatting of data records to follow these denotations).
OWL is a Description Logic, meaning that it is a language
within which logical assertions about the world can be constructed.
OWL is used to create machine-readable logical statements regarding how a machine should interpret the Triples/Graphs that it finds
on the Web.
For example, avoiding the somewhat arcane OWL syntax, the
curator of a database about Birds might use OWL to declare that
there exists a Class of entities called Birds; further define this
Class by saying that Birds haveAnatomicalPart Syrinx; and further declare that any data entity on the Web that has the taxonomicGroup Aves is an example of a Bird. Such structured Class
definitions are called ontologies, and using this small OWL
ontology, a piece of software that found the GBIF record for
P. leucoscepus, could automatically determine that it was (a) a Bird,
and therefore (b) infer that P. leucoscepus must have a Syrinx.
The important aspects to note from this simple example are
that (a) OWL can be used to organize, and potentially reclassify or
reinterpret, data published by others into a form that meets your
own needs and (b) together, RDF and OWL can be used to
(relatively easily) bridge highly disparate datasets, such as biodiversity and anatomy data. The utilization of OWL in this way reveals
how, in the future, it will become possible to automate the extraction of data from a wide variety of third-party sources and reformat,
reinterpret, and integrate it according to the rules and definitions of
your local database (the SADI and SHARE projects, described in
Subheading 3.4, have already made progress along this path).
In exactly the same way, OWL can be used to detect inconsistencieseither inconsistencies in the data or inconsistencies
between sets of assertions about the data. For example, we could
express in OWL that proteins may interact with proteins or proteins
may interact with genes, but that genes do not act on proteins and
genes do not act on genes. With this logical framework in place, our
confusion with the Triple in statement A aboveGeneID:843928
interacts with GeneID:839982could automatically be flagged as
inconsistent and our software could then act in some predetermined way to handle that situation before loading it into our
database for analysis.
Semantic Web technologies initially gained traction in the life
sciences communities precisely because problems with naming
and clarity of meaning are rampant in biology. The Gene Ontology project (16), established in 1998, is the most widely

20

Genomics Data Resources: Frameworks and Standards

497

recognized example of semantics and controlled-vocabulary


naming systems being applied to modern genomics datasets.
However, naming issues aside, biological data is highly interrelated and thus does not fit well into traditional relational database
schemas, but does fit nicely into the complex Graphs that are
generated by joining together RDF triples. Thus, despite the
fact that the software for managing these large RDF Graphs is
still quite slow and cumbersome, the life sciences community is
beginning to adopt RDF and OWL as their de facto data model
and interpretation layer (17) (respectively). In fact, at the time of
writing this article, there were already more than 40 billion triples
of life science data on the Semantic Web. In the next section, we
examine several of the most commonly used semantic resources in
Evolutionary Genomics.
2.3. Naming
and Semantic Standards
for Evolutionary
Genomics
2.3.1. Comparative Data
Analysis Ontology

2.3.2. Darwin Core

The Comparative Data Analysis Ontology (CDAO) (18) was


designed to facilitate comparative evolutionary analysis at the trait
and molecular level by explicitly defining the relevant entities. Its
scope includes entities, such as sequences, sequence differences,
taxonomic units, character data, trees, edge lengths, and so on.
To describe the algorithms used in these analyses, the CDAO
imports the myGrid ontology (19), which defines entities, such as
DNA and protein sequence databases, SmithWaterman sequence
comparison algorithm, and various file formats. The CDAO is used
to describe both the data that flow through a comparative evolutionary analysis, as well as be explicit about the process by which
that data is analyzed, manipulated, and evaluated. CDAO is published as an OWL ontology.
The Darwin Core (DC) (20, 21) is designed to facilitate sharing of
data relating to biodiversity through promoting standardization of
the annotations associated with biodiversity data. Its scope primarily revolves around records of taxa and their distribution, and
provides an interoperable way to describe taxa, taxa observations,
specimens, and the nature of those specimens. The DC includes
entities, such as annotative entities that describe the observation,
such as Event, Location, Geological Context, as well as annotative
entities that describe the record itself, such as Still Image, Living
Specimen, and Preserved Specimen. The DC is published as an
RDF document defining all of the ontological terms, and the DC
organization provides recommended XML Schema documents to
suggest how data should be formatted when using Darwin Core
annotations to maximize interoperability. The DC is the largest and
most prominent vocabulary of the numerous annotation standards
created by the TDWG (22); other vocabularies include standards
for describing, for example institutions and individuals, and the
collections that they possess.

498

M.D. Wilkinson

2.3.3. The Evolution


Ontology

The Evolution Ontology (EO) (23) was created to formalize the


approach to describing how and why biological traits evolve. While
other ontologies describe the characters (CDAO) and their location or distribution (DC), the EO attempts to model the processes
and influences that shape these characters and distributions.
It includes categories, such as Evolving Entity (e.g. an organism),
Evolvable Property (e.g. a trait), Context (e.g. a habitat), and
Process (e.g. Heterosis). Though not available as of this writing,
the EO is published as an OWL ontology.

3. Analysing
3.1. How Do We Get
Information About,
or Analyze a Named
Thing?

In the previous section, we explored the various approaches to


identifying data entities, and the increasingly formal ways of interlinking data on the Web in machine-readable ways in anticipation of
the emergence of software that can automatically find and integrate
the data we require on demand. In this section, we turn our
attention to the problem of Web-based data analysis. We once
again take a historical perspective, from the early, manually accessed
Web interfaces, through to the modern, automated, Semantic Webbased approaches, to analytical tool integration.
The Web is, of course, designed to facilitate information
retrieval, but lesser known (outside of the Web development community) are the additional features that facilitate information manipulation. The HTTP protocol consists of five core methods
effectively, operations that can be executed on a URL: GET, PUT,
POST, DELETE, and HEAD.2 GET is the method invoked when
you type a URL into your browser, and this is the method that is used
to retrieve the document that is identified by that URL. Unfortunately, in many cases, data providers have created Web architectures
that do not provide their data records via URLs, and thus these
records are not available for direct retrieval by GET; rather, they
are only accessible by submitting data (e.g. an accession number) to
the providers server via a Web Form (described in Subheading 3.2).
Moreover, for certain types of information retrieval, it is necessary to
submit large amounts of data that should be processed by an algorithm to generate novel information. Such large datasets cannot be
represented as a URL, and therefore URLs alone are not sufficient to
achieve our needs as biologists. In this section, we examine the
various types of data retrieval interfaces in rough order of their
evolution.

The HTTP methods, GET, PUT, POST, and DELETE, roughly mimic the database operations of Retrieve,
Create, Update, and Delete. The fifth method, HEAD, is used to retrieve basic metadata about the page, such as
its expiry date, its size, or its date of creation.

20
3.1.1. REST and GET

Genomics Data Resources: Frameworks and Standards

499

REpresentation State Transfer (REST (24)) is a specific type of


software design pattern in which all data/datasets have a precise
name (nouns), and there are a fixed and limited number of operations (verbs) that can be done on those named things. This distinguishes REST from the Web Service approach, described below,
where there are an unlimited number of operations, and the
(unnamed) data is passed as input to that operation in order to
obtain the result.
In RESTful software applications, every piece of data is
uniquely identified, and each piece of data has a state. The
application functions by changing the state of the data at the
various identifiers. Thus, at any given time, the current output of
the application can be observed by retrieving the data (the
state) of the identifier of interest. Put another waywhile we
commonly think of passing data through analytical tools to retrieve
a desired output document, the REST philosophy talks about
applying a tool to the data that resides at a particular address, and
thus changing the state of that dataeffectively, the current state of
the program is equivalent to the current state of all of the data at all
addresses of interest.
The canonical example of REST is the HTTP protocoleffectively, the Web itself 3. When software communicates with a Web
server, it passes (a) the URL that it is interested in, (b) the operation/method that it wants to execute on that URLGET, PUT,
POST, DELETE, and (c) optionally, some additional data that the
server needs to complete the task. As a result, the Web page at that
address is either retrieved or changed in some way. The REST
operation that is most familiar to us is GETthe method used
to retrieve the state of a document at a particular URLwhich is
what happens when we resolve a URL in our browsers. These lowlevel protocol operations (GET, POST, etc.) are entirely hidden
from the user by their browser, and so it is not possible to observe
the structure of these requests without specialized sniffing or
tracing software.
Though the Web was explicitly designed to support REST
architectures, truly RESTful informatics interfacesthose that
strictly follow the REST paradigmare extremely rare in comparative biology, genomics, and bioinformatics; moreover, the REST
architecture style is often misinterpreted as simply being synonymous with GET. It is not! This misunderstanding most commonly
manifests itself as URLs that expose an underlying analytical or
query interface and/or its parameters. As a result, verbs become
part of the namepart of the RESTful URL.

3
Though there is no formal requirement for RESTful applications to be Web based at all, REST is a design pattern,
not a Web architecture. On the contrarythe Web follows the REST pattern, not the other way around.

500

M.D. Wilkinson

Take, for example, the RESTful interface offered by the PhyloWS


API (25), made available via TreeBase. The PhyloWS API allows calls,
such as:
GET: http://purl.org/phylo/treebase/phylows/study/find?
querydcterms.contributorHuelsenbeck

This URL contains the verb find; the find interface function
of PhyloWS is, therefore, exposed by the URL, together with parameters needed by the find function call, such as the requirement for
a contributor name. All of these become part of the name of that
documentpart of its URLand this is not considered appropriate
in REST.
In a true REST architecture, the same operation might be done
by asking the REST interface to assign you a novel URLa URL
which (eventually) contains your query results. You would then use
HTTP POST to send your query parameters (find: contributor
Huelsenbeck) to this URL. This has the effect of updating (POST
Update) the state of the document identified by that URL such
that it now contains the result of the query. These results can be
obtained by calling GET on that URL. The find functionality is
not exposed within any of the URLs themselves, rather it is exposed
by allowing you to POST a set of find-query parameters to a URL
that was created specifically to identify/contain your result set.
Nevertheless, while PhyloWSs RESTful interface is not truly
RESTful, it is extremely clear how it should be used and what
functionalities it has; moreover, the PhyloWS interface exposes all
of its search and retrieval functionalities as GET-strings (URLs
with parameters); thus, the most important parts of PhyloWS
functionality, from the perspective of the comparative biologist,
can be accessed via a Web browser. This easy accessibility clearly
trumps the desire to create a philosophically pure REST interface, and, in fact, this type of faux-REST interface is almost ubiquitous in bioinformatics and life science Web frameworks for precisely
this reason! For example, other GET-string-based interfaces are
offered by the Dryad project (10)a repository of the data underlying scientific publicationsand by the EOL (26).
3.1.2. Web Forms

While the true REST philosophy does provide a means for executing
analyses on data, the method for doing so (as described in the earlier
example) is quite arcane. It is far more common (and intuitive) to
simply send data to a computational tool, and be presented with a
result. This functionality has traditionally been served by Web
FormsWeb pages with fields that can be filled in or selected by
users to achieve their desired outcome. Web Forms are specifically
designed to be utilized by a human operator and are generally
embedded within the content and visual layout elements of the
HTML page. Moreover, because they are manual interfaces, Web

20

Genomics Data Resources: Frameworks and Standards

501

Forms are most often of low throughput (e.g. a single sequence is


entered, and a single BLAST report is returned). An example of a
phylogeny and evolution resource that utilizes Web Forms as its
primary interface is PaleoDB (27).
The requirements of high-throughput biology caused bioinformaticians to begin automating interactions with Web Forms by writing
wrapper scripts. These were programs that automatically filled in the
appropriate fields for a specific Web Form, and submitted the data.
This approach, however, was fraught with problems (28). Most
importantly, the output from submitting a Web Form is also
intended for a human reader, and thus almost invariably contains
significant amounts of HTML specifically for graphics and layout.
The data would have to be automatically computationally extracted
from these documents using predictable cues in the document to
indicate where the data was locateda process known as screen
scraping. Since service providers often changed their layout and
Web Form interface (to visually improve the end-user experience),
such changes would often break the automated interactions encoded
into these wrapper scripts. Moreover, since there was no way to
computationally describe the Web Form, each wrapper had to be
hand-coded for each resource, and then maintained as the underlying Web interface changed. The fragility of the workflows created
using these wrapped Web Forms is well documented and is unacceptable for reproducible, high-throughput biology. As such, there
has been a significant uptake of another Web technology known as
Web Services (29).
3.1.3. Web Services

While many algorithms and visualization tools are still available


through Web-based forms, Web Services technologies have
become increasingly prevalent, particularly in the past 5 years.
Web Services act to formalize the way Web interfaces are described
such that machines can determine the interface methods provided
by the Service (i.e. verbs) and the data elements (effectively, the
FORM fields) that are required by each of these methods.
Web Services consume and produce XML-formatted data, the
structure of which is described in XML Schema, and these schemas
are embedded in a Web Services Description Language (WSDL (30))
document. WSDL is a World Wide Web Consortium (W3C) standard
that formally describes which XML schema elements are associated
with each available Service method call, and what the transport
protocol is for each of those methods (e.g. HTTP POST, FTP, or
SMTP). Finally, Web Services (generally) utilize the additional functionalities offered by the Simple Object Access Protocol (SOAP),
which creates an additional scaffolding of XML information around
the input and output messages within which the service provider can
include additional server-specific metadata or service-specific instructions (e.g. for long-running services, how long it will be until the
result is ready). The goal of Web Services was to enhance

502

M.D. Wilkinson

interoperability between Web resourceseffectively, to make


Web interfaces operating system and language agnostic (see also
Chapter 21 of this volume (ref. 43) on language interoperability).
In addition, separating the programmatic interface from the visualization of that interface (i.e. keeping the data separate from the visual
Web page) helped make these interfaces more stable. Most major
bioinformatics providers now have Web Service interfaces into many
of their data and analytical tools. As such, the ability to access these
resources at the code level (e.g. using Bio* toolkits) has been greatly
facilitated.
Within the Evolutionary Genomics community, a particularly
notable serviceTDWG Access Protocol for Information Retrieval
(TAPIR (31))takes yet another approach which resembles a
hybrid of the Web Service and REST standards. In the TAPIR
protocol, requests and responses are encoded in XML, governed
by an XML Schema, similar to Web Services. Five verbs are
allowed as follows:
l

Metadata: Default operation to retrieve basic information


about the service

Capabilities: Used to retrieve the essential settings to properly


interact with the service

Inventory: Used to retrieve distinct values of one or more


concepts

Search: Main operation to search and retrieve data

Ping: Used for monitoring purposes to check service availability

These verbs are used either in a key/value pair system within a


GET request or as a structured XML message in a POST request,
and unlike standard Web Services do not utilize SOAP.
While Web Services achieved the primary goal of making Web
interfaces platform and language neutral, this level of interoperability remains insufficient for most non-programmers. It is still difficult to make effective use of these Web resources and chain them
together into functional workflows, since each interface defines its
own fields using its own names for those fields, expects data to be
provided in particular syntaxes, and defines its own analytical operations using its own names for those operations. As such, it is
exceedingly difficult to reliably chain traditional Web Services into
the high-throughput workflows required by in silico biologists, and
therefore significant a priori knowledge about each Web Service
interface is required by the end-user to accomplish this task (32).
This limitation prompted the emergence of yet another layer of
technology, where semantics are applied to Web Service interfaces,
and these are known as Semantic Web Services.
3.1.4. Semantic Web
Services

One of the first applications of semantics to Web Services in the


bioinformatics domain was the Transparent Access to Multiple

20

Genomics Data Resources: Frameworks and Standards

503

Bioinformatics Services (TAMBIS (33)) project. TAMBIS created


a mediator between user queries and wrapped resources,
where the semantics of the domain of molecular biology (e.g. that
SecondaryStructure is a feature of a Protein, and that Beta sheets
are a type of protein SecondaryStructure) were explicitly encoded in
an ontology. A user interface then guided the construction of
queries by only allowing sensible questions to be asked, based
on the semantic constraints in the TAMBIS ontology, the query
was then mapped onto workflow of sub-queries over wrapped
databases, and each sub-query result was integrated into the final
result package. Among other things, TAMBIS suffered from the
cost and complexity of its centralized ontology that attempted to
generically model biological knowledge. The TAMBIS system is
no longer available, and subsequent projects, led by this group and
others, moved away from attempting to describe the biology behind
genomics, and rather attempted to model something arguably more
simplisticgenomics data types and analytical tool interfaces.
my
Grid (34) and BioMoby (35) are contemporaneous projects,
established independently in the early 2000s yet each with similar
goals. myGrid, led by the original TAMBIS team, created an ontology that could be used to annotate the (WSDL) interfaces of
bioinformatics Web Services such that each method and element
was attached to a domain model of bioinformatics data types and
the kinds of operations that could be done on them. While, unlike
TAMBIS, this could not be used to fully automate the construction
of workflows, myGrid also released a workflow design and enactment applicationTavernathat graphically guided the bioinformatics end-user in manually linking Web Services together, with the
Feta semantic search tool aiding the end-user in discovering compatible Services based on their ontological annotations. Unfortunately, annotation of the WSDL documents ended up being done
primarily by the myGrid team themselves, rather than the source
providers of the Web Services/WSDL, and thus was costly and time
consuming; moreover, there were no widely accepted standards
(at that time) for how such annotations should be recorded or
shared, and so support for these annotations was primarily from
tools written by the myGrid project.
BioMoby chose a different approach to adding semantics to
biological data. In BioMoby, every data type was defined in an
ontology (as in myGrid); however, the BioMoby ontology also
dictated the syntax by which that data would be represented. Effectively, the BioMoby ontology acted as a novel type of XML Schema.
BioMoby adopted a mass-collaborative approach to its data
infrastructure. The BioMoby data-type ontology was an open,
publicly accessible resource, and was also end-user extensible,
such that new data types could be created by any user as new
genomics technologies emerged. Nevertheless, because these data
types were ontologically based, the interpretation of new data types

504

M.D. Wilkinson

by existing software applications could reasonably accurately be


automated. Thus, unlike myGrid, BioMoby pushed the problem
of making Services interoperable squarely on to the Service providers by forcing them to use a strict data-typing system and publish
their interfaces using the BioMoby data-type ontology. Unfortunately, this approach suffered a variety of barriers and failures. First,
the BioMoby data-type syntax, while being highly self-describing,
was yet another syntax in a domain that already suffered from a
proliferation of largely equivalent and incompatible syntaxes (36);
moreover, in addition to being non-standard, the BioMoby XML
syntax was too flexible to be reliably described using traditional
XML Schema, meaning that BioMoby Service interfaces could not
be described using the WSDL standard.
In the end, both BioMoby and myGrid had about the same level
of community penetration, with approximately 1,500 Web Services
being available through each of the BioMoby and Feta search tools.
These data and analytical resources remain available for use through
the Taverna interface, and tools, such as the BioMoby plug-in to
Taverna, help to semi-automate the discovery and pipelining of
these Web Services by end-users.
The Simple Semantic Web Architecture and Protocol (SSWAP
(37)) project was an early offshoot from the BioMoby project,
again with the goal of semi- or fully automating the discovery and
pipelining of Web Services. Abandoning traditional XML and XML
Schema entirely, SSWAP used the emergent standards of RDF and
OWL to describe the Web Service interfaces. In SSWAP, a standardized OWL graph is used within which the Web Service interface
is defined. The graph contains information about the input and
output data types, the input node of the graph is filled in and passed
to the Service provider for service invocation, and the provider fills
in the output node of the graph with the output data; thus, the
OWL graph defines both the Service interface, as well as the
SSWAP project-specific Service messaging structure. All data is
represented in RDF format. The (arguably) unorthodox view of
OWL as a graph, and moreover a container for data transport (in
much the same way as SOAP provided an XML container for data
transport), provides certain barriers to the integration of SSWAP
with other Web Services and Semantic Web resources and tools,
though the tooling provided by the SSWAP project makes their
own resources compatible with one another. The primary users of
SSWAP are the iPlant Collaborative (38), and SSWAP hosts several
thousand services aimed at this community of researchers, though
it is by no means limited in its applicability to the wider genomics
community.
A recent addition to the Semantic Web Service landscape is
Semantic Automated Discovery and Integration (SADI (39)).
SADI does not define new standards, formats, or message structures, but rather defines a minimal set of best practices that enable

20

Genomics Data Resources: Frameworks and Standards

505

the kinds of Web Services commonly used in the bioinformatics


domain to be interoperable with one another. In SADI, the Service
provider defines (or points to) OWL Classes that describe their
input and output data. Within these Class definitions are the predicates (properties) that must exist in RDF data that is passed into
the service, and the properties that are added onto the incoming
data as a result of service execution. To interact with a SADI Service,
data matching the OWL ontological definition is simply POST-ed,
verbatim, to the SADI Service in RDF-XML formatthis distinguishes SADI from SSWAP, where the OWL Graph in SWAP acts as
a container for the input and output data.
The single caveat imposed by the SADI Framework is that,
through service invocation, the Subject URI of the input and output
RDF triples must remain the same. As such, in SADI, every service
must not only generate output data, but this output must be
connected to the input data through a meaningful relationship,
encoded in an RDF predicate. Thus, by passing input and output
data through a series of SADI Services, one can build a personalized
Linked-Data Graph representing all of the data related to the problem of interest. The advantage of this approach is that the Graph can
be reasoned over using OWL Description Logic reasoners to categorize or interpret the data (as described in Subheading 2), discover
new assertions, and, more importantly, discover additional SADI
services capable of operating on and analyzing those new combinations of data, to facilitate, or even anticipate, the researchers questions. For example, if Species Distribution data were downloaded
from one source and Meteorological information were downloaded
from a second source, SADI-enabled software could automatically
detect that sufficient data was now available to do a correlation
analysis (if such a Service existed).
The semantic behaviours of SADI are, perhaps, best exemplified
by the early prototype SHARE client application (40). In SHARE,
queries are posed in the SPARQL-DL query language. SHARE
examines the query, and automatically constructs a workflow
through a series of SADI Web Services that generates the database
that is required to answer that query. The query is then solved and
the results are returned to the user. While this, in itself, is quite
powerful, SHARE also utilizes the semantics of OWL, where
SHARE decomposes an OWL class definition in order to construct a workflow capable of generating instances of that ontological
class (examples of SADI Service executions using the SHARE interface are available in Subheading 5). Thus, with SADI + SHARE, the
data required to answer ad hoc queries is automatically and dynamically discovered or generated as a result of the query being posed.
A plug-in is also available to provide semi-automated SADI workflow construction from within the Taverna environment.

506

M.D. Wilkinson

3.2. Global Service


and Analytical
Workflow Repositories

While many of the Web Service and Semantic Web Service projects
have their own repositories, there has recently been a move towards
providing a single source for searching and browsing both Web
Services and Workflows, particularly within the bioinformatics and
genomics communities.
The BioCatalogue (41) is The Life Science Web Service Registry, with functionality to discover, register, annotate, and monitor biological Web Services. A primary objective of the
BioCatalogue project team is to overcome the frustrating lack of
annotation that is currently true of most bioinformatics Web Service interfaces. They plan to achieve this by (a) creating a standard
minimal set of annotation elements required to be a wellbehaved Service provider and (b) opening up the annotation
interface in Web 2.0 open and collaborative style, where any
user can annotate any Web Service. The goal of this focus on
annotation is to make Service discovery more accurate and complete, as well as assist end-users in correctly wiring together
service input and output data components into meaningful functional workflows.
In parallel with BioCatalogue, and led by the same group, is the
myExperiment (42) project. Like BioCatalogue, myExperiment is a
Web 2.0-style repository, but with a focus on Workflows as the
primary deposition. myExperiment encourages sharing of and
social media-type discussion about workflows, as well as keeping
track of the edit history of workflows as they are reused and repurposed by varying end-users.

4. Summary
This chapter provided a high-level overview of the widely divergent
approaches to data and tool provision in evolutionary biology and
genomics. Technological, social, and philosophical decisions are
made by individual resource providers largely in response to the
specific needs of their target communities. Moreover, given limited
resources, data and tool providers are often loathe to buy-in to new
and potentially transient or flawed technologies. As a result, data
integrationparticularly automated data integrationfrom one
resource to another can be difficult and error prone. While the
discussed technology has implications beyond evolutionary biology, even beyond bioinformatics, it is highly relevant because in
evolutionary biology we are dealing with complex data integration.
In discussing in some detail the issues related to data integration
in general and how the various Evolutionary Genomics projects have
dealt with these issues, it will hopefully be easier for the users of these
resources to utilize their offerings. In addition, the emergence of new
semantic technologiestechnologies that are now starting to be

20

Genomics Data Resources: Frameworks and Standards

507

adopted by the Evolutionary Genomics community of data and tool


providersgives new hope for a more integrative future!

5. Exercises
5.1. Exercise 1: LSIDs

Demonstrate that the LSID resolution specification is generic by


using the TDWG LSID resolver to resolve an LSID from the
BioMoby project, and then use the BioMoby LSID resolver to
resolve an LSID from the TDWG project. Observe that, in exactly
the same way the HTTP protocol ensures that any Browser
retrieves the same Web page given the same URL, the LSID protocol ensures that any LSID-capable software can retrieve the data or
metadata identified by an LSID.
The TDWG LSID resolver is at http://lsid.tdwg.org.
The BioMoby LSID resolver is at http://moby.ucalgary.ca/
cgi-bin/LSID_Resolver.pl.
(Note that these manual LSID resolvers are intended primarily for demonstration purposes. LSIDs are meant to be resolved by
computers, not by humans. As such, the information retrieved
when you type an LSID into the text field and submit is formatted
to be machine readableit is returned in RDF-XML format, and is
not intended to be pretty to the human eye!)
An example TDWG LSID: urn:lsid:taxonomy.org.au:TherevidMandala:MEI023602.
An example BioMoby LSID: urn:lsid:biomoby.org:objectclass:
DNASequence.
When resolving this example BioMoby LSID to its metadata,
you will notice that the output indicates that there is a newer
version of the record available, and provides you the LSID of the
newer version attached by the predicate (latest). This was considered a best practice by the LSID community, as a way of handling
version changes in database records and communicating those
changes to the users.

5.2. Exercise 2:
Exploring RDF

It is sometimes useful to explore the structure/content of an RDF


document visually, and for this purpose the W3C provides a validation/visualization service for RDF.
1. First, obtain some RDF. For example, you can obtain the RDF
metadata from a TDWG LSID from http://lsid.tdwg.org/urn:
lsid:ubio.org:classificationbank:1164063. Your browser will
now be displaying an XML document. This is the XML representation of RDF (there are a variety of ways to encode RDF,
with XML being the most widely approved method). Use your
mouse to select and copy all of this XML to your clipboard.

508

M.D. Wilkinson

2. Surf to the W3C Validator at http://www.w3.org/RDF/


Validator.
3. Remove the sample RDF in the Validator text area and paste
your RDF into the space.
4. In the Display Results Options, select Triples and Graph
from the drop-down menu.
5. Select PNG Embedded from the Graph format drop-down
menu.
6. Click the Parse RDF button.
Your browser window will now display the RDF document
broken-down into individual SubjectPredicateObject triples at
the top of the display, and a graph (similar to Fig. 1) will be
displayed at the bottom of the same page.
If you plan to utilize RDF with any regularity, a richer and more
robust RDF viewer is Cytoscape (http://cytoscape.org).
5.3. Exercise 3:
Web Services
and Semantic Web
Services

The motivation for creating scientific Web Services, and combining


them into workflows, is presented nicely in a video available on
YouTube: http://www.youtube.com/watch?vhmIErdZwFS0.
After watching the video, download a copy of Taverna from
http://taverna.sourceforge.net being sure to select the one appropriate for your operating system. Run the installer and then start
Taverna.
There are numerous Taverna tutorials available online, and
you are encouraged to download these and follow along with the
exercises presented in them. Of particular relevance are the following.
How to use Taverna to access traditional Web Services:
http://wilkinsonlab.ca/ACGC/Taverna_beginner.ppt.
How to use Taverna to access BioMoby Semantic Web Services:
http://wilkinsonlab.ca/ACGC/taverna_biomoby_tutorial.ppt.
How to use Taverna to access SADI Semantic Web Services:
http://wilkinsonlab.ca/ACGC/SADI-tutorial.pptx.

5.4. Exercise 4:
The SHARE Interface
into SADI Semantic
Web Services

Browse to the SADI Framework homepage at http://sadiframework.org. Click the Show Me tab, and then follow the link to the
SHARE demonstration.
The SHARE demo presents SADI Semantic Web Services as if
they represented a massive global database of bioinformatics information. The SHARE interface is simply a text box, which is where
you type queries over this database. The query language used is
called SPARQLthe approved language for querying RDF data.
Understanding SPARQL queries is quite straightforward.
1. The SELECT clause details the variables that you wish to be
filled with your query results.

20

Genomics Data Resources: Frameworks and Standards

509

2. The FROM clause points to an RDF-formatted dataset. This


clause is optional in the SHARE application, but is normally
required for most SPARQL query systems.
3. The WHERE clause contains one or more triple patterns of
subjectpredicateobject, where any of the three positions can
contain a variable. Thus, a SPARQL query for the pattern {?x
hasName Michael} is a query for all triples, where the subject
can be anything (?x) that has a predicate hasName and an
object with value Michael.
What makes SHARE special is that, unlike a normal SPARQL
query interface, it is not necessary in a SHARE query to indicate the
address or location of a database. This is because SHARE is going
to automatically construct the database for you, based on the
information you have mentioned in the WHERE clause. SHARE
does this by examining the triple patterns in your query, and then
discovering SADI Semantic Web Services capable of generating that
data.
Follow the link to example queries and scroll down to query
#9find me images and descriptions of the cho mutant of Antirrhinum majus (Snapdragon).
The SPARQL query is as follows:
PREFIX pred: http://sadiframework.org/ontologies/service_objects.owl#
PREFIX drag: http://lsrn.org/DragonDB_Allele:
SELECT ?image ?desc
WHERE {drag:cho pred:visualizedByImage ?image.?image
pred:hasDescription ?desc}
The two PREFIX lines indicate the URL prefixes that are used to
construct the triples in the query. This makes the query more readable. In this case, every time pred: appears in the WHERE clause, it
means
http://sadiframework.org/ontologies/service_objects.
owl#. In this way, each component of the subjectpredicateobject
triple pattern becomes a complete URL.
Copy and paste query #9 into the Query Form and click Submit.
Within a few seconds, SHARE will have created the database necessary to answer this question by automatically discovering and executing SADI Services that operate over the A. majus model organism
Web site at http://antirrhinum.net. The query results include a verbose description of each image, together with a hyperlink to the
online image file (click the hyperlinks to obtain the images).
If you wish to examine the RDF that was automatically generated by these services, click the View results as RDF link directly
under the query box. You might want to copy and paste this into
the RDF Validator (see Subheading 2) to view the results as a graph
image.

510

M.D. Wilkinson

5.5. Exercise 5:
myExperiment

In your browser, surf to http://myexperiment.org. Near the top of


that page, there is a Search box. Enter the keywords comparative genomics as your search term. The search results are categorized into the following:
l

Users: Those individuals who have contributed anything matching those keywords

Workflows: Analytical pipelines, contributed by those users,


that can be shared and edited, usually in Taverna

Files: Any other kind of file contributed with those keywords


(e.g. slideshows, etc.)

Under Workflows, scroll down to the workflow titled Compare two genomes for similarity. Clicking on it brings you to a
preview pane, where you can examine the workflow and see the
contributors comments about it. In this case, the workflow reads in
two FASTA files (representing whole genomes) and then uses the
M-GCAT algorithm to compare them.
Under the Download heading, click on the link and save the
file to your desktop. Now open Taverna and load that file. You will
see the workflow in Tavernas preview pane, and are now ready to
run that analysis on your own FASTA-formatted data, following
what you learned in the tutorials from Exercise 3.
References
1. Stein, L. (2003). Bioinformatics: Gone in
2012. OReilly Bioinformatics Technology
Conference, 2003, San Diego, California,
USA.
2. Pearson, H. (2001). Biologys name game.
Nature 411 (7 June), 631632.
3. Good, B.; Wilkinson, M. D. (2006). The Life
Sciences Semantic Web is Full of Creeps! Briefings in Bioinformatics 7 (3), 275286.
4. World Wide Web Consortium. Cool URIs.
http://www.w3.org/TR/cooluris.
5. World Wide Web Consortium. URIs, URLs,
and URNs: Clarifications and Recommendations 1.0. http://www.w3.org/TR/uri-clarification.
6. Clark, T.; Martin, S.; Liefeld, T. (2004). Globally distributed object identification for
biological knowledgebases. Briefings in bioinformatics 5 (1), 5770.
7. Bafna, S.; Humphries, J.; Miranke, D. (2008).
Schema driven assignment and implementation
of life science identifiers (LSIDs). Journal of
Biomedical Informatics 41 (5), 730738.
8. Mendelsohn, N. My conversation with Sean
Martin about LSIDs. http://lists.w3.org/
Archives/Public/www-tag/2006Jul/0041.

9. Object Management Group Inc. Document


dtc/04-10-08 (Final available Life Science
Identifier specification). http://www.omg.
org/cgi-bin/doc?dtc/04-10-08.
10. Vision, T. (2010). The Dryad Digital Repository: Published evolutionary data as part of the
greater data ecosystem. Available from Nature
Precedings:
http://hdl.handle.net/10101/
npre.2010.4595.1.
11. Booth, D. Denotation as a Two-Step Mapping
in Semantic Web Architecture. http://dbooth.
org/2009/denotation.
12. SKP1 (S phase Kinase-associated Protein 1).
http://www.ncbi.nlm.nih.gov/sites/entrez?
dbprotein&cmd Link&Lin kName
protein_gene&from_uid18410982.
13. World Wide Web Consortium (W3C). RDF
Semantic Web Standards. http://www.w3.
org/RDF.
14. World Wide Web Consortium. OWL Web
Ontology Language Overview. http://www.
w3.org/TR/owl-features.
15. Bizer, C.; Heath, T.; Berners-Lee, T. (2009).
Linked Data The Story So Far. International
Journal on Semantic Web and Information Systems 5 (3), 122.

20

Genomics Data Resources: Frameworks and Standards

16. Gene Ontology Consortium (2008). The


Gene Ontology project in 2008. Nucleic
Acids Res 36 (Database Issue), D440D444.
17. World Wide Web Consortium. Semantic Web
Health Care and Life Sciences (HCLS) Interest
Group. http://www.w3.org/blog/hcls.
18. Comparative Data Analysis Ontology. http://
www.evolutionaryontology.org.
19. Wolstencroft, K.; Alper, P.; Hull, D.; Wroe, C.;
Lord, P.; Stevens, R.; Goble, C. A. (2007).
The myGrid ontology: bioinformatics service
discovery. International Journal of Bioinformatics Research and Applications 3 (3),
303325.
20. Taxonomic Database Working Group. Darwin
Core. http://rs.tdwg.org/dwc.
21. Taxonomic Database Working Group. Darwin
Core Project site for discussion and development.
http://code.google.com/p/darwincore.
22. Taxonomic Database Working Group. TDWG
Homepage. http://www.tdwg.org.
23. Goldstein, A. M. (2009). The Evolution Ontology. Available from Nature Precedings: http://
dx.doi.org/10.1038/npre.2009.3557.1.
24. Fielding, R. T. Architectural Styles and the
Design of Network-based Software Architectures. http://www.ics.uci.edu/~fielding/pubs/
dissertation/top.htm.
25. NESCENT. PhyloWS/REST. https://www.
nescent.org/wg/evoinfo/index.php?titlePhyloWS/REST.
26. Encyclopedia of Life. http://www.eol.org/api.
27. The Paleobiology Database. http://paleodb.
org.
28. Stein, L. (2002). Creating a bioinformatics
nation. Nature 417, 119120.
29. World Wide Web Consortium. Web Services
Activity. http://www.w3.org/2002/ws.
30. World Wide Web Consortium. Web Services
Description Language (WSDL) 1.1. http://
www.w3.org/TR/wsdl.
31. Taxonomic Database Working Group. TAPIR
TDWG Access Protocol for Information
Retrieval.
http://www.tdwg.org/dav/sub
groups/tapir/1.0/docs/tdwg_tapir_specifi
cation_2010-05-05.htm.
32. Wroe, C.; Goble, C.; Goderis, A.; Lord, P.;
Miles, S.; Papay, J.; Alper, P.; Moreau, L.
(2007). Recycling workflows and services
through discovery and reuse. Concurrency
and Computation: Practice and Experience 19
(2), 181194.
33. Stevens, R.; Baker, P.; Bechhofer, S.; Ng, G.;
Jacoby, A.; Paton, N. W.; Goble, C. A.; Brass,

511

A. (2000). TAMBIS: Transparent Access to


Multiple Bioinformatics Information Sources.
Bioinformatics 16 (2), 184186.
34. myGrid Project. myGrid Home. http://www.
mygrid.org.uk.
35. The BioMoby Consortium (2008). Interoperability With Moby 1.0 Its Better Than Sharing
Your
Toothbrush!
Briefings
in
Bioinformatics 9 (3), 220.
36. Lord, P.; Bechhofer, S.; Wilkinson, M. D.;
Schiltz, G.; Gessler, D.; Hull, D.; Carole, G.;
Stein, L. (2004). Applying Semantic Web Services to Bioinformatics: Experiences Gained,
Lessons Learnt. Lecture Notes in Computer
Science, The Semantic Web ISWC 2004,
3298, 350364.
37. Gessler, D. D.; Schiltz, G. S.; May, G. D.;
Avraham, S.; Town, C.; Grant, D.; Nelson, R.
T. (2009). SSWAP: A Simple Semantic Web
Architecture and Protocol for semantic web
services. BMC Bioinformatics 10, 309.
38. The iPlant Collaborative. http://www.iplantcollaborative.org.
39. Wilkinson, M. D.; Vandervalk, B.; McCarthy,
L. (2009) SADI Semantic Web Services
cause you cant always GET what you want!
Services Computing Conference, APSCC
2009. IEEE Asia-Pacific, Singapore, 2009; pp
1318.
40. Vandervalk, B.; McCarthy, L.; Wilkinson, M.
D. (2010). SHARE & The Semantic Web
This Time its Personal! Proceedings of
OWLED 2010, San Francisco, California,
USA 2122 June 2010.
41. Bhagat, J.; Tanoh, F.; Nzuobontane, E.;
Laurent, T.; Orlowski, J.; Roos, M.; Wolstencroft,
K.; Aleksejevs, S.; Stevens, R.; Pettifer, S.;
Lopez, R.; Goble, C. A. (2010). BioCatalogue: a universal catalogue of web services for
the life sciences. Nucleic Acids Research 38
(suppl), W689W694.
42. Goble, C. A.; De Roure, D. (2007). myExperiment: social networking for workflow-using
e-scientists. Proceedings of the 2nd workshop
on Workflows in support of large-scale science;
High Performance Distributed Computing
2007; pp 12.
43. Prins, P., Belhachemi, D., Moller, S., Smant, G.
(2012) Scalable computing in evolutionary
genomics. In: Anisimova, M., (ed.), Evolutionary genomics: statistical and computational
methods (volume 1). Methods in Molecular
Biology, Springer Science+Business Media
New York.

Chapter 21
Sharing Programming Resources Between Bio*
Projects Through Remote Procedure Call and Native
Call Stack Strategies
Pjotr Prins, Naohisa Goto, Andrew Yates, Laurent Gautier,
Scooter Willis, Christopher Fields, and Toshiaki Katayama
Abstract
Open-source software (OSS) encourages computer programmers to reuse software components written by
others. In evolutionary bioinformatics, OSS comes in a broad range of programming languages, including
C/C++, Perl, Python, Ruby, Java, and R. To avoid writing the same functionality multiple times for different
languages, it is possible to share components by bridging computer languages and Bio* projects, such as
BioPerl, Biopython, BioRuby, BioJava, and R/Bioconductor. In this chapter, we compare the two principal
approaches for sharing software between different programming languages: either by remote procedure call
(RPC) or by sharing a local call stack. RPC provides a language-independent protocol over a network
interface; examples are RSOAP and Rserve. The local call stack provides a between-language mapping not
over the network interface, but directly in computer memory; examples are R bindings, RPy, and languages
sharing the Java Virtual Machine stack. This functionality provides strategies for sharing of software between
Bio* projects, which can be exploited more often. Here, we present cross-language examples for sequence
translation, and measure throughput of the different options. We compare calling into R through native R,
RSOAP, Rserve, and RPy interfaces, with the performance of native BioPerl, Biopython, BioJava, and
BioRuby implementations, and with call stack bindings to BioJava and the European Molecular Biology
Open Software Suite. In general, call stack approaches outperform native Bio* implementations and these, in
turn, outperform RPC-based approaches. To test and compare strategies, we provide a downloadable
BioNode image with all examples, tools, and libraries included. The BioNode image can be run on VirtualBox-supported operating systems, including Windows, OSX, and Linux.
Key words: Bioinformatics, R, BioPerl, BioRuby, Biopython, BioJava Web services, Remote
procedure call, Java virtual machine

*Download: http://www.evolutionarygenomics.net/ and http://biobeat.org/bioprojects.


Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2,
Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5_21,
# Springer Science+Business Media, LLC 2012

513

514

P. Prins et al.

1. Introduction
Bioinformatics has created its tower of Babel. The full set of
functionality for bioinformatics, including statistical and computational methods for evolutionary biology, is implemented in a range
of computer languages, including Java, C/C++, Perl, Python,
Ruby, and R. This comes as no surprise, as language design is the
result of multiple trade-offs, for example, in strictness, convenience,
and performance.
For example, Java is a statically typed compiled language, and R
is a dynamically typed interpreted language. In principle, a compiled language is converted into machine code once by a language
compiler, and an interpreted language is compiled every time at
runtime, the moment it is run by the interpreter. Static typing
allows the compiler to optimize machine code for speed. Dynamic
typing resolves variable and function types at runtime, and is typically suited for an interpreter. Design decisions cause Java to have
stronger type checking and faster execution speed than, R. Meanwhile, R offers sophisticated interactive analysis of data in an interpreted shell, not directly possible with Java. When comparing
runtime performance of these languages, compiled statically typed
languages, such as C land Java, outperform interpreted dynamically
typed languages, such as Python, Perl, and R. For comparisons,
see ref. 1.
Runtime performance, however, is not the only criterion for
selecting a computer language. Another important criterium may
be conciseness. All mentioned interpreted languages allow functionality to be written in less lines of code than, Java. The number
of lines matter, as it is often easier to grasp something expressed in
a short and concise fashion, if done competently, leading to easier
coding and maintenance of software, i.e., programmer productivity. In general, with R, Perl, Python, and Ruby, it takes less lines of
code to write software than with C or Java; see also ref. 1. Based on
the conciseness criterium, these languages fall into the same two
groups as when split on performance. This may suggest a trade-off
between execution speed and consciseness or execution speed and
programmer productivity.
Discussing other important criteria for selecting a programming
language, such as ease of understanding, productivity, portability,
and the size and dynamics of the supporting Bio* project developer
communities, is beyond the scope of this book. The authors, who
have different individual preferences, wish to emphasize that every
language has characteristics driven by language design and there is
no single perfect all-purpose computer language.
In practice, the choice of a computer language depends mainly
on the individuals involved in a project partly due to the investment

21

Bio* Programming

515

it takes to master a language. Researchers have prior investments


and personal preferences, which have resulted in a wide range of
computer languages used in the bioinformatics community.
Logically, to fully utilize the potential of existing and future
bioinformatics functionality, it is necessary to bridge between computer languages. Bioinformaticians cannot be expected to master
every language, and it is inefficient to write the same functionality
for every language. For example, R/Bioconductor (2) contains
unique and exhaustive functionality for statistical methods, such
as for microarray gene expression analysis. The singular implementation of this functionality in R has caused researchers to invest in
learning the R language. Others, meanwhile, have worked on
building bridges between languages. For example, RPy and Rserve
allow accessing R functionality from Python (3), and JRI and
Rserve allow accessing R functionality from Java (4, 5).
Contrasting with singular implementations, every mainstream
Bio* project, such as BioPerl (6), Biopython (7), BioRuby (8), R/
Bioconductor (2), BioJava (9), the European Molecular Biology
Open Software Suite (EMBOSS) (10), and Bio++ (11), contains
duplication of functionality. Every Bio* project consists of a group
of volunteers collaborating at providing functionality for bioinformatics, genomics, and life science research under an open-source
software (OSS) license. The BioPerl project does that for Perl,
BioJava for Java, etc. Next to the language used, the total coverage
of functionality, and perhaps quality of implementation, differs
between projects. Not only is there duplication of effort, both
in writing and testing code, but also there are differences in
implementation, completeness, correctness, and performance.
Implementations between projects differ even for something as
straightforward as codon translation, e.g., in the number of types
of encoding and support for the translating of ambiguous nucleotides. EMBOSS, uniquely, attempts to predict the final amino acid
in a sequence, even when there are only two nucleotides available
for the last codon.
Whereas Chapter 20 of this Volume discusses Internet data
resources (12) and how to share them, in this chapter we discuss
how to share functional resources by interfacing and bridging
functionality between different computer languages. This is highly
relevant to evolutionary biology as most classic phylogenetic
resources were written in C while nowadays phylogenetic routines
are written in Java, Perl, Python, Ruby, and R. Especially for communities with relatively few software developers, as is the case with
evolutionary biology, it is important to bridge these functional
resources from multiple languages.
1.1. Bridging
Functional Resources

The simple way of interfacing software is by invoking one program


from another. This strategy is often used in Bio* projects, invoking
external programs. A regular subset would be PAML (13),

516

P. Prins et al.

HMMER (14), ClustalW (15), MAFFT (16), Muscle (17), BLAST


(18), and MrBayes (19). The Bio* projects typically contain modules which invoke the external program and parse the results. This
approach has downsides. Loading a new instance of a program
every time incurs extra overhead. More importantly, nonstandard
input and output makes the interface fragile, i.e.: What happens
when output differs between two versions of a program? A further
downside is that external programs do not have fine-grained function access and have no support for error handling and exceptions.
What happens when the invoked program runs out of process
memory? A final complication is that such a program is an external
software deployment dependency, which may be hard to resolve for
an end user.
In contrast, true cross-language interfacing allows one language to access functions and/or objects in another language, as
if they are native function calls. To achieve transparent function calls
between different computer languages, there are two principal
approaches. The first approach is for one language to call directly
into another languages function or method over a network interface, the so-called remote procedure call (RPC). The second
approach is to call into another language over a compatible local
call stack.
1.2. Remote
Procedure Call

In bioinformatics, cross-language RPC comes in the form of Web


services and binary network protocols. A Web service application
programming interface (API) is exposed, and a function call gets
translated with its parameters into a language-independent format,
e.g., XML: a procedure called marshalling. After calling the
function on a server, the result is returned in XML and translated
back through unmarshalling. Examples of cross-language XML
protocols are SOAP (20) and XML/RPC (21).
More techniques exist for Web service-type cross-language
RPC. For example, Representational State Transfer (REST), or
ReSTful (22), is a straightforward HTTP protocol, often preferred
over SOAP because of its simplicity. Another XML-based protocol
is Resource Description Framework (RDF), as part of the semantic
Web specification. Both REST and RDF can be used for RPC
solutions.
In addition, binary alternatives exist because XML-based
protocols are not efficient. XML is verbose, increasing the data
load, and requires parsing at both marshalling and unmarshalling
steps. In contrast, binary protocols are designed to reduce the data
transfer load and increase speed. Examples of binary protocols are
Rserve (4), which is specifically designed for R, and Google protocol buffers (23). Another software framework based on a binary
protocol is Thrift, by the Apache software foundation, designed for
scalable cross-language services development (24).

21

1.3. Call Stack

Bio* Programming

517

The alternative to RPC is to create native local bindings from one


language to another using a shared native call stack. With the call
stack, function calls do not run over the network, but over a stack
implementation in shared computer memory. In a single virtual
machine, such as the Java Virtual Machine (JVM), Parrot, or the
.NET framework, compiled code can share the same call stack,
which makes cross-language calling efficient. For example, the
languages Java, Jython, JRuby, Clojure, and Scala can transparently
call into each other when running on the same virtual machine.
Native call stack sharing is also supported at the lowest level by
the computer operating system through compiled shared libraries.
These shared libraries have an extension .so on Linux, .dylib on
OSX, and .dll on Windows. The shared libraries are designed so
that they contain code and data that provide services to independent programs, which allows the sharing and changing of code and
data in a modular fashion. Shared library interfaces are well defined
at the operating system level, and languages have a way of binding
them. Specialized interface bindings to shared libraries exist for
every language, for example Rs C modules, the Java native interface (JNI), the Parrot native compiler interface, and Perl XS.
With these (dynamic) shared libraries, certain algorithms can be
written in a low-level, high-performance, compiled computer language, such as C/C++ or FORTRAN. High-level languages, such
as Perl, Python, Ruby, R, and even Java, can access these algorithms. This way, trade-offs in language design can be exploited
optimally. Creating these shared library interfaces, however, can be
a tedious mechanical exercise, which calls for code generators. One
such generator is the Simplified Wrapper and Interface Generator
(SWIG) (25), which consists of a macro language, a C header file
parser, and the tools to bind low-level shared libraries to a wide
range of languages. For C/C++, SWIG can parse the header files
and generate the bindings for other languages, which, in turn, call
into these shared libraries.
While all this functionality for interfacing is available, the full
potential of creating cross-language adapters is not fully exploited
in bioinformatics. Rather than bridge two languages, researchers
often opt to duplicate functionality. This is possibly due to a lack of
information on the effort involved in creating an adapter. Also the
impact on performance is normally an unknown quantity. A further
complication is the need to understand, to some degree, both sides
of the equation; i.e., providing an R function to Python requires
some understanding of both R and Python, at least to the level of
reading the documentation of the shared module and creating a
working adapter. Likewise, binding Python to C using a call stack
approach requires some understanding of both Python and C.
Sometimes, binding of complex functions can be daunting and
deployment may be a concern; e.g., when creating shared library
bindings on Linux, they may not easily work on Windows or OS/X.

518

P. Prins et al.

1.4. Comparing
Approaches

Here, we compare bridging code from one language to another using


the RPC approach and the call stack approach, in the form of short
experiments, which can be executed by the reader. To measure
performance between different approaches, we use codon translation
as an example of shared functionality between Bio* projects. Codon
translation is a straightforward algorithm with table lookups.
Sequence translation is often used with genome-sized data and
requires many function calls with small-sized parameters.
Examples and tests can be experimented on a computer running Windows, OS/X, or Linux. To ease trials, we have added
software to a readily built and downloadable BioNode image that
can be run in a virtual machine and supports all interfaces and
performance examples. BioNode is discussed in Chapter 22 of
this Volume (26).

2. Results
2.1. Calling into R
from Other
Languages

R is a free and OSS environment for statistical computing and


graphics (27). R comes with a wide range of functionality, including
modules for bioinformatics, such as bundled in R/Bioconductor (2).
R is treated as a special citizen in this chapter as the language is widely
used and comes with statistical algorithms for evolutionary biology,
such as Ape (28) and SeqinR (29), both singularly available through
the comprehensive R archive network (CRAN). Meanwhile, the
interfacing techniques discussed here have wider applicability
as they can be used between other computer languages and Bio*
projects.
R defines a clear interface between the high-level language R
and low-level highly optimized C and FORTRAN libraries, some of
which have been around for a long time, such as linear regression
and linear algebra. In addition, the R environment successfully
handles cross-platform packaging of C, FORTRAN, and R code.
The combination of features has resulted in R becoming the opensource language of choice in statistics and in a number of disciplines
in biology, e.g., R/bioconductor for microarray analysis (2) and
R/qtl (30) and R/qtlbim (31) for QTL mapping. Not all is lost,
however, for those not comfortable with the R language itself. R
can effectively act as an intermediate between functionality and
high-level languages of interest. A number of libraries have been
created that interface to R from other languages, either providing a
form of RPC, through RSOAP or Rserve, or a call stack interface
calling into the R shared library and executing R commands, for
example RPy for Python, RSPerl for Perl, RSRuby for Ruby, and
JRI for Java. Of the call stack approaches, RPy currently has the
most complete implementation; see also ref. 3.

21

Bio* Programming

519

In this chapter, we compare different approaches for invoking


full R functionality from another language. To test cross-language
calling, we elected to demonstrate codon translation. Codon-toprotein amino acid translation is representative for a relatively simple calculation that potentially happens thousands of times with
genome-sized data. Every Bio* project includes such a translation
function, so it can be used to test for language interoperability and
performance. For data, we use a WormBase (32) Caenorhabditis
elegans cDNA FASTA file (33 Mb), containing 24,652 nucleotide
sequences, predicted to translate to protein (Fig. 1).
2.1.1. Using GeneR
with Plain R

The R/Bioconductor GeneR package (33) supports fast codon


translation with the strTranslate function implemented in C.
GeneR supports the eukaryotic code and other major encoding
standards. The R usage is
R
library(GeneR)
strTranslate("atgtcaatggtaagaaatgtacaaatcagagcgaa aaattgaaattttgt")
[1] "MSMVRNVSNQSEKLEIL"

The full R + GeneR script, named DNAtranslate_GeneR.R, parses


the nucleotide FASTA input and outputs amino acid FASTA, and
can be downloaded and tried. The BioNode VM includes all the
tools and libraries for the examples in this chapter. More on BioNode can be found in Chapter 38 (26).
Used directly from R, the throughput of the GeneR module is
about 393 sequences per second (Seq/s) on the test system, a 1.2GHz Thinkpad 300 laptop with preloaded disk cache. When
checking the implementation by reading the source code, we
found that the GeneR FASTA parser is a huge bottle neck. This
FASTA parser implementation creates an index on disk and reloads
the index file completely for each individual sequence, thereby
incurring a large overhead for every single sequence.
To see if we could improve throughput, we replaced the slow
FASTA parser with a faster one, R + Biostrings, which reads FASTA
once into RAM using the R/Bioconductor BioStrings module, and
still uses GeneR to translate. This implementation is 1.6 times faster
with a sustained throughput of 622 Seq/s (see also Fig. 1). The second
script is named DNAtranslate_Biostrings.R.
2.1.2. Calling into R
from Other Languages
with RPC

RSOAP
Next, we added an R/SOAP (34) adapter for codon translation
and invoke it from Python. RSOAP provides a SOAP interface
for R. After starting up the R instance, which acts as a SOAP server,
usage is

520

P. Prins et al.

Fig. 1. Throughput of mRNA to protein translation using cross-language calling with a


range of programming resources. Wormbase Caenorhabditis elegans-predicted protein
coding DNA was parsed in FASTA format and translated into amino acids. Measurements
were taken on a Thinkpad 300 1.2-GHz laptop, making sure that the effects of disk
caching and console speed were minimized. Different file sizes were used containing 500,
1,000, 5,000, 15,000, and 25,000 sequences (X-axis) and the number of seconds needed
for starting the software, parsing, and translation (Y-axis). For cross-language calling, the
triangles represent a network RPC protocol to bridge between programming languages,
and the squares represent a local stack approach. For further comparison, the circles
represent (native) Bio* libraries that come with each programming language. Broadly, the
figure shows that sustained throughput is reached quickly, and flattens out, for all types.
Of the RPC network protocols, SOAP performs poorly (Python + RSOAP + GeneR at
115 Seq/s) while Rserve (Python + Rserve + GeneR at 767 Seq/s) is at the level of
native Bio* libraries. The cross-language RPy2 (Python + RPy2 + GeneR at 3,247 Seq/s)
outperforms the others, except for the BioLib editions (Python + BioLib + EMBOSS at
10,490 Seq/s).
Python
from SOAPpy import*
RSOAPServer=SOAPProxy(http://localhost:9081)
RSessionURL=RSOAPServer.newServer()
RSession=SOAPProxy(RSessionURL)
RSession.call(library,GeneR)
RSession.call(strTranslate,atgtcaatggtaagaaatgtatcaaatcagagcgaaaaattggaaat
tttgt)

MSMVRNVSNQSEKLEIL

With the example Python + RSOAP, Python is used for parsing


FASTA and calling into RSOAP. As with the R example, GeneR is
used to translate the DNA. At 115 Seq/s, the RSOAP as an interface
is, by far, the slowest method of cross-language interfacing we tried.

21

Bio* Programming

521

Even the marshalling and unmarshalling of simple string objects


using XML over a local network interface take a lot of computational
resources (Fig. 1). The script is named DNAtranslate_RSOAP.py.
Rserve
Rserve (4) is a custom binary network protocol, more efficient than
XML-based protocols (4). R data types are converted into Rserve
binary data types. Rserve was originally written for Java, but nowadays
connectors exist for other languages. A Python example is
Python

import pyRserve
conn=pyRserve.rconnect()
conn(library(GeneR))
conn(strTranslate("atgtcaatggtaagaaatgtatcaaatca gagcgaaaaattggaaa
ttttgt"))

MSMVRNVSNQSEKLEIL

where Biopython (7) is used for parsing FASTA, and the Rserve +
GeneR service translates. At 767 Seq/s, Python + Rserves speed is
comparable to calling within R, and seven times faster than
Python + RSOAP (Fig. 1). The script is named DNAtranslate.py.
2.1.3. Calling into R from
Other Languages with the
Call Stack Approach

RPy2 executes R code from within Python over a local call stack (3).
Invoking the same GeneR functions from Python.
Python
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
importr(GeneR)
strTranslate=robjects.r[strTranslate]
strTranslate("atgtcaatggtaagaaatgtatcaaatcagagcgaaa aattggaaattttgt")[0]
MSMVRNVSNQSEKEIL

This example uses Biopython for parsing FASTA and invokes


GeneR translation over a call stack handled by RPy2. At 3,247 Seq/
s, throughput is the highest of our calling into R examples.
The Python implementation outperforms the other FASTA parsers,
and GeneR is fast too, when only the translation function is called
(GeneRs strTranslate is actually written in C, not in R). The RPy2
call stack approach is efficient for passing data back and forth. The
script is named DNAtranslate_RPy2.py.
2.2. Native Bio*
Implementations

When dealing with cross-language transport comparisons, it is


interesting to compare results with native language implementations. For example, Biopython (7) would be
Python
from Bio.Seq import Seq
from Bio.Alphabet import generic_dna

522

P. Prins et al.
coding_dna = Seq("atgtcaatggtaagaaatgtatcaaatcagagcg aaaaattggaaattttgt",
generic_dna)

coding_dna.translate()
Seq(MSMVRNVSNQSEKLEIL, ExtendedIUPACProtein())

A comparison reading the same FASTA files and translating DNA


has Biopython, BioRuby (8), and BioPerl (6) on par around 750 Seq/s
(Fig. 1). The BioRuby performance was twice as fast with the Ruby 1.9
interpreter compared to the Ruby 1.8 interpreter. We can assume that
the Biopython, BioPerl, BioRuby, and BioPerl implementations are
reasonably optimized for performance. Therefore, throughput reflects
the performance of these interpreted languages.
The BioJava3 (9) example, meanwhile, has a throughput
of 1,057 Seq/s. The current stable edition of BioJava, 1.8.1, is
approximately four times slower than the latest development
BioJava3 edition because of a major reimplementation changing
an object-based sequence translation to a string-based sequence
translation. For this chapter, we also wrote a more hand-optimized
translation function in Java which is 1.6 times faster. These three Java
implementations show that a choice of algorithm can limit performance, and is often more important than the underlying technology.
All examples can be found here, in the Biopython, BioRuby,
BioPerl, and BioJava subdirectories.
2.3. Using the JVM
for Cross-Language
Support

The JVM is a byte code standard that represents a form of computer


intermediate language. This language conceptually represents the
instruction set of a stack-oriented capability architecture. This
intermediate language, or byte code, is not tied to Java specifically,
and in the last 10 years a number of languages have appeared which
target the JVM, including JRuby (Ruby on the JVM), Jython
(Python on the JVM), Groovy (35), Clojure (36), and Scala (37).
These languages also compile into byte code and share the same
JVM stack. The shared JVM stack allows transparent function
calling between different languages.
An example of calling BioJava3 translation from Scala over a
shared JVM stack
Scala
import bio._
import org.BioJava3.core.sequence.transcription.TranscriptionEngine
import org.BioJave3.core.sequence._
val transcriber = TranscriptionEngine.getDefault()
val dna = new DNASequence("atgtcaatggtaagaaatgtatcaatc agagcgaaaaattggaaa
ttttgt")

val ma = dna.getRNASequence(transcriber)

rna.getProteinSequence(transcriber)

A native Java function, such as getProteinSequence, is directly


invoked from the other language without overheads (the passed-in
transcriber object is passed by reference, just like in Java). In fact,

21

Bio* Programming

523

Scala compiles to byte code, which maps one to one to Java,


including the class definitions. The produced byte code is native
Java byte code; therefore, the performance of calling BioJava from
Scala or Java is exactly the same.
Next to the Scala and Jython examples, we have also included a
JRuby example that calls into BioJava3 on the JVM (see Fig. 1).
All examples can be found on the Web site here.
2.4. Shared C Library
Cross-Calling Using
EMBOSS Codon
Translation

Finally, the EMBOSS is a free and OSS analysis package specially


developed for the needs of the molecular biology user community,
mostly written in C (10). We used the SWIG code generator to map
the EMBOSS translation function to Python and Ruby. The
Python example reads
Python
import biolib, emboss as emboss
trn Table = emboss.ajTrnNewl(1)
ajpseq = emboss.ajSeqNewNameC("atgtcaatggtaagaaatgtatcaaatcagagcgaa
aaatt ggaaattttgt","Test sequence")
ajpseqt = emboss.ajTrnSeqOrig(trn Table, ajpseq,1)
print emboss.ajSeqGetSeqCopyC(ajpseqt)
MSMVRNVSNQSEKLEILX

The Python and Ruby binding of EMBOSS outperforms all other


methods at 10,490 and 8,385 Seq/s, respectively (Fig. 1). The high
speed points out that (1) the invoked Biopython and BioRuby functions are efficient at parsing FASTA; (2) the SWIG-generated call stack
is efficient for moving data over the local call stack; and (3) the
EMBOSS transeq DNA to protein translation is optimal C code. All
these examples can be found on the Web site and here.

3. Discussion
Cross-language interfacing is a topic of importance to evolutionary
genomics because computational biologists need to provide tools
that are capable of complex analysis and cope with the amount of
biological data generated by the latest technologies. Cross-language
interfacing allows sharing of code. This means computer software
can be written in the computer language of choice for a particular
purpose. Flexibility in choice of computer programming language
allows optimizing of computational resources, and, perhaps even
more important, software developer resources, in bioinformatics.
When some functionality is needed that exists in a different
computer language than the one used for a project, a developer has
the following options: either rewrite the code in the preferred
language, essentially a duplication of effort, or bridge from one
language to the other. For bridging, there are essentially two

524

P. Prins et al.

technical methods that allow full programmatic access to functionality: through RPC or a local call stack.
RPC function invocation, over a network interface, has the
advantage of being language agnostic, and even machine independent. A function can run on a different machine or even over the
Internet, which is the basis of Web services and may be attractive
even for running services locally. RPC XML-based technologies,
however, are slow because of expensive parsing and high data load.
Metrics suggest that it may be worth experimenting with binary
protocols, such as Rserve.
When performance is critical, e.g., when much data needs to be
transported, or functions are invoked millions of times, a native call
stack approach may be preferred over RPC. Metrics suggest that the
EMBOSS C implementation performs well, and that binding to the
native C libraries with SWIG is efficient. Alternatively, it is possible
to use R as an intermediate to C libraries. Interestingly, calling R
libraries, many of which are written in C, may give higher performance than calling into native Bio* implementations. For example,
Python + RPy + GeneR is faster that Biopython pure Python
implementation of sequence translation.
Even though RPC may perform less well than local stack-based
approaches, RPC has some real advantages. For example, if you
have a choice of calling a local BLAST library or call into a remote
and ready NCBI RPC interface, the latter lacks the deployment
complexity. Also the public resource may be more up to date than a
copied server running locally. This holds for many curated services
that involve large databases, such as PDB (38), Pfam (39), KEGG
(40), and UniProt (41). Chapter 20 of this volume gives a deeper
treatment of these Internet resources (12).
From the examples given in this chapter, it may be clear that
actual invocation of functions through the different technologies is
similar, i.e., all listed Python scripts look similar, provided the
underlying dependencies on tools and libraries have been resolved.
The main difference between implementations is with the deployment of software, rather than invocation of functionality. The JVM
approach is of interest, as it makes bridging between supported
languages transparent and deployment straightforward. Not only
can languages be mixed, but also the advanced Java tool chain is
available, including debuggers, profilers, load distributors, and
build tools. Other shared virtual machines, such as .NET and
Parrot, potentially offer similar advantages, but are currently less
used in bioinformatics.
When striving for reliable and correct software solutions, the
alternative strategy of calling computer programs as external units
via the command line should be discouraged: not only it is less
efficient, a program gets started every time a function gets called,
but also a potential deployment nightmare is introduced. What
happens when the program is not installed, or the interface

21

Bio* Programming

525

changed between versions, or when there is some other error?


With the full programmatic interfaces, discussed in this chapter,
incompatibilities between functions get caught much earlier.
Whichever language and bridging technology is preferred, we
think it important to test the performance of different ways of
interfacing languages, as there is (1) a need for combining languages in bioinformatics and (2) it is not always clear what impact
a choice of cross-language interface may have on performance. By
testing different bridging technologies and functional implementations, the best solution should emerge for a specific scenario.
So far, we have focused on the performance of cross-language
calling. In Chapter 22 of this volume (26), scalability of computation is discussed by programming for multiple processors and
machines.

4. Questions
1. Install BioNode and run the different test scripts. Can you
replicate the differences of throughput statistics?
2. Why is SOAP the slowest protocol?
3. What are the possible advantages of using a virtual machine,
such as the JVM?
4. If you were to bridge between your favorite language and an R
library, what options do you have?

Acknowledgments
We thank all OSS developers for creating such great tools and
libraries for the scientific community.
References
1. The computer language benchmarks game.
http://shootout.alioth.debian.org
2. Gentleman R C, Carey V J, Bates D M et al.
(2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5:R80p. doi:10.1186/
gb-2004-5-10-r80
3. Gautier L (2010) An intuitive Python interface
for Bioconductor libraries demonstrates the
utility of language translators. BMC Bioinfor-

matics. 11 Suppl 12:S11p. http://www.ncbi.


nlm.nih.gov/pubmed/21210978
4. Urbanek S (2003) Rserve a fast way to provide R
functionality to applications. In Proceedings of
the 3rd International Workshop on Distributed
Statistical Computing (DSC 2003). Vienna, Austria, http://www.ci.tuwien.ac.at/Conferences/
DSC-2003/Proceedings/Urbanek.pdf
5. Urbanek S (2009) How to talk to strangers:
ways to leverage connectivity between R, Java
and objective C. Computational Statistics.

526

P. Prins et al.

24:303311.
http://dx.doi.org/10.1007/
s00180-008-0132-x
6. Stajich J E, Block D, Boulez K et al. (2002)
The Bioperl toolkit: Perl modules for the life
sciences. Genome Res. 12:16111618.
doi:10.1101/gr.361602
7. Cock P J, Antao T, Chang J T et al. (2009)
Biopython: freely available Python tools for
computational molecular biology and bioinformatics.
Bioinformatics.
25:14221423.
doi:10.1093/bioinformatics/btp163
8. Goto N, Prins P, Nakao M et al. (2010)
Bioruby: bioinformatics software for the Ruby
programming
language.
Bioinformatics.
26:26172619. doi:10.1093/bioinformatics/
btq475
9. Holland R C, Down T A, Pocock M et al. (2008)
BioJava: an open-source framework for bioinformatics.
Bioinformatics.
24:20962097.
doi:10.1093/bioinformatics/btn397
10. Rice P, Longden I & Bleasby A (2000)
EMBOSS: the european molecular biology
open
software
suite.
Trends
Genet.
16:276277. http://www.ncbi.nlm.nih.gov/
pubmed/10827456
11. Dutheil J, Gaillard S, Bazin E et al. (2006)
Bio++: a set of C++ libraries for sequence analysis, phylogenetics, molecular evolution and
population genetics. BMC Bioinformatics.
7:188p. doi:10.1186/1471-2105-7-188
12. Wilkinson M (2012) Genomics data resources
Frameworks and standards. In: Anisimova M
(ed) Evolutionary genomics: statistical and computational methods (volume 1). Methods in
Molecular Biology, Springer Science+Business
Media New York
13. Yang Z (1997) PAML: a program package for
phylogenetic analysis by maximum likelihood.
Comput Appl Biosci. 13:555556
14. Eddy S R (2008) A probabilistic model of
local sequence alignment that simplifies
statistical significance estimation. PLoS Comput Biol. 4:e1000069p. doi:10.1371/journal.
pcbi.1000069
15. Larkin M A, Blackshields G, Brown N P et al.
(2007) Clustal W and clustal X version 2.0.
Bioinformatics. 23:29472948. doi:10.1093/
bioinformatics/btm404
16. Katoh K, Kuma K, Toh H & Miyata T (2005)
MAFFT version 5: improvement in accuracy of
multiple sequence alignment. Nucleic Acids
Res. 33:511518. doi:10.1093/nar/gki198
17. Edgar R C (2004) MUSCLE: a multiple
sequence alignment method with reduced
time and space complexity. BMC Bioinformatics. 5:113p. doi:10.1186/1471-2105-5-113

18. Altschul S F, Madden T L, Schaffer A A et al.


(1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25:33893402
19. Ronquist F & Huelsenbeck J P (2003) MrBayes
3: Bayesian phylogenetic inference under mixed
models. Bioinformatics. 19:15721574
20. Box D, Ehnebuske D, Kakivaya G et al. (2000)
Simple object access protocol (SOAP) 1.1.
http://www.w3.org/TR/2000/NOTESOAP-20000508
21. St laurent S, Johnston J & Dumbill E (2001)
Programming Web services with XML-RPC.
pub-ORA, 213p
22. Richardson L & Ruby S (2007) Restful web
services. pub-ORA, xxiv + 419p
23. Muller J, Lorenz M, Geller F, Zeier A &
Plattner H. (2010) Assessment of communication protocols in the EPC network-replacing
textual SOAP and XML with binary google
protocol buffers encoding. Industrial Engineering and Engineering Management (IE
\&EM), 2010 IEEE 17th International Conference on, IEEE, 404409. doi:10.1109/
ICIEEM.2010.5646586
24. Agarwal A, Slee M & Kwiatkowski M (2007)
Thrift: scalable cross-language services
implementation.
http://thrift.apache.org/
static/thrift-20070401.pdf
25. Beazley D (1996) SWIG: an easy to use tool for
integrating scripting languages with C and C++.
Proceedings of the 4th conference on USENIX
Tcl/Tk Workshop, 1996-Volume 4, USENIX
Association, 15p. http://www.swig.org
26. Prins P, Goto N, Yates A, Gautier L, Willis S,
Fields C & Katayama T (2012) Sharing programming resources between Bio* projects
through remote procedure call and native call
stack strategies. In: Anisimova M (ed) Evolutionary genomics: statistical and computational
methods (volume 1). Methods in Molecular
Biology, Springer Science+Business Media
New York
27. Development core team R (2010) R: a language and environment for statistical computing. http://www.R-project.org
28. Paradis E, Claude J & Strimmer K (2004) APE:
analyses of phylogenetics and evolution in R
language. Bioinformatics. 20:289290. http://
www.ncbi.nlm.nih.gov/pubmed/14734327
29. Charif D, Thioulouse J, Lobry J R & Perriere G
(2005) Online synonymous codon usage analyses with the ade4 and seqinR packages. Bioinformatics.
21:545547.
doi:10.1093/
bioinformatics/bti037
30. Arends D, Prins P, Jansen R C & Broman K W
(2010) R/qtl: high-throughput multiple QTL

21
mapping. Bioinformatics. 26:29902992.
doi:10.1093/bioinformatics/btq565
31. Yandell B S, Mehta T, Banerjee S et al. (2007)
R/qtlbim: QTL with Bayesian interval
mapping in experimental crosses. Bioinformatics. 23:641643. doi:10.1093/bioinformatics/btm011
32. Harris T W, Antoshechkin I, Bieri T et al.
(2010) WormBase: a comprehensive resource
for nematode research. Nucleic Acids Res. 38:
D463D467. doi:10.1093/nar/gkp952
33. Cottret L, Lucas A, Marrakchi E et al. GeneR: R
for genes and sequences analysis. http://www.
bioconductor.org/help/bioc-views/release/
bioc/html/GeneR.html
34. Warnes G (2004) RSOAP provides a SOAP interface for the open-source statistical package R.
http://research.warnes.net/statcomp/projects/
RStatServer/rsoap
35. Koenig D, Glover A, King P, Laforge G &
Skeet J (2007) Groovy in action. Manning
Publications Co. Greenwich, CT, USA

Bio* Programming

527

36. Halloway S (2009) Programming Clojure.


Pragmatic Bookshelf
37. Odersky M, Altherr P, Cremet V et al. (2004)
An overview of the Scala programming language. LAMP-EPFL
38. Berman H M, Battistuz T, Bhat T N et al. (2002)
The Protein Data Bank. Acta Crystallogr D Biol
Crystallogr. 58:899907. http://www.ncbi.
nlm.nih.gov/pubmed/12037327
39. Finn R D, Mistry J, Tate J et al. (2010) The Pfam
protein families database. Nucleic Acids Res. 38:
D211D222. doi:10.1093/nar/gkp985
40. Kanehisa M & Goto S (2000) KEGG: kyoto
encyclopedia of genes and genomes. Nucleic
Acids Res. 28:2730. http://www.ncbi.nlm.
nih.gov/pubmed/10592173
41. Bairoch A, Apweiler R, Wu C H et al. (2005) The
universal protein resource (UniProt). Nucleic
Acids Res. 33:D154D159. doi:10.1093/nar/
gki070

Chapter 22
Scalable Computing for Evolutionary Genomics*
Pjotr Prins, Dominique Belhachemi, Steffen Moller, and Geert Smant
Abstract
Genomic data analysis in evolutionary biology is becoming so computationally intensive that analysis of
multiple hypotheses and scenarios takes too long on a single desktop computer. In this chapter, we discuss
techniques for scaling computations through parallelization of calculations, after giving a quick overview of
advanced programming techniques. Unfortunately, parallel programming is difficult and requires special
software design. The alternative, especially attractive for legacy software, is to introduce poor mans
parallelization by running whole programs in parallel as separate processes, using job schedulers. Such
pipelines are often deployed on bioinformatics computer clusters.
Recent advances in PC virtualization have made it possible to run a full computer operating system, with
all of its installed software, on top of another operating system, inside a box, or virtual machine (VM).
Such a VM can flexibly be deployed on multiple computers, in a local network, e.g., on existing desktop
PCs, and even in the Cloud, to create a virtual computer cluster. Many bioinformatics applications in
evolutionary biology can be run in parallel, running processes in one or more VMs. Here, we show how a
ready-made bioinformatics VM image, named BioNode, effectively creates a computing cluster, and
pipeline, in a few steps. This allows researchers to scale-up computations from their desktop, using available
hardware, anytime it is required.
BioNode is based on Debian Linux and can run on networked PCs and in the Cloud. Over 200
bioinformatics and statistical software packages, of interest to evolutionary biology, are included, such as
PAML, Muscle, MAFFT, MrBayes, and BLAST. Most of these software packages are maintained through
the Debian Med project. In addition, BioNode contains convenient configuration scripts for parallelizing
bioinformatics software. Where Debian Med encourages packaging free and open source bioinformatics
software through one central project, BioNode encourages creating free and open source VM images, for
multiple targets, through one central project.
BioNode can be deployed on Windows, OSX, Linux, and in the Cloud. Next to the downloadable
BioNode images, we provide tutorials online, which empower bioinformaticians to install and run BioNode
in different environments, as well as information for future initiatives, on creating and building such images.
Key words: BioNode, Bioinformatics, Evolutionary biology, Big data, Parallelization, MPI, Cloud
computing, Cluster computing, Virtual machine, Amazon EC2, OpenStack, PAML, MrBayes,
VirtualBox, Debian Linux

Availability: The 32-bit and 64-bit BioNode desktop images for VirtualBox and the BioNode Cloud images are
based on free and open source software and can be found at http://www.evolutionarygenomics.net/ and http://
biobeat.org/bionode.
Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2,
Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5_22,
# Springer Science+Business Media, LLC 2012

529

530

P. Prins et al.

1. Introduction
Investigative evolutionary biology, nowadays, includes comparative
analysis of genomes, transcriptomes, proteomes, and interactomes,
across individuals and even across species. The analysis of data,
generated by the latest acquisition technologies, is becoming so
computationally intensive that either an analysis wont run on a
desktop computer or it is so slow that it prevents researchers from
trying different scenarios and/or hypotheses.
Evolutionary genomics often requires lengthy computations in
a multidimensional search space. Examples of such expensive computations are Bayesian analysis, inference based on Hidden Markov
Models, and maximum likelihood analysis, implemented, e.g., by
MrBayes (1), HMMER (2), and phylogenetic analysis by maximum
likelihood (PAML) (3), respectively. Genome-sized data, or Big
Data (4), such as produced by next-generation sequencers, as well
as growing sample sets, such as from the 1,000 genome project (5),
are exacerbating the computational time problem.
In addition to being computationally expensive, many implementations of major algorithms and tools in bioinformatics do not
scale automatically. An example of legacy software requiring
lengthy computation is Ziheng Yangs codeml implementation of
PAML (3). PAML can find amino acid sites which show evidence of
positive selection using dN/dS ratios, which is the ratio of nonsynonymous and synonymous substitution rate, see also Chapter 5
of this volume on selection on the protein coding genome (6).
Executing PAML over an alignment of hundred sequences may
take hours, sometimes days, even on a fast PC. PAML (version 4.
x) is designed as a single-threaded process and can only utilize one
single central processing unit (CPU) to complete a calculation. To
test hundreds of alignments, e.g., different gene families, PAML is
invoked hundreds of times in a serial fashion, possibly taking days
on a single computer. Here, we use PAML as an example, but the
idea holds for any software program that is both CPU bound, i.e.,
the CPU speed determines total program execution time. A CPU
bound program will show (close to) 100% usage for a CPU. A large
number of such legacy programs are CPU bound and do not scale
by themselves.
Scaling up of computations may be possible through parallelization. Parallelization means the computational effort is distributed
among multiple CPUs. This can be among multiple cores within a
single processor, a multiprocessor system or a network of computers, a so-called computing cluster. While CPUs are still getting
faster, the last years most of the gain in computational processing
power has come from parallelization.

22 Scalable Computing for Evolutionary Genomics

531

One recent advance is Cloud computing, which allows the


utilization of additional CPUs on the Internet, and is playing an
increasingly important role in Bioinformatics. Where previously
bioinformaticians had to physically install and maintain computer
clusters, to scale-up computations, nowadays Cloud computing
allows renting CPUs and data storage on demand over the Internet,
thereby providing a flexible concept of on-demand computing (7).
Later in this chapter, after a quick introduction on parallelization,
we provide means of scaling up computations in the Cloud.
1.1. Understanding
Parallelization

With parallel computing, computational problems are divided into


smaller ones and solved concurrently. Parallelism has been
employed for many years, mainly in high-performance computing
(HPC). With HPC, scientists from different disciplines use supercomputers and computer clusters to solve computational problems.
Now, why is it so few software programs, inside and outside
bioinformatics, take full advantage of parallel computing? The
answer is that most software is CPU bound because writing parallelized multi-CPU software has proven to be difficult (810). Parallel programming, even with supporting tools and libraries, is
complex. Typically, parallel programming implies complicated
data and control flow; causes dead locks, where depending threads
wait for each other forever; and causes race conditions, where
threads go into eternal loops. This complexity complicates software
development, bug hunting, and code maintenance.
Not only is parallel programming intrinsically complicated, programmers also have to deal with communication overheads between
parallel threads. MrBayes, a program for calculating phylogenetic
trees based on Bayesian analysis, comes with MPI support. MPI is a
message-based abstraction of parallelization, in the form of a binary
communication protocol implemented in a C programming library
(11). Sometimes the parallelized version is slower than the single
CPU version. For example, the MPI version calculates each Markov
chain in parallel and the chains need to be synchronized with each
other, in a scatter and gather pattern. The chains spend time
waiting for each other in addition to the communication overheads
introduced by MPI itself. Recent work on MrBayes is on a hybrid
use of coarse-grained OpenMPI, and fine-grained use of pthreads or
OpenMP, leading to improved scalability, (e.g., ref. 12).
Another example of communication overhead is with the statistical programming language R. R does not have native threading
support built into the language. Currently, the best option is to use
the R/SNOW library (13), which is MPI based. Effectively
R/SNOW only allows coarse-grained parallelization from R, as each
parallelized R thread starts up a full R instance, introducing large
overheads, both in communication time and memory foot print. For
a parallelized program to be faster than its single-threaded counterpart, these communication overheads have to be accounted for.

532

P. Prins et al.

The need of scaling up calculations on multi-CPU computers


has increased the interest in a number of functional programming
languages, such as Erlang (14), Haskell (15), and Scala (16). These
languages ease writing parallel software, by introducing abstraction
of parallelization, immutable data combined with automatic garbage collection (8, 17). For example, Actors are an abstraction of
parallelization and make reasoning about fine-grained parallelization easier and therefore less error prone. Actors were introduced
and explored by Erlang, a computer language originally designed
for highly parallelized telecommunications computing. To the
human programmer, each Actor appears as a linear piece of programming and is parallelized without the complexity of locks,
mutexes, and semaphores. Actors allow for parallelization in a
manageable way, where threads are guaranteed to be independent
and each has a message queue, similar to MPI. Actors, however, are
much faster, more intuitive, and, therefore, probably, safer than
MPI. Immutable data, when used on a single multi-CPU computer, allows fast passing of data by reference between Actors.
When a computer language supports the concept of immutable
data, it guarantees data is not changed between parallel threads,
again making programming less error prone. Actors with support
for immutable data are implemented as an integral part of the
programming language in Erlang, Haskell, Scala, and D (18).
Another abstraction of parallelized programming is the introduction of goroutines, part of the Go programming language (19).
Where MPI and Actors are related to a concept of message passing
and mail boxes, goroutines are more related to Unix named pipes.
Goroutines also aim to make reasoning about parallelization easier, by
providing a pipe where data goes in and results comes out, processed
concurrently, without the use of mutexes. Goroutines are related to
communicating sequential processes (CSP), the original paper by
Tony Hoare in 1978 (20). It is important to note that the problems,
ideas, and concepts of parallel programming are not recent: it has
been an important part of HPC for decennia. Meanwhile, recent
practical implementations are driven by the ubiquity of cheap multicore computers and the need for scaling up. A Java implementation of
CSP exists, named JCSP (21), and a Scala alternative named CSO
(22). Go made goroutines intuitive and a central part of the strongly
typed compiled language. We invite the bioinformatics reader interested in parallel programming to read up on the languages that have
solid built-in support high-level parallelization abstractions, in particular, Scala (16), Go (19), and D (18).
1.2. Parallelization
with Cloud Tools

The basis of Cloud computing is the virtualization of hardware,


hosted somewhere on the Internet. The underlying technology is
abstracted away, while offering users a familiar, environment.
Later in this chapter, we discuss implications of creating a scalable
virtual computing cluster in the Cloud. But first we mention a

22 Scalable Computing for Evolutionary Genomics

533

number of specialized technologies offering forms of coarse-grained


parallelization, which are evolving in Cloud computing.
MapReduce is a framework, originally by Google, for processing
huge datasets on certain kinds of distributable problems using a large
number of computers (23). The map step takes a dataset and splits it
into parts, and distributes them to worker nodes. Worker nodes can
further split data and distribute. At the reduce step, data is combined
into a result. An application programmer interface (API) is provided
which allow programmers to access functionality. The Apache
Hadoop project, which develops open source software for reliable,
scalable distributed computing, includes a MapReduce implementation and other tools, such as a distributed file system (24). These can
be used with multiple Cloud providers, and also on private computer
clusters. The advantage of Hadoop is that it goes beyond virtualization of hardware: these are services where the programmer is further
removed from the underlying operating system. The downside is that
the programmer has less control over bottlenecks.
Essentially, both commercial providers and open source software programmers are writing a range of new tools suitable for the
Cloud. Bioinformaticians can utilize these facilities, providing they
write new software to use this functionality, i.e., such software has
to be designed specifically for the Cloud.
Cloud computing can have additional benefits when public
bioinformatics data resources get pooled and cached, e.g., in large
SQL databases, in Cloud centers, bringing down data access times
for calculation nodes.
1.3. Parallelization
of Applications
Using a Pipeline

Next to the above parallel programming techniques, which can be


used when writing parallelized software from the ground up, the
often seen parallelization strategy in bioinformatics, is to start with
an existing nonparallel (legacy) application and to run it in parallel by
dividing data into units of work or jobs. Unlike OpenMP, MPI and
Actors, jobs are run independently as separate processes. With this
strategy, jobs normally do not communicate with each other, which in
effect is a poor mans form of parallel computing. Input, split into
jobs, is fed to each process by the user, and job output is collected and
collated. In bioinformatics, this leads to designing pipelineswhich
allow the use of multiple cores and clustered computers with legacy
software (25). With the PAML example, each single job can be based
on one alignment, potentially allowing linear speed improvements by
distributing jobs across multiple CPUs and computers. In other
words, the PAML software, by itself, does not allow calculations in
parallel, but it is possible to parallelize multiple runs of the standard
PAML software with the aid of some other tools. The downside to
this approach is the necessary installation and configuration of
pipeline software, as well as the management and complexity of
splitting inputs and the collecting and collating of output. Also, in
potential, these pipelines are fragile, as there is no real interprocess

534

P. Prins et al.

communication, i.e., what happens when there is a disk or network


error in the middle of a week-long calculation?
Even for multithreaded applications, applications that actually
make use of multiple CPUs, such as BLAST and MrBayes, it may be
interesting to scale-up calculations this way. For example, MrBayesMPI version 3.1.2 does not provide between-machine parallelization and is therefore machine bound, i.e., the machine determines
the total running time. Still, if you need to calculate thousands of
different phylogenetic trees, discrete jobs can be distributed across
multiple machines.
Note that there are already a number of solutions available for
building full bioinformatics pipelines, such as Ensemble on a Beowulf cluster, using OpenPBS for job control (26). In the next
section, we discuss Cloud computing. Fifteen years ago, Beowulf,
in a way, pioneered todays Cloud computing by proving that
networked low-cost commodity hardware with Linux is suitable
for scalable scientific computing. One of the strengths of Beowulf
is that it comes with full MPI support. Now, it is possible to run
Beowulf in the Cloud, i.e., low-commodity hardware has been
replaced with virtualized hardware.
1.4. A Pipeline
in the Cloud

Recent advances in computer operating system (OS) virtualization


have made it easy to deploy a full PC computer operating system,
with all of its installed software, inside a, so-called, virtual machine
(VM). A researcher can flexibly deploy the VM on multiple PCs in a
local network and even in the Cloud. In general, Cloud computing
frees researchers from owning physical infrastructure by renting
usage from a third-party provider. This allows building a parallelized
computing cluster, through a Web interface, without the requirement of investing in a traditional hardware setup, i.e., stacks of
rack-mounted computers, networking gear, and air conditioning.
To create a bioinformatics pipeline, Cloud computing can be
used in combination with a local setup. First virtualize machines,
using similar technology, on a local network, such as a few office or
lab PCs, and use these for calculations. Using the same technology
expand into the Cloud when calculations take too long, i.e., for
high-peak usage, when advantageous.
Here, we provide tools for creating such a bioinformatics pipeline, introducing BioNode, which can create compute nodes on
local PCs and even in the Cloud. In this chapter, we use this to
parallelize a legacy application, PAML. Such a genome-wide computation, testing clustered gene families for evidence of positive
selection, is also discussed in Chapter 19 of this volume on evolutionary genetical genomics (27). Essentially, a pipeline is created by
dividing data into discrete units of work, one alignment being
one job. These are executed in parallel using cluster management
software.

22 Scalable Computing for Evolutionary Genomics

2. On-Demand
Scalability
with BioNode

2.1. Packaging
Software for BioNode

535

BioNode, a ready-made computing environment, that can be


downloaded from the Internet and deployed in a virtual machine
(VM), was created so that it can run on a single multicore desktop
computer, a few networked computers, and in the Cloud. Virtualization, in general, allows running an operating system, i.e., Linux,
inside another operating system, i.e., OSX and Windows. BioNode
is a free and open source Linux image that can be deployed anywhere. This strategy is widely applicable as BioNode works with
most bioinformatics programs on a single PC.
BioNode is based on Debian Linux and includes software packages,
and meta-packages, of the Debian Med project (28). Historically, a
large number of bioinformatics software programs are available for
Linux, and this number is growing every day. When the distribution
license of the software is suitable, i.e., it is free and open source
software (FOSS), the software can easily be packaged and
distributed through Debian. For Debian Med and BioNode, the
Debian packaging system is a logical candidate, because it represents
millions of users and targets most platforms, even though it is not
the only packaging system around (RPM being a notable alternative,
with Linux distributions Fedora, OpenSuSE, and CentOS). Debian
Med, part of the Debian Linux initiative, is the de facto largest
software packaging effort for bioinformatics and provides ready
and coherent software packages for both medical informatics and
bioinformatics (28). Debian-derived distributions, such as Ubuntu
Linux, Linux Mint, and the Fink project for OSX, make Debian a
viable and interesting ecosystem for package management, with
projects exchanging information and updates. Programs are
distributed as binary packages ready for use, built on Debians
network of auto-building machines, from source code that is further
annotated and uploaded as packages by individuals. All packages are
versioned and have set dependencies. Debian Med closes the gap
between bioinformatics developers and users. It provides a simple
method for offering new releases of software and data resources,
thus provisioning a local infrastructure for computational biology.
Debian Med provides a Web portal interface, allowing users
to browse packages of interest and select specific packages and
combinations of packages, or meta-packages, including imaging,
statistics, bio and bio-dev (bioinformatics for software development). Packages with an emphasis on computation have also been
collected under the Cloud meta-package. A regular Debian
meta-package allows the easy installation of a whole set of packages
at one time (28). All this functionality is available with the BioNode
images, in addition they contain configuration and scripts for
parallelizing bioinformatics software.

536

P. Prins et al.

For BioNode, we added new Debian Med packages, including


the Cloud meta-package and packages for running BioNode
tests and examples, e.g., PAML, pal2nal, and rq (see below).
2.2. A Ready-Made
BioNode Image
for Parallelized
Computing

The BioNode images, accompanying this chapter, only use FOSS,


starting with Linux itself. Unlike Windows and OSX, Linux has no
licensing restrictions for copying across VMs. Therefore, a BioNode Linux image can be deployed freely. BioNode works as easily
in a virtual machine, on a single computer, or in the Cloud. Other
free tools, e.g., the cluster management tools, rq and TORQUE,
are included, which allow running a legacy program, here PAML, in
a batched way, so as to maximize the use of available computing
power. The experience gained with running BioNode, and examples, can easily be leveraged into existing cluster and HPC setups, as
many of these use the same, or similar, tools.

2.3. Parallelizing
an Application
with BioNode
on a Desktop PC

For the PC, BioNode comes as an Internet downloadable VirtualBox image, ready for the desktop. VirtualBox is an 86 virtualization application, with similar functionality to, e.g., VMWare or
XEN, that is installed in an existing host operating system, (e.g.,
ref. 29). Within this box, additional guest operating systems, each
known as a Guest OS, can be loaded and run with its own environment. This means a researcher can run a BioNode on an existing
installation of Microsoft Windows, Apple OSX, or Linux. While
VirtualBox is a commercial product, there also exist a free and open
source edition (OSE), which can be freely deployed on existing
PCs on a local area network (LAN). VirtualBox uses hardware
virtualization, which gives it close to native performance, (e.g.,
ref. 30). On a PC, install the free VirtualBox on Windows, OSX,
or Linux; download our BioNode image and add it to VirtualBox.
For example, see VirtualBox online tutorials or our BioNode tutorial (31). In VirtualBox, specify the number of CPUs to use as well
as computer memory. When the image boots up, it presents a
standard Debian Linux desktop, login with user guest and password guest. The desktop allows the use of both graphics and
command-based tools.
We have created a number of tests, or examples, which allow
testing the working and performance of a BioNode. Tests are
available as icons, or can be used from the command line, in the
/home/guest/Springer/Scalability directory. For example, to
run the PAML20 test, click on the icon, which runs the prepared
script:
Shell
cd/home/guest/Springer/Scalability
# run the single CPU version
./scripts/run-CPU1-PAML20.sh
# run the parallel version on four CPUs
./scripts/run-rq-PAML20.sh 4

22 Scalable Computing for Evolutionary Genomics

537

The PAML20 test takes 20 alignments, in this case putative


gene families of the oomycete P. infestans, and tests them for
evidence of positive selection. P. infestans is a single cell pathogen,
which causes late blight of potato and tomato. Gene families under
positive selection pressure may be involved in proteinprotein
interactions and are potentially of interest for fighting late blight
disease. See also Chapter 35 (27). Gene families used here were
generated by BLASTs blastclust (32) with 70% similarity on predicted genes and have been amino acid aligned using Muscle (33),
followed by translation to a nucleotide alignment using pal2nal.
The PAML20 script tests for evidence of positive selection using
PAMLs codeml with models M0M3. Note that the tools and
settings used here are merely chosen for educational purposes and
performance measures. The approach itself here may result in false
positives, as explained by Schneider et al. (34). Also, PAML is not
the only software that can test for evidence of positive selection,
e.g., the HyPhy molecular evolution and statistical sequence analysis software package contains similar functionality and uses MPI to
parallelize calculations (35). PAML is used here because it is a
reference implementation and is suitable as an example how a legacy
single-threaded bioinformatics application can be parallelized.
The parallelized run makes use of the rq queue manager.
A queue manager is needed, even on a single computer, when
running all jobs at the same time exceeds the available computer
resources, e.g., when there is not enough computer RAM to hold
all jobs. The queue manager makes sure a fixed number of jobs are
running in parallel, starting a new one, each time a job finishes. rq is
a zero-admin, zero-configuration, minimalistic tool for creating an
instant computing cluster, providing priority work queues. rq
depends on a few Ruby modules and sqlite, which are included
with Debian Med. After submitting jobs to rq through the desktop
icon or script, progress can be viewed from the command line with
rq ~/queue status. To view job results, use rq ~/queue job,
where job is the job number or rq ~/queue/execute select *
from jobs for all jobs. It only takes a few commands on the
command line to manage and monitor jobs.
The tests result in concrete timing numbers. The script ./
scripts/run-CPU1-PAML20.sh runs all 20 jobs serially in
34.5 min, while ./scripts/run-rq-PAML20.sh 8 uses rq to run
PAML on 8 cores, and takes a total running time of 6.5 min to
complete all jobs (note absolute times will differ on other hardware). Ideally, running jobs in parallel on a single multicore
machine shows linear performance increase for every CPU added,
but in reality it is less than linear. Resource contention on the
machine, e.g., disk or network IO, may have processes wait for
each other. Also, the last, and perhaps longest, running job causes
total timing to show less than linear performance, as the already
finished CPUs are idle.

538

P. Prins et al.

2.4. Parallelizing
an Application
with BioNode
on Multiple Machines

When machines are grouped together, in a computing cluster, some


additional complexity gets introduced, as the job queue needs to be
managed across machines. To scale-up our tests, install VirtualBox
on additional PCs and configure VirtualBox to use network bridging so that it can acquire its own IP address. For instructions, see
the online tutorial (31).
Again, using rq as a queue manager is the quickest option. The
only requirement is a shared network directory between machines,
because rq uses such a directory to communicate between computing nodes. When several machines share a network directory, rq
runs a job manager on each machine. For setting up such a network
directory, we have included a configuration script which only
requires an IP address.
Shell
cd/home/guest/Springer/Scalability
# On the shared directory (NFS) server
./scripts/run-rq-nfs-server.sh
> BioNode rq/export/data/rq NFS server running on 10.0.0.15
# On each client give the NFS server address:
./scripts/run-rq-nfs-client.sh 10.0.0.15
> BioNode rq mounted/export/data/rq

Adding another 8 core machine, we create a mini-cluster of two


networked PCs, totalling 16 CPUs, reduced total running time of
the PAML20 test down to 4 min. Another lesson, here, is that
adding CPUs and machines can scale-up calculations, but scaling is
never fully linear. The reason is that, in addition to the resource
contention of a single machine, the network introduces latencies
when data goes through the shared network directory on the
network file system (NFS), i.e. more bottlenecks. Also, in this
setup, the additional processes access one single disk resource on
the central NFS server. In some cases, a smarter setup could be to
pull the data files to the local hard disk first, before running the
analysis.
While rq shines in its simplicity, more advanced cluster management tools are available, which handle scheduling, prioritization, pipelining, and job control. These tools are suitable for
creating full bioinformatics pipelines. The two largest open source
clustering tool projects are GridEngine (36) and TORQUE (37);
both are used in HPC setups with over 20,000 CPUs. Both come
prepackaged with Debian Linux, and therefore with BioNode, and
offer similar features, and both can be run in the cloud. For our
demonstration, we opt for TORQUE to effectively emulate a computer cluster using BioNodes.

22 Scalable Computing for Evolutionary Genomics

539

2.4.1. Using TORQUE

Tera-scale Open-source Resource and QUEue manager


(TORQUE) is a resource manager providing control over batch
jobs and distributed compute nodes (37). TORQUE optionally
comes with Maui, which optimizes for resources and job allocation.
With BioNode, supplied with this chapter, it is possible to run
TORQUE and included test examples. To run PAML on
TORQUE, we have created a script, which fires up TORQUE
and adds cluster nodes. All that needs to be done is tell the script
which IP addresses to use.
TORQUE, unlike rq, requires an appointed server node,
though it can still act as a cluster node for computations. This
server node informs the cluster nodes on the jobs that should be
run. A node, in turn, informs the server when jobs have completed.
With TORQUE, nodes can be added and removed. A running
job, losing its compute note, e.g., because of a crash or service
interruption, may not be completed. It will be listed as such in the
error logs and should be run again. TORQUE has more features,
e.g. one can limit the compute time of a job or have several queues
of different priorities. Nodes can also have different properties, e.g.,
some nodes may be equipped for computations on graphics card
and accept special jobs.
In principle, TORQUE does not require rqs shared network
directory. Instead, all input and output files can be transferred via
http or even the secure shell (ssh) protocol. The latter requires ssh
key management, which adds complexity to the setup.

2.5. Parallelizing
an Application
with BioNode
in the Cloud

Cloud computing is being taken up, both in industry and science,


for on-demand computing, (e.g., 7). The Cloud commoditizes
cluster infrastructure and management. In addition, Cloud allows
users to run their own operating system, which is usually not the
case with existing cluster and GRID infrastructure (a GRID is a
heterogeneous network of computers that act together). A hypervisor sits between the host operating system and the guest
operating system(s). It makes sure they are clearly separated while
virtualizing the host hardware. This means many guests can share
the same machine, which appear to the users as a single machine on
the network. This allows providers to efficiently allocate resources.
Multiple providers exist, including Google, Microsoft, Rackspace
OpenStack, and Amazon Elastic Compute Cloud (EC2). Amazon
EC2, for one, provides clustering of 64-bit Debian Linux virtual
machines, allocating them physically close together with fast network links, so as to reduce network latencies. Open source alternatives, such as EC2 compatible Eucalyptus, and recently the
OpenStack (38) initiative allow using, and improving, of APIs.
OpenStack is a collection of technologies delivering a scalable
cloud operating system using an open API. The software is free
and open source and can be used to create a virtual machine service
provider or a private cloud.

540

P. Prins et al.

Currently, we provide a BioNode 64-bit image for Amazon


EC2, this image can also be used on any service that is based on
XEN or KVM hyper-visors, such as the free and open source
Eucalyptus virtualization manager (39), which allows anyone to
create an EC2 compatible service, i.e., a private cloud service. The
authors currently run Eucalyptus on a Linux cluster to manage
local VMs.
To use BioNode in the EC2, create an account on http://aws.
amazon.com, select BioNode image online, and start it up, following the EC2 tutorials online. In a nutshell, to create a cluster: create
keys, choose Web Services Cluster Compute Instances to start
instances. Started instances have a public IP, which you can use to
ssh to, and next run rq or TORQUE, as described above.
In effect, using BioNode in the cloud is almost identical to
using BioNode on a local network, and compute nodes can even be
combined into one computing cluster, mixing a local and remote
setup (see Fig. 1). For rq, a shared network directory is required.

Fig. 1. Schematic diagram of scaling up computations on BioNode, here an example of SNP detection, both on a local area
network (LAN) and in the Cloud. From the PC, BioNodes are started, first virtualizing BioNode with VirtualBox on idle
computers on the LAN, e.g., on office or laboratory computers, and next by running BioNode in the Cloud, e.g., Amazon
EC2, when more calculation power is required. Jobs are distributed across nodes. This way, a virtual computing cluster
is created, where nodes communicate through a shared file storage (FS), which can be located either on the LAN or in the
Cloud. BioNode provides a full Debian Linux environment, with the largest collection of free and open source bioinformatics
software currently available. From the users perspective, scaling BioNode from a PC, onto the LAN, and into the cloud,
amounts to a single investment. Note that clustering computers in the cloud does not escape the physical bottlenecks
of computing, i.e., computer networks are a bottleneck for big data, see also ref. 8. Public domain graphics courtesy of
http://www.openclipart.org.

22 Scalable Computing for Evolutionary Genomics

541

It is possible to run either NFS or sshfs (a shared network directory


over ssh that can be set up in user space), in the Cloud, to manage
the job queue.
With TORQUE, files can be shared over NFS or copied over
ssh. For more information on managing a TORQUE on Debian
installation, we refer to our online tutorial (31). This includes a
series of step-by-step tutorials for anyone interested in building a
Debian cluster from scratch.
Our test scripts show that BioNode has similar performance
and scalability in the cloud, compared to running it on the local
network, after adjusting for differences in hardware and network
speeds. This makes cloud computing an attractive proposition for
occasional scaling up of calculation jobs, especially when data sizes
are not too large. Note, however, that costs increase rapidly when
pushing large data files in the cloud. This is because calculation time
increases rapidly (8). Based on early 2011 rates, with Amazon EC2,
the largest Cloud computing provider, one CPU hour costs in the
order of $0.10, and moving a giga-byte of data to Amazons simple
storage solution (S3) costs $0.10. Storing the data in S3 costs
$0.10 per per giga-byte per month. Running a calculation pushing
500 MB of data through each of 1,000 nodes would cost at least
$3000, because of nodes waiting for data (8). This waiting time
caused by the network latencies within the Cloud, not by the
Internet connection to the Cloud, so shipping a hard disk wont
help. Such a calculation shows that the benefits of Cloud computing need to be balanced out with the reality of Cloud latencies
introduced by IO bottlenecks.

3. Discussion
In this chapter, we discuss the scaling up of computations through
parallelization, a necessary strategy because the rate of the data
acquisition in biology increases rapidly, and outpaces computer
hardware speed increases. In bioinformatics, the common parallelization strategy is to take an existing nonparallel application and
divide data into discrete units of work, or jobs, across multiple
CPUs, and clustered computers. Ideally, parallelizing processes
shows linear performance increase for every CPU added, but in
reality it usually is less than linear. Resource contention on the
machine, e.g., disk or network IO, has processes wait for each other.
We created BioNode, a ready-made Linux BioNode image for
parallelized computing, that can be downloaded from the Internet
and deployed as a virtual machine, so that it can run on a single
multicore desktop computer, a few networked computers, and even
in the Cloud. BioNode is based on Debian Linux and includes
software packages, and meta-packages, of the Debian Med team.

542

P. Prins et al.

Debian Med, part of the Debian project, is the largest bioinformatics


open source software packaging effort and provides hundreds of
ready and coherent software packages for medical informatics and
bioinformatics. Where Debian Med encourages packaging free and
open source bioinformatics software through one central project,
BioNode encourages creating free and open source VM images, for
multiple targets, through one central project. Currently, we provide
BioNode images for VirtualBox, which can be run on local machines,
as well as a similar image for Amazon EC2, which can also be used on
any service that is based on XEN or Linux KVM hyper-visors, such as
the free and open source Eucalyptus virtualization manager, and
OpenStack providers.
The BioNode images only use FOSS, including Linux itself, so
there are no licensing restrictions for copying across virtual
machines. Linux is free to deploy, and works as easily in a local
setup, as in the Cloud. Other free tools, e.g., the cluster management tools, rq, and TORQUE, are included, as well as configuration scripts for parallelizing bioinformatics software.
For BioNode, we added new Debian Med packages, including
the cloud meta-package, and packages for running BioNode tests
and examples. We think it is important to measure performance
between setups, so as to locate bottlenecks and estimate costs of
calculations in the cloud. Our test scripts show that BioNode has
similar performance and scalability in the cloud, compared to running it on the local network, after adjusting for differences in
hardware and network speeds. This makes cloud computing an
attractive proposition for occasional scaling up of calculation jobs,
especially when data sizes are not too large. If data files are large,
however, the calculation time and costs can increase rapidly (8).
Cloud computing allows the utilization of additional CPUs
over the Internet and is going to play an increasingly important
role in Bioinformatics, as it frees researchers from owning large
amounts of hardware. Instead there is a metered charge for CPU
time, memory allocation, and network data transfers. All cloud
services provide online calculation sheets, which allow an estimate
of running costs in advance. BioNode in the cloud is identical to
BioNode on a local network, which means compute nodes can be
combined into one computing cluster, mixing a local and remote
setup. One advantage of BioNode is that it creates a user experience
that is the same, whether the node is running on a desktop, on a
local network, or in the Cloud (Fig. 1).
Cloud computing should not be confused with HPC. Cloud
computing borrows some aspects of HPC, especially parallelization
of computing through the clustering of computers and application
of MapReduce, a method of providing distributed computing on
large data sets on clusters of computers (40). Cloud computing,
however, misses out on some important HPC concepts, such as
large memory applications, shared and distributed memory, and

22 Scalable Computing for Evolutionary Genomics

543

large file systems. One important issue is that Cloud providers offer
hardware that is not necessarily designed for high throughput at
every level. For example, hard disk IO may be a bottleneck. Network speeds in the Cloud can be fluctuating and can be low, e.g.,
transferring data between S3 and EC2. Also multiple VMs may be
competing for resources on a single machine, whether it concerns
disk or network IO. We strongly recommend to validate assumptions and run trials first. Cloud computing is of interesting for
bioinformatics, currently for computational problems that can be
split into jobs that require little computer memory and avoid large
data transfers. For other types of problems, such as sequence
assembly, it is more attractive to use a single large multicore computer with large memory and fast storage (8).
For additional information on downloading, installing, and
using BioNode, see the provided online tutorial and wiki space
(31). We also include online resources that contain build instructions for creating these images and information for running
TORQUE and setting up a Cloud cluster with Amazon EC2 or
Eucalyptus. BioNode can be used as the basis for specialized bioinformatics Linux (cluster) VMs. Finally, BioNode provides a flexible
cluster environment with a low barrier to entry, even for researchers
who normally use a Microsoft Windows desktop. BioNode is not
only useful for scaling computations, but can also be used for
educational purposes, especially as the experience gained with
tools and techniques applies to Unix and HPC setups.

4. Questions
1. Download and install BioNode on a desktop, using the instructions in the tutorial (31). How much time does it take to run
the test script discussed above?
2. Install BioNode on a second machine with a bridged network
interface. Mount NFS or sshfs. How much time does it take to
run the test script now?
3. Using online tutorials, create a free EC2 instance, create keys,
and locate and fire up a BioNode AMI. Login to BioNode
using ssh and record how much time it takes to run the test
script?
4. Use the Amazon EC2 calculation sheet and calculate how
much it would cost to store 100 GB in S3, and execute a
calculation on 100 large nodes, each reading 20 GB of
data. Do the same for another Cloud provider.

544

P. Prins et al.

Acknowledgments
The European Commissions Integrated Project BIOEXPLOIT
(FOOD-2005-513959 to G.S. and P.P.); the Netherlands Organization for Scientific Research/TTI Green Genetics (1CC029RP to P.P.).
References
1. Ronquist F & Huelsenbeck J P (2003)
MrBayes 3: Bayesian phylogenetic inference
under
mixed
models.
Bioinformatics
19:15721574
2. Eddy S R (2008) A probabilistic model of local
sequence alignment that simplifies statistical
significance estimation. PLoS Comput Biol. 4:
e1000069p
3. Yang Z (1997) PAML: a program package for
phylogenetic analysis by maximum likelihood.
Comput Appl Biosci. 13:555556
4. Doctorow C (2008) Big data: welcome to the
petacentre. Nature 455:1621.
5. Durbin R M, Abecasis G R, Altshuler D L et al.
(2010) A map of human genome variation
from population-scale sequencing. Nature
467:10611073
6. Kosiol C & Anisimova M (2012) Selection on the
protein coding genome. In: Anisimova M (ed)
Evolutionary genomics: statistical and computational methods (volume 1). Methods in Molecular Biology, Springer Science+Business Media
New York
7. Schadt E E, Linderman M D, Sorenson J, Lee
L & Nolan G P (2010) Computational solutions to large-scale data management and analysis. Nat Rev Genet. 11:647657
8. Trelles O, Prins P, Snir M & Jansen R C (2012)
Big data, but are we ready?. Nat Rev Genet.
12:224p.
http://www.ncbi.nlm.nih.gov/
pubmed/21301471
9. Patterson D A & Hennessy J L (1998) Computer organization and design (2nd ed.): the
hardware/software interface. Morgan Kaufmann Publishers Inc
10. Mattson T, Sanders B & Massingill B (2004)
Patterns for parallel programming. AddisonWesley Professional, 384 pages. http://portal.
acm.org/citation.cfm?id1406956
11. Graham R L, Woodall T S & Squyres J M (2005)
Open MPI: a flexible high performance MPI
12. Stamatakis A & Ott M (2008) Exploiting finegrained parallelism in the phylogenetic likelihood function with mpi, pthreads, and openmp:
a performance study. Pattern Recognition in
Bioinformatics, Springer Berlin/Heidelberg,

424435. http://dx.doi.org/10.1007/978-3540-88436-1_36
13. Tierney L, Rossini A & Li N (2009) Snow: a
parallel computing framework for the R system. International Journal of Parallel Programming 37:7890. http://dx.doi.org/10.1007/
s10766-008-0077-2
14. Cesarini F & Thompson S (2009) Erlang programming. 1st. OReilly Media, Inc.
15. Peyton Jones S (2003) The Haskell 98 language and libraries: the revised report. Journal
of Functional Programming 13:0255
16. Odersky M, Altherr P, Cremet V et al. (2004)
An overview of the Scala programming language. LAMP-EPFL
17. Okasaki C (1998) Purely functional data
structures. Cambridge University Press,
doi:10.2277/0521663504
18. Alexandrescu A (2010) The D programming language. 1st. Addison-Wesley Professional, 460p
19. Griesemer R, Pike R & Thompson K (2009) The
Go programming language. http://golang.org
20. Hoare C A R (1978) Communicating sequential
processes. Commun. ACM 21:666677. doi:
http://doi.acm.org/10.1145/359576.359585
21. Welch P, Aldous J & Foster J (2002) Csp networking for java (jcsp. net). Computational
ScienceICCS 2002. 695708
22. Sufrin B (2008) Communicating scala objects.
Communicating Process Architectures. 35p
23. Dean J & Ghemawat S (2008) MapReduce:
Simplified data processing on large clusters.
Communications of the ACM 51:107113
24. White T (2009) Hadoop: the definitive guide.
first edition. OReilly, http://oreilly.com/catalog/9780596521981
25. May P, Ehrlich H & Steinke T (2006) Zib
structure prediction pipeline: composing a
complex biological workflow through web services. Euro-Par 2006 Parallel Processing,
Springer Berlin/Heidelberg, 11481158.
http://dx.doi.org/10.1007/11823285_121
26. Mungall C J, Misra S, Berman B P et al.
(2002) An integrated computational pipeline
and database to support whole-genome
sequence annotation. Genome Biol. 3:

22 Scalable Computing for Evolutionary Genomics


RESEARCH0081p.
http://www.ncbi.nlm.
nih.gov/pubmed/12537570
27. Prins P, Smant G & Jansen R (2012) Genetical
genomics for evolutionary studies. In: Anisimova M (ed) Evolutionary genomics: statistical
and computational methods (volume 1). Methods in Molecular Biology, Springer Science+
Business Media New York
28. Moller S, Krabbenhoft H N, Tille A et al.
(2010) Community-driven computational
biology with debian linux. BMC Bioinformatics 11(Suppl 12):S5p. http://www.ncbi.nlm.
nih.gov/pubmed/21210984
29. Li P (2009) Exploring virtual environments in
a decentralized lab. ACM SIGITE Newsletter
6:410
30. Tikotekar A, Ong H, Alam S et al. (2009)
Performance comparison of two virtual
machine scenarios using an hpc application: a
case study using molecular dynamics simulations. Proceedings of the 3rd ACM Workshop
on System-level Virtualization for High
Performance Computing, ACM, 3340. doi:
http://doi.acm.org/10.1145/
1519138.1519143
31. Prins P, Belhachemi D & Moller S (2011) BioNode tutorial. http://biobeat.org/bionode
32. Altschul S F, Madden T L, Schaffer A A et al.
(1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25:33893402.
33. Edgar R C (2004) Muscle: multiple sequence
alignment with high accuracy and high

545

throughput.
Nucleic
Acids
Res.
32:17921797. doi:10.1093/nar/gkh340
34. Schneider A, Souvorov A, Sabath N et al.
(2009) Estimates of positive darwinian
selection are inflated by errors in sequencing,
annotation, and alignment. Genome Biol Evol.
1:114118. doi:10.1093/gbe/evp012
35. Pond S L, Frost S D & Muse S V (2005)
HyPhy: hypothesis testing using phylogenies.
Bioinformatics 21:676679. http://www.ncbi.
nlm.nih.gov/pubmed/15509596
36. Gentzsch W (2002) Sun grid engine: towards
creating a compute power grid. Cluster Computing and the Grid, 2001. Proceedings. First
IEEE/ACM International Symposium on,
IEEE, 3536
37. Staples G (2006) Torque resource manager.
Proceedings of the 2006 ACM/IEEE
conference on Supercomputing, ACM, doi:
http://doi.acm.org/10.1145/
1188455.1188464
38. Openstack open source cloud computing software. http://www.openstack.org
39. Nurmi D, Wolski R, Grzegorczyk C et al.
(2009) The Eucalyptus open-source cloudcomputing system. Proceedings of the 2009
9th IEEE/ACM International Symposium on
Cluster Computing and the Grid, IEEE Computer Society, 124131
40. Matthews S J & Williams T L (2010) Mrsrf: an
efficient mapreduce algorithm for analyzing
large collections of evolutionary trees. BMC
Bioinformatics 11 Suppl 1:S15p

INDEX
A
Actors...................................................................532, 533
Adaptation. See Adaptive, evolution; Selection, positive
Adaptive
evolution.............. 34, 103, 121, 151, 181, 471, 475
immune system..............................................242, 474
Admixture................................................... 218, 230235
Akaike Information Criterion (AIC)......................... 126,
237, 239241, 247, 248
Algorithm
ElstonStewart algorithm ............................. 219221
LanderGreen algorithm ..................... 221223, 233
Alleles ..................................................................8, 10, 13,
14, 16, 130, 153, 166, 218, 219, 221, 222,
224226, 232, 242, 252, 259, 261, 276283,
288, 322324, 470, 475, 478, 499
ALS disease. See Amyotrophic lateral sclerosis
(ALS) disease
Amazon................................................................ 539543
Amplified fragment length polymorphism
(AFLP) ........................................................470
Amyotrophic lateral sclerosis (ALS)
disease........................................ 382, 407, 408
Analysis benchmarks ....................................................387
Ancestral recombination graph (ARG).............227, 228,
299, 304307, 315331
Anomaly zone...............................................................6, 9
Apes.................................................... 341, 349350, 518
Apoptosis ......................................................................475
Application programming interface (API)................ 490,
506, 533, 539
Arabidopsis thaliana ................................. 163, 470472,
475477, 481
Archaea .........................................30, 32, 33, 47, 56, 68,
69, 71, 72, 74, 75, 82, 89, 94, 100, 101, 194,
195, 198, 202, 211
Association mapping ........................................... 275290
ATP .................................................................................89
Autosome ............................................................174, 318

B
Balancing selection. See Selection
Baseline correction ............................ 393, 394, 397, 411

Bayes factor (BF).................................................263, 264


Bayesian
approach ..............................123, 126, 130, 254, 353
graphical model ........................... 255, 258, 260, 265
inference .................................... 9, 24, 117, 296, 309
Beecher Laboratory.............................................394, 411
Benchmarking .....................................................370, 387
Beowulf cluster.............................................................534
Bias................................................................. 17, 20, 117,
118, 121, 128, 130132, 144, 146, 147, 150,
151, 203, 244, 245, 251, 255, 263, 264, 284,
288, 308, 310, 339, 340, 345, 368, 371, 373,
375, 386, 388, 410, 452, 470, 471, 478
Biased gene conversion (BGC) ...................................130
Bigrams .........................................................................206
Bindings....................................123, 144146, 152, 242,
289, 336, 346, 350352, 354, 355, 364, 432,
507, 513, 514
Bio
programming................................................. 503515
projects .......................................................... 503515
Bio++ ............................................................................505
BioCatalogue................................................................496
Bioconductor.............................................. 505, 508, 509
Biodiversity Collections Index ....................................482
BioJava ........................................................ 505, 512, 513
Biological variation .................................... 386, 480, 481
BioMart ...................................................... 434, 436438
BioMoby................................... 483, 493, 494, 497, 498
BioNode ...........................476, 508, 509, 515, 534543
BioPerl .................................................................505, 512
BioPython................................................... 505, 511514
BioRuby............................................. 476, 505, 512, 513
Birth-death
model ......................................................................192
BLAST ....................................... 97, 104, 105, 163, 165,
416420, 422, 476, 477, 491, 506, 514,
534, 537
Bonferonni correction .................................................479
Boot-split distance (BSD).......................... 5567, 72, 76
Bootstrap/bootstrapping ....................... 5565, 71, 256,
257, 405, 406, 411
Branch-site codon models ...........................................121
BSD. See Boot-split distance (BSD)

Maria Anisimova (ed.), Evolutionary Genomics: Statistical and Computational Methods, Volume 2,
Methods in Molecular Biology, vol. 856, DOI 10.1007/978-1-61779-585-5,
# Springer Science+Business Media, LLC 2012

547

VOLUTIONARY GENOMICS
548 || EIndex

C
Caenorhabditis elegans ..................... 163, 173, 175, 470,
472, 509, 510
Calibrants............................................................. 388390
Call stack.............................................................. 513515
Causality ............................................ 289, 350, 480481
C/C++ ................................................................. 504507
cDNAs .............................114, 168, 373, 417, 471, 472,
476, 509
Cell
cycle ..........................................................................92
division............................................................... 82, 90
membrane...............................................................122
Cellulose .......................................................................474
Centroided data ..................................................394, 396
CgiHunter ....................................................................439
Chaperone ......................................................................92
Chimeric .............................................. 89, 167, 173, 174
ChIP-seq............................................ 142, 346, 351356
Chromatid ....................................................................166
Chromatography .................................................384, 407
Chromosome
rearrangement ........................................................477
Cis ............................................................... 147, 336, 472
Classification rule ......................................399, 400, 402,
403, 405, 407, 409, 412
Clique ..................................................................367, 369
Cloud computing......................531534, 539, 541543
Clustering ................................ 70, 72, 73, 97, 105, 188,
212, 213, 369, 370, 376, 377, 407, 410, 411,
538540, 542
Clusters of orthologous genes (COG(s))............. 48, 56,
70, 72, 92
Coalescence .................. 5, 8, 9, 14, 17, 18, 21, 42, 228,
299, 300, 305, 328, 329
Coalescent model ........... 820, 228, 295298, 327329
Coalitions......................................... 8895, 97, 101103
Coarse grained.....................................................531, 533
Codon
translation .................................... 505, 508, 509, 513
usage bias .............................................. 130, 132, 248
Co(-)evolution ....................................................... 89, 91,
94, 96, 103, 121, 248, 253, 255, 256, 259,
264265, 468
Colombia (Col) ............................................................477
Command line................................... 257, 515, 536, 537
Common disease
common variant (CDCV)........... 276, 279, 280, 288
rare variant (CDRV) .....................................276, 279
Communicating sequential processes (CSP) ..............532
Communities ............................9, 13, 8789, 9193, 95,
96, 98, 102, 375, 478, 482, 483, 486488,
492, 494, 496, 497, 504, 505, 513

Comparative Data Analysis Ontology


(CDAO) .............................................487, 488
Comparative genomics .......................................... 55, 93,
113, 119, 134, 335338, 343, 347350, 356,
433, 500
Complex diseases ................................................276, 281
Complexosome ............................................................369
Complex trait ............................................. 469, 470, 472
Composite-likelihood (CL) .........................................229
Comprehensive R archive network (CRAN)..............508
Epigenetics ................................................................. 431
Concatenation ....................................................... 79, 70
Confusion matrix .........................................................410
Conservation ..................... 69, 127, 134, 188, 354356,
368, 369, 432442, 450, 465, 474
Conserved
non-coding sequences ..................................143, 149
synteny ....................................................................474
Constraint...........................................................8, 22, 48,
82, 91, 98, 114, 117, 118, 125, 128, 131,
142152, 172, 175, 178181, 188, 198,
211, 212, 304, 307, 348, 367, 390, 398,
402, 403, 433, 441, 486, 493
Convergent evolution .......................................12, 14, 38
Copy number variants (CNVs) ........ 173, 177, 287, 289
Correlated............................................. 33, 37, 100, 102,
131, 225, 229, 252, 253, 276, 286, 390, 399,
428, 459, 460, 474, 481
Correlation structure ...................................................118
COSI........................................................... 317, 328, 330
CpG islands (CGI) ............................432443, 447450,
452457, 462465
Cross-hybridization......................................................478
Cross-language adapters ..............................................507
Cross-over event..................................................218, 219
Cross-platform experiments ........................................397
Cross validation .......................................... 409, 410, 449
C-terminal .................................................189, 199, 203
CTL. See Cytotoxic T-lymphocytes (CTL)
Curse of dimensionality ................... 398, 399, 401, 404,
405, 410
CUSUM charts ............................................................389
Cytoscape............................................ 97, 101, 104, 105,
375376, 498
Cytotoxic T-lymphocytes (CTL)............... 242, 257261

D
DAG. See Directed acyclic graph
Darwin ............................................. 3, 55, 82, 90, 96, 99
Darwin Core (DC) .............................................487, 488
Dating ......................................................... 164, 168, 181
Deamination ............................. 433, 443, 448, 450, 457
Debian

EVOLUTIONARY GENOMICS | 549

Index |

Linux.............................................535, 536, 538541


Med project ............................................................535
deCODE..............................................................223, 224
Deep sequencing technology (RNA(-)Seq) ...............478
Defense mechanism .............................................. 92, 474
Deleterious mutation ........................ 134, 150, 151, 279
Deletion ............................. 31, 145, 165167, 173, 189,
190, 207, 208, 363, 365, 367, 372, 481
Deletions of domains ...................................................206
Democratic vote method.............................................6, 7
de novo ................... 115, 169174, 176, 181, 193, 208,
339, 340
Dependency..................................................................506
Differential
gene expression ......................................................478
Differentially methylated region (DMR)....................434
Digital Object Identifier (DOI) ..................................484
Diploid.......................................................295, 315, 316,
330, 331
Directed acyclic graph (DAG)............................315, 316
Directional selection. See Selection
Direct repeats ...............................................................168
Disease associated genes ..............................................462
Distribution
beta ................................................................122, 123
exponential ............................................. 24, 150, 191
gamma ................................................. 118, 150, 246,
296, 411
Poisson............................................................. 49, 304
Distribution normal ............................................400, 401
Divergence
of sequences............................................................299
of species.................................................................311
DNA
double strand breaks ..............................................165
methylation.........................355, 356, 432434, 443,
449455, 457
repair ..............................................................130, 165
replication ......................................................165, 166
sequencing ..............................................................115
transposons .............................................................189
dN/dS ................................... 13, 16, 119, 152, 172, 179,
180, 255, 262, 476, 530
DOI. See Digital Object Identifier (DOI)
Domain (of life i.e. eukarya, archaea and bacteria) ......16
Domain architecture ........................................... 187213
Drosophila melanogaster..115, 117, 130, 132, 145, 147,
149151, 153, 163, 164, 170, 172, 173, 348,
470
Drug resistance.................................... 84, 248, 250, 251
Dryad ...................................................................484, 490
Duplication
degeneration complementation model (DDC)....176
dispersed ............................................... 166, 173, 180

gene..................................... 11, 1314, 18, 161167,


171176, 180, 181, 200, 208211, 372, 373,
472, 475
segmental ......................................145, 164166, 477
symmetric ...............................................................178
tandem .................................165, 166, 173, 372, 373
whole-genome (WGD).......162164, 373, 374, 377
Dynamic programming............................ 4345, 47, 207

E
Ecosystems..................9193, 9597, 99, 102, 103, 535
Effective population size.................... 6, 22, 31, 42, 114,
135, 151, 228, 276, 277, 297, 300, 307, 311
Effector ................................................................474, 475
EM algorithm. See Expectation-maximization (EM)
algorithm
EMBOSS. See European Molecular Biology Open
Software Suite (EMBOSS)
Emergent evolutionary properties ................................92
Emission probability ....................................................232
Empirical bayes....................................................247, 251
codon model(s) .............................................120, 134
Encyclopedia of Life (EOL) ...............................482, 490
Endosymbioses ...............................................................89
Enhancer..................144, 146, 289, 350, 354, 356, 432
Environmental
factors.................................................... 142, 469, 481
Epigenetics ................................................289, 350, 352,
353, 355, 356, 431434, 442, 443, 447450,
457, 463, 464
Epigenomics .....................432435, 442, 444, 458, 464
Episodic selection. See Selection
Epistasis ............................................. 236, 252257, 365
Erlang ...........................................................................532
Error handling and exceptions ....................................506
Escape from adaptive conflict (EAC) model .....176, 178
Eucalyptus ......................................... 539, 540, 542, 543
Euchromatin.................................................................432
Eukaryotes .................................... 1113, 16, 30, 32, 89,
95, 100, 152, 170, 193195, 200, 202, 204,
205, 211, 213, 217
European Molecular Biology Open Software Suite
(EMBOSS)....................... 505, 510, 513, 514
Euryarchaeota.......................................................... 39, 74
Eutherians.....................................................................121
Evolutionary
biology ..............236, 471, 482, 483, 496, 504, 505,
508, 530
distance ........................................ 105, 116, 147, 355
expression QTL (eQTL).............471474, 476478,
480482
genetical genomics ............................... 469482, 534
models........................85, 92, 93, 99, 116, 118, 192,
239, 476, 477

VOLUTIONARY GENOMICS
550 || EIndex

codon ............................131see also Branch-site


codon models; Empirical, codon model(s);
Parametric codon models; Selectionmutation models)F81 ........................... 239
general time reversible (GTR) ...........249, 262
HKY/HKY85............................239, 255, 264
prior ........................................................................477
Evolution ontology (EO) ............................................488
Exonization ................................................ 190, 194, 208
Exon shuffling ...................30, 167, 173, 176, 189, 196,
207, 211
Expectation-maximization (EM) algorithm ...............129
Experimental
design...................................383, 385, 388, 389, 474
population ....................................470, 477479, 481
Expression
pattern................................................... 114, 174, 348
trait.................................................................471, 479
Expression QTL (eQTL).............................................469

F
False discovery rate (FDR) .......................122, 309, 343,
353, 354, 390, 479
False positive (error) ...........................................128, 481
Fine grained....................................... 211, 506, 531, 532
Fixation probability.................................... 130, 173, 175
Fixed effect models ......................................................123
Forest of Life (FOL) ............................................... 5376
Four-Gamete test ....................................... 225227, 235
Frameshift .....................................................................172
Fst ........................................................................... 13, 142
Functional
analysis ................................... 99, 415418, 420422
relationship .................................. 102, 365, 374, 480
Fusion ...................18, 19, 84, 167, 173, 176, 190, 203,
207, 208

G
Gag.............................................................. 253, 257, 259
Galaxy ........................................434439, 442, 462, 464
Gametes ...............................................................217, 373
GC-content ........................................................... 16, 130
Gene
accelerated ..................................................... 117118
cluster......................................................................118
comparison ........................ 13, 14, 20, 33, 122, 164,
337340, 349, 350, 376
conserved ......................................................... 71, 433
conversion................................... 116, 130, 164, 189,
218, 328
duplication..................................................11, 1314,
18, 161167, 171176, 180, 181, 200, 203,
208211, 372, 373, 472, 475
evolution...................................................................88

expression ....................................................... 85, 141,


174, 175, 191, 336347, 349353, 355357,
459, 470472, 476, 478, 481, 505
family .................................2948, 89, 192, 376, 477
fission .............................................................190, 207
flow ................................................................8, 10, 11
fusion ..........................167, 173, 176, 190, 207, 208
loss ..........................................................................163
network...................96, 98, 104105, 471, 472, 480
Omnibus .................................................................459
ontology (GO) ........................... 205, 345, 365, 416,
472, 474, 480, 486
order .....................................96, 170, 181, 309, 336,
347, 351, 352, 420, 436, 437, 465
prediction................................................................134
regions .............................................16, 88, 117, 128,
129, 134, 173, 179, 189, 289, 339, 352
regulation............................131, 142, 335337, 341,
346350, 352353, 355, 356, 457, 463, 480
tree ............................. 514, 1618, 20, 4047, 169,
207210, 301, 302
Genealogy ......................................... 83, 94, 98, 99, 228,
294296, 298, 299, 306308, 310, 328
GeneR ......................................................... 509511, 514
Genetical genomics .................................... 469482, 534
Genetics
algorithm (GA) ........................... 126, 237, 240, 242
code.................... 113, 119, 120, 245, 255, 262, 264
drift ..................8, 31, 177, 225, 277, 294, 348, 349
variation ..............................142, 217234, 280, 294,
317, 471, 472, 481
Genic selection .................................................... 148152
Genome
content......................................................................32
evolution.....................................................30, 31, 40,
48, 115118, 132, 142, 192, 309, 431465
function ......................................... 32, 101, 114, 142
segmentation .................................................114, 165
sequencing ............................................ 182, 191, 287
size ................................ 31, 193, 309, 508, 509, 530
structure....................................................................30
Genome-wide association studies (GWAS) ....... 281288
Genomic rearrangements ............................................163
Genomic signature .......................................................158
Genotype ..................................219225, 235, 264, 275,
281, 283289, 470, 473, 479, 481, 482
Genotyping errors......................................................470
Ghost QTL detection (between two QTL
in coupling phase) ......................................479
Global Biodiversity Information
Facility (GBIF)...................................482, 486
Global Names Index ....................................................492
GO. See Gene, ontology (GO)
Grammar for domain combinations ...........................208

EVOLUTIONARY GENOMICS | 551

Index |

Grand most recent common ancestor


(GMRCA) ................318320, 322, 324, 325
GridEngine ...................................................................538

H
Haploid segregants ......................................................471
Haploimbalance ..................................................373, 374
HardyWeinberg model .....................................283, 284
Haskell ..........................................................................532
Heterochromatin .........................................................432
Hidden Markov model (HMM) ............................... 114,
118119, 147, 188, 213, 221, 223, 230, 232,
233, 287, 296, 304307, 530
Hidden paralogy...................................................... 41, 42
High performance computing (HPC)...............479, 531
Histone modification ....................... 352356, 432, 434,
442, 447449
Histones........................................................................448
HIV-1 ..............................237, 242, 248, 251, 253, 258,
259, 261, 264, 265
HMMER .............................................................506, 530
HOGENOM ....................................... 32, 39, 41, 46, 48
Homologous
pairs of chromosomes ............................................166
recombination (HR) .....................................166, 238
Homology (homologous) ......................3032, 38, 104,
166, 201, 210, 339, 473, 475, 477
Horizontal gene transfer (HGT) ......................9, 1113,
33, 42, 54, 6974, 169170, 419
Host-pathogen ................................. 248, 474, 476, 477,
480, 481
HTTP protocol ........................ 482, 488, 489, 497, 506
HudsonKreitmanAguade test (HKA) .....................133
HyPhy ..............................127, 239, 240, 242, 245, 246,
248, 250, 253, 255, 257, 258, 262265, 537
Hypothesis-driven ........................................................335

I
Identify by descent (IBD) ......................... 224, 253, 278
Illegitimate recombination ..........................................190
Illumina.............................281, 286, 434, 458, 459, 462
Incomplete lineage sorting ................................4, 6, 7, 9,
18, 19, 42, 54, 298, 300302, 305, 306, 312
Incongruence ...................7, 42, 71, 237, 238, 301, 302
Inconsistency score ................................................. 65, 76
Independence .................................... 116, 147, 480, 481
Inhibitors .............................................................122, 476
Inititaion of DNA replication.............................165, 166
Innate immune system........................................242, 474
In-paralog .......................................................................38
Insertion of domains...........................................208, 209
Instantaneous rate matrix ................. 116, 129, 134, 303
Interacting genes..........................................................121

Interaction network
clustering ....................................... 97, 376, 377, 540
degree distribution...................... 192, 201, 377, 378
guilt-by-association ....................................... 369370
modularity ............................................ 105, 369370
robustness ...............................................................372
Interoperability........................................... 487, 492, 509
Inter-species differences........... 337, 341, 345, 354, 355
Interspersed repeats .....................................................190
Intrinsic information...........................................147, 531
Intron...............................131, 132, 143, 144, 149151,
168, 189, 190, 207, 208, 340
Inversion ..............................................................165, 167
Ion counter...................................................................392
Isochores.......................................................................130

J
Jaccard coefficient ................................................... 62, 63
Jackknife .................................................................. 17, 46
Java.................. 416, 504, 505, 507, 508, 511, 514, 532
Java Virtual Machine (JVM)...................... 507, 512, 514
Job scheduler................................................................477
JRI........................................................................505, 508
Junk DNA ....................................................................142
Jython ......................................................... 507, 512, 513

K
KEGG pathways ................................ 382, 418, 421, 422

L
Landsberg erecta (Ler).................................................477
Last universal common ancestor (LUCA)..................193
Lateral gene transfer (LGT). See Horizontal gene
transfer (HGT)
Latin square ..................................................................388
Leucine-rich-repeat (LRR) ..........................................475
Likelihood
composite (CL) ......................................................229
function .................................................. 47, 246, 264
ratio test (LRT) ........................... 117, 247, 250, 251
Lineage specific
gene duplications ...................................................173
tests .........................................................................471
Linkage ....................... 7, 104, 105, 218, 220, 288, 365,
470472, 479, 483
Linkage disequilibrium (LD) ................... 142, 225230,
261, 276, 280, 286, 287, 289, 323, 324,
470, 481
Linked data ................................................. 483, 485, 495
Long-branch-attraction (LBA)......................................54
Long interspersed nucleotide element-1 (LINE1) ....190
Long non-coding RNAs ..............................................171

VOLUTIONARY GENOMICS
552 || EIndex

Lower envelopes...........................................................393
LSID ........................................................... 482484, 497

M
Machine learning.............399, 407, 443, 444, 447, 449,
456457
Macro language............................................................507
Mahalanobis distance ...................................................400
Mandel bundle-of-lines................................................398
Mapping power ............................................................479
MapReduce..........................................................533, 542
Marginal trees......................................................324, 326
Marker ....................................... 13, 219, 223, 224, 281,
285, 286, 289, 290, 352, 355, 356, 418, 419,
432, 456, 457, 463, 470, 478, 479
Marker map ................................................ 470, 478, 479
Markov
chain.....................................254, 258, 307, 394, 531
Chain Monte Carlo (MCMC).....................9, 17, 18,
129, 130, 229, 254, 256258, 394
clustering ................................................................369
models................ 114, 118119, 188, 213, 287, 530
(see also Evolutionary, models)
Mass Spectral Library...................................................391
Mass spectrometry ..................................... 384, 394, 407
Mating system ..............................................................153
Maximum
estimate (see Maximum likelihood estimate (MLE))
estimator .....................................................239
likelihood (ML)............................ 39, 46, 47, 56, 69,
116, 118, 124, 237, 239, 250, 253, 256, 262,
309, 476, 479, 530
parsimony (see Parsimony)
Maximum likelihood estimate (MLE) ........................124
McDonaldKreitman test (MK)................ 133, 149, 154
Measurement equation .......................................396, 397
MEGAN .............................................................. 415428
Meiosis ...................................... 130, 219, 222, 316, 373
Meloidogyne hapla.......................................................474
Message passing interface (MPI) ......................239, 240,
257, 479, 531534, 537
Messenger RNA (mRNA) ...........................................168
Metabolic pathways.............................................104, 472
Metabolite QTL (mQTL) ......................... 471, 478, 480
Metabolites ............... 92, 381384, 386, 387, 389391,
393, 394, 398, 399, 401, 407410, 471
Metabolomics ...................................................... 381411
Metagenomics ..............................................................415
Methyl-DNA immunoprecipitation ............................471
Metropolis-Hastings algorithm...................................254
Microarray ........................................ 182, 192, 337, 338,
343345, 382, 470472, 474, 478, 505, 508
MicroRNAs ..................................................................171
Microsatelite ......................................................9, 16, 165
Minimal descriptor.............................324327, 329331

Mining .......................................363366, 383, 405, 407


Mitosis ................................................................... 92, 373
Mobile genetic elements....................................... 87, 189
Model organism ............................... 338, 341, 348, 349,
351, 371, 432, 470, 478, 499
Molecular
clock ............................................. 163, 164, 296, 308
strict ........................................................................369
Most recent common ancestor (MRCA).............. 8, 230,
295, 297299, 303, 318
Multiple QTL Mapping (MQM) .............. 479, 480, 482
Multiple sequence alignment (MSA) ............. 40, 46, 47,
116, 120, 134, 188
Multispecies coalescent model ..................................720
Mus musculus............................................. 163, 437, 470
Mutant alleles ......................................................252, 478
Mutation
accumulation studies..............................................349
rate ................................... 7, 14, 16, 20, 23, 24, 130,
133, 143, 146, 147, 163, 226, 235, 244, 276,
277, 298, 302, 311, 433
MyExperiment.....................................................496, 500

N
Natural population..................................... 167, 470, 479
Natural selection. See Selection
Nearly Universal Trees (NUTs) .......................55, 56, 70
Negating QTL (QTL in repulsion phase) ..................479
Nematode ...........................11, 143, 163, 170, 173, 474
Neofunctionalization .......................................... 175178
Network
analyzer .......................................................... 375378
of domain co-occurrence .......................................202
hubs ..............................................367368, 372374
inference ............................................... 473, 480481
Neutrality test...................................................... 133134
Next(-)generation sequencing (NGS) .....115, 165, 177,
182, 337, 350, 415
NGS. See Next(-)generation sequencing (NGS)
NHEJ. See Non-homologous end-joining (NHEJ)
NIST .............................................................................391
Non-coding .........................................................153, 171
Non-homologous end-joining (NHEJ) .....................166
Nonsynonymous mutation.............................1416, 143
Nonsynonymous to synonymous rate ratio................135
Nonsynonymous to synonymous rates ratio.
See dN/dS
Normalization ............................................ 343345, 357
NP-complete ..................................................................43
N-terminus ...................................................................211
Nucleoid ..................................................... 235, 475, 505
Nucleosome.........................................................289, 356
Nucleotide binding site leucine rich repeat domain
(NB-LRR) ...................................................475
NUTs. See Nearly Universal Trees (NUTs)

EVOLUTIONARY GENOMICS | 553

Index |

O
Olfactory receptor ........................................................175
Oligomer ........................................................................16
OpenPBS ......................................................................534
Open reading frame (ORF)....................... 163, 171, 172
Open source software ............. 257, 481, 505, 529, 533,
535, 542
OpenStack ...........................................................539, 542
Operon ........................................................... 86, 95, 101
Optimization ...................... 47, 118, 246, 310, 401403
ORF. See Open reading frame (ORF)
Origins of DNA replication................................165, 166
Ortholog........................ 54, 56, 70, 121, 169, 179, 180
Overlapping reading frames ............................... 128129
OWL. See Web Ontology Language (OWL)

P
PAML ........................................................127, 134, 135,
245, 476477, 480, 505, 530, 533, 534, 536,
537, 539
Parallelization ............................309, 530534, 541, 542
Paralog ...................14, 57, 70, 172, 180, 372374, 377
Paralogy .............................................................41, 42, 54
Parent............................... 165, 166, 179181, 219224,
256, 261, 264, 294, 315318, 320, 322, 323,
330, 474, 477, 478, 482
Parrot native compiler interface ..................................507
Parsimony ..................................164, 200, 205, 207210
Pathogen................................ 13, 88, 91, 122, 130, 131,
165, 236, 242, 248, 474477, 480482, 537
Pattern
discovery ........................................................128, 218
Pedigree analysis.................................................. 218224
Perl ..................................................... 212, 504, 505, 508
Permutation
strategy....................................................................479
Pfam ......................................... 188, 189, 193, 197, 198,
200202, 204, 209, 211213, 514
Phenotype............................ 48, 86, 275, 283, 284, 470,
471, 473, 477, 481, 482
Phybase ...........................................................................22
Phylogenetic hidden Markov models
(phylo-HMMs) ........114, 118119, 127, 129
Phylogenetic
footprinting ............................................................143
network.....................................................................87
outliers ............................................................... 1316
shadowing...............................................................143
tree .............................18, 35, 55, 59, 67, 69, 82, 93,
118, 164, 252, 477
phylo-HMMs. See Phylogenetic hidden Markov models
(phylo-HMMs)
Phytophthora infestans .......................................474, 537

Pipeline ............................255, 433, 434, 436, 458, 500,


533534, 538
Piwi RNAs ....................................................................171
Plant resistance ....................................................474, 476
Plant resistance genes (R-genes) .................................475
Plasmid .................... 30, 82, 84, 86, 87, 92, 9598, 101
Plasticity........................................................................472
Plastid ...........................................................................170
Plate geometry .............................................................388
Pleiotropic effect .................................................176, 480
Poisson
genetics .......................................................... 133134
genomics.................................................................133
process ............................................................. 49, 298
random field ...........................................................133
size ................................................. 49, 133, 134, 298
Polymorphism .....................................42, 115, 147153,
173, 177, 181, 182, 218, 255, 275, 276, 280,
281, 306, 311, 328, 470, 472, 481
Polymorphism frequencies ................147148, 153154
Poor mans parallelization ...........................................533
Population simulator .......................................... 326329
Positive selection. See Selection
Posterior
probability ............................23, 123, 126, 247, 248,
254, 258, 265, 289
Power law distribution....................... 37, 191, 192, 194,
195, 199, 200, 202, 367, 374
Preferential attachment..................... 192, 200, 209, 213
Preterm labor ............................................. 382, 409411
Primates ............................... 6, 115, 124, 296, 310312,
339, 342, 349, 354, 355
Primer ..................................................................163, 418
Prior
distribution .............................................................308
Profile HMM................................................................118
Prokaryotes......................11, 30, 3840, 47, 54, 55, 69,
70, 74, 84, 8789, 99, 170, 196, 200, 205
Prokaryotic cell...............................................................30
Promoter .........346, 350352, 431458, 460, 462465
Propagation of error ........................................... 395397
Protein
architecture .................................................... 187213
combination ..................................................189, 196
complex................................365, 369, 370, 374, 378
databases

ADDA .................................................188, 213


CATH ........................................188, 196, 213
Conserved Domain Database ............188, 213
Gene3D family ...................................188, 213
INTERPRO...............................188, 195, 213
Pfam ........ 188, 198, 201, 204, 209, 212, 213
ProDom ...................................................... 188
SCOP ....................... 188, 195, 196, 205, 213
SMART ......................................188, 204, 213

VOLUTIONARY GENOMICS
554 || EIndex

Protein (continued)
domain .......................... 30, 167, 187213, 371, 372
neighbor pair ........................................ 198, 204, 206
order ..............................................................189, 199
promiscuity/versatility .................................. 203206
QTL (pQTL)........................................ 471, 478, 480
sequence ....................................... 11, 104, 116, 120,
123, 142, 143, 145, 179181, 188, 252, 253,
259, 262, 265, 371, 420, 487
structure................................................ 114, 116, 208
triplets ............................................................120, 189
Protein-coding gene ..........................................113, 117,
120, 128, 132134, 152, 171
Protein-protein interaction(s) .................... 99, 100, 104,
123, 200, 363378, 472, 537
Pruning ......................................................... 57, 129, 406
Pseudogene .................................................. 31, 172, 191
Pseudogenization .................................................. 31, 175
Punctuated equilibrium ...............................................192
Purifying selection. See Selection
Python ..............................438, 464, 504, 505, 507514

Q
Quality control ........ 281, 283284, 383, 385391, 394
Quantitative
phenotypes.................. 470, 471, 473, 477, 479481
trait loci (QTL) ................... 470474, 476481, 508

R
R (statistical language).................................................479
Random effect (RE) models........................................123
Random forest.................. 399, 401, 405408, 410412
Random variable
continuous ................................................................37
discrete......................................................................37
Rate
heterogeneity.................................................118, 239
shift ................................................................114, 118
RDF. See Resource Description Framework (RDF)
Reactivity .............................................................480, 481
Rearrangement ................................. 166, 167, 241, 308,
433, 477
Reasoning ...............................................6, 367, 480, 532
Reassortment................................................................236
Recessive lethal alleles ..................................................470
Recombinant inbred line (RIL) ..................................470
Recombination ...............................8, 86, 116, 147, 166,
189, 217265, 287, 294, 316, 365, 470
Redundancy............................................16, 43, 152, 316
Regulation .............. 131, 141, 142, 196, 200, 335337,
341, 342, 346356, 432434, 443, 448, 449,
457, 463, 471, 472, 480
Regulator ............................................................. 471473

Regulatory
element ..........................................................152, 168
genomic regions ............................................352, 353
mechanisms ..................................336, 337, 350356
Relative rate test ...........................................................179
RELL. See Resampling of estimated log-likelihoods
(RELL)
Remote procedure call (RPC) ............................ 503515
Repeat ......................... 57, 61, 143, 165, 166, 168, 190,
203, 206, 211, 225, 295, 308, 309, 322, 331,
338, 340, 376, 410, 437, 439, 441, 444, 450,
451, 453, 456, 465, 475
Replication................................. 92, 165, 166, 238, 248,
249, 285286, 288, 289, 478
Representational State Transfer (REST)...........489490,
492, 506
Resampling of estimated log-likelihoods (RELL)......241
Residual variance ..........................................................479
Resolution schema .............................................481, 482
Resource Description Framework (RDF)................. 483,
485, 487, 494, 495, 497, 499, 506
Restriction fragment length polymorphism
(RFLP) ................................................. 23, 470
Retrogenes........................167169, 173, 174, 176, 180
Retroposition..................................... 167169, 171, 173
Retrotransposons ................................................169, 189
RFLP. See Restriction fragment length polymorphism
(RFLP)
R-gene .................................................................477, 482
Ribosomal RNA (rRNA) .................. 416, 418, 426, 427
Ribosome............................................................... 72, 369
RIL. See Recombinant inbred line (RIL)
RNA-seq ..................................................... 142, 345, 478
RPM1 ...........................................................................475
RPS2 .............................................................................475
RPy............................................ 505, 508, 510, 511, 514
rq ........................................................ 477, 536540, 542
rRNA. See Ribosomal RNA (rRNA)
Rserve ...............................505, 506, 508, 510, 511, 514
RSOAP ................................................................ 508510
RSPerl ...........................................................................508
RSRuby .........................................................................508

S
Saccharomyces cerevisiae.................... 163, 164, 193195,
202, 371, 471, 472
16S analysis..........................................................416, 427
Scaffolding ...........................................................306, 491
Scala ................................................... 507, 512, 513, 532
S. cerevisiae. See Saccharomyces cerevisiae
SEED subsystem ..........................................................425
Segmental duplication ...................... 145, 164166, 477
Segment alignment ......................................................244
Segregating sites.................................. 19, 133, 225, 235

EVOLUTIONARY GENOMICS | 555

Index |

Segregation sites. See Segregating sites


Selection
adaptive ..................................85, 145, 149, 176, 181
balancing................................................... 14, 16, 133
coefficient ......................................................132, 135
directional ............................... 14, 16, 148, 248252,
262264, 337, 348, 349, 356
positive....................... 119126, 128131, 134135,
144147, 149151, 153, 172, 175181, 244,
246248, 250, 253, 255, 262, 277, 279, 450,
473477, 481, 530, 534, 537
purifying (negative).................... 119123, 125, 130,
131, 143, 147, 149, 150, 172, 178181, 278
strength........................................ 134, 175, 177, 277
Selection-mutation models..........................................127
Semantic
Automated Discovery and Integration
(SADI) ............................. 494, 495, 498, 499
Health and Research Environment
(SHARE) ..........................486, 495, 498499
web......................483, 485488, 492496, 498499
Sequence alignment. See Multiple sequence
alignment (MSA)
Sequence assembly .......................................................543
Sequencing error correction........................................308
Shared libraries ....................................................517, 518
Short
read,
sequence repeat,
Signaling ......... 188, 196, 200, 336, 337, 345, 346, 475
Simple Object Access Protocol (SOAP) ................... 491,
492, 494, 506, 509, 510
Simple Semantic Web Architecture and Protocol
(SSWAP).............................................494, 495
Simplified Wrapper and Interface Generator
(SWIG)...................................... 507, 513, 514
Simulating
populations (see Population simulator)trees .... 2124
Single nucleotide polymorphism (SNP) ......................98,
223, 228, 229, 280289, 320, 322, 330, 470,
477, 478, 540
Sister chromatids ..........................................................166
Site frequency spectrum .....................................134, 288
Site-specific tests for selection .....................................250
SNP. See Single nucleotide polymorphism (SNP)
Speciation ......................10, 12, 4043, 45, 54, 87, 208,
294304, 306, 307, 309312
Species
delimitation ....................................................... 1011
tree ........................... 124, 38, 4046, 49, 164, 169,
207209, 301, 302
Specificity ........ 119127, 142, 200, 203, 259, 352, 475
Sperm typing .............................................. 229, 233, 234
Spirochetes .....................................................................39
Splice site ......................................................................131

Splicing ................................................................131, 370


Split distance (SD) ...........54, 56, 58, 6062, 64, 67, 76
SSWAP. See Simple Semantic Web Architecture
and Protocol (SSWAP)
Statistical
power ....................................... 7, 152, 226, 470, 476
significance................................... 118, 205, 463, 479
Statistical model(ing) .......................... 40, 115, 221, 230
Stop codon ........................................ 172, 189, 190, 206
Strain.................................236, 348, 386, 471, 475, 477
Structural variation..............................................166, 287
Structure of DNA ...................................... 289, 434, 444
Structure-preserving .................................. 323324, 329
Study design ............................. 170, 341, 349, 352, 357
Subfunctionalization ........................................... 176178
Subgraph..................................................... 317, 318, 328
Substitution
matrix .....................................................................137
model (see Evolutionary, models)scoring .............240
Supercluster ..................................................................477
Supermatrix ...................................................................... 7
Supertree ................................................6, 66, 68, 74, 75
Support vector machine (SVM) ............... 399, 401408,
410, 412
Supradomains ......................................................191, 203
Susceptible ...... 9, 17, 20, 345, 419, 470, 474, 475, 477
SVM. See Support vector machine (SVM)
Sweep ................................................... 14, 133, 177, 471
SWIG. See Simplified Wrapper and Interface Generator
(SWIG)
Synonymous mutation/substitution/change ..... 1416,
131
Synonymous substitution/change ...........115, 127, 244,
246, 249, 468, 520
Synteny ................................................................128, 474

T
Tajimas D ....................................................................133
TAMBIS. See Transparent Access to Multiple
Bioinformatics Services (TAMBIS)
Tandem affinity purification (TAP)...................363365,
368, 369
TAP. See Tandem affinity purification (TAP)
TAPIR. See TDWG Access Protocol for Information
Retrieval (TAPIR)
Target gene................................................. 464, 470, 472
Taverna .............................................. 493495, 498, 500
Taxonomic
analysis .................................................. 415420, 422
Taxonomic database working group (TDWG) ........ 482,
483, 487, 492, 497
TDWG. See Taxonomic database working group
(TDWG)
TDWG Access Protocol for Information Retrieval
(TAPIR) ......................................................492

VOLUTIONARY GENOMICS
556 || EIndex

Telomeres .....................................................................130
TORQUE................................................... 536, 538543
Trade-off............................................ 385, 390, 504, 507
Training sample ................................. 399407, 409, 410
Trait ................... 38, 84, 85, 90, 91, 98, 141, 275, 276,
348, 349, 442, 455457, 469474, 479481,
487, 488
Trans ....................................................................147, 475
trans-band ...........................................................471, 472
Transcript.........................131, 167, 169, 338340, 344,
345, 471, 474, 480
Transcription
factor .................................. 152, 203, 209, 336, 346,
349352, 354, 364, 432, 474
factor binding sites ..............145, 146, 149, 351, 355
start sites (TSSs) .....................................................439
Transcriptome assembly...............................................340
Transition probability .............. 116, 119, 222, 305, 307
Transition/transversion (rate).....................................264
Translation.................... 70, 71, 92, 101, 131, 142, 172,
193, 265, 432, 505, 508514, 536
Translocation .......................................................167, 201
Transparent Access to Multiple Bioinformatics Services
(TAMBIS) .......................................... 492493
Transposition.......................................................173, 450
Tree
of life (TOL)............................. 3, 11, 39, 55, 7072,
74, 76, 82, 8487, 89, 9395, 97, 99, 103,
104, 205, 207
reconciliation ...............................................41, 43, 54
rooted ............................................ 6, 39, 44, 54, 421
search ........................................................................96
topology..................... 47, 9, 10, 17, 21, 40, 4247,
54, 55, 67, 76, 240, 241, 244
ultrametric ................................................................66
unrooted ...................................... 54, 55, 66, 67, 125
Triple...............................6, 17, 485487, 495, 498, 499
tRNA............................................................... 71, 72, 131
Two color cDNA microarray..............................471, 472

U
Unequal crossing over .................................................373
Uniform Resource Identifier (URI) ......... 481485, 495
Uniform Resource Locator (URL) .......... 213, 481484,
488490, 497, 499, 510
Uniparental inheritance ...................................... 315316
UniProt.............................................. 188, 211, 213, 514
Untranslated regions....................................................147

URI. See Uniform Resource Identifier (URI)


URL. See Uniform Resource Locator (URL)

V
Vertebrates.........................13, 115, 117, 145, 168, 196,
207, 242, 354, 432
VirtualBox ................................ 529, 536, 538, 540, 542
Virtualization..................................... 532536, 540, 542
Virtual machine (VM) .............507, 508, 514, 534536,
539543
Virus..................................... 11, 16, 30, 33, 82, 88, 128,
130, 131, 226, 235240, 242, 243, 247, 258,
259, 364, 420
VM. See Virtual machine (VM)
VMWare........................................................................536

W
Wald confidence interval,
Warping ...............................................................392, 394
Wassilewskijai (Ws).......................................................477
Web Ontology Language (OWL) .... 485488, 494, 495
Web-services ....................434, 464, 489, 491496, 498,
506, 514, 540
Web Services Description Language (WSDL) ......... 501,
503, 504
Weight array matrix,
WGD. See Whole genome duplication (WGD)
Whole genome duplication (WGD) .................162164,
373, 374, 377, 378
Wolbachia............................................................... 11, 170
WormBase............................................................509, 510
Wright-Fisher population ................. 317, 318, 322, 328
Wright-Fisher model, 295, 316318, 326, 328, 329

X
XEN ............................................................ 536, 540, 542
Xenologs .................................................35, 38, 169, 194
XML......................... 487, 491495, 497, 506, 511, 514
xQTL ..................................................471472, 474482

Y
Yeast ........................100, 147, 150, 153, 163, 164, 365,
371, 377, 378, 389, 471
Yeast-2-Hybrid (Y2H) ..............363365, 367, 368, 372

Z
Zinc(Zn) finger protein ...............................................234

Vous aimerez peut-être aussi