Vous êtes sur la page 1sur 72

Comparative bacterial genomics

João Carlos Setubal


VBI/Virginia Tech
for EMBO course
Florianopolis, July 2008
Contents
• Tree of Life
• Basic notions of genomics
• Motivation for comparative genomics
• Whole replicon alignment: pairwise and
multiple
• Gene-centric comparisons
• Orthology and Synteny
• Exercises

December 7, 2021 JC Setubal 2


December 7, 2021 JC Setubal 3
Ciccarelli et al, Science, 2006

December 7, 2021 JC Setubal 4


 proteobacteria

Williams, Sobral, and Dickerman


December 7, 2021 JC Setubal 5
JBAC, 2007
Genomes
• The entire DNA complement of a single cell
• Abstraction
– a string s in the alphabet  = {A, C, G, T}
– Example

CTTCCAGTTCAACCGGCCGGTCGTCGCGGACGACGCGGCCGCCG
GCGCCGCGATGCTGGCGGACGTACCGCACACCCGCCCCATCTCC
ATCTTCGCTTC

December 7, 2021 JC Setubal 6


Genome sizes
• Genomes are measured in
– kb (kilo base pairs), Mb (mega), or Gb (giga)
• Viruses: |s| = [5 – 200] kb
• Bacteria: |s| = [1 – 10] Mb
• Eukaryotes: |s| = [10 Mb – 100 Gb]
• Humans: 3 Gb
• marbled lungfish: 130 Gb T. Gregory, www.genomesize.com

December 7, 2021 JC Setubal 7


Famous bacteria
• Haemophilus influenzae (1.8 Mb)
– Human pathogen, first genome to be sequenced (1995)

• Escherichia coli (4.6 Mb)


– Human pathogen and model organism (1997)

• Agrobacterium tumefaciens (6 Mb)


– Plant pathogen and biotechnology tool (2001)

December 7, 2021 JC Setubal 8


What is a gene
• A small substring of s that contains
information
• Bacteria generally have 1 gene every 1 kb
– 5 Mb genome = 5,000 genes

December 7, 2021 JC Setubal 9


A bacterial gene
>A small section of a genome
AGCTCGCGCTCCGCATCCATCCAGTAGGGTTCGGTGTCGACGAGCGTGCC
GTCCATATCCCAGAAGACGGCGGCCGGCATCGCGTGCGGAGTCAGTTCGG
TCACGGCTGACAAGTCTATCCCGGCGGCCCCGGGCCTATTCTTGAGGGAC
GGCGTCCTGACCGGTCGCCGGATGAAAGGACCAGAACGCCCCGTGACTGA
CGCGAACAGCATCCTCGGAGGGCGCATCCTCGTGGTGGCCTTCGAAGGGT
GGAACGACGCTGGCGAGGCCGCCAGCGGGGCCGTCAAGACGCTCAAGGAC
CAGCTGGATGTCGTCCCGGTCGCCGAGGTCGATCCCGAGCTGTACTTCGA
CTTCCAGTTCAACCGGCCGGTCGTCGCGGACGACGACGGCCGCCGGCGCC
TCATCTGGCCGTCCGCGGAGATCCTGGGCCCAGCTCGCCCCGGCGACACC
GGCGATGCGCGCCTGGACGCCACCGGCGCCAACGCGGGCAATATCTTCCT
TCTCCTCGGCACCGAGCCGTCGCGCAGCTGGCGCAGCTTCACCGCGGAGA
TCATGGATGCGGCCCTGGCCTCCGACATCGGCGCCATCGTCTTCCTCGGT
GCGATGCTGGCGGACGTACCGCACACCCGCCCCATCTCCATCTTCGCTTC
GAGCGAGAACGCGGCCGTCCGTGCGGAGCTCGGCATCGAACGCTCTTCGT
ACGAGGGGCCGGTCGGTATCCTGAGCGCGCTCGCCGAAGGGGCGGAGGAC
GTGGGCATTCCGACCATCTCCATCTGGGCGTCGGTTCCGCACTATGTCCA
CAATGCGCCCAGCCCGAAGGCGGTGCTCGCACTGATCGACAAGCTCGAAG
AGCTGGTGAATGTCACCATCCCGCGTGGCTCGCTGGTGGAGGAGGCCACG
GCCTGGGAAGCCGGGATCGACGCGCTGGCTCTGGACGACGACGAGATGGC
TACGTACATCCAGCAGCTGGAGCAGGCACGCGACACCGTGGACTCCCCTG
AGGCCAGCGGCGAGGCGATCGCCCAGGAGTTCGAGCGCTACCTCCGCCGC
CGCGACGGCCGCGCCGGCGATGACCCCCGCCGTGGCTGACGTCACCCCCT
CTCTGCGTCCGCCGTCCTCTGTTCCCCCCGCTCGGCCTCCCCTGAGGCCG
AGGAGTCGCGCCCACATGCCGGAAACTCCTCCTTTCCTGACTTTCTGGAG

December 7, 2021 JC Setubal 10


“Central Dogma” of molecular biology

• gene (DNA)  messenger (RNA)  protein (aminoacids)

transcription translation

Proteins are 3D objects


made out of a linear sequence
of amino acids
December 7, 2021 JC Setubal 11
A protein

December 7, 2021 JC Setubalwww.berkeley.edu/.../ 12


images/ras-rid-protein.gif
Sugar cane
pathogen
Rattoon-
stunting
disease

Monteiro-
2004
Vitorello et al
December 7, 2021 JC Setubal 13
Molecular Plant-Microbe Interactions
Comparative genomics
• There are currently more than 300 completed
sequenced microbial genomes publicly
available
• Many are of closely related species
• In a few years there will be thousands
• Why compare?
• How to do it?

December 7, 2021 JC Setubal 14


Why comparative genomics?
• To understand the genomic basis of the present
– Differences in lifestyle
• pathogen vs. nonpathogen
• Obligate vs. free-living
– Host specificity
• animals vs. plants, plant X vs. plant Y, etc
– In the case of pathogens: this understanding should help
us in fighting disease
• To understand the past
– How organisms evolved to be what they are

December 7, 2021 JC Setubal 15


Citrus canker
Xanthomonas
axonopodis pathovar
citri

December 7, 2021 JC Setubal 16


Black rot: Xanthomonas campestris pathovar campestris

December 7, 2021 JC Setubal 17


What is comparative genomics
• Assuming input is the sequence and its annotation
• There are many ways that genomes can be compared
– Different resolutions
• Whole genome
– Genome alignments
– Synteny (gene order conservation)
– Anomalous regions
• Gene-centric
– Gene families and unique genes
– Gene clustering by function
• Gene sequence variations
– Codon usage, SNPs, inDels, pseudogenes

December 7, 2021 JC Setubal 18


Resolution
• Low resolution
– Scope: entire genomes
– Example event: rearrangement
• High resolution
– Scope: nucleotide sequences
– Example event: single mutation

December 7, 2021 JC Setubal 19


Genome-wide evolutionary events
• Replicon rearrangements
• Gene/region duplication
• Gene/region loss
• Chromosome  plasmid DNA exchange
• Lateral transfer

December 7, 2021 JC Setubal 20


Fig. 4. Net gene loss or gain throughout the evolution of the {alpha}-proteobacterial species

Boussau, Bastien et al. (2004) Proc. Natl. Acad. Sci. USA 101, 9722-9727

December 7, 2021 JC Setubal 21


Copyright ©2004 by the National Academy of Sciences
Example of a “multipartite genome”

Agrobacterium tumefaciens C58

December 7, 2021 JC Setubal 22


Replicon structure in all completely sequenced
rhizobiaceae plus M. loti
genome
replicon c58 s4 k84 Retli Rleg Sm Ml
1 2.84 3.73 4.00 4.38 5.06 3.65 7.04
2 2.07 1.28 2.65 0.64 0.87 1.68 0.35
3 0.54 0.63 0.39 0.51 0.68 1.35 0.21
4 0.21 0.26 0.18 0.37 0.49
5 0.21 0.04 0.25 0.35
6 0.13 0.19 0.15
7 0.08 0.18 0.15
Numbers
December 7, 2021 are replicon size in Mbp JC Setubal 23
Whole replicon alignments:
the pairwise case
If the sequences were identical we would see

December 7, 2021 JC Setubal A 24


an inversion

B C D
A

A D

C B

December 7, 2021 JC Setubal 25


D

A B C D

Such inversions seem to happen around the


December 7, 2021 origin or terminus of replication
JC Setubal 26
Eisen JA,7,Heidelberg
December 2021 JF, White O, Salzberg SL. Evidence for symmetric chromosomal
JC Setubal 27
inversions around the replication origin in bacteria. Genome Biol. 2000;1(6):RESEARCH0011
Replicon sequence comparisons
• Basic tool: MUMmer
– Delcher AL, Phillippy A, Carlton J, Salzberg SL. Fast
algorithms for large-scale genome alignment and
comparison. Nucleic Acids Res. 2002 Jun
1;30(11):2478-83.
– Kurtz S, Phillippy A, Delcher AL, Smoot M,
Shumway M, Antonescu C, Salzberg SL. Versatile
and open software for comparing large genomes.
Genome Biol. 2004;5(2):R12
• http://mummer.sourceforge.net

December 7, 2021 JC Setubal 28


29
E. coli K12 Promer alignment

Red: direct; green: reverse


Both are  proteobacteria!
Xanthomonas axonopodis pv citri
December 7, 2021 JC Setubal 30
Basics of MUMmer
• It finds Maximal Unique Matches
• These are exact matches above a user-specified threshold
that are unique
• Exact matches found are clustered and extended (using
dynamic programming)
– Result is approximate matches
• Data structure for exact match finding: suffix tree
– Difficult to build but very fast
• Nucmer and promer
– Both very fast
– O(n + #MUMs), n = genome lengths

December 7, 2021 JC Setubal 31


sample nucmer output (coords file)
• /home/setubal/agro/comp/mummer/../../rhizogenes/v1/ctgs.fasta
/home/setubal/agro/comp/mummer/../../vitis/v3/all.fasta
• NUCMER

• [S1] [E1] | [S2] [E2] | [LEN 1] [LEN 2] | [% IDY] | [TAGS]


• =====================================================================================
• 73024 73193 | 242351 242181 | 170 171 | 93.60 | Contig789 Contig608
• 220 6244 | 38759 32766 | 6025 5994 | 86.64 | Contig791 Contig604
• 2798 6297 | 174039 177532 | 3500 3494 | 83.31 | Contig791 Contig606
• 3828 6297 | 124183 126645 | 2470 2463 | 81.80 | Contig791 Contig606
• 4767 5392 | 551684 551059 | 626 626 | 82.11 | Contig791 Contig607
• 8214 8453 | 30747 30508 | 240 240 | 84.65 | Contig791 Contig604
• 15408 15987 | 181050 181624 | 580 575 | 86.23 | Contig791 Contig606
• 63864 74254 | 191954 181567 | 10391 10388 | 89.08 | Contig791 Contig604
• 77203 79534 | 178882 176555 | 2332 2328 | 84.35 | Contig791 Contig604
• 157451 158456 | 139804 140812 | 1006 1009 | 82.09 | Contig791 Contig606
• 157483 157800 | 58429 58110 | 318 320 | 89.13 | Contig791 Contig604
• 163575 166223 | 62781 60133 | 2649 2649 | 78.80 | Contig791 Contig605
• 166754 168442 | 49403 47716 | 1689 1688 | 85.79 | Contig791 Contig604
• 171247 173701 | 45005 42556 | 2455 2450 | 88.17 | Contig791 Contig604
• 171261 172115 | 157617 158476 | 855 860 | 86.30 | Contig791 Contig606
• 181828 184458 | 41748 39140 | 2631 2609 | 93.13 | Contig791 Contig604
• 184829 185852 | 38838 37821 | 1024 1018 | 91.61 | Contig791 Contig604

December 7, 2021 JC Setubal 32


A suffix tree for BANANAS

www.somethinkodd.com/.../2006/01/suffixtree.png
December 7, 2021 JC Setubal 33
Proteome alignment done with LCS (top: Xcc; bottom: Xac )
Blue: BBHs that are in the LCS; dark blue: BBHs not in the LCS; red: Xac specifics;
yellow: Xcc specifics

December 7, 2021 JC Setubal 34


Whole replicon multiple alignment
• The program MAUVE
• Darling AC, Mau B, Blattner FR, Perna NT.
Mauve: multiple alignment of conserved
genomic sequence with rearrangements.
Genome Res. 2004 Jul;14(7):1394-403.

December 7, 2021 JC Setubal 35


MAUVE
Chromosome alignment
Dugway

RSA 493

RSA 331

December 7, 2021 JC Setubal 36


MAUVE
Genome Alignments

December 7, 2021 JC Setubal 37


How MAUVE works
• Seed-and-extend hashing
• Seeds/anchors: Maximal Multiple Unique
Matches of minimum length k
• Result: Local collinear blocks (LCBs)
• O(G2n + Gn log Gn), G = # genomes, n =
average genome length

December 7, 2021 JC Setubal 38


Alignment algorithm
1. Find Multi-MUMs
2. Use the multi-MUMs to calculate a phylogenetic
guide tree
3. Find LCBs (subset of multi-MUMs; filter out spurious
matches; requires minimum weight)
4. Recursive anchoring to identify additional anchors
(extension of LCBs)
5. Progressive alignment (CLUSTALW) using guide tree

December 7, 2021 JC Setubal 39


Gene-centric comparisons
• Homologs: genes that have the same ancestor; in
general retain the same function
• Orthologs: homologs from different species (arise
from speciation)
• Paralogs: homologs from the same species (arise
from duplication)
– Duplication before speciation (ancient duplication)
• Out-paralogs; may not have the same function
– Duplication after speciation (recent duplication)
• In-paralogs; likely to have the same function

December 7, 2021 JC Setubal 40


Orthologs
speciation

December 7, 2021 JC Setubal 41


Out-paralogs

December 7, 2021 JC Setubal 42


In-paralogs

December 7, 2021 JC Setubal 43


Published April 16, 2008

10 genomes

Orthology
+
Phylogeny

44
AG: ancestral (belli [2], canadensis) TG: typhus (prowasekii, typhi)
TRG: transitional (akari, felis) SFG: spotted fever (rickettsii, conorii, sibirica)

45
46
How to find orthologs
• Desired features of ortholog clustering
– Ability to distinguish between in- and out-paralogs
• In-paralogs should be clustered with their orthologs
– Ability to cluster genes that have the same domain
architecture, rather than simply sharing just one domain
• Methods
– Phylogenetic trees
– BLAST
– MCL
– orthoMCL

December 7, 2021 JC Setubal 47


OrthoMCL
• Li L, Stoeckert CJ Jr, Roos DS. OrthoMCL:
identification of ortholog groups for
eukaryotic genomes. Genome Res. 2003
Sep;13(9):2178-89
• Enright AJ, Van Dongen S, Ouzounis CA. An
efficient algorithm for large-scale detection of
protein families. Nucleic Acids Res. 2002 Apr
1;30(7):1575-84

December 7, 2021 JC Setubal 48


OrthoMCL
1. BLAST all-against-all
2. weighting scheme
3. MCL algorithm
• Nota bene: orthoMCL is not perfect!
– Two or more families may be wrongly joined
– One family may be wrongly split

December 7, 2021 JC Setubal 49


orthoMCL pipeline

Li Li et al. Genome Res. 2003; 13: 2178-2189


OrthoMCL weighting scheme for similarity graph

Li Li et al. Genome Res. 2003; 13: 2178-2189


(Tribe)MCL
• Enright, Van Dongen, Ouzonis [2002]
• Adaptation of MCL clustering algorithm of Van Dongen
• Markov cluster
• Simulates random walks in the graph
• Expands and inflates certain matrices until equilibrium is
reached
• Expansion: matrix squaring
• Inflation: make expanded matrix become stochastic
• Has been reasonably validated

52
Gene Set Computations
• Given a set of genomes, represented by their
‘proteomes’ or sets of protein sequences
• Given homlogous relationships (as given for
example by orthoMCL)
– Which genes are shared by genomes X and Y?
– Which genes are unique to genome Z?
– Venn or extended Venn diagrams

December 7, 2021 JC Setubal 53


3-way genome comparison

A
B

December 7, 2021 JC Setubal 54


Brucella gene set computations

December 7, 2021 JC Setubal 55


Joining synteny and homology

December 7, 2021 JC Setubal 56


OAK: ortholog alignment for prokaryotes

Genome 1

Genome 2
Ortholog set Script 1
Builder
(orthoMCL)

graph HTML Tables

Genome n

Script 2

report annotators
December 7, 2021 JC Setubal 57
Replicon color key for HTML tables

R. M. loti M. loti
R/G S4 C58 K84 R. etli S. meliloti
leguminosarum MAFF BNC

2nd linear 2nd


1 - - - - -
chromosome chromosome chromosome

plasmid F plasmid pRL12 plasmid


2 plasmid 630kb AT plasmid plasmid 390kb plasmid 1 plasmid 1
640kb 870kb pSymA

plasmid E plasmid pRL11 plasmid


3 plasmid 259kb Ti plasmid plasmid 179kb plasmid 2 plasmid 2
510kb 680kb pSymB

plasmid D plasmid pRL10


4 plasmid 210kb - plasmid 44kb - - plasmid 3
370kb 490kb

plasmid C plasmid pRL9


5 plasmid 130kb - - - - -
250kb 350kb

plasmid A plasmid pRL7


6 plasmid 79kb - - - - -
190kb 150kb

plasmid B plasmid pRL8


7 - - - - - -
180kb 150kb

December 7, 2021 JC Setubal 58


December 7, 2021 JC Setubal 59
December 7, 2021 JC Setubal 60
December 7, 2021 JC Setubal 61
What do the tables show
• conserved blocks (aka “microsyntenic
regions”), and how these blocks appear in
different replicons across the genomes
compared
• some of these blocks are not operons (would
need to show strand)
• possible block losses

December 7, 2021 JC Setubal 62


Polymorphism detection
• inDels, SNPs
• pseudogenes

December 7, 2021 JC Setubal 63


Figure 4.

II
Pseudogenes
• Nonfunctional protein coding genes
• Mutations introduce “sequence problems”
(frameshifts, stop in frame, absence of stop)
• Natural mutation or sequencing error?

65
Pseudogene cases

66
Why study pseudogenes?
• “Normal” bacterial genomes have 1-5% of
pseudogenes [Liu et al]
• Pseudogenes can give interesting clues to
evolutionary pathways

67
Why study pseudogenes? Cont’d
• High fractions of pseudogenes suggest a “genome
degradation” process
• May be cause or effect of niche restriction
• Examples
– Mycobacterium leprae: 36% (~1,100 genes)
– Leifsonia xyli subsp. xyli: 13% (~300 genes)
• Pseudogenes do not show up in BLAST searches
– Ortholog computations will in general not include them!

68
Pseudogene Identification by Sequence Similarity
Study of 8 Brucella Genomes

BLASTN

Annotated Pseudogenes
vs. Genome Sequences
Total
Alignments 4120 0.98
Previously Known Pseudogene
Gene hits 2627 0.62
Known Gene (Homologous to Pseudogene)
pseudogenes 1493 0.35 Newly Identified Pseudogene

Brucella Pseudogene Analysis


Identification of New Pseudogenes by Homology
600

500

400 PG Count: Initial

Tot. A lignments
300
Know n Genes
200 PG Count: Final

100

0
Bab9941 BabS19 Bc an23365 Bmel16M Bab2308 Bov i25840 Bsui1330 Bs ui23445
69
Genomics is just the beginning

populations
Whole organisms

Cell processes
complexity
Interactions between molecules

Genomics/proteomics
December 7, 2021 JC Setubal 70
21 century Biology: integration
st

December 7, 2021 JC Setubal 71


Acks

• Nalvo Almeida
• Chris Lasher
• Brett Tyler
• Rebecca Wattam

December 7, 2021 JC Setubal 72

Vous aimerez peut-être aussi