Académique Documents
Professionnel Documents
Culture Documents
Genomes
Milo Thurston
CEH Oxford
Genomes
• What is the genetic capacity of an
organism?
• What does an organism do with this
genetic capacity?
• How fast does this genetic capacity
change over time? How?
Microsatellites
Create a
“Microbial Genomes Microsatellite
Database”
inside the genomemine
Microsatellites are hot-spots of
mutation
(short direct repeats of 1-6 bp)
ATCGATGCATATATATATATATATATTGCCTGG (AT)9
Microsatellites are hot-spots of
mutation
(short direct repeats of 1-6 bp)
ATCGATGCATATATATATATATATATTGCCTGG (AT)9
ATCGATGCATATATATATATATATTGCC (AT)8
TGG
ATCGATGCATATATATATATATATATTGCCTGG (AT)9
ATCGATGCATATATATATATATATATATATTGCCTGG (AT)11
Microsatellites
• Molecular Markers in Eukaryotes
• Triplet-Repeat Expansion Diseases
• Contingency Loci in Pathogenic Prokaryotes
ATATATATATATATATA
ATATATATATATATATATAT
Haemophilus influenzae
ATATATATATATATATA
lic1
“Phase Variation”
Many pathogenic bacteria have the ability
to rapidly switch the abundances and types
of molecules on their cell surface.
ON - OFF Molecular
Switches
(CAAT)40
(translational switch)
(AT)8
(transcriptional switch)
ON - OFF Molecular
Switches
(CAAT)39
(translational switch)
(AT)7
(transcriptional switch)
Origin and Maintenance of Loci Involved
in Antigenic Variation, Ecological
Tradeoffs &
‘Mutational Phenotypes’ ATGCAATCAATCAATCAATCAATCAATCAA
TCAATCAATCAATCAATCAATCAACAATCA
ATCAATCAATCAATCAAATTGTAGGATTTG
TTAAAACTTGCTACAAGCCTGAGGAAGTAT
TTCATTTTCTTCATCAGCATTCCATTCCTT
TTTCCTCCATTGGAGGAATGACCAATCAAA
ATGTTCTACTTAATATTTCTGGAGTTAAGT
TTGTATTACGGATCCCTAATGCCGTAAATT
TATCACTTATAAATCGAGA........
Genome Sequencing aids in the discovery of
microsatellite “molecular switches” in pathogenic
bacteria (“Contingency Loci” Moxon, Rainey, Nowak & Lenski, 1994)
Haemophilus influenzae (Hood et. al., 1996)
Helicobacter pylori (Tomb et. al., 1997; Alm et. al., 1999)
Campylobacter jejuni (Parkhill et.al., 2000)
Neisseria meningiditus (Tettlin et. al., 2000; Parkhill et. al., 2000)
nt e nt
te nt
Log Ge o n C o
nome C Log G C
Size C enom
G + e S i z e G+
G+C Content
Genomes with the most Microsatellites
Taxa Species Genome Size No. Freq in bp
in kb
Mitochondrion Saccharomyces cerevisiae 86 171 502
Virus Molluscum contagiosum virus subtype 1 190 79 2,409
Bacteria Xanthomonas axonopodis pv. citri str 5,176 61 84,845
Nucleomorph Guillardia theta 174 44 3,958
Bacteria Xanthomonas campestris pv. campestris 5,076 43 118,051
Virus shrimp white spot syndrome virus 305 42 7,264
Chloroplast Marchantia polymorpha 121 40 3,026
Nucleomorph Guillardia theta 181 40 4,523
Virus Gallid herpesvirus 3 164 40 4,107
Virus Spodoptera exigua nucleopolyhedrovirus 136 38 3,569
Bacteria Helicobacter pylori 26695 1,668 37 45,077
Chloroplast Chlorella vulgaris 151 37 4,071
Bacteria Xylella fastidiosa 2,679 35 76,552
Bacteria Helicobacter pylori J99 1,644 34 48,348
Footprint Length
Footprint Length
versus Genome
Size
Longest repeats are
• extremely long
• hexanucleotides in Herpes
viruses, vertebrate 103 104 105 106 107
mitochondrial genomes, LogGenome
GenomeSize
Size
VNTR’s in pathogenic
Footprint Length
prokaryotes, contingency
loci
• artefact (plasmid
dinucleotide)
• viral virulence factor
(mononucleotide)
• Include long polymorphic
repeats in Baculoviruses
(variety of repeats)
• largely unannotated 103 104 105 106 107
Microsatellites in Bacteria
H. influenzae Pathogenic
Neisseria x 2 E. coli
H. pylori x
2
VNTRs
M. genitalium
C. jejuni
Observations
• ‘Small genomes’ have significant numbers of long
microsatellites
• A variety of factors, including genome size and G+C content,
contribute to presence/absence
• Taxonomic differences (numbers, motif types, biological
significance)
• Next step requires extensive curation (meta-data, genetic
content, homology, literature on phenotype and mutability)
genomemine/genomebank
“key”=“value”
information
Motivations
• Facilitate new computational studies
• Growth in the number of genomes
• Biological Patterns
• Biases
• Evaluate Prospective Data Sets
• Hypothesis Generation
• Inform ongoing computational/empirical
studies
genomemine/genomebank
• Automated retrieval of genomes (bacteria, plasmids, viruses, and organelles)
– Ecological literature (habitat, extremophile?, host, carbon source, oxygen, shape, motile?,
etc)
?
Value
Haemophilus
species ? influenzae
Interactive Plotting
strain ? Rd
Core Genome Features
Genome Size ? 1,830,138
ORFs ? 1709
G+C ? 39%
Orphans at Time of Publication ? 389
Ecology
Primary Habitat ? Human Host/Respiratory Tract
Interaction ? Commensal with ability to cause disease
Computed Features
Microsatellites ? 14 (Link in MGMD)
Percent Low Complexity ? 3.8
Acquisition of Orphans
Proportion Low Complexity as genomes are sequenced
and G+C Content 30000
Orphans
14 20000 Eubacteria
Percent Low
Complexity
12 10000
10 0
8 0 50000 100000 150000
6 Non-Orphans
4 8000
Orphans
2 6000
0 4000 Archaea
0 10 20 30 40 50 60 70 80 90 100 2000
0
G+C Content 0 10000 20000 30000 40000
Non-Orphans
Ecology Field
Phylogeny
Habitat “A Tree Viewer” (ATV)
Primary habitat Zmasek C. M and Eddy S.R (2001) ATV: display
Extremophile? and manipulation of annotated phylogenetic
Optimal Temperature
trees.
Optimal pH
Environmental Breadth
Bioinformatics. 17, 383-384.
Trophic Status http://www.genetics.wustl.edu/eddy/atv
Interaction
Obligate?
Guild
Annotated
Oxygen Phylogenies
Energy
Carbon
Growth
NCBI Taxonomy
Doubling Time in vitro
Morphology 16S ribosomal DNA
Shape (proteome
Gram stain comparisons)
Median width
Median length
Volume
Surface to volume ratio
Motile?
Summary
• Collections of Genomes present new
opportunities
• Should merge genomes with evolutionary and
ecological meta-data to put genomic information
in an ‘organismal context’
• Biological patterns/rules (biases/artefacts)
emerge (microsatellites)
• genomemine/genomebank
http://www.genomics.ceh.ac.uk/GMINE/
Acknowledgements
Dawn Field Ali Cody
Chris Bayliss
Jennifer Hughes Derek Hood
Adrian Tett Richard Moxon
Andrew Spiers
Sarah Turner
Mark Bailey