Académique Documents
Professionnel Documents
Culture Documents
Material of this lecture taken from the paper of the Mouse Genome Sequencing
Consortium „Initial sequencing and comparative analysis of the mouse genome“,
Nature 420, 520-562 (5.12.2002). Excellent paper! Well readable!
Key findings:
* the mouse genome is about 14% smaller than the human genome. The
difference probably reflects a higher rate of deletion in mouse.
* over 90% of the mouse and human genomes can be partitioned into
corresponding regions of conserved synteny (segments in which the gene order
in the most recent common ancestor has been conserved in both species)
* at the nucleotide level, ca. 40% of the human genome can be aligned to the
mouse genome. These sequences seem to represent most of the orthologous
sequences that remain in both lineages from the common ancestor. The rest
was probably deleted in one or both genomes.
* the neutral substitution rate has been roughly half a nucleotide substitution per
site since the divergence of the species. About twice as many of these
substitutions have occurred in mouse as in human.
6. Lecture WS 2003/04 Bioinformatics III 1
Comparison of mouse and man at genome level
Key findings:
* the proportion of small (50-100 bp) segments in the mammalian genome that is
under (purifying) selection is ca. 5%, i.e. much higher than can be explained by
protein-coding sequences alone.
→ genome contains many additional features (UTRs, regulatory elements, non-
protein-coding genes, chromosomal structural elements) under selection for
biological function!
* the mammalian genome is evolving in a non-uniform manner, various measures
of divergence showing substantial variation across the genome.
* mouse and human genomes each seem to contain ca. 30.000 protein-coding
genes. The proportion of mouse genes with a single identifiable orthologue in the
human genome is ca. 80%. The proportion of mouse genes without any homologue
currently detectable in the human genome (and vice versa) is < 1%.
Sequencing strategy
- first whole genome shotgun to generate draft sequence quickly
- later generate physical map for producing a finished sequence
at fine scale:
align genome to 10 Mb of finished BAC-derived sequence from B6 strain.
39 discrepancies of > 50 bp in length (median size of 320 bp) reflecting
small misassemblies.
Discrepancies typically occur at the ends of contigs in WGS assembly →
incorrect incorporation of a single terminal read.
Indeed, 5.9 million of the 33.6 million reads were not part of anchored sequence.
88% of them are not assembled into sequence contigs, 12% belong to contigs
but are not localized on a particular chromosome.
In their pioneering paper, Nadeau and Taylor, 1984 estimated that the mouse
and human genomes could be parsed into roughly 180 syntenic regions – a
surprisingly small number.
Today, gene-based syteny maps define about 200 syntenic regions.
About 558.000 pairs found! Mean spacing of 4.4 kb; N50 length of ca. 500 kb.
Together they make up 7.5% of the mouse genome.
But there may be many more that have evolved too quickly to be detected.
Syntenic block: one or more syntenic segments that are all adjacent on the same
chromosome in human and on the same chromosome in mouse; may otherwise
be shuffled with respect to order and orientation.
Each genome could be parsed into a total of 342 conserved syntenic segments.
On average, each landmark resides in a segment containing 1600 other
landmarks.
Segments vary greatly in length: 303 kb – 64.9 Mb.
About 90.2 % of human and 93.3% of mouse genome unambigously reside
with conserved syntenic segments.
Segments and blocks >300 kb in size with conserved synteny in human are
superimposed on the mouse genome. Each colour corresponds to a particular
human chromosome. The 342 segments are separated from each other by thin,
white lines within the 217 blocks of consistent colour.
With only two species, however, it is not yet possible to recover the ancestral
chromosomal order or reconstruct the precise pathway of rearrangements.
This will become possible in short time as more and more mammalian species
are sequenced.
What accounts for the smaller size of the mouse genome? See section on
repeats.
- (G + C) content
- CpG islands
(G+C) content and density of CpG islands shows more variability in human (red)
than mouse (blue) chromosomes. a, The (G+C) content for each of the mouse
chromosomes is relatively similar, whereas human chromosomes show more
variation; chromosomes 16, 17, 19 and 22 have higher (G+C) content, and
chromosome 13 lower (G+C) content. b, Similarly, the density of CpG islands is
relatively homogenous for all mouse chromosomes and more variable in human,
with the same exceptions. Note that the mouse and human chromosomes are
matched by chromosome number, not by regions of conserved synteny.
The mouse genome. Nature 420, 520 - 562
6. Lecture WS 2003/04 Bioinformatics III 25
Repeats
Repeats are the most prevalent feature of mammalian genomes.
Most of them are interspersed repeats representing „fossils“ of transposable
elements.
32.4% (mouse) of genome are lineage-specific repeats vs. 24.4% for human
This is an average. Currently, the substitution rate per year in mouse is probably
fivefold higher than in human.
Over time, pseudogenes of either class tend to accumulate mutations that clearly
reveal them to be inactive, such as multiple frameshifts or stop codons.
They acquire a larger ratio of non-synonymous to synonymous substitutions
(KA / KS) than functional genes.
Authors estimate that 76% of first class and 30% of second class are pseudo-
genes. They comprise ca. 12000 exons in the 213562 mouse gene catalogue.
For 96% the homologue lies within a similar conserved syntenic interval in the
human genome.
For 80% of mouse genomes, the best match in the human genome in turn has
its best match against that same mouse gene. These are termed 1:1 orthologues.
For less than 1% of the predicted mouse genes there was no homologous
predicted human gene.
Those genes that may seem to be mouse-specific may correspond to human genes
that are still missing due to the incompleteness of the human genome sequence.
De novo gene addition in the mouse lineage and gene deletion in the human
lineage have not significantly altered the gene repertoire.
(2)
require the presence of adjacent exons in both species
Authors expect about 1000 new gene predictions would be validated by RT-PCR.
two measures:
- percentage of amino acid identity
- KA / KS ratio
Also, domain families with enzymatic activitiy were found to have a lower KA / KS
ration than non-enzymatic domains.
* the mouse genome is about 14% smaller than the human genome. The
difference probably reflects a higher rate of deletion in mouse.
* over 90% of the mouse and human genomes can be partitioned into
corresponding regions of conserved synteny (segments in which the gene order
in the most recent common ancestor has been conserved in both species)
* at the nucleotide level, ca. 40% of the human genome can be aligned to the
mouse genome. These sequences seem to represent most of the orthologous
sequences that remain in both lineages from the common ancestor. The rest
was probably deleted in one or both genomes.
* the neutral substitution rate has been roughly half a nucleotide substitution per
site since the divergence of the species. About twice as many of these
substitutions have occurred in mouse as in human.
* mouse and human genomes each seem to contain ca. 30.000 protein-coding
genes. The proportion of mouse genes with a single identifiable orthologue in the
human genome is ca. 80%. The proportion of mouse genes without any homologue
currently detectable in the human genome (and vice versa) is < 1%.