Académique Documents
Professionnel Documents
Culture Documents
3. Unrooted trees
4. Rooted trees
• Choice of an outgroup
7. Tree reconstruction
7.1 Molecular sequences
7.2 Sequence alignment is the essential preliminary to tree reconstruction
7.3 Converting the alignment data into a phylogenetic tree
7.4 Assessing accuracy of a reconstructed tree
7.5 Molecular clocks enable the time of divergence of ancestral sequences to be
estimated
A Branches C
External nodes
B Internal nodes D
Figure: An unrooted phylogenetic tree joining 4 taxonomic units.
3. Unrooted trees
An unrooted tree simply represents phylogenetic but doesnot provide an evolutionary path. In
an unrooted tree, an external node represents a contemporary organism. Internal nodes
represent common ancestors of some of the external nodes. In this case, the tree shows the
relationship between organisms A, B, C & D and does not tell us anything about the series of
evolutionary events that led to these genes (see figure above). There is also no way to tell
whether or not a given internal node is a common ancestor of any 2 external nodes.
C D
4.C Rooted
D trees A C
Gene trees are not the same 1as species trees 4 A
3
In case of a Brooted tree, one of the internal nodes is used as an outgroup, and, in essence,
B
becomes the common
A ancestor2 of all the other external
5 nodes. The outgroup therefore enables
B and the correct evolutionary
the root of a tree to be located D pathway to be identified. In the
above case, five different evolutionary pathways are possible using an outgroup, each depicted
by a different rooted tree. 2
1 B C A
A B B
D C
A D
C D
Figure. The five rooted trees that can be drawn from the unrooted tree (box). The positions of
the roots are indicated by the numberUnrooted 3
on thetree
outline of the unrooted tree (box)
5
4
5. Inferred and true trees
The criteria used to choose an outgroup depends very much on the type of analysis that is
carried out. Suppose that 4 homologous (orthologous) genes in a tree come from human,
chimpanzee, gorilla and orangutan. A useful homologous primate outgroup sequence is that
from baboon as palaeontological evidence suggests that baboons branched away from the
lineage leading to human, chimpanzee, gorilla and orangutan before the time of the common
ancestor of the four species (figure below).
Human
Chimpanzee
Gorilla
Orangutan
Baboon
We refer to the rooted tree given above, as an inferred tree. This is to emphasise that it depicts the
series of evolutionary events that are inferred from the data that were analysed, and may not
necessarily be the same as the true tree, the one that depicts the actual series of events that
occurred. Sometimes we can be fairly confident that the inferred tree is the true tree but most
phylogenetic data analysis are prone to uncertainties. Degrees of confidence can be assigned to the
branching patterns in an inferred tree using bootstrap analysis (discussed in a later section). Due to
the imprecise nature of phylogenetic analysis controversies have arisen.
6. Gene trees are not the same as species trees
The above tree is a gene tree i.e. a tree derived by comparing orthologous sequences (those
derived from the same ancestral sequence). The assumption is that this gene tree is a more accurate
reflection of a species tree than the one that can be inferred from morphological data. This
assumption is generally correct but it does not mean that the gene tree is the same as a species tree.
Mutation and speciation are not expected to occur at the same time. For example, the mutation
event could precede the speciation event. This would mean that, to begin with, both alleles will
still be present in the same unsplit population of the ancestral species. When the population split
occurs, it is likely that both alleles will be present in each of the resulting groups. After the split,
the new population evolve independently. One possibility is that as a result of random genetic drift
loss of one allele from one population and the loss of the other allele from the second population
occurs. This establishes the two separate genetic lineages that were inferred from phylogenetic
analysis of the gene. How do these considerations affect the coincidence between a gene and a
species tree?
(a) If a molecular clock is used to date the time at which gene divergence took place, than it
cannot be assumed that this is also the time of the speciation event. A significant difference
between a gene and a species event can exist though the species tree & gene tree look the same
(see LHS figure a below).
(b) If the first speciation event is followed closely by a second speciation event in one of
the two populations, then the branching order of the gene tree might be different to that of the
species tree. This can occur if the genes in the modern species are derived from alleles that had
already appeared before the first of the two speciation (RHS Figure, below)
Mutation
λ λ
Mutation
λ Mutation
Speciation Speciation
Speciation
Allele loss
A B A B C
A B A B
Bb A B C A B C
Gene tree & species tree look the same. However,
mutation might precede speciation giving an A gene tree can have a different branching order
incorrect time for the latter if a molecular clock is from a species tree
used
7. Tree reconstruction
In any molecular phylogenetic reconstruction the following 4 points need to be addressed.
• Molecular sequences
12. Sequence alignment is the essential preliminary to tree reconstruction
13. Converting the alignment data into a phylogenetic tree
14. Assessing accuracy of a reconstructed tree
15. Molecular clocks enable the time of divergence of ancestral sequences to be
estimated
Protein1 -gly-ala-ile-leu-asp-arg-
DNA1 -gga-gcc-ata-tta-gat-aga
DNA2 -gga-gca-att-ttt-gat-aga-
Protein2 -gly-ala-ile-phe-asp-arg-
Sequence 1 GACGACCATAGACCAGCATAG
Sequence 2 GACTACCATAGA-CTGCAAAG
*** ******** * *** ** Two possible positions for
the indel
Sequence 1 GACGACCATAGACCAGCATAG
Sequence 2 GACTACCATAGACT-GCAAAG
*** ********* *** **
• The dot matrix technique for alignment: Some alignments can be easily done by "eye
balling" the sequences yet others may require a pen and paper. The simplest is known as the dot
matrix method. The two sequences are written out on the x- and y- axes of the graph paper at
the positions corresponding to the identical nucleotides of the two sequences. The alignment is
indicated by a diagonal series of dots broken by empty squares where the sequences have
nucleotide differences, and shifting from one column to another where indels occur.
An indel is shown
by a shift in the
column
Discontinued dot
indicates a point
mutation
• To date no one has devised a perfect method for tree construction and several methods are
used. Extensive comparative tests have been conducted with test sequences yet none of the
methods have failed to identify and particular method as better than the others.
• The main distinction between the different tree building methods is the way in which
multiple sequence alignment is converted into numerical data that can be analysed
mathematically in order to construct a tree.
7.3.1.1 Least squares distance matrix (modified Jukes & Cantor algorithm)
Step 1: Generating a similarity matrix.
Given below is an example alignment of 5 sequences with 25 positions in the alignment:
Seq A AGAUUCGUCUGUAGGUUUCCACCAA
Seq B ACAUUCGUGUAUAGGUUUCCACUAA
Seq C ACAUUCGUGUAGAGGUUUCCACUAA
Seq D AAGUUCGCUUGGAGGUUUCCACGAA
Seq E AUCGUGAGAUCCAGGUAUCCACAAU
The first step in the least squares distance matrix is to generate a similarity matrix. For this,
count the number of identical bases in every pair of sequences in the alignment. For example the
number of similar bases between Seq A and Seq B is 21 out of a total of 24. Therefore the
similarity between Seq A and Seq B is 21 / 24 = 0.84.
Seq A AGAUUCGUCUGUAGGUUUCCACCAA
|X||||||X|X|||||||||||X||
Seq B ACAUUCGUGUAUAGGUUUCCACUAA
A similarity matrix is generated using this approach for each pair of sequences and a similarity
table can be generated as shown below.
A B C D E
A ----- ----- ----- ----- -----
B 0.84 ----- ----- ----- -----
C 0.80 0.96 ----- ----- -----
D 0.76 0.72 0.76 ----- -----
E 0.52 0.52 0.52 0.52 -----
From this table, it can be seen that sequences A and B are 0.84 (= 84%) similar, A and C are
0.80 (=80%) similar, B and C are 0.96 (=96%) similar, etc, etc.
There is only one nucleotide difference between the two modern sequences, but two
nucleotide substitution have actually occurred. If this multiple hit is not recogonised than
the evolutionary distance between the two sequences will be significantly underestimated.
Distance matrices are therefore usually constructed using mathematical methods that
include statistical approaches for estimating the amount of multiple substitutions that have
occurred as explained below.
7.3.1.2 Neighbor Joining Method approach. for building a tree from distance matrix
The neighbor-joining method is a popular tree-building procedure that uses the distance matrix
generated by distance matrix methods as described above.
This done by starting with two of the sequences, separated by a line equal in length to the
evolutionary distance between the sequences:
Then the next sequence is added to the tree such that the distances between A, B and C are
approximately equal to the evolutionary distances. Notice that the fit isn't perfect. If we could
determine the evolutionary distances exactly, they would fit the tree exactly, but since we have
to estimate these distances, the numbers are fit to the tree as closely as possible using a least-
squares best fit.
The next step is to add the next sequence, again re-adjusting the tree to fit the distances as well
as possible:
And at last we can add the final sequence and readjust the branch lengths one last time using
least-squares:
Notice that the distance between any two sequences is (approximately) equal to the sum of the
length of the line segments joining those two sequences - in other words, the tree is additive.
Here is another way of looking at the same tree but in a different way, a dendrogram.
A dendrogram shows evolutionary distances along the horizontal axis and assumes a root
somewhere in the middle of the tree, in this case in the branch connecting sequence E to the rest
of the tree. Some people like this representation because the horizontal axis roughly
approximates time.
• More rigorous but necessitates more data handling. More sequences added means more trees
need to be generated. For example, with five sequences only 15 possible unrooted trees are
generated but with 10 sequences, 2,027,025 unrooted trees and with 50 sequences the number
exceeds the number of atoms in this universe. Not even super computers can evaluate all the
trees with Maximum Parsimony method. This is also true for the sophisticated methods such as
Maximum likelyhood and fastDNAML.
Ancestral character state: A character state possessed by a remote common ancestor of a group
of organisms.
Bootstrapping or Bootstrap analysis: A method of inferring the degree of confidence that can be
assigned to branch point in a phylogenetic tree.
CAP: The chemical modification at the 5'-end of most eucaryotic mRNA molecules.
CAP binding complex: The complex, also called eIG-4F and comprising the initiation factors
eIF-4A, eIF-4E and eIF-4G, which makes the initial attachment to the CAP structure at the be
beginning of the scanning phase of eucaryotic translation.
Chromosome walking: A technique that can be used to construct a clone contig by identifying
overlapping fragments of cloned DNA.
Clone contig approach: A genome sequencing strategy in which the molecules to be sequenced
are broken into manageable segments, each a few hundred kb or few Mb in length, which are
sequenced individually.
Consensu sequence: A nucleotide sequence that represents "average" of a number of related but
nonidentical sequences.
Contour clamped homogenous electri field (CHEF): An electrophoresis method used to separate
large DNA molecules.
Convergent evolution: The situation that occurs when the same character state evolves
independently in two lineages.
Degenerate: Refers to the fact that the genetic code has more than one codon for most amino
acids.
Derived character set: A character state that evolved in a recent ancestor of a subset of
organisms in a group being studied.
Directed evolution: A set of experimental techniques that is used to obtain novel genes with
improved products.
Distance matrix: A table showing the evolutionary distances between all pairs of nucleotide
sequences in a dataset.
Domain shuffling: Rearrangements of segments of one or more genes, each segment coding for
a structural domain in the gene product, to create a new gene.
Exon theory of genes: An "introns early" hypothesis which state that introns were formed when
the first DNA genomes were being formed.
Expressed Sequence Tags (EST): A cDNA that is sequenced in order to gain rapid access to the
genes in the genomes.
External node: The end of a branch in a phylogenetic tree, representing one of the organisms or
DNA sequences being studied.
Field Inversion gel electrophoresis (FIGE):
Molecular Clock: A device based on the inferred mutation rate that enables times to be assigned
to the branch points in a gene tree.
Molecular evolution: The gradual changes that occur in genomes over time due to the
accumulation of mutations and structural rearrangements resulting from recombination and
transcription.
Molecular phylogenetics: A set of techniques that enable the evolutionary relationships between
DNA sequences to be inferred by making comparisons between those sequences.
Multigene family: A group of genes, clustered or dispersed, with related nucleotide sequences.
Multiple hit or multiple substitution: The situation that occurs when a single nucleotide in a
DNA sequence undergoes two mutational changes, giving rise to two new alleles, both of which
differ from each other and from the parent at that nucleotide position.
Multiregional evolution: A hypothesis that states that modern humans in the Old world are
descended from Homo erectus populations that left Africa over 1 million years ago.
Natural selection: The preservation of favourable alleles and the rejection of injurious ones.
Nuclear genome: The DNA molecules present in the nucleus of eucaryotic cells.
Open Reading Frame (ORF): A series of codons starting with an initiation codon and ending
with a termination codon. The part of the protein-coding region that is translated into proteins.
Orphan family: A group of homologous sequences genes whose functions are unknown.
OFAGE:
Paralogous: Refers to two or more homologous genes located in the same genome.
Selfisg DNA: DNA that appears to have no function and apparently contributes nothing to the
cell in which it is found.
Sequence tagged site (STS): A DNA sequence that is unique in the genome.