Maximum Parsimony Analysis: Genetic Code and The Number of Mutations, 1, 2 or 3

Maximum parsimony
1 of 6
http://www.icp.ucl.ac.be/~opperd/private/parsimony.html
Maximum Parsimony analysis

Parsimony implies that simpler hypotheses are preferable to more complicated ones.
Maximum parsimony is a character-based method that infers a phylogenetic tree by minimizing the total
number of evolutionary steps required to explain a given set of data, or in other words by minimizing the
total tree length.
The steps may be base or amino-acid substitutions for sequence data, or gain and loss events for
restriction site data.
Maximum parsimony, when applied to protein sequence data either considers each site of the sequence as
a multistate unordered characterd with 20 possible states (the amino-acids) (Eck and Dayhoff, 1966), or
may take into account the genetic code and the number of mutations, 1, 2 or 3, that is required to explain
an observed amino-acid substitution. The latter method is implemented in the PROTPARS program
(Felsenstein, 1993).
The maximum parsimony method searches all possible tree topologies for the optimal (minimal) tree.
However, the number of unrooted trees that have to be analysed rapidly increases with the number of
OTUs. The number of rooted trees (Nr) for n OTUs is given by:
Nr = (2n -3)!/(2exp(n -2)) (n -2)!
The number of unrooted trees (Nr) for n OTUs is given by:
Nu = (2n -5)!/(2exp(n -3)) (n -3)!
This is shown in the following table:
Number of
Number of
Number of
unrooted
OTUs
rooted trees
trees
2
1
1
3
1
3
4
3
15
5
15
105
6
105
945
7
954
10,395
8
10,395
135,135
9
135,135 34,459,425
10
34,459,425
2.13E15
15
2.13E15
8.E21
This rapid increase in number of trees to be analysed may make it impossible to apply the method to very
large datasets. In that case the parsimony method may become very time consuming, even on very fast
computers.
An example of the maximum parsimony method for a dataset of 4 nucleic-acid sequences is given below.
Consider the following set of homologous sequences:
Site
11/20/2013 6:18 AM
Maximum parsimony
2 of 6
_________________________
Sequence
For four OTUs there are three possible unrooted trees. The trees are then analysed by searching for the
ancestral sequences and by counting the number of mutations required to explain the respective trees as
shown below:
(1) AAGAGTGCA
\4
\
4
AGCCGTGCG --/
/0
(2) AGCCGTGCG
AGATATCCA (3)
2/
/
AGAGATCCG
\
0\
AGAGATCCG (4)
Number of mutations
Tree I:
11
(1) AAGAGTGCA
AGCCGTGCG (2)
\1
3/
\
5
/
AGGAGTGCA --- AGAGGTCCG
/
\
/4
1\
(3) AGATATCCA
AGAGATCCG (4)
Tree II:
14
(1) AAGAGTGCA
AGCCGTGCG (2)
\1
3/
\
5
/
AGGAGTGCA --- AGATGTCCG
/
\
/5
2\
(4) AGAGATCCG
AGATATCCA (3)
Tree III: 16
Tree I has the topology with the least number of mutations and thus is the most parsimonious tree.
NB: The above analysis is based on all the sites in the sequence alignment . However, a number of the
sites are non-informative and, therefore, do not have to be included in the analysis. When only
informative sites are included a much lesser number of sites can be analysed, which means in the case of
large datasets a considerable gain in CPU time.
Informative sites
The definition of an informative site is as follows. A site is informative only when there are at least two
different kinds of nucleotides at the site, each of which is represented in at least two of the sequences
under study.
To illustrate the distinction between informative and non-informative sites, lets have a look the same four
hypothetical sequences as above.
11/20/2013 6:18 AM
Maximum parsimony
3 of 6
Site
_________________________
Sequence
There are three possible unrooted trees for four OTUs (tree I, II and III, see figure below). Site 1 is not
informative because all sequences at this site have A, so that no change is required in any of the three
possible trees. At site 2, sequence 1 has A while all other sequences have G, and so a simple assumption is
that the nucleotide has changed from G to A in the lineage leading to sequence 1. Thus, this site is also not
informative, because each of the three possible trees requires 1 change. As shown in the figure, for site 3
each of the three possible trees requires 2 changes and so it is also not informative. Note that if we assume
that the nucleotide at the node connecting OTUs 1 and 2 in tree I is C (or A) instead of G, the number of
changes required for the tree remains 2. The figure shows that for site 4 each of the three trees requires 3
changes and thus site 4 is also non-informative. For site 5, tree I requires only 1 change, whereas trees II
and III require 2 changes each (Figure c). Therefore, this site is informative.
From these examples, we see that, as far as molecular data are concerned, a site is informative only when
there are at least two different kinds of nucleotides at the site, each of which is represented in at least two
of the sequences under study. In the above example, informative sites are indicated by an asterisk (*).
Below you see the four sequences and their corresponding three possible trees made with only the
informative sites :
1
2
3
4
GGA
GGG
ACA
ACG
***
(1)
(2)
(1)
(3)
(1)
GGA
ACA (3)
\1
1/
\
2
/
GGG --- ACG
/
\
/0
0\
GGG
ACG (4)
GGA
GGG (2)
\1
1/
\
1
/
GCA --- GCG
/
\
/1
1\
ACA
ACG (4)
GGA
\2
Number of mutations
Tree I:
Tree II:
GGG (2)
1/
11/20/2013 6:18 AM
Maximum parsimony
4 of 6
(4)
\
0
/
GCG --- GCG
/
\
/1
2\
ACG
ACA (3)
Tree III: 6
To infer a maximum parsimony tree, for each possible tree we calculate the minimum number of
substitutions at each informative site. In the above example, for sites 5, 7, and 9, tree I requires in total 4
changes, tree II requires 5 changes, and tree III requires 6 changes. In the final step, we sum the number
of changes over all the informative sites for each tree and choose the tree associated with the smallest
number of substitutions. In our case, tree I is chosen because it requires the smallest number of changes
(4) at the informative sites.
In the case of four OTUS, an informative site favours only one of the three possible alternative trees. For
example, site 5 favours tree I over trees II and III, and is said to support tree I. It is easy to see that the
tree supported by the largest number of informative sites is the maximum parsimony tree. For instance, in
the above example, tree I is supported by 2 sites, tree II by one site, and tree III by none.
Maximum parsimony searches for the optimal (minimal) tree. In this process more than one minimal trees
may be found. In order to guarantee to find the best possible tree an exhaustive evaluation of all possible
tree topologies has to be carried out. However, this becomes impossible when there are more than 12
OTUs in a dataset.
Branch and Bound: is a variation on maximum parsimony that garantees to find the minimal tree without
having to evaluate all possible trees. This way a larger number of taxa can be evaluated but the method is
still limited.
Heuristic searches is a method with step-wise addition and rearrangement (branch swapping) of OTUs.
Here it is not guaranteed to find the best tree.
Since, in view of the size of the dataset, it is often not possible to carry out an exhaustive or other search
for the best tree, it is adviced to change the order of the taxa in the dataset and to repeat the analysis, or to
indicate to the program to do this for you by providing a so-called jumble factor to the program.
Consensus tree
Since the Maximum Parsimony method may result in more than one equally parsimonious tree, a
consensus tree should be created. For the creation of a consensus tree see bootstrapping.
11/20/2013 6:18 AM
Maximum parsimony
5 of 6
Parsimony and branch lengths

Let's assume that we have a set of 3 possible trees for 4 OTUs that relate to only one site and that all
describe the same final state by assuming a total of 3 steps. However, each final state is arrived at via a
different route. It is immediately obvious that each of the three trees is equally valid, but that the number
of steps along the indiviual branches (or the length of each branch) is not deteremined. For this reason
branch lengths are not given in parsimony, but only the total number of steps for a tree.
(1)
(2)
A (3)
\1
0/
\
1
/
C -----A
/
\
/0
1\
C
T (4)
(1)
(2)
A (3)
\0
1/
\
1
/
G -----T
/
\
/1
0\
C
T (4)
(1)
(2)
A (3)
\1
1/
\
1
/
C -----A
/
\
/0
0\
C
A (4)
Some final notes on Maximum Parsimony
11/20/2013 6:18 AM
Maximum parsimony
6 of 6
Maximum Parsimony (positive points):

is based on shared and derived characters. It therefore is a cladistic rather than a phenetic
method
does not reduce sequence information to a single number
tries to provide information on the ancestral sequences
evaluates different trees
Maximum Parsimony (negative points):
is slow in comparison with distance methods
does not use all the sequence information (only informative sites are used)
does not correct for multiple mutations (does not imply a model of evolution)
does not provide information on the branch lengths
is notorious for its sensitivity to codon bias
Last updated: 9 September 1997.
created by :Fred Opperdoes
11/20/2013 6:18 AM

Maximum Parsimony Analysis: Genetic Code and The Number of Mutations, 1, 2 or 3

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Maximum Parsimony Analysis: Genetic Code and The Number of Mutations, 1, 2 or 3

Transféré par

Droits d'auteur :

Formats disponibles

Maximum parsimony

Maximum Parsimony analysis

Parsimony and branch lengths

Some final notes on Maximum Parsimony

Maximum Parsimony (positive points):

Vous aimerez peut-être aussi