Académique Documents
Professionnel Documents
Culture Documents
Measures of genetic
diversity
Copyright: IPGRI and Cornell University, 2003
Measures 1
Contents
f Basic genetic diversity analysis
f Types of variables
f Quantifying genetic diversity:
Measuring intrapopulation genetic diversity
Measuring interpopulation genetic diversity
f Displaying relationships:
Classification or clustering
Ordination
f Appendices
Copyright: IPGRI and Cornell University, 2003
Measures 2
M
a
r
k
e
t
d
a
t
a
1
1
0
1
0
1
1
0
0
1
0
0
1
0
Individuals
1
1
0
0
0
1
1
0
1
0
0
1
1
1
0
1
0
0
1
0
1
2. Assessment of relationships
among individuals, populations,
regions, etc.
01
1
1
0
1
0
0
1
01
02
03
04
05
06
02
0.56
03
0.33
0.33
04
0.47
0.26
0.50
05
0.32
0.43
0.37
0.28
06
0.33
0.56
0.56
0.37
0.46
Ind5
3. Expression of relationships
between results obtained from
different sets of characters
Ind3
Ind6
Ind4
Ind2
Ind1
Measures 3
Most of the genetic diversity analysis that we might want to do will involve the
following steps:
1.
2.
Calculating the relationships between the units analysed in step one. This
entails calculating the distances (geometric or genetic) among all pairs of
subjects in the study.
3.
Types of variable
f Qualitative. These refer to characters or
qualities, and are either binary or categorical:
Binary, taking only two values: present (1) or
absent (0)
Categorical, taking a value among many
possibilities, and are either ordinal or nominal:
Ordinal: categories that have an order
Nominal: categories that are unrelated
Measures 4
Measures 5
or
Pj = q 0.99
Measures 6
Where,
Pj = rate of polymorphism
q = allele frequency
This measure provides criteria to demonstrate that a gene is showing
variation.
Its calculation is through direct observation of whether the definition is
fulfilled.
It can be used with codominant markers and, very restrictively, with dominant
markers. This is because the estimate based on dominant markers would be
biased below the real number.
A polymorphic gene is usually one for which the most common allele has a
frequency of less than 0.95. Rare alleles are defined as those with frequencies of
less than 0.005. The limit of allele frequency, which is set at 0.95 (or 0.99) is
arbitrary, its objective being to help identify those genes in which allelic variation is
common.
Reference
CavalliSforza, L.L. and W.F. Bodmer. 1981. Gentica de las Poblaciones
Humanas. Ed. Omega, Barcelona.
P = npj/ntotal
Measures 7
Where,
P = proportion of polymorphic loci
npj = number of polymorphic loci
ntotal = total number of loci
It expresses the percentage of variable loci in a population.
Its calculation is based on directly counting polymorphic and total loci.
It can be used with codominant markers and, very restrictively, with dominant
markers (see previous slide for explanation).
Measures 8
For a given gene in a sample, this measure tells how many allelic variants can
be found.
Although the distribution of alleles does not matter, the maximum number of
alleles does.
1/K ni
i 1
Measures 9
Where,
K = the number of loci
ni = the number of alleles detected per locus
This measure provides complementary information to that of polymorphism.
It requires only counting the number of alleles per locus and then calculating
the average.
It is best applied with codominant markers, because dominant markers do not
permit the detection of all alleles.
Measures 10
Where,
pi = frequency of the ith allele in a locus
h = 1 pi2 = heterozygosity in a locus
The measure tells about the number of alleles that would be expected
in a locus in each population.
It is calculated by inverting the measure of homozygosity in a locus.
It can be used with codominant marker data.
Its calculation may be affected by sample size.
This measure of diversity may be informative for establishing collecting
strategies. For example, we estimate it in a given sample. We then verify it
in a different sample or the entire collection. If the figure obtained the
second time is less than the first estimated number, this could mean that
our collecting strategy needs revising.
10
Population 2
Population 1
Individual 1
A1 A1
B1 B1
C1 C1
A1 A1
B1 B3
C1 C1
Individual 2
A1 A2
B1 B2
C2 C2
A1 A1
B2 B3
C1 C1
Individual 3
A1 A1
B1 B1
C1 C3
A2 A2
B1 B4
C1 C1
Individual 4
A1 A3
B1 B3
C2 C3
A2 A2
B1 B1
C1 C1
Individual 5
A3 A3
B3 B3
C3 C3
A1 A2
B4 B4
C1 C1
Frequency of allele 1
0.60
0.60
0.30
0.50
0.40
1.00
Frequency of allele 2
0.10
0.10
0.30
0.50
0.10
0.00
Frequency of allele 3
0.30
0.30
0.40
0.20
0.00
Frequency of allele 4
0.30
Heterozygosity (h)
0.54
0.54
0.66
0.50
0.70
0.00
2.17
2.17
2.94
2.00
3.33
1.00
Number of alleles
Measures 11
The table on the slide shows an example for calculating the effective number of
alleles. The two populations each have 5 individuals. For each individual, 3 loci are
analysed, each with a different number of alleles, depending also on the population
(locus A has 3 alleles in population 1 and only 2 alleles in population 2, and so on).
First, allele frequencies are calculated for each locus and each population; then,
heterozygosity in each locus; and finally, the Ae, according to the formula shown in
the previous slide.
11
h j = 1 p2 q2
hj = 1 pi2
H = jLhj/L
Measures 12
Where,
hj = heterozygosity per locus
p and q = allele frequencies
H = average heterozygosity for several loci
L = total number of loci
The average expected heterozygosity is calculated by substracting from 1 the
expected frequencies of homozygotes in a locus. The operation is repeated
for all loci and the average then performed.
It can be applied to all markers, both codominants and dominants.
The estimated value may be affected by those alleles present at higher
frequencies.
It ranges from 0 to 1.
It is maximized when there are many alleles at equal frequencies.
A minimum of 30 loci in 20 individuals per population should be analysed to
reduce the risk of statistical bias.
12
10
11 12
13 14
15 16
17
10
11 12 13 14
15 16
17 18
18
19 20
21
22
23 24
19
21
22
23
25 26
27
28
29
30
27 28 29
30
Gel
Locus A
Locus B
Locus C
Locus D
Locus E
Data scoring
Locus A
Locus B
Locus C
Locus D
Locus E
20
24
25 26
1,1 0,1 1,1 0,1 0,1 0,1 0,1 0,1 0,1 1,0 0,1 0,1 0,1 0,1 0,1 1,1 0,1 0,1 0,1 0,1 0,1 0,1 0,1
0,1 0,1 0,1 0,1 0,1 1,1 0,1 0,1 0,1 0,1 0,1 0,1 0,1 1,0 1,0 1,0 0,1 0,1 0,1 0,1 0,1 0,1 1,0
1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0
1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0
0,1 1,1 0,1 1,1 0,1 1,0 1,1 1,0 1,1 1,1 1,1 1,0 1,0 1,0 1,0 1,0 1,0 1,0 1,0 0,1 0,1 1,0
Measures 13
13
A1 A1
A1 A2
A2 A2
p2
2pq
q2
Individuals (no.)
24
30
P11 = 0.07
P12 = 0.13
P22 = 0.80
B1 B1
B1 B2
B2 B2
Total
p2
2pq
q2
Genotypes
B
Allele
freq.
Data analysis
20
30
P11 = 0.23
P12 = 0.10
P22 = 0.67
E1 E1
E1 E2
E2 E2
Total
p2
2pq
q2
Hi
Total
Individuals (no.)
Genotypes
hj =
(1 - p2 - q2)
0.13
0.87
0.28
0.72
0.63
0.37
0.23
0.41
E
Individuals (no.)
15
30
P11 = 0.50
P12 = 0.27
P22 = 0.23
0.46
0.22
Measures 14
1. First, we note that loci A, B and E are polymorphic because they fulfill the
requirement of having allele frequencies below 0.99. Loci C and D are
monomorphic. (exp. = expected value; obs. = observed value.)
2. The proportion of polymorphic loci is P = (3/5) = 0.6 or 60%. That is, the number
of polymorphic loci is divided by the total number of loci analysed.
3. To calculate average heterozygosity (Ho), we:
a. Count how many loci, out of the total, are heterozygous. For instance,
Individual1 has one heterozygous locus (A), Individual2 also has one (E),
Individual27 has 2 heterozygous loci (A and E), ... . In all, 16 individuals
were monomorphic (i.e. they had only one band in each of the five loci),
13 individuals had 1 heterozygous locus and 1 individual had 2
heterozygous loci.
b. Calculate the average observed heterozygosity as:
Ho = [16(0/5) + 13(1/5) + 1(2/5)]/(30) = 0.1
4. The intralocus gene diversity (hj) is calculated for each locus according to the
formula in the top row of the table, giving us locus A = 0.23, locus B = 0.41 and
locus E = 0.46.
5. The average expected gene diversity (Hi) is calculated from the formula in slide
number 12:
Hi = (0.23 + 0.41 + 0.46)/5 = 0.22
14
10
11 12
10
11 12 13
13 14
15 16
17
18
19 20
21
19 20
21
22
23 24
25 26
27
28
29
25 26
27 28 29
30
Locus A
Locus B
Locus C
Locus D
Locus E
Data scoring
Locus A
Locus B
Locus C
Locus D
Locus E
14 15
16
17 18
22
23
24
30
Measures 15
15
hj =
(1 - p2 q2)
Allele
freq.
Data analysis
Genotypes
Aa
Aa
aa
Total
p2
2pq
q2
0.11
0.89
0.18
0.82
0.52
0.48
Hi
Individuals (no.)
24
30
P1 = 0.20
P2 = 0.80
Bb
bb
Total
2pq
q2
Genotypes
BB
p2
Individuals (no.)
10
20
30
P1 = 0.33
P2 = 0.67
Ee
ee
Total
2pq
q2
Genotypes
EE
p2
Individuals (no.)
23
30
P1 = 0.77
P2 = 0.23
0.19
0.30
0.50
0.198
Measures 16
1. First, we look at the polymorphism shown by all loci. Loci A, B and E fulfill the
requirement of having allele frequencies below 0.99 and as such can be said to
be polymorphic. Loci C and D are monomorphic. (exp. = expected value; obs. =
observed value.)
2. The proportion of polymorphic loci (P) is P = (3/5) = 0.6 or 60%. The average
heterozygosity (He) cannot be estimated because dominant markers do not
allow discrimination between heterozygous and homozygous individuals.
3. Despite the above (2), the intralocus gene diversity (hj) may be calculated for
each locus using the formula that appears in the top row of the table, column 4,
as follows: locus A = 0.19; locus B = 0.30; and locus E = 0.50.
4. The average gene diversity (Hi) is calculated from the formula in slide number
12:
Hi = (0.19 + 0.30 + 0.50)/5 = 0.198
16
Measures 17
17
gST = 1 (hS/hT)
hS = population diversity
hT = total diversity
Measures 18
Where,
hS = (/( - 1)[1 (1/s)xij2 (ho/2)]
hT = 1 - [(1/s)xij]2 + (hS/s) (ho/2s)
= harmonic average of population sizes
s = number of populations
ho = average observed heterozygosity
xij = estimated frequency of the ith allele in the jth population
The formula in the slide provides a measure of differentiation in terms of
alleles per locus in two or more populations
It ranges from 0 to 1. A negative value may be obtained if an error was made
for sampling or an inappropriate marker system was used.
Because of the complexity of its components, its calculation requires
specialized computer software.
It can be used with codominant markers and restrictedly with dominant
markers. This is because it is a measure of heterozygosity. To have a fair
estimate of the real value, several generations are needed.
18
A1 A1
A1 A2
A2 A2
p2 + q2
Pop. 1
20
30
50
0.35
0.65
0.545
Pop. 2
10
20
70
0.20
0.80
0.680
Pop. 3
60
10
30
0.65
0.35
0.545
s=3
= 33.33
Measures 19
In this example, we have the number of individuals for each genotype for one locus
(A) in three different populations. Using this number, we want to know the degree of
differentiation in the three populations. In the table, the calculations are followed for
all the necessary elements in the formula shown on the previous slide.
The result (gST = 0.4797) shows significant differentiation between populations in
allele frequencies. We can therefore say that a high percentage of genetic diversity
is distributed among populations.
19
DST
DST
Pop1
HS
Pop3
DST
HS
Measures 20
Where,
HT = total genic diversity = HS + DST
HS = intrapopulation genic diversity
DST = interpopulation diversity
(HT/HT) = (HS/HT) + (DST/HT) = 1
GST measures the proportion of gene diversity that is distributed among
populations.
A larger number of loci must be sampled.
Equations are complex and should be calculated with specific computer
software.
For example, assuming that:
HT = 0.263
HS = 0.202
DST = 0.263 0.202 = 0.061
Then, GST = (DST/HT)
100 = (0.061/0.263)
100 = 23.19%. This means that, in
this species, a 23% differentiation among populations exist.
20
Measures 21
Where,
CT(K) = contribution of K to total diversity
CS(K) = contribution of K to intrapopulation diversity
CST(K) = contribution of K to interpopulation diversity
HT = total gene diversity
HS = intrapopulation genic diversity
DST = interpopulation diversity
HT/K = total gene diversity after removing population K
HS/K = intrapopulation gene diversity after removing population K
DST/K = interpopulation gene diversity after removing population K
The measure allows quantifying the variation of total gene diversity when a
population is introduced to or removed from a site (e.g. when introducing a
new variety into a farmers field in an in situ conservation programme).
It also serves to measure the impact of losing a population from a given site in
terms of gene diversity.
It can be used only with codominant markers.
21
F statistics (Wright)
The equation for the genetic structure of populations
is:
(1 - FIT) = (1 FIS)(1 FST)
FIT = 1 (HI/HT)
FIS = 1 (HI/HS)
FST = 1 (HS/HT)
Copyright: IPGRI and Cornell University, 2003
Measures 22
Where,
HT = total gene diversity or expected heterozygosity in the total population
as estimated from the pooled allele frequencies
HI = intrapopulation gene diversity or average observed heterozygosity in
a group of populations
HS = average expected heterozygosity estimated from each
subpopulation
The F statistics allow analysis of structures of subdivided populations. It may also
be used to measure the genetic distance among subpopulations, a concept that is
based on the idea that those subpopulations that are not intermating will have
different allele frequencies to those of the total population.
Genetic distance also provides a way of measuring the probability of encounter
between equal alleles (endogamy). The statistical indexes involved measure:
FIS = the deficiency or excess of average heterozygotes in each
population
FST = the degree of gene differentiation among populations in terms of
allele frequencies
FIT = the deficiency or excess of average heterozygotes in a group of
populations
22
23
Measures 23
Calculating F statistics
Genotype frequency
Pop.
A1 A1
A1 A2
A2 A2
pi
qi
2piqi
0.40
0.30
0.30
0.55
0.45
0.4950
0.3939
0.60
0.20
0.20
0.70
0.30
0.4200
0.5238
HT
2(0.625)(0.375) = 0.4688
po
HI
qo
HS
Measures 24
24
A1 A2
A2 A2
pi
qi
2piqi
0.25
0.50
0.25
0.50
0.50
0.500
0.0000
0.80
0.10
0.10
0.85
0.15
0.255
0.6078
HT
2(0.675)(0.325) = 0.4388
po
HI
qo
HS
Measures 25
This is another example for which the procedures used in the previous slide were
followed. Differentiation in allele frequencies between the two populations seems
greater (FST = 0.1397), with only a moderate effect of nonrandom mating within the
populations (FIS = 0.2053).
25
Measures 26
The different hierarchical levels of gene diversity studied through AMOVA may
include:
1. Continents, which may contain lesser hierarchical levels
2. Geographical regions within a continent
3. Areas within a region in a continent
4. Populations within an area of a region in a continent
5. Individuals within a population in an area of a region in a continent
The mathematical description of the model for situations 3 and 4 can be found in
Appendices 2 and 3, respectively.
The next two slides illustrate how to analyse situation 4.
26
An example of AMOVA
Ind.
Pop. 1
Pop. 2
Pop. 3
X...k
15
21
18
54
A1
A2
A1
A2
A1
A2
X...k2
225
441
324
990
Xi...k2
27
33
28
88
Xijk2
15
21
18
54
X...2
2916
SSa
0.6
MSa
0.3
SSb
11
MSb
0.26190476
SSw
10
MSw
0.22222222
10
11
12
13
14
A1 = 1
Present
A1 = 0
Absent
15
Measures 27
In this table, we show data obtained with 15 individuals from each of three
populations in an analysis with a codominant marker. By means of an analysis of
variance, these data will allow us to calculate the F statistics.
The first step is to convert the bands detected in the gels to binary variables with a
value of either 0 or 1. Then, the sums of presences (1) are calculated so we may
proceed with the sum of squares. Calculations are first done for one population and
continued for the others until we have (X...k). We have i = 15 individuals (effect b), j
= 2 alleles (effect w), k = 3 populations (effect a).
Where,
X...k is the result of summing up all the band presences (1) in the
individuals per population
X...k2 is the result of squaring the number obtained above
Xi...k2 is the result of adding up the squares of the sum of allele
presences in each individual (e.g. Indiv.1 in Pop.1 will be (0 + 0)2 + Indiv.2
in Pop.1 (1 + 1)2 + Indiv. ...)
Xijk2 is the sum of each value squared
SS is the sum of squares for effects a, b and w
An example of calculating SS:
SSa = X...k2/ij X...2/ijk = [990/(15 x 2)] - [2916/(15 x 2 x 3)] = 0.6
MS are the mean squares for effects a, b and w
An example of calculating MS: SSa/dfa = 0.6/2 = 0.3, where dfA refers to
the degrees of freedom for effect a (populations).
27
df
SS
MS
EMS
w2
+ 2b2 + 2*15a2
Populations
0.6
0.3
Indiv./pop.
42
11
0.26190476
w2 + 2b2
Within indiv.
45
10
0.22222222
w2
0.0012698
b2
0.0198413
w2
0.2222222
0.24333
FIT
0.086758
FIS
0.0819672
FST
0.0052185
(1 - FIT)
0.91324
(1 - FIS)(1 - FST)
0.91324
Measures 28
Where,
SV = sources of variation
df = degrees of freedom
SS = sum of squares (see previous slide)
MS = mean squares (see previous slide)
2 = total estimated variance
EMS = expected mean squares
w2 = 0.2222222
b2 = (MSb MSw)/2 = (0.26190476 0.22222222)/2 = 0.0198413
a2 = (MSa MSb)/2
15 = (0.3 0.26190476)/2
15 = 0.0012698
2 = w2 + b2 + a2 = 0.24333 (total estimated variance)
Calculating the F statistics has already been explained in slide 22. For this particular
example, they would be as follows:
FIT = (a2 + b2)/2 = (0.0012698 + 0.0198413)/0.24333 = 0.086758
FST = a2/2 = 0.0012698/0.24333 = 0.0052185
FIS = b2/(b2 + w2) = 0.0198413/(0.0198413 + 0.222222) = 0.0819672
The allele frequency differentiation among the three populations is very low
(FST = 0.0052185) and is probably a result of many random matings. More loci need
to be analysed to make a conclusion.
28
Measures 29
For these calculations, the assumption is made that each nucleotide is a locus.
29
Measures 30
Where,
n = number of sequences under analysis in the individuals of the
population
Xi = estimated frequency of the ith sequence in the population
Xj = estimated frequency of the jth sequence in the population
ij = proportion of different nucleotides between sequences i and j
The measure informs about the degree of nucleotide diversity among several
sequences in a given region of the genome. It is equivalent to the measure of
allelic diversity within a locus.
It ranges from 0 to 1 (0 X 1).
The factors limiting the use of this analytical tool are:
Partial genomic sequences must be available
The equation can only be applied to haploid data
This parameter informs about nucleotide sequences, and the model assumes
haplotypes (haploid genotypes). Even if the study is based on diploid individuals,
sequencing of each copy of the genome is needed.
30
Freq. Xi
Seq1
5/10 = 0.5
Seq2
2/10 = 0.2
Seq3
1/10 = 0.1
Seq4
2/10 = 0.2
10
1,2 = 2/30, 1,3 = 4/30, 1,4 = 3/30, 2,3 = 4/30, 2,4 = 3/30, 3,4 = 5/30
= 10/(10 1)XiXjij
= (10/9)[0.5
0.2
(2/30) + 0.5
0.1
(4/30) + ... + 1
0.2
(5/30)]
= 0.037
Measures 31
31
Measures 32
Where,
VXY = divergence among populations X and Y
X = nucleotide diversity in population X
dXY = the probability that two random nucleotides in populations X and Y
be different
s = number of populations
The measure informs about the level of differentiation among nucleotide
sequences in populations.
It requires sequence data in a sample of individuals for each population.
It needs specific computer software that includes sequence alignment
features.
Some of these are CLUSTAL W, MALIGN and PAUP*.
32
Measures 33
Lets say that we have another population Y, in which the nucleotide diversity for the
same sequence analysed in slide 31 is Y = 0.09.
We also know that the probability that two nucleotides as taken at random are
different in X and Y is 0.14 (dXY).
In this slide, we find the divergence between populations X and Y (VXY), the total
differentiation (Vb), the average diversity in each population (Vw) and the relative
differentiation (NST)..
33
Fragment 2
DNA
Indiv. 1
GACTGAATTCCACGGCACTGACGAATTCGAAGTGAATTCTTACTTAAGCTAGCCTGAATTCGATAC
CTGACTTAAGGTGCCGTGACTGCTTAAGCTTCACTTAAGAATGAATTCGATCGGACTTAAGCTATG
DNA
Indiv. 2
GACTGATTTCCACGGCACTGACGAATTCGAAGTGAATTCTTACTTAAGCTAGCCTGAATTCGATAC
CTGACTAAAGGTGCCGTGACTGCTTAAGCTTCACTTAAGAATGAATTCGATCGGACTTAAGCTATG
Fragment 2
No recognition
site for EcoRI
I1
I2
Fragment 2
Fragment 1
Gel
Measures 34
34
Measures 35
Where,
r = number of recognition nucleotides of a restriction enzyme
ln G = natural logarithm of the probability that there was no substitution in
the restriction site. Its calculation is:
G = F(3 2G)1/4
F = [Xi(Xin 1)]/[Xi(n 1)]
F = proportion of shared fragments
G = F1/4
n = number of haploid genotypes in the population
Xi = estimated frequency of the ith fragment in the population
The measure estimates the diversity in restriction sites in a sample, because
it relies on the nucleotide sequence of the recognition sites of a given
restriction enzyme.
It informs about the nucleotide substitution in restriction sites. It varies from 0
to 1 (0 X 1).
The equations above can be used with haploid samples, mDNA, cpDNA or
haplotypes.
Reference
Karp, A., P.G. Isaac and D.S. Ingram. 1998. Molecular Tools for Screening
Biodiversity: Plants and Animals. Chapman & Hall, London.
35
Measures 36
Where,
VXY = divergence or differentiation among populations X and Y
X = restriction diversity in population X
dXY = fragment diversity among two populations = (2/r)ln (GXY)
GXY = FXY(3 2GXY)1/4
G = FXY1/4
FXY = proportion of shared alleles among populations X and Y
= (2XiXXiY)/((XiX + XiY))
XiX = estimated frequency of the i fragment in population X
It estimates diversity in the restriction sites of a sample of two or more
populations. It informs about the nucleotide substitution in the restriction sites.
Computer software such as BIOSYS and GENEPOP are useful. Data
obtained are considered as belonging to haploid organisms.
If used with RAPDs, the value of r is replaced by the primer length (r = 10).
In addition, some assumptions are taken:
The appropriate primers are used
Polymorphism due to insertion or deletion is rare
Similar size fragments in different populations belong to the same locus
Fragments must be identified without error
Software such as RAPDISTANCE and RAPDIS is typically used.
36
Seq.
10
11
12
13
14
15
16
17
18
19
20
Freq. Xi
6/20 = 0.30
A2
5/20 = 0.25
A3
9/20 = 0.45
P
o
p
u
l
a
t
i
o
n
Seq.
A1
10
11
12
13
14
15
16
17
18
19
20
Freq. Xi
A1
5/20 = 0.25
A2
13/20 = 0.65
A3
2/20 = 0.10
Measures 37
2>0.30*0.250.25*0.650.45*0.10@
0.141251/4
0.613052
@1/ 4
>
GXY
0.14125 3 20.613052
dXY
2 / 6 ln0.163012
VXY
VW
1 0.539176 0.216633
2
Vb
1 0.226739
2
NST
0.14125
0.163012
0.604643
0.377905
0.11337
0.11337
0.11337 0.377905
0.230766
37
0.226739
Measures 38
Quadratic
1
0.8
0.8
0.6
0.4
0.6
0.4
0.2
0.2
0
0
0.2
0.4
0.6
Similarity
0.8
Distance
1
0.8
Distan ce
D is ta n c e
Circular
Quadratic
Linear
Linear
0.6
0.4
0.2
0.2
0.4
0.6
Similarity
38
0.8
0
0
0.2
0.4
0.6
Similarity
0.8
Distance models
Calculation of distance, or dissimilarity, follows one
of two possible models:
Equilibrium model
Disequilibrium model
t
d
d1
t+1
t+1
d
d2
Measures 39
For our purposes, we will use the disequilibrium model. Two alternatives exist:
Geometric distance
Does not take into account evolutionary processes
Based only on allele frequencies
Complex relationship exists between distance and divergence time
Genetic distance
Does not take into account evolutionary processes
Distance increases from the time of separation from an ancestral population
A genetic model of evolution is needed
Geometric distance is used for diversity studies in which comparisons are made
according to morphological or marker data gathered from the operative taxonomic
units (OTUs). OTUs may be individuals, accessions or populations. It can be used
with dominant markers (RAPDs, AFLPs) or codominant markers. Because
evolutionary aspects are not considered, the dendrograms obtained cannot be
interpreted as phylogenetic trees giving information about evolution or divergence
among groups.
In contrast, the genetic distance of any given OTU can be incorporated into phylogeny
studies. The model considers allelic frequencies in OTUs and its mathematical
foundation is different. It can be used with both codominant and dominant markers,
although, with the latter, information is lost because only two alleles can be scored.
Genetic distance with dominant markers, however, requires the examination of two
generations of the same population to measure the segregation of loci (Lynch and
Milligan, 1994).
Reference
Lynch, M. and B.G. Milligan. 1994. Analysis of population genetic structure with RAPD
markers. Mol. Ecol. 3:91-99.
39
Binary variables
Quantitative variables
Mixed types of variables
P number of variables
Measures 40
40
Measures 41
When using molecular marker data and transforming them to binary data, the
following should be considered:
A speciess ploidy number may mask the presence of allelic series in a locus.
If this happens, genetic diversity will be underestimated when using dominant
markers (presence/absence).
If a marker is codominant, large samples are needed to permit detection of all
possible genotypes, particularly if there are several alleles per locus.
Segregation distortions are common in polyploid species.
Most specialized computer software are designed to analyse diploid species.
Therefore if used with polyploid species, biases may occur on estimating the
various genetic diversity indices.
The reproductive system of certain species has not been studied, so their
inheritance type is not sufficiently known.
The largest coverage (coding and non-coding regions) possible of the
genome of the species under study should be sampled and analysed so that
estimates of genetic diversity are reliable.
41
10
11
12 13
14 15
16
10
11
12 13
14 15
16 17 18
Locus A
diploid
(2X)
Locus A
tetraploid
(4X)
Binary
matrix
Measures 42
This example of 18 individuals from each of a diploid and a tetraploid species was
analysed with a dominant marker. We are assuming that the banding patterns
obtained are alike. Bands are converted to a binary table in both cases. The
calculations of the frequencies are given in the table below. We can see that, in the
tetraploid, genotype 1, for example, can be either AAAA, AAAa, AAaa or Aaaa;
however the band will only be scored as present (1) the same as it will in the diploid
(AA or Aa).
Locu
s
A
(2X)
A
(4X)
Genotypes
Allele freq.
Diploid
AA, Aa
aa
Total
p2 + 2pq
q2
Indiv. number
14
18
P1 = 0.78
P2 = 0.22
Tetraploid
0.53
0.47
aaaa
Total
p4 + 4p3q + 6p2q2 +
4pq3
q4
Indiv. number
14
18
P1 = 0.78
P2 = 0.22
0.31
0.69
Allele frequencies should be different in both cases; however, the information loss in
the tetraploid individual is significant. Why? This is because, to estimate the
frequency of the recessive allele a, heterozygotes AAAa, AAaa, Aaaa are not taken
into account. This effect is larger when the ploidy number of the species under
study is unknown.
(e. = expected value; o. = observed value.)
42
Individuals
9 10 11
12 13
14
15
16 17 18
12
14
15
16 17 18
A2 A3
A1 A2
A2 A2
A3 A3
A1 A1
Locus A
diploid
(2X)
A1 A3
A3 A3 A3 A3
A1 A2 A3 A3
A1 A1 A2 A3
M
Diploid
binary
matrix
A1 A2 A2 A3
A1 A1 A1 A1
Locus A
tetraploid
(4X)
7
I
D I
10
I
11
U O
13
(1,0,0) (1,0,1) (0,0,1) (1,0,1) (0,1,1) (1,0,0) (1,0,1) (0,0,1) (0,0,1) (0,1,0) (1,1,0) (0,0,1) (0,0,1) (0,0,1) (0,0,1) (0,1,1) (1,0,1) (0,0,1)
Measures 43
In this example, we have 18 individuals from each of a diploid and tetraploid species
and analysed with a codominant marker. One locus is detected (A) with three alleles
in both situations (A1, A2 and A3).
Calculating the allele frequencies in the diploid individual is not difficult (binary
matrix, bottom of slide). For the tetraploid individual, however, conversion to binary
data is hampered by the fact that individuals with alleles A1 A1 A2 A3 cannot be
distinguished from those with other combinations such as A1 A2 A2 A3 or A1 A2 A3
A3. This situation can only be solved by inference based on estimating the DNA
fragment copy number in the gel.
Genotype
A1 A1
A1
A2
A1
A3
A2
A2
A2
A3
A3
A3
Tota
l
Gen. freq.
(e.)
p2
2pq
2pr
q2
2qr
r2
Indivs. (no.)
18
Gen. freq.
P11 = P12 = P13 = P22 = P23 = P33 =
0.11
0.06 0.22
0.06 0.11 0.44
(o.)
(e. = expected value;
o. = observed
value.)
43
0.25
0.15
0.60
Author
Expression
S1
a/n
0.333
S2
Simpson
0.750
S3
Braun-Blanquet
0.500
S4
a/[a + (b + c)/2]
0.600
c)]1/2
S5
Ochiai (1957)
a/[(a + b)(a +
S6
Kulczynski 2
(a/2)([1/(a+b)] + [1/(a+c)])
0.612
0.625
S7
a/(a + b + c)
0.429
S8
0.273
S9
Kulczynski 1 (1928)
a/(b + c)
0.750
S10
(a + d)/n
0.556
S11
0.385
S12
(a + d)/[a + d + (b + c)/2]
0.714
S13
(a + d)/(b + c)
1.250
Measures 44
Indiv.j
Indiv.i
a+
b
c+d
a+c
b+d
Where,
n=a+b+c+d
In the table above we see that:
Indices S1 to S9 give value only to the presence of information
Indices S10 to S13 give value to both presence and absence
Next, we will discuss three indices (in red on top table): Simple Matching (S10),
Jaccard (S7) and Nei-Li (S4).
44
Measures 45
These three indices differ in their approach for estimating the number of
coincidences and differences.
The Simple Matching Coefficient considers that absence corresponds to
homozygous loci. It can be used with dominant marker data (RAPD and AFLP),
because absences could correspond to homozygous recessives. An example of
application of the Simple Matching Coefficient for categorical variables is found in
Appendix 6 (click here).
The Jaccard Coefficient only counts bands present for either individual (i or j).
Double absences are treated as missing data. If false-positive or false-negative data
occur, the index estimate tends to be biased. It can be applied with codominant
marker data.
The Nei-Li Coefficient counts the percentage of shared bands among two
individuals and gives more weight to those bands that are present in both. It
considers that absence has less biological significance, and so this coefficient has
complete meaning in terms of DNA similarity. It can be applied with codominant
marker data (RFLP, SSR).
45
Measures 46
46
DXY
ln (IXY)
Ixy
Jxy
(JxJy)
Measures 47
47
Alleles
Allele frequencies
Pop.1
Pop.3
Pop.2
A1
0.80
0.74
0.65
A2
0.20
0.26
0.35
Locus heterozygosity
hijk
0.3200
0.3848
0.4550
B1
0.86
0.81
1.00
B2
0.01
0.10
0.00
B3
0.13
0.09
0.00
Locus heterozygosity
hijk
0.2434
0.3258
0.0000
D1
0.00
1.00
0.30
D2
1.00
0.00
0.70
Locus heterozygosity
hijk
0.0000
0.00
0.4200
Average heterozygosity
Hi
0.0433
0.0547
0.0673
Average homozygosity
Ji
0.9567
0.9453
0.9327
Jii
J1,2 = 0.8733
J1,3 = 0.9346
J2,3 = 0.8986
Genetic identity
Iii
I1,2 = 0.9183
I1,3 = 0.9894
I2,3 = 0.9570
Genetic distance
Dii
D1,2 = 0.0852
D1,3 = 0.0107
D2,3 = 0.0440
Measures 48
48
Swi
2
2
ii' (aii ai' i' )
2n(2n 1)
Sw
(1/ds) jSwj
Measures 49
Where,
aij = size of the allele of the ith copy (i = 1, 2, , 2n) in the jth population
(j = 1, 2, , ds)
n = number of individuals in the sample
Two considerations:
The calculation of distance between two alleles is a transformation of the
number of repeats.
One difficulty in using SSRs to estimate genetic distances is their high rate
of mutation.
49
SB
2
2
(a
ij a i' j' )
j
j'
i
i'
(2n)2d s(ds 1)
Measures 50
The global distance is the weighted average among the component intra- and
interpopulations
2n - 1
2n(ds 1)
Sw
SB
(2nds 1)
(2nds 1)
These coefficients represent the probability of choosing two different copies of one
locus in the same population and between two populations.
Useful computer software: MICROSAT, BIOSYS, GENEPOP, GDA and POPGENE.
50
Nonhierarchical
Overlapping
Copyright: IPGRI and Cornell University, 2003
Measures 51
Reference
Garca, J.A., M.C. Duque, J. Tohme, S. Xu and M. Levy. 1995. SAS for
Classification Analysis; Agrobiotecnology Course, October 1995. Working
document. Centro Internacional de Agricultura Tropical (CIAT), Cali, Colombia.
51
Phenetic classification
f Shows the relationships among samples by using a
similarity index
f A grouping method or distance is selected so that a
tree diagram (dendrogram) or a phenogram (if the
similarity matrix contains phenotypic data) can be
drawn
1
Measures 52
In this example of hierarchical grouping, all characters are given the same weight in
the grouping process.
Total similarity between two groups is the sum of similarity for each character.
It does not consider genealogy.
Phenetic refers to any character used in the classification procedure, whether
morphological, physiological, ecological, molecular or cytological.
52
Clustering methods
f Clustering steps:
Proximity is defined
Each grouping is estimated according to distance
The branches of the dendrogram are built in each
cycle
Measures 53
53
Simple linkage
f Or nearest neighbour
f It minimizes the inter-group distance by taking
the distance to the neighbour with the highest
similarity
f It works with regular and compact groups, but is
highly influenced by distant individuals. This is
inconvenient when there are different groups
that are not well distributed in space
d(1,2)
Group 1
54
Measures 54
0.30
0.43 0.35
(3)
(2)
C
C
ADB
0.35
AD
0.35
AD
0.30
0.40
(4)
ADB
0.10
0.0
D
B
C
Copyright: IPGRI and Cornell University, 2003
Measures 55
1. The distance matrix is formed first; then, in a first cycle, the shortest distance is
selected dAD = 0.28.
2. A new matrix is formed by grouping individuals A and D and calculating the
combined distances:
dB(AD) = min (dBA; dBD) = min (0.30; 0.60) = 0.30
dC(AD) = min (dCA; dCD) = min (0.43; 0.40) = 0.40
3. A new matrix is formed by grouping individual B with group (AD) and calculating
the combined distances
dC(ADB) = min (dAC; dCD; dCB) = min (0.43; 0.40; 0.35) = 0.35
4. The dendrogram is drawn.
55
Complete linkage
f Or farthest neighbour
f It minimizes the inter-group distance by taking
the distance to the individual with minimal
similarity
f It works well with regular and compact groups
but, again, it is influenced by distant individuals
d(1,2)
Group 2
56
Measures 56
0.30
0.43 0.35
0.28 0.60
0.40
(2)
(3)
AC
DB
0.40
0.43
BD
0.30
0.40
BD
(4)
AC
A
A
DB
B
D
A
C
Measures 57
1. The distance matrix is formed first; then, in a first cycle, the longest distance is
selected, dBD = 0.60.
2. A new matrix is formed by grouping individuals B and D and calculating the
combined distances:
dA(BD) = max(dBA; dAD) = max(0.30; 0.28) = 0.30
dC(BD) = max(dCB; dCD) = max(0.35; 0.40) = 0.40
3. The new matrix is formed with groups AC and BD, and the combined distances
calculated:
d(AC)(DB) = max (dAD; dAB; dCD; dCB) = max (0.28; 0.30; 0.40; 0.35) = 0.40
4. The dendrogram is drawn.
57
Average linkage
f Or unweighted pair-group method using the
arithmetic average (UPGMA)
f It minimizes the inter-group distance by taking
the average pairwise distance among all
individuals of the sample
f Most used method
d(1i ,2j) = average distance
between OTUi and OTUj of
groups 1 and 2
Group 1
Group 2
58
Measures 58
0.30
0.43
0.35
0.28
0.60
0.40
BC
AD
(3)
BC
AD
0.42
(2)
0.35
AD 0.45
AD
0
0.415
(4)
0.5
0.4
0.3
0.2
0.1
0.0
D
B
C
Copyright: IPGRI and Cornell University, 2003
Measures 59
1. The distance matrix is formed first; then, in a first cycle, the shortest distance
is selected, dAD = 0.28
2. Next, a matrix is formed by grouping individual A with D and calculating the
combined distances:
dB(AD) = (dBA + dBD)/2 = (0.30 + 0.60)/2 = 0.45
dC(AD) = (dCA + dCD)/2 = (0.43 + 0.40)/2 = 0.415
3. A new matrix is formed by grouping the individuals with the shortest distance B
with C, and calculating the combined distances:
d(AD) (BC) = (dAB + dAC + dBD + dBC)/4 = (0.30 + 0.43 + 0.60 + 0.35)/4 = 0.42
59
Measures 60
60
f External validation
f Internal validation
f Relative validation
f Bootstrapping
Measures 61
External validation:
The matrix distance is compared with other information not used in the
grouping calculations (e.g. genealogy).
Internal validation:
This technique quantifies the distortion due to the grouping method used. It
builds a new similarity or distance matrix, the co-phenetic matrix, directly
from the dendrogram. Validation is calculated by means of a correlation
coefficient between similarity or distance data from the original matrix and
those from the new co-phenetic matrix. Whether the original distances are
maintained is assessed after the grouping exercise (Sokal and Rohlf, 1994).
Relative validation:
Similarity between methods is compared.
Bootstrapping:
This is a re-sampling method by replacement with the same data matrix. It
allows calculation of standard deviations and variances, and is useful for
those situations in which the number of samples or resources (e.g. time,
budget) is limited.
Examples of applying the co-phenetic correlation and bootstrapping methods are
shown next.
Reference
Sokal, R. and J. Rohlf. 1994. Biometry: The Principles and Practice of Statistics
Biological Research (3rd edn.). Freeman & Co, NY.
61
in
D
Dendrogram
0.30
0.43
0.35
0.28
0.60
0.40
0
A
Co-phenetic correlation =
0.5557
0.43
0.43
0.35
0.28
0.43
0.43
C
0.43 0.35 0.28
Co-phenetic matrix
Measures 62
To construct the co-phenetic matrix, we look at the dendrogram previously built with
the original matrix (this example comes from slide 58). We see that the distance
between D and C in the dendrogram is 0.43, so we fill that cell in the co-phenetic
matrix. Distance between B and C is 0.35, and so on.
Calculations for the co-phenetic correlation are based on the correlation coefficient:
r = (6XiYi - 6Xi6Yi/n)/SXiSYi
Where,
Xi and Yi are the similarity or distance values of the original and cophenetic matrix, respectively
SXi and SYi are the standard deviations for each variable
If the correlation value is high, we can conclude that the dendrogram does indeed
reflect the distances in the original matrix and that therefore there is no distortion
due to the grouping method. In the above example, we obtained a value 0.5557.
This is an average value that could indicate that the dendrogram distances do not
reflect the distance data in the original matrix, and so distortion exists because of
the method used. However, in building the example, we used very few data; nor
were they the real results of an experiment, thus explaining the value obtained.
62
A
C
P1
P2
P3
P4
(2)
D
E
Gel
P1
P2
P3
P4
L1
L2
L3
L4
L5
Data matrix
(3)
P1
P2
P3
P1
P2
0.400
P3
0.600
0.400
P4
0.400
0.200
0.400
P4
Similarity matrix
Measures 63
63
P2
P3
P4
P1
P2
0.267 0.115
P3
0.600 0.000
0.400 0.200
P4
0.533 0.115
0.200 0.000
0.400 0.200
0.25
0.44
0.63
0.81
0.11
1.00
0.33
0.56
0.78
1.00
Measures 64
For each individual, the value for each locus is taken, one by one, with replacement
and a sample formed of equal size to the number of loci. The possibility exists that a
locus is selected one or more times. For the example:
M1: L1 L1 L2 L3 L5 (locus L4 was not drawn)
M2: L1 L2 L3 L4 L3
M3: L3 L1 L5 L2 L4
In each sample a similarity matrix is calculated.
Average similarities and their standard deviations are estimated for each individual
pair (1 & 2, 1 & 3, 2 & 3, and so on), and the average similarity matrix is created.
A new dendrogram is built, using the average similarity matrix.
For real situations, more than 100 replacement samples should be created.
64
Measures 65
65
Measures 66
66
Appendices
Measures 67
67
In summary
f The analysis of genetic diversity and structure of
populations involves:
The quantification of diversity and the relationships
within and between populations and/or individuals
The display of relationships
68
Measures 68
69
Measures 69
References
Measures 70
70
Next
71
Measures 71