Académique Documents
Professionnel Documents
Culture Documents
Distance Measures
5 j S am ple sp a ce S p ecies sp a ce
ш q
♦SUA
3 3
<L> Sp 2 SUB
ex 2 tU
CL
C /3
C /3
- I ------------------ 1------------------1------------------ L_
» 1 2 3 4 5 1 2 3
Sample U nit A S p ecies 1
F igure 6.1. G raphical representation o f the data set in T able 6.1. T h e left-hand graph
show s species as points in sam ple space. T he rieh t-h an d nranh shnwc eamnip unite ·>-
Chapter 6
Table 6.2 Reasonable and acceptable dom ains o f input data. л\ and ranges o f distance m easures, d - fix).
D om ain
Name (synonym s) of X Range o f d = f i x ) C om m ents
Sorensen x >0 0 <d < 1 p roportion coefficient in city-
(Brav & Curtis; ( o r ( ) < x < 100%) block space, sem im etric
Czekanovvski)
Relative Sorensen x >0 0 <d< 1 proportion coefficient in city -
(Kulczyński; Q uantitative (or 0 < x < 100%) block space; sam e as Sorensen
Symmetric) but data points relativized by
sam ple unit totals; sem im etric
Jaccard x>0 ()< £ /< 1 proportion coefficient in city-
(orO < d < 100%) block space; m etric
D istance m easures can be categorized as metric, Kulczyński distances Seim m ctrics are extremely use
scm im etric. or nonm etric A m e tric distance m easure ful in com m unity ecology but obey a non-Euclidean
must satisfy the follow ing rules: geometry N o n m etrics violate one or m ore o f the other
rules and are seldom used in ecology
1 The m inim um value is zero w hen two item s are
identical.
2 W hen two item s differ, the distance is positive
Distance measures
(negative distances are not allowed). T h e equations use the follow ing conventions: Our
data m atrix A has q rows, w hich are sam ple units and
3 Symmetry: the distance from objects A to object
p colum ns, w hich are species. E ach elem ent of the
B is the sam e as the distance from B to A.
m atrix, a, ,, is the abundance of species j in sam ple unit
4 T riangle inequality axiom: W ith three objects. i. Most of the following distance m easures can also be
the distance between two o f these objects used on binary data (1 o r 0 for presence or absence).
cannot be larger than the sum of the two other In each o f the follow ing equations, we are calculating
distances.
the distance between sam ple units / and h.
D istance Λle asu res
Euclidean distance
oo EUCLIDEAN
E E ) ,,и 'У ' ( α , , ~ ай,]} Щ DISTANCE
D
ω
Cu
oo
T his form ula is sim ply the Pythagorean theorem
applied to p dim ensions rather th an the usual two
dim ensions (Fig. 6.2).
S P E C IE S 1
dim ensions
IX,
W ritten in set notation: Σ». +Σ a.
2( A Г) B)
Sorensen similarity A nother way o f w riting this, w here MÍN is the
( A ' u B) - (A rsB) sm aller o f two values is:
АглВ
Jaccard similarity
A sj B Clh,)
- Σ a ¡j “ ci
JD ,
E n v iro n m e n ta l G ra d ie n t
Σ ач + Σ ^ ”^ Σ ^ο ~ Qhj
cn
\ A rc d istan ce
/ y w
u RED \
Ш (ch o rd )
es
co
X j a k
D„ Σ M I N
J- 1 P
Σ«. =1 7 1.0
Vj-i J S P E C IE S 1
Figure 6.4. Relative E uclidean distance is the
chord distance betw een two points on the surface
An alternate version, using an absolute value o f a unit hypersphere
instead of the MIN function, is also m athem atically
equivalent to Bray-C urtis coefficient on data relativized
RED builds in a standardization It puts differ
bv SU total:
ently scaled variables on the sam e footing, elim inating
any signal other than relative abundance. Note that the
correlation coefficient also accom plishes this standard
a ization, but arccos(r) gives the arc distance on the
IX, qu arter hypersphere. not the chord distance Also,
a,. = tolal for sam ple unit / (i.e.. f ' tf„ ) C ase
/I o o·
o oo A
Note (hat this distance m easure is sim ilar to Figure 6.5. Illustration o f the influence o f vvithin-
E uclidean distance, but it is w eighted by the inverse o f group variance on M ahalanobis distance
the species totals. If the data are prerelativized by
sam ple unit totals (i.e.. b„ a,., ), then the equation
sim plifies to:
Dfh (w - g ) ΣΣ
r i J- 1
W 'J ' K * Cl.h ) ( « л - a jh )
X' Σ (fibL-AÌ w here n is the num ber o f sam ple units, g is the num ber
o f groups, an d i * j. Note that differences are w eighted
m ore heavily by w„ w hen v ariables / and j are uncorre
The num erator is the squared difference in relative
lated Thus. M ahalanobis distance corrects for the
abundance It is expressed as a proportion o f the
correlation structure o f the original variables (the
species total (the denom inator) and sum m ed over all
dim ensions o f the space). T h e built-in standardization
species.
m eans that it is independent o f the m easurem ent units
M inchin (1987a) offered the following critique o f o f the original variables.
this distance measure:
In w hich case in Figure 6.5 are groups ƒ and h
The appropriateness o f C hi-squared distance as a
more distant? Because the M ahalanobis distance
m easure o f com positional dissim ilarity in ecology
inversely w eights the distance between centroids by the
may be questioned (F a ith et al 1987). T he m ea
variance, the two groups are m ore distant in Case B.
sure accords high w eight to species w hose total
abundance in the data is low. [Conversely, it de-
even though the centroids are equidistant in the two
em phasizes abundant species ] It thus tends to cases.
exaggerate the d istin c tiv e n ess o f sam ples contain Note the conceptual sim ilarity to an T -ratio of
ing several rare species. U nlike the B ray-C urtis between- to w itlnn-group variance. Indeed, the M aha
coefficient and re la te d m easures. C hi-squared lanobis distance can be used to calculate an F-test for
distance does not reach a constant, m axim al value
m ultivariate differences between groups. Sim ilarly, it
for sam ple pairs w ith no species in com m on, but
can be used to test for outliers by calculating the
fluctuates according to variations in the rep resen
distance between each point and the cloud of rem ain
tation o f species w ith high or low total abundances.
ing points.
These properties o f C hi-squared distance may
account for som e o f the d istortions observed in
DCA ordinations. Performance of distance
measures
M ahalanobis distance (Ό2)
M ahalanobis distance O f is used as a distance Loss o f sensitivity with heterogeneity
measure between two groups (/ and h). It is com m only Perform ance o f distance m easures can be
used in discrim inant analysts and in testing for evaluated by com paring the relationship between
outliers. If a„ is the m ean for the ith variable in group environm ental distance (distance along an environ
J. and vt'y is an elem ent from the inverse of the pooled m ental gradient, such as elevation) vs. sociological
vvithin-groups covariance m atrix. representing distance (the difference in com m unities as reflected by
' >n ihtpc / and /'. then the distance in species space) T his method oi
D istance ΛJe asures
evaluating distance m easures w as used by Beals previous exam ple (CV o f SU totals = 40% ). and the
(1984). Faith et al. (1987), De ath (1999a). an d Boyce species vary realistically in abundance (CV of species
and Ellison (2001). 11' species respond noiselessly to totals = 183°»).
environm ental gradients and the environm ental A gain, all o f the distance m easures lose sensitiv ity
gradients are know n, then we seek a perfect linear with increasing environm ental distance (Fig. 6,7),
relationship betw een distances in species space and T his loss is greatest for distance based on the co rrela
distances in environm ental space. Any departure from tion coefficient. E uclidean distance not only loses
that relationship represents a partial failure o f our sensitivity at high distances, but introduces consid
distance measures. erable error, even at m oderate distances. N ote also that
Two exam ples help clarify' the variability in the E uclidean distance shows no fixed upper bound for
relationship betw een distance in species space and sam ple units that have nothing in com m on
environm ental space. T hese exam ples are based on Sorensen distance loses sensitiv ity ov er a distance
synthetic data sets w ith a know n underlying structure about h alf the length o f the env ironm ental gradients.
and noiseless responses o f species to two environm en T he flat top on the Sorensen scatterplot results because
tal gradients. it has a fixed m axim um for SUs having no species in
The first exam ple is an "easy" data set. consisting com m on. M any ecologists consider this a desirable,
o f 25 sam ple units and 16 species. It is easy because intuitive property for species data. C hi-square distance
the beta diversity is fairly low (average Sorensen perform s reasonably well at sm all environm ental
distance am ong SUs = 0.59: 1.3 h alf changes), the distances but m isinterprets many distant SUs as being
sam ple unit totals are fairly even (coefficient of close in species space.
variation (CV) o f SU totals = 17%). and the species are T ransform ation of the data to binary (presence-
all sim ilarly abundant (CV o f species totals = 37%). absence) in both exam ples results in a more linear
Despite this being an easy data set. all o f the relationship w ith most distance measures. T his was
distance m easures show a curvilinear relationship with show n for E uclidean distance and Sorensen distance
environm ental distance (Fig. 6.6). Specifically, we see w ith real data from an elevation gradient (Beals 1984).
the loss in sensitivity o f our distance m easures at large G iven the apparently poor perform ance o f all of
environm ental distances. T he problem is least the distance m easures, it is rem arkable that m u ltiv ari
apparent in the Sorensen, chi-square, and Jaccard ate analysis is able to extract clear, sensible patterns
distances. T he problem is w orst with the correlation (Fig. 6 8) We are rescued by the redundancy m the
distance, w here the curve not only flattens at high data — all ordination and classification techniques
environm ental distances, but starts to decline at the benefit front this redundancy in the data W here two
highest distances. T he drop in the curve for the corre species fail to be inform ative of a difference, another
lation coefficient (actually (1 - r)l2 w hich converts r two species are inform ative. Some ordination tech
into a distance rather than a sim ilarity m easure) is due niques. nonm etric m ultidim ensional scaling (NM S) in
to interpreting shared zeros (0.0) as positive associ particular, are able to linearize the relationship
ation. between distance in species space an d distance in a
T he second exam ple is a m ore difficult data set. reduced ordination space. NM S has an advantage over
consisting o f 100 sam ple units and 25 species. The other ordination techniques: it is based on ranked
difficulty has nothing to do w ith the size o f the data distances, w hich im proves its ability to extract
set Rather its beta diversity is higher (average Soren inform ation from the nonlinear relationships illustrated
sen distance am ong SUs = 0.79: 2.3 h alf changes), the in the two exam ples.
sam ple unit totals vary m ore widely than in the
C hapter 6
« 0 0
u
o 20
c 0.8 8 a
ce o
0 .6 «0
9
0
S 10 0.4
o 0.2
сл
0
2 4 6 2 4
0.08 0.9
a
3 0.06
co
D 0.05
υ.07 0.8
0.7
06
V ? "
05
<υ
u 0.04
03 j 0.4
¡3 0.03 аз
cr ТЗ 0.3
.2 0 02 fc 0.2
O
O 0.01 CJ 0.1
0
2 4 6 2 4 6
700 1 o o
8 *8
a 600 o
d) o 0.8
Г2 500 o 0
73 0
Z3 400 0.6
Q O
ш 0
-а 300 T3
w
P 03 0.4
o
g 200
cr 0.2
C/3 100
0
2 4 6 2 4 6
Figure 6 6. R elationship between distance in species space for an "easy” data set, using various distance
m easures and environm ental distance. T he graphs above are based on a synthetic data set w ith noiseless
species responses to two know n underlying environm ental gradients T he gradients w ere sam pled w ith a
5 x 5 grid T his is an "easy” data set because the average distance is reasonably sm all (Sorensen distance
= 0.59: 1.3 h a lf changes), all species are sim ilar in abundance (CV o f species totals = 37% ), and sam ple
units have sim ilar totals (CV o f SU totals = 17%).
D istance Λteasures
0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 1 2 3 4 5 6 7 X 9 10 1 1 1 2 13
Environm ental D istance Environm ental D istance
& a 0.6
o 0. 4
ω 0.3 &
0 1 2 3 4 5 6 7 8 9 10 1 1 1 2 13 0 1 2 3 4 5 6 7 X 9 10 11 12 i:
Environm ental D istance Environm ental Distance
Figure 6.7. R elationship between distance in species space for a "m ore difficult” data set, using various
distance m easures, and environm ental distance. T he graphs above are based on a synthetic data set with
noiseless species responses to two know n underlying environm ental gradients. T he gradients were
sam pled w ith a 10 x 10 grid. T his is a "m ore difficult ' data set because the average distance is rather
large (Sorensen distance = 0.79: 2.3 h a lf changes), species vary in abundance (C V of species totals =
183%). and sam ple units have moderately variable totals (CV o f SU totals = 40% ).
------- V 'J J
C hapter 6
O 1 2 3 4 5 6 7 8 9 10 11 12 13
Environmental Distance
root-of-5 tim es as im portant as gradient Y W hich as the n nearest neighbors G eodesic distances should
space m atches your intuition? be able to find effectively the curvature of com position
W ith Euclidean distance, large differences are al gradients in species space Geodesic distances arc
w eighted more heavily than several sm all differences one o f the m ost prom ising new m ethodological deve-
(Box 6.2). T his results in greater sensitivity to outliers lopments. A key issue w ill be objectively defining
w ith E uclidean distance than w ith city-block distance "nearby" to optim ize the recovery o f patterns in
m easures For exam ple, assum e we have four species ecological com m unities.
and three sam ple umts. A, B. and C T he data and T he difference betw een T enenbaum s geodesic
differences in abundance o f each species for each pair distance an d the ecologists' shortest path (SP) m ethods
o f sam ple units are listed in Box 6.2. can be visualized w ith an analogy to crossing a stream
dotted w ith stepping stones We w ant to find the
Geodesic distance shortest route from a p articular point on one bank to a
p articular point on the opposite bank. T he SP m ethod
P erform ance o f all o f the traditional distance
m easures declines as distances in species space m ust find a single stepping stone that gets us across the
increase (Figs. 6.7 and 6 8). An innovative solution to stream in the two shortest possible leaps (one to the
the problem o f m easuring long distances in nonlinear stepping stone and one to the far bank). T he geodesic
m ethod, however, defines a com fortably sm all step,
structures is the geodesic distance (Tenenbaunr et al.
then seeks the shortest series o f steps w ithout ever
2000) T his concept is sim ilar to the "shortest path"
exceeding that sm all step length. T he geodesic m ethod
adjustm ents to a distance m atrix (W illiam son 1978.
thus considers the w hole array o f stepping stones,
1983: Clvm o 1980. Bradfield & K enkei 1987. De ath
w hile the SP m ethod can consider only one stone at a
1999a). W illiam son sum m ed distances between
tim e
sam ple unit pairs representing the shortest path
betw een two distant SUs. but only applied this to SUs A problem w ith the SP m ethod is that if the stream
w ith no species in com m on Bradfield and Kenkei is broader th an two leaps, th en no single stepping stone
(1987) added flexibility by varying the threshold for will work. T his corresponds to two SUs so different
the num ber of species in com m on. Bradfield and th at there is no th ird SU that shares species with both
Kenkei found better results w ith a low er threshold; i.e.. o f them. De ath (1999a) solved this problem by allow
adjusting a larger proportion o f the distance m atrix. ing m ultiple passes o f the SP method, in essence
De ath (1999a) further extended the m ethod by using allow ing m ultiple stepping stones.
city-block distance m easures, changing the threshold to D espite the excellent o rdinations in Bradfield and
a quantitative dissim ilarity value, and allow ing K enkei (1987), Boyce an d E llison (2001), and De ath
inultiple-step paths betw een very distant SUs. (1999a). the geodesic distances and related m ethods
A geodesic distance betw een two points is m eas have not been widely adopted, probably because they
ured by accum ulating distances betw een nearby points. have not been included in p opular softw are packages.
T enenbaum et al. (2000) used E uclidean distances, but W hether T enenbaum et a l .’s (2000) geodesic distances
geodesic distances can be built from other distance offer further im provem ents over the SP m ethods used
m easures "N earby" can be defined as a fixed radius or by ecologists rem ains to be seen
C'hapter 6
Box 6.1. C om parison o f E uclidean distance w ith a proportion coefficient (Sorensen distance). Relative proportions
o f species 1 and 2 are the sam e between Plots 1 and 2 and Plots 3 and 4
T he Sorensen distance betw een Plots 1 and 2 is 0.333 (33.3% ). as is the Sorensen
distance between Plots 3 and 4. as illustrated below In both cases the shared abundance is one
third o f the total abundance. In contrast, the E uclidean distance betw een Plots 1 an d 2 is I,
w hile the E uclidean distance betw een Plots 3 and 4 is 10. T hus the Sorensen coefficient
expresses the shared abundance as a proportion of the total abundance, w hile E uclidean distance
is unconcerned w ith proportions.
10 Plot 4 ♦
8
СЧ
00 6
<υ
Õ 33.3
<D
Q* 4
СП Plot 2
2 I 33.3
♦ / Plot 1 Plot 3
0
2 4 6 8 10
Species 1
D istance M easures
Box 6.2. E xam ple data set com paring E uclidean and citx -block distances, contrasting the effect
o f squaring differences versus not.
Sp
SU 1 2 3 4
A 4 2 0 1
B 5 1 1 10
C 7 5 3 4
Sam ple units A.B: species differences d ~ 1, 1, 1, 9 for each o f the four species.
Sam ple units A.C: species differences </= 3. 3, 3, 3
D istance M easure
AB 9.165 12
AC 6 12
T he sim ple sum o f differences (city-block distance) is the sam e for AB and AC
E uclidean distance sum s the squared differences, so that difference o f 9 is given more
em phasis w ith E uclidean distance than w ith city-block distance. Thus, the Euclidean
distance betw een A and B is larger th an the distance betw een A and C The city-
block distance between A and B is the sam e as that betw een A and C. W hich dis
tance m easure m atches your intuition ’