Académique Documents
Professionnel Documents
Culture Documents
Alexei S. Kassian*
Linguistic homoplasy and phylogeny
reconstruction. The cases of Lezgian
and Tsezic languages (North Caucasus)
https://doi.org/10.1515/flih-2017-0008
Abstract: This paper deals with the problem of linguistic homoplasy (parallel or
backward development), how it can be detected, what kinds of linguistic homoplasy
can be distinguished and which varieties of the phenomenon are the most deleterious
for the reconstruction of language phylogeny. It is proposed that language phylogeny
reconstruction should consist of two main stages. Firstly, a strict consensus tree
should be built on the basis of high-quality input data elaborated with the help of
the main phylogenetic methods (such as Neighbor-joining, Bayesian MCMC, and
Maximum parsimony), and ancestral character states, allowing us to reveal a certain
number of homoplastic characters. Secondly, after the detected instances of homo-
plasy are eliminated from the input matrix, the consensus tree is to be compiled again.
It is expected that after homoplastic optimization it will be possible to better resolve
individual “problem clades”, and generally the homoplasy-optimized phylogeny
should be more robust than the tree constructed initially. The proposed procedure is
tested on the 110-item Swadesh wordlists of the Lezgian and Tsezic groups. The
Lezgian and Tsezic results generally support theoretical expectations. The MLN
(minimal lateral network) method, currently implemented in the LingPy software, is
a helpful tool for the detection of linguistic homoplasy.
1 Introduction1
This paper is structured as follows. Section 1 contains a general overview of
homoplasy between languages; main kinds of linguistic homoplasy are
1 This section partially overlaps with List et al. (2014b). List et al. focus on loanword detection,
but borrowings of any kind can be formally treated as a particular case of homoplasy.
(a) L1 (b) L1
50% 50%
60% 60%
50%
L2 L3
L2 40% L3
(c) L1 (d) L1
40% 40%
L2 40%
L3
Figure 1: Distances between three lects (L1, L2, L3). The triangles (a)–(b) are ultrametric; (c)–(d)
are not. A higher percentage of discrepancies within character states means a more distant
relationship between the two lects in question. Thus (a) L2 & L3 are close to each other and both
are equally remote from L1; (b) the three are equally distant from each other; (c) the three
distances are all unequal; (d) L2 & L3 are remote from each other and both are equally close to L1.
(S. Starostin 2007 [1989], 2000, 2007; Novotná and Blažek 2007; Balanovsky et
al. 2011). On the other hand, a number of scholars prefer to apply the relaxed
molecular clock model to language evolution, implying that the mean rate of
lexical replacement varies among branches (e.g., Gray and Atkinson 2003;
Kitchen et al. 2009). In any case, it is unlikely that the range within which
the mean rate of basic vocabulary replacement varies in practice can be very
large (perhaps except for some rare special cases such as Icelandic). Thus, the
pairs L1-L2 or L2-L3 in Figure 1c and L1-L2 or L1-L3 in Figure 1d are suspected
to contain secondary, i.e., homoplastic matches.
A more difficult task is to detect exactly which characters are homoplastic.
The original linguistic dataset represents a multistate matrix (i.e., a matrix
whose characters can have more than 2 states; for matrix compilation, see,
e.g., Atkinson and Gray 2006: 93–94). If we are dealing with lexical characters
(lexicostatistics), synonymy, i.e., the presence of more than one word in a
single slot, is almost inevitable. To my knowledge, among packages
which reconstruct phylogeny, Starling (S. Starostin 2007 [1993]; Burlak and
220 Alexei S. Kassian
Starostin 2005: 271–274) is the only software able to process input matrices
containing synonyms: when the same Swadesh slot is occupied by more
than one word, i.e., by several synonyms, all possible pairs of involved
words between two languages are compared within this slot, and if there is
at least one matching pair, Starling treats the whole slot as a match2 (it goes
without saying that synonymy is deleterious for correct phylogeny reconstruc-
tion, the amount of synonyms should be minimized during the compilation of
a dataset).
In order to make the dataset importable in most popular phylogenetic
packages, it was proposed by Gray and Atkinson (Gray and Atkinson 2003;
Atkinson and Gray 2006) to convert the original multistate matrix into a binary
format. Binarization is coding for the presence (“1”) or absence (“0”) of the
specific proto-root with the given Swadesh meaning in the language in question,
while Swadesh items superseded by loanwords or simply not documented are
marked as “?” (the difference between this procedure, adopted by the Global
Lexicostatistical Database project following S. Starostin’s (2007 [1989])
approach, and the conversion described in Atkinson and Gray (2006), is that
Atkinson and Gray treat loanwords as full-fledged items with unique cognate
indices – the so-called singletons).
It remains unclear how seriously such a conversion corrupts input data and
causes model misspecification. Evans et al. (2006) argue that binary recoding
creates dependencies among characters that have unknown consequences and
potentially lead to biased results. Pagel and Meade (2006), however, suggest
that dependencies among characters introduced under binary recoding will have
only a minor impact on computational results and merely create scaled versions
of the best topology – trees with shorter branch lengths and higher posterior
probabilities for subgroups, but without major changes to the subgroups them-
selves. Up to now, however, almost all available tests (including my own ones
based on various Global Lexicostatistical Database wordlists) suggest that the
phylogenetic results of a multistate matrix and those of its binary counterpart
are quite similar if not identical.3
2 The LingPy package (List and Moran 2013), discussed below, can process synonymy in the
same way, but LingPy functions do not include phylogeny reconstruction.
3 One of the known exceptions is the maximum parsimony phylogeny of the Indo-European
family offered in Rexová et al. (2003): the Indo-European tree obtained from the multistate
matrix seriously differs from that obtained from the binary matrix if Hittite is involved as an
outlier. I suspect that the main reason for such discrepancies could be that the input lexical
dataset contains a substantial number of errors (see the Linguistic supplement in Kushniarevich
et al. 2015; where the Indo-European Swadesh database Dyen et al. 1997; is partially examined).
Homoplasy and phylogeny reconstruction 221
An example.4 The Proto-Slavic paradigm of the word for ‘foot’ was as follows:
nominative *〈nog-a〉, locative *〈nog-oi〉. At a later stage the diphthong 〈oi〉
contracted into a front vowel which caused the fronting g > zʸ in Old Russian,
where the paradigm became nog-a, nozʸ-i͡e. Out of the two daughter languages of
Old Russian, this zʸ is retained in Modern Belarusian: nag-a, nazʸ-e, but lost in
Modern Russian: nag-a, nagʸ-e, where the locative form has undergone leveling
on the basis of the nominative (Kiparsky 1967: 34–35). In absence of documented
Old Russian evidence, we would have hypothesized that the shift g > zʸ before a
front vowel only took place in Belarusian, rather than in the Russian-Belarusian
protolanguage (i.e., in Old Russian).
Parallel evolution can be depicted as Figure 3.
An example. The analytic perfect tense-aspect construction participle + ‘to be’
or ‘to have’, which is widespread in modern Germanic languages, arose simulta-
neously in the individual lects (Harbert 2007: 292). However, in the absence of
evidence from ancient Germanic languages, we would instead reconstruct this
morphosyntactic pattern for Proto-Germanic, which would be an error.
From the formal point of view, if we have two characters in a multistate
matrix, each of them has at least two states with equal cost of change between
the states (e.g. one has the states A & B, the second C & D), and they take all four
4 All linguistic data in the present article are encoded in the unified transcription system of the
Global Lexicostatistical Database project, which is generally based on the IPA alphabet, with
just a few specific discrepancies (http://starling.rinet.ru/new100/UTS.htm). Traditional or ortho-
graphic representations are enclosed in 〈angle brackets〉.
222 Alexei S. Kassian
(a) A (b) A
A B
vs.
B B B B
Figure 3: A character has two states: A, B. (a) The state A is reconstructed for the intermediate
node, thus the tree demonstrates parallel developments A > B in two terminal nodes. (b) The
state B is reconstructed for the intermediate node, which allows to avoid parallel
developments.
possible pairs of states in the matrix: “AC”, “AD”, “BC”, “BD”, then these
characters are incompatible and at least one of them must be homoplastic
(see, e.g., Semple and Steel 2003: 69–73). In such and some other cases, the
reconstructed tree topology can suggest exactly which character is homoplastic:
see Figure 4.
A
C~D
A A B B
C D C D
Figure 4: Two incompatible characters. The first character has the states A, B; the second one
has the states C, D. The second character demonstrates homoplasy. The ancestral state can be
safely reconstructed for the first character (A), but not for the second one (C and D are equally
probable).
As can be seen, the reconstructed tree helps to detect homoplasy within one
multistate character (the so-called “criss-crossed” configuration): Figure 5.
However, the maximum number of homoplastic developments in a multi-
state or binary matrix can be revealed if ancestral character states, i.e., character
states for the protolanguage, are reconstructed (see the example above involving
the Germanic perfect tense-aspect construction). Reconstruction of this kind is in
Homoplasy and phylogeny reconstruction 223
C D C D C D C D
Figure 5: A character has two states: C, D. Two kinds of the “criss-crossed” configuration are
depicted. In both cases, the ancestral state cannot be reconstructed with certainty (C and D are
equally probable from the topological point of view).
fact a non-trivial theoretical and practical task (Kassian et al. 2015); in particular
it is impossible without the established rooted phylogenetic tree.
The picture is somewhat different when we are dealing with a binary
lexicostatistical matrix, converted from an original multistate matrix with “1”
denoting the marked state of the character and “0” the unmarked state (i.e.,
“1” = presence and “0” = absence of the specific proto-root with the specific
Swadesh meaning in the given language; the so-called presence/absence
matrix). Even if there are two incompatible characters in the input matrix
which take all four possible pairs of states: “00”, “01”, “10”, “11”, the change
1 > 0 (loss of the root) is not a significant event, since this can occur indepen-
dently in different languages and such loss can hardly be regarded as homo-
plastic. Therefore the known rooted tree topology is unhelpful for the detection
of linguistic homoplasy in this type of binary matrix. To detect the fact of
homoplasy and identify specific homoplastic characters it is necessary to recon-
struct ancestral character states.
Once the phylogenetic rooted tree of the analyzed language group is
obtained and character states for the protolanguage have been reconstructed,
it is reasonable to search the input matrix for homoplastic characters and to
eliminate all parasitic matches caused by these characters. It is expected that a
new phylogenetic tree reconstructed on the basis of the elaborated matrix with
homoplastic matches eliminated will be more robust than the tree reconstructed
initially. Hence I label a phylogenetic tree produced from the standard dataset
non-homoplasy-optimized, or simply non-optimized, and a tree produced from the
examined dataset with homoplastic developments at least partially eliminated
homoplasy-optimized.
Cf. recent attempts to detect lexical homoplasy and loanwords with the help
of formal algorithms, based on the minimal lateral network (MLN) approach,
224 Alexei S. Kassian
implemented in the LingPy software: Nelson-Sathi et al. (2011); List and Moran
(2013); List et al. (2014a; 2014b); below this will be tested and discussed in more
detail. For the NeighborNet network approach, see Bryant et al. (2005); Holden
and Gray (2006).
An example. Ukrainian rʸik underwent the shift ‘term, time period’ > ‘year’
(having superseded the more archaic ɦid ‘year’) under the influence of Polish rok
‘year’. Ukrainian rʸik and Polish rok are etymological cognates with regular sound
correspondences, and this common identity is transparent for Ukrainian and Polish
speakers. Two distinct cognate indices should be assigned to the Ukrainian and
Polish forms in the homoplasy-optimized lexicostatistical matrix or, since we are
sure that the direction of influence in this case was Polish > Ukrainian, we can go
further and simply mark Ukrainian rʸik as a loanword in the meaning ‘year’.
(2) Loan translation or semantic loan, where a semantic concept is borrowed
without any phonetic similarity or etymological relationship between the
expressions in question.
phrase’ under the influence of the same polysemy seen in English 〈head〉 (the
latter instance is from Haspelmath 2009).
Apparently the first kind of contact-driven semantic shifts, supported by
phonetic similarity and etymological relationship, occurs more frequently
than the second, phonetically unsupported one (cf. similar observations by
Haugen 1950: 220), but often, when closely related lects are involved, phone-
tically and etymologically supported homoplastic developments successfully
imitate natural etymological evolution and are treated by linguists as true
cognates.
Contact-driven homoplasy is the most deleterious kind of homoplastic
development, since its prevalence and correspondingly its impact on the recon-
structed phylogeny can be significant, whereas its detection is often difficult or
impossible.
synonyms. For examples, the GLD standard (Kassian et al. 2010; G. Starostin 2010)
recommends filling the slots of the personal pronouns ‘I’, ‘thou’, ‘we’ with both
direct and oblique stem forms if these are suppletive. Simplification of a supple-
tive paradigm can produce incompatible characters (Figure 4) and criss-crossed
configuration (Figure 5).
An example. The Bulgarian personal pronoun ni-e ‘we’, nas ‘us’ corresponds
etymologically to Latin noːs ‘we, us’, whereas Lithuanian mɛːs ‘we’, mus ‘us’
corresponds etymologically to Tocharian B wes ‘we, us’. Since Bulgarian is the
closest relative of Lithuanian among these languages, we are dealing with a
criss-crossed configuration (Figure 5). However, this is not a genuine homo-
plasy, since we can securely reconstruct the suppletive paradigm *wey-s [direct]
( > Lithuanian & Tocharian)/*n(V)s- [oblique] ( > Bulgarian & Latin) for the Proto-
Indo-European language.
Note that only experts can decide whether the reconstruction of synon-
ymous character states for the protolanguage is reasonable or not in the indivi-
dual case. The information required to make such a decision is missing from the
input matrices. Because of this, proto-synonymy can hardly be discriminated
from real homoplasy by any formal algorithms.
3.1 Data
3.1.1 Wordlists
et al. 2015: 304–307; G.; Starostin 2013a) for the methodology and basic principles of
protolanguage wordlist reconstruction.
For tree rooting, the 110-item wordlist of the Chechen literary language
(G. Starostin 2011) has been introduced into the comparison as an outgroup.
Chechen was chosen as a language genetically related to the investigated groups
(Lezgian and Tsezic) within the Nakh-Dagestanian linguistic family, on the one
hand, and as a lect which is definitely not a member of the Lezgian or Tsezic
groups on the other. Etymological comparison between Chechen and Lezgian/
Tsezic is based on S. Starostin and Nikolayev (Nikolaev) (1994) with some
corrections drawn from G. Starostin (2011).
Figure 6: Map of the modern Lezgian lects (adapted from Koryakov 2006: map #13).
Homoplasy and phylogeny reconstruction
Figure 7: Map of the modern Tsezic lects (adapted from Koryakov 2006: map #11).
231
232 Alexei S. Kassian
The current version of the GLD Tsezic database (Kassian 2013–2015) features
110-item wordlists of the following 9 languages and dialects:
– Hunzib,
– Bezhta (dialects: Bezhta proper, Khoshar-Khota, Tlyadal),
– Hinukh,
– Tsez (alternative name: Dido; dialects: Kidero, Sagada),
– Khwarshi (dialects: Khwarshi proper, Inkhokwari),
– plus the reconstructed Proto-Tsezic list.
software v.4.13.1 (Huson and Bryant 2006) from the binary lexicostatistical
matrix (NEXUS format) which was generated from the original multistate
matrix by coding the presence (“1”) or absence (“0”) of each proto-root in
each language (Swadesh items superseded by loanwords or simply not docu-
mented are marked as “?”). The non-parametric bootstrap test was performed
(10,000 pseudoreplicates). The trees were rooted by the outgroup (the
Chechen wordlist). The trees are not dated. The trees were visualized in the
FigTree software (v.1.4.0). Furthermore, additional trees were produced by the
BioNJ method (Gascuel 1997); these are topologically identical to the NJ ones
in all cases.
3. Unweighted pair group method with arithmetic mean (hence UPGMA), see
(Sneath and Sokal 1973: 230–234; Makarenkov et al. 2006: 65–66). The trees
were produced in the SplitsTree4 software v.4.13.1 from the binary matrix
described above. The non-parametric bootstrap test was performed (10,000
pseudoreplicates). The trees were rooted by the outgroup (the Chechen
wordlist). The trees are not dated. The trees were visualized in the FigTree
software (v.1.4.0).
4. Markov chain Monte Carlo method under a Bayesian framework (hence
Bayesian MCMC), see Makarenkov et al. (2006: 68–69), as applied to
linguistic data for the first time in Gray and Atkinson (2003). The trees
were produced in the MrBayes software v.3.2.1 (Huelsenbeck and Ronquist
2001) from the binary matrix described above. I used the F81 model with
rates = gamma. The program was run 4 times using 4 concurrent Markov
chains; the Chechen language was marked as an outgroup. Each run
produced 5,000,000 tree generations with samples taken every 500 gen-
erations. For each run, the first 25% tree generations were discarded as a
burn-in. The consensus trees were rooted by the outgroup (the Chechen
wordlist). The trees are not dated. The trees were visualized in the FigTree
software (v.1.4.0).
5. Unweighted maximum parsimony method (hence UMP), see Makarenkov
et al. (2006: 66–67). The trees were produced in the TNT software (Willi
Hennig Society edition of TNT, v.1.1, May 2014, see Goloboff et al. 2008) from
the binary matrix described above by the branch-and-bound (“Implicit
enumeration”) algorithm. Obligatory binarization of nodes was prohibited
(“Collapse trees after the search”); the Chechen language was marked as an
outgroup. When several optimal trees of equal cost were obtained, the strict
consensus tree was produced for which the non-parametric bootstrap test
was performed (1000 pseudoreplicates). The trees were rooted by the out-
group (the Chechen wordlist). The trees are not dated. The trees were
visualized in the FigTree software (v.1.4.0).
234 Alexei S. Kassian
3.3 Analysis
A lexicostatistical analysis of the Lezgian group, performed with the aid of the
aforementioned phylogenetic methods (StarlingNJ, NJ, UPGMA, Bayesian MCMC,
UMP), was published in Kassian (2015) (these are also reproduced in Text
Supplement as Figures S1a–S6a). The strict consensus Lezgian non-homoplasy-
optimized tree (Figure 8a) conforms very well with the traditional expert classi-
fication of the group. In sum, the following trees and networks obtained from
the non-homoplasy-optimized dataset are analyzed below:
– Figure S1a (see Text Supplement), StarlingNJ method with binary nodes
only.
– Figure S2a (see Text Supplement), StarlingNJ method with neighboring
nodes joined.
– Figure S3a (see Text Supplement), NJ method.
– Figure S4a (see Text Supplement), UPGMA method.
– Figure S5a (see Text Supplement), Bayesian MCMC method.
– Figure S6a (see Text Supplement), UMP method.
– Figure 8a, manually constructed strict consensus tree.
– Figure 9a, NeighborNet network.
– Figure 10a, Minimal lateral network.
Homoplasy and phylogeny reconstruction 235
(a)
(b)
Figure 8: Manually constructed strict consensus phylogenetic tree of the Lezgian lects based on the
StarlingNJ, NJ, BioNJ, UPGMA, Bayesian MCMC, UMP methods (adapted from Kassian 2015: Figure 8).
Ternary nodes cover binary branchings that either differ depending on the method employed or are
automatically joined under the assumption of the temporal error of 300 years, or both. Statistical
support values are shown in the following sequence: NJ/MCMC/UMP (“x” means that P ≥ 0.95 in an
individual method; not shown for nodes with P ≥ 0.95 in all methods). StarlingNJ dates are proposed.
(a) Non-homoplasy-optimized tree. (b) Homoplasy-optimized tree; probability values from the non-
optimized strict consensus tree (a) are quoted in parentheses. No topological discrepancies between
(a) and (b).
236 Alexei S. Kassian
basaran
(a)
ran
asa
n
Southern_Ta
he
Tab
Kosha
ec
Ch
rn_
_
oup
tgr
rthe
n_Agh
Ge Ou
i
ch
Ag
No
Ar
hu qu
ul
Fit l_ n_
e_ p
Ag rop
Ag
hu er
hu
l
l
Keren_
Aghul
_Udi
shen
Varta
Nidzh_Udi
gi
Gyune_Lez
Mik
ryts Mi ik_Tsa
k_K per sh khu
Aly ro les r
Ge
p h_
s_
kh
Ixr
Ts
ryt
lm
Mukhad_Rutul
ak
du
ek
K
et
h ur
utul
s_
Bu
_R
Ts
utu
ek_R
ak
l
hu
r
Luch
Aghul_proper
(b)
i
zg
ul
Fite_A Aghu
Ko
Ge
Le
Agh
sh
e_
qu
an
un
en_
n_
_A
ghul l
Gy
So gh
Ker
uth ul
ern i
_T ch en
aba
Ar hech
Northern_
sar
an ro up_C
Tabasar Outg
an
Nidzh_Udi
Varta
shen
_Ud
ts i
Alyk_Kry r
e
_ p rop
ts h
Kry d uk
Bu
Mi
utul
Ixrek_Rutul
kik
ad_R ul
_T
M
ut
Gelm
sa
ish ts_Tsa
_R
kh
les
ur
ek
h_
Mukh
ch
Ts
Lu
ak
hu
khu
r
r
Figure 9: Phylogenetic network of the Lezgian lects produced by the NeighborNet method from
the binary matrix in the SplitsTree4 software. Bootstrap values are shown near the branches
(not shown for stable branches with bootstrap value ≥ 95%). Branch length reflects the relative
rate of cognate replacement as suggested by SplitsTree4. (a) Non-homoplasy-optimized net-
work (b) Homoplasy-optimized network.
Homoplasy and phylogeny reconstruction 237
Koshan_Aghul -
(a)
Keren_
ut ul
Geq
l_Rutu
l
u tu
r
ek_R
un_
hu
F it
_R
Aghul
ak
N
Ag
- Ixrek
Agh
or
e_A
ad
Ts
uch
hu
th
kh
s_
ur
er
l_ p
gh
So
ul
- L
et
kh
n_
Mu
ut r
ul
lm
ro
hu
-
Ta
sa
he
-
pe
ak
Ge
rn
ba
_T
3
-
s
-
_T
r
ik
sa
ab _T
ik
sh
ra
-
-
as
M
e
n
Gy ar hl
is
-
an
Inferred Links
-
un M
e_ - -
Le i
zg ch
Kry i Ar i
ts_p - - _Ud
hen
rop ta s
er
- - Var
Alyk_ _Udi
K ryts zh
- - Nid
1
Budukh - - Outgroup_Chechen
Koshan_Aghul -
(b)
Keren_
tu l
Geq
tul
l
_Ru
u tu
_Ru
r
hu
un_
F it
_R
hek
Aghul
ak
N
Ag
- Ixrek
Agh
or
e_A
ad
Ts
hu
th
uc
kh
s_
ur
er
l_ p
gh
So
ul
- L
et
kh
n_
Mu
ut ur
ul
lm
ro
sa
Ta
he
kh
-
pe
Ge
_T
rn
ba
a 2
-
Ts
-
_T
r
ik
sa
ab h_
ik
ra
-
-
as s
M
le
n
Gy ar h
is
-
an
Inferred Links
-
un M
e_ - -
Le i
zg ch
K ry i Ar di
ts_p - - n_U
rop ta s he
er
- - Var
Alyk_ _Udi
Kryts zh
- - Nid
1
Budukh - - Outgroup_Chechen
Figure 10: Minimal lateral network of the Lezgian lects produced in the LingPy software. Node
size reflects the inferred number of cognate sets present in each lect. The solid links illustrate
lateral transfer events suggested by the method. The thickness and color of the links indicate
the inferred number of homoplastic characters between the two nodes in question, as specified
by the scale on the right. (a) Non-homoplasy-optimized network; based on the strict consensus
non-homoplasy-optimized tree (Figure 8a); the gain-loss model 5−2 is best fitting, p = 0.55 (the
next nearest model is 2−1, p = 0.48). (b) Homoplasy-optimized network; based on the strict
consensus non-homoplasy-optimized tree (Figure 8b); the gain-loss model 3−1 is best fitting,
p = 0.2 (the next nearest model is 5−2, p = 0.17).
238 Alexei S. Kassian
According to the traditional expert classification, the group is divided into two
main branches: East Tsezic (Hunzib & Bezhta) and West Tsezic (Tsez &
Khwarshi), see Bokarev (1959: 227) (general lexical evidence); Imnaishvili
(1963: 9) (various evidence); Testelets (1993) (lexicostatistics and phonetic evi-
dence); S. Starostin and Nikolayev (Nikolaev) (1994: 110) (lexicostatistics and
general evidence); Alekseev (1998: 299–300) (lexicostatistics and general evi-
dence); Khalilova (2009: 1) (general evidence). A slightly different classification
is provided in van den Berg (1995: 5), where Khwarshi constitutes a third distinct
branch (Northern Tsezic).
The position of the fifth language, Hinukh, is somewhat more controversial,
since in Bokarev (1959: 227) Hinukh is classified as an “intermediate lect”:
“Hinukh occupies the intermediate position: it shares one part of its [native]
lexicon with the languages of the West subgroup, and another part with the
languages of the East subgroup”. However, according to Lomtadze (1963) (var-
ious evidence), van den Berg (1995: 5) (general evidence) and Testelets (1993)
(lexicostatistics and phonetic evidence), Hinukh is a close relative of Tsez or
simply a dialect of Tsez.
Previous formal classifications of the Tsezic group suggest the same two-
way division into East Tsezic (Hunzib & Bezhta) and West Tsezic (Hinukh, Tsez
& Khwarshi):
240 Alexei S. Kassian
As illustrated by the above trees, all the phylogenetic methods used recon-
struct two main clades, namely East Tsezic (Hunzib & Bezhta) and West Tsezic
(Hinukh, Tsez & Khwarshi), in agreement with the traditional and previous
formal classifications. However, the methods contradict each other with respect
to the topology of the West Tsezic clade. Some methods suggest a Hinukh-Tsez
clade distinct from Khwarshi: StarlingNJ (Figure S7a in Text Supplement),
UPGMA (Figure S10a), UMP (Figure S12a). Others suggest a Tsez-Khwarshi
clade distinct from Hinukh: NJ (Figure S9a), Bayesian MCMC (Figure S11a). An
additional and immaterial discrepancy can be seen in the Bezhta clade: the two
archaic methods, StarlingNJ and UPGMA, suggest that Bezhta proper splits off
first, other methods reconstruct the Tlyadal dialect as the first outlier, although
(a)
500 BC 0 AD 500 1000 1500 2000
Hunzib proper
East Tsezic (710 AD) Bezhta proper
Bezhta (1190 AD) Khoshar-Khota Bezhta
x / x / 94
Tlyadal Bezhta
Khwarshi proper
Khwarshi (1060 AD)
Inkhokwari Khwarshi
(b)
500 BC 0 AD 500 1000 1500 2000
Hunzib proper
East Tsezic (660 AD) Bezhta proper
Bezhta (1190 AD) Khoshar-Khota Bezhta
x/x/x
(x / x / 94)
Tlyadal Bezhta
Figure 11: Manually constructed strict consensus phylogenetic tree of the Tsezic lects based on the StarlingNJ, NJ, BioNJ, UPGMA, Bayesian MCMC, UMP
methods. Ternary nodes cover binary branchings that either differ depending on the method employed or are automatically joined under the assumption of the
temporal error of 300 years, or both. Statistical support values are shown in the following sequence: NJ/MCMC/UMP (“x” means that P ≥ 0.95 in an individual
method; not shown for nodes with P ≥ 0.95 in all methods). StarlingNJ dates are proposed. (a) Non-homoplasy-optimized phylogenetic tree. (b) Homoplasy-
241
optimized tree; probability values from the non-optimized strict consensus tree (a) are quoted in parentheses. No topological discrepancies between (a) and (b).
242 Alexei S. Kassian
hi
(a)
wars
en
ch
Khw
he
h
ri_K
C
p_
arsh
ou
tgr
k wa
Ou
i_pr
Sa
o
Inkh
ga
ope
da
_T
se
r
z(
Di
do
)
Kidero_Tsez (Dido)
kh
nu
Hi
Hunzib_proper
Be
zh
hta
ta_
Kho
pr
Bez
op
er
shar
dal_
_Kh
Tlya
ota_
Bez
hta
hi
(b) n
wars
he
hec
Khw
_C
ri_Kh
up
ro
arsh
tg
okwa
Ou
i_pr
Sa
Inkh
ga
ope
da
_T
r
se
z(
Di
do
)
Kidero_Tsez (Dido)
kh
nu
Hi
Bezhta_proper
Kh
os
Tlya
ha
r
e
r_K
prop
ho
dal_
ta_
zib_
Be
Bez
zh
ta
Hun
hta
Figure 12: Phylogenetic network of the Tsezic lects produced by the NeighborNet method from the
binary matrix in the SplitsTree4 software. Bootstrap values are shown near the branches (not shown
for stable branches with bootstrap value ≥ 95%). Branch length reflects the relative rate of cognate
replacement as suggested by SplitsTree4. (a) Non-homoplasy-optimized network (b) Homoplasy-
optimized network.
Homoplasy and phylogeny reconstruction 243
(a)
ta
zh
ta
Be
K id
l_Bezh
ta _
e ro
ho
_T
Hinukh
da
r_K
se
Sa
- Tlya
ha
z(
ga
da
D id
os
r
_T pe
Kh
se ro
o)
-
4
z p
(D t a_
-
-
id h
o)
B ez
Inferred Links
Kh - -
wa r
rsh ope
i_p
rop ib _pr 2
er unz
- - H
Inkhokwari_Khwarshi - - Outgroup_Chechen 1
ta
(b)
zh
ta
Be
l_Be zh
K id
ta _
ero
ho
_T
Hinukh
da
r_K
se
Sa
- Tlya
ha
z(
ga
da
Di
os
r
_T pe
do
Kh
se ro
-
1
p
)
z
(D t a_
-
-
id h
o)
B ez
Inferred Links
Kh - -
wa er
rsh rop
i_p
rop i b_p
er unz
- - H
Inkhokwari_Khwarshi - - Outgroup_Cheche n 0
Figure 13: Minimal lateral network of the Tsezic lects produced in the LingPy software. Node
size reflects the inferred number of cognate sets present in each lect. The solid links illustrate
lateral transfer events suggested by the method. The thickness and color of the links indicate
the inferred number of homoplastic characters between the two nodes in question, as specified
by the scale on the right. (a) Non-homoplasy-optimized network; based on the strict consensus
non-homoplasy-optimized tree (Figure 11a); the gain-loss models 5−2 and 2−1 are equally best
fitting, p = 0.69 in both cases. (b) Homoplasy-optimized network of the Tsezic lects; based on
the strict consensus homoplasy-optimized tree (Figure 11b); the gain-loss model 3−1 is best
fitting, p = 0.88.
with low statistical support for the Bezhta proper–Khoshar-Khota node (ca. 0.5)
and very short distance between the two binary nodes.
Formally the strict consensus tree shown in Figure 11a can thus be con-
structed. In this tree, the neighboring binary nodes are joined (1) if the temporal
distance between them is ≤ 300 years as calculated by the StarlingNJ method,
244 Alexei S. Kassian
see Figure S7a, S8a in Text Supplement; or (2) if their topology depends on the
individual phylogenetic method applied. Actually all the resulting ternary nodes
in Figure 11a both differ depending on the method and are obtained automati-
cally under the assumption of a temporal error of 300 years.
As can be seen, the topology of the strict consensus tree (Figure 11a) is
identical to that of the StarlingNJ tree (Figure S8a).
Below I examine the reverse lexicostatistical distances for two East Tsezic
lects (ETs) – Hunzib proper and Bezhta proper – and three West Tsezic lects
(WTs) – Hinukh, Kidero Tsez, and Khwarshi proper – as obtained from the
multistate matrix; a higher percentage of shared basic vocabulary means a
smaller distance between two lects and thus a higher value in Table 1.
5 “Many Hinuq men marry Tsez women, who then move to the village of Hinuq. These women
often do not fully acquire the Hinuq language and sometimes simply continue to speak Tsez, at
least at home” (Forker 2013: 16).
Homoplasy and phylogeny reconstruction 245
(which is not adjacent to the Hinukh territory) demonstrates the same lexicos-
tatistical closeness to Hinukh as Kidero Tsez does (76–77%), it is more likely that
the usual direction of influence is Kidero Tsez > Hinukh rather than vice versa.
The fact of Tsez-Hinukh linguistic interaction and the Tsez influence on Hinukh
is generally accepted by Caucasologists, see, (e.g., Bokarev 1959: 112, 236; Forker
2013: 12).
Second, a comparison of Hinukh with East Tsezic lects also demonstrates
irregular ratios. Four sets of three languages can be analyzed.
(1) Hunzib proper (ETs)/Bezhta proper (ETs)/Hinukh (WTs). The configuration
is normal: the two East Tsezic lects are close to each other (87%) and
equally remote from the West Tsezic lect (63%).
(2) Hunzib proper (ETs)/Bezhta proper (ETs)/Kidero Tsez (WTs). The configura-
tion is normal: the two East Tsezic lects are close to each other (87%) and
equally remote from the West Tsezic lect (53–55%).
(3) Hinukh (WTs)/Kidero Tsez (WTs)/Hunzib proper (ETs). The configuration is
not quite normal: the two West Tsezic lects are indeed close to each other
(77%), but not equally remote from the East Tsezic lect: Hinukh/Hunzib =
63%, whereas Kidero Tsez/Hunzib = only 55% (a difference of 8%).
(4) Hinukh (WTs)/Kidero Tsez (WTs)/Bezhta proper (ETs). The configuration is
even more abnormal: the two West Tsezic lects are indeed close to each
other (77%), but not equally remote from the East Tsezic lect: Hinukh/
Bezhta = 63%, whereas Kidero Tsez/Bezhta = only 53% (a difference of
10%).
As follows from the analysis of these four sets, the lexicostatistical distances
between one West Tsezic lect in particular and the two East Tsezic lects are far
from ultrametric: Hinukh demonstrates an abnormal closeness to both Bezhta
and Hunzib. This closeness should be treated as secondary, i.e., a number of
secondary lexical matches between Proto-Hinukh and Proto-East Tsezic are to be
assumed. This can be explained as a result of substantial influence between
Proto-Hinukh and Proto-East Tsezic, although the default direction of influence,
Proto-East Tsezic > Proto-Hinukh or vice versa, cannot be established by means
of a formal analysis of this kind.
The NeighborNet network (Figure 12a) also depicts conflicting signals
between Hinukh and Tsez, on the one hand, and Hinukh and Bezhta, on the
other.
Thus, two stages in the history of Hinukh can be reconstructed, if we accept
that the rate of Swadesh cognate replacement is nearly strict within the Tsezic
group. Initially, Hinukh entered into close contact with Proto-East Tsezic and
subsequently Bezhta (the direction of influence is not entirely clear). Later,
246 Alexei S. Kassian
The homoplastic character #7 ‘to bite’ (East Tsezic, Chechen) naturally went
undetected by LingPy, since the fact of homoplasy follows from external data
not included in the input dataset.
In rare cases LingPy reconstructs false homoplasy.
4 Results
split (West (East, South)) in the non-optimized tree; (2) among the Aghul dialects,
Gequn splits off prior to Keren. In both cases, the relevant distances are short and
the bootstrap values are rather low. Bootstrap values are slightly altered.
UPGMA (Figure S4b in Text Supplement). There are two discrepancies in
topology (marked with gray circles) as compared with the non-optimized
UPGMA tree (Figure S4a): (1) in the Rutul node, Ixrek splits off first; (2) in the
Aghul node, Keren and Gequn form a distinct clade (in both cases, the relevant
distances are very short and the bootstrap values are rather low). Bootstrap
values have become somewhat weaker.
Bayesian MCMC (Figure S5b in Text Supplement). There are no discrepancies
in topology as compared with the non-optimized MCMC tree (Figure S5a). Some
bootstrap values have changed (usually becoming stronger).
UMP (Figure S6b in Text Supplement). There are two discrepancies in
topology (marked with gray circles) as compared with the non-optimized
UMP tree (Figure S6a): (1) in the Lezgian clade, Udi splits off first followed
by Archi, as opposed to the sequence Archi then Udi in the non-optimized tree;
(2) Proto-Nuclear Lezgian shows the split (East (West, South)) as opposed to
the split (West (East, South)) seen in the non-optimized tree. Some bootstrap
values have changed (the Aghul clade in particular has become significantly
more stable).
Thus, the distance-based methods (StarlingNJ, NJ, BioNJ, UPGMA), when
applied to the homoplasy-optimized Lezgian dataset, produce trees which differ
topologically from the corresponding non-optimized trees with regard to three
points: (1) the initial Proto-Nuclear Lezgian split, (2) certain Aghul dialects, and
(3) the Rutul dialects. These changes do not seem substantial, since in all three
cases the tested phylogenetic methods vary from each other in a way implying
ternary nodes in the strict consensus trees whether based on the non-optimized
dataset (Figure 8a) or the homoplasy-optimized one (Figure 8b). The main and
unexpected result of the distance-based methods concerns nodes with bootstrap
value < 95%: their bootstrap values in the homoplasy-optimized trees are some-
what weaker as compared with the non-optimized trees (although opposite
instances also occur).
The results of the character-based methods (Bayesian MCMC, UMP) are more
significant. Firstly, bootstrap and probability values become stronger in most
cases. Secondly, the topology of the homoplasy-optimized UMP tree (Figure S6b
in Text Supplement) has changed. In particular this concerns the Proto-Lezgian
node (in the non-optimized UMP tree Archi splits off first, in disagreement with
both the traditional classification and other phylogenetic methods). Topological
change of the new UMP tree allows it to meet theoretical expectations, since the
maximum parsimony method depends on homoplasy to a greater extent than
Homoplasy and phylogeny reconstruction 249
other methods (note that for the non-optimized dataset the UMP method yields
the less likely tree, as discussed in Kassian 2015).
Before the construction of a strict consensus tree (Figure 8b), the difference
between the individual homoplasy-optimized trees obtained (Figure S1b–S6b in
Text Supplement) should be discussed. The discrepancies do not seem substan-
tial, and in fact are even less substantial than in the case of the non-optimized
dataset (Kassian 2015), since the UMP tree has now itself been altered. Let us
examine them.
(1) All distance-based methods, i.e., StarlingNJ, NJ, BioNJ, UPGMA (Figure
S1b, S3b, S4b in Text Supplement), plus the character-based method UMP
(Figure S6b) suggest consecutive Proto-Lezgian bifurcations in which the
Udi branch split off first and the Archi branch split off second. However,
the distance between the two nodes (the separation of Udi and the
separation of Archi) is short in all the distance-based trees, as follows
from the tree visualization and the probabilistic values of the branches,
and under the assumption of a temporal error of 300 years in StarlingNJ
(Figure S2b) the first split in the Lezgian group turns out to be three-way:
Udi, Archi, Nuclear Lezgian. The character-based Bayesian MCMC method
(Figure S5b), immediately suggests a ternary split into Udi, Archi and
Nuclear Lezgian.
(2) All the methods suggest the following three-part division for the Nuclear
Lezgian clade: (1) proto-West Lezgian [Tsakhur, Rutul], (2) proto-South
Lezgian [Kryts, Budukh], (3) proto-East Lezgian [Aghul, Tabasaran, Lezgi].
However, differences emerge in the hierarchy of the splits. The StarlingNJ
(Figure S1b) method suggests that West Lezgian splits off first; NJ (Figure
S3b), Bayesian MCMC (Figure S5b) and UMP (Figure S6b) suggest that East
Lezgian splits off first; UPGMA (Figure S4b) suggests that South Lezgian
splits off first. The distance between the two nodes (i.e., the consecutive
bifurcations between West, South and East protolanguages) is, however,
short in all the trees, as follows from the tree visualization and the prob-
abilistic values of the branches, and under the assumption of a temporal
error of 300 years in StarlingNJ (Figure S2b) it emerges that the Nuclear
Lezgian sub-group shows a three-way split between West, South and East.
(3) The non-Koshan Aghul dialects. All the methods (with the exception of
some UMP variants) reconstruct the distinct Aghul Proper/Fite clade, but
contradict each other as concerns the Keren and Gequn dialects. However,
under the assumption of a temporal error of 300 years in StarlingNJ (Figure
S2b) it emerges that the proto-Aghul language, after the separation of
Koshan from the rest, shows a three-way split between Keren, Gequn and
Aghul Proper/Fite.
250 Alexei S. Kassian
(4) The Rutul dialects. StarlingNJ (Figure S1b) suggests that Luchek split off
first; all other methods reconstruct the separation of Ixrek as earliest (note
that formally this is a case where the results of multistate matrix analysis
are opposed to those of binary matrix analysis). At the same time, in Figure
S1b (StarlingNJ), the two Rutul nodes are remote enough chronologically to
avoid being unified even under the assumption of a temporal error of 300
years (Figure S2b). The Rutul problem is discussed in Kassian (2015):
contrary to expectation, the lexicostatistical distances in the Rutul part of
the tree are not ultrametric, implying the existence of unidentified loans
and contact-driven homoplasy between Rutul dialects.
specific proto-root with the specific Swadesh meaning in the given language). On
the other hand, it is reported in Holden and Gray (2006: 25) (where Bantu
languages were investigated) that coding lexical data in a multistate format
makes “very little difference to the results”. More tests making use of various
linguistic data are required.
The results of the MLN module of the LingPy software appear to be more
important. The minimal lateral network based on the homoplasy-optimized
dataset (Figure 10b) is significantly less conflicting than the minimal lateral
network based on the non-homoplasy-optimized dataset (Figure 10a). As follows
from the graphical representation (Figure 10b), the total number of inferred
homoplastic characters is rather modest: the highest number of conflicting
characters (2) was obtained between Archi and Proto-Aghul (namely ‘to lie’
and ‘yellow’; in both cases, however, we are dealing with false responses).
Hunzib proper
East Tsezic (660 AD) Bezhta proper
Bezhta (1190 AD) Khoshar-Khota Bezhta
Tlyadal Bezhta
Tsezic (720 BC)
Hinukh
Alexei S. Kassian
Figure 14: Manually constructed homoplasy-optimized strict consensus phylogenetic tree of the Tsezic lects based on the StarlingNJ, NJ, BioNJ,
UPGMA, Bayesian MCMC, UMP methods. Statistical support values are shown in the following sequence: NJ/MCMC/UMP (“x” means that P ≥ 0.95 in
an individual method; not shown for nodes with P ≥ 0.95 in all methods). Neighboring nodes with the temporal distance between them ≤ 300 years
have not been joined.
Homoplasy and phylogeny reconstruction 253
concerns the West Tsezic cluster: Tsez and Khwarshi now form a clade distinct
from Hinukh. Bootstrap values have changed slightly. Some nodes have
acquired slightly different dates.
StarlingNJ, tree with joined nodes (Figure S8b in Text Supplement). There
are no discrepancies in topology as compared with the non-optimized StarlingNJ
tree (Figure S8a). Some nodes have acquired slightly different dates.
NJ (Figure S9b in Text Supplement). There are no discrepancies in topology
as compared with the non-optimized NJ tree (Figure S9a). Bootstrap values
become stronger.
UPGMA (Figure S10b in Text Supplement). As compared with the non-
optimized UPGMA tree (Figure S10a), the only discrepancy in topology concerns
the West Tsezic cluster: Tsez and Khwarshi form a clade distinct from Hinukh.
Bootstrap values show no change.
Bayesian MCMC (Figure S11b in Text Supplement). There are no discrepan-
cies in topology as compared with the non-optimized MCMC tree (Figure S11a).
Bootstrap values become stronger.
UMP (Figure S12b in Text Supplement). As compared with the non-optimized
UMP tree (Figure S12a), the only discrepancy in topology concerns the West
Tsezic cluster: Tsez and Khwarshi form a clade distinct from Hinukh. Bootstrap
values become stronger.
Thus, three methods – StarlingNJ, UPGMA, UMP – applied to the homo-
plasy-optimized Tsezic dataset produced trees which differ topologically from
the corresponding non-optimized trees on one point: Tsez and Khwarshi form a
distinct clade, in opposition to the distinct Hinukh-Tsez clade found in the non-
optimized trees.
All the methods, when applied to the homoplasy-optimized Tsezic dataset,
produced topologically identical trees (except for closely related Bezhta dialects)
which can be summarized in a strict consensus tree: Figure 11b where neighbor-
ing nodes are additionally joined if the temporal distance between them is ≤
300 years as calculated by the StarlingNJ method. As can be seen, the topology
of the homoplasy-optimized strict consensus tree (Figure 11b) is identical to both
StarlingNJ trees with joined nodes: non-homoplasy-optimized and homoplasy-
optimized (Figure S8a–b in Text Supplement).
Since the exact position of Hinukh within the West Tsezic cluster is impor-
tant for general Tsezic studies (Kassian and Testelets 2017), Figure 14 represents
the special version of the strict consensus tree which simply repeats the topology
obtained (Bezhta dialects form a ternary node) without the joining of neighbor-
ing nodes with the temporal distance between them ≤ 300 years (Hinukh is thus
opposed to the Tsez-Khwarshi clade).
254 Alexei S. Kassian
Summing up. Homoplastic optimization makes the Tsezic topology the same
across all methods (except for Bezhta dialects) and strengthens statistical sup-
port values (bootstrap and Bayesian posterior probabilities). Some nodes acquire
slightly different dates. The resulting homoplasy-optimized strict consensus tree
is more stable than its non-optimized counterpart.
Now we can reexamine the reverse lexicostatistical distances for two East
Tsezic lects (Hunzib proper, Bezhta proper) and three West Tsezic lects (Hinukh,
Kidero Tsez, Khwarshi proper) as obtained from the multistate matrix; a higher
percentage of shared basic vocabulary means a smaller distance between two
lects and thus a higher value in Table 2.
Table 2:. Reverse lexicostatistical distances for some Tsezic lects (homoplasy-optimized data-
set); values for the non-optimized dataset (Table 1) are quoted in parentheses if different.
First, the distances between the three West Tsezic lects become more normal:
Kidero Tsez is close to Khwarshi (75%), whereas Hinukh is almost equally
remote from Kidero Tsez and Khwarshi (69–72%). The difference between the
Hinukh/Kidero Tsez distance (72%) and the Hinukh/Khwarshi distance (69%)
could suggest that not all secondary, i.e., homoplastic matches in the Hinukh/
Kidero Tsez pair have been revealed by the above linguistic analysis.
Second, a comparison of Hinukh with East Tsezic lects still demonstrates
irregular ratios. Four sets of three languages can be analyzed.
(1) Hunzib proper (ETs)/Bezhta proper (ETs)/Hinukh (WTs). The configuration
is normal: the two East Tsezic lects are close to each other (86%) and
equally remote from the West Tsezic lect (60–62%).
(2) Hunzib proper (ETs)/Bezhta proper (ETs)/Kidero Tsez (WTs). The configura-
tion is normal: the two East Tsezic lects are close to each other (86%) and
equally remote from the West Tsezic lect (53–55%).
(3) Hinukh (WTs)/Kidero Tsez (WTs)/Hunzib proper (ETs). The configuration is
not normal: the two West Tsezic lects are indeed close to each other (72%),
Homoplasy and phylogeny reconstruction 255
but not equally remote from the East Tsezic lect: Hinukh/Hunzib = 62%,
whereas Kidero Tsez/Hunzib = only 55% (a difference of 7%).
(4) Hinukh (WTs)/Kidero Tsez (WTs)/Bezhta proper (ETs). The configuration is
also abnormal: the two West Tsezic lects are indeed close to each other
(72%), but not equally remote from the East Tsezic lect: Hinukh/Bezhta =
60%, whereas Kidero Tsez/Bezhta = only 53% (a difference of 7%).
As follows from the analysis of these four sets, the lexicostatistical distances
between Hinukh and the two East Tsezic lects are not ultrametric: Hinukh still
demonstrates an abnormal closeness to both Bezhta and Hunzib, although the
situation shows a slight improvement in comparison with the lexicostatistical
anomalies between these taxa for the non-optimized dataset. This could implies
that there are several cases of contact-driven parallel developments within the
110-item wordlist between Hinukh and East Tsezic (Bezhta & Hunzib) which
cannot be revealed by the current linguistic analysis.
The homoplasy-optimized NeighborNet network (Figure 12b) demonstrates
that conflicting signals between Hinukh and Tsez, on the one hand, and
between Hinukh and Bezhta, on the other, become somewhat weaker as com-
pared to the non-optimized network (Figure 12a). On the other hand, as
expected, the minimal lateral network produced by the LingPy software only
detects a couple of (false) conflicts: Figure 13b.
5 Conclusions
It can be concluded that the reconstruction of language phylogeny should
consist of several steps.
1. A high-quality input dataset is elaborated with the help of the main phylo-
genetic methods, and a strict consensus rooted phylogenetic tree is
produced.
2. The ancestral, i.e., protolanguage character states are reconstructed (unfor-
tunately it is commonplace for several formally equal reconstructed items to
compete with each other in a number of characters).
3. Proceeding from the consensus tree and ancestral character states, the
dataset is examined for homoplasy (contact-driven parallel developments
are especially deleterious). The minimal lateral network module of LingPy is
a powerful tool for that purpose. Items identified as constituting secondary
matches should be marked as unrelated or, if we are sure of the direction of
influence, the target item should be treated as a borrowing. Note that
usually not all cases of linguistic homoplasy can be detected.
256 Alexei S. Kassian
6 Supporting Information
– Text Supplement contains: (1) phylogenetic trees of the Lezgian and Tsezic
language groups obtained by individual methods, (2) linguistic notes on
individual homoplastic developments within the Lezgian and Tsezic word-
lists. PDF format.
6 Despite terminological similarity, the proposed algorithm differs from the “perfect phylogeny”
approach developed by Don Ringe, Tandy Warnow and Luay Nakhleh with colleagues (see Ringe et
al. 2002; Nakhleh et al. 2005; and other publications by the team). Firstly, the Maximum Parsimony
method used by the aforementioned authors is rather noise sensitive, so it is risky to reconstruct the
phylogeny only by means of the MP-method: the obtained MP-tree is expected to contain a number
of false nodes which will merely become more robust after the homoplasy cleaning process.
Secondly, the aforementioned authors proceed from some general prior knowledge about a true
Indo-European tree (e.g., Germanic is not the first outlier) and Indo-European ancestral character
states (e.g., many characters are excluded from the analysis due to their notoriously homoplastic
nature), following previous results of philological work of traditional linguists. As stated by Ringe et
al. themselves: “we are not attempting to construct a method which will allow us to input raw data
to an algorithm and derive a completely mechanical solution, in the belief that that is somehow
more ‘objective’ than using the results of traditional philological work” (Ringe et al. 2002: 79).
Homoplasy and phylogeny reconstruction 257
Figure 15: Main steps of the reconstruction of language phylogeny as discussed in the present
paper.
Four packs with various files for phylogenetic analysis are available: Lezgian
non-homoplasy-optimized, Lezgian homoplasy-optimized, Tsezic non-homo-
plasy-optimized, Tsezic homoplasy-optimized. Each pack includes:
– Multistate matrix in MS Excel format (incl. the wordlists).
– Binary matrix in NEXUS format.
– Reverse distance matrix, generated from the multistate matrix in the Starling
software, MS Excel format.
258 Alexei S. Kassian
References
Alekseev, Mikhail. 1998. Tsezskie jazyki [Tsezic languages]. In Mikhail Alekseev (ed.), Jazyki
mira: Kavkazskie jazyki [Languages of the world: Caucasian languages], 299–303.
Moscow: Academia.
Atkinson, Quentin D. & Russell D. Gray. 2006. How old is the Indo-European language family?
Progress or more moths to the flame? In Peter Forster & Colin Renfrew (eds.), Phylogenetic
methods and the prehistory of languages, 91–109. Cambridge: McDonald Institute for
Archaeological Research.
Balanovsky, Oleg, Khadizhat Dibirova, Anna Dybo, Oleg Mudrak, Svetlana Frolova et al. 2011.
Parallel evolution of genes and languages in the Caucasus region. Molecular Biology and
Evolution 28(10). 2905–2920.
Berg, Helma van den. 1995. A grammar of Hunzib. With texts and lexicon. Leiden: Proefschrift ter
verkrijging van de graad van Doctor aan de Rijksuniversiteit te Leiden, 25 januari 1995.
Bokarev, Evgeny A. 1959. Tsezskie (didojskie) jazyki Dagestana [Tsezic languages of Dagestan].
Moscow: Izd-vo AN SSSR.
Bryant, David, Flavia Filimon & Russell D. Gray. 2005. Untangling our past: languages, trees,
splits and networks. In Ruth Mace, Clare Holden & Stephen Shennan (eds.), The Evolution
of cultural diversity: a phylogenetic approach, 69–85. London: UCL Press.
Bryant, David & Vincent Moulton. 2004. NeighborNet: an agglomerative algorithm for the
construction of phylogenetic networks. Molecular Biology and Evolution 21. 255–265.
Burlak, Svetlana A. & Sergei A. Starostin. 2005. Sravnitel’no-istoricheskoe jazykoznanie
[Historical linguistics]. 2nd ed. Moscow: Academia.
Chang, Will, Chundra Cathcart, David Hall & Andrew Garrett. 2015. Ancestry-constrained phylo-
genetic analysis supports the Indo-European steppe hypothesis. Language 91(1). 194–244.
Cysouw, Michael & Diana Forker. 2009. Reconstruction of morphosyntactic function: Nonspatial
usage of spatial case marking in Tsezic. Language 85(3). 588–617.
Dyen, Isidore, Joseph Kruskal & Paul Black. 1997. Comparative Indo-European database. The file was
last modified on February 5, 1997. http://www.wordgumbo.com/ie/cmp/ (accessed 7 July 2014).
Evans, Steven N., Donald Ringe & Tandy Warnow. 2006. Inference of divergence times as a
statistical inverse problem. In Peter Forster & Colin Renfrew (eds.), Phylogenetic methods
and the prehistory of languages, 119–130. Cambridge: McDonald Institute for
Archaeological Research.
Homoplasy and phylogeny reconstruction 259
Forker, Diana. 2013. A grammar of Hinuq. Berlin & Boston: Mouton De Gruyter.
Gascuel, Olivier. 1997. BIONJ: An improved version of the NJ algorithm based on a simple model
of sequence data. Molecular Biology and Evolution 14. 685–695.
Goloboff, Pablo A., James S. Farris & Kevin C. Nixon. 2008. TNT, a free program for phylogenetic
analysis. Cladistics 24(5). 774–786.
Gray, Russell D. & Quentin D. Atkinson. 2003. Language-tree divergence times support the
Anatolian theory of Indo-European origin. Nature 426. 435–439.
Harbert, Wayne. 2007. The Germanic languages. Cambridge: Cambridge University Press.
Haspelmath, Martin. 2009. Lexical borrowing: Concepts and issues. In Martin Haspelmath & Uri
Tadmor (eds.), Loanwords in the world’s languages: A comparative handbook, 35–54.
Berlin & New York: Mouton De Gruyter.
Haugen, Einar. 1950. The analysis of linguistic borrowing. Language 26(2). 210–231.
Holden, Clare J. & Russell D. Gray. 2006. Rapid radiation, borrowing and dialect continua in the
Bantu languages. In Peter Forster & Colin Renfrew (eds.), Phylogenetic methods and the
prehistory of languages, 19–31. Cambridge: McDonald Institute for Archaeological Research.
Huelsenbeck, John P. & Fredrik Ronquist. 2001. MrBayes: Bayesian inference of phylogenetic
trees. Bioinformatics 17(8). 754–755.
Huson, Daniel H. & David Bryant. 2006. Application of phylogenetic networks in evolutionary
studies. Molecular Biology and Evolution 23(2). 254–267.
Imnaishvili, David S. 1963. Didojskij jazyk v sravnenii s ginukhskim i khvarshijskim jazykami [Tsez
language in comparison with the Hinukh and Khwarshi languages]. Tbilisi: Mecniereba.
Kassian, Alexei. 2011–2012. Annotated Swadesh wordlists for the Lezgian group (North
Caucasian family). In George Starostin (ed.), The global lexicostatistical database. Moscow
& Santa Fe: Center for Comparative Studies at the Russian State University for the
Humanities; Santa Fe Institute. http://starling.rinet.ru/new100 (accessed 7 May 2015).
Kassian, Alexei. 2013–2015. Annotated Swadesh wordlists for the Tsezic group (North
Caucasian family). In George Starostin (ed.), The global lexicostatistical database. Moscow
& Santa Fe: Center for Comparative Studies at the Russian State University for the
Humanities; Santa Fe Institute. http://starling.rinet.ru/new100 (accessed 7 May 2015).
Kassian, Alexei. 2015. Towards a formal genealogical classification of the Lezgian languages
(North Caucasus): Testing various phylogenetic methods on lexical data. PLoS ONE 10(2):
e0116950. doi:10.1371/journal.pone.0116950.
Kassian, Alexei, George Starostin, Anna Dybo & Vasiliy Chernov. 2010. The Swadesh wordlist.
An attempt at semantic specification. Journal of Language Relationship 4. 46–89.
Kassian, Alexei & Yakov Testelets. 2017. Classification of the Tsezic languages and the con-
troversy of Hinukh (North Caucasus). Lingua 196. 98–118, http://dx.doi.org/10.1016/
j.lingua.2017.06.011
Kassian, Alexei, Mikhail Zhivlov & George Starostin. 2015. Proto–Indo-European–Uralic com-
parison from the probabilistic point of view. Journal of Indo-European Studies 43(3–4).
301–347.
Khalilova, Zaira. 2009. A grammar of Khwarshi. Leiden: Proefschrift ter verkrijging van de graad
van Doctor aan de Universiteit Leiden, 17 December 2009.
Kiparsky, Valentin. 1967. Russische historische Grammatik, Bd. 2: Die Entwicklung des
Formensystems. Heidelberg: Carl Winter Universitätsverlag.
Kitchen, Andrew, Christopher Ehret, Shiferaw Assefa & Connie J. Mulligan. 2009. Bayesian
phylogenetic analysis of Semitic languages identifies an Early Bronze Age origin of Semitic
in the Near East. Proceedings of the Royal Society B 276. 2703–2710.
260 Alexei S. Kassian
Koryakov, Yuri B. 2006. Atlas kavkazskikh jazykov: s prilozheniem polnogo reestra jazykov
[Atlas of the Caucasian languages with language guide]. Moscow: Institute of Linguistics.
Kushniarevich, Alena, Olga Utevska, Marina Chuhryaeva, Anastasia Agdzhoyan, Khadizhat
Dibirova, Ingrida Uktveryte, Märt Möls, Lejla Mulahasanovic, Andrey Pshenichnov,
Svetlana Frolova, Andrey Shanko, Ene Metspalu, Maere Reidla, Kristiina Tambets, Erika
Tamm, Sergey Koshel, Valery Zaporozhchenko, Lubov Atramentova, Vaidutis Kučinskas,
Oleg Davydenko, Olga Goncharova, Irina Evseeva, Michail Churnosov, Elvira
Pocheshchova, Bayazit Yunusbayev, Elza Khusnutdinova, Damir Marjanović, Pavao Rudan,
Siiri Rootsi, Nick Yankovsky, Phillip Endicott, Alexei Kassian, Anna Dybo, The Genographic
Consortium, Chris Tyler-Smith, Elena Balanovska, Mait Metspalu, Toomas Kivisild, Richard
Villems & Oleg Balanovsky. 2015. Genetic heritage of the Balto-Slavic speaking popula-
tions: A synthesis of autosomal, mitochondrial and Y-chromosomal data. PLoS ONE 10(9):
e0135820. doi:10.1371/journal.pone.0135820.
Lees, Robert B. 1953. The basis of glottochronology. Language 29(2). 113–127.
List, Johann-Mattis & Steven Moran. 2013. An open source toolkit for quantitative historical
linguistics. In Proceedings of the 51st Annual meeting of the association for computational
linguistics: System demonstrations, 13–18. Stroudsburg, PA: Association for
Computational Linguistics.
List, Johann-Mattis, Steven Moran, Peter Bouda & Johannes Dellert. 2014c. LingPy: Python
library for quantitative tasks in historical linguistics. Version 2.4.1.alpha, DOI: 10.5281/
zenodo.11886. Marburg: Forschungszentrum Deutscher Sprachatlas. http://lingpy.org/
(accessed 28 September 2014).
List, Johann-Mattis, Shijulal Nelson-Sathi, Hans Geisler & William Martin. 2014a. Networks of
lexical borrowing and lateral gene transfer in language and genome evolution. BioEssays
36(2). 141–150.
List, Johann-Mattis, Shijulal Nelson-Sathi, William Martin & Hans Geisler. 2014b. Using phylo-
genetic networks to model Chinese dialect history. Language Dynamics and Change 4.
222–252.
Lomtadze, Elizbar. 1963. Ginukhskij dialekt didojskogo jazyka [Hinukh dialect of the Tsez
language]. Tbilisi: Mecniereba.
Makarenkov, Vladimir, Dmytro Kevorkov & Pierre Legendre. 2006. Phylogenetic network con-
struction approaches. In Dilip K. Arora, Randy M. Berka & Gautam B. Singh (eds.), Applied
mycology and biotechnology, Vol. 6: Bioinformatics, 61–98. Amsterdam & Boston:
Elsevier.
Müller, André, Viveka Velupillai, Søren Wichmann, Cecil H. Brown, Eric W. Holman et al. ASJP
world language trees of lexical similarity: Version 4 (October 2013). http://asjp.clld.org/
download (accessed 7 May 2015).
Nakhleh, Luay, Don Ringe & Tandy Warnow. 2005. Perfect phylogenetic networks: A new
methodology for reconstructing the evolutionary history of natural languages. Language
81(2). 382–420.
Nelson-Sathi, Shijulal, Johann-Mattis List, Hans Geisler, Heiner Fangerau, Russell D. Gray,
William Martin & Tal Dagan, 2011. Networks uncover hidden lexical borrowing in Indo-
European language evolution. Proceedings of the Royal Society B 278. 1794–1803.
Nikolayev (Nikolaev), Sergei L. 1978. Rekonstruktsija foneticheskoj sistemy pratsezskogo
jazyka [Reconstruction of the Proto-Tsezic phonological system]. In Victoria N. Yartseva
(ed.), Konferentsiya: Problemy rekonstruktsii (tezisy dokladov), 87–89. Moscow: Institut
jazykoznaniya AN SSSR.
Homoplasy and phylogeny reconstruction 261
Novotná, Petra & Blažek. Václav 2007. Glottochronology and its application to the Balto-Slavic
languages. Baltistica 42(2). 185–210; Baltistica 42(3). 323–346.
Pagel, Mark & Andrew Meade. 2006. Estimating rates of lexical replacement on phylogenetic trees
of languages. In Peter Forster & Colin Renfrew (eds.), Phylogenetic Methods and the Prehistory
of Languages, 173–182. Cambridge: McDonald Institute for Archaeological Research.
Renfrew, Colin. 2000. At the edge of knowability: Towards a prehistory of languages.
Cambridge Archaeological Journal 10(1). 7–34.
Rexová, Kateřina, Daniel Frynta & Zrzavý. January 2003. Cladistic analysis of languages: Indo-
European classification based on lexicostatistical data. Cladistics 19. 120–127.
Ringe, Don, Tandy Warnow & Ann Taylor. 2002. Indo-European and computational cladistics.
Transactions of the Philological Society 100(1). 59–129.
Saitou, Naruya & Masatoshi Nei. 1987. The neighbor-joining method: A new method for
reconstructing phylogenetic trees. Molecular Biology and Evolution 4. 406–425.
Semple, Charles & Mike Steel. 2003. Phylogenetics. Oxford: Oxford University Press.
Sneath, Peter H. & Robert R. Sokal. 1973. Numerical Taxonomy. San Francisco: W. H. Freeman
and Company.
Starostin, George S. 2010. Preliminary lexicostatistics as a basis for language classification:
A new approach. Journal of Language Relationship 3. 79–116.
Starostin, George S. 2011. Annotated Swadesh wordlists for the Nakh group (North Caucasian
family). In George Starostin (ed.), The global lexicostatistical database. Moscow & Santa
Fe: Center for Comparative Studies at the Russian State University for the Humanities;
Santa Fe Institute. http://starling.rinet.ru/new100 (accessed 7 May 2015).
Starostin, George S. (ed.). 2011–2015. The global lexicostatistical database. Moscow & Santa
Fe: Center for Comparative Studies at the Russian State University for the Humanities;
Santa Fe Institute. http://starling.rinet.ru/new100 (accessed 7 May 2015).
Starostin, George S. 2013a. Jazyki Afriki. Opyt postroenija leksikostatisticheskoj klassifikatsii
[Languages of Africa: A new lexicostatistical classification]. Vol. 1: Metod. Kojsanskie jazyki
[Methodology. Khoisan languages]. Moscow: LRC.
Starostin, George S. 2013b. Lexicostatistics as a basis for language classification: Increasing
the pros, reducing the cons. In Heiner Fangerau, Hans Geisler, Thorsten Halling & William
Martin (eds.), Classification and evolution in biology, linguistics and the history of science:
Concepts – methods – visualization, 125–146. Stuttgart: Franz Steiner Verlag.
Starostin, Sergei A. 1994. Lezgian etymological database, computerized version of the Proto-
Lezgian corpus which includes some Proto-Lezgian etymologies (mostly basic lexicon
items) that have not been included in Starostin & Nikolayev 1994 due to their lack of
external cognates in other branches of North Caucasian. http://starling.rinet.ru/cgi-bin/
main.cgi (accessed 10 September 2014).
Starostin, Sergei A. (ed.). 1998–2005. The Tower of Babel: An etymological database project.
http://starling.rinet.ru/ (accessed 7 May 2015).
Starostin, Sergei A. 2000. Comparative-historical linguistics and lexicostatistics. In Colin
Renfrew, April McMahon & Larry Trask (eds.). Time depth in historical linguistics, 223–259.
Cambridge: McDonald Institute for Archaeological Research, 2000. First publ. in Vitaly
Shevoroshkin & Paul J. Sidwell (eds.), 1999, Historical linguistics and lexicostatistics,
3–50. Melbourne: Association for the History of Language.
Starostin, Sergei A. 2007. Opredelenie ustojchivosti bazisnoj leksiki [Defining the stability of
basic lexicon]. In Sergei A. Starostin, Trudy po jazykoznaniju [Works in linguistics],
827–839. Moscow: LRC.
262 Alexei S. Kassian
Supplementary Material: The online version of this article offers supplementary material
(https://doi.org/10.1515/flih-2017-0008).