Vous êtes sur la page 1sur 46

Folia Linguistica Historica 2017; 38: 217–262

Alexei S. Kassian*
Linguistic homoplasy and phylogeny
reconstruction. The cases of Lezgian
and Tsezic languages (North Caucasus)
https://doi.org/10.1515/flih-2017-0008

Abstract: This paper deals with the problem of linguistic homoplasy (parallel or
backward development), how it can be detected, what kinds of linguistic homoplasy
can be distinguished and which varieties of the phenomenon are the most deleterious
for the reconstruction of language phylogeny. It is proposed that language phylogeny
reconstruction should consist of two main stages. Firstly, a strict consensus tree
should be built on the basis of high-quality input data elaborated with the help of
the main phylogenetic methods (such as Neighbor-joining, Bayesian MCMC, and
Maximum parsimony), and ancestral character states, allowing us to reveal a certain
number of homoplastic characters. Secondly, after the detected instances of homo-
plasy are eliminated from the input matrix, the consensus tree is to be compiled again.
It is expected that after homoplastic optimization it will be possible to better resolve
individual “problem clades”, and generally the homoplasy-optimized phylogeny
should be more robust than the tree constructed initially. The proposed procedure is
tested on the 110-item Swadesh wordlists of the Lezgian and Tsezic groups. The
Lezgian and Tsezic results generally support theoretical expectations. The MLN
(minimal lateral network) method, currently implemented in the LingPy software, is
a helpful tool for the detection of linguistic homoplasy.

Keywords: language classification, lexicostatistics, homoplasy, minimal lateral


network, Lezgian languages, Tsezic languages

1 Introduction1
This paper is structured as follows. Section 1 contains a general overview of
homoplasy between languages; main kinds of linguistic homoplasy are

1 This section partially overlaps with List et al. (2014b). List et al. focus on loanword detection,
but borrowings of any kind can be formally treated as a particular case of homoplasy.

*Corresponding author: Alexei S. Kassian, Institute of Linguistics of the Russian Academy of


Sciences, Moscow, Russia; Russian Presidential Academy of National Economy and Public
Administration, Moscow, Russia, E-mail: kassian@iling-ran.ru
218 Alexei S. Kassian

described in Section 2. Section 3 “Material and methods” presents the linguistic


data (3.1), phylogenetic methods (3.2), and phylogenetic analysis (3.3). In
Section 4, the results obtained for the Lezgian (4.1) and the Tsezic (4.2) lan-
guages are discussed. Finally Section 5 presents the conclusions. In the on-line
supplement, I offer etymological comments on Lezgian and Tsezic forms
involved and technical phylogenetic data which allow a reader to reproduce
author’s experiments.
Homoplasy is the occurrence of parallel or backward (reverse) developments
in the course of an evolutionary process. This is a phenomenon which perturbs
input data and makes it difficult to produce a robust phylogenetic tree of the
language family in question. In some cases, intensive homoplasy makes it
impossible to reveal the true phylogeny.
Examples. Front rounded vowels (ö, ü) are found in both German and
French, but emerged independently in the two languages, and thus represent
parallel developments. An instance of a reverse development concerns the
behavior of the vocative case in the history of Russian. In Proto-Indo-
European, the vocative was expressed by a bare stem. These bare forms were
retained in Proto-Slavic and further in Old Russian, but were then lost in favor of
the nominative forms. In modern colloquial Russian, however, new vocative
forms have arisen which synchronically represent bare stems (pap ‘oh dad!’ from
pap-ə ‘father’ etc.).
If lexical characters are used (giving the procedure known as lexicostatis-
tics), a good indicator of the potential presence of secondary, i.e., homoplastic
matches between two lects is a situation in which the lexicostatistical
distances between the lects involved are not ultrametric, i.e., the condition
“d(x,z) ≤ max(d(x,y), (d(y,z)) for all x, y, z” is violated. Four possible lexicos-
tatistical configurations involving three contemporary taxa (lects) are depicted
in Figure 1: the distances in Figure 1a–b are ultrametric, while those seen in
Figure 1c–d are not. In Figure 1a–b, the distances are more normal for natural
language evolution (in Figure 1a, lects L2 & L3 form a distinct clade; in Figure
1b, lects L1, L2 & L3 form a ternary node) than those in Figure 1c–d. As
concerns lexicostatistics and the Swadesh wordlist, there are different views
on the problem of rate of cognate replacement. For example, the original
proposal made by Morris Swadesh (1952, 1955, Lees 1953) was that cognate
replacement within the basic vocabulary can be described by a strict clock
model (according to which evolutionary rates across lineages are constant or
nearly constant). The linguistic data collected by the Moscow school – the
Tower of Babel (S. Starostin 1998–2005) and Global Lexicostatistical Database
(GLD; G. Starostin 2011–2015) projects – generally conform to this approach,
although with certain “relaxatory” improvements proposed by Sergei Starostin
Homoplasy and phylogeny reconstruction 219

(a) L1 (b) L1

50% 50%
60% 60%
50%
L2 L3

L2 40% L3

(c) L1 (d) L1
40% 40%

50% 60% 60%


L2 L3

L2 40%
L3

Figure 1: Distances between three lects (L1, L2, L3). The triangles (a)–(b) are ultrametric; (c)–(d)
are not. A higher percentage of discrepancies within character states means a more distant
relationship between the two lects in question. Thus (a) L2 & L3 are close to each other and both
are equally remote from L1; (b) the three are equally distant from each other; (c) the three
distances are all unequal; (d) L2 & L3 are remote from each other and both are equally close to L1.

(S. Starostin 2007 [1989], 2000, 2007; Novotná and Blažek 2007; Balanovsky et
al. 2011). On the other hand, a number of scholars prefer to apply the relaxed
molecular clock model to language evolution, implying that the mean rate of
lexical replacement varies among branches (e.g., Gray and Atkinson 2003;
Kitchen et al. 2009). In any case, it is unlikely that the range within which
the mean rate of basic vocabulary replacement varies in practice can be very
large (perhaps except for some rare special cases such as Icelandic). Thus, the
pairs L1-L2 or L2-L3 in Figure 1c and L1-L2 or L1-L3 in Figure 1d are suspected
to contain secondary, i.e., homoplastic matches.
A more difficult task is to detect exactly which characters are homoplastic.
The original linguistic dataset represents a multistate matrix (i.e., a matrix
whose characters can have more than 2 states; for matrix compilation, see,
e.g., Atkinson and Gray 2006: 93–94). If we are dealing with lexical characters
(lexicostatistics), synonymy, i.e., the presence of more than one word in a
single slot, is almost inevitable. To my knowledge, among packages
which reconstruct phylogeny, Starling (S. Starostin 2007 [1993]; Burlak and
220 Alexei S. Kassian

Starostin 2005: 271–274) is the only software able to process input matrices
containing synonyms: when the same Swadesh slot is occupied by more
than one word, i.e., by several synonyms, all possible pairs of involved
words between two languages are compared within this slot, and if there is
at least one matching pair, Starling treats the whole slot as a match2 (it goes
without saying that synonymy is deleterious for correct phylogeny reconstruc-
tion, the amount of synonyms should be minimized during the compilation of
a dataset).
In order to make the dataset importable in most popular phylogenetic
packages, it was proposed by Gray and Atkinson (Gray and Atkinson 2003;
Atkinson and Gray 2006) to convert the original multistate matrix into a binary
format. Binarization is coding for the presence (“1”) or absence (“0”) of the
specific proto-root with the given Swadesh meaning in the language in question,
while Swadesh items superseded by loanwords or simply not documented are
marked as “?” (the difference between this procedure, adopted by the Global
Lexicostatistical Database project following S. Starostin’s (2007 [1989])
approach, and the conversion described in Atkinson and Gray (2006), is that
Atkinson and Gray treat loanwords as full-fledged items with unique cognate
indices – the so-called singletons).
It remains unclear how seriously such a conversion corrupts input data and
causes model misspecification. Evans et al. (2006) argue that binary recoding
creates dependencies among characters that have unknown consequences and
potentially lead to biased results. Pagel and Meade (2006), however, suggest
that dependencies among characters introduced under binary recoding will have
only a minor impact on computational results and merely create scaled versions
of the best topology – trees with shorter branch lengths and higher posterior
probabilities for subgroups, but without major changes to the subgroups them-
selves. Up to now, however, almost all available tests (including my own ones
based on various Global Lexicostatistical Database wordlists) suggest that the
phylogenetic results of a multistate matrix and those of its binary counterpart
are quite similar if not identical.3

2 The LingPy package (List and Moran 2013), discussed below, can process synonymy in the
same way, but LingPy functions do not include phylogeny reconstruction.
3 One of the known exceptions is the maximum parsimony phylogeny of the Indo-European
family offered in Rexová et al. (2003): the Indo-European tree obtained from the multistate
matrix seriously differs from that obtained from the binary matrix if Hittite is involved as an
outlier. I suspect that the main reason for such discrepancies could be that the input lexical
dataset contains a substantial number of errors (see the Linguistic supplement in Kushniarevich
et al. 2015; where the Indo-European Swadesh database Dyen et al. 1997; is partially examined).
Homoplasy and phylogeny reconstruction 221

Not all homoplastic developments can be identified as such. Firstly, some


cases of backward evolution cannot be detected (at least without extra evi-
dence such as ancient texts or old borrowings in neighboring languages): see
Figure 2.

B A Figure 2: A character has two states: A, B.

An example.4 The Proto-Slavic paradigm of the word for ‘foot’ was as follows:
nominative *〈nog-a〉, locative *〈nog-oi〉. At a later stage the diphthong 〈oi〉
contracted into a front vowel which caused the fronting g > zʸ in Old Russian,
where the paradigm became nog-a, nozʸ-i͡e. Out of the two daughter languages of
Old Russian, this zʸ is retained in Modern Belarusian: nag-a, nazʸ-e, but lost in
Modern Russian: nag-a, nagʸ-e, where the locative form has undergone leveling
on the basis of the nominative (Kiparsky 1967: 34–35). In absence of documented
Old Russian evidence, we would have hypothesized that the shift g > zʸ before a
front vowel only took place in Belarusian, rather than in the Russian-Belarusian
protolanguage (i.e., in Old Russian).
Parallel evolution can be depicted as Figure 3.
An example. The analytic perfect tense-aspect construction participle + ‘to be’
or ‘to have’, which is widespread in modern Germanic languages, arose simulta-
neously in the individual lects (Harbert 2007: 292). However, in the absence of
evidence from ancient Germanic languages, we would instead reconstruct this
morphosyntactic pattern for Proto-Germanic, which would be an error.
From the formal point of view, if we have two characters in a multistate
matrix, each of them has at least two states with equal cost of change between
the states (e.g. one has the states A & B, the second C & D), and they take all four

4 All linguistic data in the present article are encoded in the unified transcription system of the
Global Lexicostatistical Database project, which is generally based on the IPA alphabet, with
just a few specific discrepancies (http://starling.rinet.ru/new100/UTS.htm). Traditional or ortho-
graphic representations are enclosed in 〈angle brackets〉.
222 Alexei S. Kassian

(a) A (b) A

A B
vs.

B B B B

Figure 3: A character has two states: A, B. (a) The state A is reconstructed for the intermediate
node, thus the tree demonstrates parallel developments A > B in two terminal nodes. (b) The
state B is reconstructed for the intermediate node, which allows to avoid parallel
developments.

possible pairs of states in the matrix: “AC”, “AD”, “BC”, “BD”, then these
characters are incompatible and at least one of them must be homoplastic
(see, e.g., Semple and Steel 2003: 69–73). In such and some other cases, the
reconstructed tree topology can suggest exactly which character is homoplastic:
see Figure 4.

A
C~D

A A B B
C D C D

Figure 4: Two incompatible characters. The first character has the states A, B; the second one
has the states C, D. The second character demonstrates homoplasy. The ancestral state can be
safely reconstructed for the first character (A), but not for the second one (C and D are equally
probable).

As can be seen, the reconstructed tree helps to detect homoplasy within one
multistate character (the so-called “criss-crossed” configuration): Figure 5.
However, the maximum number of homoplastic developments in a multi-
state or binary matrix can be revealed if ancestral character states, i.e., character
states for the protolanguage, are reconstructed (see the example above involving
the Germanic perfect tense-aspect construction). Reconstruction of this kind is in
Homoplasy and phylogeny reconstruction 223

(a) C~D (b) C~D

C D C D C D C D

Figure 5: A character has two states: C, D. Two kinds of the “criss-crossed” configuration are
depicted. In both cases, the ancestral state cannot be reconstructed with certainty (C and D are
equally probable from the topological point of view).

fact a non-trivial theoretical and practical task (Kassian et al. 2015); in particular
it is impossible without the established rooted phylogenetic tree.
The picture is somewhat different when we are dealing with a binary
lexicostatistical matrix, converted from an original multistate matrix with “1”
denoting the marked state of the character and “0” the unmarked state (i.e.,
“1” = presence and “0” = absence of the specific proto-root with the specific
Swadesh meaning in the given language; the so-called presence/absence
matrix). Even if there are two incompatible characters in the input matrix
which take all four possible pairs of states: “00”, “01”, “10”, “11”, the change
1 > 0 (loss of the root) is not a significant event, since this can occur indepen-
dently in different languages and such loss can hardly be regarded as homo-
plastic. Therefore the known rooted tree topology is unhelpful for the detection
of linguistic homoplasy in this type of binary matrix. To detect the fact of
homoplasy and identify specific homoplastic characters it is necessary to recon-
struct ancestral character states.
Once the phylogenetic rooted tree of the analyzed language group is
obtained and character states for the protolanguage have been reconstructed,
it is reasonable to search the input matrix for homoplastic characters and to
eliminate all parasitic matches caused by these characters. It is expected that a
new phylogenetic tree reconstructed on the basis of the elaborated matrix with
homoplastic matches eliminated will be more robust than the tree reconstructed
initially. Hence I label a phylogenetic tree produced from the standard dataset
non-homoplasy-optimized, or simply non-optimized, and a tree produced from the
examined dataset with homoplastic developments at least partially eliminated
homoplasy-optimized.
Cf. recent attempts to detect lexical homoplasy and loanwords with the help
of formal algorithms, based on the minimal lateral network (MLN) approach,
224 Alexei S. Kassian

implemented in the LingPy software: Nelson-Sathi et al. (2011); List and Moran
(2013); List et al. (2014a; 2014b); below this will be tested and discussed in more
detail. For the NeighborNet network approach, see Bryant et al. (2005); Holden
and Gray (2006).

2 Main kinds of homoplasy


For practical purposes, several partially overlapping kinds of linguistic homo-
plasy can be distinguished.
(1) Lexical borrowings.
(2) Independent homoplasy.
(3) Contact-driven homoplasy.
(4) Synonymy and suppletion in a protolanguage.

2.1 Lexical borrowings


Lexical borrowings (loanwords) represent the most trivial and routine case of
homoplasy. Normally, loanwords can be identified on the basis of phonetic and
morphological evidence, although in some cases sociolinguistic or historical
information can help to detect a borrowed item. It seems reasonable to
exclude all identified loanwords from the phylogenetic analysis (S. Starostin
2007 [1989]: 416–417; G.; Starostin 2013b: 133–136), i.e., to treat them as lexico-
graphic lacunae, when no expressions for the given semantic concepts are
documented for the given language. Sometimes loanwords are technically ana-
lyzed as full-fledged, albeit etymologically isolated items (singletons): see Ringe
et al. (2002) or the Indo-European Swadesh database Dyen et al. (1997) and some
phylogenetic studies based on Dyen et al.’s data, such as (Gray and Atkinson
2003; Chang et al. 2015) (in Chang et al. 2015: 205, it is even explicitly stated that
the detected loanwords should not be excluded). This approach does not seem
justified, since lexical replacement by means of borrowed items does not reflect
natural language evolution; rather it reflects contingent, extra-linguistic events
dependent on sociolinguistic, cultural and political factors.
An example. Modern English 〈mountain〉 goes back to the Middle English
form, which was borrowed from Old French 〈montaigne〉 ‘mountain’. The
Swadesh slot ‘mountain’ should be left empty for the Modern English language,
since we do not know what form an inherited Modern English word for this
meaning would take.
Homoplasy and phylogeny reconstruction 225

Note that it is recommended to treat a loanword as a normal lexicosta-


tistical item if it has acquired its current meaning in the target language itself
(G. Starostin 2013b: 135).
Examples. Modern Demotic Greek pulˈi 〈πουλί〉 ‘bird’ originates from late
Ancient Greek puːll-í-on 〈πουλλίον〉, a diminutive of pûːll-o-s 〈ποῦλλος〉 ‘chicken’.
The latter was in turn borrowed from Latin pullus ‘young (of animals)/chick,
chicken’, but the meaning shift ‘chicken’ > ‘bird’ is an internal Greek develop-
ment; it is therefore natural to treat Modern Demotic pulˈi ‘bird’ as a full-fledged
Swadesh item. Similarly, Modern German 〈Kopf〉 ‘head’ goes back to Old High
German 〈kopf〉 ‘mug, bowl’, borrowed from Latin 〈cupa, cuppa〉 ‘cask, bowl’, but
the meaning shift ‘bowl’ > ‘head’ took place in Germanic itself, and thus
for lexicostatistical purposes Modern German 〈Kopf〉 should be regarded as a
full-fledged form.

2.2 Independent homoplasy

Independent homoplasy within the Swadesh wordlist arises relatively rarely at a


reasonable time depth and apparently cannot seriously affect the resulting
phylogenetic tree, albeit independent homoplasy may cause so-called long
branch attraction (i.e., the erroneous grouping of two or more long branches
as a clade) when distant taxa are analysed. Some authors, however, (e.g., Ringe
et al. 2002: 69; G.; Starostin 2013b: 137–143; Chang et al. 2015: 202–203), draw
special attention to natural and therefore homoplastically recurrent patterns of
semantic shifts (such as ‘leg’ > ‘foot’, ‘to sit down’ > ‘to sit’ and so on, or with
morphological derivations ‘to die’ → ‘to cause to die’ > ‘to kill’, ‘to blow’ →
‘blower’ > ‘wind’ and so forth), which indeed become more abundant when
less stable lexical items are involved, e.g., when the 200-item wordlist is used
instead of the 100-item. It should be noted that the impact of independent
homoplasy is especially insignificant if “step-by-step” reconstruction is applied,
when a protolanguage is reconstructed sequentially on the basis of protolan-
guages at the previous taxonomic level.
An example. Examining the slot ‘moon’ in the Indo-European family, we
find that two Indo-European lexemes are used for this meaning in the Slavic
group: *meːn-(e)s- (the bulk of lects) and *lowk-s-n-aː (Russian, Bulgarian,
Slovene). The same two lexemes occur for ‘moon’ in the Italic group: the former
in Umbrian and the latter in Latin. This is therefore a criss-crossed configuration
(Figure 5), which implies parallel development in these two groups. Proceeding
from various more or less formal criteria (such as tree topology and typology of
semantic shifts), we can safely reconstruct *meːn-(e)s- as underlying both the
226 Alexei S. Kassian

Proto-Slavic and Proto-Italic terms for ‘moon’ (this case is methodologically


discussed in Kassian et al. 2015). Thus, the use of *lowk-s-n-aː in the meaning
‘moon’ is an independent, i.e., parallel semantic development within Slavic and
Italic. Note that once such intermediate reconstructions have been carried out,
there is no homoplasy in the slot ‘moon’ when we are dealing with the recon-
structed Proto-Slavic and Proto-Italic data.

2.3 Contact-driven homoplasy


A more critical factor is a phenomenon which could be called contact-driven
homoplasy (or advergence according to Renfrew 2000: 14). Two lects in contact
can display the same phonetic, morphosyntactic or semantic innovations
under mutual influence. This contact-driven effect can be especially strong
when closely genetically related and geographically neighboring lects are
involved (Renfrew 2000: 14; Chang et al. 2015: 204–211, 230). For lexicostatistical
purposes, two main kinds of contact-driven semantic shifts, i.e., contact-driven
homoplasy should be noted (both are labeled loanshift by Haugen 1950: 214–215,
219–220; specified as loan translation or loan meaning extension by; Haspelmath
2009: 39).
(1) Cognate or simply phonetically similar words in two lects may synchro-
nously acquire a new meaning. There is no lexical borrowing per se; all that
is borrowed is a semantic concept, supported by the phonetic similarity of
the words in question.

An example. Ukrainian rʸik underwent the shift ‘term, time period’ > ‘year’
(having superseded the more archaic ɦid ‘year’) under the influence of Polish rok
‘year’. Ukrainian rʸik and Polish rok are etymological cognates with regular sound
correspondences, and this common identity is transparent for Ukrainian and Polish
speakers. Two distinct cognate indices should be assigned to the Ukrainian and
Polish forms in the homoplasy-optimized lexicostatistical matrix or, since we are
sure that the direction of influence in this case was Polish > Ukrainian, we can go
further and simply mark Ukrainian rʸik as a loanword in the meaning ‘year’.
(2) Loan translation or semantic loan, where a semantic concept is borrowed
without any phonetic similarity or etymological relationship between the
expressions in question.

Examples. The Slovenian verb za = stop-i-ti ‘to understand’ is a morpheme-


for-morpheme borrowing from German 〈verstehen〉 ‘id.’. The German word
〈Kopf〉 ‘head’ has acquired the additional meaning ‘main word in a syntactic
Homoplasy and phylogeny reconstruction 227

phrase’ under the influence of the same polysemy seen in English 〈head〉 (the
latter instance is from Haspelmath 2009).
Apparently the first kind of contact-driven semantic shifts, supported by
phonetic similarity and etymological relationship, occurs more frequently
than the second, phonetically unsupported one (cf. similar observations by
Haugen 1950: 220), but often, when closely related lects are involved, phone-
tically and etymologically supported homoplastic developments successfully
imitate natural etymological evolution and are treated by linguists as true
cognates.
Contact-driven homoplasy is the most deleterious kind of homoplastic
development, since its prevalence and correspondingly its impact on the recon-
structed phylogeny can be significant, whereas its detection is often difficult or
impossible.

2.4 Synonymy or suppletion in a protolanguage

In order to explain incompatible or criss-crossed lexical characters, historical


linguists sometimes propose to reconstruct synonymous roots or stems. This
implies that a single meaning was expressed by several equivalent words in a
protolanguage and that in the daughter languages this synonymy was resolved
in different ways, with only one word retaining the original meaning in each
individual lect. Such uncontrolled reconstruction of proto-synonymy makes
particularly little sense in the context of the Swadesh wordlist. The available
typological data from the Global Lexicostatistical Database project suggest that
the normal cases of technical synonymy in the 100- or 110-item wordlist
involve either morphological suppletion (for which see the next paragraph)
or inadequate lexicographical descriptions (when we are not able to choose
between two identically glossed forms and are forced to accommodate both in
the same slot). But when there are indeed two words with the same Swadesh
meaning in a language which are equal in respect of frequency and style, this
should mean that an intermediate stage of lexical replacement is registered for
the given Swadesh slot (this phenomenon can perhaps be illustrated by
Modern English 〈many〉 and 〈a lot of〉). It is expected, however, that a language
should have zero, one or at most two slots containing such “true” synonyms at
any given moment. Thus, even if a protolanguage did possess Swadesh slots
with “true” synonyms, the effect would be negligible, given the minimal
number of slots affected.
A special and more important case of lexicostatistical synonymy is morpho-
logical suppletion, if we conventionally treat suppletive stems as lexicostatistical
228 Alexei S. Kassian

synonyms. For examples, the GLD standard (Kassian et al. 2010; G. Starostin 2010)
recommends filling the slots of the personal pronouns ‘I’, ‘thou’, ‘we’ with both
direct and oblique stem forms if these are suppletive. Simplification of a supple-
tive paradigm can produce incompatible characters (Figure 4) and criss-crossed
configuration (Figure 5).
An example. The Bulgarian personal pronoun ni-e ‘we’, nas ‘us’ corresponds
etymologically to Latin noːs ‘we, us’, whereas Lithuanian mɛːs ‘we’, mus ‘us’
corresponds etymologically to Tocharian B wes ‘we, us’. Since Bulgarian is the
closest relative of Lithuanian among these languages, we are dealing with a
criss-crossed configuration (Figure 5). However, this is not a genuine homo-
plasy, since we can securely reconstruct the suppletive paradigm *wey-s [direct]
( > Lithuanian & Tocharian)/*n(V)s- [oblique] ( > Bulgarian & Latin) for the Proto-
Indo-European language.
Note that only experts can decide whether the reconstruction of synon-
ymous character states for the protolanguage is reasonable or not in the indivi-
dual case. The information required to make such a decision is missing from the
input matrices. Because of this, proto-synonymy can hardly be discriminated
from real homoplasy by any formal algorithms.

3 Material and methods

3.1 Data

3.1.1 Wordlists

Within the framework of the Global Lexicostatistical Database project


(G. Starostin 2011–2015), 110-item high-quality wordlists of basic vocabulary
for 20 Lezgian lects and 9 Tsezic lects have been compiled and annotated by
the author (Kassian 2011–2012; 2013–2015).
Cognation indexes within the multistate matrices were marked with the help
of the traditional comparative method. I use the Proto-Lezgian reconstruction of
the late Sergei Starostin (S. Starostin and Nikolayev 1994: 122–179; S.; Starostin
1994; S. Starostin n.d.) and the Proto-Tsezic reconstruction of Sergei Nikolayev
(Nikolayev 1978; S. Starostin and Nikolayev (Nikolaev) 1994: 110–115) with
certain corrections and improvements where necessary.
The GLD databases Kassian (2011–2012; 2013–2015) are supplemented
with reconstructed Proto-Lezgian and Proto-Tsezic 110-item wordlists. See (Kassian
Homoplasy and phylogeny reconstruction 229

et al. 2015: 304–307; G.; Starostin 2013a) for the methodology and basic principles of
protolanguage wordlist reconstruction.
For tree rooting, the 110-item wordlist of the Chechen literary language
(G. Starostin 2011) has been introduced into the comparison as an outgroup.
Chechen was chosen as a language genetically related to the investigated groups
(Lezgian and Tsezic) within the Nakh-Dagestanian linguistic family, on the one
hand, and as a lect which is definitely not a member of the Lezgian or Tsezic
groups on the other. Etymological comparison between Chechen and Lezgian/
Tsezic is based on S. Starostin and Nikolayev (Nikolaev) (1994) with some
corrections drawn from G. Starostin (2011).

3.1.2 Lezgian languages

Lezgian is a relatively deep linguistic group consisting of languages spoken in


South-East Dagestan (Russian Federation) and the adjacent parts of Azerbaijan,
Figure 6. The Lezgian group is a member of the Nakh-Dagestanian linguistic
family (or, as I believe, of the Nakh-Dagestanian clade of the North Caucasian
linguistic family).
110-item wordlists of the following 20 languages and dialects are included in
the current version of the GLD Lezgian database (Kassian 2011–2012):
– Udi (dialects: Nidzh, Vartashen),
– Archi,
– Kryts (dialects: Kryts Proper, Alyk),
– Budukh,
– Tsakhur (dialects: Mishlesh, Mikik, Gelmets),
– Rutul (dialects: Mukhad, Ixrek, Luchek),
– Aghul (dialects: Koshan, Keren, Gequn, Fite, Aghul Proper),
– Tabasaran (dialects: Northern, Southern),
– Lezgi (dialect: Gyune),
– plus the reconstructed Proto-Lezgian list.

3.1.3 Tsezic languages

Tsezic is a linguistic group comprising several languages spoken in South-West


Dagestan (Russian Federation), Figure 7. The Tsezic group is a member of the
Nakh-Dagestanian linguistic family (or, as I believe, of the Nakh-Dagestanian
clade of the North Caucasian linguistic family).
230
Alexei S. Kassian

Figure 6: Map of the modern Lezgian lects (adapted from Koryakov 2006: map #13).
Homoplasy and phylogeny reconstruction

Figure 7: Map of the modern Tsezic lects (adapted from Koryakov 2006: map #11).
231
232 Alexei S. Kassian

The current version of the GLD Tsezic database (Kassian 2013–2015) features
110-item wordlists of the following 9 languages and dialects:
– Hunzib,
– Bezhta (dialects: Bezhta proper, Khoshar-Khota, Tlyadal),
– Hinukh,
– Tsez (alternative name: Dido; dialects: Kidero, Sagada),
– Khwarshi (dialects: Khwarshi proper, Inkhokwari),
– plus the reconstructed Proto-Tsezic list.

3.2 Phylogenetic methods

Lexicostatistical trees were produced by means of several phylogenetic methods.


1. Modified neighbor-joining method, designed by S. Starostin for lexicostatistical
analysis and implemented in the Starling software (method Starling neighbor-
joining, hence StarlingNJ); see (Burlak and Starostin 2005: 163–167; Kassian 2015).
The StarlingNJ trees were produced in the Starling software v.2.5.3 (see S. Starostin
2007 [1993]; Burlak and Starostin 2005: 271–274) from the lexicostatistical data-
base, which represents a multistate matrix with synonymy allowed. The allowed
synonymy means that when the same Swadesh slot is occupied by more than one
word, i.e., by several synonyms, all possible pairs of involved words between two
languages are compared within this slot: if there is at least one matching pair, the
whole slot is treated as a match. For node dating, the so-called “experimental
method” was applied, according to which each Swadesh item possesses an
individual relative index of stability (S. Starostin 2007; G. Starostin 2010). The
non-parametric bootstrap test was performed (10,000 pseudoreplicates). The
hierarchical agglomerative clustering produces a rooted tree by definition (the
last merger is the root; it coincides with the midpoint under the assumption of a
nearly uniform replacement rate). The dates of the nodes were established by
strict molecular clocks, see S. (Starostin 2007 [1989]; S. Starostin 2000; Novotná
and Václav 2007; Balanovsky et al. 2011) on scale calibration and for further
details. For data elaborated by the StarlingNJ method, two kinds of trees are
offered: a tree with binary nodes only (as produced by the NJ algorithm), and the
same tree, where neighboring nodes are joined in a single node if the temporal
distance between them is 300 years or less (300 years corresponds to the mutation
of ca. 1.5 words in a lect – a reasonable calculation error, although this temporal
interval is essentially arbitrary at the current stage of research). The trees were
visualized in Starling and then manually redrawn for best appearance.
2. Standard neighbor-joining method (hence NJ), see (Saitou and Nei 1987;
Makarenkov et al. 2006: 65–66). The trees were produced in the SplitsTree4
Homoplasy and phylogeny reconstruction 233

software v.4.13.1 (Huson and Bryant 2006) from the binary lexicostatistical
matrix (NEXUS format) which was generated from the original multistate
matrix by coding the presence (“1”) or absence (“0”) of each proto-root in
each language (Swadesh items superseded by loanwords or simply not docu-
mented are marked as “?”). The non-parametric bootstrap test was performed
(10,000 pseudoreplicates). The trees were rooted by the outgroup (the
Chechen wordlist). The trees are not dated. The trees were visualized in the
FigTree software (v.1.4.0). Furthermore, additional trees were produced by the
BioNJ method (Gascuel 1997); these are topologically identical to the NJ ones
in all cases.
3. Unweighted pair group method with arithmetic mean (hence UPGMA), see
(Sneath and Sokal 1973: 230–234; Makarenkov et al. 2006: 65–66). The trees
were produced in the SplitsTree4 software v.4.13.1 from the binary matrix
described above. The non-parametric bootstrap test was performed (10,000
pseudoreplicates). The trees were rooted by the outgroup (the Chechen
wordlist). The trees are not dated. The trees were visualized in the FigTree
software (v.1.4.0).
4. Markov chain Monte Carlo method under a Bayesian framework (hence
Bayesian MCMC), see Makarenkov et al. (2006: 68–69), as applied to
linguistic data for the first time in Gray and Atkinson (2003). The trees
were produced in the MrBayes software v.3.2.1 (Huelsenbeck and Ronquist
2001) from the binary matrix described above. I used the F81 model with
rates = gamma. The program was run 4 times using 4 concurrent Markov
chains; the Chechen language was marked as an outgroup. Each run
produced 5,000,000 tree generations with samples taken every 500 gen-
erations. For each run, the first 25% tree generations were discarded as a
burn-in. The consensus trees were rooted by the outgroup (the Chechen
wordlist). The trees are not dated. The trees were visualized in the FigTree
software (v.1.4.0).
5. Unweighted maximum parsimony method (hence UMP), see Makarenkov
et al. (2006: 66–67). The trees were produced in the TNT software (Willi
Hennig Society edition of TNT, v.1.1, May 2014, see Goloboff et al. 2008) from
the binary matrix described above by the branch-and-bound (“Implicit
enumeration”) algorithm. Obligatory binarization of nodes was prohibited
(“Collapse trees after the search”); the Chechen language was marked as an
outgroup. When several optimal trees of equal cost were obtained, the strict
consensus tree was produced for which the non-parametric bootstrap test
was performed (1000 pseudoreplicates). The trees were rooted by the out-
group (the Chechen wordlist). The trees are not dated. The trees were
visualized in the FigTree software (v.1.4.0).
234 Alexei S. Kassian

Lexicostatistical networks were produced by means of several phylogenetic


methods.
6. NeighborNet method (Bryant and Moulton 2004; Makarenkov et al. 2006:
89–90). The networks were produced in the SplitsTree4 software v.4.13.1
from the binary matrix described above. The non-parametric bootstrap test
was performed (10,000 pseudoreplicates). The networks were visualized in
the SplitsTree4 software.
7. Minimal lateral network method (List et al. 2014a; List et al. 2014b). The
networks were produced and visualized in the LingPy software (List and
Moran 2013), version 2.4.1.alpha (List et al. 2014c) from the special LingPy
matrix converted from the Starling multistate matrix. The analysis was
based on the weighted parsimony approach, which assigns different weights
to gain and loss events and searches for the most parsimonious evolutionary
scenarios to explain how the characters evolved along the reference tree.
The software tests five gain-loss models (3−1, 5−2, 2−1, 3−2, 1−1) and chooses
the best-fitting of these for the given dataset.

3.3 Analysis

3.3.1 The Lezgian case

A lexicostatistical analysis of the Lezgian group, performed with the aid of the
aforementioned phylogenetic methods (StarlingNJ, NJ, UPGMA, Bayesian MCMC,
UMP), was published in Kassian (2015) (these are also reproduced in Text
Supplement as Figures S1a–S6a). The strict consensus Lezgian non-homoplasy-
optimized tree (Figure 8a) conforms very well with the traditional expert classi-
fication of the group. In sum, the following trees and networks obtained from
the non-homoplasy-optimized dataset are analyzed below:
– Figure S1a (see Text Supplement), StarlingNJ method with binary nodes
only.
– Figure S2a (see Text Supplement), StarlingNJ method with neighboring
nodes joined.
– Figure S3a (see Text Supplement), NJ method.
– Figure S4a (see Text Supplement), UPGMA method.
– Figure S5a (see Text Supplement), Bayesian MCMC method.
– Figure S6a (see Text Supplement), UMP method.
– Figure 8a, manually constructed strict consensus tree.
– Figure 9a, NeighborNet network.
– Figure 10a, Minimal lateral network.
Homoplasy and phylogeny reconstruction 235

(a)

(b)

Figure 8: Manually constructed strict consensus phylogenetic tree of the Lezgian lects based on the
StarlingNJ, NJ, BioNJ, UPGMA, Bayesian MCMC, UMP methods (adapted from Kassian 2015: Figure 8).
Ternary nodes cover binary branchings that either differ depending on the method employed or are
automatically joined under the assumption of the temporal error of 300 years, or both. Statistical
support values are shown in the following sequence: NJ/MCMC/UMP (“x” means that P ≥ 0.95 in an
individual method; not shown for nodes with P ≥ 0.95 in all methods). StarlingNJ dates are proposed.
(a) Non-homoplasy-optimized tree. (b) Homoplasy-optimized tree; probability values from the non-
optimized strict consensus tree (a) are quoted in parentheses. No topological discrepancies between
(a) and (b).
236 Alexei S. Kassian

basaran
(a)

ran
asa
n

Southern_Ta
he

Tab
Kosha
ec
Ch

rn_
_
oup
tgr

rthe
n_Agh
Ge Ou

i
ch
Ag

No

Ar
hu qu

ul
Fit l_ n_
e_ p
Ag rop
Ag
hu er
hu
l
l

Keren_
Aghul

_Udi
shen
Varta
Nidzh_Udi
gi
Gyune_Lez

Mik
ryts Mi ik_Tsa
k_K per sh khu
Aly ro les r
Ge

p h_
s_
kh

Ixr

Ts
ryt
lm
Mukhad_Rutul

ak
du

ek

K
et

h ur
utul

s_
Bu

_R

Ts
utu
ek_R

ak
l

hu
r
Luch
Aghul_proper

(b)
i
zg
ul
Fite_A Aghu

Ko
Ge

Le
Agh

sh
e_
qu

an
un
en_
n_

_A
ghul l

Gy

So gh
Ker

uth ul
ern i
_T ch en
aba
Ar hech
Northern_
sar
an ro up_C
Tabasar Outg
an

Nidzh_Udi
Varta
shen
_Ud
ts i
Alyk_Kry r
e
_ p rop
ts h
Kry d uk
Bu
Mi
utul
Ixrek_Rutul

kik
ad_R ul

_T
M
ut

Gelm

sa
ish ts_Tsa
_R

kh
les

ur
ek

h_
Mukh
ch

Ts
Lu

ak
hu
khu

r
r

Figure 9: Phylogenetic network of the Lezgian lects produced by the NeighborNet method from
the binary matrix in the SplitsTree4 software. Bootstrap values are shown near the branches
(not shown for stable branches with bootstrap value ≥ 95%). Branch length reflects the relative
rate of cognate replacement as suggested by SplitsTree4. (a) Non-homoplasy-optimized net-
work (b) Homoplasy-optimized network.
Homoplasy and phylogeny reconstruction 237

Koshan_Aghul -
(a)

Keren_

ut ul
Geq

l_Rutu

l
u tu

r
ek_R
un_

hu
F it

_R
Aghul

ak
N

Ag

- Ixrek
Agh
or

e_A

ad

Ts
uch
hu
th

kh

s_

ur
er

l_ p

gh
So

ul

- L

et

kh
n_

Mu
ut r

ul

lm
ro
hu

-
Ta

sa
he

-
pe
ak

Ge
rn
ba

_T
3

-
s

-
_T
r

ik
sa

ab _T

ik
sh
ra

-
-
as

M
e
n

Gy ar hl
is

-
an

Inferred Links
-

un M
e_ - -
Le i
zg ch
Kry i Ar i
ts_p - - _Ud
hen
rop ta s
er
- - Var
Alyk_ _Udi
K ryts zh
- - Nid
1
Budukh - - Outgroup_Chechen
Koshan_Aghul -

(b)
Keren_

tu l
Geq

tul

l
_Ru

u tu
_Ru

r
hu
un_
F it

_R
hek
Aghul

ak
N

Ag

- Ixrek
Agh
or

e_A

ad

Ts
hu
th

uc

kh

s_

ur
er

l_ p

gh

So
ul

- L

et

kh
n_

Mu

ut ur
ul

lm
ro

sa
Ta

he
kh
-
pe

Ge

_T

rn
ba

a 2
-

Ts
-

_T
r

ik
sa

ab h_
ik
ra

-
-

as s
M

le
n

Gy ar h
is
-

an

Inferred Links
-

un M
e_ - -
Le i
zg ch
K ry i Ar di
ts_p - - n_U
rop ta s he
er
- - Var
Alyk_ _Udi
Kryts zh
- - Nid
1
Budukh - - Outgroup_Chechen

Figure 10: Minimal lateral network of the Lezgian lects produced in the LingPy software. Node
size reflects the inferred number of cognate sets present in each lect. The solid links illustrate
lateral transfer events suggested by the method. The thickness and color of the links indicate
the inferred number of homoplastic characters between the two nodes in question, as specified
by the scale on the right. (a) Non-homoplasy-optimized network; based on the strict consensus
non-homoplasy-optimized tree (Figure 8a); the gain-loss model 5−2 is best fitting, p = 0.55 (the
next nearest model is 2−1, p = 0.48). (b) Homoplasy-optimized network; based on the strict
consensus non-homoplasy-optimized tree (Figure 8b); the gain-loss model 3−1 is best fitting,
p = 0.2 (the next nearest model is 5−2, p = 0.17).
238 Alexei S. Kassian

At least ca. 45 cases of homoplastic developments within the Lezgian 110-


item wordlists (Kassian 2011–2012) can be detected on the basis of the recon-
structed phylogenetic tree (Figure 8a) and the reconstructed Proto-Lezgian word-
list. Initially I checked the Lezgian non-homoplasy-optimized dataset manually,
and subsequently the MLN module of LingPy was applied to the non-homo-
plasy-optimized dataset; the latter detected the majority (although not all) of the
homoplastic characters previously known from manual analysis. See Text
Supplement for the list of Lezgian homoplastic etymologies with individual
comments.
I now examine how the MLN module of LingPy dealt with the task of
identifying homoplasy in the input data. As was noted above, LingPy correctly
detected the majority of the aforementioned cases of homoplasy, which should
be considered a very good result. Some details, however, require additional
discussion.
Firstly, LingPy reconstructs false homoplasy if the taxa involved are too
distant from each other. For example, the Proto-Lezgian term for ‘sand’, *šːäm, is
only reconstructed on the basis of data from one outlier (Vartashen Udi) and one
Nuclear Lezgian lect (Mukhad Rutul); in other lects, inherited forms are normally
superseded by loanwords. Although there are no topological conflicts in the
character ‘sand’, LingPy treats the Vartashen Udi-Mukhad Rutul match as homo-
plastic. This problem results from binary state models used by LingPy and a
large amount of missing data (undocumented items or loans) and/or singletons
in specific semantic slots.
Another (and inevitable) type of false response concerns synonymy and
suppletion in the protolanguage, for which see §2 above. For the Lezgian
dataset, LingPy infers homoplasy in the pronominal slot ‘thou’. This slot in
fact contains two roots which both go back to the Proto-Lezgian suppletive
paradigm.
Secondly, there are several cases where LingPy quite reasonably detects
formal homoplasy, although I, as a linguist, arbitrarily prefer to leave these
characters without homoplastic optimization. These are Swadesh items which
are unstable in the Lezgian group: ‘skin’, ‘sleep’, ‘what’, ‘year’. Reflexes of the
competing protoforms in these Swadesh slots are rather complicated and their
reconstruction in Proto-Lezgian is either unreliable or simply impossible (see
Kassian 2011–2012 for discussion of individual cases).
Thirdly, among the Lezgian homoplastic etymologies treated above, a num-
ber of cases are missed by the LingPy MLN-algorithm. These can be divided into
two types.
(1) The fact that we are dealing with homoplasy follows from linguistic data
not included in the input dataset. Because of this, such homoplastic
Homoplasy and phylogeny reconstruction 239

developments cannot be detected by formal algorithms. The following slots


fall within this category: #7 ‘to bite’ (Rutul + Aghul + Tabasaran), #29 ‘fish’
(Aghul + Tabasaran), #56 ‘mouth’ (Tsakhur + Rutul), #105 ‘short’ (West
Lezgian + East Lezgian).
(2) Standard topological conflicts which could be theoretically revealed on the
basis of the input dataset, but the total amount of losses which the algorithm
is forced to infer turns out to be too expensive to make a scenario in which
the character is present already in the tree root the most parsimonious one.
Instead a scenario in which the character is gained independently in several
lects is inferred as the most parsimonious one. The following slots fall
within this category: #10 ‘bone’ (Kryts + Budukh + Rutul + Tabasaran + Lezgi),
#13 ‘nail’ (Aghul + Lezgi), #19 ‘to drink’ (Aghul + Tabasaran), #26
‘fat’ (Aghul + Tabasaran), #38 head (Archi + Chechen), #39 ‘to hear’
(Rutul + Aghul + Lezgi), #50 ‘louse’ (Kryts + Budukh + Tsakhur + Rutul), #72 ‘to
see’ (Kryts + Budukh + Aghul + Tabasaran), #102 ‘heavy’ (Aghul + Tabasaran),
#109 ‘worm’ (Tabasaran + Lezgi).

3.3.2 The Tsezic case

According to the traditional expert classification, the group is divided into two
main branches: East Tsezic (Hunzib & Bezhta) and West Tsezic (Tsez &
Khwarshi), see Bokarev (1959: 227) (general lexical evidence); Imnaishvili
(1963: 9) (various evidence); Testelets (1993) (lexicostatistics and phonetic evi-
dence); S. Starostin and Nikolayev (Nikolaev) (1994: 110) (lexicostatistics and
general evidence); Alekseev (1998: 299–300) (lexicostatistics and general evi-
dence); Khalilova (2009: 1) (general evidence). A slightly different classification
is provided in van den Berg (1995: 5), where Khwarshi constitutes a third distinct
branch (Northern Tsezic).
The position of the fifth language, Hinukh, is somewhat more controversial,
since in Bokarev (1959: 227) Hinukh is classified as an “intermediate lect”:
“Hinukh occupies the intermediate position: it shares one part of its [native]
lexicon with the languages of the West subgroup, and another part with the
languages of the East subgroup”. However, according to Lomtadze (1963) (var-
ious evidence), van den Berg (1995: 5) (general evidence) and Testelets (1993)
(lexicostatistics and phonetic evidence), Hinukh is a close relative of Tsez or
simply a dialect of Tsez.
Previous formal classifications of the Tsezic group suggest the same two-
way division into East Tsezic (Hunzib & Bezhta) and West Tsezic (Hinukh, Tsez
& Khwarshi):
240 Alexei S. Kassian

(1) lexicostatistical calculations by Testelets (1993) which are based on the


etymologized 100-item wordlists, elaborated by means of the StarlingNJ
method: Hinukh and Tsez form a distinct clade;
(2) lexicostatistical calculations by Alekseev (1998: 300) and Koryakov (2006:
21) which are based on the etymologized 100-item wordlists, elaborated by
means of the StarlingNJ method: West Tsezic is a ternary node which splits
into Hinukh, Tsez and Khwarshi;
(3) phylogeny produced by the Automated Similarity Judgment Program project
(Müller et al. 2013), based on the Levenshtein distances between non-
etymologized 40-item wordlists: Hinukh and Tsez form a distinct clade.
(4) phylogeny by Cysouw and Forker (2009: Figures 1, 2), based on the
Levenshtein distances between non-etymologized 1300-item wordlists:
Hinukh and Tsez form a distinct clade.

Lexicostatistical analysis, based on the 110-item non-homoplasy-optimized


wordlists of the 9 Tsezic lects (Kassian 2013–2015), yielded the following phylo-
genetic trees and networks:
– Figure S7a (see Text Supplement), StarlingNJ method with binary nodes
only;
– Figure S8a (see Text Supplement), StarlingNJ method with neighboring
nodes joined;
– Figure S9a (see Text Supplement), NJ method;
– Figure S10a (see Text Supplement), UPGMA method.
– Figure S11a (see Text Supplement), Bayesian MCMC method.
– Figure S12a (see Text Supplement), UMP method.
– Figure 11a, manually constructed strict consensus tree.
– Figure 12a, NeighborNet network.
– Figure 13a, Minimal lateral network.

As illustrated by the above trees, all the phylogenetic methods used recon-
struct two main clades, namely East Tsezic (Hunzib & Bezhta) and West Tsezic
(Hinukh, Tsez & Khwarshi), in agreement with the traditional and previous
formal classifications. However, the methods contradict each other with respect
to the topology of the West Tsezic clade. Some methods suggest a Hinukh-Tsez
clade distinct from Khwarshi: StarlingNJ (Figure S7a in Text Supplement),
UPGMA (Figure S10a), UMP (Figure S12a). Others suggest a Tsez-Khwarshi
clade distinct from Hinukh: NJ (Figure S9a), Bayesian MCMC (Figure S11a). An
additional and immaterial discrepancy can be seen in the Bezhta clade: the two
archaic methods, StarlingNJ and UPGMA, suggest that Bezhta proper splits off
first, other methods reconstruct the Tlyadal dialect as the first outlier, although
(a)
500 BC 0 AD 500 1000 1500 2000

Hunzib proper
East Tsezic (710 AD) Bezhta proper
Bezhta (1190 AD) Khoshar-Khota Bezhta
x / x / 94
Tlyadal Bezhta

Tsezic (770 BC)


Hinukh

Kidero Tsez (Dido)


West Tsezic (70 AD) Tsez (820 AD)
x / x / 93 Sagada Tsez (Dido)

Khwarshi proper
Khwarshi (1060 AD)
Inkhokwari Khwarshi

(b)
500 BC 0 AD 500 1000 1500 2000

Hunzib proper
East Tsezic (660 AD) Bezhta proper
Bezhta (1190 AD) Khoshar-Khota Bezhta
x/x/x
(x / x / 94)
Tlyadal Bezhta

Tsezic (720 BC)


Hinukh
x/x/x
(x / x / 93) Kidero Tsez (Dido)
West Tsezic (40 BC) Tsez (820 AD)
Sagada Tsez (Dido)

Khwarshi (1060 AD)


Khwarshi proper
Inkhokwari Khwarshi
Homoplasy and phylogeny reconstruction

Figure 11: Manually constructed strict consensus phylogenetic tree of the Tsezic lects based on the StarlingNJ, NJ, BioNJ, UPGMA, Bayesian MCMC, UMP
methods. Ternary nodes cover binary branchings that either differ depending on the method employed or are automatically joined under the assumption of the
temporal error of 300 years, or both. Statistical support values are shown in the following sequence: NJ/MCMC/UMP (“x” means that P ≥ 0.95 in an individual
method; not shown for nodes with P ≥ 0.95 in all methods). StarlingNJ dates are proposed. (a) Non-homoplasy-optimized phylogenetic tree. (b) Homoplasy-
241

optimized tree; probability values from the non-optimized strict consensus tree (a) are quoted in parentheses. No topological discrepancies between (a) and (b).
242 Alexei S. Kassian

hi
(a)

wars
en
ch

Khw
he

h
ri_K
C
p_

arsh
ou
tgr

k wa
Ou

i_pr
Sa

o
Inkh
ga

ope
da
_T
se

r
z(
Di
do
)

Kidero_Tsez (Dido)

kh
nu
Hi
Hunzib_proper

Be
zh
hta

ta_
Kho

pr
Bez

op
er
shar
dal_

_Kh
Tlya

ota_
Bez
hta
hi

(b) n
wars

he
hec
Khw

_C
ri_Kh

up
ro
arsh

tg
okwa

Ou
i_pr

Sa
Inkh

ga
ope

da
_T
r

se
z(
Di
do
)

Kidero_Tsez (Dido)

kh
nu
Hi
Bezhta_proper
Kh
os
Tlya

ha
r
e

r_K
prop

ho
dal_

ta_
zib_

Be
Bez

zh
ta
Hun

hta

Figure 12: Phylogenetic network of the Tsezic lects produced by the NeighborNet method from the
binary matrix in the SplitsTree4 software. Bootstrap values are shown near the branches (not shown
for stable branches with bootstrap value ≥ 95%). Branch length reflects the relative rate of cognate
replacement as suggested by SplitsTree4. (a) Non-homoplasy-optimized network (b) Homoplasy-
optimized network.
Homoplasy and phylogeny reconstruction 243

(a)

ta
zh
ta

Be
K id

l_Bezh

ta _
e ro

ho
_T

Hinukh

da

r_K
se
Sa

- Tlya

ha
z(
ga
da

D id

os
r
_T pe

Kh
se ro

o)

-
4
z p
(D t a_

-
-
id h
o)
B ez

Inferred Links
Kh - -
wa r
rsh ope
i_p
rop ib _pr 2
er unz
- - H

Inkhokwari_Khwarshi - - Outgroup_Chechen 1

ta
(b)

zh
ta

Be
l_Be zh
K id

ta _
ero

ho
_T

Hinukh

da

r_K
se

Sa
- Tlya

ha
z(

ga
da
Di

os

r
_T pe
do

Kh

se ro
-

1
p
)

z
(D t a_
-
-

id h
o)
B ez

Inferred Links
Kh - -
wa er
rsh rop
i_p
rop i b_p
er unz
- - H

Inkhokwari_Khwarshi - - Outgroup_Cheche n 0

Figure 13: Minimal lateral network of the Tsezic lects produced in the LingPy software. Node
size reflects the inferred number of cognate sets present in each lect. The solid links illustrate
lateral transfer events suggested by the method. The thickness and color of the links indicate
the inferred number of homoplastic characters between the two nodes in question, as specified
by the scale on the right. (a) Non-homoplasy-optimized network; based on the strict consensus
non-homoplasy-optimized tree (Figure 11a); the gain-loss models 5−2 and 2−1 are equally best
fitting, p = 0.69 in both cases. (b) Homoplasy-optimized network of the Tsezic lects; based on
the strict consensus homoplasy-optimized tree (Figure 11b); the gain-loss model 3−1 is best
fitting, p = 0.88.

with low statistical support for the Bezhta proper–Khoshar-Khota node (ca. 0.5)
and very short distance between the two binary nodes.
Formally the strict consensus tree shown in Figure 11a can thus be con-
structed. In this tree, the neighboring binary nodes are joined (1) if the temporal
distance between them is ≤ 300 years as calculated by the StarlingNJ method,
244 Alexei S. Kassian

see Figure S7a, S8a in Text Supplement; or (2) if their topology depends on the
individual phylogenetic method applied. Actually all the resulting ternary nodes
in Figure 11a both differ depending on the method and are obtained automati-
cally under the assumption of a temporal error of 300 years.
As can be seen, the topology of the strict consensus tree (Figure 11a) is
identical to that of the StarlingNJ tree (Figure S8a).
Below I examine the reverse lexicostatistical distances for two East Tsezic
lects (ETs) – Hunzib proper and Bezhta proper – and three West Tsezic lects
(WTs) – Hinukh, Kidero Tsez, and Khwarshi proper – as obtained from the
multistate matrix; a higher percentage of shared basic vocabulary means a
smaller distance between two lects and thus a higher value in Table 1.

Table 1: Reverse lexicostatistical distances for some Tsezic lects (non-homoplasy-optimized


dataset).

Hunzib Bezhta Hinukh Kidero Tsez Khwarshi


proper (ETs) proper (ETs) (WTs) (WTs) proper (WTs)

Hunzib proper (ETs) – . . . .


Bezhta proper (ETs) – . . .
Hinukh (WTs) – . .
Kidero Tsez (WTs) – .

If we exclude Hinukh, the lexicostatistical distances between the four remaining


lects become almost ultrametric: the two East Tsezic lects are close to each other
(87% shared basic vocabulary), the two West Tsezic lects are close to each other
(76%), whereas any East Tsezic lect is almost equally remote from any West
Tsezic lect (53–55%).
When Hinukh is introduced, however, the resulting configuration is abnor-
mal. First, the distances between the three West Tsezic lects are not ultrametric,
but far from it: Kidero Tsez is equally close to Khwarshi and Hinukh (76–77%),
whereas Khwarshi and Hinukh are remote from each other (70%). This implies
that there must be a number of parasitic, i.e., secondary matches either between
Kidero Tsez and Khwarshi or between Kidero Tsez and Hinukh. Geographical
distribution (Figure 7) and ethnographic evidence5 suggest that it is the latter
pair that might be expected to show secondary contact. Since Sagada Tsez

5 “Many Hinuq men marry Tsez women, who then move to the village of Hinuq. These women
often do not fully acquire the Hinuq language and sometimes simply continue to speak Tsez, at
least at home” (Forker 2013: 16).
Homoplasy and phylogeny reconstruction 245

(which is not adjacent to the Hinukh territory) demonstrates the same lexicos-
tatistical closeness to Hinukh as Kidero Tsez does (76–77%), it is more likely that
the usual direction of influence is Kidero Tsez > Hinukh rather than vice versa.
The fact of Tsez-Hinukh linguistic interaction and the Tsez influence on Hinukh
is generally accepted by Caucasologists, see, (e.g., Bokarev 1959: 112, 236; Forker
2013: 12).
Second, a comparison of Hinukh with East Tsezic lects also demonstrates
irregular ratios. Four sets of three languages can be analyzed.
(1) Hunzib proper (ETs)/Bezhta proper (ETs)/Hinukh (WTs). The configuration
is normal: the two East Tsezic lects are close to each other (87%) and
equally remote from the West Tsezic lect (63%).
(2) Hunzib proper (ETs)/Bezhta proper (ETs)/Kidero Tsez (WTs). The configura-
tion is normal: the two East Tsezic lects are close to each other (87%) and
equally remote from the West Tsezic lect (53–55%).
(3) Hinukh (WTs)/Kidero Tsez (WTs)/Hunzib proper (ETs). The configuration is
not quite normal: the two West Tsezic lects are indeed close to each other
(77%), but not equally remote from the East Tsezic lect: Hinukh/Hunzib =
63%, whereas Kidero Tsez/Hunzib = only 55% (a difference of 8%).
(4) Hinukh (WTs)/Kidero Tsez (WTs)/Bezhta proper (ETs). The configuration is
even more abnormal: the two West Tsezic lects are indeed close to each
other (77%), but not equally remote from the East Tsezic lect: Hinukh/
Bezhta = 63%, whereas Kidero Tsez/Bezhta = only 53% (a difference of
10%).

As follows from the analysis of these four sets, the lexicostatistical distances
between one West Tsezic lect in particular and the two East Tsezic lects are far
from ultrametric: Hinukh demonstrates an abnormal closeness to both Bezhta
and Hunzib. This closeness should be treated as secondary, i.e., a number of
secondary lexical matches between Proto-Hinukh and Proto-East Tsezic are to be
assumed. This can be explained as a result of substantial influence between
Proto-Hinukh and Proto-East Tsezic, although the default direction of influence,
Proto-East Tsezic > Proto-Hinukh or vice versa, cannot be established by means
of a formal analysis of this kind.
The NeighborNet network (Figure 12a) also depicts conflicting signals
between Hinukh and Tsez, on the one hand, and Hinukh and Bezhta, on the
other.
Thus, two stages in the history of Hinukh can be reconstructed, if we accept
that the rate of Swadesh cognate replacement is nearly strict within the Tsezic
group. Initially, Hinukh entered into close contact with Proto-East Tsezic and
subsequently Bezhta (the direction of influence is not entirely clear). Later,
246 Alexei S. Kassian

Hinukh was influenced by the neighboring Tsez varieties (especially Kidero


Tsez). Cf. a similar statement by Forker, but unfortunately without any further
detail: “there has been and there still is extensive contact between Hinuq speak-
ers and speakers of two other Tsezic languages, Bezhta and Tsez” (Forker 2013:
12). Forker (2013: 16) also points to the continuation of Hinukh-Tsez contact up
to the present day.
In other words, a formal analysis of lexicostatistical distances suggests that
Tsez and Khwarshi form a clade distinct from Hinukh. There is undeniably a
certain similarity in phonology and morphosyntax between Tsez and Hinukh
that may even induce some linguists to treat Hinukh as a dialect of Tsez
(Lomtadze 1963), but in fact, according to Kassian and Testelets (2017), Tsez
and Hinukh lack any reliable shared innovations in phonology or morphosyntax
which could go back to a hypothetical distinct Tsez-Hinukh protolanguage.
The Tsezic non-homoplasy-optimized minimal lateral network (Figure 13a),
based on the strict consensus tree with ternary nodes (Figure 11a), also detects
several conflicting characters, although the number of these observed is rela-
tively modest, constituting no more than 1 character between two nodes, except
for the Hinukh-Bezhta pair which has 4 conflicting characters.
At least ca. 15 cases of homoplastic developments within the Tsezic 110-item
wordlists (Kassian 2013–2015) can be detected on the basis of the reconstructed
phylogenetic tree (Figure 11a) and the reconstructed Proto-Tsezic wordlist.
Initially I checked the Tsezic non-homoplasy-optimized dataset manually, and
subsequently the MLN module of LingPy was applied to the non-homoplasy
optimized dataset. LingPy revealed about half of the homoplastic characters
previously known from manual analysis, a good result for such a small dataset.
See Text Supplement for the list of Tsezic homoplastic etymologies with indivi-
dual comments.
I now examine how the MLN module of LingPy dealt with the task of
identifying homoplasy in the input data. Unlike the Lezgian case treated
above, LingPy correctly detected only about half of the above cases of Tsezic
homoplasy (#1 ‘all’, #11 ‘breast’, #43 ‘to kill’, #44 ‘knee’, #70 ‘sand’, #73 ‘seed’,
#83 ‘to swim’, #92 ‘to go’). This modest outcome is due to the small number of
lects involved: 10 taxa in the Tsezic dataset vs. 21 taxa in the Lezgian dataset
(one feature of the MLN algorithm is that the chance of finding topological
conflicts decreases rapidly with the decreasing number of compared lects).
The following homoplastic developments with topological conflicts were
overlooked by the MLN algorithm: #5 ‘big’ (Hinukh, Tsez), #36 ‘hair’ (Hunzib,
Bezhta proper), #52 ‘many’ (Hinukh, Sagada Tsez), #53 ‘meat’ (East Tsezic,
Hinukh), #65 ‘rain’ (Hinukh, Tsez, Khwarshi), #71 ‘say’ (East Tsezic, Hinukh),
#89 ‘tooth’ (Hinukh, Tsez).
Homoplasy and phylogeny reconstruction 247

The homoplastic character #7 ‘to bite’ (East Tsezic, Chechen) naturally went
undetected by LingPy, since the fact of homoplasy follows from external data
not included in the input dataset.
In rare cases LingPy reconstructs false homoplasy.

4 Results

4.1 The Lezgian case

Manually modifying the etymology-based non-optimized Lezgian dataset


(Kassian 2011–2012; 2015) in accordance with the above discussion (Section
3.3.1), we obtain the homoplasy-optimized dataset. The following trees and
networks from the homoplasy-optimized dataset were produced:
– Figure S1b (see Text Supplement), StarlingNJ method with binary nodes
only.
– Figure S2b (see Text Supplement), StarlingNJ method with neighboring
nodes joined.
– Figure S3b (see Text Supplement), NJ method.
– Figure S4b (see Text Supplement), UPGMA method.
– Figure S5b (see Text Supplement), Bayesian MCMC method.
– Figure S6b (see Text Supplement), UMP method.
– Figure 8b, manually constructed strict consensus tree.
– Figure 9b, NeighborNet network.
– Figure 10b, Minimal lateral network.

The resulting homoplasy-optimized phylogenetic trees of the Lezgian lects


require some comments.
StarlingNJ, binary tree (Figure S1b in Text Supplement). As compared with
the non-optimized StarlingNJ tree (Figure S1a), the only discrepancy in topology
concerns the Aghul dialects: Keren and Gequn now form a distinct clade
(marked with a gray circle). Some internal nodes have acquired different and
usually somewhat deeper dates. Bootstrap values become somewhat weaker.
StarlingNJ, tree with joined nodes (Figure S2b in Text Supplement). There
are no discrepancies in topology as compared with the non-optimized StarlingNJ
tree (Figure S2a). Some internal nodes have acquired slightly different dates.
NJ (Figure S3b in Text Supplement). There are two discrepancies in topology
(marked with gray circles) as compared with the non-optimized NJ tree (Figure S3a):
(1) Proto-Nuclear Lezgian shows the split (East (West, South)) as opposed to the
248 Alexei S. Kassian

split (West (East, South)) in the non-optimized tree; (2) among the Aghul dialects,
Gequn splits off prior to Keren. In both cases, the relevant distances are short and
the bootstrap values are rather low. Bootstrap values are slightly altered.
UPGMA (Figure S4b in Text Supplement). There are two discrepancies in
topology (marked with gray circles) as compared with the non-optimized
UPGMA tree (Figure S4a): (1) in the Rutul node, Ixrek splits off first; (2) in the
Aghul node, Keren and Gequn form a distinct clade (in both cases, the relevant
distances are very short and the bootstrap values are rather low). Bootstrap
values have become somewhat weaker.
Bayesian MCMC (Figure S5b in Text Supplement). There are no discrepancies
in topology as compared with the non-optimized MCMC tree (Figure S5a). Some
bootstrap values have changed (usually becoming stronger).
UMP (Figure S6b in Text Supplement). There are two discrepancies in
topology (marked with gray circles) as compared with the non-optimized
UMP tree (Figure S6a): (1) in the Lezgian clade, Udi splits off first followed
by Archi, as opposed to the sequence Archi then Udi in the non-optimized tree;
(2) Proto-Nuclear Lezgian shows the split (East (West, South)) as opposed to
the split (West (East, South)) seen in the non-optimized tree. Some bootstrap
values have changed (the Aghul clade in particular has become significantly
more stable).
Thus, the distance-based methods (StarlingNJ, NJ, BioNJ, UPGMA), when
applied to the homoplasy-optimized Lezgian dataset, produce trees which differ
topologically from the corresponding non-optimized trees with regard to three
points: (1) the initial Proto-Nuclear Lezgian split, (2) certain Aghul dialects, and
(3) the Rutul dialects. These changes do not seem substantial, since in all three
cases the tested phylogenetic methods vary from each other in a way implying
ternary nodes in the strict consensus trees whether based on the non-optimized
dataset (Figure 8a) or the homoplasy-optimized one (Figure 8b). The main and
unexpected result of the distance-based methods concerns nodes with bootstrap
value < 95%: their bootstrap values in the homoplasy-optimized trees are some-
what weaker as compared with the non-optimized trees (although opposite
instances also occur).
The results of the character-based methods (Bayesian MCMC, UMP) are more
significant. Firstly, bootstrap and probability values become stronger in most
cases. Secondly, the topology of the homoplasy-optimized UMP tree (Figure S6b
in Text Supplement) has changed. In particular this concerns the Proto-Lezgian
node (in the non-optimized UMP tree Archi splits off first, in disagreement with
both the traditional classification and other phylogenetic methods). Topological
change of the new UMP tree allows it to meet theoretical expectations, since the
maximum parsimony method depends on homoplasy to a greater extent than
Homoplasy and phylogeny reconstruction 249

other methods (note that for the non-optimized dataset the UMP method yields
the less likely tree, as discussed in Kassian 2015).
Before the construction of a strict consensus tree (Figure 8b), the difference
between the individual homoplasy-optimized trees obtained (Figure S1b–S6b in
Text Supplement) should be discussed. The discrepancies do not seem substan-
tial, and in fact are even less substantial than in the case of the non-optimized
dataset (Kassian 2015), since the UMP tree has now itself been altered. Let us
examine them.
(1) All distance-based methods, i.e., StarlingNJ, NJ, BioNJ, UPGMA (Figure
S1b, S3b, S4b in Text Supplement), plus the character-based method UMP
(Figure S6b) suggest consecutive Proto-Lezgian bifurcations in which the
Udi branch split off first and the Archi branch split off second. However,
the distance between the two nodes (the separation of Udi and the
separation of Archi) is short in all the distance-based trees, as follows
from the tree visualization and the probabilistic values of the branches,
and under the assumption of a temporal error of 300 years in StarlingNJ
(Figure S2b) the first split in the Lezgian group turns out to be three-way:
Udi, Archi, Nuclear Lezgian. The character-based Bayesian MCMC method
(Figure S5b), immediately suggests a ternary split into Udi, Archi and
Nuclear Lezgian.
(2) All the methods suggest the following three-part division for the Nuclear
Lezgian clade: (1) proto-West Lezgian [Tsakhur, Rutul], (2) proto-South
Lezgian [Kryts, Budukh], (3) proto-East Lezgian [Aghul, Tabasaran, Lezgi].
However, differences emerge in the hierarchy of the splits. The StarlingNJ
(Figure S1b) method suggests that West Lezgian splits off first; NJ (Figure
S3b), Bayesian MCMC (Figure S5b) and UMP (Figure S6b) suggest that East
Lezgian splits off first; UPGMA (Figure S4b) suggests that South Lezgian
splits off first. The distance between the two nodes (i.e., the consecutive
bifurcations between West, South and East protolanguages) is, however,
short in all the trees, as follows from the tree visualization and the prob-
abilistic values of the branches, and under the assumption of a temporal
error of 300 years in StarlingNJ (Figure S2b) it emerges that the Nuclear
Lezgian sub-group shows a three-way split between West, South and East.
(3) The non-Koshan Aghul dialects. All the methods (with the exception of
some UMP variants) reconstruct the distinct Aghul Proper/Fite clade, but
contradict each other as concerns the Keren and Gequn dialects. However,
under the assumption of a temporal error of 300 years in StarlingNJ (Figure
S2b) it emerges that the proto-Aghul language, after the separation of
Koshan from the rest, shows a three-way split between Keren, Gequn and
Aghul Proper/Fite.
250 Alexei S. Kassian

(4) The Rutul dialects. StarlingNJ (Figure S1b) suggests that Luchek split off
first; all other methods reconstruct the separation of Ixrek as earliest (note
that formally this is a case where the results of multistate matrix analysis
are opposed to those of binary matrix analysis). At the same time, in Figure
S1b (StarlingNJ), the two Rutul nodes are remote enough chronologically to
avoid being unified even under the assumption of a temporal error of 300
years (Figure S2b). The Rutul problem is discussed in Kassian (2015):
contrary to expectation, the lexicostatistical distances in the Rutul part of
the tree are not ultrametric, implying the existence of unidentified loans
and contact-driven homoplasy between Rutul dialects.

Taking into account the aforementioned discrepancies, the following strict


consensus phylogenetic tree of the Lezgian lects can be manually constructed,
see Figure 8b. In this tree, neighboring nodes are joined (1) if the temporal
distance between them is ≤ 300 years as calculated by the StarlingNJ method,
see Figure S1b, S2b; or (2) if their topology depends on the individual phyloge-
netic methods (the only exception is the Aghul Proper/Fite Aghul clade, which is
missing from some of the UMP trees, Figure S6b). Ternary nodes cover five
binary branchings that either differ depending on the method employed or are
automatically joined under the assumption of the temporal error of 300 years, or
both (actually all ternary nodes are obtained under the 300 years assumption
except for the node that joins together three Rutul dialects as discussed above in
this section). As can be seen, the topology of the strict consensus tree (Figure 8b)
is identical to the StarlingNJ tree (Figure S2b) except for the additional joining of
the Rutul dialects into one ternary node.
Summing up. The topology of the homoplasy-optimized strict consensus
Lezgian tree (Figure 8b) is the same as that of the non-optimized strict consensus
tree (Figure 8a); some internal nodes have acquired slightly different dates.
Statistical support values (bootstrap and Bayesian posterior probabilities) are
altered in both directions, but generally the homoplasy-optimized consensus
tree is more stable, especially as concerns the initial splits of Proto-Lezgian and
the Aghul clade (this is primarily thanks to the homoplasy-optimized UMP tree,
which is now in agreement with other methods).
It is interesting that even after the extensive homoplastic optimization process
described above, it cannot be said that the Lezgian NeighborNet network based on
the homoplasy-optimized dataset (Figure 9b) seriously differs from the NeighborNet
network based on the non-homoplasy-optimized dataset (Figure 9a). This absence
of substantial difference probably comes about because NeighborNet uses binary
matrices with an equal cost of change between the states, although the input data
in fact represent presence/absence matrices (“1” = presence, “0” = absence of the
Homoplasy and phylogeny reconstruction 251

specific proto-root with the specific Swadesh meaning in the given language). On
the other hand, it is reported in Holden and Gray (2006: 25) (where Bantu
languages were investigated) that coding lexical data in a multistate format
makes “very little difference to the results”. More tests making use of various
linguistic data are required.
The results of the MLN module of the LingPy software appear to be more
important. The minimal lateral network based on the homoplasy-optimized
dataset (Figure 10b) is significantly less conflicting than the minimal lateral
network based on the non-homoplasy-optimized dataset (Figure 10a). As follows
from the graphical representation (Figure 10b), the total number of inferred
homoplastic characters is rather modest: the highest number of conflicting
characters (2) was obtained between Archi and Proto-Aghul (namely ‘to lie’
and ‘yellow’; in both cases, however, we are dealing with false responses).

4.2 The Tsezic case

Manually modifying the etymology-based non-optimized Tsezic dataset (Kassian


2013–2015) in accordance with the above discussion (Section 3.3.2), we obtain
the homoplasy-optimized dataset. The following trees and networks were pro-
duced from the homoplasy-optimized dataset:
– Figure S7b (see Text Supplement), StarlingNJ method with binary nodes
only;
– Figure S8b (see Text Supplement), StarlingNJ method with neighboring
nodes joined;
– Figure S9b (see Text Supplement), NJ method;
– Figure S10b (see Text Supplement), UPGMA method.
– Figure S11b (see Text Supplement), Bayesian MCMC method.
– Figure S12b (see Text Supplement), UMP method.
– Figure 11b, manually constructed strict consensus tree with neighboring
nodes joined.
– Figure 12b, NeighborNet network.
– Figure 13b, Minimal lateral network.
– Figure 14, manually constructed strict consensus tree without the temporal
error assumption.

The resulting homoplasy-optimized phylogenetic trees of the Tsezic lects


require some comments.
StarlingNJ, binary tree (Figure S7b in Text Supplement). As compared with
the non-optimized StarlingNJ tree (Figure S7a), the only discrepancy in topology
500 BC 0 AD 500 1000 1500 2000
252

Hunzib proper
East Tsezic (660 AD) Bezhta proper
Bezhta (1190 AD) Khoshar-Khota Bezhta
Tlyadal Bezhta
Tsezic (720 BC)

Hinukh
Alexei S. Kassian

West Tsezic (130 BC)


Kidero Tsez (Dido)
Tsez (820 AD)
Sagada Tsez (Dido)
82 / x / 67 Khwarshi proper
Khwarshi (1060 AD)
Inkhokwari Khwarshi

Figure 14: Manually constructed homoplasy-optimized strict consensus phylogenetic tree of the Tsezic lects based on the StarlingNJ, NJ, BioNJ,
UPGMA, Bayesian MCMC, UMP methods. Statistical support values are shown in the following sequence: NJ/MCMC/UMP (“x” means that P ≥ 0.95 in
an individual method; not shown for nodes with P ≥ 0.95 in all methods). Neighboring nodes with the temporal distance between them ≤ 300 years
have not been joined.
Homoplasy and phylogeny reconstruction 253

concerns the West Tsezic cluster: Tsez and Khwarshi now form a clade distinct
from Hinukh. Bootstrap values have changed slightly. Some nodes have
acquired slightly different dates.
StarlingNJ, tree with joined nodes (Figure S8b in Text Supplement). There
are no discrepancies in topology as compared with the non-optimized StarlingNJ
tree (Figure S8a). Some nodes have acquired slightly different dates.
NJ (Figure S9b in Text Supplement). There are no discrepancies in topology
as compared with the non-optimized NJ tree (Figure S9a). Bootstrap values
become stronger.
UPGMA (Figure S10b in Text Supplement). As compared with the non-
optimized UPGMA tree (Figure S10a), the only discrepancy in topology concerns
the West Tsezic cluster: Tsez and Khwarshi form a clade distinct from Hinukh.
Bootstrap values show no change.
Bayesian MCMC (Figure S11b in Text Supplement). There are no discrepan-
cies in topology as compared with the non-optimized MCMC tree (Figure S11a).
Bootstrap values become stronger.
UMP (Figure S12b in Text Supplement). As compared with the non-optimized
UMP tree (Figure S12a), the only discrepancy in topology concerns the West
Tsezic cluster: Tsez and Khwarshi form a clade distinct from Hinukh. Bootstrap
values become stronger.
Thus, three methods – StarlingNJ, UPGMA, UMP – applied to the homo-
plasy-optimized Tsezic dataset produced trees which differ topologically from
the corresponding non-optimized trees on one point: Tsez and Khwarshi form a
distinct clade, in opposition to the distinct Hinukh-Tsez clade found in the non-
optimized trees.
All the methods, when applied to the homoplasy-optimized Tsezic dataset,
produced topologically identical trees (except for closely related Bezhta dialects)
which can be summarized in a strict consensus tree: Figure 11b where neighbor-
ing nodes are additionally joined if the temporal distance between them is ≤
300 years as calculated by the StarlingNJ method. As can be seen, the topology
of the homoplasy-optimized strict consensus tree (Figure 11b) is identical to both
StarlingNJ trees with joined nodes: non-homoplasy-optimized and homoplasy-
optimized (Figure S8a–b in Text Supplement).
Since the exact position of Hinukh within the West Tsezic cluster is impor-
tant for general Tsezic studies (Kassian and Testelets 2017), Figure 14 represents
the special version of the strict consensus tree which simply repeats the topology
obtained (Bezhta dialects form a ternary node) without the joining of neighbor-
ing nodes with the temporal distance between them ≤ 300 years (Hinukh is thus
opposed to the Tsez-Khwarshi clade).
254 Alexei S. Kassian

Summing up. Homoplastic optimization makes the Tsezic topology the same
across all methods (except for Bezhta dialects) and strengthens statistical sup-
port values (bootstrap and Bayesian posterior probabilities). Some nodes acquire
slightly different dates. The resulting homoplasy-optimized strict consensus tree
is more stable than its non-optimized counterpart.
Now we can reexamine the reverse lexicostatistical distances for two East
Tsezic lects (Hunzib proper, Bezhta proper) and three West Tsezic lects (Hinukh,
Kidero Tsez, Khwarshi proper) as obtained from the multistate matrix; a higher
percentage of shared basic vocabulary means a smaller distance between two
lects and thus a higher value in Table 2.

Table 2:. Reverse lexicostatistical distances for some Tsezic lects (homoplasy-optimized data-
set); values for the non-optimized dataset (Table 1) are quoted in parentheses if different.

Hunzib Bezhta Hinukh Kidero Tsez Khwarshi


proper (ETs) proper (ETs) (WTs) (WTs) proper (WTs)

Hunzib proper (ETs) – . (.) . . . (.)


(.)
Bezhta proper (ETs) – . . . (.)
(.)
Hinukh (WTs) – . (.) . (.)
Kidero Tsez (WTs) – . (.)

First, the distances between the three West Tsezic lects become more normal:
Kidero Tsez is close to Khwarshi (75%), whereas Hinukh is almost equally
remote from Kidero Tsez and Khwarshi (69–72%). The difference between the
Hinukh/Kidero Tsez distance (72%) and the Hinukh/Khwarshi distance (69%)
could suggest that not all secondary, i.e., homoplastic matches in the Hinukh/
Kidero Tsez pair have been revealed by the above linguistic analysis.
Second, a comparison of Hinukh with East Tsezic lects still demonstrates
irregular ratios. Four sets of three languages can be analyzed.
(1) Hunzib proper (ETs)/Bezhta proper (ETs)/Hinukh (WTs). The configuration
is normal: the two East Tsezic lects are close to each other (86%) and
equally remote from the West Tsezic lect (60–62%).
(2) Hunzib proper (ETs)/Bezhta proper (ETs)/Kidero Tsez (WTs). The configura-
tion is normal: the two East Tsezic lects are close to each other (86%) and
equally remote from the West Tsezic lect (53–55%).
(3) Hinukh (WTs)/Kidero Tsez (WTs)/Hunzib proper (ETs). The configuration is
not normal: the two West Tsezic lects are indeed close to each other (72%),
Homoplasy and phylogeny reconstruction 255

but not equally remote from the East Tsezic lect: Hinukh/Hunzib = 62%,
whereas Kidero Tsez/Hunzib = only 55% (a difference of 7%).
(4) Hinukh (WTs)/Kidero Tsez (WTs)/Bezhta proper (ETs). The configuration is
also abnormal: the two West Tsezic lects are indeed close to each other
(72%), but not equally remote from the East Tsezic lect: Hinukh/Bezhta =
60%, whereas Kidero Tsez/Bezhta = only 53% (a difference of 7%).

As follows from the analysis of these four sets, the lexicostatistical distances
between Hinukh and the two East Tsezic lects are not ultrametric: Hinukh still
demonstrates an abnormal closeness to both Bezhta and Hunzib, although the
situation shows a slight improvement in comparison with the lexicostatistical
anomalies between these taxa for the non-optimized dataset. This could implies
that there are several cases of contact-driven parallel developments within the
110-item wordlist between Hinukh and East Tsezic (Bezhta & Hunzib) which
cannot be revealed by the current linguistic analysis.
The homoplasy-optimized NeighborNet network (Figure 12b) demonstrates
that conflicting signals between Hinukh and Tsez, on the one hand, and
between Hinukh and Bezhta, on the other, become somewhat weaker as com-
pared to the non-optimized network (Figure 12a). On the other hand, as
expected, the minimal lateral network produced by the LingPy software only
detects a couple of (false) conflicts: Figure 13b.

5 Conclusions
It can be concluded that the reconstruction of language phylogeny should
consist of several steps.
1. A high-quality input dataset is elaborated with the help of the main phylo-
genetic methods, and a strict consensus rooted phylogenetic tree is
produced.
2. The ancestral, i.e., protolanguage character states are reconstructed (unfor-
tunately it is commonplace for several formally equal reconstructed items to
compete with each other in a number of characters).
3. Proceeding from the consensus tree and ancestral character states, the
dataset is examined for homoplasy (contact-driven parallel developments
are especially deleterious). The minimal lateral network module of LingPy is
a powerful tool for that purpose. Items identified as constituting secondary
matches should be marked as unrelated or, if we are sure of the direction of
influence, the target item should be treated as a borrowing. Note that
usually not all cases of linguistic homoplasy can be detected.
256 Alexei S. Kassian

4. The homoplasy-optimized dataset is elaborated with help of the main phylo-


genetic methods, and the strict consensus phylogenetic tree is produced.6

The whole process can be graphically represented as the flowchart, see


Figure 15.
After homoplastic optimization, individual clades can be more accurately
resolved, and in general the homoplasy-optimized phylogeny should be more
robust than the tree reconstructed initially. The Lezgian and Tsezic datasets
tested above confirm these expectations. It must also be noted that, in both
the Lezgian and Tsezic cases, statistical support values (bootstrap and Bayesian
posterior probabilities) are primarily increased for the character-based methods
(Bayesian MCMC and Maximum parsimony); the distance-based methods, such
as NJ and UPGMA, turned out to be less noise-sensitive.
It should be borne in mind, however, that the proposed step-by-step proce-
dure is designed to make the initially obtained phylogenetic tree more robust
and more highly resolved. But if the language contacts have been too strong and
have disturbed the input data enough, a situation can result in which all
methods reconstruct an incorrect topology based on the non-homoplasy-opti-
mized data. In such a case, homoplastic optimization only serves to exacerbate
initial topological errors instead of helping to uncover the true phylogeny.

6 Supporting Information
– Text Supplement contains: (1) phylogenetic trees of the Lezgian and Tsezic
language groups obtained by individual methods, (2) linguistic notes on
individual homoplastic developments within the Lezgian and Tsezic word-
lists. PDF format.

6 Despite terminological similarity, the proposed algorithm differs from the “perfect phylogeny”
approach developed by Don Ringe, Tandy Warnow and Luay Nakhleh with colleagues (see Ringe et
al. 2002; Nakhleh et al. 2005; and other publications by the team). Firstly, the Maximum Parsimony
method used by the aforementioned authors is rather noise sensitive, so it is risky to reconstruct the
phylogeny only by means of the MP-method: the obtained MP-tree is expected to contain a number
of false nodes which will merely become more robust after the homoplasy cleaning process.
Secondly, the aforementioned authors proceed from some general prior knowledge about a true
Indo-European tree (e.g., Germanic is not the first outlier) and Indo-European ancestral character
states (e.g., many characters are excluded from the analysis due to their notoriously homoplastic
nature), following previous results of philological work of traditional linguists. As stated by Ringe et
al. themselves: “we are not attempting to construct a method which will allow us to input raw data
to an algorithm and derive a completely mechanical solution, in the belief that that is somehow
more ‘objective’ than using the results of traditional philological work” (Ringe et al. 2002: 79).
Homoplasy and phylogeny reconstruction 257

Figure 15: Main steps of the reconstruction of language phylogeny as discussed in the present
paper.

Four packs with various files for phylogenetic analysis are available: Lezgian
non-homoplasy-optimized, Lezgian homoplasy-optimized, Tsezic non-homo-
plasy-optimized, Tsezic homoplasy-optimized. Each pack includes:
– Multistate matrix in MS Excel format (incl. the wordlists).
– Binary matrix in NEXUS format.
– Reverse distance matrix, generated from the multistate matrix in the Starling
software, MS Excel format.
258 Alexei S. Kassian

– Distance matrix, generated from the binary matrix in the SplitsTree4


software.
– Files for minimal lateral network (MLN) analysis in the LingPy software,
various formats.

Acknowledgements: I express my sincere thanks to Michael Cysouw (Marburg),


Johann-Mattis List (Paris), George Starostin, Yakov Testelets, and Mikhail
Zhivlov (all Moscow) for the discussion and valuable comments. I remain
responsible for all possible errors of fact or interpretation.

References
Alekseev, Mikhail. 1998. Tsezskie jazyki [Tsezic languages]. In Mikhail Alekseev (ed.), Jazyki
mira: Kavkazskie jazyki [Languages of the world: Caucasian languages], 299–303.
Moscow: Academia.
Atkinson, Quentin D. & Russell D. Gray. 2006. How old is the Indo-European language family?
Progress or more moths to the flame? In Peter Forster & Colin Renfrew (eds.), Phylogenetic
methods and the prehistory of languages, 91–109. Cambridge: McDonald Institute for
Archaeological Research.
Balanovsky, Oleg, Khadizhat Dibirova, Anna Dybo, Oleg Mudrak, Svetlana Frolova et al. 2011.
Parallel evolution of genes and languages in the Caucasus region. Molecular Biology and
Evolution 28(10). 2905–2920.
Berg, Helma van den. 1995. A grammar of Hunzib. With texts and lexicon. Leiden: Proefschrift ter
verkrijging van de graad van Doctor aan de Rijksuniversiteit te Leiden, 25 januari 1995.
Bokarev, Evgeny A. 1959. Tsezskie (didojskie) jazyki Dagestana [Tsezic languages of Dagestan].
Moscow: Izd-vo AN SSSR.
Bryant, David, Flavia Filimon & Russell D. Gray. 2005. Untangling our past: languages, trees,
splits and networks. In Ruth Mace, Clare Holden & Stephen Shennan (eds.), The Evolution
of cultural diversity: a phylogenetic approach, 69–85. London: UCL Press.
Bryant, David & Vincent Moulton. 2004. NeighborNet: an agglomerative algorithm for the
construction of phylogenetic networks. Molecular Biology and Evolution 21. 255–265.
Burlak, Svetlana A. & Sergei A. Starostin. 2005. Sravnitel’no-istoricheskoe jazykoznanie
[Historical linguistics]. 2nd ed. Moscow: Academia.
Chang, Will, Chundra Cathcart, David Hall & Andrew Garrett. 2015. Ancestry-constrained phylo-
genetic analysis supports the Indo-European steppe hypothesis. Language 91(1). 194–244.
Cysouw, Michael & Diana Forker. 2009. Reconstruction of morphosyntactic function: Nonspatial
usage of spatial case marking in Tsezic. Language 85(3). 588–617.
Dyen, Isidore, Joseph Kruskal & Paul Black. 1997. Comparative Indo-European database. The file was
last modified on February 5, 1997. http://www.wordgumbo.com/ie/cmp/ (accessed 7 July 2014).
Evans, Steven N., Donald Ringe & Tandy Warnow. 2006. Inference of divergence times as a
statistical inverse problem. In Peter Forster & Colin Renfrew (eds.), Phylogenetic methods
and the prehistory of languages, 119–130. Cambridge: McDonald Institute for
Archaeological Research.
Homoplasy and phylogeny reconstruction 259

Forker, Diana. 2013. A grammar of Hinuq. Berlin & Boston: Mouton De Gruyter.
Gascuel, Olivier. 1997. BIONJ: An improved version of the NJ algorithm based on a simple model
of sequence data. Molecular Biology and Evolution 14. 685–695.
Goloboff, Pablo A., James S. Farris & Kevin C. Nixon. 2008. TNT, a free program for phylogenetic
analysis. Cladistics 24(5). 774–786.
Gray, Russell D. & Quentin D. Atkinson. 2003. Language-tree divergence times support the
Anatolian theory of Indo-European origin. Nature 426. 435–439.
Harbert, Wayne. 2007. The Germanic languages. Cambridge: Cambridge University Press.
Haspelmath, Martin. 2009. Lexical borrowing: Concepts and issues. In Martin Haspelmath & Uri
Tadmor (eds.), Loanwords in the world’s languages: A comparative handbook, 35–54.
Berlin & New York: Mouton De Gruyter.
Haugen, Einar. 1950. The analysis of linguistic borrowing. Language 26(2). 210–231.
Holden, Clare J. & Russell D. Gray. 2006. Rapid radiation, borrowing and dialect continua in the
Bantu languages. In Peter Forster & Colin Renfrew (eds.), Phylogenetic methods and the
prehistory of languages, 19–31. Cambridge: McDonald Institute for Archaeological Research.
Huelsenbeck, John P. & Fredrik Ronquist. 2001. MrBayes: Bayesian inference of phylogenetic
trees. Bioinformatics 17(8). 754–755.
Huson, Daniel H. & David Bryant. 2006. Application of phylogenetic networks in evolutionary
studies. Molecular Biology and Evolution 23(2). 254–267.
Imnaishvili, David S. 1963. Didojskij jazyk v sravnenii s ginukhskim i khvarshijskim jazykami [Tsez
language in comparison with the Hinukh and Khwarshi languages]. Tbilisi: Mecniereba.
Kassian, Alexei. 2011–2012. Annotated Swadesh wordlists for the Lezgian group (North
Caucasian family). In George Starostin (ed.), The global lexicostatistical database. Moscow
& Santa Fe: Center for Comparative Studies at the Russian State University for the
Humanities; Santa Fe Institute. http://starling.rinet.ru/new100 (accessed 7 May 2015).
Kassian, Alexei. 2013–2015. Annotated Swadesh wordlists for the Tsezic group (North
Caucasian family). In George Starostin (ed.), The global lexicostatistical database. Moscow
& Santa Fe: Center for Comparative Studies at the Russian State University for the
Humanities; Santa Fe Institute. http://starling.rinet.ru/new100 (accessed 7 May 2015).
Kassian, Alexei. 2015. Towards a formal genealogical classification of the Lezgian languages
(North Caucasus): Testing various phylogenetic methods on lexical data. PLoS ONE 10(2):
e0116950. doi:10.1371/journal.pone.0116950.
Kassian, Alexei, George Starostin, Anna Dybo & Vasiliy Chernov. 2010. The Swadesh wordlist.
An attempt at semantic specification. Journal of Language Relationship 4. 46–89.
Kassian, Alexei & Yakov Testelets. 2017. Classification of the Tsezic languages and the con-
troversy of Hinukh (North Caucasus). Lingua 196. 98–118, http://dx.doi.org/10.1016/
j.lingua.2017.06.011
Kassian, Alexei, Mikhail Zhivlov & George Starostin. 2015. Proto–Indo-European–Uralic com-
parison from the probabilistic point of view. Journal of Indo-European Studies 43(3–4).
301–347.
Khalilova, Zaira. 2009. A grammar of Khwarshi. Leiden: Proefschrift ter verkrijging van de graad
van Doctor aan de Universiteit Leiden, 17 December 2009.
Kiparsky, Valentin. 1967. Russische historische Grammatik, Bd. 2: Die Entwicklung des
Formensystems. Heidelberg: Carl Winter Universitätsverlag.
Kitchen, Andrew, Christopher Ehret, Shiferaw Assefa & Connie J. Mulligan. 2009. Bayesian
phylogenetic analysis of Semitic languages identifies an Early Bronze Age origin of Semitic
in the Near East. Proceedings of the Royal Society B 276. 2703–2710.
260 Alexei S. Kassian

Koryakov, Yuri B. 2006. Atlas kavkazskikh jazykov: s prilozheniem polnogo reestra jazykov
[Atlas of the Caucasian languages with language guide]. Moscow: Institute of Linguistics.
Kushniarevich, Alena, Olga Utevska, Marina Chuhryaeva, Anastasia Agdzhoyan, Khadizhat
Dibirova, Ingrida Uktveryte, Märt Möls, Lejla Mulahasanovic, Andrey Pshenichnov,
Svetlana Frolova, Andrey Shanko, Ene Metspalu, Maere Reidla, Kristiina Tambets, Erika
Tamm, Sergey Koshel, Valery Zaporozhchenko, Lubov Atramentova, Vaidutis Kučinskas,
Oleg Davydenko, Olga Goncharova, Irina Evseeva, Michail Churnosov, Elvira
Pocheshchova, Bayazit Yunusbayev, Elza Khusnutdinova, Damir Marjanović, Pavao Rudan,
Siiri Rootsi, Nick Yankovsky, Phillip Endicott, Alexei Kassian, Anna Dybo, The Genographic
Consortium, Chris Tyler-Smith, Elena Balanovska, Mait Metspalu, Toomas Kivisild, Richard
Villems & Oleg Balanovsky. 2015. Genetic heritage of the Balto-Slavic speaking popula-
tions: A synthesis of autosomal, mitochondrial and Y-chromosomal data. PLoS ONE 10(9):
e0135820. doi:10.1371/journal.pone.0135820.
Lees, Robert B. 1953. The basis of glottochronology. Language 29(2). 113–127.
List, Johann-Mattis & Steven Moran. 2013. An open source toolkit for quantitative historical
linguistics. In Proceedings of the 51st Annual meeting of the association for computational
linguistics: System demonstrations, 13–18. Stroudsburg, PA: Association for
Computational Linguistics.
List, Johann-Mattis, Steven Moran, Peter Bouda & Johannes Dellert. 2014c. LingPy: Python
library for quantitative tasks in historical linguistics. Version 2.4.1.alpha, DOI: 10.5281/
zenodo.11886. Marburg: Forschungszentrum Deutscher Sprachatlas. http://lingpy.org/
(accessed 28 September 2014).
List, Johann-Mattis, Shijulal Nelson-Sathi, Hans Geisler & William Martin. 2014a. Networks of
lexical borrowing and lateral gene transfer in language and genome evolution. BioEssays
36(2). 141–150.
List, Johann-Mattis, Shijulal Nelson-Sathi, William Martin & Hans Geisler. 2014b. Using phylo-
genetic networks to model Chinese dialect history. Language Dynamics and Change 4.
222–252.
Lomtadze, Elizbar. 1963. Ginukhskij dialekt didojskogo jazyka [Hinukh dialect of the Tsez
language]. Tbilisi: Mecniereba.
Makarenkov, Vladimir, Dmytro Kevorkov & Pierre Legendre. 2006. Phylogenetic network con-
struction approaches. In Dilip K. Arora, Randy M. Berka & Gautam B. Singh (eds.), Applied
mycology and biotechnology, Vol. 6: Bioinformatics, 61–98. Amsterdam & Boston:
Elsevier.
Müller, André, Viveka Velupillai, Søren Wichmann, Cecil H. Brown, Eric W. Holman et al. ASJP
world language trees of lexical similarity: Version 4 (October 2013). http://asjp.clld.org/
download (accessed 7 May 2015).
Nakhleh, Luay, Don Ringe & Tandy Warnow. 2005. Perfect phylogenetic networks: A new
methodology for reconstructing the evolutionary history of natural languages. Language
81(2). 382–420.
Nelson-Sathi, Shijulal, Johann-Mattis List, Hans Geisler, Heiner Fangerau, Russell D. Gray,
William Martin & Tal Dagan, 2011. Networks uncover hidden lexical borrowing in Indo-
European language evolution. Proceedings of the Royal Society B 278. 1794–1803.
Nikolayev (Nikolaev), Sergei L. 1978. Rekonstruktsija foneticheskoj sistemy pratsezskogo
jazyka [Reconstruction of the Proto-Tsezic phonological system]. In Victoria N. Yartseva
(ed.), Konferentsiya: Problemy rekonstruktsii (tezisy dokladov), 87–89. Moscow: Institut
jazykoznaniya AN SSSR.
Homoplasy and phylogeny reconstruction 261

Novotná, Petra & Blažek. Václav 2007. Glottochronology and its application to the Balto-Slavic
languages. Baltistica 42(2). 185–210; Baltistica 42(3). 323–346.
Pagel, Mark & Andrew Meade. 2006. Estimating rates of lexical replacement on phylogenetic trees
of languages. In Peter Forster & Colin Renfrew (eds.), Phylogenetic Methods and the Prehistory
of Languages, 173–182. Cambridge: McDonald Institute for Archaeological Research.
Renfrew, Colin. 2000. At the edge of knowability: Towards a prehistory of languages.
Cambridge Archaeological Journal 10(1). 7–34.
Rexová, Kateřina, Daniel Frynta & Zrzavý. January 2003. Cladistic analysis of languages: Indo-
European classification based on lexicostatistical data. Cladistics 19. 120–127.
Ringe, Don, Tandy Warnow & Ann Taylor. 2002. Indo-European and computational cladistics.
Transactions of the Philological Society 100(1). 59–129.
Saitou, Naruya & Masatoshi Nei. 1987. The neighbor-joining method: A new method for
reconstructing phylogenetic trees. Molecular Biology and Evolution 4. 406–425.
Semple, Charles & Mike Steel. 2003. Phylogenetics. Oxford: Oxford University Press.
Sneath, Peter H. & Robert R. Sokal. 1973. Numerical Taxonomy. San Francisco: W. H. Freeman
and Company.
Starostin, George S. 2010. Preliminary lexicostatistics as a basis for language classification:
A new approach. Journal of Language Relationship 3. 79–116.
Starostin, George S. 2011. Annotated Swadesh wordlists for the Nakh group (North Caucasian
family). In George Starostin (ed.), The global lexicostatistical database. Moscow & Santa
Fe: Center for Comparative Studies at the Russian State University for the Humanities;
Santa Fe Institute. http://starling.rinet.ru/new100 (accessed 7 May 2015).
Starostin, George S. (ed.). 2011–2015. The global lexicostatistical database. Moscow & Santa
Fe: Center for Comparative Studies at the Russian State University for the Humanities;
Santa Fe Institute. http://starling.rinet.ru/new100 (accessed 7 May 2015).
Starostin, George S. 2013a. Jazyki Afriki. Opyt postroenija leksikostatisticheskoj klassifikatsii
[Languages of Africa: A new lexicostatistical classification]. Vol. 1: Metod. Kojsanskie jazyki
[Methodology. Khoisan languages]. Moscow: LRC.
Starostin, George S. 2013b. Lexicostatistics as a basis for language classification: Increasing
the pros, reducing the cons. In Heiner Fangerau, Hans Geisler, Thorsten Halling & William
Martin (eds.), Classification and evolution in biology, linguistics and the history of science:
Concepts – methods – visualization, 125–146. Stuttgart: Franz Steiner Verlag.
Starostin, Sergei A. 1994. Lezgian etymological database, computerized version of the Proto-
Lezgian corpus which includes some Proto-Lezgian etymologies (mostly basic lexicon
items) that have not been included in Starostin & Nikolayev 1994 due to their lack of
external cognates in other branches of North Caucasian. http://starling.rinet.ru/cgi-bin/
main.cgi (accessed 10 September 2014).
Starostin, Sergei A. (ed.). 1998–2005. The Tower of Babel: An etymological database project.
http://starling.rinet.ru/ (accessed 7 May 2015).
Starostin, Sergei A. 2000. Comparative-historical linguistics and lexicostatistics. In Colin
Renfrew, April McMahon & Larry Trask (eds.). Time depth in historical linguistics, 223–259.
Cambridge: McDonald Institute for Archaeological Research, 2000. First publ. in Vitaly
Shevoroshkin & Paul J. Sidwell (eds.), 1999, Historical linguistics and lexicostatistics,
3–50. Melbourne: Association for the History of Language.
Starostin, Sergei A. 2007. Opredelenie ustojchivosti bazisnoj leksiki [Defining the stability of
basic lexicon]. In Sergei A. Starostin, Trudy po jazykoznaniju [Works in linguistics],
827–839. Moscow: LRC.
262 Alexei S. Kassian

Starostin, Sergei A. 2007 [1989]. Sravnitel’no-istoricheskoe jazykoznanie i leksikostatistika


[Historical linguistics and lexicostatistics]. In Sergei A. Starostin, Trudy po jazykoznaniju
[Works in linguistics], 407–447. Moscow: LRC. First publ. in Ilia Peiros (ed.),
Lingvisticheskaja rekonstruktsija i drevnejshaja istorija Vostoka, 3–39. Moscow, 1989.
English version: S. Starostin 2000.
Starostin, Sergei A. 2007 [1993]. Rabochaja sreda dlya lingvista [Linguist’s workspace].
In Sergei A. Starostin, Trudy po jazykoznaniju [Works in linguistics], 481–496. Moscow:
LRC. First publ. in Bazy dannykh po istorii Evrazii v srednie veka 2, 50–64, Moscow: Institut
vostokovedenija RAN, 1993.
Starostin, Sergei A. n d. Istoricheskaja fonetika lezginskikh jazykov [Lezgian historical
phonology]. Unpubl. ms, 1980s.
Starostin, Sergei A. & Sergei L. Nikolayev (Nikolaev). 1994. A North Caucasian etymological
dictionary. Moscow: Asterisk [reprinted: 3 vols. Ann Arbor: Caravan Books, 2007].
Available online at the Tower of Babel project: http://starling.rinet.ru/cgi-bin/main.cgi
(accessed 10 September 2014).
Swadesh, Morris. 1952. Lexico-statistic dating of prehistoric ethnic contacts. Proceedings of the
American Philosophical Society 96. 453–463.
Swadesh, Morris, 1955. Towards greater accuracy in lexicostatistic dating. International Journal
of American Linguistics 21. 121–137.
Testelets, Yakov G. 1993. K sravnitelno-istoricheskoj fonetike tsezskix jazykov (rekonstruktsija
vokalizma) [Towards a historical phonology of the Tsezic languages: vowels]. In Tatiana M.
Nikolaeva (ed.). Problemy fonetiki 1, 126–134. Moscow: Prometej.

Supplementary Material: The online version of this article offers supplementary material
(https://doi.org/10.1515/flih-2017-0008).

Vous aimerez peut-être aussi