Stuck in The Forest Trees Networks and

Stuck in the forest:
Trees, networks and Chinese dialects
Mahé BEN HAMED1,2, Feng WANG3
1
Laboratoire Dynamique du Langage, UMR 5596, Lyon, FRANCE
2
Department of Psychology, University of Auckland, Auckland, New Zealand
email: Mahe.Ben-Hamed@ish-lyon.cnrs.fr, fax: (+64-9)373-7450

3
Department of Chinese Language & Literature, Peking University, Beijing, China
email: wfwf@pku.edu.cn, fax: (+86-10) 6275-3016
Correspondence to: Mahé BEN HAMED

Summary
This paper discusses the validity of the tree model of evolution for the particular case of Sinitic
languages (or Chinese dialects). Our approach is lexically based, using standardized wordlists. First,
these lists were tested for their congruence, as they are supposed to have evolved at different rates.
Then, the phylogenetic analysis could proceed, using both a distance-based lexicostatistical method
and a character-based maximum parsimony method. The traditional classification of Chinese dialects
is recovered to various extents depending on the method and on the wordlist used, but the character-
based analysis of the 200 Swadesh wordlist outperforms all other analyses. Finally, the validity of the
branching patterns obtained was tested through a variety of techniques. Although the data fits the
inferred trees well, the topology of these trees is collapsed to a star-like pattern when investigated
through resampling methods. The application of a network method confirms that the development of
these Sinitic languages is not tree-like, highlighting the fact that in cases like this tree-reconstruction
methods can be misleading.
Résumé
Cet article a pour objet d’évaluer la validité du modèle arboré dans le cas particulier des langues
sinitiques (ou dialectes chinois). Il repose sur l’analyse de données lexicales correspondant à des listes
de mots standardisées. Ces listes sont supposées avoir évolué à des vitesses différentes, ce qui impose
de vérifier leur congruence avant de procéder aux analyses phylogénétiques proprement dites. Ces
dernières consistent en une approche lexicostatistique – autrement dit, de distance – et en une
méthode fondée sur le caractère, dite de parcimonie. La classification traditionnelle est retrouvée avec
plus ou moins de précision selon la méthode et la liste utilisées, mais l’approche cladistique sur les 200
mots de Swadesh reste la plus performante. Toutefois, si les données s’ajustent correctement aux
arbres inférés, ces derniers se trouvent réduits à des diagrammes en étoile lorsque des procédures de
rééchantilonnage sont utilisées pour évaluer le soutien statistique de leurs topologies. L’application
d’une méthode de réseaux confirme la non arboricité du développement de ces langues étudiées à
partir de leurs lexiques et incite à manipuler les méthodes de reconstruction arborées avec
précaution, celles-ci pouvant amener, comme dans le cas présent, à des inférences erronées.
2
Zusammenfassung
Der vorliegende Artikel testet die Gültigkeit des baumartigen Models, das die
Entwicklung der chinesischen Dialekte darstellen soll. Er stützt sich auf lexikale Daten, die
den Standardlisten von Swadesh und Yakhontov entsprechen. Die Auswertung erfolgt in drei
Etappen. Nachdem wir uns vergewissert haben, dass diese ineinandergeflochtenen Wortlisten
gegenseitig kongruent sind, führen wir phylogenetische Rekonstruktionen aus, entweder mit
Hilfe einer Distanzmethode, die sich der lexikostatistischen Distanz bedient, oder mit einer
Maximum-Parsimony-Methode. Beide Methoden führen gewissermaßen auf die traditionelle
und allgemein anerkannte Klassifizierung dieser Dialekte zurück. Werden Permutationstests
durchgeführt, die die statistische Unterlage der Untergruppen und deren Topologie auswerten
sollen, so geschieht eine Zersplitterung der traditionellen Klassifizierung. Weiterhin die
Durchführung einer Methode, die Neighbor-Net Methode, die bezüglich der Darstellung der
Beziehungen zwischen Dialekten nicht nachteilig ist, lässt uns zu dem Schluss kommen, dass
jegliche baumartige Struktur fehlt. Mehrere Erklärungen, um diese fehlende Struktur zu
erläutern, können vorgeschlagen werden. Auf alle Fälle sollte man mit baumartigen
Rekonstruktionen vorsichtig umgehen, denn sie beruhen auf nicht nachprüfbaren
Voraussetzungen und können zu falschen Schlussfolgerungen führen.
3
I. Introduction
The first tree diagram used to represent species evolution can be found in Darwin’s first
notebook on transmutation of species (1837). Independently, the German philologist August
Schleicher also used the tree metaphor for Indo-European languages. His family-tree theory
(stammbaumtheorie, 1853) was soon challenged by Schleicher’s own student, Johannes
Schmidt who argued that innovations could spread from multiple centres, a distribution for
which a strictly hierarchical tree model could not account. Schmidt proposed an alternative
model, called the wave theory (wellentheorie, 1872), which he argued was more realistic. In
fact, it shows the continuing contact between languages depending on the features analyzed,
whereas the tree model requires sudden and sharp splits of the language as a whole. In
contrast; the wave model does not give a synthetic figure of how languages are related and
fails to represent earlier and later stages of languages simultaneously (Trask, 1996).
Trees are a dominant paradigm in evolutionary biology. In the past fifty years, this field
experienced substantial theoretical and computational developments, making tree
reconstruction automatic and fast through a variety of computerized methods. Such
developments appealed to other fields dealing with evolutionary issues, such as historical
linguistics. Since Schleicher, this discipline was familiar with the tree model and the use of
tree reconstruction methods has progressively spread through historical linguistic practice,
even in cases where the tree model may have seemed less appropriate. There are several
reasons for this. First, the tree iconography is a particularly convenient synthetic figure,
straightforward to interpret and intuitive to understand. Second, the most popular tree
reconstruction software is developed by phylogeneticians and only generates trees. Third,
multidimensional analyses are essentially distance-based methods and are particularly
interesting when the distance used proceeds from an explicit evolutionary model. This may be
the case in population genetics, but not in historical linguistics, where no models are available
4
(yet) which could be used to incorporate evolutionary assumptions in the computation of
linguistic distance. This being said, there persists a rather fundamental question: if we
constrain languages to fit onto a tree when their evolution is not tree-like, what does that tree
actually tell us about how these languages evolved?
The adequacy of the tree model in representing the evolution of a set of languages is a
recurring debate in historical linguistics. In the ideal case of tree-like evolution, the mother
language differentiates into some daughter languages which then diverge progressively.
Although this model applies in the case of the most extensively studied families (Indo-
European, Semitic, Uralic, Algonquian), it cannot be generalized (Dixon, 1997) and has to be
integrated into a perspective which simultaneously includes a dimension of genetic descent
and a dimension of areal diffusion.
The tree model is also a matter of debate in evolutionary biology. In population genetics,
for instance, it is challenged by multidimensional analyses of variance. In fact, at the infra-
specific level, historical relationships are the sum of vertical transmission (descent with
modification) and of a horizontal transmission between lineages (migration and admixture).
As such, each approach (trees vs. multidimensional analyses) can capture history only
partially and both are still needed to account for it as a whole. The problem is that
multidimensional analyses (and the wave theory) cannot account for a hierarchical structure
whilst trees cannot account for spread of innovations from multiple centres. Consequently, if
the whole scheme is constrained to fit the tree model, the non-hierarchical information may
distort the hierarchical figure.
The rate of change in language is higher than in biological data, and horizontal transfers
between languages are not rare as they are between species. Consequently, stratified events of
borrowing can rapidly dilute the phylogenetic signal of linguistic descent and create
reticulations between languages. Borrowing is more or less prevalent and the patterns of
5
descent are more or less salient, depending on the language group. Therefore, depending on
the amount of borrowing, all cases on the continuum between the pure tree model and the
completely tangled network are plausible. It thus becomes inescapable to question the
adequacy of the tree to represent how languages are related, but this formulation is too case-
sensitive to bear a systematic answer. There is indeed tree-like signal, but it can be overlapped
by conflicting signals which cause the hierarchical signal to fade and become ambiguous.
Therefore, one should rather question how to justify that the tree model is adequate for a
particular set of languages, and what alternative representations may be used.
This paper specifically addresses these issues in the case of Sinitic languages (or Chinese
dialects). This language group is known for its complex and intertwined evolutionary history
(Wang, 1991) which challenges the tree model. To picture the complexity that must be dealt
with, we can compare this group to Indo-European. Both families have been extensively
studied and have considerable historical records. However, Indo-European, which has
presumably evolved over the past 6000 to 8000 years, is better resolved in terms of its
phylogeny than Sinitic, which has evolved over only half that time span.
Chinese speaking populations have undergone successive waves of migrations and have
experienced a situation of diglossia, probably since ancient times (Wang, 1991; Lee, 1991).
Consequently, Sinitic languages are not expected to have evolved in a steady tree pattern, but
rather as a network. In spite of this intuition, several studies in Chinese dialectology (Cheng,
1991; Ogura, 1994; Wang, 1998; Wang and Wang, 2004) use classical tree reconstruction
methods, with no further assessment of the robustness of the trees obtained. These
reconstructions have relied on global similarity, but such distance-based methods are
intrinsically unable to diagnose deviation from the tree model. Hinnebusch (1996) noted
however that a skewing of the tree co-occurred with borrowing events, an observation which
6
motivated a method for detecting borrowing based on the evaluation of the skewing in
lexicostatistical trees (Minett and Wang, 2004).
In this paper, we argue that tree reconstruction should be complemented by a number of
procedures which can measure the confidence in the pattern of subgrouping given by the tree
obtained. We also argue for the use of alternative representations which can help visualize
how good an approximation a tree provides for the signal contained in the data. Section 2
introduces the data and the methods used for tree reconstruction. Section 3 presents the
measures used to assess the quality of a tree and Section 4 discusses the results of these
approaches on Sinitic languages, with a particular focus on the issue of tree-likeness. In this
last section, we also explore alternative representations to the tree.
I. Wordlists and tree reconstruction
1. Swadesh lists and derived lists
Swadesh (1952) introduced a lexicon-based model for language change relying on test-
lists of meanings likely to be found in all cultures. There are two such lists and they are
referred to as the Swadesh 100 and 200 wordlists, depending on the number of meanings they
contain. These meanings are putative cultural universals such as body parts, lower numerals,
topographical terms, kinship terms, personal, demonstrative and interrogative pronouns,
naturally-occurring phenomena and basic activities. In addition to their assumed cultural
universality, these meanings belong to the basic core vocabulary. This subset of the lexicon is
supposedly more immune to change, being both more retentive and less prone to borrowing
during language contact events than the rest of the lexicon. In a phylogenetic framework,
these lists correspond to a controlled sampling of lexicon which maximizes the signal to noise
ratio of the information about genetic descent.
7
The original 200 Swadesh wordlist actually consisted of 207 items (Swadesh, 1952). Chen
(1996: 297) argues that 7 items - ‘at’, ‘other’, ‘some’, ‘when’, ‘wipe’, ‘with’ and ‘ye’- should
be discarded as they appear irrelevant for Chinese, which brings the largest wordlist to exactly
200 items. This list was compiled for Old Chinese and 22 modern Chinese dialects which
locations are given in Figure 1. Shanghai dialect is given under two descriptions, bringing the
number of analysed taxa to 24. The affiliation of each dialect is given according to Yuan’s
classification (1980), which is quite uncontroversial among Chinese dialectologists. The data
is accessible on the Peking University website1. The data was recoded in accordance with the
intended phylogenetic processing, as described in paragraphs 2 and 3 of this section.
Figure 1: Map of the Chinese linguistic domain, showing the boundaries of the 7 major
dialectal groups and the locations of the dialects used in this study.
Mandarin: (1) Beijing, (2) Taiyuan, (3) Wuhan, (4) Ningxia, (5) Chengdu, (6) Yuci, (7) Yingshan. Xiang: (8) Changsha, (9)
Shuangfeng. Gan: (10) Anyi, (11) Nanchang. Wu: (12) Ningbo, (13) Shanghai, (14) Suzhou, (15) Wenzhou. Hakka: (16)
Liancheng, (17) Meixian. Min: (18) Fuzhou, (19) Taiwan, (20) Xiamen, (21) Zhangping. Yue: (29) Guangzhou.
In addition to Swadesh’s subdivisions (Swadesh, 1955), shorter versions have been
proposed, namely the Yakhontov 35 wordlist (quoted in Starostin, 1991: 59-60), and the even
1
http://chinese.pku.edu.cn/wangf/wangf.htm
8
more restrictive 15 wordlist of Dolgopolsky (1964). This latter was not used as it contains too
few items with respect to the number of dialects studied. These lists are not uncontroversial
(for a detailed review of the question, see Embleton, 1986), neither is their use to reconstruct
linguistic phylogenies. It is often argued that regular sound changes and innovations in
inflectional morphology are better informants of linguistic descent than vocabulary, which in
comparison, is particularly sensitive to borrowing (Ringe et al., 2002; Balter, 2003). In
practice however, much importance can be given to lexical evidence. Ogura (1994), for
instance, argues that within Chinese historical linguistics, lexicon is the only linguistic
structure which is “rich enough in detail to permit quantitative analysis” (1994: 350). In fact,
numerous studies in Chinese dialectology rely on lexical data (Chen, 1996; Cheng, 1991,
Ogura, 1994; Sagart, 1993), which is viewed as more informative than other linguistic
features for the time span covering the formation and evolution of Chinese dialects (Ogura,
1994). Lexicon would be informative for the span of several centuries/millennia, whereas
vowels, consonants, tones, morphosyntax and semantics exhibit changes at shallower time
depths.
These basic wordlists tentatively try to minimize the residual borrowing, convergence and
chance resemblance (generally subsumed under the term homoplasy) which would still escape
the linguist’s vigilance. In fact, a critical step in phylogenetic analyses is to distinguish
between the (phylo)genetic signal and non-vertical transmission. When homoplasy is
prevalent, the reconstructed tree can be well supported by the dataset but still be a poor
estimate of the true phylogeny. In such a case, shorter wordlists as Yakhontov’s or
Dolgopolsky’s can be useful to rule out as much homoplasy as possible from the analyses. In
fact, the shorter the list, the more conservative it is assumed to be and thus the more
informative about the ‘true phylogeny’. Caution would thus advise the use of the less
homoplasy-sensitive data sample. At the same time, reducing the number of characters used
9
for the phylogenetic inference also reduces the power of resolution of the dataset. The
problem of homoplasy gives way to that of insufficient sampling. This problem gets harder as
the number of languages increases. Although a large number of taxa2 can be grouped perfectly
well with a single binary character, the expected resolution is not high. Although there must
be a critical number of characters necessary for tree-reconstruction methods to discriminate
between different topologies (Erdös et al. 1997; Kim, 1998), the few studies addressing this
question have reached different conclusions (Wheeler, 1992; Graybeal, 1998; Russo, 1996).
Following Felsenstein (1985), we argue that the fundamental issue is not the number of
characters used but how well a given set of characters supports a clade3 in a phylogeny, a
concern that will be developed in Section 3.
2. Tree reconstruction methods
We have mentioned in section I different paradigms of phylogenetic analysis, which can
be categorized into character-based and distance-based methods.
a. Character-based methods: A character is a discrete attribute observed under a specific
state for each language, and is potentially informative about phylogenetic relationships
between languages.
For any set of taxa, there are multiple possible trees to connect them, the number of trees
growing exponentially with the number of taxa. For instance, for the 24 dialects of our
dataset, there are more than 1030 possible unrooted bifurcating trees connecting them.
Therefore, one needs a criterion to select a subset of all possible trees judged as optimal. This
criterion can be probabilistic as in maximum likelihood and Bayesian inference, provided that
we have an acceptable probabilistic model about character change. Given that we argue for
multiple-state coding of the data, and as no available phylogenetic software can cope with the
2
taxa = objects under study, singular taxon
3
clade = (gr. χλαδοσ) putative group defined by an ancestor and all its descendant in a phylogeny
10
high character polymorphism in our data, we were unable to perform such probabilistic
approaches.
Outside this probabilistic framework, one could use a maximum compatibility criterion
which reconstruct trees from the largest subset of characters which are all compatible on the
same hierarchical structure. However, compatibility methods do not deal with homoplasy.
Therefore, in the case of Chinese dialects where such a huge amount of borrowing has
occurred, compatibility approaches are not recommended.
We preferred the maximum parsimony criterion, where the optimal trees are those which
minimize the overall number of character changes. Note that, given the large number of
possible trees for our dataset, it is not possible to compute and evaluate each and every tree,
and the search for optimal trees has to be heuristic. We have therefore no guarantee to find the
optimal tree subset, but at least, an estimate of this can be obtained in a reasonable time. In
this paper, we have used a cladistic approach – sensu Hennig (1950), which proceeds as
follows:
− For each character, we formulate a hypothesis of homology: if two languages share the
same state, then the most parsimonious explanation is that they were inherited from
the common ancestor to those languages.
− Once the subset of optimal trees is selected, this primary hypothesis can be tested. The
distribution of character states on the selected topologies determine whether the
observed shared character states result from common ancestry or not. Therefore, it is
the tree topology, and more precisely the optimal tree topologies, which explain the
mode of transmission for each character. When character states appear to be shared by
non related languages, they are said to be homoplasic.
b. Distance-based methods: These methods are based on global similarity. First a measure
of pairwise similarity between taxa is computed, which is then transformed into a measure of
11
dissimilarity with specific mathematical properties, called a distance. The resulting pairwise
distance matrix is in turn transformed into a tree, according to a specific reconstruction
method. In historical linguistics, lexicostatistical methods are a type of distance-based
methods.
As such, distance methods imply information loss with respect to character-based
methods, since the original character matrix cannot be recovered from a distance matrix
(Penny, 1982). Herein, we have used the popular Neighbour-Joining method (Saitou and Nei,
1987), which is an agglomerative procedure that estimates the unique tree minimizing the
difference between the observed distance matrix and the tree distance matrix (reconstructed as
the pairwise distance between taxa on the estimated tree).
We have used the same lexicostatistical distance as Minett and Wang (2004). The distance
between languages A and B is a function of their lexical similarity C, which is equal to the
proportion of cognates shared by A and B over the whole wordlist. The distance is equal to
minus the logarithm of C (d(LA, LB) = – log C), the minus sign ensuring the distance is
positive.
3. Defining the characters
Previous analyses of lexical lists (Gray and Jordan, 2000; Holden, 2002; Ringe et al. 2002;
Gray and Atkinson, 2003; Rexova et al. 2003) use two types of character coding:
− Each basic meaning constitutes a character, and the different cognate sets identified
for that meaning constitute the character’s states (Table 1). This definition results in
multistate characters with more than 2 states per character.
− Each cognate set from each meaning constitutes a character where the states are either
present, if the item belongs to that cognate set, or absent otherwise. This definition
12
results into binary characters (2 states per character), with multiple characters per
meaning.
Meaning Fuzhou (Min) Guangzhou (Yue)

“dog” quan2 犬 gou2 狗
Character State 1 State 2
Table 1: Example of character coding for meaning ‘dog’
Advocates of the binary coding argue that the meanings composing the wordlists are not
the ‘objects’ of language change, but are artificial constructs (Atkinson and Gray, 2004). In
contrast, cognate sets are unambiguously defined as heritable units and therefore the binary
coding would be at least as principled as the multistate coding. On the contrary, advocates of
multistate coding argue that binary coding violates the necessary requirement of character
independence as the cognate sets related to a given meaning are semantically dependant
(Balter, 2003). In fact, cognacy judgments are made on both semantic and phonological
grounds. Since these two dimensions are non-independent in circumscribing cognate sets,
they should not be separated for the analyses. For example, if, for a given meaning, a
language has a word in a given cognate set, it is not likely to have a word for the same
meaning in the other cognate sets for that semantic slot. Consequently, binary coding bears
the risk of over-weighting the meanings with a large number of cognate sets to the detriment
of meanings with few cognate sets. In fact, almost all cladistic analyses treat data as
independent and therefore, non-independent characters are over-weighted in the analyses. The
violation of this theoretical requirement is problematic as character independence is difficult
to test, evaluate and correct. Consequently, the bias introduced by non independence on tree
reconstruction is not assessable.
For these reasons, we preferred multistate coding. By doing so however, we find ourselves
hindered in co-opting probabilistic methods and models developed in biology, as our
13
characters appear far more polymorphic than the characters these methods and models were
designed for. We treated characters as unordered: for a given character, any state is allowed to
change into any other state with the same probability. This is far from the reality of linguistic
change, but in the absence of a plausible model of evolution of lexicon, this corresponds to
the least constrained approach.
Wordlists rarely appear as simple to code as the example given in Table 1. For example,
variant roots can be associated with slight variations of the same basic meaning. For instance,
the meaning “bird” is represented by 鸟 (niao3) in Nanchang (Gan) and by 雀 (que4) in
Chengdu (Mandarin). Contrastively, Ningxia (Mandarin) has both roots 鸟 and 雀 for this
same meaning, but these roots differ in their etymology in Old Chinese, referring to “big bird”
and “small bird”, respectively. In such a case, we considered semantic variation itself as a
new state: Nanchang is therefore in state 1 (鸟), Chengdu is in state 2 (雀) and Ningxia is in a
third state noted 1/2 (鸟/雀). The transformation between these states 1 and 3 can be due
either to the loss of the variant 雀 from Ningxia to Nanchang, by its emergence in Ningxia
from Nanchang, or even by borrowing from another source.
Some meanings are represented by a root with appended affixes and modifiers. The root
can stay constant while the affix/modifier can change, or both can vary. The first case can be
simply resolved by coding only the variable part, but the latter case is more problematic. In
fact, the root and the affix/modifier cannot be split into two characters, since they have a tied
evolution within a semantic slot. We thus coded each combination root/affix or modifier as a
new state. There are also cases of optional affixation, which we coded as distinct states from
the stable form. For instance, the meaning “head” is represented in Nanchang (Gan) by the
root 头 (tou2), state one, which can appears either systematically suffixed as in Zhangping
14
(Min) 头壳 (ke4),state two, or optionally suffixed as in Guangzhou (Yue) 头[壳], state three.
The duplication of a root or the combination of roots is also treated as a suffixation. In the
latter case, if each root and their combined form are semantic variants of the same meaning,
then they constitute three different states.
The missing data are coded as a specific state. Note that the absence of a concept and truly
missing data (lack of information) are coded and treated differently. The absence of a concept
is considered as a proper state and analyzed as any other informative state, whereas missing
data have a special treatment. In distance methods, they are discarded from the computation
of the distance matrix, whereas they are optimized in the cladistic approach. In this latter case,
missing values are affected the states which allow the best optimization.
II. Trees and ‘tree-likeness’ of Chinese dialects development
1. Incrementing wordlists: homoplasy and congruence
Increasing the number of characters does not guarantee a better phylogenetic estimate. For
instance, character sets may be evolving at different rates. Constraining such characters to rate
equality in a phylogenetic analysis is likely to result in a flawed topological estimate (Cao,
1998). In other words, datasets assumed to have evolved at different rates should not be
combined into a single phylogenetic analysis without a prior assessment of their congruence.
Two dataset are said to be congruent when the hierarchical signals they contain are
compatible, or in agreement. Note that congruence does not imply that the additional signal
does not contribute with any original information, but one can be assured of the coherence of
the signal contained in the combined dataset, in a given analytical framework. In contrast,
incongruence implies a conflict between the initial data set and the incremented items that
15
would result in a bad estimate of either conflicting evolutionary trajectories. Although this is
subject to debate between phylogeneticians (Kluge, 1989; de Queiroz, 1993), we are ourselves
advocates of a prior assessment of congruence between datasets in order to legitimate their
combination in a single analysis (de Queiroz, 1993). However, if datasets prove incongruent,
they should be analyzed separately, and the differences in the optimal trees resulting from
each analysis should be evaluated.
The rationale behind the 100 and the 35 wordlists is that the rates of word evolution
differ from one list to the other. According to Swadesh’s subdivisions, the items of the 100
wordlist evolve more slowly than the second 100 list of the 200 wordlist. According to
Yakhontov, the 35 items of his list evolve even more slowly than the other 165 words of
Swadesh’s 200 word list. Therefore, we find ourselves in the situation described above with
character sets evolving at different rates. Consequently, we need to investigate the congruence
within the 100 and 200 wordlists. Intriguingly, whereas one would expect Yakhontov’s list to
be included in Swadesh’s 100 list, three items of the Yakhontov list belong to the less
conservative half of Swadesh’s 200 list. Given this apparent contradiction between
Yakhontov’s and Swadesh’s views on what subsets of the basic lexicon are the most
conservative, we have to test not only the congruence between the first and the second halves
of the 200 wordlist, but also, the congruence within each subset with respect to Yakhontov’s
subdivision. In other words, we have to test 5 partitions of the basic lexicon (Figure 2):
Partition 1. within Swadesh’s 200 wordlist : Yakhontov’s 35 words against the
remaining 165 items
Partition 2. within Swadesh’s 100 wordlist : Yakhontov’s 32 words against the
remaining 68 items
Partition 3. Yakhontov’s 35 items against the 68 subset (within the 100 wordlist),
against the 97 subset (within the second half of the 200 wordlist)
16
Partition 4. the 68 subset (within the 100 wordlist), against the 97 subset (within the
second half of the 200 wordlist)
Partition 5. the 100 wordlist against the second half of the 200 wordlist
Figure 2: Structure of the wordlists and definition of the subsets used in the ILD procedure.
In the maximum parsimony framework we are using for tree reconstruction, one can
use the Incongruence Length Difference (ILD) test, proposed by Farris et al. (1994), and
independently by Swofford (1995) under the name of Partition Homogeneity test. This
procedure is character-based and non-parametric, and tests the null hypothesis of congruence
between datasets. In fact, one cannot directly test the null hypothesis of incongruence, but one
can measure the extent to which the incongruence between datasets is significantly higher
than would be expected by chance alone. This is precisely how the ILD test proceeds. It
measures the extent with which the original datasets result in different trees and compares it to
a distribution of this same measure for randomly generated datasets of the same size as the
original ones. Therefore, this test can be used to justify the combination of the datasets in a
parsimony framework, as the combined dataset results in no less optimal solutions than the
separate original datasets.
The test requires the computation of tree lengths. The length of a tree is equal to the
overall number of character changes required to map the character matrix onto that tree.
17
Given two datasets X and Y, the measure of their incongruence is the difference DXY between
the length L(X+Y) of the most parsimonious tree obtained by combining them into a single
analysis (X+Y), and the sum (LX + LY) of the lengths of the most parsimonious trees obtained
on each character set analyzed separately (DXY = L(X+Y) – [LX + LY]). The null distribution of D
is given by a procedure of random partition of the combined matrix (X+Y) into two matrices P
and Q of the same size of X and Y, respectively (note that this procedure can be extended to
more than 2 datasets). The null hypothesis of congruence between X and Y is rejected if DXY
appears to be greater than 95% of the DPQ computed. In other words, we take a 5% risk of
falsely rejecting the null hypothesis of congruence between datasets. If this null hypothesis is
not rejected, datasets X and Y can be combined into a single analysis.
2. Goodness of fit of the data onto a tree
In a cladistic framework, it is possible to evaluate the goodness of fit of the character
matrix to each of the selected trees. The amount of shared innovations, and by correlation, the
amount of homoplasy, is traditionally estimated by the consistency index (CI, Kluge and
Farris, 1969). This index is computed as the minimal number (R) of transformations required
to explain the states of all characters divided by the number (L) of changes observed on the
tree. L is equal to the length of the tree in the case of optimal trees. The larger the CI, the
smaller the amount of homoplasy. A tree for which there is no homoplasy has a CI equal to 1.
The homoplasy index is thus calculated as HI = 1-CI.
The consistency index can also be computed only on the fraction of characters which are
informative for the subgrouping. This excludes parsimony uninformative characters, such as
characters which are constant, or where unique states are not shared. The CI excluding
uninformative characters can differ significantly from the uncorrected CI when a large
fraction of the characters have states present in only one dialect.
18
Although a very convenient measure of homoplasy, the consistency index is difficult to
interpret. In fact, whereas its maximal value is equal to 1 for a homoplasy-free dataset, its
minimal value is not null, and depends on the number of taxa considered (Archie, 1989).
Other measures of homoplasy have been proposed which do not have this flaw. The retention
index (RI, Archie, 1989; Farris, 1989), for instance, ranges from 1 (no homoplasy) to 0 (no
phylogenetic information). This index is computed in a similar fashion to the CI, but with a
correction of both the numerator (R) and the denominator (L) by the maximal number (G) of
transformations required to reconstruct a tree from the dataset. RI is therefore equal to G-R
divided by G-L. Finally, a third index has been proposed by Farris (1989), called the rescaled
consistency index (RC), which is equal to the product of the consistency index by the
retention index, which rescales the CI between 0 and 1.
The significance of the CI can however be tested. This can be done by calculating the CI
distribution for a sample of random trees (Cavalli-Sforza et al. 1992). Sanderson and
Donoghue (1989) showed that the CI is highly correlated with the number of taxa but does not
show a significant relationship to the number of characters used for the analysis. They suggest
that the logarithm of the CI is linearly correlated with the number of taxa and provide a
method for calculating the threshold CI value over which the observed CI value for a tree can
be considered significant. In our case, where 24 taxa are considered, this threshold CI is equal
to 0.372.
Being measures of homoplasy, these indices (CI, RI and RC) are therefore also indices of
the ‘tree-likeness’ of the dataset: the better the fit of the characters to the optimal trees, the
more adequate is the tree diagram to account for the observed dataset. If these indices are
significantly different from what would be expected by chance only, then the trees for which
they are computed are also significant.
19
3. Assessment of clade support
Although the 100 and the 200 wordlists appear consistent as a whole, it is unlikely that
they are homoplasy-free. The presence of homoplasy implies that multiple evolutionary paths
have been involved in generating the characters. Therefore, it can be of interest to test the
robustness of the selected trees towards a random resampling of the character matrix, in other
words, to estimate empirically the variability of the phylogenetic estimate. Each clade can be
associated with a statistical measure of support, and consequently, it becomes possible to
evaluate the confidence one can have in the interpretation of the subgrouping patterns.
The resampling methods we used for assessing the robustness of a tree are bootstrapping
and jackknifing:
− Bootstrap (Efron, 1979; Felsenstein, 1985): This procedure involves resampling with
replacement of N characters from the N characters of the initial dataset. This produces
a fictional sample of the same size, called a replicate. As sampling is done with
replacement, some of the original characters may have been sampled several times,
while others may be left out entirely. This procedure mimics repeated sampling of the
data from the initial population of characters.
− Jackknife (Mueller and Ayala, 1982): In its original formulation, this procedure
consisted in dropping one observation at a time from one’s sample and calculating the
phylogenetic estimate each time. The variability of the estimate was then calculated on
the small variations that were caused. Later variants of this method involve a sub-
sampling of a given fraction p of the initial set of characters at a time and estimating
the variance on the phylogenies reconstructed on each sub-sample of the data
(Felsenstein, 1985 ; Wu, 1986). The half-delete jackknife is the most widely used
variant of jackknife, and involves multiple sub-sampling of half the characters.
20
Bootstrap and jackknife consist both in a random re-weighting of the data: bootstrap
proceeds to a completely random re-weighting, whereas jackknife affects a weight of 0 to all
deleted characters, and a weight of 1 to all included characters. It is not clear however if any
of these methods is substantially more advantageous than the other (Felsenstein, 2004), but
one could favour jackknife over bootstrap as this latter creates artificial data sets while the
former uses different sub-samples of the real data. Moreover, bootstrap, as implemented in
most phylogenetic software (following Felsenstein 1985), tends to overestimate bootstrap
values when multiple equally parsimonious trees are selected for a given replicate (Farris,
2004).
Whether using bootstrap or jackknife, we generate a large number of replicates, for each
of which we infer a phylogeny using a given method of phylogenetic reconstruction. If
distance methods are used, the resampling has to occur on the original character data before
the distance matrix is computed. We end up with a cloud of trees corresponding to a
collection of phylogenetic estimates for the languages under study. These can be summarized
into a majority-rule consensus tree, called a bootstrap (or a jackknife) tree. This tree shows the
subgroups that are recovered in more than 50% of the replicates, and associates the exact
percentage of replicates supporting this subgroup. The higher this value, the more robust the
subgroup, and the more confident one can be in interpreting it. The subgroups that are
supported by less than half of the replicates are not represented and appear unresolved on the
consensus tree.
In either case, however, there is no absolute threshold over which the certainty of a group
can be acknowledged. One has to be aware that the lower the threshold of acceptance of
subgroups, the less support one can have in their interpretation. For instance, a subgroup that
is supported by a value of 50% is a subgroup which is actually rejected by half the replicates.
In such a case, there is as much support for the subgroup as against it. The problem can be
21
construed as hypothesis test where we test whether a clade is actually supported by the data. A
clade can be considered as valid if no more than 5% of the random sample argues against it,
which would bring the threshold up to 95%.
III. Results
Tree reconstruction was performed using PAUP*4b10 (Swofford, 2002) for both
character and distance based methods> PAUP* was also used for the computation of the
indices of homoplasy and for the resampling procedures.
1. Congruence within wordlists
Each of the partitions defined in Figure 2 have been tested for congruence using the ILD
test over 1000 trials. Table 2 shows the tail probabilities obtained for each partition. The null
hypothesis of congruence is rejected when the tail probability is no larger than 5%, in other
words, when the observed incongruence within the partition is larger than 95% of the
randomly generated partitions of the same size. When the 35 word list is extended first to 100
then to 200 items, the newly added items introduce a signal which is congruent to the one
contained in the initial word list. We can therefore follow Swadesh’s and Yakhontov’s
subdivisions of the basic lexicon when performing our phylogenetic analysis.
Partition Type I error

1 p= 0.085
2 p= 0.212
3 p= 0.162
4 p= 0.330
5 p= 0.176
Table 2: Results of the ILD test.

The null hypothesis of congruence is rejected if the p-value is no larger than 5%.
22
2. Sinitic Trees
Trees were rooted using Old Chinese, which is both related to all the languages under
study and of greater time depth. Such a polarization of the trees will allow the discussion on
the pattern of dialect subgrouping. Concerning the measures of homoplasy, values will be
given for all indices (consistency, retention and rescaled consistency), but only the
consistency indices will be discussed, since their significance can be easily assessed using
Sanderson and Donoghue’s (1989) regression method. As mentioned in section II.2, the
critical value of the CI for 24 taxa is 0.372.
a. Maximum-parsimony analyses:
35 wordlist: 25 characters are informative as 3 characters are constant and 7 are parsimony
uninformative. The parsimony analysis of this list results in 77 equally parsimonious trees
which are 252 steps long and show a very significant fit of the data (CI=0.8849, CI excluding
uninformative characters = 0.8797, RI = 0.766, RC = 0.678). The majority-rule consensus tree
(Figure 3) shows the support received by each clade from the cloud of optimal trees. It shows
a good topological agreement for the deep nodes, in spite of the numerous trees selected. A
first division roughly clusters the southern (Yue, Min) dialects against the northern and
central dialects (Mandarin, Wu, Xiang, and Gan. Hakka dialects, however, do not fit this
geographical pattern, as they cluster with the northern (and central) dialects when one would
expect it to cluster within the southern group. Mandarin, Wu and Gan fail to cluster as
consistent groups, whereas Xiang is supported by about 73% of the cloud of optimal trees. On
the contrary, Min and Hakka are supported by all the trees selected (discussing the Yue is
meaningless here as this group is represented by a single dialect). As previously mentioned,
Hakka does not cluster within a southern dialectal group as would be expected from
geography and in accordance with Norman (1988). On the contrary, this group clusters at the
tip of the northern subdivision, close to the Gan dialects and to the Wenzhou southern Wu
23
dialect. It has been argued by Yuan (1980) that Gan is closer in its lexicon to Xiang and Wu
than to Hakka, a claim which is supported by only 30% of the tree cloud (23 trees over 77),
whereas 70% of the tree cloud supports the claim by Sagart (2002) that there is a privileged
relationship between Gan and Hakka.
Figure 3: Majority-rule consensus tree for the 35 wordlist parsimony analysis.
100 wordlist: the extension of the analysis to the 100 wordlist results in a dramatic reduction
of the number of optimal trees from 77 to 6. This indicates that the characters that have been
added to the 35 wordlist have helped resolve the tree topologies. This is further supported by
the strong topological agreement between the equally parsimonious trees as suggested by their
consensus tree (Figure 4). The addition of characters dramatically changes the tree topologies
but only slightly affects the value of the consistency index which remains highly significant
(CI = 0.84, CI excluding uninformative characters = 0.8296, RI = 0.6771, RC = 0.5685).
Mandarin, Wu and Gan remain unsupported by the data, whereas Min, Hakka and Xiang are
supported by all optimal trees. Hakka is brought closer to the root, in accordance with the
claim by Norman (1988) that this group has a close relationship with the southern dialects.
24
200 wordlist: the 200 wordlist also results in 6 equally optimal trees, whose majority-rule
consensus is shown in Figure 5. The majority consensus tree shows a 100% support for each
of the expected dialect groups, and the consistency index remains in the range of those
observed for the two shorter lists (CI = 0.8605, CI excluding uninformative characters =
0.851, RI = 0.6494, RC = 0.559).
The consensus topology shows a clear bipartition between the southern dialects (Min, Hakka
and Yue) versus the northern (Mandarin, Wu) and central (Xiang, Gan) dialects. No
25
hierarchical ordering of the southern groups is supported, whereas the structure within the
northern-central group is constant across the cloud of optimal trees. Mandarin appears as a
sister-group to Xiang, the group defined by (Mandarin, Xiang) is in turn a sister-clade to Gan
and ((Mandarin, Xiang), Gan) is in turn a sister-clade to Wu. A summary topology is given in
Figure 6.
Figure 4: Tree summarizing the relationship between dialect groups

as inferred from the parsimony analysis of the 200 wordlist.
Norman (1988) argues that Mandarin and Min are the most neatly characterized groups,
with sharp boundaries distinguishing them from the neighbouring dialects. Using lexical data,
this claim is verified for Min which clusters in a coherent subgroup with a constant topology
for all wordlists. In contrast, Mandarin clusters as a stable group only for the 200 wordlist,
therefore using the least conservative dataset. Furthermore, using this list, Mandarin displays
a geographical structure, with distinct Northern (Beijing, (Taiyuan, Yuci)) and Eastern
(Yingshan, Wuhan) subdivisions.
b. Distance analyses:
Figure 7 (a), (b) and (c) show the trees obtained for the 35, 100 and 200 wordlists,
respectively. None of the wordlists results into the accepted classification: whereas Gan,
Xiang, Hakka and Min cluster correctly, Mandarin and Wu are more difficult to recover.
Mandarin clusters correctly when the 200 wordlist is used, but even then, Wu fails to cluster
26
correctly as Wenzhou southern Wu dialect appears closer to southern or central dialects than
to the group it should belong to.
Figure 5: Neighbour-Joining tree from the (a) 35, (b) 100 and (c) 200 wordlists.
27
In this respect, distance analyses seem to perform less well than parsimony analyses. This
can be due to the fact that the distance used is based on assumptions which are inappropriate
to account for the underlying evolutionary mechanisms at work here. In fact, the distance used
assumes that all characters evolve at the same rate, thus imposing an evolutionary clock
(Ruvolo, 1987). The implicit assumption in the definition of wordlists however, is that the
items of the 35 wordlist do not evolve at the same rate as the items added to constitute first the
100 then the 200 wordlist. The fact that the three word subsets do not evolve at equal rates
does not imply that the signals they contain are not hierarchically compatible (and therefore
incongruent). However, distance analyses are based on global similarity and therefore they
average the behaviour of the three word subsets, whereas character based methods, which
consider each character independently, can account for the specific behaviour of each of them.
The other dialect groups – when they cluster as expected – have variable relative positions
depending on the wordlist used. However, the lexicostatistical trees always show a robust
geographic bipartition between the Northern (Wu, Mandarin) and Central (Gan and Xiang)
dialects and Southern dialects (Min, Hakka and Yue), as the parsimony analyses on the 100
and 200 wordlists.
In the parsimony analyses, all wordlists supported a basal position for Yue and Min
dialects, which are known to have retained numerous archaisms from Old Chinese (Norman,
1988). In the distance analyses, the variation in the position of Old Chinese depending on the
wordlist considered is particularly interesting: whereas the parsimony trees distinguish clearly
Old Chinese from the Min group on the 35 wordlist, the distance tree groups Old Chinese
within Min using that same list. The 100 wordlist still groups Old Chinese with Min against
all other dialects, but clearly distinguishes a Min group. Only when the 200 wordlist is used is
the privileged relationship between Min and Old Chinese lost. If Yuan’s classification (1980)
is accepted, than this behaviour of Old Chinese with respect to the southern groups is in
28
agreement with the claim that the 35 wordlist is more conservative than the 100 wordlist,
which is in turn more conservative than the 200 wordlist. In fact, the archaisms conserved in
Min are shared with Old Chinese thus increasing their global similarity. The addition of less
conservative items has the opposite effect of decreasing the similarity between Min and Old
Chinese while increasing the number of traits specific to Min, allowing Min to cluster as a
distinct group.
c. Clade support :
The only configuration which allowed Yuan’s (1980) dialectal groups to be recovered is
the parsimony analysis of the 200 wordlist. Consequently, it is the only configuration which
will be discussed concerning clade support. The re-sampling techniques of bootstrap and
jackknife (Figure 8) result in the collapse of the seemingly stable classification obtained in
Figure 5 into a star-like (unresolved) pattern. Acceptable support scores (~75%) are
maintained for the bipartition into southern (Min, Hakka and Yue) and northern-central
(Mandarin, Wu, Xiang and Gan) dialects. The southern dialects are more basal, in agreement
with genetic evidence of their early differentiation from Old Chinese (Yuan, 1980; Norman,
1988). Min, Hakka and Xiang are well supported (~85%), but the northern-central group loses
its resolution and the hierarchical structure shown in Figure 6 becomes star-like.
Resampling techniques proceed through character re-weighting. The fact that they
collapse the original tree structure into a nearly star-like diagram suggests that various
evolutionary signals are present in the dataset, and that the one shaping the tree in Figure 6 is
not very robust. Still, the low bootstrap and jackknife support values contradict the high
consistency index recorded for the 200 wordlist. Whereas this latter supports the hierarchical
structures selected as optimal, the former invalidate the subgrouping pattern these topologies
define.
29
Figure 6 : Bootstrap (a) and Jackknife (b) trees for the 200 word list in a parsimony
framework.
This contradiction can be explained by the high polymorphism of the characters
considered, which hinders the application of resampling techniques. In fact, many characters
are represented by a large number of states (figure 9) and therefore are very informative.
Consequently, excluding (or over-weighting) a character implies the exclusion (or
30
amplification) of a large amount of signal and dramatically threatens the stability of the
optimal topology.
Figure 9: Frequency of the characters depending on their number of states.
Given the massive migratory activity and the diglossic situation that have prevailed
throughout the history of China, departure from the tree model would not be surprising and
one could interpret the bootstrap values as the consequence of the misspecification of the
model of representation of these dialects’ relationships. At the same time, the data show a
good fit on the trees selected which, moreover, are in agreement with the traditional
classification for Chinese dialects. But if we choose to consider the consistency indices to the
detriment of the bootstrap values, then we have to justify why the topology obtained for the
200 wordlist should prevail upon the ones obtained for the 100 and 35 wordlists. The absence
of cross validation between homoplasy indices and resampling techniques is problematic
because we have no satisfying measure of the variance of the inferred topological estimate,
which seems, however, to be very sensitive to character sampling and weighting. The only
31
reference we have for accepting the tree given in Figure 5, Yuan’s classification, is external to
the analytical framework, and we are still unable to test the assumption of tree-likeness.
3. Sinitic networks
The subgrouping methods described up to here presuppose that we have to reconstruct a
tree, but there are other methods which do not require this assumption. Instead of producing a
tree by default, they generate reticulated graphs, or networks, which visualize the degree of
departure from the tree-like model, and consequently, assess its validity for the data. These
graphs also show patterns of subgrouping which can be directly interpreted in terms of
language evolution.
Different network reconstruction methods are available, and are either character-based or
distance-based. Median networks (Bandelt et al. 1995; Bandelt et al. 1999) are the only
character-based methods, but they have trouble dealing with such highly polymorphic data as
ours (Forster and Toth, 2003). We are therefore left with distance-based approaches to
network reconstruction, the best of which appears to be, at the moment, the Neighbour Net
method (Bryant and Moulton, 2002, 2004) implemented in the SPLITSTREE software (Huson,
1998; Huson and Bryant, 2004). This method generates splits graphs from a taxa pairwise
distance matrix, a split being a partition of the set of taxa into two non empty subsets. It is an
agglomerative procedure, similar to the tree method of Neighbour-Joining, which proceeds to
a heuristic search for bipartitions in a set of taxa but without necessarily producing a tree. A
step-by-step description of the method can be found in Bryant et al. (2004), illustrated on the
case of Indo-European languages. This method seems to be the most efficient network
generating approach when the number of taxa increases (Bryant et al. 2004). Neighbour Nets
display edges and boxes, where boxes represent conflicting signals while edges represent an
unambiguous signal. The support for a given split is given by the length of the edge (or the
32
box’s edge) defining that split, while the amount of conflict is represented by the width of the
box representing it.
We have reconstructed Neighbour Nets from our data using both a lexicostatistical
(discussed earlier) and a Hamming distance. These two distances produce similar graphs,
which was expected due to their strong correlation. The graphs obtained from the different
wordlists displayed a strong web of conflicting signals at the base of clearly distinguishable
subgroups. This figure contrasts with what was obtained for Indo-European (Bryant et al.
2004), where the acknowledged subfamilies were supported by long edges and connected in a
fairly conflict-free pattern.
Figure 10 shows the Neighbour Net obtained for the 200 wordlist using Hamming
distance. The acknowledged dialect groups are recovered, as well as the partition between
southern and north-central dialects. The irresolution of the northern-central group is explained
by the heterogeneity of the Wu group (Wenzhou dialect diverging from the northern
Shanghai, Suzhou and Ningbo dialects) and the poor support of Mandarin against the central
Gan and Xiang dialects. Interestingly, the trees obtained by parsimony analyses are salient in
the Neighbour Nets, whatever the wordlist considered. These Neighbour Nets manage to
summarize the apparent contradiction discussed previously between the good fit of the data on
the most parsimonious trees as shown by the good CI values and the instability of their
topology as represented by the poor bootstrap and jackknife values.
Although dialect subgroups are clearly distinguished, they appear to be related in a rather
star-like pattern, with a basal web of conflicts linking all Sinitic languages. This pattern fits
more the notion of dialectal continuum than that of a steadily increasing differentiation and
isolation of dialect subgroups implied by the tree model. It visually summarizes the two
opposite forces have supposedly shaped the Sinitic linguistic domain (Ogura, 1994): a force
of differentiation, related to the spread of people throughout China created distinguishable
33
dialect subgroups, while a homogenization force due to diglossia and heavy borrowing
created the continuum. Thus, even though the differentiation force would have proceeded in a
tree-like pattern, it was always counterbalanced by the homogenization force which diluted
the hierarchical structure, rendering the tree model inappropriate to account for these dialects’
relationships.
Figure 10: Neighbour Net obtained from the 200 wordlist using Hamming distance.
IV. Conclusion
Should we then conclude that the tree model is inappropriate in the case of Sinitic
languages? The Neighbour Net analyses show that parsimony trees are relevant since they
underlie the networks. These trees prove nevertheless incapable of solving the contradiction
34
between a good fit of the data and poor bootstrap support and require external evidence to
decide if they should be considered as relevant. On the contrary, the Neighbour Nets reconcile
these apparently contradicting observations in a synthetic and coherent framework.
Furthermore, they recover the traditional classification and explain why this classification is
difficult to recover when trees are used.
It is necessary that tree reconstruction methods be backed up by measures of homoplasy
and by re-sampling techniques to ascertain that the figures obtained are not an artefact of their
assumption of tree-likeness. However, even if these complementary procedures may detect
weaknesses of the tree model, they do not evaluate explicit alternatives to the pure tree model.
Methods reconstructing networks do, and we argue that networks should be more broadly
used in historical linguistics in conjunction with classical tree reconstruction methods.
In fact, although networks are useful to visualize how languages are related without a
prejudice of tree-likeness, they may be troublesome to interpret, especially when both the
number of taxa and the number of characters increase. In the Neighbour Net approach, for
instance, it is possible to list the characters which are incompatible with a given split and
which are therefore involved in a particular ensemble of conflicting trajectories. However, in
cases like Sinitic where such a large amount of conflict is detected, the task becomes quickly
horrendous.
Various reasons can explain the lack of tree-likeness in the Sinitic group. We believe that
the particular mode of development that these languages have experienced is responsible for
it. Diglossia and population contact have probably permanently counterbalanced the effect of
a tree-like divergence scheme, resulting in the continuum depicted by the Neighbour Nets.
However, we should not neglect alternative explanations linked to our processing of the data.
One possible explanation is that wordlists are not reliable in a phylogenetic framework, either
because lexicon is inappropriate (Ringe, 2002, Balter, 2003), or because a larger amount of
35
data is necessary. Another possible explanation is that the way we coded our data is flawed. It
is indeed unconstrained, and corresponds to a neutral observation of character polymorphism
with no specification of any privileged transitions between character states. This means
however that the probability of change in the suffix without a change in the root is the same as
the probability of changing from a cognate class to another. It also means that there are no
transitional states, such as variant roots, and no ordered transformations. Such approximations
are unrealistic and are probably responsible for the loss of a large amount of information
during the coding procedure. The evolution could indeed be tree-like, but the signal is too
weakened during the recoding process to allow the recovery of a robust hierarchy. The next
step in the analysis of Sinitic wordlists will necessarily have to be the implementation of
complex transformation series between states. However, even if such an acceptable coding is
provided, constraints on transformations cannot be incorporated in distance calculus and this
coding is therefore useless in a distance-based framework, for both tree and network
reconstruction. Unfortunately, up to now, no satisfactory methods exist to bridge the gap
between distance-based methods which consider alternatives to the pure the pure tree model
and character-based methods which allow the diagnosis of homoplasy. In any case, if tree
reconstruction methods are used, it is vital that they be complemented by approaches for
testing the validity of the tree model before any interpretation of the subgrouping patterns.
36
References
Archie, James W. 1989. "Homoplasy excess ratios: New indices for measuring levels of
homoplasy in phylogenetic systematics and a critique of the consistency index". Systematic
Zoology, 38: 253-269.
Balter, Michael 2003. "Early Date for the Birth of Indo-European Languages". Science
302(5650): 1490-1491.
Bandelt, Hans-Juergen & Dress, Andreas W. M. 1992. "Split Decomposition: A new and useful
approach to phylogenetic analysis of distance data". Molecular Phylogenetics and Evolution 1:
242-252.
Bandelt, Hans-Juergen, Forster, Peter, Sykes, Bryan & Richards, Martin B., 1995. "Mitochondrial
portraits of human population using median networks". Genetics 141: 743–753.
Bandelt, Hans-Juergen, Forster, Peter & Rohl, Arne 1999. "Median-joining networks for inferring
intraspecific phylogenies". Molecular Biology and Evolution 16: 37-48.
Bryant, David & Moulton, Vincent 2002. "Neighbor-Net, an agglomerative algorithm for the
construction of phylogenetic networks". Workshop on Algorithms in Bioinformatics (WABI)
ed. by R. Guigo and D. Gusfield, vol. LNCS 2452, 375-391.Springer-verlag.
Bryant, David & Moulton, Vincent 2004. "NeighborNet: an agglomerative algorithm for the
construction of phylogenetic networks". Molecular Biology and Evolution 21(2): 255-265.
Bryant, David, Filimon, F. and Gray, Russell D., in press "Untangling our past: Languages, trees,
splits and networks". The Evolution of Cultural Diversity: Phylogenetic Approaches ed. by .
Ruth Mace, Clare. J. Holden & S. Shennan. UCL Press.
Cao, Ying, Janke, Axel, Waddell, Peter J., Westerman, Michael, Takenaka, Osamu, Murata,
Shigenori, Okada, Norihiro, Paabo, Svante & Hasegawa, Masami 1998. "Conflict Among
Individual Mitochondrial Proteins in Resolving the Phylogeny of Eutherian Orders". Journal
of Molecular Evolution 47(3): 307-22.
Cavalli-Sforza, Luigi. L., Minch, Edward and Mountain, Johanna L., 1992. "Coevolution of genes
and languages revisited". Procceedings of the National Academy of Sciences USA 89(12):
5620-4.
Chen, Baoya 1996. Lun yuyanjiechu yu yuyanlianmeng [Language contact and language union].
Beijing: Yuwen chubanshe.
Cheng, Chin-Chuan 1991. "Quantifying Affinity among Chines Dialects". Languages and
Dialects of China, ed. by William S.-Y. Wang, 78-112. Berkley: Journal of Chinese
Linguistics.
De Queiroz, Alvaro A. A. 1993. "For consensus (sometimes)". Systematic Biology 42(3): 368-
372.
Dixon, Robert M. W. 1997. The Rise and Fall of Languages. Cambridge: Cambridge University
Press.
Dolgopolsky, Aaron. B. 1964. "A Probabilistic Hypothesis Concerning the Oldest Relationships
Among the Language Families in Northern Eurasia” (Original Russian translated into
English). Typology, Relationship and Time ed. by Vitalij V. Shevoroshkin & Thomas L.
Markey. Ann Arbor: Karoma Publishers.
Efron, Bradley 1979. "Bootstrap methods: Another look at the jackknife". The Annals of Statistics
7: 1-26.
Embleton, Sheila 1986. Statistics in Historical Linguistics. Bochum: Brockmeyer.
Erdos, Peter L., Steel, Michael A., Szekeley, Laszlo A. & Warnow, Tandy 1997. "Local quartet
splits of a binary tree infer all quartet splits via one dyadic inference rule". Computers and
Artificial Intelligence 16(2): 217-227.
Farris, James S. 1989. "The retention index and the rescaled consistency index". Cladistics 5:
417-419.
37
Farris, James S., Kallersjo, Mari, Kluge, Arnold G. & Bult, Carol J. 1994. "Testing significance
of incongruence". Cladistics 10: 315-319.
Felsenstein, Joseph 1985. "Confidence limits on phylogenies: An approach using the bootstrap”.
Evolution 39: 783-791.
Felsenstein, Joseph 2004. Inferring Phylogenies. Sunderland, Massachusetts: Sinauer Associates,
Inc.
Forster, Peter & Toth, Alfred 2003. "Towards a phylogenetic chronology of ancient Gaulish,
Celtic, and Indo-European". Proceedings of the National Academy of Sciences USA,. 100:
9079-9084.
Gray, Russell D. & Jordan, Fiona M. 2000. "Language trees support the express-train sequence of
Austronesian expansion". Nature 405: 1052-1055.
Gray, Russell D. & Atkinson, Quentin D. 2003. "Language-tree divergence times support the
Anatolian theory of Indo-European origin". Nature 426: 435-439.
Graybeal, Anna 1998. "Is it better to add taxa or characters to a difficult phylogenetic problem?".
Systematic Biology 47(1): 9-17.
Hennig, Willi 1950. Grundzüge einer Theorie der phylogenetschen Systematik. Berlin: Deutscher
Zentralverlag.
Hinnebusch, Thomas J. 1996. "Skewing in lexicostatistic tables as an indicator of contact". Paper
presented at the Round Table on Bantu Historical Linguistics, Université Lumière 2, Lyon,
France, May 30–June 1, 1996.
Holden, Clare. J. 2002. "Bantu language trees reflect the spread of farming across sub-Saharan
Africa: a maximum-parsimony analysis". Proceedings of the Royal Society of London B 269:
793-799.
Huson, Daniel H. 1998. "SplitsTree: Analyzing and visualizing evolutionary data".
Bioinformatics 14: 68-73.
Huson, Daniel & Bryant, David 2004. SplitsTree. http://www-ab.informatik.uni-tuebingen.de/
software/splits/welcome.html
Kim, Junhyong 1998. "Large-scale phylogenies and measuring the performance of phylogenetic
estimators". Systematic Biology 47(1): 43-60.
Kluge, Arnold G. & Farris, James S. 1969. "Quantitative phyletics and the evolution of anurans".
Systematic Zoology 18: 1-32.
Lee, James & Wong, Bin R. 1991. "Population Mouvements in Quing China and their Linguistic
legacy". Languages and Dialects of China, ed. by William S.-Y. Wang, 52-77. Berkley:
Journal of Chinese Linguistics
Minett, James. W. & Wang, William S.-Y. 2003. "On detecting Borrowing: Distance-based and
Character-based approaches". Diachronica 20(2): 289-330.
Mueller, Laurence D. & Ayala, Francisco J. 1982. "Estimation and interpretation of genetic
distance in empirical studies". Genetical Research 40: 127-137.
Norman, Jerry 1988. Chinese. Cambridge: Cambridge University Press.
Ogura, Mieko 1994. "Dialect Formation in China: Linguistics, Genetic and Historical
Perspectives". In honour of William S-Y. Wang: Interdisciplinary Studies on Language and
Language Change ed. by Matthew Y. Chen & Ovid J.L. Tzeng, 349-372. Taipei: Pyramid
Press.
Rexova, Katerina, Frynta, Daniel & Zrzavy, Jan 2003. "Cladistic analysis of languages: Indo-
European classification based on lexicostatistical data". Cladistics 19: 120-127.
Ringe, Donald, Taylor, Anne & Warnow, Tandy 2002. "Indo-European and computational
cladistics". Transactions of the Philological Society 100: 59-129.
Russo, Claudia. A. M., Takezaki, Naoko & Nei, Masatoshi 1996. "Efficiencies of different genes
and different tree-building methods in recovering a known vertebrate phylogeny". Molecular
Biology and Evolution 13: 525-536.
38
Ruvolo, Maryellen 1987. "Reconstructing genetic and linguistic trees: phenetic and cladistic
approaches". Biological Metaphor and Cladistic Classification, ed. by Henry M. Hoenigswald
& Linda F. Wiener, 193-216. Philadelphia: University of Pennsylvania Press.
Sagart, Laurent 1993. "Chinese and Austronesian: Evidence for a genetic relationship". Journal of
Chinese Linguistics 21(1): 1-63.
Saitou, Naruya & Nei, Masatoshi 1987. "Neighbor-joining Method: a new method for
reconstructing phylogenetic trees". Molecular Biology and Evolution 4(4): 406-425.
Sanderson,. Michael J. & Donoghue Michael J. 1989. « Patterns of variation in levels of
homoplasy”. Evolution 43:1781-1795.
Schleicher, August 1853. "Die ersten Spaltungen des indogermanischen Urvolkes". Allgemeine
Zeitung fuer Wissenschaft und Literatur. Böhlau, Weimar.
Schmidt, Johannes 1972. Die Verwandtschaftsverhältnisse der Indogermanischen Sprachen.
Weimar: H. Böhlau.
Starostin, Sergei 1991. Altajskaja Problema i Proisxoždenie Japonskogo Jazyka [The Altaic
problem and the origin of the Japanese language]. Moscow: Nauka.
Swadesh, Morris 1952. "Lexico-statistic dating of prehistoric ethnic contacts". Proceedings of the
American Philological Society 96: 453-463.
Swadesh, Morris 1955. "Towards greater accuracy in lexicostatistic dating". International
Journal of American Linguistics 21: 121-137.
Swofford, David L. 1995. PAUP*. Phylogenetic Analysis Using Parsimony (andother methods).
Sunderland, Massachusetts: Sinauer Associates.
Trask, Larry 1996. Historical Linguistics. London: Arnold.
Wang, William S.-Y. 1991. "Introduction". Languages and Dialects of China ed. by William S.-
Y. Wang, 1-3. Berkeley: Journal of Chinese Linguistics.
Wang, William S.-Y. 1998. "Language and the Evolution of Modern Humans". The Origins and
Past of Modern Humans ed. by K. Omoto & P. V. Tobias, 247-262. World Scientific.
Wang, Feng (2004). Basic-words of Chinese Dialects. http://chinese.pku.edu.cn/wangf/
wangf.htm
Wang, Feng. & Wang, William S.-Y. 2004. "Basic words and language evolution". Language and
linguistics 5(3): 643-662.
Wheeler, Ward C. 1992. "Extinction, sampling, and molecular phylogenetics". Extinction and
phylogeny, ed. by M. J. Novacek & Q. D. Wheeler, 205-215. New York : University Press.
Wu, C. F. J. 1986. "Jackknife, bootstrap and other resampling plans in regression analysis",
Annals of Statistics 14: 1261-1350.
Yuan, Jiahua. 1980, Hanyu fanyan gaiyao [Conspectus on Chinese dialects]. Second edition.
Beijing: Wenzi Gaige Chubanshe.
39

Stuck in The Forest Trees Networks and

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Stuck in The Forest Trees Networks and

Transféré par

Droits d'auteur :

Formats disponibles

Stuck in the forest:

Trees, networks and Chinese dialects

Mahé BEN HAMED1,2, Feng WANG3

email: Mahe.Ben-Hamed@ish-lyon.cnrs.fr, fax: (+64-9)373-7450

email: wfwf@pku.edu.cn, fax: (+86-10) 6275-3016

Correspondence to: Mahé BEN HAMED

methods can be misleading.

dernières consistent en une approche lexicostatistique – autrement dit, de distance – et en une

Maximum-Parsimony-Methode. Beide Methoden führen gewissermaßen auf die traditionelle

und allgemein anerkannte Klassifizierung dieser Dialekte zurück. Werden Permutationstests

sollen, so geschieht eine Zersplitterung der traditionellen Klassifizierung. Weiterhin die

jegliche baumartige Struktur fehlt. Mehrere Erklärungen, um diese fehlende Struktur zu

Rekonstruktionen vorsichtig umgehen, denn sie beruhen auf nicht nachprüfbaren

Voraussetzungen und können zu falschen Schlussfolgerungen führen.

notebook on transmutation of species (1837). Independently, the German philologist August

(stammbaumtheorie, 1853) was soon challenged by Schleicher’s own student, Johannes

experienced substantial theoretical and computational developments, making tree

reconstruction automatic and fast through a variety of computerized methods. Such

reconstruction software is developed by phylogeneticians and only generates trees. Third,

multidimensional analyses are essentially distance-based methods and are particularly

actually tell us about how these languages evolved?

integrated into a perspective which simultaneously includes a dimension of genetic descent

and a dimension of areal diffusion.

for instance, it is challenged by multidimensional analyses of variance. In fact, at the infra-

modification) and of a horizontal transmission between lineages (migration and admixture).

distort the hierarchical figure.

particular set of languages, and what alternative representations may be used.

lexicostatistical trees (Minett and Wang, 2004).

In this paper, we argue that tree reconstruction should be complemented by a number of

last section, we also explore alternative representations to the tree.

I. Wordlists and tree reconstruction

1. Swadesh lists and derived lists

topographical terms, kinship terms, personal, demonstrative and interrogative pronouns,

naturally-occurring phenomena and basic activities. In addition to their assumed cultural

ratio of the information about genetic descent.

intended phylogenetic processing, as described in paragraphs 2 and 3 of this section.

In addition to Swadesh’s subdivisions (Swadesh, 1955), shorter versions have been

comparison, is particularly sensitive to borrowing (Ringe et al., 2002; Balter, 2003). In

the linguist’s vigilance. In fact, a critical step in phylogenetic analyses is to distinguish

between the (phylo)genetic signal and non-vertical transmission. When homoplasy is

estimate of the true phylogeny. In such a case, shorter wordlists as Yakhontov’s or

be a critical number of characters necessary for tree-reconstruction methods to discriminate

concern that will be developed in Section 3.

2. Tree reconstruction methods

We have mentioned in section I different paradigms of phylogenetic analysis, which can

be categorized into character-based and distance-based methods.

a. Character-based methods: A character is a discrete attribute observed under a specific

occurred, compatibility approaches are not recommended.

the common ancestor to those languages.

distribution of character states on the selected topologies determine whether the

non related languages, they are said to be homoplasic.

distance matrix is in turn transformed into a tree, according to a specific reconstruction

method. In historical linguistics, lexicostatistical methods are a type of distance-based

As such, distance methods imply information loss with respect to character-based

the pairwise distance between taxa on the estimated tree).

3. Defining the characters

multistate characters with more than 2 states per character.

Meaning Fuzhou (Min) Guangzhou (Yue)

Character State 1 State 2

Table 1: Example of character coding for meaning ‘dog’

violation of this theoretical requirement is problematic as character independence is difficult

reconstruction is not assessable.

hindered in co-opting probabilistic methods and models developed in biology, as our