Vous êtes sur la page 1sur 39

Stuck in the forest:

Trees, networks and Chinese dialects

Mahé BEN HAMED1,2, Feng WANG3

1
Laboratoire Dynamique du Langage, UMR 5596, Lyon, FRANCE
2
Department of Psychology, University of Auckland, Auckland, New Zealand

email: Mahe.Ben-Hamed@ish-lyon.cnrs.fr, fax: (+64-9)373-7450


3
Department of Chinese Language & Literature, Peking University, Beijing, China

email: wfwf@pku.edu.cn, fax: (+86-10) 6275-3016

Correspondence to: Mahé BEN HAMED


Summary

This paper discusses the validity of the tree model of evolution for the particular case of Sinitic

languages (or Chinese dialects). Our approach is lexically based, using standardized wordlists. First,

these lists were tested for their congruence, as they are supposed to have evolved at different rates.

Then, the phylogenetic analysis could proceed, using both a distance-based lexicostatistical method

and a character-based maximum parsimony method. The traditional classification of Chinese dialects

is recovered to various extents depending on the method and on the wordlist used, but the character-

based analysis of the 200 Swadesh wordlist outperforms all other analyses. Finally, the validity of the

branching patterns obtained was tested through a variety of techniques. Although the data fits the

inferred trees well, the topology of these trees is collapsed to a star-like pattern when investigated

through resampling methods. The application of a network method confirms that the development of

these Sinitic languages is not tree-like, highlighting the fact that in cases like this tree-reconstruction

methods can be misleading.

Résumé

Cet article a pour objet d’évaluer la validité du modèle arboré dans le cas particulier des langues

sinitiques (ou dialectes chinois). Il repose sur l’analyse de données lexicales correspondant à des listes

de mots standardisées. Ces listes sont supposées avoir évolué à des vitesses différentes, ce qui impose

de vérifier leur congruence avant de procéder aux analyses phylogénétiques proprement dites. Ces

dernières consistent en une approche lexicostatistique – autrement dit, de distance – et en une

méthode fondée sur le caractère, dite de parcimonie. La classification traditionnelle est retrouvée avec

plus ou moins de précision selon la méthode et la liste utilisées, mais l’approche cladistique sur les 200

mots de Swadesh reste la plus performante. Toutefois, si les données s’ajustent correctement aux

arbres inférés, ces derniers se trouvent réduits à des diagrammes en étoile lorsque des procédures de

rééchantilonnage sont utilisées pour évaluer le soutien statistique de leurs topologies. L’application

d’une méthode de réseaux confirme la non arboricité du développement de ces langues étudiées à

partir de leurs lexiques et incite à manipuler les méthodes de reconstruction arborées avec

précaution, celles-ci pouvant amener, comme dans le cas présent, à des inférences erronées.

2
Zusammenfassung

Der vorliegende Artikel testet die Gültigkeit des baumartigen Models, das die

Entwicklung der chinesischen Dialekte darstellen soll. Er stützt sich auf lexikale Daten, die

den Standardlisten von Swadesh und Yakhontov entsprechen. Die Auswertung erfolgt in drei

Etappen. Nachdem wir uns vergewissert haben, dass diese ineinandergeflochtenen Wortlisten

gegenseitig kongruent sind, führen wir phylogenetische Rekonstruktionen aus, entweder mit

Hilfe einer Distanzmethode, die sich der lexikostatistischen Distanz bedient, oder mit einer

Maximum-Parsimony-Methode. Beide Methoden führen gewissermaßen auf die traditionelle

und allgemein anerkannte Klassifizierung dieser Dialekte zurück. Werden Permutationstests

durchgeführt, die die statistische Unterlage der Untergruppen und deren Topologie auswerten

sollen, so geschieht eine Zersplitterung der traditionellen Klassifizierung. Weiterhin die

Durchführung einer Methode, die Neighbor-Net Methode, die bezüglich der Darstellung der

Beziehungen zwischen Dialekten nicht nachteilig ist, lässt uns zu dem Schluss kommen, dass

jegliche baumartige Struktur fehlt. Mehrere Erklärungen, um diese fehlende Struktur zu

erläutern, können vorgeschlagen werden. Auf alle Fälle sollte man mit baumartigen

Rekonstruktionen vorsichtig umgehen, denn sie beruhen auf nicht nachprüfbaren

Voraussetzungen und können zu falschen Schlussfolgerungen führen.

3
I. Introduction

The first tree diagram used to represent species evolution can be found in Darwin’s first

notebook on transmutation of species (1837). Independently, the German philologist August

Schleicher also used the tree metaphor for Indo-European languages. His family-tree theory

(stammbaumtheorie, 1853) was soon challenged by Schleicher’s own student, Johannes

Schmidt who argued that innovations could spread from multiple centres, a distribution for

which a strictly hierarchical tree model could not account. Schmidt proposed an alternative

model, called the wave theory (wellentheorie, 1872), which he argued was more realistic. In

fact, it shows the continuing contact between languages depending on the features analyzed,

whereas the tree model requires sudden and sharp splits of the language as a whole. In

contrast; the wave model does not give a synthetic figure of how languages are related and

fails to represent earlier and later stages of languages simultaneously (Trask, 1996).

Trees are a dominant paradigm in evolutionary biology. In the past fifty years, this field

experienced substantial theoretical and computational developments, making tree

reconstruction automatic and fast through a variety of computerized methods. Such

developments appealed to other fields dealing with evolutionary issues, such as historical

linguistics. Since Schleicher, this discipline was familiar with the tree model and the use of

tree reconstruction methods has progressively spread through historical linguistic practice,

even in cases where the tree model may have seemed less appropriate. There are several

reasons for this. First, the tree iconography is a particularly convenient synthetic figure,

straightforward to interpret and intuitive to understand. Second, the most popular tree

reconstruction software is developed by phylogeneticians and only generates trees. Third,

multidimensional analyses are essentially distance-based methods and are particularly

interesting when the distance used proceeds from an explicit evolutionary model. This may be

the case in population genetics, but not in historical linguistics, where no models are available

4
(yet) which could be used to incorporate evolutionary assumptions in the computation of

linguistic distance. This being said, there persists a rather fundamental question: if we

constrain languages to fit onto a tree when their evolution is not tree-like, what does that tree

actually tell us about how these languages evolved?

The adequacy of the tree model in representing the evolution of a set of languages is a

recurring debate in historical linguistics. In the ideal case of tree-like evolution, the mother

language differentiates into some daughter languages which then diverge progressively.

Although this model applies in the case of the most extensively studied families (Indo-

European, Semitic, Uralic, Algonquian), it cannot be generalized (Dixon, 1997) and has to be

integrated into a perspective which simultaneously includes a dimension of genetic descent

and a dimension of areal diffusion.

The tree model is also a matter of debate in evolutionary biology. In population genetics,

for instance, it is challenged by multidimensional analyses of variance. In fact, at the infra-

specific level, historical relationships are the sum of vertical transmission (descent with

modification) and of a horizontal transmission between lineages (migration and admixture).

As such, each approach (trees vs. multidimensional analyses) can capture history only

partially and both are still needed to account for it as a whole. The problem is that

multidimensional analyses (and the wave theory) cannot account for a hierarchical structure

whilst trees cannot account for spread of innovations from multiple centres. Consequently, if

the whole scheme is constrained to fit the tree model, the non-hierarchical information may

distort the hierarchical figure.

The rate of change in language is higher than in biological data, and horizontal transfers

between languages are not rare as they are between species. Consequently, stratified events of

borrowing can rapidly dilute the phylogenetic signal of linguistic descent and create

reticulations between languages. Borrowing is more or less prevalent and the patterns of

5
descent are more or less salient, depending on the language group. Therefore, depending on

the amount of borrowing, all cases on the continuum between the pure tree model and the

completely tangled network are plausible. It thus becomes inescapable to question the

adequacy of the tree to represent how languages are related, but this formulation is too case-

sensitive to bear a systematic answer. There is indeed tree-like signal, but it can be overlapped

by conflicting signals which cause the hierarchical signal to fade and become ambiguous.

Therefore, one should rather question how to justify that the tree model is adequate for a

particular set of languages, and what alternative representations may be used.

This paper specifically addresses these issues in the case of Sinitic languages (or Chinese

dialects). This language group is known for its complex and intertwined evolutionary history

(Wang, 1991) which challenges the tree model. To picture the complexity that must be dealt

with, we can compare this group to Indo-European. Both families have been extensively

studied and have considerable historical records. However, Indo-European, which has

presumably evolved over the past 6000 to 8000 years, is better resolved in terms of its

phylogeny than Sinitic, which has evolved over only half that time span.

Chinese speaking populations have undergone successive waves of migrations and have

experienced a situation of diglossia, probably since ancient times (Wang, 1991; Lee, 1991).

Consequently, Sinitic languages are not expected to have evolved in a steady tree pattern, but

rather as a network. In spite of this intuition, several studies in Chinese dialectology (Cheng,

1991; Ogura, 1994; Wang, 1998; Wang and Wang, 2004) use classical tree reconstruction

methods, with no further assessment of the robustness of the trees obtained. These

reconstructions have relied on global similarity, but such distance-based methods are

intrinsically unable to diagnose deviation from the tree model. Hinnebusch (1996) noted

however that a skewing of the tree co-occurred with borrowing events, an observation which

6
motivated a method for detecting borrowing based on the evaluation of the skewing in

lexicostatistical trees (Minett and Wang, 2004).

In this paper, we argue that tree reconstruction should be complemented by a number of

procedures which can measure the confidence in the pattern of subgrouping given by the tree

obtained. We also argue for the use of alternative representations which can help visualize

how good an approximation a tree provides for the signal contained in the data. Section 2

introduces the data and the methods used for tree reconstruction. Section 3 presents the

measures used to assess the quality of a tree and Section 4 discusses the results of these

approaches on Sinitic languages, with a particular focus on the issue of tree-likeness. In this

last section, we also explore alternative representations to the tree.

I. Wordlists and tree reconstruction

1. Swadesh lists and derived lists

Swadesh (1952) introduced a lexicon-based model for language change relying on test-

lists of meanings likely to be found in all cultures. There are two such lists and they are

referred to as the Swadesh 100 and 200 wordlists, depending on the number of meanings they

contain. These meanings are putative cultural universals such as body parts, lower numerals,

topographical terms, kinship terms, personal, demonstrative and interrogative pronouns,

naturally-occurring phenomena and basic activities. In addition to their assumed cultural

universality, these meanings belong to the basic core vocabulary. This subset of the lexicon is

supposedly more immune to change, being both more retentive and less prone to borrowing

during language contact events than the rest of the lexicon. In a phylogenetic framework,

these lists correspond to a controlled sampling of lexicon which maximizes the signal to noise

ratio of the information about genetic descent.

7
The original 200 Swadesh wordlist actually consisted of 207 items (Swadesh, 1952). Chen

(1996: 297) argues that 7 items - ‘at’, ‘other’, ‘some’, ‘when’, ‘wipe’, ‘with’ and ‘ye’- should

be discarded as they appear irrelevant for Chinese, which brings the largest wordlist to exactly

200 items. This list was compiled for Old Chinese and 22 modern Chinese dialects which

locations are given in Figure 1. Shanghai dialect is given under two descriptions, bringing the

number of analysed taxa to 24. The affiliation of each dialect is given according to Yuan’s

classification (1980), which is quite uncontroversial among Chinese dialectologists. The data

is accessible on the Peking University website1. The data was recoded in accordance with the

intended phylogenetic processing, as described in paragraphs 2 and 3 of this section.

Figure 1: Map of the Chinese linguistic domain, showing the boundaries of the 7 major
dialectal groups and the locations of the dialects used in this study.
Mandarin: (1) Beijing, (2) Taiyuan, (3) Wuhan, (4) Ningxia, (5) Chengdu, (6) Yuci, (7) Yingshan. Xiang: (8) Changsha, (9)
Shuangfeng. Gan: (10) Anyi, (11) Nanchang. Wu: (12) Ningbo, (13) Shanghai, (14) Suzhou, (15) Wenzhou. Hakka: (16)
Liancheng, (17) Meixian. Min: (18) Fuzhou, (19) Taiwan, (20) Xiamen, (21) Zhangping. Yue: (29) Guangzhou.

In addition to Swadesh’s subdivisions (Swadesh, 1955), shorter versions have been

proposed, namely the Yakhontov 35 wordlist (quoted in Starostin, 1991: 59-60), and the even
1
http://chinese.pku.edu.cn/wangf/wangf.htm

8
more restrictive 15 wordlist of Dolgopolsky (1964). This latter was not used as it contains too

few items with respect to the number of dialects studied. These lists are not uncontroversial

(for a detailed review of the question, see Embleton, 1986), neither is their use to reconstruct

linguistic phylogenies. It is often argued that regular sound changes and innovations in

inflectional morphology are better informants of linguistic descent than vocabulary, which in

comparison, is particularly sensitive to borrowing (Ringe et al., 2002; Balter, 2003). In

practice however, much importance can be given to lexical evidence. Ogura (1994), for

instance, argues that within Chinese historical linguistics, lexicon is the only linguistic

structure which is “rich enough in detail to permit quantitative analysis” (1994: 350). In fact,

numerous studies in Chinese dialectology rely on lexical data (Chen, 1996; Cheng, 1991,

Ogura, 1994; Sagart, 1993), which is viewed as more informative than other linguistic

features for the time span covering the formation and evolution of Chinese dialects (Ogura,

1994). Lexicon would be informative for the span of several centuries/millennia, whereas

vowels, consonants, tones, morphosyntax and semantics exhibit changes at shallower time

depths.

These basic wordlists tentatively try to minimize the residual borrowing, convergence and

chance resemblance (generally subsumed under the term homoplasy) which would still escape

the linguist’s vigilance. In fact, a critical step in phylogenetic analyses is to distinguish

between the (phylo)genetic signal and non-vertical transmission. When homoplasy is

prevalent, the reconstructed tree can be well supported by the dataset but still be a poor

estimate of the true phylogeny. In such a case, shorter wordlists as Yakhontov’s or

Dolgopolsky’s can be useful to rule out as much homoplasy as possible from the analyses. In

fact, the shorter the list, the more conservative it is assumed to be and thus the more

informative about the ‘true phylogeny’. Caution would thus advise the use of the less

homoplasy-sensitive data sample. At the same time, reducing the number of characters used

9
for the phylogenetic inference also reduces the power of resolution of the dataset. The

problem of homoplasy gives way to that of insufficient sampling. This problem gets harder as

the number of languages increases. Although a large number of taxa2 can be grouped perfectly

well with a single binary character, the expected resolution is not high. Although there must

be a critical number of characters necessary for tree-reconstruction methods to discriminate

between different topologies (Erdös et al. 1997; Kim, 1998), the few studies addressing this

question have reached different conclusions (Wheeler, 1992; Graybeal, 1998; Russo, 1996).

Following Felsenstein (1985), we argue that the fundamental issue is not the number of

characters used but how well a given set of characters supports a clade3 in a phylogeny, a

concern that will be developed in Section 3.

2. Tree reconstruction methods

We have mentioned in section I different paradigms of phylogenetic analysis, which can

be categorized into character-based and distance-based methods.

a. Character-based methods: A character is a discrete attribute observed under a specific

state for each language, and is potentially informative about phylogenetic relationships

between languages.

For any set of taxa, there are multiple possible trees to connect them, the number of trees

growing exponentially with the number of taxa. For instance, for the 24 dialects of our

dataset, there are more than 1030 possible unrooted bifurcating trees connecting them.

Therefore, one needs a criterion to select a subset of all possible trees judged as optimal. This

criterion can be probabilistic as in maximum likelihood and Bayesian inference, provided that

we have an acceptable probabilistic model about character change. Given that we argue for

multiple-state coding of the data, and as no available phylogenetic software can cope with the

2
taxa = objects under study, singular taxon
3
clade = (gr. χλαδοσ) putative group defined by an ancestor and all its descendant in a phylogeny

10
high character polymorphism in our data, we were unable to perform such probabilistic

approaches.

Outside this probabilistic framework, one could use a maximum compatibility criterion

which reconstruct trees from the largest subset of characters which are all compatible on the

same hierarchical structure. However, compatibility methods do not deal with homoplasy.

Therefore, in the case of Chinese dialects where such a huge amount of borrowing has

occurred, compatibility approaches are not recommended.

We preferred the maximum parsimony criterion, where the optimal trees are those which

minimize the overall number of character changes. Note that, given the large number of

possible trees for our dataset, it is not possible to compute and evaluate each and every tree,

and the search for optimal trees has to be heuristic. We have therefore no guarantee to find the

optimal tree subset, but at least, an estimate of this can be obtained in a reasonable time. In

this paper, we have used a cladistic approach – sensu Hennig (1950), which proceeds as

follows:

− For each character, we formulate a hypothesis of homology: if two languages share the

same state, then the most parsimonious explanation is that they were inherited from

the common ancestor to those languages.

− Once the subset of optimal trees is selected, this primary hypothesis can be tested. The

distribution of character states on the selected topologies determine whether the

observed shared character states result from common ancestry or not. Therefore, it is

the tree topology, and more precisely the optimal tree topologies, which explain the

mode of transmission for each character. When character states appear to be shared by

non related languages, they are said to be homoplasic.

b. Distance-based methods: These methods are based on global similarity. First a measure

of pairwise similarity between taxa is computed, which is then transformed into a measure of

11
dissimilarity with specific mathematical properties, called a distance. The resulting pairwise

distance matrix is in turn transformed into a tree, according to a specific reconstruction

method. In historical linguistics, lexicostatistical methods are a type of distance-based

methods.

As such, distance methods imply information loss with respect to character-based

methods, since the original character matrix cannot be recovered from a distance matrix

(Penny, 1982). Herein, we have used the popular Neighbour-Joining method (Saitou and Nei,

1987), which is an agglomerative procedure that estimates the unique tree minimizing the

difference between the observed distance matrix and the tree distance matrix (reconstructed as

the pairwise distance between taxa on the estimated tree).

We have used the same lexicostatistical distance as Minett and Wang (2004). The distance

between languages A and B is a function of their lexical similarity C, which is equal to the

proportion of cognates shared by A and B over the whole wordlist. The distance is equal to

minus the logarithm of C (d(LA, LB) = – log C), the minus sign ensuring the distance is

positive.

3. Defining the characters

Previous analyses of lexical lists (Gray and Jordan, 2000; Holden, 2002; Ringe et al. 2002;

Gray and Atkinson, 2003; Rexova et al. 2003) use two types of character coding:

− Each basic meaning constitutes a character, and the different cognate sets identified

for that meaning constitute the character’s states (Table 1). This definition results in

multistate characters with more than 2 states per character.

− Each cognate set from each meaning constitutes a character where the states are either

present, if the item belongs to that cognate set, or absent otherwise. This definition

12
results into binary characters (2 states per character), with multiple characters per

meaning.

Meaning Fuzhou (Min) Guangzhou (Yue)


“dog” quan2 犬 gou2 狗

Character State 1 State 2

Table 1: Example of character coding for meaning ‘dog’

Advocates of the binary coding argue that the meanings composing the wordlists are not

the ‘objects’ of language change, but are artificial constructs (Atkinson and Gray, 2004). In

contrast, cognate sets are unambiguously defined as heritable units and therefore the binary

coding would be at least as principled as the multistate coding. On the contrary, advocates of

multistate coding argue that binary coding violates the necessary requirement of character

independence as the cognate sets related to a given meaning are semantically dependant

(Balter, 2003). In fact, cognacy judgments are made on both semantic and phonological

grounds. Since these two dimensions are non-independent in circumscribing cognate sets,

they should not be separated for the analyses. For example, if, for a given meaning, a

language has a word in a given cognate set, it is not likely to have a word for the same

meaning in the other cognate sets for that semantic slot. Consequently, binary coding bears

the risk of over-weighting the meanings with a large number of cognate sets to the detriment

of meanings with few cognate sets. In fact, almost all cladistic analyses treat data as

independent and therefore, non-independent characters are over-weighted in the analyses. The

violation of this theoretical requirement is problematic as character independence is difficult

to test, evaluate and correct. Consequently, the bias introduced by non independence on tree

reconstruction is not assessable.

For these reasons, we preferred multistate coding. By doing so however, we find ourselves

hindered in co-opting probabilistic methods and models developed in biology, as our

13
characters appear far more polymorphic than the characters these methods and models were

designed for. We treated characters as unordered: for a given character, any state is allowed to

change into any other state with the same probability. This is far from the reality of linguistic

change, but in the absence of a plausible model of evolution of lexicon, this corresponds to

the least constrained approach.

Wordlists rarely appear as simple to code as the example given in Table 1. For example,

variant roots can be associated with slight variations of the same basic meaning. For instance,

the meaning “bird” is represented by 鸟 (niao3) in Nanchang (Gan) and by 雀 (que4) in

Chengdu (Mandarin). Contrastively, Ningxia (Mandarin) has both roots 鸟 and 雀 for this

same meaning, but these roots differ in their etymology in Old Chinese, referring to “big bird”

and “small bird”, respectively. In such a case, we considered semantic variation itself as a

new state: Nanchang is therefore in state 1 (鸟), Chengdu is in state 2 (雀) and Ningxia is in a

third state noted 1/2 (鸟/雀). The transformation between these states 1 and 3 can be due

either to the loss of the variant 雀 from Ningxia to Nanchang, by its emergence in Ningxia

from Nanchang, or even by borrowing from another source.

Some meanings are represented by a root with appended affixes and modifiers. The root

can stay constant while the affix/modifier can change, or both can vary. The first case can be

simply resolved by coding only the variable part, but the latter case is more problematic. In

fact, the root and the affix/modifier cannot be split into two characters, since they have a tied

evolution within a semantic slot. We thus coded each combination root/affix or modifier as a

new state. There are also cases of optional affixation, which we coded as distinct states from

the stable form. For instance, the meaning “head” is represented in Nanchang (Gan) by the

root 头 (tou2), state one, which can appears either systematically suffixed as in Zhangping

14
(Min) 头壳 (ke4),state two, or optionally suffixed as in Guangzhou (Yue) 头[壳], state three.

The duplication of a root or the combination of roots is also treated as a suffixation. In the

latter case, if each root and their combined form are semantic variants of the same meaning,

then they constitute three different states.

The missing data are coded as a specific state. Note that the absence of a concept and truly

missing data (lack of information) are coded and treated differently. The absence of a concept

is considered as a proper state and analyzed as any other informative state, whereas missing

data have a special treatment. In distance methods, they are discarded from the computation

of the distance matrix, whereas they are optimized in the cladistic approach. In this latter case,

missing values are affected the states which allow the best optimization.

II. Trees and ‘tree-likeness’ of Chinese dialects development

1. Incrementing wordlists: homoplasy and congruence

Increasing the number of characters does not guarantee a better phylogenetic estimate. For

instance, character sets may be evolving at different rates. Constraining such characters to rate

equality in a phylogenetic analysis is likely to result in a flawed topological estimate (Cao,

1998). In other words, datasets assumed to have evolved at different rates should not be

combined into a single phylogenetic analysis without a prior assessment of their congruence.

Two dataset are said to be congruent when the hierarchical signals they contain are

compatible, or in agreement. Note that congruence does not imply that the additional signal

does not contribute with any original information, but one can be assured of the coherence of

the signal contained in the combined dataset, in a given analytical framework. In contrast,

incongruence implies a conflict between the initial data set and the incremented items that

15
would result in a bad estimate of either conflicting evolutionary trajectories. Although this is

subject to debate between phylogeneticians (Kluge, 1989; de Queiroz, 1993), we are ourselves

advocates of a prior assessment of congruence between datasets in order to legitimate their

combination in a single analysis (de Queiroz, 1993). However, if datasets prove incongruent,

they should be analyzed separately, and the differences in the optimal trees resulting from

each analysis should be evaluated.

The rationale behind the 100 and the 35 wordlists is that the rates of word evolution

differ from one list to the other. According to Swadesh’s subdivisions, the items of the 100

wordlist evolve more slowly than the second 100 list of the 200 wordlist. According to

Yakhontov, the 35 items of his list evolve even more slowly than the other 165 words of

Swadesh’s 200 word list. Therefore, we find ourselves in the situation described above with

character sets evolving at different rates. Consequently, we need to investigate the congruence

within the 100 and 200 wordlists. Intriguingly, whereas one would expect Yakhontov’s list to

be included in Swadesh’s 100 list, three items of the Yakhontov list belong to the less

conservative half of Swadesh’s 200 list. Given this apparent contradiction between

Yakhontov’s and Swadesh’s views on what subsets of the basic lexicon are the most

conservative, we have to test not only the congruence between the first and the second halves

of the 200 wordlist, but also, the congruence within each subset with respect to Yakhontov’s

subdivision. In other words, we have to test 5 partitions of the basic lexicon (Figure 2):

Partition 1. within Swadesh’s 200 wordlist : Yakhontov’s 35 words against the

remaining 165 items

Partition 2. within Swadesh’s 100 wordlist : Yakhontov’s 32 words against the

remaining 68 items

Partition 3. Yakhontov’s 35 items against the 68 subset (within the 100 wordlist),

against the 97 subset (within the second half of the 200 wordlist)

16
Partition 4. the 68 subset (within the 100 wordlist), against the 97 subset (within the

second half of the 200 wordlist)

Partition 5. the 100 wordlist against the second half of the 200 wordlist

Figure 2: Structure of the wordlists and definition of the subsets used in the ILD procedure.

In the maximum parsimony framework we are using for tree reconstruction, one can

use the Incongruence Length Difference (ILD) test, proposed by Farris et al. (1994), and

independently by Swofford (1995) under the name of Partition Homogeneity test. This

procedure is character-based and non-parametric, and tests the null hypothesis of congruence

between datasets. In fact, one cannot directly test the null hypothesis of incongruence, but one

can measure the extent to which the incongruence between datasets is significantly higher

than would be expected by chance alone. This is precisely how the ILD test proceeds. It

measures the extent with which the original datasets result in different trees and compares it to

a distribution of this same measure for randomly generated datasets of the same size as the

original ones. Therefore, this test can be used to justify the combination of the datasets in a

parsimony framework, as the combined dataset results in no less optimal solutions than the

separate original datasets.

The test requires the computation of tree lengths. The length of a tree is equal to the

overall number of character changes required to map the character matrix onto that tree.

17
Given two datasets X and Y, the measure of their incongruence is the difference DXY between

the length L(X+Y) of the most parsimonious tree obtained by combining them into a single

analysis (X+Y), and the sum (LX + LY) of the lengths of the most parsimonious trees obtained

on each character set analyzed separately (DXY = L(X+Y) – [LX + LY]). The null distribution of D

is given by a procedure of random partition of the combined matrix (X+Y) into two matrices P

and Q of the same size of X and Y, respectively (note that this procedure can be extended to

more than 2 datasets). The null hypothesis of congruence between X and Y is rejected if DXY

appears to be greater than 95% of the DPQ computed. In other words, we take a 5% risk of

falsely rejecting the null hypothesis of congruence between datasets. If this null hypothesis is

not rejected, datasets X and Y can be combined into a single analysis.

2. Goodness of fit of the data onto a tree

In a cladistic framework, it is possible to evaluate the goodness of fit of the character

matrix to each of the selected trees. The amount of shared innovations, and by correlation, the

amount of homoplasy, is traditionally estimated by the consistency index (CI, Kluge and

Farris, 1969). This index is computed as the minimal number (R) of transformations required

to explain the states of all characters divided by the number (L) of changes observed on the

tree. L is equal to the length of the tree in the case of optimal trees. The larger the CI, the

smaller the amount of homoplasy. A tree for which there is no homoplasy has a CI equal to 1.

The homoplasy index is thus calculated as HI = 1-CI.

The consistency index can also be computed only on the fraction of characters which are

informative for the subgrouping. This excludes parsimony uninformative characters, such as

characters which are constant, or where unique states are not shared. The CI excluding

uninformative characters can differ significantly from the uncorrected CI when a large

fraction of the characters have states present in only one dialect.

18
Although a very convenient measure of homoplasy, the consistency index is difficult to

interpret. In fact, whereas its maximal value is equal to 1 for a homoplasy-free dataset, its

minimal value is not null, and depends on the number of taxa considered (Archie, 1989).

Other measures of homoplasy have been proposed which do not have this flaw. The retention

index (RI, Archie, 1989; Farris, 1989), for instance, ranges from 1 (no homoplasy) to 0 (no

phylogenetic information). This index is computed in a similar fashion to the CI, but with a

correction of both the numerator (R) and the denominator (L) by the maximal number (G) of

transformations required to reconstruct a tree from the dataset. RI is therefore equal to G-R

divided by G-L. Finally, a third index has been proposed by Farris (1989), called the rescaled

consistency index (RC), which is equal to the product of the consistency index by the

retention index, which rescales the CI between 0 and 1.

The significance of the CI can however be tested. This can be done by calculating the CI

distribution for a sample of random trees (Cavalli-Sforza et al. 1992). Sanderson and

Donoghue (1989) showed that the CI is highly correlated with the number of taxa but does not

show a significant relationship to the number of characters used for the analysis. They suggest

that the logarithm of the CI is linearly correlated with the number of taxa and provide a

method for calculating the threshold CI value over which the observed CI value for a tree can

be considered significant. In our case, where 24 taxa are considered, this threshold CI is equal

to 0.372.

Being measures of homoplasy, these indices (CI, RI and RC) are therefore also indices of

the ‘tree-likeness’ of the dataset: the better the fit of the characters to the optimal trees, the

more adequate is the tree diagram to account for the observed dataset. If these indices are

significantly different from what would be expected by chance only, then the trees for which

they are computed are also significant.

19
3. Assessment of clade support

Although the 100 and the 200 wordlists appear consistent as a whole, it is unlikely that

they are homoplasy-free. The presence of homoplasy implies that multiple evolutionary paths

have been involved in generating the characters. Therefore, it can be of interest to test the

robustness of the selected trees towards a random resampling of the character matrix, in other

words, to estimate empirically the variability of the phylogenetic estimate. Each clade can be

associated with a statistical measure of support, and consequently, it becomes possible to

evaluate the confidence one can have in the interpretation of the subgrouping patterns.

The resampling methods we used for assessing the robustness of a tree are bootstrapping

and jackknifing:

− Bootstrap (Efron, 1979; Felsenstein, 1985): This procedure involves resampling with

replacement of N characters from the N characters of the initial dataset. This produces

a fictional sample of the same size, called a replicate. As sampling is done with

replacement, some of the original characters may have been sampled several times,

while others may be left out entirely. This procedure mimics repeated sampling of the

data from the initial population of characters.

− Jackknife (Mueller and Ayala, 1982): In its original formulation, this procedure

consisted in dropping one observation at a time from one’s sample and calculating the

phylogenetic estimate each time. The variability of the estimate was then calculated on

the small variations that were caused. Later variants of this method involve a sub-

sampling of a given fraction p of the initial set of characters at a time and estimating

the variance on the phylogenies reconstructed on each sub-sample of the data

(Felsenstein, 1985 ; Wu, 1986). The half-delete jackknife is the most widely used

variant of jackknife, and involves multiple sub-sampling of half the characters.

20
Bootstrap and jackknife consist both in a random re-weighting of the data: bootstrap

proceeds to a completely random re-weighting, whereas jackknife affects a weight of 0 to all

deleted characters, and a weight of 1 to all included characters. It is not clear however if any

of these methods is substantially more advantageous than the other (Felsenstein, 2004), but

one could favour jackknife over bootstrap as this latter creates artificial data sets while the

former uses different sub-samples of the real data. Moreover, bootstrap, as implemented in

most phylogenetic software (following Felsenstein 1985), tends to overestimate bootstrap

values when multiple equally parsimonious trees are selected for a given replicate (Farris,

2004).

Whether using bootstrap or jackknife, we generate a large number of replicates, for each

of which we infer a phylogeny using a given method of phylogenetic reconstruction. If

distance methods are used, the resampling has to occur on the original character data before

the distance matrix is computed. We end up with a cloud of trees corresponding to a

collection of phylogenetic estimates for the languages under study. These can be summarized

into a majority-rule consensus tree, called a bootstrap (or a jackknife) tree. This tree shows the

subgroups that are recovered in more than 50% of the replicates, and associates the exact

percentage of replicates supporting this subgroup. The higher this value, the more robust the

subgroup, and the more confident one can be in interpreting it. The subgroups that are

supported by less than half of the replicates are not represented and appear unresolved on the

consensus tree.

In either case, however, there is no absolute threshold over which the certainty of a group

can be acknowledged. One has to be aware that the lower the threshold of acceptance of

subgroups, the less support one can have in their interpretation. For instance, a subgroup that

is supported by a value of 50% is a subgroup which is actually rejected by half the replicates.

In such a case, there is as much support for the subgroup as against it. The problem can be

21
construed as hypothesis test where we test whether a clade is actually supported by the data. A

clade can be considered as valid if no more than 5% of the random sample argues against it,

which would bring the threshold up to 95%.

III. Results

Tree reconstruction was performed using PAUP*4b10 (Swofford, 2002) for both

character and distance based methods> PAUP* was also used for the computation of the

indices of homoplasy and for the resampling procedures.

1. Congruence within wordlists

Each of the partitions defined in Figure 2 have been tested for congruence using the ILD

test over 1000 trials. Table 2 shows the tail probabilities obtained for each partition. The null

hypothesis of congruence is rejected when the tail probability is no larger than 5%, in other

words, when the observed incongruence within the partition is larger than 95% of the

randomly generated partitions of the same size. When the 35 word list is extended first to 100

then to 200 items, the newly added items introduce a signal which is congruent to the one

contained in the initial word list. We can therefore follow Swadesh’s and Yakhontov’s

subdivisions of the basic lexicon when performing our phylogenetic analysis.

Partition Type I error


1 p= 0.085
2 p= 0.212
3 p= 0.162
4 p= 0.330
5 p= 0.176

Table 2: Results of the ILD test.


The null hypothesis of congruence is rejected if the p-value is no larger than 5%.

22
2. Sinitic Trees

Trees were rooted using Old Chinese, which is both related to all the languages under

study and of greater time depth. Such a polarization of the trees will allow the discussion on

the pattern of dialect subgrouping. Concerning the measures of homoplasy, values will be

given for all indices (consistency, retention and rescaled consistency), but only the

consistency indices will be discussed, since their significance can be easily assessed using

Sanderson and Donoghue’s (1989) regression method. As mentioned in section II.2, the

critical value of the CI for 24 taxa is 0.372.

a. Maximum-parsimony analyses:

35 wordlist: 25 characters are informative as 3 characters are constant and 7 are parsimony

uninformative. The parsimony analysis of this list results in 77 equally parsimonious trees

which are 252 steps long and show a very significant fit of the data (CI=0.8849, CI excluding

uninformative characters = 0.8797, RI = 0.766, RC = 0.678). The majority-rule consensus tree

(Figure 3) shows the support received by each clade from the cloud of optimal trees. It shows

a good topological agreement for the deep nodes, in spite of the numerous trees selected. A

first division roughly clusters the southern (Yue, Min) dialects against the northern and

central dialects (Mandarin, Wu, Xiang, and Gan. Hakka dialects, however, do not fit this

geographical pattern, as they cluster with the northern (and central) dialects when one would

expect it to cluster within the southern group. Mandarin, Wu and Gan fail to cluster as

consistent groups, whereas Xiang is supported by about 73% of the cloud of optimal trees. On

the contrary, Min and Hakka are supported by all the trees selected (discussing the Yue is

meaningless here as this group is represented by a single dialect). As previously mentioned,

Hakka does not cluster within a southern dialectal group as would be expected from

geography and in accordance with Norman (1988). On the contrary, this group clusters at the

tip of the northern subdivision, close to the Gan dialects and to the Wenzhou southern Wu

23
dialect. It has been argued by Yuan (1980) that Gan is closer in its lexicon to Xiang and Wu

than to Hakka, a claim which is supported by only 30% of the tree cloud (23 trees over 77),

whereas 70% of the tree cloud supports the claim by Sagart (2002) that there is a privileged

relationship between Gan and Hakka.

Figure 3: Majority-rule consensus tree for the 35 wordlist parsimony analysis.

100 wordlist: the extension of the analysis to the 100 wordlist results in a dramatic reduction

of the number of optimal trees from 77 to 6. This indicates that the characters that have been

added to the 35 wordlist have helped resolve the tree topologies. This is further supported by

the strong topological agreement between the equally parsimonious trees as suggested by their

consensus tree (Figure 4). The addition of characters dramatically changes the tree topologies

but only slightly affects the value of the consistency index which remains highly significant

(CI = 0.84, CI excluding uninformative characters = 0.8296, RI = 0.6771, RC = 0.5685).

Mandarin, Wu and Gan remain unsupported by the data, whereas Min, Hakka and Xiang are

supported by all optimal trees. Hakka is brought closer to the root, in accordance with the

claim by Norman (1988) that this group has a close relationship with the southern dialects.

24
200 wordlist: the 200 wordlist also results in 6 equally optimal trees, whose majority-rule

consensus is shown in Figure 5. The majority consensus tree shows a 100% support for each

of the expected dialect groups, and the consistency index remains in the range of those

observed for the two shorter lists (CI = 0.8605, CI excluding uninformative characters =

0.851, RI = 0.6494, RC = 0.559).

Figure 4: Majority-rule consensus tree for the 100 wordlist parsimony analysis.

Figure 5: Majority-rule consensus tree for the 200 wordlist parsimony analysis.

The consensus topology shows a clear bipartition between the southern dialects (Min, Hakka

and Yue) versus the northern (Mandarin, Wu) and central (Xiang, Gan) dialects. No

25
hierarchical ordering of the southern groups is supported, whereas the structure within the

northern-central group is constant across the cloud of optimal trees. Mandarin appears as a

sister-group to Xiang, the group defined by (Mandarin, Xiang) is in turn a sister-clade to Gan

and ((Mandarin, Xiang), Gan) is in turn a sister-clade to Wu. A summary topology is given in

Figure 6.

Figure 4: Tree summarizing the relationship between dialect groups


as inferred from the parsimony analysis of the 200 wordlist.

Norman (1988) argues that Mandarin and Min are the most neatly characterized groups,

with sharp boundaries distinguishing them from the neighbouring dialects. Using lexical data,

this claim is verified for Min which clusters in a coherent subgroup with a constant topology

for all wordlists. In contrast, Mandarin clusters as a stable group only for the 200 wordlist,

therefore using the least conservative dataset. Furthermore, using this list, Mandarin displays

a geographical structure, with distinct Northern (Beijing, (Taiyuan, Yuci)) and Eastern

(Yingshan, Wuhan) subdivisions.

b. Distance analyses:

Figure 7 (a), (b) and (c) show the trees obtained for the 35, 100 and 200 wordlists,

respectively. None of the wordlists results into the accepted classification: whereas Gan,

Xiang, Hakka and Min cluster correctly, Mandarin and Wu are more difficult to recover.

Mandarin clusters correctly when the 200 wordlist is used, but even then, Wu fails to cluster

26
correctly as Wenzhou southern Wu dialect appears closer to southern or central dialects than

to the group it should belong to.

Figure 5: Neighbour-Joining tree from the (a) 35, (b) 100 and (c) 200 wordlists.

27
In this respect, distance analyses seem to perform less well than parsimony analyses. This

can be due to the fact that the distance used is based on assumptions which are inappropriate

to account for the underlying evolutionary mechanisms at work here. In fact, the distance used

assumes that all characters evolve at the same rate, thus imposing an evolutionary clock

(Ruvolo, 1987). The implicit assumption in the definition of wordlists however, is that the

items of the 35 wordlist do not evolve at the same rate as the items added to constitute first the

100 then the 200 wordlist. The fact that the three word subsets do not evolve at equal rates

does not imply that the signals they contain are not hierarchically compatible (and therefore

incongruent). However, distance analyses are based on global similarity and therefore they

average the behaviour of the three word subsets, whereas character based methods, which

consider each character independently, can account for the specific behaviour of each of them.

The other dialect groups – when they cluster as expected – have variable relative positions

depending on the wordlist used. However, the lexicostatistical trees always show a robust

geographic bipartition between the Northern (Wu, Mandarin) and Central (Gan and Xiang)

dialects and Southern dialects (Min, Hakka and Yue), as the parsimony analyses on the 100

and 200 wordlists.

In the parsimony analyses, all wordlists supported a basal position for Yue and Min

dialects, which are known to have retained numerous archaisms from Old Chinese (Norman,

1988). In the distance analyses, the variation in the position of Old Chinese depending on the

wordlist considered is particularly interesting: whereas the parsimony trees distinguish clearly

Old Chinese from the Min group on the 35 wordlist, the distance tree groups Old Chinese

within Min using that same list. The 100 wordlist still groups Old Chinese with Min against

all other dialects, but clearly distinguishes a Min group. Only when the 200 wordlist is used is

the privileged relationship between Min and Old Chinese lost. If Yuan’s classification (1980)

is accepted, than this behaviour of Old Chinese with respect to the southern groups is in

28
agreement with the claim that the 35 wordlist is more conservative than the 100 wordlist,

which is in turn more conservative than the 200 wordlist. In fact, the archaisms conserved in

Min are shared with Old Chinese thus increasing their global similarity. The addition of less

conservative items has the opposite effect of decreasing the similarity between Min and Old

Chinese while increasing the number of traits specific to Min, allowing Min to cluster as a

distinct group.

c. Clade support :

The only configuration which allowed Yuan’s (1980) dialectal groups to be recovered is

the parsimony analysis of the 200 wordlist. Consequently, it is the only configuration which

will be discussed concerning clade support. The re-sampling techniques of bootstrap and

jackknife (Figure 8) result in the collapse of the seemingly stable classification obtained in

Figure 5 into a star-like (unresolved) pattern. Acceptable support scores (~75%) are

maintained for the bipartition into southern (Min, Hakka and Yue) and northern-central

(Mandarin, Wu, Xiang and Gan) dialects. The southern dialects are more basal, in agreement

with genetic evidence of their early differentiation from Old Chinese (Yuan, 1980; Norman,

1988). Min, Hakka and Xiang are well supported (~85%), but the northern-central group loses

its resolution and the hierarchical structure shown in Figure 6 becomes star-like.

Resampling techniques proceed through character re-weighting. The fact that they

collapse the original tree structure into a nearly star-like diagram suggests that various

evolutionary signals are present in the dataset, and that the one shaping the tree in Figure 6 is

not very robust. Still, the low bootstrap and jackknife support values contradict the high

consistency index recorded for the 200 wordlist. Whereas this latter supports the hierarchical

structures selected as optimal, the former invalidate the subgrouping pattern these topologies

define.

29
Figure 6 : Bootstrap (a) and Jackknife (b) trees for the 200 word list in a parsimony
framework.
This contradiction can be explained by the high polymorphism of the characters

considered, which hinders the application of resampling techniques. In fact, many characters

are represented by a large number of states (figure 9) and therefore are very informative.

Consequently, excluding (or over-weighting) a character implies the exclusion (or

30
amplification) of a large amount of signal and dramatically threatens the stability of the

optimal topology.

Figure 9: Frequency of the characters depending on their number of states.

Given the massive migratory activity and the diglossic situation that have prevailed

throughout the history of China, departure from the tree model would not be surprising and

one could interpret the bootstrap values as the consequence of the misspecification of the

model of representation of these dialects’ relationships. At the same time, the data show a

good fit on the trees selected which, moreover, are in agreement with the traditional

classification for Chinese dialects. But if we choose to consider the consistency indices to the

detriment of the bootstrap values, then we have to justify why the topology obtained for the

200 wordlist should prevail upon the ones obtained for the 100 and 35 wordlists. The absence

of cross validation between homoplasy indices and resampling techniques is problematic

because we have no satisfying measure of the variance of the inferred topological estimate,

which seems, however, to be very sensitive to character sampling and weighting. The only

31
reference we have for accepting the tree given in Figure 5, Yuan’s classification, is external to

the analytical framework, and we are still unable to test the assumption of tree-likeness.

3. Sinitic networks

The subgrouping methods described up to here presuppose that we have to reconstruct a

tree, but there are other methods which do not require this assumption. Instead of producing a

tree by default, they generate reticulated graphs, or networks, which visualize the degree of

departure from the tree-like model, and consequently, assess its validity for the data. These

graphs also show patterns of subgrouping which can be directly interpreted in terms of

language evolution.

Different network reconstruction methods are available, and are either character-based or

distance-based. Median networks (Bandelt et al. 1995; Bandelt et al. 1999) are the only

character-based methods, but they have trouble dealing with such highly polymorphic data as

ours (Forster and Toth, 2003). We are therefore left with distance-based approaches to

network reconstruction, the best of which appears to be, at the moment, the Neighbour Net

method (Bryant and Moulton, 2002, 2004) implemented in the SPLITSTREE software (Huson,

1998; Huson and Bryant, 2004). This method generates splits graphs from a taxa pairwise

distance matrix, a split being a partition of the set of taxa into two non empty subsets. It is an

agglomerative procedure, similar to the tree method of Neighbour-Joining, which proceeds to

a heuristic search for bipartitions in a set of taxa but without necessarily producing a tree. A

step-by-step description of the method can be found in Bryant et al. (2004), illustrated on the

case of Indo-European languages. This method seems to be the most efficient network

generating approach when the number of taxa increases (Bryant et al. 2004). Neighbour Nets

display edges and boxes, where boxes represent conflicting signals while edges represent an

unambiguous signal. The support for a given split is given by the length of the edge (or the

32
box’s edge) defining that split, while the amount of conflict is represented by the width of the

box representing it.

We have reconstructed Neighbour Nets from our data using both a lexicostatistical

(discussed earlier) and a Hamming distance. These two distances produce similar graphs,

which was expected due to their strong correlation. The graphs obtained from the different

wordlists displayed a strong web of conflicting signals at the base of clearly distinguishable

subgroups. This figure contrasts with what was obtained for Indo-European (Bryant et al.

2004), where the acknowledged subfamilies were supported by long edges and connected in a

fairly conflict-free pattern.

Figure 10 shows the Neighbour Net obtained for the 200 wordlist using Hamming

distance. The acknowledged dialect groups are recovered, as well as the partition between

southern and north-central dialects. The irresolution of the northern-central group is explained

by the heterogeneity of the Wu group (Wenzhou dialect diverging from the northern

Shanghai, Suzhou and Ningbo dialects) and the poor support of Mandarin against the central

Gan and Xiang dialects. Interestingly, the trees obtained by parsimony analyses are salient in

the Neighbour Nets, whatever the wordlist considered. These Neighbour Nets manage to

summarize the apparent contradiction discussed previously between the good fit of the data on

the most parsimonious trees as shown by the good CI values and the instability of their

topology as represented by the poor bootstrap and jackknife values.

Although dialect subgroups are clearly distinguished, they appear to be related in a rather

star-like pattern, with a basal web of conflicts linking all Sinitic languages. This pattern fits

more the notion of dialectal continuum than that of a steadily increasing differentiation and

isolation of dialect subgroups implied by the tree model. It visually summarizes the two

opposite forces have supposedly shaped the Sinitic linguistic domain (Ogura, 1994): a force

of differentiation, related to the spread of people throughout China created distinguishable

33
dialect subgroups, while a homogenization force due to diglossia and heavy borrowing

created the continuum. Thus, even though the differentiation force would have proceeded in a

tree-like pattern, it was always counterbalanced by the homogenization force which diluted

the hierarchical structure, rendering the tree model inappropriate to account for these dialects’

relationships.

Figure 10: Neighbour Net obtained from the 200 wordlist using Hamming distance.

IV. Conclusion

Should we then conclude that the tree model is inappropriate in the case of Sinitic

languages? The Neighbour Net analyses show that parsimony trees are relevant since they

underlie the networks. These trees prove nevertheless incapable of solving the contradiction

34
between a good fit of the data and poor bootstrap support and require external evidence to

decide if they should be considered as relevant. On the contrary, the Neighbour Nets reconcile

these apparently contradicting observations in a synthetic and coherent framework.

Furthermore, they recover the traditional classification and explain why this classification is

difficult to recover when trees are used.

It is necessary that tree reconstruction methods be backed up by measures of homoplasy

and by re-sampling techniques to ascertain that the figures obtained are not an artefact of their

assumption of tree-likeness. However, even if these complementary procedures may detect

weaknesses of the tree model, they do not evaluate explicit alternatives to the pure tree model.

Methods reconstructing networks do, and we argue that networks should be more broadly

used in historical linguistics in conjunction with classical tree reconstruction methods.

In fact, although networks are useful to visualize how languages are related without a

prejudice of tree-likeness, they may be troublesome to interpret, especially when both the

number of taxa and the number of characters increase. In the Neighbour Net approach, for

instance, it is possible to list the characters which are incompatible with a given split and

which are therefore involved in a particular ensemble of conflicting trajectories. However, in

cases like Sinitic where such a large amount of conflict is detected, the task becomes quickly

horrendous.

Various reasons can explain the lack of tree-likeness in the Sinitic group. We believe that

the particular mode of development that these languages have experienced is responsible for

it. Diglossia and population contact have probably permanently counterbalanced the effect of

a tree-like divergence scheme, resulting in the continuum depicted by the Neighbour Nets.

However, we should not neglect alternative explanations linked to our processing of the data.

One possible explanation is that wordlists are not reliable in a phylogenetic framework, either

because lexicon is inappropriate (Ringe, 2002, Balter, 2003), or because a larger amount of

35
data is necessary. Another possible explanation is that the way we coded our data is flawed. It

is indeed unconstrained, and corresponds to a neutral observation of character polymorphism

with no specification of any privileged transitions between character states. This means

however that the probability of change in the suffix without a change in the root is the same as

the probability of changing from a cognate class to another. It also means that there are no

transitional states, such as variant roots, and no ordered transformations. Such approximations

are unrealistic and are probably responsible for the loss of a large amount of information

during the coding procedure. The evolution could indeed be tree-like, but the signal is too

weakened during the recoding process to allow the recovery of a robust hierarchy. The next

step in the analysis of Sinitic wordlists will necessarily have to be the implementation of

complex transformation series between states. However, even if such an acceptable coding is

provided, constraints on transformations cannot be incorporated in distance calculus and this

coding is therefore useless in a distance-based framework, for both tree and network

reconstruction. Unfortunately, up to now, no satisfactory methods exist to bridge the gap

between distance-based methods which consider alternatives to the pure the pure tree model

and character-based methods which allow the diagnosis of homoplasy. In any case, if tree

reconstruction methods are used, it is vital that they be complemented by approaches for

testing the validity of the tree model before any interpretation of the subgrouping patterns.

36
References

Archie, James W. 1989. "Homoplasy excess ratios: New indices for measuring levels of
homoplasy in phylogenetic systematics and a critique of the consistency index". Systematic
Zoology, 38: 253-269.
Balter, Michael 2003. "Early Date for the Birth of Indo-European Languages". Science
302(5650): 1490-1491.
Bandelt, Hans-Juergen & Dress, Andreas W. M. 1992. "Split Decomposition: A new and useful
approach to phylogenetic analysis of distance data". Molecular Phylogenetics and Evolution 1:
242-252.
Bandelt, Hans-Juergen, Forster, Peter, Sykes, Bryan & Richards, Martin B., 1995. "Mitochondrial
portraits of human population using median networks". Genetics 141: 743–753.
Bandelt, Hans-Juergen, Forster, Peter & Rohl, Arne 1999. "Median-joining networks for inferring
intraspecific phylogenies". Molecular Biology and Evolution 16: 37-48.
Bryant, David & Moulton, Vincent 2002. "Neighbor-Net, an agglomerative algorithm for the
construction of phylogenetic networks". Workshop on Algorithms in Bioinformatics (WABI)
ed. by R. Guigo and D. Gusfield, vol. LNCS 2452, 375-391.Springer-verlag.
Bryant, David & Moulton, Vincent 2004. "NeighborNet: an agglomerative algorithm for the
construction of phylogenetic networks". Molecular Biology and Evolution 21(2): 255-265.
Bryant, David, Filimon, F. and Gray, Russell D., in press "Untangling our past: Languages, trees,
splits and networks". The Evolution of Cultural Diversity: Phylogenetic Approaches ed. by .
Ruth Mace, Clare. J. Holden & S. Shennan. UCL Press.
Cao, Ying, Janke, Axel, Waddell, Peter J., Westerman, Michael, Takenaka, Osamu, Murata,
Shigenori, Okada, Norihiro, Paabo, Svante & Hasegawa, Masami 1998. "Conflict Among
Individual Mitochondrial Proteins in Resolving the Phylogeny of Eutherian Orders". Journal
of Molecular Evolution 47(3): 307-22.
Cavalli-Sforza, Luigi. L., Minch, Edward and Mountain, Johanna L., 1992. "Coevolution of genes
and languages revisited". Procceedings of the National Academy of Sciences USA 89(12):
5620-4.
Chen, Baoya 1996. Lun yuyanjiechu yu yuyanlianmeng [Language contact and language union].
Beijing: Yuwen chubanshe.
Cheng, Chin-Chuan 1991. "Quantifying Affinity among Chines Dialects". Languages and
Dialects of China, ed. by William S.-Y. Wang, 78-112. Berkley: Journal of Chinese
Linguistics.
De Queiroz, Alvaro A. A. 1993. "For consensus (sometimes)". Systematic Biology 42(3): 368-
372.
Dixon, Robert M. W. 1997. The Rise and Fall of Languages. Cambridge: Cambridge University
Press.
Dolgopolsky, Aaron. B. 1964. "A Probabilistic Hypothesis Concerning the Oldest Relationships
Among the Language Families in Northern Eurasia” (Original Russian translated into
English). Typology, Relationship and Time ed. by Vitalij V. Shevoroshkin & Thomas L.
Markey. Ann Arbor: Karoma Publishers.
Efron, Bradley 1979. "Bootstrap methods: Another look at the jackknife". The Annals of Statistics
7: 1-26.
Embleton, Sheila 1986. Statistics in Historical Linguistics. Bochum: Brockmeyer.
Erdos, Peter L., Steel, Michael A., Szekeley, Laszlo A. & Warnow, Tandy 1997. "Local quartet
splits of a binary tree infer all quartet splits via one dyadic inference rule". Computers and
Artificial Intelligence 16(2): 217-227.
Farris, James S. 1989. "The retention index and the rescaled consistency index". Cladistics 5:
417-419.

37
Farris, James S., Kallersjo, Mari, Kluge, Arnold G. & Bult, Carol J. 1994. "Testing significance
of incongruence". Cladistics 10: 315-319.
Felsenstein, Joseph 1985. "Confidence limits on phylogenies: An approach using the bootstrap”.
Evolution 39: 783-791.
Felsenstein, Joseph 2004. Inferring Phylogenies. Sunderland, Massachusetts: Sinauer Associates,
Inc.
Forster, Peter & Toth, Alfred 2003. "Towards a phylogenetic chronology of ancient Gaulish,
Celtic, and Indo-European". Proceedings of the National Academy of Sciences USA,. 100:
9079-9084.
Gray, Russell D. & Jordan, Fiona M. 2000. "Language trees support the express-train sequence of
Austronesian expansion". Nature 405: 1052-1055.
Gray, Russell D. & Atkinson, Quentin D. 2003. "Language-tree divergence times support the
Anatolian theory of Indo-European origin". Nature 426: 435-439.
Graybeal, Anna 1998. "Is it better to add taxa or characters to a difficult phylogenetic problem?".
Systematic Biology 47(1): 9-17.
Hennig, Willi 1950. Grundzüge einer Theorie der phylogenetschen Systematik. Berlin: Deutscher
Zentralverlag.
Hinnebusch, Thomas J. 1996. "Skewing in lexicostatistic tables as an indicator of contact". Paper
presented at the Round Table on Bantu Historical Linguistics, Université Lumière 2, Lyon,
France, May 30–June 1, 1996.
Holden, Clare. J. 2002. "Bantu language trees reflect the spread of farming across sub-Saharan
Africa: a maximum-parsimony analysis". Proceedings of the Royal Society of London B 269:
793-799.
Huson, Daniel H. 1998. "SplitsTree: Analyzing and visualizing evolutionary data".
Bioinformatics 14: 68-73.
Huson, Daniel & Bryant, David 2004. SplitsTree. http://www-ab.informatik.uni-tuebingen.de/
software/splits/welcome.html
Kim, Junhyong 1998. "Large-scale phylogenies and measuring the performance of phylogenetic
estimators". Systematic Biology 47(1): 43-60.
Kluge, Arnold G. & Farris, James S. 1969. "Quantitative phyletics and the evolution of anurans".
Systematic Zoology 18: 1-32.
Lee, James & Wong, Bin R. 1991. "Population Mouvements in Quing China and their Linguistic
legacy". Languages and Dialects of China, ed. by William S.-Y. Wang, 52-77. Berkley:
Journal of Chinese Linguistics
Minett, James. W. & Wang, William S.-Y. 2003. "On detecting Borrowing: Distance-based and
Character-based approaches". Diachronica 20(2): 289-330.
Mueller, Laurence D. & Ayala, Francisco J. 1982. "Estimation and interpretation of genetic
distance in empirical studies". Genetical Research 40: 127-137.
Norman, Jerry 1988. Chinese. Cambridge: Cambridge University Press.
Ogura, Mieko 1994. "Dialect Formation in China: Linguistics, Genetic and Historical
Perspectives". In honour of William S-Y. Wang: Interdisciplinary Studies on Language and
Language Change ed. by Matthew Y. Chen & Ovid J.L. Tzeng, 349-372. Taipei: Pyramid
Press.
Rexova, Katerina, Frynta, Daniel & Zrzavy, Jan 2003. "Cladistic analysis of languages: Indo-
European classification based on lexicostatistical data". Cladistics 19: 120-127.
Ringe, Donald, Taylor, Anne & Warnow, Tandy 2002. "Indo-European and computational
cladistics". Transactions of the Philological Society 100: 59-129.
Russo, Claudia. A. M., Takezaki, Naoko & Nei, Masatoshi 1996. "Efficiencies of different genes
and different tree-building methods in recovering a known vertebrate phylogeny". Molecular
Biology and Evolution 13: 525-536.

38
Ruvolo, Maryellen 1987. "Reconstructing genetic and linguistic trees: phenetic and cladistic
approaches". Biological Metaphor and Cladistic Classification, ed. by Henry M. Hoenigswald
& Linda F. Wiener, 193-216. Philadelphia: University of Pennsylvania Press.
Sagart, Laurent 1993. "Chinese and Austronesian: Evidence for a genetic relationship". Journal of
Chinese Linguistics 21(1): 1-63.
Saitou, Naruya & Nei, Masatoshi 1987. "Neighbor-joining Method: a new method for
reconstructing phylogenetic trees". Molecular Biology and Evolution 4(4): 406-425.
Sanderson,. Michael J. & Donoghue Michael J. 1989. « Patterns of variation in levels of
homoplasy”. Evolution 43:1781-1795.
Schleicher, August 1853. "Die ersten Spaltungen des indogermanischen Urvolkes". Allgemeine
Zeitung fuer Wissenschaft und Literatur. Böhlau, Weimar.
Schmidt, Johannes 1972. Die Verwandtschaftsverhältnisse der Indogermanischen Sprachen.
Weimar: H. Böhlau.
Starostin, Sergei 1991. Altajskaja Problema i Proisxoždenie Japonskogo Jazyka [The Altaic
problem and the origin of the Japanese language]. Moscow: Nauka.
Swadesh, Morris 1952. "Lexico-statistic dating of prehistoric ethnic contacts". Proceedings of the
American Philological Society 96: 453-463.
Swadesh, Morris 1955. "Towards greater accuracy in lexicostatistic dating". International
Journal of American Linguistics 21: 121-137.
Swofford, David L. 1995. PAUP*. Phylogenetic Analysis Using Parsimony (andother methods).
Sunderland, Massachusetts: Sinauer Associates.
Trask, Larry 1996. Historical Linguistics. London: Arnold.
Wang, William S.-Y. 1991. "Introduction". Languages and Dialects of China ed. by William S.-
Y. Wang, 1-3. Berkeley: Journal of Chinese Linguistics.
Wang, William S.-Y. 1998. "Language and the Evolution of Modern Humans". The Origins and
Past of Modern Humans ed. by K. Omoto & P. V. Tobias, 247-262. World Scientific.
Wang, Feng (2004). Basic-words of Chinese Dialects. http://chinese.pku.edu.cn/wangf/
wangf.htm
Wang, Feng. & Wang, William S.-Y. 2004. "Basic words and language evolution". Language and
linguistics 5(3): 643-662.
Wheeler, Ward C. 1992. "Extinction, sampling, and molecular phylogenetics". Extinction and
phylogeny, ed. by M. J. Novacek & Q. D. Wheeler, 205-215. New York : University Press.
Wu, C. F. J. 1986. "Jackknife, bootstrap and other resampling plans in regression analysis",
Annals of Statistics 14: 1261-1350.
Yuan, Jiahua. 1980, Hanyu fanyan gaiyao [Conspectus on Chinese dialects]. Second edition.
Beijing: Wenzi Gaige Chubanshe.

39

Vous aimerez peut-être aussi