Vous êtes sur la page 1sur 422

.

1
Corpus Linguistics: v0
A Guide to the Methodology
Anatol Stefanowitsch
T
AF
R
D

Anatol Stefanowitsch
Freie Universitt Berlin
Institut r Englische Philologie
Habelschwerdter Allee 45
14195 Berlin

Phone (030) 838-723 56


Fax (030) 838-723 00
Email anatol.stefanowitsch@fu-berlin.de
clmbook@stefanowitsch.net
Preface to the dra version
I wrote this book, with interruptions, over a period of 12 years, and though I have
extensively rewrien the earlier parts for terminological and stylistic consistency,

.1
the long time span has le an impact on the data and calculations: Over the years,
I have used slightly dierent versions of some of the corpora and dierent ways
of accessing them and computing statistics over the results. I am in the process of
v0
redoing all case studies using the most widely accessible version of the respective
corpora, to ensure maximal reproducibility.
I believe that reproducibility is an important issue not just in corpus
linguistics in general, but in the context of a text book in particular, and I am
planning to make this a topic with its own chapter in some future edition of the
T
book (but not in the rst one). However, the studies in the book should be as
faithfully reproducible as possible, so I am planning to replace data drawn from
AF

expensive commercially available corpora such as the ICE-GB by data from freely
or at least cheaply available corpora such as the BNC, the BNC Baby, the COW
corpora or corpora from the ICAME collection even in the rst edition of the
book.
Finally, some of the data in the case studies is cited in the form given in the
R

original studies on which my case studies are based. ese too are not always
reproducible with the freely available versions of the corpora in question, or they
D

may use corpora that are not available at all. I might redo these studies too.
Since this is a time-consuming and laborious task and the currently remaining
inconsistencies (for example, concerning the number of hits for particular
phenomena or the total number of tokens in the corpora) should not have an
impact on the determination of the books quality in the reviewing process, I have
decided to submit the manuscript as is and to complete the recalculations during
the reviewing process.
I am looking forward to any questions, suggestions and constructive criticism
that the reviewers or other interested parties using this dra version have for me.
I have set up a dedicated email address for this purpose:

2
Anatol Stefanowitsch

<clmbook@stefanowitsch.net>

A note on copyright: once this book is published with Language Science Press,
it will become available under a Creative Commons BY-SA 4.0 licence, but for the
moment, I retain full standard copyright over this dra. is is to ensure that I
cede control over this work only once it has been reviewed and a nal version
has been found worthy of publication. is dra may be distributed only in
academic contexts (schools, universities and research institutes) and only in the
present form (as a pdf le or as a printout of this le). Please do not host it on
openly accessible servers it will remain downloadable from the URL

.1
hp://stefanowitsch.net/clm/clmbook-dra.pdf (and from my academics.edu and
researchgate.net pages) until (and, for historical purposes, aer) the nal version
is published. Storing it in just a few places under my control is meant to prevent
v0
outdated dra copies being used and distributed aer the nal version has been
published.

Anatol Stefanowitsch
Berlin, February 2017
T
AF
R
D

3
Part I: Methodological Foundations

.1
v0
T
AF
R
D

4
Anatol Stefanowitsch

1 e need for corpus data


Broadly speaking, science is the study of some aspect of the (physical, natural or
social) world by means of systematic observation and experiment, and linguistics

.1
is the scientic study of those aspects of the world that we summarize under the
label language. us, both the systematic observation and the experimental
study of linguistic behavior should have a role to play in linguistics.
v0
Let us dene a corpus somewhat crudely as a large collection of authentic text
(i.e., samples of language produced in genuine communicative situations), and
corpus linguistics as any form of linguistic inquiry based on data derived from
such a corpus (we will rene these denitions in the next chapter to a point
where they can serve as the foundation for a methodological framework, but they
T
will suce for now).
Dened in this way, corpora clearly constitute recorded observations of
AF

language behavior, so their place in linguistic research seems so obvious that


anyone unfamiliar with the last y years of mainstream linguistic theorizing
will wonder why their use would have to be justied at all. I cannot think of any
other scientic discipline whose textbook authors would feel compelled to begin
their exposition by defending the use of observational data, and yet corpus
R

linguistics textbooks oen do exactly that.


e reasons for this defensive stance can be found in the recent history of the
D

eld. e role of corpus data, and the observation of linguistic behavior more
generally, has been, and continues to be, highly controversial, especially with
respect to core areas of grammatical theory such as morphology, syntax, and
semantics. In much of the formalist linguistic literature, corpus linguistics is
routinely aacked, ridiculed, and dismissed not just as having no practical use,
but as having no conceivable use at all in linguistic inquiry.
In this literature, the method proposed instead is that of intuiting linguistic
data. Put simply, intuiting data means inventing sentences exemplifying the
phenomenon under investigation and then judging their grammaticality
(roughly, whether the sentence is a possible sentence of the language in

5
1 e need for corpus data

question). To put it mildly, inventing ones own data would appear to be a rather
subjective procedure, so, again, anyone unfamiliar with the last fiy years of
linguistic theorizing might wonder why such a procedure was proposed in the
rst place and why it is considered superior to the use of corpus data.
Readers familiar with this discussion or readers already convinced of the need
for corpus data may skip this chapter, as it will not be referenced extensively in
the remainder of this book. For all others, a discussion of both issues the alleged
uselessness of corpus data and the alleged superiority of intuited data seems
indispensable, if only to put them to rest in order to concentrate, throughout the
rest of this book, on the vast potential of corpus linguistics and the exciting

.1
avenues of research that it opens up.
Section 1.1 will discuss four major points of criticisms leveled at corpus data.
As arguments against corpus data, they are easily defused, but they do point to
v0
aspects of corpora and corpus linguistic methods that must be kept in mind when
designing linguistic research projects. Section 1.2 will discuss intuited data in
more detail and show that it does not solve any of the problems associated
(rightly or wrongly) with corpus data. Instead, as Section 1.3 will show, intuited
data actually creates a number of additional problems. Still, intuitions we have
T
about our native language (or other languages we speak well), can nevertheless
be useful in linguistic research as long as we do not confuse them with data.
AF

1.1 Arguments against corpus data


e four major points of criticism leveled at the use of corpus data in linguistic
research are the following:
R

i. corpus data are usage data, and thus of no use in studying linguistic
knowledge;
D

ii. corpora, and the data derived from them, are necessarily incomplete;
iii. corpora contain only linguistic forms (represented as orthographic
strings), but no information about the semantics, pragmatics, etc. of these
forms; and
iv. corpora do not contain negative evidence, i.e they can only tell us what is
possible in a given language, but not what is not possible.
I will discuss the rst three points in the remainder of this section. A fruitful
discussion of the fourth point requires a basic understanding of statistics, which
will be provided in Chapters 69, so I will postpone it until Chapter 10.

6
Anatol Stefanowitsch

1.1.1 Corpus data as usage data


e rst point of criticism is the most fundamental one: if corpus data cannot tell
us anything about our object of study, there is no reason to use them at all. It is
no coincidence that this argument is typically made by proponents of generative
syntactic theories, who place much importance on the distinction between the
use of language (performance) and the knowledge of language (competence).
Noam Chomsky, one of the early proponents of generative linguistics, argued
early on that the exclusive goal of linguistics is to model competence, and that
therefore, corpora have no place in serious linguistic analysis:

.1
e speaker has represented in his brain a grammar that gives an ideal
account of the structure of the sentences of his language, but, when
actually faced with the task of speaking or understanding, many other
v0
factors act upon his underlying linguistic competence to produce actual
performance. He may be confused or have several things in mind, change
his plans in midstream, etc. Since this is obviously the condition of most
actual linguistic performance, a direct record an actual corpus is almost
T
useless, as it stands, for linguistic analysis of any but the most supercial
kind (Chomsky 1964[1971]: 130f., emphasis added).
AF

is argument may seem plausible at rst glance, but it requires at least one of
two assumptions that do not hold up to closer scrutiny: rst, that there is an
impenetrable bi-directional barrier between competence and performance, and
second, that the inuence of confounding factors on linguistic performance
R

cannot be identied in the data.


e assumption of a barrier between competence and performance is, of
course, a central axiom in generative linguistics, which famously assumes even
D

language acquisition to depend on input only minimally. is assumption has


been called into question by a wealth of recent research on language acquisitions
(see Tomasello 2003 for an overview). But even if we accept the claim that
linguistic competence is not derived from linguistic usage, it would seem
implausible accept the converse claim that linguistic usage does not reect
linguistic competence (if it did not, this would raise the question what we need
linguistic competence for at all).
is is where the second assumption comes into play. If we believe that
linguistic competence is at least broadly reected in linguistic performance, as I
assume even hardcore generativist theoreticians do, then it should be possible to

7
1 e need for corpus data

model linguistic knowledge based on observations language use unless there


are unidentiable confounding factors distorting performance, making it
impossible to determine which aspects of performance are reections of
competence and which are not. Obviously, such confounding factors exist the
confusion and the plan-changes that Chomsky mentions, but also others like
tiredness, drunkenness and all the other external inuences that potentially
interfere with speech production. However, it is doubtful that these factors and
their distorting inuence cannot be identied and taken into account when
drawing conclusions from linguistic corpora.1
Corpus linguistics is in the same situation as any other empirical science with

.1
respect to the task of deducing underlying principles from specic
manifestations, inuenced by other factors. For example, Chomsky has repeatedly
likened linguistics to physics, but physicists searching for gravitational waves do
v0
not reject the idea of observational data on the basis of the argument that there
are many other factors acting upon uctuations in gravity and that therefore a
direct record of such uctuations is almost useless. Instead, they aempt to
identify these factors and subtract them from their measurements.
In any case, the gap between linguistic usage and linguistic knowledge would
T
be an argument against corpus data only if there were a way of accessing
linguistic knowledge directly and without the interference of other factors.
AF

Sometimes, intuited data is claimed to t this description, but as I will discuss in


Section 1.2.1, not even Chomsky himself subscribes to this position.

1.1.2 e incompleteness of corpora


R

Next, let us look at the argument that corpora are necessarily incomplete, also a
long-standing argument in Chomskyan linguistics:
D

[I]t is obvious that the set of grammatical sentences cannot be identied


with any particular corpus of uerances obtained by the linguist in eld
work. Any grammar of a language will project the nite and somewhat
accidental corpus of observed uerances to a set (presumably innite) of
grammatical uerances (Chomsky 1957: 15).

1 In fact, there is research that not only takes such factors into account but that actually
treats them as objects of study in their own right. ere is a so much corpus-based and
experimental research literature on disuencies, hesitation phenomena, repairs, and
similar phenomena, that it makes lile sense to even begin citing it here (cf. Kjellmer
2003, Corley and Stewart 2008, Gilquin and De Cock 2011).

8
Anatol Stefanowitsch

Let us set aside for now the problems associated with the idea of grammaticality
and simply replace the word grammatical with conventionally occurring (an
equation that Chomsky explicitly rejects). Even the resulting, somewhat weaker
statement is quite clearly true, and will remain true no maer how large a corpus
we are dealing with. Corpora are incomplete in at least two ways.
First, corpora no maer how large are obviously nite, and thus they can
never contain examples of every linguistic phenomenon. As an example, consider
the construction [it doesnt maer the N] (as in the lines It doesnt maer the
colour of the car/But what goes on beneath the bonnet from the Billy Bragg song A
Lover Sings).2 ere is ample evidence that this is a construction of British

.1
English. First, Bragg uses it in his song; second, a web search of .uk websites will
turn up a number of examples (e.g. [1.1]); and third, most native speakers will
readily provide examples of the construction:v0
(1.1a) It doesnt maer the reasons people go and see a lm as long as they go
and see it. [thenorthernecho.co.uk]
(1.1b) Remember, it doesnt maer the size of your garden, or if you live in a at,
there are still lots of small changes you can make that will benet
T
wildlife. [avonwildlifetrust.org.uk]
(1.1c) It doesnt maer the context. In the end, trust is about the person
AF

extending it. [clocurto.us]


(1.1d) It doesnt maer the color of the uniform, we all work for the greater good.
[fw.ky.gov]

However, the largest available corpus of British English, the one-hundred-million-


R

word British National Corpus (see Appendix A.1), does not contain a single
instance of this construction. is is unlikely to be due to the fact that the
D

construction is limited to informal registers, as the BNC contains a reasonable


amount of informal language. Instead, it seems more likely that the construction
is simply too infrequent to have a reasonable likelihood of occurrence in one
2 Note that this really is a grammatical construction in its own right, i.e., it is not a case
of right-dislocation (as in It doesnt maer, the color or It is not important, the color). In
cases of right-dislocation, the pronoun and the dislocated noun phrase are co-
referential and there is an intonation break before the NP (in standard English
orthographies, there is a comma before the NP). In the construction in question, the
pronoun and the NP are not co-referential (it functions as a dummy subject) and there
is no intonation break (cf. Michaelis and Lambrecht [1996] for a detailed (non-corpus-
based) analysis of the very similar [it BE amazing the N]).

9
1 e need for corpus data

hundred million words of text. us, someone studying the construction will
wrongly conclude that it does not exist in British English on the basis of the BNC.
Second, linguistic usage is not homogeneous but varies across situations (think of
the kind of variation referred to by terms such as dialect, sociolect, genre, register,
etc.); clearly, it is, for all intents and purposes, impossible to include this variation
in its entirety in a given corpus. is is a problem not only for studies that are
interested in linguistic variation but also for studies in core areas such as lexis
and grammar: many linguistic paerns are limited to certain varieties, and a
corpus that does not contain a particular variety cannot contain examples of a
paern limited to that variety. For example, the verb croak in the sense die is

.1
usually used intransitively, but in there is one register in which it also occurs
transitively. Consider the following representative examples:

(1.2a)
v0
Because he was a skunk and a stool pigeon I croaked him just as he was
goin to call the bulls with a police whistle [Veiller, Within the Law]
(1.2b) [U]se your bean. If I had croaked the guy and frisked his wallet , would I
have le my signature all over it? [Stout, Some Buried Cesar]
(1.2c) I recall pointing to the loaded double-barreled shotgun on my wall and
T
replying, with a smile, that I would croak at least two of them before they
got away. [ompson, Hells Angels]
AF

Very roughly, we might characterize this register as (American) gangster talk, or


perhaps (American) gangster talk as portrayed in crime ction (I have never
come across an example outside of this genre). Neither of these registers is among
the text categories represented in the BNC, and therefore the transitive use of
R

croak die does not occur in this corpus.3


e incompleteness of linguistic corpora must therefore be accepted and kept
D

in mind when designing and using corpora (cf. the discussion in the next
chapter). However, it is not an argument against the use of corpora, as all
collections of data are necessarily incomplete. One important aspect of scientic
work is to build general models from incomplete data and rene them as more
data becomes available. e incompleteness of observational data is not seen as
an argument against its use in other disciplines, and the argument gained
currency in linguistics only because it was largely accepted that intuited data are

3 A kind of pseudo-transitive use with a dummy object does occur, however: He


croaked it meaning he died, and of course the major use of croak (to speak with a
creaky voice) occurs transitively.

10
Anatol Stefanowitsch

more complete. I will argue in Section 1.2.2, however, that this is not the case.

1.1.3 e absence of meaning in corpora


Finally, let us turn to the argument that corpora do not contain information about
the semantics, pragmatics, etc. of the linguistic expressions they contain. Lest
anyone get the impression that it is only Chomskyan linguists who reject corpus
data, consider the following statement of this argument by an avowed anti-
Chomskyan:

Corpus linguistics can only provide you with uerances (or wrien leer

.1
sequences or character sequences or sign assemblages). To do cognitive
linguistics with corpus data, you need to interpret the data to give it
meaning. e meaning doesnt occur in the corpus data. us,
v0
introspection is always used in any cognitive analysis of language [] (G.
Lako 2004).

G. Lako (and others puing forward this argument) are certainly right: the
corpus itself was all we had, corpus linguistics would be reduced to the detection
T
of formal paerns (such as recurring combinations) in otherwise meaningless
strings of symbols.
AF

ere are cases where this is the best we can do, namely, when dealing with
documents in an unknown or unidentiable language. An example is the Phaistos
disc, a clay disk discovered in 1908 in Crete. e disc contains a series of symbols
that appear to be pictographs (but may, of course, have purely phonological
R

value), arranged in an inward spiral. ese pictographs may or may not present a
writing system, and no one knows what language, if any, they may represent (in
fact, it is not even clear whether the disc is genuine or a fake). However, this has
D

not stopped a number of scholars from linguistics and related elds to identify
out a number of intriguing paerns in the series of pictographs and some
suggestive parallels to known writing systems (see Robinson 2002, Ch. 11, for a
fairly in-depth popular account). Some of the results of this research are certainly
suggestive and may one day enable us to identify the underlying language and
even decipher the message, but until someone does so, there is no way of
knowing if the theories are even on the right track.
It hardly seems desirable to put ourselves in the position of a Phaistos disc
scholar articially, by excluding from our research designs our knowledge of
English (or whatever other language our corpus contains); it is quite obvious that

11
1 e need for corpus data

we should, as G. Lako says, interpret the data in the course of our analysis. But
does this mean that we are using introspection in the same way as someone
inventing sentences and judging their grammaticality?
I think not. We need to distinguish two dierent types of introspection: (i)
intuiting, i.e. practice of introspectively accessing ones linguistic experience in
order to create sentences and assign grammaticality judgments to them; and (ii)
interpreting, i.e. the practice of assigning an interpretation (in semantic and
pragmatic terms) to an uerance. ese are two very dierent activities, and
there is good reason to believe that speakers are beer at the second activity than
at the rst: interpreting linguistic uerances is a natural activity speakers must

.1
interpret everything we hear or read in order to understand it; inventing
sentences and judging their grammaticality is not a natural activity speakers
never do it outside of papers on grammatical theory. us, one can believe that
v0
interpretation has a place in linguistic research but intuition does not.
Nevertheless, interpretation is a subjective activity and there are strict procedures
that must be followed when including its results in a research design. is issue
will be discussed in detail in Chapter 5.
As with the two points of criticism discussed in the preceding subsections, the
T
problem of interpretation would be an argument against the use of corpus data
only if there were a method that avoids interpretation completely or that at least
AF

allows for interpretation to be made objective.

1.2 Intuition
Intuited data would not be the only alternative to corpus data, but it is the one
R

proposed and used by critics of the laer, so let us look more closely at this
practice. Given the importance of grammaticality judgments, one might expect
D

them to have been studied extensively to determine exactly what it is that people
are doing when they are making such judgments. Surprisingly, this is not the
case, and the few studies that do exist are hardly ever acknowledged as
potentially problematic by those linguists that routinely rely on them, let alone
discussed with respect to their place in scientic methodology.
One of the few such discussions is found in Jackendo (1994). Jackendo
introduces the practice of intuiting grammaticality judgments as follows:

[A]mong the kinds of experiments that can be done on language, one kind
is very simple, reliable, and cheap: simply present native speakers of a
language with a sentence or phrase, and ask them to judge whether or not it is

12
Anatol Stefanowitsch

grammatical in their language or whether it can have some particular


meaning. [] e idea is that although we cant observe the mental
grammar of English itself, we can observe the judgments of grammaticality
and meaning that are produced by using it (Jackendo 1994: 47; emphasis
added).

is statement is representative of the general aitude towards grammaticality


judgments in generative linguistics in two ways: rst, in that it views intuitive
judgments as one kind of scientic experiment among others without discussing
the methodological implications of this assumption; second, in that it treats the

.1
process of eliciting such judgments as so straightforward that it does not require
in-depth justication or discussion.
Jackendo does not deal with either of these two aspects in any detail,
v0
although he briey touches upon the rst issue in the following passage:

Ideally, we might want to check these experiments out by asking large


numbers of people under controlled circumstances, and so forth. But in fact
the method is so reliable that, for a very good rst approximation, linguists
T
tend to trust their own judgments and those of their colleagues (Jackendo
1994: 48).
AF

e only thing about this statement that is true is the observation that linguists
trust their own judgments. However, the claim that these judgments are reliable
is completely unfounded. In the linguistic literature, grammaticality judgments of
the same sentences by dierent authors oen dier consistently and the few
R

studies that have investigated the reliability of grammaticality judgments have


consistently shown that such judgments display too much variation across and
within individual speakers to take serious the idea that isolated grammaticality
D

judgments can be used as linguistic data.4 What is especially problematic is the


use of isolated judgments by the researcher themselves; rst, they are language
experts, whose judgments will hardly be representative of the average native
speaker, and second, they will usually know what it is that they want to prove,
and this will distort their judgments. It seems obvious, then, that expert

4 ere is a substantial number of studies that deals with various aspects of


grammaticality judgments; suce it here to mention two relatively recent book-length
treatments, Schtze 1996 (reissued under a Creative-Commons license by Language
Science Press in 2016), esp. Ch. 3 on factors inuencing grammaticality judgments,
and Cowart 1997.

13
1 e need for corpus data

judgments should be used with extreme caution (cf. Labov 1996) if at all (Schtze
1996/2016), instead of serving as the main methodology in linguistics.
e real reason that data intuited by the researcher themselves is so popular is
not that it is reliable, but that it is easy. Jackendo essentially admits this when he
says that

other kinds of experiments can be used to explore properties of the mental


grammar eir disadvantage is their relative ineciency: it takes a great
deal of time to set up the experiment. By contrast, when the experiment
consists of making judgments of grammaticality, there is nothing simpler

.1
than devising and judging some more sentences (Jackendo 1994: 49).

However, the fact that something can be done quickly and eortlessly does not
v0
make it a good scientic method. If one is serious about using grammaticality
judgments, these must be made as reliable as possible; among other things, this
involves the two aspects that Jackendo dismisses so lightly: asking large
numbers of speakers and controlling the circumstances under which they are
asked (cf. Schtze 1996/2016 and Cowart 1997 for detailed suggestions as to how
T
this is to be done and Bender 2005 for an interesting alternative). In order to
distinguish such empirically collected introspective data from data intuited by the
AF

researcher, I will refer to the former as elicitation data and continue to reserve for
the laer the term intuition or intuited data.
e reliability problems of linguistic intuitions should be obvious and I will
return to them briey in Section 1.3, but rst, let us discuss whether intuited
data fares beer than corpus data in terms of the three major points of criticism
R

discussed in the preceding section:


i. are intuited data a more direct reection of linguistic knowledge
(competence) than corpus data;
D

ii. are intuited data more complete than corpus data; and
iii. do intuited data contain information about the semantics, pragmatics,
etc. of these forms.

1.2.1 Intuition as performance


e most fundamental point of criticism leveled against corpus data concerns the
claim that since corpora are samples of language use (performance), they are
useless in the study of linguistic knowledge (competence). I argued in Section
1.1.1 above that this claim makes sense only in the context of rather implausible

14
Anatol Stefanowitsch

assumptions concerning linguistic knowledge and linguistic usage, but even if we


accept these assumptions, the question remains whether intuited judgments are
dierent from corpus data in this respect.
It seems a priori obvious that both inventing sentences and judging their
grammaticality are types of behavior, and as such, performance in the
generative linguistics sense. And in fact, Chomsky himself admits this:

[W]hen we study competence the speaker-hearers knowledge of his


language we may make use of his reports and his behavior as evidence,
but we must be careful not to confuse evidence with the abstract

.1
constructs that we develop on the basis of evidence and try to justify in
terms of evidence. Since performance in particular, judgments about
sentences obviously involves many factors apart from competence, one
v0
cannot accept as an absolute principle that the speakers judgments will
give an accurate account of his knowledge. (Chomsky 1972: 187, emphasis
added).

ere is lile to add to this statement, other than to emphasize that if it is


T
possible to construct a model of linguistic competence on the basis of intuited
judgments that involve factors other than competence, it should also be possible
AF

to do so on the basis of corpus data that involve factors other than competence
and the competence/performance argument against corpus data collapses.

1.2.2 e incompleteness of intuition


R

Next, let us turn to the issue of incompleteness. As discussed in Section 1.1.2,


corpus data are necessarily incomplete, both in a quantitative sense (since every
corpus is nite in size) and in a qualitative sense (since even the most carefully
D

constructed corpus is skewed with respect to the types of language it contains).


is incompleteness is not an argument against using corpora as such, but it
might be an argument in favor of intuited judgments if there was reason to
believe that they are more complete.
To my knowledge, this issue has never been empirically addressed, and it
would be dicult to do so, since there is no a priori complete data set against
which intuited judgments could be compared. However, it seems implausible to
assume that such judgments are more complete than corpus data. First, just like a
corpus, the linguistic experience of a speaker is nite and the any mental
generalizations based on this experience will be partial in the same way that

15
1 e need for corpus data

generalizations based on corpus data must be partial (although it must be


admied that the linguistic experience a native speaker gathers over a lifetime
exceeds even a large corpus like the BNC in terms of quantity). Second, just like a
corpus, a speakers linguistic experience is limited to certain text types most
English speakers have never been to confession or planned an illegal activity, for
example, which means they will lack knowledge of certain linguistic structures.
To exemplify this point, consider that many speakers of English are unaware
of the fact that there is a use of the verb bring that has the valency paern (or
subcategorization frame) [__ NPLIQUID PP[to the boil]] (in British English) or
[__ NPLIQUID PP[to a boil]] (in American English). is use is essentially limited to a

.1
single text type, recipes of the 145 matches in the BNC, 142 occur in recipes and
the remaining three in narrative descriptions of someone following a recipe.
us, a native speaker of English who never reads cookbooks or cooking-related
v0
journals and websites and never watches cooking shows on television can go
through their whole life without encountering the verb bring used in this way.
When describing the grammatical behavior of the verb bring based on their
intuition, this use would not occur to them, and if they were asked to judge the
grammaticality of a sentence like Half-ll a large pan with water and bring to the
T
boil [BNC A7D], they would judge it ungrammatical. us, this valency paern
would be absent from their description in the same way that transitive croak die
AF

or [it doesnt maer the N] would be absent from a grammatical description based
on the BNC (where, as we saw in Section 1.1.2, these paerns do not occur).
If this example seems too speculative, consider Culicovers (1999: 106f.)
analysis of the phrase no maer. Culicover is an excellent linguist by any
standard, but he bases his intricate argument concerning the unpredictable nature
R

of the phrase no maer on the claim that the construction [it doesnt maer the N]
is ungrammatical. If he had consulted the BNC, he might be excused for coming
D

to this wrong conclusion, but he reaches it without consulting a corpus at all,


based solely on his native-speaker intuition.5

5 Culicover is a speaker of American English, so if he were writing his book today, he


might check the 450-Million-word Corpus of Contemporary American English
(COCA), rst released in 2008, instead of the BNC. If he did, he would nd nine
instances of the construction (for example It doesnt maer the number of zeros they
aach to it, from a 1997 transcript of ABC Nightline) and would not have to rely on
his incomplete native-speaker intuition.

16
Anatol Stefanowitsch

1.2.3 Intuitions about form and meaning


Finally, let us turn to the question whether intuited data contain information
about meaning. At rst glance, the answer to this question would appear to be an
obvious yes: if I make up a sentence, of course I known what that sentence
means. However, a closer look shows that maers are more complex and the
answer is less obvious.
Constructing a sentence and interpreting a sentence are two separate
activities. As a consequence, I do not actually know what my constructed
sentence means, but only what I think it means. While I may rightly consider
myself the nal authority on the intended meaning of a sentence that I myself

.1
have produced, my interpretation ceases to be privileged in this way once the
issue is no longer my intention, but the interpretation that my constructed
sentence would conventionally receive in a particular speech community. In other
v0
words, the interpretation of a constructed sentence is subjective in the same way
that the interpretation of a sentence found in a corpus is subjective. In fact,
interpreting other peoples uerances, as we must do in corpus linguistic
research, may actually lead to more intersubjectively stable results, as
T
interpreting other peoples uerances is a more natural activity than interpreting
our own: the former is what we routinely engage in in communicative situations,
AF

the laer, while not exactly unnatural, is a rather exceptional activity.


On the other hand, it is very dicult not to interpret a sentence, but that is
exactly what I would have to do in intuiting grammaticality judgments judging
a sentence to be grammatical or ungrammatical is supposed to be a judgment
purely about form, dependent on meaning only insofar as that meaning is
R

relevant to the grammatical structure. Consider the example in (1.3a):

(1.3a) When shed rst moved in she hadnt cared about anything, certainly not
D

her surroundings they had been the least of her problems and if the
villagers hadnt so kindly donated her furnishings shed probably still be
existing in empty rooms. [BNC H9V]

e grammaticality of the clause [T]he villagers [] donated her furnishings can be


judged for its grammaticality only aer disambiguating between the meanings
associated with the structures in (1.3b) and (1.3c):

(1.3b) VP [donated NP[her] NP[furnishings]]


(1.3c) VP [donated NP[DET[her] N[furnishings]]

17
1 e need for corpus data

e structure in (1.3b) is a ditransitive, which is widely agreed to be impossible


with donate (but see Stefanowitsch 2007), so the sentence would be judged
ungrammatical under this reading by the vast majority of English speakers. e
structure in (1.3c), in contrast, is a simple transitive, which is one of the two most
frequent valency paerns for donate, so the sentence would be judged
grammatical by all English speakers. e same would obviously be true if the
sentence was constructed rather than taken from a corpus.
But the semantic considerations that increase or decrease our willingness to
judge an uerance as grammatical are frequently more subtle than the dierence

.1
between the readings in (1.3b) and (1.3c). Consider the example in (1.4), which
contains a clear example of donate with the supposedly ungrammatical
ditransitive valency paern: v0
(1.4) Please have a look at our wish-list and see if you can donate us a plant we
need. [headway-cambs.org.uk]

Since this is an authentic example, we cannot simply declare it ungrammatical;


T
instead, we must look for properties that distinguish this example from more
typical uses of donate and try to arrive an an explanation for such exceptional,
AF

but possible uses. In Stefanowitsch (2007), looking at a number of such


exceptional uses, I suggest that they may be made possible by the highly
untypical sense in which the verb donate is used here. In (1.4) and other
ditransitive uses, donate refers to a direct transfer of something relatively
valueless from one individual to another in a situation of personal contact. is is
R

very dierent from the typical use, where a sum of money is transferred from an
individual to an organization without personal contact. If this were an intuited
D

example, I might judge it grammatical (at least marginally so) for similar reasons,
while another researcher, unaware of my subtle reconceptualization, would judge
it ungrammatical, leading to no insights whatsoever into the semantics of the
verb donate or the valency paerns it occurs in.

1.3 Intuition data vs. corpus data


As the preceding section has shown, intuited judgments are just as vulnerable as
corpus data concerning the major points of criticism leveled at the laer. In fact, I
have tried to argue that they are, in some respects, more vulnerable to these
criticisms. In case this has not yet convinced you of the need for corpus data, let

18
Anatol Stefanowitsch

me compare of the quality of intuited data and corpus data in terms of two
aspects that are considered much more crucial in methodological discussions
outside of linguistics than those discussed above:

i. data reliability (roughly, how sure can we be that other people will arrive
at the same set of data using the same method);
ii. data validity or epistemological status of the data (roughly, how well do
we understand what real world phenomenon the data correspond to);6

As to the rst criterion, note that the problem is not that intuition data are

.1
necessarily wrong. Very oen, intuitive judgments turn out to agree very well
with more objective kinds of evidence, and this should not come as a surprise.
Aer all, as native speakers of a language, or even as advanced foreign-language
v0
speakers, we have considerable experience with using that language actively
(speaking and writing) and passively (listening and reading). It would thus be
surprising, if we were categorically unable to make statements about the
likelihood of occurrence of a particular expression.
Instead, the problem is that we have no way of determining introspectively
T
whether a particular piece of intuited data is correct or not. To decide this, we
need objective evidence, obtained either by serious experiments (including
AF

elicitation experiments) or by corpus-linguistic methods. But if that is the case,


the question is why we need intuition data in the rst place. In other words,
intuition data are simply not reliable.
e second criterion provides an even more important argument, perhaps the
most important argument, against the practice of intuiting. Note that even if we
R

manage to solve the problem of reliability (as systematic elicitation from a


representative sample of speakers does to some extent), the epistemological status
D

of intuitive data remains completely unclear. is is particularly evident in the


case of grammaticality judgments: we simply do not know what it means to say
that a sentence is grammatical or ungrammatical, i.e., whether grammaticality
is a property of natural languages or their mental representations in the rst
place. It is not entirely implausible to doubt this (cf. e.g. Sampson 1987), and even
if one does not, one would have to oer a theoretically well-founded denition of

6 Readers who are well-versed in methodological issues are asked to excuse this
somewhat abbreviated use of the term validity; there are, of course, a range of uses in
the philosophy of science and methodological theory for the term validity (we will
encounter a dierent use from the one here in Chapters 3 and 4).

19
1 e need for corpus data

what grammaticality is and one would have to show how it is measured by


grammaticality judgments. Neither task have been satisfactorily undertaken.
In contrast, the epistemological status of a corpus datum is crystal clear: it is
(an orthographic representation o) something that a specic speaker has said or
wrien on a specic occasion in s specic situation. Statements that go beyond a
specic speaker, a specic occasion or a specic situation must, of course, be
inferred from these data and this is dicult and there is a constant risk that we
get it wrong. However, inferring general principles from specic cases is one of
the central tasks of all scientic research and the history of any discipline is full
of inferences that turned out to be wrong. Intuited data may create the illusion

.1
that we can jump to generalizations directly and without the risk of errors. e
fact that corpus data do not allow us to maintain this illusion does not make them
inferior to intuition, it makes them superior. More importantly, it makes them
v0
normal observational data, no dierent from observational data in any other
discipline.
To put it bluntly, then intuition data are less reliable and less valid than
corpus data, and they are just as incomplete and in need of interpretation. Does
this mean that intuition data should be banned completely from linguistics? e
T
answer is No, but not straightforwardly.
On the one hand, we would deprive ourselves of a potentially very rich source
AF

of information by dogmatically abandoning the use of our linguistic intuition


(native-speaker or not). On the other hand, given the unreliability and
questionable epistemological status of intuition data, we cannot simply use them,
as some corpus linguists suggest (e.g. McEnery and Wilson 2001: 19), to augment
our corpus data. e problem is that any mixed data set (i.e. any set containing
R

both corpus data and intuition data) will only be as valid, reliable, and complete
as the weakest subset of data it contains. We have already established that
D

intuition data and corpus data are both incomplete, thus a mixed set will still be
incomplete (albeit perhaps less incomplete than a pure set), so nothing much is
gained. Instead, the mixed set will simply inherit the lack of validity and
reliability from the intuition data, and thus its quality will actually be lowered by
the inclusion of these.
e solution to this problem, I believe, is quite simple. While intuited
information about linguistic paerns fails to meet even the most basic
requirements for scientic data, it meets every requirement for scientic
hypotheses. A hypothesis has to be neither reliable, nor valid (in the sense of the
term used here) nor complete. In fact, these words do not have any meaning if we
apply them to hypotheses the only requirement it must meet is that of

20
Anatol Stefanowitsch

testability (see further Chapter 3). ere is nothing wrong with introspectively
accessing our experience as a native speaker of a language (or a non-native one at
that), provided we treat the results of our introspection as hypotheses about the
meaning or likelihood of occurrence rather than as a fact.
Since there are no standards of purity for hypotheses, it is also unproblematic
to mix intuition and corpus data in order to come up with more ne-grained
hypotheses (cf. in this context Aston and Burnard [1998: 143]), as long as we then
test our hypothesis on a pure data set that does not include the corpus-data used
in generating the hypotheses.

.1
1.4 Corpus data in other disciplines
Before we conclude our discussion of the supposed weaknesses of corpus data
v0
and the supposed strengths of intuited judgments, it should be pointed out that
this discussion is limited largely to the eld of grammatical theory. is in itself
would be surprising if intuited judgments were indeed superior to corpus
evidence: aer all, the distinction between linguistic behavior and linguistic
knowledge is potentially relevant in other areas of linguistic inquiry, too. Yet, no
T
other sub-discipline of linguistics has aempted to make a strong case against
observation and for intuited data.
AF

In some cases, we could argue that this is due to the fact that intuited
judgments are simply not available. In language acquisition or in historical
linguistics, for example, researchers could not use their intuition even if they
wanted to, since not even the most fervent defendants of intuited judgments
would want to argue that speakers have meaningful intuitions about earlier
R

stages of their own linguistic competence or their native language as a whole. For
language acquisition research, corpus data and, to a certain extent,
D

psycholinguistic experiments are the only source of data available, and historical
linguists must rely completely on textual evidence.
In dialectology and sociolinguistics, however, the situation is slightly dierent:
at least those researchers whose linguistic repertoire encompasses more than one
dialect or sociolect (which is not at all unusual), could, in principle, aempt to use
intuition data to investigate regional or social variation. To my knowledge,
however, nobody has aempted to do this. ere are, of course, descriptions of
individual dialects that are based on introspective data (Green's (2002) description
of the grammar of African-American English is an impressive example). But in
the study of actual variation, systematically collected survey data (e.g. Labov, Ash
and Boberg 2008) and corpus data in conjunction with multivariate statistics (e.g.

21
1 e need for corpus data

Tagliamonte 2006) were considered the natural choice of data long before their
potential was recognized in other areas of linguistics.
e same is true of conversation and discourse analysis. One could
theoretically argue that our knowledge of our native language encompasses
knowledge about the structure of discourse and that this knowledge should be
accessible to introspection in the same way as our knowledge of grammar.
However, again, no conversation or discourse analyst has ever actually taken this
line of argumentation, relying instead on authentic usage data.7
Even lexicographers, who could theoretically base their description of the
meaning and grammatical behavior of words entirely on introspectively accessed

.1
knowledge of their language have not generally done so. Beginning with the
Oxford English Dictionary, dictionary entries have been based at least in part on
citations authentic usage examples of the word in question (see next chapter).
v0
If the incompleteness of linguistic corpora or the fact that corpus data have to
be interpreted were serious arguments against their use, these sub-disciplines of
linguistics should not exist, or at least, they should not have yielded any useful
insights into the nature of language change, language acquisition, language
variation, the structure of linguistic interactions or the lexicon. Yet all of these
T
disciplines have, in fact, yielded insightful descriptive and explanative models of
their respective research objects.
AF

e question remains, then, why grammatical theory is the only sub-discipline


of linguistics whose practitioners have rejected the common practice of building
models of underlying principles on careful analyses of observable phenomena. It
seems to me that the rejection of corpora and corpus-linguistic methods in (some
schools o) grammatical theorizing are based mostly on a desire to avoid having
R

to deal with actual data, which are messy, incomplete and oen frustrating, and
that the arguments against the use of such data are, essentially, post-hoc
D

rationalizations. But whatever the case may be, we will, at this point, simply stop
worrying about the wholesale rejection of corpus linguistics by some researchers
until the time that they come up with a convincing argument for this rejection
and turn to the question what exactly constitutes corpus-linguistics.
7 Perhaps Speech Act eory could be seen as an aempt at discourse analysis on the
basis of intuition data: its claims are oen based on short snippets of invented
conversations. e dierence between intuition data and authentic usage data is nicely
demonstrated by the contrast the relatively broad but supercial view of linguistic
interaction found in philosophical pragmatics with the rich and detailed view of
linguistic interaction found in Conversation Analysis (e.g. Sacks, Scheglo, Jeerson
1974, Sacks 1992) and other discourse-analytic traditions.

22
Anatol Stefanowitsch

2. What is corpus linguistics?


Although corpus-based studies of language structure can look back at a tradition
of at least a hundred years, there is no general agreement as to what exactly

.1
constitutes corpus linguistics. is is due in part to the fact that the hundred-year
tradition is not an unbroken one. As we saw in the preceding chapter, corpora fell
out of favor just as linguistics grew into an academic discipline in its own right
v0
and corpus linguistics in its modern incarnation has made a comeback only
recently. It should therefore not come as a surprise that it has not, so far,
consolidated into a homogeneous methodological framework. More generally,
though, linguistics itself, with a tradition that reaches back to antiquity, has
remained notoriously heterogeneous discipline with lile agreement among
T
researchers even with respect to fundamental questions such as what aspects of
language constitute their object of study. It is unsurprising, then, that they do not
AF

agree how their object of study should be approached methodologically and how
it might be modeled theoretically. Given this lack of agreement, it is highly
unlikely that a unied methodology will emerge in the eld any time soon.
On the one hand, this heterogeneity is a good thing. e dogmatism that
comes with monolithic theoretical and methodological frameworks can be stiing
R

to the curiosity that drives scientic progress, especially in the humanities and
social sciences which are, by and large, less mature descriptively and theoretically
D

than the natural sciences. On the other hand, aer more than a century of
scientic inquiry in the modern sense, there should no longer be any serious
disagreement as to its fundamental procedures, and there is no reason not to
apply these procedures within the language sciences. us, I will aempt in this
chapter to sketch out a broad, and, I believe, ultimately uncontroversial
characterization of corpus linguistics as an instance of the scientic method. I
will develop this proposal by successively considering and dismissing alternative
characterizations of corpus linguistics. My aim in doing so is not to delegitimize
these alternative characterizations, but to point out ways in which they are
incomplete.

23
2. What is corpus linguistics?

Let us begin by considering a characterization of corpus linguistics from a


leading textbook:

Corpus linguistics is perhaps best described for the moment in simple


terms as the study of language based on examples of real life language
use. (McEnery and Wilson 2001: 1).

is denition is uncontroversial in that any research method that does not fall
under it would not be regarded as corpus linguistics. However, it is also very
broad, covering many methodological approaches that would not be described as

.1
corpus linguistics even by their own practitioners (such as discourse analysis or
citation-based lexicography). Some otherwise similar denitions of corpus
linguistics aempt to be more specic in that they dene corpus linguistics as
v0
the compilation and analysis of corpora. (Cheng 2012: 6, cf also Meyer 2002: xi),
suggesting that there is a particular form of recording real-life language use
called a corpus.
e rst chapter of this book started with a similar denition, characterizing
corpus linguistics as as any form of linguistic inquiry based on data derived from
T
a corpus, where corpus was dened as a large collection of authentic text. In
order to distinguish corpus linguistics proper from other observational methods
AF

in linguistics, we must rst rene this denition of a linguistic corpus; this will be
our concern in the remainder of Section 2.1. We must then take a closer look at
what it means to study language on the basis of a corpus; this will be our concern
in Section 2.2.
R

2.1. e linguistic corpus


D

e term corpus has slightly dierent meanings in dierent academic disciplines.


It generally refers to a collection of texts; in literature studies, this collection may
consist of the works of a particular author (e.g. all plays by William Shakespeare)
or a particular genre and period (e.g. all 18th century novels; in theology, it may
be (a particular translation o) the Bible. In eld linguistics, it refers to any
collection of data (whether narrative texts or individual sentences) elicited for the
purpose of linguistic research, frequently with a particular research question in
mind (cf. Sebba and Fligelstone 1994: 769).
In corpus linguistics, the term is used dierently: it refers to a collection of
samples of language use with the following properties:

24
Anatol Stefanowitsch

the instances of language use contained in it are authentic;


the collection is representative of the language or linguistic variety under
investigation;
the collection is large.

To distinguish this type of corpus from other corpora, we will refer to it as a


linguistic corpus, although the term corpus should always be understood to refer
to a linguistic corpus in this book unless specied otherwise.
Let us now discuss each of these criteria in turn, beginning with authenticity.

.1
2.1.1 Authenticity
e word authenticity has a range of meanings that could be applied to language
v0
it can mean that a speaker or writer speaks true to their character (He has
found his authentic voice) or to the character of the group they belong to (She is
the authentic voice of her generation), that a particular piece of language is
correctly aributed (is is not an authentic Lincoln quote), or that speech is direct
and truthful (the authentic language of ordinary people).
T
In the context of corpus linguistics (and oen of linguistics in general),
authenticity refers much more broadly to what McEnery and Wilson call real life
AF

language use. As Sinclair puts it, an authentic corpus is one in which

[a]ll the material is gathered from the genuine communications of people


going about their normal business. Anything which involves the linguist
beyond the minimum disruption required to acquire the data is reason for
R

declaring a special corpus. (Sinclair 1996)

In other words, authentic language is language produced for the purpose of


D

communication, not for linguistic analysis or even with the knowledge that they
might be used for such a purpose. It is language that is not, as it were, performed
for the linguist based on what speakers believe constitutes good or proper
language. is is a very broad view of authenticity, since people may be
performing inauthentic language for reasons other than the presence of a
linguist but such performances would be regarded by linguists as something
people will do naturally from time to time and that can and must be studied as an
aspect of language use. In contrast, performances for the linguist are assumed to
distort language behavior in ways that makes them unsuitable for linguistic
analysis.

25
2. What is corpus linguistics?

In the case of wrien language, the criterion of authenticity is easy to satisfy.


Writing samples can be collected aer the fact, so that there is no way for the
speakers to know that their language will come under scientic observation. In
the case of spoken language, the minimum disruption that Sinclair mentions
becomes relevant. We will return to this issue and its consequences for
authenticity presently, but rst let us discuss some general problems with the
corpus linguist's broad notion of authenticity.
Widdowson (2000), in the context of discussing the use of corpora in the
language classroom, casts doubt on the notion of authenticity for what seems, at
rst, to be a rather philosophical reason:

.1
e texts which are collected in a corpus have a reected reality: they are
only real because of the presupposed reality of the discourses of which
v0
they are a trace. is is decontexualized language, which is why it is only
partially real. If the language is to be realized as use, it has to be
recontextualized. (Widdowson 2000: 7)

In some sense, it is obvious that the texts in a corpus (in fact, all texts) are only
T
fully authentic as long as they are part of an authentic communicative situation.
A sample of spoken language is only authentic as part of the larger conversation
AF

it is part of, a sample of newspaper language is only authentic as long as it is


produced in a newsroom and processed by a reader in the natural context of a
newspaper or news site for the purposes of informing themselves about the news,
and so on. us, the very act of taking a sample of language and including it in a
corpus removes its authenticity.
R

is rather abstract point has very practical consequences, however. First, any
text, spoken or wrien, will lose not only its communicative context (the
discourse of which it was originally a part), but also some of its linguistic and
D

paralinguistic properties when it becomes part of a corpus. is is most obvious


in the case of transcribed spoken data, where the very act of transcription means
that aspects like tone of voice, intonation, subtle aspects of pronunciation, facial
expressions, gestures etc. are removed (or, at best, replaced by simplied
descriptions). It is also true, however, for wrien texts, where, for example, visual
information about the font, its color and size, the position of the text on the page,
and the tactile properties of the paper are removed or replaced by descriptions.
e corpus linguist can aempt to supply this missing information
introspectively, recontextualizing the text, as Widdowson puts it. But since they
are not in an authentic seing (and oen not a member of the same cultural and

26
Anatol Stefanowitsch

demographic group as the original or originally intended hearer/reader), this


recontextualization can approximate authenticity at best.
Second, texts, whether wrien or spoken, may contain errors that were
present in the original production or that were introduced by editing before
publication or by the process of preparing them for inclusion in the corpus (cf.
also Emons 1997). As long as the errors are present in the language sample before
it is included in the corpus, they are not, in themselves, problematic: errors are
part of language use and must be studied as such (in fact, the study of errors has
yielded crucial insights into language processing, cf., for example, Fromkin 1973,
1980). e problem is that the decision as to whether some bit of language

.1
contains an error is one that the researcher must make by reconceptualizing the
speaker and their intentions in the original context, a reconceptualization that
makes authenticity impossible to determine.
v0
is does not mean that corpora cannot be used. It simply means that limits of
authenticity have to be kept in mind. With respect to spoken language, however,
there is a more serious problem Sinclairs minimum disruption.
e problem is that in observational studies no disruption is ever minimal as
soon as the investigator is present in person or in the minds of the observed, we
T
get what is known as the observers paradox: we want to observe people (or
other animate beings) behaving as they would if they were not observed in the
AF

case of gathering spoken language data, we want to observe speakers interacting


linguistically as they would if no linguist was in sight.
In some areas, it is possible to circumvent this problem by hiding (or installing
hidden recording devices), but in the case of human language users this is
impossible: it is unethical as well as illegal in most jurisdictions to record people
R

without their knowledge. Speakers must give wrien consent before the data
collection can begin, and there is typically a recording device in plain view that
D

will constantly remind them that they are being recorded.


is knowledge will invariably introduce a degree of inauthenticity into the
data. Take the following excerpts from the Bergen Corpus of London Teenage
Language (COLT). In the excerpt in (2.1), the speakers are talking about the
recording device itself, something they would obviously not do in other
circumstances:

(2.1) A: Josie?
B: Yeah. [laughs] I'm not lming you, I'm just taping you.
[]
A: Yeah, I'll take your lile toy and smash it to pieces!

27
2. What is corpus linguistics?

C: Mm. Take these back to your class. [COLT B132611]

In the excerpt in (2.2), speaker A explains to their interlocutor the fact that the
conversation they are having will be used for linguistic research:

(2.2) A: Were you here when I got that?


B: No what is it?
A: It's for the erm, [] language course. Language, survey.
[]
B: Who gave it to you?

.1
A: Erm this lady from the, University of Bergen.
B: So how d'ya how does it work?
A: Erm you you speak into it and erm, records, goa record conversations
v0
between people. [COLT B141708]

A speakers knowledge that they are being recorded for the purposes of linguistic
analysis is bound to distort the data even further. In example (2.3), there is
evidence for such a distortion the speakers are performing explicitly for the
T
recording device:
AF

(2.3) C: Ooh look, theres Nick!


A: Is there any music on that?
B: A few things I taped o the radio.
A: Alright then. Right. I wa, I just want true things. He told me he
dumped you is that true?
R

C: [laughs]
B: No it is not true. I protest. [COLT B132611]
D

Speaker A asks for true things and then imitates an interview situation, which
speaker B takes up by using the somewhat formal phrase I protest, which they
presumably would not use in an authentic conversation about their love life.
Obviously, such distortions will be more or less problematic depending on our
research question. Level of formality (register) may be easier to manipulate in
performing for the linguist than pronunciation, which is easier to manipulate
than morphological or syntactic behavior. However, the fact remains that spoken
data in corpora is hardly ever authentic in the corpus linguistic sense (unless it is
based on recordings of public language use, for example, from television or the
radio), and the researcher must rely, again, on an aempt to recontextualize the

28
Anatol Stefanowitsch

data based on their own experience as a language user in order to identify


possible distortions. ere is no objective way of judging the degree of distortion
introduced by the presence of an observer, since we do not have a suciently
broad range of surreptitiously recorded data for comparison.
ere is one famous exception to the observers paradox in spoken language
data: the so-called Nixon Tapes. ese (illegal) surreptitious recordings of
conversation in the executive oces of the White House and the headquarters of
the opposing Democratic Party were produced at the request of the Republican
President Richard Nixon between February 1971 and July 1973. Many of these
tapes are now available as digitized sound les and/or transcripts (see, for

.1
example, Nichter 2007). In addition to the interest they hold for historians, they
form the largest available corpus of truly authentic spoken language.
However, even these recordings are too limited in size as well as in the
v0
diversity of speakers recorded (mainly older white American males), to serve as a
standard against which to compare other collections of spoken data.
e ethical and legal problems in recording unobserved spoken language
cannot be circumvented, but their impact on the authenticity of the recorded
language may be lessened in various ways for example, by geing general
T
consent for recording from speakers, but not telling them when precisely they
will be recorded.
AF

However, there are reasons why researchers may want to depart from
authenticity in the corpus-linguistic sense on purpose, because their research
design or the phenomenon under investigation requires it.
A researcher may be interested in a phenomenon that is so rare in most
situations that even the largest available corpora do not contain a sucient
R

number of cases. ese may be structural phenomena (like the paern [It doesnt
maer the N] or transitive croak, discussed in the previous chapter), or unusual
D

communicative situations (for example, human-machine interaction). In these


situations, it may be necessary to switch methods and use grammaticality
judgments or similar methods aer all, but it may also be possible to elicit these
phenomena in what we could call semi-authentic seings.
For example, researchers interested in motion verbs oen do not have the
means (or the patience) to collect these verbs from general corpora, or corpora
may not contain a suciently broad range of descriptions of motion events with
particular properties. Such descriptions are sometimes elicited by asking speakers
to describe movie snippets or narrate a story from a picture book, cf. e.g. Berman
and Slobin 1994, Strmqvist and Verhoeven 2003). Human-machine interaction is
sometimes elicited in so-called Wizard of Oz experiments, where people believe

29
2. What is corpus linguistics?

they are talking to a robot, but the robot is actually controlled by one of the
researchers, cf. e.g. Georgila et al. 2010).
Such semi-structured elicitation techniques may also be used with phenomena
that are frequent enough in a typical corpus, but where the researcher wants to
vary certain aspects systematically (for example, the nature of the entity that is
moving along a path), or where the researcher wants to achieve comparability
across speakers or even across languages (cf. again the picture-elicited narratives
just mentioned).
ese are valid reasons for eliciting a special-purpose corpus rather than
collecting naturally occurring text. Still, the stimulus-response design of this type

.1
of corpus collection is very obviously inuenced by experimental paradigms used
in psychology. us, studies based on such corpora must be regarded as falling
somewhere between corpus linguistics and psycholinguistics and they must
v0
therefore meet the design criteria of both corpus linguistic and psycholinguistic
research designs.

2.1.2 Representativeness
T
Put simply, a representative sample is a subset of a population that is
(near-)identical to the population as a whole with respect to the distribution of
the phenomenon under investigation. us, for a corpus (a sample of language
AF

use) to be representative of a particular language, the distribution of linguistic


phenomena (words, grammatical structures, etc.) would have to be identical to
their distribution in the language as a whole (or in the variety under
investigation, see further below).
R

Ostensibly, the way that corpus builders typically aim to achieve this, is by
including in the corpus dierent text types characterized by channel
(spoken/wrien), seing, function, demographic background of speakers etc. in
D

a similar proportion to their occurrence in the speech community in question.


is is sometimes referred to as balance, the idea being that any given linguistic
phenomenon should be accurately represented in a balanced corpus.
It is obvious right away that this is an ideal that can never be aained in reality
for at least the following four reasons.
First, for most potentially relevant parameter we simply do not know how
they are distributed in the population. We may know the distribution of some of
the most important demographic variables (e.g. sex, age, education), but we
simply not know the overall distribution of spoken vs. wrien language, press
language vs. literary language, texts and conversations about particular topics etc.

30
Anatol Stefanowitsch

Second, even if we did know, it is not clear that all types of language use shape
and/or represent the linguistic system in the same way, simply because we do not
know how widely they are received. For example, emails may be responsible for a
larger share of wrien language produced daily than news sites, but each email is
typically read by a handful of people at the most, while some news texts may be
read by millions of people (and others not at all).
ird, in a related point, speech communities are not homogeneous, so
dening balance based on the proportion of text types in the speech community
may not yield a realistic representation of the language even if it were possible:
every member of the speech community takes part in dierent communicative

.1
situations involving dierent text types. Some people read more than others,
among these some read mostly newspapers, others mostly novels; some people
watch parliamentary debates on TV all day, others mainly talk to customers in
v0
the bakery where they work. In other words, the proportion of text types
speakers encounter varies, requiring a notion of balance based on the incidence of
text types in the linguistic experience of a typical speaker. is, in turn, requires a
denition of what constitutes a typical speaker in a given speech community.
Such a denition may be possible, but to my knowledge, does not exist so far.
T
Finally, there are text types for which for practical reasons, it is impossible to
acquire samples for example, pillow talk (which speakers will be unwilling to
AF

share because they consider them too private), religious confessions or lawyer-
client conversations (which speakers are prevented from sharing because they are
privileged), and planning of illegal activities (which speakers will want to keep
secret in order to avoid lengthy prison terms).
Representativity or balancedness also plays a role if we do not aim at
R

investigating a language as a whole, but are instead interested in a particular


variety. In this case, the corpus will be deliberately skewed so as to contain only
D

samples of the variety under investigation. However, if we plan to generalize our


results to that variety as a whole, the corpus must be representative of that
variety. is is an important fact that is sometimes overlooked. For example, there
are studies of political rhetoric that are based on speeches by just a handful of
political leaders (e.g. Charteris-Black 2005, 2006) or studies of romantic metaphor
based on a single Shakespeare play (Barcelona Snchez 1995). While such studies
may be extremely insightful with respect to the language of the individuals under
investigation, their results are unlikely to be generalizable even within the
narrow variety under investigation (political speeches, romantic tragedies). us,
they belong to the eld of literary criticism or stylistics much more than to the
eld of linguistics.

31
2. What is corpus linguistics?

Table 2-1: Composition of the BROWN corpus


Genre Subgenre/Topic Area Samples
Non-Fiction Religion Books 7
Periodicals 6
Tracts 4
Skills and Hobbies Books 2
Periodicals 34
Popular Lore Books 23
Periodicals 25
Belles Leres, Biography, Memoirs, etc. Books 38
Periodicals 37
Miscellaneous Government Documents 24

.1
Foundation Reports 2
Industry Reports 2
College Catalog 1
v0 Industry House organ 1
Learned Natural Sciences 12
Medicine 5
Mathematics 4
Social and Behavioral Sciences 14
Political Science, Law, Education 15
Humanities 18
T
Technology and Engineering 12
Fiction General Novels 20
AF

Short Stories 9
Mystery and Detective Novels 20
Short Stories 4
Science Fiction Novels 3
Short Stories 3
Adventure and Western Novels 15
R

Short Stories 14
Romance and Love Story Novels 14
Short Stories 15
D

Humor Novels 3
Essays, etc. 6
Press Reportage Political 14
Sports 7
Society 3
Spot News 9
Financial 4
Cultural 7
Editorial Institutional 10
Personal 10
Leers to the Editor 7
Reviews (theatre, books, music, dance) 17

32
Anatol Stefanowitsch

Given the problems discussed above, it seems impossible to create a linguistic


corpus meeting the criteria of representativeness and/or balance. In fact, while
there are very well-thought out approaches to approximating representativeness
(cf. e.g. Biber 1993), it is fair to say that most corpus builders never really try. Let
us see what they do instead.
e rst linguistic corpus in our sense was the Brown University Standard
Corpus of Present-Day American English (generally referred to as BROWN). It
contains exclusively edited prose published in the year 1961, so it clearly does not
aempt to be representative of American English in general, but only of a
particular type of wrien American English. is is legitimate if the goal is to

.1
investigate this particular variety, but if the corpus were meant to represent the
standard language in general (which the corpus creators explicitly deny), it forces
us to accept a very narrow understanding of standard.
v0
e BROWN corpus consists of 500 samples of approximately 2000 words
each, drawn from a number of dierent text types, as shown in Table 2.1.
e rst level of sampling is by genre: there are 286 samples of non-ction,
126 samples of ction and 88 samples of press texts. ere is no reason to believe
that this corresponds proportionally to the total number of words produced in
T
these text types in the USA in 1961. ere is also no reason to believe that the
distribution corresponds proportionally to the incidence of these text types in the
AF

linguistic experience of a typical speaker. is is true all the more so when we


take into account the subcategories within these genres, which are a mixture of
sub-genres (such as reportage or editorial in the press category or novels and
short stories in the ction category), and topics (such as Romance, Natural
Science or Sports). Clearly the number of samples included for these categories is
R

not based on statistics of their proportion in the language as a whole. Intuitively,


there may be a rough correlation in some cases: newspapers publish more
D

reportage than editorials, people (or at least academics of the type that built the
corpus) read more mystery ction that science ction, etc.e creators of the
BROWN corpus are quite open about the fact that their corpus design is not a
representative sample of (wrien) American English. ey describe the collection
procedure as follows:

e selection procedure was in two phases: an initial subjective


classication and decision as to how many samples of each category would
be used, followed by a random selection of the actual samples within each
category. In most categories the holding of the Brown University Library
and the Providence Athenaeum were treated as the universe from which

33
2. What is corpus linguistics?

the random selections were made. But for certain categories it was
necessary to go beyond these two cellections. For the daily press, for
example, the list of American newspapers of which the New York Public
Library keeps microlms les was used (with the addition of the
Providence Journal). Certain categories of chiey ephemeral material
necessitated rather arbitrary decisions; some periodical materials in the
categories Skills and Hobbies and Popular Lore were chosen from the
contents of one of the largest second-hand magazine stores in New York
City. (Francis/Kucera 1979)

.1
If anything, the BROWN corpus represents the holdings of the libraries
mentioned, although even this representativeness is limited in two ways. First, by
the unsystematic additions mentioned in the quote, and second, by the sampling
v0
procedure applied.
Although this sampling procedure is explicitly acknowledged to be
subjective by the creators of the BROWN corpus, their description suggests that
their design was guided by a general desire for balance:
T
e list of main categories and their subdivisions was drawn up at a
conference held at Brown University in February 1963. e participants in
AF

the conference also independently gave their opinions as to the number of


samples there should be in each category. ese gures were averaged to
obtain the preliminary set of gures used. A few changes were later made
on the basis of experience gained in making the selections. Finer
subdivision was based on proportional amounts of actual publication
R

during 1961. (Francis/Kucera 1979)

is procedure combines elements from both interpretations of balance


D

discussed above. First, it involves the opinions (i.e., intuitions) of a number of


people concerning the proportional relevance of certain subgenres and/or topic
areas. e fact that these opinions were averaged suggests that the corpus
creators wanted to achieve a certain degree of intersubjectivity. is idea is not
completely wrongheaded, although it is doubtful that speakers have reliable
intuitions in this area. In addition, the participants of the conference mentioned
did not exactly constitute a group of typical speakers or a cross-section of the
American English speech community: they consisted of six academics with
backgrounds in linguistics, education and psychology, ve men and one woman,
four Americans, one Brit and one Czech, all of them white and middle age (the

34
Anatol Stefanowitsch

youngest was 36, the oldest 59). No doubt, a dierent group of researchers let
alone a random sample of speakers following the procedure described would
arrive at a very dierent corpus design.
Second, the procedure involves an aempt to capture the proportion of text
types in actual publication this proportion was determined on the basis of the
American Book Publishing Record, a reference work containing publication
information on all books published in the USA in a given year. Whether this is, in
fact, a comprehensive source is unclear, and anyway, it can only be used in the
selection of excerpts from books. Basing the estimation of the proportion of text
types on a dierent source would, again, have yielded a very dierent corpus

.1
design. For example, the copyright registrations for 1961 suggest that the
category of periodicals is severely underrepresented relative to the category of
books there are roughly the same number of copyright registrations for the two
v0
text types (, but there are one-and-a-half times as many excerpts from books than
from periodicals in the BROWN corpus.
Despite these shortcomings, the BROWN corpus was well-received, spawning
a host of corpora of dierent varieties of English using the same design, for
example, the Lancaster-Oslo/Bergen Corpus (LOB) containing British English
T
from 1961, the Freiburg Brown (FROWN) and Freiburg LOB (FLOB) corpora of
American and British English respectively from 1991, the Wellington Corpus of
AF

Wrien New Zealand English, and the Kolhapur Corpus (Indian English). e
success of the BROWN design was partly due to the fact that being able to study
strictly comparable corpora of dierent varieties is useful regardless of their
design. However, if the design had been widely felt to be completely o-target,
researchers would not have used it as a basis for the signicant eort involved in
R

corpus creation.
More recent corpora at rst glance appear to take a more principled approach
D

to balance. Most importantly, they typically include not just wrien language, but
also spoken language. However, a closer look reveals that this is the only real
change. For example, the BNC BABY, a four-million-word subset of the 100-
million-word British National Corpus (BNC), includes approximately one million
words each from the registers spoken conversation, wrien academic language,
wrien prose ction and wrien newspaper language (Table 2.2 shows the exact
design). Obviously, this design does not correspond to the linguistic experience of
a typical speaker, who is unlikely to be exposed to academic writing and whose
exposure to wrien language is not three times as large as their exposure to
spoken language. e design also does not correspond in any obvious way to the
actual amount of language produced on average in the four categories or the

35
2. What is corpus linguistics?

subcategories of academic and newspaper language. Despite this, the BNC BABY,
and the BNC itself, which is even more drastically skewed towards edited wrien
language, are extremely successful corpora that are still widely used a quarter-
century aer the rst release of the BNC.

Table 2-2: Composition of the BNC BABY corpus


Channel Genre Subgenre Topic area Samples Words
Spoken Conversation 30 1017025
Wrien Academic Humanities/Arts 7 224872
Medicine 2 89821

.1
Nat. Science 6 215549
Politics/Law/Education 6 195836
Soc. Science
v0 7 209645
Technology/Engineering 2 77533
Fiction Prose 25 1010279
Newspapers Nat. Broadsheet Arts 9 36603
Commerce 7 64162
T
Editorial 1 8821
Miscellaneous 25 121194
Report 3 48190
AF

Science 5 18245
Social 13 34516
Sports 3 36796
Other Arts 3 43687
R

Commerce 5 89170
Report 7 232739
D

Science 7 13616
Social 8 94676
Tabloid 1 121252

Even what I would consider the most serious approach to date to creating a
balanced corpus design, the sampling schema of the International Corpus of
English (ICE), is unlikely to be signicantly closer to constituting a representative
sample of English language use (see Table 2.3).

36
Anatol Stefanowitsch

Table 2-3: Composition of the ICE corpora


Channel Situation/Genre Samples
Spoken Dialogues Private Face-to-face conversations 90
Phone calls 10
Public Classroom Lessons 20
Broadcast Discussions 20
Broadcast Interviews 10
Parliamentary Debates 10
Legal cross-examinations 10
Business Transactions 10

.1
Monologues Unscripted Spontaneous commentaries 20
Unscripted Speeches 30
Demonstrations 10

Scripted
v0 Legal Presentations
Broadcast News
10
20
Broadcast Talks 20
Non-broadcast Talks 10
T
Wrien Non-printed Student Writing Student Essays 10
Exam Scripts 10
AF

Leers Social Leers 15


Business Leers 15
Printed Academic writing Humanities 10
Social Sciences 10
Natural Sciences 10
R

Technology 10
Popular writing Humanities 10
D

Social Sciences 10
Natural Sciences 10
Technology 10
Reportage Press news reports 20
Instructional writing Administrative Writing 10
Skills/hobbies 10
Persuasive writing Press editorials 10
Creative writing Novels and short stories 10

It puts a stronger emphasis on spoken language sixty percent of the corpus are
spoken text types, although two thirds of these are public language use, while for

37
2. What is corpus linguistics?

most of us private language use is likely to account for more of our linguistic
experience. It also includes a much broader range of wrien text types than
previous corpora, including not just edited writing but also student writing and
leers.
However, as before, there is lile reason to believe that this design
corresponds closely to a random sample taken from the entirety of language
produced in a particular English speech community.
It is clear, then, that none of these (or any other actually existing) corpora are
representative and/or balanced in the sense discussed at the beginning of this
section.

.1
e impossibility of creating a truly balanced corpus is sometimes used to
argue against corpus linguistics in general recall the criticism, discussed in the
previous chapter that linguistic corpora are necessarily skewed. is point of
v0
criticism is widely acknowledged in corpus linguistics, and yet corpus linguists
are not only happy to continue using available corpora, there would also be a
wide agreement that although none of the corpora discussed above are
representative or balanced, the design of the ICE is beer than that of the BNC
BABY, which is in turn beer than that of the BROWN corpus and its ospring.
T
e rationale behind this (presumed) agreement is not based on
representativeness or balance, but on the related but distinct criterion of diversity.
AF

While corpora will always be skewed relative to the overall population of texts
and text types in a speech community, the undesirable eects of this skew can be
minimized by including in the corpus as broad a range of varieties as is realistic,
either in general or in the context of a given research project.
Unless language structure and language use are innitely variable (which, at
R

least at a given point in time, they are not), increasing the diversity of the sample
will increase representativeness even if the corpus design is not strictly speaking
D

balanced. It seems important to acknowledge that this does not mean that
diversity and balance are the same thing, but given that balanced corpora are
practically (and perhaps theoretically) impossible to create, diversity is a
workable and justiable proxy.

2.1.3 Size
Like diversity, corpus size is also assumed, more or less explicitly, to contribute to
representativeness (e.g. by McEnery/Wilson 1991). It is dicult to say whether
the relationship between size and representativeness is strictly speaking relevant
in the context of corpora: sample size does correlate with representativeness to

38
Anatol Stefanowitsch

some extent in that an exhaustive sample, i.e., one that encompasses the entire
population, will necessarily be representative and that this representativeness
does not drop to zero immediately if we decrease sample size. However, it does
drop rather rapidly if we exclude one percent of the totality of texts produced in
a given language, entire text types may already be missing. For example, the
Library of Congress holds around 38 million print materials, roughly half of them
in English. A search for cooking in the main catalogue yields 7,638 items that
presumably include all cookbooks in the collection. is means that cookbooks
make up no more than 0.04 percent of printed English (7,638/19,000,000 =
0.000402). us, they would be easily lost in their entirety once the sample size

.1
drops substantially below the size of the population as a whole.
Since even the largest corpora comprise only tiny fractions of the language as
a whole, size certainly does not guarantee balance in any way even the text
v0
types that are theoretically accessible will be represented disproportionally and
many will be missing altogether. And with these text types missing, at least some
linguistic phenomena will disappear from the corpus, such as the expression
[__ NPLIQUID PP[to the/a boil]], which, as discussed in Chapter 1, is exclusive to
cookbooks. However, assuming, again, that language structure and use are not
T
innitely variable, size will correlate with the representativeness of particular
phenomena at least to some extent.
AF

ere is no principled answer to the question how large a linguistic corpus


must be, except, perhaps, an honest It is impossible to say (Renouf 1987: 130).
However, there are two practical answers. e more modest answer is that it
must be large enough to contain a sample of instances of the phenomenon under
investigation that is large enough for analysis (we will discuss what this means in
R

Chapters 6 and 7). e less modest answer is that it must be large enough to
contain suciently large samples of every grammatical structure, vocabulary
D

item etc. Given that an ever increasing number of texts from a broad range of text
types is becoming accessible via the web, the second answer may not actually be
as immodest as it sounds it is now possible to create corpora that contain
billions of words, making the BNC look tiny in comparison. Still, for most issues
that corpus linguists are likely to want to investigate, a size of one million to one-
hundred million words is sucient. Let us take this broad range as characterizing
a linguistic corpus for practical purposes.

2.2 Corpus linguistics


Having characterized the linguistic corpus in its ideal form, we can now

39
2. What is corpus linguistics?

reformulate the denition of corpus linguistics cited at the beginning of this


chapter as follows:

Denition (First aempt)


Corpus linguistics is the investigation of linguistic phenomena on the basis
of linguistic corpora.

is denition is more specic with respect to the data used in corpus linguistics
and will exclude discourse analysis, text linguistics, variationist sociolinguistics
and other elds working with authentic language data (whether such a strict

.1
exclusion is a good thing is a question we will briey return to at the end of this
chapter).
However, the denition says nothing about the way in which these data are to
v0
be investigated. Crucially, it would cover a procedure in which the linguistic
corpus essentially serves as a giant citation le, that the researcher scours, more
or less systematically, for examples of a given linguistic phenomenon.
is procedure of basing linguistic analyses on citations has a long tradition in
descriptive English linguistics, going back at least to Oo Jespersens seven-
T
volume Modern English Grammar on Historical Principles (Jespersen 19091940).
It played a particularly important role in the context of dictionary making. e
AF

Oxford English Dictionary (Simpson and Weiner 1989) is the rst and probably
still the most famous example of a citation-based dictionary. For the rst two
editions, it relied on citations sent in by volunteers (cf. Winchester 2003 for a
popular account). In its current third edition, its editors actively search corpora
and other text collections (including the Google Books index) for citations.
R

A fairly stringent implementation of this method is described in the following


passage from the FAQ web page of the Merriam-Webster Online Dictionary:
D

Each day most Merriam-Webster editors devote an hour or two to reading a


cross section of published material, including books, newspapers,
magazines, and electronic publications; in our oce this activity is called
reading and marking. e editors scour the texts in search of []
anything that might help in deciding if a word belongs in the dictionary,
understanding what it means, and determining typical usage. Any word of
interest is marked, along with surrounding context that oers insight into
its form and use. [] e marked passages are then input into a computer
system and stored both in machine-readable form and on 3 5 slips of
paper to create citations. (Merriam-Webster 2014)

40
Anatol Stefanowitsch

e cross-section of published material referred to in this passage is heavily


skewed towards particular varieties of formal wrien language. Given that people
will typically consult dictionaries to look up unfamiliar words they encounter in
writing, this may be a reasonable choice to make, although it should be pointed
out that modern dictionaries are more commonly based on linguistic corpora of
the type described in the preceding section.
But let us assume, for the moment, that the cross-section of published material
read by the editors of Merriam Websters dictionary counts as a linguistic corpus.
Under this assumption, the procedure described here clearly falls under our
denition of corpus linguistics. Interestingly, the publishers of Merriam Websters

.1
even refer to their procedure as study[ing] the language as its used (Merriam-
Webster 2014), a characterization that is very close to McEnery and Wilsons
denition of corpus linguistics as the study of language based on examples of
v0
real life language use.
ere are legitimate reasons for collecting citations. It may serve to show that
a particular linguistic phenomenon existed at a particular point in time; one
reason for basing the OED on citations was and is to identify the rst recorded
use of each word. It may also serve to show that a particular linguistic
T
phenomenon exists at all, for example, if that phenomenon is considered
ungrammatical (as in the case of [it doesnt maer the N], discussed in the
AF

previous chapter).
However, there is an important reason why the method of collecting citations
cannot be regarded as an instance of the scientic method beyond the purpose of
proving existence, and hence should not be taken to represent corpus linguistics
proper. While the procedure described by the makers of Merriam Websters
R

sounds relatively methodical and organized, it is obvious that the editors will be
guided in their selection by many factors that would be hard to control even if
D

they were fully aware of them, such as their personal interests, their sense of
esthetics, the intensity with which they have thought about some uses of a word
as opposed to others, etc.
is can result in a signicant bias in the resulting data base even if the
method is applied systematically, a bias that will be reected in the results of the
linguistic analysis, i.e. the denitions and example sentences in the dictionary. To
pick a random example: e word of the day on the Merriam-Webster
dictionaries website at the time of writing is implacable, dened as not capable
of being appeased, signicantly changed, or mitigated (Merriam-Webster, sv.
implacable). e entry gives two examples for the use of this word (cf. [2.4a, b]),
and the word-of-the-day posting gives two more (shown in [2.4c, d] in

41
2. What is corpus linguistics?

abbreviated form):

(2.4a) He has an implacable hatred for his political opponents.


(2.4b) an implacable judge who knew in his bones that the cover-up extended to
the highest levels of government
(2.4c) the implacable laws of the universe are of interest to me.
(2.4d) Through his audacity, his vision, and his implacable faith in his future
success

Except for hatred, the nouns modied by implacable in these examples are not

.1
at all representative of actual usage. e lemmas most frequently modied by
implacable in the 450-million-word Corpus of Current American English (COCA),
are enemy and foe, followed at some distance by force, hostility, opposition, will,
v0
and the hatred found in (2.4a). us, it seems that implacable is used most
frequently in contexts describing adversarial human relationships, while the
examples that the editors of the Merriam-Websters selected as typical deal mostly
with adversarial abstract forces. Perhaps this distortion is due to the materials the
editors searched, perhaps the examples struck the editors as citation-worthy
T
precisely because they are slightly unusual, or because they appealed to them
esthetically (they all have a certain kind of rhetorical ourish).8
AF

Contrast the performance of the citation-based method with the more strictly
corpus-based method used by the Longman Dictionary of English, which
illustrates the adjective implacable with the representative examples in (2.5a,b):

(2.5a) implacable enemies


R

(2.5b) e government faces implacable opposition on the issue of nuclear


waste. (LDCE, s.v. implacable)
D

Obviously, the method of citation collection becomes worse the more


opportunistically the examples are collected: the researcher will not only focus on
examples that they happen to notice, they may also selectively focus on examples
that they intuitively deem particularly relevant or representative. In the worst
case, they will consciously perform an introspection-based analysis of a

8 Incidentally, this type of distortion means that it is dangerous to base analyses on


examples included in citation-based dictionaries; but cf. Mair (2004), who shows
convincingly that, given an appropriately constrained research design, the dangers of
an unsystematically collected citation base are

42
Anatol Stefanowitsch

phenomenon and then scour the corpus for examples that support this analysis;
we could call this method corpus-illustrated linguistics (cf. Tummers et al. 2005).
In the case of spoken examples that are overheard and then recorded aer the
fact, there is an additional problem: researchers will write down what they
thought they heard, not what they actually heard.9
e use of corpus examples for illustrative purposes has become somewhat
fashionable among researchers who largely depend on introspective data
otherwise. While it is probably an improvement over the practice of simply
inventing data, it has a fundamental weakness: it does not ensure that the data
selected by the researcher are actually representative of the phenomenon under

.1
investigation. In other words, corpus-illustrated linguistics simply replaces
introspectively invented data with introspectively selected data and thus inherits
the fallibility of the introspective method discussed in the previous chapter.
v0
Since overcoming the fallibility of introspective data is one of the central
motivations for using corpora in the rst place, the denition must be revised as
follows (a similar denitions is found in Biber, Conrad and Reppen 1998: 4):

Denition (Second aempt)


T
Corpus linguistics is the exhaustive and systematic investigation of
linguistic phenomena on the basis of linguistic corpora.
AF

Here, exhaustive roughly means taking into account all examples of the
phenomenon in question and systematic means looking at each example based
on the same criteria. As was mentioned in the preceding section, linguistic
corpora are currently typically between one million and one-hundred million
R

words in size (and growing). As a consequence, it is almost impossible to


exhaustively extract data manually, and this has lead to a widespread use of
computers and corpus linguistic soware applications in the eld.10
D

9 As anyone who has ever tried to transcribe spoken data, this implicit distortion of data
is a problem even where the data is available as a recording: transcribers of spoken
data are forever struggling with it. Just record a minute of spoken language and try to
transcribe it exactly you will be surprised how frequently you transcribe something
that is similar, but by no means identical to what is on the tape.
10 Note, however, that sometimes manual extraction is the only option cf. Colleman
(2006, 2009), who manually searched a 1-million word corpus of Dutch in order to
extract all ditransitive clauses. To give you a rough idea of the work load involved in
this type of manual extraction, it took Colleman ten full work days to go through the
entire corpus (Colleman, pers. comm.).

43
2. What is corpus linguistics?

In fact, corpus technology has become so central, that it is sometimes seen as a


dening aspect of corpus linguistics. One corpus linguistics textbook opens with
the sentence e main part of this book consists of a series of case studies which
involve the use of corpora and corpus analysis technology (Partington 1998: 1),
and another observes that [c]orpus linguistics is [] now inextricably linked to
the computer (Kennedy 1998: 5); a third textbook explicitly includes the
extensive use of computers for analysis, using both automatic and interactive
techniques as one of four dening criteria of corpus linguistics (Biber, Conrad
and Reppen 1998: 4). This perspective is summarized in the following denition:

.1
Denition (ird aempt, Version 1)
Corpus linguistics is the investigation of linguistic phenomena on the basis
of computer-readable linguistic corpora using corpus analysis soware.
v0
However, this is not a useful approach. It is true that there are scientic
disciplines that are so heavily dependent upon a particular technology that they
could not exist without it for example, radio astronomy (which requires a radio
telescope) or radiology (which requires an x-ray machine). However, even in such
T
cases we would hardly want to claim that the technology in question can serve as
a dening criterion: one can use the same technology ways that do not qualify as
AF

belonging to the respective discipline. For example, a spy might use a radio
telescope to intercept enemy transmissions, and an engineer may use an x-ray
machine to detect fractures in a steel girder, but that does not make the spy a
radio astronomer or the engineer a radiologist.
Clearly, even a discipline that relies crucially on a particular technology
R

cannot be dened by the technology itself but by the uses to which it puts that
technology. If anything, we must thus replace the reference to corpus analysis
soware by a reference to what that soware typically does.
D

ere is a wide range of soware packages for corpus analysis that vary in
their capabilities, but most of them have three functions in common:

i. they produce KWIC (Key Word In Context) concordances, i.e. they display
all occurrences of a particular search paern (oen a word form),
together with their immediate context, dened in terms of a particular
number of words or characters to the le and the right (see Table 2.4 for a
KWIC concordance of the noun time);
ii. they identify collocates of a given search word, i.e. words that occur in a
certain position relative to it; these words are typically listed in the order

44
Anatol Stefanowitsch

of frequency with which they occur in the position in question (see Table
2.5 for a list of collocates of the noun time in a span of three words to the
le and right);
iii. they produce frequency lists, i.e. lists of all word forms in a given corpus,
listed in the order of their frequency of occurrence (see Table 2.6 for the
forty most frequent stings (word forms and punctuation marks in the
BNC BABY).

Table 2-4: KWIC concordance (random sample) of the noun time (BNC BABY)
st sight to take an unconscionably long [time] . A common fallacy is the attempt to as

.1
s with Arsenal . Graham reckons it 's [time] his side went gunning for trophies agai
ough I did n't , he he , I did n't have [time] to ask him what the hell he 'd been up
was really impressed . I think the last [time] I came I had erm a chicken thing . A ch
away . No Ann would have him the whole [time] . Yeah well [unclear] Your mum would n'
arch 1921 . He had been unwell for some [time]
v0 and had now gone into a state of collap
te the planned population in five years [time] . So what are you gon na multiply that
tempt to make a coding time and content [time] the same ) . 10 Conclusion I have stres
hearer and for the analyst most of the [time] . Most of the time , things will indeed
in good faith and was reasonable at the [time] it was prepared . The bank gave conside
he had something on his mind the whole [time] . Perhaps he was thinking of his wr
ctices of commercial architects because [time] and time again they come up with the go
nyway . From then on Augusto , at the [time] an economist with the World Bank , and
T
This may be my last free night for some [time] . I do n't think they 'd be in any
two reasons . Firstly , the passage of [time] provides more and more experience and t
go . The horse was racing for the first [time] for Epsom trainer Roger Ingram having p
imes better and you would do it all the [time] , right Mm I mean basically you say the
AF

ther , we 'll see you in a fortnight 's [time] . Perhaps then , said Viola , who
granny does it and she 's got loads of [time] . She sits there and does them twice as
pig , fattened up in woods in half the [time] and costing well under an eighth of the
ike to be [unclear] like that , all the [time] ! Yeah . I said they wo n't bloody lock
es in various biological groups through [time] are most usefully analysed in terms of
? Er do you want your dinner at dinner [time] or [unclear] No I do n't know what I 'v
But they always around about Christmas [time] . My mam reckons that the You can put t
R

inversion , i.e. of one descriptor at a [time] , but they are generally provided and e

Let us briey look at why these functions might be useful in corpus linguistic
D

research (we will discuss them in much more detail in later chapters).
A concordance provides a quick overview of the typical usage of a particular
word (or more complex linguistic expression). e hits are presented in random
order in Table 2.1, but corpus-linguistic soware packages typically allow the
researcher to sort concordances in various ways, for example, by the rst word to
the le or to the right; this will give us an even beer idea as to what the typical
usage contexts for the expression under investigation are.
Collocate lists are a useful way of summarizing the contexts of a linguistic
expression. For example, the collocate list in the column marked L1 in Table 2.2
will show us at a glance what words typically directly precede time. e

45
2. What is corpus linguistics?

determiners the and this are presumably due to the fact that we are dealing with a
noun, but the adjectives rst, same, long, some, last, every and next are related
specically to the meaning of the noun time; the high frequency of the
prepositions at, by, for and in in the column marked L2 (two words to the le of
the node word time) not only gives us additional information about the meaning
and phraseology associated with the word time, it also tells us that time
frequently occurs in prepositional phrases in general.

Table 2-5: Collocates of time in a span of three words to the le and to the right
L3 L2 L1 R1 R2 R3

.1
for 335 the 851 the 1032 . 950 the 427 . 294
. 322 at 572 this 380 , 661 168 , 255
at 292 a 361 first 320 to 351 i 141 the 212
,
a
the
227
170
130
all
.
by
226
196
192
of
same
a
242
240
239
v0
of
and
for
258
223
190
.
and
you
137
118
104
to
a
it
120
118
112
it 121 , 162 long 224 in 184 it 102 was 107
to 100 of 154 some 200 i 177 a 96 and 92
and 89 for 148 last 180 he 136 he 92 i 86
T
in 89 it 117 every 134 ? 122 , 91 you 76
was 85 in 93 in 113 you 120 was 87 in 75
AF

is 78 's 68 that 111 when 118 had 80 71


's 68 and 65 what 108 the 90 but 70 of 64
have 59 next 83 we 88 to 69 's 59
that 58 any 72 is 85 64 is 59
had 55 one 65 as 78 she 58 ? 58
52 's 64 it 70 they 57 he 58
R

no 63 they 70 that 56 had 53


from 57 she 69 in 55
that 64
D

was 50

Finally, frequency lists provide useful information about the distribution of words
(and, in the case of wrien language, punctuation marks) in a particular corpus.
is can be useful, for example, in comparing the structural properties or typical
contents of dierent text types (see further Chapter 12). It is also useful in
assessing which collocates of a particular word are frequent only because they
are frequent in the corpus in general, and which collocates actually tell us
something interesting about a particular word.

46
Anatol Stefanowitsch

Table 2-6: e forty most frequent tokens in the BNC BABY


. 226990 that 51976 on 29258 do 20433
, 212502 you 49346 n't 27672 at 20164
the 211148 's 48063 be 24865 not 19983
of 100874 is 40508 with 24533 had 19453
to 94772 ? 38422 as 24171 we 18834
and 94469 was 37087 have 23093 are 18474
a 88277 36831 [unclear] 21879 this 18393
in 69121 he 36217 but 21209 there 17585
it 60647 34994 they 21177 his 17447
i 59827 for 31784 she 21121 by 17201

Note, for example, that the collocate frequency lists on the right side of the word

.1
time are more similar to the general frequency list than those on the le side,
suggesting that the noun time has a stronger inuence on the words preceding it
than on the words following it (see further Chapter 9).
v0
Given the widespread implementation of these three techniques, they are
obviously central to corpus linguistics research, so we might amend the denition
above as follows (a similar denition is implied by Kennedy 1998: 244.):

Denition (ird aempt, Version 2)


T
Corpus linguistics is the investigation of linguistic phenomena on the basis
of concordances, collocations, and frequency lists.
AF

Two problems remain with this denition. e rst problem is that the
requirements of systematicity and exhaustiveness that were introduced in the
second denition are missing. is can be remedied by combining the second and
third denition as follows:
R

Denition (Combined second and third aempt)


Corpus linguistics is the exhaustive and systematic investigation of
D

linguistic phenomena on the basis of linguistic corpora using concordances,


collocations, and frequency lists.

e second problem is that including a list of specic techniques in the denition


of a discipline seems undesirable, no maer how central these techniques are.
First, such a list will necessarily be nite and will thus limit the imagination of
future researchers. Second, and more importantly, it presents the techniques in
question as an arbitrary set, while it would clearly be desirable to characterize
them in terms that capture the reasons for their central role in the discipline.
What concordances, collocate lists and frequency lists have in common is that
they are all ways of studying the distribution of linguistic elements in a corpus.

47
2. What is corpus linguistics?

us, we could dene corpus linguistics as follows:

Denition (Fourth aempt)


Corpus linguistics is the exhaustive and systematic investigation of the
distribution of linguistic phenomena in a linguistic corpus.

On the one hand, this denition subsumes the previous two denitions: If we
assume that corpus linguistics is essentially the study of the distribution of
linguistic phenomena in a linguistic corpus, we immediately understand the
central role of the techniques described above: (i) KWIC concordances are

.1
essentially a way of displaying the distribution of a lexical item across dierent
syntagmatic contexts; (ii) collocations summarize the distribution of lexical items
with respect to other lexical items in quantitative terms, and (iii) frequency lists
v0
summarize the overall quantitative distribution of lexical items in a given corpus.
On the other hand, the denition is not limited to these techniques but can be
applied open-endedly on all levels of language and to all kinds of distributions.
is denition is close to the understanding of corpus linguistics that this book
will advance, but it must still be narrowed down somewhat.
T
First, it must not be misunderstood to suggest that studying the distribution of
linguistic phenomena is an end in itself in corpus linguistics. Fillmore (1992: 35)
AF

presents a caricature of a corpus linguist who is busy determining the relative


frequencies of the eleven parts of speech as the rst word of a sentence versus as
the second word of a sentence. Of course, there is nothing intrinsically wrong
with such a research project: when large electronically readable corpora and the
computing power to access them became available in the late 1950s, linguists
R

became aware of a vast range of stochastic regularities of natural languages that


had previously been dicult or impossible to detect and that are certainly worthy
of study. Narrowing our denition to this stochastic perspective would give us
D

the following:

Denition (Fourth aempt, stochastic interpretation)


Corpus linguistics is the investigation of the statistical properties of
language.

However, while the statistical properties of language are a worthwhile and


actively researched area, they are not the primary object of research in corpus
linguistics. Instead, the denition just given captures an important aspect of a
discipline referred to as statistical or stochastic natural language processing (a good

48
Anatol Stefanowitsch

introduction to this eld can be found in Manning and Schtze 2000, especially
the overview in Chapter 1).
Stochastic natural language processing and corpus linguistics are closely
related elds that have frequently proted from each other (see, e.g., Kennedy
1998: 5); it is understandable, therefore, that they are sometimes conated (see,
e.g., Sebba and Fligelstone 1994: 769). However, the two disciplines are best
regarded as overlapping but separate research programs with very dierent
research interests.
Corpus linguistics, as its name suggests, is part of linguistics and thus focuses
on linguistic research questions that may include, but are in no way limited to the

.1
stochastic properties of language. Adding this perspective to our denition, we
get the following:
v0
Denition (Fourth aempt, linguistic interpretation)
Corpus linguistics is the investigation of linguistic research questions based
on the systematic and exhaustive analysis of the distribution of linguistic
phenomena in a linguistic corpus.
T
To arrive at our nal denition, it remains for us to specify what it actually
means to investigate the distribution of linguistic phenomena. Much of the next
AF

chapter will be concerned with this issue, so the following short discussion will
only set the stage for the more detailed discussion to follow.
Let us assume we have noticed, that English speakers use two dierent words
for the front window of cars: some say windscreen, some say windshield. It is a
genuinely linguistic question, what factor or factors explain this variation. In line
R

with the denition above, we would now try to determine their distribution in a
corpus. Since the word is not very frequent, assume that we combine four
corpora that we happen to have available, namely the BROWN, FROWN, LOB
D

and FLOB corpora mentioned in Section 2.1.2 above. We nd that windscreen


occurs 12 times and windshield occurs 13 times.
at the two words have roughly the same frequency in our corpus, while
undeniably a fact about their distribution, is not very enlightening. If our
combined corpus were balanced, we could at least conclude that neither of the
two words is dominant.
Looking at the grammatical contexts also does not tell us much: both words
are almost always preceded by the denite article the, sometimes by a possessive
pronoun or the indenite article a. Both words occur frequently in the PP
[through NP], sometimes preceded by a verb of seeing, which is not surprising

49
2. What is corpus linguistics?

given that they refer to a type of window. e distributional fact that the two
words occur in the same types of grammatical contexts is more enlightening: it
suggests that we are dealing with exact synonyms. However, it does not provide
an answer to the question why there should be two words for the same thing.
It is only when we look at the distribution in the four corpora, that we nd
that answer: windscreen occurs exclusively in the LOB and FLOB corpora, while
windshield occurs exclusively in the BROWN and FROWN corpora. e rst two
are corpora of British English, the second two are corpora of American English;
thus, we can conclude that we are probably dealing with dialectal variation. In
other words: we had to investigate dierences in the distribution of linguistic

.1
phenomena under dierent conditions in order to answer our research question.
Taking this into account, we can now posit the following nal denition of
corpus linguistics: v0
Denition (Final Version)
Corpus linguistics is the investigation of linguistic research questions that
have been framed in terms of the conditional distribution of linguistic
phenomena in a linguistic corpus.
T
Note that in the simple example presented here, the conditional distribution is a
AF

maer of all-or-nothing: all instances of windscreen occur in the British part of


the corpus and all instances of windshield occur in the American part. ere is a
categorical dierence between the two words with respect to the conditions
under which they occur (at least in our corpus). In contrast, the two words do not
dier at all with respect to the grammatical contexts in which they occur.
R

ese two distributions are, of course, only the limiting cases: Two (or more)
words (or other linguistic phenomena) may also show relative dierences in their
distribution across conditions. For example, the words railway and railroad show
D

clear dierences in their distribution across the combined corpus used above:
railway occurs 118 times in the British part compared to only 16 times in the
American part, while railroad occurs 96 times in the American part but only 3
times in the British part. Clearly, this tells us something very similar about the
words in question: they also seem to be dialectal variants, even though the
dierence between the dialects is gradual rather than absolute in this case. Given
that very lile is absolute when it comes to human behavior, it will come as no
surprise that gradual dierences in distribution will turn out to be much more
common in language (and thus, more important to linguistic research) than
absolute dierences and later chapters will discuss this issue at length. For now,

50
Anatol Stefanowitsch

note that both kinds of conditional distributions are covered by the nal version
of our denition.
Note also that many of the aspects that were proposed as dening criteria in
previous denitions need no longer be included once we adopt our nal version,
since they are presupposed by this denition: conditional distributions (whether
they dier in relative or absolute terms) are only meaningful if they are based on
the complete data base (hence the criterion of exhaustivity); conditional
distributions can only be assessed if the data are carefully categorized according
to the relevant conditions (hence the criterion of systematicity); distributions
(especially relative ones) are more reliable if they are based on a large data set

.1
(hence the preference for large electronically stored corpora that are accessed via
appropriate soware applications); and oen but not always the standard
procedures for accessing corpora (concordances, collocate lists, frequency lists) are
v0
a natural step towards identifying the relevant distributions in the rst place.
However, these preconditions are not self-serving, and hence they cannot
themselves form the dening basis of a methodological framework: they are only
motivated by the denition just given.
In conclusion, note that our nal denition does distinguish corpus linguistics
T
from other types of observational methods, such as text linguistics, discourse
analysis, variationist sociolinguistics etc.; however, it does so in a way that allows
AF

us to recognize the overlaps between these methods, which seems highly


desirable given that these methods are fundamentally based on the same
assumptions as to how language can and should be studied (namely on the basis
of authentic instances of language use), and that they are likely to face similar
methodological problems.
R

Further Reading
D

Wynne (2005) is an introduction to corpus development, Xiao (2009) is an


overview of well-known English corpora.
Wynne, Martin (ed.). 2005. Developing linguistic corpora: a guide to good
practice. Oxford & Oakville: Oxbow Books. Available online:
hp://www.ahds.ac.uk/creating/guides/linguistic-corpora/index.htm.
Xiao, Richard. 2008. Well-known and inuential corpora. In Anke
Ldeling & Merja Kyt (eds.), Corpus Linguistics, vol. 1, 383483. Berlin&
New York: Walter de Gruyter.
hp://www.lancaster.ac.uk/sta/xiaoz/papers/corpus%20survey.htm

51
3 Corpus Linguistics as a Scientic
Method

.1
Having dened corpus linguistics as the investigation of linguistic research
questions that have been framed in terms of the conditional distribution of
linguistic phenomena in a linguistic corpus, the remainder of this book will
v0
essentially spell out this denition in detail. In this chapter, we will take a closer
look at the notion research (in Section 3.1), and discuss what it means to frame a
research question in terms of a conditional distribution (in Section 3.2 and 3.3).

3.1 e Scientic Research Cycle


T

Research may refer to a variety of activities aimed at uncovering facts of some


AF

kind or another, but since linguistics is the scientic study of language, it is to be


understood in the sense of scientic research here. If we understand corpus
linguistics as a scientic method, corpus linguistic research projects must meet
certain standards expected of scientic research in general. is entails a certain
terminology and a certain way of thinking about research designs that may be
R

unfamiliar to at least some linguists and that is oen not made explicit even in
the work of those linguists who work empirically in the scientic sense.
D

Research in the natural and social sciences broadly follows a cyclic process of
hypothesis formulation and testing, as described by the Austrian-British
philosopher of science Karl Popper (a readable overview is found in Popper 1963,
esp. 33 ). Put simply, the researcher rst identies and delineates some object of
research (some phenomenon or set of related phenomena in the natural or social
world). Second, the researcher formulates a hypothesis about this object of
research, typically by postulating a relationship between at least two phenomena
(at least in the classic Popperian paradigm, see further below). ird, the
researcher species procedures for the identication and classication of the
relevant phenomena in such a way that the hypothesis can be tested (this is called
operationalization), and formulates predictions. Finally, these predictions are

52
Anatol Stefanowitsch

subjected to a test. If the test is negative (i.e., does not match the prediction), the
hypothesis must be modied or discarded and replaced by a new hypothesis; if
the test is positive (i.e., does match the prediction), it may lead to renements or
additional hypotheses. is research cycle is shown schematically in Figure 3-1.

Figure 3-1. e research cycle


Operationalization and
formulation of
predictions

.1
Object of Hypothesis Test
Research v0
Formulation of
new/modified/additional
hypotheses
T

Any research project must, of course, begin with the choice of some object of
AF

research a fragment of reality that we wish to investigate. is fragment of


reality must then be described in terms of constructs theoretical concepts
corresponding to the entities involved. ese theoretical concepts will (hopefully)
be provided in part by whichever theoretical framework we have chosen to work
in. However, a framework will typically not provide fully explicated constructs
R

for the description of every aspect of the object of research this is certainly the
case in linguistics, which is still very much an evolving discipline. In this case, it
D

is necessary to provide such an explication.


In corpus linguistics, the object of research will usually involve one or more
aspects of language structure or language use, but it may also involve aspects of
our psychological, social or cultural reality that are merely reected in language, a
point that we will return to in the second part of this book. In addition, the object
of research may involve one or more aspects of extralinguistic reality, most
importantly demographic properties of the speaker(s) such as geographical
location, sex, age, ethnicity, social status, nancial background, education,
knowledge of other languages, etc.
Once the object of research is properly delineated, the researcher identies a
research issue the question that they will aempt to answer. is always

53
3 Corpus Linguistics as a Scientic Method

involves a relationship between at least two theoretical constructs one that is to


be explained, and one that is to provide (part o) the explanation. In other words,
the research issue must be formulated in terms of an explicit relationship
between two (or more) constructs.
ere are, broadly speaking, two ways to state a research issue. In a pure
Popperian framework, the research issue is stated in terms of a specic hypothesis
about the way in which two constructs X and Y are related, such as all X are Y
or X leads to Y. e task of the researcher is then to determine whether this
hypothesis is true or false (strictly speaking, their task is to aempt to disprove
the hypothesis, a point we will return to in Section 3.3 below). is approach,

.1
sometimes referred to as the deductive approach, is generally seen as the standard
way of conducting scientic research (although it needs to be restated in a way
that allows gradual distributions, see, again, Section 3.3 below).
v0
Alternatively, the research issue can be stated in terms of a general research
question of the type What (if any) is the relationship between X and Y? or even
just What relationships exist between the constructs in my data?. is approach,
referred to as an inductive or exploratory approach, is not traditionally accepted as
scientic, but it is becoming increasingly more common especially where
T
researchers have access to large amounts of data and tools that can search and
correlate these data in a reasonable time frame (big data). In corpus linguistics,
AF

this is certainly the case and thus, both the deductive and the inductive approach
have their place in the discipline. In this chapter, we will assume a deductive
perspective, but we will return to the issue of inductive/exploratory research in
the second part of this book, in particular in Chapter 9.
R

3.2 Stating hypotheses


D

As mentioned above, a theoretical construct can play two dierent roles in the
statement of a research issue (regardless of whether it is stated as a hypothesis or
as a question). It can represent the phenomenon that we want to explain (the
explicandum); in linguistics, this is typically some aspect of language use or
language structure. Or it can be the factor in terms of which we want to explain
that phenomenon (the explicans); in linguistics, this may be some other aspect of
language use (such as the presence or absence of a particular linguistic structure,
a particular position in a discourse, etc.), or some language external factor (such
as the speaker's sex or age, the relationship between speaker and hearer, etc.).
In empirical research, the explicandum is referred to as the dependent variable
and the explicans as the independent variable (note that these terms are actually

54
Anatol Stefanowitsch

quite transparent: if we want to explain X in terms of Y, then X must somehow be


dependent on Y). Each of the variables must have at least two possible values. In
the simplest case, these values could be the presence vs. the absence of instances
of the construct, in more complex cases, the values would correspond to dierent
classes of instances of the construct.
For example, the variable INTERRUPTION has two values, PRESENCE (an interruption
occurs) vs. ABSENCE, (no interruption occurs). e variable SEX in our everyday
understanding also has two values (MALE vs. FEMALE). In contrast, the value of the
variable GENDER is language dependent: in French or Spanish it has two values
(MASCULINE vs. FEMININE), in German or Russian it has three (MASCULINE vs. FEMININE vs.

.1
NEUTER) and there are languages with even more values for this variable. The
variable VOICE has two to four values in English, depending on the way that this
construct is dened in a given model (most model of English would see ACTIVE and
v0
PASSIVE as values of the variable VOICE, some models would also include the MIDDLE
construction, and a few models might even include the ANTIPASSIVE).
A research issue that is stated in terms of two variables and their relationship
can be visualized as a cross table (also referred to as contingency table) with each
dimension of the table as one of the variables and each cell as one of the possible
T
intersections (i.e., combinations) of the values of the two variables. Such a table is
shown in Table 3-1 (note that there is a useful convention albeit not one very
AF

strictly enforced in linguistics to show the values of the independent variable in


the table rows and the values of the dependent variables in the table columns).

Table 3-1. A contingency table


DEP VARIABLE (DV)
R

E N D E N T
VALUE 1 VALUE 2
INDEPENDENT VARIABLE (IV)

VALUE 1
Intersection Intersection
IV1 & DV1 IV1 & DV2

VALUE 2
Intersection Intersection
IV2 & DV1 IV2 & DV2

Since hypotheses play such a central role in the Popperian framework of doing
science, we might expect that there is a large number of criteria that a hypothesis

55
3 Corpus Linguistics as a Scientic Method

has to meet in order to qualify as the basis for a scientic research project.
Actually, there is just a single criterion dictated by the logic of the Popperian
research paradigm: a scientic hypothesis must be testable. Of course, in the real
world there will be additional criteria that will oen be relevant. For example, a
hypothesis may have to be seen as addressing some relevant aspect of the current
state of knowledge in a given research tradition, or it may even have to conform
to discipline-external criteria such as the requirements of funding agencies. But
from the viewpoint of scientic methodology, such criteria are irrelevant.

3.3 Testing hypotheses

.1
Crucially, in order to count as testable in the scientic sense, a hypothesis must
be falsiable, i.e., it must be stated in a way that allows us to show that it is false.
v0
As researchers, it is then our task to try and falsify the hypothesis.
To a novice to scientic methodology, this may come as a surprise.
Spontaneously, one might assume the exact opposite: that a hypothesis must be
formulated in such a way that it is obvious how it can be shown to be true, and
that it is the task of a researcher to verify their hypotheses.
T
However, there are at least two reasons why no amount of positive evidence
could ever show a hypothesis to be true, and thus, why this procedure would not
AF

be feasible. First, for any given piece of evidence that seems to support our
hypothesis there are always several alternative hypotheses that explain the
evidence just as well as our research hypothesis. Second, and even more
importantly, no maer how much evidence we adduce in favor of our hypothesis,
the next piece of evidence we come across may falsify it. Worse, if we aim our
R

test at producing positive evidence, we may actually make it impossible to nd


negative evidence at all, even where such evidence is abundant.
D

Take the example of the words windscreen and windshield, discussed in the
previous chapter. Let us assume that we have already formulated the hypothesis
that the rst of these words is British English and the second one is American
English. First, note that this is a perfect example of a scientic hypothesis: It
identies to variables (VARIETY and WORD FOR FRONT WINDOW OF CAR) with two
values each (BRITISH ENGLISH vs. AMERICAN ENGLISH, and WINDSCREEN vs. WINDSHIELD) and it
posits a relationship between them such that which word is used depends on
which variety is spoken (specically, that WORD FOR FRONT WINDOW OF CAR will take
the value WINDSCREEN if VARIETY is BRITISH ENGLISH and the value WINDSHIELD if VARIETY is
AMERICAN ENGLISH). Table 3-2 is a visual representation.

56
Anatol Stefanowitsch

Table 3-2. A contingency table with binary values for the intersections
DV: F R O N T WI N D O W O F CAR
WINDSCREEN WINDSHIELD

BRE

IV: VARIETY

AME

.1
How do we test this hypothesis empirically? It predicts that we should nd
v0
examples of windscreen in British English and examples of windshield in
American English (the intersections marked by checkmarks in Table 3-2), so we
might be tempted to conrm it by querying a corpus of American English (such
as the BROWN corpus) for the string windshield and a corpus of British
T
English (such as the LOB corpus) for the string windscreen (I will enclose
corpus queries in angled brackets from now on as a reminder that we are never,
strictly speaking, searching for words in a corpus, but for strings of characters of
AF

varying degrees of complexity).


But obviously, this does not constitute a test of our hypothesis: no maer how
many hits our queries return, they would not show our hypothesis to be true,
because we would get the same if both words occurred in both varieties.
R

Instead, we must try to falsify our hypothesis by looking for cases that should
not exist according to our hypothesis. In the present case, we have to look for
occurrences of the string windscreen in American English and windshield
D

in British English (the intersections marked by crosses in Table 3-2).


If we dont nd instances, then this constitutes evidence in favor of our
hypothesis. It does not prove our hypothesis, however, because we have not
looked at the entirety of British and American English, but only at relatively
small samples. us, we can never exclude the possibility that in a larger sample,
the next instance we nd would be one of windshield in American or windscreen
in British English. is would be true even if we looked at the entirety of British
and American English at any given point in time, because new instances of the
two varieties are being created all the time.
We will return to this problem in more detail, but for now, let us assume that

57
3 Corpus Linguistics as a Scientic Method

in order to test a hypothesis we have to specify cases that should not occur if the
hypothesis were true, and then do our best to nd such cases. e harder we try
to nd such cases but fail to do so, the more certain we will be that our
hypothesis is true. But no maer how hard we look, we must learn to accept that
we can never be absolutely certain: in science, a fact is simply a hypothesis that
has not yet been falsied. is may seem disappointing, but science has made
signicant advances despite (or perhaps because) scientists accept that there is no
certainty when it comes to truth. In contrast, a single counterexample will give us
the certainty that our hypothesis is false.
In the classical Popperian framework, counterexamples (i.e., negative

.1
outcomes of an experiment), play a crucial role and traditional linguistic
argumentation on the basis of intuited examples also accords a central role to the
counterexample (or at least claims to do so). In corpus linguistics,
v0
counterexamples have not played a major role, but this is due not so much to the
fact that they cannot play a useful role (cf. for example, Meurers 2005 and
Meurers and Mller 2009 for excellent examples of how they can inform syntactic
argumentation), but rather to the fact that corpus-linguistic research questions do
not lend themselves to the kinds of categorical hypotheses that make
T
counterexamples meaningful.
Nevertheless, a brief discussion of counterexamples will be useful both in its
AF

own right and in seing the stage for the next section. For expository reasons, I
will continue to use the case of dialectal variation as an example, but the issues
discussed apply to all corpus-linguistic research questions.
In the case of windscreen and windshield, there are no counterexamples in the
corpora used, but what if there had been one? Take another well-known case of a
R

lexical dierence between British and American English: the distilled petroleum
used to fuel cars is referred to as petrol in British English and gasoline in
D

American English. A search in the four corpora used above yields the frequencies
of occurrence shown in Table 3-3.

58
Anatol Stefanowitsch

Table 3-3. Petrol vs. gasoline


DV: DIST I L L E D PET R O L E U M

PETROL GASOLINE
IV: VARIETY

BRE 21 0

AME 1 20

.1
In other words, the distribution is almost identical to that for the words
windscreen and windshield except for one counterexample, where petrol occurs
in the American part of the corpus (specically, in the FROWN corpus). In other
v0
words, it seems that our hypothesis is falsied at least with respect to the word
petrol. Of course, this is true only if we are genuinely dealing with a
counterexample, so let us take a closer look at the example in question, which
turns out to be from the novel Eye of the Storm by Jack Higgins:
T
(3.1) He was in Dorking within half an hour. He passed straight through and
continued toward Horsham, nally pulling into a petrol station about ve
AF

miles outside. [Higgins, Eye of the Storm]

Now, Jack Higgins is a pseudonym used by the novelist Harry Paerson for some
of his novels and Paerson is British (he was born in Newcastle upon Tyne, and
grew up in Belfast and Leeds). In other words, his novel was erroneously included
R

in the FROWN corpus, presumably because it was published by an American


publisher. us, we can discount the counterexample and maintain our original
D

hypothesis. Misclassied data are only one reason to discount a counterexample,


other reasons include intentional deviant linguistic behavior (for example, an
American speaker may imitate a British speaker or a British speaker may have
picked up some American vocabulary on a visit to the United States); a more
complex reason is discussed below.
Note that there are two problems with the strategy of checking
counterexamples individually to determine whether they are genuine
counterexample or not. First, we only checked the example that looked like a
counterexample we did not check all the examples that t our hypothesis.
However, these examples could, of course, also contain cases of misclassied data,
which would lead to additional counterexamples. Of course, we could

59
3 Corpus Linguistics as a Scientic Method

theoretically check all examples, as there are only 42 examples overall. However,
the larger our corpus is (and most corpus-linguistic research requires corpora that
are much larger than the four million words used here), the less feasible it
becomes to do so. e second problem is that we were lucky, in this case, that the
counterexample came from a novel by a well-known author, whose biographical
information is easily available. But linguistic corpora do not (and cannot) contain
only well-known authors, and so checking the individual demographic data for
every speaker in a corpus may be dicult to impossible. Finally, some types of
language cannot be aributed to a single speaker at all speeches are oen
wrien by a team of speech writers that may or may not include the person

.1
delivering the speech, newspaper articles may include text from a number of
journalists and press agencies, published texts in general are typically proof-read
by people other than the author, and so forth.
v0
Let us look at a more complex example, the words for the (typically elevated)
paved strip at the side of a road provided for pedestrians. Dictionaries typically
tell us, that this is called pavement in British English and sidewalk in American
English, for example, the OALD:
T
(3.2a) pavement noun []
1 [countable] (British English) (North American English sidewalk) a at
AF

part at the side of a road for people to walk on [OALD]

(3.2b) sidewalk noun []


(North American English) (British English pavement) a at part at the side
of a road for people to walk on [OALD]
R

A search of the four corpora used above shows a distribution of the two words (in
D

all their potential orthographic variants) across the two varieties as shown in
Table 3-4a.

60
Anatol Stefanowitsch

Table 3-4a. Pavement vs. sidewalk


DV: R O A D S I D E FO O T PAT H

PAVEMENT SIDEWALK

BRE 37 4
IV: VARIETY

AME 22 43

.1
In this case, we are not dealing with a single counterexample. Instead, there are
four apparent counterexamples where sidewalk occurs in British English, and 22
v0
apparent counterexamples where pavement occurs in American English.
In the case of sidewalk, it seems at least possible that a closer inspection of the
four cases in British English would show them to be only apparent
counterexamples, due, for example, to misclassied texts. In the case of the 22
T
cases of pavement in American English, this is less likely. Let us look at both cases
in turn.
Here are the four examples of sidewalk in British English, along with their
AF

author and title of the original source as quoted in the manuals of the
corresponding corpora:

(3.3a) One persistent taxi follows him through the street, crawling by the
R

sidewalk
[LOB E09: Wilfrid T. F. Castle, Stamps of Lebanon's Dog River]
(3.3b) Keep that black devil away from Rusty or you'll have a sick horse on
D

your hands, he warned, and leaped to the wooden sidewalk.


[LOB N07: Bert Cloos, Drury]
(3.3c) ere was a small boy on the sidewalk selling melons.
[FLOB K24: Linda Waterman, Bad Connection.]
(3.3d) Joe, my love, the snowakes fell on the sidewalk.
[FLOB K25: Christine McNeill. e Lesson.]

Not much can be found about Wilfrid Castle, other than that he also wrote
several books about English parish churches, making it seem likely that he was
British. Bert Cloos is a pseudonym for Bertram P. Cloos, and there is a US army

61
3 Corpus Linguistics as a Scientic Method

record11 from 1940 for a Bertram Cloos whose civilian occupation is listed as
authors, editors and reporters. If this is the same person, then (3.3b) would be
due to a misclassied text by an American author in the British corpus, but there
is no way of determining this.
For Linda Waterman and Christine McNeill, no biographical information can
be found at all. Watermans story was published in a British student magazine,
but this in itself is no evidence of anything. e story is set in Latin America, so
there may be a conscious eort to evoke American English. In McNeills case
there is some evidence that she is British: she uses some words that are typically
British, such as dressing gown (AmE (bath)robe) and breadbin (AmE breadbox), so

.1
it is plausible that she is British. Like Watermans story, hers was published in a
British magazine. Interestingly, however, the scene in which the word is used is
set in the United States, so she, too, might be consciously evoking American
v0
English. To sum up, we have one example that was likely produced by an
American speaker, two that were likely produced by British speakers, although
one was probably evoking American English, and one about whose author which
we cannot say anything, but which may also have been evoking American
English. Which of these, if any, we may safely discount is dicult to say.
T
Turning to pavement in American English, it would be possible to check the
origin of the speakers of all 22 cases with the same aention to detail, but it is
AF

questionable that the results would be worth the time invested: as pointed out, it
is unlikely that there are so many misclassied examples in the American
corpora.
On closer inspection, however, it becomes apparent that we may be dealing
with a dierent type of exception here: the word pavement has additional senses
R

to the one cited in (3.2a above), one of which does exist in American English.
Here is the remainder of the dictionary entry cited in (3.2a):
D

(3.4) 2 [countable, uncountable] (British English) any area of at stones on the


ground
3 [uncountable] (North American English) the surface of a road [OALD]

Since neither of these meanings is relevant for the issue of British and American
words for pedestrian paths next to a road, they cannot be treated as
counterexamples in our context. In other words, we have to look at all hits for

11 See hp://wwii-army.ndthedata.org/l/7914772/Bertram-P-Cloos (Last Access: April


2014).

62
Anatol Stefanowitsch

pavement and annotate them for their appropriate meaning. is in itself is a non-
trivial task, which we will discuss in more detail in Chapters 4 and 5. Take the
example in (3.5):

(3.5) [H]e could see the police radio car as he rounded the corner and slammed
on the brakes. He did not bother with his radio there would be time for
that later but as he scrambled out on the pavement he saw the lling
station and the public telephone booth [BROWN L 18]

Even with quite a large context, this example is compatible with a reading of

.1
pavement as road surface or as pedestrian path. If it came from a British text,
we would not hesitate to assign the laer reading, but since it comes from an
American text (the novel Error of Judgment by the American author George
v0
Harmon Coxe), we might lean towards erring on the side of caution and annotate
it as road surface. Alas, the side of caution here is the side suggested by the
very hypothesis we are trying to falsify we would be basing our categorization
circularly on what we are expecting to nd in the data.
A more intensive search of novels by American authors in the Google Books
T
archive (which is vastly larger than the BROWN corpus), turns up clear cases of
the word pavement with the meaning of sidewalk, for example, this passage from
AF

a novel by American author Mary Roberts Rinehart:

(3.6) He had fallen asleep in his buggy, and had wakened to nd old Neie
drawing him slowly down the main street of the town, pursuing an
erratic but homeward course, while the people on the pavements watched
R

and smiled. [Mary Roberts Rinehart, e Breaking Point, Ch. 10]


D

Since this reading exists, then, we have found a counterexample to our


hypothesis and can reject it.
But what does this mean for our sample from the BROWN corpus is there
really nothing to be learned from it concerning our hypothesis? Let us say we
truly wanted to err on the side of caution, i.e. on the side that goes against our
hypothesis, and assign the meaning of sidewalk to Coxes novel too. Let us further
assume that we can assign all other uses of pavement in the sample to the reading
paved surface, and that two of the four examples of sidewalk in the British
English corpus are genuine counterexamples, too. is would give us the
distribution shown in Table 3-4b:

63
3 Corpus Linguistics as a Scientic Method

Table 3-4b. Pavement vs. sidewalk (corrected)


DV: R O A D S I D E FO O T PAT H

PAVEMENT SIDEWALK

BRE 37 2
IV: VARIETY

AME 1 43

.1
Given this distribution, would we really want to claim that it is wrong to assign
pavement to British and sidewalk to American English on the basis that there are
v0
a few possible counterexamples? More generally, is falsication by
counterexample a plausible research strategy for corpus linguistics?
ere are several reasons why the answer to this question must be no. First,
we can rarely say with any certainty whether we are dealing with true
T
counterexamples or whether the apparent counterexamples are due to errors in
the construction of the corpus or in our classication. From a theoretical
perspective, this may not count as a valid objection to the idea of falsication by
AF

counterexample. We may argue that we simply have to make sure that there are
no errors in the construction of our corpus and that we have to classify all hits
correctly. However, in actual practice this is impossible. We can (and must) try to
minimize errors in our data and our classication, but we can never get rid of
R

them completely (this is true not only in corpus-linguistics but in any discipline).
Second, even if our data and our classication were error-free, human
behavior is less deterministic than the physical processes Popper had in mind
D

when postulating his model of scientic research. Even in a simple case like word
choice, there may be many reasons why a speaker may produce an exceptional
uerance evoking a variety other than their own (as in the examples above),
unintentionally or intentionally using a word that they would not normally use
because their interlocutor has used it, temporarily slipping into a variety that
they used to speak as a child but no longer do, etc. With more complex linguistic
behavior, such as producing particular grammatical structures, there will be
additional reasons for exceptional behavior: planning errors, choosing a dierent
formulation in mid-sentence, tiredness, etc. all the kinds of things classied as
performance errors in traditional grammatical theory.

64
Anatol Stefanowitsch

To sum up, our measurements will never be perfect and speakers will never
behave perfectly consistently. is means that we cannot use a single
counterexample (or even a handful of counterexamples) as a basis for rejecting a
hypothesis, even if that hypothesis is stated in absolute terms in the rst place,
i.e., even if our hypothesis predicts that some intersections of our variables will
occur in the data while others will not.
ere is a third reason why falsication by counterexample is not a plausible
research strategy: As mentioned in Chapter 2 in connection with our nal
denition of corpus linguistics, hypotheses in linguistics are frequently not stated
in absolute terms (All Xs are Y, Zs always do Y, etc.), but in relative terms, i.e.,

.1
tendencies or preferences (Xs tend to be Y, Zs prefer Y, etc.). For example,
there are a number of prepositions and/or adverbs in English that contain the
morpheme -ward or -wards, such as aerward(s), backward(s), downward(s),
v0
inward(s), outward(s) and toward(s). ese two morphemes are essentially
allomorphs of a single sux that are in free variation: they have the same
etymology (wards simply includes a lexicalized genitive ending), they have both
existed throughout the recorded history of English and there is no discernible
dierence in meaning between them. However, many dictionaries point out that
T
the forms ending in -s are preferred in British English and the ones without the -s
are preferred in American English.
AF

Clearly, counterexamples are irrelevant in this case. Finding an example like


(3.7a) in a corpus of American English does not disprove the hypothesis that the
use in (3.7b) would be preferred or more typical:

(3.7a) [T]he tall young bualo hunter pushed open the swing doors and walked
R

towards the bar. [BROWN N]


(3.7b) en Angelina turned and with an easy grace walked toward the kitchen.
D

[BROWN K]

In other words, our prediction cannot be that we will nd -ward in American


English and -wards in British English. Instead, we have to state our prediction in
quantitative terms. Generally speaking, we should expect to nd more cases of
-wards than of -ward in British English and more of -ward than of -wards in
American English, as visualized in Table 3.5 (where the circles of dierent sizes
represent dierent frequencies of occurrence).

65
3 Corpus Linguistics as a Scientic Method

Table 3-5. A contingency table showing relative dierences


SU F F I X VARIAN T

-WARD -WARDS

BRE

VARIETY

AME

.1
We will return to the issue of how to phrase predictions in quantitative terms in
Chapter 6. Of course, phrasing predictions in quantitative terms raises additional
v0
questions: How large must a dierence in quantity be in order to count as
evidence in favor of a hypothesis that is stated in terms of preferences or
tendencies? And, given that our task is to try to falsify our hypothesis, how can
this be done if counterexamples cannot do the trick? In order to answer such
T
questions, we need a dierent approach to hypothesis testing, namely statistical
hypothesis testing. is approach will be discussed in detail in Chapters 7 and 8.
ere is another issue that we must turn to rst, though that of dening our
AF

variables and their values in such a way that we can identify them in our data.
We saw even in the simple cases discussed above that this is not a trivial maer.
For example, we dened American English as the language occurring in the
BROWN and FROWN corpora, but we saw that the FROWN corpus contains at
R

least one misclassied text by a British Author. And even if this was not the case,
nobody would want to claim that our denition was accurate: clearly, the corpora
contain only a tiny fraction of the language produced by American speakers of
D

English, and in fact it is even questionable whether language produced by


American speakers of English is the same as American English. We also saw
that there are at least two meanings of the word pavement that were relevant to
our research question. We assumed that it was possible, at least in the majority of
cases, simply to recognize which meaning we are dealing with by looking at the
corpus citation in question, but we saw that there are at least some cases where
this is impossible. ese are just two examples of the larger problem of
operationalization, to which we will turn in the next chapter.

66
Anatol Stefanowitsch

Further Reading
A readable exposition of Popper's ideas about falsication is Poppers essay
Science as falsication, in Popper (1963).
Popper, Karl R. (1963). Conjectures and refutations: the growth of scientic
knowledge. London & New York: Routledge and Keagan Paul.

.1
v0
T
AF
R
D

67
4 Operationalization
e examples discussed in the preceding chapter all dealt with the association
between seemingly simple and uncontroversial aspects of language structure and

.1
use lexical items and broadly dened language varieties , and yet we saw that
they were not trivial to deal with even in the context of a simple research
question such as which words are used in which variety. More oen than not,
v0
corpus-linguistic research issues will be considerably more complex, requiring a
much more sophisticated approach even to such seemingly simple notions, and
additionally also including more abstract, theory-dependent features of language.
While there are cases where even abstract features manifest themselves relatively
directly as classes or congurations of concrete elements, in many cases they are
T
reected in the corpus data only indirectly or not at all.
In either case, most features of language structure or use cannot be
AF

straightforwardly read o the data. us, geing from a linguistic hypothesis to a


set of data that can be used to test this hypothesis requires the following three
broad steps:

1. operationalization: the variables referred to by our hypothesis and all


R

values of these variables must be dened in such a way that they can be
reliably measured (i.e., identied in a corpus);
D

2. retrieval: the corpus must be searched in such a way that all data points
manifesting variables in question are found;
3. coding: the data must be coded for the values that our variable may take.

For example, in order to answer the research question how windscreen and
windshield are distributed across British and American English, the variable
VARIETY was operationally dened such that anything found in the LOB or FLOB
corpus counted as BRITISH ENGLISH and anything found in the BROWN and FROWN
corpora counted as AMERICAN ENGLISH; the variable WORD FOR THE FRONT WINDOW OF A
CAR was operationally dened in terms of two sets of orthographic strings

68
Anatol Stefanowitsch

containing all ways in which the words windscreen and windshield could
conceivably be spelled in the corpus (for example, WINDSCREEN, WIND SCREEN,
WIND-SCREEN, Windscreen, Wind screen, Wind Screen, Wind-screen, Wind-Screen,
windscreen, wind screen, wind-screen for WINDSCREEN). ese orthographic strings
were then retrieved from all four corpora using a computer program that went
through each corpus word by word, checking for each word whether it
corresponded to one of these orthographic strings and saving it to a le if it did.
Each hit was then coded as BRITISH ENGLISH if it occurred in the LOB or FLOB
corpus and as AMERICAN ENGLISH if it occurred in the BROWN or FROWN corpus,
and each hit was coded as WINDSHIELD or WINDSCREEN depending on which of these

.1
two words the orthographic string presented.
If this seems like an overly complex way of describing the steps necessary to
answer the research question, this is because we tend to take much of this
v0
implicitly for granted. Indeed, even in the corpus-linguistic research literature,
the research design is oen described in much less detail, leaving it to the reader
to ll in the gaps. While this may be possible in a (seemingly) straightforward
research design such as the one described here, it becomes impossibly complex in
the case of more advanced research designs. In this chapter, we will therefore
T
take a close look the rst of these steps, in the next chapter we will deal with the
other two.
AF

4.1 Operational denitions


Put simply, an operational denition of a construct is an explicit and
unambiguous description of a set of operations that are performed to identify and
R

measure that construct. is makes operational denitions fundamentally


dierent from our every-day understanding of what a denition is.
D

Take an example from physics, the property HARDNESS. A typical dictionary


denition of the word hard is the following (the abbreviations refer to
dictionaries, see further Appendix A.2):

(4.1) 1 FIRM TO TOUCH rm, sti, and dicult to press down, break, or cut [ so]
(LDCE, s.v. hard, cf. also the virtually identical denitions in CALD, MW
and OALD).

is denition corresponds quite closely to our experiential understanding of


what it means to be hard. However, for a physicist or an engineer interested in
the hardness of dierent materials, it is not immediately useful: rm and sti are

69
4 Operationalization

simply loose synonyms of hard, and so is an antonym they do not help in


understanding what hardness is. e remainder of the denition is more
promising: it should be possible to determine the hardness of a material on this
basis simply by pressing it down, breaking or cutting it and noting how dicult
this is. However, before, say, pressing down can be used as an operational
denition, at least three questions need to be asked: rst, what type of object is to
be used for pressing (what material it is made of and what shape it has); second,
how much pressure is to be applied; and third, how the diculty of pressing
down is to be determined.
ere are a number of hardness tests that dier mainly along the answers they

.1
provide to these questions (cf. Konrad 2011 and Wiederhorn et al. 2011 for a
discussion of hardness tests). One commonly-used group of tests is based on the
size of the indentation that an object (the indenter) leaves when pressed into
v0
some material. One such test is the Vickers Hardness Test, which species the
indenter as a diamond with the shape of a square-based pyramid with an angle of
136, and the test force as ranging between or 49.3 to 980.7 newtons (the exact
force must be specied every time a measurement is reported). Hardness is then
dened as follows (cf. Konrad 2011: 43):
T

0.102F
HV =
AF

(4.2)
A

where F is the load in newtons and A is the surface of the indentation (0.102 is a
constant that converts newtons into kiloponds; this is necessary because the
R

Vickers Hardness Test used to measured the test force in kiloponds before the
newton became the internationally recognized unit of force).
Unlike the dictionary denition quoted above, Vickers Hardness (HV) is an
D

operational denition of hardness: it species a procedure that leads to a number


representing the hardness of a material. is operational denition is partly
motivated by our experiential understanding of hardness in the same way as (part
o) the dictionary denition (dicult to press down), but in other aspects, it is
arbitrary. For example, one could use indenters that dier in shape or material,
and indeed there are other widely used tests that do this: the Brinell Hardness Test
uses a hardmetal ball with a diameter that may dier depending on the material
to be tested, and the Knoop Hardness Test uses a diamond indenter with a
rhombic-based pyramid shape. One could also use a dierent measure of
diculty of pressing down: for example, some tests use the rebound energy of

70
Anatol Stefanowitsch

an object dropped onto the material from a particular height.


Obviously, each of these tests will give a dierent result when applied to the
same material, and some of them cannot be applied to particular materials (for
example, materials that are too exible for the indenter to leave an indentation, or
materials that are so brile that they will fracture during testing). More crucially,
none of them aempt to capture the nature of hardness; instead, they are meant
to turn hardness into something that is close enough to our understanding of
what it means to be hard, yet at the same time reliably measurable.
Take another example: in psychiatry, it is necessary to identify mental
disorders in order to determine what (if any) treatment may be necessary for a

.1
given patient. But clearly, just like the hardness of materials, mental disorders are
not directly accessible. Consider, for example, the following dictionary denition
of schizophrenia: v0
(4.3) a serious mental illness in which someone cannot understand what is real
and what is imaginary (CALD, s.v. schizophrenia, see again the very
similar denitions in LDCE, MW and OALD).
T
Although maers are actually signicantly more complicated, let us assume that
this denition captures the essence of schizophrenia. As a basis for diagnosis, it is
AF

useless. e main problem is that understanding what is real is a mental process


that cannot be observed or measured directly (a second problem is that everyone
may be momentarily confused on occasion with regard to whether something is
real or not, for example, when we are tired or drunk).
In psychiatry, mental disorders are therefore operationally dened in terms of
R

certain behaviors. For example, the fourth edition of the Diagnostic and Statistical
Manual of Mental Disorders (DSM-IV), used by psychiatrists and psychologists in
D

the U.S. to diagnose schizophrenia, classies an individual as schizophrenic if


they (i) display at least two of the following symptoms: delusions,
hallucinations, disorganized speech, grossly disorganized or catatonic
behavior and aective aening, poverty of speech or lack of motivation;
and if they (ii) function markedly below the level achieved prior to the onset in
areas such as work, interpersonal relations, or self-care; and if (iii) these
symptoms can be observed over a period of at least one month and show eects
over a period of at least six months; and if (iv) similar diagnoses (such as
schizoaective disorder) and substance abuse and medication can be ruled out
(American Psychological Association 2000).
is denition of schizophrenia is much less objective than that of physical

71
4 Operationalization

hardness, which is partly due to the fact that human behavior is more complex
and less comprehensively understood than the mechanical properties of
materials, and partly due to the fact that psychology and psychiatry are less
developed disciplines than physics. However, it is an operational denition in the
sense that it eectively presents a check-list of observable phenomena that is
used to determine the presence of an unobservable phenomenon. As in the case
of hardness tests, there is no single operational denition the International
Statistical Classication of Diseases and Related Health Problems oers a dierent
denition that overlaps with that of the DSM-IV but places more emphasis on
(and is more specic with respect to) mental symptoms and less emphasis on

.1
social behaviors.
As should have become clear, operational denitions do not (and do not
aempt to) capture the essence of the things or phenomena they dene. We
v0
cannot say that the Vickers Hardness number is hardness or that the DSM-IV
list of symptoms is schizophrenia. ey are simply ways of measuring or
diagnosing these phenomena. Consequently, it is pointless to ask whether
operational denitions are correct or incorrect they are simply useful in a
particular context. However, this does not mean that any operational denition is
T
as good as any other. A good operational denition must have two properties: it
must be reliable and valid.
AF

A denition is reliable to the degree that dierent researchers can use it at


dierent times and all get the same results; this objectivity (or at least
intersubjectivity) is one of the primary motivations for operationalization in the
rst place. Obviously, the reliability of operational denitions will vary
depending on the degree of subjective judgment involved: while Vickers Hardness
R

is extremely reliable, depending only on whether the apparatus is in good


working order and the procedure is followed correctly, the DSM-IV denition of
D

schizophrenia is much less reliable, depending, to some extent irreducibly, on the


opinions and experience of the person applying it. Especially in the laer case it
is important to test the reliability of an operational denition empirically, i.e. to
let dierent people apply it and see to what extent they get the same results (see
further Chapter 5).
A denition is valid to the degree that it actually measures what it is supposed
to measure. us, we assume that there are such phenomena as hardness or
schizophrenia and that they may be more or less accurately captured by an
operational denition. Validity is clearly a very problematic concept: since
phenomena can only be measured by operational denitions, it would be circular
to assess the quality of the same denitions on the basis of these measures. One

72
Anatol Stefanowitsch

indirect indication of validity is consistency (e.g., the phenomena identied by the


denition share a number of additional properties not mentioned in the
denition), but to a large extent, the validity of operationalizations is likely to be
assessed on the basis of plausibility arguments. e more complex and the less
directly accessible a construct is, the more problematic the concept of validity
becomes: While everyone would agree that there is such a thing as hardness, this
is much less clear in the case of schizophrenia: it is not unusual for psychiatric
diagnoses to be reclassied (for example, what was Aspergers syndrome in the
DSM-IV became part of autism spectrum disorder in the DSM-V) or to be dropped
altogether (as was the case with homosexuality, which was treated as a mental

.1
disorder by the DSM-II until 1974). us, operational denitions may create the
construct they are merely meant to measure; it is therefore important to keep in
mind that even a construct that has been operationally dened is still just a
v0
construct, i.e. part of a theory of reality rather than part of reality itself.

4.2 Corpus-linguistic operationalizations of linguistic


phenomena
T
Corpus linguistics is no dierent from other scientic disciplines: it is impossible
to conduct any corpus-based research without operational denitions. However,
AF

this does not mean that researchers are necessarily aware that this is what they
are doing. In corpus-based research, we nd roughly three dierent situations:

1. operational denitions remain completely implicit, i.e. the researcher


R

simply identies and categorizes phenomena on the basis of their


(professional but unspoken) understanding of the subject maer;
2. operational denitions may already be part of the data and be accepted
D

(more or less implicitly) by the researcher, for example, when a corpus


has been designated as containing a particular variety of a language, or
when a corpus has been tokenized (i.e., someone has decided, what
constitutes a single word and what does not) or tagged for word classes
(i.e. someone has decided which and how many parts of speech there are
and has assigned each word to one of these);
3. operational denitions and the procedure by which they have been
applied may be explicitly stated.

ere may be linguistic phenomena, whose denition is so uncontroversial that it

73
4 Operationalization

seems justied to simply apply it without any discussion at all for example,
when identifying occurrences of a particular word like sidewalk. But even here, it
is important to state explicitly which orthographic strings were searched for and
why. As soon as maers get a lile more complex, implicitly applied denitions
are unacceptable because unless we state exactly how we identied and
categorized a particular phenomenon, nobody will be able to interpret our results
correctly, let alone replicate them or test them on a dierent set of data. For
example, the English possessive construction is a fairly simple and
uncontroversial grammatical structure. In wrien English it consists either of the
sequence [NOUNapostrophe(one or more adjectives)noun], or [possessive

.1
pronoun(one or more adjectives)noun] (the parentheses denote optionality).
is sequence seems easy enough to identify in a corpus, so a researcher studying
the possessive may not even mention how they dened this construction. e
v0
following examples show that maers are more complex, however:

(4.4a) We are a women's college, one of only 46 women's colleges in the United
States and Canada [womenscollege.du.edu]
(4.4b) An indictment was handed up Monday in connection with weekend raids
T
at Danny's Family Car Wash locations.
(4.4c) Our restaurant is situated in St. Paul's Square, Birmingham.
AF

[pastadipiazza.com]
(4.4d) Such a student may arrange with his/her Registrar to submit a leer of
good standing to Jacksonville University [FROWN H]
(4.4e) My Opera was the virtual community for Opera web browser users.
[Wikipedia, s.v. My Opera]
R

While all of these cases have the form of the possessive construction and match
D

the strings above, opinions may dier on whether they should be included in a
sample of English possessive constructions. Example (4.4a) is a so-called
possessive compound, a lexicalized possessive construction that functions like a
conventional compound and could be treated as a single word. Example (4.4b) is a
proper name, just like (4.4d). Concerning the laer: if we want to include it, we
would have to decide whether also to include proper names where possessive
pronoun and noun are spelled as a single word, as in MySpace (the name of an
online social network). Example (4.4c) is a geographical name; here, the problem
is that such names are increasingly spelled without an apostrophe, oen by
conscious decisions by government institutions (see Swaine 2009, Newman 2013).
If we want to include them, we have to decide whether also to include spellings

74
Anatol Stefanowitsch

without the apostrophe (such as We have been established for over 20 years in the
prestigious St. Pauls Square area of Birmingham [henrysrestaurant.co.uk]) and
how to nd them in the corpus. Finally, (4.4d) is a regular possessive construction,
but it contains two pronouns separated by a slash; we would have to decide
whether to count these as one or as two cases of the construction.
ese are just some of the problems we face even with a very simple
grammatical structure. us, if we were to study the possessive construction (or
any other structure), we would have to state precisely which potential instances
of a structure we include. In other words, our operational denition needs to
include a list of cases that may occur in the data together with a statement of

.1
whether and why to include them or not. We will return to this issue in
Section 4.3.
Likewise, there may be situations where operational denitions already
v0
present in the data can be plausibly used without further discussion. If we accept
orthographic representations of language (which corpus linguists do, most of the
time), then we also accept some of the denitions that come along with
orthography, for example concerning the question what constitutes a word. For
many research questions, it may be irrelevant whether the orthographic word
T
correlates with a linguistic word in all cases (whether it does depends to a large
extent on the specic linguistic model we adopt), so we may simply accept this
AF

correspondence as a pre-theoretical fact. But there are research questions, for


example concerning the average length of clauses, uerances, etc., where this
becomes relevant and we may have to dene the notion of word in a dierent
way. At the very least, we should acknowledge that we are accepting an
orthographic denition despite the fact that it may not have a linguistic basis.
R

Similarly, there may be situations where we simply accept the part-of-speech


tagging or the syntactic annotation in a corpus, but given that there is no agreed-
D

upon theory of word classes, let alone syntactic structures, this will be
problematic in many situations. It is crucial to understand that tagging and other
types of annotation are already the result of applying operational denitions by
other researchers and if we use tags or other forms of annotation, we must
familiarize ourselves with these denitions, which are usually found in manuals
accompanying the corpus. ese manuals and other literature provided by corpus
creators must be read and cited like all other literature, and we must clarify in the
description of our research design why and to what extent we rely on the
operationalizations described in these materials.
In other words, we must always explicitly state the operational denitions we
use. Fortunately, this is becoming more common as corpus linguistics is growing

75
4 Operationalization

into a mature methodological tradition. We will come back to the issue of


operational denitions throughout the remainder of this book; in particular, we
will discuss the type of situation described above for the genitive construction in
more detail in Section 4.3.
Let us conclude this section by looking at four examples of frequently used
corpus linguistic operationalizations that demonstrate various aspects of the
issues sketched out above and which we must keep in mind in designing studies.

a) Length. ere is a wide range of phenomena that has been claimed and/or
shown to be related to the weight of linguistic units (syllables, words or

.1
phrases) word-order phenomena following the principle light before heavy,
such as the dative alternation (Thompson and Koide 1987), particle placement
(Chen 1986), s-genitive and of-construction (Deane 1987), frozen binominals
v0
(Sobkowiak 1993), to name just a few. In the context of such claims, weight is
sometimes understood to refer to structural complexity, sometimes to length, and
sometimes to both. Since complexity is oen dicult to dene, it is, in fact,
frequently operationalized in terms of length, but let us rst look at the diculty
of dening length in its own right and briey return to complexity below.
T
Let us begin with words. Clearly, words dier in length everyone would
agree that the word stun is shorter than the word abbergast. ere are a number
AF

of ways in which we could operationalize WORD LENGTH, all of which would allow
us to conrm this dierence in length:
as number of leers (cf., e.g., Wul 2003), in which case abbergast has
a length of 11 and stun has a length of 4;
as number of phonemes (cf., e.g., Sobkowiak 1993), in which case
R


abbergast has a length of 9 (BrE /bst/ and AmE /bst/), and
stun has a length of 4 (BrE and AmE /stn/);
D

as number of syllables (cf., e.g., Sobkowiak 1993, Stefanowitsch 2003), in


which case abbergast has a length of 3 and stun has a length of 1.
While all three operationalizations give us comparable results in the case of these
two words, they will diverge in other cases. Take disconcert, which has the same
length as abbergast when measured in terms of phonemes (it has nine; BrE
/dsknst/ and AmE /dsknst/) or syllables (three), but it is shorter when
measured in terms of leers (ten). Or take shock, which has the same length as
stun when measured in syllables (one), but is longer when measured in leers (5
vs. 4) and shorter when measured in phonemes (3 vs. 4; BrE /k/ and AmE
/k/). Or take amaze, which has the same length as shock in terms of leers
(ve), but is longer in terms of phonemes (4 or 5, depending on how we analyze

76
Anatol Stefanowitsch

the diphthong in /mez/) and syllables (2 vs. 1).12


Clearly, none of these three denitions is correct they simply measure
dierent ways in which a word may have (phonological or orthographic) length.
Which one to use depends on a number of factors, including rst, what aspect of
word length is relevant in the context of a particular research project (this is the
question of validity), and second, to what extent are they practical to apply (this
is the question of reliability). e question of reliability is a simple one: number
of leers is the most reliably measurable factor assuming that we are dealing
with wrien language or with spoken language transcribed using standard
orthography; number of phonemes can be measured less reliably, as it requires

.1
that data be transcribed phonemically (which leaves more room for interpretation
than orthography) or, in the case of wrien data, converted from an orthographic
to a phonemic representation (which requires assumptions about which the
v0
language variety and level of formality the writer in question would have used if
they had been speaking the text); number of syllables also requires these
assumptions.
e question of validity is less easy to answer: if we are dealing with language
that was produced in the wrien medium, number of leers may seem like a
T
valid measure, but writers may be speaking internally as they write, in which
AF

12 Note that I have limited the discussion here to denitions of length that make sense in
the domain of traditional linguistic corpora; there are other denitions, such as
phonetic length (time it took to pronounce a token of the word in a specic situation),
or average phonetic length (time it takes to pronounce the word on average). For
example, the pronunciation samples of the CALD, as measured by playing them in the
R

browser Chrome (Version 32.0.1700.102 for Mac OSX) and recording them using the
soware Audacity (Version 2.0.3 for Mac OSX) on a MacBook Air with a 1.8 GHz Intel
Core i5 processor and running OS X version 10.8.5 and then using Audacitys timing
D

function, have the following lengths: abbergast BrE 0.929s, AmE 0.906s and stun BrE
0.534s, AmE 0.482s. e reason I described the hardware and soware I used in so
much detail is that they are likely to have inuenced the measured length in addition
to the fact that dierent speakers will produce words at dierent lengths on dierent
occasions; thus, calculating meaningful average pronunciation lengths would be a
very time- and resource-intensive procedure even if we decided that it was the most
valid measure of WORD LENGTH in the context of a particular research project. I am not
aware of any corpus-linguistic study that has used this denition of word length;
however, there are versions of the SWITCHBOARD corpus (a corpus of transcribed
telephone conversations) that contain information about phonetic length, and these
have been used to study properties of spoken language (e.g. Greenberg et al. 1996,
Greenberg et al. 2003).

77
4 Operationalization

case orthographic length would play a marginal role in stylistic and/or


processing-based choices. Whether phonemic length or syllabic length are the
more valid measure may depend on particular research questions (if rhythmic
considerations are potentially relevant, syllables are the more valid measure), but
also on particular languages (for example, Cutler et al. 1986 have shown that
speakers of French (and other so-called syllable-timed languages) process words
syllabically, in which case phonemic length would never play a role, while
English (and other stress-timed languages) process them phonemically (in which
case it depends on the phenomenon, which of the measures are more valid).13
Finally, note that of course phonemic and/or syllabic length correlate with

.1
orthographic length to some extent (in languages with phonemic and syllabic
scripts), so we might use the easily and reliably measured orthographic length as
an operational denition of phonemic and/or syllabic length and assume that
v0
mismatches will be infrequent enough to be lost in the statistical noise (cf. Wul
2003).
When we want to measure the length of linguistic units above word level, e.g.
phrases, we can choose all of the above methods, but additionally or instead we
can (and more typically do) count the number of words and/or constituents (cf.
T
e.g. Gries 2003 for a comparison of syllables and words as a measure of length).
Here, we have to decide whether to count orthographic words (which is very
AF

reliable but may or may not be valid), or phonological words (which is less
reliable, as it depends on our theory of what constitutes a phonological word).
As mentioned at the beginning of this subsection, weight is sometimes
understood to mean structural complexity rather than length. e question how
to measure structural complexity has been addressed in some detail in the case of
R

phrases, where it has been suggested that COMPLEXITY could be operationalized as


number of nodes in the tree diagram modeling the structure of the phrase (cf.
D

Wasow and Arnold 2003). Such a denition has a high validity, as number of
nodes directly corresponds to a central aspect of what it means for a phrase to be
syntactically complex, but as tree diagrams are highly theory-dependent, the
reliability across linguistic frameworks is low.
13 e dierence between these two language types is that in stress-timed languages, the
time between two stressed syllables tends to be constant regardless of the number of
unstressed syllables in between, while in syllable-timed languages every syllable takes
about the same time to pronounce. is suggests an additional possibility for
measuring length in stress-timed languages, namely the number of stressed syllables.
Again, I am not aware of any study that has discussed the operationalization of word
length at this level of detail.

78
Anatol Stefanowitsch

Structural complexity can also be operationalized at various levels for words.


e number of nodes could be counted in a phonological description of a word.
For example, two words with the same number of syllables may dier in the
complexity of those syllables: amaze and astound both have two syllables, but the
second syllable of amaze follows a simple CVC paern, while that of astound has
the much more complex CCVCC paern. e number of nodes could also be
counted in the morphological structure of a word. In this case, all of the words
mentioned above would have a length of one, except disconcert, which has a
length of 2 (dis + concert).
Due to the practical and theoretical diculties of dening and measuring

.1
complexity, the vast majority of corpus-based studies operationalize weight in
terms of some measure of WORD LENGTH even if they theoretically conceptualize it
in terms of complexity. Since complexity and length correlate to some extent, this
v0
is a justiable simplication in most cases. In any case, it is a good example of
how a phenomenon and its operational denition may be more or less closely
related.

b) Discourse status. e notion of topical, old, or given information plays an


T
important role in many areas of grammar, such as pronominal reference, voice,
and constituent order in general. Denitions of this construct vary quite
AF

drastically across researchers and frameworks, but there is a simple basis for
operational denitions of TOPICALITY in terms of referential distance, proposed by
Talmy Givn:

(4.5) Referential Distance [] assesses the gap between the previous


R

occurrence in the discourse of a referent/topic and its current occurrence


in a clause, where it is marked by a particular grammatical coding device.
D

e gap is thus expressed in terms of number of clauses to the le. e


minimal value that can be assigned is thus 1 clause [] (Givn 1983: 13)

This is not quite an operational denition yet, as it cannot be applied reliably


without a specication of the notions clause and coding device. Both notions are
to some extent theory-dependent, and even within a particular theory they have
to be dened in the context of the above denition of referential distance in a
way that makes them identiable.
With respect to coding devices, it has to be specied whether only overt
references (by lexical nouns, proper names and pronouns) are counted, or
whether covert references (by structural and/or semantic positions in the clause

79
4 Operationalization

that are not phonologically realized) are included, and if so, which types of covert
references. With respect to clauses, it has to be specied what counts as a clause,
and it has to be specied how complex clauses are to be counted.
A concrete example may demonstrate the complexity of these decisions. Let us
assume that we are interested in determining the referential distance of the
pronouns in the following example, all of which refer to the person named Joan
(verbs and other elements potentially forming the core of a clause have been
indexed with numbers for ease of reference in the subsequent discussion):

(4.6) Joan, though Anne's junior1 by a year and not yet fully accustomed2 to the

.1
ways of the nobility, was3 by far the more worldly-wise of the two. She
watched4, listened5, learned6 and assessed7, speaking8 only when spoken9
to in general whilst all the while making10 her plans and looking11 to the
v0
future Enchanted12 at rst by her good fortune in becoming13 Anne
Mowbray's companion, grateful14 for the benets showered15 upon her,
Joan rapidly became16 accustomed to her new role. [BNC CCD]

Let us assume the traditional denition of a clause as a nite verb and its
T
dependents and let us assume that only overt references are counted. If we apply
these denitions very narrowly, we would put the referential distance between
AF

the initial mention of Joan and the rst pronominal reference at 1, as Joan is a
dependent of was in clause 4.63 and there are no other nite verbs between this
mention and the pronoun she. A broader denition of clause along the lines of a
unit expressing a complete proposition however, might include the structures
(4.61) (though Anne's junior by a year) and (4.62) (not yet fully accustomed to the
R

ways of the nobility) in which case the referential distance would be 3 (a similar
problem is posed by the potential clauses (4.612) and (4.614), which do not contain
D

nite verbs but do express complete propositions). Note that if we also count the
NP the two as including reference to the person named Joan, the distance to she
would be 1, regardless of how the clauses are counted.
In fact, the structures (4.61) and (4.62) pose an additional problem: they are
dependent clauses whose logical subject, although it is not expressed, is clearly
coreferential to Joan. It depends on our theory whether these covert logical
subjects are treated as elements of grammatical and/or semantic structure; if they
are, we would have to include them in the count.
e dierences that decisions about covert mentions can make are even more
obvious when calculating the referential distance of the second pronoun, her (in
her plans). Again, assuming that every nite verb and its dependents form a

80
Anatol Stefanowitsch

clause the distance between her and the previous use she is six clauses (4.64 to
4.69). However, in all six clauses, the logical subject is also Joan. If we include
these as mentions, the referential distance is 1 again (her good fortune is part of
the clause [4.612] and the previous mention would be the covert reference by the
logical subject of clause [4.611]).
Finally, note that I have assumed a very at, sequential understanding of
number of clauses counting every nite verb separately. However, one could
argue that the sequence She watched4, listened5, learned6 and assessed7 is actually a
single clause with four coordinated verb phrases sharing the subject she, that
speaking8 only when spoken9 to in general is a single clause consisting of a matrix

.1
clause and an embedded adverbial clause, and that this clause itself is dependent
on the clause with the four verb phrases. us, the sequence from (4.64) to (4.69)
can be seen as consisting of six, two or even just one clause, depending on how
v0
we decide to count clauses in the context of referential distance.
Obviously, there is no right or wrong way to count clauses; what maers is
that we specify a way of counting clauses that can be reliably applied and that is
valid with respect to what we are trying to measure. With respect to reliability,
obviously the simpler our specication, the beer (simply counting every verb,
T
whether nite or not, might be a good compromise between the two denitions
mentioned above). With respect to validity, things are more complicated:
AF

referential distance is meant to measure the degree of activation of a referent, and


dierent assumptions about the hierarchical structure of the clauses in question
are going to have an impact on our assumptions concerning the activation of the
entities referred to by them.
Since specifying what counts as a clause and what does not is fairly complex,
R

it might be worth thinking about more objective, less theory-dependent measures


of distance, such as the number of (orthographic) words between two mentions (I
D

am not aware of studies that do this, but nding out to what extent the results
correlate with clause-based measures of various kinds seems worthwhile).
For practical as well as for theoretical reasons, it is plausible to introduce a
cut-o point for the number of clauses we search for a previous mention of a
referent: practically, it will become too time consuming to search beyond a
certain point, theoretically, it is arguable to what extent a distant previous
occurrence of a referent contributes to the current information status. Givon
(1983) originally set this cut-o point at 20 clauses, but there are also studies
seing it at ten or even at three clauses. Clearly, there is no correct number of
clauses, but there is empirical evidence that the relevant distinctions are those
between a referential distance of 1, between 2 and 3, and >3 (cf. Givon 1992).

81
4 Operationalization

Note that, as an operational denition of topicality or givenness, it will


miss a range of referents that are topical or given. For example, there are
referents that are present in the minds of speakers because they are physically
present in the speech situation, or because they constitute salient shared
knowledge for them, or because they talked about them at a previous occasion, or
because they were mentioned prior to the cut-o point. Such referents may
already be given at the point that they are rst mentioned in the discourse.
Conversely, the denition may wrongly classify referents as discourse-active.
For example, in conversational data an entity may be referred to by one speaker
but be missed or misunderstood by the hearer, in which case it will not constitute

.1
given information to the hearer (Givn originally intended the measure for
narrative data only, where this problem will not occur).
v0
Both WORD LENGTH and DISCOURSE STATUS are phenomena that can be dened in
relatively objective, quantiable terms not quite as objectively as physical
HARDNESS, perhaps, but with a comparable degree of rigor. Like HARDNESS measures,
they do not access reality directly and are dependent on a number of assumptions
and decisions, but providing that these are stated suciently explicitly, they can
T
be applied almost automatically. While WORD LENGTH and DISCOURSE STATUS are not
the only such phenomena, they are not typical either. Most phenomena that are
AF

of interest to linguists (and thus, to corpus linguists) require operational


denitions that are more heavily dependent on interpretation. Let us look at two
such phenomena, WORD SENSE and ANIMACY.

c) Word senses. Although we oen pretend that corpora contain words, they
R

actually contain orthographic strings. Sometimes, such a string is in a relatively


unique relationship with a particular word. For example, sidewalk is normally
D

spelled as an uninterrupted sequence of the character S or s followed by the


characters i, d, e, w, a, l and k, or as an uninterrupted sequence of the characters
S, I, D, E, W, A, L and K, so (assuming that the corpus does not contain hyphens
inserted at the end of a line when breaking the word across lines), there are just
three orthographic forms; also, the word always has the same meaning. is is
not the case for pavement, which, as we saw, has several meanings that (while
clearly etymologically related), must be distinguished.
In these cases, the most common operationalization strategy found in corpus
linguistics is reference to a dictionary or lexical database. In other words, the
researcher will go through the concordance and assign every instance of the
orthographic string in question to one word-sense category posited in the

82
Anatol Stefanowitsch

corresponding lexical entry. A resource frequently used in such cases is the


WordNet database (cf. Fellbaum 1998, see also Appendix A.2). is is a sort of
electronic dictionary that includes not just denitions of dierent word senses
but also information about lexical relationships etc.; but let us focus on the word
senses. For pavement, the entry looks like this:

(4.7) S: (n) pavement#1, paving#2 (the paved surface of a thoroughfare)


S: (n) paving#1, pavement#2, paving material#1 (material used to pave an
area)
S: (n) sidewalk#1, pavement#3 (walk consisting of a paved area for

.1
pedestrians; usually beside a street or roadway)

ere are three senses of pavement, as shown by the numbers aached, and in
v0
each case there are synonyms. Of course, in order to turn this into an operational
denition, we need to specify a procedure that allows us to assign the hits in our
corpus to these categories. For example, we could try to replace the word
pavement by a unique synonym and see whether this changes the meaning. But
even this, as we saw in Chapter 3, may be quite dicult.
T
ere is an additional problem: We are relying on someone elses decisions
about which uses of a word constitute dierent senses. In the case of pavement,
AF

this is fairly uncontroversial, but consider the entry for the noun bank:

(4.8a) bank#1 (sloping land (especially the slope beside a body of water))
(4.8b) bank#2, depository nancial institution#1, bank#2, banking concern#1,
banking company#1 (a nancial institution that accepts deposits and
R

channels the money into lending activities)


(4.8c) bank#3 (a long ridge or pile)
D

(4.8d) bank#4 (an arrangement of similar objects in a row or in tiers)


(4.8e) bank#5 (a supply or stock held in reserve for future use (especially in
emergencies))
(4.8f) bank#6 (the funds held by a gambling house or the dealer in some
gambling games)
(4.8g) bank#7, cant#2, camber#2 (a slope in the turn of a road or track; the
outside is higher than the inside in order to reduce the eects of
centrifugal force)
(4.8h) savings bank#2, coin bank#1, money box#1, bank#8 (a container (usually
with a slot in the top) for keeping money at home)
(4.8i) bank#9, bank building#1 (a building in which the business of banking

83
4 Operationalization

transacted)
(4.8j) bank#10 (a ight maneuver; aircra tips laterally about its longitudinal
axis (especially in turning))

While everyone will presumably agree that (4.8a) and (4.8b) are separate senses
(or even separate words, i.e. homonyms), it is less clear whether everyone would
distinguish (4.8b) from (4.8i) and/or (4.8); or (4.8e) and (4.8), or even (4.8a) and
(4.8g). In these cases, one could argue that we are just dealing with contextual
variants of a single underlying meaning.
us, we have the choice of coming up with our own set of senses (which has

.1
the advantage that it will t more precisely into the general theoretical
framework we are working in and that we might nd it easier to apply), or we
can stick with an established set of senses such as that proposed by WordNet,
v0
which has the advantage that it is maximally transparent to other researchers and
that we cannot subconsciously make it t our own preconceptions, thus
distorting our results in the direction of our hypothesis. In either case, we must
make the set of senses and the criteria for applying them transparent, and in
either case we are dealing with an operational denition that does not correspond
T
directly with reality (if only because word senses tend to form a continuum
rather than a set of discrete categories in actual language use).
AF

d) Animacy. e animacy of the referents of noun phrases plays a role in a range


of grammatical processes in many languages. In English, for example, it has been
argued (and shown) to be involved in the grammatical alternations already
discussed above, in other languages it is involved in grammatical gender, in
R

alignment systems etc.


e simplest distinction in the domain of ANIMACY would be the following:
D

(4.9a) ANIMATE vs. INANIMATE

Dictionary denitions typically treat animate as a rough synonym of alive (OALD


and CALD dene it as having life), and inanimate as a rough synonym of not
alive, normally in the sense of not being capable of having life, like, for example, a
rock (having none of the characteristics of life that an animal or plant has,
CALD, see also OALD), but sometimes additionally in the sense of being no
longer alive (dead or appearing to be dead, OALD).
e basic distinction in (4.9a) looks simple, so that any competent speaker of a
language should be able to categorize the referents of nouns in a text accordingly.

84
Anatol Stefanowitsch

On second thought, however, it is more complex than it seems. For example, what
about dead bodies or carcasses? e fact that dictionaries disagree as to whether
these are inanimate shows that this is not a straightforward question that calls for
a decision before the nouns in a given corpus could be categorized reliably.
Let us assume for the moment, that animate is dened as potentially having
life and thus includes dead bodies and carcasses. is does not solve all
problems: For example, how should body parts, organs or individual cells be
categorized? ey have life in the sense that they are part of something alive,
but they are not, in themselves, living beings. In fact, in order to count as an
animate being in a communicatively relevant sense, an entity has to display some

.1
degree of intentional agency. is raises the question of whether, for example,
plants, jellysh, bacteria, viruses or prions should be categorized as animate.
Sometimes, the dimension of intentionality/agency is implicitly recognized as
v0
playing a crucial role, leading to a three-way categorization such as that in (4.9b):

(4.9b) HUMAN vs. OTHER ANIMATE vs. INANIMATE

If ANIMACY is treated as a maer of degree, we might want to introduce further


T
distinctions in the domain of animates, such as HIGHER ANIMALS, LOWER ANIMALS,
PLANTS, MICRO-ORGANISMS. However, the distinction between HUMANS and OTHER
AF

ANIMATES introduces additional problems. For example, how should we categorize


animals that are linguistically represented as quasi-human, like the bonobo Kanzi,
or a dog or a cat that is treated by their owner as though it has human
intelligence? If we categorize them as OTHER ANIMATE, what about ctional talking
animals like the Big Bad Wolf and the ree Lile Pigs? And what about fully
R

ctional entities, such as gods, ghosts, dwarves, dragons or unicorns? Are they,
respectively, humans and animals, even though they do not, in fact exist? Clearly,
D

we treat them conceptually as such, so unless we follow an extremely objectivist


semantics, they should be categorized accordingly but this is not something we
can simply assume implicitly.
A slightly dierent problem is posed by robots (ctional ones that have quasi-
human or quasi-animal capacities and real ones, that do not). Should these be
treated as HUMANS/ANIMATE? If so, what about other kinds of intelligent machines
(again, ctional ones with quasi-human capacities, like HAL 9000 from Arthur C.
Clarkes Space Odyssey series, or real ones without such capacities, like the laptop
on which I am writing this book)? And what about organizations (when they are
metonymically treated as agents, and when they are not)? We might want to
categorize robots, machines and organizations as human/animate in contexts

85
4 Operationalization

where they are treated as having human or animal intelligence and agency, and
as inanimate where they are not. In other words, our categorization of a referent
may change depending on context.
Sometimes studies involving animacy introduce additional categories in the
INANIMATE domain. One distinction that is oen made is that between concrete and
abstract, yielding the four-way categorization in 4.9c:

(4.9c) HUMAN vs. ANIMATE vs. CONCRETE INANIMATE vs. ABSTRACT INANIMATE

e distinction between concrete and abstract raises the practical issue where to

.1
draw the line (for example, is electricity concrete?). It also raises a deeper issue
that we will return to in Chapter 5: are we still dealing with a single dimension?
Are abstract inanimate entities (say, marriage or Wednesday) really less animate
v0
than concrete entities like a wedding ring or a calendar? And are animate and
abstract incompatible, or would it not make sense to treat the referents of words
like god, demon, unicorn etc. as abstract animate?

We have seen that operational denitions in corpus linguistics may dier


T
substantially in terms of their objectivity. Some operational denitions, like those
for length and discourse status, are almost comparable to physical measures like
AF

Vickers Hardness in terms of objectivity and quantitativeness. Others, like those


for word senses or animacy are more like the denitions in the DSM or the ICD
in that they leave room for interpretation, and thus for subjective choices, no
maer how precise the instructions for the identication of individual categories
are. Unfortunately, the laer type of operational denition is more common in
R

linguistics (and the social sciences in general), but there are procedures to deal
with the problem of subjectiveness at least to some extent. We will return to
D

these procedures in detail in the next chapter.

Further reading
A discussion of the role of operationalization in the context of corpus-based
semantics is found in Stefanowitsch (2010). Wul (2003) is a study of adjective
order in English that operationalizes a variety of linguistic constructs in an
exemplary way. Zaenen et al. (2004) is an example of a detailed and extensive
coding scheme for animacy.
Stefanowitsch, Anatol. 2010. Empirical cognitive semantics: Some

86
Anatol Stefanowitsch

thoughts. In Dylan Glynn & Kerstin Fischer (eds.), antitative methods


in cognitive semantics: Corpus-driven approaches, 355380. Berlin, New
York: Mouton de Gruyter.
Wul, Stefanie. 2003. A multifactorial corpus analysis of adjective order
in English. International Journal of Corpus Linguistics 8(2). 245282.
Zaenen, Annie, et al. 2004. Animacy encoding in English: Why and how.
Proceedings of the 2004 ACL Workshop on Discourse Annotation, 118125.
Stroudsburg, PA, USA: Association for Computational Linguistics.

.1
v0
T
AF
R
D

87
5 Retrieval and Coding
Traditionally, many corpus-linguistic studies use the (orthographic) word form as
their starting point. This is at least in part due to the fact that corpora consist of

.1
text that is represented as a sequence of word forms, and that, consequently,
word forms are easy to retrieve. As briey discussed in Chapter 2, concordancing
programs oer at least the option of searching for a string of characters and
v0
displaying the results as a list of hits in context. As we saw when discussing the
case of pavement in Chapter 3, searching for a string of characters will typically
give us both more and less than we need. A query for the string pavement will
give us less than we need, because a study involving the word pavement in the
sense of pedestrian footpath requires us to look at all morphological variants (in
T
this case, the singular pavement, the plural pavements and, depending on how the
corpus is prepared, the possessive form pavements) and orthographic and
AF

typographic variants (for example, the variants with an uppercase P as it would


occur at the beginning of a sentence or in certain kinds of titles). A query for
pavement also gives us more than we need, because it will return not only hits
meaning pedestrian footpath, but also those meaning hard surface.
In Chapter 3, we implicitly treated the rst issue as a problem of retrieval,
R

making sure to dene our corpus query in such a way as to capture all variants of
the word pavement; we treated the second issue as a problem of coding, going
D

through the search results one by one and determining from the context, which
of the senses of pavement given by dictionaries like CALD and OALD we were
likely dealing with.
For research questions where one of the variables is a word (or a set of words),
this is probably the best strategy. However, as soon as we want to investigate
phenomena other than words, the issue of retrieval becomes very complex,
requiring careful thought and a number of decisions concerning an almost
inevitable trade-o between the quality of the results and the time needed to
retrieve them. is issue will be dealt with in Section 5.1. We already saw that the
issue of coding the data is complex even in the case of simple words, and the

88
Anatol Stefanowitsch

preceding chapter discussed some more complicated examples. is issue will be


dealt with in more detail in Section 5.2.

5.1 Retrieval
5.1.1 Corpus queries
Roughly speaking, there are two ways of searching a corpus for a particular
linguistic phenomenon: manually (i.e., by reading the texts contained in it, noting
down each instance of the phenomenon in question) or automatically (i.e., by

.1
using a computer program to run a query on a machine-readable version of the
texts).
As discussed in Chapter 2, there may be cases where there is no alternative to
v0
a manual search, but soware-aided queries are the default in modern corpus
linguistics. ere is a range of more or less specialized commercial and non-
commercial concordancing programs designed specically for corpus linguistic
research, and there are many other soware packages that may be used to search
text corpora even though they are not specically designed for corpus-linguistic
T
research (some non-commercial soware solutions are listed in Appendix A.3).
Finally, there are scripting languages like Perl, Python and R, that can be used to
AF

write programs capable of searching text corpora (ranging from very simple two-
liners to very complex solutions (see also Appendix A.3 for further information).
e power of soware-aided searches depends on two things: on the
information contained in the corpus itself and on the paern-matching capacities
of the soware used to access them. In the simplest case (which we assumed to
R

hold in the examples discussed in Chapters 3 and 4), a corpus will contain plain
text in a standard orthography and the soware will be able to nd passages
D

matching a specic string of characters. Essentially, this is something every word


processor is capable of.
Most concordancing programs (and many other types of computer programs)
can do more than this, however. ey typically allow the researcher to formulate
search queries that match not just one string, but a class of strings. One fairly
standardized way of achieving this is by using so-called regular expressions
strings that may contain not just simple characters, but also symbols referring to
classes of characters or symbols aecting the interpretation of characters. For
example, the lexeme sidewalk, has (at least) six possible orthographic
representations: sidewalk, side-walk, Sidewalk, Side-walk, sidewalks, side-walks,
Sidewalks and Side-walks (in older texts, it is sometimes spelled as two separate

89
5 Retrieval and Coding

words, which means that we have to add at least side walk, side walks, Side walk
and Side walks when investigating such texts). In order to retrieve all occurrences
of the lexeme in Chapters 2 and 3, we could perform a separate query for each of
these strings, but I actually queried the string in (5.1a); a second example of
regular expressions is (5.1b), which represents one way of searching for all
inected forms and spelling variants of the verb synthesize:

(5.1a) [Ss]ide-?walks?
(5.1b) synthesi[sz]e?[ds]?(ing)?

.1
Any group of characters in square brackets is treated as a class, meaning that any
one of them will be treated as a match, and the question mark means zero or one
of the preceding characters). is means, that the paern in (5-1a) will match an
v0
upper- or lower-case S, followed by i, d, and e, followed by zero or one occurrence
of a hyphen, followed by w, a, l, and k, followed by zero or one occurrence of s.
is matches all the variants of the word. For (5-1b), the [sz] ensures that both
the British spelling (with an s) and the American spelling (with a z) are found.
e question mark aer e ensures that both the forms with an e (synthesize,
T
synthesizes, synthesized) and that without one (synthesizing) are matched. Next,
the sting matches zero to one occurrence of a d or an s followed by zero or one
AF

occurrence of the string ing (because it is enclosed in parentheses, it is treated as


a unit for the purposes of the following question mark.
Regular expressions allow us to formulate the kind of complex and abstract
search queries that we are likely to need when searching for words (rather than
individual word forms) and even more so when searching for more complex
R

expressions. But even the simple example in (5.1b) demonstrates a problem with
such search queries: they quickly overgeneralize. For example, the paern in
D

(5.1b) would also match some non-existing forms, like synthesizding, and, more
crucially, it will match existing forms that we may not want to include in our
search results, like the noun synthesis (see further Section 5.1.2).
e benets of being able to dene complex queries becomes even more
obvious if our corpus contains annotations in addition to the original text. If the
corpus contains part-of-speech tags, as many corpora do, this will allow us to
search (within limits) for grammatical structures. For example, assume that there
is a part-of-speech tag aached to the end of every word by an underscore (as it
is in many of older corpora, including the BROWN corpus used in several
examples in previous chapters) and that the tags are as shown in 5.2 (following
the sometimes rather nontransparent BROWN naming conventions). We could

90
Anatol Stefanowitsch

then search for prepositional phrases using a paern like the one in (5.3):

(5.2) prepositions _IN


articles _AT
adjectives _JJ (uninected)
_JJR (comparative)
_JJT (superlative)
nouns _NN (common singular nouns)
_NNS (common plural nouns)
_NN$ (common nouns with possessive clitic)

.1
_NP (proper names)
_NN$ (common nouns with possessive clitic)

(5.3)
v0
\S+_IN ( \S+_AT)?( \S+_JJ[RT]?)*( \S+_N[PN][S$]?)+
1 2 3 4

e star means zero ore more, the plus means one or more, and \S means any
T
non-whitespace character, the meaning of the other symbols is as before. is
paern will match the following sequence (the numbers correspond to the indices
AF

in 5.3):

1. any word (i.e., sequence of non-whitespace characters) tagged as a


preposition,
followed by
R

2. zero or one occurrence of a word tagged as an article that is preceded by


a whitespace,
D

followed by
3. zero or more occurrences of a word tagged as an adjective (again
preceded by a whitespace), including comparatives and superlatives
note that the JJ-tag may be followed by zero or one occurrence of a T or
an R),
followed by
4. one or more words (again, preceded by a whitespace) that are tagged as a
noun or proper name note the square bracket containing the N for
common nouns and the P for proper nouns , including plural forms and
possessive forms note that NN or NP tags may be followed by zero or
one occurrence of an S or a $.

91
5 Retrieval and Coding

Even though this query is already quite complex, it will not return all
prepositional phrases, however; for example, it will not match cases where the
adjective is preceded by one or more quantiers (tagged _QT in the BROWN
corpus), adverbs (tagged _RB), or combinations of the two. It will also not return
cases with pronouns instead of nouns. ese and other issues can be xed by
augmenting the search paern accordingly, although the increasing complexity
will bring problems of its own, to which we return in the next subsection.
Other problems are impossible to x; for example, if the noun phrase inside
the prepositional phrase contains another PP, the paern will not recognize it as

.1
belonging to the NP but will treat it as a new match and there is nothing we can
do about this, since there is no dierence between the sequence of POS tags in a
structures like (5.4a), where the PP o the kitchen is a complement of the noun
v0
room and as such is part of the NP inside the rst PP, and (5.4b), where the PP at
a party is an adjunct of the verb standing and as such is not part of the NP
preceding it:

(5.4a) A mill stands in a room o the kitchen. [BROWN F04]


T
(5.4b) He remembered the rst time he saw her, standing across the room at a
party. [BROWN P28]
AF

In order to distinguish between these cases in a query, we would need a corpus


annotated not just for parts of speech but also for phrase structures (a so-called
parsed corpus, such as the Penn Treebank, the British Component of the
International Corpus of English (ICE-GB) or the SUSANNE corpus).
R

5.1.2 Precision and recall


D

In arriving at the denition of corpus linguistics adopted in this book, we stressed


the need to investigate linguistic phenomena exhaustively, which we took to
mean taking into account all examples of the phenomenon in question (cf.
Chapter 2). In order to take into account all examples of a phenomenon, we have
to retrieve them rst. However, as we saw in the preceding section and in
Chapter 3, it is not always possible to dene a corpus query in a way that will
retrieve all and only the occurrences of a particular phenomenon. Instead, a
query can have four possible outcomes: it may

1. include hits that are instances of the phenomenon we are looking for

92
Anatol Stefanowitsch

(these are referred to as a true positives);


2. include a hits that are not instances of our phenomenon (these are
referred to as a false positives);
3. fail to include instances of our phenomenon (these are referred to as a
false negatives); or
4. fail to include strings that are not instance of our phenomenon (this is
referred to as a true negative).

Table 5-1 summarizes these outcomes.

.1
Table 5-1: Four possible outcomes of a corpus query for a phenomenon X
SEARCH RESULT
INCLUDED NOT INCLUDED Total

X
v0
True Positive (TP) False Negative (FN) P
CORPUS

(hit) (miss)

X
False Positive (FP) True Negative (TN) N
(incorrectly included) (correctly rejected)
T

Obviously, the rst case (TP) and the fourth case (TN) are desirable outcomes:
AF

we want our search results to include all instances of the phenomenon under
investigation and exclude everything that is not such an instance. e second
case (FN) and the third case (FP) are undesirable outcomes: we do not want our
search to overlook instances of our phenomenon and we do not want our search
R

results to include strings that are not instances of it.


We can thus describe the quality of a data set that we have retrieved from a
corpus in terms of two measures. First, the proportion of positives (i.e., strings
D

found by our search) that are true positives; this is referred to as precision (or as
the positive predictive value, cf. 5.5a). Second, the proportion of all instances of
our phenomenon that are true positives (i.e., that were actually found by our
search; this is referred to as recall (cf. 5.5b):14

14 ere are two additional measures that are important in other areas of empirical
research but do not play a central role in data retrieval. First, the specicity or true
negative rate the proportion of negatives that are incorrectly included in our data
(i.e. false negatives); second, negative predictive value the proportion of negatives
(i.e., cases not included in our search) that are true negatives (i.e., that are correctly
rejected). ese measures play a role in situations where a negative outcome of a test

93
5 Retrieval and Coding

True Positives
(5.5a) Precision=
True Positives +False Positives

True Positives
(5.5b) Recall=
True Positives +False Negatives

Ideally, the value of both measures should be 1, i.e., our data should include all
cases of the phenomenon under investigation (a recall rate of 100 percent) and it
should include nothing that is not a case of this phenomenon (a precision of 100

.1
percent). However, unless we carefully search our corpus manually (a possibility I
will return to below), there is typically a trade-o between the two. Either we
devise a query that matches only clear cases of the phenomenon we are
v0
interested in (high precision) but that will miss all less clear cases (low recall). Or
we devise a query that matches as many potential cases as possible (high recall),
but that will include many cases that are not instances of our phenomenon (low
precision).
Let us look at a specic example, the English ditransitive construction, and let
T
us assume that we have an untagged and unparsed corpus. How could we retrieve
instances of the ditransitive? As the rst object of a ditransitive is usually a
AF

pronoun (in the objective case) and the second a lexical NP (see, for example,
ompson and Koide 1987), one possibility would be to search for a pronoun
followed by a determiner (i.e., for any member of the set of strings in (5.6a)),
followed by any member of the set of strings in (5.6b)). is gives us the query in
R

(5.6c), which is long, but not very complex:

(5.6a) me, you, him, her, it, us, them


D

(5.6b) the, a, an, this, that, these, those, some, many, lots, my, your, his, her, its,
our, their, something, anything
(5.6c) (me|you|him|her|it|us|them) (the|a|an|this|that|these|those|
some|many|lots|my|your|his|her|its|our|their|something|
anything)

Let us apply this query (which is actually used in Colleman and De Clerk 2011) to
a freely available sample from the ICE-GB mentioned above (see Appendix A.1).

is relevant (for example, with medical diagnoses); in corpus linguistics, this is


generally not the case.

94
Anatol Stefanowitsch

is corpus has been manually annotated, amongst other things, for argument
structure, so that we can check the results of our search against this annotation.
ere are 36 ditransitive clauses in the sample, thirteen of which are returned
by our query. ere are also 2,838 clauses that are not ditransitive, 14 of which
are also returned by our query. Table 5.2 shows the results of the query in terms
of true and false positives and negatives:

Table 5-2: Comparison of the search results


SEARCH RESULT
INCLUDED NOT INCLUDED Total

.1
DITR. 13 23 36
true positives false negatives
CORPUS

OTHER 14
false positives
v0 2824
true negatives
2838

Total 27 2847 2874


T
We can now calculate precision and recall rate of our search paern:
AF

13
(5.7a) Precision= =0.48
27

13
R

(5.7b) Recall = =0.36


36
D

Clearly, neither precision nor recall are particularly good. Let us look at the
reasons for this, beginning with precision.
While the sequence of a pronoun and a determiner is typical for (one type o)
ditransitive clause, it is not unique to the ditransitive, as the following false
positives of our corpus search show:

(5.8a) one of the experiences that went towards making me a Christian


(5.8b) I still ring her a lot.
(5.8c) I told her that I 'd had to take these tablets
(5.8d) It seems to me that they they tend to come from
(5.8e) Do you need your caeine x before you this

95
5 Retrieval and Coding

Two other typical structures characterized by the sequence pronoun-determiner


are object-complement clauses (cf. 5.8a) and clauses with quantifying noun
phrases (cf. 5.8b). In addition, some of the strings in (5.6b) are ambiguous, i.e.,
they can represent parts of speech other than determiner; for example, that can
be a conjunction, as in (5.8c), which otherwise ts the description of a
ditransitive, and in (5.8d), which does not. Finally, especially in spoken corpora,
there may be fragments that match particular search criteria only accidentally (cf.
5.8e). Obviously, a corpus tagged for parts of speech could improve the precision
of our search results somewhat, by excluding cases like (5.4c-d), but others, like

.1
(5.4.a), could never be excluded, since they are identical to the ditransitive as far
as the sequence of parts-of-speech is concerned.
Of course, it is relatively trivial to increase the precision of our search results:
v0
we can manually discard all false positives, which would increase precision to the
maximum value of 1. Typically, our data will have to be manually coded for
various criteria anyway, allowing us to discard false positives in the process.
However, the larger our data set, the more time consuming this will become, so
that precision should always be a consideration even at the stage of data retrieval.
T
Let us now look at the reasons for the recall rate, which is even worse than the
precision. ere are, roughly speaking, four types of ditransitive structures that
AF

our search paern misses, exemplied in (5.9ae):

(5.9a) How much money have they given you?


(5.9b) e centre [] has also been granted a three-year repayment freeze.
(5.9d) He gave the young couple his blessing.
R

(5.9d) ey have just given me enough to last this year.


(5.9e) He nds himself [] oering Gradiva owers.
D

e rst group of cases are those where the second object does not appear in its
canonical position, for example in interrogatives and other cases of le-
dislocation (cf. 5.9a), or passives (5.9b). e second group of cases are those where
word order is canonical, but either the rst object (5.9c) or the second object
(5.9d) or both (5.9e) do not correspond to the search paern.
Note that, unlike precision, the recall rate of a search paern cannot be
increased aer the data have been extracted from the corpus. us, an important
aspect in constructing a query is to annotate a random sample of our corpus
manually for the phenomenon we are interested in, and then to check ourquery
against this manual annotation. is will not only tell us how good or bad the

96
Anatol Stefanowitsch

recall of our query is, it will also provide information about the most frequent
cases we are missing. Once we know this, we can try to revise our query to take
these cases into account. In a pos-tagged corpus, we could, for example, search
for a sequence of a pronoun and a noun in addition to the sequence pronoun-
determiner that we used above, which would give us cases like (5.5d), or we could
search for forms of be followed by a past participle followed by a determiner or
noun, which would give us passives like those in (5.5b).
In some cases, however, there may not be any additional paerns that we can
reasonably search for. In the present example with an untagged corpus, for
example, there is no additional paern that seems in any way promising. In such

.1
cases, we have two options for dealing with low recall: First, we can check (in our
manually annotated subcorpus) whether the data recalled dier from the data not
recalled in any way signicant for our research question. If this is not the case,
v0
we might decide to continue working with a low recall and hope that our results
are still generalizable (Colleman and De Clerk 2011, for example, are mainly
interested in the question which classes of verbs were used ditransitively at what
time in the history of English, a question that they were able to discuss
insightfully based just on the subset of ditransitives matching their search
T
paern).
If our data do dier signicantly along one or more of the dimensions relevant
AF

to our research project, we might have to increase the recall at the expense of
precision and spend relatively more time weeding out false positives. In the most
extreme case, this might mean extracting the data manually, i.e. checking the
entire corpus word by word or clause by clause for the phenomenon we are
interested in (as Colleman 2006 did when he manually searched an untagged one-
R

million word corpus of Dutch for ditransitives).


D

5.1.3 Manual, semi-manual and automatic searches


In Chapter 2, we touched on the issue of manual vs. automatic corpus searches.
Let us briey return to it in light of the issues raised by the example discussed in
the preceding subsection.
In theory, the highest quality search results would be achieved by a kind of
close reading, i.e. a careful word-by-word (or phrase by phrase, clause by clause)
inspection of the corpus. In some cases, this may be the only feasible option,
either because an automatic retrieval is dicult (as in the example above), or
because an automatic retrieval is impossible.
In the case of words and in at least some cases of grammatical structures, the

97
5 Retrieval and Coding

quality of automatic search results may be increased by automatically


preprocessing the corpus, adding part-of-speech tags, phrase tags or even a full-
edged syntactic annotation. is strategy brings its own set of problems, as
automatic tagging and grammatical parsing are far from perfect. For example, it
has been estimated that 1.5 percent of all words in the British National Corpus
are tagged incorrectly, and for a further 3.5 percent the automatic tagger was not
able to make a decision, assigning two possible tags (Leech et al. 1994). In other
words, 95 percent of the words in the corpus are tagged correctly and
unambiguously.
While this still sounds impressive at rst, a closer look reveals two problems:

.1
rst, an accuracy of 95 percent means that roughly one word in 20 is tagged
incorrectly. Assuming a mean sentence length of 20 words (actual estimates range
from 16 to 22), every sentence contains on average one incorrectly or
v0
ambiguously tagged word. Second, the incorrectly assigned tags are unlikely to be
distributed randomly across parts of speech. For example, the is unlikely to be
tagged incorrectly, as there is only one possible tag); instead, they will cluster
around certain dicult cases (like ing-forms of verbs, which may be participles,
adjectives or nouns). ey are also unlikely to be distributed randomly across
T
grammatical constructions. For example, the word form regard is systematically
tagged incorrectly as a verb in the complex prepositions with regard to and in
AF

regard to, but is correctly tagged as a noun in most instances of the phrase in high
regard. us, particular linguistic phenomena will be severely misrepresented in
corpus searches based on automatic tagging or parsing. Still, an automatically
preprocessed corpus will certainly allow us to dene searches whose precision
and recall are higher than in the example above.
R

In the case of many other phenomena, however, automatic preprocessing is


simply not possible, or will yield a quality so low that it simply does not make
D

sense to base queries on it. For example, linguistic metaphors are almost
impossible to identify automatically, as they have lile or no properties that
systematically set them apart from literal language. Consider the following
examples of the metaphors anger is heat/a (hot) liquid (from G. Lako and
Kvecses 1987):

(5.10a) Boy, am I burned up.


(5.10b) Hes just leing o steam.
(5.10c) I had reached the boiling point.

e rst problem is that while the expressions in (5.10a-c) may refer to

98
Anatol Stefanowitsch

feelings of anger or rage, they can also occur in their literal meaning, as the
corresponding authentic examples in (5.11ac) show:

(5.11a) Now, aer I am burned up, he said, snatching my wrist, and the re is
out, you must scaer the ashes. " [Anne Rice, e Vampire Lestat]
(5.11b) As soon as the driver saw the train which had been hidden by the curve,
he let o steam and checked the engine [Galignani, Accident on the
Paris and Orleans Railway]
(5.11c) Heat water in saucepan on highest seing until you reach the boiling
point and it starts to boil gently. [www.sugarfreestevia.net]

.1
Clearly, there can be no search paern that would nd the examples in (5.10)
but not those in (5.11). In contrast, it is very easy for a human to recognize the
v0
examples in (5.11) as literal. If we are explicitly interested in metaphors involving
liquids and/or heat, we could choose a semi-manual approach, rst extracting all
instances of words from the eld of liquids and/or heat and then discarding all
cases that are not metaphorical. is type of approach is used quite fruitfully, for
example, by Deignan (2005), amongst others.
T
If we are interested in metaphors of anger in general, however, this approach
will not work, since we have no way of knowing beforehand which semantic
AF

elds to include in our search. is is precisely the situation where an exhaustive


retrieval can only be achieved by a manual corpus search, i.e., by reading the
entire corpus and deciding for each word, phrase or clause, whether it constitutes
an example of the phenomenon we are looking for. us, it is not surprising that
many corpus-linguistic studies on metaphor are based on manual searches (see,
R

for example, Semino and Masci 1996 or Jkel 1997 for very thorough early studies
of this type).
D

However, as mentioned in Chapter 2, manual searches are very time-


consuming and this limits their practical applicability: either we search large
corpora, in which case manual searching is going to take more time and human
resources than are realistically available, or we perform the search in a realistic
time-frame and with the human resources realistically available, in which case we
have to limit the size of our corpus so severely that the search results can no
longer be considered representative of the language as a whole. us, manual
searches are useful mainly in the context of research projects looking at a
linguistic phenomenon in some clearly dened subtype of language (for example,
metaphor in political speeches, Charteris-Black 2005).
When searching corpora for such hard-to-retrieve phenomena, it may

99
5 Retrieval and Coding

sometimes be possible to limit the analysis usefully to a subset of the available


data. As mentioned in the preceding subsection, limiting ourselves only to cases
of the ditransitive that are declarative, in the active voice and that display
canonical word order may still give us valuable data about lexical preferences
(provided our corpus is large enough). To take up the example of metaphors
introduced above, consider the following examples that are quite close in
meaning to the corresponding examples in (5.6ac) above (also from G. Lako
and Kvecses 1987):

(5.12a) He was consumed by his anger.

.1
(5.12b) He was lled with anger.
(5.12c) She was brimming with rage.
v0
In these cases, the PPs by/with anger/rage make it clear that consume, (be)
lled and brimming are not used literally. If we limit ourselves just to
metaphorical expressions of this type, i.e. expressions that explicitly mention the
two semantic elds involved in the metaphorical expression, then it becomes
possible to search semi-automatically for all metaphors of anger by retrieving all
T
instances of the words anger and rage (as well as fury, ire, and annoyance, and
perhaps, irritation, and outrage depending on how broadly we operationalize
AF

the concept ANGER). Such a semi-automatic approach has been pursued, for
example, by Tissari (2003) and Stefanowitsch (2004, 2006, see further Chapter 13,
Section 13.2.2.1). If we are interested in specic metaphors, the approach can even
be automatized by searching for phrases or clauses that contain vocabulary from
the two semantic elds involved (see Martin 2006). As in the case of the
R

ditransitive, this retrieval strategy is only useful if we can show that the results
are comparable to the results we would get if we extracted the phenomenon
D

exhaustively (in the case of metaphor, Stefanowitsch 2006 suggests that they are,
indeed, comparable).
To sum up, it depends on the phenomenon under investigation and on the
research question whether we can take an automatic or at least a semi-automatic
approach or whether we have to resort to manual data extraction. Obviously, the
more exhaustively we can extract our object of research from the corpus, the
beer.

5.2 Coding
Once the data have been extracted from the corpus (and, if necessary, false

100
Anatol Stefanowitsch

positives have been removed), they typically have to be coded in terms of the
variables relevant for the research question. In some cases, the variables and their
values will be provided externally; they may, for example, follow from the
structure of the corpus itself (as in the case of BRITISH ENGLISH vs. AMERICAN ENGLISH
dened as occurring in the LOB corpus and occurring in the BROWN corpus
respectively. In other cases, the variables and their values may have been
operationalized in terms of criteria that can be applied objectively (as in the case
of LENGTH dened as number of leers). In most cases, however, some degree of
interpretation will be involved (as in the case of ANIMACY or the metaphors
discussed above). Whatever the case, we need a coding scheme an explicit

.1
statement of the operational denitions applied. Of course, such a coding scheme
is especially important in cases where interpretative judgments are involved in
categorizing the data. In this case, the coding scheme should contain not just
v0
operational denitions, but also explicit guidelines as to how these denitions
should be applied to the corpus data. ese guidelines must be explicit enough to
ensure a high degree of agreement if dierent coders apply it to the same data.
Let us look at each of these aspects in some detail.
T
5.2.1. Coding as interpretation
First of all, it is necessary to understand that the categorization of corpus data is
AF

an interpretative process in the rst place. is is true regardless of the type of


category.
Even externally given categories are typically given an interpretation in the
context of a specic research project. In the simplest case, this consists in
R

accepting the operational denitions used by the makers of a particular corpus (as
well as the interpretative judgments made in applying them). Take the example of
BRITISH ENGLISH and AMERICAN ENGLISH used in Chapters 3 and 4: If we accept the idea
D

that the LOB corpus contains British English we are accepting an interpretation
of language varieties that is based on geographical criteria: British English means
the English spoken by people who live (perhaps also: who were born and grew
up) in the British Isles.
Or take the example of SEX, one of the demographic speaker variables included
in many modern corpora: By accepting the values of this variable, that the corpus
provides (typically MALE and FEMALE), we are accepting a specic interpretation of
what it means to be male or female. In some corpora, this may be the
interpretation of the speakers themselves (i.e., the corpus creators may have
asked the speakers to specify their sex), in other cases this may be the

101
5 Retrieval and Coding

interpretation of the corpus creators (based, for example, on the rst names of the
speakers or on the assessment of whoever collected the data). For many speakers
in a corpus, these dierent interpretations will presumably match, so that we can
accept whatever interpretation was used as an approximation of our own
operation denition of SEX. But in research projects that are based on a specic
understanding of SEX (for example, as a purely biological, a purely social or a
purely psychological category), simply accepting the (oen unstated) operational
denition used by the corpus creators may distort our results substantially. The
same is true of other demographic variables, like education, income, social class
etc., which are oen dened on a common sense basis that does not hold up to

.1
the current state of sociological research.
Interpretation also plays a role in the case of seemingly objective criteria. Even
though a criterion such as number of leers is largely self-explanatory, there
v0
are cases requiring interpretative judgments that may vary across researchers. In
the absence of clear instructions they may not know, among other things,
whether to treat ligatures as one or two leers, whether apostrophes or word-
internal hyphens are supposed to count as leers, or how to deal with spelling
variants (for example, in the BNC the noun programme also occurs in the variant
T
program that is shorter by two leers). is type of orthographic variation is very
typical of older stages of English (before there was a somewhat standardized
AF

orthography), which causes problems not only for retrieval (cf. the discussion in
Section 5.1.1 above, cf. also Barnbrook 1996, Ch. 8.2 for more detailed discussion),
but also for a reasonable application of the criterion number of leers.
In such cases, the role of interpretation can be reduced by including explicit
instructions for dealing with potentially unclear cases. However, we may not
R

have thought of all potentially unclear cases before we start coding our data,
which means that we may have to amend our coding scheme as we go along. In
D

this case, it is important to check whether our amendments have an eect on the
data we have already coded, and to recode them if necessary.
In cases of less objective criteria (such as ANIMACY discussed in Chapter 4
above), the role of interpretation is obvious. No maer how explicit our coding
scheme, we will come across cases that are not covered and will require
individual decisions; and even the clear cases are always based on an
interpretative judgment. As mentioned in Chapter 1, this is not necessarily
undesirable in the same way that intuitive judgements about acceptability are
undesirable; interpreting linguistic uerances is a natural activity in the context
of using language. us, if our operational denitions of the relevant variables
and values are close to the denitions speakers implicitly apply in everyday

102
Anatol Stefanowitsch

linguistic interactions, we may get a high degree of agreement even in the


absence of an explicit coding scheme,15 and certainly, operational denitions
should strive to retain some degree of linguistic naturalness in the sense of being
anchored in interpretation processes that plausibly occur in language processing.

5.2.2. Coding schemes


We can think of a linguistic coding scheme as a comprehensive operational
denition for a particular variable, with detailed instructions as to how the values
of this variable should be assigned to linguistic data (in our case, corpus data, but
of course coding schemes are also needed to categorize experimentally elicited

.1
linguistic data). In order to keep dierent research projects in a particular area
comparable, it is of course desirable to create coding schemes independently of a
particular research project. However, the eld of corpus-linguistics is not well-
v0
established and methodologically mature enough yet to have yielded
uncontroversial and widely applicable coding schemes for most linguistic
phenomena. ere are some exceptions, such as the part-of-speech tagsets and
the parsing schemes used by various wide-spread automatic taggers and parsers,
T
which have become de facto standards by virtue of being easily applied to new
data; there are also some substantial aempts to create coding schemes for
manual coding of phenomena like topicality (cf. Givn 1983), animacy (cf. Zaenen
AF

et al. 2004), and the grammatical description of English sentences (e.g., Sampson
1995).
Whenever it is feasible, we should use existing coding schemes instead of
creating our own searching the literature for such schemes should be a routine
R

step in the planning of a research project. Oen, however, such a search will
come up empty, or existing coding schemes will not be suitable for the specic
data we plan to use or they may be incompatible with our theoretical
D

assumptions. In these cases, we have to create our own coding schemes.


e rst step in creating a coding scheme for a particular variable consists in
deciding on a set of values that this variable may take. As the example of ANIMACY
in Chapter 4 shows, this decision is loosely constrained by our general
operational denition, but the ultimate decision is up to us and must be justied
within the context of our theoretical assumptions and our specic research

15 In fact, it may be worth exploring, within a corpus-linguistic framework, ways of


coding data that are based entirely on implicit decisions by untrained speakers;
specically, I am thinking here of the kinds of association tasks and sorting tasks oen
used in psycholinguistic studies of word meaning.

103
5 Retrieval and Coding

question.
ere are, in addition, several general criteria that the set of values for any
variable must meet. First, they must be non-overlapping. is may seem obvious,
but it is not at all unusual, for example, to nd continuous dimensions split up
into overlapping categories, as in the following quotation:

Hunters aged 1525 years old participated more in non-consumptive


activities than those aged 2535 and 4565 (P<0.05), as were those aged 35
45 compared to those 5565 years old (P<0.05). (Ericsson and Heberlein

.1
2002: 304).

Here, the authors obviously summarized the ages of their subjects into the
v0
following four classes: (I) 2535, (II) 3545, (III) 4555 and (IV) 5565: thus,
subjects aged 35 could be assigned to class I or class II, subjects aged 45 to class II
or class III, and subjects aged 55 to class III or class IV. is must be avoided, as
dierent coders might make dierent decisions, and as other researchers
aempting to replicate the research will not know how we categorized such
T
cases.
Second, the variable should be dened such that it does not conate properties
AF

that are potentially independent of each other, as this will lead to a set of values
that do not fall along a single dimension. As an example, consider the so-called
Silverstein Hierarchy used to categorize nouns for (inherent) Topicality (aer
Deane 1987: 67):
R

(5.13) 1st person pronoun


2nd person pronoun
D

3rd person pronoun


3rd person demonstrative
Proper name
Kin-Term
Human and animate NP
Concrete object
Container
Location
Perceivable
Abstract

104
Anatol Stefanowitsch

Note, rst, that there is a lot of overlap in this coding scheme. For example, a rst
or second person pronoun will always refer to a human or animate NP and a
third person pronoun will frequently do so, as will a proper name or a kin term.
Similarly, a container is a concrete object and can also be a location, and
everything above the category Perceivable is also perceivable. is overlap can
only be dealt with by an instruction of the kind that every nominal expression
should be put into the topmost applicable category; in other words, we need to
add an except for expressions that also t into one of the categories above to
every category label.
Secondly, although the Silverstein Hierarchy may supercially give the

.1
impression of providing values of a single variable that could be called TOPICALITY,
it is actually a mixture of several quite dierent variables and their possible
values. One aempt of disentangling these variables and giving them each a set
v0
of plausible values is the following:

(5.14a) TYPE OF NOMINAL EXPRESSION


PRONOUN > PROPER NAME > KINSHIP TERMS > LEXICAL NP
(5.14b) DISCOURSE ROLE
T
SPEAKER > HEARER > OTHER (NEAR > FAR)
(5.14c) ANIMACY/AGENCY
AF

HUMAN > ANIMATE > INANIMATE


(5.14d) CONCRETENESS
TOUCHABLE > NON-TOUCHABLE CONCRETE > ABSTRACT
(5.14e) GESTALT STATUS
OBJECT > CONTAINER > LOCATION
R

Given this set of variables, it is possible to describe all categories of the


D

Silverstein Hierarchy as a combination of values of these variables, for example:

(5.15a) 1st Person Pronoun: PRONOUN + SPEAKER + HUMAN + TOUCHABLE + OBJECT


(5.15b) Concrete Object: LEXICAL NP + OTHER + INANIMATE + TOUCHABLE + OBJECT

e set of variables in (5.14) also allows us to dierentiate between expressions


that the Silverstein Hierarchy lumps together, for example, a 3rd person pronoun
could be categorized as (5.16a), (5.16b), (5.16c) or (5.16d), depending on whether it
referred to a mouse, a rock, air or democracy:

(5.16a) PRONOUN + OTHER + ANIMATE + TOUCHABLE + OBJECT

105
5 Retrieval and Coding

(5.16b) PRONOUN + OTHER + INANIMATE + TOUCHABLE + OBJECT


(5.16c) PRONOUN + OTHER + INANIMATE + NON-TOUCHABLE + OBJECT (or perhaps LOCATION,
cf. in the air)
(5.16) PRONOUN + OTHER + INANIMATE + ABSTRACT + OBJECT (or perhaps LOCATION, cf. in a
democracy)

ere are two advantages of this more complex coding scheme. First, it allows a
more principled categorization of individual expressions: the variables and their
values are easier to dene and there are fewer unclear cases. Second, it would
allow us to determine empirically which of the variables are actually relevant in

.1
the context of a given research question, as irrelevant variables will not show a
signicant distribution across dierent conditions. Originally, the Silverstein
Hierarchy was meant to allow for a principled description of split ergative
v0
systems; it is possible, that the specic conation of variables is suitable to this
task. However, it is an open question whether the same conation of variables is
also suitable to the analysis of other phenomena. If we were to apply it as is, we
would not be able to tell whether this is the case. us, we should always dene
our variables in terms of a single dimension and deal with complex concepts (like
T
TOPICALITY) by analyzing the data in terms of a set of such variables.
Aer dening a variable (or set of variables) and deciding on the type and
AF

number of values, the second step in creating a coding scheme consists in


dening what belongs into each category. Where necessary, this should be done
in the form of a decision procedure.
For example, the coding scheme for ANIMACY mentioned in the preceding
chapter (Garretson 2004, Zaenen et al. 2004) has the categories HUMAN and
R

ORGANIZATION (among others). e category HUMAN is relatively self-explanatory, as


we tend to have a good intuition about what constitutes a human. Nevertheless,
D

the coding scheme spells out that it does not maer by what linguistic means
humans are referred to (e.g., proper names, common nouns including kinship
terms, and pronouns) and that dead, ctional or potential future humans are
included as well as humanoid entities like gods, elves, ghosts, and androids.
e category ORGANIZATION is much more complex to apply consistently, since
there is no intuitively accessible and generally accepted understanding of what
constitutes an organization. In particular, it needs to be specied what
distinguishes an ORGANIZATION from other groups of human beings (that are to be
categorized as HUMAN according to the coding scheme). e coding scheme denes
an ORGANIZATION as a referent involving more than one human with some degree
of group identity. It then provides the following hierarchy of properties that a

106
Anatol Stefanowitsch

group of humans may have (where each property implies the presence of all
properties below its position in the hierarchy):

(5.17) +/- chartered/ocial


+/- temporally stable
+/- collective voice/purpose
+/- collective action
+/- collective

It then states that any group of humans at + collective voice or higher should be

.1
categorized as ORGANIZATION, while those below should simply be coded as HUMAN.
By listing properties that a group must have to count as an organization in the
sense of the coding scheme, the decision is simplied considerably, and by
v0
providing a decision procedure, the number of unclear cases is reduced. e
coding scheme also illustrates the use of the hierarchy:

us, while the posse would be an ORG, the mob might not be, depending
on whether we see the mob as having a collective purpose. e crowd
T
would not be considered ORG, but rather simply HUMAN.
AF

Whether or not to include such specic examples is a question that must be


answered in the context of particular research projects. One advantage is that
examples may help the coders understand the coding scheme. A disadvantage is
that examples may be understood as prototypical cases against which the
referents in the data are to be matched, which may lead coders to ignore the
R

denitions and decision procedures.


e third step consists in testing the reliability of our coding scheme, a point I
D

will return to in the next and nal subsection of this chapter. Before I do so, I will
briey address a practical question: that of where and how the coding decisions
should be recorded.

5.2.3 Data storage


ere are, broadly, two ways of storing our data and coding decisions: in the
corpus itself or in a database containing the extracted results of our query. e
rst option is routinely chosen in the case of automatically coded variables like
PART OF SPEECH, as in the following excerpt from the BROWN corpus:

107
5 Retrieval and Coding

(5.18) the_AT fact_NN that_CS Jess's_NP$ horse_NN had_HVD not_*


been_BEN returned_VBN to_IN its_PP$ stall_NN could_MD indicate_VB
that_CS Diane's_NP$ information_NN had_HVD been_BEN wrong_JJ ,_,
but_CC Curt_NP didn't_DOD* interpret_VB it_PPO this_DT way_NN ._.
[BROWN N]

Other variables that are sometimes recorded in the corpus itself are LEMMA and
SYNTACTIC STRUCTURE. When more than one variable is annotated in a corpus, the
corpus is traditionally structured as shown in Figure 51, with one word per line
and dierent columns for the dierent types of annotation (alternatively, complex

.1
XML tags are used).

Fig. 5-1. Example of a corpus with complex annotation (SUSANNE, Sampson


v0
1995)
ID POS Word Lemma Grammar
N12:0280.42 AT The the [O[S[Ns:s.
N12:0290.03 NN1n fact fact .
N12:0290.06 CST that that [Fn.
N12:0290.09 NP1f Jess Jess [Ns:S[G[Nns.Nns]
T
N12:0290.12 GG +<apos>s - .G]
N12:0290.15 NN1c horse horse .Ns:S]
N12:0290.18 VHD had have [Vdefp.
AF

N12:0290.21 XX not not .


N12:0290.24 VBN been be .
N12:0290.27 VVNv returned return .Vdefp]
N12:0290.30 IIt to to [P:q.
N12:0290.33 APPGh1 its its [Ns.
N12:0290.39 NN1c stall stall .Ns]P:q]Fn]Ns:s]
N12:0290.42 VMd could can [Vdc.
R

N12:0290.48 VV0t indicate indicate .Vdc]


N12:0300.03 CST that that [Fn:o.
N12:0300.06 NP1f Diane Diane [Ns:s[G[Nns.Nns]
D

N12:0300.09 GG +<apos>s - .G]


N12:0300.12 NN1u information information .Ns:s]
N12:0300.15 VHD had have [Vdfb.
N12:0300.18 VBN been be .Vdfb]
N12:0300.21 JJ wrong wrong [J:e.J:e]Fn:o]

108
Anatol Stefanowitsch

N12:0300.24 YC +, - .
N12:0300.27 CCB but but [S+.
N12:0300.30 NP1m Curt Curt [Nns:s.Nns:s]
N12:0300.33 VDD did do [Vde.
N12:0300.39 XX +n<apos>t not .
N12:0300.42 VV0v interpret interpret .Vde]
N12:0310.03 PPH1 it it [Ni:o.Ni:o]
N12:0310.06 DD1i this this [Ns:h.
N12:0310.09 NNL1n way way .Ns:h]S+]S]
N12:0310.12 YF +. - .
Note that type of corpus structure requires concordancers that are specically
geared towards processing column-based input many of the commercially
available soware packages cannot handle this.

.1
In the case of annotations added in the context of specic research projects
(especially if they are added manually), the second option mentioned above is
typically preferred: the data are extracted, stored in a separate le, and then
v0
annotated. Frequently, spreadsheet applications are used to store the corpus-data
and annotation decisions, as in the example in Fig. 5-2, where possessive
pronouns and nouns are coded for NOMINAL TYPE, ANIMACY and CONCRETENESS:

Fig. 5-2. Example of a raw data table


T
A B C D E
1 Example SourceFile Word Nom. Type Animacy
AF

2 Jess's horse N 12 Jess proper name human


3 its stall N 12 it pronoun animate
4 Diane's information N 12 Diane proper name human
R

e rst line contains labels that tell us what information is found in each
column respectively. is should include the example itself (either as shown in
Fig. 5-2, or in the form of a KWIC concordance line) and meta-information such
D

as what corpus and/or corpus le the example was extracted from. Crucially, it
will include the relevant variables. Each subsequent line contains one observation
(i.e., one hit and the appropriate values of each variable). is format one
column for each variable, and one line for each example is referred to as a raw
data table. It is the standard way of recording measurements in all empirical
sciences and we should adhere to it strictly, as it ensures that the structure of the
data is retained. In particular, one should never store ones data in summarized
form, for example, as shown in Figure 5-3.

Fig. 5-3. Data stored in summarized form

109
5 Retrieval and Coding

A B C D E
1 Noun Type pronoun proper name noun
2 1 2 0
3 Animacy human animate inanimate
4 2 1 0

Such a summary can be created automatically from the a raw data table like that
in Figure 5-2 when needed, but if we record our coded data like this in the rst
place, we would have no way of reconstructing the original cases and thus no
way of telling which combinations of variables actually occurred in the data. In

.1
this particular case, for example, we cannot tell whether one of the human
referents or the animate referent was referred to by the pronoun. In addition,
statistics soware packages require a raw data table as input.
v0
Since any kind of quantitative analysis requires a raw data table, we might
conclude that it is the only useful way of recording our coding decisions.
However, there are cases where it may be more useful to record them in the form
of annotations in (a copy o) the original corpus. For example, the information in
Fig. 5-1 could be recorded in the corpus itself in the same way that part-of-speech
T
tags are, i.e., we could add an ANIMACY label to every nominal element in our
corpus, either in the column format shown in Figure 5-1, or in the format used for
AF

POS tags by the original version of the BROWN corpus, as shown in (5.19):

(5.19) the_AT fact_NN_abstract that_CS Jess's_NP$_human


horse_NN_animate had_HVD not_* been_BEN returned_VBN to_IN
R

its_PP$_animate stall_NN_inanimate could_MD indicate_VB that_CS


Diane's_NP$_human information_NN_abstract had_HVD been_BEN
wrong_JJ ,_, but_CC Curt_NP_human didn't_DOD* interpret_VB
D

it_PPO_inanimate this_DT way_NN_inanimate ._.

From a corpus encoded in this way, we could easily create the raw data list in
Figure 5-1 by searching for possessives and separating the hits into the word
itself, the PART-OF-SPEECH label and the ANIMACY annotation (this can be done
manually, or we could write a simple computer program to do it for us). e
advantage would be that we, or other researchers, could also use our coded data
for research projects concerned with completely dierent research questions.
us, if we are dealing with a variable that is likely to be of general interest, we
should consider the possibility of annotating the corpus itself instead of rst

110
Anatol Stefanowitsch

extracting the relevant data to a raw data table and coding them aerwards.16

5.2.4 e reliability of coding schemes


In some cases, we may be able to dene our variables in such a way that they can
be coded automatically. For example, it is trivial to write a program that will
count the LENGTH of each word in our corpus in terms of number of leers and
aach the value as a tag. We could also, for example, create a list of the 2500 most
frequent nouns in English and their ANIMACY values, and write a program that
goes through a tagged corpus and, whenever it encounters a word tagged as a
noun, looks up this value and aach it to the word. In such cases, we need to

.1
assess the quality of our automatized coding scheme in the same way in which
we would assess the quality of the results returned by a particular query (cf.
Section 5.1.2, esp. Table 5-1). In the context of coding data, a true positive result
v0
for a particular value would be a case where that value has been assigned to a
corpus example correctly, a false positive would be a case where that value has
been assigned incorrectly, a false negative would be a case where the value has
not been assigned although it should have been, and a true negative would be a
T
case where the value has not been assigned and should not have been assigned.
is assumes, of course, that we can determine with sucient certainty
whether a particular value should be assigned to a particular corpus example.
AF

Note that this problem also exists with coding schemes for the manual coding of
data: as pointed out above, coding usually involves a certain degree of subjective
interpretation, so we need to ensure that the inuence of subjective judgments is
as small as possible. e best way of doing this is to have (at least) two dierent
R

coders apply the coding scheme to the data if our measurements cannot be
made objective (and, as should be clear by now, they rarely can in linguistics), this
will at least allow us to ensure that they are intersubjectively reliable.
D

16 Such a direct annotation of corpus les is rarely done in corpus linguistics, but it has
become the preferred strategy in various elds concerned with qualitative analysis of
textual data. ere are open-source and commercial soware packages dedicated to
this task (see Appendix A.3). ey typically allow the user to dene a set of codes,
import a text le, and then assign the codes to a word or larger textual unit by
selecting it with the mouse and then clicking a buon for the appropriate code that is
then added (oen in XML format) to the imported text. is strategy has the
additional advantage that one can view one's coded examples in their original context
(which may be necessary when annotating additional variables later). However, the
available soware packages are geared towards the analysis of individual texts and do
not let the user to work comfortably with large corpora.

111
5 Retrieval and Coding

One approach would be to have the entire data set coded by two coders
independently on the basis of the same coding schema. We could then identify all
cases in which the two coders did not assign a the same value and determine,
where the disagreement came from. Obvious possibilities include cases that are
not covered by the coding scheme at all, cases where the denitions in the coding
scheme are too vague to apply or too ambiguous to make a principled decision,
and cases where one of the coders has misunderstood the corpus example or
made a mistake due to inaention. Where the coding schema is to blame, it could
be revised accordingly and re-applied to all unclear cases. Where a coder is at
fault, they could correct their coding decision. At the end of this process we

.1
would have a carefully coded data set with no (or very few) unclear cases le.
However, in practice there are two problems with this procedure. First, it is
extremely time-consuming, which will oen make it dicult to impossible to nd
v0
a second coder. Second, discussing all unclear cases but not the apparently clear
cases holds the danger that the former will be coded according to dierent
criteria than the laer. Both problems can be solved (or at least alleviated) by
testing the coding scheme on a smaller dataset using two coders and calculating
its reliability across coders. If this so-called inter-rater reliability (see Carlea
T
1996) is suciently high, the coding scheme can be applied to the actual data set
by a single coder. If not, it needs to be made more explicit and applied to another
AF

set of test data. e inter-rater reliability is typically expressed in terms of a


measure referred to as Cohens Kappa () that can range from 0 (no agreement)
to 1 (complete agreement) (see Appendix B.1 for a description of how to
calculate this measure). By general agreement, 0.7 is considered to be an
indication that inter-rater reliability is satisfactory while < 0.7 is taken to show
R

that inter-rater reliability is not satisfactory. Obviously, 0.7 cannot be taken as an


absolute value; what we consider to be satisfactory will depend on our research
D

project and there are many situations where a higher reliability is necessary.

Further reading
No maer what corpora and concordancing soware you work with, you will
need regular expressions at some point. High-quality information is easy to
nd online, but if you insist on buying a book, Friedl (2006) is as good as any.
Barnbrook (1996, Ch. 8.1) is an interesting case study on retrieving data from
corpora with non-standardized orthography. Leech (2005) is a good
introduction to the issues involved in annotating corpora.

112
Anatol Stefanowitsch

Barnbrook, Geo. 1996. Language and computers. A practical introduction


to the computer analysis of language. Edinburgh: Edinburgh University
Press.
Friedl, Jerey E. F. 2006. Mastering regular expressions. 3rd ed.
Sebastapol, CA: OReilly.
Leech, Georey. 2005. Adding linguistic annotation. In Wynne, Martin
(ed.). Developing linguistic corpora: a guide to good practice. Oxford:
Oxbow. ahds.ac.uk/creating/guides/linguistic-corpora/chapter2.htm

.1
v0
T
AF
R
D

113
6. antifying Research estions
Recall, once again, that at the end of Chapter 2, we presented the following
denition of corpus linguistics:

.1
Corpus linguistics is the investigation of linguistic research questions that
have been framed in terms of the conditional distribution of linguistic
v0
phenomena in a linguistic corpus.

We discussed the fact that this denition covers cases of hypotheses phrased in
absolute terms, i.e. cases where the distribution of a phenomenon across dierent
conditions is a maer of all or nothing (for example, All speakers of American
T
English refer to the front window of a car as windshield, all speakers of British
English as windscreen). We pointed out that it also covers cases where the
AF

distribution is a maer of more-or-less (for example, British English speakers


tend to refer to networks of train tracks as railway rather than railroad and
American English speakers tend to refer to them as railroad rather than railway
or and this is not the same More British speakers refer to networks of train
tracks as railway instead of railroad and more American English refer to them as
R

railroad instead of railway).


In the case of hypotheses stated in terms of more-or-less, predictions must be
D

stated in quantitative terms which in turn means that our data have to be
quantied in some way so that we can compare them to our predictions. In this
chapter, we will discuss in more detail how this is done when dealing with
dierent types of data.
Given that, as mentioned in Chapter 3, hypotheses stated in relative terms are
actually more typical in corpus linguistics than hypotheses stated in absolute
terms, one might expect quantitative methods to be a well-established aspect of
corpus-linguistic methodology. However, this is not the case. Many early corpus-
linguistic studies treated quantication impressionistically at best (stating that
one thing seemed to be more common in a given corpus or under a given
condition than another), and although this has changed over the last twenty years

114
Anatol Stefanowitsch

(slowly at rst, but increasingly rapidly in the past decade), the impressionistic
school of corpus-linguistics is still strong. In order to distinguish the more
principled quantitative approach taken here from this impressionistic approach, it
is now oen referred to as Quantitative Corpus Linguistics (a distinction rst
suggested in Stefanowitsch and Gries 2005, although the term is, of course, older).
In this chapter, we will discuss the types of data we may be confronted with
when quantifying the (annotated) results of a corpus query (Section 6.1). We will
then discuss the way in which these types of data can be quantied (Sections 6.2,
6.3 and 6.4), laying the ground work for a brief introduction to statistical
hypothesis testing in the following chapters.

.1
6.1 Types of Data
v0
Let us turn to an example that is more complex and much closer to the kind of
phenomenon actually of interest to corpus linguists than the distribution of
words across varieties, that of the two English possessive constructions. In
English, certain types of semantic relations, possession being very prominent
among them, can be expressed in two alternative ways; either by what is
T
traditionally referred to as the genitive (cf. 6.1a) or by the preposition of
connecting two noun phrases (cf. 6.1b):
AF

(6.1a) e city's museums are treasure houses of inspiring objects from all eras
and cultures. [www.res.org.uk]
(6.1b) Today one can nd the monuments and artifacts from all of these eras in
the museums of the city. [www.travelhouseuk.co.uk]
R

e two constructions cannot always be used interchangeably. First, there are a


D

number of relations that are exclusively encoded by of, such as quantities (both
generic, as in a couple/bit/lot of, and in terms of measures, as in six
miles/years/gallons of), type relations (a kind/type/sort/class of) and substance (a
mixture of water and whisky, a dress of silk, etc.) (cf. Stefanowitsch 2003).
Second, and more interestingly, even a relation that can be expressed by both
constructions in principle can oen not be expressed by both constructions
equally well in a specic situation. A number of explanations have been proposed
(and investigated using quantitative corpus-linguistics), for example the
following, which we will also investigate here (since these explanations are
treated as hypotheses here, I will take the liberty and illustrate them with
constructed examples rather than corpus data).

115
6. antifying Research estions

a) Givenness. Following the principle of Functional Sentence Perspective, if the


modier (the phrase marked by s or of) refers given information, the genitive will
be preferred, if the modier is new, the construction with of will be preferred
(Standwell 1983). us, (6.2a) and (6.3a) sound more natural than (6.2b) and (6.3b)
respectively:

(6.2a) In London, we visited the capitals many museums


(6.2b) In London, we visited the many museums of the capital.

.1
(6.3a) e Guggenheim is much larger than the museums of other major cities.
(6.3b) ??e Guggenheim is much larger than other major cities museums.
v0
b) Animacy. Since animate referents tend to be more topical than inanimate
ones and more topical elements tend to precede less topical ones, if the modier
is animate, the genitive will be preferred, if it is inanimate, the construction with
of will be preferred (irk et al. 1972: 192-203, Deane 1987):
T
(6.4a) Solomon R. Guggenheims collection contains some ne paintings.
(6.4b) e collection of Solomon R. Guggenheim contains some ne paintings.
AF

(6.5a) e collection of the Guggenheim museum contains some ne paintings.


(6.5a)
e Guggenheim museums collection contains some ne paintings.

c) Length. Since short constituents generally precede long constituents, if the


R

modier is short, the genitive will be preferred, if it is long, the construction with
of will be preferred (Altenberg 1980):
D

(6.6a) e museums collection is stunning.


(6.6a) e collection of the museum is stunning.

(6.7b) e collection of the most famous museum in New York is stunning.


(6.7b) e most famous museum in New Yorks collection is stunning.

In all three cases, we are dealing with hypotheses concerning preferences rather
than absolute dierence. None of the examples with question marks are
ungrammatical and all of them could conceivably occur; they just sound a lile
bit odd or even just a lile less good than the alternatives. us, the predictions

116
Anatol Stefanowitsch

we can derive from each hypothesis must be stated, and tested in terms of relative
rather than absolute dierences they all involve a predictions stated in terms
more-or-less rather than all-or-nothing. Relative quantitative dierences are
expressed and dealt with in dierent ways depending on the type of data they
involve. ere are three types of data, exemplied respectively by the variables in
each of the hypotheses discussed above: nominal (or categorical) data, ordinal
data and cardinal data. Let us discuss each of these in turn.

6.1.1 Nominal data


A nominal variable is a variable whose values are labels for categories that have

.1
no intrinsic order with respect to each other (i.e., there is no aspect of their
denition that would allow us to put them in a natural order). If we categorize
data in terms of such a nominal variable (either because there is no other way of
v0
categorizing them or simply because we choose to categorize them in this way),
the only way to quantify them is to count the number of observations in each
category and express the result in absolute frequencies (i.e., raw numbers) or
relative frequencies (i.e. percentages). We cannot rank the data based on intrinsic
T
criteria, and we can not calculate mean values.
Some typical examples of nominal variables are demographic variables like
SEX, NATIONALITY or LANGUAGE. In a sample of speakers, we could count how many of
AF

them are speakers of GERMAN and how many are speakers of FRENCH, but we cannot
rank the German language higher the French language on intrinsic criteria.
is does not mean that we can never rank the two languages for example,
we could rank them by number of native speakers worldwide (in which case
R

German ranks above French) or by the number of countries in which they are an
ocial language (in which case French ranks above German). But the number of
speakers or the number of countries where a language has an ocial status is not
D

part of the denition of that language (German would still be German if its
number of speakers was reduced by half because the others decided to speak only
English from this day on, and French would still be French if it lost its ocial
status in every single country).
In other words, we would be ranking these languages by extrinsic criteria,
which means that really we are ranking something completely dierent: in the
rst case, we are ranking the variable SIZE OF SPEECH COMMUNITY, in the second case
we are ranking the variable NUMBER OF COUNTRIES WITH OFFICIAL LANGUAGE X by size.
We also cannot calculate a mean value between the two languages (for example,
claiming that English is the mean of German and French because its vocabulary is

117
6. antifying Research estions

derived in equal proportions from the Germanic and the Romance language
families. We can calculate the mean number of speakers of a set of languages or
the mean number of countries they are spoken in, but again, we would not be
calculating means between languages but between sets of speakers or countries.
With respect to the examples above, it should be obvious that they all involve
a nominal variable we could call TYPE OF POSSESSIVE CONSTRUCTION (with the values
GENITIVE and OF-CONSTRUCTION. We can categorize all possessive expressions in a
corpus in terms of these two categories and we can then report the frequency of
each: for example, by my e the genitive occurs 22,193 times in the BROWN
corpus (excluding proper names and instances of the double genitive), and the of-

.1
construction 17,800 times (this laer value is actually an estimate; it would take
too long to go through all 36,406 occurrences of of and identify those that occur
in the structure relevant here, so I categorized a random subsample of 500 hits of
v0
of and generalized the proportion of of-constructions vs. other uses of of to the
total number of hits for of).
We can rank the constructions in terms of their frequency (the genitive is
more frequent), but in this case we are not ranking the constructions based on an
intrinsic criterion, but on an extrinsic one: their corpus frequency in one
T
particular corpus. We can also calculate their mean frequency (19,996.5), but
again, this is not a mean of the two constructions, but of their frequencies in one
AF

particular corpus. Again, the two constructions are in no way dened by their
frequency, so their frequency is an extrinsic criterion.

6.1.2 Ordinal data


R

An ordinal variable is a variable whose values are labels for categories that do
have an intrinsic order with respect to each other but that cannot be expressed in
terms of natural numbers. In other words, ordinal variables are variables that are
D

are dened in such a way that some aspect of their denition allows us to order
them without reference to an extrinsic criterion, but that does not give us any
information about the distance (or degree of dierence) between one category
and the next. If we categorize data in terms of such an ordinal variable, we can
treat them accordingly (i.e., we can rank them), or we can treat them like nominal
data by simply ignoring their inherent order (i.e., we can still count the number of
observations for each value and report absolute or relative frequencies. We
cannot calculate mean values.
Some typical examples of ordinal variables are demographic variables like
EDUCATION or (in the appropriate sub-demographic) MILITARY RANK, but also SCHOOL

118
Anatol Stefanowitsch

GRADES and the kind of ratings oen found in questionnaires (both of which are,
however, oen treated as though they were cardinal data, see below). For
example, academic degrees are intrinsically ordered: it is part of the denition of
a PhD degree that it ranks higher than a masters degree, which in turn ranks
higher than a bachelors degree. us, we can easily rank speakers in a sample of
university graduates based on the highest degree they have completed. We can
also simply count the number of PhDs, MAs, and BAs and ignore the ordering of
the degrees. But we cannot calculate a mean: if ve speakers in our sample have a
PhD and ve have a BA, this does not allow us to claim that all of them have an
MA degree on average. e rst important reason for this is that the size of the

.1
dierence in terms of skills and knowledge that separates a BA from an MA is not
the same as that separating an MA from a PhD: in Europe, one typically studies
two years for an MA, but it typically takes four to ve years to complete a PhD.
v0
e second important reason is that the values of ordinal variables typically dier
along more than one dimension: while it is true that a PhD is a higher degree
than an MA, which is a higher degree than a BA, the three degrees also dier in
terms of specialization (from a relatively broad BA to a very narrow PhD), and
the PhD degree diers from the two other degrees qualitatively: a BA and an MA
T
primarily show that one has acquired knowledge and (more or less practical
skills), but a PhD primarily shows that one has acquired research skills.
AF

With respect to the examples above, clearly ANIMACY is an ordinal variable, at


least if we think of it in terms of a scale, as we did in Chapter 4. A simple
animacy scale might look like this:

(6.8) vs. INANIMATE vs. ABSTRACT


R

ANIMATE

On this scale, ANIMATE ranks higher than INANIMATE which ranks higher than
D

ABSTRACT in terms of the property we are calling animacy, and this ranking is
determined by the scale itself, not by any extrinsic criteria. is means that we
could categorize and rank all nouns in a corpus according to their animacy. But
again, we cannot calculate a mean. If we have 50 HUMAN nouns and 50 ABSTRACT
nouns, we cannot say that we have an mean of 100 INANIMATE nouns. Again, this is
because we have no way of knowing whether, in terms of animacy, the dierence
between ANIMATE and INANIMATE is the same size as that between INANIMATE and
ABSTRACT, but also, because we are, again, dealing with qualitative as well as
quantitative dierences: the dierence between animate and inanimate on the
one hand and abstract on the other is that the rst two have physical existence;
and the dierence between animate on the one hand and inanimate and abstract

119
6. antifying Research estions

on the other is that animates are potentially alive and the other two are not. In
other words, our scale is really a combination of at least two dimensions. Again,
we could also ignore the intrinsic order of the values on our ANIMACY scale and
simply treat them as nominal data, i.e., count them and report the frequency with
which each value occurs in our data. Potentially ordinal data are actually
frequently treated like nominal data in corpus linguistics, and with complex
scales combining a range of dierent dimensions, this is probably a good idea;
but ordinal data also have a useful place in quantitative corpus linguistics.

6.1.3 Cardinal data

.1
Cardinal variables are variables whose values are numerical measurements along
a particular dimension. In other words, they are intrinsically ordered (like ordinal
data), but not because some aspect of their denition allows us to order them, but
v0
because of their nature as numbers. Also, the distance between any two
measurements is precisely known and can directly be expressed as a number
itself. is means that we can perform any arithmetic operation on cardinal data.
Crucially, we can calculate means. Of course, we can also treat cardinal data like
T
rank data by ignoring all of their mathematical properties other than their order,
and we could also treat them as nominal data (see further below).
Typical cases of cardinal variables are demographic variables like AGE or
AF

INCOME. For example, we can categorize a sample of speakers by their age and then
calculate the mean age of our sample. If our sample contains 5 50-year-olds and 5
30-year-olds, it makes perfect sense to say that the mean age in our sample is 40;
we might need additional information to distinguish between this sample and
R

another sample that consists of 5 41-year-olds and 5 39-year-olds, that would also
have an mean age of 40 (cf. Section 6.2.3 below), but the mean itself is meaningful,
because the distance between 30 and 40 is the same as that between 40 and 50 and
D

all measurements involve just a single dimension (age).


With respect to the two possessives, the variables LENGTH and DISCOURSE STATUS
are cardinal variables. It should be obvious that we can calculate the mean length
of words or other constituents in a corpus, a particular sample, a particular
position in a grammatical construction etc.
As mentioned above, we can also treat cardinal data like ordinal data. is
may sometimes actually be necessary for mathematical reasons (see Chapter 7
below); in other cases, we may want to transform cardinal data to ordinal data
based on theoretical considerations. For example, the measure of Referential
Distance discussed in Chapter 4, Section 4.2.b yields cardinal data ranging from 0

120
Anatol Stefanowitsch

to whatever maximum distance we decide on and it would be possible, and


reasonable, to calculate the mean referential distance of a particular type of
referring expression. However Givn (1992:20) argues that we should actually
think of referential distance as ordinal data: as most referring expressions
consistently have a referential distance of either 0-1, or 2-3, or larger than 3, he
suggests converting measures of REFERENTIAL DISTANCE into just three categories:
MINIMAL GAP (0-1), SMALL GAP (2-3) and LONG GAP (>3). Once we have done this, we can
no longer calculate meaningful mean, because the categories are no longer
equivalent in size or distance, but we can still rank them, and, of course, we can
also treat them as nominal data.

.1
6.1.4 Interim summary
In the preceding three subsections, we have repeatedly mentioned concepts like
v0
frequency, percentage, rank and mean. In the following three sections, we
will introduce these concepts in more detail, providing a solid foundation of
descriptive statistical measures for nominal, ordinal and cardinal data.
Note, however, that most research designs, including those useful for
T
investigating the hypotheses about the two possessive constructions, involve (at
least) two variables: (at least) one independent one and (at least) one dependent
one. Even our denition of corpus-linguistics makes reference to this fact when it
AF

states that research questions should be framed such that it enables us to answer
them by looking at the distribution of linguistic phenomena across dierent
conditions.
Since such conditions are most likely to be nominal in character (a set of
R

varieties, groups of speakers, grammatical constructions, text types, etc.), we will


limit the discussion to combinations of variables where at least one variable is
nominal, i.e., (a) designs with two nominal variables, (b) designs with one
D

nominal and one ordinal variable, and (c) designs with one nominal and one
cardinal variable. Logically, there are three additional designs, namely designs
with (d) two ordinal variables, (e) two cardinal variables or () one ordinal and
one cardinal value. For such cases, we would need dierent types of correlation
analysis, which we will not discuss in this book in any detail (there are pointers
to the relevant literature at the end of Chapter 7).

6.2 Descriptive statistics for nominal data


All examples we have looked at so far in this book have involved two nominal

121
6. antifying Research estions

variables: VARIETY (British vs. American English) and some linguistic alternation
(mostly regional synonyms of some lexicalized concept). us, this type of
research design should already be somewhat familiar.
For a closer look, we will apply it to the rst of the three hypotheses
introduced in the preceding section, which is restated here with the background
assumption from which it is derived:

(6.9) Assumption: Discourse-old items occur before discourse-new items.


Hypothesis: e S-POSSESSIVE will be used when the modier is DISCOURSE-OLD,
the OF-POSSESSIVE will be used when the modier is DISCOUSE-NEW.

.1
Before we can turn to the quantitative prediction, there are some aspects of the
research design that need to be discussed. First, the two constructions have been
v0
renamed to show that they are values of a variable in a specic research design.
As such, they must, of course, be given operational denitions. e denitions I
used were the following:

(6.10a) S-POSSESSIVE: A construction consisting of a possessive pronoun or a noun


T
phrase marked by the clitics modifying a noun following it, where the
construction as a whole is not a proper name.
AF

(6.10b) OF-POSSESSIVE: A construction consisting of a noun modied by a


prepositional phrase with of, where the construction as a whole encodes a
relation that could theoretically also be encoded by the S-POSSESSIVE and is
not a proper name.
R

Proper names are excluded in both cases because they are xed and could not
vary. Therefore, they will not be subject to any restrictions concerning givenness,
D

animacy or length. e OF-POSSESSIVE was dened in such a way that only those
cases count that are in a semantic relationship with the S-POSSESSIVE. is, in a
sense, is what construes them theoretically as values of a single variable
POSSESSIVE.
Next, DISCOURSE-OLD and DISCOURSE-NEW have to be operationalized. is could be
done using the measure of referential distance discussed in Chapter 4, but instead
I chose an indirect operationalization: it is well established that pronouns refer to
old information, whereas new information must be introduced in lexical NPs.
us, we will dene DISCOURSE-OLD as encoded by a pronoun and DISCOURSE-NEW as
encoded by a lexical NP. e advantage is that this is a very easily retrievable
and codable property, the disadvantage is that lexical NPs may also refer to

122
Anatol Stefanowitsch

DISCOURSE-OLD information, adding noise to the data set.


I then retrieved from the BROWN corpus a 1 percent sample of potential
occurrences of the S-POSSESSIVE (i.e. all possessive pronouns and all occurrences of
the clitics s and , the laer used with nouns that already end in s, all of which
are tagged with an $ in the original tagged version of the corpus used here). I
then manually discarded all examples that contained proper names. Next, I
retrieved a 1 percent sample of the word of preceded by a noun (this already
excluded many of the quantitative uses which are preceded by quantitative
adverbs like most). I then manually checked each example for whether it could be
replaced by an S-POSSESSIVE in the right circumstances, namely, if the modier was a

.1
pronoun. Consider (6.11a, b):

(6.11a) a lack of unity v0


(6.11b) the concept of unity

At rst glance, neither of them seems to be rephraseable, but in the right


syntactic context, (6.11a) can actually be expressed by the s-possessive, while
(6.11b) cannot:
T

(6.12a) Unity is important, and its lack can be a problem.


AF

(6.12b) * Unity is important, so its concept must be taught early.

Obviously, inventing a plausible context of the form [NP1 is ADJ, and/so


PRONPOSS N2 VP], where NP1 corresponds to the modier of the corpus
example and N2 to its head, and then checking its acceptability, is a complex task.
R

Ideally, it should be repeated by a second rater, to make sure nothing was missed,
but as pointed out in the preceding chapter, this is a very time-consuming
D

process, so it is not always be possible.


is procedure le me with 222 s-possessives and 178 of-possessives ). ese
were categorized according to whether their modier NP is (or is headed by) a
pronoun, a proper name or a common noun (based on the PoS tags in the corpus).
Cases with proper names were ignored for the purposes of the present study, as
they play no role in the operational denition of DISCOURSE STATUS adopted here,
which le 200 s-possessives and 156 of-possessives (see Appendix D.1, Study 1 for
the data).
We can now state the following quantitative prediction based on our
hypothesis:

123
6. antifying Research estions

(6.13) Prediction: ere will be more cases of the S-POSSESSIVE with DISCOURSE-OLD
modiers than with DISCOURSE-NEW modiers, and more cases of the OF-
POSSESSIVE with discourse-new modiers than with DISCOURSE-OLD modiers.

Table 6-1 shows the absolute frequencies of the parts of speech of the
modier in both constructions.

Table 6-1. Part of Speech of the Modier in the s-possessive and the of-possessive
POSSE S S I V E

.1
S-POSSESSIVE OF-POSSESSIVE Total
DISC. STATUS

OLD 180 3 183


20 153 173
NEW

Total 200
v0 156 356

Such a table, examples of which we have already seen in Chapter 3, is referred


to as a contingency table. In this case, the contingency table consists of four cells
T
showing the frequencies of the four intersections of the variables DISCOURSE STATUS,
(with the values NEW, i.e. pronoun, and OLD, i.e. lexical noun and POSSESSIVE (with
AF

the values S and OF); in other words, it is a 22 (two-by-two) table. Possessive is


presented as the dependent variable here, since logically the hypothesis is that
the discourse status of the modier inuences the choice of construction, but
mathematically it does not maer in contingency tables what we treat as the
R

dependent or independent variable.


In addition, there are two cells showing the row totals (the sum of all cells in a
given row) and the column totals (the sum of all cells in a given column), and one
D

cell showing the table total (the sum of all four intersections. e row and column
totals for a given cell are referred to as the marginal frequencies for that cell.

6.2.1 Percentages
The frequencies in Table 6-1 are fairly easy to interpret in this case, because the
dierences in frequency are very clear. However, we should be wary of basing
our assesment of corpus data directly on raw frequencies in a contingency table.
ese can be very misleading, especially if the marginal frequencies of the
variables dier substantially, which in this case, they do: the s-possessive is more
frequent overall than the of-possessive and the overall frequency discourse-old

124
Anatol Stefanowitsch

modiers (i.e. pronouns) are slightly more frequent overall than discourse-new
ones (i.e., lexical nouns).
Thus, it is generally useful to convert the absolute frequencies to relative
frequencies, abstracting away from the dierences in marginal frequencies. In
order to convert an absolute frequency n into a percentage, we simply divide it by
the total number of cases N of which it is a part and multiply the result by 100.

n
(6.14) 100
N

.1
For example, if we have a group of 31 students studying some foreign
language and six of them study German, the percentage of students studying
German is v0
6
(6.15) = 0.1935 100 = 19.35 %
31
T
In other words, a percentage is just another way of expressing a decimal
fraction, which is just another way of expressing a fraction (note that in academic
AF

papers, it is common to report relative frequencies as decimal fractions rather


than as percentages).
If we want to calculate percentages for the frequencies in Table 6-1, we rst
have to decide what the relevant total frequency is. ere are three possibilities,
all of which are useful in some way: we can divide each cell by its column total,
R

by its row total or by the table total. Table 6-2 shows the results for all three
possibilities.
e column percentages can be related to our prediction most
D

straightforwardly: based on our hypothesis, we predicted that in our sample a


majority of s-possessives should have modiers that refer to discourse-old
information and, conversely a majority of of-possessives should have modiers
that refer to discourse-new information.
e relevance of the row percentages is less clear in this case. We might
predict, based on our hypothesis, that the majority of modiers referring to old
information should occur in s-possessives and the majority of modiers referring
to new information should occur in of-possessives.

125
6. antifying Research estions

Table 6-2. Absolute and relative frequencies of the modiers POS in the English
possessive constructions
POSSESSIVE
S-POSSESSIVE OF-POSSESSIVE Total
Abs. Freq. 180 Abs. Freq. 3 Abs. Freq. 183
Col. Pct. 90.00% Col. Pct. 1.92% Col. Pct. 51.40%
OLD
Row Pct. 98.36% Row Pct. 1.64% Row Pct. 100%
Tab. Pct. 50.56% Tab. Pct. 0.84%
DISCOURSE STATUS

Abs. Freq. 20 Abs. Freq. 153 Abs. Freq. 173

.1
Col. Pct. 10.00% Col. Pct. 98.08% Col. Pct. 48.60%
NEW
Row Pct. 11.56% Row Pct. 88.44% Row Pct. 100%
Tab. Pct. 5.62% Tab. Pct.
v0 42.98%
Abs. Freq. 200 Abs. Freq. 156 Abs. Freq. 356
Col. Pct. 100% Col. Pct. 100% Col. Pct. 100%
Total
Row Pct. 56.18% Row Pct. 43.82% Row Pct. 100%
Tab. Pct. 100%
T
is is the case in Table 6-2, and this is certainly compatible with our hypothesis.
AF

However, if it were not the case, this could also be compatible with our
hypothesis. Note that the constructions dier in frequency, with the of-possessive
being only three-quarters as frequent as the s-possessive. Now imagine the
dierence was ten to one instead of four to three. In this case, we might well nd
that the majority of both old and new modiers occurs in the s-possessives,
R

simply because there are so many more s-possessives than of-possessives. We


would, however, expect the majority to be larger in the case of old modiers than
D

in the case of new modiers. In other words, even if we are looking at row
percentages, the relevant comparisons are across rows, not within rows.
Whether column percentages or row percentages are more relevant to a
hypothesis depends, of course, on the way variables are arranged in the table: if
we rotate the table such that the variable Possessive ends up in the rows, then the
row percentages would be more relevant. When interpreting percentages in a
contingency table, we have to nd those that actually relate to our hypothesis. In
any case, the interpretation of both row and column percentages requires us to
choose one value of one of our variables and compare it across the two values of
the other variable, and then compare this comparison to a comparison of the
other value of that variable. If that sounds complicated, this is because it is

126
Anatol Stefanowitsch

complicated.
It would be less confusing if we had a way of taking into account both values
of both variables at the same time. e table percentages allow this to some
extent. e way our hypothesis is phrased, we would expect a majority of cases
to instantiate the intersections S-POSSESSIVE DISCOURSE-OLD and OF-POSSESSIVE
DISCOURSE-NEW, with a minority of cases instantiating the other two intersections.
In Table 6-2, this is clearly the case: the intersection S-POSSESSIVE DISCOURSE-OLD
contains more than y percent of all cases, the intersection OF-POSSESSIVE
DISCOURSE-NEW well over 40 percent. Again, if the marginal frequencies dier more
extremely, so may the table percentages in the relevant intersections. We could

.1
imagine a situation, for example, where 90 percent of the cases fell into the
intersection S-POSSESSIVE DISCOURSE-OLD and 10 percent in the intersection OF-
POSSESSIVE DISCOURSE-NEW this would still be a conrmation of our hypothesis.
v0
While percentages are, with due care, more easily interpretable than absolute
frequencies, they have two disadvantages. First, by abstracting away from the
absolute frequencies, we lose valuable information: we would interpret a
distribution such as that in Table 6-3 dierently, if we knew that it was based on a
sample on just 35 instead of 356 corpus hits. Second, it provides no sense of how
T
dierent our observed distribution is from the distribution that we would expect
if there was no relation between our two variables, i.e., if the values were
AF

distributed according to chance. us, instead of (or in addition to) percentages,


we should compare the observed frequencies of the intersections of our variables
with the frequencies we would expect if there was no relationship between the
two variables, i.e., if the intersections occurred at chance level. is also provides
a foundation for inferential statistics, discussed in Chapter 7.
R

6.3.2 Observed and expected frequencies


D

But how can we calculate the expected frequencies of the intersections of our
variables? Consider the textbook example of a random process: ipping a coin
onto a hard surface. ere are two possible outcomes, heads and tails, and we
know that their probability is y percent (or 0.5) each (ignoring the remote
possibility that the coin will land, and remain standing, on its edge). From these
probabilities, we can calculate the expected frequency of heads and tails in a
series of coin ips. If we ip the coin ten times, the expected frequency of heads
and tails is 5 (i.e., y percent of ten, or 0.5 10) for heads and tails each; if we
ip the coin 42 times, the expected frequency would be 21 (0.5 42) for each, and
so on. In the real world, we would of course expect some variation (more on this

127
6. antifying Research estions

in Chapter 7), so expected frequency refers to a theoretical expectation derived


from multiplying the probability of an event by the total number of observations.
So how do we transfer this logic to a contingency table like Table 6-1? Naively,
we might assume that the chance frequencies for each cell can be determined by
taking the total number of observations and dividing it by four: if the data were
distributed randomly, each intersection of values should have about the same
frequency (just like, when tossing a coin, each side should come up roughly the
same number of times). However, this would only be the case if all marginal
frequencies were the same, for example, if our sample contained y S-POSSESSIVES
and y OF-POSSESSIVES and y of the modiers were discourse old (i.e. pronouns)

.1
and y of them were discourse-new (i.e. lexical NPs). But this is not the case:
there are more discourse-old modiers than discourse-new ones (183 vs. 173) and
there are more s-possessives than of-possessives (200 vs. 156).
v0
These marginal frequencies of our variables and their values are a fact about
our data that must be taken as a given when calculating the expected frequencies:
our hypothesis says nothing about the overall frequency of the two constructions
or the overall frequency of discourse-old and discourse-new modiers, but only
about the frequencies with which these values should co-occur. In other words,
T
the question we must answer is the following: Given that the s- and the of-
possessive occur 200 and 156 times respectively and given that there are 183
AF

discourse-old modiers and 173 discourse-new modiers, how frequently would


each combination these values occur by chance?
Put like this, the answer is conceptually quite simple: the marginal frequencies
should be distributed across the intersections of our variables such that the
relative frequencies in each row should be the same as those of the row total and
R

the relative frequencies in each column should be the same as those of the
column total.
D

For example, 56.18 percent of all possessive constructions in our sample are s-
possessives and 43.82 percent are of-possessives; if there were only a chance
relationship between type of construction and discourse status of the modier,
we should nd the same proportions for the 183 constructions with old modiers,
i.e. 183 0.5618 = 102.81 s-possessives and 183 0.4382 = 80.19 of-possessives.
Likewise, there are 173 constructions with new modiers, so 173 0.5618 = 97.19
of them should be s-possessives and 173 0.4382 = 75.81 of them should be of-
possessives. e same goes for the columns: 51.4 percent of all constructions have
old modiers and 41.6 percent have new modiers. If there were only a chance
relationship between type of construction and discourse status of the modier,
we should nd the same proportions for both types of possessive construction:

128
Anatol Stefanowitsch

there should be 200 0.514 = 102.8 s-possessives with old modiers and 97.2 with
new modiers, as well as 156 0.514 = 80.18 of-possessives with old modiers
and 156 0.486 = 75.82 of-possessives with new modiers. Note that the expected
frequencies for each intersection are the same whether we use the total row
percentages or the total column percentages: the small dierences are due to
rounding errors.
To avoid rounding errors, we should not actually convert the row and column
totals to percentages at all, but use the following much simpler way of calculating
the expected frequencies: for each cell, we simply multiply its marginal
frequencies and divide the result by the table total as shown in Table 6-3; note

.1
that we are using the standard convention of using O to refer to observed
frequencies, E to refer to expected frequencies, and subscripts to refer to rows and
columns. e convention for these subscripts is as follows: use 1 for the rst row
v0
or column, 2 for the second row or column, and T for the row or column total,
and give the index for the row before that of the column. For example, E21 refers
to the expected frequency of the cell in the second row and the rst column, O1T
refers to the total of the rst row, and so on.
T
Table 6-3 Calculating expected frequencies from observed frequencies
DEPENDENT VARIABLE
AF

VALUE 1 VALUE 2 Total


(O T1O1T ) E12 =
E11 =
INDEPENDENT VARIABLE

VALUE 1 O TT (O T2O1T ) O1T


O TT
R

(O T1O 2T ) E22 =
E21 = (O T2O 2T )
VALUE 2 OTT O2T
D

OTT
Total OT1 OT2 OTT

Applying this procedure to our observed frequencies yields the results shown in
Table 6-4. One should always report nominal data in this way, i.e., giving both the
observed and the expected frequencies in the form of a contingency table.

129
6. antifying Research estions

Table 6-4 Observed and expected frequencies of old and new modiers in the s- and
the of-possessive
POSSESSIVE
S-POSSESSIVE OF-POSSESSIVE Total
Obs. 180 Obs. 3 Obs. 183
DISCOURSE STATUS

OLD
Exp. 102.81 Exp. 80.19
Obs. 20 Obs. 153 Obs. 173
NEW
Exp. 97.19 Exp. 75.81

.1
Total Obs. 200 Obs. 156 Obs. 356

We can now compare the observed and expected frequencies of each intersection
v0
to see whether the dierence conforms to our quantitative prediction. is is
clearly the case: for the intersections S-POSSESSIVE DISCOURSE-OLD and OF-POSSESSIVE
DISCOURSE-NEW, the observed frequencies are higher than the expected ones, for the
intersections S-POSSESSIVE DISCOURSE-NEW and OF-POSSESSIVE DISCOURSE-OLD, the
observed frequencies are lower than the expected ones.
T
is conditional distribution seems to conrm our hypothesis. However, note
that it does not yet prove or disprove anything, since, as mentioned above, we
AF

would never expect a real-world distribution of events to match the expected


distribution perfectly. We will return to this issue in Chapter 7.

6.3 Descriptive Statistics for Ordinal Data


R

Let us turn, next, to a design with one nominal and one ordinal variable: a test of
the second of the three hypotheses introduced at the beginning of this chapter.
D

Again, it is restated here together with the background assumption from which it
is derived:

(6.16) Assumption: Animate items occur before inanimate items.


Hypothesis: e S-POSSESSIVE will be used when the modier is high in
ANIMACY, the OF-POSSESSIVE will be used when the modier is low in ANIMACY.

e constructions are operationalized as before. e data used are based on the


same data set, except that cases with proper names are now included. For
expository reasons, we are going to look at a ten-percent subsample of the full
sample, giving us 22 s-possessives and 17 of-possessives (see Appendix D.1 for

130
Anatol Stefanowitsch

the data).
ANIMACY was operationally dened in terms of the coding scheme shown in
Table 6-5 (based on Zaenen et al. 2003).

Table 6-5. A simple coding scheme for ANIMACY


ANIMACY CATEGORY DEFINITION RANK
HUMAN Real or ctional humans and human-like beings. 1
ORGANIZATION Groups of HUMANS acting with a common purpose. 2
OTHER ANIMATE Real or ctional animals, animal-like beings and plants. 3

.1
HUMAN ATTRIBUTE Body parts, organs, etc. of HUMANS. 4
CONCRETE TOUCHABLE Physical entities that are incapable of life and can be touched. 5
CONCRETE Physical entities that are incapable of life and cannot be
v0 6
NONTOUCHABLE touched.
LOCATION Physical places and regions 7
TIME Points in and periods of time 8
EVENT Events 9
T
ABSTRACT Other abstract entities. 10
Total
AF

As pointed out above, this type of ANIMACY hierarchy is a classic example of


ordinal data, as the categories can be ordered (although there may be some
disagreement about the exact order), but we cannot say anything about the
distance between one category and the next, and there is more than one
R

conceptual dimension involved (I ordered them according to dimensions like


potential for life, touchability and conceptual independence).
We can now formulate the following prediction:
D

(6.17) Prediction: e modiers of the S-POSSESSIVE will tend to occur high on the
ANIMACY scale, the modiers of OF-POSSESSIVE will tend to occur low on the
ANIMACY scale.

Note that phrased like this it is not yet a quantitative prediction, since tend to is
not a mathematical concept. In contrast to frequency (for nominal data) and
average or mean (for cardinal data), which are used in everyday language
with something close to their mathematical meaning, we do not have an
everyday word for dealing with dierences in ordinal data. We will return to this

131
6. antifying Research estions

point presently, but rst, let us look at the data impressionistically. Table 6-6
shows the coded sample (cases are listed in the order in which they occurred in
the corpus, the example IDs can be used to refer to the actual hits listed in
Appendix D.1, Study 2).

Table 6-6. A sample of s- and of-possessives coded for Animacy (BROWN corpus)
(a) S-POSSESSIVE (b) OF-POSSESSIVE
Example ID Animacy Animacy Example ID Animacy Animacy
Rank Rank
6-6a1 ORG 2 6-6b1 LOC 8

.1
6-6a2 HUM 1 6-6b2 ORG 2
6-6a3 HUM 1 6-6b3 EVT 7
6-6a4 CCN 6 v0 6-6b4 HUM 1
6-6a5 ORG 2 6-6b5 CCT 5
6-6a6 ORG 2 6-6b6 ABS 10
6-6a7 HUM 1 6-6b7 HUM 1
6-6a8 HUM 1 6-6b8 ABS 10
T
6-6a9 HUM 1 6-6b9 CCN 6
6-6a10 HUM 1 6-6b10 ABS 10
AF

6-6a11 CCT 5 6-6b11 CCT 5


6-6a12 HUM 1 6-6b12 TIM 9
6-6a13 ORG 2 6-6b13 HAT 4
6-6a14 ORG 2 6-6b14 EVT 7
R

6-6a15 HUM 1 6-6b15 ABS 10


6-6a16 ANI 3 6-6b16 CCT 5
D

6-6a17 HUM 1 6-6b17 CCT 5


6-6a18 HUM 1 6-6b18 CCT 5
6-6a19 HUM 1
6-6a20 HUM 1
6-6a21 HUM 1
6-6a22 ANI 3
6-6a23 HUM 1

A simple way of nding out whether the data conform to our prediction would be
to sort the entire data set by the rank assigned to the examples and check

132
Anatol Stefanowitsch

whether the s-possessives cluster near the top of the list and the of-possessives
near the boom. Table 6-7 shows this ranking

Table 6-7. e coded sample from Table 6-6 ordered by assigned rank.
RANKING RANKING (CONTD.)
Rank Example Rank Example
1 s-possessive (6-6a10) 3 s-possessive (6-6a16)
1 s-possessive (6-6a12) 3 s-possessive (6-6a22)
1 s-possessive (6-6a15) 4 of-possessive (6-6b13)
1 s-possessive (6-6a17) 5 s-possessive (6-6a11)

.1
1 s-possessive (6-6a18) 5 of-possessive (6-6b11)
1 s-possessive (6-6a19) 5 of-possessive (6-6b16)
1 s-possessive (6-6a2) v0 5 of-possessive (6-6b17)
1 s-possessive (6-6a20) 5 of-possessive (6-6b18)
1 s-possessive (6-6a21) 5 of-possessive (6-6b5)
1 s-possessive (6-6a23) 6 s-possessive (6-6a4)
1 s-possessive (6-6a3) 6 of-possessive (6-6b9)
1 s-possessive (6-6a7) 7 of-possessive (6-6b14)
T
1 s-possessive (6-6a8) 7 of-possessive (6-6b3)
1 s-possessive (6-6a9) 8 of-possessive (6-6b1)
AF

1 of-possessive (6-6b4) 9 of-possessive (6-6b12)


1 of-possessive (6-6b7) 10 of-possessive (6-6b10)
2 s-possessive (6-6a1) 10 of-possessive (6-6b15)
2 s-possessive (6-6a13) 10 of-possessive (6-6b6)
2 s-possessive (6-6a14) 10 of-possessive (6-6b8)
R

2 s-possessive (6-6a5)
2 s-possessive (6-6a6)
2 of-possessive (6-6b2)
D

Table 6-7 shows that the data conform to our hypothesis: among the cases whose
modiers have an animacy of rank 1 to 3, s-possessives dominate, among those
with a modier of rank 4 to 10, of-possessives make up an overwhelming
majority.
However, we need a less impressionistic way of summarizing data sets coded
as ordinal variables, since not all data set will be as straightforwardly
interpretable as this one. So let us turn to the question of an appropriate
descriptive statistic for ordinal data.

133
6. antifying Research estions

6.3.1 Medians
As explained above, we cannot calculate a mean for a set of ordinal values, but we
can do something similar. e idea behind calculating a mean value is, essentially,
to provide a kind of mid-point around which a set of values is distributed it is a
so-called measure of central tendency. us, if we cannot calculate a mean, the
next best thing is to simply list our data ordered from highest to lowest and nd
the value in the middle of that list. is value is known as the median a value
that splits a sample or population into a higher and a lower portion of equal sizes.
For example, the rank values for the Animacy of our sample of s-possessives

.1
are shown in (6.18):

(6.18) 11111111111111222223356 v0
ere are 23 values, thus the median is the twelh value in the series (shown in
bold face), because there are 11 values above it and eleven below it. e twelh
values in the series is a 1, so the mean value of s-possessive modiers in our
sample is 1 (or HUMAN).
T
If the sample consists of an even number of data points, we simply calculate
the mean between the two values that lie in the middle of the ordered data set.
AF

For example, the rank values for the Animacy of our sample of of-possessives are
shown in (6.19):

(6.19) 1 1 2 4 5 5 5 5 5 6 7 7 8 9 10 10 10 10
R

ere are 18 values, so the median would fall between the ninth and the tenth
value (shown in italics). e ninth and tenth value are 5 and 6 respectively, so the
D

median for the of-possessive modiers is (5+6)/2 = 5.5 (i.e., it falls between
CONCRETE TOUCHABLE and CONCRETE NONTOUCHABLE).
Using the idea of a median, we can now rephrase our prediction in
quantitative terms:

(6.20) Prediction: e modiers of the S-POSSESSIVE will have a higher median on


the ANIMACY scale than the the modiers of the OF-POSSESSIVE.

Our data conform to this prediction, as 1 is higher on the scale than 5.5. As
before, this does not prove or disprove anything, as, again, we would expect some
random variation. Again, we will return to this issue in Chapter 7.

134
Anatol Stefanowitsch

6.4.2 Frequency lists and mode


Recall that I mentioned above the possibility of treating cardinal data like
nominal data. Table 6-8 shows the relative frequencies for each animacy category,
(alternatively, we could also calculate expected frequencies in the way described
in Section 6.3 above).

Table 6-8. Relative frequencies for the Animacy values of the modiers of the s-
and of-possessives.
RANK ANIMACY CATEGORY S-POSSESSIVE OF-POSSESSIVE

.1
1 HUMAN 14 60.9% 2 11.1%
2 ORGANIZATION 5 21.7% 1 5.6%
3 OTHER ANIMATE v0 2 8.7% 0
4 HUMAN ATTRIBUTE 0 1 5.6%
5 CONCRETE TOUCHABLE 1 4.3% 5 27.86
6 CONCRETE NONTOUCHABLE 1 4.3% 1 5.6%
7 LOCATION 0 1 5.6%
T
8 TIME 0 1 5.6%
9 EVENT 0 2 11.1%
AF

10 ABSTRACT 0 4 22.2%
Total 23 100% 18 100%

As we can see, this table also nicely shows the preference of the s-possessive for
R

animate modiers (human, organization, other animate) and the preference of the
of-possessive for the categories lower on the hierarchy. e table also shows,
however, that the modiers of the of-possessive are much more evenly distributed
D

across the entire Animacy scale than those of the s-possessive.


For completeness sake, let me point out that there is a third measure of
central tendency, that is especially suited to nominal data (but can also be applied
to ordinal and cardinal data): the mode. e mode is simply the most frequent
value in a sample, so the modiers of the of-possessive have a mode of 5 (or
CONCRETE TOUCHABLE) and those of the s-possessive has a mode of 1 (or HUMAN) with
respect to animacy (similarly, we could have said that the mode of s-possessive
modiers is DISCOURSE-OLD and the mode of of-possessive modiers is DISCOURSE-NEW).
ere may be more than one mode in a given sample. For example, if we had
found just a single additional modier of the type ABSTRACT in the sample above

135
6. antifying Research estions

(which could easily have happened), its frequency would also be ve; in this case,
the of-possessive modier would have two modes (CONCRETE TOUCHABLE and
ABSTRACT).
e concept of mode may seem useful in cases where we are looking for a
single value by which to characterize a set of nominal data, but on closer
inspection it turns out that it does not actually tell us very much: it tells us, what
the most frequent value is, but not, how much more frequent that value is than
the next most frequent one, how many other values occur in the data at all, etc.
us, it is always preferable to report the frequencies of all values, and, in fact, I
have never come across a corpus-linguistic study reporting modes.

.1
6.5 Descriptive Statistics for Cardinal Data
v0
Let us turn, nally, to a design with one nominal and one cardinal variable: a test
of the third of the three hypotheses introduced at the beginning of this chapter.
Again, it is restated here together with the background assumption from which it
is derived:
T
(6.21) Assumption: Short items tend to occur toward the beginning of a
constiutent, long items tend to occur at the end.
AF

Hypothesis: e S-POSSESSIVE will be used with short modiers, the OF-


POSSESSIVE will be used with long modiers.

e constructions are operationalized as before. e data used are based on the


same data set as before, except that cases with proper names and pronouns are
R

excluded. e reason for this is that we already know from the rst case study
that pronouns, which we used as an operational denition of old information
D

prefer the s-possessive. Since all pronouns are very short (regardless of whether
we measure their lenght in terms of words, syllables or leers), including them
would bias our data in favor of the hypothesis. is le 20 cases of the s-
possessive and 154 cases of the of-possessive. To get samples of roughly equal
size for expository clarity, I selected every sixth case of the of-possessive, giving
me 25 cases (note that in a real study, there would be no good reason to create
such roughly equal sample sizes we would simply use all the data we have).
e variable LENGTH was dened operationally as number of orthographic
words. We can now state the following prediction:

(6.22) Prediction: e mean length of modiers of the S-GENITIVE should be

136
Anatol Stefanowitsch

smaller than that of the modiers of the OF-GENITIVE.

Table 6-9 shows the length of head and modier for all cases in our sample
(again, see Appendix D.1, Study 3 for the actual examples).

Table 6-9. Length of heads and modiers in a sample of s- and of-possessives


(BROWN corpus)
(a) S-POSSESSIVE (b) OF-POSSESSIVE
Example ID Modier Head Example ID Modier Head
6-9a1 2 14 6-9b1 3 4

.1
6-9a2 2 6 6-9b2 5 2
6-9a3 2 1 6-9b3 4 2
6-9a4 2 3 v0 6-9b4 8 2
6-9a5 3 1 6-9b5 7 3
6-9a6 1 2 6-9b6 1 1
6-9a7 2 2 6-9b7 9 5
6-9a8 2 1 6-9b8 2 3
6-9a9 2 1 6-9b9 5 2
T
6-9a10 2 1 6-9b10 6 2
6-9a11 1 1 6-9b11 2 3
6-9a12 2 4 6-9b12 2 2
AF

6-9a13 1 7 6-9b13 1 2
6-9a14 3 1 6-9b14 8 2
6-9a15 2 1 6-9b15 5 4
6-9a16 1 1 6-9b16 2 2
R

6-9a17 2 3 6-9b17 2 2
6-9a18 2 1 6-9b18 20 2
6-9a19 2 4 6-9b19 2 2
D

6-9a20 2 2 6-9b20 1 1
6-9b21 2 2
6-9b22 8 2
6-9b23 3 2
6-9b24 2 3
6-9b25 2 2

6.5.1 Means
How to calculate a mean (more precisely, an arithmetic mean) is common
knowledge, but for completeness sake, here is the formula:

137
6. antifying Research estions

n
1 x 1+ x2 ++ x n
(6.23) xarithm = xi =
n i=1 n

In other words, in order to calculate the mean of a set of values {x 1, x2, xn} of
size n, we add up all values and divide them by n (or multiply them by 1/n, which
is the same thing).
Since we have stated our hypothesis and the corresponding prediction only in
terms of the modier, we should rst make sure that the heads of the two
possessives do not dier greatly in length: if they did, any dierences we nd for

.1
the modiers could simply be related to the fact that one of the constructions
may be longer in general than the other. Adding up all 20 values for the s-
possessive heads gives us a total of 57, so the mean is 57/20 = 2.85. Adding up all
v0
25 values of the of-possessive heads gives us a total of 59, so the mean is 59/25 =
2.36. We have, as yet, no way of telling whether this dierence could be due to
chance, but the two values are so close together that we will assume so for now.
In fact, note that there is one obvious outlier (a value that is much bigger than the
T
others: Example 6-9a1 has a head that is 14 words long. If we assume that this is
somehow exceptional and remove this value, we get a mean length of 43/19 =
AF

2.26, which is almost identical to the mean length of the of-possessives modiers.
If we apply the same formula to the modiers, however, we nd that they
dier substantially: the mean length of the s-possessive modiers is 38/20 = 1.9,
while and the mean length of the of-possessives modiers is more than twice as
much, namely 112/25 = 4.48. Even if we remove the obvious outlier, example (6-
R

9b18), the of-possessives modiers are twice as long as those of the s-possessive,
namely 92/24 = 3.83.
D

6.6 Summary
We have looked at three case studies, one involving nominal, one ordinal and one
cardinal data. In each case, we were able to state a hypothesis and derive a
quantitative prediction from it. Using appropriate descriptive statistics
(percentages, observed and expected frequencies, modes, medians and means), we
were able to determine that the data conform to these predictions i.e., that the
quantitative distribution of the values of the variables PART OF SPEECH, ANIMACY and
LENGTH across the conditions S-GENITIVE and OF-CONSTRUCTION ts the predictions
formulated.

138
Anatol Stefanowitsch

However, these distributions by themselves do not prove (or, more precisely,


fail to disprove) the hypotheses for two related reasons. First, the predictions are
stated in relative terms, i.e. in terms of more-or-less, but they do not tell us how
much more or less we should expect to observe. Second, we do not know, and
currently have no way of determining, whether the more-or-less that we observe
reects real dierences in distribution, or whether it falls within the range of
random variation that we always expect when observing tendencies. More
generally, we do not know how to apply the Popperian all-or-nothing research
logic to quantitative predictions. All this will be the topic of the next chapter.

.1
Further reading
See next chapter.
v0
T
AF
R
D

139
7. Signicance Testing
In Chapter 3, we discussed the fact that we can never prove a hypothesis to be
true, but we can prove it to be false. is insight leads to the Popperian research

.1
cycle that roughly goes as follows: e researcher formulates a hypothesis and
then do their best to falsify it. If they manage to falsify it, the hypothesis has to be
revised, if they dont manage to falsify it, they may continue assuming that it is
v0
correct. If the hypothesis can be formulated in such a way that it could be
falsied by a counterexample, this procedure can be applied fairly
straightforwardly (although there may be diculties in deciding what counts as a
counterexample).
But, as also discussed in Chapter 3, if the hypothesis is formulated in relative
T
terms, counterexamples are irrelevant. In the last chapter, we formulated
hypotheses and derived dierent types of quantitative predictions. We then coded
AF

samples of corpus data, summarized them quantitatively (in terms of frequencies


and percentages, rankings and medians, and means). Finally, we compared the
results to the predictions to see whether they conformed to these predictions.
However, we also pointed out that the fact that our results conform to our
predictions does not prove the corresponding hypothesis to be true because we
R

can never prove a hypothesis to be true.


In statistical hypothesis testing, this problem is solved by formulating not one,
D

but two hypotheses. First, as always, we formulate our research hypothesis,


which states that there is a relationship between two (or more) variables, and
species the nature of this relationship. is hypothesis is referred to as H1 (or
alternative hypothesis) in statistical research designs. Second, we formulate a
null hypothesis (or H0), which states that there is no relationship between our
variables. Crucially, we then aempt to falsify the null hypothesis and to show
that the data conform to the alternative hypothesis.
In the rst step, the null hypothesis and the alternative hypothesis are turned
into quantitative predictions concerning the intersections of the variables, as
schematically shown in (7.1a, b):

140
Anatol Stefanowitsch

(7.1a) Null hypothesis (H0): ere is no relationship between Variable A and


Variable B.
Prediction: e data should be distributed randomly across the
intersections of A and B; i.e., the frequency/medians/means of the
intersections should not dier from those expected by chance.

(7.2b) Alternative hypothesis (H1): ere is a relationship between Variable A


and Variable B such that some value(s) of A tend to co-occur with some
value(s) of B.

.1
Prediction: The data should be distributed non-randomly across the
intersections of A and B; i.e., the frequency/medians/means of some the
intersections should be higher and/or lower than those expected by
v0
chance.

Once we have formulated our research hypothesis and the corresponding null
hypothesis in this way (and once we have operationalized the constructs used in
formulating them), we collect the relevant data.
T
In a second step, we then have to show that the data we have collected dier
from the prediction of the null hypothesis in the direction predicted by the
AF

alternative hypothesis. For example, if Variable A has the values X and Y and
Variable B has the values P and Q, and our prediction is that X and P tend to co-
occur, we need to show that the combination of X and P is more frequent, has a
higher median, or has a higher mean than they would have if the relationship
between the two variables were random. If this is not the case, i.e., if the
R

frequency, the median or the mean of the combination X & P is equal to what we
would expect by chance, or if it is actually lower than that, then we cannot falsify
D

the null hypothesis (at least not in a way that is compatible with our alternative
hypothesis) and have to assume instead that our research hypothesis is false. If it
is the case, i.e., if the frequency, the median or the mean of the combination X & P
is higher than what we would expect by chance, we know that we can potentially
reject the null hypothesis in a way that conforms to our research hypothesis
this was the case in the case studies in the preceding chapter. However, we
cannot actually reject the null hypothesis yet, because we must always expect a
certain amount of random variation even if there is a relationship between two
variables.
Before we can reject the null hypothesis, we have to show in a third step that
the dierence between the observed distribution and the chance distribution is so

141
7. Signicance Testing

large that it is likely not due to random variation. is is the task of inferential
statistics, which we will deal with in this chapter.

7.1 Probabilities and signicance testing


Recall the example of a coin that is ipped onto a hard surface: every time we ip
it, there is a y percent chance that it will come down heads, and a y percent
chance that it will come down tails. From this it follows, for example, that if we
ip a coin ten times, the expected outcome is ve heads and ve tails. However,
as pointed out in the last chapter, this is only a theoretical expectation derived

.1
from the probabilities of each individual outcome. In reality, every outcome
from ten heads to ten tails is possible, as each ip of the coin is an independent
event. v0
Intuitively, we know this: if we ip a coin ten times, we do not really expect it
to come down heads and tails exactly ve times each but we accept a certain
amount of variation. However, the greater the imbalance between heads and tails,
the less willing we will be to accept it as a result of chance. In other words, we
would not be surprised if the coin came down heads six times and tails four
T
times, or even heads seven times and tails four times, but we might already be
slightly surprised if it came down heads eight times and tails only twice, and we
AF

would certainly be surprised to get a series of ten heads and no tails.


Let us look at the reasons for this surprise, beginning with a much shorter
series of just two coin ips. ere are four possible outcomes of such a series:

(7.2) heads - heads


R

heads - tails
tails - heads
D

tails - tails

Obviously, none of these outcomes is more or less likely than the others: since
there are four possible outcomes, they each have a probability of 1/4 = 0.25 (i.e.,
25 percent, we will be using the decimal notation for percentages from here on).
Alternatively, we can calculate the probability of each series by multiplying the
probability of the individual events in each series, i.e. 0.5 * 0.5 = 0.25.
Crucially, however, there are dierences in the probability of geing a
particular set of results (i.e, a particular number of heads and regardless of the
order they occur in): There is only one possibility of geing two heads and one of
geing two tails, but there are two possibilities of geing one head and one tail.

142
Anatol Stefanowitsch

We calculate the probability of a particular set by adding up the probabilities of


all possible series that will lead to this set. us, the probabilities for the sets
{heads, heads} and {tails, tails} are 0.25 each, while the probability for the set
{heads, tails}, corresponding to the series headstails and tailsheads, is 0.25 +
0.25 = 0.5.
is kind of coin-ip logic (also known as probability theory), can be utilized
in evaluating quantitative hypotheses that have been stated in quantitative terms.
Take the larger set of ten coin ips mentioned at the beginning of this section:
now, there are eleven potential outcomes, shown in Table 7-1.

.1
Table 7-1. Sets of ten coin ips.
Unordered Set No. of Series Probability
1 {0 heads, 10 tails}
v0 1 0.000977
2 {1 heads, 9 tails} 10 0.009766
3 {2 heads, 8 tails} 45 0.043945
4 {3 heads, 7 tails} 120 0.117188
5 {4 heads, 6 tails} 210 0.205078
6 {5 heads, 5 tails} 252 0.246094
T
7 {6 heads, 4 tails} 210 0.205078
8 {7 heads, 3 tails} 120 0.117188
AF

9 {8 heads, 2 tails} 45 0.043945


10 {9 heads, 1 tails} 10 0.009766
11 {10 heads, 0 tails} 1 0.000977
R

Again, these outcomes dier with respect to their probability. e third column of
Table 7-1 gives us the number of dierent series corresponding to each set.17 For
example, there is only one way to get a set consisting of heads only: the coin
D

must come down showing heads every single time. ere are ten dierent ways
of geing one heads and nine tails: e coin must come down heads the rst or
second or third or fourth or h or sixth or seventh or eighth or ninth or tenth
time, and tails the rest of the time. Next, there are forty-ve dierent ways of
geing two heads and eight tails, which I am not going to list here (but you may

17 You may remember having heard of Pascals triangle, which, among more
sophisticated things, lets us calculate the number of dierent ways in which we can
get a particular combination of heads and tails for a given number of coin ips: the
third column of Table 7-1 corresponds to line 11 of this triangle. If you dont
remember, no worries, we will not need it.

143
7. Signicance Testing

want to, as an exercise), and so on. e fourth column contains the same
information, expressed in terms of relative frequencies: there are 1024 dierent
series of ten coin ips, so the probability of geing, for example, two heads and
eight tails is 45/1024 = 0.043945.
e basic idea behind statistical hypothesis testing is simple: calculate the
probability of the result that we have observed. e lower this probability is, the
less likely it is to be due to chance and the more likely we are correct if we reject
the null hypothesis. For example, if we observed a series of ten heads and zero
tails, we know that the likelihood that the deviation from the expected result of
ve heads and ve tails is due to chance is 0.000977 (i.e. roughly a tenth of a

.1
percent). is tenth of a percent is also the probability that we are wrong if we
reject the null hypothesis and claim that the coin is not behaving randomly (for
example, that it is manipulated in some way).
v0
If we observed one heads and nine tails, we would know that the likelihood
that this deviation from the expected result is 0.009766 (i.e. almost one percent).
us we might think that, again, if we reject the null hypothesis, this is the
likelihood that we are wrong. However, we must add to this the likelihood of
geing ten heads and zero tails. e reason for this is that if we accept a result of
T
1:9 as evidence for a non-random distribution, we would also accept the even
more extreme result of 0:10. So the likelihood that we are wrong in rejecting the
AF

null hypothesis is 0.000977 + 0.009766 = 0.010743. In other words: the probability


that we are wrong in rejecting the null hypothesis is always the probability of the
observed result plus the probabilities of all results that deviate from the null
hypothesis even further in the direction of the observed frequency. is is called
the probability of error (or simply p) in statistics.
R

By convention, a ve percent chance of being wrong is considered to be the


limit as far as acceptable risks are concerned in statistics if p < 0.05 (i.e., if p is
D

smaller than ve percent), the result is said to be statistically signicant (i.e., not
due to chance), if it is larger, the result is said to be non-signicant (i.e., probably
to chance). Two other values of p that have a conventional importance are 0.01
(one percent), at which a result is said to be very signicant, and 0.001 (a tenth of
a percent), at which a result is considered to be highly signicant.
Note that the probability of error depends not just on the proportion of the
deviation, but also on the overall size of the sample. For example, if we observe a
series of two heads and eight tails (i.e., twenty percent heads), the probability of
error in rejecting the null hypothesis is 0.000977+ 0.009766 + 0.043945 = 0.054688.
However, if we observe a series of four heads and sixteen tails (again, twenty
percent heads), the probability of error would be roughly ten times lower, namely

144
Anatol Stefanowitsch

0.005909. e reason is the following: ere are 1,048,576 possible series of


twenty coin ips. ere is still only one way of geing one head and nineteen
tails, so the probability of geing one head and nineteen tails is 1/1,048,576 =
0.0000009536743; however, there are already 20 ways of geing one tails and
nineteen heads (so the probability is 20/1,048,576 = 0.000019), 190 ways of geing
two heads and eighteen tails (p=190/1,048,576 = 0.000181), 1140 ways of geing
three heads and seventeen tails (p=1140/1,048,576 = 0.001087) and 4845 ways of
geing four heads and sixteen tails (p=4845/1,048,576 = 0.004621). And adding up
these probabilities gives us 0.005909.
Obviously, most research designs in any discipline are more complicated than

.1
coin ipping, which involves just a single variable with two values. However, it is
theoretically possible to generalize the coin-ipping logic to any research design,
i.e., calculate the probabilities of all possible outcomes and add up the
v0
probabilities of the observed outcome and all outcomes that deviate from the
expected outcome even further in the same direction. Simply put, though, this is
only a theoretical possibility, as the computations would be too long and complex
even for the most powerful computers in existence, let alone for manual
calculations (or even a standard-issue home computer). erefore, many
T
statistical methods use a kind of mathematical trick: they derive from a given
sample a single value whose probability distribution is known a so-called test
AF

statistic. Instead of calculating the probability of our observed outcome directly,


we can then assess its probability by comparing the test statistic against its
known distribution. Mathematically, this involves identifying its position on the
respective distribution and, as we did above, adding up the probability of this
position and all positions deviating further from the chance distribution. In
R

practice, we just have to look up the test statistic in a chart that will give us the
corresponding probability of error (or p-value, as we will call it from now on).
D

In the following three sections, I will introduce three widely-used tests


involving test statistics for the three types of data discussed in the previous
section: the chi-square test for nominal data, the Wilcoxon-Mann-Whitney test
(also known as Mann-Whitney U test or Wilcoxon rank sum test) for ordinal data,
and Welchs t-test for cardinal data.
Given the vast range of corpus-linguistic research designs, these three tests
will not always be the ideal choice. In many cases, there are more sophisticated
statistical procedures which are beer suited to the task at hand, be it for
theoretical (mathematical or linguistic) or for practical reasons. However, the
statistical tests introduced here have some advantages that make them ideal
procedures for an initial statistical evaluation of results. For example, they are

145
7. Signicance Testing

easy to perform: we dont need more than a paper and a pencil, or a calculator, if
we want to speed up things, and they are also included as standard functions in
widely-used spreadsheet applications. ey are also relatively robust even in
situations where we should not really use them (a point I will return to below).
ey are also ideal procedures for introducing statistics to novices. Again, they
are easy to perform and do not require statistical soware packages that are
typically expensive and/or have a steep learning curve. ey are also relatively
transparent with respect to their underlying logic and the steps required to
perform them. us, my purpose in introducing them in some detail here is at
least as much to introduce the logic and the challenges of statistical analysis, as it

.1
is to provide basic tools for actual research. I will not introduce the mathematical
underpinnings of these tests and I will mention alternative procedures only in
passing. For those who are interested in a deeper discussion of those issues, there
v0
are a large number of relatively readable introductory textbooks on probability
theory, theoretical and applied statistics (see the Further Reading section at the
end of this chapter). I will also not make any reference to statistical soware,
although such soware packages are useful for anyone planning to use statistics
in their actual research.
T

7.2 Nominal Data: e chi-square test


AF

As mentioned in the preceding chapter, nominal data (or data that are best
treated like nominal data) are the type of data most frequently encountered in
corpus linguistics. I will therefore treat them in slightly more detail than the
other two types, introducing dierent versions and (in the next chapter)
R

extensions of the most widely used statistical test for nominal data, the chi-square
(2) test. is test in all its variants is extremely exible, and is thus more useful
D

across dierent research designs than many of the more specic and more
sophisticated procedures (much like a Swiss army knife is an excellent all-
purpose tool despite the fact that there is usually a beer tool dedicated to a
specic task at hand).
Despite its exibility, there are two requirements that must be met in order for
the chi-square test to be applicable: rst, no intersection of variables must have a
frequency of zero in the data, and second, no more than twenty-ve percent of
the intersections must have frequencies lower than ve. When these conditions
are not met, an alternative test must be used instead (or we need to collect
additional data).

146
Anatol Stefanowitsch

7.2.1. Two-by-two designs


Let us begin with a two-by-two design and return to the case of discourse-old and
discourse-new modiers in the two English possessive constructions. Here is the
research hypothesis again, paraphrased from (6.6) and (6.9):

(7.3a) H1: ere is a relationship between DISCOURSE STATUS and TYPE OF POSSESSIVE
such that the S-POSSESSIVE is preferred when the modier is DISCOURSE-OLD,
the OF-POSSESSIVE is preferred when the modier is DISCOUSE-NEW.
Prediction: ere will be more cases of the S-POSSESSIVE with DISCOURSE-OLD
modiers than with DISCOURSE-NEW modiers, and more cases of the OF-

.1
POSSESSIVE with discourse-new modiers than with DISCOURSE-OLD modiers.

e corresponding null hypothesis is stated in (7.3b):


v0
(7.3b) H0: ere is no relationship between DISCOURSE STATUS and TYPE OF POSSESSIVE.
Prediction: Discourse-old and discourse-new modiers will be distributed
randomly across the two Possessive constructions.
T
We already reported the observed and expected frequencies, but let us repeat
AF

them here for convenience:

Table 7-2. Observed and expected frequencies of old and new modiers in the s- and
the of-possessive (= Table 6-4b)
POSSESSIVE
R

S-POSSESSIVE OF-POSSESSIVE Total


Obs. 180 Obs. 3 Obs. 183
D DISCOURSE STATUS

OLD
Exp. 102.81 Exp. 80.19
Obs. 20 Obs. 153 Obs. 173
NEW
Exp. 97.19 Exp. 75.81
Total Obs. 200 Obs. 156 Obs. 356

In order to test our research hypothesis, we must show that the observed
frequencies dier from the null hypothesis in the direction of our prediction. We
already saw in Chapter 6 that this is the case: e null hypothesis predicts the
expected frequencies, but there are more cases of s-possessives with old modiers

147
7. Signicance Testing

and of-possessives with new modiers than expected. Next, we must apply the
coin-ip logic and ask the question: Given the sample size, how surprising is the
dierence between the expected frequencies (i.e., a perfect chance distribution)
and the observed frequencies?
As mentioned above, the conceptually simplest way of doing this would be to
compute all possible ways in which the marginal frequencies (the sums of the
columns and rows) could be distributed across the four cells of our table and then
check what proportion of these tables deviates from the perfect chance
distribution at least as much as the table we have actually observed. For two-by-
two tables, there is, in fact, a test that does this, the exact test aer Fisher and

.1
Yates, and where the conditions for using the chi-square test are not met, we
should use it. But, as mentioned above, this test is dicult to perform without
statistical soware, and it is not available for tables larger than 2-by-2 anyway, so
v0
instead we will derive the chi-square test statistic from the table.
First, we need to assess the magnitude of the dierences between observed
and expected frequencies. e simplest way of doing this would be to subtract the
expected dierences from the observed ones, giving us numbers that show for
each cell the size of the deviation as well as its direction (i.e., are the observed
T
frequencies higher or lower than the expected ones). For example, the values for
Table 7-2 would be 77.19 for cell C11 (S-POSSESSIVE & OLD), -77.19 for C21 (OF-POSSESSIVE
AF

& OLD), -77.19 for C12 (S-POSSESSIVE & NEW) and 77.19 for C22 (OF-POSSESSIVE & NEW).
However, we want to derive a single measure from the table, so we need a
measure of the overall deviation of the observed frequencies from the expected,
not just a measure for the individual intersections. Obviously, adding up the
dierences of all intersections does not give us such a measure, as it would
R

always be zero (since the marginal frequencies are xed, any positive deviance in
one cell will have a corresponding negative deviance in its neighboring cells.
D

Second, subtracting the observed from the expected frequencies gives us the same
number for each cell, when it is obvious that the actual magnitude of the
deviation depends on the expected frequency. For example, a deviation of 77.19 is
more substantial if the expected frequency is 75.81 than if the expected frequency
is 102.81. In the rst case, the observed frequency is more than a hundred percent
higher than expected, in the second case, it is only 75 percent higher.
e rst problem is solved by squaring the dierences. is converts all
deviations into positive numbers, and thus their sum will no longer be zero, and it
has the additional eect of weighing larger deviations more strongly than smaller
ones. e second problem is solved by dividing the squared dierence by the
expected frequencies. is will ensure that a deviation of a particular size will be

148
Anatol Stefanowitsch

weighed more heavily for a small expected frequency than for a large expected
frequency. e values arrived at in this way are referred to as the cell components
of chi square (or simply chi-square components); the formulas for calculating the
cell components in this way are shown in Table 7-3a.

Table 7-3a. Calculating chi-square components for individual cells


DEPENDENT VARIABLE
VALUE 1 VALUE 2

2 2
INDEP. VARIABLE

(O 11 E 11 ) (O 12 E 12 )

.1
VALUE 1
E 11 E 12
2 2
(O 21 E 21) (O 22 E 22)
VALUE 2 v0
E 21 E 22

If we apply this procedure to Table 7.2, we get the components shown in Table
7.3b.
T
Table 7-3b. Chi-square components for Table 7-2
AF
R
D

149
7. Signicance Testing

POSSESSIVE
S-POSSESSIVE OF-POSSESSIVE

2 2
(180 102.81) (3 80.19)
DISC. STATUS

OLD = 57.96 = 74.3


102.81 80.19
2 2
(20 97.19) (153 75.81)
NEW = 61.31 = 78.6
97.19 75.81

e degree of deviance from the expected frequencies for the entire table can

.1
then be calculated by adding up the chi-square components. For Table 7-3, the
chi-square value (2) is 272.16. is value can now be used to determine the
probability of error by checking it against a table like that in Appendix C.1.
v0
Before we can do so, there is a nal technical point to make. Note that the
degree of variation in a given table that is expected to occur by chance depends
quite heavily on the size of the table. e bigger the table, the higher the number
of cells that can vary independently of other cells without changing the marginal
sums (i.e., without changing the overall distribution). e number of such cells
T
that a table contains is referred to as the number of degrees of freedom of the
table. In the case of a two-by-two table, there is just one such cell: if we change
AF

any single cell, we must automatically adjust the other three cells in order to keep
the marginal sums constant. us, a two-by-two table has one degree of freedom.
Signicance levels of chi-square values dier depending on how many degrees of
freedom a table has.
Now, we can turn to the table in Appendix C.1. In the row for one degree of
R

freedom (the rst row), we must check whether our 2-value is larger than that
required for the level of signicance that we are aer. In our case, the value of
D

272.16 is much higher than the chi-square value required for a signicance level
of 0.001 at one degree of freedom, which is 10.83. us, we can say that the
dierences in Table 7-2 are statistically highly signicant. e results of a chi-
square test are conventionally reported in the following format:

(7.4) Format for reporting the results of a chi-square test


(2=[CHI-SQUARE VALUE], df=[DEG. OF FREEDOM], p < (or >) [SIG.
LEVEL]).

us, in the present case, the analysis might be summarized along the following
lines: This study has shown that s-possessives are preferred when the modier is

150
Anatol Stefanowitsch

discourse-old while of-possessives are preferred when the modier is discourse-


new. e dierences between the constructions are highly signicant (2= 272.16
(df = 1), p < 0.001).
A potential danger to this way of formulating the results is the meaning of the
word signicant. In statistical terminology, this word simply means that the
results obtained in a study based on one particular sample are unlikely to be due
to chance and can therefore be generalized, with some degree of certainty, to the
entire population. In contrast, in every-day usage the word means something
along the lines of having an important eect or inuence (LDCE, s.v.
signicant). Because of this every-day use, it is easy to equate statistical

.1
signicance with theoretical importance. However, there are at least three
reasons why this equation must be avoided.
First, and perhaps most obviously, statistical signicance has nothing to do
v0
with the validity of the operational denitions used in our research design. In our
case, this validity is reasonably high, provided that we limit our conclusions to
wrien English. As a related point, statistical signicance has nothing to do with
the quality of our data. If we have chosen unrepresentative data of if we have
extracted or coded our data sloppily, the statistical signicance of the results is
T
meaningless.
Second, statistical signicance has nothing to do with theoretical relevance.
AF

Put simply, if we have no theoretical model in which the results can be


interpreted meaningfully, statistical signicance does not add to our
understanding of the object of research. If, for example, we had shown that the
preference for the two possessives diered signicantly depending on the font in
R

which a modier is printed, rather than on the discourse status of the modier,
there is not much that we conclude from our ndings.18
D

18 is problem cannot be dismissed as lightly as this example may suggest: it points to a


fundamental diculty in doing science. Note that if we did nd that the font has an
inuence on the choice of possessive, we would most likely dismiss this nding as a
random uke despite its statistical signicance. And we may well be right, since even
a level of signicance of p < 0.001 does not preclude the possibility that the observed
frequencies are due to chance. In contrast, an inuence of the discourse status of the
modier makes sense because discourse status has been shown to have eects in
many areas of grammar, and thus we are unlikely to question such an inuence. In
other words, our judgment of what is and is not plausible will inuence our
interpretation of our empirical results even if they are statistically signicant.
Alternatively, we could take every result seriously and look for a possible explanation,
which will then typically require further investigation. For example, we might

151
7. Signicance Testing

ird, and perhaps least obviously but most importantly, statistical


signicance does not actually tell us anything about the quantitative importance
of the relationship we have observed. A relationship may be highly signicant
(i.e., generalizable with a high degree of certainty) and still be extremely weak.
Put dierently, statistical signicance is not typically an indicator of the strength
of the association.19
To solve the last problem, we can calculate a so-called measure of eect size,
which, as its name suggests, indicates the size of the eect that our independent
variable has on the dependent variable. For two-by-two contingency tables with
categorical data, there is a widely-used measure referred to as phi that is

.1
calculated as follows:


2
v0
(7.5) =
O TT

In our example, this formula gives us the following value:


T
272.16
(7.6) = = 0.8744
356
AF

e -value is a so-called correlation coecient, whose interpretation can be


very subtle (especially when it comes to comparing two or more of them), but we
will content ourselves with two relatively simple ways of interpreting them.
First, there are generally agreed-upon verbal descriptions for dierent ranges
R

that the value of a correlation coecient may have (similarly to the verbal
descriptions of p-values discussed above. ese descriptions are shown in Table
7.4.
D

Table 7.4. Conventional interpretation of correlation coecients

hypothesize that there is a relationship between font and level of formality, and the
laer has been shown to have an inuence on the choice of possessive constructions
(Jucker 1993).
19 is statement must be qualied to a certain degree: given the right research design,
statistical signicance may actually be a very reasonable indicator of association
strength (cf. e.g. Stefanowitsch and Gries 2003, Gries and Stefanowitsch 2004 for
discussion). However, most of the time we are well advised to keep statistical
signicance and association strength completely separate.

152
Anatol Stefanowitsch

ABSOLUTE VALUE INTERPRETATION


0 No relationship
.01-.10 Very weak
.11-.25 Weak
.26-.50 Moderate
.51-.75 Strong
.76-.99 Very strong

.1
1 Perfect association

Our value of 0.8744 falls into the very strong category, which is unusual in
v0
uncontrolled observational research, and which suggests that DISCOURSE STATUS is
indeed a very important factor in the choice of POSSESSIVE constructions in English.
Exactly how much of the variance in the use of the two possessives is
accounted for by the discourse status of the modier can be determined by
looking at the square of the coecient: the square of a correlation coecient
T
generally tells us what proportion of the distribution of the dependent variable
we can account for on the basis of the independent variable (or, more generally,
AF

what proportion of the variance our design has captured). In our case, 2 = (0.8744
0.8744 ) = 0.7645. In other words, the variable DISCOURSE STATUS explains roughly
three-quarters of the variance in the use of the POSSESSIVE constructions if, that
is, our operational denition actually captures the discourse status of the
modier, and nothing else. A more precise way of reporting the results from our
R

study would be something like the following is study has shown a strong and
statistically highly signicant inuence of Discourse Status on the choice of
D

possessive construction: s-possessives are preferred when the modier is


discourse-old (dened in this study as being realized by a pronoun) while of-
possessives are preferred when the modier is discourse-new (dened in this
study as being realized by a lexical NP) (2= 272.16 (df = 1), p < 0.001, 2 =
0.7645).
Unfortunately, studies in corpus linguistics (and in the social sciences in
general) oen fail to report eect sizes, but we can usually calculate them from
the data provided, and one should make a habit of doing so. Many eects
reported in the literature are actually somewhat weaker than the signicance
levels might lead us to believe.

153
7. Signicance Testing

7.2.2. One-by-n designs


In the vast majority of corpus linguistic research issues, we will be dealing with
designs that are at least bivariate (i.e., that involve the intersection of at least two
variables), like the one discussed in the preceding section. However, once in a
while we may need to test a univariate distribution for signicance (i.e., a
distribution of values of a single variable regardless of any specic condition). We
may, for instance, have coded an entire corpus for a particular speaker variable
(such as sex), and we may now want to know whether the corpus is actually
balanced with respect to this variable.
Consider the following example: the ICE-GB contains language produced by

.1
443 FEMALE speakers and 932 MALE speakers. In order to determine whether the ICE-
GB can be considered a balanced corpus with respect to SPEAKER SEX, we can
compare this observed20 distribution of speakers to the expected one more or
v0
less exactly in the way described in the previous sections except that we have two
alternative ways of calculating the expected frequencies.
First, we could simply take the total number of elements and divide it by the
number of categories (values), on the assumption that random distribution
T
means that every category should occur with the same frequency. In this case, the
expected number of MALE and FEMALE speakers would be [Total Number of Speakers
AF

/ Sex Categories], i.e. 1,375 / 2 = 687.5. We can now calculate the chi-square
components just as we did in the preceding sections, using the formula [(O-E)2/E].
Table 7-5 shows the results.

Table 7-5. Observed and expected frequencies of Speaker Sex in the ICE-GB (based
R

on the assumption of equal proportions).


D

20 I put the word observed in scare quotes because it is unclear whether this dierence
was in fact observed aer the fact or whether the texts were sampled in this way
intentionally.

154
Anatol Stefanowitsch

Observed Expected Chi-Square Component


MALE 932 687.5 = 86.95
SEX

2
(443 687.5)
FEMALE 443 687.5 = 86.95
687.5
Total 1,375

Adding up the components gives us a chi-square value of 173.96. A one-by-two


table has one degree of freedom (if we vary one cell, we have to adjust the other
one automatically to keep the marginal sum constant). Checking the appropriate

.1
row in the table in Appendix C.1, we can see that this value is much higher than
the 10.83 required for a signicance level of 0.001. us, we can say that the ICE-
GB corpus contains a signicantly higher proportion of male speakers than
v0
expected by chance (2=173.96, df=1, p<0.001) (note that since this is a test of
proportions rather than correlations, we cannot calculate a phi value here).
The second way of deriving expected frequencies for a univariate distribution
is from prior knowledge concerning the distribution of the values in general. In
T
our case, we could nd out the proportion of men and women in the relevant
population (Great Britain) and then derive the expected frequencies for our table
AF

by assuming that they follow this proportion. A population estimate from around
the time that the ICE-GB was released puts the male population of England and
Wales in 2003 at roughly 25.9 million and the female population at roughly 27
million (Oce for National Statistics 2005). In other words, 49 percent of the
population are men and 51 percent are women. Assuming that this is
R

representative of Great Britain as a whole and assuming that these proportions


have not changed signicantly since the ICE-GB was released ve years earlier,
we can now estimate the expected frequencies as shown in Table 7-6.
D

Table 7-6. Observed and expected frequencies of Speaker Sex in the ICE-GB (based
on the proportions in the general population)
Observed Expected Chi-Square Component
(932 673.75)2
MALE 932 932 0.49 = 673.75 = 98.99
673.75
SEX

(443 701.25)2
FEMALE 443 443 0.51 = 701.25 = 95.11
701.25

155
7. Signicance Testing

Total 1375

Clearly, the empirical distribution in this case closely resembles our hypothesized
equal distribution, and thus the results are very similar. e chi-square-value is
194.09 slightly higher even, than above and thus, the result is again
signicant at the 0.001-level.
In this case, it does not make much of a dierence how we derive the expected
frequencies. However, this is, of course, only because men and women account
for roughly half the population each. For variables where such an even
distribution of values does not exist, the dierences between these two

.1
procedures can be quite drastic.

Table 7-8. Observed and expected frequencies of Speaker Age in the ICE-GB
v0
Equal Proportions Population Proportions
Observed Pop. % Expected 2
Expected 2
1,259
0-18 0 0.23 = 251.8 251.8 1,259 0.23 = 295.67 295.67
5
T
SPEAKER AGE

18-25 275 0.09 251.8 2.14 1,259 0.09 = 111.66 238.93


26-45 506 0.29 251.8 1016.82 1,259 0.29 = 363.83 55.56
AF

46-65 428 0.24 251.8 123.3 1,259 0.24 = 297.57 57.17


65< 50 0.15 251.8 9.1 1,259 0.15 = 190.27 103.41
Total 1,259 1
R

As an example, consider Table 7-8, which lists the observed distribution of the
speakers in the ICE-GB across age groups, together with the expected frequencies
D

on the assumption of equal proportions, and the expected frequencies based on


the distribution of speakers across age groups in the real world.21
Adding up the chi-square components gives us 1,403.16 in the rst case and
750.74 in the second case. For univariate tables, df=[Number of Values 1], so
21 Note that this example demonstrates the procedure for univariate distributions with
more than two values. e distribution in the real world is once again taken from the
NOS (2005) population estimate. ey had to be recalculated to t the age groups
coded in the ICE-GB. Alternatively, we could have used the age groups from the
population estimate and recalculated the ICE-GB categories (as the actual age is given
for most speakers, but this would have given us 17 dierent age categories, which is
not ideal for expository purposes.

156
Anatol Stefanowitsch

Table 7-8 has four degrees of freedom (we can vary four cells independently and
then simply adjust the h to keep the marginal sum constant). e required chi-
square value for a 0.001-level of signicance at four degrees of freedom is 18.47:
clearly, whichever way we calculate the expected frequencies, the dierences
between observed and expected are highly signicant. However, the conclusions
we will draw concerning the over- or underrepresentation of individual
categories will be very dierent. In the rst case, for example, we might be led to
believe that the age group 18-25 is relatively fairly represented while the age
group 26-45 is drastically overrepresented. In the second case, we see that in fact
both age groups are overrepresented, the rst one slightly more so than the

.1
second one. In this case, there is a clear argument for using empirically derived
expected frequencies: the categories dier in terms of the age span each of them
covers, so even if we thought that the distribution of ages in the population is
v0
homogeneous, we would not expect all categories to have the same size.
e exact alternative to the univariate chi-square test with a two-level
variable is the binomial test, which we used (without calling it that), in our coin-
ip example in Section 7.1 above and which is included as a predened function
in many major spreadsheet applications and in R; for one-by-n tables, there is a
T
multinomial test typically not available outside large statistics packages, so we
have to content ourselves with the chi-square test.
AF

7.2 Ordinal data: e Mann-Whitney U-test


Where one variable is nominal (more precisely, nominal with two values) and one
is ordinal, the most widely used test statistic is the Mann-Whitney U-test (also
R

called Wilcoxon rank sum test).


Let us return to the case study of the animacy of modiers in the two English
D

possessive constructions. Here is the research hypothesis again, from (6.12) and
(6.16) above:

(7.7a) H1: e S-POSSESSIVE will be used when the modier is high in ANIMACY, the
OF-POSSESSIVE will be used when the modier is low in ANIMACY.
Prediction: e modiers of the S-POSSESSIVE will have a higher median on
the ANIMACY scale than the the modiers of the OF-POSSESSIVE.

e corresponding null hypothesis is stated in (7.7b):

(7.7b) H0: ere is no relationship between ANIMACY and TYPE OF POSSESSIVE.

157
7. Signicance Testing

Prediction: ere will be no dierence between the medians of the


modiers of the S-POSSESSIVE and the OF-POSSESSIVE on the ANIMACY scale.

e median animacy of all modiers in our sample taken together is 2,22 so the H0
predicts that the medians of s-possessive and the of-possessive should also be 2.
Recall that the observed median animacy in our sample was 1 for the s-possessive
and 5 for the of-possessive, which deviates from the prediction of the H0 in the
direction of our H1. However, as in the case of nominal data, a certain amount of
deviation from the null hypothesis will occur due to chance, so we need a test
statistic that will tell us how likely our observed result is. For ordinal data, this

.1
test statistic is the U value, which is calculated as follows.
In a rst step, we have to determine the rank order of the data points in our
sample. For expository reasons, let us distinguish between the rank value and the
v0
rank position of a data point: the rank value is the ordinal value it received
during coding (in our case, its value on the ANIMACY scale), its rank position is the
position it occupies in an ordered list of all data points. If every rank value
occurred only once in our sample, rank value and rank position would be the
same. Howerver, there are 41 data points in our sample, so the rank positions will
T
range from 1 to 41, and there are only 10 rank values in our coding scheme for
ANIMACY. is means that at least some rank values will occur more than once,
AF

which is a typical situation for corpus-linguistic research involving ordinal data.


Table 7-9 shows all data points in our sample together with their rank
position.

Table 7-9. Coded sample from Table 6-6 ordered by assigned rank ( Table 6-7)
R

RANKING RANKING (CONTD.)


Value Example Position Value Example Position
D

1 s-possessive (6-6a2) 8.5 3 s-possessive (6-6a16) 23.5


1 s-possessive (6-6a3) 8.5 3 s-possessive (6-6a22) 23.5
1 s-possessive (6-6a7) 8.5 4 of-possessive (6-6b13) 25
1 s-possessive (6-6a8) 8.5 5 s-possessive (6-6a11) 28.5
1 s-possessive (6-6a9) 8.5 5 of-possessive (6-6b11) 28.5
1 s-possessive (6-6a10) 8.5 5 of-possessive (6-6b16) 28.5
1 s-possessive (6-6a12) 8.5 5 of-possessive (6-6b17) 28.5
1 s-possessive (6-6a15) 8.5 5 of-possessive (6-6b18) 28.5

22 ere are 41 data points in our sample, whose ranks are the following: 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 4, 5, 5, 5, 5, 5, 5, 6, 6, 7, 7, 8, 9, 10, 10, 10, 10 . e
twenty-rst item on the list is a 2, so this is the median.

158
Anatol Stefanowitsch

1 s-possessive (6-6a17) 8.5 5 of-possessive (6-6b5) 28.5


1 s-possessive (6-6a18) 8.5 6 s-possessive (6-6a4) 32.5
1 s-possessive (6-6a19) 8.5 6 of-possessive (6-6b9) 32.5
1 s-possessive (6-6a20) 8.5 7 of-possessive (6-6b14) 32.5
1 s-possessive (6-6a21) 8.5 7 of-possessive (6-6b3) 32.5
1 s-possessive (6-6a23) 8.5 8 of-possessive (6-6b1) 36
1 of-possessive (6-6b4) 8.5 9 of-possessive (6-6b12) 37
1 of-possessive (6-6b7) 8.5 10 of-possessive (6-6b10) 39.5
2 s-possessive (6-6a1) 19.5 10 of-possessive (6-6b15) 39.5
2 s-possessive (6-6a13) 19.5 10 of-possessive (6-6b6) 39.5
2 s-possessive (6-6a14) 19.5 10 of-possessive (6-6b8) 39.5

.1
2 s-possessive (6-6a5) 19.5
2 s-possessive (6-6a6) 19.5
2 of-possessive (6-6b2) 19.5 v0
Every rank value except 4, 8 and 9 occurs more than once; for example, there are
sixteen cases that have an ANIMACY rank value of 1 and six cases that have a rank
value of 2, two cases that have a rank value of 3, and so on. is means we cannot
simply assign rank positions from 1 to 41 to our examples, as there is no way of
T
deciding which of the sixteen examples with the rank value 1 should receive the
rank position 1, 2, 3, etc. Instead, these 16 examples as a group share the range of
AF

ranks from 1 to 16, so each example gets the average rank position of this range.
ere are sixteen cases with rank value 1, to their average rank is

1+2+3+4 +5+6+7+8+9 +10+11+12+13+14+15+16 136


(7.8) = = 8.5
R

16 16

e rst example with the rank value 2 occurs in line 17 of the table, so it would
D

receive the rank position 17. However, there are ve more examples with the
same rank value, so again we calculate the average rank position of the range
from rank 17 to 22, which is

17+18+19+20+21+22 117
(7.9) = = 19.5
6 16

Repeating this process for all examples yields the rank positions shown in the
third column in Table 7-9 above. Once we have determined the rank position of
each data point, we separate them into two subsamples corresponding to the

159
7. Signicance Testing

values of the nominal variable TYPE OF POSSESSIVE again, as shown in Table 7-10a.

Table 7-10a. Rank positions for the coded sample from Table 7.9.
S-POSSESSIVE OF-POSSESSIVE

Value Example Position Value Example Position


2 s-possessive (6-6a1) 19.5 8 of-possessive (6-6b1) 36
1 s-possessive (6-6a2) 8.5 2 of-possessive (6-6b2) 19.5
1 s-possessive (6-6a3) 8.5 7 of-possessive (6-6b3) 32.5
6 s-possessive (6-6a4) 32.5 1 of-possessive (6-6b4) 8.5
2 s-possessive (6-6a5) 19.5 5 of-possessive (6-6b5) 28.5

.1
2 s-possessive (6-6a6) 19.5 10 of-possessive (6-6b6) 39.5
1 s-possessive (6-6a7) 8.5 1 of-possessive (6-6b7) 8.5
1 s-possessive (6-6a8) 8.5 10 of-possessive (6-6b8) 39.5
1 s-possessive (6-6a9) 8.5 6 of-possessive (6-6b9) 32.5
1
5
s-possessive (6-6a10)
s-possessive (6-6a11)
8.5
28.5
v0 10
5
of-possessive (6-6b10)
of-possessive (6-6b11)
39.5
28.5
1 s-possessive (6-6a12) 8.5 9 of-possessive (6-6b12) 37
2 s-possessive (6-6a13) 19.5 4 of-possessive (6-6b13) 25
2 s-possessive (6-6a14) 19.5 7 of-possessive (6-6b14) 32.5
T
1 s-possessive (6-6a15) 8.5 10 of-possessive (6-6b15) 39.5
3 s-possessive (6-6a16) 23.5 5 of-possessive (6-6b16) 28.5
AF

1 s-possessive (6-6a17) 8.5 5 of-possessive (6-6b17) 28.5


1 s-possessive (6-6a18) 8.5 5 of-possessive (6-6b18) 28.5
1 s-possessive (6-6a19) 8.5
1 s-possessive (6-6a20) 8.5
1 s-possessive (6-6a21) 8.5
R

3 s-possessive (6-6a22) 23.5


1 s-possessive (6-6a23) 8.5
D

Next, we calculate the so-called rank sum for each group, which is simply the
sum of their rank positions, and we count the number of data points in each
group. ese measures are shown in Table 7-10b.

Table 7-10b: Rank Sums


S-POSSESSIVE OF-POSSESSIVE

Rank Sum 324.5 Rank Sum 532.5


No. of Data Points 23 No. of Data Points 18

Using these two measures, we can now calculate the U values for both group

160
Anatol Stefanowitsch

using the following simple formulas, where N stands for the number of data
points in the sample and R for the rank sum:

N 1 (N 1 + 1)
(7.10a) U 1 = (N 1 N 2) + R1
2

N 2 (N 2 + 1)
(7.10b) U 2 = ( N 1N 2 ) + R2
2

Applying these formulas to the measures for the s-possessive (7.10a) and of-

.1
possessive (7.10b) respectively, we get the following U values:

23 (23 + 1)
(7.11a) U 1 = (23 18) +
2
v0 324.5 = 365.5

18 (18 + 1)
(7.11b) U 2 = (23 18) + 532.5 = 52.5
2
T
e U value for the entire data set is always the smaller of the two U values. In
our case this is U2, so our U value is 52.5. is value can now be compared against
AF

its known distribution in the same way as the chi-square value for nominal data.
In our case, this means looking it up in the appropriate table in Appendix C.3,
which tells us that the p-value for this U value is smaller than 0.001 the
dierence between the s- and the of-possessive is, again, highly signicant. e
Mann-Whitney test may be reported as follows:
R

(7.12) Format for reporting the results of a Mann-Whitney test


D

(U = [U VALUE], N1=[N1], N2=[N2] p< (or >) [SIG. LEVEL]).

us, we could report the results of this case study as follows: This study has
shown that s-possessives are preferred when the modier is high in animacy,
while of-possessives are preferred when the modier is low in animacy. A Mann-
Whitney test shows that the dierences between the constructions are highly
signicant (U = 52.5, N1= 18, N2 = 23, p < 0.001).

161
7. Signicance Testing

7.3 Inferential Statistics for Cardinal Data


Where one variable is nominal (more precisely, nominal with two values) and one
is cardinal, the a widely-used test statistic is the t-test, of which there are two
well-known versions, Welchs t-test and Students t-test, that dier in terms of
the requirements that the data must meet in order for them to be applicable. In
the following, I will introduce Welchs t-test, which can be applied more broadly,
although it still has some requirements that I will return to below.

7.3.1 Welchs t-test

.1
Let us return to the case study of the length of modiers in the two English
possessive constructions. Here is the research hypothesis again, paraphrased
slightly from (6.17) and (6.18) above:
v0
(7.13a) H1: The S-POSSESSIVE will be used with short modiers, the OF-POSSESSIVE will
be used with long modiers.
Prediction: The mean LENGTH (in number of words) of modiers of the S-
T
GENITIVE should be smaller than that of the modiers of the OF-GENITIVE.

e corresponding null hypothesis is stated in (7.7b):


AF

(7.13b) H0: ere is no relationship between LENGTH and TYPE OF POSSESSIVE.


Prediction: ere will be no dierence between the mean length of the
modiers of the S-POSSESSIVE and the OF-POSSESSIVE.
R

Table 7-13 shows the length in number of words for the modiers of the s- and of-
possessives (as already reported in Table 6-9), together with a number of
D

additional pieces of information that we will turn to next.

Table 7-11. Length of the modier in a sample of s- and of-possessives.


S-POSSESSIVE OF-POSSESSIVE
Example Length Di. f. mean )2
( x X Example Length Di. f. mean )2
( x X
ID x X ID x X
6-9a1 2 0.1 0.0100 6-9b1 3 -0.8333 0.6944
6-9a2 2 0.1 0.0100 6-9b2 5 1.1667 1.3611
6-9a3 2 0.1 0.0100 6-9b3 4 0.1667 0.0278
6-9a4 2 0.1 0.0100 6-9b4 8 4.1667 17.3611

162

ave

Cancel
Anatol Stefanowitsch

6-9a5 3 1.1 1.2100 6-9b5 7 3.1667 10.0278


6-9a6 1 -0.9 0.8100 6-9b6 1 -2.8333 8.0278
6-9a7 2 0.1 0.0100 6-9b7 9 5.1667 26.6944
6-9a8 2 0.1 0.0100 6-9b8 2 -1.8333 3.3611
6-9a9 2 0.1 0.0100 6-9b9 5 1.1667 1.3611
6-9a10 2 0.1 0.0100 6-9b10 6 2.1667 4.6944
6-9a11 1 -0.9 0.8100 6-9b11 2 -1.8333 3.3611
6-9a12 2 0.1 0.0100 6-9b12 2 -1.8333 3.3611
6-9a13 1 -0.9 0.8100 6-9b13 1 -2.8333 8.0278
6-9a14 3 1.1 1.2100 6-9b14 8 4.1667 17.3611
6-9a15 2 0.1 0.0100 6-9b15 5 1.1667 1.3611

.1
6-9a16 1 -0.9 0.8100 6-9b16 2 -1.8333 3.3611
6-9a17 2 0.1 0.0100 6-9b17 2 -1.8333 3.3611
6-9a18 2 0.1 0.0100 v0 6-9b19 2 -1.8333 3.3611
6-9a19 2 0.1 0.0100 6-9b20 1 -2.8333 8.0278
6-9a20 2 0.1 0.0100 6-9b21 2 -1.8333 3.3611
6-9b22 8 4.1667 17.3611
6-9b23 3 -0.8333 0.6944
6-9b24 2 -1.8333 3.3611
T
6-9b25 2 -1.8333 3.3611
N 20 24
AF

TOTAL 58 0 92 0
Mean 1.9 3.8333
Sample 0.3053 6.6667
Variance
R

First, note that one case that was still included in Table 6-9 is missing: Example
(6-9b19), which had a modier of length 20. is is treated here as a so-called
outlier, i.e., a value that is so far away from the mean that it can be considered an
D

exception. ere are dierent opinions on if and when outliers should be removed
that we will not discuss here, but for expository reasons alone it is reasonable
here ro remove it (and for our results, it would not have made a dierence if we
had kept it).
In order to calculate Welchs t-test, we determine three values on the basis of
our measurements of LENGTH: the number of measurements, the mean length for
each group, and a value called sample variance. e number of measurements is
easy to determine we just count the cases in each group: 20 s-possessives and
24 of-possessives. We already calculated the mean lengths in Chapter 6: for the s-
possessive, the mean length is 1.9 words, for the of-possessive it is 3.83 words. As

163
7. Signicance Testing

we already discussed in Chapter 6, this dierence conforms to our hypothesis: s-


possessives are, on average, shorter than of-possessives.
e question is, again, how likely this dierence could be due to chance. When
comparing group means, the crucial question we must ask in order to determine
this is how large the variation is within each group of measurements: put simply,
the more widely the measurements within each group vary, the more likely the
dierences across groups could have come about by chance.
e rst step in assessing the variation consists in determining for each
measurement, how far away it is from its group mean. us, we simply subtract
each measurement for the s-possessive from the group mean of 1.9, and each

.1
measurement for the of-possessive from the group mean of 3.83. e results are
shown in the third column of each sub-table in 7-13. However, we do not want to
know how much each single measurement deviates from the mean, but how far
v0
the group S-POSSESSIVE or OF-POSSESSIVE as a whole varies around the mean.
Obviously, adding up all individual values is not going to be helpful: as in the case
of observed and expected frequencies of nominal data, the result would always be
zero. So we use the same trick we used there, and calculate the square of each
value making them all positive and weighting larger deviations more heavily.
T
e results of this are shown in the fourth column of each sub-table. We then
calculate the mean of these values for each group, but instead of adding up all
AF

values and dividing them by the number of cases, we add them up and divide
them by the total number of cases minus one. is is referred to as the sample
variance:
n
R

2

i =1
)2
( xi X
(7.14) s =
n1
D

e sample variances itself cannot be very easily interpreted (see further below),
but we can use them to calculate our test statistic, the t-value, using the following
formula (the barred X stands for the group mean, s2 stands for the sample
variance, and N stands for the number of cases; the subscripts 1 and 2 indicate the
two sub-samples:

164
Anatol Stefanowitsch

X 1 X 2
(7.15) t Welch =


2 2
s1 s
+ 2
N1 N2

Note that this formula assumes that the measures with the subscript 1 are from
the larger of the two samples (if we dont pay aention to this, however, all that
happens is that we get a negative t-value, whose negative sign we can simply
ignore). In our case, the sample of of-possessives is the larger one, giving us:

3.8333 1.9

.1
t Welch =
(7.16)
6.6667
24
+
0.3053
20
= 3.5714
v0
As should be familiar by now, we compare this t-value against its distribution to
determine the probability of error (i.e., we look it up in the appropriate table in
Appendix C.4. Before we can do so, however, we need to determine the degrees of
freedom of our sample. is is done using the following formula:
T
AF

( )
2 2 2
s1 s
+ 2
N1 N2
(7.17) df 4 4
s 1 s2
2
+ 2
R

N 1 df 1 N 2 df 2

Again, the subscripts indicate the sub-samples, s2 is the sample variance, and N is
D

the number of items the degrees of freedom for the two groups (df 1 and df2) are
dened as N-1. If we apply the formula to our data, we get the following:

( )
2
6.6667 0.3053
+
24 20
(7.18) df 2 = 25.5038
6.6667 0.30532
+
242 (241) 202 (201)

As we can see in the table in Appendix C.4, the t-value is smaller than 0.01. A t-
test should be reported in the following format:

165
7. Signicance Testing

(7.19) Format for reporting the results of a t test


(t([DEG. FREEDOM]) = [t VALUE], p < (or >) [SIG. LEVEL]).

us, a straightforward way of reporting our results would be something like


this: This study has shown that for modiers that are realized by lexial NPs, s-
possessives are preferred when the modier is short, while of-possessives are
preferred when the modier is long. e dierence between the constructions is
very signicant (t(25.50) = 3.5714, p < 0.01).
As pointed out above, the value for the sample variance does not, in itself, tell

.1
us very much. We can convert it into something called the sample standard
deviation, however, by taking its square root. e standard deviation is an
indicator of the amount of variation in a sample (or sub-sample) that is frequently
v0
reported; it is good practice to report standard deviations whenever we report
means.
Finally, note that, again, the signicance level does not tell us anything about
the size of the eect, so we should calculate an eect size separately. e most
widely-used eect size for data analyzed with a t-test is Cohens d, also referred
T
to as the standardized mean dierence. ere are several ways to calculate it, the
simplest one is the following, where is the standard deviation of the entire
AF

sample:

X 1 X2
(7.20) d =
R

For our case study, this gives us (7.21):

3.8333 1.9
D

(7.21) d = = 0.8966
2.1562

is standardized mean dierence can be converted to a correlation coecient


by the formula in (7.22):

d
r =
(7.22)

2
2 ( N 1 + N 2)
d +
N1 N2

For our case study, this gives us (7.23):

166
Anatol Stefanowitsch

0.8966
r = = 0.4077
(7.23)

2
2 (24 + 20)
0.8966 +
24 20

Since this is a correlation coecient, it can be interpreted as described in Table


7.4 above. It falls into the moderate range, so a more comprehensive way of
summarizing the results of this case study would be the following: This study
has shown that length has a moderate, statistically signicant inuence on the
choice of possessive constructions with lexical NPs in modier position: s-

.1
possessives are preferred when the modier is short, while of-possessives are
preferred when the modier is long. (t(25.50) = 3.5714, p < 0.01, r = 0.41).

7.3.2 Normal distribution requirement


v0
In the context of corpus linguistics, there is one fundamental problem with the t-
test in any of its variants: it requires data that follow what is called the normal
distribution. Briey, the normal distribution is a probability distribution that
T
many natural phenomena follow and that forms a symmetrical bell curve with
most measurements falling in the middle and then decreasing symmetrically on
AF

either side until they reach zero. ere are many ways of determining whether a
sample is distributed normally, but we will not discuss these here for the simple
reason that, unfortunately, cardinal measurements derived from corpus data
(such as the length or of words, constituents, sentences etc. or the distance to the
last mention of a referent) are usually not distributed normally (see, e.g., McEnery
R

and Hardie 2012: 51), and so the discussion would hardly be worth the trouble at
this point.
D

ere are two ways of dealing with this problem: rst, ignore it and hope that
the t-test is robust enough to yield meaningful results despite this violation of the
normality requirement. Although statisticians warn against it categorically, this is
what many linguists (and other social scientists) regularly do. In many cases, this
may be less of a problem than one might assume, since, as mentioned in the
introduction to this chapter, the tests introduced here are fairly robust against
violations of their prerequisites.
Still, ignoring the prerequisites of a statistical procedure is not exactly good
scientic practice, so the second way of dealing with the problem is the more
recommendable one: nd a way of geing around having to use a t-test in the
rst place. One way would be to nd an alternative test that does not assume a

167
7. Signicance Testing

normal distribution. In some cases, so-called exact tests are available (such as the
Fisher-Yates exact test mentioned in Section 7.2.1) and some corpus linguists
argue that we should restrict ourselves to such exact tests (e.g. Gries 2006). In
other cases, we may have to rethink the nature of our variables. For example, as
pointed out in Chapter 6, we can treat our cardinal data as ordinal data. We can
then use the Mann-Whitney U-test, which does not require a normal distribution
of the data. I leave it as an exercise to the reader to apply this test to the data in
Table 7-11 (you know you have succeeded if your result for U is 137, p < 0.01).
Alternatively, we could try to nd a less problematic operationalization of the
phenomenon we are trying to measure and use the corresponding statistical

.1
procedures. Since nominal variables are oen less troublesome than cardinal
ones, we could, for example, code the data in Table 7-11 in terms of a very simple
nominal variable: LONGER CONSTITUENT (with the variables HEAD and MODIFIER). For
v0
each case, we simply determine whether the head is longer than the modier (in
which case we assign the value head) or whether the modier is longer than the
head (in which case we assign the value MODIFIER; we discard all cases where the
two have the same length. is gives us Table 7.12.
T
Table 7.12. e inuence of length on the choice between the two possessives.
POSSESSIVE
AF

S-POSSESSIVE OF-POSSESSIVE Total


R
D

168
Anatol Stefanowitsch
LONGER CONSTITUENT

Obs. 8 Obs. 5 Obs. 13


HEAD
Exp. 6.3 Exp. 6.7
Obs. 8 Obs. 12 Obs. 20
MOD
Exp. 9.7 Exp. 10.3
Obs. 16 Obs. 17 Obs. 33
Total

e chi-square value for this table is 0.7281, which at one degree of freedom
means that the p value is larger than 0.05, so we would have to conclude that

.1
there is no inuence of length on the choice of possessive construction. However,
the deviations of the observed from the expected frequencies go in the right
direction, so this may simply be due to the fact that our sample is too small
v0
(obviously, a serious corpus-linguistic study would not be based on just 33 cases).

7.4 Other statistics


e short introduction to some important statistical procedures in this and the
T
previous chapter was primarily meant to introduce the most fundamental
concepts needed to think about linguistic data quantitatively. We did not do much
AF

more than scratch the surface, however, so for anyone who is serious about using
statistics in their research, following the recommendations for further reading at
the end of this chapter is very important more so than in any other chapter.
One aspect of statistical analysis that we only mentioned in passing was that
of general quantitative characteristics that a sample may have. For example, we
R

mentioned that data must follow a normal distribution for some statistical
procedures but not for others. ere are additional restrictions for some tests. For
D

example, many procedures for comparing group means can only be applied if the
two groups have the same variance (e.g., the more widely used Student t-test), and
there are tests to tell us this (the F test). Also, it makes a dierence whether the
two groups that we are comparing are independent of each other (as in the case
studies presented here), or if they are dependent in that there is a correspondence
between measures in the two groups. For example, if we wanted to compare the
length of heads and modiers in the s-possessive, we would have two groups that
are dependent in that for any data point in one of the groups there is a
corresponding data point in the other group that comes from the same corpus
example. In this case, we would use a paired test (for example, the matched-pairs
Wilcoxon test for ordinal data and Students paired t-test for cardinal data).

169
7. Signicance Testing

As mentioned at the beginning of the chapter, we also omied an entire class


of research designs, namely those where neither variable is nominal. In these
cases you can use measures of correlation, such as Pearsons product-moment
correlations (if are dealing with two cardinal variables) and Spearman's rank
correlation coecient or the Kendall tau rank correlation coecient (if one or
both of our variables are ordinal.
Finally, we restricted ourselves to bivariate research designs (designs with two
variables). Increasingly, research designs in corpus linguistics are multivariate,
investigating the inuence of more than one independent variables on a
dependent variable. We will discuss why multivariate designs are frequently

.1
necessary in the next chapter, limiting ourselves to designs where all variables
are nominal. However, in multivariate designs many combinations of variable
types are possible and there is a large number of multivariate statistical
v0
procedures, such as the well-known Analysis of Variance (ANOVA) for designs
with several independent nominal variables and a cardinal dependent variable.

Further reading
T
Anyone serious about using statistics in your research should start with a
basic introduction to statistics, and then proceed to an introduction of more
AF

advanced methods, preferably one that introduces a statistical soware package


at the same time. For the rst step, I recommend Butler (1985), a very solid
introduction to statistical concepts and pitfalls specically aimed at linguists. It is
out of print, but the author has made it available for free download. For the
second step, I recommend Gries (2013) as a package deal geared specically
R

towards linguistic research questions, but I also encourage you to explore the
wide range of free or commercially available books introducing statistics with R.
D

Butler, C. (1985). Statistics in linguistics. Oxford: Blackwell. [Available at


hp://www.uwe.ac.uk/hlss/llas/statistics-in-linguistics/bkindex.shtml]
Gries, S. T. (2013). Statistics for linguistics with R: a practical introduction
(2nd revised edition.). Berlin: De Gruyter Mouton.
If you are interested in the English possessive alternation discussed in this and
the preceding chapter, there are several corpus-based studies that deal with it,
Rosenbach (2002) being the most comprehensive. Stefanowitsch (2003), briey
mentioned in Chapter 6, aempts to relate corpus data to a construction-
grammar model of the two possessives.

170
Anatol Stefanowitsch

Rosenbach, Anee (2002). Genitive variation in English: conceptual factors


in synchronic and diachronic studies. Berlin and New York: Mouton de
Gruyter.
Stefanowitsch, Anatol (2003). Constructional semantics as a limit to
grammatical alternation: e two genitives of English. In Gnter
Rohdenburg & Bria Mondorf (eds.), Determinants of grammatical
variation in English, 413444. Berlin: Mouton de Gruyter.

.1
v0
T
AF
R
D

171
8. Complex antitative Research
Designs

.1
In this chapter, we will nish our discussion of quantitative corpus linguistic
research designs by looking at some issues that become relevant with more
complex designs: rst, designs with two variables where at least one variable has
v0
more than two values, and second, designs with more than two variables. We will
focus on designs with nominal variables, but the issues discussed here are
relevant to research designs with all kinds of variables.

8.1 Variables with more than two values


T

In the Chapters 6 and 7, we restricted our discussion to cases where the


AF

independent variable was binary only two two constructions were involved.
e dependent variables were more complex: the cardinal variable LENGTH
obviously had a potentially innite number of values and the ordinal variable
ANIMACY had ten values; the nominal value DISCOURSE STATUS was treated like a
binary variable too. Frequently, perhaps even typically, corpus linguistic research
R

questions will be more complex, and we will be confronted with designs where
independent variables and/or nominal dependent variables will have more than
D

two values.
For example, we might be intrigued by the fact that transitive clauses are
explicitly or implicitly assumed to be the most typical clause type by
syntacticians, while in actual usage, intransitive clauses are much more frequent
(cf. ompson and Hopper 2001). We might then speculate that the frequency of
dierent subcategorization frames may dier across text types and that the
perceived typicality of transitive clauses may be due to the salience of particular
text types. Assuming we limit ourselves to the major subcategorization paerns,
our dependent variable will have at least four values (intransitive, transitive,
ditransitive, and copular). e independent variable will, of course, have
indenitely many values (since no-one knows how many text types there are).

172
Anatol Stefanowitsch

Let us assume that we have chosen ve text types as representative of two major
distinctions oen made: formal vs. informal and spoken vs. wrien. e
intersection of these two variables will then yield a four-by-ve table. Table 8-1
shows the observed and expected frequencies for each of these intersections.23

.1
v0
T
AF
R
D

23 Note that, at least under one interpretation, this design is actually awed: if we
assume that the dimensions just mentioned can directly inuence the frequency of
subcategorization paerns, then the values of our independent variable overlap: two
of the values are spoken, two are wrien, and one is both; likewise, two are informal
and three are formal. us, a beer design would be one with FORMALITY and CHANNEL as
independent variables; since we do not know how to deal with more than one
independent variable at a time yet, let us ignore this aw. Of course, if you do not
assume that the dimensions FORMALITY and CHANNEL can have a direct inuence, but that
they are simply dimensions along which holistic and wholly independent text types
happen to be similar to each other, then there is not problem in the rst place. is is a
good example of how the constructs in our research design are always anchored in a
particular model of reality, rather than reality itself.

173
8. Complex antitative Research Designs

Table 8-1. Subcategorization paerns by text type (ICE-GB)


S U B C A T E G O R I Z A T I O N F R A M E

INTRANSITIVE COPULAR TRANSITIVE DITRANSITIV Total


E

SPOKEN 6333 6913 9095 339


INFORMAL (5913.51) (6500.03) (9889.77) (376.69) 2680
Y P E

SPOKEN 3219 3520 5519 266


FORMAL (3265.47) (3589.34) (5461.18) (208.01) 12524
T

1331 1471 2620 84

.1
SCRIPTED
(1435.62) (1578.00) (2400.93) (91.45) 5506
E X T

WRITTEN 978 963 1476 103


(917.79) (1008.82) (1534.92) (58.46) 3520
INFORMAL
v0
T

WRITTEN 1216 1507 3160 41


FORMAL (1544.61) (1697.80) (2583.20) (98.39) 5924
Total 13077 14374 21870 833 50154
Note. Spoken informal: private direct conversation; Spoken formal: broadcast
T
discussions/interviews, parliamentary debates, legal cross-examinations, business
transactions; Scripted: scripted broadcast and non-broadcast monologues; Wrien
AF

informal: social leers; Wrien formal: academic writing.

e expected frequencies are arrived at in the way demonstrated in Chapter 6: for


each cell, the sum of the column in which the cell is located is multiplied by the
sum of the row in which it is located, and the result is divided by the table sum.
R

For example, for the top le cell, we get (13,077 2,680) 50,154 = 5,913.51.
Likewise, the chi-square cell components are calculated in the usual way, i.e. ((O
D

E)2 E); for the top le cell, for example, this gives us (6333 5913.51) 2
5913.51 = 29.76.

Table 8-2. Subcategorization paerns by text type (ICE-GB) chi-square

174
Anatol Stefanowitsch

S U B C A T E G O R I Z A T I O N F R A M E

INTRANSITIVE COPULAR TRANSITIVE DITRANSITIVE

SPOKEN
29.76 26.24 63.87 3.77
INFORMAL
Y P E

SPOKEN
0.66 1.34 0.61 16.17
FORMAL
T

SCRIPTED 7.62 7.26 19.99 0.61


E X T

WRITTEN
3.95 2.08 2.26 33.93

.1
INFORMAL
T

WRITTEN
69.91 21.44 128.79 33.48
FORMAL v0
Adding up the individual cell components gives us a chi-square value of 473.73.
e general formula for determining the degrees of freedom of a table is the
following (recall that rows and columns containing labels or marginal frequencies
do not count):
T

(8.1) df = (Number of Rows 1) (Number of Columns 1)


AF

Our table thus has 43=12 degrees of freedom. As the chi-square table in
Appendix C.1 shows, the required value for a signicance level of 0.001 at 12
degrees of freedom is 32.90; our chi-square value for Table 6-6 is much higher
than this, thus, our results are highly signicant. We could summarize our
R

ndings as follows: e frequency of verb categorization paerns diers


signicantly across registers (2=473.73 (df=12), p<0.001).
D

Recall that the mere fact that there is a signicant association between register
and subcategorization does not tell us anything about the strength of that
association (the eect size). For tables larger than two-by-two, there is a
correlation coecient referred to as Cramers V (occasionally also referred to as
Cramers or ) which is calculated as follows (N is the table sum, k is the
number of rows or columns, whichever is smaller):

(8.2) Cramer s V =
2
N (k 1)

175
8. Complex antitative Research Designs

For our table, this gives us the following:

(8.3) Cramer s V =
473.73
50154 (4 1)
= 0.06

Recall that the square of a correlation coecient tells us the proportion of the
variance captured by our design, which, in this case, is 0.0031. In other words,
TEXT TYPE explains a mere third of a percent of the distribution of
subcategorization paerns (at least if we operationalize it as we have done here);
or This study has shown a very weak but highly signicant inuence of text

.1
type on the frequency of subcategorization paerns (2=473.73 (df=12), p<0.001,
r=0.0031).
is result in itself will probably not satisfy the curiosity arising from our
v0
initial hypothesis concerning the overrepresentation of transitive clauses in
syntactic argumentation: we need to know whether particular subcategorization
paerns are particularly frequent in a particular register. In order to answer this
question, we need to know precisely which intersections of values are more
frequent or less frequent that expected. We can easily nd out by comparing the
T
observed and expected frequencies in each cell. For example, we can look at the
top le cell and see that the observed frequency is 6,333 against an expected
AF

5,913, i.e., that intransitives are more frequent than expected in informal spoken
language.
However, while we know that the distribution in the table as a whole is
signicant, we do not know whether any particular dierence by itself is
R

signicant (say, that of intransitives in informal spoken language or that of


transitives in formal wrien language). In other words, we do not know which
particular intersections of variables contribute signicantly to the overall
D

signicance. In order to determine this, imagine that our 54 table consists of


twenty tables with a single cell each. In the case of the intersection between
intransitive verbs and informal spoken language, this table contains the chi-
square component 29.76. We can now simply treat this component as a chi-square
value in its own right and look it up in Appendix C.1 to see whether it exceeds
the value required for a particular level of signicance. In order to do so, we
assume that a single cell always has one degree of freedom.
Clearly, 29.76 is much higher than the 10.83 required for a signicance level of
0.001 at one degree of freedom.24 We can now check every cells chi-square value

24 In fact, maers are slightly more complex: if we treat each cell as an independent

176
Anatol Stefanowitsch

against the required corrected values and nd out exactly which of the
intersections contribute signicantly to our result. By comparing the observed
against the expected frequencies, we can determine for each intersection of
values whether it is more or less frequent than expected. For twenty tests, this is
quite a complex task, that we would not want to leave to the reader. Some way of
summarizing the information in a compact way is required. ere is no standard
way of doing this, but the representation in Table 8-3 seems a reasonable
suggestion: in each cell, the rst line contains either a plus (for more frequent
than expected) or a minus (for less frequent than expected); the second line
contains the actual contribution to chi-square for the cell, and the third line

.1
contains the level of signicance (using the standard convention of representing
them by asterisks three for a level of 0.001, two for a level of 0.01, and one for a
level of 0.05; non-signicance is represented by n.s. and marginal signicance by
v0
marg.)
T
AF
R
D

mini-table, then we have eectively performed 20 chi-square tests on the same data
set. But recall that the level of signicance is nothing but a probability of error for
example, p=0.05 means that there is a ve percent likelihood that the results are due to
chance. Of course, this means that if we perform twenty tests, there is a likelihood of
one-hundred percent that one of our results did, indeed, come about by chance (20
0.05 = 1.00). In order to avoid this danger, the chi-square values must be adjusted.
Appendix C.2 lists the corrected values for multiple tests with one degree of freedom.
Checking the adjusted chi-square values for twenty tests, we nd that the required
value for a 0.001 level of signicance is 16.45: our value of 29.76 is higher than even
this corrected value, and thus the association between intransitives and spoken
informal language is highly signicant.

177
8. Complex antitative Research Designs

Table 8-3. Subcategorization Frames by Text Type Contributions to Chi-Square


S U B C A T E G O R I Z A T I O N F R A M E

INTRANSITIVE COPULAR TRANSITIVE DITRANSITIVE

SPOKEN
+ +
29.76 26.24 63.87 3.77
INFORMAL
*** *** *** n.s.

SPOKEN
+ +
0.66 1.34 0.61 16.17
Y P E

FORMAL
n.s. n.s. n.s. **

.1
+
T

SCRIPTED 7.62 7.26 19.99 0.61


marg. marg. *** n.s.
E X T

+ +
WRITTEN
3.95
v02.08 2.26 33.93
T

INFORMAL
n.s. n.s. n.s. ***

WRITTEN
+
69.91 21.44 128.79 33.48
FORMAL
*** *** *** ***
T

is table presents the complex results at a single glance; they can now be
AF

interpreted. ree paerns are quite obvious in the data: First, there is a clear
dierence between the two extreme ends of the register scale: in spoken
informal language, intransitive and copular verbs are more frequent and
transitives are less frequent; in wrien formal language, it is the other way
R

around. Second, scripted language is relatively similar to wrien formal language


(although the deviance of intransitives and copular verbs is only marginally
signicant). ird, spoken formal and wrien informal are similar in that they
D

contain more ditransitives than expected (which must somehow be related to the
two most frequent ditransitive verbs in the ICE-GB, give and tell). In light of our
initial hypothesis, we could now conclude, for example, that syntacticians were
fooled by the relative importance of transitive clauses in formal wrien language
into thinking that this subcategorization paern is in some way typical.

8.2. Multivariate contingency analysis


At least impressionistically, the majority of studies using corpus-linguistic
methods are concerned with bivariate research designs (research designs that

178
Anatol Stefanowitsch

investigate the relationship between two variables), such as those discussed in the
previous chapters and also in the previous section. In many cases, bivariate
research designs are sucient to test a given research hypotheses, and thus, they
will continue to have a place in corpus linguistics.
However, there is an obvious reason why bivariate designs might not always
be useful: a fragment of reality that we wish to investigate will very oen be too
complex to capture in terms of just two variables. is may be obvious to the
researcher from the outset or it may become obvious to them when a given
variable turns out to have a very small eect size (suggesting that there may be
additional variables that explain the deviation of the observed from the expected

.1
frequencies). In other words, our research hypothesis may oen posit a
relationship between more than two variables, and even where it does not, there
is an inherent danger in bivariate research designs. e next subsection will spell
v0
out this danger, and the subsection following it will present a solution. Note that
this solution is considerably more complex than the statistical procedures we
have looked at so far and while it will be presented in sucient detail to enable
the reader in principle to apply it themselves, some additional reading will be
highly advisable.
T

8.2.1. A Danger of Bivariate Designs


AF

Consider the following research hypothesis: e word lovely is becoming more


frequent in English (there is no inherent plausibility to this hypothesis, but
hypotheses do not have to be plausible, they have to be testable). From this
hypothesis, we can derive the prediction e word lovely will be used more
R

frequently by young people and less frequently by old people. e null


hypothesis, obviously, is e word lovely is not changing in frequency, and the
corresponding prediction is ere is no relationship between age and the
D

frequency of the word lovely.


Note, incidentally, that these seemingly obvious predictions involve an
operational denition: there is an implicit construct PREVIOUS STATE OF THE LANGUAGE,
which is operationalized as current language use by older speakers and another
implicit construct CURRENT STATE OF THE LANGUAGE, which is operationalized as
current language use by younger speakers. In psychology/psycholinguistics and
sociology/sociolinguistics, this is a well-known way of operationalizing temporal
development referred to as apparent time design, the idea being that someones
language use (and their cultural values, their understanding of social
relationships, etc.) reect those that were current when they were young. Let us

179
8. Complex antitative Research Designs

further operationalize YOUNG as 45 years of age or younger and OLD as above 45


years of age (obviously an arbitrary division). Table 8-4 shows the observed and
expected frequencies and the chi-square values for the individual intersections of
the variables AGE and UTTERANCES WITH/OUT LOVELY (based on the International
Corpus of English British Component).

Table 8-4. Uerances containing the word lovely by speaker age (ICE-GB)
U T T E R A N C E S

WITH LOVELY WITHOUT LOVELY Total

.1
Obs. 74 Obs. 44,478 44,552
YOUNG Exp. 59.20 Exp. 44,492.80
2 3.70 2 0.005
G E

Obs. 18
v0 Obs. 24,667 24,685
A

OLD Exp. 32.80 Exp. 24,652.20


2 6.68 2 0.01
Total 92 69,145 69,237
T
e null hypothesis is clearly rejected, so we can assume our research hypothesis
to be correct: younger people use the word lovely more frequently than expected
AF

and old people use it less frequently, although the eect is very weak (2=10.39
(df=1), p<0.01; 2 =0.0001).
Next, consider a dierent hypothesis: e word lovely is typical of womens
language (cf. R. Lako 1973), from which we can straightforwardly derive the
R

prediction that it is used more frequently by women than by men. Let us accept
the operationalization of SEX that the creators of the ICE-GB chose (a bad idea,
since we have no idea what it is based on). Table 8-5 shows the relevant data.
D

Again, the hypothesis is conrmed, although, again, the eect is very weak
(2=20.29 (df=1), p<0.001; 2=0.0003). is raises the following question: what is
the relationship between AGE and SEX when it comes to the use of the word
lovely? Consider now Table 8-6, which shows the overall frequency of uerances
in the ICE-GB by AGE and SEX.

180
Anatol Stefanowitsch

Table 8-5. Uerances containing the word lovely by speaker sex (ICE-GB)
U T T E R A N C E S

WITH LOVELY WITHOUT LOVELY Total


Obs. 36 Obs. 42,835 42,871
MALE Exp. 56.97 Exp. 42,814.03
2 7.72 2 0.01
E X

Obs. 56 Obs. 26,310 26,366


S

FEMALE Exp. 35.03 Exp. 26,330.97


2 12.55 2 0.02

.1
Total 92 69,145 69,237

As is immediately obvious, the intersections of these variables are not distributed


v0
evenly in the corpus: uerances by young male speakers and by old female
speakers are signicantly less frequent than expected and those by old male and
young female speakers are signicantly more frequent, and the eect size of this
interaction is much higher than those of lovely/AGE and lovely/SEX (2=4905.24
T
(df=1), p<0.001; 2=0.07). In other words, there is a possibility that this uneven
distribution of Age and Sex has a distorting inuence on one of the distributions
in Table 8-4 and 8-5.
AF

Table 8-6. Uerances in the ICE-GB by AGE and SEX


A G E

Total
R

YOUNG OLD

Obs. 23,300 Obs. 19,571 42,871


MALE Exp. 27,586.24 Exp. 15,284.76
D

2 665.98 2 1201.97
E X

Obs. 21,252 Obs. 5,114 26,366


S

FEMALE Exp. 16,965 Exp. 9400.24


2 1082.88 2 1954.41
Total 44,552 24,685 69,237

e danger of bivariate analyses in general, then, is that the variable we have


chosen for investigation is correlated with one or more variables ignored in our
research design, whose inuence thus remains hidden. A very general precaution
against this possibility is to make sure that the corpus is balanced with respect to

181
8. Complex antitative Research Designs

all potentially confounding variables. In reality, this is almost impossible to


achieve and (and oen undesirable, since we might, for example, want our corpus
to reect the real-world correlation of speaker variables). erefore we need a
way of including multiple independent variables in our research designs.

8.2.2. Congural Frequency Analysis


ere is a conceptually straightforward extension of the chi-square test to designs
with more than two variables, a version of a set of multivariate models referred to
as Congural Frequency Analysis. is method is more typically found in
psychology than in corpus linguistics, but it is certainly a suitable method for the

.1
analysis of linguistic research questions. An early language-related application
involving language disorders is Lautsch et al. (1988), and it is found occasionally
in psycholinguistic and sociolinguistic research (e.g. Hsu et al 2000, Adli 2004);
v0
the rst corpus-linguistic application I am aware of is Gries (2002); it has since
established itself as a corpus-linguistic research tool to some extent (see, e.g.,
Stefanowitsch and Gries 2005, 2008).
In its simplest version, congural frequency analysis is simply a chi-square
T
test on a contingency table with more than two dimensions. ere is no logical
limit to the number of dimensions, but if we want to calculate this statistic
manually (rather than leing a specialized soware package do it, cf. Appendix
AF

A.3), then a three-dimensional table is already quite complex to deal with. us,
we will not go beyond three dimensions here.
A three-dimensional contingency table would have the form of a cube, as
shown in Figure 8-1. e smaller cube represents the cells on the far side of the
R

big cube seen from the same perspective and the smallest cube represents the cell
in the middle of the whole cube). As before, cells are labeled by subscripts: the
rst subscript stands for the values and totals of the dependent variable, the
D

second for those of the rst independent variable, and the third for those of the
second independent variable.

182
Anatol Stefanowitsch

Figure 8-1. A three-dimensional contingency table

De OT1T
p. V eB
ari abl
abl OT12 O21T ari
e e p. V
O212 Ind
OT11 O11T
O211 O112

O111

.1
OT21 O12T O222

Indep. Variable A
O221 O122
OTT1
O2T1
O121
O1T2
v0 O1TT

O1T1 OT2T
T
OT22 O22T
Indep. Variable A

OTTT
AF

OTT2 O2TT
B De
le p. V
ariab ari
e p. V O2T2 abl
e
Ind
R

While this kind of visualization is quite useful in grasping the notion of a three-
D

dimensional contingency table, it would be awkward to use it as a basis for


recording observed frequencies or calculating the expected frequencies. us, a
possible two-dimensional representation is shown in Table 8-7. In this table, the
rst independent variable is shown in the rows, and the second independent
variable is shown in the three blocks of three columns (these may be thought of
as three slices of the cube in Figure 8-1), and the dependent variable is shown in
the columns themselves.

183
8. Complex antitative Research Designs

Table 8-7. A two-dimensional representation of a three-dimensional contingency


table
ivB 1 ivB 2 ivB Total
dv 1 dv 2 dv dv 1 dv 2 dv dv 1 dv 2 dv
Total Total Total
ivA 1 O111 O112 O11T O121 O122 O12T O1T1 O1T2 O1TT
ivA 2 O211 O212 O21T O221 O222 O22T O2T1 O2T2 O2TT
ivA OT11 OT12 OT1T OT21 OT22 OT2T OTT1 OTT2 OTTT
Total

.1
Given this representation, the expected frequencies for each intersection of the
three variables can now be calculated in a way that is similar (but not identical) to
v0
that used for two-dimensional tables. Table 8-8 shows the formulas for each cell
as well as those marginal sums needed for the calculation.

Table 8-8. Calculating expected frequencies in a three-dimensional contingency


T
table
ivB 1 ivB 2 ivB Total
AF

dv 1 dv 2 dv dv 1 dv 2 dv dv 1 dv 2 dv Total
Total Total
OT1T O1TT OTT1 OT1T O1TT OTT2 OT2T O1TT OTT1 OT2T O1TT OTT2
ivA 1 O TTT2 O TTT2 O TTT2 O TTT2 O1TT
OT1T O2TT OTT1 OT1T O2TT OTT2 OT2T O2TT OTT1 OT2T O2TT OTT2
ivA 2 O2TT
R

O TTT2 O TTT2 O TTT2 O TTT2

ivA OT1T OT2T OTT1 OTT2 OTTT


Total
D

Once the expected frequencies have been calculated, we proceed exactly as


before: we derive each cells chi-square component by the standard formula (O
E)2 E and we can then add up these components to give us an overall chi-square
value for the table, which can then be checked for signicance. e degrees of
freedom of a three-dimensional contingency table are calculated by the following
formula (where k is the number of values of each variable and the subscripts refer
to the variables themselves):

(4.5) df = (k1k2k3) (k1+k2+k3) + 2

184
Anatol Stefanowitsch

In our case, each variable has two values, thus we get (222) (2+2+2) + 2 = 4.
More interestingly, we can also look at the individual cells to determine whether
their contribution to the overall value is signicant). In this case, as before, each
cell has one degree of freedom and the signicance levels have to be adjusted for
multiple testing. In CFA, an intersection of variables whose observed frequency is
signicantly higher than expected is referred to as a type and one whose observed
frequency is signicantly lower is referred to as an antitype.
Let us apply this method to the case described in the previous sub-section.
Table 8-9 shows the observed and expected frequencies of uerances containing

.1
the word lovely by SPEAKER AGE and SPEAKER SEX, as well as the corresponding chi-
square components. Before we deal with these, note one important fact about
multi-dimensional contingency tables that may be confusing: if we add up the
v0
expected frequencies of a given column, their sum will not usually correspond to
the sum of the observed frequencies in that column (in contrast to two-
dimensional tables, where this is necessarily the case). Instead, the sum of the
observed and expected frequencies in each slice is identical.
T
Table 8-9. Use of lovely by SEX and AGE
YOUNG OLD Total
AF

LOVELY OTHER Total LOVELY OTHER Total LOVELY OTHER Total


MALE Obs. 23 23,277 23,300 13 19,558 19,571 36 42,835 42,871
Exp. 36.66 27,549 .59 20.31 15,264.45
2 5.09 662.62 2.63 1207.68
R

FEMALE 51 21,201 21,252 5 5,109 5,115 56 26,310 26,366


22.54 16,943 .21 12.49 9.387.75
35.92 1,069 .97 4.49 1950.17
D

Total 74 44,478 44,552 18 24,667 24,685 92 69,145 69,237

Adding up the chi-square components yields a chi-square value of 2=4938.58,


which at df=4 is highly signicant (p<0.001). is tells us something we already
expected from the individual pairwise analyses of the three variables in the
preceding section: there is a signicant relationship among them. Of course, what
we are interested in is what this relationship is and to answer this question, we
need to look at the contributions to chi-square. Table 4-17 shows the chi-square
components for each combination of the three variables and the direction in
which the observed frequencies diverge from the expected (this information can

185
8. Complex antitative Research Designs

be read o straightforwardly from Table 8-10, the information has merely been
rearranged for ease of exposition. Each of the chi-square components can be
checked against the table for multiple chi-square tests in Appendix C.2. Since we
are dealing with eight individual tests, the required chi-square values are 7.48 for
p<0.05, 10.41 for p<0.01, and 14.71 for p<0.01. e last column shows whether the
chi-square value for a given combination is signicant.

Table 8-10. Types and antitypes with lovely


Combination Chi- Direction Corr.
WORD SEX AGE Square Sig. Level

.1
LOVELY MALE YOUNG 5.09 n.s.
LOVELY MALE OLD v0 2.63 n.s.
LOVELY FEMALE YOUNG 35.92 + ***
LOVELY FEMALE OLD 4.49 n.s.
OTHER MALE YOUNG 662.62 ***
MALE OLD 1207.68 + ***
T
OTHER

OTHER FEMALE YOUNG 1069.97 + ***


AF

OTHER FEMALE OLD 1950.17 ***

e result is very interesting: rst, there is no general inuence of age or sex on


the use of the word lovely. Neither old female speakers nor young male speakers
R

use the word more frequently than expected; we might have expected them to do
so given the results of the bivariate analyses presented in the preceding section. It
is only speakers who are both young and female who use the word lovely more
D

frequently than expected. Second, the lower half of the table quite clearly reects
the over- and under-representations of speaker groups in the corpus as a whole.
ey are exactly those identied by the bivariate analysis of Age and Sex above.
e interesting result, then is that young female speakers also use other words
more frequently than expected in other words, they are generally
overrepresented in the corpus. What the multivariate analysis tells us, though, is
that despite this general overrepresentation, young female speakers use the word
lovely more frequently than expected; our three individual bivariate analyses
could never have shown us this.
Let us look at a second example of two bivariate designs that yield a more

186
Anatol Stefanowitsch

comprehensive picture when combined into a multivariate design. Consider the


following research hypothesis: Women favor a communication style that signals
insecurity. Let us, once again, operationalize the construct SEX as whatever the
makers of the ICE-GB think sex is. Let us further operationalize COMMUNICATIVE
INSECURITY as frequent use of hedged uerances. Since there are too many
dierent ways of hedging an uerance to identify them all easily, let us narrow
down the construct HEDGED UTTERANCE as an uerance that contains an adverbial
use of kind of or sort of. We can then derive the following prediction from our
hypothesis: Womens speech will contain a higher proportion of uerances
containing kind of/sort of than mens speech. e null hypothesis, of course,

.1
would be that there is no relationship between SEX and HEDGING (we will return to
the plausibility of the hypothesis and our operationalizations below; they are
drastically simplied adaptations of claims in R. Lako 1973).
v0
Table 8-11 shows the observed and expected frequencies of hedged and
unhedged uerances by men and women in the International Corpus of English -
British Component (or rather, those les from that corpus that are annotated for
speaker sex and for reasons that will become relevant in a moment
education).
T

Table 8-11. Observed and expected frequencies of hedged and unhedged uerances
AF

by sex (ICE-GB, sub-corpus annotated for sex and education)


COMMUNICATIVE INSECURITY
HEDGED UTTERANCES UNHEDGED UTTERANCES Total
Obs. 341 Obs. 42,693 43,034
R

MALE
(Exp. 412.94) (Exp. 42,621.06)
SEX

Obs. 313 Obs. 24,808 25,121


D

FEMALE
(Exp. 241.06) (Exp. 24,879.94)
Total 654 67,501 68,155

e results clearly show that (British) mens uerances contain the hedges sort
of/kind of less frequently than expected, while (British) womens uerances
contain them more frequently than expected (2=33.86 (df = 1), p<0.001). is
conrms our research hypothesis if our operationalizations make sense. A good
way of testing this would be to nd another variable that should be associated
with COMMUNICATIVE INSECURITY.

187
8. Complex antitative Research Designs

If the use of hedging is related to a general insecurity concerning the content


of ones uerances or ones status in the communicative situation, we might
expect the use of hedges to correlate with EDUCATION, such that speakers with a
low level of education will hedge their uerances more frequently than those
with a high level of education. In order to test this hypothesis in a bivariate
design, let us operationalize LEVEL OF EDUCATION as either SECONDARY school (i.e. high
school) or UNIVERSITY education (since this is the distinction that the ICE-GB
happens to make). Table 8.12 shows the relevant information.

Table 8-12. Observed and expected frequencies of hedged and unhedged uerances

.1
by EDUCATION (ICE-GB, sub-corpus annotated for sex and education)
COMMUNICATIVE SUBORDINATION
Total
Obs.
v0
HEDGED UTTERANCES

203
UNHEDGED UTTERANCES

Obs. 17,069 17,272


E D U C AT I O N

SECONDARY
Exp. 165 Exp. 17106 .26
Obs. 451 Obs. 50432 50,883
UNIVERSITY
Exp. 488 Exp. 40394 .74
T
Total 654 67,501 68,155
AF

Again, hypothesis is clearly conrmed: educated speakers use less hedging than
uneducated ones (2=11.43 (df=1), p<0.001).
We are now, again, faced with two possible explanations: either both SEX and
EDUCATION have an inuence on the use of hedging, or SEX and EDUCATION are
R

correlated in the corpus (like AGE and SEX were) e laer would be expected,
given that access to higher education was dicult for women until fairly
recently. A multivariate analysis will help us distinguish between these two
D

possibilities. Table 8-13 shows the relevant frequencies for a HEDGING SEX
EDUCATION design.

188
Anatol Stefanowitsch

Table 8-13. Use of hedging by SEX and EDUCATION


SECONDARY UNIVERSITY Total
HEDGE NO HEDGE Total HEDGE NO HEDGE Total HEDGE NO Total
HEDGE

MALE 85 8,749 8,834 256 33,944 34,200 341 42,693 43,034


104 .65 10,801 .13 308 .30 31,819 .93
3 .69 389 .89 8 .87 141 .79

FEMALE 118 8,320 8,438 195 16,488 16,683 313 24,808 25,121
61 .09 6,305 .13 179 .97 18,574 .81
53 .02 643 .87 1 .26 234 .45

.1
Total 203 17,069 17,272 451 50,432 50,883 654 67,501 68,155

e overall chi-square value shows that there is a signicant relationship


v0
between the there variables (2=1476.83 (df=4), p<0.001), but we are interested in
exactly which intersections are responsible for this result. Table 8-14 shows the
types and antitypes.

Table 8-14. Types and antitypes with hedging


T
Combination Chi- Direction Corr.
HEDGING SEX EDUCATION Square Sig. Level
AF

HEDGED MALE SECONDARY 3.69 n.s.


HEDGED MALE UNIVERSITY 8.87 *
HEDGED FEMALE SECONDARY 53.02 + ***
R

HEDGED FEMALE UNIVERSITY 1.26 + n.s.


UNHEDGED MALE SECONDARY 389.89 ***
D

UNHEDGED MALE UNIVERSITY 141.79 + ***


UNHEDGED FEMALE SECONDARY 643.87 + ***
UNHEDGED FEMALE UNIVERSITY 234.45 ***

e lower half of the table reects a general bias in the corpus in the direction we
expected (or feared): there do indeed seem to be more uerances by educated
males and uneducated females and fewer by uneducated males and educated
females in the corpus. However, the upper half of the table shows that, despite
this imbalance, there is a signicant inuence of both variables: educated males

189
8. Complex antitative Research Designs

use hedging signicantly less frequently than expected and uneducated females
use it highly signicantly more frequently. is suggests that educated males feel
most secure in what they are saying and their status in saying so, and uneducated
females feel least secure. Uneducated males and educated females fall somewhere
in the middle.
is may seem a plausible basis for a theory of communicative insecurity, but
we should be careful not to jump to conclusions over our joy of having
discovered relationships between variables in our corpus data. All that our data
have shown us is that people who were dened as male by the makers of the ICE-
GB and who have a university education are least likely to use the hedges kind of

.1
and sort of, while people who were dened as female by the makers of the ICE-
GB and who do not have a university education are most likely to do so and
that this is statistically signicant. v0
e data do not tell us whether our research design is valid. In this particular
case, this validity is doubtful (to say the least): First, there is no reason to trust the
judgment of the ICE-GB makers with respect to the categorization of speakers
according to SEX presumably, they went by their intuition, which is based on
outward appearance, proper names etc., rather than asking for a gene test and
T
using the chromosome pair XX as a denition for FEMALE and XY as a denition
for MALE (even if they had, there are researchers in the humanities who question
AF

the link between genetics and sex in general, and there are medical conditions
that can lead to a divergence of genetics and physiology that make it dicult
even in the context of biology to determine sex.
In addition, it is questionable whether communication styles are determined
by genetic sex at all more likely, they are the result of the socialization into
R

particular sex/gender roles. ese, of course, were not assessed at all by the
corpus makers, so we have to hope that they are associated suciently strongly
D

with the outward signals of biological SEX that the corpus makers were paying
aention to. Second, there is no a priori logical link between hedging and social
subordination. Intuitively, hedging is related to the degree to which a speaker
wants to commit to the contents of an uerance. While a low degree of
commitment to an uerance may of course reect an aempt to signal insecurity,
it may also have a range of other functions for example, a desire to soen ones
statements so as not to impose on ones addressee, a signal that one wants to
construct statements cooperatively etc. (cf. Holmes 1995: 73-74; cf. Coates 1996:
162 for additional possibilities).
is chapter was intended to impress on the reader one thing: that looking at
one potential variable inuencing some phenomenon that you are interested in

190
Anatol Stefanowitsch

may not be enough. Multivariate research designs are becoming the norm rather
than the exception, and rightly so. However, the more complex our design, the
more likely we are to forget that it is just that a design, whose usefulness
depends entirely on how we have operationalized our constructs. Regardless of
the sophistication of our statistical models, our results may tell us very lile, or
they may tell us something completely dierent from what we think we are
seeing in them. Empirical research is never nished every interpretation of
your data should lead you to come up with additional ways of testing it,
distinguishing it from other plausible interpretations, rening your operational
denitions, and so on.

.1
Further reading v0
Note that Congural Frequency Analysis is much more exible and complex
than might by suggested by the above exposition. A readable introduction (by the
standards of statistical textbooks) to the method is von Eye (1990). A good
introduction to multivariate methods and designs in linguistics in general is
T
Gries (2013). Some examples of multivariate research designs can be found,
for example, in Glynn and Robinson (2014).
AF

Eye, Alexander von. (1990). Introduction to congural frequency analysis:


the search for types and antitypes in cross-classications. Cambridge and
New York: Cambridge University Press.
Gries, S. T. (2013). Statistics for linguistics with R: a practical introduction
R


(2nd revised edition.). Berlin: De Gruyter Mouton.
Glynn, Dylan & Justyna A. Robinson (eds.). 2014. Corpus methods for
D

semantics: quantitative studies in polysemy and synonymy. (Human


Cognitive Processing 43). Amsterdam; Philadelphia: John Benjamins.

191
Part II: Applications

.1
v0
T
AF
R
D

192
Anatol Stefanowitsch

9. Collocation
e (orthographic) word plays a central role in corpus-linguistics. As suggested in
Chapter 5, this is in no small part due to the fact that all corpora, whatever

.1
additional annotations may have been added, consist of orthographically
represented language, which makes it easy to retrieve word forms. Every
concordancing program oers the possibility to search for a string of
v0
orthographic characters in fact, some are limited to this type of search.
However, the focus on words is also due to the fact that the results of corpus
linguistic research quickly showed that words (individually and in groups) are
more interesting and show a more complex behavior than traditional, grammar-
focused theories of language assumed. An area in which this is very obvious, and
T
which has therefore become one of the most heavily researched areas in corpus
linguistics, is the way in which words combine to form so-called collocations.
AF

is chapter is dedicated entirely to the discussion of collocation. At rst, this


will seem like a somewhat abrupt shi from the topics and phenomena we have
discussed so far it may not even be immediately obvious how they t into the
denition of corpus linguistics as the investigation of linguistic research
questions that have been framed in terms of the conditional distribution of
R

linguistic phenomena in a linguistic corpus, which was presented at the end of


Chapter 2. However, a closer look will show that they are simply a special case of
D

precisely this type of research question.

9.1. Collocates
Trivially, texts are not random collections of words. e occurrence of words is
restricted by several factors.
First, the grammatical context limits our choices. For example, a determiner
cannot be followed by another determiner or a verb, but only by a noun, by an
adjective modifying a noun, or an adverb modifying such an adjective; likewise, a
transitive verb requires a direct object in the form of a noun phrase, so we expect

193
9. Collocation

it to be followed by a word that can occur at the beginning of a noun phrase


(such as a pronoun, a determiner, an adjective or a noun).
Second, semantic considerations limits our choices. For example, the transitive
verb drink requires a direct object referring to a liquid, so we would expect it to
be followed by words like water, beer, coee, poison, etc., but not by words like
bread, guitar, stone, democracy, etc. Such restrictions are treated as a grammatical
property of words (called selection restrictions) in some theories, but they may also
be an expression of our world knowledge concerning the activity of drinking.
Finally, topical consideration will obviously limit our choices: we will choose
words that refer to whatever it is that we are currently talking or writing about.

.1
However, it has long been noted that words are not distributed randomly even
within the connes of grammar, lexical or world knowledge, and communicative
intent. Instead, a given word will have anities to some words (and disanities
v0
to others) that we could not predict given a set of grammatical rules, a dictionary
and a thought that needs to be expressed. One of the rst principled discussions
of this phenomenon is found in Firth (1957). Using the example of the word ass
(in the sense of donkey), he discusses the way in which what he calls habitual
collocations contribute to the meaning of words:
T
One of the meanings of ass is its habitual collocation with an immediately
AF

preceding you silly, and with other phrases of address or of personal


reference. ere are only limited possibilities of collocation with
preceding adjectives, among which the commonest are silly, obstinate,
stupid, awful, occasionally egregious. Young is much more frequently found
than old. (Firth 1957: 194).
R

Note that Firth, although writing well before the advent of corpus linguistics,
refers explicitly to frequency as a characteristic of collocations. e possibility of
D

using frequency as part of the denition of and thus as a way of identifying


collocations was quickly taken up. Halliday (1961) provides what is probably the
rst strictly quantitative denition:

Collocation is the syntagmatic association of lexical items, quantiable,


textually, as the probability that there will occur, at n removes (a distance
of n lexical items) from an item x, the items a, b, c Any given item thus
enters into a range of collocation, the items with which it is collocated
being ranged from more to less probable (Halliday 1961: 276, cf. Church
and Hanks 1989 for a more recent comprehensive quantitative discussion).

194
Anatol Stefanowitsch

9.1.1 Collocation as a quantitative phenomenon


Essentially, then, collocation is just a special case of the quantitative corpus
linguistic research design adopted in this book: to ask whether two words form a
collocation (or: are collocates of each other) is to ask whether one of these words
occurs in a given position more frequently than expected by chance under the
condition that the other word occurs in a structurally or sequentially related
position. In other words, we can decide whether two words a and b can be
regarded as collocates on the basis of a contingency table like that in 9-1.

Table 9-1: Collocation

.1
POSITION 2
WORD B v0 OTHER WORDS Total
WORD A a+b a + other word a
POSITION 1

OTHER other + b other + other other


WORDS words
T
word b other words corpus
Total
size
AF

On the basis of such a table, we can determine the collocation status of a given
word pair. For example, we can ask whether Firth was right with respect to the
claim that silly ass is a collocation. e necessary data are shown in Table 9-2.
R

Table 9-2: Co-occurrence of silly and ass in the BNC


D

195
9. Collocation

POSITION 2
ASS OTHER WORDS Total
Obs: 9 Obs: 2,670 2,679
POSITION 1

SILLY
Exp: 0.01 Exp: 2,678.99
OTHER Obs: 350 Obs: 98,493,136 98,493,486
WORDS Exp: 358.99 Exp: 96,983,773.008
Total 359 98,493,127.01 98,496,165
Note: Data are based on the version of the BNC distributed by the Oxford Text
Archive (see Appendix A.1). e queries are for the lemmata silly and ass and the

.1
lemma sequence silly ass. e total number of words in the BNC is based on a query
for all tokens in the BNC that are not tagged as punctuation marks.
v0
e combination silly ass is very rare in English, occurring just seven times in the
98,496,165 word BNC, but Table 9-2 shows that it is vastly more frequent than
expected: by chance, it should have occurred only 0.01 times (i.e., not at all). e
dierence between the observed and the expected frequencies is highly
signicant (2 = 8277.66, df = 1, p < 0.001). Note that we are using the chi-square
T
test here because we are already familiar with it; it is questionable, however,
whether this test is the most useful one for the purpose of identifying
AF

collocations (a point we will return to in detail below).


Generally speaking, the goal of a quantitative collocation analysis is to
identify, for a given word, those other words that are characteristic for its context
of usage. Tables 9-1 and 9-2 present the most straightforward way of doing so: we
simply compare the frequency with which two words co-occur with the
R

frequencies with which they occur in the corpus in general. In other words, the
two conditions across which we are investigating the distribution of a word are
next to a given other word and everywhere else. In other words, the corpus
D

itself functions as a kind of neutral control condition, albeit a somewhat


indiscriminate one (comparing the frequency of a word next so some other word
with its frequency in the entire rest of the corpus is a bit like comparing an
experimental group of subjects that have been given a particular treatment with a
control group consisting of all other people who happen to live in the same city).
Oen, we will be interested in the distribution of a word across two specic
conditions in the case of collocation, the distribution across the immediate
contexts of two semantically related words. It may be more insightful to compare
adjectives occurring next to ass with those occurring next to the rough synonym
donkey or the superordinate term animal, because it is more interesting that silly

196
Anatol Stefanowitsch

occurs more frequently with ass than with donkey or animal than that it occurs
more frequently with ass than with stone or democracy. Likewise, it is more
interesting that silly occurs with ass more frequently than childish than that silly
occurs with ass more frequently than precious or parliamentary.
In such cases, we can modify Table in 9-1 as shown in Table 9-3 to identify the
collocates that dier signicantly between two words. ere is no established
term for such collocates, so we we will call them dierential collocates here25 (the
method is based on Church et al. 1991).

Table 9-3: Identifying dierential collocates

.1
POSITION 2
WORD B WORD C Total
word a +
v0 word a + word a
POSITION 1

WORD A
word b word c
OTHER other words + other words + other words
WORDS word b word c occ. with b or c
T
Total word b word c sample size
AF

Since the collocation silly ass and the word ass in general is so infrequent in
the BNC, let us use a dierent noun to demonstrate the usefulness of this method,
the word game. We can speak of silly game(s) or childish game(s), but we may feel
that the laer is more typical than the former. e relevant lemma frequencies to
put this feeling to the test are shown in Table 9-4.
R

Table 9-4: Childish game vs. silly game (lemmas) in the BNC
D

25 Gries (2003) and Gries and Stefanowitsch (2004) use the term distinctive collocate,
which has been taken up by some authors; however, many other authors use the term
distinctive collocate much more broadly to refer to characteristic collocates of a word.

197
9. Collocation

POSITION 1
CHILDISH SILLY Total
Obs: 12 Obs: 31 43
POSITION 2

GAME
Exp: 6.1 Exp: 36.9
OTHER Obs: 431 Obs: 2648 3,079
WORDS Exp: 436.9 Exp: 2642.1
Total 443 2,679 3,122

Childish game(s) and silly game(s) both occur in the BNC. Both combinations

.1
taken individually are signicantly more frequent than expected (you may check
this yourself using the frequencies from Table 94, the total lemma frequency of
game in the BNC (20,625), and the total number of words in the BNC given in
v0
Table 9.2 above). e lemma sequence silly game is more frequent, which may
suggest that it is the stronger collocation. However, the direct comparison shows
that this is due to the fact that silly is more frequent in general than childish, so
that the chance likelihood of silly games is higher than that of childish games. The
T
dierence between the observed and the expected frequencies suggests that
childish is more strongly associated with game(s) than silly. e dierence is very
AF

signicant (2 = 6.74, df = 1, p < 0.01).


Researchers dier with respect to what types of co-occurrence they accept
when identifying collocations. Some treat co-occurrence as a purely sequential
phenomenon dening collocates as words that co-occur more frequently than
expected within a given span. Some researchers require a span of 1 (i.e., the
R

words must occur directly next to each other), but many allow spans larger spans
(ve words being a relatively typical span size).
Other researchers treat co-occurrence as a structural phenomenon, i.e., they
D

dene collocates as words that co-occur more frequently than expected in two
related positions in a particular grammatical structure, for example, the adjective
and noun positions in noun phrases of the form [Det Adj N] or the verb and noun
position in transitive verb phrases of the form [V [NP (Det) (Adj) N]].26 However,
instead of limiting the denition to one of these possibilities, it seems more
plausible to dene the term appropriately in the context of a specic research
question. In the examples above, we used a purely sequential denition that
26 Note that such word-class specic collocations are sometimes referred to as
colligations, although the term colligation usually refers to the co-occurrence of a
word in the context of particular word classes, which is not the same.

198
Anatol Stefanowitsch

simply required words to occur next to each other, paying no aention to their
word-class or structural relationship; given that we were looking at adjective-
noun combinations, it would certainly have been reasonable to restrict our search
parameters to adjectives modifying the noun ass, regardless of whether other
adjectives intervened, for example in expressions like silly old ass, which our
search would have missed if they occurred in the BNC (which they do not).
Note that the designs in Table 9-1 and 9-3 are essentially variants of the
general research design introduced in previous chapters and used as the
foundation of dening corpus linguistics: it has two variables, POSITION 1 and
POSITION 2, both of which have two values, namely WORD X vs. OTHER WORDS (or, in

.1
the case of dierential collocates, WORD X vs. WORD Y). e aim is to determine
whether the value WORD A is more frequent for POSITION 1 under the condition that
WORD B occurs in POSITION 2 than under the condition that other words (or a
v0
particular other word) occur in POSITION 2.

9.1.2 Methodological issues in collocation research


While there are research projects involving individual collocations (or reasonably
T
small sets of collocations, for example, all collocations involving a particular
word), in many cases we are more likely to be interested in large sets of
collocations, perhaps even in all collocations in a given corpus. is has a number
AF

of methodological consequences concerning the practicability, the statistical


evaluation and the epistemological status of collocation research.

a) Practicability. In practical terms, the analysis of large numbers of potential


R

collocations requires creating a large number of contingency tables and


subjecting them to the chi-square test or some other appropriate statistical test.
is becomes implausibly time-consuming very quickly and thus needs to be
D

automated in some way.


ere are concordancing programs that oer some built-in statistical tests, but
they typically restrict our options quite severely, both in terms of the tests they
allow us to perform and in terms of the data on which the tests are performed.
Anyone who decides to become involved in collocation research (or some of the
large-scale lexical research areas described in the next chapter), should get
acquainted at least with the simple options of automatizing statistical testing
oered by spreadsheet applications. Beer yet, they should invest a few weeks
(or, in the worst case, months) to learn a scripting language like Perl, Python or R
(the laer being a combination of statistical soware and programming

199
9. Collocation

environment that is ideal for almost any task that we are likely to come across as
corpus linguists).

b) Statistical evaluation. In statistical terms, the analysis of large numbers of


potential collocations requires us to keep in mind that we are now performing
multiple signicance tests on the same set of data. is means that we must
adjust our signicance levels. ink back to the example of coin-ipping: the
likelihood of geing a series of one head and nine tails is 0.009765. If we ip a
coin ten times and get this result, we could thus reject the null hypothesis with a
probability of error of 0.010744, i.e., around 1 per cent (because we would have to

.1
add the probability of geing ten tails, 0.000976). is is well below the level
required to claim statistical signicance. However, if we performed a hundred
series of ten coin-ips, there is a very strong likelihood that one series of this
v0
type (or of the more extreme type zero heads and ten tails) occurred aer all,
there is a one percent chance, i.e. a chance of 1 in 100. is is not a problem as
long as we did not accord this one result out of a hundred any special importance.
However, if we were to identify a set of 100 collocations with p-values of 0.001 in
a corpus, we are potentially treating all of them as important, even though it is
T
very likely that at least one of them reached this level of signicance by chance.
To avoid this, we have to correct our levels of signicance when performing
AF

multiple tests on the same set of data. e simplest way to do this is the so-called
Bonferroni correction, which consists in dividing the conventionally agreed-upon
signicance levels by the number of tests we are performing. For example, if we
performed signicance tests on 100 potential collocations, the signicance levels
would be 0.0005 (signicant), 0.0001 (very signicant) and 0.00001 (highly
R

signicant).
It should be noted that this is an extremely conservative correction, that might
D

make it quite dicult for any given collocation to reach signicance. For example,
if we were to calculate the signicance of all token pairs in the LOB corpus, we
would have to perform a statistical test on 422,764 contingency tables. is means
that the most generous level of signicance is now 0.005/422,764 = 0.00000012.
is still leaves more than 15,000 statistically signicant collocation in the LOB
corpus, but it will remove many more that could also be signicant. On the other
hand, the Bonferroni correction has the advantage of being very simple to apply
(see Shaer 1995 for an overview corrections for multiple testing, including many
that are less conservative than the Bonferroni correction).
Of course, the question is how important the role of p-values is in a design
where our main aim is to identify collocates and order them in terms of their

200
Anatol Stefanowitsch

collocation strength. I will turn to this point presently, but before I do so, let us
discuss the third of the three consequences of large-scale testing for collocation,
the methodological one.

c) Epistemological considerations. We have, up to this point, presented a very


narrow view of the scientic process based (in a general way) on the Popperian
research cycle where we formulate a research hypothesis and then test it (either
directly, by looking for counterexamples, or, more commonly, by aempting to
reject the corresponding null hypothesis). is is called the deductive method.
However, as briey mentioned in Chapter 3, there is an alternative approach to

.1
scientic research that does not start with a hypothesis, but rather with general
questions like Do relationships exist between the constructs in my data? and If
so, what are those relationships?. e research then consists in applying
v0
statistical procedures to large amounts of data and examining the results for
interesting paerns. As electronic storage and computing power have become
cheaper and more widely accessible, this approach the exploratory or inductive
approach has become increasingly popular in all branches of science,
particularly the social sciences. It would be surprising if corpus linguistics was an
T
exception, and indeed, it is not. Especially the area of collocational research is
typically exploratory.
AF

In principle, there is nothing wrong with exploratory research on the


contrary, it would be unreasonable not to make use of the large amounts of
language data and the vast computing power that has become available and
accessible over the last twenty years. In fact, it is sometimes dicult to imagine a
plausible hypothesis for collocational research projects. What hypothesis would
R

we formulate before identifying all collocations in the LOB or some specialized


corpus (e.g., a corpus of business correspondence, a corpus of ight-control
D

communication or a corpus of learner language)?27 Despite this, it is clear that the


results of such a collocation analysis yield interesting data, both for practical
purposes (building dictionaries or teaching materials for business English or
aviation English, extracting terminology for the purpose of standardization,
training natural-language processing systems) and for theoretical purposes
(insights into the nature of situational language variation or even the nature of
27 Of course we are making the implicit assumption that there will be collocates in a
sense, this is a hypothesis, since we could conceive of models of language that would
not predict their existence (in fact, we might argue that at least some versions of
generative grammar constitute such models). However, even if we accept this as a
hypothesis, it is typically not the one we are interested in this type of study.

201
9. Collocation

language in general).
But there is a danger, too: Most statistical procedures will produce some
statistically signicant result if we apply them to a large enough data set, and
collocational methods certainly will. Unless we are interested exclusively in
description, the crucial question is whether these results are meaningful. If we
start with a hypothesis, we are restricted in our interpretation of the data by the
need to relate our data to this hypothesis. If we do not start with a hypothesis, we
can interpret our results without any restrictions, which, given the human
propensity to see paerns everywhere, may lead to somewhat arbitrary post-hoc
interpretations that could easily be changed, even reversed, if the results had been

.1
dierent and that therefore tell us very lile about the phenomenon under
investigation or language in general. Thus, it is probably a good idea to formulate
at least some general expectations before doing a large-scale collocation analysis.
v0
Even if we do start out with general expectations or even with a specic
hypothesis, we will oen discover additional facts about our phenomenon that go
beyond what is relevant in the context of our original research question. For
example, checking in the BNC Firths claim that the most frequent collocates of
ass are silly, obstinate, stupid, awful and egregious and that young is much more
T
frequent than old, we nd that silly is indeed the most frequent adjectival
collocate, but that obstinate, stupid and egregious do not occur at all, that awful
AF

occurs only once, and that young and old both occur twice. Instead, frequent
adjectival collocates (ignoring second-placed wild, referring to actual donkeys),
are pompous and bad. Pompous does not really t with the semantics that Firths
adjectives suggest and indicates that a semantic shi from stupidity to self-
importance may have taken place between 1957 and 1991 (when the BNC was
R

assembled).
is is, of course, a new hypothesis that can (and must) be investigated by
D

comparing data from the 1950s and the 1990s. It has some initial plausibility in
that the adjectives blithering, hypocritical, monocled and opinionated also co-occur
with ass in the BNC but were not mentioned by Firth. However, it is crucial to
treat this as a hypothesis rather than a result. e same goes for bad ass which
suggests that the American sense of ass (boom) and/or the American adjective
badass (which is oen spelled as two separate words) may have begun to enter
British English. In order to be tested, these ideas and any ideas derived from an
exploratory data analysis have to be turned into testable hypotheses and the
constructs involved have to be operationalized. Crucially, they must be tested on
a new data set; if we were to circularly test them on the same data that they were
derived from, we would obviously nd them conrmed.

202
Anatol Stefanowitsch

9.1.3 Eect sizes for collocations


As mentioned above, signicance testing (while not without its uses) may not the
primary concern when investigating collocations. Instead, researchers frequently
need a way of assessing the strength of the association between two (or more)
words, or, put dierently, the eect size of their co-occurrence (recall from
Chapter 7 that signicance and eect size are not the same). A wide range of such
association measures has been proposed and investigated. ey are typically
calculated on the basis of (some or all) the information contained in contingency
tables like those in 9-1 and 9-3 above.
Let us look at some of the most popular and/or most useful of these measures.

.1
I will represent the formulas with reference to the table in 9-5a, i.e, O11 means the
observed frequency of the top le cell, E11 its expected frequency, R1 the rst row
total, C2 the second column total, and so on.
v0
Table 9-5a: A generic 2-by-2 table for collocation research
POSITION 2
Total
T
WORD B OTHER WORDS
or WORD C
O11 O12 R1
POSITION 1

AF

WORD A

OTHER O21 O22 R2


Total C1 C2 N
Note: e second column would be labeled OTHER WORDS in the case of
R

normal collocations, and WORD C in the case of dierential collocations.


e association measures can be applied to both types of design.
D

Now all we need is a good example to demonstrate the calculations. Let us use
the adjective-noun sequence good example from the LOB corpus (but do not fear,
we will return to equine animals and their properties further below).

Table 9-5b: Co-occurrence of good and example in the LOB

203
9. Collocation

POSITION 2
EXAMPLE OTHER WORDS Total
Obs: 9 Obs: 870 879
POSITION 1

GOOD
Exp: 0.1846 Exp: 878.8154
OTHER Obs: 235 Obs: 1160704 1160939
WORDS Exp: 243.8154 Exp: 1160695.1846
Total 244 1161574 1161818

e main reason for the existence of dierent measures is, of course, that they

.1
give dierent results. In particular, many measures have a problem with rare
collocations, especially if the individual words of which they consist are also rare.
Aer we have introduced the measures, we will therefore compare their
v0
performance with a particular focus on the way in which they deal (or fail to
deal) with such rare events.

a) Chi-square. e rst association measure is an old acquaintance: the chi-square


T
statistic, which we used extensively in Chapters 7 and 8 and in Section 9.1.1
above. I will not demonstrate it again, but the chi-square value for Table 9-5b
AF

would be 421.37 (at 1 degree of freedom this means that p < 0.001, but we are not
concerned with p-values here).
Recall that the chi-square test statistic is not an eect size, but that it needs to
be divided by the table total to turn it into one; but as long as we are deriving all
our collocation data from the same corpus, this will not make a dierence, since
R

the table total will always be the same. However, if we are calculating dierential
collocates, this is no longer the case, so we might want to use the phi statistic
instead. I am not aware of any research using the phi-statistic as an association
D

measure, and in fact the chi-square statistic itself is not used very widely either.
is is because it has a serious problem: recall that it cannot be applied if more
than 20 percent of the cells of the contingency table contain expected frequencies
smaller than 5 (in the case of collocates, this means not even one out of the four
cells of the 2-by-2 table). One reason for this is that it dramatically overestimates
the eect size and signicance of such events, and of rare events in general. Since
collocations are oen relatively rare events, this makes the chi-square statistic a
bad choice as an association measure.

b) Mutual Information. is is one of the oldest collocation measures, frequently

204
Anatol Stefanowitsch

used in computational linguistics and oen implemented in collocation soware.


It is given in (9.1a) in a version based on Church and Hanks 1989:28

(9.1a) MI = log2
( ) O 11
E 11

Applying the formula to our table, we get the following:

(9.1b) MI = log2 ( 0.1846038


9
) = log ( 48.7531) = 5.6074

.1
2

In our case, we are looking at cases where WORD A and WORD B occur directly next
v0
to each other, i.e., the span size is 1. When looking at a larger span (which is oen
done in collocation research), the likelihood of encountering a particular collocate
increases, because there are more slots that it could potentially occur in. e MI
statistic can be adjusted for larger span sizes as follows (where S is the span size):
T
(9.1c) MI = log2
( O 11
E 11 S )
AF

e mutual information measure suers from the same problem as the chi-square
statistic, overestimating the importance of rare events, but since it is still fairly
wide-spread in collocational research, we may need it for situations where we are
R

comparing our own data to the results of published studies. However, note that

28 A logarithm with a base b of a given number x is the power to which b must be raised
D

to produce x, so, for example, log 10(2) = 0.30103, because 100.30103=2. Most calculators
give us at least a choice between the natural logarithm, where the base is the number
e (approx. 2.7183) and the common logarithm, where the base is the number 10; many
calculators and all major spreadsheet programs allow us to calculate logarithms with
any base. In the formula in (9.1a), we need the logarithm with base 2; if this is not
available, we can use the natural logarithm and divide the result by the natural
logarithm of 2:

(i)
MI =
log e
( )
O 11
E 11
log e ( 2 )

205
9. Collocation

there are versions of the MI measure that will give dierent results, so we need to
make sure we are using the same version as the study we are comparing our
results to. Or beer yet, we should not use mutual information at all (one of the
case studies presented below uses it, see Section 9.2.1.1.).

c) Log-likelihood. e log-likelihood test statistic G2 is one of the most popular


perhaps the most popular association measure in collocational research, found
in many of the central studies in the eld and oen implemented in collocation
soware. e following is a frequently found form (Read and Cressie 1988: 134):

.1
( )
n
Oi
G = 2 O i log e
2
(9.2a)
i=1 Ei
v0
In order to calculate the log-likelihood measure, we calculate for each cell the
natural logarithm of the observed frequency divided by the expected frequency
and multiply it by the observed frequency. We then add up the results for all four
cells and multiply the result by two. Note that if the observed frequency of a
T
given cell is zero, the expression Oi/Ei will, of course, also be zero. Since the
logarithm of zero is undened, this would result in an error in the calculation.
AF

us, log(0) is simply dened as zero when applying the formula in (9.2b).29
Applying the formula to the data in Table 9-5b, we get the following:
R
D

29 If you plan to automatize the calculation of log-likelihood using a spreadsheet or a


simple program, you can also use the following version of the formula, which will give
the same result. It is much longer and unwieldier, but which has the advantage that
you do not need to calculate expected frequencies (again, make sure you dene log(0)
as 0).

2 ( (O 11 log O11 ) + (O 12 log O 12 ) + (O 21 log O 21 ) + ( O 22 log O 22)


(i) ( O 11 + O12 ) log ( O11 + O 12 ) ( O 11 + O 21 ) log (O 11 + O 21 )
( O 12 + O 22) log (O 12 + O 22 ) ( O 21 + O 22 ) log (O 21 + O22 )
+ ( O 11 + O 12 + O 21 + O22 ) log ( O11 + O 12 + O 21 + O 22 ))

206
Anatol Stefanowitsch

(9.2b) 2
G = 2 9log e
( ( 0.1846
9
)) + (870log ( 878.8154
e
870
))
(
+ 235loge ( 243.8154
235
)) + (1160704log ( 1160695.1846
e
1160704
))
= 2 ((34.9809) + (8.771) + (8.6541) + (8.8154)) = 52.7424

.1
e log-likelihood test statistic has long been known to be more reliable than the
chi-square test when dealing with small samples and small expected frequencies
v0
(Read and Cressie 1988: 134). is led Dunning (1993) to propose it as an
association measure specically to avoid the overestimation of rare events that
plagues the chi-square test, mutual information and other measures.

d) Minimum Sensitivity. is measure was proposed by Pedersen (1998) as


T
potentially useful measure especially for the identication of associations
between content words:
AF

(9.3a) MS = min
( O 11 O 11
,
R1 C1 )
R

We simply divide the observed frequency of a collocation by the frequency of the


rst word (R1) and of the second word (C1) and use the smaller of the two as the
association measure. For the data in Table 9-5b, this gives us the following:
D

(9.3b) MS = min ( 8799 , 2449 ) = min (0.0102, 0.0369) = 0.0102


Apart from being extremely simple to calculate, it has the advantage of ranging
from zero (words never occur together) to 1 (words always occur together); it was
also argued by Wiechmann (2008) to correlate best with reading time data when
applied to combinations of words and grammatical constructions (see Chapter
10). However, it also tends to overestimate the importance of rare collocations.

207
9. Collocation

e) Fishers Exact test (p-value). is test was already mentioned in passing in


Chapter 7 as an alternative to the chi-square test that calculates the probability of
error directly by adding up the probability of the observed distribution and all
distributions that deviate from the null hypothesis further in the same direction.
Pedersen (1996) suggests using this p-value as a measure of association because it
does not make any assumptions about normality (cf. Chapter 7) and is even beer
at dealing with rare events than log-likelihood. Stefanowitsch and Gries (2003:
238239) add that it has the advantage of taking into account both the magnitude
of the deviation from the expected frequencies and the sample size.
ere are some practical disadvantages to Fishers exact test. First, it is

.1
computationally expensive it cannot be calculated manually, except for very
small tables, because it involves computing factorials, which become very large
very quickly. For completeness sake, here is (one version o) the formula:
v0
R1! R 2 ! C 1 ! C 2 !
(9.4) p Fisher =
O 11 ! O 12 ! O 21 ! O 22 ! N !

Obviously, it is not feasible to apply this formula directly to the data in Table 9-
T
5b, because we cannot realistically calculate the factorials for 235 or 870, let alone
1,160,704. But if we could, we would nd that the p-value for Table 9-5 is
AF

0.0000000000004841512.
Spreadsheet applications do not oer Fishers exact test, but all major statistics
applications implement it. However, typically, the exact p-value is not reported
beyond the limit of a certain number of decimal places. is means that there is
R

oen no way of ranking the most strongly associated collocates, because their p-
values are smaller than this limit. For example, there are 125 collocates in the
LOB corpus with a Fishers exact p-value that is smaller than the smallest value
D

that a standard-issue computer chip is capable of calculating, and 5343 collocates


that have p-values that are smaller than what the standard implementation of
Fishers exact test in the statistical soware package R will deliver.

To conclude, let us compare how the association measures compare using a data
set of 20 potential collocations. Inspired by Firths silly ass, they are all
combinations of adjectives with equine animals. Table 9-6 shows the
combinations and their frequencies in the BNC (sorted by their raw frequency of
occurrence.

208
Anatol Stefanowitsch

Table 9-6. Some collocates of the form [ADJ Nequine] (data from the BNC)
WORD A WORD B A WITH B A WITHOUT B B WITHOUT A NEITHER A NOR B
(O11) (O12) (O21) (O22)
rocking horse 35 423 1083 96985166
Trojan horse 21 96 1097 96985493
new horse 13 124021 1105 96861568
silly ass 8 2436 294 96983969
galloping horse 8 199 1110 96985390
prancing horse 8 71 1110 96985518
pompous ass 4 251 298 96986154

.1
old donkey 3 52645 428 96933631
common zebra 3 19839 170 96966695
old mule 3 52645 27 96934032
old ass
v0
2 52646 300 96933759
braying donkey 2 29 429 96986247
young zebra 1 32292 172 96954242
large mule 1 34230 29 96952447
T
female hinny 1 1451 7 96985249
extinct quagga 1 421 4 96986281
AF

monocled ass 1 2 301 96986403


caparisoned mule 1 19 29 96986658
dumb-fuck donkey 1 0 430 96986276
jumped-up jackass 1 20 7 96986679
R

All combinations are perfectly normal, grammatical adjective-noun pairs,


meaningful not only in the specic context of their actual occurrence. However, I
D

have selected them in such a way that they dier with respect to their status as
potential collocations (in the sense of typical combinations of words). Some are
compounds or compound like combinations (rocking horse, Trojan horse, and, in
specialist discourse, common zebra). Some are the kind of semi-idiomatic
combinations that Firth had in mind (silly ass, pompous ass). Some are very
conventional combinations of nouns with an adjective denoting a property
specic to that noun (prancing horse, braying donkey, galloping horse the rst of
these being a conventional way of referring to the Ferrari brand mark logo). Some
only give the appearance of semi-idiomatic combinations (jumped-up jackass,
actually an unconventional variant of jumped-up jack-in-oce; dumb-fuck donkey,
actually an extremely rare phrase that occurs in the book Trail of the Octopus:

209
9. Collocation

From Beirut to Lockerbie - Inside the DIA and which sounds like an idiom because
of the alliteration and the semantic relationship to silly ass; and monocled ass,
which brings to mind pompous ass but is actually not a very conventional
combination). Finally, there are a number of fully compositional combinations
that make sense but do not have any special status (caparisoned mule, new horse,
old donkey, young zebra, large mule, female hinny, extinct quagga).
In addition, I have selected them to represent dierent types of frequency
relations: some of them are (relatively) frequent, some of them very rare, for some
of them the either the adjective or the noun is generally quite frequent, and for
some of them neither of the two is frequent.

.1
Table 9-7 shows the ranking of these twenty collocations by the ve
association measures discussed above. Simplifying somewhat, a good association
measure should rank the conventionalized combinations highest (rocking horse,
v0
Trojan horse, silly ass, pompous ass, prancing horse, braying donkey, galloping
horse), the distinctive sounding but non-conventionalized combinations
somewhere in the middle (jumped-up jackass, dumb-fuck donkey, old ass, monocled
ass) and the compositional combinations lowest (common zebra, jumped-up
jackass, dumb-fuck donkey, old ass, monocled ass). Common zebra is dicult to
T
predict it is a conventionalized expression, but not in the general language.
All of the association measures fare quite well, generally speaking, with
AF

respect to the compositional expressions these tend to occur in the lower third
of all lists. Where there are exceptions, the chi-square statistic, mutual
information and minimum sensitivity rank rare cases higher than they should
(e.g. caparisoned mule, extinct quagga, female hinny), while the log-likelihood test
statistic and the p-value of Fishers exact test rank frequent cases higher (new
R

horse).
D

210
Table 9-7. Comparison of selected association measures for collocates of the form [ADJ Nequine] (data from the BNC)
CHI-SQ. FISHERS P
2
COLLOCATION COLLOCATION MI COLLOCATION MS COLLOCATION G COLLOCATION
jumped-up jackass 577300 jumped-up jackass 19.1390 jumped-up jackass 0.047619 rocking horse 549.811 rocking horse 2.90E-121
Trojan horse
rocking horse
dumb-fuck donkey

326944
231962
225026

dumb-fuck donkey
caparisoned mule
monocled ass
.1
17.7797
17.3025
16.7079

caparisoned mule
rocking horse
Trojan horse

0.033333
0.031306
0.018784

Trojan horse
prancing horse
galloping horse

367.8483 Trojan horse


130.1905 prancing horse
114.2553 galloping horse

1.28E-81
7.93E-30
2.21E-26
caparisoned mule
monocled ass
prancing horse
extinct quagga
braying donkey

161643
107048
70263
45963
29032

extinct quagga
Trojan horse
braying donkey
prancing horse
female hinny

15.4883
13.9265
13.8255
13.1008
13.0275

v0
pompous ass
prancing horse
galloping horse
braying donkey
monocled ass

0.013245
0.007156
0.007156
0.004640
0.003311

silly ass
pompous ass
braying donkey
new horse
old mule

95.5781 silly ass


60.3153 pompous ass
34.4740 new horse
34.3740 braying donkey
25.6367 old mule

2.50E-22
1.58E-14
4.22E-09
9.16E-09
6.42E-07
galloping horse
pompous ass
26806
20143
rocking horse
pompous ass
12.6947
12.2985
silly ass
extinct quagga
T 0.003273
0.002370
jumped-up jackass 24.7112 jumped-up jackass 1.73E-06
dumb-fuck donkey 24.6503 dumb-fuck donkey 4.44E-06
AF
silly ass 8394 galloping horse 11.7111 dumb-fuck donkey 0.002320 caparisoned mule 22.0709 caparisoned mule 6.19E-06
female hinny 8348 silly ass 10.0379 female hinny 0.000689 monocled ass 21.5436 common zebra 7.07E-06
old mule 547 old mule 7.5253 common zebra 0.000151 common zebra 20.7614 monocled ass 9.34E-06
common zebra 248 large mule 6.5614 new horse 0.000105 extinct quagga 19.6885 extinct quagga 2.18E-05
new horse 94 common zebra 6.4053 old mule 0.000057 female hinny 16.1915 female hinny 1.20E-04
large mule
old donkey
92
33
young zebra
old donkey
4.1177
3.6806
old donkey
old ass
0.000057
0.000038
R
old donkey
large mule
9.7931 old donkey
7.1502 large mule
1.78E-03
1.05E-02
old ass
young zebra
21
15
old ass
new horse
3.6088
3.1846
young zebra
large mule
0.000031
0.000029
old ass
young zebra D 6.3448 old ass
3.8288 young zebra
1.20E-02
5.60E-02
9. Collocation

With respect to the non-compositional cases, chi-square and mutual


information are quite bad, overestimating rare combinations like jumped-up
jackass, dumb-fuck donkey and monocled ass, while listing some of the clear cases
of collocations much further down the list (silly ass, and, in the case of MI, Trojan
horse and rocking horse). Minimum sensitivity is much beer, ranking most of the
conventionalized cases in the top half of the list and the non-conventionalized
ones further down (with the exception of jumped-up jackass, where the
combination and the individual words are all quite rare). e log-likelihood test
statistic and the Fisher p-value fare best (with no strong dierences between
them), listing the conventionalized cases at the top and the distinctive but non-

.1
conventionalized cases in the middle.
To demonstrate the problems that very rare events can cause (especially those
where both the combination and each of the two words in isolation are very rare),
v0
imagine someone had used the phrase tomfool onager once in the BNC neither
the adjective tomfool (a synonym of silly) nor the noun onager (the name of the
donkey sub-genus Equus hemionus, also known as Asiatic or Asian wild ass)
occur in the BNC, which would give us the distribution in Table 9-8.
T
Table 9-8. Fictive occurrence of tomfool onager in the BNC
POSITION 2
AF

ONAGER OTHER WORDS Total


TOMFOOL 1 0 1
POS. 1

OTHER 0 96,986,706 96,986,706


R

Total 1 96,986,706 96,986,707


D

Applying the formulas discussed above to this table will give us a chi-square
value of 96.986,707, an MI value of 26.53 and a minimum sensitivity value of 1,
placing this (hypothetical) one-o combination at the top of the respective
rankings by a wide margin. Again, the log-likelihood test statistic and Fishers
exact test are much beer, relegating it to the seventh place (G 2 = 38.78) and ninth
place (p= 1,03E-08) respectively.
Although the example is hypothetical, the problem is not. It uncovers a
mathematical weakness of many commonly used association measures. From an
empirical perspective, this would not necessarily be a problem, if cases like that
in Table 9-8 were rare in linguistic corpora. However, they are not. e LOB
corpus, for example, contains almost one thousand such cases, including some

212
Anatol Stefanowitsch

legitimate collocation candidates (like herbal brews, casus belli or sub-tropical


climates), but mostly compositional combinations (ungraceful typography,
turbaned headdress, songs-of-Britain medley), snippets of foreign languages (freie
Blicke, larbre rouge, palomita blanca) and other things that are quite clearly not
what we are looking for in collocation research. All of these would occur at the
top of any collocate list created using statistics like chi-square, mutual
information and minimum sensitivity. In large corpora, which are impossible to
check for orthographical errors and/or errors introduced by tokenization, this list
will also include hundreds of such errors (which are likely to be infrequent
precisely because they are errors).

.1
To sum up, when doing collocational research, we should use the best
association measures available. For the time being, this is the p value of Fishers
exact test (if we have the means to calculate it), or the log-likelihood test-statistic
v0
G2 (if we dont, or if we prefer using a widely-accepted association measure). We
will use G2 through much of the remainder of this book whenever dealing with
collocations or collocation-like phenomena.

9.2 Case studies


T

In the following, we will at some typical examples of collocation research, i.e.


AF

cases, where both variables consist of (some part o) the lexicon and the values
are individual words.

9.2.1 Collocation for its own sake


R

Research that is concerned exclusively with the collocates of individual words or


the extraction of all collocations from a corpus falls into three broad types. First,
there is a large body of research on the explorative extraction of collocations from
D

corpora. is research is not usually interested in any particular collocation (or


set of collocations), or in genuinely linguistic research questions; instead, the
focus is on methods (ways of preprocessing corpora, which association measures
to use, etc.. Second, there is an equally large body of applied research that results
in lexical resources (dictionaries, teaching materials, etc.) rather than scientic
studies on specic research questions. ird, there is a much smaller body of
research that simply investigates the collocates of individual words or small sets
of words. e perspective of these studies tends to be descriptive, oen with the
aim of showing the usefulness of collocation research for some application area.
e (relative) absence of theoretically more ambitious studies of the collocates

213
9. Collocation

of individual words may partly be due to the fact that words tend to be too
idiosyncratic in their behavior to make their study theoretically aractive.
However, this idiosyncrasy itself is, of course, theoretically interesting and so
such studies hold an unrealized potential at least for areas like lexical semantics.

9.2.1.1 Case study: Degree adverbs. A typical example of a thorough descriptive


study of the collocates of individual words is Kennedy (2003), which investigates
the collocates of degree adverbs like very, considerably, absolutely, heavily and
terribly. Noting that some of these adverbs appear to be relatively interchangeable
with respect to the adjectives and verbs they modify, others are highly

.1
idiosyncratic, Kennedy identies the adjectival and verbal collocates of 24
frequent degree adverbs in the BNC, extracting all words occurring in a span of
two words to their le or right, and using mutual information to determine which
v0
of them are associated with each degree adverb.
us, Kennedys study adopts an exploratory perspective (as is typical for this
kind of study). It involves two nominal variables: DEGREE ADVERB (with 24 values
corresponding to the 24 specic adverbs he selects) and VERB OR ADJECTIVE (with as
many dierent potential values as there are dierent verbs and adjectives in the
T
BNC (in exploratory studies, it is oen the case that we do not know the exact
values of at least one of the two variables in advance). Which of the two values is
AF

the dependent one and which the independent is always dicult to say.
Statistically, it does not maer, since our statistical tests for nominal data do not
distinguish between dependent and independent variables. Conceptually, our
interpretation may depend on which variable we treat as independent. Let us
assume that when planning an uerance, we rst choose a verb or adjective
R

corresponding to our communicative intention, and then select the appropriate


adverb. us, VERB OR ADJECTIVE is plausibly regarded as the independent variable
D

determining the distribution of the dependent variable DEGREE ADVERB.


Kennedy nds, rst, that there are some degree adverbs that do not appear to
have restrictions concerning the verbs/adjectives they occur with (for example,
very, really and particularly). However, most degree adverbs are clearly associated
with semantically restricted sets of verbs and adjectives. e restrictions are of
three broad types. First, there are connotational restrictions (some adverbs are
associated primarily with positive words (e.g. perfectly) or negative words (e.g.
uerly, totally; on connotation cf. also Section 9.2.4). Second, there are specic
semantic restrictions (for example, incredibly, which is associated with subjective
judgments), sometimes relating transparently to the meaning of the adverb (for
example, badly, which is associated with words denoting damage or clearly,

214
Anatol Stefanowitsch

which is associated with words denoting sensory perception). Finally, there are
morphological restrictions (some adverbs are used frequently with words derived
by particular suxes, for example, perfectly, which is frequently found with
words derived by able/ible, or totally, whose collocates oen contain the prex
un-). Table 9-9 illustrates these ndings for 5 of the 24 degree adverbs and their
top 15 collocates (note that these ndings are not precisely replicable using the
version of the BNC available via the OTA).

Table 9-9. Selected degree adverbs and their collocates (Kennedy 2003)
INCREDIBLY PERFECTLY TOTALLY COMPLETELY BADLY

.1
sexy 4.70 contestable 7.56 unsuited 6.23 reed 6.09 mauled 6.23
nave 4.53 proportioned 6.75 unprepared 5.89 inelastic 5.85 sprained 5.85
handsome 3.75 manicured 6.12 illegible
v0 5.65 outclassed 5.69 bruised 5.68
boring 3.73 spherical 5.84 unsuitable 5.61 redesigned 5.51 corroded 5.48
brave 3.67 groomed 5.68 impractical 5.60 refurbished 5.42 damaged 5.34
exciting 3.65 timed 5.40 uncharacteristic 5.58 overhauled 5.35 behaved 5.19
lucky 3.55 understandable 5.29 illogical 5.57 eradicated 5.33 decomposed 5.05
stupid 3.55 symmetrical 5.15 unacceptable 5.55 disorientated 5.27 unstuck 5.05
ecient 3.47 acceptable 5.07 unconnected 5.51 renovated 5.18 shaken 4.79
T
clever 3.42 complemented 4.97 devoid 5.49 mystied 5.16 wounded 4.63
complicated 3.34 intelligible 4.93 unintelligible 5.38 sequenced 5.08 injured 4.59
AF

dangerous 3.21 feasible 4.88 symmetric 5.35 gued 4.91 ventilated 4.58
thin 3.20 balanced 4.87 unfounded 5.34 revamped 4.86 charred 4.57
fast 3.12 legitimate 4.74 untrue 5.33 uninterested 4.76 mutilated 4.55
beautiful 2.99 elastic 4.71 unmoved 5.27 untrue 4.76 hurt 4.48
R

9.2.2 Lexical relations


One area of lexical semantics where collocation data is used quite intensively is
D

the study of lexical relations most notably, (near) synonymy (Taylor 2001, cf.
below), but also polysemy (e.g. Yarowsky 1993, investigating the idea that
associations exist not between words but between particular senses of words) and
antonymy (Justeson and Katz 1991, see below).

9.2.2.1 Case study: Near Synonyms. Natural languages typically contain pairs (or
larger sets) of words with very similar meanings, such as big and large, begin and
start or high and tall. In isolation, it is oen dicult to tell what the dierence in
meaning is, especially since they are oen interchangeable at least in some
contexts. Obviously, the distribution of such pairs or sets with respect to other

215
9. Collocation

words in a corpus can provide insights into their similarities and dierences.
Table 9-10. Objects described as tall or high in the LOB corpus (Taylor 2001)
CATEGORY TALL HIGH TOTAL
Humans Obs: 45 Obs: 2 47
Exp: 22.91 Exp: 24.09
2 : 21.31 2 : 20.26
Animals Obs: 0 Obs: 1 1
Exp: 0.49 Exp: 0.51
2 : 0.49 2 : 0.46
Plants, trees Obs: 7 Obs: 3 10
Exp: 4.87 Exp: 5.13
2 : 0.93 2 : 0.88

.1
Buildings Obs: 3 Obs: 10 13
Exp: 6.34 Exp: 6.66
2 : 1.76 v0 2 : 1.67
Walls, fences, etc. Obs: 0 Obs: 5 5
Exp: 2.44 Exp: 2.56
2 : 2.44 2 : 2.32
Towers, statues, Obs: 0 Obs: 7 7
pillars, sticks Exp: 3.41 Exp: 3.59
2 : 3.41 2 : 3.24
T
Articles of clothing Obs: 0 Obs: 7 7
Exp: 3.41 Exp: 3.59
2 : 3.41 2 : 3.24
AF

Miscellaneous Obs: 2 Obs: 13 15


artifacts Exp: 7.31 Exp: 7.69
2 : 3.86 2 : 3.67
Topographical Obs: 0 Obs: 5 5
features Exp: 2.44 Exp: 2.56
2 : 2.44 2 : 2.32
R

Other natural Obs: 0 Obs: 5 5


phenomena Exp: 2.44 Exp: 2.56
2 : 2.44 2 : 2.32
D

Uncertain reference Obs: 1 Obs: 3 4


Exp: 1.95 Exp: 2.05
2 : 0.46 2 : 0.44
Total 58 61 110

One example of such a study is Taylor (2001), which investigates the synonym
pair high and tall by identifying all instances of the two words in their subsense
large vertical extent in the LOB corpus and categorizing the words they modify
into eleven semantic categories. ese categories are based on semantic
distinctions such as human vs. inanimate, buildings vs. other artifacts vs. natural
entities etc., which are expected a priori to play a role.

216
Anatol Stefanowitsch

e study, while not strictly hypothesis-testing, is thus somewhat deductive. It


involves two nominal variables; the independent variable TYPE OF ENTITY with
eleven values shown in Table 10-2 below and the dependent variable VERTICAL
EXTENT ADJECTIVE with the values HIGH and TALL (assuming that people rst choose
something to talk about and then choose the appropriate adjective to describe it).
Table 9-10 shows Taylors results (he reports absolute and relative frequencies,
which I have used to calculate expected frequencies and chi-square components).
As we can see, there is lile we can learn from this table, since the frequencies
in the individual cells are simply too small to apply the chi-square test to the
table as a whole. e only chi-square components that reach signicance

.1
individually are those for the category HUMAN, which show that tall is preferred
and high avoided with human referents. e sparse data in the table are due, rst,
to the fact that the sample is too small, but second, to the fact that there are too
v0
many categories. e category labels are not well chosen either: they overlap
substantially in several places (e.g., towers and walls are buildings, pieces of
clothing are artifacts, etc.) and not all of them seem relevant to any expectation
we might have about the words high and tall.
Taylor later cites earlier psycholinguistic research indicating that tall is used
T
when the vertical dimension is prominent, is an acquired property and is a
property of an individuated entity. It would thus have been beer to categorize
AF

the corpus data according to these properties in other words, a more strictly
deductive approach would have been more promising given the small data set.
Alternatively, Taylor could have taken an exploratory approach and looked for
dierential collocates as described in Section 9.1.1 above. is would have
involved constructing a table like Table 9-11a for every noun occurring with the
R

adjectives tall or high (in a larger corpus, such as the BNC) and then using an
association measure to identify words more strongly aracted to one or the other.
D

Table 9-11a. Tall and high as dierential collocates for window (in the BNC)
POSITION 1
TALL HIGH Total
Obs: 34 Obs: 41 75
POSITION 2

WINDOW
Exp: 3.12 Exp: 71.88
OTHER Obs: 1686 Obs: 39633 41319
WORDS Exp: 1716.88 Exp: 39602.12
Total 1720 39674 41394

217
9. Collocation

Once we have calculated the association strengths of all nouns in this way, we
can sort the collocates rst, according to which of the two words they occur with
more frequently, and second, according to their association strength. Table 9-11b
shows the top 15 dierential collocates of the two words in the BNC.

Table 9-11b. dierential collocates for tall and high in the BNC
COLLOCATE CO-OCCURRENCE CO-OCCURRENCE OTHER WORDS OTHER WORDS G2
WITH TALL WITH HIGH WITH TALL WITH HIGH
MOST STRONGLY ASSOCIATED WITH TALL

.1
man 182 3 1538 39671 1146.54
building 82 26 1638 39648 408.35
tree 73 26 1647 39648 355.52
boy 40 0 v0 1680 39674 255.36
glass 39 2 1681 39672 233.14
woman 38 3 1682 39671 221.34
ship 33 0 1687 39674 210.54
girl 32 0 1688 39674 204.15
gure 62 93 1658 39581 195.58
T
chimney 28 8 1692 39666 141.09
order 62 176 1658 39498 138.01
grass 24 5 1696 39669 126.76
AF

tale 20 1 1700 39673 119.50


window 34 41 1686 39633 117.04
story 18 0 1702 39674 114.69
MOST STRONGLY ASSOCIATED WITH HIGH
level 0 2741 1720 36933 240.90
R

education 0 2499 1720 37175 218.94


court 0 1863 1720 37811 161.88
D

quality 0 1079 1720 38595 92.83


standard 1 1163 1719 38511 90.35
rate 0 922 1720 38752 79.16
proportion 0 875 1720 38799 75.08
street 1 810 1719 38864 60.38
school 0 676 1720 38998 57.86
price 0 642 1720 39032 54.93
degree 0 638 1720 39036 54.58
speed 0 547 1720 39127 46.75
interest 0 493 1720 39181 42.10
risk 0 431 1720 39243 36.78
Note: e corpus was queried for the lemmas TALL and HIGH followed by a word tagged as a NOUN

218
Anatol Stefanowitsch

(including ambiguous tags). e nouns were lemmatized before counting.

The results for tall clearly support Taylors ideas about the salience of the vertical
dimension. e results for high show something Taylor could not have found,
since he restricted his analysis to the subsense vertical dimension: when
compared with tall, high is most strongly associated with quantities or positions
in hierarchies and rankings. ere are no spatial uses at all among its top
dierential collocates. This does not answer the question why we can use it
spatially and in competition with tall, but it shows what general sense we would
have to assume: one concerned not with the vertical extent as such, but with the

.1
magnitude of that extent (which, incidentally, Taylor notes in his conclusion).
is case study shows how the same question can be approached by a
deductive or an inductive (exploratory) approach. e deductive approach can be
v0
more precise, but this depends on the appropriateness of the categories chosen a
priori for coding the data; it is also time consuming and therefore limited to
relatively small data sets. In contrast, the inductive approach can be applied to a
large data set because it requires no a priori coding. It also does not require any
choices concerning coding categories; however, there may be a danger to project
T
paerns into the data post hoc.
AF

9.2.2.2 Case study: Antonymy. At rst glance, we expect the relationship between
antonyms to be a paradigmatic one, where only one or the other will occur in a
given uerance. However, Charles and Miller (1989) suggest, based on the results
of sorting tasks and on theoretical considerations, that, on the contrary, antonym
pairs are frequently found in syntagmatic relationships, occurring together in the
R

same clause or sentence. A number of corpus-linguistic studies have shown this


to be the case (e.g. Justeson and Katz 1991, 1992, Fellbaum 1995; cf. also Gries and
Otani for a study identifying antonym pairs based on their similarity in lexico-
D

syntactic behavior).
ere are dierences in detail in these studies, but broadly speaking, they take
a deductive approach: ey choose a set of test words for which there is
agreement as to what their antonyms are, search for these words in a corpus, and
check whether their antonyms occur in the same sentence signicantly more
frequently than expected. e studies thus involve two nominal variables:
SENTENCE (with the values CONTAINS TEST WORD and DOES NOT CONTAIN TEST WORD) and
ANTONYM OF TEST WORD (with the values OCCURS IN SENTENCE and DOES NOT OCCUR IN
SENTENCE). is seems like an unnecessarily complicated way of representing the
type of co-occurrence design used in the examples above, but I have chosen it to

219
9. Collocation

show that in this case sentences containing a particular word are used as the
condition under which the occurrence of another word is investigated a
straightforward application of the general research design that denes
quantitative corpus linguistics. Table 9-12 demonstrates the design using the
adjectives good and bad.

Table 9-12. Sentential co-occurrence of good and bad in the BROWN corpus
(based on Justeson and Katz 1991: 5).
GOOD

Total

.1
OCCURS DOES NOT OCCUR

WITH Obs: 16 Obs: 109 125


SENTENCE

BAD Exp: 1.56 v0 Exp: 123.44


WITHOUT Obs: 666 Obs: 53,926 54035
BAD Exp: 680.44 Exp: 53,911.56
Total 682 54592 54717
T
Good occurs signicantly more frequently in sentences also containing bad than
in sentences not containing bad (2 = 135.87, df = 1, p < 0.001). Justeson and Katz
AF

(1991) apply this procedure to 36 adjectives and get signicant results for 25 of
them (19 of which remain signicant aer a Bonferroni correction for multiple
tests). ey also report that in a larger corpus, the frequency of co-occurrence for
all adjective pairs is signicantly higher than expected (but do not give any
gures). Fellbaum (1995) uses a very similar procedure with words from other
R

word classes, with very similar results.


ese studies only look at the co-occurrence of antonyms; they do not apply
the same method to word pairs related by other lexical relations (synonymy,
D

taxonomy, etc.). us, there is no way of telling whether co-occurrence within the
same sentence is something that is typical specically of antonyms, or whether it
is something that characterizes word pairs in other lexical relations, too.
An obvious approach to testing this would be to repeat the study with other
types of lexical relations. Alternatively, we can take an exploratory approach that
does not start out from specic word pairs at all. Justeson and Katz (1991)
investigate the specic grammatical contexts which antonyms tend to co-occur,
identifying, among others, coordination of the type [ADJ and ADJ] or [ADJ or
ADJ]. We can use these specic contexts to determine the role of co-occurrence
for dierent types of lexical relations by simply extracting all word pairs

220
Anatol Stefanowitsch

occurring in the adjective slots of these paerns, calculating their association


strength within this paern as shown in Table 9-13a for the adjectives good and
bad in the BNC, and then categorizing the most strongly associated collocates in
terms of the lexical relationships between them.

Table 9-13a. Co-occurrence of good and bad in the rst and second slot of [ADJ +
and/or + ADJ] sequences.
SECOND ADJECTIVE
BAD OTHER WORDS Total

.1
Obs: 304 Obs: 542 846
FIRST ADJ.

GOOD
Exp: 17.77 Exp: 828.23
OTHER Obs: 44 v0 Obs: 157674 15718
WORDS Exp: 330.23 Exp: 15387.77
Total 348 16216 54717

Note that this is a slightly dierent procedure from what we have seen before:
T
instead of comparing the frequency of co-occurrence of two words with their
individual occurrence in the rest of the corpus, we are comparing it to their
AF

individual occurrence in a given position of a given structure in this case [ADJ


and/or ADJ] (Stefanowitsch and Gries (2005) call this type of design covarying
collexeme analysis).
Table 9-13b shows the thirty most strongly associated adjective pairs
coordinated with and or or in the BNC.
R

Clearly, antonymy is the dominant relation among these word pairs, which are
mostly opposites (black/white, male/female, public/private, etc.), and sometimes
relational antonyms (primary/secondary, economic/social, economic/political,
D

social/political, lesbian/gay, etc.). e only cases of non-antonymic pairs are


economic/monetary, which is more like a synonym than an antonym and the xed
expressions deaf/dumb and hon(ourable)/learned (as in honorable and learned
gentleman/member/friend). e paern does not just hold for the top 30 collocates
but continues as we go down the list. ere are additional cases of relational
antonyms, like British/American and Czech/Slovak and additional examples of
xed expressions (alive and well, far and wide, true and fair, null and void, noble
and learned), but most cases are clear antonyms (for example, syntactic/semantic,
spoken/wrien, mental/physical, right/le, rich/poor, young/old, good/evil, etc.). e
one systematic exceptions are cases like worse and worse (a special construction

221
9. Collocation

with comparatives indicating incremental change; cf. Stefanowitsch 2007).

Table 9-13b. Co-occurrence of adjectives in the rst and second slot of [ADJ
and/or ADJ] sequences (BNC).
WORD A WORD B A&B A & !B !A & B !A & ! B G2
black white 1015 608 722 154797 7759.74
male female 545 27 31 157479 6809.46
economic social 1065 1326 1412 153239 6112.06
public private 448 171 186 157471 4655.26
social economic 779 1848 928 154059 4301.73

.1
primary secondary 295 69 34 158184 3726.64
deaf dumb 276 46 8 158290 3721.86
good bad 304 542 44 157674 3042.80
internal external 223 41 27 158435 2975.78
positive
hon.
negative
learned
258
232
184
92
v0 57
119
158157
158265
2931.04
2656.65
lesbian gay 185 8 26 158583 2645.03
greater lesser 189 50 14 158541 2576.04
political economic 492 1323 1215 155158 2512.32
T
le right 183 41 41 158541 2415.68
physical mental 245 565 66 157806 2347.59
AF

social political 528 2099 1230 154259 2323.53


national international 280 529 264 157539 2314.06
upper lower 169 32 151 158482 2032.92
hot cold 224 328 267 157905 1966.72
old new 225 508 190 157799 1926.01
R

top boom 131 29 6 158744 1921.61


formal informal 168 140 75 158453 1913.33
D

large small 238 722 195 157541 1901.35


direct indirect 157 195 22 158484 1869.35
local national 215 396 228 157903 1864.40
economic monetary 266 2125 96 156153 1845.83
high low 186 316 114 158184 1822.80
economic political 436 1955 1322 154587 1803.38
past present 139 83 31 158641 1780.50
Note: e corpus was queried for a word tagged as an adjective (including comparative and superlative forms),
followed by the string and or or (case insensitive), followed by another word tagged as an adjective (again
including comparative and superlative forms).

is case study shows how deductive and inductive domains may complement

222
Anatol Stefanowitsch

each other: while the deductive studies cited show that antonyms tend to co-
occur syntagmatically, the inductive study presented here shows that words that
co-occur syntagmatically (at least in certain syntactic contexts) tend to be
antonyms. ese two ndings are not equivalent; the second nding shows that
the rst nding may indeed be typical for antonymy as opposed to other lexical
relations. Of course, the exploratory study was limited to a particular syntactic
context which leaves open the question whether there are other types of contexts
that would allow the identication of word pairs connected by other types of
lexical relations.

.1
9.2.3 Grammaticalization
While lexical semantics (including lexicography) is one of the most typical areas
of study in collocation research, it is not the only one. A very dierent, but also
v0
highly interesting area is language change. Language usage and the (conditional)
frequency of linguistic items is clearly relevant, for example, to phenomena like
grammaticalization and lexicalization (corpus linguists who are interested in this
area of research might start by looking at the contributions in Bybee and Hopper
T
2001, Lindquist and Mair 2004 and Bybee 2007).

9.2.3.1 Case study: Cliticization. An interesting example of collocation-based


AF

grammaticalization research is the cliticization of the negative particle not (Krug


2003). e particle typically follows modals and is realized either as an
independent word (e.g. do not) or as a clitic (dont). In spoken language,
realization as a clitic is usually preferred, but there are dierences between
R

modals with respect to the magnitude of this preference. For example, more than
99 percent of all occurrences of the sequence [do NEG] are realized as dont in the
spoken demographic part of the BNC, but only 92.6 percent of all occurrences of
D

[are NEG] are realized as arent. For some modals, the clitic is not the preferred
form at all: for example, [ought NEG] is only realized as oughtnt in 40 percent of
all cases, and [may NEG] is realized as maynt in a mere 1.5 percent.
Krug (2003) tests the hypothesis that this is related to the overall co-
occurrence of the respective form of a modal with a negative particle, such that
the more frequently a modal is negated overall, the more likely it is to be negated
by a clitic rather than the full form of the negative particle. More specically, he
is interested in whether it is the raw frequency with which a modal is negated
that allows us to predict the percentage of cliticized cases, or whether it is the
probability with which a given modal is negated.

223
9. Collocation

is is a deductive study that aempts to test a hypothesis (that frequency in


some form is relevant at all) and then aempts to distinguish between two
specic versions of the hypothesis. It has two variables, both of which are
cardinal data (a case that we did not discuss in this book): e variable DEGREE OF
CLITICIZATION (with values ranging from 0 percent to 100 percent) and the variable
FREQUENCY/PROBABILITY OF NEGATION (with values ranging from 0 to total corpus size
for frequency, and 0 to 100 percent for probability). An obvious third version
hypothesis would be that the STRENGTH OF COLLOCATION (for example, measured in
terms of G2) predicts the percentage of cliticized cases. We will look at all three
variants.

.1
e necessary numbers for all three versions of the hypothesis can be derived
from a contingency table like that in Table 9-14 (where NEGATIVE combines all cases
of not and nt). v0
Table 9-14. Collocation of the modal word form does and the negative morpheme.
POSITION 2
NEGATIVE OTHER WORDS Total
T
Obs: 3502 Obs: 4299 7801
POSITION 1

DOES
Exp: 159.27 Exp: 7,641.73
AF

OTHER Obs: 98,880 Obs: 4,907,974 5006854


WORDS Exp: 102,222.7 Exp: 4,904,631.3
Total 102382 4912273 5014655
R

e frequency of does+NEG is easy to determine: it is 3502; we could also express


it as a relative frequency, if we wanted to compare data from corpora with
dierent total sizes, in which case it would be 3,502/5,014,655 = 0.000698, but as
D

long as we are taking all our data from the same corpus, it does not make a
dierence. e probability that does occurs as a negated form is simply the
percentage of negated cases of does out of all cases of does, i.e. 3,502/7,801 =
0.4489. Finally, the G2 statistic for this table is 16,812.67 (calculated according to
the formula in 9.2b) above.
Once we have derived these three measures for each modal, we have to
determine, again for each modal, the percentage of negated cases where the
negative morpheme occurs as a clitic. Table 9-15 provides the relevant
information for all [MODAL NEG] combinations in the demographic part of the
BNC (note that Krug (2003) is based on the entire spoken part of the BNC, about

224
Anatol Stefanowitsch

10 million words, so his numbers dier from the ones reported here).

Table 9-15. Modals, their likelihood of negation and degree of cliticization of the
negative morpheme.
CLITICIZATION FREQ. PROBABILITY LOG-LIKELIHOOD
MODAL WITH WITHOUT OTHER OTHER RANK PERCENT RANK RANK PERCENT RANK G2
NEG NEG NEG BIGRAMS

do 26639 27694 75743 4884579 1 0.9943 1 1 0.4903 1 140359.02


did 8879 11333 93503 4900940 2 0.9923 3 2 0.4393 2 42587.99
shall 117 1554 102265 4910719 3 0.9915 17 18 0.0700 18 126.95
have 4809 29230 97573 4883043 4 0.9894 13 8 0.1413 9 11078.96

.1
has 1186 3437 101196 4908836 5 0.9890 9 12 0.2565 10 4119.39
does 3502 4299 98880 4907974 6 0.9889 2 5 0.4489 5 16812.67
should 795 3601 101587 4908672 v0 7 0.9887 10 14 0.1808 14 2185.13
could 2314 5761 100068 4906512 8 0.9857 8 9 0.2866 7 8618.97
will 3953 5487 98429 4906786 9 0.9836 4 4 0.4188 4 18298.59
can 8234 15494 94148 4896779 10 0.9832 5 3 0.3470 3 34703.48
need 58 2981 102324 4909292 11 0.9828 22 23 0.0191 23 0.28
had 700 12258 101682 4900015 12 0.9814 20 15 0.0540 16 508.35
T
were 1763 8816 100619 4903457 13 0.9722 11 10 0.1667 11 4576.41
was 2734 32039 99648 4880234 14 0.9674 16 13 0.0786 13 3488.71
would 3359 6635 99023 4905638 15 0.9589 6 6 0.3361 6 13756.03
AF

is 5049 30128 97333 4882145 16 0.9487 12 7 0.1435 8 11791.24


dare 62 125 102320 4912148 17 0.9355 7 16 0.3316 15 250.13
must 165 2851 102217 4909422 18 0.9333 19 19 0.0547 19 122.19
are 2136 14375 100246 4897898 19 0.9265 14 11 0.1294 12 4530.16
ought 5 449 102377 4911824 20 0.4000 23 22 0.0110 22 2.41
R

might 218 3387 102164 4908886 21 0.2018 18 17 0.0605 17 190.82


may 64 742 102318 4911531 22 0.0156 15 20 0.0794 20 81.71
am 56 1152 102326 4911121 23 0.0000 21 21 0.0464 21 30.02
D

In his study, Krug concludes that raw frequency is the best predictor of
cliticization, but he also notes that frequency and probability do not dier by
much. For the subset of the data reported in Table 9-15, we can say that all three
measures of collocation correlate with the degree of cliticization signicantly,
with G2 performing best (r = 0.61, p < 0.01), followed closely by probability (r =
0.59, p < 0.01) and frequency (r = 0.57, p<0.01) (calculated using Spearmans rank
correlation).
is case study shows that collocation (more specically, association strength)
may itself be used as a variable, rather than showing the relationship between the

225
9. Collocation

variables in a design. ere are many phonological and/or morphological


reduction phenomena that have been studied in similar ways, and many others
that have not, so this remains an interesting area of research (see, for example,
Bybee and Scheibman 1999 on the reduction of dont, Bush 2001 on the
palatalization of word boundaries and Lorenz 2012 on the contraction of semi-
modals).

9.2.4. Semantic prosody


Sometimes, the collocates of a node word (or larger expressions) fall into a more
or less clearly recognizable semantic class that is dicult to characterize in terms

.1
of denotational properties of the node word. Louw (1993: 157) refers to this
phenomenon as semantic prosody, dened, somewhat impressionistically, as the
consistent aura of meaning with which a form is imbued by its collocates.
v0
is denition has been understood by collocation researchers in two dierent
(but related) ways. Much of the subsequent research on semantic prosody is
based on the understanding that this aura consists of connotational meaning
(cf. e.g. Partington 1998: 68), so that words can have a positive, neutral or
T
negative semantic prosody. However, Sinclair, who according to Louw invented
the term,30 seems to have in mind aitudinal or pragmatic meanings that are
much more specic than positive, neutral or negative. ere are insightful
AF

terminological discussions concerning this issue (cf. e.g. Hunston 2007), but since
the term is widely-used in (at least) these two dierent ways, and since positive
and negative connotations are very general types of aitudinal meaning, it
seems more realistic to accept a certain vagueness of the term. If necessary, we
R

could dierentiate between the general semantic prosody of a word (its positive
or negative connotation as reected in its collocates) and its specic semantic
prosody (the word-specic aitudinal meaning reected in its collocates).
D

9.2.4.1 Case study: True feelings. A good example of Sinclairs approach to


semantic prosody, both methodologically and theoretically, is his short case study
of the expression true feelings. Sinclair presents a selection of concordance lines
from the COBUILD corpus (Table 9-16 shows a sample from the BNC instead, but
Sinclairs ndings are well replicated by this sample). He then discusses a number
of observations about the use of the phrase true feelings, quantifying them

30 Louw aributes the term to John Sinclair, but Louw (1993) is the earliest appearance of
the term in writing. However, Sinclair is clearly the rst to discuss the phenomenon
itself systematically, without giving it a label (e.g. Sinclair 1991: 74-75).

226
Anatol Stefanowitsch

informally.

Table 9-16. Concordance of true feelings (BNC)


1 tragedy . He only reveals his [true feelings] when he throws his flower int
2 , and the fight to conceal his [true feelings] was well and truly lost . The
3 ou 're absolutely sure of your [true feelings] . I had a similar experience
4 hat many people took to be the [true feelings] of Democratic Unionists , oft
5 ell not reflect my employer 's [true feelings] on the matter , but once havi
6 rewriting the history of their [true feelings] , and it seems clear , at lea
7 action : acting only from our [true feelings] , not governed by the distort
8 ession is a way of hiding your [true feelings] so that they remain an intern
9 County did not disguise their [true feelings] or their forthright acknowled
10 conceal even from herself her [true feelings] the quickening of her pul
11 the problem of reading the [true feelings] of the individual can be made

.1
12 I only taught him to keep his [true feelings] from me . But he remained
13 othing stand in the way of his [true feelings] . He felt humiliated apar
14 between false surface and his [true feelings] , and in its affected , force
15 be reluctant to offer their [true feelings] or observations . The Bir
v0
16 d in the art of disguising his [true feelings] . Let him not be frightened o
17 n something of these mothers ' [true feelings] from the women who suffered t
18 o whom he could reveal his new [true feelings] . As for his own talk tha
19 that she had been denying her [true feelings] for years . She had loved Con
20 controlled voice disguised his [true feelings] , but Christina sensed his je
21 helpful to show each side the [true feelings] of the other , the need to ac
22 ys felt quite at odds with her [true feelings] . In fact , there were lots o
23 that he had first realised his [true feelings] for her . Then that he had fi
T
24 nd , but you like to hide your [true feelings] . Oh , do n't be so s
25 as n't actually dealt with the [true feelings] that he had towards his fathe
26 away so little . What were his [true feelings] , his intentions ? All I
AF

27 lf the opportunity to learn my [true feelings] , although He fell si


28 nts will often not admit their [true feelings] about the child and the incid
29 ong at last gave vent to their [true feelings] . The match had been billed i
30 enticity , emotional depth and [true feelings] . Postmodern sociology , as o
31 lly forced her to confront her [true feelings] for Arnie . Or rather , her l
32 th hands , and told him of her [true feelings] , they might have had a chanc
33 cts full rein and indulged her [true feelings] at this stage might n't she b
R

34 which she hoped disguised her [true feelings] , and carried on inking in co
35 he humiliation of exposing her [true feelings] . That 's not how it is
36 sly , turning blindly from her [true feelings] to hate him more . What m
D

Sinclair notes three things: rst, the phrase is almost always part of a possessive
(realized by pronoun, possessive noun phrase or of-construction). is is also true
of our sample, with the exception of lines 25 (where there is a possessive relation,
but it is realized by the verb have) and 30 (where the chosen span does not
contain a possessive). Second, the expression collocates with verbs of expression
(perhaps unsurprising for an expression relating to emotions); this, too, is true for
our sample. ird, and crucially in our context, the majority of the examples
express a reluctance to express emotions in many cases this reluctance is
communicated as part of the verb (the most frequent verbs are disguise, conceal
and hide), in other cases it is communicated separately (e.g. line 15, reluctant to

227
9. Collocation

oer; line 29, at last gave vent (suggesting a period of holding back); line 35,
humiliation of exposing). Sinclair ascribes the denotational meaning genuine
emotions to the phrase; the sense of reluctance to express them is the semantic
prosody of the phrase (Sinclair 2004: 36) an aitudinal meaning much more
specic than a general positive or negative connotation.
Sinclairs methodological approach is quantitative only very informal sense
he rarely reports exact frequencies for a given semantic feature in his sample. In
is an interesting question how to quantify his approach more strictly. Of course,
we could count and report the frequency of a given prosody with a lexical item or
phrase, but since we do not know and have no plausible way of determining

.1
how frequent this prosody is in the corpus in general, we cannot tell whether that
frequency is higher or lower than expected.
One way around this problem could be to use denotationally synonymous
v0
phrases for comparison phrases like genuine emotion(s) (which Sinclair uses to
paraphrase the denotational meaning), genuine feeling(s), real emotion(s) and real
feeling(s). In a sample of these phrases drawn from the BNC in the same way as
that in Table 9-16 (see Appendix D.2), only 8 out of 65 examples had a context of
reluctance or inability to express genuine emotions; in Table 9-16, I would argue
T
that 19 of the 36 examples do so (namely those in lines 1, 2, 8, 9, 10, 11, 12, 14, 15,
16, 19, 20, 22, 24, 28, 29, 34, 35, and 36). We can now compare the phrase true
AF

feelings to the other expressions in the usual way (cf. Table 9-17).

Table 9-17. e meaning reluctance to express with the phrase true meaning and
its synonyms.
R

CONTEXT
TRUE FEELINGS SYNONYMS OF Total
D

TRUE FEELINGS

Obs: 19 Obs: 8 27
RELUCTANCE

YES
Exp: 9.92 Exp: 17.08
Obs: 17 Obs: 53 70
NO
Exp: 26.08 Exp: 44.92
Total 36 65 97

e dierence between true feelings and its synonyms is highly signicant (2 =


18.14 (df = 1), p<0.001), suggesting that reluctance to express emotion is indeed
a specic semantic property of the phrase true feelings.

228
Anatol Stefanowitsch

9.2.4.2 Case study: Cause. A second way in which semantic prosody can be studied
quantitatively is implicit in Kennedys study of collocates of degree adverbs
discussed in Section 9.2.1 above. Recall that Kennedy discusses for each degree
adverb whether a majority of its collocates has a positive or a negative
connotation. is, of course, is a statement about the (broad) semantic prosody of
the respective adverb, based not on an inspection and categorization of usage
contexts, but on inductively discovered strongly associated collocates.
is method was rst proposed by Stubbs (1995a), who studies, among other
things, the noun and verb cause. He rst presents the result of a manual

.1
extraction of all nouns (sometimes with adjectives qualifying them, as in the case
of wholesale slaughter) that occur as subject or object of the verb cause or as a
prepositional object of the noun cause in the LOB. He codes them in their context
v0
of occurrence for their connotation, nding that approximately 80 percent are
negative, 18 percent are neutral and 2 percent are positive. is procedure is still
very close to the one suggested in the preceding case study.
He then notes that this manual extraction becomes unfeasible as the number
of corpus hits grows and discusses a procedure for identifying collocates that
T
involves a combination of association measures.31 As his focus is on
methodological issues, Stubbs does not present the results of this procedure
AF

exhaustively and the corpus he uses is not available, so I have replicated his
procedure using the BNC. e results are shown in Table 9-18 (following Stubbs, I
have limited it to nouns).
R

31 Stubbs uses a combination of mutual information (cf. Chapter 10) and a statistic called
t, rst proposed by Church and Hanks (1989) as an association measure. e laer is
not discussed in this book since it is not widely used and does not oer any
D

advantages over the more widely-used association measures. I have used it here in the
following form, suggested by Stubbs (1995a).
R1 C 1
O 11
N
(i) t =
O 11
Stubbs procedure consists in calculating MI and t for all words occurring within a
span of three words to the right or le of the node word, and then discarding all
words that only occur once (his way of geing around the fact that MI is overly
sensitive to rare events), have MI values of less than 3 and t-values of less than 2. I
have replicated the method here as a methodological exercise for actual collocation
research, I recommend, of course, using a beer association measure in the rst place.

229
9. Collocation

Table 9-18. Noun collocates in a six-word span around cause (BNC)


COLLOCATE AB A!B A!B !A!B MI T

damage 1101 7113 28722 96226463 6.17 12.12


problems 1016 25883 28807 96207693 4.35 6.14
death 712 18889 29111 96214687 4.29 5.04
concern 475 9645 29348 96223931 4.66 4.69
trouble 326 8375 29497 96225201 4.33 3.47
harm 299 2555 29524 96231021 5.82 5.58
injury 255 4301 29568 96229275 4.91 3.76
loss 233 11122 29590 96222454 3.47 2.15
disease 220 8579 29603 96224997 3.75 2.32

.1
pain 192 6793 29631 96226783 3.89 2.27
diculties 188 6530 29635 96227046 3.91 2.27
confusion 161 2610 v0 29662 96230966 4.97 3.04
distress 158 1282 29665 96232294 5.88 4.15
pollution 149 3907 29674 96229669 4.31 2.32
root 128 2016 29695 96231560 5.01 2.75
deaths 127 2232 29696 96231344 4.86 2.6
delay 117 3010 29706 96230566 4.33 2.07
T
disruption 115 754 29708 96232822 6.15 3.89
injuries 108 2399 29715 96231177 4.54 2.14
AF

alarm 104 2085 29719 96231491 4.68 2.21


anxiety 104 2463 29719 96231113 4.45 2.04
accidents 103 1845 29720 96231731 4.83 2.32
chaos 95 1464 29728 96232112 5.04 2.39
delays 87 987 29736 96232589 5.45 2.64
R

embarrassment 81 1160 29742 96232416 5.14 2.29


havoc 75 256 29748 96233320 6.93 4.12
inconvenience 63 358 29760 96233218 6.33 3.06
D

friction 63 415 29760 96233161 6.15 2.87


anaemia 42 309 29781 96233267 6.01 2.24
consternation 39 150 29784 96233426 6.80 2.83
devastation 38 225 29785 96233351 6.28 2.34
uproar 35 189 29788 96233387 6.39 2.33
furore 29 129 29794 96233447 6.63 2.3

It is clear that all of the nominal collocates (which are very similar to the ones
Stubbs cites from his results) with the exception of root (from the compound root
cause) have negative connotations. Incidentally, this is also true of most adjectival

230
Anatol Stefanowitsch

collocates identied by Stubbs procedure in the BNC, which are serious, severe,
bodily, underlying, grievous, reckless and proximate).
While Stubbs' (1995a) study, like many collocational studies, looks at all words
occurring in a given span around a particular word, Stefanowitsch and Gries
(2003) present data for nominal collocates of the verb cause in the object position
of dierent subcategorization paerns. While their results conrm the negative
connotation of cause also found by Stubbs, their results add an interesting
dimension: while objects of cause in the transitive construction (cause a problem)
and the prepositional dative (cause a problem to someone) refer to negatively
perceived external and objective states, the objects of cause in the ditransitive

.1
refer to negatively experienced internal and/or subjective states (see Table 9-19).

Table 9-19. Nominal collocates as objects of the verb cause in dierent


v0
constructions (from Stefanowitsch and Gries 2003: 222, data from the ICE-GB).
TRANSITIVE PREPOSITIONAL DATIVE DITRANSITIVE

COLLOCATE P COLLOCATE P COLLOCATE P

problem 3.30E-18 harm 4.37E-10 distress 4.54E-04


damage 2.52E-10 damage 5.47E-05 hardship 4.54E-04
T
havoc 8.74E-09 modication 6.56E-04 discomfort 5.19E-04
cancer 4.39E-07 inconvenience 8.43E-04 inconvenience 5.84E-04
AF

injury 7.12E-07 famine 9.37E-04 problem 8.57E-04


injustice 9.84E-07 delight 1.59E-03 pain 3.24E-03
stampede 5.08E-06 problem 1.83E-03 diculty 7.83E-03
congestion 1.01E-05 disruption 2.06E-03 night up 1.89E-02
extrusion 1.01E-05 accident 1.66E-02
R

change 1.43E-05

e strategy of identifying collocates rst and coding them for connotational


D

properties aerwards can be very eective, but note that it limits the perspective
to what I have called here broad semantic prosody; while it is possible, in many
cases, to ascribe a positive or negative connotation to words isolated from
their context of usage, this is not possible for the more complex, nuanced
aitudinal meanings relevant for the narrow semantic prosody of words.

9.2.5 Cultural analysis


In collocation research, a word (or other element of linguistic structure) typically
stands for itself the aim of the researcher is to uncover the linguistic properties
of a word (or set of words). However, texts are not just manifestations of a

231
9. Collocation

language system, but also of the cultural conditions under which they were
produced. is allows corpus linguistic methods to be used in uncovering at least
some properties of that culture. Specically, we can take lexical items to represent
culturally dened concepts and investigate their distribution in linguistic corpora
in order to uncover these cultural denitions. Of course, this adds complexity to
the question of operationalization: we must ensure that the words we choose are
indeed valid representatives of the cultural concept in question.

9.2.5.1 Case Study: Small boys, lile girls. Obviously, lexical items used
conventionally to refer to some culturally relevant group of people are plausible

.1
representatives of the cultural concept of that group. For example, some very
general lexical items referring to people (or higher animals) exist in male and
female versions man/woman, boy/girl, lad/lass, husband/wife, father/mother,
v0
king/queen, etc. If such word pairs dier in their collocates, this could tell us
something about the cultural concepts behind them. For example, Stubbs (1995b)
cites a nding by Baker and Freebody (1989), that in childrens literature, the
word girl collocates with lile much more strongly than the word boy, and vice
versa for small. Stubbs shows that this is also true for balanced corpora (see Table
T
9-20; again, since Stubbs corpora are not available, I show frequencies from the
BNC instead but the proportions are within a few percent points of his).
AF

Table 9-20>: Small and lile girls and boys (BNC)


POSITION 2
BOY GIRL Total
R

Obs: 791 Obs: 1148 1,939


POSITION 1

LITTLE
Exp: 927.53 Exp: 1011.47
D

Obs: 336 Obs: 81 417


SMALL
Exp: 199.47 Exp: 217.53
Total 1127 1229 2,356
Note: e query was for adjective lemmas that directly precede the
respective noun lemma (2 = 217.66, df = 1, p < 0.001).

is part of Stubbs study is clearly deductive: He starts with a hypothesis taken


from the literature and tests it against a larger, more representative corpus. e
variables involved are, as is typical for collocation studies, nominal variables
whose values are words.

232
Anatol Stefanowitsch

Stubbs argues that this dierence is due to dierent connotations of small and
lile which he investigates on the basis of the noun collocates to their right and
the adjectival and adverbial collocates to the le. Again, instead of Stubbs
original data (which he identies on the basis of raw frequency of occurrence and
only cites selectively) I show data from the BNC identied using the log-
likelihood test statistic. Table 9-21a shows the ten most strongly associated noun
collocates to the right of the node word and Table 9-21b shows the ten most
strongly associated adjectival collocates to the le.

Table 9-21a. Nominal collocates of lile and small at R1 (BNC)

.1
COLLOCATE CO-OCCURRENCE CO-OCCURRENCE OTHER WORDS OTHER WORDS G2
WITH LITTLE WITH SMALL WITH LITTLE WITH SMALL
MOST STRONGLY ASSOCIATED WITH LITTLE v0
bit 2838 40 33606 36388 3678.84
girl 1148 81 35296 36347 1122.07
doubt 546 5 35898 36423 710.68
time 595 23 35849 36405 664.49
while 435 0 36009 36428 605.46
T
evidence 324 0 36120 36428 450.46
aention 253 1 36191 36427 339.81
AF

chance 273 24 36171 36404 245.72


money 194 6 36250 36422 223.77
interest 213 13 36231 36415 214.28
MOST STRONGLY ASSOCIATED WITH SMALL
number 23 1265 36421 35163 1576.90
R

group 123 1188 36321 35240 1017.55


amount 7 735 36437 35693 957.08
D

business 36 821 36408 35607 898.26


rm 15 570 36429 35858 675.97
company 15 507 36429 35921 591.16
proportion 0 417 36444 36011 580.67
scale 1 392 36443 36036 533.15
area 15 352 36429 36076 385.19
quantity 0 220 36444 36208 305.75

is part of the study is more inductive. Stubbs may have expectations about
what he will nd, but he essentially identies collocates exploratively and then
interprets the ndings. e nominal collocates show, according to Stubbs, that

233
9. Collocation

small tends to mean small in physical size or low in quantity, while lile is
more clearly restricted to quantities, including informal quantifying phrases like
lile bit. is is generally true for the BNC data, too (note, however, the one
exception among the top ten collocates girl).
e connotational dierence between the two adjectives becomes clear when
we look at the adjectives they combine with. e word lile has strong
associations to evaluative adjectives that may be positive or negative, and that are
oen patronizing. Small, in contrast, does not collocate with evaluative adjectives.

Table 9-21b. Adjectival collocates of lile and small at L1 (BNC)

.1
COLLOCATE CO-OCCURRENCE CO-OCCURRENCE OTHER WORDS OTHER WORDS G2
WITH LITTLE WITH SMALL WITH LITTLE WITH SMALL
MOST STRONGLY ASSOCIATED WITH LITTLE v0
nice 354 4 4643 1055 110.24
poor 246 0 4751 1059 96.75
prey 119 0 4878 1059 46.25
tiny 95 0 4902 1059 36.84
nasty 59 0 4938 1059 22.80
T
funny 67 1 4930 1058 18.96
dear 47 0 4950 1059 18.15
AF

sweet 42 0 4955 1059 16.21


silly 59 1 4938 1058 16.10
lovely 92 5 4905 1054 13.58
MOST STRONGLY ASSOCIATED WITH SMALL
other 59 141 4938 918 285.47
R

only 36 119 4961 940 270.99


proximal 0 28 4997 1031 98.27
D

numerous 4 30 4993 1029 82.20


far 3 19 4994 1040 50.15
wee 2 15 4995 1044 40.93
existing 0 11 4997 1048 38.46
various 6 18 4991 1041 38.31
occasional 1 12 4996 1047 35.29
new 22 25 4975 1034 31.01

Stubbs sums up his analysis by pointing out that small is a neutral word for
describing size, while lile is sometimes used neutrally, but is more oen
nonliteral and convey[s] connotative and aitudinal meanings, which are oen

234
Anatol Stefanowitsch

patronizing, critical, or both. (1995b: 386). ere dierences in distribution


relative to the words boy and girl are evidence for him that [c]ulture is encoded
not just in words which are obviously ideologically loaded, but also in
combinations of very common words (Stubbs 1995b: 387).
Stubbs remains unspecic as to what that ideology is presumably, one that
treats boys as neutral human beings and girls as targets for patronizing
evaluation. In order to be more specic, it would be necessary to turn around the
perspective and study all adjectival collocates of boy and girl. Stubbs does not do
this, but Caldas-Coulthard and Moon (2010) look at adjectives collocating with
man, woman, boy and girl in broadsheet and yellow-press newspapers. In order to

.1
keep the results comparable with those reported above, let us stick with the BNC
instead. Table 9-21c shows the top ten adjectival collocates of boy and girl.

CO-OCCURRENCE CO-OCCURRENCE
v0
Table 9-21c. Adjectival collocates of lile and small at L1 (BNC)
COLLOCATE OTHER WORDS OTHER WORDS G2
WITH BOY WITH GIRL WITH BOY WITH GIRL
MOST STRONGLY ASSOCIATED WITH BOY
old 634 257 5385 7296 279.98
T
small 336 81 5683 7472 237.78
dear 126 45 5893 7508 61.30
AF

lost 41 1 5978 7552 58.54


big 167 89 5852 7464 46.02
naughty 71 22 5948 7531 39.75
new 124 69 5895 7484 31.31
rude 19 0 6000 7553 30.93
R

toy 16 0 6003 7553 26.04


whipping 14 0 6005 7553 22.78
D

MOST STRONGLY ASSOCIATED WITH GIRL


young 351 820 5668 6733 111.04
prey 23 132 5996 7421 62.59
other 194 444 5825 7109 54.56
beautiful 13 87 6006 7466 46.13
aractive 1 35 6018 7518 33.58
blonde 1 29 6018 7524 26.89
single 1 27 6018 7526 24.68
dead 12 57 6007 7496 22.67
unmarried 0 17 6019 7536 19.94
lovely 18 66 6001 7487 19.45

235
9. Collocation

e results are broadly similar in kind to those in Caldas-Coulthard and Moon


(2010): boy collocates mainly with neutral descriptive terms (small, lost, big, new),
or with terms with which it forms a xed expression (old, dear, toy, whipping).
ere are the evaluative adjectives rude (which in Caldas-Coulthard and Moons
data is oen applied to young men of Jamaican descent) and its positively
connoted equivalent naughty. e collocates of girl are overwhelmingly
evaluative, related to physical appearance. ere are just two neutral adjective
(other and dead, the laer tying in with a general observation that women are
more oen spoken of as victims of crimes and other activities than men). Finally,

.1
there is one adjective signaling marital status. ese results also generally reect
Caldas-Coulthard and Moons ndings (in the yellow-press, the evaluations are
oen heavily sexualized in addition). v0
To put it mildly, then, the collocates of boy and girl conrm a general aitude
that the laer are up for constant evaluation while the former are mainly seen as
a neutral default. at the adjectives dead and unmarried are among the top ten
collocates in a representative, relatively balanced corpus, hints at something
darker a patriarchal world view that sees girls as victims and sexual partners
T
and not much else (other studies investigating gender stereotypes on the basis of
collocates of man and woman are Gesuato 2003 and Pearce 2008).
AF

Further Reading
If you want to learn more about association measures, Evert (2005) and the
companion website (Evert 20042010) are very comprehensive and relatively
R

accessible places to start. Stefanowitsch and Flach (2016) discuss corpus-based


association measures in the context of psycholinguistics.
D

Evert, Stefan (20042010). Association Measures. Computational


Approaches to Collocations. hp://www.collocations.de/AM/, accessed
June 10, 2014.
Stefanowitsch, Anatol & Susanne Flach (2016). e corpus-based
perspective on entrenchment. In Hans-Jrg Schmid (ed.), Entrenchment
and the psychology of language learning. How we reorganize and adapt
linguistic knowledge. Berlin& New York: De Gruyter Mouton.

236
Anatol Stefanowitsch

10. Grammar
e fact that corpora are most easily accessed via words (or word forms) is also
reected in many corpus studies focusing on various aspects of grammatical

.1
structure. Many such studies either take (sets o) words as a starting point for
studying various aspects of grammatical structure, or they take easily identiable
aspects of grammatical structure as a starting point for studying the distribution
v0
of words. However, as the case studies of the English possessive constructions in
Chapters 6 and 7 showed, grammatical structures can be (and are) also studied in
their own right, for example with respect to semantic, information-structural and
other restrictions they place on particular slots or sequences of slots, or with their
distribution across texts, text types, demographic groups or varieties.
T

10.1 Grammar in corpora


AF

ere are two major problems to be solved when searching corpora for
grammatical structures. We discussed both of them to some extent in Chapter 5,
but let us briey recapitulate and elaborate some aspects of the discussion before
R

turning to the case studies.


First, we must operationally dene the structure itself in such a way that we
(and other researchers) can reliably categorize potential instances as manifesting
D

the structure or not. is may be relatively straightforward in the case of simple


grammatical structures that can be characterized based on tangible and stable
characteristics, such as particular congurations of grammatical morphemes
and/or categories occurring in sequences that reect hierarchical relationships
relatively directly. It becomes dicult, if not impossible, with complex structures,
especially in frameworks that characterize such structures with recourse to
abstract, non-tangible and theory-dependent constructs (see Sampson 1995 for an
aempt at a comprehensive coding scheme for the grammar of English).
Second, we must dene a query that will allow us to retrieve potential
candidates from our corpus in the rst place (a problem we discussed in some

237
10. Grammar

detail in Chapter 5). Again, this is simpler in the case of morphologically marked
and relatively simple grammatical structures or example, the s-possessive (as
dened above) is typically characterized by the sequence NOUN s ADJECTIVE*
NOUN in corpora containing texts in standard orthography; it can thus be
retrieved from a POS-tagged corpus with a fairly high degree of precision and
reliability. However, even this simple case is more complex than it seems: in the
sequence just given, s it may also stand for the verb be (Sams head of marketing);
and of course, the modied nominal may not always be directly adjacent to the s
(for example in is oce is Sams or in Sams friends and family).
Other structures may be dicult to retrieve even though they can be

.1
characterized straightforwardly: most linguists would agree, for example, that
transitive verbs, are verbs that take a direct object. However, this is of very lile
help in retrieving transitive verbs even from a POS-tagged corpus, since many
v0
noun-phrases following a verb will not be direct objects (Sam slept the whole day)
and direct objects do not necessarily follow their verb; in addition, noun phrases
themselves are not trivial to retrieve.
Yet other structures may be easy to retrieve, but not without retrieving many
false hits at the same time. is is the case with ambiguous structures like the of-
T
possessive, which can be retrieved by a query along the lines of NOUN of ART?
ADJ* NOUN, which will also retrieve partitive and quantitative uses, etc.).
AF

Finally, structures characterized with reference to invisible theoretical


constructs are so dicult to retrieve that this, in itself, may be a good reason to
avoid such invisible constructs whenever possible when characterizing linguistic
phenomena that we plan to investigate empirically.
ese diculties do not keep corpus linguists from investigating grammatical
R

structures, even very abstract ones, retrieving the relevant data by mind-numbing
and time-consuming manual analysis of the results of very broad searches or
D

even of the corpus itself, if necessary. But it is probably one reason why so much
grammatical research in corpus linguistics takes a word-centered approach.
A second reason is that it allows us to transfer well-established collocational
methods to the study of grammar. In the preceding chapter we saw that while
collocation research oen takes a sequential approach to co-occurrence, counting
as potential collocates of a node word any word within a given span around it, it
is not uncommon to see a structure-sensitive approach that considers only those
potential collocates that occur in a particular grammatical position relative to
each other (for example, adjectives relative to the nouns they modify or vice
versa). In this approach, grammatical structure is already present in the design,
even though it remains in the background. We can move these types of

238
Anatol Stefanowitsch

grammatical structure into the focus of our investigation, giving us a range of


research designs where one variable consists of (part o) the lexicon (with values
that are individual words) and one variable consists of some aspect of
grammatical structure. In these studies, the retrieval becomes somewhat less of a
problem, as we can search for lexical items and identify the grammatical
structures in our search results aerwards, though identifying these structures
reliably remains non-trivial. We will begin with word-centered case studies and
then move towards more genuinely grammatical research designs.

10.2 Case studies

.1
10.2.1 Collocational frameworks and grammar patterns
v0
An early extension of collocation research to the association between words and
grammatical structure is Renouf and Sinclairs notion of collocational frameworks,
which they dene as a discontinuous sequence of two words, positioned at one
word remove from each other, where the two words in question are always
function words (examples are a(n)+ X + of, too + X + to or many + N + of. ey
T
are particularly interested in classes of items that ll the position of such
frameworks and see the fact that these items tend to be semantically coherent as
AF

evidence that collocational frameworks are relevant items of linguistic structure.


e notion of collocational frameworks was subsequently extended by
Hunston and Francis (2002) to more traditional units of linguistic structure and
their combination, such as [it + Linking Verb + Adjective + to/that]. eir
essential insight is similar to Renouf and Sinclairs: that such structures (which
R

they call grammar paerns), are meaningful and that their meaning is manifest
in the collocates in central slots::
D

e paerns of a word can be dened as all the words and structures which
are regularly associated with the word and which contribute to its
meaning. A paern can be identied if a combination of words occurs
relatively frequently, if it is dependent on a particular word choice, and if
there is a clear meaning associated with it (Hunston and Francis 2000: 37).

Collocational frameworks and especially grammar paerns have an immediate


applied relevance: the COBUILD dictionaries included the most frequently found
paerns for each word in their entries from 1995 onward and there is a two-
volume descriptive grammar of the paerns of verbs (Francis et al. 1996) and

239
10. Grammar

nouns and adjectives (Francis et al. 1998); there were also aempts to identify
grammar paerns automatically (cf Mason and Hunston 2004). Research on
collocational frameworks and grammar paerns is mainly descriptive and takes
place in applied contexts, but Hunston and Francis (2002) argue very explicitly for
a usage-based theory of grammar on the basis of their descriptions (note that the
denition quoted above is strongly reminiscient of the way constructions were
later dened in Construction Grammar (Goldberg 1995: 4), a point I will return to
in the Epilogue of this book.

10.2.1.1 Case study: [a(n) + X + of] As an example of a collocational framework,

.1
consider [a(n) + X + of]. Table 10-1 shows the most strongly associated lexical
items for this framework in the BNC.

Word a(n) X of X
v0
Table 10-1. Top collocates of the collocational framework [a(n) + X + of] (BNC)
a(n) _ of Rest LL-G2 a/(a+b)
lot 14632 13329 271461 96687283 132624.95 0.5233
number 15137 33669 270956 96666943 116935.11 0.3101
couple 7042 4838 279051 96695774 66197.79 0.5928
T
series 5985 8237 280108 96692375 50552.96 0.4208
result 5610 16300 280483 96684312 40644.29 0.2560
AF

variety 4284 4368 281809 96696244 38013.57 0.4951


bit 5026 21444 281067 96679168 33045.05 0.1899
maer 4600 17145 281493 96683467 31332.57 0.2115
member 3950 13435 282143 96687177 27525.87 0.2272
R

range 3359 16907 282734 96683705 21075.42 0.1657


pair 2331 3593 283762 96697019 19259.42 0.3935
piece 2535 6477 283558 96694135 18888.97 0.2813
D

group 3540 37611 282553 96663001 17377.09 0.0860


kind 2599 20950 283494 96679662 14073.75 0.1104
sense 2503 18846 283590 96681766 13866.52 0.1172

e results are comparable to those Renouf and Sinclair present, although they
use the COBUILD corpus, which is not publicly available, and they do not
calculate association strength, but they provide the percentages of all occurrences
of each word that occur in a given paern (shown here in the last column).
e most strongly associated words are related mainly to quantities, parts, or
types (here). is type of semantic coherence is evidence for Renouf and Sinclair

240
Anatol Stefanowitsch

that collocational frameworks are relevant units of language.

10.2.1.2 Case study: [it be ADJ that/to]. As an example of a grammar paern,


consider [it be ADJ that/to]. Table 10-3 shows the most frequent adjectives in this
paern (I report frequencies rather than association strengths, because this is
common practice in Paern Grammar, see further Section 10.2.2 below).

Table 10-3. Most frequent words in the grammar paern [it be ADJ that/to] (BNC)
RANK COLLOCATION FREQUENCY RANK COLLOCATION FREQUENCY
1. possible 3475 11. essential 804

.1
2. important 3030 12. unlikely 727
3. dicult 2281 13. interesting 679
4. clear 1897 v0 14. obvious 524
5. necessary 1488 15. beer 505
6. hard 1372 16. good 461
7. likely 1261 17. best 382
8. easy 1129 18. vital 381
T
9. impossible 1121 19. nice 367
10. true 970 20. easier 359
AF

Hunston and Francis (2002: 29) note that the adjectives fall into a broad class they
describe as modality, ability, importance, predictability, obviousness, value and
appropriacy, rationality, truth. Like Renouf and Sinclair, they see this semantic
coherence as evidence for the relevance of the paerns they identify.
R

10.2.2 Collostructional analysis


D

Simply put, collostructional analysis is the application of quantitative collocation


analysis to associations between lexical items and (congurations) of elements of
grammatical structure. It overlaps with paern grammar where congurations of
complements and adjuncts are concerned (subcategorization, valency, etc.),
and with modern approaches to colligation where more general features of
grammatical structure (such as tense, aspect, negation etc.) are concerned (cf.
Section 10.2.4 below), but is more rigorously quantitative. It has been applied
mainly in the domain of argument structure, with a perspective that is anchored
in construction grammar (see Stefanowitsch and Gries 2009 and Stefanowitsch
2012 for overviews). Like collocation analysis, it is generally inductive, but it can
also be used to test certain general hypotheses about the meaning of grammatical

241
10. Grammar

constructions (in this, too, it has much in common with Paern Grammar).

10.2.2.1 Case study: e ditransitive. Stefanowitsch and Gries (2003) investigate,


among other things, which verbs are strongly associated with the ditransitive
construction (or subcategorization frame). is is a very direct application of the
basic research design for collocates introduced in the preceding chapter.
eir design is broadly deductive, as their hypothesis is that constructions
have meaning and specically, that the ditransitive has a transfer meaning. e
design has two nominal variables: ARGUMENT STRUCTURE (with the VALUES ditransitive
and OTHER) and VERB (with values corresponding to all verbs occurring in the

.1
construction). e prediction is that the most strongly associated verbs will be
verbs of literal or metaphorical transfer. Table 10-4a gives the information needed
to calculate the association strength for give (although the procedure should be
v0
familiar by now), and Table 10-4b shows the ten most strongly associated verbs.

Table 10-4a: Give and the ditransitive (ICE-GB)


ARGUMENT STRUCTURE
T
DITRANSITIVE OTHER Total
Obs: 461 Obs: 687 1,035
AF

GIVE
Exp: 8.57 Exp: 1,139.43
VERB

Obs: 574 Obs: 136,942 137,629


OTHER
Exp: 1,026.43 Exp: 136,489.57
1,148 137,516 138,664
R

The hypothesis is conrmed: the top ten collexemes (and most of the other
signicant collexemes not shown here) refer to literal or metaphorical transfer.
D

However, note that on the basis of lists like that in Table 10-3 we cannot reject
a null hypothesis along the lines of ere is no relationship between the
ditransitive and the encoding of transfer events, since we did not test this. All we
can say is that we can reject null hypotheses stating that there is no relationship
between the ditransitive and each individual verb on the list. In practice, this may
amount to the same thing, but if we wanted to reject the more general null
hypothesis, we would have to code all verbs in the corpus according to whether
they are transfer verbs or not, and the show that transfer verbs are signicantly
more frequent in the ditransitive construction than in the corpus as a whole.

242
Anatol Stefanowitsch

Table 10-4b: e verbs in the ditransitive construction (ICE-GB, Stefanowitsch and


Gries 2003: 229)
COLLEXEME O11 O12 O21 O22 p (Fisher)
give 461 687 574 136942 0
tell 128 660 907 136969 1.6E-127
send 64 280 971 137349 7.26E-68
oer 43 152 992 137477 3.31E-49
show 49 578 986 137051 2.23E-33
cost 20 82 1015 137547 1.12E-22
teach 15 76 1020 137553 4.32E-16

.1
award 7 9 1028 137620 1.36E-11
allow 18 313 1017 137316 1.12E-10
lend 7 24 1028 137605 2.85E-09
v0
10.2.2.2 Case study: Ditransitive and prepositional dative. Collostructional analysis
can also be applied in the direct comparison of two grammatical constructions (or
other grammatical features), analogous to the dierential collocate design
T
discussed in Chapter 9. For example, Gries and Stefanowitsch (2004) compare
verbs in the ditransitive and the so-called to-dative both constructions express
transfer meanings, but it has been claimed that the ditransitive encodes a
AF

spatially and temporally more direct transfer than the to-dative (ompson and
Koide 1987). If this is the case, it should be reected in the verbs most strongly
associated to one or the other construction in a direct comparison. Table 10-5a
shows the data needed to determine for the verb give which of the two
R

constructions it is more strongly aracted to.

Table 10-5a: Give in the ditransitive and the prepositional dative (ICE-GB)
D

ARGUMENT STRUCTURE
DITRANSITIVE TO-DATIVE Total
Obs: 461 Obs: 146 607
GIVE
Exp: 212.68 Exp: 394.32
VERB

Obs: 574 Obs: 1 773 2 347


OTHER
Exp: 822.32 Exp: 1524.68
1 035 1 919 2 954

243
10. Grammar

Give is signicantly more frequent than expected in the ditransitive and less
frequent than expected in the to-dative (p = 1.84E-120). It is therefore be said to
be a signicant distinctive collexeme of the ditransitive (again, I will use the term
dierential instead of distinctive here). Table 10-5b shows the top ten dierential
collexemes for each construction.

Table 10-5b: Verbs the ditransitive and the prepositional dative (ICE-GB, Gries and
Stefanowitsch 2004: 106).
COLLEXEME O11 O12 O21 O22 p (Fisher)
MOST STRONGLY ASSOCIATED WITH THE DITRANSITIVE

.1
give 461 146 574 1773 1.84E-120
tell 128 2 907 1917 8.77E-58
show 49 15 v0 986 1904 8.32E-12
oer 43 15 992 1904 9.95E-10
cost 20 1 1015 1918 9.71E-09
teach 15 1 1020 1918 1.49E-06
wish 9 1 1026 1918 0.0005
T
ask 12 4 1023 1915 0.0013
promise 7 1 1028 1918 0.0036
AF

deny 8 3 1027 1916 0.0122


MOST STRONGLY ASSOCIATED WITH THE TO-DATIVE

bring 7 82 1028 1837 1.47E-09


play 1 37 1034 1882 1.46E-06
take 12 63 1023 1856 0.0002
R

pass 2 29 1033 1890 0.0002


make 3 23 1032 1896 0.0068
D

sell 1 14 1034 1905 0.0139


do 10 40 1025 1879 0.0151
supply 1 12 1034 1907 0.0291
read 1 10 1034 1909 0.0599
hand 5 21 1030 1898 0.0636

Generally speaking, the list for the ditransitive is very similar to the one we get if
we calculate the simple collexemes of the construction; crucially, many of the
dierential collexemes of the to-dative highlight the spatial distance covered by
the transfer, which is in line with what the hypothesis predicts.

244
Anatol Stefanowitsch

10.2.2.3 Case study: Negative evidence. Recall from the introductory chapter that
one of the arguments routinely made against corpus-linguistics is that corpora do
not contain negative evidence. Even corpus linguists occasionally agree with this
claim. For example, McEnery and Wilson (2001: 11), in their otherwise excellent
introduction to corpus-linguistic thinking, cite the sentence in (10.1):

(10.1) *He shines Tony books.

ey point out that this sentence will not occur in any given nite corpus, but

.1
that this does not allow us to declare it ungrammatical, since it could simply be
one of innitely many sentences that that simply havent occurred yet. ey
then oer the same solution Chomsky has repeatedly oered:
v0
It is only by asking a native or expert speaker of a language for their
opinion of the grammaticality of a sentence that we can hope to
dierentiate unseen but grammatical constructions from those which are
simply grammatical but unseen (McEnery and Wilson 2001: 12).
T
However, as Stefanowitsch (2006, 2008) points out, zero is just a number, no
AF

dierent in quality from 1, or 461 or any other frequency of occurrence. is


means that we can run the same statistical tests on a combination of a verb and a
grammatical construction (or any other elements of linguistic structure) that do
not occur together as on combinations that do. Table 10-6a shows this for the verb
say and the ditransitive construction in the ICE-GB.
R

Table 10-6a: e non-occurrence of say in the ditransitive (ICE-GB)


D

ARGUMENT STRUCTURE
DITRANSITIVE OTHER Total
Obs: 0 Obs: 3 333 3 333
SAY
Exp: 44.52 Exp: 3 288.48
VERB

Obs: 1 824 Obs: 131 394 133 218


OTHER
Exp: 1 779.48 Exp: 131 438.52
1 824 134 727 136 551

Fishers exact test shows that the observed frequency of zero diers signicantly

245
10. Grammar

from that expected by chance (p = 4,29E-165). us, it is very unlikely that


sentences like Alex said Joe the answer simply havent occurred yet in the
corpus. Instead, we can be fairly certain that say cannot be used with ditransitive
complementation in English. Of course, the corpus data do not tell us why this is
so, but neither would an acceptability judgment from a native speaker.
Table 10-6b shows the twenty verbs whose non-occurrence in the ditransitive
is statistically most signicant in the ICE-GB (see Stefanowitsch 2006: 67). Since
the frequency of co-occurrence is always zero and the construction is constant,
the order of association strength corresponds to the order of the corpus
frequency of the words, so the point of statistical testing in this case really is to

.1
determine whether the absence of a particular word is signicant or not.

Table 10-6b: Non-occurring verbs in the ditransitive (ICE-GB)


v0
COLLEXEME O11 O12 O21 O22 p (Fisher)
be 0 25416 1824 109311 4,29E-165
be|have 0 6261 1824 128466 3,66E-038
have 0 4303 1824 130424 2,90E-026
think 0 3335 1824 131392 1,90E-020
T
say 0 3333 1824 131394 1,96E-020
know 0 2120 1824 132607 3,32E-013
AF

see 0 1971 1824 132756 2,54E-012


go 0 1900 1824 132827 6,69E-012
want 0 1256 1824 133471 4,27E-008
use 0 1222 1824 133505 6,77E-008
R

come 0 1140 1824 133587 2,06E-007


look 0 1099 1824 133628 3,59E-007
try 0 749 1824 133978 4,11E-005
D

mean 0 669 1824 134058 1,21E-004


work 0 646 1824 134081 1,65E-004
like 0 600 1824 134127 3,08E-004
feel 0 593 1824 134134 3,38E-004
become 0 577 1824 134150 4,20E-004
happen 0 523 1824 134204 8,70E-004
put 0 513 1824 134214 9,96E-004

Note that since zero is no dierent from any other frequency of occurrence, this
procedure does not tell us anything about the dierence between an intersection

246
Anatol Stefanowitsch

of variables that did not occur at all and an intersection that occurred with any
other frequency less than the expected one. All the method tells us is whether an
occurrence of zero is signicantly less than expected.
In other words, the method makes no distinction between zero occurrence and
other less-frequent-than-expected occurrences. However, Stefanowitsch (2006:
70f.) argues that this is actually an advantage: if we were to treat an occurrence of
zero as special opposed to, say, an occurrence of 1, then a single counterexample
to an intersection of variables hypothesized to be impossible will appear to
disprove our hypothesis. e knowledge that a particular intersection is
signicantly less frequent than expected, in contrast, remains relevant even when

.1
faced with apparent counterexamples). And, as anyone who has ever elicited
acceptability judgments (whether from someone else or introspectively from
themselves) knows, the same is true of acceptability judgments: We may feel
v0
something to be unacceptable even though we know of counterexamples (or can
even think of such examples ourselves) that seem possible but highly unusual.
Of course, applying signicance testing to zero occurrences of some
intersection of variables is not always going to provide a signicant result: if one
(or both) of the values of the intersection are rare in general, an occurrence of
T
zero may not be signicantly less than expected. In this case, we still do not know
whether the absence of the intersection is due to chance or to its impossibility
AF

but with such rare combinations, acceptability judgments are also going to be
variable.

10.2.3 Words and their grammatical properties


R

Studies that take a collocational perspective on associations between words and


grammatical structure tends to start from one or more grammatical structures
and inductively identies the words associated with these structures. Moving
D

away from this perspective, we nd research designs that begin to resemble more
closely the type illustrated in Chapters 68, but that nevertheless include lexical
items as one of their variables. ese will typically start from a word (or a small
set of words) and identify associated grammatical (structural and semantic)
properties. is is of interest for many reasons, for example in the context of how
much idiosyncrasy there is in the grammatical behavior of lexical items in general
(for example, what are the grammatical dierences between near synonyms), or
how much of a particular grammatical alternation is lexically determined.

10.2.3.1 Case study: Frame elements of verbs of seeing. As an example of a study

247
10. Grammar

looking at the grammatical behavior of items in general, consider Atkins (1994),


who looks at dierences between verbs of seeing in terms of the semantic roles
that are grammatically expressed. She assumes that all these verbs share a
common frame that includes an experiencer (the person seeing something), a
stimulus (the entity that is seen), the location of the experiencer, the location of
the stimulus and potential obstructions in the line of sight. Not all of these
elements need to be expressed grammatically, and Atkins investigates which
elements are more or less likely to be expressed with a number of dierent verbs
of seeing. Her study is inductive, in that she does not formulate any hypotheses
concerning the dierences between the verbs. e study has two nominal

.1
variables, Verb of Seeing (with values corresponding to the individual verbs) and
Frame Elements Expressed (with values corresponding to the frame elements).
Table 10-7 shows one of the ways in which Atkins summarizes the data.
v0
Table 10-7. Distribution of frame elements across verbs of seeing (Atkins 1994: 50)
FRAME ELEMENT EXPRESSED
EXPERIENCER PERCEPT LOCATION OF LOCATION OF BARRIER TIME
PERCEPT EXPERIENCER
T
see 86% 95% 14%
catch sight of 100% 100% 7% 2%
AF

glimpse 67% 98% 43% 8% 15%


VERB OF SEEING

sight 45% 92% 43% 35%


spot 70% 95% 39% 5%
behold 81% 100% 3% 6%
R

descry 100% 100%


spy 95% 100% 25% 5%
espy 100% 100% 8%
D

Atkins discusses various ways in which the paern of the grammatical


expression of frame elements corresponds to semantic dierences between the
verbs (unfortunately, she does not provide absolute frequencies, so it is
impossible to determine which of the observed frequencies actually dier
signicantly from the expected ones). For example, she notes that sight, spot, spy
and glimpse dier from the other verbs of seeing in focusing on the location of
the percept; this is clearly related to the fact that sight, spot and spy are used in
situations where the experiencer is actively on the look out for the percept;
Atkins notes that the place of the percept is oen described with adjuncts giving

248
Anatol Stefanowitsch

precise locations, oen including measurements. Sight additionally focuses on the


time at which the experiencer saw the percept. Atkins also points out that
glimpse seems to have a temporal dimension (it suggests very short seeing
events), but the fact that temporal information is never grammatically expressed
suggests that this is due to the frequent presence of a barrier in the line of sight,
that obscures the percept except for very short moments).
Not all authors who study dierences between the grammatical behavior of
near synonyms agree that these dierences can be explained fully by the
semantics of the words in question. For example, Faulhaber 2011 looks at the
specic formal paerns that verbs are used in as well as the semantic roles they

.1
express. Table 10-8 shows the result of one of her case studies using verbs of
permission (note that for expository reasons the table is simplied and the labels
for semantic roles are more generic than the ones used by Faulhaber).
v0
Table 10-8. Argument structures possible with allow, permit, authorize and entitle in
the sense 'permit' (Faulhaber 2011: 101)
Argument Structure allow permit authorize entitle
NPAGT V NPTHM [for_N/V-ing]PURP
T
NPAGT V NPTHM [to_NP]REC
NPAGT V [NP(]+[)to_INF]THM
AF

NPAGT V NPTHM
NPAGT V [V-ing]THM
NPAGT V [that_CL]THM
NPAGT V [for_NP to_INF]THM
R

NPAGT V [NP]REC + [NP]THM


NPAGT V [0]
NPAGT V [of_NP]THM
D

NPAGT V [NP]THM +[into_NP]GOAL


NPAGT V [for_NP/V-ing]THM
NPAGT V [for_NP V-ing]THM
NPAGT V [to_INF]THM

Instead of giving percentages, Faulhaber takes an all-or-nothing approach (and in


fact, she supplements corpus data by native-speaker judgments, something that,
as mentioned in the introductory chapter, I strongly recommend against). Still,
her approach of taking into consideration not only semantic roles and their
grammatical expression, but also the specic conguration of constituents used

249
10. Grammar

to express them uncovers an amount of idiosyncrasy that is very dicult (in


Faulhabers opinion, impossible) to account for based on verb semantics.
Finally, note that in the context of the grammatical and semantic behavior of
lexical items, the idea of behavioral proles as developed by Gries on the basis of
Atkins (1987) is worth following up: Gries suggests coding lexical items for a
wide range of aspects of their semantic and syntactic behavior and then running
cluster analyses to see if, and in what ways, they will be grouped together on the
basis of their colligational, collocational and semantic associations (the technique
has been applied with extremely interesting results to word senses, near-
synonyms and antonyms, cf. Gries 2010 for an overview).

.1
10.2.3.2 Case study: Complementation of begin and start. ere are a number of
verbs in English that display variation with respect to their complementation
v0
paerns and the factors inuencing this variation have provided (and continue to
provide) an interesting area of research for corpus linguists. In an early example
of such a study, Schmid (1996) investigates the near-synonyms begin and start,
both of which can occur with to-clauses or ing-clauses. He starts out from the
hypothesis (derived from discussions in the literature) that begin signals gradual
T
onsets and start signals sudden ones, and that ing-clauses are typical of dynamic
situations while to-clauses are typical of stative situations. He then identies all
AF

occurrences of both verbs with both complementation paers in the LOB and
codes them for whether the embedded verb (the verb in the complement clause)
refers to an activity, a process or a state.
His study is a deductive one, as he starts out from a hypothesis to be tested on
the corpus data. It involves three nominal variables: VERB (with the values BEGIN
R

and START), COMPLEMENTATION (with the values ING and TO) and AKTIONSART (with the
values ACTIVITY, PROCESS and STATE). Again, we are dealing with a multivariate
D

design.
Schmid (1996) reports frequencies and percentages and bases his discussion on
an informal inspection of these. Let us be more rigorous and submit his data to a
Congural Frequency Analysis as introduced in Chapter 8. Table 10-9a shows the
observed frequencies (from Schmid 1996) and the expected frequencies derived
from these, and Table 10-9b shows the results of the CFA.
ere are three types with corresponding antitypes that reach signicance (in
one case, only marginally so, at corrected levels of signicance): activity verbs are
associated with the verb start and the complementation paern ing and avoid the
verb start with the complementation paern to. is conrms the hypothesis
activities have clear (sudden) onsets and they are dynamic. For state verbs, the

250
Anatol Stefanowitsch

paern is reversed, again, in line with the hypothesis states do not typically
have clear (sudden) onsets and they are, of course, stative). Processes seem to be
closer to states that to activities, but perhaps one of the reasons that the type
PROCESS + BEGIN + TO is only marginally signicant is because they are somewhere
in between states and processes.

Table 10-9a. Aktionsart, matrix verb and complementation type (LOB, Schmid 1996:
229)
BEGIN START Total
AKTIONSART ING TO Total ING TO Total

.1
ACTIVITY O: 21 O: 96 117 O: 50 O: 26 76 193
E: 30.85 E: 114.42 E: 10.14 E: 37.59
PROCESS O: 2 O: 59 61 v0 O: 2 O: 10 12 73
E: 11.67 E: 43.28 E: 3.83 E: 14.22
STATE O: 1 O: 101 102 O: 3 O: 1 4 106
E: 16.94 E: 62.84 E: 5.57 E: 20.65
Total 24 256 280 55 37 92 372
T
Table 10-9b. Congural frequency analysis for Table 10-21a.
Intersection Observed Expected Type Chi-sq. p Sig.
AF

ACTIVITY BEGIN ING 21 30.85 3.15 2.075228e-01


TO 96 114.49 2.96 2.270716e-01
START ING 50 10.14 + 156.77 0.000000e+00 ***
TO 26 37.59 3.58 1.672970e-01
PROCESS BEGIN ING 2 11.67 8.01 1.821074e-02 ***
R

TO 59 43.28 + 5.71 5.750215e-02 marg.


START ING 2 3.83 0.88 6.449079e-01
10 14.22 1.25 5.346649e-01
D

TO

STATE BEGIN ING 1 16.94 15.00 5.523667e-04 ***


TO 101 62.84 + 23.17 9.301166e-06 ***
START ING 3 5.57 1.18 5.532767e-01
TO 1 20.65 18.70 8.712604e-05 ***

10.2.4 Grammar and context


ere is a wide range of contextual factors that are hypothesized or known to
inuence grammatical variation. ese include information status, animacy and
length, which we already discussed in the case studies of the possessive
constructions in Chapters 6 and 7. Since we have dealt with them in some detail,

251
10. Grammar

they will not be discussed further here, but they have been extensively studied for
a variety of phenomena (cf. e.g. ompson/Koide 1987 for the dative alternation;
Chen 1986, Gries 2003 for particle placement; Rosenbach 2002, Stefanowitsch
2003 for the possessives). Instead, we will discuss some less frequently
investigated contextual factors here, namely word frequency, phonology, the
horror aequi and priming.

10.2.4.1 Case study: Adjective order and frequency. In a comprehensive study on


adjective order already mentioned in Chapter 5 above, Wul (2003) studies,
among other things, the hypothesis that in noun phrases with more two

.1
adjectives, more frequent adjectives precede less frequent ones. ere are two
straightforward ways of testing this hypothesis. First (as Wul does), based on
average frequencies: if we assume that the hypothesis is correct, then the average
v0
frequency of the rst adjective of two-adjective pairs should be higher than that
of the second. Second, based on the number of cases in which the rst and second
adjective, respectively, are more frequent: if we assume that the hypothesis is
correct, there should be more cases where the rst adjective is the more frequent
one than cases, where the second adjective is the more frequent one. Obviously,
T
the two ways of investigating the hypothesis could give us dierent results.
Consider Table 10-10, which lists ten randomly chosen adjective pairs from the
AF

spoken part of the BNC ( Wul uses the entire BNC in her study).

Table 10-10. Frequencies of selected adjectives in ADJ+ADJ sequences.


FIRST WORD SECOND WORD f(FIRST WORD) f(SECOND WORD) MORE FREQUENT
R

necessary nancial 734 808 SECOND


other dumb 13099 66 FIRST
future reproductive 106 150
D

SECOND
new allied 6337 33 FIRST
conservative young 148 1856 SECOND
other married 13099 436 FIRST
whole bloody 2361 2110 FIRST
frail elderly 10 276 SECOND
delighted united 175 515 SECOND
new posh 6337 139 FIRST
Total 4240.6 638.9 5/5

252
Anatol Stefanowitsch

If we go by average frequency, the hypothesis would be conrmed: the average


frequency of the adjectives in rst position is much higher than that of the
second. In contrast, if we go by number of cases, the hypothesis would not be
conrmed: there are ve cases where the rst word is more frequent and ve
cases where the second word is more frequent.
Both ways of looking at the issue have disadvantages. If we go by average
frequency, then individual cases might inate the average of either of the two
adjectives in our random sample, the relatively high-frequency adjectives other
and new, which each occur twice, inate the average frequency of the rst
adjectives. In contrast, if we go by number of cases, then cases with very lile

.1
dierence in frequency (like necessary/nancial in the rst line of Table 10-10)
count just as much as cases with a vast dierence in frequency (like other/dumb
in line 2). In cases like this, it may be advantageous to apply both methods and
v0
reject the null hypothesis only if both of them give the same result.
Searching for sequences of exactly two adjectives (excluding comparatives and
superlatives) followed by a noun in the spoken part of the BNC yields 7769 hits. I
determined the frequency of each adjective in the spoken part of the BNC (again
ignoring comparatives and superlatives), and calculated means and sample
T
variance as described in Chapter 7, Section 7.3.1). Table 10-11 shows the relevant
information.
AF

Table 10-11. Word frequency and word order in ADJ+ADJ combinations, variant 1
(BNC Spoken)
FIRST WORD SECOND WORD
R

Mean Frequency 2 891.04 2 025.67


Sample Variance 15 846 791 10 037 595
Number of cases 7 769 7 769
D

Using the formulas given in Chapter 7, we get a t-value of 14.99, which at 14 791
degrees of freedom is highly signicant (t(14 791) = 14.99, p < 0.01). Going by
average frequency, then, frequency seems to have an inuence on word order.
Next, let us look at the number of cases that match the prediction. ere are
274 cases where both adjectives have exactly the same frequency (in most cases,
because an adjective is repeated, e.g. in long, long time; in some cases because two
dierent adjectives happen to have the same frequency, e.g. antiseptic soapy
water). We can discard these cases, as our hypothesis does not make any
prediction about them. is leaves 4 279 cases where the rst adjective is more

253
10. Grammar

frequent than the second, and 3 216 cases where the second is more frequent than
the rst. If there was no relation between order and frequency, we would expect
the cases to be evenly distributed. is gives us the observed and expected
frequencies in Table 10-12, which show that the dierence is, again, statistically
highly signicant (2 = 150.76, df = 1, p < 0.001).

Table 10-12. Word frequency and word order in ADJ+ADJ combinations, variant 2
(BNC spoken)
MORE FREQUENT

.1
FIRST WORD SECOND WORD Total
Obs. 4 279 Obs. 3 216 7 495
ADJ+ADJ
Exp. 3 747.5 v0 Exp. 3 747.5
COMBINATIONS
2-Cp. 75.38 2-Cp. 75.38

is case study was meant to demonstrate that sometimes we can derive dierent
types of quantitative predictions from a hypothesis with no way to decide which
of them more accurately reects the hypothesis; in this case, it is a good idea to
T
test both predictions. e case study also shows that word frequency may have
eects on grammatical variation, which is interesting from a methodological
AF

perspective because not only is corpus linguistics is the only way to test for such
eects; corpora are also the only source from which the relevant values for the
independent variable can be extracted.

10.2.4.2 Case study: Binomials and sonority. Frozen binomials (i.e. phrases of the
R

type esh and blood, day and night, size and shape) have inspired a substantial
body of research aempting to uncover principles determining the order of the
D

constituents. A number of semantic, syntactic and phonological factors have been


proposed and investigated using psycholinguistic and corpus-based methods (see
Lohmann 2013 for an overview and a comprehensive empirical study). A
phonological factor that we will focus on here concerns the sonority of the nal
consonants of the two constituents: it has been proposed that words with less
sonorant nal consonants occur before words with more sonorant ones (e.g.
Cooper and Ross 1975).
In order to test this, we rst need a sample of binomials. In the literature, these
are typically taken from dictionaries or other lists assembled for other purposes,
but there are two problems with this method. First, these lists contain no
information about the frequency (and thus, importance) of the individual

254
Anatol Stefanowitsch

examples. Second, these lists do not tell us exactly how frozen the phrases are;
while there are cases that seem truly non-reversible (like esh and blood), others
simply have a strong preference (day and night is three times as frequent as night
and day in the BNC) or even a relatively weak one (size and shape is only one-
and-a-half times as frequent as shape and size).
We can avoid these problems by drawing our sample from the corpus itself.
For this case study, I selected all instances of [NOUN and NOUN] that occurred at
least 30 times in the BNC; because it is known that length and stress are strong
determining factors, I only included cases where both nouns are monosyllabic. I
then determined for all cases their frequency in both orders. I then calculated a

.1
frozenness rank by calculating for each pair the percentage of the more
frequent order out of all uses of the pair.
Next, I coded the nal consonants of all items and determined, whether the
v0
rst or the last one was more sonorant. For this, I used the following (hopefully
uncontroversial) sonority hierarchy:

(10.3) [vowels] > [semivowels] > [liquids] > [h] > [nasals] >
[voiced fricatives] > [voiceless fricatives] > [voiced aricates] >
T
[voiceless aricates] > [voiced stops] > [voiceless stops]
AF

e data are shown in Table 10-13. e rst column shows the phrase, the second
column gives the frequency of the phrase is in the more frequent order (the one
shown), the third column gives the frequency of the less frequent order (the
reversal of the order shown), the fourth column gives the percentage of the more
frequent order, and the h column shows whether the nal consonant of the
R

rst or of the second noun is less sonorant.


D

Table 10-13. Sample of monosyllabic binomials and there sonority


EXAMPLE More Less Frozen Less EXAMPLE More Less Frozen- Less
freq. freq. -ness sono- freq. freq. ness sono-
order order rant order order rant
arts and cras 120 0 1,0000 (continued)
rise and fall 113 0 1,0000 1st life and work 49 5 0,9074 2nd
sh and chip 102 0 1,0000 2nd days and weeks 29 3 0,9063 1st
esh and blood 99 0 1,0000 2nd home and school 38 4 0,9048 1st
hands and knees 98 0 1,0000 face and hands 56 6 0,9032 1st
length and breadth 74 0 1,0000 food and fuel 28 3 0,9032 2nd
ways and means 66 0 1,0000 sheep and goats 58 7 0,8923 1st
pots and pans 64 0 1,0000 1st hands and feet 124 15 0,8921 2nd
kings and queens 62 0 1,0000 1st heart and lungs 37 5 0,8810 1st

255
10. Grammar

pride and joy 61 0 1,0000 1st land and sea 37 5 0,8810 1st
rights and wrongs 60 0 1,0000 1st hearts and minds 64 9 0,8767 1st
man and wife 58 0 1,0000 1st size and weight 28 4 0,8750 2nd
heart and soul 55 0 1,0000 1st east and west 282 43 0,8677
son and heir 55 0 1,0000 1st arms and legs 195 30 0,8667
ebb and ow 51 0 1,0000 1st heart and lung 30 5 0,8571 1st
cat and mouse 36 0 1,0000 1st eyes and ears 77 13 0,8556
egg and chips 33 0 1,0000 1st rind and juice 27 5 0,8438 2nd
hip and thigh 32 0 1,0000 1st food and wine 74 14 0,8409 1st
head and tail 32 0 1,0000 1st height and weight 37 7 0,8409
months and years 30 0 1,0000 1st date and time 59 13 0,8194 1st
re and life 30 0 1,0000 2nd meat and sh 29 7 0,8056 1st

.1
drink and drugs 29 0 1,0000 1st maths and science 33 8 0,8049
knees and toes 29 0 1,0000 road and rail 65 16 0,8025 1st
esh and bone 28 0 1,0000 1st
v0 boys and girls 337 84 0,8005
mums and dads 28 0 1,0000 trees and shrubs 131 33 0,7988
cut and thrust 28 0 1,0000 coal and steel 49 13 0,7903
day and age 26 0 1,0000 2nd men and boys 29 8 0,7838 2nd
births and deaths 26 0 1,0000 war and peace 69 21 0,7667 2nd
king and queen 25 0 1,0000 1st arm and leg 29 9 0,7632 2nd
mum and dad 489 11 0,9780 2nd north and west 111 35 0,7603 2nd
T
north and south 398 9 0,9779 peas and beans 25 8 0,7576
black and white 127 4 0,9695 day and night 309 101 0,7537 2nd
AF

nuts and bolts 63 2 0,9692 hair and eyes 28 10 0,7368 2nd


life and death 241 8 0,9679 south and east 118 43 0,7329 2nd
sh and chips 216 8 0,9643 north and east 67 26 0,7204 2nd
bread and cheese 80 3 0,9639 1st space and time 158 64 0,7117 1st
wife and child 53 2 0,9636 1st eyes and nose 27 11 0,7105
knife and fork 86 4 0,9556 2nd rats and mice 34 15 0,6939
R

care and skill 36 2 0,9474 2nd shoes and socks 36 16 0,6923 2nd
youth and sports 52 3 0,9455 south and west 72 37 0,6606 2nd
wife and son 34 2 0,9444 1st doom and gloom 58 30 0,6591
D

oil and gas 373 23 0,9419 2nd eyes and mouth 27 15 0,6429 2nd
hips and thighs 28 2 0,9333 1st plants and trees 26 17 0,6047 1st
news and views 28 2 0,9333 legs and feet 27 18 0,6000 2nd
age and sex 118 9 0,9291 2nd size and shape 61 42 0,5922 2nd
days and nights 84 7 0,9231 2nd nose and mouth 32 25 0,5614 2nd
fruit and veg 34 3 0,9189 1st sea and air 27 27 0,5000 2nd

Let us rst simply look at the number of cases for which the claim is true or false.
First, there are 27 cases where both words end in consonants of the same
sonority; these can be discarded, as the hypothesis says nothing about them. is
leaves 36 cases where the rst words nal consonant is less sonorant (as

256
Anatol Stefanowitsch

predicted), and 30 cases where the second words nal consonant is less sonorant
(counter to the prediction). As Table 10-14a shows, this dierence is not
signicant (2 = 0.54, df=1, p > 0.05).

Table 10-14a. Sonority of the nal consonant and word order in binomials
LESS SONOROUS FINAL CONSONANT
FIRST WORD SECOND WORD Total
Obs. 36 Obs. 30 66
BINOMIALS Exp. 33 Exp. 33

.1
2-Cp. 0.27 2-Cp. 0.27

However, note that we are including both cases with a very high degree of
v0
frozenness (like arts and cras, rise and fall, heart and soul) and cases with a
relatively low degree of frozenness (like nose and mouth, sea and air): this will
dilute our results, as the cases with low frozenness are not actually predicted to
adhere to the less-before-more-sonorant hypothesis. We could, of course, limit our
analysis to cases with a high degree of frozenness, say, above 90 per cent (the
T
data is available, so you might want to try). However, it would be even beer to
keep all our data and make use of the rank order that the frozenness measure
AF

provides: the prediction would be that cases with a high frozenness rank would
be more likely to adhere to the sonority constraint than those with a low
frozenness rank. Table 11-13 contains all the data we need to determine medians
the median of words adhering or not adhering to the constraint, as well as the
rank sums and number of cases, which we need to calculate a U-test. We will not
R

go through the test step by step (but you can try for yourself if you want to).
Table 11-14b provides the necessary values derived from Table 11-13 (again,
D

binomials with equal sonority values for the rst and second word were ignored).

Table 11-14b. Sonority of the nal consonant and word order in binomials
FINAL CONSONANT OF 1ST WORD MORE SONOROUS FINAL CONSONANT OF 2ND WORD MORE SONOROUS
Median 23.5 Median 47.5
Rank Sum 942.5 Rank Sum 1268.5
No. of Data Points 36 No. of Data Points 30

e binomials adhering to the less-before-more-sonorant constraint have a much


higher median frozenness rank than those not adhering to the constraint in
other words, binomials with a high degree of frozenness tend to adhere to the

257
10. Grammar

constraint, binomials with a low degree of frozenness do not. e U-test shows


that the dierence is highly signicant (U = 276.5, N1=36, N2=30, p < 0.001).
is case study was intended to show that sometimes we can derive dierent
kinds of quantitative predictions from a hypothesis but one of them more
accurately reects the hypothesis; it was also meant to provide an example of a
corpus-based design where it is more useful to operationalize one of the
constructs (Frozenness) as an ordinal, rather than a nominal variable. More
generally, it was meant to demonstrate how phonology can be interact
grammatical variation (or, in this case, the absence of variation); cf. Schlter 2003
for another example, and cf. Lohmann (2013) for a comprehensive study of

.1
binomials.

10.2.4.3 Case study: Horror aequi. In a number of studies, Rohdenburg (e.g. 1995,
v0
2003) has studied the inuence of contextual (and, consequently, conceptual)
complexity on grammatical variation. e general idea is that, in contexts that are
already complex, speakers will try to chose a variant that reduces (or at least does
not contribute) to this complexity. A particularly striking example of this is what
Rohdenburg (adapting a term from Karl Brugmann), calls the horror aequi
T
principle: the widespread (and presumably universal) tendency to avoid the
repetition of identical and adjacent grammatical elements or structures
AF

(Rohdenburg 2003: 206).


For example, Rohdenburg shows (on the basis of the text of an 18th century
novel) that verbs which normally occur alternatively with a to-clause or an ing-
clause may prefer an ing-clause in contexts where they occur as to-innitives
themselves. Take the verb start: in the past tense, it may take either a to-innitive
R

(as in 10.4a) or an ing-form (as in 10.4b); but as a to-innitive it would avoid the
to-innitive (although not completely, as 10.4c shows), and strongly prefer the
D

ing-form (as in 10.4d):

(10.4a) I started to think about my childhood again [BNC A0F]


(10.4b) So I started thinking about alternative ways to earn a living [BNC C9H]
(10.4c) in the future they will actually have to start to think about a fairer
electoral system [BNC JSG]
(10.4d) He will also have to start thinking about a partnership with Skoda before
[BNC A6W]

Impressionistically, this seems to be true: in the BNC there are 11 cases of started
to think about, 18 of started thinking about, only one of to start to think about but

258
Anatol Stefanowitsch

35 of to start thinking about.


Let us aempt a more comprehensive analysis and look at all cases of the verb
start with a clausal complement in the BNC. Since we are interested in the
inuence of the tense/aspect form of the verb start on complementation choice,
let us distinguish the inectional forms start (base form), starts (3rd person),
started (past tense/past participle) and starting (present participle); let us further
distinguish cases of the base form start with and without the innitive marker to.
Table 10-15 lists the frequencies of to- and ing-complements for each of these
forms, together with the expected frequencies and the chi-square components.

.1
Table 10-15. Complementation of start and the horror aequi principle
TYPE OF COMPLEMENT CLAUSE
TO-CLAUSE v0 ING-CLAUSE Total
start (without to) Obs: 1326 Obs: 2503 3829
Exp: 1751,13 Exp: 2077,87
2: 103,21 2: 86,98
to start Obs: 65 Obs: 1050 1115
FORM OF THE MATRIX VERB

Exp: 509,93 Exp: 605,07


T
2: 388,21 2: 327,17
starts Obs: 539 Obs: 470 1009
AF

Exp: 461,45 Exp: 547,55


2: 13,03 2: 10,98
started Obs: 3257 Obs: 3164 6421
Exp: 2936,54 Exp: 3484,46
2: 34,97 2: 29,47
R

starting Obs: 896 Obs: 31 927


Exp: 423,95 Exp: 503,05
D

2: 525,61 2: 442,96
Total 6083 7218 13301

e most obvious and most signicant deviation from the expected frequencies is
indeed in the case where the matrix verb start has the same form as the
complement clause: there are far fewer cases of to start to VERB and starting
VERBing than expected. Interestingly, if the base form of start does not occur with
an innitive particle, the to-complement is still fairly strongly avoided in favor of
the ing-complement. It seems as though horror aequi is a graded principle the
stronger the similarity, the stronger the avoidance.
is case study is intended to introduce the notion of horror aequi, which has

259
10. Grammar

been shown to inuence a number of grammatical and morphological variation


phenomena (cf. e.g. Rohdenburg 2003, Vosberg 2003, Rudanko 2003, Gries and
Hilpert 2010). Methodologically, it is a straightforward application of the chi-
square test, but one where the individual chi-square components are more
interesting than the question whether the observed distribution as a whole diers
signicantly from the expected one.

10.2.4.4 Case study: Synthetic and analytic comparatives and persistence. e horror
aequi principle has been plausibly demonstrated to have an eect on certain
types of variation and it can plausibly be explained as a way of reducing

.1
complexity (if the same structure occurs twice in a row, this might cause
problems for language processing). However, there is another well-known
principle that is, in a way, the opposite of horror aequi: priming. We know from
v0
psycholinguistic experiments, that it is easier to activate the linguistic knowledge
associated with a particular linguistic unit if it was already active shortly before.
is has been argued to lead to so-called persistence eects, where a word or
structure that has been introduced in the discourse is reused within a short span.
For example, Szmrecsanyi (2005) argues that persistence is one of the factors
T
inuencing the choice of adjectival comparison. While most adjectives require
either synthetic comparison (further, but not *more far) or analytic comparison
AF

(more distant, but not *distanter), some vary quite freely between the two
(friendlier/more friendly). Szmrecsanyi hypothesizes that one of many factors
inuencing the choice is the presence of an analytic comparative in the directly
preceding context if speakers have used an analytic comparative (for example,
with an adjective that does not allow anything else), they are more likely to use it
R

with an adjective that would theoretically also allow a synthetic comparison.


Szmrecsanyi shows that this is the case, but that it depends crucially on the
D

distance between the two instances of comparison: the persistence is rather


short-lived. Let us replicate his ndings in a small analysis focusing only on
persistence and disregarding other factors.
As a sample, I decided to use adjectives with a relatively even distribution of
the two strategies: the less frequent one had to account for at least forty percent
in the BNC. Also, the adjective had to occur in a the comparative at least twenty
times. Finally, because length and stress are known to have an eect, the adjective
had to be bisyllabic with the stress on the rst syllable (this is the case for most
adjectives that vary between the two strategies). is le just six adjectives:
angry, empty, friendly, lively, risky and sorry. I extracted all comparative forms
and checked the preceding context (20 words) for the presence of an additional

260
Anatol Stefanowitsch

analytic comparative.
Table 10-16a shows the distribution of analytic comparatives in this context
across the two conditions (the 28 cases that do contain an analytic comparative
are listed in Appendix D.3).

Table 10-16a. Analytic comparatives in a context of 20 words preceding analytic


and synthetic comparatives (BNC).
C OMPARAT I V E ST RAT EGY
ANALYTIC SYNTHETIC Total

.1
CONTAINS Obs: 16 Obs: 12 28
ANALYTIC Exp: 13.81 Exp: 14.19
PRECEDING

COMP.
CONTEXT

DOES NOT
CONTAIN
Obs:
Exp:
205
207.19
v0 Obs:
Exp:
215
212.81
420

AN. COMP.

221 227 448


T
ere is only a very small dierence between the two conditions, which is not
signicant (2 = 0.73, df = 1, p > 0.05).
AF

However, the context within which an analytic comparative has to occur to be


counted is quite large, which means that in some cases, the comparative is very
far away from the critical item (cf. 10.5a), while in other cases it is very close (cf.
10.5b):
R

(10.5a) But the statistics for the second quarter, announced just before the
October Conference of the Conservative Party, were even more damaging
D

to the Government showing a rise of 17 per cent on 1989. Indeed these


gures made even sorrier reading for the Conservatives when one
realised [BNC G1J]
(10.5b) Over the next ten years, China will become economically more liberal,
internationally more friendly [BNC ABD]

Obviously, we would expect a much stronger priming eect in a situation like


that in (10.5b), where one word intervenes between the two comparatives, than in
a situation like (10.5a), where 17 words (and a sentence boundary) intervene. So it
is not surprising to see persistence of the analytic comparative in (10.5b) but not

261
10. Grammar

in (10.5a).
Let us therefore restrict the context in which we count analytic comparatives
to a size more likely to show persistence. I have chosen 7 words (based on the
factoid that short term memory can hold up to seven units). Table 10-16b shows
the distribution of analytic comparatives across the two conditions in this smaller
window.

Table 10-16b. Analytic comparatives in a context of 7 words preceding analytic


and synthetic comparatives (BNC).
C OMPARAT I V E ST RAT EGY

.1
ANALYTIC SYNTHETIC Total
CONTAINS Obs: 13 v0 Obs: 0 13
ANALYTIC Exp: 6.41 Exp: 6.59
PRECEDING

COMP.
CONTEXT

DOES NOT Obs: 208 Obs: 227 435


CONTAIN Exp: 214.59 Exp: 220.41
AN. COMP.
T
221 227 448
AF

We see that all of the analytic comparatives preceding synthetic ones disappear
when we restrict the window in this way, while the number of analytic
comparatives preceding analytic comparatives remains roughly the same. e
dierence between the two conditions is now statistically highly signicant (2 =
R

13.75, df = 1, p < 0.001).


is case study is intended to introduce the corpus-based study of
grammatical priming or persistence. Szmrecsanyi (2006) is a comprehensive
D

treatment of the topic, but see also Gries (2005) and Weiner and Labov (1983); see
also Ferreira and Bock (2006) for a theoretical discussion. It also demonstrates
that details of the research design, such as the size of the search window, may
have substantial consequences for the results.

10.2.5 Variation and change


10.2.5.1 Case study: Dialects. Obviously, varieties of English may dier in aspects
of their grammar just as they dier in their vocabulary, and these dierences can
be studied in the same way. For example, Rohdenburg (2009) investigates the
hypothesis that American English tends to prefer transitive verbs where British

262
Anatol Stefanowitsch

English uses the same verb as a complex intransitive (for example to protest
something vs. to protest against something). He tests this hypothesis for a number
of verbs from dierent semantic elds in a corpus of British and American
Newspapers and nds the tendency conrmed.
Let us replicate the study for two selected verbs using the much smaller, but
more diverse LOB/FLOB and BROWN/FROWN corpora: ght and protest. Each of
these verbs allows both simple transitive and complex intransitive uses (in
addition to intransitive uses, which we will ignore here but which do not dier
signicantly across varieties). Consider ght in (10.6):

.1
(10.6a) ese people knocked each other about for a while but united to ght the
Huns
(10.6b) soldiers who had gone to ght against Germans had turned into
v0
gardeners in uniform.

Table 10-17a shows the distribution of these two variants in British and American
English (not counting uses with a cognate direct object, such as ght a war).
ere is no signicant dierence between the two varieties (2 = 0.83, df = 1, p >
T
0.05), and the very small dierence that does exist goes against the hypothesis.
AF

Table 10-17a Transitive and complex intransitive uses of ght in British and
American English
COMP LEMEN TATION
TRANSITIVE COMP. INTRANS. Total
R

BRITISH Obs: 47 Obs: 6 53


ENGLISH Exp: 45.19 Exp: 7.81
VARIETY

AMERICAN Obs: 63 Obs: 13 76


ENGLISH Exp: 64.81 Exp: 11.19
110 19 129

Next, consider protest in (10.7):

(10.7a) I fantasized about being at Berkeley, protesting the war.


(10.7b) We decided to protest against the violent actions here in the past week

Table 10-17b shows the distribution of the two variants across British and

263
10. Grammar

American English. In this case, the distribution is signicant (2 = 6.56, df = 1, p <


0.05).

Table 10-17b. Transitive and complex intransitive uses of protest in British and
American English
COMP LEMEN TATION
TRANSITIVE COMP. INTRANS. Total
BRITISH Obs: 2 Obs: 10 12
ENGLISH Exp: 5.38 Exp: 6.62
VARIETY

.1
AMERICAN Obs: 11 Obs: 6 17
ENGLISH Exp: 7.62 Exp: 9.38
13
v0 16 29

In order to show that the preference for transitive and complex intransitive verbs
respectively is a dierence that characterizes American and British English, it
would be necessary to look at a large number of verbs (which Rohdenburg does),
T
using a representative corpus (which Rohdenburg does not, for lack of large
comparable corpora). If and when the American National Corpus (see Appendix
AF

A.1) becomes available, the corpus-based study of transitivity and other


grammatical dierences could become a very dynamic eld of research. Even
without large comparable corpora, a number of studies on such dierences has
been conducted (cf., e.g., Leech 1999, Mair 2003, Toie and Homann 2006,
Collins 2007, and the references at the end of Case Study 10.2.5.4 below).
R

10.2.5.2 Case study: Sex. Grammatical dierences may also exist between varieties
D

spoken by subgroups of speakers dened by demographic variables, for example,


when the speech of younger speakers reects recent changes in the language, or
when speakers from dierent educational or economic backgrounds speak
dierent established sociolects. Even more likely are dierences in usage
preference. For example, R. Lako (1973) claims that women make more intensive
use of tag questions than men. Mondorf (2004) investigates this claim in detail in
the basis of the London-Lund Corpus, which is annotated for intonation among
other things. Mondorfs analysis not only conrms the claim that women use tag
questions more frequently than men, but also showing qualitative dierences in
terms of their form and function.
is kind of analysis requires very careful, largely manual data extraction and

264
Anatol Stefanowitsch

annotation so it is limited to relatively small corpora, but let us see what we can
do in terms of a larger-scale analysis. Let us focus on tag questions with negative
polarity containing the auxiliary be (e.g. isnt it, wasnt she, am I not, was it not).
ese can be extracted relatively straightforwardly even from an untagged corpus
using the following queries:

(10.8a) (am|are|is|was|were) n't (I|you|he|she|it|we|they) [.,]


(10.8b) (am|are|is|was|were) (I|you|he|she|it|we|they) not [.,]

e query in (10.8a) will nd all nite form of the verb to be (as non-nite forms

.1
cannot occur in tag questions), followed by the negative clitic n't, followed by a
pronoun; the query in (10.8b) will do the same thing for the full form of the
particle not, which then follows rather than precedes the pronoun. Both queries
v0
will only nd those cases that occur before a punctuation mark signaling a clause
boundary (what to include here will depend on the transcription conventions of
the corpus, if it is a spoken one).
e queries are meant to work for the spoken portion of the BNC, which uses
the comma for all kinds of things, including hesitation or incomplete phrases, so
T
we have to make a choice whether to include it and increase the precision or to
include it and increase the recall (I will choose the laer option). e queries are
AF

not perfect yet: British English also has the form aint it, so we might want to
include the query ai n't (I|you|he|she|it|we|they) ; however, aint can stand for
be or for have, which lowers the precision somewhat. Finally, there is also the
form innit in (some varieties o) British English, so we might want to include in
n it ; however, this is an invariant form that can occur with any verb or auxiliary
R

in the main clause, so it will decrease the precision even further. We will ignore
aint and innit here (they are not particularly frequent and hardly change the
D

results reported below).


In the part of the spoken BNC annotated for speaker sex, there are 3 801 hits
for the paerns in (18.8a,b) for female speakers (only 13 of which were for 18.8b),
and 3 109 hits for male speakers (only 27 of which were for 18.8b). Of course, we
cannot assume that there is an equal amount of male and female speech in the
corpus, so the question is what to compare these frequencies against. Obviously,
such tag questions will normally occur in declarative sentences with positive
polarity containing a nite form of be. Of course, such sentences cannot be
counted easily, but we can estimate their number. First, we can search for nite
forms of be that are not followed by a negative clitic (is nt) or particle (is not), or
by an adverb and a negative particle (is just/obviously/ not). ere are 66 849

265
10. Grammar

such occurrences for female speakers and 113 650 for male speakers. Next, we can
count the number of interrogative sentences in a sample of 100 hits. In three
separate samples, I found eight to ten questions, so let us take the highest value
to be sure, and adjust the number of hits by subtracting ten percent; this gives us
estimates of 60 164 nite declarative positive-polarity clauses with be for female
speakers and 102 285 for male speakers.
ese numbers can now be used to compare the number of tag questions
against, as shown in Table 10.18. Since the tag questions that we found using our
queries have negative polarity, they are not included in the sample, but must
occur as tags to a subset of the sentences. is means that by subtracting the

.1
number of tag questions from the total for each group of speakers, we get the
number of sentences without tag questions.
v0
Table 10-18. Negative polarity tag questions in male and female speech in the
spoken BNC.
TAG QUESTION
PRESENT ABSENT Total
T
Obs: 3 801 Obs: 56 363 60 164
SPEAKER SEX

FEMALE
Exp: 2 559.16 Exp: 57 604.84
AF

Obs: 3 109 Obs: 99 176 102 285


MALE
Exp: 4 350.84 Exp: 97 934.16
6 910 155 539 162 449
R

e dierence between male and female speakers is highly signicant, with


female speakers using substantially more tag questions than expected, and male
D

speakers using substantially fewer (2 = 999.57, df = 1, p < 0.001).


is case study was intended to introduce the study of sex-related dierences
in grammar (or grammatical usage); cf. Mondorf 2004 for addtional studies and an
overview of the literature. It was also intended to demonstrate the kinds of steps
necessary to extract the required frequencies for a grammatical research question
from an untagged corpus, and the ways in which they might be estimated if they
cannot be determined precisely. Of course, these steps and considerations depend
to a large extent on the specic phenomenon under investigation; one reason for
choosing tag questions with be is that they, and the sentences against which to
compare their frequency, are much easier to extract from an untagged corpus
than is the case for tag questions with have, or, worst of all do (think about all the

266
Anatol Stefanowitsch

problems these would cause).

10.2.5.3 Case study: Language change. Grammatical dierences between varieties


of a language will generally change over time they may increase, as speech
communities develop separate linguistic identities or even lose contact with each
other, or they may decrease, e.g. through mutual inuence. For example, Berlage
(2009) studies word-order dierences in the placement of adpositions in British
and American English, focusing on notwithstanding as a salient member of a
group of adpositions that can occur as both pre- and postpositions in both
varieties. Two larger questions that she aempts to answer are, rst, the

.1
diachronic development and, second, the interaction of word order and
grammatical complexity. She nds that the prepositional use is preferred in
British English (around two thirds of all uses are prepositions in Present-Day
v0
British Newspapers) while American English favors the postpositional use (more
than two thirds of occurrences in American English are postpositions). She shows
that the postpositional use initially accounted for around a quarter of all uses but
then almost disappeared in both varieties; its re-emergence in American English
is a recent development (the convergence and/or divergence of British and
T
American English has been intensively studied by Hundt, e.g. 1997, 2009).
e basic design with which to test the convergence or divergence of two
AF

varieties with respect to a particular feature is a multivariate one with VARIETY and
PERIOD as independent variables and the frequency of the feature as a dependent
one. Let us try to apply such a design to notwithstanding using the LOB, BROWN,
FLOB and FROWN corpora (two British and two American corpora each from the
1960s and the 1990s). Note that these corpora are rather small and 30 years are
R

not a long period of time, so we would not necessarily expect results even if the
hypothesis were true that American English reintroduced postpositional
D

notwithstanding in the 20th Century (it probably is true, as Berlage shows on a


much larger data sample from dierent sources).
Notwithstanding is a relatively infrequent adposition: there are only 36 cases in
the four corpora combined. Table 10-19a shows their distribution across the eight
conditions.

267
10. Grammar

Table 10-19a. Notwithstanding as a preposition and postposition in British and


American English
1961 1991 Total
BRIT. AMER. Total BRIT. AMER. Total
PREPOSITION O: 12 O: 7 19 O: 3 O: 4 7 26
E: 6.74 E: 8.43 E: 4.81 E: 6.02
POSTPOSITION O: 0 O: 2 2 O: 1 O: 7 8 10
E: 2.59 E: 8.43 E: 1.85 E: 2.31
Total 12 9 21 4 11 15 36

.1
While the prepositional use is more frequent in both corpora from 1961, the
postpositional use is the more frequent one in the American English corpus from
1991. A CFA shows, that the intersection 1991 AM.ENGL POSTPOSITION is the only
v0
one whose observed frequencies dier signicantly from the expected (cf. Table
10-19b).

Table 10-19b. Congural frequency analysis for Table 10-19a.


Intersection Observed Expected Type Chi-sq. p Sig.
T
1961 UK PREPOSITION 12 6.74 + 4.10 0.0428 n.s.
POSTPOSITION 0 2.59 2.59 0.1074 n.s.
AF

US PREPOSITION 7 8.43 0.24 0.6233 n.s.


POSTPOSITION 2 3.24 0.48 0.4907 n.s.
1991 UK PREPOSITION 3 4.81 0.68 0.4082 n.s.
POSTPOSITION 1 1.85 0.39 0.5313 n.s.
R

US PREPOSITION 4 6.02 0.68 0.4106 n.s.


POSTPOSITION 7 2.31 + 9.48 0.0021 *
D

Due to the small number of cases, we would be well advised not to place too
much condence in our results, but note that they fully conrm Berlages claims
that British English prefers the prepositional use and American English has
recently begun to prefer the postpositional use.
is case study is intended to provide a further example of a multivariate
design and to show, that even small data sets may provide evidence for or against
a hypothesis. It is also intended to introduce the study of the convergence and/or
divergence of varieties and the basic design required. is eld of studies is of
interest especially in the case of pluricentric languages like, for example, English,
Spanish or Arabic (cf. Rohdenburg and Schlter 2009, from which Berlages study

268
Anatol Stefanowitsch

is taken, for a broad, empirically founded introduction to the contrastive study of


British and American English grammar; see also Leech and Kehoe 2006).

10.2.5.4 Case study: Grammaticalization. One of the central issues in


grammaticalization theory is the relationship between grammaticalization and
discourse frequency. Very broadly, the question is whether a rise in discourse
frequency is a precondition (or at least a crucial driving force) in the
grammaticalization of a structure or whether it is a consequence.
Since corpora are the only source for the identication of changes in discourse
frequency, this is a question that can only be answered using corpus-linguistic

.1
methodology. An excellent example is Mair (2004), which looks at a number of
grammaticalization phenomena to answer this and other questions.
He uses the OEDs citation database as a corpus (not just the citations given in
v0
the OEDs entry for going to, but all citations used in the entire OED). In is an
interesting question, to what extent such a citation database can be regarded as a
corpus. One argument against doing so is, that it is an intentional selection of
certain examples over others and thus may not yield an authentic picture of a
given phenomenon. However, as Mair points out, the vast majority of examples of
T
a given phenomenon X will occur in citations that were collected to illustrate
other phenomena, so they should constitute random samples with respect to X.
AF

e advantage of citation databases for historical research is that the sources for
citations will have been carefully checked and very precise information will be
available as to their year of publication, the author, etc.
As an example of this method, consider the going-to future. It is relatively easy
to determine at what point at the latest the sequence [going to Vinf] was
R

established as a future marker. In the literature on going to, the following example
from the 1482 Revelation to the Monk of Evesham is considered the rst
D

documented use with a future meaning (it is also the rst citation in the OED):

(10.9) erefore while thys onhappy sowle by the vyctoryse pompys of her enmyes
was goyng to be broughte into helle for the synne and onleful lustys of her
body.

Mair also notes that it is mentioned in grammars from 1646 onward; at the very
latest, then, it was established at the end of the 17th century.
However, as Figure 10-1 shows, there was only a small rise in frequency
during the time that the construction became established, but a substantial jump
in frequency aerwards (the dashed line shows Mairs conservative estimate for

269
10. Grammar

the point at which the construction was rmly established as a way to express
future tense).

Figure 10-1. Grammaticalization and discourse frequency of going to (redrawn


from Mair 2004)

.1
v0
T
AF
R
D

Interestingly, around the time of that jump, we also nd the rst documented
instances of the contracted form gonna. ese results suggest, that semantic
reanalysis is the rst step in grammaticalization, followed by a rise in discourse
frequency accompanied by phonological reduction.
is case study is intended to show that very large collections of citations may
be used as a corpus under certain circumstances. It also demonstrates the
importance of corpora in diachronic research, which was, of course, always text-
based, but which deals with many questions that require not just texts, but large
corpora.

270
Anatol Stefanowitsch

10.2.6 Grammar and cultural analysis


Like words, grammatical structures usually represent themselves in corpus
linguistic studies they are either investigated as part of a description of the
syntactic behavior of lexical items or they are investigated in their own right in
order to learn something about their semantic, formal or functional restrictions.
However, like words, they can also be used as representatives of some aspect of
the speech communitys culture, specically, a particular culturally dened
scenario. To take a simple example: if we want to know what kinds of things are
transferred between people in a given culture, we may look at the theme
arguments of ditransitive constructions in a large corpus; we may look for

.1
collocates in the verb and theme positions of the ditransitive if we want to know
how particular things are transferred, cf. Stefanowitsch and Gries 2009). In this
way, grammatical structures can become diagnostics of culture, Again, care must
v0
be taken to ensure that the link between a grammatical structure and a putative
scenario is plausible.

10.2.6.1 Case study: He said, she said. In a paper on the medial representation of
T
men and women, Caldas-Coulthard (1993) nds that men are quoted vastly more
frequently than women in the COBUILD corpus (cf. also Chapter 11). She also
AF

notes in passing that the verbs of communication used to introduce or aribute


the quotes dier both mens and womens speech is introduced using general
verbs of communication, such as say or tell, but with respect to more descriptive
verbs, there are dierences: Men shout and groan, while women (and children)
scream and yell (Caldas-Coulthard 1993: 204).
R

e construction [QUOTE + Subj + V] is a perfect example of a diagnostic for


a cultural frame: it is routinely used (in wrien language) to describe a speech
event. Crucially, the verb slot oers an opportunity to introduce additional
D

information (such as the manner of speaking, as in the examples of manner verbs


verbs just mentioned (that oen contain evaluations), but also the type of speech
act being performed (ask, order), etc.). It is also easy to nd even in an untagged
corpus, since it includes (by denition) a passage of direct speech surrounded by
quotation marks, a subject that is, in an overwhelming number of cases, a
pronoun, and a verb (or verb group) typically in that order. In a wrien corpus,
we can thus query the sequence PRONOUN VERB to nd the majority of
examples of the construction. In order to study dierences in the representation
of men and women, we can query the pronouns he and she separately to obtain
representative samples of male and female speech act events without any coding.

271
10. Grammar

is design can be applied deductively, if we have hypotheses about the


gender-specic usage of particular (sets o) verbs, or inductively, if we simply
calculate the association strength of all verbs to one pronoun as compared to the
other. In either case we have two nominal variables, SUBJECT OF QUOTED SPEECH, with
the variables HE and SHE, and SPEECH ACTIVITY VERB with all occurring verbs as its
values. Table 10-20 shows the results of an inductive application of the design to
the BNC.

Table 10-20. Verbal collexemes of [QUOTE + Pron + V] (BNC)


COLLOCATE CO-OCCURRENCE CO-OCCURRENCE OTHER WORDS OTHER WORDS G2

.1
WITH HE WITH SHE WITH HE WITH SHE
MOST STRONGLY ASSOCIATED WITH HE
say 17794 10936 v0 16687 13235 230.32
growl 153 4 34328 24167 132.65
drawl 170 10 34311 24161 121.39
write 209 43 34272 24128 68.27
grate 80 4 34401 24167 59.99
rasp 77 7 34404 24164 46.08
T
snarl 79 13 34402 24158 32.08
continue 311 123 34170 24048 31.25
AF

roar 48 5 34433 24166 26.76


murmur 617 306 33864 23865 25.75
MOST STRONGLY ASSOCIATED WITH SHE
whisper 361 625 34120 23546 199.72
cry 203 379 34278 23792 136.24
R

manage 18 88 34463 24083 78.70


snap 161 260 34320 23911 72.41
D

retort 73 153 34408 24018 64.71


protest 60 132 34421 24039 59.47
ask 2007 1757 32474 22414 49.13
exclaim 128 194 34353 23977 47.44
wail 7 45 34474 24126 46.17
deny 17 58 34464 24113 40.66

ere is a clear dierence that conrms Caldas-Coulthards casual observation:


the top ten verbs of communication associated with men contain ve verbs
conveying a rough, unpleasant and/or aggressive manner of speaking (growl,
grate, rasp, snarl, roar), while those for women only include one (snap, related to

272
Anatol Stefanowitsch

irritability rather than outright aggression). Interestingly, two very general


communication verbs say and write, are also typical for mens reported speech.
Womens speech is introduced by verbs conveying weakness or communicative
subordination (whisper, cry, manage, protest, wail and deny).

10.2.7 Grammar and counterexamples


While this book focuses on quantitative designs, non-quantitative designs are
possible within the general framework adopted. Chapter 3, Section 3.2.2 included
a discussion of counterexamples and their place in a scientic framework for
corpus linguistics.

.1
10.2.7.1 Case study: to- vs. that-complements. A good case study for English that is
based largely on counterexamples is Noel (2003), who looks at a number of claims
v0
made about the semantics of innitival complements as compared to that-clauses.
He takes claims made by other authors based on their intuition and treats them
like Popperian hypotheses, searching the BNC for counterexamples. He mentions
more or less informal impressions about frequencies, but only to clarify that the
T
counterexamples are not just isolated occurrences that could be explained away.
For example, he takes the well-known claim that with verbs of knowing,
innitival complements present knowledge as subjective/personal, while that-
AF

clauses present knowledge as objective/impersonal/public. is is supposed to


explain acceptability judgments like the following (Borkin 1973: 4546.
Wierzbicka 1988: 50, 136):
R

(10.10a)He found her to be intelligent.


(10.10b)*I bet that ifyou look in the les, youll nd her to be Mexican.
(10.10c) I bet that if you look in the les, youll nd that she is Mexican.
D

e crucial counterexample here would be one like (10.10b), with an innitival


complement that expresses knowledge that is public rather than
personal/experiential; also of interest would be examples with that-clauses that
express personal/experiential knowledge. e corresponding queries are easy
enough to dene:

(10.11a) (nd|nds|nding|found) (me|you|him|her|it|us|them) to be


(10.11b) (nd|nds|nding|found) that (I|you|he|she|it|we|they) (is|are|was|were)

273
10. Grammar

is query follows the specic example in (10.10b) very narrowly, we could of


course dene a broader one that would capture, for example, proper names and
noun phrases in addition to pronouns, but remember that we are looking for
counterexamples if we can nd these with a query following the structure of
supposedly non-acceptable sentences very closely, they will be all the more
convincing.
e BNC contains not just one, but many counterexamples. Here are some
examples with that-complements expressing subjective, personal knowledge:

(10.12a)Erika was surprised to nd that she was beginning to like Bach (BNC

.1
A7A)
(10.12b)[A]che of loneliness apart, I found that I was stimulated by the challenge
of nding my way about this great and beautiful city. (BNC AMC)
v0
And here are some with to-complements expressing objective, impersonal
knowledge:

(10.13a)Li and her coworkers have been able to locate these sequence variations
T
in the three-dimensional structure of the toxin, and found them to be
concentrated in the sheets of domain II . (BNC ALV)
AF

(10.13b)e visiting party , who were the rst and last ever to get a good look at
the crater of Perboewetan, found it to be about 1,000 metres in diameter
and about y metres deep (BNC ASR)

ese counterexamples (and others not cited here) in fact give us a new
R

hypothesis as to what the specic semantic contribution of the to-complement


may be: if used to refer to objective knowledge, it overwhelmingly refers to
D

situations where this objective knowledge was not previously known to the
participants of the situation described. In fact, if we extend our search for
counterexamples beyond the BNC to the world-wide web, we nd examples that
are even more parallel to (10.10b), such as (10.13c):

(10.13c) Our more recent ancestors (since 1800) were named Burnaman and told
they were from Ireland. We met much resistance from the now living
older generation when we found them to be German and spelled
Bornemann in Germany and their early years in America. (female speaker
from Texas, 2008)

274
Anatol Stefanowitsch

Again, what is dierent from the (supposedly unacceptable) example (10.10b) is


that the knowledge in question was new (and surprising) to the participants of
the situation. is observation could now be used as the basis for a new
hypothesis concerning the dierence between the two constructions, but even if
it is not or if this hypothesis turned out to be false, the counterexamples clearly
disprove the claim by Wierzbicka and others concerning the subjective/objective
distinction (Nol 2003 actually goes on to propose an information-structural
account of the dierence between the to- and the that-complement).
is case study was intended to show how counterexamples may play a role in
disproving hypotheses based on introspection and constructed examples (see

.1
Meurers (2005) and Meurers and Mllers (2009) work on German syntax as good
examples for the theoretically informed search for counterexamples).

Further reading
v0
Grammar is a complex phenomenon investigated from very dierent
perspectives. is makes general suggestions for further reading dicult. It may
be best to start with collections focusing on the corpus-based analysis of
T
grammar, such as Rohdenburg and Mondorf (2003), Gries and Stefanowitsch
(2006), Rohdenburg and Schlter (2009) on dierences between British and
AF

American English and Lindquist and Mair (2004) on grammaticization.


Rohdenburg, Gnter & Mondorf, Bria (eds.), Determinants of
grammatical variation in English, 767. (Topics in English Linguistics 43).
Berlin & New York: Mouton de Gruyter.
R

Gries, Stefan . & Anatol Stefanowitsch. 2006. Corpora in Cognitive


Linguistics Corpus-Based Approaches to Syntax and Lexis. Berlin & New
York: Mouton de Gruyter.
D

Rohdenburg, Gnter & Julia Schlter. 2009. One language, two


grammars? dierences between British and American English.
Cambridge, UK; New York: Cambridge University Press.
Lindquist, Hans & Christian Mair (eds.). 2004. Corpus approaches to
grammaticalization in English. (Studies in Corpus Linguistics v. 13).
Amsterdam; Philadelphia: John Benjamins.

275
11. Morphology
e wordform-centeredness of most corpora and corpus-access tools that requires
a certain degree of ingenuity when studying structures larger than the word is

.1
not a huge problem for corpus-based morphology, which studies structures
smaller than the word. Corpus morphology is mostly concerned with the
distribution of axes (derivational ones, like the nominalizing suxes -ness and
v0
-ity and inectional ones, like participial -ing or the plural -s). Retrieving all
occurrences of an ax plausibly starts with the retrieval of all strings potentially
containing this ax. We could retrieve all occurrences of -ness, for example, with
a query like .+ness. is will retrieve all occurrences of the sux the recall
will be 100 percent. e precision will not be 100 percent in most cases, as such a
T
search it will also retrieve words that just happen to end with the search string
in the case of -ness, words like witness, governess or place names like Inverness.
AF

e degree of precision will depend on how unique the search string is for the
ax in question; for -ness and -ity it is fairly good, as there are only a few words
that share the same string accidentally (examples like those just mentioned for
-ness and words like city and pity for -ity).
However, once we have extracted and manually cleaned up our data set, we
R

are faced with a problem that does not present itself when studying lexis or
grammar: the very fact that axes do not occur independently but always as
D

parts of words, some of which (like wordform-centeredness in the rst sentence of


this chapter) have been created on the y for a specic purpose, while others (like
ingenuity in the same sentence) are conventionalized lexical items that we expect
to nd listed in a dictionary, even though they are theoretically the result of
aaching an ax to a known stem (like ingen-, also found in ingenious and,
confusingly, its almost-antonym ingenuous). We have to keep the dierence
between these two types of words in mind when constructing morphological
research designs; since the two types are not always clearly distinguishable, this
is more dicult than it sounds. Also, the fact that axes always occur as parts of
words has consequences for the way we can, and should, count them; in

276
Anatol Stefanowitsch

quantitative corpus-linguistics, this is a crucial point, so I will discuss it in quite


some detail before we turn to our case studies.

11.1 antifying morphological phenomena


11.1.1 Counting morphemes: types, tokens and hapax legomena
Determining the frequency of a linguistic phenomenon under a particular
condition seems a straightforward task: we simply to count the number of times
that the phenomenon occurs under that condition. However, this sounds

.1
straightforward (in fact, tautological) only because we have made tacit
assumptions about what it means for instances of a particular phenomenon to
occur (generally, or under dierent conditions.
v0
When we are interested in the frequency of occurrence of a particular word
(as we were, for example, in Chapter 3), there seems to be only one meaningful
way of counting it: search for all occurrences and add them up. For example, in
order to determine how frequent the denite article the is in the BNC, we query
the string the in all combinations of upper and lower cases, i.e. at least the, e,
T
and THE, but perhaps also tHe, E, THe, tHE and thE, to be sure) and count the
hits (since this string corresponds uniquely to the word the, we dont even have
AF

to clean up the results manually). e query will yield 6,041,234 hits, so the
frequency of occurrence of the word the in the BNC is 6,041,234.
When searching for grammatical structures (for example in Chapters 6 and 7),
we have simply transferred this way of counting occurrences. For example, in
order to determine the frequency of the s-possessive in the BNC, we would dene
R

a reasonable set of queries (which, as discussed in the preceding chapter, can be


tricky) and again simply count the hits. Let us assume that the POS-tag sequences
D

[poss. pronouns (adjective) noun] and [s - (adjective) - noun] are such a


reasonable set; together, the four paerns occur 1,492,349 times in the BNC, so it
seems obvious that the frequency of the s-possessive must be 1,492,349.
However, there is a crucial dierence between the two situations: in the case
of the word the, every instance is identical to all others (if we ignore upper and
lower case). is is not the case for the s-possessive. Of course, here too many
instances are identical to other instances: there are exact repetitions of proper
names, like Kings Cross (322 hits) or Peoples revolutionary party (47), of (parts o)
idiomatic expressions, like arms length (216) or heavens sake (187) or non-
idiomatic but nevertheless xed phrases like its present form (107) or childs best
interest (26), and also of many free combinations of words that recur because they

277
11. Morphology

are simply communicatively useful in many situations, like her head (4,919), his
younger brother (112), peoples lives (224) and bodys immune system (29).
is means that there are two dierent ways to count occurrences of the s-
genitive. First, we could simply count all instances without paying any aention
to whether they recur in identical form or not. When looking at occurrences of a
linguistic item or structure in this way, they are referred to as tokens, so 1,492,349
is the token frequency of the possessive. Second, we could exclude repetitions and
count only the number of instances that are dierent from each other, for
example, we would count Kings Cross only the rst time we encounter it,
disregarding the other 321 occurrences. When looking at occurrences of linguistic

.1
items in this way, they are referred to as types; the type frequency of the s-
possessive in the BNC is 378,893. e type frequency of the, of course, is 1.
Let us look at one more example of the type/token distinction before we move
v0
on. Consider the following famous line from Gertrude Steins poem Sacred
Emily:

(11.1) Rose is a rose is a rose is a rose.


T
At the word level, it consists of ten tokens (rose, is, a, rose, is, a, rose, is, a, and
rose), but only of three types (rose, is and a). At the level of sentence structure it
AF

consists of three (overlapping) instances of the copular construction with a


predicate noun (i.e., [NP be NP]), so it consists of one type and three tokens.
Which of the two frequencies we consider relevant in the context of a research
design depends both on the kind of phenomenon we are counting and on our
research question. When studying words, we will normally be interested in how
R

oen they are used under a particular condition, so it is their token frequency
that is relevant to us; but we could imagine designs where we are mainly
D

interested in whether a word occurs at all, in which case all that is relevant is
whether its type frequency is one or zero. When studying grammatical structures,
we will also mainly be interested in how frequently a particular grammatical
structure is used under a certain condition, regardless of the words that ll this
structure. Again, it is the token frequency that is relevant to us. However, note
that we can (to some extent) ignore the specic words lling our structure only
because we are assuming that the structure and the words are, in some
meaningful sense, independent of each other; i.e., that the same words could have
been used in a dierent structure (e.g. an of-possessive instead of an s-possessive)
or that the same structure could have been used with dierent words (e.g. Johns
spouse instead of his wife). Recall that in our case studies in Chapters 7 and 8 we

278
Anatol Stefanowitsch

excluded all instances where this assumption does not hold (such as proper
names and xed expressions); since there is no (or very lile) choice with these
cases, including them, let alone counting repeated occurrences of them, would
have added nothing (we did, of course, include repetitions of free combinations,
of which there were four in our sample: his sta, his mouth, his work and his head
occurred twice each).
Obviously, morphemes (whether inectional or derivational) can be counted in
the same two ways as words or grammatical structures. Take the following
passage from William Shakespeares play Julius Cesar:

.1
(11.2) CINNA: Am I a married man, or a bachelor? en, to answer every
man directly and briey, wisely and truly: wisely I say, I am a bachelor.
v0
Let us count the occurrences of the adverbial sux -ly. ere are ve word
tokens that contain this sux (directly, briey, wisely, truly, wisely), so its token
frequency is 5; however, there are only four types, since wisely occurs twice, so its
type frequency in this passage is 4.
Again, whether type or token frequency is the more relevant or useful
T
measure depends on the research design, but the issue is more complicated than
in the case of words and grammatical structures. Let us begin to address this
AF

problem by looking at the diminutive axes -icle (as in cubicle, icicle) and mini-
(as in minivan, mini-cassee).

a) Token frequency. First, let us count the tokens of both axes in the BNC.
We will nd that -icle has a token frequency of 20,857, almost ten times that of
R

mini-, which occurs only 2,161 times. We might thus be tempted to conclude that
we will be able to learn much more about the sux -icle from our data than about
D

the prex mini-. We might also conclude that -icle is much more central to the
domain of English diminutives than mini-. However, both conclusions would be
misleading, or at least premature, for reasons related to the problems introduced
above. Recall that axes do not occur by themselves, but always as parts of
words (this is what makes them axes in the rst place). is means, however,
that their token frequency can reect situations that are both quantitatively and
qualitatively very dierent.
Specically, the high frequency of an ax may be due to the fact that it is
used in a small number of very frequent words, or in a large number of very
infrequent words (or something in between). e rst case holds for -icle: just the
three most frequent words it occurs in (article, vehicle and particle) account for

279
11. Morphology

19,206 hits (i.e., 92.08 percent of all occurrences). In contrast, the three most
frequent words with mini- (mini-bus, mini-computer and mini-bar) account for
only 45.94 per cent of all occurrences (617 hits); to get to 92 percent, we have to
include the twenty-one next-most-frequent words (mini-skirt, mini-cab, mini-
gram, mini-series, mini-golf, mini-enterprise, mini-step, mini-tab, mini-raise, mini-
van, mini-dress, mini-budget, mini-tel, mini-ramp, mini-league, mini-boom, mini-
roundabout, mini-break, mini-clinic, mini-LP, and mini-tower; note that most of
these words occur both with and without hyphens, I have standardized their
spelling for expository ease).
In other words, the high token frequency of -icle tells us nothing (or at least

.1
very lile) about the importance of the ax, but rather about the importance of
some of the words containing it. is is true regardless of whether we are looking
at its token frequency in the corpus as a whole or under specic conditions; if its
v0
token frequency turned out to be higher, under one condition than under the
other, this could point to the association between that condition and one or more
of the words containing the ax, rather than between the condition and the ax
itself. For example, the token frequency of the sux -icle is higher in the BROWN
corpus (270 tokens) than in the LOB corpus (226 tokens). However, as Table 11-1
T
shows, this is exclusively due to the fact that the word vehicle is more frequent
than expected in American English and less frequent than expected in British
AF

English (the corresponding chi-square components being the only ones reaching
signicance).
Even if all words containing a particular ax were more frequent under one
condition (e.g. in one variety) and under another, this would tell us nothing about
the ax itself, as this distribution could be due to the ax itself (the adverbial
R

sux -ly, for example, is disappearing from American English, but not from
British English), or to the words containing it.
D

is is not to say that the token frequencies of axes can never play a useful
role; they may be of interest, for example, in cases of morphological alternation
(i.e. two suxes competing for the same stems, such as -ic and -ical in words like
electric/al); here, we may be interested in the quantitative association between
particular stems and one or the other of the ax variants, essentially giving us a
collocation-type research design based on token frequencies. But for most
research questions, designs based (exclusively) on the distribution of token
frequencies under dierent conditions will give us meaningless results.

280
Anatol Stefanowitsch

Table 11-1. Words containing -icle in two corpora


BROWN LOB
article Obs: 99 Obs: 116 215
Exp: 117.04 Exp: 97.96
2-Cmp: 2.78 2-Cmp: 3.32
vehicle Obs: 88 Obs: 39 127
Exp: 69.13 Exp: 57.87
2-Cmp: 5.15 2-Cmp: 6.15
particle Obs: 64 Obs: 49 113
Exp: 61.51 Exp: 51.49
2-Cmp: 0.10 2-Cmp: 0.12

.1
chronicle Obs: 8 Obs: 7 15
Exp: 8.17 Exp: 6.83
2-Cmp: 0.00 2-Cmp: 0.00
fascicles Obs: 3
v0 Obs: 0 3
Exp: 1.63 Exp: 1.37
2-Cmp: 1.14 2-Cmp: 1.37
ventricle Obs: 4 Obs: 8 12
Exp: 6.53 Exp: 5.47
2-Cmp: 0.98 2-Cmp: 1.17
T
testicle Obs: 2 Obs: 0 2
Exp: 1.09 Exp: 0.91
2-Cmp: 0.76 2-Cmp: 0.91
AF

canticle Obs: 1 Obs: 0 1


Exp: 0.54 Exp: 0.46
2-Cmp: 0.38 2-Cmp: 0.46
icicle Obs: 1 Obs: 0 1
Exp: 0.54 Exp: 0.46
R

2-Cmp: 0.38 2-Cmp: 0.46


auricle Obs: 0 Obs: 5 5
Exp: 2.72 Exp: 2.28
D

2-Cmp: 2.72 2-Cmp: 3.25


conventicle Obs: 0 Obs: 1 1
Exp: 0.54 Exp: 0.46
2-Cmp: 0.54 2-Cmp: 0.65
cuticle Obs: 0 Obs: 1 1
Exp: 0.54 Exp: 0.46
2-Cmp: 0.54 2-Cmp: 0.65
270 226 496

b) Type frequency. In contrast, the type frequency of an ax is a fairly direct


reection of the importance of the ax for the lexicon of a language: obviously
an ax that occurs in many dierent words is more important than one that

281
11. Morphology

occurs only in a few words. Note that order to compare type frequencies, we have
to correct for the size of the sample: all else being equal, a larger sample will
contain more types than a smaller one simply because it oers more
opportunities for dierent types to occur (a point we will return to in more detail
in the next subsection). A simple way of doing this is to divide the number of
types by the number of tokens; the resulting measure is referred to very
transparently as the type/token ratio (or TTR):

n (types)
(11.3) TTR =
n (tokens)

.1
e TTR is the percentage of types in a sample are dierent from each other; or,
put dierently, it is the average likelihood that we will encounter a new type if
v0
we go through the sample item by item.
For example, the ax -icle occurs in just 35 dierent words in the BNC, so its
TTR is 35/20,857 = 0.0017. In other words, 0.17 percent its tokens the BNC are
dierent from each other, the vast remainder consists of repetitions of these 0.17
percent; put dierently, if we went through the occurrences of -icle in the BNC
T
item by item, the likelihood that the next item instantiating this sux will be a
type we have not seen before is 0.17 percent, so we will encounter a new type
AF

about once every six-hundred words. For mini-, the type-token ratio is much
higher: it occurs in 555 dierent words, so its TTR is 555/2,161 = 0.2568. In other
words, more than twenty-ve percent of all occurrences of mini- are dierent
from each other; put dierently, if we went through our sample word by word,
R

the likelihood that the next instance of mini- is a a new type would be 25 percent,
so it will happen once every four hits on average. e dierences in their TTRs
suggests that mini-, in its own right is much more central in the English lexicon
D

than -icle, even though the laer has a much higher token frequency. Note that
this is a statement only about the axes; it does not mean that the words
containing mini- are individually or collectively more important than those
containing -icle (on the contrary: words like vehicle, article and particle are
arguably much more important than words like minibus, minicomputer and
minibar).
Likewise, observing the type frequency (i.e., the TTR) of an ax under
dierent conditions provides information about the relationship between these
conditions and the ax itself, albeit one that is mediated by the lexicon: it tells us
how important the sux in question is for the subparts of the lexicon that are

282
Anatol Stefanowitsch

relevant under those conditions. For example, there are 7 types and 9 tokens for
mini- in the 1991 British FLOB corpus (two token each for mini-bus and mini-
series and one each for min-charter, mini-disc, mini-maestro, mini-roll and mini-
submarine), so the TTR is 7/9 = 0.77789; in the 1991 US-American FROWN
corpus, there are 10 types and 11 tokens, so the TTR is 10/11 = 0.9091. is
suggests that the prex mini- was more important to the US-English lexicon than
to the British English lexicon in the 1990s, although, of course, the samples and
the dierence between them are both rather small, so we would not want to draw
that conclusion without testing for signicance rst, a point I will return to in the
next subsection.

.1
c) Hapax legomena. While type frequency is a useful (and, in my view,
insuciently valued) way of measuring the importance of axes in general or
v0
under specic conditions, it has one drawback: it does not tell us whether the
ax plays a productive role in a language at the time from which we take our
samples (i.e., whether speakers at that time made use of it when coining new
words). An ax may have a high TTR because it was productively used at the
time of the sample, or because it was productively uses at some earlier period in
T
the history of the language in question. In fact, an ax can have a high TTR even
if it was never productively used, for example, because speakers at some point
AF

borrowed a large number of words containing it; this is the case for a number of
Romance axes in English, occurring in words borrowed from Norman French
but never (or very rarely) used to coin new words. An example is the sux
-ence/-ance occurring in many Latin and French loanwords (such as appearance,
dierence, existence, inuence, nuisance, providence, resistance, signicance,
R

vigilance, etc.), but only in a handful of words formed in English (e.g. abidance,
forbearance, furtherance, hinderance, and riddance).
D

In order to determine the productivity (and thus the current importance) of


axes at a particular point in time, Baayen (cf. e.g. 2009 for an overview) has
suggested that we should focus on types that only occur once in the corpus, so-
called hapax legomena (from Greek , said once). e assumption
is that productive uses of an ax (or other linguistic rule) should result in one-o
coinages (some of which may subsequently spread through the speech
community while others will not).
Of course, not all hapax legomena are the result of productive rule-application:
the words wordform-centeredness and ingenuity that I used in the rst sentence of
this chapter are both hapax legomena in this book (or would be, if I had not
mentioned referred to them twice subsequently), but wordform-centeredness is a

283
11. Morphology

word I coined productively (in fact, I coined it in order to have a good example of
a hapax legomenon), whereas ingenuity has been part of the English language for
more than four-hundred years (the OED rst records it in 1598); it occurs only
once in this book for the simple reason that I only needed it once. So a word may
be a hapax legomenon because it is a productive coinage, or because it is
infrequently-needed (in larger corpora, the category of hapaxes typically also
contains misspelled or incorrectly tokenized words which will have to be cleaned
up manually). e idea is simply to use the notion hapax legomenon as an
operationalization of the notion productive application of a rule in the hope
that the correlation between the two notions (in a large enough corpus) will be

.1
substantial.32
Like the number of types, the number of hapax legomena is dependent on
sample size (although the relationship is not as straightforward as in the case of
v0
types, see next subsection); it is useful, therefore, to divide the number of hapax
legomena by the number of tokens to correct for sample size:

n( hapax legomena)
(11.4) HTR =
n ( tokens)
T

We will refer to this measure as the hapax-token ratio (or HTR) by analogy with
AF

the term type-token ratio. Note, however, that in the literature this measure is
referred to as P for Productivity (following Baayen, who rst suggested the
measure); I depart from this nomenclature here to avoid confusion with p for
probability of error.
R

Let us apply this measure to our two diminutive axes. e sux -icle has
just ve hapax legomena in the BNC (auricle, denticle, pedicle, pellicle and tunicle).
is means that its HTR is 5/20,857 = 0.0002 (i.e., 0.02 percent of its tokens are
D

hapax legomena). In contrast, there are 370 hapax legomena for mini- in the BNC
(including, for example, mini-10p-piece, mini-brain, mini-gasometer, mini-textbook
and mini-wurlitzer). is means that its HTR is 370/2,161 = 0.1712 (i.e. 17 percent
of its tokens are hapax legomena). us, mini- is much more productive than -icle,

32 Note also that the productive application of a sux does not necessarily result in a
hapax legomenon: two or more speakers may arrive at the same coinage, or a single
speaker may like their own coinage so much that they use it again; some researchers
therefore suggest that we should also pay aention to dis legomena (words occurring
twice) or even tris legomena (words occurring three times). We will stick with the
mainstream here and use only hapax legomena.

284
Anatol Stefanowitsch

(or was, at the time that the BNC was assembled); this presumably matches the
intuition of most speakers of English.

11.1.2 Statistical evaluation


As pointed out in connection with the comparison of the TTRs for mini- in the
FLOB and the FROWN corpus, we would like to be able to test dierences
between two (or more) TTRs (and, of course, also two or more HTRs) for
statistical signicance. eoretically, this could be done very easily. Take the
TTR: if we interpret it as the probability of encountering a new type as we move
through our samples, we are treating it like a nominal variable TYPE, with the

.1
values NEW and SEEN BEFORE. One appropriate statistical test for a distribution
nominal values under dierent conditions is the chi-square test, which we are
already more than familiar with. For example, if we wanted to test whether the
v0
TTRs of -icle and mini- in the BNC dier signicantly, we might construct a Table
like that in 11-1. e chi-square test would tell us that the dierence is highly
signicant (2 = 5104.04, df = 1, p < 0.001, = 0.4709).
T
Table 11-1. Type/token ratios of -icle and mini- in the BNC
AFFIX
AF

-ICLE MINI- Total


Obs. 35 Obs. 555 Obs. 590
NEW
Exp. 534.61 Exp. 55.39
TYPE

Obs. 20,822 Obs. 1606 Obs. 22,428


R

SEEN
BEFORE Exp. 20,322.39 Exp. 21,05.61

Total Obs. 20,857 Obs. 2,161 Obs. 23,018


D

For HTRs, we could follow the same procedure: in this case we are dealing with a
nominal variable TYPE with the variables OCCURS ONLY ONCE and OCCURS MORE THAN
ONCE, so we could construct the corresponding table and perform the chi-square
test.
However, while the logic behind this procedure is plausible in theory both for
HTRs and for TTRs, in practice, maers are much more complicated. e reason
for this is that type-token ratios and hapax-token ratios are also dependent on
sample size.
In order to understand why and how this is the case and how to deal with it,

285
11. Morphology

let us leave the domain of morphology for a moment and look at the relationship
between tokens and types or hapax legomena in texts. Consider the opening
sentences of Jane Austens novel Pride and Prejudice (the novel is freely available
from Gutenberg.org, cf. Appendix A.1):

(11.5) It is a truth universally acknowledged, that a2/-1 single man in possession


of a3 good fortune, must be in2/-1 want of2/-1 a4 wife. However lile known
the feelings or views of3 such a5 man2/-1 may be2/-1 on his rst entering a6
neighbourhood, this truth2/-1 is2/-1 so well xed in3 the2/-1 minds of4 the
surrounding families, that2/-1 he is3 considered the rightful property of5

.1
some one or2/-1 other of6 their daughters.

All words without a subscript are new types and hapax legomena at the point at
v0
which they appear in the text; if they have one subscript, this is their token
frequency, i.e., they do not constitute new types but a repetition; their rst
repetition (i.e., the point when their token frequency becomes 2) is additionally
marked by a subscript reading -1, indicating that they cease to be hapax legomena
at this point, decreasing the overall count of hapaxes by one.
T
As we move through the text word by word, initially all words are new types
and hapaxes, so the type- and hapax-counts rise at the same rate as the token
AF

counts. However, it only takes eight tokens before we reach the rst repetition
(the word a), so while the token frequency rises to 8, the type count remains
constant at seven and the hapax count falls to six. Six words later, there is
another occurrence of a, so type and hapax counts remain at 12 and 11
respectively as the token count rises to 14, and so on. In other words, while the
R

numbers of types and hapaxes generally increases as the number of tokens in a


sample increases, they do not increase at a steady rate. e more types have
D

already occurred, the more types are there to be re-used (put simply, speakers will
encounter fewer and fewer communicative situations that require a new type),
which makes it less and less likely for new types (including new hapaxes) to
occur. Figure 11-1a shows how type and hapax counts fall substantially below
token counts in the rst 100 words of the novel, Figure 11-1b shows the same
thing for the entire novel.
As Figure 11-1a shows, the type-token and hapax-token ratios shrinks fairly
quickly: aer 20 tokens, the TTR is 18/20 = 0.9 and the HTR is 17/20 = 0.85, aer
40 tokens the TTR is 31/40 = 0.775 and the HTR is 26/40 = 0.65, aer 60 tokens
the HTR is 42/60 = 0.7 and the TTR is 33/60 = 0.55, and so on (note also how the
hapax-token ratio sometimes drops before it rises again, as words that were

286
Anatol Stefanowitsch

hapaxes up to a particular point in the text re-occur and cease to be counted as


hapaxes.

Fig. 11-1a: TTR and HTR for the rst Fig. 11-1b: TTR and HTR for the entire
100 words of Pride and Prejudice text of Pride and Prejudice

.1
v0
T
AF

If we zoom out and look at the entire novel, shown in Figure 11-1b, we see
that the growth in hapaxes slows considerably, to the extent that it has almost
stopped by the time we reach the end of the novel. e growth in types also
slows, although not as much as in the case of the hapaxes. In both cases this
R

means that the ratios will continue to fall as the number of tokens increases. Now
imagine we wanted to use the TTR and the HTR as measures of Jane Austens
overall lexical productivity (referred to as lexical richness in computational
D

stylistics and in second-language teaching): if we chose a small sample of her


writing, the TTR and the HTR would be larger than if we chose a large sample, to
the extent that the scores derived from the two samples would dier signicantly.
Table 11-2 shows what would happen if we compared the TTR of the rst chapter
with the TTR of the entire rest of the novel.
e TTR for the rst chapter is an impressive 0.3781, that for the rest of the
novel is a measly 0.0566, and the dierence is highly signicant (2 = 1688.7, df =
1, p < 0.001, = 0.1147). And this is not because there is anything special about
the rst chapter; the TTR for chapter 2 is 0.391, that for chapter 3 is 0,3457, that
for chapter 4 is 0.3943, and so on. It is simply because the TTR does not remain

287
11. Morphology

constant as we increase the sample size.

Table 11-2. Type/token ratios in Pride and Prejudice


TEXT SAMPLE
FIRST CHAPTER REST OF NOVEL Total
Obs. 321 Obs. 6,829 7,150
NEW
Exp. 47.29 Exp. 7,102.71
TYPE

SEEN Obs. 528 Obs. 120,679 114,378


Exp. 801.71 Exp. 120,405.29

.1
BEFORE

Total 849 120,679 128,357


v0
is is why we must never compare TTRs derived from dierent sample sizes,
let alone evaluate the dierences statistically. e same is true for HTRs, with the
added problem that, under certain circumstances, it will decrease at some point as
we keep increasing the sample size: at some point, all possible words will have
been used, so unless new words are added to the language, the number of
T
hapaxes will shrink again and nally drop to zero when all existing types have
been used at least twice.
AF

e same is true if we do not compare the TTR or HTR of entire texts but of
particular axes (or other linguistic rules). Consider Figures 11-2a and 11-2b,
which shows the TTR and the HTR of the verb suxes -ize (occuring in words
like realize, maximize or liquidize) and -ify (occurring in words like identify,
intensify or liquify).
R

As we can see, the TTR and HTR of both axes behaves roughly like that of
Jane Austens vocabulary as a whole as we increase sample size: both of them
D

grow fairly quickly at rst before their growth slows down; the laer happens
more quickly in the case of the HTR than in the case of the TTR, and, again, we
observe that the HTR sometimes decreases as types that were hapaxes up to a
particular point in the sample re-occur and cease to be hapaxes.

288
Anatol Stefanowitsch

Fig. 11-2a: TTRs of -ize and -ify in the Fig. 11-2b: HTRs of -ize and -ify in the
BROWN corpus BROWN corpus

.1
v0
Taking into account the entire sample, the TTR for -ize is 151/1057 = 0.1429
T
and that for -ify is 52/498 = 0.1044; it seems that -ize is slightly more important to
the lexicon of English than -ify. A chi-square test suggests that that the dierence
AF

is signicant, but only relatively weakly and with a very small eect size (cf.
Table 11-3a; 2 = 4.41, df = 1, p = 0.0358, =0.0532).

Table 11-3a. Type/token ratios of -ise/-ize and -ify in the BROWN corpus
R

AFFIX
-IZE/-ISE -IFY Total
D

Obs. 151 Obs. 52 203


NEW
Exp. 137.99 Exp. 65.01
TYPE

SEEN Obs. 906 Obs. 446 1352


BEFORE Exp. 919.01 Exp. 432.99

Total 1057 498 1555

Likewise, taking into account the entire sample, the HTR for -ize is 53/1057 =
0.0501 and that for -ify is 15/498 = 0.0311; it seems that -ize is slightly more
productive than -ify. However, the dierence is very small, and a chi-square test

289
11. Morphology

shows that it is not signicant and that the eect size would be very small even if
it were (cf. Table 11-3b; 2 = 3.24, df = 1, p = 0.0716, =0.0021).

Table 11-3b. Hapax/token ratios of -ize/-ize and -ify in the BROWN corpus
AFFIX
-IZE/-ISE -IFY Total
Obs. 53 Obs. 15 68
HAPAX
Exp. 46.22 Exp. 21.78
TYPE

Obs. 1004 Obs. 483 1487

.1
NOT
HAPAX Exp. 1010.78 Exp. 476.22

Total 1057 v0 498 1555

However, note that -ify has a token frequency that is less than half of that of -ize,
so the sample is much smaller: as in the example of lexical richness in Pride and
Prejudice, this means that the TTR and the HTR of this smaller sample are
exaggerated and our comparisons are, in fact, completely meaningless.
T
e simplest way of solving the problem of dierent sample sizes is, of course,
to create samples of equal size for the purposes of comparison. We simply take
AF

the size of the smaller of our two samples and draw a random sample of the same
size from the larger of the two samples (if our data sets are large enough, it would
be even beer to draw random samples for both axes). is means that we lose
some data, but there is nothing we can do about this (note that we can still
include the discarded data in a qualitative description of the ax in question).
R

Figures 11-3a and 11-3b show the growth rates of the TTR and the HTR of a
random sub-sample of 498 tokens of -ize in comparison with the total sample of
D

the same size for -ify.

290
Anatol Stefanowitsch

Fig. 11-3a: TTRs of -ize (sample) and -ify Fig. 11-2b: HTRs of -ize (sample) and -ify
in the BROWN corpus in the BROWN corpus

.1
v0
T
e TTR of -ize based on the random sub-sample is 113/489 = 0.2311, that of -ify
is still 0.1044; the dierence between the two suxes is much clearer now, and a
AF

chi-square test shows that it is highly signicant and the eect size is more than
80 times higher than in our inappropriate comparison above (cf. Table 11-4a; 2 =
27.0293, df = 1, p-value = 2.004e-07, = 0.1647).

Table 11-4a. Type/token ratios of -ize/-ize (sample) and -ify in the BROWN corpus
R

AFFIX
-IZE/-ISE -IFY Total
D

Obs. 113 Obs. 52 165


NEW
Exp. 82.5 Exp. 82.5
TYPE

SEEN Obs. 385 Obs. 446 831


BEFORE Exp. 415.5 Exp. 415.5

Total 498 498 996

Likewise, the HTR of -ize based on our sub-sample is 113/498 = 0.2269, the HTR
of -ify remains 0.1044. Again, the dierence is much clearer, and it, too, is now
highly signicant with a noticeable eect size (cf. Table 11-4b; 2 = 18.4529, df = 1,

291
11. Morphology

p = 1.742e-05, = 0.1361).

Table 11-4b. Hapax/token ratios of -ize/-ize (sample) and -ify in the BROWN corpus
AFFIX
-IZE/-ISE -IFY Total
Obs. 48 Obs. 15 63
HAPAX
Exp. 31.5 Exp. 31.5
TYPE

NOT Obs. 450 Obs. 483 933


Exp. 466.5 Exp. 466.5

.1
HAPAX

Total 498 498 996


v0
In the case of the HTR, decreasing the sample size is slightly more problematic
than in the case of the TTR. The proportion of hapax legomena actually resulting
from productive rule application becomes smaller as sample size decreases. Take
example (11.2) from Shakespeares Julius Caesar above: the words directly, briey
and truly are all hapaxes in the passage cited, but they are clearly not the result of
T
a productively applied rule-application (all of them have their own entries in the
OALD, for example). As we increase the sample, they cease to be hapaxes
AF

(directly occurs 9 times in the entire play, briey occur 4 times and truly 8 times).
is means that while we must draw random samples of equal size in order to
compare HTRs, we should make sure that these samples are as large as possible.
R

11.2 Case studies

11.2.1 Morphemes and stems


D

One general question in (derivational) morphology concerns the category of the


stem which an ax may be aached to. is is obviously a descriptive issue that
can be investigated on the basis of corpora very straightforwardly simply by
identifying all types containing the ax in question and describing their internal
structure. In the case of axes with a low productivity, this will typically add
lile insight over studies based on dictionaries, but for productive axes, a
corpus-analysis will yield more detailed and comprehensive results since corpora
will contain spontaneously produced or at least recently created items not (yet)
found in dictionaries. Such newly-created words will oen oer particularly clear

292
Anatol Stefanowitsch

insights into constraints that an ax places on its stems (and obviously, corpus-
based approaches are without an alternative in diachronic studies and they yield
particularly interesting results when used to study changes in the quality or
degree of productivity, cf. for example Dalton-Puer 1996).
In the study of constraints placed by derivational axes on the stems that
they combine with, the combinability of derivational morphemes (in an absolute
sense or in terms of preferences) is of particular interest. Again, corpus linguistics
is a uniquely useful tool to investigate this.
Finally, there are cases where two derivational morphemes are in direct
competition because they are functionally roughly equivalent (e.g. -ness and -ity,

.1
both of which form abstract nouns from typically adjectival bases, -ize and -ify,
which form process verbs from nominal and adjectival bases or -ic and -ical,
which form adjectives from typically nominal bases). Here, too, corpus linguistics
v0
provides useful tools, for example to determine whether the choice between
axes is inuenced by syntactic, semantic or phonological properties of stems.

a) Case study: Phonological constraints of -ify. As part of a larger argument that


-ize and -ify should be considered phonologically conditioned allomorphs, Plag
T
(1999) investigates the phonological constraints that -ify places on its stems. First,
he summarizes the properties of stems in established words with -ify as observed
AF

in the literature. Second, he checks these observations against a sample of


twenty-three recent (20th-century) coinages from a corpus of neologisms to
ensure that the constraints also apply to productive uses of the ax. is is a
reasonable question. e ax was rst borrowed into English as part of a large
number of French loanwords beginning in the late 13th century; two thirds of all
R

non-hapax types and 19 of the twenty most frequent types found in the BNC are
older than the 19th century. us, it is possible that the constraints observed in
D

the literature are historical remnants not relevant to new coinages.


e most obvious constraint is that the syllable directly preceding -ify must
carry the main stress of the word. is has a number of consequences, of which
we will focus on two: First, monosyllabic stems (as in falsify) are preferred, since
they always meet this criterion. Second, if a polysyllabic stem ends in an
unstressed syllable, the stress must be shied to that syllable (as in persnify from
prson);33 since this reduces the transparency of the stem, there should be a

33 is is a simplication: if an unstressed nal syllable ends in a vowel, it is simply


deleted (as in smple smplify); stress-shi only occurs with unstressed closed
syllables or sequence of two unstressed syllables (sllable syllbify). Occasionally

293
11. Morphology

preference for those polysyllabic stems which already have the stress on the nal
syllable.
Plag simply checks his neologisms against the literature, but we will evaluate
the claims from the literature quantitatively. Our main hypotheses will be that
neologisms with -ify do not dier from established types with respect to the fact
that the directly preceding the sux must carry primary stress, with the
consequences that (i) they prefer monosyllabic stems, and (ii) if the stem is
polysyllabic, they prefer stems that already have the primary stress on the last
syllable. Our independent variable is therefore LEXICAL STATUS with the values
ESTABLISHED WORD vs. NEOLOGISM (which will be operationalized presently). Our

.1
dependent variables are SYLLABICITY with the values MONOSYLLABIC and POLYSYLLABIC,
and STRESS SHIFT with the values REQUIRED vs. NOT REQUIRED (both of which should be
self-explanatory). v0
Our design compares two predened groups of types with respect to the
distribution that particular properties have in these groups; this means that we do
not need to calculate TTRs or HTRs, but that we need operational denitions of
the values ESTABLISHED WORD and NEOLOGISM). Following Plag, let us dene NEOLOGISM
as coined in the 20th century, but let us use a large historical dictionary (the
T
Oxford English Dictionary, 3rd edition) and a large corpus (the BNC) in order to
identify words matching this denition; this will give us the opportunity to
AF

evaluate the idea that hapax legomena are a good way of operationalizing
productivity.
Excluding cases with prexed stems, the OED contains 456 entries or sub-
entries for verbs with -ify, 31 of which are rst documented in the 20th century.
Of the laer, 21 do not occur in the BNC at all, and 10 do occur in the BNC, but
R

are not hapaxes (see Table 11-5a below). e BNC contains 30 hapaxes, of which
13 are spelling errors and 7 are rst documented in the OED before the 20th
D

century (carbonify, churchify, hornify, preachify, saponify, solemnify, townify).


is leaves 10 hapaxes that are plausibly regarded as neologisms, none of which
are listed in the OED (again, see Table 11-5a). In addition, there are four types in
the BNC that are not hapax legomena, but that are not listed in the OED; careful
cross-checks show that these are also neologisms. Combining all sources, this
gives us 45 neologisms.
Before we turn to the denition and sampling of established types, let us
determine the precision and recall of the operational denition of neologism as

stem-nal consonants are deleted (as in liquid liquify); cf. Plag 1999 for a more
detailed discussion.

294
Anatol Stefanowitsch

hapax legomenon in the BNC, using the formulas introduced in Chapter 5 (cf.
5.1a, b). Precision is dened as the number of true positives (items that were
found and that actually are what they are supposed to be) divided by the number
of all positives (all items found); 10 of the 30 hapaxes in the BNC are actually
neologisms, so the precision is 10/30 = 0.3333. Recall is dened as the number of
true positives divided by the number of true positives and false negatives (i.e., all
items that should have been found); 10 of the 45 neologisms were actually found
by using the hapax denition, so the recall is 10/45 = 0.2222. In other words,
neither precision or recall of the method are very good, at least for moderately
productive axes like -ify (the method will give beer results with highly

.1
productive axes). Let us also determine the recall of neologisms from the OED
(using the denition rst documented in the 20th century according to the
OED): the OED lists 31 of the 45 neologisms, so the recall is 31/45 = 0.6889; this
v0
is much beer than the recall of the corpus-based hapax-denition, but it also
shows that if we combine corpus data and dictionary data, we can increase
coverage substantially even for moderately productive axes.

Table 11-5a. 20th century neologisms with -ify


T
First documented in the OED in the 20th century
(i) also occur in the BNC, but not as hapaxes
AF

bourgeoisify, esterify, gentrify, karstify, massify, Nazify, syllabify, vinify, yuppify,


zombify
(ii) do not occur in the BNC
ammonify, aridify, electronify, glassify, humify, iconify, jazzify, maify, metrify,
mucify, nannify, passivify, plastify, probabilify, Prussify, rancidify, sinify, trendify,
R

trustify, tubify, youthify)


Hapax Legomena in the BNC
(iii) not listed in the OED at all
D

faintify, fuzzify, lewisify, rawify, rockify, sickify, sonify, validify, yankify, yukkify
Non-Hapax Legomena in the BNC that are not listed in the OED
(iv) commodify, desertify, extensify, geriatrify

Let us now turn to the denition of ESTABLISHED TYPES. Given our denition of
NEOLOGISMS, established types would rst have to be documented before the 20th
century, so we could use the 420 types in the OED that meet this criterion (again,
excluding prexed forms). However, these 420 types contain many very rare or
even obsolete forms, like duplify to make double, eaglify to make into an eagle
or naucify to hold in low esteem. Clearly, these are not established in any
meaningful sense, so let us add the requirement that a type must occur in the

295
11. Morphology

BNC at least twice to count as established. Let us further limit the category to
verbs rst documented before the 19th century, in order to leave a clear
diachronic gap between the established types and the productive types.34

11-5b. Control sample of established types with the sux -ify.


acidify, amplify, beatify, beautify, certify, clarify, classify, crucify, damnify, deify, dignify,
diversify, edify, electrify, exemplify, falsify, fortify, Frenchify, fructify, glorify, gratify,
identify, indemnify, justify, liquify/liquefy, magnify, modify, mollify, mortify, mummify,
notify, nullify, ossify, pacify, personify, petrify, preify, purify, qualify, quantify, ramify,
rarify, ratify, rectify, sacrify, sanctify, satisfy, scarify, signify, simplify, solidify, specify,

.1
stratify, stultify, terrify, testify, transmogrify, typify, uglify, unify, verify, versify, vilify,
vitrify, vivify
v0
Let us now evaluate the hypotheses. Table 11-6a shows the type frequencies for
monosyllabic and polysyllabic stems in the two samples. In both cases, there is a
preference for monosyllabic stems (as expected), but interestingly, this preference
is less strong among the neologisms than among the established types and this
dierence is very signicant (2 = 7.37, df = 1, p < 0.01 , = 0.2577).
T
Table 11-6a. Monosyllabic and bisyllabic stems with -ify
AF

NO. OF SYLLABLES
MONOSYLLABIC POLYSYLLABIC Total
Obs. 57 Obs. 9 66
LEXICAL STATUS

ESTABLISHED
Exp. 51.14 Exp. 14.86
R

Obs. 29 Obs. 16 45
NEOLOGISM
Exp. 34.86 Exp. 10.14
D

Total 86 25 111

34 Interestingly, leaving out words coined in the 19th century does not make much of a
dierence: although the 19th century saw a large number of coinages (with 138 new
types it was the most productive century in the history of the sux), few of these are
frequent enough today to occur in the BNC; if anything, we should actually extend
our denition of neologisms to include the 19th century.

296
Anatol Stefanowitsch

e fact that there is a signicantly higher number of neologisms with


polysyllabic stems than expected on the basis of established types, the second
hypothesis becomes more interesting: does this higher number of polysyllabic
stems correspond with a greater willingness to apply it to stems that then have to
undergo stress shi (which would be contrary to our hypothesis, which assumes
that there will be no dierence between established types and neologisms)?
Table 11-6b shows the relevant data: it seems that there might indeed be such
a greater willingness, as the number of neologisms with polysyllabic stems
requiring stress shi is higher than expected; however, the dierence is not
statistically signicant (2 = 1.96, df = 1, p > 0.05, = 0.28) (strictly speaking, we

.1
cannot use the chi-square test here, since half of the expected frequencies are
below 5, but Fishers exact test conrms that the dierence is not signicant).

Table 11-6b. Stress-shi with polysyllabic stems with -ify


v0 SHIFT
NOT REQUIRED REQUIRED Total
Obs. 3 Obs. 6 9
T
LEXICAL STATUS

ESTABLISHED
Exp. 4.68 Exp. 4.32
Obs. 10 Obs. 6 16
AF

NEOLOGISM
Exp. 8.32 Exp. 7.68
Total 13 12 25

is demonstrates some of the problems and advantages of using corpora to


R

identify neologisms in addition to existing dictionaries. It also constitutes an


example of a purely type-based research design; note, again, that this is possible
D

here because we are not interested in the type frequency of a particular ax


under dierent conditions (in which case we would have to calculate a TTR to
adjust for dierent sample sizes), but in the distribution of the variables SYLLABLE
LENGTH and STRESS SHIFT in two qualitatively dierent categories of types. Finally,
note that the study comes to dierent conclusions than Plags (1999)
impressionistic analysis of his sample so it demonstrates the advantages of
strictly quantied designs.

b) Case Study: Semantic dierences between -ic and -ical. Axes, like words, can be
related to other axes by lexical relations like synonymy, antonymy etc. In the
case of (roughly) synonymous axes, an obvious research question is what

297
11. Morphology

determines the choice between them for example, whether there are more ne-
grained semantic dierences that are not immediately apparent.
One way of approaching this question is to focus on stems that occur with
both axes (such as liqui(d) in liquidize and liquify/liquefy, scarce in scarceness
and scarcity or electr- in electric and electrical) and to investigate the semantic
contexts in which they occur for example, by categorizing their collocates,
analogous to the way Taylor (2001) categorizes collocates of high and tall (cf.
Chapter 9, Section 9.2.2).
A good example of this approach is found in Kaunisto (1999), who investigates
the pairs electric/electrical and classic/classical on the basis of the British

.1
Newspaper Daily Telegraph. Since his corpus is not accessible, we will use the
BROWN corpus instead to replicate his study for electric/electrical. It is a study
with two nominal variables: AFFIX VARIANT (with the values -IC and -ICAL), and
v0
SEMANTIC CATEGORY (with a set of values to be discussed presently). Note that this
design can be based straightforwardly on token frequency, as we are not
concerned with the relationship between the stem and the ax, but with the
relationship between the stem-ax combination and the nouns modied by it.
Put dierently, we are not using the token frequency of a stem-ax combination,
T
but of the collocates of words derived by a particular ax.
Kaunisto uses a mixture of dictionaries and existing literature to identify
AF

potentially interesting values for the variable SEMANTIC CATEGORY; we will restrict
ourselves to dictionaries here:

(11.6) electric
a) connected with electricity; using, produced by or producing electricity
R

[OALD]
b) of or relating to electricity; operated by electricity [MW]
D

working by electricity; used for carrying electricity; relating to electricity


[MD]
c) of, produced by, or worked by electricity [CALD]
d) needing electricity to work, produced by electricity, or used for
carrying electricity [LDCE]
e) work[ing] by means of electricity; produced by electricity; designed to
carry electricity; refer[ring] to the supply of electricity [Cobuild]

(11.7) electrical
a) connected with electricity; using or producing electricity [OALD]
b) of or relating to electricity; operated by electricity [MW] [mentioned as

298
Anatol Stefanowitsch

synonym under corresp. sense of electric]


c) working by electricity; relating to electricity [MD]
d) related to electricity [CALD]
e) relating to electricity [LDCE]
) work[ing] by means of electricity; supply[ing] or us[ing] electricity;
energy in the form of electricity; involved in the production and supply
of electricity or electrical goods [Collins]

MW treats the two words as largely synonymous and OALD distinguishes them
only insofar as mentioning for electric, but not electrical, that it may refer to

.1
phenomena produced by electricity (this is meant to cover cases like electric
current/charge); however, since both words are also dened referring to anything
connected with electricity, this is not much of a dierentiation (the entry for
v0
electrical also mentions electrical power/energy). Macmillans Dictionary also
treats them as largely synonymous, although it is pointed out specically that
electric refers to entities carrying electricity (citing electric outlet/plug/cord).
CALD and LDCE present electrical as a more general word for anything related
to electricity, whereas they mention specically that electric is used for things
T
worked by electricity (e.g. electric light/appliance) or carrying electricity
(presumably cords, outlets etc.) and phenomena produced by electricity
AF

(presumably current, charge, etc.). Collins presents both words as referring to


electric(al) appliances, with electric additionally referring to things produced by
electricity, designed to carry electricity or being related to the supply of
electricity and electrical additionally referring to energy or entities involved in
the production and supply of electricity (presumably energy companies,
R

engineers, etc.).
Summarizing, we can posit the following ve broad values for our variable
D

SEMANTIC CATEGORY:
DEVICES and appliances working by electricity
ENERGY in the form of electricity
COMPANIES producing or supplying energy
CIRCUITS carrying electricity (including their components)
ENGINEERS and their work with electricity, circuits, devices

Table 11-7 shows the token frequency with which words from these categories
are referred to as electric or electrical, and it also lists all types found for each
category.

299
11. Morphology

Table 11-7. Entities described as electric or electrical in the BROWN corpus.


ELECTRIC ELECTRICAL Total
DEVICES Obs. 34 Obs. 3 37
Exp. 21.25 Exp. 15.75
2 7.65 2 10.32
amplier, blanket, bug, chair, display, equipment, wonder
computer, device, drive, atiron,
gadget, globe, grill, hand-blower,
heater, horn, icebox, lantern, model,
range, razor, refrigerator, sensor, sign,

.1
spit, tool, toothbrush,

ENERGY Obs. 14 Obs. 20 34


Exp. 19.52 v0 Exp. 14.48
2 1.56 2 2.11
arc, current, discharge, eect, power, araction, body, characteristic,
shock, universe charge, distribution, energy,
explosion, force, form, input, power,
shock, signal, stimulation
T
COMPANIES Obs. 5 Obs. 1 6
Exp. 3.45 Exp. 2.55
2 0.70 2 0.95
AF

company, plant manufacturer

CIRCUITS Obs. 5 Obs. 10 15


Exp. 8.61 Exp. 6.39
R

2 1.52 2 2.05
circuit, component, line, material component, contact, control, line,
outlet, picko, torquer, wiring
D

ENGINEERS Obs. 0 Obs. 9 9


Exp. 5.17 Exp. 3.83
2 5.17 2 6.97
engineer, literature, requirement,
work

Total 58 43 101

e dierence between electric and electrical is signicant overall (2 = 39, df = 4,


p < 0.001). Since we are interested in the nature of this dierence, it is much more

300
Anatol Stefanowitsch

insightful to look at the chi-square components individually. is gives us a


beer idea where the overall signicant dierence comes from: devices are much
more frequently called electric and less frequently electrical than expected, and
engineers and their work is more frequently referred to as electrical and less
frequently as electric than expected. is is in line with the dictionary denitions
cited above. e other dierences do not reach signicance individually, but they
go against the direction expected on the basis of the dictionary entries in all
cases. In the case of companies, proper names were excluded from the analysis,
but these, too, overwhelmingly favor electric there were eight seven company
names with electric (Electric and Musical Industries, General Electric, General

.1
Electric Company, Industrial Electric, Narraganse Electric, Vermont Hydro-Electric
Corporation, and Westinghouse Electric Corp.) and only one with electrical
(Metropolitan Vickers Electrical Co.), which corroborate the trend for companies to
v0
be referred to by the adjective electric.
In sum, there is a signicant tendency for electric to be used with devices (and,
if we were to include proper names, companies), and electrical for people and
their work with electricity. In addition, there are (non-signicant) tendencies for
electrical to be used with circuits and energy. is is broadly similar to Kaunistos
T
(1999) ndings (he goes on to look at the categories in more detail, revealing even
more ne-grained dierences).
AF

Of course, this type of investigation can also be designed as an inductive study


of dierential collexemes (again, like the study of synonyms such as high and
tall). Table 11-8 shows the results of such an analysis, calculated on the basis of
occurrences of electric/al in the BNC that are directly followed by a noun.
e results largely agree with the preferences also uncovered by the more
R

careful (and more time-consuming) categorization of a complete data set, with


one crucial dierence: there are members of the category DEVICE among the
D

signicant dierential collocates of both variants. A closer look reveals a


systematic dierence within this category: the DEVICE collocates of electric refer to
specic devices (such as light, guitar, light, kele etc.); in contrast, the DEVICE
collocates of electrical refer to general classes of devices (equipment, appliance,
system). is dierence was not discernible in the BROWN dataset (presumably
because it is too small, but it is discernible in Kaunistos (1999) data set, who
posits corresponding subcategories.

301
11. Morphology

Table 11-8. Dierential nominal collocates of electric and electrical in the BNC
KEY WORD O11 O12 O21 O22 G2
MOST STRONGLY ASSOCIATED WITH ELECTRIC

shock 140 2 2,692 2,027 136.60


light 122 2 2,710 2,027 116.99
eld 191 23 2,641 2,006 104.41
guitar 81 0 2,751 2,029 88.50
re 109 7 2,723 2,022 78.62
car 59 2 2,773 2,027 50.11
motor 63 3 2,769 2,026 49.42
blanket 46 1 2,786 2,028 42.07

.1
window 38 0 2,794 2,029 41.27
kele 37 0 2,795 2,029 40.18
cooker 34 0 2,798 2,029 36.91
drill 39 1 v0 2,793 2,028 34.75
train 32 0 2,800 2,029 34.73
Co. 27 0 2,805 2,029 29.28
vehicle 24 0 2,808 2,029 26.02
lighting 21 0 2,811 2,029 22.76
fan 21 0 2,811 2,029 22.76
tramway 20 0 2,812 2,029 21.67
T
fence 20 0 2,812 2,029 21.67
traction 19 0 2,813 2,029 20.58
AF

MOST STRONGLY ASSOCIATED WITH ELECTRICAL

engineering 0 108 2,832 1,921 192.16


engineer 0 89 2,832 1,940 157.84
equipment 6 106 2,826 1,923 147.96
goods 1 88 2,831 1,941 146.12
R

activity 1 85 2,831 1,944 140.79


appliance 3 69 2,829 1,960 100.18
conductivity 0 35 2,832 1,994 61.51
D

fault 0 34 2,832 1,995 59.75


signal 8 53 2,824 1,976 54.50
stimulation 0 26 2,832 2,003 45.63
union 0 26 2,832 2,003 45.63
energy 2 33 2,830 1,996 44.78
impulse 2 32 2,830 1,997 43.14
retailer 0 21 2,832 2,008 36.82
property 0 20 2,832 2,009 35.06
work 0 19 2,832 2,010 33.30
control 1 23 2,831 2,006 33.10
system 6 31 2,826 1,998 28.06
circuit 10 37 2,822 1,992 27.06
recording 1 19 2831 2010 26.44

302
Anatol Stefanowitsch

ere is an additional paern that would warrant further investigation: there


are collocates for both variants that correspond to what some of the dictionaries
refer to as produced by energy (cf. 11.6a, c, d, e): shock, eld and re for electric
and signal, energy, impulse for electrical. It is possible that electric more
specically characterizes phenomena that are caused by electricity, while
electrical characterizes phenomena that manifest electricity. It seems, then, that a
dierential-collocate analysis is a good alternative to the manual categorization
and category-wise comparison of all collocates; it allows us to process very large
data-sets very quickly and then focus on the semantic properties of those

.1
collocates we already know to distinguish signicantly between the variants.
In conclusion, let me stress once more that this kind of study does not
primarily uncover dierences between axes, but dierences between specic
v0
word pairs containing these axes. ey are, as pointed out above, essentially
lexical studies of near-synonymy. Of course, it is possible that by performing such
analyses for a large number of word pairs containing a particular ax pair,
general semantic dierences may emerge, but since we are frequently dealing
with highly lexicalized forms, there is no guarantee for this. Gries (2003, cf. also
T
2001) has shown that -ic/-ical pairs dier substantially in the extent to which they
are synonymous; for example, he nds substantial dierence in meaning for
AF

politic/political or poetic/poetical, but much smaller dierences, for example, for


bibliographic/bibliographical, with electric/electrical somewhere in the middle.
Obviously, the two variants have lexicalized independently in many cases, and
the specic dierences in meaning resulting from this lexicalization process are
unlikely to fall into clear general categories.
R

c) Case Study: Phonological dierences between -ic and -ical. In an interesting but
D

rarely-cited paper, Or (1994) collects a number of hypotheses about semantic and,


in particular, phonological factors inuencing the distribution of -ic and -ical that
she provides impressionistic corpus evidence for but does not investigate
systematically. A simple example is the factor LENGTH: Or hypothesizes that
speakers will tend to avoid long words and choose the shorter variant -ic for long
stems (in terms of number of syllables). She reports that a survey of a general
vocabulary list conrms this hypothesis but does not present any systematic
data.
e hypothesis can be tested relatively straightforwardly using a t-test (if the
syllable lengths of stems occurring with ic(al) are normally distributed or with a
Mann-Whitney U test if they are not. ere are 673 stem types occurring with -ic

303
11. Morphology

in the BROWN corpus, and 240 with -ical (since the point is to show the inuence
of length on sux choice, prexed stems, compound stems etc. are included in
this gure). e mean length is 2.86 (sample variance: 1.35) for stems occurring
with -ic and 2.72 (sample variance: 1.59) for stems occurring with -ical. Applying
the formula in (7.15) from Chapter 7, we get a t-value of 1.54, which, at 392.82
degrees of freedom (which we determine using the formula in 7.17), is not
signicant (p = 0.12).35
Note that in this kind of design we have used samples of types rather than
tokens again, because we are investigating a claim about an inuence of stem
length on ax choice; in this context, it is important to determine how many

.1
stems of a given length occur with a particular ax variant, but it does not
maer how oen a particular stem does so.
v0
d) Case Study: Ax combinations. It has oen been observed that certain
derivational axes show a preference for stems that are already derived by a
particular ax. is observation is potentially relevant to cases of alternation
between suxes, as such preferences may be one of the factors inuencing the
choice between the variants.
T
For example, Lindsay (2011) and Lindsay and Arono (2013) show that while
-ic is more productive in general than -ical (see preceding case study), -ical is
AF

preferred with stems that contain the ax -olog- (for example, morphological or
methodological). ey show this by comparing the ratio of the two suxes for
stems that occur with both suxes; I will return to their approach presently. First,
though, let us see what we can nd out by looking at the overall distribution of
types. e BROWN and LOB corpora taken together contain 737 dierent stems
R

occurring with either -ic or -ical or both (spelling dierences were ignored and
since we are interested in sux combinations, all prexes were removed before
D

counting stem types). ere is a clear overall preference for -ic (559 types) over
-ical (178 types). is general preference can now be used to determine expected
frequencies for stems containing a particular sux in a one-by-n design as
described in Chapter 7, Section 7.2.2 above): 75.85 percent of all -ic(al) stem types
occur with -ic, 24.15 percent occur with -ical. If other suxes have no inuence
35 As mentioned in Chapter 7, word length (however we measure it) rarely follows a
normal distribution, so the Mann-Whitney U test would be the beer test in this case.
Here are the rank sums, if you want to calculate it: 313061.5 for stems occurring with
-ic and 104179.5 for stems occurring with -ical. You will nd that the dierence is not
signicant according to the U test either (which is unsurprising, since the median
length for both suxes is 3 syllables).

304
Anatol Stefanowitsch

on this preference, stems containing a particular sux (like -olog-) should show
the same proportion. ere are 46 such stems, (5 with -ic, 41 with -ical), so the
expected frequency for -ic is 46 0.7585 = 34.89, for -ical it is 46 0.2415 = 11.11.
is is shown in Table 11-9a.

Table 11-9a. Preference of stems with -olog- for -ic or -ical


STEM TYPES
STEMS WITH -IC STEMS WITH -ICAL Total
Obs. 5 Obs. 41 Obs. 46
STEMS WITH

.1
Exp. 34.89 Exp. 34.89
-OLOG- 2-Cp. 25.61 2-Cp. 80.42
v0
e dierence between the observed distribution of -ic and -ical with olog-stems
and that expected on the basis of the overall distribution of the suxes in the
lexicon is highly signicant (2 = 106.02, df=1, p < 0.001): Stems with -olog- do,
indeed, favor -ical against the general trend.
is could be due specically to the sux -olog-, but it could also be a general
T
preference of derived stems for -ical. ere are two more suxes that occur
frequently enough among the types with -ic(al) to test this: -ist (as is altruistic or
AF

statistical) and -graph (as in biographic(al), cryptographic or bibliographical), so let


us apply the same test to them.

Table 11-9b. Preference of stems with -graph for -ic or -ical


R

STEM TYPES
STEMS WITH -IC STEMS WITH -ICAL Total
D

Obs. 18 Obs. 5 Obs. 23


STEMS WITH
Exp. 17.45 Exp. 5.55
-GRAPH 2-Cp. 0.02 2-Cp. 0.05

305
11. Morphology

Table 11-9c. Preference of stems with -ist for -ic or -ical


STEM TYPES
STEMS WITH -IC STEMS WITH -ICAL Total
Obs. 71 Obs. 4 Obs. 75
STEMS WITH -IST Exp. 56.89 Exp. 18.11
2-Cp. 3.5 2-Cp. 11

ere is no signicant dierence between the observed and expected frequencies


of stem types with -graph (2 = 0.07, df=1, p > 0.05) in fact, there is almost no

.1
dierence at all. In contrast, there is a signicant dierence for stems with the
sux -ist (2 = 14.5, df=1, p < 0.001), but it goes in the opposite direction: the
preference of stems with -ist for -ic is even stronger than the overall preference.
v0
us, individual suxes may dier in their preference for other suxes.
Note that we have been talking of a preference of particular stems for one or
the other sux, but this is somewhat imprecise: we looked at the total number of
stem types with -ic and -ical and the three additional suxes. While dierences
in number are plausibly aributed to preferences, they may also be purely
T
historical leovers due to the specic history of the two suxes (which is rather
complex, involving borrowing from Latin, Greek and, in the case of -ical, French).
AF

More convincing evidence for a productive dierence in preferences would come


from stems that take both -ic and -ical (such as electric/al, symmetric/al or
numeric/al, to take three examples that display a relatively even distribution
between the two). is is why Lindsay (2011) and Lindsay and Arono (2013)
focus on these stems: for each stem, they check whether it occurs more
R

frequently with -ic or with -ical and calculate a preference ratio. ey then
compare the ratio of all stems to that of stems with -olog- (see Table 11-10).
D

Table 11-10. Stems favoring -ic or -ical in the COCA (Lindsay 2011: 194)
TOTAL STEMS RATIO -OLOG- STEMS RATIO
FAVORING -IC 1197 4.5 13 1
FAVORING -ICAL 268 1 73 5.6
Total 1465 86

e ratios themselves are dicult to compare statistically, but of course, we could


use the proportions for all stems to derive expected frequencies for the -olog-
stems in the same way we did for the types above (you might want to try this as

306
Anatol Stefanowitsch

an exercise, the chi-square value should be 255.13). However, we could take


Lindsay and Aronos approach one step further: Instead of calculating the
overall preference of a particular type of stem and comparing it to the overall
preference of all stems, we could calculate the preference for each stem
individually. is will give us values for each stem that are, at the very least,
ordinal data, i.e., we can rank the stems by preference. We could then use the
Mann-Whitney U-test to determine whether the stems with -olog- tend to occur
towards the -ical end of the ranking. at way, we would treat preference as the
maer of degree that it actually is, rather than as an absolute property of stems.36
e BROWN/LOB corpora do not contain enough stems that occur with both

.1
suxes, so we will use the BNC instead. e BNC contains 196 stems that take
both suxes, which, since I want to show all relevant information so that you can
retrace the necessary steps for yourself, is too much. erefore, let us include
v0
only the stems with -olog and -graph, as the laer were shown by our previous
analysis to be typical of -ic(al) stems in general. Table 11-11 shows the relevant
data.
e preference of stems with -olog- for the variant -ical is very obvious even
on purely visual inspection of the table. Stems with graph are found with stems
T
ranging from a 99.89 percent preference for -ic to a 90 percent preference for -ical
(i.e., a 10 percent preference for -ic), i.e, they cover almost the entire range of
AF

possible preferences. In contrast, stems with -olog- are found with stems within a
very narrow range of preferences for -ical, namely 99,88 (a 0.12 preference for -ic)
to 80 (a 20 percent preference for -ic). Accordingly, there is very lile overlap, and
the medians of the two suxes dier substantially: 9.5 for -graph and 28.5 for
-olog-. A Mann-Whitney U-test shows this dierence to be highly signicant (U =
R

3, N1 = 18, N2 = 20 p < 0.001).


D

36 In fact, they are cardinal data, as the value can range from 0 to 1 with every possible
value in between; it is safer to treat them as ordinal data, however, because we dont
know whether such preference values are normally distributed (in fact, since they are
based on word frequency data, which we know not to be normally distributed, it is a
fair guess that the preference data are not normally distributed).

307
11. Morphology

Table 11-11. Preference of stems containing graph and -olog- for the sux
variants -ic and -ical (BNC)
STEM SUFFIX F(-IC) F(-ICAL) P(-IC) RANK
photograph- graph 890 1 0,9989 1
radiograph- graph 44 1 0,9778 2
cartograph- graph 79 2 0,9753 3
orthograph- graph 77 3 0,9625 4
ethnograph- graph 200 8 0,9615 5
calligraph- graph 13 1 0,9286 6
physiograph- graph 10 2 0,8333 7
petrograph- graph 34 7 0,8293 8

.1
thermograph- graph 3 1 0,75 9
lexicograph- graph 21 10 0,6774 10
iconograph- graph 16 v0 10 0,6154 11
bibliograph- graph 203 130 0,6096 12
hagiograph- graph 4 6 0,4 13
stratigraph- graph 67 120 0,3583 14
typograph- graph 42 82 0,3387 15
topograph- graph 48 139 0,2567 16
pedolog- olog 2 8 0,2 17
T
geograph- graph 299 1607 0,1569 18
hydrolog- olog 7 56 0,1111 19
AF

historiograph- graph 2 18 0,1 20


pharmacolog- olog 5 68 0,0685 21
serolog- olog 3 54 0,0526 22
haematolog- olog 2 39 0,0488 23
immunolog- olog 3 107 0,0273 24
R

morpholog- olog 9 340 0,0258 25


philolog- olog 1 39 0,025 26
aetiolog- olog 1 40 0,0244 27
D

patholog- olog 4 306 0,0129 28


histolog- olog 5 387 0,0128 29
radiolog- olog 2 170 0,0116 30
epidemiolog- olog 2 179 0,011 31
epistemolog- olog 2 206 0,0096 32
geolog- olog 7 979 0,0071 33
physiolog- olog 4 660 0,006 34
ontolog- olog 1 279 0,0036 35
biolog- olog 6 2031 0,0029 36
ecolog- olog 1 726 0,0014 37
technolog- olog 2 1632 0,0012 38

308
Anatol Stefanowitsch

11.2.2 Morphemes and demographic variables


ere are a few studies investigating the productivity of derivational morphemes
across text types (comparing, e.g., wrien and spoken language or genres), across
groups dened by sex, education and/or class, or across varieties. is is an
extremely interesting area of research that may oer valuable insights into the
very nature of morphological richness and productivity, allowing us, for example,
to study potential dierences between regular, presumably subconscious
applications of derivational rules and the deliberate coining of words. Despite
this, it is an area that has not been studied very intensely, so there is much that
remains to be discovered.

.1
a) Case study: Productivity and genre. Guz (2009) studies the prevalence of
dierent kinds of nominalization across genres. e question whether the
v0
productivity of derivational morphemes diers across genres is a very interesting
one, and Guz presents a potentially more detailed analysis than previous studies
in that he looks at the prevalence of dierent stem types for each ax, so that
qualitative as well as quantitative dierences in productivity could, in theory, be
T
studied. In practice, unfortunately, the study oers preliminary insights at best, as
it is based entirely on token frequencies, which, as discussed in Section 11.1
AF

above, do not tell us anything at all about productivity.


We will therefore look at a question inspired by Guzs study, and use the TTR
and the HTR to study the relative importance and productivity of the
nominalizing sux -ship (as in friendship, lordship, etc.) in newspaper language
and in prose ction. e sux -ship is known to have a very limited productivity,
R

and our hypothesis (for the sake of the argument) will be that it is more
productive in prose ction, since authors of ction are under pressure to use
language creatively (this is not Guzs hypothesis; his study is entirely
D

explorative).
e sux has a relatively high token frequency: there are 2751 tokens in the
prose ction section of the BNC, and 7173 tokens in the newspaper section
(including all sub-genres of newspaper language, such as reportage, editorial,
etc.). is dierence is not due to the respective sample sizes: the prose-ction
section in the BNC is much larger than the newspaper section; thus, token
frequency would suggest that the sux is more important in newspaper language
than in ction if token frequency meant anything.
In order to compare the two genres in terms of the type-token and hapax-
token ratios, they need to have the same size; the following is based on random

309
11. Morphology

sub-samples of 2750 tokens from each genre.


Let us begin by looking at the types. Overall, there are 96 dierent types, of
which 46 occur in both samples (some examples of types that frequent in both
samples are relationship (the most frequent word in the prose sample),
championship (the most frequent word in the news sample), friendship,
partnership, lordship, ownership and membership. In addition, there are 36 types
that occur only in the prose sample (for example, churchmanship, honourship,
queenship and swordsmanship) and 14 that occur only in the newspaper sample
(for example, associateship, distributorship, parentship and sportsmanship). is
means that the TTR of the prose sample is 82/2750 = 0.0298, and that of the

.1
newspaper sample is 62/2750 = 0.0218. e fact that both TTRs are very small
conrms that the sux -ship is not very central to the English lexicon. Crucially,
the dierence is also very small and as Table 11-12 shows, it is not statistically
v0
signicant, although it only just misses the 5% level (2 = 3.5, df = 1, p = 0.0614). If
the dierence turned out to be signicant in a study based on a larger sample,
this could be due to a higher importance of the sux in prose ction, but it is also
possible that ction has a higher overall lexical richness.
T
Table 11-12. Types with -ship in prose ction and newspapers
GENRE
AF

PROSE FICTION NEWSPAPERS Total


Obs. 82 Obs. 60 142
NEW
Exp. 71 Exp. 71
TYPE

Obs. 2,668 Obs. 2,690 5,358


SEEN BEFORE
Exp. 2,679 Exp. 2,679
Total 2,750 2,750 5,500
D

Let us now turn to the hapax legomena, which are listed in Table 11-13a. At rst
glance, there seem to be twenty-six hapax legomena in the prose-ction sample,
and seventeen in the newspaper sample. But here, too, there is substantial
overlap. For example, crasmanship, managership and ministership occur as hapax
legomena in both samples; other words that are hapaxes in one subsample occur
several times in the other. Obviously, such words cannot be considered hapaxes:
in designs comparing an ax across two subsamples, a word can only be
regarded as a hapax legomenon if its overall frequency in the combined sub-
samples is still 1 (this may seem obvious, but it is sometimes overlooked in the

310
Anatol Stefanowitsch

literature).

Table 11.13a Hapax legomena with -ship (BNC)


PROSE COMBINED WHOLE NEWS COMBINED WHOLE
SUBSAMPLE SAMPLE BNC SUBSAMPLE SAMPLE BNC
brinkmanship associateship
chiefship authorship
chieainship conductorship
clerkship councillorship
cloudship cousinship

.1
connoisseurship cramanship
convenership distributorship
cramanship entrepreneurship
executorship guardianship
gamesmanship
generalship


v0 kingship
listenership
headship managership
highwaymanship ministership
honourship parentship
T
impress-ship salesmanship
mageship statesmanship
AF

managership traineeship
mediumship
ministership
penmanship
professorship
R

queenship
rulership
D

studentship
trhoneship
wardership
studentship

26 22 8 17 8 4

e column combined sample shows those words that are still hapax legomena
if the two samples are combined: in the prose subsample, 22 hapaxes remain, so
the HTR is 22/2,750 = 0.008; in the newspaper, only 8 hapaxes remain so the HTR
is 8/2,750 = 0.0029. Like the TTRs, then, the HTRs are extremely small,

311
11. Morphology

conrming our assumption that the sux is not particularly productive in either
of the two genres. Again, the dierence between the HTRs is also very small, but
as Table 13b shows, it is statistically signicant (2 = 5.86, df = 1, p-value < 0.05,
= 0.10).

Table 11-13b. Hapax legomena with -ship in ction and newspapers


GENRE
PROSE NEWSPAPERS Total
HAPAX Obs. 21 Obs. 8 29

.1
LEGOMENON Exp. 14.5 Exp. 14.5
TYPE

Obs. 2,729 Obs. 2,742 5,471


OTHER
Exp. 2,735.5 Exp. 2,735.5
Total 2,750
v0 2,750 5,500

We might be tempted to use an even stricter denition of hapax legomenon, that


would seem to get us even closer to what hapax legomena are supposed to
T
measure: since the ction and newspaper samples were both drawn from the
much larger BNC, we could count only those words as hapaxes whose frequency
AF

in the entire corpus is 1. e column whole BNC shows those words that would
remain hapaxes if the entire corpus were taken into consideration: 8 in the prose
sample and 4 in the newspaper sample. is dierence is no longer statistically
signicant (2 = 1.34, df = 1, p-value > 0.05).
However, while this may seem a reasonable idea, it is not: by increasing the
R

size of the sample used to determine hapax-hood without increasing the size of
the sample in which we identify potential hapax legomena, we are distorting the
D

data. Our assumption when counting hapaxes in a sample is not that all instances
are actually hapaxes in the language as a whole, but that the hapax-token relation
in the sample is comparable to the hapax-token relation in the language as a
whole (which does not mean that it is identical recall that the HTR falls as we
increase the size of our sample.
By taking the entire BNC into account when determining what counts as a
hapax legomenon, but not when identifying potential hapax legomena, we are
destroying the relationship between the HTR measured in our sample and the
HTR in the language as a whole. Ultimately, the number of hapax legomena will
fall to zero if we continue to dene them against an ever larger sample. For
example, a single word from the list in Table 11-13a retains its status as a hapax

312
Anatol Stefanowitsch

legomenon if we search the Google Books collection: impress-ship. At the same


time, the Google Books collection contains hundreds (if not thousands) of hapax
legomena that we never even notice (such as Johnship the state of being the
individual referred to as John).
is case study has demonstrated the potential of using the TTR and the HTR
not as a means of assessing morphological richness and productivity as such, but
as a means of assessing genres with respect to their richness and productivity. It
has also demonstrated some of the problems of identifying hapax legomena in the
context of such cross-text-type comparisons. As mentioned initially, there are few
studies of this type, but cf. Plag et al.s (1999) study of productivity across wrien

.1
and spoken language.

b) Case study: Productivity and speaker sex. Morphological productivity has not
v0
traditionally been investigated from a sociolinguistic perspective, but a study by
Sily (2009) suggests that this may be a promising eld of research. Sily
investigates dierences in the productivity of the suxes -ness and -ity in the
language produced by men and women in the BNC. She nds no dierence in
productivity for -ness, but a higher productivity of -ity in the language produced
T
by men (cf. also Sily and Suomela 2009 for a diachronic study with very similar
results). She uses a sophisticated method involving the comparison of the
AF

suxes type and hapax growth rates, but let us replicate her study using the
simple method used in the preceding case study, beginning with a comparison of
type-token ratios.
Like the ICE-GB (cf. Chapter 7, Secztion 7.2.2.), the BNC contains substantially
more speech and writing by male speakers than by female speakers, which is
R

reected in dierences in the number of ax tokens produced by men and


women: for -ity, there are 2,562 tokens produced by women and 8,916 tokens
D

produced by men; for -ness, there are 616 tokens produced by women and 1,154
tokens produced by men (note that unlike Sily, I excluded the words business and
witness, since they did not seem to me to be synchronically transparent instances
of the ax). To get samples of equal size for each ax, random sub-samples were
drawn from the tokens produced by men.
Based on these subsamples, the type-token ratios for -ity are 0.0652 for men
and 0.0777 for women; as Tables 11-15a shows, this dierence is not statistically
signicant (2 = 3.01, df = 1, p < 0.05, = 0.0242).

313
11. Morphology

Table 11-15a. Types with -ity in male and female speech (BNC)
SPEAKER SEX
FEMALE MALE Total
Obs. 167 Obs. 199 366
NEW
Exp. 183 Exp. 183
TYPE

Obs. 2,395 Obs. 2,363 4,758


SEEN BEFORE
Exp. 2,379 Exp. 2,379
Total 2,562 2,562 5,124

.1
e type-token ratios for -ness are much higher, namely 0.1981 for women and
0.2597 for men. As Table 11-15b shows, the dierence is statistically signicant,
v0
although the eect size is weak (2 = 5.37, df = 1, p < 0.05, = 0.066).

Table 11-15b. Types with -ness in male and female speech (BNC)
SPEAKER SEX
T
FEMALE MALE Total
Obs. 122 Obs. 156 278
AF

NEW
Exp. 139 Exp. 139
TYPE

Obs. 494 Obs. 460 954


SEEN BEFORE
Exp. 477 Exp. 477
Total 616 616 1,232
R

Note that Sily investigates spoken and wrien language separately and she also
includes social class in her analysis, so her results dier from the ones presented
D

here; she nds a signicantly lower HTR for -ness in lower-class womens speech
in the spoken sub-corpus, but not in the wrien one, and a signicantly lower
HTR for -ity in both sub-corpora. is might be due to the dierent methods
used, or to the fact that I excluded business, which is disproportionally frequent in
male speech and writing in the BNC and would thus reduce the diversity in the
male sample substantially. However, the type-based dierences do not have a
very impressive eect size in our design and they are unstable across conditions
in Silys, so perhaps they are simply not very substantial.
Let us turn to the HTR next. As before, we are dening what counts as a
hapax legomenon not with reference to the individual subsamples of male and

314
Anatol Stefanowitsch

female speech, but with respect to the combined sample. Table 11-16a shows the
Hapaxes for -ity in the male and female samples. e HTRs are very low,
suggesting that -ity is not a very productive sux: 0.0099 in female speech and
0.016 in male speech.

Table 11-16a. Hapaxes with -ity in sampes of male and female speech (BNC)
MALE
abnormality, antiquity, applicability, brutality, civility, criminality, deliverability, divinity, duplicity,
eccentricity, eventuality, falsity, femininity, xity, frivolity, illegality, impurity, inexorability,
infallibility, inrmity, levity, longevity, mediocrity, obesity, perversity, predictability, rationality,
regularity, reliability, scarcity, seniority, serendipity, solidity, subsidiarity, susceptibility, tangibility,

.1
verity, versatility, virtuality, vitality, voracity
FEMALE
absurdity, adjustability, admissibility, centrality, complicity, eemininity, enormity, exclusivity,
gratuity, hilarity, humility, impunity, inquisity, morbidity, municipality, originality, progility,
v0
respectability, sanity, scaleability, sincerity, spontaneity, sterility, totality, virginity

Although the dierence in HTR is relatively small, Table 16b shows that it is
statistically signicant, albeit again with a very weak eect size (2 = 3.93, df = 1,
p < 0.05, = 0.0277).
T

Table 11-16b. Hapax legomena with -ity in male and female speech (BNC)
AF

SPEAKER SEX
FEMALE MALE Total
HAPAX Obs. 25 Obs. 41 66
R

LEGOMENON Exp. 33 Exp. 33


TYPE

Obs. 2,537 Obs. 2,521 5,058


OTHER
D

Exp. 2,529 Exp. 2,529


Total 2,562 2,562 5,124

Table 11-17a shows the Hapaxes for -ness in the male and female samples. e
HTRs are low, but much higher than for -ity, 0.0795 for women and 0.1023 for
men.

315
11. Morphology

Table 11-17a. Hapaxes with -ness in sampes of male and female speech (BNC)
FEMALE
ancientness, appropriateness, badness, bolshiness, chasifness, childishness, chubbiness, clumsiness,
conciseness, eagerness, easiness, faithfulness, falseness, feverishness, zziness, freshness, ghostliness,
greyness, grossness, grotesqueness, heaviness, laziness, likeness, mysteriousness, nastiness,
outspokenness, pinkness, plainness, politeness, preiness, priggishness, primness, randomness,
responsiveness, scratchiness, sloppiness, smoothness, stiness, stretchiness, tenderness, tightness,
timelessness, timidness, ugliness, uncomfortableness, unpredictableness, untidiness, wetness, zombieness

MALE
abjectness, adroitness, aloneness, anxiousness, awfulness, barrenness, blackness, blandness, bluntness,
carefulness, centredness, cleansiness, clearness, cowardliness, crispness, delightfulness, dierentness,

.1
dizziness, drowsiness, dullness, eyewitnesses, fondness, fullness, genuineness, godliness, graciousness,
headedness, heartlessness, heinousness, keenness, lateness, likeliness, limitedness, loudness, mentalness,
messiness, narrowness, nearness, neighbourliness, niceness, numbness, peiness, pleasantness,
plumpness, positiveness, quickness, reasonableness, rightness, riseness, rudeness, sameness, sameyness,
v0
separateness, shortness, smugness, soness, soreness, springiness, steadiness, stubbornness,
timorousness, toughness, uxoriousness

As Table 11-17b shows, the dierence in HTRs is not statistically signicant, and
the eect size would be very weak anyway (2 = 1.93, df = 1, p > 0.05, = 0.0395).
T

Table 11-17b. Hapax legomena with -ness in male and female speech (BNC)
AF

SPEAKER SEX
FEMALE MALE Total
HAPAX Obs. 49 Obs. 63 112
Exp. 56 Exp. 56
R

LEGOMENON
TYPE

Obs. 567 Obs. 549 1,120


OTHER
Exp. 560 Exp. 560
D

Total 616 616 1,232

In this case, the results correspond to Silys, who also nds a signicant
dierence in productivity for -ity, but not for -ness.
is case study was meant to demonstrate, once again, the method of
comparing TTRs and HTRs based on samples of equal size. It was also meant to
draw aention to the fact that morphological productivity may be an interesting
area of research for variationist sociolinguistics; however, it must be pointed out
that it would be premature to conclude that men and women dier in their
productive use of particular axes; as Sily herself points out, men and women

316
Anatol Stefanowitsch

are not only represented unevenly in quantitative terms (with a much larger
proportion of male language included in the BNC), but also in qualitative terms
(the text types with which they are represented dier quite strikingly). us, this
may actually be another case of dierent degrees of productivity in dierent text
types (which we investigated in the preceding case study).

Further Reading

.1
For a dierent proposal of how to evaluate TTRs statistically, see Baayen
(2008, Section 6.5); for a very interesting method of comparison for TTRs and
HTRs based on permutation testing instead of classical inferential statistics see
v0
Sily and Suomela (2009).

Baayen, Harald. 2008. Analyzing linguistic data. A practical introduction.


Cambridge& New York: Cambridge University Press.
Sily, Tanja & Jukka Suomela. 2009. Comparing type counts: e case of
T
women, men and -ity in early English leers. Language and Computers
Studies in Practical Linguistics 69. 87109.
AF
R
D

317
12 Text
As mentioned repeatedly, linguistic corpora, by their nature, consist of word
forms, while other levels of linguistic representation are not represented unless

.1
the corresponding annotations are added. In wrien corpora, there is one level
other than the lexical that is (or can be) directly represented: the text. Well-
constructed linguistic corpora typically consist of (samples from) individual texts,
v0
whose meta-information (author, title, original place and context of publication,
etc.) are known. ere is a substantial body of corpus-linguistic research based on
designs that combine the two inherently represented variables WORD (FORM) and
TEXT; such designs may be concerned with the occurrence of words in individual
texts, or, more typically, with the occurrence of words in clusters of texts
T
belonging to the same text type (dened by topic, genre, function etc.).
Texts are, of course, produced by speakers, and depending on how much and
AF

what type of information about these speakers is available, we can also cluster
texts according to demographic variables such as dialect, socioeconomic status,,
gender, age, political or religious aliation, etc. (as we have done in many of the
examples in earlier chapters). In these cases, quantitative corpus linguistics is
essentially a variant of sociolinguistics, that diers mainly in that the linguistic
R

phenomena it pays most aention to are not necessarily those most central to
sociolinguistic research in general.
D

12.1 Key-word analysis


In the investigation of relationships between words (or other units of language
structure) and texts (or clusters of texts), researchers frequently use a method
referred to as key-word analysis. e term was originally used in contexts where
cultural values and practices were studied through particular lexical items (cf.
Williams 1976, Wierczbicka 2003); in corpus linguistics, it is used in a related, but
slightly broader sense of words that are characteristic of a particular text, text
type or demographic in the sense that they occur with unusual frequency in a

318
Anatol Stefanowitsch

given text or set of texts, where unusual means high by comparison with a
reference corpus of some kind. (Sco 1997: 236).
In other words, the corpus-linguistic identication of key words is analogous
to the identication of dierential collocates, except that it analyses the
association of a word W to a particular text (or collection of texts) T in
comparison to the language as a whole (as represented by the reference corpus,
which is typically a large, balanced corpus). Table 12-1 shows this schematically.

Table 12-1: Key Word Analysis


TEXT

.1
TEXT/CORPUS T REFERENCE CORPUS Total
WORD W O11 v0 O12 R1
WORD

OTHER O21 O22 R2


WORDS

Total C1 C2 N
T
Just like collocation analysis, key word analysis is typically applied inductively,
but there is nothing that precludes a deductive design if we have hypotheses
AF

about the over- or underrepresentation of particular lexical items in a particular


text or collection of texts. In either case, we have two nominal variables: KEY
WORDS (with the WORDS as values) and TEXT (with the values TEXT and REFERENCE
CORPUS).
If key word analysis is applied to a single text, the aim is typically to identify
R

either the topic area or some stylistic property of that text. When applied to text
types (registers, genres, etc), the aim is typically to identify general lexical and/or
D

grammatical properties of the respective text type. As a rst example of the kind
of results that key-word analysis yields, consider Table 12-2, which shows the 20
most frequent tokens (including punctuation marks) in the BROWN corpus and
two individual texts.
As we can see, the dierences are relatively small, as all lists are dominated by
frequent function words and punctuation marks. Ten of these occur on all three
lists (a, and, in, of, that, the, to, was, the comma and the period), and another
seven occur on two of them (as, he, his, it, on, and opening and closing quotation
marks). Even the types that occur only once are mostly uninformative with
respect to the type of text we may be dealing with (1959, at, by, for, had, is, with,
the hyphen and opening and closing parentheses). e only exception are four

319
12 Text

content words in Text A: Neosho, river, species, station these suggest that the
text is about rivers (Neosho being the name of a river) and perhaps biology (as
suggested by the word species).

Table 12-2: Most frequent words in three texts


BROWN TEXT A TEXT B
THE 6.00 THE 6.45 THE 5.85
, 4.97 , 5.41 . 5.62
. 4.20 IN 4.02 , 4.52
OF 3.12 . 3.97 OF 2.58

.1
AND 2.48 OF 3.33 A 2.31
TO 2.24 AND 2.62 TO 2.08
A 1.99 ( 1.48 HE 1.92
IN
THAT
IS
1.83
0.93
0.88
)
AT
WAS
v0 1.48
1.27
1.27


AND
1.50
1.49
1.48
WAS 0.86 TO 1.27 HIS 1.40
HE 0.84 A 1.27 WAS 1.38
FOR 0.81 WERE 1.21 IN 1.12
T
IT 0.78 NEOSHO 1.09 THAT 1.10
0.76 SPECIES 0.85 HAD 0.99
AF

0.75 THAT 0.70 HUME 0.81


WITH 0.63 STATION 0.69 ON 0.78
AS 0.62 BY 0.65 IT 0.68
HIS 0.60 RIVER 0.62 0.63
ON 0.58 1959 0.58 AS 0.63
R

Applying key word analysis to each text or collection of texts identies the
D

words that dier most signicantly in frequency from the reference corpus and
thereby tell us how the text in question diers lexically from the (wrien)
language of its time as a whole. Table 12-3a lists the key words for Text A.
e key words now convey a very specic idea of what the text is about: there
are two proper names of rivers (the Neosho already seen on the frequency list and
the Marais des Cygnes, represented by its constituents Cygnes, Marais and des),
there are a number of words for specic species of sh as well as the words river
and channel.

320
Anatol Stefanowitsch

Table 12-3a: Key words in a report on sh populations


KEY WORD O11 O12 O21 O22 G2
NEOSHO 244 0 22176 1165321 1939.90
SPECIES 191 37 22229 1165284 1317.28
STATION 155 109 22265 1165212 877.91
1957 122 46 22298 1165275 773.81
1959 130 98 22290 1165223 725.05
( 331 2392 22089 1162929 707.68
) 331 2426 22089 1162895 700.21
RIVER 139 166 22281 1165155 690.33

.1
FISH 104 35 22316 1165286 670.66
CYGNES 83 0 22337 1165321 659.30
MARAIS 83 0 22337 1165321 659.30
CATFISH 80 2 22340 1165319 616.73
DES
SHINER
84
75
9
0
v0
22336
22345
1165312
1165321
608.45
595.72
CHANNEL 81 16 22339 1165305 557.14
ABUNDANCE 78 13 22342 1165308 545.42
MINNOW 63 0 22357 1165321 500.38
T
LOWER 100 123 22320 1165198 492.31
TAKEN 122 281 22298 1165040 485.74
AF

UPPER 88 75 22332 1165246 476.95

e text is clearly about sh in the two rivers. e occurrence of the words


station and abundance suggests a research context, which is supported by the
occurrence of two dates and opening and closing parentheses (which are oen
R

used in scientic texts to introduce references. e text in question is indeed a


scientic report on sh populations: Fish Populations, Following a Drought, In the
D

Neosho and Marais des Cygnes Rivers of Kansas (available via Project Guenberg, .
Note that the occurrence of some tokens (such as the dates and the parentheses
may be characteristic of a text type rather than an individual text, a point we will
return to below).
Next, consider Table 12-3b, which lists the key words for Text B. ree things
are noticeable: the keyness of a number of words that are most likely proper
names (Hume, Vye, Rynch, Wass, Brodie and Jumala), pronouns (he, his) and
punctuation marks indicative of direct speech (the quotation marks and the
exclamation mark).

321
12 Text

Table 12-3b: Key words in a science ction novel


KEY WORD O11 O12 O21 O22 G2
HUME 325 4 39904 1165317 2169.65
VYE 228 0 40001 1165321 1551.70
RYNCH 134 0 40095 1165321 911.66
WASS 100 0 40129 1165321 680.26
HE 772 9796 39457 1155525 393.06
HIS 565 6996 39664 1158325 301.70
FLITTER 41 0 40188 1165321 278.85
BRODIE 44 10 40185 1165311 248.18

.1
604 8808 39625 1156513 221.63
601 8771 39628 1156550 220.08
HUNTER 42 19 40187 1165302 211.27
L-B 31 0 40198 1165321 210.83
NEEDLER
GUILD
30
33
0
7
v0
40199
40196
1165321
1165314
204.03
187.81
HAD 397 5230 39832 1160091 185.30
JUMALA 27 0 40202 1165321 183.62
SAFARI 29 2 40200 1165319 182.53
T
. 2261 48958 37968 1116363 175.98
! 125 796 40104 1164525 172.79
AF

is does not tell us anything about this particular text, but taken together, these
pieces of evidence point to a particular genre: narrative text (novels, short stories,
etc.). e few potential content words suggest particular sub-genres: hunter and
guild suggest a historical or a fantasy novel, the unusual words ier and needler
R

support the laer hypothesis. If we were to include the next ten most strongly
associated nouns, we would nd camp, patrol, out-hunter, globes, barrier, beast,
D

tube, spacer and starfall, which strongly suggest that we are, in fact, dealing with
a science-ction novel. And indeed, the text in question is the science-ction
novel Starhunter by Andre Alice Norton (available via Project Gutenberg, see
Appendix A.1).
Again, the keywords identied are a mixture of topical markers and markers
for the text type (in this case, genre) of the text, so even a study of the key words
of single texts provides information about more general linguistic properties of
the text in question as well as its specic topic.

322
Anatol Stefanowitsch

12.2 Keywords: Case studies


12.2.1 Text type
Keyword analysis has been applied to a wide range of text types dened by topic
(e.g. travel writing), genre (e.g news reportage) or both (e.g. history textbooks)
(see the contributions in Bondi and Sco 2010 for recent examples). Here, we will
look at two case studies of scientic language.

12.2.1.1 Case study: Keywords in scientic writing. ere are a number of keyword-

.1
based analysis of academic writing (cf., for example, Sco and Tribble 2006 on
literary criticism, Rmer and Wul 2010 on academic student essays). Instead of
replicating one of these studies in detail, let us look more generally at the
v0
Learned section of the BROWN corpus, using the rest of BROWN (all sections
except the Learned section) as a reference corpus.
Table 12-4 shows contains the key words for this section.

Table 12-4: Key words in the Learned section of BROWN


T
KEY WORD O11 O12 O21 O22 G2
OF 7452 28958 173672 955248 644.42
AF

( 853 1539 180271 982667 580.83


) 856 1570 180268 982636 569.08
IS 2410 7789 178714 976417 455.69
THE 12537 57433 168587 926773 308.32
T 90 5 181034 984201 297.63
R

ANODE 77 0 181047 984206 286.71


C 104 46 181020 984160 217.87
DATA 113 60 181011 984146 217.71
D

INDEX 72 9 181052 984197 214.62


IN 4097 17240 177027 966966 209.21
76 16 181048 984190 203.38
BY 1216 4089 179908 980117 198.15
SURFACE 118 82 181006 984124 196.34
CELLS 69 12 181055 984194 193.02
SYSTEM 185 234 180939 983972 192.82
STRESS 80 27 181044 984179 186.11
FUNCTION 81 32 181043 984174 177.73
1 161 194 180963 984012 175.98
DICTIONARY 54 4 181070 984202 173.30

323
12 Text

In this case, there is lile we can say about the specic content of the collection
of text there is a dominance of academic terms, but they are either very general
(data, index), or relate to dierent areas of study (anode from the domain of
electricity, cells from biology, dictionary from the language sciences). is reects
the fact that the Learned section of the BROWN corpus is a collection of 80
dierent texts from the natural sciences, medicine, mathematics, the social and
behavioral sciences, political science, law and education.
is case study demonstrates a typical application of keyword analysis to a
collection of texts of a particular type, which allows us to make statements about

.1
the relevant text type in general. For example, we can see that academic writing is
characterized by scientic terminology not a surprising result, but one which
demonstrates that the method works. We also see that certain types of
v0
punctuation are typical of academic writing in general (such as the parentheses,
which we already suspected based on the analysis of the sh population report in
Section 12.1 above). Finally, and perhaps most interestingly, this type of analysis
can reveal function words that are characteristic for a particular text type and
thus give us potential insights into grammatical structures that may be typical for
T
it; for example, is, the and of are among the most signicant key words of
academic writing are. e laer two are presumably related to the nominal
AF

style that is known to characterize academic texts, while the higher-than-normal


frequency of is may be due to the prevalence of denitions, statements of
equivalence, etc. is (and other observations made on the basis of key-word
analysis) would of course have to be followed up by more detailed analyses of the
function these words serve but key-word analysis tells us, what words are likely
R

to be interesting to investigate.
D

12.2.1.2 Case study: [a(n) + N + of] in medical research papers. Of course, key word
analysis is not the only way to study lexical characteristics of text types. In
principle, any design studying the interaction of lexical items with other units of
linguistic structure can also be applied to specic text types.
For example, Marco (2000) investigates collocational frameworks (see Chapter
10, Section 10.2.1) in medical research papers. While this may not sound
particularly interesting at rst glance, it turns out that even highly frequent
frameworks like [a(n) _ of] are lled by completely dierent items from those
found in the language as a whole, which is important for many applied purposes
(such as language teaching or machine processing of language), but which also
shows just how dierent text types can actually be. Since Marcos corpus is not

324
Anatol Stefanowitsch

publicly available, let us instead use the Learned subsection of the BROWN
corpus again. Table 12-5 shows the 15 most strongly associated words in the
framework [a(n) __ of], i.e. the words whose frequency of occurrence inside this
framework diers most signicantly from their frequency of occurrence outside
of this framework in the same corpus.

Table 12-5: Most strongly associated words in the framework [a(n) + _ + of] in the
Learned subsection of the BROWN corpus
FRAMEWORK O11 O12 O21 O22 G2
NUMBER 54 119 518 180440 412.98

.1
MATTER 18 27 554 180532 147.45
SERIES 13 14 559 180545 112.69
VARIETY 13 16
v0 559 180543 110.21
FUNCTION 13 68 559 180491 79.06
MEANS 11 54 561 180505 68.11
KIND 8 22 564 180537 57.58
RESULT 9 48 563 180511 54.36
COMBINATION 5 8 567 180551 40.35
T
PRODUCT 6 26 566 180533 38.43
SET 7 64 565 180495 35.37
PIECE 4 4 568 180555 35.03
AF

I.Q. 3 1 569 180558 30.07


PERIOD 6 62 566 180497 28.96
STUDY 6 85 566 180474 25.46
R

If we compare the result in Table 12-5 to that in Table 10-2 in Chapter 10, we
notice clear dierences between the use of this framework in academic texts and
the language as a whole; for example, lot, which is most strongly associated with
D

the framework in the general language does not occur at all; instead, there are
many typically academic terms like function, I.Q. and study. However, it is
actually dicult to assess the magnitude of the dierences, as both tables were
derived independently.
A beer way of assessing the specic characteristics of a collocational
framework in a particular text type is to combine collocational framework
analysis and keyword analysis by applying the laer to the items occurring in a
particular framework (i.e., instead of statistically evaluating the frequencies of all
words in the two corpora, we evaluate the frequencies of those words occurring
in a particular collocational framework (or collocation, collostruction, etc.) in the

325
12 Text

two corpora.
For example, Table 12-6 shows the results of a comparison of words occurring
in the framework [a(n) + ___ + of] in the LEARNED section of the Brown corpus
with those occurring in this framework in ALL OTHER SECTIONS of the Brown
corpus.

Table 12-6. Key-words in the framework [a + _ + of] in the LEARNED section of


BROWN compared to ALL OTHER SECTIONS of BROWN
KEY WORD O11 O12 O21 O22 G2
KEY FOR LEARNED

.1
FUNCTION 13 1 504 2353 38.03
NUMBER 46 64 471 2290 35.31
PRODUCT 6 2 v0 511 2352 12.42
MEANS 11 11 506 2343 11.70
VARIETY 11 12 506 2342 10.75
SET 6 3 511 2351 10.35
DISTRIBUTION 3 0 514 2354 10.30
CURVE 3 0 514 2354 10.30
T
REFUND 3 0 514 2354 10.30
PHENOMENON 3 0 514 2354 10.30
KEY FOR THE REST OF BROWN
AF

LOT 1 50 516 2304 13.60


COUPLE 2 58 515 2296 12.54
GROUP 2 52 515 2302 10.54
MEMBER 2 44 515 2310 7.97
R

BIT 0 19 517 2335 7.57


COLLECTION 0 16 517 2338 6.37
CUP 0 11 517 2343 4.38
D

WORD 0 10 517 2344 3.98


MAN 1 22 516 2332 3.96
SORT 2 30 515 2324 3.84

Perhaps not unsurprisingly, scientic vocabulary dominates the keywords in


academic language (even more clearly than in the simple collocational framework
analysis above), while general (particularly informal) quantifying and
categorizing terms dominate in the general language. Interestingly, however, we
now see that the word number, which occurs frequently in this framework both
in the general language and in academic language is much more typical of the
laer than of the former.

326
Anatol Stefanowitsch

is case study shows the variability that even seemingly simple grammatical
paerns may display across text types. It is also meant to demonstrate how
simple techniques like collocational-framework analysis can be combined with
more sophisticated techniques to yield more insightful results.

12.2.2 Comparing speech communities


As pointed out at the beginning of this chapter, a keyword analysis of corpora
that are dened by demographic variables is essentially a variant of variationist
sociolinguistics. e basic method remains the same, the only dierence being
that the corpora under investigation either have to be constructed based on the

.1
variables in question, or, more typically, that existing corpora have to be
separated into sub-corpora accordingly. is is true for inductive key word
analyses as well as for the type of deductive analysis of individual words or
v0
constructions that we used in some of the examples in earlier chapters. e
dependent variable will, as in all examples in this and the preceding chapter,
always be nominal, consisting of (some part o) the lexicon (with words as
values).
T
Language variety (dialect, sociolect etc.) is an obvious demographic category
to investigate using corpus-linguistic methods. Language varieties dier from
each other along a number of dimensions, one of which is the lexicon. While
AF

lexical dierences have not tended to play a major role in mainstream


sociolinguistics, they do play a role in corpus-based sociolinguistics rst, because
they are relatively easy to extract from appropriately constructed corpora, but
also because they have traditionally been an important dening criterion of
R

dialects (and continue to be so, especially in applied contexts).


Many of the examples in earlier chapters of this book demonstrate how, in
principle, lexical dierences between varieties can be investigated take two
D

suciently large corpora representing two dierent varieties, and study the
distribution of a particular word across these two corpora. Alternatively, we can
study the distribution of all words across the two corpora in the same way as we
have studied their distribution across texts or text types in the preceding section.
is was actually done fairly early, long before the invention of key word
analysis, by Johansson and Hoand (1989). ey compared all word forms in the
LOB and BROWN corpora using a coecient of dierence calculated by the
following formula:

327
12 Text

f (word LOB) f (word BROWN )


(12.1)
f (word LOB ) + f (word BROWN )

is formula will give us the percentage of uses of the word in the LOB or the
BROWN corpus (whichever is larger), with a negative sign if it occurs in the
BROWN corpus.37 In addition, Johansson and Hoand test each dierence for
signicance using the chi-square test. As discussed in Chapter 9, it is more
recommendable to use an association measure like Log Likelihood or the p-value
of Fishers exact test, because percentages will massively overestimate infrequent
events (a word that occurs only a single time will be seen as 100 per cent typical

.1
of whichever corpus it happens to occur in); also, the chi-square test cannot be
applied to infrequent words. Still, Johansson and Hoands basic idea is of course
highly interesting and their work constitutes the rst example of a keyword
v0
analysis that I am aware of.
Comparing two (large) corpora representing two varieties will not, however,
straightforwardly result in a list of dialect dierences. Instead, there are at least
ve types of dierences that must be selectively discarded in any analysis. Table
12-7, which shows the 10 most strongly dierential key words for the LOB and
T
BROWN corpus respectively, based on a direct comparison of the corpora).
AF
R
D

37 Note that this formula only works because the two corpora are roughly equal in size;
if you want to apply it to two corpora A and B that have dierent sizes, the
frequencies have to be normalized by dividing them by the respective total corpus size
(this is the case, for example, in Schmids study reported in Case Study 12.2.3.1 below):
f (word A ) f (word B )

size A size B
(i)
f (word A ) f (word B )
+
size A size B

328
Anatol Stefanowitsch

Table 12-7. Key words of British and American English based on a comparison of
LOB and BROWN.
WORD FREQUENCY IN FREQUENCY IN OTHER WORDS IN OTHER WORDS IN G2
LOB BROWN LOB BROWN
MOST STRONGLY ASSOCIATED WITH LOB
3064 187 1154620 1165143 3098.38
3048 185 1154636 1165145 3086.63
MR 266 0 1157418 1165330 370.54
LABOUR 285 4 1157399 1165326 360.34
LONDON 502 92 1157182 1165238 314.11
SIR 456 95 1157228 1165235 259.71

.1
I 7617 5841 1150067 1159489 248.33
COLOUR 139 0 1157545 1165330 193.62
TOWARDS 318 64 1157366 1165266 185.97
CENTRE
SHE
ROUND
144
4090
336
2
2984
76
v0 1157540
1153594
1157348
1165328
1162346
1165254
182.21
181.54
178.95
BRITAIN 301 61 1157383 1165269 175.10
COMMONWEAL 158 7 1157526 1165323 171.81
TH
T
CENT. 129 1 1157555 1165329 169.34
MOST STRONGLY ASSOCIATED WITH BROWN
AF

@ 0 440 1157684 1164890 607.16


PROGRAM 0 395 1157684 1164935 545.06
TOWARD 14 386 1157670 1164944 430.76
STATES 126 613 1157558 1164717 346.40
CENTER 0 226 1157684 1165104 311.84
R

STATE 273 845 1157411 1164485 303.37


DEFENSE 0 167 1157684 1165163 230.43
6834 8771 1150850 1156559 230.01
D

LABOR 0 150 1157684 1165180 206.97


COLOR 0 141 1157684 1165189 194.55
PROGRAMS 0 139 1157684 1165191 191.79
7051 8808 1150633 1156522 184.92
FEDERAL 33 246 1157651 1165084 182.57
YORK 67 309 1157617 1165021 167.27
AMERICAN 226 570 1157458 1164760 151.47

First, there are dierences that are due to the way the corpora are constructed.
e @ sign is no more frequent in American English than in British English; it
was simply used as a stand-in for a range of non-ASCII characters in the
BROWN, but not in the LOB corpus. is seems trivial, but since it is impossible

329
12 Text

to nd two corpora of diferent varieties of any language) that are constructed in


the same way in every detail, we will get odd results like the @ sign in any key
word analysis across corpora unless we normalize the corpora before the analysis
(which is extremely time consuming even with a 1 million word corpus).
Second, there are dierences in spelling. For example, labour is spelled with ou
in Britain, but with o in the USA and the US-American defense is spelled defence
in Britain. Less obviously, abbreviated terms of address like Mr are mostly spelled
without a period in American English but both with and without a period in
British English. ese dierences are, in some sense, dialectal and may be of
interest in applied contexts, but they are not of primary interest to most linguists.

.1
In fact, they are oen irritating, since of course we would like to know whether
words like labo(u)r or Mr(.) are more typical for British or for American English
aside from the spelling dierences. To nd out, we have to normalize spellings in
v0
the corpora before comparing them (which is possible, but labo(u)r-intensive).
ird, there are proper nouns that dier in frequency across corpora; for
example, geographical names like London, Britain, Commonwealth, and (New)
York will dier in frequency because their referents are of dierent degrees of
interest to the speakers of the two varieties. ere are also personal names that
T
dier across corpora; for example, the name Macmillan occurs 62 times in the
LOB corpus but only once in BROWN; this is because in 1961, Harold Macmillan
AF

was the British Prime Minister and thus Brits had more reason to mention the
name. But there are also names that dier in frequency because they dier in
popularity in the speech communities: for example, Mike is a keyword for
BROWN, Michael for LOB. us, proper names may dier in frequency for purely
cultural or for linguistic reasons; the same is true of common nouns.
R

Fourth, nouns may dier in frequency not because they are dialectal, but
because the things they refer to play a dierent role in the respective culture.
D

State, for example, is a word found in both varieties, but it is more frequent in US-
American English because the USA is organized into 50 states that play an
important cultural and political role.
Fih, nouns may dier in frequency due to dialectal dierences (as we saw in
many of the examples in previous chapters. Take toward and towards, which
mean the same thing, but for which the rst variant is preferred in US-American
and the second in British English. Or take round, which is an adjective meaning
shaped like a circle or a ball in both varieties, but also an adverb with a range of
related meanings that corresponds to American English around.
Perhaps because there appears, supercially, lile le to discover in terms of
dialectal dierences between those variants of English for which there are

330
Anatol Stefanowitsch

suciently large corpora, most keyword studies comparing varieties focus on


cultural dierences. ese studies are typically inductive (at least partially) and
they have two nominal variables: CULTURE (operationalized as corpus containing
language produced by members of the culture) and AREA OF LIFE (operationalized
as semantic eld). ey then investigate the importance of dierent areas of life
for the cultures involved (where the importance of an area is operationalized as
having a large number of words from the corresponding semantic eld among
the dierential keywords).

12.2.2.1 Case study: British vs. American Culture. e earliest study of this type is

.1
Leech and Fallon (1992), which is based on the Johansson and Hoands (1989)
keyword list of British and American English.
e authors inductively identify words pointing to cultural contrasts by
v0
discarding all words whose distribution across the two corpora is not signicant,
all proper names, and all words whose signicant dierences in distribution are
due to dialectal variation (including spelling variation). Next, they look at
concordances of the remaining words to determine rst, which senses are most
frequent and thus most relevant for the observed dierences, and second,
T
whether the words are actually distributed across the respective corpus,
discarding those whose overall frequency is simply due to their frequent
AF

occurrence in a single le (since those words would not tell us anything about
cultural dierences). Finally, they sort the words into semantic elds such as
sport, travel and transport, business, mass media, military etc., discussing the
quantitative and qualitative dierences for each semantic eld.
For example, they note that there are obvious dierences between the types of
R

sports whose vocabulary dierentiates between the two corpora (baseball is


associated with the BROWN corpus, cricket and rugby with the LOB corpus),
D

reecting the importance of these sports in the two cultures, but also, that general
sports vocabulary (athletic, ball, playing, victory) is more oen associated with the
BROWN corpus, suggesting a greater overall importance of sports in 1961 US-
American culture. Except for one case, they do not present the results
systematically. ey list lexical items they found to dierentiate between the
corpora, but it is unclear whether these lists are exhaustive or merely illustrative
(the only drawback of this otherwise methodologically excellent study).
e one case where they do present a table is the semantic eld MILITARY. eir
results are shown in Table 12-8a/b (I have recalculated them using the log-
likelihood test statistic we have used throughout this and the preceding section).

331
12 Text

Table 12-8a Military keywords in BROWN (cf. Leech and Fallon 1992: 49)
KEYWORD LOB BROWN G2 KEYWORD LOB BROWN G2
AIRCRAFT 31 71 15,85 MARINE 12 59 33,61
ARMED 22 60 18,05 MERCENARIES 0 12 16,56
ARMIES 5 15 5,17 MILITARY 133 212 17,74
ARMS 87 121 5,36 MILITIA 0 11 15,18
ASSAULT 4 15 6,71 MISSILE 5 49 41,25
BALLISTIC 1 17 17,12 MISSILES 11 32 10,57
BATTERY 6 18 6,20 MISSION 50 78 5,99
BATTLE 54 87 7,58 MISSIONS 3 16 9,68
BOMBERS 7 23 8,89 MOBILE 6 44 32,37

.1
BOMBS 16 35 7,13 PATRIOT 0 10 13,80
BULLET 13 28 5,52 PATROL 6 25 12,39
BULLETS 4 21 12,56 PEACE 159 198 4,02
CAMPAIGNS 5 17 6,84 PENTAGON
v0 3 14 7,65
CAVALRY 3 26 20,76 PIRATES 3 13 6,67
CIVILIAN 11 24 4,86 PISTOL 13 27 4,91
CODE 14 40 12,88 PLANE 51 114 24,26
CODES 4 17 8,58 RIFLE 20 64 23,95
COLUMN 24 71 24,00 RIFLES 6 23 10,52
COMBAT 8 27 10,77 SHERMAN 0 36 49,67
T
COMMAND 51 73 3,78 SHOT 65 112 12,32
COMMANDS 5 15 5,17 SIGNAL 40 63 5,03
AF

CORPS 9 109 99,31 SIGNALS 11 29 8,28


DESTROY 25 48 7,22 SLUG 1 10 8,49
DIVISION 65 109 10,96 SQUAD 2 18 14,62
ENEMY 39 96 24,47 STRATEGIC 9 23 6,25
ENLISTED 3 11 4,81 STRATEGY 4 22 13,60
R

FALLOUT 1 31 35,26 SUBMARINE 7 27 12,43


FIGHTERS 4 16 7,63 TACTICS 8 20 5,23
FIRE 125 189 12,72 TARGETS 9 22 5,54
D

FORCE 168 231 9,58 TERRITORIAL 5 14 4,38


FORT 11 55 31,73 TROOP 3 16 9,68
FOUGHT 28 46 4,31 VETERAN 11 28 7,55
GUN 56 119 22,79 VETERANS 1 18 18,39
GUNS 8 42 25,12 VICTOR 8 24 8,27
HEADQUARTERS 28 65 14,89 VICTORY 32 61 9,01
INFANTRY 6 16 4,65 VIET 3 16 9,68
LIEUTENANT 12 32 9,30 VOLUNTEERS 8 29 12,52
LOSSES 11 46 22,87 WAR 395 467 5,56
MAJOR 146 247 25,59 WARFARE 14 43 15,28
MANNED 2 12 7,86 WEAPON 25 42 4,25
MARCH 85 120 5,78 WINCHESTER 4 12 4,13
MARCHING 5 15 5,17

332
Anatol Stefanowitsch

Table 12-8b Military keywords in LOB (cf. Leech and Fallon 1992: 50)
KEYWORD LOB BROWN LL KEYWORD LOB BROWN G2
CONQUEST 25 9 7,94 RANK 43 24 5,59
DISARMAMENT 27 11 7,06 TANKS 35 18 5,66
MEDAL 37 7 22,64 TRENCH 15 2 11,34
Note: To save space, the remaining number of tokens for other words in the corpora are not shown
in Tables 12-8a/b. if necessary, it can be calculated by subtracting from the total number of tokens in
each corpus, which, in the version used here, are 1157684 (LOB) and 1165330 (BROWN).

Even this list is not exhaustive. For example, the word soldier is missing, although
it is associated with the LOB corpus (60 vs. 40 occurrences, G2=4.61); but overall,

.1
it seems that Leech and Fallon checked the keywords thoroughly (soldier was one
of the few missing signicant keywords I was able to nd. us, their conclusion
that the concept of war played a central role in US culture in 1961 seems reliable.
v0
is case study is an example of a very carefully constructed and executed
contrastive cultural analysis based on keywords. Note, especially, that Leech and
Fallon do not just look for semantic elds that are strongly represented among
the statistically signicant keywords of one corpus, but that they check whether
the semantic eld is also represented (with dierent vocabulary) among the
T
statistically signicant keywords of the other corpus. is search for counter-
evidence is oen lacking in inductive studies.
AF

One caveat, perhaps, is that the study of cultural importance cannot be


separated completely from the study of dialectal preferences. For example, the
word armistice occurs 15 times in the LOB corpus but only 4 times in BROWN,
making it a signicant keyword for British English (G2=6.86). However, the words
ceasere and truce, which are roughly synonymous, are not signicantly
R

distributed (cease(-)re 7:7, truce 0:5). Taking all three words together, there is no
signicant dierence between the two varieties (22:16, G2=0.99). It seems that
D

there is simply a preference for the word armistice in British English and for truce
in American English. Without taking into account synonymy, we might have
concluded that armistices are more central to British thinking about military
conicts, and similar distortions might underlie some of the other keywords.

12.2.2.2. Case study: African key words. Wol and Polzenhagen (2007) present an
analysis of African keywords arrived at by comparing a corpus of Cameroon
English to the combined FLOB/FROWN corpora (jointly meant to present
Western culture). eir study is partially deductive, in that they start out from a
pre-conceived African model of community which they claim holds for all of
sub-Saharan Africa and accords the extended family and community a central

333
12 Text

role in a holistic cosmology involving hierarchical structures of authority and


respect within the community, extending into the spiritual world of the
ancestors, and also involving a focus on gods and witchcra. e model seems
somewhat stereotypical and simplistic, to say the least, but a judgment of its
accuracy is beyond the scope of this book. What maers here is that this model
guides the authors in their search for keywords that might dierentiate between
their two corpora. Unlike Leech and Fallon in the study described above, it seems
that they do not create a complete list of dierential keywords and then
categorize them into semantic elds, but they focus on words for kinship
relations, spiritual entities and witchcra straight away. ey also do not check

.1
the distribution of words across les in the two corpora, but, like Leech and
Fallon, they check concordance lines for ambiguous words to determine what
senses they are used in. v0
is procedure yields seemingly convincing word lists like that in Table 12-9.

Table 12-9. Keywords relating to authority and respect in a corpus of Cameroon


English (from Wolf and Polzenhagen 2007: 421).
KEYWORD CEC FLOB/FROWN LL P-VALUE
T
AUTHORITY 301 556 9.1 0.002563
RESPECT 232 226 82.2 0.000000
AF

OBEDIENCE 23 8 25.3 0.000000


OBEY 80 24 95.9 0.000000
DISOBEDIENCE 30 4 49.9 0.000000
CHIEF 373 219 267.9 0.000000
CHIEFDOM 28 2 53.5 0.000000
R

DIGNITARIES 11 4 11.7 0.000610


LEADER 312 402 56.7 0.000000
LEADERSHIP 86 128 9.4 0.002212
D

RULER 59 28 51.7 0.000000


FATHER 365 793 22.5 0.000002
ELDER 59 68 22.5 0.000002
TEACHER 293 254 127.9 0.000000
PRIEST 82 132 6.2 0.012755

However, due to their method, we cannot tell whether this is a real result or
whether it is simply due to the specic selection of words they look at. Some
obvious items from the semantic eld AUTHORITY are missing from their list for
example, command, rule, disobey, disrespect, and power. As long as we do not

334
Anatol Stefanowitsch

know whether they failed to check the keyness of these (and other) AUTHORITY
words or whether they did check them but found them not to be signicant, we
cannot claim that authority and respect play a special role in African culture(s)
for the simple reason that we cannot exclude the possibility that these words may
actually be signicant keywords for the combined FROWN/FLOB corpus.
Finally, it is questionable whether one can simply combine a British and an
American corpus to represent Western culture. First, it assumes that the two
cultures individually belong to such a larger culture and can jointly represent it;
why not take, say, Irish English and Canadian English instead? Second, it glosses
over dierences between the two speech communities in exactly those areas in

.1
which they are being compared to Cameroon English. For example, authority,
ruler and even father are very strongly associated with the British English (using
LOB) as compared to American English (using BROWN); the reverse is true for
v0
leadership, dignitaries, obedience and respect. us, it seems beer to compare
specic cultures (for example in the sense of speech communities within a
nation state) directly, than to construct a combined Western (or other) culture
(cf. Oakes and Farrow 2007 for a method than can be used to compare multiple
varieties against each other in a single analysis).
T

12.2.3 Co-Occurrence of Lexical Items and Demographic Categories


AF

e potential overlap between keyword analysis and sociolinguistics becomes


most obvious when using individual demographic variables such as sex, age,
education, income, etc. as individual variables. Note that such variables may be
nominal (sex), or ordinal (age, income, education); however, even potentially
R

ordinal variables are treated as nominal in keyword-based studies, since keyword


analysis cannot straightforwardly deal with ordinal data (although it could, in
principle, be adapted to do so).
D

12.2.3.1 Case study: A deductive approach to sex dierences. A thorough study of


lexical dierences between male and female speech is Schmid (2003) (inspired by
an earlier, less detailed study by Rayson et al. 1997). Schmid uses the parts of the
BNC that contain information about speaker sex (which means that he simply
follows the denition of SEX used by the corpus creators); he uses Johansson and
Hoands (1989) dierence coecient (discussed in Section 12.2.2 above). His
procedure is at least partially deductive in that he focuses on particular semantic
elds in which dierences between male and female language are expected
according to authors like R. Lako (1975), for example, color, clothing, body

335
12 Text

and health, car and trac and public aairs. As far as one can tell from the
methodological description, he only looks at selected words from each category,
so the reliability of the results depends on the plausibility of his selection.
One area in which this is unproblematic is color: Schmid nds that all basic
color terms (black, white, red, yellow, blue, green, orange, pink, purple, grey) are
more frequent in womens language than in mens language. Since basic color
terms are (synchronically) a closed set and Schmid looks at the entire set, the
results may be taken to suggest that women talk about color more oen than men
do. Similarly, for the domain of temporal deixis Schmid looks at the expressions
yesterday, tomorrow, last week, tonight, this morning, today, next week, last year

.1
and next year and nds that all but the last three are signicantly more frequently
used by women. While this is not a complete list of temporal deictic expressions,
it seems representative enough to suggest that the results reect a real dierence.
v0
Schmids analysis of the semantic eld personal reference, shown in Table
12-10, is somewhat more problematic. Personal reference is a large, lexically
diverse and somewhat diuse semantic eld, and while the sample Schmid
presents certainly seems to contain many items that can be considered central to
this domain, the question remains whether a more exhaustive sample would yield
T
the same results. For example, the plurals of girl and boy are missing, which is
odd, given that the plurals of all other nouns are included.
AF

Table 12-10. Male and female keywords for the semantic eld personal reference
(Schmid 2003)
WORD FREQ. MEN FREQ. WOMEN COEFFICIENT WORD FREQ. MEN FREQ. WOMEN COEFFICIENT
she 2266.94 7842.65 -0.55 *** (continued)
R

girl 91.91 274.92 -0.50 *** they 9134.26 10563.21 -0.07 ***
boy 127.49 232.22 -0.29 *** person 236.27 229.46 0.01
D

woman 108.17 180.92 -0.25 *** man 425.57 403.01 0.03


he 6437.68 9820.51 -0.21 *** we 11549.23 9032.93 0.12 ***
I 26298.91 36701.21 -0.17 *** men 232 179.39 0.13 ***
you 25124.76 30555.67 -0.10 *** people 2116.07 1600.05 0.14 ***
women 173.65 198.74 -0.07 persons 13.62 5.53 0.42 ***

Despite these problems, we can probably agree that the selection in Table 12-10 is
relatively representative for the eld of personal reference. For many other elds
that Schmid investigates, the representativity of his lexical sample is much more
dicult to assess. For example, Table 12-11 shows Schmid's results for the
semantic eld body and health (again, negative coecients indicate that a word

336
Anatol Stefanowitsch

is associated with female language, positive coecients with male).

Table 12-11. Male and female keywords for the semantic eld body and health
(Schmid 2003)
WORD FREQ. MEN FREQ. WOMEN COEFFICIENT WORD FREQ. MEN FREQ. WOMEN COEFFICIENT
breast 4.47 15.97 -0.56 *** (continued)
hair 55.71 195.67 -0.56 *** eyes 49.21 79.56 -0.24 ***
headache 4.47 14.13 -0.52 *** nger 30.91 44.85 -0.18 **
legs 32.94 79.86 -0.42 *** ngers 26.03 29.49 -0.06
sore throat 2.03 4.91 -0.41 eye 53.27 58.67 -0.05

.1
doctor 71.17 139.76 -0.33 *** body 101 103.52 -0.01
sick 48.39 90.31 -0.30 *** hands 96.99 98.29 -0.01
ill 30.3 55.9 -0.30 *** hand 231.39 214.4 0.04
leg 40.06 65.12 -0.24 ***
v0
In this case, it is dicult to conclude, as Schmid does, that the eld body and
health is associated with womens language, because the selection of words is
too small and eclectic to be considered representative for the entire eld (some
T
words that come to mind immediately are pain, ache/aching, unwell, nurse,
medicine, (in)u(enza) for the domain HEALTH and nose, arm(s), foot, feet, stomach,
mouth, teeth for the domain BODY.
AF

Schmids study shows the diculty of applying keyword analysis deductively


it is dicult to assess the validity and reliability of an analysis based on small
selections of words from a given semantic eld. However, if we manage to come
up with a justiable selection of lexical items, a deductive analysis can be used to
R

test a particular hypothesis in a more ecient and principled way than an


inductive design.
D

12.2.3.2 Case study: An inductive approach to sex dierences. An inductive design


will give us a more complete and less biased picture, but also one that is much
less focused. For example, Rayson et al. (1997) apply inductive keyword analysis
to the uerances with FEMALE vs. MALE speaker information in the spoken-
conversation subcorpus of the BNC (like Schmid, they simply follow the corpus
creators implicit denition of SEX). Table 12-12 shows the 15 most signicant key
words im womens speech and mens speech (the results dier minimally from
those in Rayson et al. 1997, p. 136137, since they use the chi-square stastistic
while I have used the log-likelihood G2 statistic).

337
12 Text

Table 12-12. Key words im womens speech and mens speech in the BNC
conversation subcorpus
KEY WORD O11 O12 O21 O22 G2
MOST STRONGLY ASSOCIATED WITH FEMALE SPEAKERS
SHE 7037 22807 1479484 2285127 3291.17
HER 2313 7306 1484208 2300628 990.39
SAID 4911 12375 1481610 2295559 881.36
N'T 24221 44380 1462300 2263554 444.52
I 54825 93330 1431696 2214604 306.99
AND 29109 50467 1457412 2257467 231.78

.1
COS 3314 6864 1483207 2301070 191.93
TO 23693 40934 1462828 2267000 175.92
CHRISTMAS 285 1005 1486236 2306929 171.10
CHARLOTTE
THOUGHT
OH
24
1545
13236
298
3523
23472
v0 1486497
1484976
1473285
2307636
2304411
2284462
170.52
166.21
152.83
LOVELY 406 1217 1486115 2306717 145.25
MM 7067 13039 1479454 2294895 139.46
T
BECAUSE 1830 3901 1484691 2304033 129.79
MOST STRONGLY ASSOCIATED WITH MALE SPEAKERS
FUCKING 1383 326 1485138 2307608 1251.11
AF

ER 9415 9337 1477106 2298597 939.42


THE 43385 57367 1443136 2250567 648.87
YEAH 21888 28793 1464633 2279141 343.21
[UNCLEAR] 30659 41710 1455862 2266224 312.08
MINUS 257 35 1486264 2307899 302.37
R

RIGHT 6081 7092 1480440 2300842 266.11


AYE 1164 876 1485357 2307058 265.55
D

HUNDRED 1473 1233 1485048 2306701 256.96


FUCK 331 106 1486190 2307828 241.58
IS 13277 17337 1473244 2290597 225.16
TWO 4282 5019 1482239 2302915 181.11
A 28415 39787 1458106 2268147 179.01
JESUS 177 36 1486344 2307898 174.00
NO 14836 19976 1471685 2287958 172.98

Two dierences are obvious immediately. First, there are pronouns among the
most signicant keywords of womens speech, but not of mens speech,
conrming Schmids nding concerning personal reference. is is further

338
Anatol Stefanowitsch

supported if we include the next thirty ve most signicant keywords, which


yields three additional pronouns (him, he, and me) and eight proper names for
womens speech, but no pronouns and only a single proper name for mens
speech (although there are two terms of address, mate and sir). Second, there are
three instances of cursing/taboo language among the most signicant male
keywords (fucking, fuck and Jesus), but not among female key words, conrming
ndings of a number of studies focusing on cursing (e.g. Murphy 2009).
In order to nd signicant dierences in other domains, we would now have
to sort the entire list into semantic categories (as Leech and Fallon did for British
and American English). is is clearly much more time consuming than Schmids

.1
analysis of preselected items for example, looking at the rst y keywords in
male and female speech will reveal no clear additional dierences, although they
point to a number of potentially interesting semantic elds (for example, the
v0
occurrence of lovely as a female keyword points to the possibility that there
might be dierences in the use of evaluative adverbs).

e two case studies presented in this subsection demonstrate the use of


keyword-analyses with demographic variables traditionally of interest to
T
sociolinguistics (see Rayson et al. (1997) for additional case studies involving age
and social class, as well as interactions between sex, age and class). ey are also
AF

intended to show the respective advantages and disadvantages of deductive and


inductive approaches in this area.
Note that one diculty with sociolinguistic research focusing on lexical items
is that topical dierences in the corpora may distort the picture. For example,
among the female keywords we nd words like kitchen, baby, biscuits, husband,
R

bedroom, and cooking which could be used to construct a stereotype of womens


language as being home- and family-oriented. In contrast, among the male
D

keywords we nd words like minus, plus, perent, equals, squared, decimal as well
as many number words, which could be used to construct a stereotype of male
language as being concerned with abstract domains like mathematics. However,
these dierences very obviously depend on the topics of the conversations
included in the corpus. It is not inconceivable, for example, that male linguists
constructing a spoken corpus will record their male colleagues in a university
seing and their female spouses in a home seing. us, we must take care to
distinguish stable, topic-independent dierences from those that are due to the
content of the corpora investigated. is should be no surprise, of course, since
keyword analysis was originally invented to uncover precisely such dierences in
content.

339
12 Text

12.2.4 Ideology
Just as we can choose texts to stand for demographic variables, we can choose
them to stand for the world views or ideologies of the speakers who produced
them. Note that in this case, the texts serve as an operational denition of the
corresponding ideology, an operationalization that must be plausibly justied.

12.2.4.1 Case study: Political ideologies. As an example, consider Rayson (2008),


who compares the election manifestos of the Labour Party and the Liberal
Democrats for the 2001 general election in Great Britain. is study is interesting
in that it compares two text directly with each other, rather than comparing a

.1
text to a reference corpus. is is frequently done when investigating political
ideologies, which is reasonable in contexts where two ideologies are clearly set
against each other. In Raysons study, one wonders what the results would have
v0
been if he had compared the Labour Manifesto against that of the Conservative
Party instead; or if he had compared the manifestos of all major parties against a
general reference corpus. Which specic design we choose depends on what
exactly we are investigating.
T
Table 12-13 shows the result of the direct comparison, derived from my own
analysis of the two party manifestos (which can easily be found online). e
AF

results dier from Raysons in a few details, due to slightly dierent decisions
about tokenization, but they are identical with respect to all major observations.
Obviously, the names of each party are overrepresented in the respective
manifesto as compared to that of the other party. More interesting is the fact that
would is a key word for the Liberal Democrats program; this because their
R

program mentions hypothetical events more frequently, which Rayson takes to


mean that they did not expect to win the election. It is also interesting that the
Labour Manifesto does not have any words relating to specic policies among the
D

ten strongest keywords; presumably, because they were already in power and felt
less of a need to mention specic policies that they were planning to implement.
e Liberal Democrats, in contrast, have green and environmental, pointing to a
strong environmental focus, as well as powers, which suggests that they are very
concerned with the distribution of decision-making powers (both speculations
turn out to be correct if we read the actual manifesto).

340
Anatol Stefanowitsch

Table 12-13. Dierential Keywords of the 2001 Labour and Liberal Democrat
election manifestos
KEY WORD O11 O12 O21 O22 G2
MOST STRONGLY ASSOCIATED WITH THE LABOUR PARTY ELECTION MANIFESTO
PER CENT 92 0 30330 21248 97.58
OUR 286 76 30136 21172 66.44
LABOUR 177 36 30245 21212 58.17
IS 336 119 30086 21129 44.89
NOW 77 8 30345 21240 42.82
MILLION 68 6 30354 21242 41.10

.1
1997 60 4 30362 21244 40.79
NEXT 55 4 30367 21244 36.16
SINCE 39 2 30383 21246 28.91
NEW 174 57
v0 30248 21191 27.62
MOST STRONGLY ASSOCIATED WITH THE LIBERAL DEMOCRATIC PARTY ELECTION MANIFESTO
LIBERAL 0 46 30422 21202 81.81
WOULD 11 70 30411 21178 71.81
DEMOCRATS 0 39 30422 21209 69.35
WHICH 37 92 30385 21156 48.21
T
SUPPORT 14 57 30408 21191 45.70
ENVIRONMENTAL 15 47 30407 21201 30.85
AF

ESTABLISH 7 34 30415 21214 30.39


GREEN 3 26 30419 21222 30.11
ALSO 50 88 30372 21160 28.74
POWERS 6 29 30416 21219 25.84
R

Of course, identifying key words is only the rst step in an analysis of ideologies
(or of key word analyses in general). Once the key words are identied, they are
D

typically investigated in their context (cf. Rayson 2008, who presents KWIC
concordances of some important key words, and Sco 1997, who identies
collocates of the key words in a sophisticated procedure that leads to highly
insightful clusters of key words).
In fact, key word analysis is sometimes used not to identify the specic key
words as such, but to identify a set of lexical items to investigate in the context of
a larger research question. For example, Partington (1997) suggests using a key
word analysis to identify potential metaphors in a particular domain of discourse.

12.2.4.2 Case study: The importance of men and women. Just as text may stand for
something other than text, words may stand for something other than words in a

341
12 Text

given research design. Perhaps most obviously, they may stand for their referents
(or classes of referents). If we are careful with our operational denitions, then,
we may actually use corpus-linguistic methods to investigate not (only) the role
of words in texts, but the role of their referents in a particular community.
In perhaps the rst study aempting this, Kjellmer (1986) uses the frequency
of masculine and feminine pronouns in the topically dened sub-corpora of the
LOB and BROWN corpora as an indicator of the importance accorded to women
in the respective discourse domain. His research design is essentially deductive,
since he starts from the hypothesis that women will be mentioned less frequently
than men. e design has two nominal variables: SEX (with the values MAN and

.1
WOMAN, operationalized as male pronoun and female pronoun) and TEXT
CATEGORY (with the values provided by the text categories of the LOB/BROWN
corpora). v0
First, Kjellmer notes that men are referred to much more frequently than
women overall: ere are 17,965 male pronouns in the LOB corpus compared to
only 8,261 female ones (Kjellmers gures dier very slightly from the ones given
here and below, perhaps because he used an earlier version of the corpus). is
dierence between male and female pronouns is signicant: using the single-
T
variable version of the chi-square test introduced in Chapter 7 (Section 7.2.2), and
assuming that the population in 1961 consisted of 50 percent men and 50 percent
AF

women, we get the expected frequencies shown in Table 12-14a (2 = 3590.62 (df =
1), p < 0.001).

Table 12-14a. Observed and expected frequencies of male and female pronouns in
the LOB corpus (based on the assumption of equal proportions).
R

Chi-Square
Observed Expected
Component
D PRONOUN

MALE 17,965 13,113 1,795.31


FEMALE 8,261 13,113 1795.31
Total 1375

In other words, women are drastically underrepresented among the people


mentioned in the LOB corpus (in the BROWN corpus, Kjellmer nds, things are
even worse). We might want to blame this either on the fact that the corpus is
from 1961, or on the possibility that many of the occurrences of male pronouns
might actually be generic uses, referring to mixed groups or abstract (categories

342
Anatol Stefanowitsch

o) people of any sex. However, Kjellmer shows that only 4 percent of the male
pronouns are used generically, which does not change the imbalance perceptibly;
also, the FLOB corpus from 1991 shows almost the same distribution (16,278 male
pronouns, 8,669 female ones).
Kjellmers main question is whether, given this overall imbalance, there are
dierences in the individual text categories, and as Table 12-14b shows, this is
indeed the case.

Table 12-14b. Male and female pronouns in the dierent text categories of the
LOB corpus.

.1
SEX OF PRONOUN REFERENTS
MALE FEMALE Total
v0
TE XT CATEGORY

A. Press: Reportage O: 1584 O: 299 1883


E: 1289.8 E: 593.2
2: 67.11 *** 2: 145.91 ***
B. Press: Editorial O: 567 O: 102 669
E: 458.25 E: 210.75
2: 25.81 *** 2: 56.12 ***
T
C. Press: Reviews O: 671 O: 212 883
E: 604.83 E: 278.17
AF

2: 7.24 (n.s.) 2: 15.74 **


D. Religion O: 458 O: 32 490
E: 335.64 E: 154.36
2: 44.61 *** 2: 97.00 ***
E. Skills and Hobbies O: 322 O: 93 415
R

E: 284.26 E: 130.74
2: 5.01 (n.s.) 2: 10.89 *
D

F. Popular Lore O: 1150 O: 666 1816


E: 1243.91 E: 572.09
2: 7.09 (n.s.) 2: 15.41 **
G. Belles Leres. O: 2861 O: 780 3641
Biography. Memoirs E: 2493.98 E: 1147.02
2: 54.01 *** 2: 117.44 ***
H. Miscellaneous O: 227 O: 69 296
E: 202.75 E: 93.25
2: 2.90 (n.s.) 2: 6.31 (n.s.)
J. Learned O: 1065 O: 100 1165
E: 797.99 E: 367.01
2: 89.34 *** 2: 194.26 ***

343
12 Text

SEX OF PRONOUN REFERENTS


MALE FEMALE Total
K. General Fiction O: 2045 O: 1434 3479
E: 2383.01 E: 1095.99
2: 47.95 *** 2: 104.25 ***
L. Mystery and Detective O: 1893 O: 829 2722
Fiction E: 1864.49 E: 857.51
2: 0.44 (n.s.) 2: 0.95 (n.s.)
M. Science Fiction O: 333 O: 78 411
E: 281.52 E: 129.48

.1
2: 9.41 (n.s.) 2: 20.47 ***
N. Adventure and Western O: 2362 O: 1250 3612
Fiction E: 2474.12
v0 E: 1137.88
2: 5.08 (n.s.) 2: 11.05 *
P. Romance and Love O: 2095 O: 2186 4281
Story E: 2932.36 E: 1348.64
2: 239.12 *** 2: 519.91 ***
R. Humor O: 329 O: 131 460
T
E: 315.09 E: 144.91
2: 0.61 (n.s.) 2: 1.34 (n.s.)
AF

Total 17962 8261 26223

Even taking into consideration the general overrepresentation of men in the


corpus, they are overrepresented strongly in reportage and editorials, religious
writing and Belle Leres/Biographies all factual genres, suggesting that
R

actual, existing men are simply thought of as more worthy topics of discussion
than actual, existing women. Women, in contrast, are overrepresented in popular
D

lore, general ction, adventure and western, and romance and love stories
(overrepresented, that is, compared to their general underrepresentaion; in
absolute numbers, they are mentioned less frequently in every single category
except romance and love stories). In other words, ctive women are slightly less
strongly discriminated against in terms of their worthiness for discussion than
are real women.
Other researchers have taken up and expanded Kjellmers method of using the
distribution of male and female pronouns (and other gendered words) in corpora
to assess the role of women in society (see, e.g. Romaine 2001; Baker 2010b;
Twenge et al 2012 and Libermans 2012 discussion of their method; Subtirelu
2014).

344
Anatol Stefanowitsch

12.2.5 Culture across time


As has become clear, a comparison of speech communities oen results in a
comparison of cultures, but of course culture can also be studied on the basis of a
corpus without such a comparison. Referents that are important in a culture are
more likely to be talked and wrien about than those that are not; thus, in a
suciently large and representative corpus, the frequency of a linguistic item
may be taken to represent the importance of its referent in the culture. is is the
basic logic behind a research tradition referred to as culturomics, a word that is
intended to mean something like rigorous quantitative inquiry to a wide array of
new phenomena spanning the social sciences and the humanities (cf. Michel et

.1
al. 2011). In practice, culturomics is simply the application of standard corpus-
linguistic techniques (word frequencies, tracked across time), and it has yielded
some interesting results (if applied carefully, which is not always the case).
v0
12.2.5.1 Case Study: God. Michel et al. (2011) present a number of small case
studies intended to demonstrate the potential of the method. ey use the Google
Books corpus (see Appendix A.3). e use of Google Books may be criticized
T
because it is not a balanced corpus, but the authors point out that rst, it is the
largest corpus available and second, books constitute cultural products and thus
AF

may not be such a bad choice for studying culture aer all.

Figure 12-1. God in US-American novels.


R
D

345
12 Text

As a simple example, consider the search for the word God in US-novels (I used
the 2012 version of the corpus, so the result dier slightly from theirs).
e authors present this as an example of the history of religion, they
conclude from their result somewhat ippantly that God is not dead but needs
a new publicist. is ippancy, incidentally, signals an unwillingness to engage
with their own results in any depth that is not entirely untypical of researchers in
culturomics. Broadly speaking the result certainly suggests a waning dominance
of religion on topic selection in book publishing, but not necessarily on the
culture as such.

.1
12.2.5.2 Example: Censorship. While it is not implausible to analyze culture in
general on the basis of a literary corpus, any analysis that involves the area of
publishing itself will be particularly convincing. One such example is Michel et
v0
al.s (2011) use of frequencies to identify periods of censorship. For example, they
search for the name of the Jewish artist Marc Chagall in the German and the US-
English corpus. As Figure 12-2 shows, there is a rst peak in the German corpus
around 1920, but during the time of the Nazi government, the name drops almost
so zero while it continues to rise in the US-English corpus.
T

Figure 12-2. e name Marc Chagall in the US-English and the German part of the
AF

Google Books corpus.


R
D

346
Anatol Stefanowitsch

e authors rightly take this drastic drop in frequency to be evidence of political


censorship Chagalls works, like those of other Jewish artists, were declared to
be degenerate and conscated from museums, and it is not surprising, that his
name would not be mentioned in books wrien in Nazi Germany. Such drops in
frequency can, in principle, be identied inductively for a large number of names
of public gures, which may uncover periods of explicit or implicit censorship
that we do not know about a priori.

Further reading

.1
is chapter has focused on very simple aspects of variation across text types and
a very simple notion of text type. See Biber (1988, 1989) for a more
comprehensive corpus-based perspective on text types. As seen in some of
v0
the case studies in this chapter, text is frequently a proxy for demographic
properties of the speakers who have produced it, making corpus linguistics a
variant of sociolinguistics see Baker (2010).
Biber, Douglas. 1988. Variation across speech and writing. Cambridge &
T
New York: Cambridge University Press.
Biber, Douglas. 1989. A typology of English texts. Linguistics 27(1). 344.
AF

Baker, Paul. 2010. Sociolinguistics and corpus linguistics. Edinburgh:


Edinburgh University Press.
R
D

347
13 Metaphor and Metonymy
e ease with which corpora are accessed via word forms and the diculty of
accessing them at other levels of linguistic representation is an advantage as long

.1
as it is our aim to investigate words, for example with respect to their
relationship to other words, to their internal structure or to their distribution
across grammatical structures and across texts and text types. As we saw in
v0
Chapter 10, it is more problematic where our aim is to investigate grammar in its
own right, but since grammatical structures tend to be associated with particular
words and/or morphemes, these diculties can be overcome to some extent.
When it comes to investigating phenomena that are not lexical in nature, the
word-based nature of corpora is clearly a disadvantage and it may seem as
T
though there is no alternative to a careful manual search and/or a sophisticated
annotation (manual, semi-manual or based on advanced natural-language
AF

technology). However, corpus linguists have actually uncovered a number of


relationships between words and linguistic phenomena beyond lexicon and
grammar without making use of such annotations. In the nal chapter of this
book, we will discuss a number of case studies of two such phenomena: metaphor
and metonymy.
R

13.1 Studying metaphor in corpora


D

Metaphor is traditionally dened as the transfer of a word from one referent


(variously called vehicle, gure or source) to another (the tenor, ground or target)
(cf. e.g. Aristotle, Poetics, XXI). If metaphor were indeed located at the word level,
it should be straightforwardly amenable to corpus-linguistic analysis.
Unfortunately, things are slightly more complicated. First, the transfer does not
typically concern individual words but entire semantic elds (or even conceptual
domains, according to some theories). Second, as discussed in some detail in
Chapter 5 (Section 5.1.3), there is nothing in the word itself that distinguishes its
literal and metaphorical uses. One way around this problem is manual

348
Anatol Stefanowitsch

annotation, and there are very detailed and sophisticated proposals for
annotation procedures (most notably the Pragglejaz Metaphor Identication
Procedure, cf., for example, Steen et al. 2010).
However, as stressed in various places throughout this book, the manual
annotation of corpora severely limits the amount of data that can be included in a
research design; this does not invalidate manual annotation, but it makes
alternatives highly desirable. Two broad alternatives have been proposed in
corpus linguistics. Since these were discussed in some detail in Chapter 5, we will
only repeat them briey here before illustrating them in more detail in the case
studies presented in the next section. e rst approach starts from a source

.1
domain, searching for individual words or sets of words (synonym sets, semantic
elds, discourse domains) and then identifying the metaphorical uses and the
respective targets and underlying metaphors manually. This approach is
v0
extensively demonstrated, for example, in Deignan (1999, 2005). e second
approach starts from a target domain, searching for abstract words describing
(parts o) the target domain and then identifying those that occur in a
grammatical paern together with items from a dierent semantic domain (which
will normally be a source domain). is approach has been taken by
T
Stefanowitsch (2004, 2006) and others (see Martin 2006 for a semi-automatic
variant of the method). A third approach has been suggested by Wallington et al.
AF

(2003): they aempt to identify words that are not themselves part of a
metaphorical transfer but that point to a metaphorical transfer in the immediate
context (the expression guratively speaking would be an obvious candidate). is
approach has not been taken up widely, but it is very promising at least for the
identication of certain types of metaphor, and of course the expressions in
R

question are worthy of study in their own right.


D

13.2 Metaphor: Case studies


13.2.1 Source domains
13.2.1.1 Case Study: ame vs. ames. Among other things, the corpus-based study
of (small set o) source domain words yields insights into the intricate ways in
which metaphor interacts with lexical items and even word forms. For example,
Deignan (2006) studies the metaphors associated with the source words ame and
ames in terms of whether they occur in positive or negative contexts. Her design
is generally deductive in that she starts with an expectation (but not quite a full-
edged hypothesis) that there is a dierence between the two word forms.

349
13 Metaphor and Metonymy

This impression is presumably based on examples like the following:

(13.1a) the ame of hope burns brightly here. [BNC AJD]


(13.1b) Emilio Estevez, siing on the sofa next to old ame Demi Moore [BNC CH1]

(13.2a) the ames of civil war engulfed the central Yugoslav republic. [BNC AHX]
(13.2b) e game was going OK and then it went up in ames. [BNC CBG]

Deignans design has two nominal variables: WORD FORM OF FLAME (with the
variables SINGULAR and PLURAL) and CONNOTATION OF METAPHOR (with the values POSITIVE

.1
and NEGATIVE; she does not say how exactly the metaphors were categorized).
Table 13-1a shows her results (2=53.98 (df = 1), p<0.001).
v0
Table 13-1a. Positive and negative metaphors with singular and plural forms of
ame (Deignan 2006: 117)
WORD FORM OF FLAME
SINGULAR PLURAL Total
T
Obs: 90 Obs: 24 114
E VA L UAT I O N

POSITIVE
Exp: 70.78 Exp: 43.22
AF

Obs: 5 Obs: 34 39
NEGATIVE
Exp: 24.22 Exp: 14.78
95 58 153
R

Clearly, metaphorical singular ame is used in positive metaphorical contexts


much more frequently than metaphorial plural ames. Deignan tentatively
D

explains this in relation to the literal uses of ame(s): a single ame is usually
under control, and it may be if use, as a candle or a burning match. If there is
more than one ame, we are essentially dealing with a re flames are oen
undesired, out of control and very dangerous (Deignan 2006: 117).
is explanation itself is of course a hypothesis about the literal use of
singular and plural ame that must be tested separately. Deignan does not
provide such a test, but it is easy enough to do. I selected 100 random
concordance lines each for ame and ames from the BNC and removed all
gurative uses. e remaining literal uses (see Appendix D.4) were sorted
according to whether they described a ame/re that was safe and under control
(as in 13.3a and 13.4a), or dangerous and out of control (as in 13.3b and 13.4b):

350
Anatol Stefanowitsch

(13.3a) She watched the kiln's ames icker on the walls [BNC HH3]
(13.3b) Malekith was caught within the ames, his body terribly scarred and
burned [BNC CM1]

(13.4a) At her elbow the candle burned with a steady ame [BNC G1X]
(13.4b) [He] felt the wild pain of the ame on his arm as he dived for safety. [BNC
HA3]

e result is shown in Table 13-1b (2=10.83 (df = 1), p<0.001).

.1
Table 13-1b. Positive and negative contexts for literal uses of singular and plural
forms of ame v0
WORD FORM OF FLAME
SINGULAR PLURAL Total
UNDER Obs: 29 Obs: 17 36
SI T UAT ION

CONTROL Exp: 19.85 Exp: 26.15


T
OUT OF Obs: 34 Obs: 66 100
Exp: 43.15 Exp: 56.85
AF

CONTROL

63 83 136

Obviously, re is inherently dangerous and so both word-forms are frequently


used to refer to situations where a re is out of control (in the case of ame,
R

about a third of these cases involve a construction such as sheet/tongues of ame,


as in [A] sheet of ame came down and blocked me [BNC CLV]). However, the
D

singular ame is used more frequently than expected for situations where the re
is under control, while the plural ames is used more frequently than expected
for situations where it is out of control. e quality of being out-of-control is also
dierent: 14 of the 83 cases refer to a situation where someone is trying and
failing to put out the re (compared to 0 cases for singular ame) and 7 cases
refer to people being trapped by re (compared to 1 case for singular ame).
us, Deignans explanation appears to be fundamentally correct, providing
evidence for a substantial degree of isomorphism between literal and gurative
uses of (at least some) words. An analysis of more such cases could show whether
this isomorphism between literal and metaphorical uses is a general principle (as
G. Lakos 1993 conceptual theory of metaphor suggests it should be).

351
13 Metaphor and Metonymy

is case study demonstrates rst, how to approach the study of metaphor


starting from source-domain words, and, second, that such an approach may be
applied not just descriptively, but in the context of answering fundamental
questions about the nature of metaphor.

13.2.1.2 Case study: Identifying potential source domain expressions. If we are


interested in a phenomenon instantiated in a particular source domain (as in the
preceding case study) or in the source domain itself, our main task is to dene a
representative set of source domain items on which to base our study; this can be
done most straightforwardly on the basis of thesauri. If, however, we are

.1
interested in a particularly target domain and the source domains associated with
it, there is no lexical resource to draw from. Partington (1998) suggests an
interesting solution: He applies keyword analysis (see Chapter 12, Section 12.1) to
v0
a thematic corpus dealing with the target domain in question and then inspects
the results for items that are not from the domain in question. Among these items
are many, that are likely to be used metaphorically in the target domain.
To demonstrate this method, let us create a small subcorpus of economic texts
from the BROWN corpus and perform a keyword analysis comparing this
T
subcorpus against the rest of the les in BROWN. ere are eight les that, based
on their description in the manual, deal with economics topics (specically, files
AF

A26, A27, A28, H05, H11, H24, J40, J41). Table 13-2a shows selected results of a
keyword analysis of these les.
As expected, most of the strongest keywords are directly related to the domain
of economics. Among the Top 10 (shown in Part I of Table 13-2a), only the items
roleplaying and gin do not seem to t this paern, but closer inspection reveals
R

that they are related to economics indirectly: roleplaying occurs in a text about
the use of roleplaying in sales training (BROWN J30) and gin in a text about the
D

production of gin (BROWN A26).


However, there are also many keywords that are not related to the domain of
economics directly or indirectly. Part II of Table 13-2a shows the strongest non-
economic keywords, most of which come from the spatial domain: movable,
return/returns, extension, rise/rising and location (other domains are concrete
objects (tangible), health (recovery) and containment (pressure)). ese
domains recur among the keywords for economics texts (see Part III of Table 13-
2a for additional examples from the Top 250).

352
Anatol Stefanowitsch

Table 13-2a. Selected keywords from an economics subcorpus from BROWN


Rank Keyword a b c d G2
(I) TOP 10
1 WAGE 44 12 18231 1005763 296.63
2 INDUSTRY 58 128 18217 1005647 240.94
3 PROPERTY 54 102 18221 1005673 237.38
4 PRICE 41 67 18234 1005708 189.23
5 INCOME 40 69 18235 1005706 181.35
6 SALES 42 91 18233 1005684 175.66
7 RATE 47 162 18228 1005613 161.58
8 ASSESSORS 19 1 18256 1005774 145.10

.1
9 ROLEPLAYING 16 0 18259 1005775 128.85
10 GIN 18 5 18257 1005770 121.04
(II) STRONGEST NON-ECONOMIC KEYWORDS v0
20 MOVABLE 14 5 18261 1005770 91.02
23 RETURN 30 150 18245 1005625 84.80
35 TANGIBLE 11 8 18264 1005767 63.00
51 EXTENSION 12 24 18263 1005751 51.67
54 RECOVERY 11 18 18264 1005757 50.73
T
57 PRESSURE 22 163 18253 1005612 48.07
67 RISE 16 86 18259 1005689 43.32
AF

69 RISING 13 49 18262 1005726 42.77


79 RETURNS 10 24 18265 1005751 40.19
126 LOCATION 10 53 18265 1005722 27.30
(III) SELECTED ADDITIONAL NON-ECONOMIC KEYWORDS
138 HIGHER 15 146 18260 1005629 26.29
R

143 LEVEL 17 197 18258 1005578 25.26


148 CALIBRATION 4 2 18271 1005773 24.64
151 IN 479 20862 17796 984913 24.38
D

185 RAISE 8 44 18267 1005731 21.35


186 WITHHOLDING 4 4 18271 1005771 21.26
205 BELOW 12 133 18263 1005642 18.64
210 RISES 5 15 18270 1005760 18.31
211 EXPANSION 7 40 18268 1005735 18.25
243 UPSWING 2 0 18273 1005775 16.10

Among the spatial terms, words referring to the vertical dimension are very
frequent (rise, rising, higher, level, raise, below, rises, upswing), and among these,
the word rise is dominantly represented by three dierent word forms among the

353
13 Metaphor and Metonymy

top 250 signicant keywords. Table 13-2b is a concordance of all instances of all
wordforms of rise in the economics subcorpus.

Table 13-2b. A concordance of RISE


1 ir advantage to allow the wage [[rise]] . Thus , for non-negative c
2 prices enjoyed a fairly solid [[rise]] but here also trading dwind
3 y business and industry should [[rise]] during the decade . ( 5 )
4 e of household formations will [[rise]] from about 883,000 in the l
5 ign economic aid are likely to [[rise]] further in the next several
6 some good news . A substantial [[rise]] in new orders and sales of
7 the latter part of 1959 . The [[rise]] in sales last winter was ch
8 ual speed to stimulate a sharp [[rise]] in residential construction

.1
9 f the country and the moderate [[rise]] in vacancy rates for apartm
10 he spring of 1961 before a new [[rise]] in economic activity gets u
11 ties , a continued substantial [[rise]] in expenditures by state an
12 enditures will not continue to [[rise]] in the Sixties . I am confi
v0
13 as . Paradoxically , the sales [[rise]] is due in large measure to
14 o allow the basic wage rate to [[rise]] obviously depends upon the
15 ections show a more pronounced [[rise]] to an annual rate of 1,338,
16 ge rate . Since marginal costs [[rise]] when the wage rate rises ,
17 costs rise when the wage rate [[rises]] , the profit-maximizing pr
18 only rises as the wage rate [[rises]] . In such circumstances ,
19 he public-limit price only [[rises]] as the wage rate rises . I
T
20 o which the public-limit price [[rises]] in response to a basic wag
21 e profit-maximizing price also [[rises]] when the public-limit pric
22 is manifestly justified by [[rising]] costs ( due to rising wag
AF

23 uld receive some stimulus from [[rising]] Federal spending , and th


24 ter than theirs , will lead to [[rising]] Federal spending in certa
25 involved in this trend toward [[rising]] Federal expenditures , of
26 ding edged down in April after [[rising]] for two consecutive month
27 said the odds appear to favor [[rising]] interest rates in coming
28 they have been drained by the [[rising]] price level , and we have
R

29 ew legislation , together with [[rising]] prices for farm products


30 will probably be sparked by a [[rising]] rate of housing starts ne
31 . However , the impact of a [[rising]] rate of household formati
32 of pockets of unsold homes and [[rising]] vacancy rates in apartmen
D

33 , which should contribute to a [[rising]] volume of consumer expend


34 by rising costs ( due to [[rising]] wages , etc. ) . Thus ,
35 of prices received by farmers [[rose]] in the month ended at mid-S
36 of Toronto in July and August [[rose]] to 2,418 units from 869 in
37 's short-term borrowing costs [[rose]] with Tuesday weekly offerin

e concordance shows that all instances of the word rise in the economics sub-
corpus are metaphorical uses, based on the metaphor increase in quantity is
upward motion. It also provides information about the words with which this
this metaphor is used most frequently: there are 9 cases of rate(s), 6 of price(s), 4
of expenditure(s), 3 each of costs, sales and spending, and 2 of wages.
is case study demonstrates that central metaphors for a given target domain

354
Anatol Stefanowitsch

can indeed be identied by applying a keyword analysis to a specialized corpus of


texts from that domain. e case study does not discuss a particular research
question, but obviously, the method is useful in the context of many dierent
research designs. It can also be used to generate hypotheses in the rst place: for
example, the words just listed suggest that the metaphor increase in quantity is
upward motion is associated more strongly with spending money (price,
expenditure, cost, spending) than with making money (sales). is would of course
have to be tested more rigorously in a separate study, for example, by comparing
how frequently increases are referred to by the metaphorical rise as opposed to
the literal increase with cost-related words as opposed to prot-related words.38

.1
13.2.1.3 e impact of metaphorical expressions. As a nal example of a source-
domain oriented study consider Stefanowitsch (2005), which investigates the
v0
relationship between metaphorical and literal expressions hinted at at the end of
the preceding case study. e aim of the study is to uncover evidence for the
function of metaphorical expressions that have literal paraphrases, such as [dawn
of NP] in examples like (13.5a), which is seemingly equivalent to the literal
[beginning of NP] in (13.5b):
T

(13.5a) [I]t has taken until the dawn of the 21st century to realise that the best
AF

methods of utilising . . . our woodlands are those employed a millennium


ago. [BNC AHD]
(13.5b) Communal life survived until the beginning of the nineteenth century and
traditions peculiar to that way of life had lingered into the present. [BNC
AEA]
R

e study utilizes an inductive design with METAPHORICITY OF PATTERN as the


D

independent variable (whose values are pairs of paerns like the one illustrated
in examples (13.5a,b)), and the nouns in the NP slot as the dependent variable.
is corresponds to a dierential-collexeme analysis (Chapter 10, Section
10.2.2.2).
e study aims to test the hypothesis that metaphorical language serves a
38 For example, a query of the BNC for (rising|increasing) (cost|prot)s? results in 84
hits for rising cost(s) as opposed to 34 hits for increasing cost(s) and 7 hits for rising
prot(s) as opposed to 12 hits for increasing prot(s). It is le as an exercise for the
reader to test this distribution for signicance using the chi-square test or a similar
test (but use a separate sheet of paper, as the margin of this page will be too small to
contain your calculations).

355
13 Metaphor and Metonymy

cognitive function and that for each pair of paerns investigated, the
metaphorical variant should be used with nouns referring to more complex
entities. e construct COMPLEXITY is operationalized in the form of axioms derived
from gestalt psychology, such as the following:

Concepts representing entities that have a simple shape and/or have a clear
boundary are less complex than those representing entities with complex
shapes or fuzzy boundaries (because they are more easily delineable). is
follows from the gestalt principles of closure and simplicity (Stefanowitsch
2005: 170).

.1
For each pair of expressions, the dierential collexemes are identied and the
resulting lists are compared against these axiomatic assumptions. Take the case of
v0
the two expressions introduced above, whose dierential collexemes are shown in
Table 13-3.

Table 13-3. Dierential collexemes of beginning of NP and dawn of NP


beginning of NP (n = 4,232) dawn of NP (n = 117)
T
COLLEXEME DIFFERENTIALITY COLLEXEME DIFFERENTIALITY
year (315/1) 1.35E-03 civilisation (1/19) 6.36E-30
AF

century (385/4) 1.65E-02 time (36/13) 2.30E-10


history (14/9) 3.23E-09
age (10/7) 1.33E-07
dream (1/4) 2.44E-06
mankind (0/3) 1.90E-05
R

era (37/8) 2.02E-05


day (29/7) 3.73E-05
consciousness (0/2) 7.18E-04
D

enlightenment (0/2) 7.18E-04


regime (0/2) 7.18E-04
culture (1/2) 2.12E-03

Obviously, both expressions are associated with words referring to time, but
those associated with the literal beginning of refer to time spans with clear
boundaries and a clearly dened duration (year and century), while those
associated with the metaphorical dawn of refer to variable time spans without
such boundaries (civilization, time, history, age, era, culture). e one apparent
exception is day, but this occurs exclusively in literal uses of dawn of, such as It
was the dawn of the fourth day since the murder (BNC CAM). is (and similar

356
Anatol Stefanowitsch

results for other pairs of expressions) are taken by Stefanowitsch (2005) as


evidence for a cognitive function of metaphor.
Liberman (2005) notes in passing, that even individual decades and centuries
may dier in the degree to which they prefer beginning of or dawn of: using
dierent internet search engines, he shows that dawn of the 1960s is more likely
than dawn of the 1980s compared to beginning of the 1960s/1980s, and that dawn of
the 21st century is more likely than dawn of the 18th century compared to
beginning of the 18th/21st century. In his view calls into question the role of
boundedness, since obviously all decades/centuries are equally bounded.
Since search engine frequency data are notoriously unreliable, let us check

.1
these observations in two large corpora, the 450-million-word Corpus of Current
American English (COCA) and the 1.9-billion-word Corpus of Global Web-based
English (GloWbE) (see Appendix A.1). e names of decades (such as 1960s or
v0
sixties) occur too infrequently with dawn of in these corpora to say anything
useful about them, but the names of centuries are frequent enough.
Table 13-4 shows the percentage of dawn of for the past ten centuries (all
spelling variants of the respective centuries were included, e.g. 19th century,
nineteenth century, etc.).
T

Table 13-4. Dawn of vs. beginning of with the century in two large corpora
AF

CoCA GloWbE
DAWN BEGINNING % DAWN DAWN BEGINNING % DAWN
12TH CENTURY 0 4 0.0000 0 26 0.0000
13TH CENTURY 0 5 0.0000 3 37 0.0750
14TH CENTURY 0 8 0.0000 3 43 0.0652
R

15TH CENTURY 0 6 0.0000 1 37 0.0263


16TH CENTURY 0 16 0.0000 3 85 0.0341
D

17TH CENTURY 0 19 0.0000 4 107 0.0360


18TH CENTURY 1 24 0.0400 5 128 0.0376
19TH CENTURY 2 74 0.0263 4 339 0.0117
20TH CENTURY 22 243 0.0830 62 934 0.0622
21ST CENTURY 36 95 0.2748 80 304 0.2083

ere are clear dierences between the centuries, and these dierences are
roughly similar in the two corpora: in both corpora, the 19th century occurs
signicantly less frequently and the 21st century occurs signicantly more
frequently than expected with dawn of (you can test this for yourself by
calculating the chi-square components). Why this should be the case requires

357
13 Metaphor and Metonymy

further study, but one could speculate that the nineteenth century feels more
bounded than the 21st because it is actually over, and we can imagine it in its
entirety, while none of the speakers in the respective corpora will live to see the
end of the 21st century, making it conceptually less bounded.
is case study demonstrates use of the dierential collexeme analysis (and
thus of collocational methods in general) that goes beyond associations between
words and other elements of structure and instead uses words and grammatical
paerns as ways of investigating semantic associations. Direct comparisons of
literal and metaphorical language are rare in the research literature, so this
remains a potentially interesting eld of research.

.1
13.2.2 Target domains
13.2.2.1 Case study: happiness and joy. As discussed in Chapter 5 (Section 5.1.3),
v0
there are two types of metaphorical uerances: those that could be interpreted
literally in their entirety (like Lako and Kvecses example I am burned up, cf.
Examples 5.6a and 5.7a), and those that contain vocabulary from both the source
and the target domain (like He was lled with anger, cf. 5.8a). Stefanowitsch (2004,
T
2006) refers to the laer as metaphorical paerns, dened as follows:

A metaphorical paern is a multi-word expression from a given source


AF

domain (SD) into which one or more specic lexical item from a given
target domain (TD) have been inserted (Stefanowitsch 2006: 66).

In the example just cited, the multi-word source-domain expression would be


R

[NPcontainer be lled with NPsubstance], the source domain would be that of substances
in containers. e target domain word that has been inserted in this expression is
anger, yielding the metaphorical paern [NPcontainer be lled with NPemotion]. e
D

metaphors instantiated by this paern include an emotion is a substance and


experiencing an emotion is being lled with a substance.
A metaphorical paern analysis of a given target domain (like anger) thus
proceeds by selecting one or more words that refer to (or are inherently
connected with) this domain (for example, the word anger, or the set irritation,
annoyance, anger, rage, fury, etc.) and retrieve all instances of this word or set of
words from a corpus. e next step consists in identifying all cases where the
search term(s) occur in a multi-word expression referring to some domain other
than emotions. Finally, the source domains of these expressions are identied,
giving us the metaphor instantiated by each metaphorical paern. e paerns

358
Anatol Stefanowitsch

can then be grouped into larger sets corresponding to metaphors like emotions
are substances.
Of course, the metaphors identied in this way are (or can be) specically
related to the specic target-domain word occurring in the corresponding
paern, rather than the emotion referred to. For example, the BNC contains the
metaphorical paerns lled with anger and lled with rage, but not lled with
irritation/annoyance/fury. In other words, metaphorical paern analysis identies
metaphors associated with a particular lexicalization of a concept, rather than the
concept itself. In this way, it can be used for comparisons of lexicalizations across
languages (cf. e.g. Stefanowitsch 2004, Daz-Vera 2013), of alternative

.1
lexicalizations within a single language (e.g. Stefanowitsch 2006, Turkkila 2014),
or both (e.g. Rojo Lopez 2013), or of changes in the metaphors associated with a
lexical item (Tissari 2003, 2010). v0
For example, Stefanowitsch (2006) investigates the metaphorical paerns
associated with the near synonyms happiness and joy in an inductive study of
metaphors associated with various words for basic emotions in English. One of
the ndings is that while liquid/substance metaphors are found with both
words, they dier in the number of metaphors suggesting that the person
T
experiencing the corresponding emotion is completely full or even bursting.
Table 13-5a shows the respective metaphors and their frequency of occurrence
AF

with each word.


is is a study with two nominal variables, HAPPINESS WORD (with the values
HAPPINESS and JOY) and TYPE OF METAPHOR (with the values FULLNESS and OTHER). e
original study does not give an operational denition for the construct FULLNESS,
but since the paerns instantiating it exhaustively listed, researchers can check to
R

what extent they agree with the categorization.


As Table 13-5b shows, FULLNESS metaphors are signicantly more frequent than
D

expected with joy and less frequent than expected with happiness (2=9.4 (df = 1),
p < 0.01). Stefanowitsch (2006) suggests that this reects the fact that joy
describes a much more intense emotion than happiness. is hypothesis would
have to be tested with other pairs of near synonyms where one describes a more
intense emotion than the other, such as grief/sadness, disgust/revulsion, lust/desire,
rage/anger, etc. (Turkkila 2014 nds more SUBSTANCE-IN-A-CONTAINER metaphors for
rage (20/500) than for anger (9/500), with fury in the middle but closer to anger
(13/500); however, the study does not list FULLNESS metaphors separately).

359
13 Metaphor and Metonymy

Table 13-5a. LIQUID metaphors for happiness and joy (Stefanowitsch 2006: 100)
happiness joy
source of NP 12 6
NP spring from X 1 1
X open self to NP 1
NP pour into heart 1
inner NP 1 2
X contain/include/hold NP 7 1
NP be in X 2
distillation of NP 1
X drink NP 1
NP evaporate 1

.1
X leave X empty of NP 1 1
TOTAL 24 16

FULLNESS, PRESSURE, and BURSTING metaphors v0 happiness joy


eervescent/seething NP 2
pressure of NP 1
swell of NP 1
heave of NP 1
rush of NP 1
surge of NP 2 1
T
river be NP 1
ood of NP 1
AF

NP subside 1
lled/loaded with/full of NP 8 15
heart (be) full to bursting with NP 1
heart ll/swell with NP 2
X ll/swell Y(s heart) with NP 1 6
NP brim in heart 1
R

burst/explosion of NP 1 1
cold void run over with NP 1
NP burst in/through X(s) heart 2
D

NP overow 1
X brim over with NP 1
X burst/erupt/explode in/with NP 6
NP surge/sweep/wash over/through X 1 3
X be swept away by NP 1
X pour NP 1
ow of NP emanate from X 1
NP seep from X 1
TOTAL 20 47

360
Anatol Stefanowitsch

Table 13-5b. LIQUID metaphors for happiness and joy


HAPPINESS WORD
HAPPINESS JOY Total
FULLNES Obs: 24 Obs: 16 40
Exp: 16,45 Exp: 23,55
METAP HOR

OTHER Obs: 20 Obs: 47 67


LIQUID Exp: 27,55 Exp: 39,45
44 63 107

.1
is case study demonstrates how central metaphors for a given target domain
can be identied by searching for general words describing (aspects o) the
v0
domain in question. It also shows that these metaphors can be associated to
dierent degrees with dierent words within a given target domain. It is unclear
to what extent this lexeme specicity of metaphorical mappings supports or
contradicts current cognitive theories of metaphor, so it is potentially a highly
productive area of research.
T

13.2.3 Metaphor and text


AF

13.2.3.2 Case Study: Metaphoricity signals. Although metaphorical expressions are


pervasive in all registers and speakers do not generally seem to draw special
aention to their occurrence, there is a wide range of devices that mark non-
literal language more or less explicitly (as in metaphorically/guratively speaking,
R

picture NP as NP, so to speak/say):

(13.6a) the princess held a gun to Charles 's head, guratively speaking [BNC
D

CBF]
(13.6b) He pictures eternity as a lthy Russian bathhouse [BNC A18]
(13.6c) the only way they can deal with crime is to ght re, so to speak, with re.
[BNC ABJ]

Wallington et al. (2003) investigate the extent to which these devices, which they
call metaphoricity signals, correlate systematically with the occurrence of
metaphorical expressions in language use. ey nd no strong correlation, but as
they note, this may well be due to various aspects of their design. First, they
adopt a very broad view of what constitutes a metaphoricity signal, including

361
13 Metaphor and Metonymy

expressions like a kind/type/sort of, not so much NP as NP and even prexes like
super-, mini-, etc. While some or all of these signals may have an anity to
certain types of non-literal language, one would not really consider them to be
metaphoricity signals in the same way as those in (13.6ac). Second, they
investigate a carefully annotated, but very small corpus. ird, they do not
distinguish between stronlgy conventionalized metaphors, which are found in
almost every uerance and are thus unlikely to be explicitly signaled, and weakly
conventionalized metaphors, which seems more likely to be signaled explicitly a
priori).
A more restricted case study could determine whether the idea of

.1
metaphoricity signals is, in principle, plausible. Let us look at what is intuitively
the clearest case of such a signal on Wallington et al.s list: the sentence
adverbials metaphorically speaking and guratively speaking. As a control, let us
v0
use the roughly equally frequent sentence adverbial technically speaking, which
does not signal metaphoricity but which can, of course, co-occur with
(conventionalized) metaphors and which can thus serve as a baseline.
ere are 22 cases of technically speaking in the BNC. Only four of these are
part of a clause that arguably contains a metaphor:
T

(13.7a) Technically speaking, of course, she was o duty. [BNC JYB]


AF

(13.7b) [D]irectors can, technically speaking, be held liable for negligence and
consequently sued. [BNC CBU]
(13.7c) [T]echnically speaking, you are no longer in a position to provide him with
employment. [BNC HWN]
(13.7d) As novelists, however, Orwell and Waugh evolve not towards each other but,
R

technically speaking, in opposite directions. [BNC CKN]


D

Note that we are taking a very generous view on what counts as a metaphor:
(13.7a) contains the spatial preposition o as part of the phrase o duty, which
could be said to instantiate the metaphor a situation is a location; (13.7b) uses
hold as part of the phrase hold liable, instantiating a metaphor like believing
something about someone is holding them (cf. also hold s.o.
responsible/accountable, hold in high esteem); (13.7c) uses provide as part of the
phrase provide employment, which instantiates a metaphor like causing someone
to be in a state is transferring an object to them (cf. also provide s.o. with an
opportunity/insight/power). All three cases are highly lexicalized. Even (13.7d),
which uses the verb evolve metonymically to refer to a non-evolutionary
development and then uses the spatial expressions towards and opposite direction

362
Anatol Stefanowitsch

metaphorically to describe the quality of the development, is a relatively


conventionalized expression.
ere are 19 cases of the sentence adverbial guratively/metaphorically
speaking in the BNC. In stark contrast to technically speaking, all but two of these
are part of a clause that clearly contains a metaphor or metonymy. e
metaphorical expressions are listed in Table 13-6, the two exceptions are shown
in (13.8a, b):

(13.8a) Figuratively speaking, he declared, in case of need, Soviet artillerymen


can support the Cuban people with their rocket re . [BNC G1R]

.1
(13.8b) Let's pick someone completely at random, now we've had Tracey
guratively speaking! [BNC F7U]
v0
Table 13-6. Metaphorical expressions signaled by metaphorically/literally speaking
(BNC)
EXPRESSION MEANING METAPHOR SOURCE
bloody the nose of sb be successful in court against sb legal ght is physical ght A3G
take it on the chin endure being criticized argument is physical ght ASA
T
put the boot in treat sb cruelly sports is physical ght K25
hold gun to sbs head coerce sb to act power is physical force CBF
a taste for excitement experience experience is taste AC9
AF

balance the books set things right life is commercial transaction KAE
cash in be successful life is commercial transaction C9U
make a killing be nancially successful life is a hunt BMR
transport sb across the world make sb think of a distant location imagination is motion ACX
bulwark of society defender of society defense is a wall B0Y
R

make sth serve one's aims put sth to use in achieving sth to be used is to serve BMA
a frozen moment in time documentation of a particular state time is a river HPN
superego be soluble in alcohol self-control disappears when drunk character is a substance KM7
D

burst into song take one's turn speaking singing for speaking B21
give ones right arm to do sth want sth very much body part for personal value B2D
be within arms length be in close proximity arm's length for short distance HTP
you tell me your co-employee tell me employee for company HGY

We can now compare the literal and metaphorical contexts in which the
expressions technically speaking and metaphorically/guraltively speaking occur. If
the laer are a metaphoricity signal, they should occur signicantly more
frequently in metaphorical contexts than the former. As Table 13-7 shows, this is
indeed the case (2 = 20.74 (df = 1), p < 0.001).

363
13 Metaphor and Metonymy

Table 13-7. Literal and gurative uerances containing the sentence adverbials
metaphorically/guratively speaking and technically speaking in the BNC
SENTENCE ADVERBIAL
METAPHORICALLY/ TECHNICALLY Total
FIGURATIVELY SPEAKING SPEAKING

Obs: 17 Obs: 4 21
FIGU RAT I V E

YES
Exp: 9.73 Exp: 11.27
Obs: 2 Obs: 18 20
NO
Exp: 9.27 Exp: 10.73

.1
19 22 41
v0
Of course, the question remains, why some metaphors should be explicitly
signaled while the majority is not. For example, we might suspect that
metaphorical expressions are more likely to be explicitly signaled in contexts in
which they might be interpreted literally. is may be the case for put the boot in,
which occurs in a description of a rugby game where one could potentially
T
misread it for a statement that someone was actually kicked:
AF

(13.9a) Guy Gregory, who was later to prove the match winner, easily converted .
Gloucester needed a kick start to get into the game. ree penalties helped
put them in front. But not for long. Gregory put the boot in metaphorically
speaking!
R

Alternatively (or additionally), a metaphor may be signaled explicitly if its


specic phrasing is more likely to be used in literal contexts. is may be the case
D

with hold a gun to sbs head, of which there are ten examples in the BNC, only
one of which is metaphorical:

(13.9b) I'm not sure if the princess held a gun to Charles's head, guratively
speaking, but it seems if she wanted something said.

Again, which of these hypotheses (if any of them) is correct would have to be
studied more systematically.
is case study found a clear eect where the authors of the study it is based
on did not. is demonstrates the need to formulate specic predictions
concerning the behavior of specic linguistic items in such a way that they can

364
Anatol Stefanowitsch

be tested systematically and the results be evaluated statistically. e study also


shows that the area of metaphoricity signals is worthy of further investigation.

13.2.3.3 Metaphor and ideology. Regardless of whether metaphor is a rhetorical


device (as has traditionally been assumed) or a cognitive device (as seems to be
the majority view today), it is clear that it can serve an ideological function,
allowing authors suggest a particular perspective on a given topic. us, an
analysis of the metaphors used in texts manifesting a particular ideology should
allow us to uncover those perspectives.
For example, Charteris-Black (2005) investigates a corpus of right-wing

.1
communication and media reporting on immigration, containing speeches,
political manifestos and articles from the conservative newspapers Daily Mail and
Daily Telegraph. He nds, among other things, that the metaphor immigration is
v0
a ood is used heavily, arguing that this allows the right to portray immigration
as a disaster that must be contained, citing examples like a ood of refugees, the
tide of immigration, and the trickle of applicants has become a ood.
Charteris-Blacks ndings are intriguing, but since he does not compare the
ndings from his corpus of right-wing materials to a neutral or a corresponding
T
le-wing corpus, it remains an open question whether the use of these metaphors
indicates a specically right-wing perspective on immigration. Let us therefore
AF

replicate his analysis more systematically. e BNC contains 1 232 966 words
from the Daily Telegraph (all les whose names begin with AH, AJ and AK),
which will serve as our right-wing corpus, and 918 159 words from the Guardian
(all les whose names begin with A8, A9 or AA, except le AAY), which will
serve as our corresponding le-wing (or at least le-leaning) corpus. Since
R

Charteris-Blacks examples all involve reference to target-domain items such as


refugee and immigration, a metaphorical-paern analysis (cf Section 13.2.2.1
D

above) suggests itself. Table 13-8a shows all concordance lines for the words
migrant(s), immigrant(s) and refugee(s) containing metaphorical paerns
instantiating the metaphor immigration is a mass of water.
In terms of absolute frequencies, there is no great dierence between the two
subcorpora (10 vs. 11), but the overall number of hits for the words in question
diers drastically: there are 136 instances of these words in the Guardian
subcorpus but only half as many (68) in the Telegraph subcorpus. is means that
relatively speaking, in the domain of migration liquid metaphor are more
frequent than expected in the Telegraph and less frequent than expected in the
Guardian (see Table 13-8b), which suggests that such metaphors are indeed
typical of right-wing discourse. e dierence is only marginally signicant,

365
13 Metaphor and Metonymy

however, so a larger corpus would be required to conrm the result (2=3.822,


df=1, p = 0.0506).

Table 13-8a. Selected LIQUID metaphors with migrant(s), immigrant(s), refugee(s)


TELEGRAPH (n=68)
Britain would be swamped with [[immigrants]] under a Labour Government
s country would be swamped with [[immigrants]] of every colour and race ,
support was due to the flood of [[migrants]] and would-be asylum seekers .
t increasing levels of economic [[migrants]] and asylum seekers entering B
nges to the constitution if the [[refugee]] influx was to be curbed . Herr
d open up Britain to a flood of [[immigrants]] and permit the rise of fasc
the Gulf war and the influx of [[refugees]] from Afghanistan and Iraq . T

.1
for help to deal with flood of [[refugees]] By Philip Sherwell in Tuzla T
Sherwell in Tuzla THE FLOOD of [[refugees]] fleeing the escalating confli
an militia groups . Most of the [[refugees]] flowing into Tuzla are escapi
gling to cope with the flood of [[refugees]] and have appealed to the inte
v0
GUARDIAN (n=136)
mirroring this year 's flood of [[refugees]] . Watched by a demonstration
litary security , migration and [[refugee]] flows on an vast scale . As We
cope with the current surge of [[refugees]] , her Foreign Secretary , Mr
nd other aspects of large-scale [[immigrant]] absorption . The bureaucracy
he need to control an influx of [[immigrants]] . Rebel troops end siege of
T
control and manage the flows of [[migrants]] that wars , disasters , and ,
greed that the flow of economic [[migrants]] from Vietnam should be stoppe
e form of an influx of Romanian [[refugees]] . In one case in 1988 a Roman
AF

control and manage the flows of [[migrants]] that wars , disasters , and ,
greed that the flow of economic [[migrants]] from Vietnam should be stoppe

Table 13-8b. LIQUID paerns with the words migrant(s), immigrant(s), refugee(s)
NEWSPAP ER
R

GUARDIAN (PROGRESSIVE) TELEGRAPH Total


(CONSERVATIVE)
D

LIQUID Obs: 10 Obs: 11 21


MET. Exp: 14 Exp: 7
PAT T ERN

Obs: 126 Obs: 57 183


OTHER
Exp: 122 Exp: 61
136 68 204

is case study demonstrates that even general metaphors such as immigration


is a mass of water may be associated with particular political ideologies. ere is
a large research literature on the role of metaphor in political discourse (see, for

366
Anatol Stefanowitsch

example, Koller 2004, Charteris-Black 2004, cf. also Musol 2012), although at
least part of this literature is not as systematic and quantitative as it should be, so
this remains a promising area of research). e case study also demonstrates the
need to include a control sample in corpus-linguistic designs (in case that this still
needed to be demonstrated at this point).

13.2.4 Metonymy
Following G. Lako and Johnson (1980:35), metonymy is dened in a broad sense
here as using one entity to refer to another that is related to it (this includes
what is oen called synecdoche, see Seto 1999 for critical discussion). Text book

.1
examples are the following from (Lako and Johnson 1980: 35 and 39
respectively): v0
(13.10a)e ham sandwich is waiting for his check
(13.10b)Nixon bombed Hanoi.

In the (13.10a), the metonym ham sandwich stands for the target expression the
T
person who ordered the ham sandwich, in (13.10b) the metonym Nixon stands
for the target expression the air-force pilots controlled by Nixon (at least at rst
glance, cf. Case Study 13.2.4.3 below).
AF

us, metonymy diers from metaphor in that it does not mix vocabulary
from two domains, which has consequences for a transfer of the methods
introduced for the study of metaphor in Section 13.1.
e source-domain oriented approach can be transferred relatively
R

straightforwardly we can query an item (or set of items) that we suspect may
be used as metonyms then identify the actual metonymic uses (see Case Study
13.2.4.1). e main diculty with this approach is choosing promising items for
D

investigation. For example, the word sandwich occurs almost 900 times in the
BNC, but unless I have overlooked one, it is is not used as a metonym even once.
A straightforward analogue to the target-domain oriented approach (i.e.,
metaphorical paern analysis) is more dicult to devise, as metonymies do not
combine vocabulary from dierent semantic domains. One possibility would be to
search for verbs that we know or suspect to be used with metonymic subjects
and/or objects. For example, a Google search for "is waiting for (his|her|their|the)
check" turns up about 20 unique hits; most of these have people as subjects and
none of them have meals as subjects, but there are three cases that have table as
subject, as in (13.11):

367
13 Metaphor and Metonymy

(13.11) Table 12 is waiting for their check. [articles.baltimoresun.com]

A more systematic but still somewhat informal investigation of this type will be
presented in Case Study 13.2.4.2).
Finally, note that analyses of potential metonyms or target frames are not the
only way in which corpus data can shed light on metonymical relationships; a
particularly interesting, much more complex study is discussed in Case Study
13.2.4.3).

.1
13.2.4.2 Case study: Metonymy and grammar. As mentioned above, a quasi-source-
domain-oriented approach, i.e. one starting from the potential metonym, is
straightforward to implement, although it depends on the choice of a promising
v0
candidate for metonymic usage. Body parts are known to be such promising
candidates (see, e.g., Deignan 1999 on shoulder, Levin and Lindquist 2007 on nose,
Lindquist and Levin 2008 on foot and mouth and Niemeier 2008 on head and
heart).
Hilpert (2006) looks at metonymic uses of eye with the aim of determining the
T
extent to which they represent conventionalized paerns (xed or semi-xed
phrases like those in [13.11a,b]) as opposed to productive uses (of which Hilpert
AF

provides no examples, but [13.11c], a variant of the conventionalized have an eye


for, might be a candidate):

(13.11a)[H]er brother-in-law had kept an eye on her nances from the beginning
[BNC FPM]
R

(13.11b)A detail on the screen had caught his eye. [BNC FR0]
(13.11c) Joan had an eye which knew what to do with apparently unpromising
D

houses [BNC A68]

In a ten-million-word sample from the BNC, Hilpert nds 443 metonymic uses of
eye. Of these, 323 (i.e. 72.9 percent) represent highly conventionalized paerns,
which are listed in Table 13-9 (with some minor dierences from Hilperts
original classication).

368
Anatol Stefanowitsch

Table 13-9. Metonymic paerns with eye in a 10 percent sample of the BNC
Paern Meaning Metonymic Links Freq.
i. eye contact 46
ii. catch POSS eye 34
visual contact
catch the eye (of NP) eye for watching 7
eye-catching 2
iii. V DET/POSS eye over NP scan NP 5
iv. keep an eye on NP 56
keep a ADJ eye on NP pay aention to NP 10
keep POSS eye on NP 11
v. one eye on NP pay some aention to eye for watching + 5

.1
vi. PREP the public eye PREP the public aention watching for aention 12
Public Eye (TV series) 7
vii. with an eye on NP (cf. xiii) pay aention to NP
v0 2
viii. turn a blind eye (to NP) disregard something/NP 17
ix. under the eye of NP eye for watching + 6
under observation of NP
under POSS eye watching for supervision 1
eye for watching +
x. with an eye to NP with concern for NP 5
watching for concern
with the intention of V-ing eye for watching +
T
xi. with an eye to V-ing NP 3
NP watching for intention
eye for watching +
xii. have an eye for NP (cf xvii) have interest in NP 3
watching for interest
AF

xiii. with an eye on NP (cf. vii) eye for watching + 2


want NP
xiv. have POSS eye on NP watching for wanting 6
xv. in POSS/the minds eye in POSS/the imagination eye for vision 18
there BE more to NP than there BE more to NP than is eye for vision + vision for
xvi. 3
meets the eye readily perceivable perception
R

xvii. have an eye for NP (cf. xii) have good perception of NP 5


eye for vision + vision for
POSS ADJ eye for NP good perception of NP 2
good perception
with an eye for NP with good perception of NP 2
D

eye for vision +


xviii. private eye private investigator 1
vision for investigation
Private Eye (Magazine) 11
the Eye (Magazine) 6
xix. NP in POSS eye NP in POSS expression 8
eye for expression
NP COME into POSS eye NP enter POSS expression 3
xx. N to the eye N to the beholder 3
xxi. ADJ to the eye ADJ to the beholder eye for beholder 2
xxii. to the ADJ eye to the ADJ beholder 4
xxiii. see eye to eye agree understanding is seeing 7
xxiv. the apple of DET/POSS eye cherished object eye for thing seen 5
xxv. black eye discolored eye region part for part 5

369
13 Metaphor and Metonymy

is clear dominance of conventionalized metonymical expressions is interesting,


as it suggests that metonymies (or at least body-part metonymies) have a much
more limited productivity than one would expect if they were a genuinely
cognitive process. is is not to suggest, of course, that metonymy is a purely
linguistic phenomenon; in addition to the 323 conventionalized expressions, there
are 120 expressions (27.1 percent) that only occur once in Hilperts sample and
that he therefore classies as productive (although this estimate is probably too
generous, as a larger corpus would show that many of these paerns do reoccur
frequently enough to be classied as (semi-)xed phrases).

.1
It would be interesting to look more closely at the nature of these productive
uses, for example, to see whether they are largely minimal extensions of
conventionalized paerns (like example 13.11c), or whether there are also truly
v0
creative uses, as are commonly found in the domain of metaphor. Interestingly,
these paerns Hilpert identies as productive uses overwhelmingly represent the
most frequently instantiated metonymies eye for aention, eye for watching
and eye for perception. is may appear to suggest that these are indeed
productive metonymies, while mappings like eye for expression or eye for
T
beholder are dead metonymies limited to a few xed phrases; however, it may
also be due to the fact that the expressions instantiating them are more
AF

entrenched and thus more amenable to extension.


is case study demonstrates the metonym-oriented approach to the corpus-
based analysis of metonymy. It also shows that such an approach need not be
limited to description but can be used to answer questions about the nature of
metonymy. Metonymy has been studied far less intensively than metaphor from a
R

corpus-based perspective (and in general), so this remains an exciting area of


study.
D

13.2.4.3 Case Study: Subjects of bomb. As suggested in the introduction to this


section, it may be interesting to explore the possibility of target-domain-oriented
studies of metonymy, i.e., to start from a verb (or entire predicate) that we expect
might have metonymic subjects (or objects). One such study is Stefanowitsch
(2015), which studies the verb bomb (as in 13.10b), Nixon bombed Hanoi.
According to Lako and Johnson, this sentence instantiates what they call the
controller for controlled metonymy, i.e. Nixon would be a metonym for the air
force pilots controlled by Nixon.39 us, a search for [pos: NOUN] [lemma:

39 Alternatively, as argued by Stallard (1993), it is the predicate rather than the subject

370
Anatol Stefanowitsch

bomb] should allow us to asses, for example, the importance of this metonymy
in relation to other metonymies and literal uses.
erying the BNC for [pos: NOUN] [lemma: bomb, pos: VERB] yields 31
hits referring to the dropping of bombs. Of these, only a single one has the
ultimate decision maker as a subject (cf. 13.12a). Somewhat more frequent in
subject position are countries or inhabitants of countries (5 cases) (cf. 13.12b, c).
Even more frequently, the organization responsible for carrying out the bombing
e.g. an air force, or part of an air force is chosen as the subject (9 cases) (cf.
13.12d,e). e most frequent case (14 hits) mentions the aircra carrying the
bombs in subject position, oen accompanied by an adjective referring to the

.1
country whose military operates the planes (cf 13.12) or some other responsible
group (cf. 13.12g). Finally, there are two cases where the bombs themselves
occupy the subject position (cf. 13.12h).
v0
(13.12a)Mussolini bombed and gassed the Abyssinians into subjection.
(13.12b)On the day on which Iraq bombed Larak
(13.12c) Seven years aer the Americans bombed Libya
(13.12d)[T]he school was blasted by an explosion, louder than anything heard there
T
since the Luwae bombed it in 1944.
(13.12e)Germany, whose Condor Legion bombed the working classes in Guernica
AF

(13.12) on Jan. 24 French aircra bombed Iraq for the rst time
(13.12g) Rebel jets bombed the Miraores presidential palace
(13.12h) Watching our bombs bomb your people

Cases with pronouns in subject position have a similar distribution, again, there
R

is only one hit with a human controller in subject position. All hits (whether with
pronouns, common nouns or proper names), interestingly, have metonymic
D

subjects i.e., not a single example has the bomber pilot in the subject position.
is is unexpected, since literal uses should be more frequent than gurative uses
(it leads Stefanowitsch 2015 to reject an analysis of such sentences as metonymies
altogether). On the other hand, there are cases that are plausibly analyzed as
metonymies here, such as examples (13.12de), which seem to instantiate a
metonymy like military unit for member of unit (i.e. whole for part) and (13.f
h), which instantiate plane for pilot, i.e. instrument for controller.
More systematic study of such metonymies by target domain could uncover

that is used metonymically in this sentence, which would make this a metonym-
oriented case study.

371
13 Metaphor and Metonymy

more such facts as well as contributing to a general picture of how important


particular metonymies are in a particular language.
is case study sketches out a potential target-oriented approach to the
corpus-based study of metonymy, along with some general questions that we
might investigate using it (most obviously, the question of how central a given
metonymy is in the language under investigation). Again, metonymy is a vastly
under-researched area of linguistics, so much work remains to be done.

13.2.4.3 Case Study: Logical metonymies. Lapata and Lascarides (2003) present an
innovative approach to what they call logical metonymy clauses, where the

.1
object of a verb is logically the object of some other verb that does not occur in
the uerance. For example, we would typically interpret the uerance Mary
enjoyed the book as Mary enjoyed reading the book, or perhaps Mary enjoyed
v0
writing the book, but denitely not as Mary enjoyed buying the book even
though buying a book is a perfectly normal thing to do.
Lapata and Lascarides argue that the interpretation of the missing verb
depends both on the verb that is actually present (e.g. enjoy) and on the object
(e.g. book) it is determined, in our examples, on the basis of our knowledge of
T
what people typically enjoy and our knowledge of what people do with lms. By
identifying verbal collocates shared by the verb (when used as a matrix verb with
AF

a dependent clause) and the object, it should be possible to recover the most
plausible interpretations for any given logical metonymy.
For example, the ten most frequent dependent verbs of enjoy as a matrix verb
in the BNC are (in that order) be, work, read, play, have, go, watch, meet, make,
and see; the verbs most frequently taking book as an object (or predicate noun) are
R

be, read, write, have, get, publish, buy, nd, produce and make. Taking into account
the overall frequency of these verbs and their frequencies with enjoy and lm, it
D

should be possible to nd the best match. Lascarides and Lapata use the following
formula, where Vmet is the metonymically used verb (e.g. enjoy), Vimp is the
potentially implied verb (e.g. read, write, buy) and Obj is the object (e.g. book):

f (V met , V imp ) f (Obj , V imp )


(13.13a) P (V imp , V met , Obj ) =
N f (V imp )

Buy occurs 25,827 times in the BNC in general, and 125 times with the object
[(Det) (Adj) book]; enjoy buying occurs 3 times, and there are 11,292,844 verbs in
the version of the BNC used here. us, we can calculate the hypothetical

372
Anatol Stefanowitsch

probability of the co-occurrence of buy, enjoy and book as follows:

3 125
(13.13b) P (V imp , V met , Obj ) = = 0.000000001286
11292844 25827

If we repeat this calculation for all verbs occurring with enjoy and/or book and
order the results by probability of co-occurrence, we nd that the most likely
interpretation is, indeed, read, followed by write (see Table 13-10).

Table 13-10. Most probable interpretations for enjoy & book.

.1
VMET OBJ & V IMP V MET&V IMP V
IMP P(V ,V , OBJ)
IMP MET

read 615 68 22702 1.63E-07


write 604 v0 20 39230 2.73E-08
cook 18 13 3933 5.27E-09
be 1147 150 4122614 3.70E-09
get 191 19 213378 1.51E-09
travel 18 9 9890 1.45E-09
have 399 52 1317706 1.39E-09
T
buy 125 3 25827 1.29E-09
make 83 35 210554 1.22E-09
compile 9 2 1366 1.17E-09
AF

Of course, the method is not awless. First, it obviously depends on the size and
quality of the corpus (note the occurrence in third place of the verb cook this
is entirely due to the questionable tagging of cook as a verb in the compound
R

Cooking Book (British English for cookbook).


However, while it generally yields excellent results, there are cases where it
ranks plausible interpretations at the top of the list that are simply not the most
D

expected, suggesting that paerns like enjoy + Object at least in some cases
receive conventionalized interpretations that do not reect an intersection of the
use of [enjoy + Verb] and [Verb + Object]. Compare, for example, the results for
food and meal in Table 13-11.
Obviously, I enjoyed the meal/food can both be interpreted to mean I enjoyed
cooking the meal/food in a specic context, but in both cases the default
interpretation would be I enjoyed eating the meal/food. However, based on the
frequencies of [enjoy + Verb] and [Verb + meal], the most likely interpretation for
enjoy the meal should be enjoy cooking the meal. It is possible, that a more ne-
grained corpus analysis (for example, one that distinguishes between transitive

373
13 Metaphor and Metonymy

and intransitive uses of cook (and other verbs) in [enjoy + Verb]) would yield
more accurate results, but it is also possible, that usage does not fully reect our
assumptions about what people enjoy and or typically do with meals.

Table 13-11. Most probable interpretations for enjoy & food and enjoy & meal.
VMET OBJ & V IMP V MET &V IMP V IMP P(V ,V , OBJ)
IMP MET

FOOD
eat 210 14 13791 1.89E-08
cook 39 13 3933 1.14E-08
buy 129 3 25827 1.33E-09

.1
share 20 9 12803 1.24E-09
be 354 150 4122614 1.14E-09
MEAL
cook
eat
make
83
112
93
v0 13
14
35
3933
13791
210554
2.43E-08
1.01E-08
1.37E-09
share 15 9 12803 9.34E-10
have 216 52 1317706 7.55E-10
T
Finally, as a demonstration that the method does not simply pick the verb most
frequently used with the object, consider Table 13-12, which shows the most
AF

probable interpretations for [plan (Det) (Adj) book]. Even though read is the most
frequent verb with book, the method correctly picks write, followed by the almost
equally likely publish as the best interpretation (read is the eighth-best
interpretation).
R

Table 13-12. Most probable interpretations for plan & book.


V OBJ & V V V &V P(V ,V , OBJ)
D

MET IMP IMP MET IMP IMP MET

write 604 27 39230 3,68E-08


publish 151 19 12697 2,00E-08
buy 125 45 25827 1,93E-08
sell 59 62 21088 1,54E-08
close 52 37 12983 1,31E-08
open 62 51 22345 1,25E-08
use 69 129 110838 7,11E-09
read 615 2 22702 4,80E-09

is study demonstrates a complex innovative research design for a highly


interesting research question. It is dicult to place in terms of inductive or

374
Anatol Stefanowitsch

deductive and in terms of what exactly the variables are. is is because it does
not use corpus data primarily as a way of testing a hypothesis or answering a
research question, but as a model of how humans (or other language-learning
systems) may acquire the correct interpretations of logical metonymies. is
model itself may then be tested, for example (as Lapata and Lascarides do), by
comparing its results with experimentally collected native-speaker judgments.
Here, the hypothesis would be that the model correctly predicts the majority of
interpretations, and the variable would be PREDICTION OF THE MODEL with the values
CORRECT and INCORRECT.

.1
Further Reading
Deignan 2005 is a comprehensive aempt to apply corpus-linguistic methods to a
v0
range of theoretically informed research questions concerning metaphor. e
contributions in Stefanowisch & Gries (2006) demonstrate a range of
methodological approaches by many leading researchers applying corpus
methods to the investigation of metaphor.
T
Deignan, Alice. 2005. Metaphor and corpus linguistics. Amsterdam &
Philadelphia: John Benjamins.
AF

Stefanowitsch, Anatol & Stefan omas Gries. 2006. Corpus-based


approaches to metaphor and metonymy.
R
D

375
Epilogue
In this book, I have focused on corpus linguistics as a methodology, more
precisely, as an application of a general observational scientic procedure to large
samples of linguistic usage. I have refrained from placing this method in a

.1
particular theoretical framework for two reasons.
e rst reason is that I'm not convinced that linguistics should be focusing
quite as much on theoretical frameworks, but rather on linguistic description
v0
based on data. Edward Sapir famously said that unfortunately, or luckily no
language is tyrannically consistent. All grammars leak (1921: 39). is is all the
more true of formal models, that aempt to achieve tyrannical consistency by
pretending those leaks do not exist or, if they do exist, are someone else's
problem. To me, and to many others whose studies I discussed in this book, the
T
ways grammars leak are simply more interesting than the formalisms that help us
ignore this.
AF

e second reason is because I believe that corpus linguistics has a place in


any theoretical linguistic framework, as long as that framework has some
commitment to modeling linguistic reality. Obviously, the precise place, or rather,
the distance from the data analyzed using this method and the consequences of
R

this analysis for the model depend on the type of linguistic reality that is being
modeled. If it is language use, as, for example, in historically or sociolinguistically
oriented studies, the distance is relatively short, requiring the researcher to
D

discover the systematicity behind the usage paerns observed in the data. If it is
the mental representation of language, the length of the distance depends on your
assumptions about those representations.
Traditionally, those representations have been argued to be something
fundamentally dierent from linguistic usage that they are an ephemeral
competence based on a universal grammar that may be a mental organ
(Chomsky 1980) or an evolved biological instinct (Pinker 1994), but that is
dependent on and responsible for linguistic usage only in the most indirect ways
imaginable. As I have argued in Chapters 1 and 2, even those frameworks have no
alternative to corpus data that does not suer from the same drawbacks, without

376
Anatol Stefanowitsch

oering any of the advantages.


However, more recent models do not draw as strict a line between usage and
mental representations. e Usage-Based Model (Langacker 1990) is a model of
linguistic knowledge based on the assumption that speakers initially learn
language as a set of unanalyzed chunks of various sizes (established units), from
which they derive linguistic representations of varying degrees of abstractness
and complexity based on formal and semantic correspondences across these units
(cf. Langacker 1991: 266f). Hopper's Emergent Grammar is based on similar
assumptions but is skeptical even of abstractness, viewing language, instead, as
built up out of combinations of prefabricated parts. Language is, in other

.1
words, to be viewed as a kind of pastiche, pasted together in an improvised way
out of ready-made elements (Hopper 1987: 144).
In these models, the corpus becomes more than just a research tool, it becomes
v0
part of a model of linguistic competence itself (see further Stefanowitsch 2011). In
fact, in the most radical version, Hoeys (2005) notion of lexical priming, the
corpus essentially is the model of linguistic competence:

e notion of priming as here outlined assumes that the mind has a mental
T
concordance of every word it has encountered, a concordance that has
been richly glossed for social, physical, discoursal, generic and
AF

interpersonal context. is mental concordance is accessible and can be


processed in much the same way that a computer concordance is, so that
all kinds of paerns, including collocational paerns, are available for use.
It simultaneously serves as a part, at least, of our knowledge base. (Hoey
2005: 11)
R

Obviously, this mental concordance would not correspond exactly to any


concordance derived form an actual linguistic corpus. First, because as
D

discussed in Chapters 1 and 2 no linguistic corpus captures the linguistic


experience of a given individual speaker or the average speaker in a speech
community; second, because the concordance that Hoey envisions is not a
concordance of linguistic forms, but of contextualized linguistic signs i.e., it
contains all the semantic and pragmatic information that corpus linguists have to
reconstruct laboriously in their analyses. Still, an appropriately annotated
concordance from a balanced corpus would be a reasonable operationalization of
this mental concordance (cf. also Taylor 2012).
In less radical usage-based models of language, such as Langackers, the
corpus is not a model of linguistic competence, which is seen as a consequence of

377
Epilogue

linguistic input perceived and organized by human minds with a particular


structure (for example, the capacity for gure-ground categorization). It is,
however, a reasonable model (or at least an operationalization) of this input.
Many of the properties of language that guide the storage of units and the
abstraction of schemas over these stored units can be derived from corpora
(frequencies, associations between units of linguistic structure, distributions of
these units across grammatical and textual contexts, the internal variability of
these units, etc., cf Stefanowitsch and Flach 2016 for discussion).
is view is explicitly taken in language acquisition research conducted within
the Usage-Based Model (e.g. Tomasello 2003, Dabrowska 2001, Diessel 2004),

.1
where children's expanding grammatical abilities (as reected in acquisition
corpora) are investigated against the input that they get from their caretakers. It
is not always shared by the major theoretical proponents of the Usage-Based
v0
Model, who connect the notion of usage to the notion of linguistic corpora only
in theory. However, it is a view that oers a tremendous potential to bring
together two broad strands of research cognitive-functional linguistics
(including some versions of construction grammar) and corpus linguistics
(including aempts to build theoretical models on corpus data, such as Paern
T
Grammar [Hunston and Francis 2000] and Lexical Priming [Hoey 2005]). ese
strands have developed more or less independently and their proponents are
AF

sometimes mildly hostile toward each other over minor, but fundamental
dierences in perspective (see McEnery and Hardie 2012, Section 8.3 for
discussion), but they could complement each other in many ways, cognitive
linguistics providing a more explicitly psychological framework than most corpus
linguists adopt, and corpus linguistics providing a methodology that cognitive
R

linguists serious about usage urgently need.


Finally, in such usage-based models, as in models of language in general,
D

corpora can also be seen as models (or operationalizations) of the typical


linguistic output of the members of a speech community, i.e. the language
produced based on their internalized linguistic knowledge. is is the least
controversial view, and the one that I have essentially adopted throughout this
book. Even under this view, corpus data remain one of the best sources of
linguistic data we have, one that can only keep growing, providing us with ever
deeper insights into the leaky, intricate, ever-changing signature activity of our
species.
I hope this book has inspired you and I hope it enables you to produce
research that inspires all of us.

378
Anatol Stefanowitsch

Appendix
A: Resources

.1
A.1 Corpora and Text Archives
ANC American National Corpus. An ongoing project to build a large
v0
corpus of wrien and spoken American English from the 1990s
onward. A 15 million word slice is available for free download.
hp://www.anc.org/

BNC British National Corpus. A 100-million-word corpus of wrien and


T
spoken British English from the 1990s. Available for free download
aer registration.
AF

hp://www.natcorp.ox.ac.uk/

BNCb British National Corpus Baby. A four-million-word subset of the


BNC. Available for free download aer registration.
hp://www.natcorp.ox.ac.uk/
R

BROWN e Brown Standard Corpus of Present-Day Edited American English.


D

A one-million-word corpus of wrien American English from 1961.


Available for free download as part of the Natural Language
Toolkit and commercially on CD ROM from ICAME.
hp://www.nltk.org/nltk_data/

CHILDES A collection of language acquisition corpora of English and other


languages.. Available for free download.
hp://childes.psy.cmu.edu/

CLMET e Corpus of Late Modern English Texts. A corpus of wrien


British English from 1710 to 1920, drawn from Project Guenberg

379
Appendix

and other online archives. Available for free download aer


registration.
hps://perswww.kuleuven.be/~u0044428/clmet3_0.htm

COCA Corpus of Contemprary American English. A 520-million-word


corpus of wrien and spoken American English from 1990
onwards. Searchable online aer free registration. A smaller,
heavily redacted version is available commercially.
hp://corpus.byu.edu/coca/

.1
COLT e Bergen Corpus of London Teenage Usage. Available
commercially on CD ROM from ICAME.

FLOB
v0
e Freiburg LOB Corpus. A corpus of wrien British English from
1991, constructed in parallel to the BROWN corpus. Available
commercially on CD ROM from ICAME.

FROWN e Freiburg BROWN Corpus. A corpus of wrien American English


T
from 1991, constructed in parallel to the BROWN corpus. Available
commercially on CD ROM from ICAME.
AF

GloWbE Corpus of Global Web-based English. A 1.9-billion-word corpus of


internet texts. Searchable online aer free registration. A smaller,
heavily redacted version is available commercially.
hp://corpus.byu.edu/glowbe/
R

ICAME A collection of early wrien corpora. Available commercially on


D

CD ROM from ICAME.


hp://clu.uni.no/icame/newcd.htm

ICE-GB e International Corpus of English, British Component. A 1-million


word corpus of spoken and wrien British English. Available
commercially on CD ROM.
hp://www.ucl.ac.uk/english-usage/projects/ice.htm
A sample (used in this book) is available for free download.
hp://www.ucl.ac.uk/english-usage/projects/ice-gb/sampler/
Note: is corpus is part of a larger eort to collect comparable 1-
million-word corpora for dierent rst- and second-language

380
Anatol Stefanowitsch

varieties, some available commercially, some for a small handling


fee, and some for free aer registration. Unfortunately, the project
no longer has a central web page, so you will have to perform web
searches to nd the current web pages for each corpus.

LOB e Lancaster-Oslo/Bergen Corpus.. A corpus of wrien British


English from 1961, constructed in parallel to the BROWN corpus.
Available for free download from the OTA.
hp://ota.ox.ac.uk/desc/0167

.1
OTA e Oxford Text Archive. A vast collection of freely available
corpora, including the BNC and the LOB corpus (in some cases,
free registration is required).
v0
hp://ota.ox.ac.uk

Penn e Penn Treebank. A syntactically parsed corpus. Commercially


available on CD ROM. A 10 percent sample is available for free
download as part of the NLTK.
T
hp://www.nltk.org/nltk_data/
AF

PG Project Guenberg. A growing online-collection of out-of-copyright


books in many languages, that has served as a basis for ad-hoc
corpora as well as carefully constructed and annotated corpora like
the CLMET.
hps://www.gutenberg.org
R

Note. In this book, three texts from this collection are used at
various points:
D

Fish Populations, Following a Drought, In the Neosho and


Marais des Cygnes Rivers of Kansas
Pride and Prejudice (Jane Austen)
Starhunter (Andre Alice Norton)

SBCAE Santa Barbara Corpus of American English. 250 000 words of


transcribed conversations in American English. Available for free
download from the department of Linguistics at UC Santa Barbara
or from TALKBANK.
hp://www.linguistics.ucsb.edu/research/santa-barbara-corpus

381
Appendix

SUSANNE e SUSANNE corpus. A 130 000-word subset of the BROWN corpus


with detailed annotation of grammatical structure. Available for
free download from Georey Sampson's web site.
hp://www.grsampson.net/resources.html

TALKBANK A collection of freely available spoken corpora of various sizes, text


types and languages.
hps://talkbank.org/

A.2 Dictionaries

.1
CALD Cambridge Advanced Learner's Dictionary & esaurus
Online: hp://dictionary.cambridge.org/
v0
LDCE Longman Dictionary of Contemporary English
Online: hp://www.ldoceonline.com/
T
MW Merriam-Webster
Online: hp://www.merriam-webster.com/
AF

OALD Oxford Advanced Learners Dictionary


Online: hp://oald8.oxfordlearnersdictionaries.com/

WN WordNet
R

Online: hp://wordnet.princeton.edu/
D

A.3 Soware

Concordancing
TextSTAT A simple concordancer for Windows, Linux or MacOS X with a
graphical user interface. Best for working with small corpora up to
1 million words. Does concordances and frequency lists, allows
regular expressions in queries. Uses a free soware licence.
Recommended for beginners, or for working with small, ad-hoc
corpora.
hp://neon.niederlandistik.fu-berlin.de/en/textstat/

382
Anatol Stefanowitsch

AntConc A sophisticated concordancer for Windows, Linux or MacOS X


with a graphical user interface. Best for working with small to
medium-size corpora. Does concordances and frequency lists, has a
range of advanced functions like the identication of collocates and
word clusters. Available for free but uses a proprietary licence.
Recommended for beginners and for users with a more applied
interest in corpora, such as language teachers.
hp://www.laurenceanthony.net/soware/antconc/

.1
CWB e Corpus Work Bench. A sophisticated concordancer with a
command-line interface, running in any Unix-type environment,
including Linux, MacOS X or, with appropriate virtualization,
v0
Windows. Can be installed locally or on a server. Requires corpora
to be transformed into a standard format and indexed by the
program. Installation and corpus preparation is fairly complex, but
once it is done, the program is extremely powerful and exible.
Recommended for professional users working with large corpora.
T
hp://cwb.sourceforge.net/
AF

Spreadsheets, statistics, programming


Calc A spreadsheet application that is part of the LibreOce/OpenOce
soware suite for Windows, Linux or Mac OS X, available for free
download under Free soware licences. Useful for storing and
R

annotating data, and for simple statistics.


LibreOce: www.libreoce.org/
OpenOce: hps://www.openoce.org/
D

EtherCalc If you want to annotate data collaboratively, it may be useful to


store them online in a way that allows simultaneous editing by
several people. Google Docs is one of several (free-of-charge)
commercial services oering collaborative spreadsheets, but if you
prefer a non-commercial hoster or if you want to host your
spreadsheet yourself, EtherCalc is one such hosting service and
provider of a downloadable Free open-source soware.
hps://ethercalc.net/

383
Appendix

R A powerful command-line statistical programming environment


for Windows, Linux or MacOS X. Available for free download
under a Free soware licence. Has an initially steep learning curve,
but is a must for any corpus linguist serious about using statistics.
A version with a more comfortable user interface is available under
the name R Studio (recommended especially to Windows users).
R: hps://www.r-project.org/
R Studio: hps://www.rstudio.com/
Note: R can be augmented by installing libraries for advanced
functions. e following libraries are useful for advanced

.1
procedures mentioned at various points in this book:
e cfa package for Congural Frequency Analysis
e irr package for calculating inter-rater agreement
v0
e ngramr package for accessing the Google Ngram data

Perl/Python Although the concordancers listed above, especially CWB, are


likely to suce for most queries and R and LibreOce will serve
your needs for storing and quantitatively analyzing data, there will
T
come a time for any serious corpus linguist when programming
skills are needed to search, transform or annotate data, prepare
AF

corpora, etc. I suggest that you start learning a scripting language


like Perl or Python early on in your career. ese languages are
installed by default on MacOS X and Linux, and they are freely
available for Windows.
Perl/Python for Windows: hp://www.activestate.com/
R
D

384
Anatol Stefanowitsch

B: Statistical techniques

.1
B.1: Cohens Kappa () v0
e agreement between the two raters can then be expressed in terms of a
measure referred to as Cohens Kappa (). is measure can be calculated in six
steps.

Step 1: Determine for all coding categories the number of cases where the raters
T
have assigned the same category and the numbers of cases where they have
assigned dierent categories. Record the results in a table like that shown in
AF

Table 1 (this table assumes that there are two categories, but it should be obvious
how to generalize this to larger tables).

Table. Contingency Table for Two Raters and Two Categories


R

R AT ER 1
CATEGORY X CATEGORY Y Total
D

C11 C12
No. of items catego- No. of items catego-
CAT. X
rized as X by both rized as Y by R1 and X
R AT ER 2

raters by R2
C21 C22
No. of items catego- No. of items catego-
CAT. Y
rized as X by R1 and rized as X by both
Y by R2) raters
Total N

Step 2: Calculate the expected frequencies from the observed frequencies, follow-

385
Appendix

ing the standard procedure for each cell of multiplying the column total and the
row total and dividing the product by the table total N.
Step 3: Calculate the sum of observed agreements (AObs) by adding up the
observed frequencies of all cells that represent intersections of the same category
for both raters. No maer what size the table, these will always be the cells in the
diagonal from the top le cell to the boom right cell of the table (if you number
them according to the standard scheme, they are all cells whose row and column
index are identical (C11, C22, C33, C44, etc.).
Step 4: Calculate the sum of expected agreements (AExp) by repeating the

.1
procedure for the expected frequencies.
Step 5: You now have all the necessary information for calculating Cohens :

=
AObs Aexp
N Aexp
v0
Step 6: Interpret and draw the appropriate conclusions. Generally, 0.7 is
considered to be an indication that inter-rater reliability is satisfactory while <
T
0.7 is taken to show that inter-rater reliability is not satisfactory. Obviously, 0.7
cannot be taken as an absolute level; what you consider to be satisfactory will
AF

depend on your research project, and there are many situations where a higher
reliability is necessary. If you are satised with your inter-rater reliability, you can
go on to the next step of your research project (presumably the statistical
evaluation of your data). If you are not satised, you must try to determine
whether there is a problem with the coding scheme or the way it was applied. Fix
R

this problem, and then repeat the entire process.


D

C: Statistical Tables
e statistical tables in this section are provided for readers who want to practice
manually computing the tests introduced in this book. For serious research
projects, it is highly recommendable to use a statistical soware package such as
R (see Appendix A.3 above), which will output the p-values directly, making such
tables superuous.

C.1 Critical values for the chi-square test


1. Find the appropriate row for the degrees of freedom of your data.

386
Anatol Stefanowitsch

2. Find the most rightward column listing a value smaller than the chi-square
value you have calculated. At the top, it will tell you the corresponding
probability of error.

Table C-1. Critical values for the chi-square test


SIGNIFICANCE LEVELS (PROBABILITIES OF ERROR)
0.25 0.1 0.09 0.08 0.07 0.06 0.05 (*) 0.04 0.03 0.02 0.01 (**) 0.001
(***)
df not sig. marginally signicant sig. very highly
sig. sig.
1 1.3233 2.7055 2.8744 3.0649 3.2830 3.5374 3.8415 4.2179 4.7093 5.4119 6.6349 10.8276
2 2.7726 4.6052 4.8159 5.0515 5.3185 5.6268 5.9915 6.4378 7.0131 7.8240 9.2103 13.8155

.1
3 4.1083 6.2514 6.4915 6.7587 7.0603 7.4069 7.8147 8.3112 8.9473 9.8374 11.3449 16.2662
4 5.3853 7.7794 8.0434 8.3365 8.6664 9.0444 9.4877 10.0255 10.7119 11.6678 13.2767 18.4668
5 6.6257 9.2364 9.5211 9.8366 10.1910 10.5962 11.0705 11.6443 12.3746 13.3882 15.0863 20.5150
6
7
8
7.8408
9.0371
10.2189
10.6446
12.0170
13.3616
10.9479
12.3372
13.6975
11.2835
12.6912
14.0684
11.6599
13.0877
14.4836
v0
12.0896
13.5397
14.9563
12.5916
14.0671
15.5073
13.1978
14.7030
16.1708
13.9676
15.5091
17.0105
15.0332
16.6224
18.1682
16.8119
18.4753
20.0902
22.4577
24.3219
26.1245
9 11.3888 14.6837 15.0342 15.4211 15.8537 16.3459 16.9190 17.6083 18.4796 19.6790 21.6660 27.8772
10 12.5489 15.9872 16.3516 16.7535 17.2026 17.7131 18.3070 19.0207 19.9219 21.1608 23.2093 29.5883
11 13.7007 17.2750 17.6526 18.0687 18.5334 19.0614 19.6751 20.4120 21.3416 22.6179 24.7250 31.2641
12 14.8454 18.5493 18.9395 19.3692 19.8488 20.3934 21.0261 21.7851 22.7418 24.0540 26.2170 32.9095
T
13 15.9839 19.8119 20.2140 20.6568 21.1507 21.7113 22.3620 23.1423 24.1249 25.4715 27.6882 34.5282
14 17.1169 21.0641 21.4778 21.9331 22.4408 23.0166 23.6848 24.4855 25.4931 26.8728 29.1412 36.1233
15 18.2451 22.3071 22.7319 23.1993 23.7202 24.3108 24.9958 25.8162 26.8479 28.2595 30.5779 37.6973
AF

16 19.3689 23.5418 23.9774 24.4564 24.9901 25.5950 26.2962 27.1356 28.1907 29.6332 31.9999 39.2524
17 20.4887 24.7690 25.2150 25.7053 26.2514 26.8701 27.5871 28.4450 29.5227 30.9950 33.4087 40.7902
18 21.6049 25.9894 26.4455 26.9467 27.5049 28.1370 28.8693 29.7451 30.8447 32.3462 34.8053 42.3124
19 22.7178 27.2036 27.6694 28.1814 28.7512 29.3964 30.1435 31.0367 32.1577 33.6874 36.1909 43.8202
20 23.8277 28.4120 28.8874 29.4097 29.9910 30.6489 31.4104 32.3206 33.4624 35.0196 37.5662 45.3147
R

Note that such tables typically have only three columns, one each for the standard signicance
levels 0.05, 0.01, and 0.001. However, it should be kept in mind that this is a relatively arbitrary
convention (as two psychologists put it: Surely, God loves the .06 nearly as much as the .05
D

(Rosnow and Rosenthal 1989: 1277). Clearly, what probability of error one is willing to accept for
any given study depends very much on the nature of the study, the nature of the research design,
and a general disposition to take or avoid risk. us, signicance levels between 0.05 and 0.1 (or
even higher) are oen referred to as marginally signicant, reecting a willingness on the part of
the researcher to live with a six, seven, or even ten percent chance of being mistaken in assuming a
non-random distribution. Our table lists levels up to ten percent, and it lists the values in between
ve percent and one percent in order to allow a more ne grained assessment of the chi-square
values.

387
Appendix

C.2 Chi-Square Table for Multiple Tests with One Degree of


Freedom
1. Find the appropriate row for the number of tests you have performed (e.g., the
number of cells in your table if you are checking individual chi-square
components post hoc).
2. Find the most rightward column listing a value smaller than the chi-square
value you have calculated. At the top, it will tell you the corresponding
probability of error.

.1
Table C-2. Critical values for multiple chi-square tests at one degree of freedom.
No. of 0.1 0.05 0.01 0.001 No. of 0.1 0.05 0.01 0.001
Tests Tests
1 2.7055 3.8415 6.6349 10.8276
v0 36 8.9480 10.2205 13.2146 17.5641
2 3.8415 5.0239 7.8794 12.1157 37 8.9980 10.2710 13.2660 17.6162
3 4.5286 5.7311 8.6154 12.8731 38 9.0468 10.3202 13.3160 17.6670
4 5.0239 6.2385 9.1406 13.4121 39 9.0943 10.3682 13.3647 17.7164
5 5.4119 6.6349 9.5495 13.8311 40 9.1406 10.4149 13.4121 17.7645
6 5.7311 6.9604 9.8846 14.1739 41 9.1858 10.4605 13.4585 17.8115
7 6.0025 7.2367 10.1685 14.4641 42 9.2299 10.5051 13.5037 17.8574
8 6.2385 7.4768 10.4149 14.7157 43 9.2730 10.5486 13.5478 17.9022
T
9 6.4475 7.6891 10.6326 14.9379 44 9.3151 10.5911 13.5910 17.9459
10 6.6349 7.8794 10.8276 15.1367 45 9.3563 10.6326 13.6332 17.9887
11 6.8049 8.0520 11.0041 15.3167 46 9.3966 10.6733 13.6745 18.0305
AF

12 6.9604 8.2097 11.1655 15.4811 47 9.4360 10.7130 13.7148 18.0715


13 7.1037 8.3551 11.3140 15.6325 48 9.4746 10.7520 13.7544 18.1116
14 7.2367 8.4898 11.4517 15.7726 49 9.5125 10.7902 13.7931 18.1508
15 7.3607 8.6154 11.5800 15.9032 50 9.5495 10.8276 13.8311 18.1893
16 7.4768 8.7330 11.7000 16.0254 51 9.5859 10.8642 13.8683 18.2270
17 7.5860 8.8436 11.8128 16.1402 52 9.6215 10.9002 13.9048 18.2640
R

18 7.6891 8.9480 11.9193 16.2484 53 9.6565 10.9355 13.9406 18.3003


19 7.7867 9.0468 12.0200 16.3509 54 9.6909 10.9701 13.9757 18.3359
20 7.8794 9.1406 12.1157 16.4481 55 9.7246 11.0041 14.0102 18.3709
21 7.9677 9.2299 12.2067 16.5406 56 9.7577 11.0375 14.0441 18.4052
D

22 8.0520 9.3151 12.2935 16.6288 57 9.7903 11.0703 14.0774 18.4389


23 8.1325 9.3966 12.3765 16.7132 58 9.8222 11.1026 14.1101 18.4721
24 8.2097 9.4746 12.4559 16.7939 59 9.8537 11.1343 14.1423 18.5046
25 8.2838 9.5495 12.5322 16.8714 60 9.8846 11.1655 14.1739 18.5367
26 8.3551 9.6215 12.6055 16.9458 61 9.9150 11.1962 14.2050 18.5682
27 8.4237 9.6909 12.6760 17.0175 62 9.9450 11.2263 14.2356 18.5992
28 8.4898 9.7577 12.7441 17.0866 63 9.9744 11.2560 14.2657 18.6297
29 8.5537 9.8222 12.8097 17.1532 64 10.0034 11.2853 14.2954 18.6597
30 8.6154 9.8846 12.8731 17.2176 65 10.0320 11.3140 14.3245 18.6893
31 8.6751 9.9450 12.9345 17.2799 66 10.0601 11.3424 14.3533 18.7184
32 8.7330 10.0034 12.9939 17.3402 67 10.0878 11.3703 14.3816 18.7471
33 8.7891 10.0601 13.0516 17.3987 68 10.1151 11.3978 14.4095 18.7753
34 8.8436 10.1151 13.1075 17.4555 69 10.1420 11.4250 14.4370 18.8032
35 8.8965 10.1685 13.1618 17.5106 70 10.1685 11.4517 14.4641 18.8306

388
Anatol Stefanowitsch

C.3 Critical values for the U-test


ere are three tables, one for p < 0.05, one for p < 0.01 and one for p < 0.001 (all
for two-tailed tests).
1. Starting with the rst table, perform the following steps
1. Find the smaller one of your sample sizes in the rows labelled m and
the larger of your sample sizes in the columns labelled n.
2. Find the cell at the intersection of the row and the column.
3. If your U value is smaller than or equal to the value in this cell, your
result is signicant at the level given above the table.
2. Repeat the three steps with the second table. If your U value is larger

.1
than the value in the appropriate cell, stop and report a signicance level
of 0.05. If it is smaller or equal to the value in the appropriate cell, go to 3.
3. Repeat the three steps with the third table. If your U value is larger than
v0
the value in the appropriate cell, report a signicance level of 0.01. If it is
smaller or equal to the value in the appropriate cell, report a signicance
level of 0.001.
T
Table C-3. Critical values for the two-sided Mann-Whitney test (p < 0.05)
n
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
AF

1
2 0 0 0 0 1 1 1 1 1 2 2 2 2 3 3 3 3 3
3 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10
4 0 1 2 3 4 4 5 6 7 8 9 10 11 11 12 13 14 15 16 17 17 18
5 2 3 5 6 7 8 9 11 12 13 14 15 17 18 19 20 22 23 24 25 27
6 5 6 8 10 11 13 14 16 17 19 21 22 24 25 27 29 30 32 33 35
7 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44
R

8 13 15 17 19 22 24 26 29 31 34 36 38 41 43 45 48 50 53
9 17 20 23 26 28 31 34 37 39 42 45 48 50 53 56 59 62
10 23 26 29 33 36 39 42 45 48 52 55 58 61 64 67 71
11 30 33 37 40 44 47 51 55 58 62 65 69 73 76 80
D

12 37 41 45 49 53 57 61 65 69 73 77 81 85 89
m 13 45 50 54 59 63 67 72 76 80 85 89 94 98
14 55 59 64 69 74 78 83 88 93 98 102 107
15 64 70 75 80 85 90 96 101 106 111 117
16 75 81 86 92 98 103 109 115 120 126
17 87 93 99 105 111 117 123 129 135
18 99 106 112 119 125 132 138 145
19 113 119 126 133 140 147 154
20 127 134 141 149 156 163
21 142 150 157 165 173
22 158 166 174 182
23 175 183 192
24 192 201
25 211

389
Table C-2. Critical values for multiple chi-square tests at one degree of freedom.

Table C-4. Critical values for the two-sided Mann-Whitney test (p < 0.01)
n
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
1
2 0 0 0 0 0 0 0
3 0 0 0 1 1 1 2 2 2 2 3 3 3 4 4 4 5
4 0 0 1 1 2 2 3 3 4 5 5 6 6 7 8 8 9 9 10 10
5 0 1 1 2 3 4 5 6 7 7 8 9 10 11 12 13 14 14 15 16 17
6 2 3 4 5 6 7 9 10 11 12 13 15 16 17 18 19 21 22 23 24
7 4 6 7 9 10 12 13 15 16 18 19 21 22 24 25 27 29 30 32
8 7 9 11 13 15 17 18 20 22 24 26 28 30 32 34 35 37 39
9 11 13 16 18 20 22 24 27 29 31 33 36 38 40 43 45 47
10 16 18 21 24 26 29 31 34 37 39 42 44 47 50 52 55
11 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63

.1
12 27 31 34 37 41 44 47 51 54 58 61 64 68 71
m 13 34 38 42 45 49 53 57 60 64 68 72 75 79
14 42 46 50 54 58 63 67 71 75 79 83 87
15 51 55 60 64 69 73 78 82 87 91 96
16 60 65 70 74 79 84 89 94 99 104
17
18
19
v0 70 75
81
81
87
93
86 91 96
92 98 104
99 105 111
102
109
117
107
115
123
112
121
129
20 105 112 118 125 131 138
21 118 125 132 139 146
22 133 140 147 155
23 148 155 163
T
24 164 172
25 180
AF

Table C-5. Critical values for the two-sided Mann-Whitney test (p < 0.001)
n
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
1
2
3 0 0 0 0 0
4 0 0 0 1 1 1 2 2 2 3 3 3 3
R

5 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8
6 0 1 2 2 3 4 5 5 6 7 8 8 9 10 11 12 12 13
7 0 1 2 3 4 5 6 7 8 9 10 11 13 14 15 16 17 18 19
8 2 4 5 6 7 9 10 11 13 14 15 17 18 20 21 22 24 25
D

9 5 7 8 10 11 13 15 16 18 20 21 23 25 26 28 30 32
10 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38
11 12 15 17 19 21 24 26 28 31 33 35 38 40 42 45
12 17 20 22 25 27 30 33 35 38 41 44 46 49 52
m 13 23 25 28 31 34 37 40 43 46 49 52 56 59
14 29 32 35 39 42 45 49 52 55 59 62 66
15 36 39 43 46 50 54 58 61 65 69 73
16 43 47 51 55 59 63 67 72 76 80
17 51 56 60 65 69 74 78 83 87
18 61 65 70 75 80 85 89 94
19 70 76 81 86 91 96 102
20 81 87 92 98 103 109
21 93 98 104 110 116
22 105 111 117 124
23 118 124 131
24 132 139
25 146

390
Anatol Stefanowitsch

C.4 Critical values for the t-test


ese are the critical values for a two-tailed t-test. For a one-tailed t-test, divide
the signicance levels by two (i.e., the values in the column for a level of .1 in a
two-tailed test correspond to the level of 0.05 in a one-tailed test, etc.)

1. Find the appropriate row for the degrees of freedom of your test.
2. Find the most rightward column whose value is smaller than your t-value. At
the top of the column you will nd the p-value you should report.

Table C-6. Critical values for the t-test.

.1
df .10 .05 .01 .001 df .10 .05 .01 .001
1 6.31 12.71 63.66 636.62 21 1.72 2.08 2.83 3.82
2 2.92 4.30 9.22 31.60
v0 35 1.69 2.03 2.72 3.59
3 2.35 3.18 5.84 12.92 40 1.68 2.02 2.70 3.55
4 2.13 2.78 4.60 8.61 45 1.68 2.01 2.69 3.52
5 2.02 2.57 4.03 6.87 50 1.68 2.01 2.68 3.50
6 1.94 2.45 3.71 5.96 55 1.67 2.00 2.67 3.48
7 1.90 2.37 3.50 5.41 60 1.67 2.00 2.66 3.46
8 1.86 2.31 3.36 5.04 65 1.67 2.00 2.65 3.45
9 1.83 2.26 3.25 4.78 70 1.67 1.99 2.65 3.44
T
10 1.81 2.23 3.17 4.59 75 1.67 1.99 2.64 3.43
11 1.80 2.20 3.11 4.44 80 1.66 1.99 2.64 3.42
12 1.78 2.18 3.06 4.32 85 1.66 1.99 2.64 3.41
AF

13 1.77 2.16 3.01 4.22 90 1.66 1.99 2.63 3.40


14 1.76 2.15 2.98 4.14 95 1.66 1.99 2.63 3.40
15 1.75 2.13 2.95 4.07 100 1.66 1.98 2.63 3.39
16 1.75 2.12 2.92 4.02 200 1.65 1.97 2.60 3.34
17 1.74 2.11 2.90 3.97 500 1.65 1.97 2.59 3.31
18 1.73 2.10 2.88 3.92 1000 1.65 1.96 2.58 3.30
R

19 1.73 2.09 2.86 3.88 Inf 1.65 1.96 2.58 3.30


20 1.73 2.09 2.85 3.85
22 1.72 2.07 2.82 3.79
23 1.71 2.07 2.81 3.77
D

24 1.71 2.06 2.80 3.75


25 1.71 2.06 2.79 3.73
26 1.71 2.06 2.78 3.71
27 1.70 2.05 2.77 3.69
28 1.70 2.05 2.76 3.67
29 1.70 2.05 2.76 3.66
30 1.70 2.04 2.75 3.65

391
Table C-2. Critical values for multiple chi-square tests at one degree of freedom.

D: Data
D1. Possessives Data

Study 1 (Section 6.2)


Table D-1: Sample of s-possessives from BROWN, with part-of-speech of the
modier (pronouns and nouns only, proper names were discarded)
EXAMPLE POS EXAMPLE (CONTD.) POS

.1
its policy PRON their glasses PRON
his waterfront property PRON their own goals PRON
his oces PRON our major parties PRON
its conference victory last Saturday night PRON v0 their new scal year PRON
his wife, Nori PRON its passengers PRON
their big week ends of the year PRON whose stock PRON
their clients' guilt PRON the Department's recommendation NOUN
its budgeted $66,000 in the category PRON our rules PRON
his wife, Sue PRON its members' duties PRON
her professional roles PRON their appropriate participation in the PRON
determination of high policy
T
the government's special ceremonies at NOUN our laboratory PRON
Memorial University honoring distinguished
sons and daughters of the island province
AF

their work PRON its synthesis by systems with intact thyroid cells PRON
in vitro
his research PRON their economic development programs PRON
the year's grist of nearly 15,000 book titles NOUN its environment PRON
their party PRON his distress PRON
their actions PRON her partially exposed breast PRON
R

his sta PRON our national economy PRON


their burden PRON our decision PRON
his advisers in the Politburo (White House) PRON their conclusions and recommendations PRON
its extremely partisan leadership PRON our original questions 1 and 2 PRON
D

our only obligation for this day PRON the posse's approach NOUN
our anatomy PRON her leer to John Brown PRON
his apologetic best PRON ladies' fashions NOUN
his work PRON your present history PRON
his readers PRON her honor PRON
his stage seings PRON his art PRON
its musical frame PRON the convict's climactic reappearance in London NOUN
his horn PRON industry's main criticism of the Navy's NOUN
antisubmarine eort
a burgomaster's Beethoven NOUN his gaunt height PRON
the world's nest fall coloring NOUN his position PRON
a standard internist's text NOUN her sharp tongue and erce initiative PRON
their head PRON his own salvation PRON
my salvation PRON his work PRON
its program PRON his sister Mary PRON

392
Anatol Stefanowitsch

EXAMPLE POS EXAMPLE (CONTD.) POS


its local boards PRON my heart PRON
its metaphysic from Mahayana PRON her two children PRON
my father PRON its wing PRON
its ripening PRON its groin PRON
your feet PRON my own wife PRON
its operating weight PRON his gentle manners PRON
your outdoor table PRON his weight PRON
our eorts PRON his money PRON
his Classical Symphony PRON his head PRON
his continued hunting and shing expeditions PRON his shoe PRON
your management climate PRON whose professional interests PRON
his working life PRON his face PRON
its high impact strength PRON her father PRON

.1
mom's apple pie NOUN his mouth PRON
his problems PRON his glass PRON
his mind PRON his things PRON
his on-the-job problems PRON v0 her dish PRON
our health PRON her mother PRON
her hits PRON his helplessness PRON
her captain PRON her cats PRON
their families PRON his jacket pockets PRON
his permanently shortened leg PRON her fear PRON
his own task force of three stragglers PRON his voice PRON
their athletic pursuits your landlord's habits and movements
T
PRON PRON
his instructions PRON her crew PRON
the Square's historic value NOUN their clothes PRON
her testimony my God
AF

PRON PRON
their home range PRON his baldness PRON
their relation to issues of the most desperate PRON her brain PRON
urgency for the life of mankind
their work and responsibilities PRON the town marshal's oce NOUN
its periodic bloody engagements with the Senate PRON his bullet-riddled hat PRON
their eectiveness in reaching desired goals PRON his glance at Gyp Carmer PRON
R

his sta PRON his white teeth PRON


our meeting PRON the pool's edge NOUN
its new way of life PRON my Aunt PRON
D

their lives PRON his mouth PRON


my writing PRON her blue-black hair PRON
his forearms across his knees PRON his brown face PRON
his arts or culture PRON his due for this dive PRON
his neighbors PRON his head PRON
our society PRON his mother PRON
his thinking PRON man's tongue NOUN
our thoughts PRON whose habitual climate PRON
her conquerors PRON his remaining daughter PRON
his individuality PRON her cold rage PRON
her story of Sara PRON an egotist's rage for fame NOUN
her constant and wonderfully tragic posture PRON his expansiveness PRON
her life to religion PRON her guests PRON
our nding PRON his heart PRON
their eorts PRON their grotesque cry of war PRON

393
Table C-2. Critical values for multiple chi-square tests at one degree of freedom.

EXAMPLE POS EXAMPLE (CONTD.) POS


his mother's urging NOUN her way to Paris PRON
his foe PRON her life PRON
his aention PRON a women's oor NOUN
their own trees PRON these shores' peculiar powers of stimulation NOUN
their horses PRON his age PRON
her Navy PRON its black, forked tongue PRON
its reputation PRON his sports coat PRON
our peace PRON your cars PRON
one's daily work PRON his power PRON
our impression that these evaluations are PRON her logic PRON
naively uncritical resultants of blissful
ignorance

.1
whose arms were too long PRON his line of vision PRON
his role PRON his back PRON
his reading public PRON our aention PRON
his early poems her Midwestern lineage
their impulses and desires
our need
PRON
PRON
PRON
v0 whose only fault
the novelist's carping phrase
PRON
PRON
NOUN

Table D-2: Sample of of-possessives from BROWN, with part-of-speech of the


modier (pronouns and nouns only, proper names were discarded)
T
EXAMPLE POS EXAMPLE (CONTD.) POS
a re-enactment of the strange but successful NOUN the ndings of a completely impartial NOUN
AF

honeymoon he had in the 1960 legislative investigator


session
the announcement of a special achievement NOUN the head of the table NOUN
award to William A. (Bill) Shea
the death of her daughter NOUN the exaltations of combat NOUN
the crew of a trawler they found driing on a NOUN a truly great leader of the Nation NOUN
R

life ra aer they had abandoned a sinking


ship
the manufacturing and distribution center of NOUN dispersement of personnel NOUN
coon gin machinery and supplies
D

the announcement last week of the forthcoming NOUN the number of new plants and construction NOUN
encounter projects in Rhode Island
a joint session of Congress NOUN the advancement of all people NOUN
the opening of a new bypass NOUN the magnitude of eort and costs of each of NOUN
these possible phases
the side of right NOUN the existence of Prandtl numbers reaching NOUN
values of more than unity
support of private, state, or municipal ETV NOUN subsection (A) of this section NOUN
eorts
a conference of the California Democratic NOUN the passage of Public Law 236 NOUN
Council directorate
the necessity of interpretation by a Biblical NOUN violation of the provisions of the Universal NOUN
scholar Military Training and Service Act

394
Anatol Stefanowitsch

EXAMPLE POS EXAMPLE (CONTD.) POS


any loss of revenue NOUN for administrative expenses of the Export- NOUN
Import Bank of Washington in India
the lack of consciousness of destiny NOUN servants of government NOUN
these guardians of the nation's security NOUN the outstanding standard bearer of Mr. Brown's NOUN
tradition for accuracy
foreign enemies of peaceful coexistence NOUN the expanding but competitive economy of the NOUN
1960's
a transcript of these press conferences NOUN the feeding of the uid into the manometer NOUN
his portrayal of an edgy head-in-the-clouds NOUN the inuence of mechanical action NOUN
artist
an extraordinary knowledge of Russian NOUN the production of BW weapons systems NOUN
language, history and literature
the works of composers such as Mendelssohn, NOUN the blood group of the donor NOUN

.1
Dvorak, Canteloube, Copland and Brien
the expense of everything else NOUN the growth of senile individuals NOUN
the vision and conscience of the candidate the NOUN two indicators of developmental level NOUN
party chooses to lead it v0
the deep friezes of his architectonic music NOUN the anterior lobe of the pituitary NOUN
a lack of unity of purpose and respect for heroic NOUN hyalinization of aerent glomerular arterioles NOUN
leadership
the golden throne of the merchant NOUN the study of the actions of drugs in this respect NOUN
the third oor of her house which was behind NOUN the space of N times continuously dierentiable NOUN
the wall functions
the revival of native ones, some of which had the totality of singular lines
T
PRON NOUN
been lying in slumber for centuries
the danger that an eect of it PRON the image regulus of the pencil NOUN
the use of his particular skills the place of religion
AF

NOUN NOUN
the ideals of the country NOUN follow-up visits of the nurse and social worker NOUN
the Mahayana metaphysic of mystical union NOUN the ranking of the various families NOUN
for salvation
the death throes of men who were shot before NOUN the number of registered persons NOUN
the paredon
a result of the multi-purpose resources control NOUN a consequence of the severe condition of NOUN
R

program of the government perceived threat that persists unabated for the
anxious child in an ambiguous sort of school
environment
D

the volume of the cylinder opening in the head NOUN the possible forms of nonverbal expression NOUN
gasket
the passengers of the tallyho, from which there NOUN a grammatical description of each form NOUN
issued clouds of smoke
the right of them PRON the design of wrien languages NOUN
thickness of glaze NOUN the conicting expectations of East and West NOUN
the price of swimming NOUN the value of the election NOUN
lack of rainfall NOUN the lead of the Russians NOUN
the design and production of engraved seals NOUN terms of contract NOUN
the early stages of shipping fever NOUN the preparation of the questionnaire NOUN
the commercial realities of the market NOUN the rst part of the Regulation that is currently NOUN
at issue
the habit of many schools NOUN children of military personnel NOUN
the depths of the fourth dimension NOUN the maintenance of social stratication in the NOUN
schools

395
Table C-2. Critical values for multiple chi-square tests at one degree of freedom.

EXAMPLE POS EXAMPLE (CONTD.) POS


the amazing variety and power of reactions, NOUN costs of service NOUN
aitudes, and emotions precipitated by the nude
form
the cost of transportation and utilities NOUN possible integration of supposedly disparate NOUN
sociological investigations
the persecution of the Jews NOUN the signicance of Christian Humanism NOUN
the use of birth-control devices NOUN the novelty of such a gathering NOUN
honor of its commander, Colonel Josiah Snelling NOUN the vestige of a picture plane NOUN
true products of our American oral tradition NOUN no traces of former restoring NOUN
the wet end of the cork NOUN ineective dispersion of stock ownership NOUN
the end of one complete sequence NOUN the soul of the Tsarevich NOUN
the nature of the Protestant development the symbolic name of an index word or

.1
NOUN NOUN
electronic switch
the many mistakes of the new group of traders NOUN knowledge of the environment NOUN
certain sections of the country NOUN the rst section of this publication NOUN
the gardens of the square the potentially useful sources of ionizing
NOUN
v0 radiations, gamma sources, cobalt-60, cesium-
137, ssion products, or a reactor irradiation
loop system using a material such as an indium
NOUN

salt
the constitution of his home state of NOUN the value of a for the major portion of the knife NOUN
Massachuses
T
leaders of the nobility of Lincoln and Lee NOUN the boom of Kate's steps NOUN
all the details of the paern NOUN the old members of the church NOUN
the odor of decay NOUN the composer of a successful opera NOUN
AF

the image of man NOUN the shelter of the tent NOUN


the wintry homeland of his fathers NOUN the end of the aernoon NOUN
the spirit of the mad genius from Baker Street NOUN the eyes of the Lord's servants NOUN
popular approval of this questionable aitude NOUN the bow of the nearest ski NOUN
toward the highest oce in the land
the terminus of the change NOUN a knowledge of profound sorrow NOUN
the executive agency of personality the cus of his trousers
R

NOUN NOUN
the makers of constitutions NOUN the spirit of a thing NOUN
another source of intellectual stimulus NOUN the precepts of her conditioning NOUN
Ann's own description of the scene NOUN the high ridge of the mountains NOUN
D

the sound of jazz and the blues NOUN the rigid xture of her jaws NOUN
the image of the facts NOUN the side of the stall NOUN
this aspect of the literary process NOUN the muscles of his jaws NOUN
the carvings of cavemen NOUN the corner of the car NOUN
the fossilized, formalized, precedent-based NOUN the end of a long, miswrien chapter in the NOUN
thinking of the legendary military brain social life of the community
considerable criticism of its length NOUN the pirouee of his arms NOUN
the extent of ethical robotism NOUN the existence of AZ's useless Wisconsin set-up NOUN
the dra of soldiers NOUN other complications of the hex you've aroused NOUN
the incarnation of the new rational man NOUN the doors of the D train NOUN

396
Anatol Stefanowitsch

Study 2 (Section 6.3)


Table D-3: Sample of s- and of-possessives from BROWN coded for animacy (for
the pronoun it, the antecedent from the original context is given in square
brackets).
N O. EXAMPLE CONSTRUCTION ANIMACY
6-6a1 its [administration] policy S-GENITIVE ORG
6-6a2 her professional roles S-GENITIVE HUM
6-6a3 their burden S-GENITIVE HUM
6-6a4 its [word] musical frame S-GENITIVE CCT
6-6a5 its [sect] metaphysic from Mahayana S-GENITIVE ORG
6-6a6 your management climate S-GENITIVE ORG

.1
6-6a7 their families S-GENITIVE HUM
6-6a8 Lumumba's death S-GENITIVE HUM
6-6a9 his arts or culture S-GENITIVE HUM
6-6a10 her life to religion v0 S-GENITIVE HUM
6-6a11 its [monument] reputation S-GENITIVE CCT
6-6a12 their impulses and desires S-GENITIVE HUM
6-6a13 its [advisory board] members' duties S-GENITIVE ORG
6-6a14 our national economy S-GENITIVE ORG
6-6a15 the convict's climactic reappearance in London S-GENITIVE HUM
6-6a16 its [bird] wing S-GENITIVE ANI
6-6a17 her father S-GENITIVE HUM
T
6-6a18 his voice S-GENITIVE HUM
6-6a19 her brain S-GENITIVE HUM
6-6a20 his brown face S-GENITIVE HUM
AF

6-6a21 his expansiveness S-GENITIVE HUM


6-6a22 its [snake] black, forked tongue S-GENITIVE ANI
6-6a23 the novelist's carping phrase S-GENITIVE HUM
6-6b1 the invasion of Cuba OF-GENITIVE LOC
6-6b2 a joint session of Congress OF-GENITIVE ORG
6-6b3 foreign enemies of peaceful coexistence OF-GENITIVE EVT
R

6-6b4 the word of God OF-GENITIVE HUM


6-6b5 the volume of the cylinder opening in the head gasket OF-GENITIVE CCT
6-6b6 the depths of the fourth dimension OF-GENITIVE ABS
6-6b7 the views of George Washington OF-GENITIVE HUM
D

6-6b8 all the details of the paern OF-GENITIVE ABS


6-6b9 the makers of constitutions OF-GENITIVE CCT
6-6b10 the extent of ethical robotism OF-GENITIVE ABS
6-6b11 the number of new plants and construction projects in Rhode Island OF-GENITIVE CCT
6-6b12 the expanding but competitive economy of the 1960's OF-GENITIVE TIM
6-6b13 hyalinization of aerent glomerular arterioles OF-GENITIVE HAT
6-6b14 the possible forms of nonverbal expression OF-GENITIVE EVT
6-6b15 the maintenance of social stratication in the schools OF-GENITIVE ABS
6-6b16 knowledge of the environment OF-GENITIVE CCT
6-6b17 the bow of the nearest ski OF-GENITIVE CCT
6-6b18 the corner of the car OF-GENITIVE CCT

397
Table C-2. Critical values for multiple chi-square tests at one degree of freedom.

Study 3 (Section 6.4)


Table D-4. Sample of s- and of-possessives from the BROWN corpus coded for
length of the modier and the head.
N O. EXAMPLE CONSTRUCTION NO. OF WORDS
MOD. HEAD
6-9a1 the government's special ceremonies at Memorial University honoring S-POSSESSIVE 2 14
distinguished sons and daughters of the island province
6-9a2 the year's grist of nearly 15,000 book titles S-POSSESSIVE 2 6
6-9a3 a burgomaster's Beethoven S-POSSESSIVE 2 1
6-9a4 the world's nest fall coloring S-POSSESSIVE 2 3

.1
6-9a5 a standard internist's text S-POSSESSIVE 3 1
6-9a6 mom's apple pie S-POSSESSIVE 1 2
6-9a7 the Square's historic value S-POSSESSIVE 2 2
6-9a8 his mother's urging v0 S-POSSESSIVE 2 1
6-9a9 the Department's recommendation S-POSSESSIVE 2 1
6-9a10 the posse's apPROach S-POSSESSIVE 2 1
6-9a11 ladies' fashions S-POSSESSIVE 1 1
6-9a12 the convict's climactic reappearance in London S-POSSESSIVE 2 4
6-9a13 industry's main criticism of the Navy's antisubmarine eort S-POSSESSIVE 1 7
6-9a14 the town marshal's oce S-POSSESSIVE 3 1
6-9a15 the pool's edge S-POSSESSIVE 2 1
T
6-9a16 man's tongue S-POSSESSIVE 1 1
6-9a17 an egotist's rage for fame S-POSSESSIVE 2 3
6-9a18 a women's oor S-POSSESSIVE 2 1
AF

6-9a19 these shores' peculiar powers of stimulation S-POSSESSIVE 2 4


6-9a20 the novelist's carping phrase S-POSSESSIVE 2 2
6-9b1 the announcement last week of the forthcoming encounter OF-POSSESSIVE 3 4
6-9b2 the necessity of interpretation by a Biblical scholar OF-POSSESSIVE 5 2
6-9b3 his portrayal of an edgy head-in-the-clouds artist OF-POSSESSIVE 4 2
6-9b4 a lack of unity of purpose and respect for heroic leadership OF-POSSESSIVE 8 2
R

6-9b5 the death throes of men who were shot before the paredon OF-POSSESSIVE 7 3
6-9b6 lack of rainfall OF-POSSESSIVE 1 1
6-9b7 the amazing variety and power of reactions, aitudes, and emotions OF-POSSESSIVE 9 5
precipitated by the nude form
D

6-9b8 the wet end of the cork OF-POSSESSIVE 2 3


6-9b9 the constitution of his home state of Massachuses OF-POSSESSIVE 5 2
6-9b10 the spirit of the mad genius from Baker Street OF-POSSESSIVE 6 2
6-9b11 Ann's own description of the scene OF-POSSESSIVE 2 3
6-9b12 considerable criticism of its length OF-POSSESSIVE 2 2
6-9b13 the exaltations of combat OF-POSSESSIVE 1 2
6-9b14 the existence in the nitrogen dissociation region of Prandtl numbers reaching OF-POSSESSIVE 8 2
values of more than unity
6-9b15 the outstanding standard bearer of Mr. Brown's tradition for accuracy OF-POSSESSIVE 5 4
6-9b16 the growth of senile individuals OF-POSSESSIVE 2 2
6-9b17 the totality of singular lines OF-POSSESSIVE 2 2
6-9b18 a consequence of the severe condition of perceived threat that persists unabated OF-POSSESSIVE 20 2
for the anxious child in an ambiguous sort of school environment
6-9b19 the lead in space of the Russians OF-POSSESSIVE 2 2
6-9b20 costs of service OF-POSSESSIVE 1 1

398
Anatol Stefanowitsch

N O. EXAMPLE CONSTRUCTION NO. OF WORDS


6-9b21 ineective dispersion of stock ownership OF-POSSESSIVE 2 2
6-9b22 the value of a for the major portion of the knife OF-POSSESSIVE 8 2
6-9b23 the eyes of the Lord's servants OF-POSSESSIVE 3 2
6-9b24 the high ridge of the mountains OF-POSSESSIVE 2 3
6-9b25 the pirouee of his arms OF-POSSESSIVE 2 2

D2. True feelings (Case Study 9.2.4.1 )


Table D-6. Sample concordance for the query (real|actual|genuine)
(feeling|emotion)s?
(BNC)

.1
1 tyle is only a front for his [real feelings] . It is probably best to
2 imbledon . I think she has a [genuine feeling] for the fans and for a
3 ng with passion . Yes , with [real feeling] . With she searched f
4 night Bob had displayed any [real emotion] . His eyes sparkled . T
5 comes nowhere . There was [real feeling] in this judgement , McLei
v0
6 nfamiliar , and I had had no [real feeling] of kinship . Now , while
7 f the title bloom , brings a [real feeling] of tragic suffering to th
8 avonarola are described with [real feeling] . There is a fascinating
9 st : I do my best to hide my [real feelings] from others I always try
10 ) but with the intrusion of [real feelings] love , obsession , jea
11 even today . So when we have [real emotions] about someone , we lift
12 . His daughter summed up the [real emotions] of herself and her mothe
T
13 the sadness she felt . When [genuine feelings] are denied and pushe
14 s not theatricality but very [genuine feeling] . It is my way of show
15 the first time and said with [genuine feeling] : I 'm sorry . No il
16 ot be interpreted as lack of [real feeling] or a sign that the end of
AF

17 s avoiding the pain of their [real feelings] . This includes managers


18 of a cousin . Disguising his [real feelings] he wrote cheerfully , te
19 they do not represent their [real emotions] . Indeed , words can be
20 the counsellor must seek the [real feelings] of the counsellee throug
21 are fully discussed and that [real feelings] are expressed rather tha
22 l need , and to discover the [real feelings] and needs of the older i
23 of a community . It was the [actual feeling] that you could walk dow
R

24 pressionists and there was a [real feeling] of disquiet and apocalyps


25 n writes : But I do have a [real feeling] of joy ! Yes , when you
26 of mystification that denies [real feelings] and experiences is a nec
27 , if it corresponded to some [genuine feeling] both amongst the commo
D

28 ing is realistic and gives a [real feeling] of being there , and the
29 t be able to verbalize their [real feelings] , or to admit they had h
30 presence which communicates [real emotion] where the programme offer
31 theless appears to have been [genuine emotion] on both sides . Yet th
32 h sooner if he had known her [real feelings] towards him , but she ha
33 ppen . " I do n't recall any [real feelings] of resentment , and I do
34 ds describing a situation of [real emotion] for working people : a ri
35 was quite clear they had no [real feeling] for their mother , and we
36 llington , said Julia with [real feeling] . But I 'm quite all ri
37 . But Tod looked at it with [real feeling] , with the dull heat of-I
38 n neither can say what their [real feelings] are . A true conversatio
39 xtraordinarily difficult for [genuine feelings] of distress or even s
40 ade her see more clearly her [real feelings] for me . Whatever it was
41 y spoke to me . Told me your [real feeling] about oh , about life a
42 vely , and could still evoke [genuine emotions] ; the way a Woolworth
43 sts , she said , with more [real feeling] than she intended . Now
44 sted it But , whatever his [real feelings] , at least he was n't an

399
Table C-2. Critical values for multiple chi-square tests at one degree of freedom.

45 ct with little connection to [real emotion] . Nevertheless , it awoke


46 smother the awakening of her [real feelings] for him ? He 'd been imp
47 rnal view ! Also there was a [genuine feeling] of anticipation when s
48 ce of work which regards the [real feelings] and fears of one who is
49 r honeymoon ? If Ace had any [real feelings] for her he would have ta
50 pening . He was using a very [real emotion] , cold-bloodedly , delibe
51 990 ) it seems unlikely that [actual feelings] of emotional arousal a
52 conditions ; this means that [actual feelings] of risk may not be sol
53 ere are to give the design a [real feeling] of space and movement tha
54 ur main task is to provide a [real feeling] of space and movement as
55 to kill herself . Mariana 's [real emotion] is shown through her repe
56 to blend in but what 's the [real feeling] what 's the thought in yo
57 us ! she had enthused with [real feeling] . It 's the most wonder
58 h , no ! Lindsey stifled a [genuine feeling] of sorrow . I 'm del

.1
59 hordes who came without any [real feeling] , who listened without un
60 lito 's . He sings with such [genuine feeling] . It 's all an act
61 m but his control over his [real feelings] had remained even then .
62 hat she had not betrayed her [real feelings] , Sophie concentrated on
63 member . They 're taking the [real feeling] of Liverpool away , she
v0
64 nsight into the Mr. Darcy 's [real feelings] during particular parts
65 lly but you have n't got the [actual feelings] . Yeah . No I 'm not d

D3. Synthetic and analytic comparatives (Case study 10.2.4.4)


Table D-7. Comparative forms of angry, empty, friendly, lively, risky and sorry that
T
have an analytic comparative in their immediate preceding context. (BNC)
L20L8 L7L1 HIT TYPE WITHIN
7 TOKENS
AF

she should feel relieved or even more depressed and on the laer . He sounded even more angry analytic no
decided aer a moment
ite oen a series of progressively unpleasant child becoming more obstinate and more angry analytic yes
interchanges will take place with the the parent
is dismissed as domestic violence . In these of feminism has become more more angry analytic yes
circumstances the voice insistent and
it even worse because you go round and round in circles , you know and the more [unclear] more angry analytic yes
R

and e
to the more amorphous boundaries of suburban to the countryside in search of a more friendly analytic yes
neighbourhoods , many people have migrated
e colour the what the lay out of the foyer we would look at again to make it more inviting more friendly analytic yes
D

of them . Seeing how closely Weis was watching him , he an eort to be more warm , more friendly analytic yes
made
are welcome to visit China . Over the next ten years , will become economically more more friendly analytic yes
China liberal , internationally
School , a mixed High School in Ohio where the dierent , much more relaxed , more friendly analytic yes
atmosphere was totally
be signicant dierences in the way they perform their be more active than others , some more friendly analytic yes
jobs . Some will
pond and began the same operation along its full length . to be more interesting and was more lively analytic yes
is proved certainly
signicant average correlation between risk and recall . junctions are both intrinsically more more risky analytic yes
One possibility is that certain memorable and
seem an ever more clear indication of personal turn makes openness about more risky analytic no
inadequacy which in its problems appear yet
eect and purpose of any given transition . As a system the logic becomes more obscure , more risky analytic yes
gets larger modication

400
Anatol Stefanowitsch

L20L8 L7L1 HIT TYPE WITHIN


7 TOKENS
vagrancy in 1714 , 1740 and 1744 , meant that many of lower orders found it more dicult more risky analytic yes
and
. I mistrust a literature that nds suicide more signicant and a man 's inability to more sorry analytic no
than death , communicate
in advance that she was more interested in Travis 's Travis himself . And that made her angrier synthetic no
nancial status than
you get older you 're supposed to dri into becoming more conservative but I nd myself angrier synthetic no
and more geing politically
from being discouraged it stood all the more rmly and his great aims , sounded even emptier synthetic no
unerringly behind
like work . Life down here is conducted at a far more pace of life and the people are friendlier synthetic no
relaxed
responsive to rhythm , more creative , independent and ey always go home happier and friendlier synthetic no
sociable . Lucy says

.1
the British people have a far more reserved nature than 'm used to people being a lot friendlier synthetic no
Australians . I
of the several alternatives along this coast ; it is more Bayonne , less pompous than livelier synthetic no
intimate than Biarritz and
present in writing of every kind , is more obtrusive than it in the other tale , is the livelier synthetic no
is
time when to be new seemed more than usually
important ,
v0
as exemplied by the titles of its livelier synthetic no

ese imperfections make it all the more important for BIS minimums and set higher riskier synthetic no
regulators to enforce the standards for
with sta . It means more imaginative deployment of preparedness to experiment with riskier synthetic no
sta . It means new and possibly
T
more damaging to the Government showing a rise of 17 1989 . Indeed these gures made sorrier synthetic no
per cent on even
AF

D4. Literal uses of flame and flames (Case Study 13.2.1.1 )


Table D-8. 136 random concordance lines for literal uses of ame(s) (BNC)
SINGULAR FLAME
UNDER CONTROL
R

nds , smiling above the soft yellow [flame] whose elvish reflection danced in
l in charge of my own destiny . The [flame] of the candle on this table is by
nside with a satisfactory whoosh of [flame] . His point had been proved with s
brief space of time I turned up the [flame] . The X-rays that had proved so of
D

y through a high temperature plasma [flame] ( ca 10 000C ) . AS the powder pa


which he replenished as soon as the [flame] began to dwindle from a bowl on th
is surprisingly easy over a candle [flame] or similar , but be careful not to
She had set down the candle . Its [flame] showed up her hollow cheeks , the
y level , as you raise or lower the [flame] . At a glance you know what 's on
lor Heartbeat gas fire gives living [flame] effect at little more cost than a
s roaring away with a powerful blue [flame] and already the water in the sauce
edroom . In the light of its little [flame] , Dolly saw the face of Sergeant J
m like thinnest threads of stronger [flame] within a dully glowing oven . In v
e strokes on the pump stabilise the [flame] . This is much easier than it soun
pill and lit the cigarette with the [flame] I had obtained from the gas-fire .
eplaced it by a single candle . The [flame] rose untrembling in the still air
bow the candle burned with a steady [flame] , which was more than that perspir
n the stone erupted into satisfying [flame] . I slid a slim vibro-dagger from
t blazed and he stood in its yellow [flame] . " Big hand , friends ! " cried H
Time and again he wafted the candle [flame] and , for a while , the writing be
t . Cupped by a circle of hands the [flame] gathers . The man jerks back as an

401
Table C-2. Critical values for multiple chi-square tests at one degree of freedom.

ot the fire there was very little [flame] . He flashed a sudden stunning s
anuka that burns with a hot , clean [flame] is best He was interrupted by
e resin column ; ( iv ) Potassium : [flame] photometry ( Corning 405 ) . Tissu
ling to the breeze . A tiny lick of [flame] flickered round the mouth of the
gain at those dim shapes around the [flame] , they did appear to be rigs , vag
would be sufficient to relight the [flame] and so keep their fires burning ,
the points represent the sparks of [flame] . It has the backing of the farmer
mbard standing impassive beside the [flame] , de Guichet at his shoulder . Haz
NOT UNDER CONTROL
ed with Class 1 Surface Spread of [flame] ratings . CONDITIONING In common
ire officers normally do , to put a [flame] to the scenery to ensure it is pro
s he threw his head back and spewed [flame] a couple of yards into the air . O
h a few to blowing the gases into a [flame] . The resulting explosion loosened
& thermostats A Baxi Bermuda living [flame] fire-front , compatible with back

.1
ng his cigarette from a twelve-inch [flame] shooting out of a gas jet , who wa
nt susceptibility . For example , a [flame] detector on a North Sea drilling p
too late do they discover that the [flame] burns and is lethal . But which is
spark from her tinder , a sheet of [flame]
v0 could envelop the Genoese . And ad
out of one door but then a sheet of [flame] came down and blocked me , so I ha
seemed to be in a haze or a glowing [flame] . The two men continued walking ,
back in time to avoid the spear of [flame] from the torch , and flung a handf
ething that ignited enough to get a [flame] onto his seat . After that , panic
hair re-grew . He emerged from the [flame] unscathed , transformed by the cle
burned . Unable to pass through the [flame] , he managed to cast himself back
ch exploded in small dirty gouts of [flame] and smoke . The first French skirm
. Major cations were determined by [flame] atomic absorption , and P and Si b
T
you do n't even need to use a naked [flame] , you just need a hot surface like
he hotel complex burned into autumn [flame] and the air smelt of bonfires that
snaked towards it to snuff out the [flame] . The audience clapped . The black
AF

turned to see the first tongues of [flame] licking up the walls of the cainca
kness underground because the small [flame] of the oil-lamp was n't showing en
sh that was not flesh , dousing the [flame] , runnelling in scarred troughs
m , as far as detection ( and hence [flame] damage ) is concerned . Once detec
been billeted began to blossom with [flame] . Two figures were at its eaves wi
ook out , she cried as tongues of [flame] blow-torched from the crevices aro
pstairs windows , a sudden spurt of [flame] , and a part of the roof begin to
R

ide , and felt the wild pain of the [flame] on his arm as he dived for safety
ver , disappear in a white sheet of [flame] . He just kept right on kicking Pi
jetties , or the waiting batons of [flame] and black smoke that fenced the na
les an hour . There was a tongue of [flame] and Asa pulled back the column and
D

windows and roof . The inferno spat [flame] 20 feet into the air . Station Off
uality has always been a priority . [flame] Retardants gained BS5750 Pt 2 accr
re teddy bears by the looks of it . [flame] retardant oh ! They 're , they 're
PLURAL FLAMES
UNDER CONTROL
yrotechnic display of rose and gold [flames] that burnt up the whole western s
s fixed thoughtfully on the dancing [flames] . Roger sat down in the high-back
heat serves physical needs and its [flames] give inspiration for the mind . T
999 ambulance sped past them with [flames] roaring from its back . Drivers f
intest it was a face imagined among [flames] . The story paints an idyllic pic
With practice I learned that those [flames] instantly settled down to a prope
d . The smoke billows and spreads , [flames] crackle . Peasants and shepherds
er mind 's eye rude naked women and [flames] as upright and stiff as those on
sible place lay a deep disquiet . [flames] danced in the glass of each eye ,
owing coals and the cosy flicker of [flames] around them that was , she soon r
buses . Young men danced around the [flames] and slapped onto those cars that

402
Anatol Stefanowitsch

th were huge . They shone with blue [flames] . There were rings of blue fire r
nd a huge log fire , the flickering [flames] casting long shadows against the
h at dawn . She watched the kiln 's [flames] flicker on the walls . Li Lu stok
My master just sat staring into the [flames] of the fire . What is the matte
ch they eventually committed to the [flames] . The government did its best to
s , she said flatly , eyes on the [flames] in the hearth . I think , for u
NOT UNDER CONTROL
A) ATTEMPT TO EXTINGUISH
anger , but he does not put out the [flames] . as if to prove that he is in ea
ound near and began to beat out the [flames] with it . But she soon realised t
ic oxygen supply and extinguish the [flames] . Closed windows and doors help t
up the bucket for him to quench the [flames] . But there were no flames . All
dioxide ? Mm . Erm it smothers the [flames] they ca n't get oxygen if there '
e walls , and throwing water on the [flames] , but the fire was burning more s

.1
was full , then tried to douse the [flames] . One curtain went out , extingui
d him about the pavement to put the [flames] out . Nobody else was hurt , bu
ual struggle to contain the leaping [flames] . Shutting down their pumps , the
fire brigade squirting water on the [flames] . He must be feeling pretty dis
where terrified workers doused the [flames]
v0 . Moments before , Mr Chittenden
and I used my shirt to put out the [flames] and Hazel gave him first aid .
ound by Mr Nellis , who put out the [flames] . He could have been much more se
e fire brigade went in to quell the [flames] . Only once firemen had arrived d
B) ATTEMPT TO ESCAPE
ough windows in a bid to escape the [flames] licking around them after an IRA
of them were engulfed in a rush of [flames] before she could throw the baby t
ay . Malekith was caught within the [flames] , his body terribly scarred and b
T
bring to life would perish in the [flames] , together with all Frankenstein
een Colbert , 55 , was enveloped in [flames] after her dressing gown caught fi
fety as the vehicle was engulfed in [flames] . The incident , which happened w
AF

s attempt to save his wife from the [flames] . She marries him , and in the la
nt happened late Friday afternoon . [flames] set light to his jacket and Mr Wi
C) OTHER
wanner go up in a pile a smoke an' [flames] an' eye shadder an' levver shoes
e in Acton , Richard Baxter saw the [flames] and the huge pall of smoke which
the Hindenburg crashed in a ball of [flames] just a few miles from Nicholson '
now collapsed . Sparks , smoke and [flames] poured into the air , and the hea
R

, as from my window I could see the [flames] from the burning laundry , less t
ommunicate amongst the glare of the [flames] and through dense smoke ! The two
the spectators finally bursts into [flames] as the commentators announce that
flame it and then shoot it into the [flames] . A convenient hole appears and t
D

e wolf suddenly disappeared and the [flames] on the lances were extinguished .
ge was double : the exposure to the [flames] had been long enough to cause sev
ver again A ROYAL palace erupted in [flames] early yesterday , exactly a week
nutes earlier , however , and those [flames] would have been funeral pyres s
-holocaust Los Angeles circa 2029 . [flames] belch from the wreckage , degener
smelling the smoke . Or seeing the [flames] . I feel distant , removed and co
artment buildings spewing smoke and [flames] and chaotic scenes of paramedics
ir Montego hit a van and burst into [flames] on the M18 near Doncaster . Colin
the spiral column of smoke and the [flames] that played at its heels . Only t
ce smoke kills far more people than [flames] , and most fire deaths occur at n
soon have burst uncontrollably into [flames] . And if it be thought that Mr Hu
all but drowned in the hiss of the [flames] , sounded fluttery and frail .
ople trapped in the ruble Houses in [flames] all around The loud ringing of Fi
rate the loaded vehicle through the [flames] out to safety . The blue flames w
aughter . Jezrael cried . Faces and [flames] . Heavy breathing in shadows . Tr
temperature it will burst into the [flames] , likewise if it 's not balanced
re of the goods stored is such that [flames] are unlikely to damage them withi

403
Table C-2. Critical values for multiple chi-square tests at one degree of freedom.

ends with San Fransisco going up in [flames] , then succumbing to a monstrous


only spreads the flames . When the [flames] are out , remove hot clothes but
then it began to roll very slowly . [flames] licked over the straw , which cra
ollow , sickening whoomph . And the [flames] all round the pontoon . And , lat
arsome sermon , abruptly burst into [flames] . I have never seen a church empt
Rincewind turned and stared at the [flames] racing towards them , and wondere
ers were driven out by the heat and [flames] . When part of the roof collapsed
brown at the back , nearest the gas [flames] not exactly burnt , but very cr
d to a vast network of subterranean [flames] , lakes of bitumen and burning co
rn when a twisting frame burst into [flames] in the early hours of the morning
be flames in one of its branches ; [flames] that licked high , then guttered
g picture of an ancient courtyard . [flames] were pouring out of a well . Figu
e of the fat had splashed on to the [flames] and the blue smoke increased in d
re . 150 firemen were drafted in as [flames] swept through and destroyed numbe

.1
en prove to be more lethal than the [flames] themselves . The old-style foam c
eir lives almost ended in a ball of [flames] . So Norman White , Sheila Stroud
to discover what happens within the [flames] Male speaker We 've mounted a who
rt Gallery when a passer-by spotted [flames] coming from the roof . The galler
, flicked a lighter and exploded in [flames]
v0 in London 's Parliament Square as
T
AF
R
D

404
Anatol Stefanowitsch

References
Adli, Aria. 2004. Grammatische Variation und Sozialstruktur. De Gruyter.
Altenberg, Bengt. 1980. Binominal NPs in a thematic perspective: Genitive vs.

.1
ofconstructions in 17th century English. In Sven Jacobsen (ed.), Papers from
the Scandinavian Symposium on Syntactic Variation, 149172. (Stockholm
Studies in English 52). Stockholm: Almqvist and Wiksell.
v0
American Psychiatric Association. 2000. Diagnostic and statistical manual of
mental disorders: DSM-IV-TR. 4th ed., text revision. Washington, DC:
American Psychiatric Association.
Atkins, Beryl T.S. 1994. Analyzing the verbs of seeing: a frame semantics
approach to corpus lexicography. In Susanne Gahl, Andy Dolbey &
T
Christopher Johnson (eds.), Proceedings of the Twentieth Annual Meeting of
the Berkeley Linguistics Society: General Session Dedicated to the
AF

Contributions of Charles J. Fillmore, 4256. Berkeley: Berkeley Linguistics


Society.
Baayen, Harald. 2009. 41. Corpus linguistics in morphology: Morphological
productivity. In Anke Ldeling & Merja Kyt (eds.), Corpus linguistics: An
international handbook, vol. 2, 899919. (Handbooks of Linguistics and
R

Communication Science (HSK) 29). Berlin, New York: Mouton de Gruyter.


Baker, Carolyn D. & Peter Freebody. 1989. Childrens rst school books:
D

introductions to the culture of literacy. (e Language Library). Oxford, UK;


Cambridge, Mass., USA: B. Blackwell.
Baker, Paul. 2010a. Sociolinguistics and corpus linguistics. (Edinburgh
Sociolinguistics). Edinburgh: Edinburgh University Press.
Baker, Paul. 2010b. Will Ms ever be as frequent as Mr? A corpus-based
comparison of gendered terms across four diachronic corpora of British
English. Gender and Language 4(1).
Barcelona Snchez, Antonio. 1995. Metaphorical models of romantic love in
Romeo and Juliet. Journal of Pragmatics 24(6). 667688.
Barnbrook, Geo. 1996. Language and computers: a practical introduction to the

405
References

computer analysis of language. (Edinburgh Textbooks in Empirical


Linguistics). Edinburgh: Edinburgh University Press.
Berlage, Eva. 2009. Prepositions and postpositions. In Gnter Rohdenburg & Julia
Schlter (eds.), One Language, Two Grammars?, 130148. Cambridge:
Cambridge University Press.
Berman, Ruth Aronson & Dan Isaac Slobin (eds.). 1994. Relating events in
narrative:: A crosslinguistic developmental study. Hillsdale, NJ: Lawrence
Erlbaum Associates.
Biber, D. 1993. Representativeness in Corpus Design. Literary and Linguistic
Computing 8(4). 243257.

.1
Biber, Douglas, Susan Conrad & Randi Reppen. 1998. Corpus linguistics
investigating language structure and use. Cambridge; New York: Cambridge
University Press. v0
Bondi, Marina & Mike Sco (eds.). 2010. Keyness in texts. (Studies in Corpus
Linguistics v. 41). Amsterdam; Philadelphia: John Benjamins.
Borkin, Ann. 1973. To be and not to be. In Claudia Corum, T. Cedrik Smith-Stark
& Ann Weiser (eds.), Papers from the Ninth Regional Meeting of the Chicago
Linguistics Society, vol. 9, 4456. Chicago: Chicago Linguistic Society.
T
Bush, Nathan. 2001. Frequency eects and word-boundary palatalization in
English. In Joan Bybee & Hopper, Paul J. (eds.), Frequency and the emergence
AF

of linguistic structure, 255280. (Typological Studies in Language 45).


Amsterdam; Philadelphia: John Benjamins.
Bybee, Joan L. 2007. Frequency of use and the organization of language. Oxford;
New York: Oxford University Press.
Bybee, Joan L & Paul J Hopper. 2001. Frequency and the emergence of linguistic
R

structure. (Typological Studies in Language 45). Amsterdam; Philadelphia:


John Benjamins.
D

Bybee, Joan & Joanne Scheibman. 1999. e eect of usage on degrees of


constituency: the reduction of dont in English. Linguistics 37(4). 575596.
Caldas-Coulthard, Carmen Rosa. 1993. From discourse analysis to Critical
Discourse Analysis: e dierential re-presentation of women and men
speaking in wrien news. In John McHardy Sinclair, Michael Hoey &
Gwyneth Fox (eds.), Techniques of Description: Spoken and Wrien
Discourse., 196208. London: Routledge.
Caldas-Coulthard, Carmen Rosa & Rosemary Moon. 2010. Curvy, hunky, kinky:
Using corpora as tools for critical analysis. Discourse & Society 21(2). 99133.
Carlea, Jean. 1996. Assessing agreement on classication tasks: e Kappa
statistic. Computational Linguistics 22(2). 249254.

406
Anatol Stefanowitsch

Charles, Walter G. & George A. Miller. 1989. Contexts of antonymous adjectives.


Applied Psycholinguistics 10(03). 357.
Charteris-Black, J. 2006. Britain as a container: immigration metaphors in the
2005 election campaign. Discourse & Society 17(5). 563581.
Charteris-Black, Jonathan. 2004. Corpus approaches to critical metaphor analysis.
Houndmills, Basingstoke, Hampshire; New York: Palgrave Macmillan.
Charteris-Black, Jonathan. 2005. Politicians and rhetoric: the persuasive power of
metaphor. Houndmills, Basingstoke, Hampshire; New York: Palgrave
Macmillan.
Cheng, Winnie. 2012. Exploring corpus linguistics: language in action. London;

.1
New York, NY: Routledge.
Chen, Ping. 1986. Discourse and particle movement in English. Studies in
Language 10(1). 7995. v0
Chomsky, Noam. 1957. Syntactic structures. e Hague: Mouton.
Chomsky, Noam. 1964. Formal discussion of W. Miller and Susan Ervin, e
Development of Grammar in Child Language. In Ursula Bellugi & Roger
Brown (eds.), e acquisition of language, 3539. (Monographs of the Society
for Research in Child Development 92). Lafayee, IN: Purdue University Press.
T
Chomsky, Noam. 1972. Language and mind. Second edition. New York: Harcourt
Brace Jovanovich.
AF

Chomsky, Noam. 1980. Rules and representations. New York: Columbia


University Press.
Church, Kenneth Ward, William Gale, Patrick Hanks & Donald Hindle. 1991.
Using statistics in lexical analysis. In Uri Zernik (ed.), Lexical acquisition:
Exploiting on-line resources to build a lexicon, 115164. Hillsdale, NJ:
R

Lawrence Erlbaum Associates.


Church, Kenneth Ward & Patrick Hanks. 1990. Word association norms, mutual
D

information, and lexicography. Computational Linguistics 16(1). 2229.


Coates, Jennifer. 1996. Women talk: conversation between women friends. Oxford
[England]; Cambridge, Mass: Blackwell Publishers.
Colleman, Timothy & Bernard De Clerck. 2011. Constructional semantics on the
move: On semantic specialization in the English double object construction.
Cognitive Linguistics 22(1).
Collins, Peter. 2007. Can/could and may/might in British, American and
Australian English: a corpus-based account. World Englishes 26(4). 474491.
Cooper, William E. & John R. Ross. 1975. Word order. In Robin Grossman, L.J. San
& Timothy J. Vance (eds.), Papers from the Parasession on Functionalism, 63
111. Chicago, IL: Chicago Linguistic Society.

407
References

Corley, Martin & Oliver W. Stewart. 2008. Hesitation Disuencies in Spontaneous


Speech: e Meaning of um. Language and Linguistics Compass 2(4). 589602.
Cowart, Wayne. 1997. Experimental syntax: applying objective methods to
sentence judgements. ousand Oaks, CA: Sage Publications.
Culicover, Peter W. 1999. Syntactic nuts: hard cases, syntactic theory, and
language acquisition. (Foundations of Syntax 1). Oxford; New York: Oxford
University Press.
Cutler, Anne, Jacques Mehler, Dennis Norris & Juan Segui. 1986. e syllables
diering role in the segmentation of French and English. Journal of Memory
and Language 25(4). 385400.

.1
Dabrowska, Ewa. 2001. From formula to schema: e acquisition of English
questions. Cognitive Linguistics 11(1-2).
Deane, Paul. 1987. English possessives, topicality, and the Silverstein Hierarchy.
v0
In Jon Aske, Natasha Beery, Laura Michaelis & Hana Filip (eds.), Proceedings
of the Annual Meeting of the Berkeley Linguistics Society: General Session
and Parasession on Grammar and Cognition, vol. 13, 6576. Berkeley:
Berkeley Linguistics Society.
Deignan, Alice. 1999. Corpus-based research into metaphor. In Graham Low &
T
Lynne Cameron (eds.), Researching and applying metaphor, 177199.
Cambridge: Cambridge University Press.
AF

Deignan, Alice. 2005. Metaphor and corpus linguistics. (Converging Evidence in


Language and Communication Research 6). Amsterdam; Philadelphia: John
Benjamins.
Daz-Vera, Javier E. 2012. Infected aances: Metaphors of the word JEALOUSY in
Shakespeares plays. metaphorik.de 22. 2343.
R

Diessel, Holger. 2004. e acquisition of complex sentences. (Cambridge Studies


in Linguistics 105). Cambridge, U.K.; New York: Cambridge University Press.
D

Emons, Rudolf. Corpus linguistics: some basic problems. Studia Anglica


Posnaniensia XXXIL. 6168.
Ericsson, G. & T. A. Heberlein. 2002. Jgare talar naturens sprk (Hunters speak
natures language): A comparison of outdoor activities and aitudes toward
wildlife among Swedish hunters and the general public. Zeitschri r
Jagdwissenscha 48(S1). 301308.
Eye, Alexander Von. 1988. e General Linear Model as a Framework for Models
in Congural Frequency Analysis. Biometrical Journal 30(1). 5967.
Faulhaber, Susen. 2011. Idiosyncrasy in Verb Valency Paerns. Zeitschri r
Anglistik und Amerikanistik 59(4).
Fellbaum, Christiane. 1995. Co-Occurrence and Antonymy. International Journal

408
Anatol Stefanowitsch

of Lexicography 8(4). 281303.


Fellbaum, Christiane. 1998. WordNet: an electronic lexical database. Cambridge,
Mass., [etc.]: e MIT Press.
Fillmore, Charles. 1992. Corpus linguistics or Computer-aided armchair
linguistics. In Jan Svartvik (ed.), Directions in corpus linguistics. Proceedings
of Nobel Symposium 82, Stockholm, 4-8 August 1991, 3560. (Trends in
Linguistics. Studies and Monographs 65). Berlin; New York: Mouton de
Gruyter.
Firth, John Rupert. 1957. Papers in Linguistics 19341951. London: Oxford
University Press.

.1
Francis, Gill, Susan Hunston & Elisabeth Manning. 1996. Collins COBUILD
Grammar Paerns 1: Verbs. London: HarperCollins.
Francis, Gill, Susan Hunston & Elizabeth Manning. 1998. Collins Cobuild
v0
Grammar Paerns 2. Nouns and Adjectives. London: HarperCollins.
Fromkin, Victoria. 1973. Speech errors as linguistic evidence. e Hague: Mouton.
Fromkin, Victoria (ed.). 1980. Errors in linguistic performance: slips of the tongue,
ear, pen, and hand. San Francisco: Academic Press.
Garretson, Gregory. 2004. Coding practices used in the project Optimal Typology
T
of Determiner Phrases. Unpublished manuscript. Boston, MA, ms.
Georgila, Kallirroi, Maria Wolters, Johanna D. Moore & Robert H. Logie. 2010. e
AF

MATCH corpus: a corpus of older and younger users interactions with


spoken dialogue systems. Language Resources and Evaluation 44(3). 221261.
Gesuato, Sara. 2003. e company women and men keep: what collocations can
reveal about culture. Proceedings of the 2002 Corpus Linguistics Conference,
253262. (UCREL Technical Paper 16). Lancaster: UCREL.
R

Gilquin, Gatanelle & Sylvie De Cock. 2011. Errors and disuencies in spoken
corpora: Seing the scene. International Journal of Corpus Linguistics 16(2).
D

141172.
Givn, Talmy. 1983. Topic continuity in discourse a quantitative cross-language
study. Amsterdam; Philadelphia: John Benjamins.
Givn, Talmy. 1992. e grammar of referential coherence as mental processing
instructions. Linguistics 30(1).
Goldberg, Adele E. 1995. Constructions: a construction grammar approach to
argument structure. (Cognitive eory of Language and Culture). Chicago:
University of Chicago Press.
Greenberg, Steven, Hannah Carvey, Leah Hitchcock & Shuangyu Chang. 2003.
Temporal properties of spontaneous speecha syllable-centric perspective.
Journal of Phonetics 31(3-4). 465485.

409
References

Greenberg, Steven, Joy Hollenback & Dan Ellis. 1996. Insights into spoken
language gleaned from phonetic transcription of the Switchboard corpus.
Proceedings of the Fourth International Conference on Spoken Language
Processing, vol. 1, 3235. Philadelphia.
Gries, Stefan . 2001. A corpus-linguistic analysis of English -ic vs -ical
adjectives. ICAME Journal 25. 65108.
Gries, Stefan . 2003. Testing the sub-test: An analysis of English -ic and -ical
adjectives. International Journal of Corpus Linguistics 8(1). 3161.
Gries, Stefan . 2010. Behavioral proles: A ne-grained and quantitative
approach in corpus-based lexical semantics. e Mental Lexicon 5(3). 323346.

.1
Gries, Stefan omas. 2002. Evidence in Linguistics: three approaches to genitives
in English. In Ruth M. Brend & William J. Sullivan (eds.), LACUS Forum
XXVIII: what constitutes evidence in linguistics?, 1731. Fullerton, CA:
v0
LACUS.
Gries, Stefan omas. 2003. Multifactorial analysis in corpus linguistics: a study
of particle placement. (Open Linguistics Series). New York: Continuum.
Gries, Stefan omas & Naoki Otani. 2010. Behavioral proles: A corpus-based
perspective on synonymy and antonymy. ICAME Journal 34. 121150.
T
Gries, Stefan . & Anatol Stefanowitsch. 2004. Extending collostructional
analysis: A corpus-based perspective on `alternations. International Journal of
AF

Corpus Linguistics 9(1). 97129.


Guz, Wojciech. 2009. English Axal Nominalizations Across Language Registers.
Pozna Studies in Contemporary Linguistics 45(4).
Halliday, M. A. K. Categories of the theory of grammar. Word 17. 24192.
Herrmann, Konrad. 2011. Hardness testing principles and applications. Materials
R

Park, Ohio: ASM International.


Hilpert, Martin. 2006. Keeping an eye on the data: Metonymies and their paerns.
D

In Anatol Stefanowitsch & Stefan . Gries (eds.), Trends in Linguistics.


Studies and Monographs [TiLSM]. Berlin, New York: Mouton de Gruyter.
Hoey, Michael. 2005. Lexical priming: a new theory of words and language.
London; New York: Routledge.
Holmes, Janet. 1995. Women, men, and politeness. (Real Language Series).
London; New York: Longman.
Hsu, Hui-Chin, Alan Fogel & Rebecca B. Cooper. 2000. Infant vocal development
during the rst 6 months: speech quality and melodic complexity. Infant and
Child Development 9(1). 116.
Hundt, Marianne. 1997. Has British English been catching up with American
English over the past thirty years? In Magnus Ljung (ed.), Corpus-Based

410
Anatol Stefanowitsch

Studies in English: Papers from the Seventeenth International Conference on


English-Language Research Based on Computerized Corpora, 135151.
Amsterdam: Rodopi.
Hundt, Marianne. 2009. Colonial lag, colonial innovation or simply language
change? In Gnter Rohdenburg & Julia Schlter (eds.), One Language, Two
Grammars?, 1337. Cambridge: Cambridge University Press.
Hunston, Susan. 2007. Semantic prosody revisited. International Journal of
Corpus Linguistics 12(2). 249268.
Hunston, Susan & Gill Francis. 2000. Paern grammar: a corpus-driven approach
to the lexical grammar of English. (Studies in Corpus Linguistics v. 4).

.1
Amsterdam; Philadelphia: John Benjamins.
Jackendo, Ray. 1994. Paerns in the mind: language and human nature. New
York: BasicBooks. v0
Jkel, Olaf. 1997. Metaphern in abstrakten Diskurs-Domnen: eine kognitiv-
linguistische Untersuchung anhand der Bereiche Geistesttigkeit, Wirtscha
und Wissenscha. (Duisburger Arbeiten Zur Sprach- Und Kulturwissenscha,
Duisburg Papers on Research in Language and Culture Bd. 30). Frankfurt am
Main; New York: P. Lang.
T
Jespersen, Oo. 1909. A modern English grammar on historical principles
(volumes 17). . 7 vols. Heidelberg: C. Winter.
AF

Johansson, Stig & Knut Hoand. 1989. Frequency analysis of English vocabulary
and grammar. Tag frequencies and word frequencies. . Vol. 1. 2 vols. Oxford:
Clarendon Press.
Justeson, John S. & Slava M. Katz. 1991. Co-occurrences of antonymous adjectives
and their contexts. Computational Linguistics 17(1). 119.
R

Justeson, John S. & Slava M. Katz. 1992. Redening antonymy: e textual


structure of a semantic relation. Literary and Linguistic Computing 7(3). 176
D

184.
Kaunisto, Mark. 1999. Electric/electrical and classic/classical: Variation between
the suxes ic and ical. English Studies 80(4). 343370.
Kennedy, Graeme. 2003. Amplier Collocations in the British National Corpus:
Implications for English Language Teaching. TESOL arterly 37(3). 467.
Kennedy, Graeme D. 1998. An introduction to corpus linguistics. London; New
York: Longman.
Kjellmer, Gran. 1986. e lesser man: Observations on the role of women in
modern English writing. In Jan Aarts & Willem Meijs (eds.), Corpus linguistics
II: New studies in the analysis and exploitation of computer corpora, 163176.
Amsterdam: Rodopi.

411
References

Kjellmer, Gran. 2003. Hesitation. In Defence of ER and ERM. English Studies


84(2). 170198.
Koller, Veronika. 2004. Metaphor and gender in business media discourse: a
critical cognitive study. New York: Palgrave Macmillan.
Labov, William. 1996. When intuitions fail. In Lisa McNair, Kora Singer, Lise
Dolbrin & Michelle Aucon (eds.), Papers from the parasession on theory and
data in linguistics, vol. 32, 77106. Chicago, IL: Chicago Linguistic Society.
Labov, William, Sharon Ash & Charles Boberg. 2006. e atlas of North American
English phonetics, phonology, and sound change: a multimedia reference tool.
Berlin; New York: Mouton de Gruyter.

.1
Lako, George. 1993. e contemporary theory of metaphor. In Andrew Ortony
(ed.), Metaphor and thought. 2nd ed. Cambridge: Cambridge University Press.
Lako, George. 2004. Re: Empirical methods in Cognitive Linguistics.
v0
Lako, George & Mark Johnson. 1980. Metaphors we live by. Chicago: University
of Chicago Press.
Lako, Robin. 1973. Language and womans place. Language in Society 2(1). 45
80. (4 February, 2013).
Langacker, Ronald W. 1991. Concept, image, and symbol: the cognitive basis of
T
grammar. (Cognitive Linguistics Research 1). Berlin: Mouton de Gruyter.
Lapata, Maria & Alex Lascarides. 2003. A Probabilistic Account of Logical
AF

Metonymy. Computational Linguistics 29(2). 261315.


Lautsch, Erwin, Gustav A. Lienert & Alexander von Eye. 1988. Strategische
berlegungen zur Anwendung der Kongurationsfrequenzanalyse. EDV in
Medizin und Biologie 19(1). 2630.
Leech, Georey N. 1999. e distribution and function of vocatives in American
R

and British English conversation. In Hilde Hasselgard & Signe Okseell (eds.),
Out of Corpora. Studies in honor of Stig Johansson, 107118. Amsterdam:
D

Rodopi.
Leech, Georey N., Roger Garside & Michael Bryant. 1994. CLAWS4: e tagging
of the British National Corpus. Proceedings of the 15th International
Conference on Computational Linguistics, 622628. Kyoto.
Leech, Georey N. & Andrew Kehoe. 2006. Recent grammatical change in wrien
English 1961-1992: some preliminary ndings of a comparison of American
with British English. In Antoinee Renouf & Andrew Kehoe (eds.), e
Changing Face of Corpus Linguistics, 185204. (Language and Computers 55).
Amsterdam: Rodopi.
Levin, Magnus & Hans Lindquist. 2007. Sticking ones nose in the data:
Evaluation in phraseological sequences with nose. ICAME Journal 31. 87110.

412
Anatol Stefanowitsch

Liberman, Mark. 2005. What happened to the 1940s? Blog. Language Log.
itre.cis.upenn.edu/~myl/languagelog/archives/002397.html (1 March, 2015).
Liberman, Mark. 2012. Historical culturomics of pronoun frequencies. Blog.
Language Log.
Lindquist, Hans & Magnus Levin. 2008. Foot and Mouth: e phrasal paerns of
two frequent nouns. In Sylviane Granger & Fanny Meunier (eds.),
Phraseology: An interdisciplinary perspective, 143158. Amsterdam: John
Benjamins. hps://benjamins.com/catalog/z.139.15lin (25 February, 2015).
Lindquist, Hans & Christian Mair (eds.). 2004. Corpus approaches to
grammaticalization in English. (Studies in Corpus Linguistics v. 13).

.1
Amsterdam; Philadelphia: John Benjamins.
Lindsay, Mark. 2011. Rival suxes: synonymy, competition, and the emergence of
productivity. In Angela Ralli, Geert Booij, Sergio Scalise & Athanasios
v0
Karasimos (eds.), Morphology and the architecture of grammar. On-line
proceedings of the Eighth Mediterranean Morphology Meeting., 192203.
Patras: University of Patras.
Lindsay, Mark & Mark Arono. 2013. Natural selection in self-organizing
morphological systems. In Nabil Hathout, Fabio Montermini & Jesse Tseng
T
(eds.), Morphology in Toulouse. Selected proceedings of Dcembrees 7, 133
153. Mnchen: Lincom Europa.
AF

Lohmann, Arne. 2013. Constituent order in coordinate constructions: a


processing perspective. Hamburg: Universitt Hamburg Dissertation.
urn:nbn:de:gbv:18-64094.
Lorenz, David. 2012. Contractions of English Semi-Modals: e Emancipating
Eect of Frequency. Freiburg: Albert-Ludwigs-Universitt Freiburg Ph.D.
R

thesis.
Louw, William E. 1993. Irony in the text or insincerity in the writer? e
D

diagnostic potential of semantic prosodies. In Mona Baker, Gill Francis &


Elena Tognini-Bonelli (eds.), Text and technology in honour of John Sinclair,
157176. Amsterdam; Philadelphia: John Benjamins.
Mair, Christian. 2003. Gerundial complements aer begin and start: Grammatical
and sociolinguistic factors, and how they work against each other. In Gnter
Rohdenburg & Bria Mondorf (eds.), Determinants of Grammatical Variation
in English. Berlin, New York: DE GRUYTER MOUTON.
Mair, Christian. 2004. Corpus linguistics and grammaticalisation theory: Statistics,
frequencies, and beyond. In Hans Lindquist & Christian Mair (eds.), Studies in
Corpus Linguistics, vol. 13, 121150. Amsterdam: John Benjamins.
hps://benjamins.com/catalog/scl.13.07mai (15 November, 2016).

413
References

Manning, Christopher D. & Hinrich Schtze. 1999. Foundations of statistical


natural language processing. Cambridge, Mass: MIT Press.
Marco, Maria Jos Luzon. 2000. Collocational frameworks in medical research
papers: a genre-based study. English for Specic Purposes 19(1). 6386.
Martin, James H. 2006. A corpus-based analysis of context eects on metaphor
comprehension. In Anatol Stefanowitsch & Stefan . Gries (eds.), Corpus-
based approaches to metaphor, 214236. Berlin; New York: Mouton de
Gruyter.
Mason, Oliver & Susan Hunston. 2004. e automatic recognition of verb
paerns: A feasibility study. International Journal of Corpus Linguistics 9(2).

.1
253270.
McEnery, Tony & Andrew Hardie. 2012. Corpus linguistics: method, theory and
practice. (Cambridge Textbooks in Linguistics). Cambridge; New York:
v0
Cambridge University Press.
McEnery, Tony & Andrew Wilson. 2001. Corpus linguistics: an introduction.
Edinburgh: Edinburgh University Press.
Merriam-Webster. 2014. How does a word get into a Merriam-Webster
dictionary? FAQ. Merriam-Webster Online.
T
Meurers, W. Detmar. 2005. On the use of electronic corpora for theoretical
linguistics. Lingua 115(11). 16191639.
AF

Meurers, W. Detmar & Stefan Mller. 2009. Corpora and syntax. In Anke Ldeling
& Merja Kyt (eds.), Corpus linguistics, vol. 2, 920933. (Handbooks of
Linguistics and Communication Science 29). Berlin; New York: De Gruyter
Mouton.
Meyer, Charles F. 2002. English corpus linguistics: an introduction. (Studies in
R

English Language). Cambridge, UK; New York: Cambridge University Press.


Michaelis, Laura A. & Knud Lambrecht. 1996. Toward a construction-based
D

theory of language function: the case of nominal extraposition. Language


72(2). 215247.
Michel, J.-B., Y. K. Shen, A. P. Aiden, A. Veres, M. K. Gray, e Google Books
Team, J. P. Picke, et al. 2011. antitative Analysis of Culture Using Millions
of Digitized Books. Science 331(6014). 176182.
Mondorf, Bria. 2004. Gender dierences in English syntax. Tbingen: Max
Niemeyer.
Murphy, Brna. 2009. Shes a fucking ticket: the pragmatics of fuck in Irish
English an age and gender perspective. Corpora 4(1). 85106.
Musol, Andreas. 2012. e study of metaphor as part of critical discourse
analysis. Critical Discourse Studies 9(3). 301310.

414
Anatol Stefanowitsch

Newman, Barry. 2013. In U.S., Apostrophes in Place Names Are Practically


Against the Law - WSJ.com. News. e Wall Street Journal.
Nichter, Luke A. 2007. Nixon tapes and transcripts. Archive. nixontapes.org.
Niemeier, Susanne. 2008. To be in control: kind-hearted and cool-headed: the
head-heart dichotomy in English. In Farzad Sharifan & Susanne Niemeier
(eds.), Culture, body, and language: conceptualizations of internal body organs
across cultures and languages., 349372. (Applications of Cognitive Linguistics
7). Berlin; New York: Mouton de Gruyter.
Nol, Dirk. 2003. Is there semantics in all syntax? e case of accusative and
innitive constructions vs. that-clauses. In Gnter Rohdenburg & Bria

.1
Mondorf (eds.), Determinants of Grammatical Variation in English. Berlin,
New York: DE GRUYTER MOUTON.
Oakes, M. P. & M. Farrow. 2007. Use of the Chi-Squared Test to Examine
v0
Vocabulary Dierences in English Language Corpora Representing Seven
Dierent Countries. Literary and Linguistic Computing 22(1). 8599.
Oce for National Statistics. 2005. T 01 arterly Population Estimates for
England and Wales by inary Age Groups and Sex Sept 03 - June 05.
London: Oce for National Statistics.
T
Or, Winnie Wing-fung. 1994. A corpus-based study of features of adjectival
suxation in English. Proceedings Joint Seminar on Corpus Linguistics and
AF

Lexicology Guangzhou and Hong Kong, 7281. Hong Kong: Language Centre,
Hong Kong University of Science and Technology.
Oxford University Press. 1989. e Oxford English dictionary. (Ed.) J. A. Simpson
& E. S. C. Weiner. 2nd ed. Oxford: Oxford; New York: Clarendon Press;
Oxford University Press.
R

Partington, Alan. 1998. Paerns and meanings using corpora for English language
research and teaching. Amsterdam; Philadelphia: John Benjamins.
D

Pearce, Michael. 2008. Investigating the collocational behaviour of MAN and


WOMAN in the British National Corpus using Sketch Engine. Corpora 3(1). 1
29.
Pinker, Steven. 1994. e language instinct. 1st ed. New York: W. Morrow and Co.
Plag, Ingo. 1999. Morphological productivity. Structural constraints in English
derivation. Berlin; New York: Mouton de Gruyter.
Rayson, Paul. 2008. From key words to key semantic domains. International
Journal of Corpus Linguistics 13(4). 519549.
Rayson, Paul, Georey N. Leech & Mary Hodges. 1997. Social Dierentiation in
the Use of English Vocabulary: Some Analyses of the Conversational
Component of the British National Corpus. International Journal of Corpus

415
References

Linguistics 2(1). 133152.


Renouf, Antoinee. 1987. Lexical resolution. In Willem Meijs (ed.), Proceedings of
the Seventh International Conference on English Language Research on
Computerised Corpora, 121131. Amsterdam: Rodopi.
Renouf, A. & John Sinclair. 1991. Collocational frameworks in English. In Karin
Aijmer & Bengt Altenberg (eds.), English corpus linguistics: studies in honour
of Jan Svartvik, 128143. London: Longman.
Robinson, Andrew. 2002. Lost languages: the enigma of the worlds undeciphered
scripts. New York: McGraw-Hill.
Rohdenburg, Gnter. 1995. On the replacement of nite complement clauses by

.1
innitives in English. English Studies 76(4). 367388.
Rohdenburg, Gnter. 2003. Cognitive complexity and horror aequi as factors
determining the use of interrogative clause linkers in English. In Gnter
v0
Rohdenburg & Bria Mondorf (eds.), Determinants of Grammatical Variation
in English. Berlin, New York: DE GRUYTER MOUTON.
Rohdenburg, Gnter. 2009. Nominal complements. In Gnter Rohdenburg & Julia
Schlter (eds.), One Language, Two Grammars?, 194211. Cambridge:
Cambridge University Press.
T
Rohdenburg, Gnter & Julia Schlter. 2009. One language, two grammars?
dierences between British and American English. Cambridge, UK; New York:
AF

Cambridge University Press.


Rojo Lpez, Ana Mara. 2013. Distinguishing near-synonyms and translation
equivalents in metaphorical terms: Crisis vs. recession in English and Spanish.
In Francisco Gonzlvez-Garca, M. Sandra Pea Cervel & Lorena Prez
Hernndez (eds.), Metaphor and metonymy revisited beyond the
R

contemporary theory of metaphor recent developments and applications, 283


316. Amsterdam; Philadelphia: John Benjamins.
D

Romaine, Suzanne. 2001. A corpus-based view of gender in British and American


English. In Marlis Hellinger & Hadumod Bumann (eds.), Gender across
Languages, vol. 1, 153175. (IMPACT: Studies in Language and Society 9).
Amsterdam; Philadelphia: John Benjamins.
Rmer, Ute & Stefanie Wul. 2010. Applying corpus methods to writing research:
Explorations of MICUSP. Journal of Writing Research 2(2). 99127.
Rosenbach, Anee. 2002. Genitive variation in English: conceptual factors in
synchronic and diachronic studies. (Topics in English Linguistics 42). Berlin;
New York: Mouton de Gruyter.
Rudanko, Juhani. 2003. More on horror aequi: evidence from large corpora. In
Archer, Dawn Corpus Linguistics 2003, Dawn Archer, Paul Rayson, Andrew

416
Anatol Stefanowitsch

Wilson & Tony McEnery (eds.), Proceedings of the Corpus Linguistics 2003
conference, 662668. Lancaster: UCREL, Computing Dept., Universtity of
Lancaster.
Sacks, Harvey. 1992. Lectures on conversation. Oxford, UK; Cambridge, Mass:
Blackwell.
Sacks, Harvey, Emanuel A. Scheglo & Gail Jeerson. 1974. A Simplest
Systematics for the Organization of Turn-Taking for Conversation. Language
50(4). 696.
Sily, Tanja. 2011. Variation in morphological productivity in the BNC:
Sociolinguistic and methodological considerations. Corpus Linguistics and

.1
Linguistic eory 7(1).
Sily, Tanja & Jukka Suomela. 2009. Comparing type counts: e case of women,
men and -ity in early English leers. Language and Computers Studies in
v0
Practical Linguistics 69. 87109.
Sampson, Georey. 1995. English for the computer: the SUSANNE corpus and
analytic scheme. Oxford: New York: Clarendon Press; Oxford University
Press.
Sapir, Edward. 1921. Language: An introduction to the study of speech. New York:
T
Harcourt, Brace & Co.
Schlter, Julia. 2003. Phonological determinants of grammatical variation in
AF

English: Chomskys worst possible case. In Gnter Rohdenburg & Bria


Mondorf (eds.), Determinants of Grammatical Variation in English. Berlin,
New York: DE GRUYTER MOUTON.
Schmid, Hans-Jrg. 1996. Introspection and computer corpora: the meaning and
complementation of start and begin. In Arne Zeersten & Viggo Hjrnager
R

(eds.), Symposium on Lexicography VII. Proceedings of the Seventh


Symposium on Lexicography, May 5-6, 1994 at the Unversity of Copenhagen,
D

223239. (Lexicographica, Series Major 76). Tbingen: Niemeyer.


Schmid, Hans-Jrg. 2003. Do women and men really live in dierent cultures?
Evidence from the BNC. In Andrew Wilson, Paul Rayson & Tony McEnery
(eds.), Corpus linguistics by the Lune: a Festschri for Georey Leech, 185
221. Frankfurt: Peter Lang.
Schtze, Carson T. 1996. e empirical base of linguistics: Grammaticality
judgments and linguistic methodology. Chicago, Il: e University of Chicago
Press.
Schtze, Carson T. 2016. e empirical base of linguistics: Grammaticality
judgments and linguistic methodology. Reissue. (Classics in Linguistics 2).
Berlin: Language Science Press.

417
References

Sco, Mike. 1997. PC analysis of key words And key key words. System 25(2).
233245.
Sco, Mike & Chris Tribble. 2006. Textual paerns: key words and corpus
analysis in language education / Mike Sco and Christopher Tribble. (Studies
in Corpus Linguistics v. 22). Amsterdam; Philadelphia: John Benjamins.
Searle, John R. 1969. Speech acts: an essay in the philosophy of language. London:
Cambridge U.P.
Sebba, Mark & Steven D. Fligelstone. 1994. Corpora. (Ed.) Ronald E. Asher &
James M.Y. Simpson. e encyclopedia of language and linguistics. Oxford:
Pergamon.

.1
Semino, E. & M. Masci. 1996. Politics is Football: Metaphor in the Discourse of
Silvio Berlusconi in Italy. Discourse & Society 7(2). 243269.
Seto, Ken-ichi. 1999. Distinguishing metonymy from synecdoche. In Klaus-Uwe
v0
Panther & Gnter Radden (eds.), Metonymy in language and thought, vol. 4,
91120. (Human Cognitive Processing). Amsterdam; Philadelphia: John
Benjamins. hps://benjamins.com/catalog/hcp.4.06set (20 February, 2015).
Shaer, J P. 1995. Multiple Hypothesis Testing. Annual Review of Psychology
46(1). 561584.
T
Sinclair, John. 1991. Corpus, concordance, collocation. Oxford: Oxford University
Press.
AF

Sinclair, John. 1996. EAGLES Preliminary recommendations on corpus typology.


Pisa: Expert Advisory Group on Language Engineering Standards.
Sobkowiak, Wodzimierz. 1993. Unmarked-Before-Marked as a Freezing Principle.
Language and Speech 36(4). 393414.
Stallard, David. 1993. Two kinds of metonymy. Proceedings of the 31st annual
R

meeting on Association for Computational Linguistics, 8794. Stroudsburg:


Association for Computational Linguistics.
D

Standwell, G.J.B. 1982. Genitive constructions and functional sentence


perspective. IRAL - International Review of Applied Linguistics in Language
Teaching 20(1-4).
Stefanowitsch, Anatol. 2003. Constructional semantics as a limit to grammatical
alternation: e two genitives of English. In Gnter Rohdenburg & Bria
Mondorf (eds.), Determinants of grammatical variation in English, 413444.
(Topics in English Linguistics 43). Berlin; New York: Mouton de Gruyter.
Stefanowitsch, Anatol. 2004. HAPPINESS in English and German: A
metaphorical-paern analysis. In Michel Achard & Suzanne Kemmer (eds.),
Language, culture, and mind, 137149. Stanford, CA: CSLI.
Stefanowitsch, Anatol. 2005. e function of metaphor: Developing a corpus-

418
Anatol Stefanowitsch

based perspective. International Journal of Corpus Linguistics 10(2). 161198.


Stefanowitsch, Anatol. 2006a. Negative evidence and the raw frequency fallacy.
Corpus Linguistics and Linguistic eory 2(1).
Stefanowitsch, Anatol. 2006b. Words and their metaphors: A corpus-based
approach. In Anatol Stefanowitsch & Stefan . Gries (eds.), Corpus-based
approaches to metaphor, 61105. (Trends in Linguistics). Berlin; New York:
Mouton de Gruyter.
Stefanowitsch, Anatol. 2007a. Wortwiederholungen im Englischen und
Deutschen: eine korpuslinguistische Annhrerung. In Andreas Ammann &
Aina Urdze (eds.), Wiederholung, Parallelismus, Reduplikation: Strategien der

.1
multiplen Strukturanwendung, 2945. (Diversitas Linguarum). Bochum:
Brockmeyer.
Stefanowitsch, Anatol. 2007b. Linguistics beyond grammaticality. Corpus
v0
Linguistics and Linguistic eory 3(1).
Stefanowitsch, Anatol. 2008. Negative entrenchment: A usage-based approach to
negative evidence. Cognitive Linguistics 19(3).
Stefanowitsch, Anatol. 2010. Empirical cognitive semantics: Some thoughts. In
Dylan Glynn & Kerstin Fischer (eds.), antitative Methods in Cognitive
T
Semantics: Corpus-Driven Approaches. Berlin, New York: De Gruyter Mouton.
Stefanowitsch, Anatol. 2011. Cognitive linguistics meets the corpus. In Mario
AF

Brdar, Stefan . Gries & Milena ic Fuchs (eds.), Human Cognitive


Processing, vol. 32, 257290. Amsterdam: John Benjamins Publishing
Company.
Stefanowitsch, Anatol. 2015. Metonymies dont bomb people, people bomb
people. Yearbook of the German Cognitive Linguistics Association 3(1).
R

Stefanowitsch, Anatol & Susanne Flach. 2016. e corpus-based perspective on


entrenchment. In Hans-Jrg Schmid (ed.), Entrenchment and the psychology
D

of language learning. How we reorganize and adapt linguistic knowledge.


(Language and the Human Lifespan). Berlin; New York: De Gruyter Mouton.
Stefanowitsch, Anatol & Stefan . Gries. 2003. Collostructions: Investigating the
interaction of words and constructions. International Journal of Corpus
Linguistics 8(2). 209243.
Stefanowitsch, Anatol & Stefan . Gries. 2005. Covarying collexemes. Corpus
Linguistics and Linguistic eory 1(1). 143.
Stefanowitsch, Anatol & Stefan . Gries. 2008. Channel and constructional
meaning: A collostructional case study. In Gie Kristiansen & Ren Dirven
(eds.), Cognitive Sociolinguistics, vol. 39, 129152. Berlin, New York: Mouton
de Gruyter.

419
References

Stubbs, Michael. 1995a. Collocations and cultural connotations of common words.


Linguistics and Education 7(4). 379390.
Stubbs, Michael. 1995b. Collocations and semantic proles: On the cause of the
trouble with quantitative studies. Functions of Language 2(1). 2355.
Subtirelu, Nicholas. 2014. Do we talk and write about men more than women?
Blog. linguistic pulse.
Swaine, Jon. 2009. Apostrophes abolished by council - Telegraph. Newspaper. e
Telegraph.
Szmrecsanyi, Benedikt. 2005. Language users as creatures of habit: A corpus-
based analysis of persistence in spoken English. Corpus Linguistics and

.1
Linguistic eory 1(1).
Szmrecsanyi, Benedikt. 2006. Morphosyntactic persistence in spoken English: a
corpus study at the intersection of variationist sociolinguistics,
v0
psycholinguistics, and discourse analysis. (Trends in Linguistics 177). Berlin;
New York: Mouton de Gruyter.
Tagliamonte, Sali. 2006. Analysing sociolinguistic variation. (Key Topics in
Sociolinguistics). Cambridge, UK; New York: Cambridge University Press.
Taylor, John R. 2012. e mental corpus: how language is represented in the
T
mind. Oxford; New York: Oxford University Press.
ompson, Sandra A. & Paul J. Hopper. 2001. Transitivity, clause structure, and
AF

argument structure: evidence from conversation. In Joan Bybee & Paul J.


Hopper (eds.), Frequency and the emergence of linguistic structure, 2760.
Amsterdam; Philadelphia: John Benjamins.
ompson, Sandra A. & Yuka Koide. 1987. Iconicity and indirect objects in
English. Journal of Pragmatics 11(3). 399406.
R

Tissari, Heli. 2003. Lovescapes: Changes in prototypical senses and cognitive


metaphors since 1500. (Mmoires de La Socit Nophilologique de Helsinki
D

LXII). Helsinki: Socit Nophilologique.


Tissari, Heli. 2010. English words for emotions and their metaphors. In Margaret
E. Winters, Heli Tissari & Kathryn Allan (eds.), Historical cognitive linguistics,
298330. (Cognitive Linguistics Research 47). Berlin; New York: De Gruyter
Mouton.
Toie, G. & S. Homann. 2006. Tag estions in British and American English.
Journal of English Linguistics 34(4). 283311.
Tummers, Jose, Kris Heylen & Dirk Geeraerts. 2005. Usage-based approaches in
Cognitive Linguistics: A technical state of the art. Corpus Linguistics and
Linguistic eory 1(2).
Turkkila, Kaisa. 2014. Do near-synonyms occur with the same metaphors: A

420
Anatol Stefanowitsch

comparison of anger terms in American English. metaphorik.de 25. 129154.


Twenge, Jean M., W. Keith Campbell & Briany Gentile. 2012. Male and Female
Pronoun Use in U.S. Books Reects Womens Status, 19002008. Sex Roles
67(9-10). 488493.
Vosberg, Uwe. 2003. e role of extractions and horror aequi in the evolution of
-ing complements in Modern English. In Gnter Rohdenburg & Bria
Mondorf (eds.), Determinants of Grammatical Variation in English. Berlin,
New York: Mouton de Gruyter.
Wallington, Alan, John A. Barnden, Marina A. Barnden, Fiona J. Ferguson &
Sheila R. Glasbey. 2003. Metaphoricity signals: A corpus-based investigation.

.1
Technical Report. Birmingham: e University of Birmingham, School of
Computer Science.
Wasow, Tom & Jennifer Arnold. 2003. Post-verbal constituent ordering in English.
v0
In Gnter Rohdenburg & Bria Mondorf (eds.), Determinants of grammatical
variation in English, 119154. (Topics in English Linguistics 43). Berlin; New
York: Mouton de Gruyter.
Weiner, E. Judith & William Labov. 1983. Constraints on the agentless passive.
Journal of Linguistics 19(01). 29.
T
Widdowson, Henry G. 2000. On the limitations of linguistics applied. Applied
Linguistics 21(1). 325.
AF

Wiechmann, Daniel. 2008. On the computation of collostruction strength: Testing


measures of association as expressions of lexical bias. Corpus Linguistics and
Linguistic eory 4(2). 253290.
Wiederhorn, Sheldon M., Richard J. Fields, Samuel Low, Gun-Woong Bahng, Alois
Wehrstedt, Junhee Hahn, Yo Tomota, et al. 2011. Mechanical Properties. In
R

Horst Czichos, Tetsuya Saito & Leslie Smith (eds.), Springer Handbook of
Metrology and Testing, 339452. Berlin, Heidelberg: Springer Berlin
D

Heidelberg.
Wierzbicka, Anna. 1988. e semantics of grammar. (Studies in Language
Companion Series v. 18). Amsterdam; Philadelphia: John Benjamins.
Wierzbicka, Anna. 2003. Cross-cultural pragmatics the semantics of human
interaction. Berlin; New York: Mouton de Gruyter.
Williams, Raymond. 1976. Keywords: a vocabulary of culture and society. New
York: Oxford University Press.
Winchester, Simon. 2003. e meaning of everything: the story of the Oxford
English dictionary. Oxford; New York: Oxford University Press.
Wolf, Hans-Georg & Frank Polzenhagen. 2007. Fixed expressions as
manifestations of cultural conceptualizations: Examples from African varieties

421
References

of English. In Paul Skandera (ed.), Phraseology and culture in English, 399


435. (Topics in English Linguistics 54). Berlin; New York: Mouton de Gruyter.
World Health Organization. 2010. e ICD-10 classication of mental and
behavioural disorders: Clinical descriptions and diagnostic guidelines. 4th ed.
Geneva: World Health Organization.
Wul, Stefanie. 2003. A multifactorial corpus analysis of adjective order in
English. International Journal of Corpus Linguistics 8(2). 245282.
Yarowsky, David. 1993. One sense per collocation. Human language technology:
proceedings of a workshop held at Plainsboro, New Jersey, March 21-24, 1993,
266271. Association for Computational Linguistics.

.1
Zaenen, Annie, Jean Carlea, Gregory Garretson, Joan Bresnan, Andrew Koontz-
Garboden, Tatiana Nikitina, M. Catherine OConnor & Tom Wasow. 2004.
Animacy encoding in English: Why and how. Proceedings of the 2004 ACL
v0
Workshop on Discourse Annotation, 118125. (DiscAnnotation 04).
Stroudsburg, PA, USA: Association for Computational Linguistics.
T
AF
R
D

422

Vous aimerez peut-être aussi