Académique Documents
Professionnel Documents
Culture Documents
1
Corpus Linguistics: v0
A Guide to the Methodology
Anatol Stefanowitsch
T
AF
R
D
Anatol Stefanowitsch
Freie Universitt Berlin
Institut r Englische Philologie
Habelschwerdter Allee 45
14195 Berlin
.1
the long time span has le an impact on the data and calculations: Over the years,
I have used slightly dierent versions of some of the corpora and dierent ways
of accessing them and computing statistics over the results. I am in the process of
v0
redoing all case studies using the most widely accessible version of the respective
corpora, to ensure maximal reproducibility.
I believe that reproducibility is an important issue not just in corpus
linguistics in general, but in the context of a text book in particular, and I am
planning to make this a topic with its own chapter in some future edition of the
T
book (but not in the rst one). However, the studies in the book should be as
faithfully reproducible as possible, so I am planning to replace data drawn from
AF
expensive commercially available corpora such as the ICE-GB by data from freely
or at least cheaply available corpora such as the BNC, the BNC Baby, the COW
corpora or corpora from the ICAME collection even in the rst edition of the
book.
Finally, some of the data in the case studies is cited in the form given in the
R
original studies on which my case studies are based. ese too are not always
reproducible with the freely available versions of the corpora in question, or they
D
may use corpora that are not available at all. I might redo these studies too.
Since this is a time-consuming and laborious task and the currently remaining
inconsistencies (for example, concerning the number of hits for particular
phenomena or the total number of tokens in the corpora) should not have an
impact on the determination of the books quality in the reviewing process, I have
decided to submit the manuscript as is and to complete the recalculations during
the reviewing process.
I am looking forward to any questions, suggestions and constructive criticism
that the reviewers or other interested parties using this dra version have for me.
I have set up a dedicated email address for this purpose:
2
Anatol Stefanowitsch
<clmbook@stefanowitsch.net>
A note on copyright: once this book is published with Language Science Press,
it will become available under a Creative Commons BY-SA 4.0 licence, but for the
moment, I retain full standard copyright over this dra. is is to ensure that I
cede control over this work only once it has been reviewed and a nal version
has been found worthy of publication. is dra may be distributed only in
academic contexts (schools, universities and research institutes) and only in the
present form (as a pdf le or as a printout of this le). Please do not host it on
openly accessible servers it will remain downloadable from the URL
.1
hp://stefanowitsch.net/clm/clmbook-dra.pdf (and from my academics.edu and
researchgate.net pages) until (and, for historical purposes, aer) the nal version
is published. Storing it in just a few places under my control is meant to prevent
v0
outdated dra copies being used and distributed aer the nal version has been
published.
Anatol Stefanowitsch
Berlin, February 2017
T
AF
R
D
3
Part I: Methodological Foundations
.1
v0
T
AF
R
D
4
Anatol Stefanowitsch
.1
is the scientic study of those aspects of the world that we summarize under the
label language. us, both the systematic observation and the experimental
study of linguistic behavior should have a role to play in linguistics.
v0
Let us dene a corpus somewhat crudely as a large collection of authentic text
(i.e., samples of language produced in genuine communicative situations), and
corpus linguistics as any form of linguistic inquiry based on data derived from
such a corpus (we will rene these denitions in the next chapter to a point
where they can serve as the foundation for a methodological framework, but they
T
will suce for now).
Dened in this way, corpora clearly constitute recorded observations of
AF
eld. e role of corpus data, and the observation of linguistic behavior more
generally, has been, and continues to be, highly controversial, especially with
respect to core areas of grammatical theory such as morphology, syntax, and
semantics. In much of the formalist linguistic literature, corpus linguistics is
routinely aacked, ridiculed, and dismissed not just as having no practical use,
but as having no conceivable use at all in linguistic inquiry.
In this literature, the method proposed instead is that of intuiting linguistic
data. Put simply, intuiting data means inventing sentences exemplifying the
phenomenon under investigation and then judging their grammaticality
(roughly, whether the sentence is a possible sentence of the language in
5
1 e need for corpus data
question). To put it mildly, inventing ones own data would appear to be a rather
subjective procedure, so, again, anyone unfamiliar with the last fiy years of
linguistic theorizing might wonder why such a procedure was proposed in the
rst place and why it is considered superior to the use of corpus data.
Readers familiar with this discussion or readers already convinced of the need
for corpus data may skip this chapter, as it will not be referenced extensively in
the remainder of this book. For all others, a discussion of both issues the alleged
uselessness of corpus data and the alleged superiority of intuited data seems
indispensable, if only to put them to rest in order to concentrate, throughout the
rest of this book, on the vast potential of corpus linguistics and the exciting
.1
avenues of research that it opens up.
Section 1.1 will discuss four major points of criticisms leveled at corpus data.
As arguments against corpus data, they are easily defused, but they do point to
v0
aspects of corpora and corpus linguistic methods that must be kept in mind when
designing linguistic research projects. Section 1.2 will discuss intuited data in
more detail and show that it does not solve any of the problems associated
(rightly or wrongly) with corpus data. Instead, as Section 1.3 will show, intuited
data actually creates a number of additional problems. Still, intuitions we have
T
about our native language (or other languages we speak well), can nevertheless
be useful in linguistic research as long as we do not confuse them with data.
AF
i. corpus data are usage data, and thus of no use in studying linguistic
knowledge;
D
ii. corpora, and the data derived from them, are necessarily incomplete;
iii. corpora contain only linguistic forms (represented as orthographic
strings), but no information about the semantics, pragmatics, etc. of these
forms; and
iv. corpora do not contain negative evidence, i.e they can only tell us what is
possible in a given language, but not what is not possible.
I will discuss the rst three points in the remainder of this section. A fruitful
discussion of the fourth point requires a basic understanding of statistics, which
will be provided in Chapters 69, so I will postpone it until Chapter 10.
6
Anatol Stefanowitsch
.1
e speaker has represented in his brain a grammar that gives an ideal
account of the structure of the sentences of his language, but, when
actually faced with the task of speaking or understanding, many other
v0
factors act upon his underlying linguistic competence to produce actual
performance. He may be confused or have several things in mind, change
his plans in midstream, etc. Since this is obviously the condition of most
actual linguistic performance, a direct record an actual corpus is almost
T
useless, as it stands, for linguistic analysis of any but the most supercial
kind (Chomsky 1964[1971]: 130f., emphasis added).
AF
is argument may seem plausible at rst glance, but it requires at least one of
two assumptions that do not hold up to closer scrutiny: rst, that there is an
impenetrable bi-directional barrier between competence and performance, and
second, that the inuence of confounding factors on linguistic performance
R
7
1 e need for corpus data
.1
respect to the task of deducing underlying principles from specic
manifestations, inuenced by other factors. For example, Chomsky has repeatedly
likened linguistics to physics, but physicists searching for gravitational waves do
v0
not reject the idea of observational data on the basis of the argument that there
are many other factors acting upon uctuations in gravity and that therefore a
direct record of such uctuations is almost useless. Instead, they aempt to
identify these factors and subtract them from their measurements.
In any case, the gap between linguistic usage and linguistic knowledge would
T
be an argument against corpus data only if there were a way of accessing
linguistic knowledge directly and without the interference of other factors.
AF
Next, let us look at the argument that corpora are necessarily incomplete, also a
long-standing argument in Chomskyan linguistics:
D
1 In fact, there is research that not only takes such factors into account but that actually
treats them as objects of study in their own right. ere is a so much corpus-based and
experimental research literature on disuencies, hesitation phenomena, repairs, and
similar phenomena, that it makes lile sense to even begin citing it here (cf. Kjellmer
2003, Corley and Stewart 2008, Gilquin and De Cock 2011).
8
Anatol Stefanowitsch
Let us set aside for now the problems associated with the idea of grammaticality
and simply replace the word grammatical with conventionally occurring (an
equation that Chomsky explicitly rejects). Even the resulting, somewhat weaker
statement is quite clearly true, and will remain true no maer how large a corpus
we are dealing with. Corpora are incomplete in at least two ways.
First, corpora no maer how large are obviously nite, and thus they can
never contain examples of every linguistic phenomenon. As an example, consider
the construction [it doesnt maer the N] (as in the lines It doesnt maer the
colour of the car/But what goes on beneath the bonnet from the Billy Bragg song A
Lover Sings).2 ere is ample evidence that this is a construction of British
.1
English. First, Bragg uses it in his song; second, a web search of .uk websites will
turn up a number of examples (e.g. [1.1]); and third, most native speakers will
readily provide examples of the construction:v0
(1.1a) It doesnt maer the reasons people go and see a lm as long as they go
and see it. [thenorthernecho.co.uk]
(1.1b) Remember, it doesnt maer the size of your garden, or if you live in a at,
there are still lots of small changes you can make that will benet
T
wildlife. [avonwildlifetrust.org.uk]
(1.1c) It doesnt maer the context. In the end, trust is about the person
AF
word British National Corpus (see Appendix A.1), does not contain a single
instance of this construction. is is unlikely to be due to the fact that the
D
9
1 e need for corpus data
hundred million words of text. us, someone studying the construction will
wrongly conclude that it does not exist in British English on the basis of the BNC.
Second, linguistic usage is not homogeneous but varies across situations (think of
the kind of variation referred to by terms such as dialect, sociolect, genre, register,
etc.); clearly, it is, for all intents and purposes, impossible to include this variation
in its entirety in a given corpus. is is a problem not only for studies that are
interested in linguistic variation but also for studies in core areas such as lexis
and grammar: many linguistic paerns are limited to certain varieties, and a
corpus that does not contain a particular variety cannot contain examples of a
paern limited to that variety. For example, the verb croak in the sense die is
.1
usually used intransitively, but in there is one register in which it also occurs
transitively. Consider the following representative examples:
(1.2a)
v0
Because he was a skunk and a stool pigeon I croaked him just as he was
goin to call the bulls with a police whistle [Veiller, Within the Law]
(1.2b) [U]se your bean. If I had croaked the guy and frisked his wallet , would I
have le my signature all over it? [Stout, Some Buried Cesar]
(1.2c) I recall pointing to the loaded double-barreled shotgun on my wall and
T
replying, with a smile, that I would croak at least two of them before they
got away. [ompson, Hells Angels]
AF
in mind when designing and using corpora (cf. the discussion in the next
chapter). However, it is not an argument against the use of corpora, as all
collections of data are necessarily incomplete. One important aspect of scientic
work is to build general models from incomplete data and rene them as more
data becomes available. e incompleteness of observational data is not seen as
an argument against its use in other disciplines, and the argument gained
currency in linguistics only because it was largely accepted that intuited data are
10
Anatol Stefanowitsch
more complete. I will argue in Section 1.2.2, however, that this is not the case.
Corpus linguistics can only provide you with uerances (or wrien leer
.1
sequences or character sequences or sign assemblages). To do cognitive
linguistics with corpus data, you need to interpret the data to give it
meaning. e meaning doesnt occur in the corpus data. us,
v0
introspection is always used in any cognitive analysis of language [] (G.
Lako 2004).
G. Lako (and others puing forward this argument) are certainly right: the
corpus itself was all we had, corpus linguistics would be reduced to the detection
T
of formal paerns (such as recurring combinations) in otherwise meaningless
strings of symbols.
AF
ere are cases where this is the best we can do, namely, when dealing with
documents in an unknown or unidentiable language. An example is the Phaistos
disc, a clay disk discovered in 1908 in Crete. e disc contains a series of symbols
that appear to be pictographs (but may, of course, have purely phonological
R
value), arranged in an inward spiral. ese pictographs may or may not present a
writing system, and no one knows what language, if any, they may represent (in
fact, it is not even clear whether the disc is genuine or a fake). However, this has
D
not stopped a number of scholars from linguistics and related elds to identify
out a number of intriguing paerns in the series of pictographs and some
suggestive parallels to known writing systems (see Robinson 2002, Ch. 11, for a
fairly in-depth popular account). Some of the results of this research are certainly
suggestive and may one day enable us to identify the underlying language and
even decipher the message, but until someone does so, there is no way of
knowing if the theories are even on the right track.
It hardly seems desirable to put ourselves in the position of a Phaistos disc
scholar articially, by excluding from our research designs our knowledge of
English (or whatever other language our corpus contains); it is quite obvious that
11
1 e need for corpus data
we should, as G. Lako says, interpret the data in the course of our analysis. But
does this mean that we are using introspection in the same way as someone
inventing sentences and judging their grammaticality?
I think not. We need to distinguish two dierent types of introspection: (i)
intuiting, i.e. practice of introspectively accessing ones linguistic experience in
order to create sentences and assign grammaticality judgments to them; and (ii)
interpreting, i.e. the practice of assigning an interpretation (in semantic and
pragmatic terms) to an uerance. ese are two very dierent activities, and
there is good reason to believe that speakers are beer at the second activity than
at the rst: interpreting linguistic uerances is a natural activity speakers must
.1
interpret everything we hear or read in order to understand it; inventing
sentences and judging their grammaticality is not a natural activity speakers
never do it outside of papers on grammatical theory. us, one can believe that
v0
interpretation has a place in linguistic research but intuition does not.
Nevertheless, interpretation is a subjective activity and there are strict procedures
that must be followed when including its results in a research design. is issue
will be discussed in detail in Chapter 5.
As with the two points of criticism discussed in the preceding subsections, the
T
problem of interpretation would be an argument against the use of corpus data
only if there were a method that avoids interpretation completely or that at least
AF
1.2 Intuition
Intuited data would not be the only alternative to corpus data, but it is the one
R
proposed and used by critics of the laer, so let us look more closely at this
practice. Given the importance of grammaticality judgments, one might expect
D
them to have been studied extensively to determine exactly what it is that people
are doing when they are making such judgments. Surprisingly, this is not the
case, and the few studies that do exist are hardly ever acknowledged as
potentially problematic by those linguists that routinely rely on them, let alone
discussed with respect to their place in scientic methodology.
One of the few such discussions is found in Jackendo (1994). Jackendo
introduces the practice of intuiting grammaticality judgments as follows:
[A]mong the kinds of experiments that can be done on language, one kind
is very simple, reliable, and cheap: simply present native speakers of a
language with a sentence or phrase, and ask them to judge whether or not it is
12
Anatol Stefanowitsch
.1
process of eliciting such judgments as so straightforward that it does not require
in-depth justication or discussion.
Jackendo does not deal with either of these two aspects in any detail,
v0
although he briey touches upon the rst issue in the following passage:
e only thing about this statement that is true is the observation that linguists
trust their own judgments. However, the claim that these judgments are reliable
is completely unfounded. In the linguistic literature, grammaticality judgments of
the same sentences by dierent authors oen dier consistently and the few
R
13
1 e need for corpus data
judgments should be used with extreme caution (cf. Labov 1996) if at all (Schtze
1996/2016), instead of serving as the main methodology in linguistics.
e real reason that data intuited by the researcher themselves is so popular is
not that it is reliable, but that it is easy. Jackendo essentially admits this when he
says that
.1
than devising and judging some more sentences (Jackendo 1994: 49).
However, the fact that something can be done quickly and eortlessly does not
v0
make it a good scientic method. If one is serious about using grammaticality
judgments, these must be made as reliable as possible; among other things, this
involves the two aspects that Jackendo dismisses so lightly: asking large
numbers of speakers and controlling the circumstances under which they are
asked (cf. Schtze 1996/2016 and Cowart 1997 for detailed suggestions as to how
T
this is to be done and Bender 2005 for an interesting alternative). In order to
distinguish such empirically collected introspective data from data intuited by the
AF
researcher, I will refer to the former as elicitation data and continue to reserve for
the laer the term intuition or intuited data.
e reliability problems of linguistic intuitions should be obvious and I will
return to them briey in Section 1.3, but rst, let us discuss whether intuited
data fares beer than corpus data in terms of the three major points of criticism
R
ii. are intuited data more complete than corpus data; and
iii. do intuited data contain information about the semantics, pragmatics,
etc. of these forms.
14
Anatol Stefanowitsch
.1
constructs that we develop on the basis of evidence and try to justify in
terms of evidence. Since performance in particular, judgments about
sentences obviously involves many factors apart from competence, one
v0
cannot accept as an absolute principle that the speakers judgments will
give an accurate account of his knowledge. (Chomsky 1972: 187, emphasis
added).
to do so on the basis of corpus data that involve factors other than competence
and the competence/performance argument against corpus data collapses.
15
1 e need for corpus data
.1
single text type, recipes of the 145 matches in the BNC, 142 occur in recipes and
the remaining three in narrative descriptions of someone following a recipe.
us, a native speaker of English who never reads cookbooks or cooking-related
v0
journals and websites and never watches cooking shows on television can go
through their whole life without encountering the verb bring used in this way.
When describing the grammatical behavior of the verb bring based on their
intuition, this use would not occur to them, and if they were asked to judge the
grammaticality of a sentence like Half-ll a large pan with water and bring to the
T
boil [BNC A7D], they would judge it ungrammatical. us, this valency paern
would be absent from their description in the same way that transitive croak die
AF
or [it doesnt maer the N] would be absent from a grammatical description based
on the BNC (where, as we saw in Section 1.1.2, these paerns do not occur).
If this example seems too speculative, consider Culicovers (1999: 106f.)
analysis of the phrase no maer. Culicover is an excellent linguist by any
standard, but he bases his intricate argument concerning the unpredictable nature
R
of the phrase no maer on the claim that the construction [it doesnt maer the N]
is ungrammatical. If he had consulted the BNC, he might be excused for coming
D
16
Anatol Stefanowitsch
.1
have produced, my interpretation ceases to be privileged in this way once the
issue is no longer my intention, but the interpretation that my constructed
sentence would conventionally receive in a particular speech community. In other
v0
words, the interpretation of a constructed sentence is subjective in the same way
that the interpretation of a sentence found in a corpus is subjective. In fact,
interpreting other peoples uerances, as we must do in corpus linguistic
research, may actually lead to more intersubjectively stable results, as
T
interpreting other peoples uerances is a more natural activity than interpreting
our own: the former is what we routinely engage in in communicative situations,
AF
(1.3a) When shed rst moved in she hadnt cared about anything, certainly not
D
her surroundings they had been the least of her problems and if the
villagers hadnt so kindly donated her furnishings shed probably still be
existing in empty rooms. [BNC H9V]
17
1 e need for corpus data
.1
between the readings in (1.3b) and (1.3c). Consider the example in (1.4), which
contains a clear example of donate with the supposedly ungrammatical
ditransitive valency paern: v0
(1.4) Please have a look at our wish-list and see if you can donate us a plant we
need. [headway-cambs.org.uk]
very dierent from the typical use, where a sum of money is transferred from an
individual to an organization without personal contact. If this were an intuited
D
example, I might judge it grammatical (at least marginally so) for similar reasons,
while another researcher, unaware of my subtle reconceptualization, would judge
it ungrammatical, leading to no insights whatsoever into the semantics of the
verb donate or the valency paerns it occurs in.
18
Anatol Stefanowitsch
me compare of the quality of intuited data and corpus data in terms of two
aspects that are considered much more crucial in methodological discussions
outside of linguistics than those discussed above:
i. data reliability (roughly, how sure can we be that other people will arrive
at the same set of data using the same method);
ii. data validity or epistemological status of the data (roughly, how well do
we understand what real world phenomenon the data correspond to);6
As to the rst criterion, note that the problem is not that intuition data are
.1
necessarily wrong. Very oen, intuitive judgments turn out to agree very well
with more objective kinds of evidence, and this should not come as a surprise.
Aer all, as native speakers of a language, or even as advanced foreign-language
v0
speakers, we have considerable experience with using that language actively
(speaking and writing) and passively (listening and reading). It would thus be
surprising, if we were categorically unable to make statements about the
likelihood of occurrence of a particular expression.
Instead, the problem is that we have no way of determining introspectively
T
whether a particular piece of intuited data is correct or not. To decide this, we
need objective evidence, obtained either by serious experiments (including
AF
6 Readers who are well-versed in methodological issues are asked to excuse this
somewhat abbreviated use of the term validity; there are, of course, a range of uses in
the philosophy of science and methodological theory for the term validity (we will
encounter a dierent use from the one here in Chapters 3 and 4).
19
1 e need for corpus data
.1
that we can jump to generalizations directly and without the risk of errors. e
fact that corpus data do not allow us to maintain this illusion does not make them
inferior to intuition, it makes them superior. More importantly, it makes them
v0
normal observational data, no dierent from observational data in any other
discipline.
To put it bluntly, then intuition data are less reliable and less valid than
corpus data, and they are just as incomplete and in need of interpretation. Does
this mean that intuition data should be banned completely from linguistics? e
T
answer is No, but not straightforwardly.
On the one hand, we would deprive ourselves of a potentially very rich source
AF
both corpus data and intuition data) will only be as valid, reliable, and complete
as the weakest subset of data it contains. We have already established that
D
intuition data and corpus data are both incomplete, thus a mixed set will still be
incomplete (albeit perhaps less incomplete than a pure set), so nothing much is
gained. Instead, the mixed set will simply inherit the lack of validity and
reliability from the intuition data, and thus its quality will actually be lowered by
the inclusion of these.
e solution to this problem, I believe, is quite simple. While intuited
information about linguistic paerns fails to meet even the most basic
requirements for scientic data, it meets every requirement for scientic
hypotheses. A hypothesis has to be neither reliable, nor valid (in the sense of the
term used here) nor complete. In fact, these words do not have any meaning if we
apply them to hypotheses the only requirement it must meet is that of
20
Anatol Stefanowitsch
testability (see further Chapter 3). ere is nothing wrong with introspectively
accessing our experience as a native speaker of a language (or a non-native one at
that), provided we treat the results of our introspection as hypotheses about the
meaning or likelihood of occurrence rather than as a fact.
Since there are no standards of purity for hypotheses, it is also unproblematic
to mix intuition and corpus data in order to come up with more ne-grained
hypotheses (cf. in this context Aston and Burnard [1998: 143]), as long as we then
test our hypothesis on a pure data set that does not include the corpus-data used
in generating the hypotheses.
.1
1.4 Corpus data in other disciplines
Before we conclude our discussion of the supposed weaknesses of corpus data
v0
and the supposed strengths of intuited judgments, it should be pointed out that
this discussion is limited largely to the eld of grammatical theory. is in itself
would be surprising if intuited judgments were indeed superior to corpus
evidence: aer all, the distinction between linguistic behavior and linguistic
knowledge is potentially relevant in other areas of linguistic inquiry, too. Yet, no
T
other sub-discipline of linguistics has aempted to make a strong case against
observation and for intuited data.
AF
In some cases, we could argue that this is due to the fact that intuited
judgments are simply not available. In language acquisition or in historical
linguistics, for example, researchers could not use their intuition even if they
wanted to, since not even the most fervent defendants of intuited judgments
would want to argue that speakers have meaningful intuitions about earlier
R
stages of their own linguistic competence or their native language as a whole. For
language acquisition research, corpus data and, to a certain extent,
D
psycholinguistic experiments are the only source of data available, and historical
linguists must rely completely on textual evidence.
In dialectology and sociolinguistics, however, the situation is slightly dierent:
at least those researchers whose linguistic repertoire encompasses more than one
dialect or sociolect (which is not at all unusual), could, in principle, aempt to use
intuition data to investigate regional or social variation. To my knowledge,
however, nobody has aempted to do this. ere are, of course, descriptions of
individual dialects that are based on introspective data (Green's (2002) description
of the grammar of African-American English is an impressive example). But in
the study of actual variation, systematically collected survey data (e.g. Labov, Ash
and Boberg 2008) and corpus data in conjunction with multivariate statistics (e.g.
21
1 e need for corpus data
Tagliamonte 2006) were considered the natural choice of data long before their
potential was recognized in other areas of linguistics.
e same is true of conversation and discourse analysis. One could
theoretically argue that our knowledge of our native language encompasses
knowledge about the structure of discourse and that this knowledge should be
accessible to introspection in the same way as our knowledge of grammar.
However, again, no conversation or discourse analyst has ever actually taken this
line of argumentation, relying instead on authentic usage data.7
Even lexicographers, who could theoretically base their description of the
meaning and grammatical behavior of words entirely on introspectively accessed
.1
knowledge of their language have not generally done so. Beginning with the
Oxford English Dictionary, dictionary entries have been based at least in part on
citations authentic usage examples of the word in question (see next chapter).
v0
If the incompleteness of linguistic corpora or the fact that corpus data have to
be interpreted were serious arguments against their use, these sub-disciplines of
linguistics should not exist, or at least, they should not have yielded any useful
insights into the nature of language change, language acquisition, language
variation, the structure of linguistic interactions or the lexicon. Yet all of these
T
disciplines have, in fact, yielded insightful descriptive and explanative models of
their respective research objects.
AF
to deal with actual data, which are messy, incomplete and oen frustrating, and
that the arguments against the use of such data are, essentially, post-hoc
D
rationalizations. But whatever the case may be, we will, at this point, simply stop
worrying about the wholesale rejection of corpus linguistics by some researchers
until the time that they come up with a convincing argument for this rejection
and turn to the question what exactly constitutes corpus-linguistics.
7 Perhaps Speech Act eory could be seen as an aempt at discourse analysis on the
basis of intuition data: its claims are oen based on short snippets of invented
conversations. e dierence between intuition data and authentic usage data is nicely
demonstrated by the contrast the relatively broad but supercial view of linguistic
interaction found in philosophical pragmatics with the rich and detailed view of
linguistic interaction found in Conversation Analysis (e.g. Sacks, Scheglo, Jeerson
1974, Sacks 1992) and other discourse-analytic traditions.
22
Anatol Stefanowitsch
.1
constitutes corpus linguistics. is is due in part to the fact that the hundred-year
tradition is not an unbroken one. As we saw in the preceding chapter, corpora fell
out of favor just as linguistics grew into an academic discipline in its own right
v0
and corpus linguistics in its modern incarnation has made a comeback only
recently. It should therefore not come as a surprise that it has not, so far,
consolidated into a homogeneous methodological framework. More generally,
though, linguistics itself, with a tradition that reaches back to antiquity, has
remained notoriously heterogeneous discipline with lile agreement among
T
researchers even with respect to fundamental questions such as what aspects of
language constitute their object of study. It is unsurprising, then, that they do not
AF
agree how their object of study should be approached methodologically and how
it might be modeled theoretically. Given this lack of agreement, it is highly
unlikely that a unied methodology will emerge in the eld any time soon.
On the one hand, this heterogeneity is a good thing. e dogmatism that
comes with monolithic theoretical and methodological frameworks can be stiing
R
to the curiosity that drives scientic progress, especially in the humanities and
social sciences which are, by and large, less mature descriptively and theoretically
D
than the natural sciences. On the other hand, aer more than a century of
scientic inquiry in the modern sense, there should no longer be any serious
disagreement as to its fundamental procedures, and there is no reason not to
apply these procedures within the language sciences. us, I will aempt in this
chapter to sketch out a broad, and, I believe, ultimately uncontroversial
characterization of corpus linguistics as an instance of the scientic method. I
will develop this proposal by successively considering and dismissing alternative
characterizations of corpus linguistics. My aim in doing so is not to delegitimize
these alternative characterizations, but to point out ways in which they are
incomplete.
23
2. What is corpus linguistics?
is denition is uncontroversial in that any research method that does not fall
under it would not be regarded as corpus linguistics. However, it is also very
broad, covering many methodological approaches that would not be described as
.1
corpus linguistics even by their own practitioners (such as discourse analysis or
citation-based lexicography). Some otherwise similar denitions of corpus
linguistics aempt to be more specic in that they dene corpus linguistics as
v0
the compilation and analysis of corpora. (Cheng 2012: 6, cf also Meyer 2002: xi),
suggesting that there is a particular form of recording real-life language use
called a corpus.
e rst chapter of this book started with a similar denition, characterizing
corpus linguistics as as any form of linguistic inquiry based on data derived from
T
a corpus, where corpus was dened as a large collection of authentic text. In
order to distinguish corpus linguistics proper from other observational methods
AF
in linguistics, we must rst rene this denition of a linguistic corpus; this will be
our concern in the remainder of Section 2.1. We must then take a closer look at
what it means to study language on the basis of a corpus; this will be our concern
in Section 2.2.
R
24
Anatol Stefanowitsch
.1
2.1.1 Authenticity
e word authenticity has a range of meanings that could be applied to language
v0
it can mean that a speaker or writer speaks true to their character (He has
found his authentic voice) or to the character of the group they belong to (She is
the authentic voice of her generation), that a particular piece of language is
correctly aributed (is is not an authentic Lincoln quote), or that speech is direct
and truthful (the authentic language of ordinary people).
T
In the context of corpus linguistics (and oen of linguistics in general),
authenticity refers much more broadly to what McEnery and Wilson call real life
AF
communication, not for linguistic analysis or even with the knowledge that they
might be used for such a purpose. It is language that is not, as it were, performed
for the linguist based on what speakers believe constitutes good or proper
language. is is a very broad view of authenticity, since people may be
performing inauthentic language for reasons other than the presence of a
linguist but such performances would be regarded by linguists as something
people will do naturally from time to time and that can and must be studied as an
aspect of language use. In contrast, performances for the linguist are assumed to
distort language behavior in ways that makes them unsuitable for linguistic
analysis.
25
2. What is corpus linguistics?
.1
e texts which are collected in a corpus have a reected reality: they are
only real because of the presupposed reality of the discourses of which
v0
they are a trace. is is decontexualized language, which is why it is only
partially real. If the language is to be realized as use, it has to be
recontextualized. (Widdowson 2000: 7)
In some sense, it is obvious that the texts in a corpus (in fact, all texts) are only
T
fully authentic as long as they are part of an authentic communicative situation.
A sample of spoken language is only authentic as part of the larger conversation
AF
is rather abstract point has very practical consequences, however. First, any
text, spoken or wrien, will lose not only its communicative context (the
discourse of which it was originally a part), but also some of its linguistic and
D
26
Anatol Stefanowitsch
.1
contains an error is one that the researcher must make by reconceptualizing the
speaker and their intentions in the original context, a reconceptualization that
makes authenticity impossible to determine.
v0
is does not mean that corpora cannot be used. It simply means that limits of
authenticity have to be kept in mind. With respect to spoken language, however,
there is a more serious problem Sinclairs minimum disruption.
e problem is that in observational studies no disruption is ever minimal as
soon as the investigator is present in person or in the minds of the observed, we
T
get what is known as the observers paradox: we want to observe people (or
other animate beings) behaving as they would if they were not observed in the
AF
without their knowledge. Speakers must give wrien consent before the data
collection can begin, and there is typically a recording device in plain view that
D
(2.1) A: Josie?
B: Yeah. [laughs] I'm not lming you, I'm just taping you.
[]
A: Yeah, I'll take your lile toy and smash it to pieces!
27
2. What is corpus linguistics?
In the excerpt in (2.2), speaker A explains to their interlocutor the fact that the
conversation they are having will be used for linguistic research:
.1
A: Erm this lady from the, University of Bergen.
B: So how d'ya how does it work?
A: Erm you you speak into it and erm, records, goa record conversations
v0
between people. [COLT B141708]
A speakers knowledge that they are being recorded for the purposes of linguistic
analysis is bound to distort the data even further. In example (2.3), there is
evidence for such a distortion the speakers are performing explicitly for the
T
recording device:
AF
C: [laughs]
B: No it is not true. I protest. [COLT B132611]
D
Speaker A asks for true things and then imitates an interview situation, which
speaker B takes up by using the somewhat formal phrase I protest, which they
presumably would not use in an authentic conversation about their love life.
Obviously, such distortions will be more or less problematic depending on our
research question. Level of formality (register) may be easier to manipulate in
performing for the linguist than pronunciation, which is easier to manipulate
than morphological or syntactic behavior. However, the fact remains that spoken
data in corpora is hardly ever authentic in the corpus linguistic sense (unless it is
based on recordings of public language use, for example, from television or the
radio), and the researcher must rely, again, on an aempt to recontextualize the
28
Anatol Stefanowitsch
.1
example, Nichter 2007). In addition to the interest they hold for historians, they
form the largest available corpus of truly authentic spoken language.
However, even these recordings are too limited in size as well as in the
v0
diversity of speakers recorded (mainly older white American males), to serve as a
standard against which to compare other collections of spoken data.
e ethical and legal problems in recording unobserved spoken language
cannot be circumvented, but their impact on the authenticity of the recorded
language may be lessened in various ways for example, by geing general
T
consent for recording from speakers, but not telling them when precisely they
will be recorded.
AF
However, there are reasons why researchers may want to depart from
authenticity in the corpus-linguistic sense on purpose, because their research
design or the phenomenon under investigation requires it.
A researcher may be interested in a phenomenon that is so rare in most
situations that even the largest available corpora do not contain a sucient
R
number of cases. ese may be structural phenomena (like the paern [It doesnt
maer the N] or transitive croak, discussed in the previous chapter), or unusual
D
29
2. What is corpus linguistics?
they are talking to a robot, but the robot is actually controlled by one of the
researchers, cf. e.g. Georgila et al. 2010).
Such semi-structured elicitation techniques may also be used with phenomena
that are frequent enough in a typical corpus, but where the researcher wants to
vary certain aspects systematically (for example, the nature of the entity that is
moving along a path), or where the researcher wants to achieve comparability
across speakers or even across languages (cf. again the picture-elicited narratives
just mentioned).
ese are valid reasons for eliciting a special-purpose corpus rather than
collecting naturally occurring text. Still, the stimulus-response design of this type
.1
of corpus collection is very obviously inuenced by experimental paradigms used
in psychology. us, studies based on such corpora must be regarded as falling
somewhere between corpus linguistics and psycholinguistics and they must
v0
therefore meet the design criteria of both corpus linguistic and psycholinguistic
research designs.
2.1.2 Representativeness
T
Put simply, a representative sample is a subset of a population that is
(near-)identical to the population as a whole with respect to the distribution of
the phenomenon under investigation. us, for a corpus (a sample of language
AF
Ostensibly, the way that corpus builders typically aim to achieve this, is by
including in the corpus dierent text types characterized by channel
(spoken/wrien), seing, function, demographic background of speakers etc. in
D
30
Anatol Stefanowitsch
Second, even if we did know, it is not clear that all types of language use shape
and/or represent the linguistic system in the same way, simply because we do not
know how widely they are received. For example, emails may be responsible for a
larger share of wrien language produced daily than news sites, but each email is
typically read by a handful of people at the most, while some news texts may be
read by millions of people (and others not at all).
ird, in a related point, speech communities are not homogeneous, so
dening balance based on the proportion of text types in the speech community
may not yield a realistic representation of the language even if it were possible:
every member of the speech community takes part in dierent communicative
.1
situations involving dierent text types. Some people read more than others,
among these some read mostly newspapers, others mostly novels; some people
watch parliamentary debates on TV all day, others mainly talk to customers in
v0
the bakery where they work. In other words, the proportion of text types
speakers encounter varies, requiring a notion of balance based on the incidence of
text types in the linguistic experience of a typical speaker. is, in turn, requires a
denition of what constitutes a typical speaker in a given speech community.
Such a denition may be possible, but to my knowledge, does not exist so far.
T
Finally, there are text types for which for practical reasons, it is impossible to
acquire samples for example, pillow talk (which speakers will be unwilling to
AF
share because they consider them too private), religious confessions or lawyer-
client conversations (which speakers are prevented from sharing because they are
privileged), and planning of illegal activities (which speakers will want to keep
secret in order to avoid lengthy prison terms).
Representativity or balancedness also plays a role if we do not aim at
R
31
2. What is corpus linguistics?
.1
Foundation Reports 2
Industry Reports 2
College Catalog 1
v0 Industry House organ 1
Learned Natural Sciences 12
Medicine 5
Mathematics 4
Social and Behavioral Sciences 14
Political Science, Law, Education 15
Humanities 18
T
Technology and Engineering 12
Fiction General Novels 20
AF
Short Stories 9
Mystery and Detective Novels 20
Short Stories 4
Science Fiction Novels 3
Short Stories 3
Adventure and Western Novels 15
R
Short Stories 14
Romance and Love Story Novels 14
Short Stories 15
D
Humor Novels 3
Essays, etc. 6
Press Reportage Political 14
Sports 7
Society 3
Spot News 9
Financial 4
Cultural 7
Editorial Institutional 10
Personal 10
Leers to the Editor 7
Reviews (theatre, books, music, dance) 17
32
Anatol Stefanowitsch
.1
investigate this particular variety, but if the corpus were meant to represent the
standard language in general (which the corpus creators explicitly deny), it forces
us to accept a very narrow understanding of standard.
v0
e BROWN corpus consists of 500 samples of approximately 2000 words
each, drawn from a number of dierent text types, as shown in Table 2.1.
e rst level of sampling is by genre: there are 286 samples of non-ction,
126 samples of ction and 88 samples of press texts. ere is no reason to believe
that this corresponds proportionally to the total number of words produced in
T
these text types in the USA in 1961. ere is also no reason to believe that the
distribution corresponds proportionally to the incidence of these text types in the
AF
reportage than editorials, people (or at least academics of the type that built the
corpus) read more mystery ction that science ction, etc.e creators of the
BROWN corpus are quite open about the fact that their corpus design is not a
representative sample of (wrien) American English. ey describe the collection
procedure as follows:
33
2. What is corpus linguistics?
the random selections were made. But for certain categories it was
necessary to go beyond these two cellections. For the daily press, for
example, the list of American newspapers of which the New York Public
Library keeps microlms les was used (with the addition of the
Providence Journal). Certain categories of chiey ephemeral material
necessitated rather arbitrary decisions; some periodical materials in the
categories Skills and Hobbies and Popular Lore were chosen from the
contents of one of the largest second-hand magazine stores in New York
City. (Francis/Kucera 1979)
.1
If anything, the BROWN corpus represents the holdings of the libraries
mentioned, although even this representativeness is limited in two ways. First, by
the unsystematic additions mentioned in the quote, and second, by the sampling
v0
procedure applied.
Although this sampling procedure is explicitly acknowledged to be
subjective by the creators of the BROWN corpus, their description suggests that
their design was guided by a general desire for balance:
T
e list of main categories and their subdivisions was drawn up at a
conference held at Brown University in February 1963. e participants in
AF
34
Anatol Stefanowitsch
youngest was 36, the oldest 59). No doubt, a dierent group of researchers let
alone a random sample of speakers following the procedure described would
arrive at a very dierent corpus design.
Second, the procedure involves an aempt to capture the proportion of text
types in actual publication this proportion was determined on the basis of the
American Book Publishing Record, a reference work containing publication
information on all books published in the USA in a given year. Whether this is, in
fact, a comprehensive source is unclear, and anyway, it can only be used in the
selection of excerpts from books. Basing the estimation of the proportion of text
types on a dierent source would, again, have yielded a very dierent corpus
.1
design. For example, the copyright registrations for 1961 suggest that the
category of periodicals is severely underrepresented relative to the category of
books there are roughly the same number of copyright registrations for the two
v0
text types (, but there are one-and-a-half times as many excerpts from books than
from periodicals in the BROWN corpus.
Despite these shortcomings, the BROWN corpus was well-received, spawning
a host of corpora of dierent varieties of English using the same design, for
example, the Lancaster-Oslo/Bergen Corpus (LOB) containing British English
T
from 1961, the Freiburg Brown (FROWN) and Freiburg LOB (FLOB) corpora of
American and British English respectively from 1991, the Wellington Corpus of
AF
Wrien New Zealand English, and the Kolhapur Corpus (Indian English). e
success of the BROWN design was partly due to the fact that being able to study
strictly comparable corpora of dierent varieties is useful regardless of their
design. However, if the design had been widely felt to be completely o-target,
researchers would not have used it as a basis for the signicant eort involved in
R
corpus creation.
More recent corpora at rst glance appear to take a more principled approach
D
to balance. Most importantly, they typically include not just wrien language, but
also spoken language. However, a closer look reveals that this is the only real
change. For example, the BNC BABY, a four-million-word subset of the 100-
million-word British National Corpus (BNC), includes approximately one million
words each from the registers spoken conversation, wrien academic language,
wrien prose ction and wrien newspaper language (Table 2.2 shows the exact
design). Obviously, this design does not correspond to the linguistic experience of
a typical speaker, who is unlikely to be exposed to academic writing and whose
exposure to wrien language is not three times as large as their exposure to
spoken language. e design also does not correspond in any obvious way to the
actual amount of language produced on average in the four categories or the
35
2. What is corpus linguistics?
subcategories of academic and newspaper language. Despite this, the BNC BABY,
and the BNC itself, which is even more drastically skewed towards edited wrien
language, are extremely successful corpora that are still widely used a quarter-
century aer the rst release of the BNC.
.1
Nat. Science 6 215549
Politics/Law/Education 6 195836
Soc. Science
v0 7 209645
Technology/Engineering 2 77533
Fiction Prose 25 1010279
Newspapers Nat. Broadsheet Arts 9 36603
Commerce 7 64162
T
Editorial 1 8821
Miscellaneous 25 121194
Report 3 48190
AF
Science 5 18245
Social 13 34516
Sports 3 36796
Other Arts 3 43687
R
Commerce 5 89170
Report 7 232739
D
Science 7 13616
Social 8 94676
Tabloid 1 121252
Even what I would consider the most serious approach to date to creating a
balanced corpus design, the sampling schema of the International Corpus of
English (ICE), is unlikely to be signicantly closer to constituting a representative
sample of English language use (see Table 2.3).
36
Anatol Stefanowitsch
.1
Monologues Unscripted Spontaneous commentaries 20
Unscripted Speeches 30
Demonstrations 10
Scripted
v0 Legal Presentations
Broadcast News
10
20
Broadcast Talks 20
Non-broadcast Talks 10
T
Wrien Non-printed Student Writing Student Essays 10
Exam Scripts 10
AF
Technology 10
Popular writing Humanities 10
D
Social Sciences 10
Natural Sciences 10
Technology 10
Reportage Press news reports 20
Instructional writing Administrative Writing 10
Skills/hobbies 10
Persuasive writing Press editorials 10
Creative writing Novels and short stories 10
It puts a stronger emphasis on spoken language sixty percent of the corpus are
spoken text types, although two thirds of these are public language use, while for
37
2. What is corpus linguistics?
most of us private language use is likely to account for more of our linguistic
experience. It also includes a much broader range of wrien text types than
previous corpora, including not just edited writing but also student writing and
leers.
However, as before, there is lile reason to believe that this design
corresponds closely to a random sample taken from the entirety of language
produced in a particular English speech community.
It is clear, then, that none of these (or any other actually existing) corpora are
representative and/or balanced in the sense discussed at the beginning of this
section.
.1
e impossibility of creating a truly balanced corpus is sometimes used to
argue against corpus linguistics in general recall the criticism, discussed in the
previous chapter that linguistic corpora are necessarily skewed. is point of
v0
criticism is widely acknowledged in corpus linguistics, and yet corpus linguists
are not only happy to continue using available corpora, there would also be a
wide agreement that although none of the corpora discussed above are
representative or balanced, the design of the ICE is beer than that of the BNC
BABY, which is in turn beer than that of the BROWN corpus and its ospring.
T
e rationale behind this (presumed) agreement is not based on
representativeness or balance, but on the related but distinct criterion of diversity.
AF
While corpora will always be skewed relative to the overall population of texts
and text types in a speech community, the undesirable eects of this skew can be
minimized by including in the corpus as broad a range of varieties as is realistic,
either in general or in the context of a given research project.
Unless language structure and language use are innitely variable (which, at
R
least at a given point in time, they are not), increasing the diversity of the sample
will increase representativeness even if the corpus design is not strictly speaking
D
balanced. It seems important to acknowledge that this does not mean that
diversity and balance are the same thing, but given that balanced corpora are
practically (and perhaps theoretically) impossible to create, diversity is a
workable and justiable proxy.
2.1.3 Size
Like diversity, corpus size is also assumed, more or less explicitly, to contribute to
representativeness (e.g. by McEnery/Wilson 1991). It is dicult to say whether
the relationship between size and representativeness is strictly speaking relevant
in the context of corpora: sample size does correlate with representativeness to
38
Anatol Stefanowitsch
some extent in that an exhaustive sample, i.e., one that encompasses the entire
population, will necessarily be representative and that this representativeness
does not drop to zero immediately if we decrease sample size. However, it does
drop rather rapidly if we exclude one percent of the totality of texts produced in
a given language, entire text types may already be missing. For example, the
Library of Congress holds around 38 million print materials, roughly half of them
in English. A search for cooking in the main catalogue yields 7,638 items that
presumably include all cookbooks in the collection. is means that cookbooks
make up no more than 0.04 percent of printed English (7,638/19,000,000 =
0.000402). us, they would be easily lost in their entirety once the sample size
.1
drops substantially below the size of the population as a whole.
Since even the largest corpora comprise only tiny fractions of the language as
a whole, size certainly does not guarantee balance in any way even the text
v0
types that are theoretically accessible will be represented disproportionally and
many will be missing altogether. And with these text types missing, at least some
linguistic phenomena will disappear from the corpus, such as the expression
[__ NPLIQUID PP[to the/a boil]], which, as discussed in Chapter 1, is exclusive to
cookbooks. However, assuming, again, that language structure and use are not
T
innitely variable, size will correlate with the representativeness of particular
phenomena at least to some extent.
AF
Chapters 6 and 7). e less modest answer is that it must be large enough to
contain suciently large samples of every grammatical structure, vocabulary
D
item etc. Given that an ever increasing number of texts from a broad range of text
types is becoming accessible via the web, the second answer may not actually be
as immodest as it sounds it is now possible to create corpora that contain
billions of words, making the BNC look tiny in comparison. Still, for most issues
that corpus linguists are likely to want to investigate, a size of one million to one-
hundred million words is sucient. Let us take this broad range as characterizing
a linguistic corpus for practical purposes.
39
2. What is corpus linguistics?
is denition is more specic with respect to the data used in corpus linguistics
and will exclude discourse analysis, text linguistics, variationist sociolinguistics
and other elds working with authentic language data (whether such a strict
.1
exclusion is a good thing is a question we will briey return to at the end of this
chapter).
However, the denition says nothing about the way in which these data are to
v0
be investigated. Crucially, it would cover a procedure in which the linguistic
corpus essentially serves as a giant citation le, that the researcher scours, more
or less systematically, for examples of a given linguistic phenomenon.
is procedure of basing linguistic analyses on citations has a long tradition in
descriptive English linguistics, going back at least to Oo Jespersens seven-
T
volume Modern English Grammar on Historical Principles (Jespersen 19091940).
It played a particularly important role in the context of dictionary making. e
AF
Oxford English Dictionary (Simpson and Weiner 1989) is the rst and probably
still the most famous example of a citation-based dictionary. For the rst two
editions, it relied on citations sent in by volunteers (cf. Winchester 2003 for a
popular account). In its current third edition, its editors actively search corpora
and other text collections (including the Google Books index) for citations.
R
40
Anatol Stefanowitsch
.1
even refer to their procedure as study[ing] the language as its used (Merriam-
Webster 2014), a characterization that is very close to McEnery and Wilsons
denition of corpus linguistics as the study of language based on examples of
v0
real life language use.
ere are legitimate reasons for collecting citations. It may serve to show that
a particular linguistic phenomenon existed at a particular point in time; one
reason for basing the OED on citations was and is to identify the rst recorded
use of each word. It may also serve to show that a particular linguistic
T
phenomenon exists at all, for example, if that phenomenon is considered
ungrammatical (as in the case of [it doesnt maer the N], discussed in the
AF
previous chapter).
However, there is an important reason why the method of collecting citations
cannot be regarded as an instance of the scientic method beyond the purpose of
proving existence, and hence should not be taken to represent corpus linguistics
proper. While the procedure described by the makers of Merriam Websters
R
sounds relatively methodical and organized, it is obvious that the editors will be
guided in their selection by many factors that would be hard to control even if
D
they were fully aware of them, such as their personal interests, their sense of
esthetics, the intensity with which they have thought about some uses of a word
as opposed to others, etc.
is can result in a signicant bias in the resulting data base even if the
method is applied systematically, a bias that will be reected in the results of the
linguistic analysis, i.e. the denitions and example sentences in the dictionary. To
pick a random example: e word of the day on the Merriam-Webster
dictionaries website at the time of writing is implacable, dened as not capable
of being appeased, signicantly changed, or mitigated (Merriam-Webster, sv.
implacable). e entry gives two examples for the use of this word (cf. [2.4a, b]),
and the word-of-the-day posting gives two more (shown in [2.4c, d] in
41
2. What is corpus linguistics?
abbreviated form):
Except for hatred, the nouns modied by implacable in these examples are not
.1
at all representative of actual usage. e lemmas most frequently modied by
implacable in the 450-million-word Corpus of Current American English (COCA),
are enemy and foe, followed at some distance by force, hostility, opposition, will,
v0
and the hatred found in (2.4a). us, it seems that implacable is used most
frequently in contexts describing adversarial human relationships, while the
examples that the editors of the Merriam-Websters selected as typical deal mostly
with adversarial abstract forces. Perhaps this distortion is due to the materials the
editors searched, perhaps the examples struck the editors as citation-worthy
T
precisely because they are slightly unusual, or because they appealed to them
esthetically (they all have a certain kind of rhetorical ourish).8
AF
Contrast the performance of the citation-based method with the more strictly
corpus-based method used by the Longman Dictionary of English, which
illustrates the adjective implacable with the representative examples in (2.5a,b):
42
Anatol Stefanowitsch
phenomenon and then scour the corpus for examples that support this analysis;
we could call this method corpus-illustrated linguistics (cf. Tummers et al. 2005).
In the case of spoken examples that are overheard and then recorded aer the
fact, there is an additional problem: researchers will write down what they
thought they heard, not what they actually heard.9
e use of corpus examples for illustrative purposes has become somewhat
fashionable among researchers who largely depend on introspective data
otherwise. While it is probably an improvement over the practice of simply
inventing data, it has a fundamental weakness: it does not ensure that the data
selected by the researcher are actually representative of the phenomenon under
.1
investigation. In other words, corpus-illustrated linguistics simply replaces
introspectively invented data with introspectively selected data and thus inherits
the fallibility of the introspective method discussed in the previous chapter.
v0
Since overcoming the fallibility of introspective data is one of the central
motivations for using corpora in the rst place, the denition must be revised as
follows (a similar denitions is found in Biber, Conrad and Reppen 1998: 4):
Here, exhaustive roughly means taking into account all examples of the
phenomenon in question and systematic means looking at each example based
on the same criteria. As was mentioned in the preceding section, linguistic
corpora are currently typically between one million and one-hundred million
R
9 As anyone who has ever tried to transcribe spoken data, this implicit distortion of data
is a problem even where the data is available as a recording: transcribers of spoken
data are forever struggling with it. Just record a minute of spoken language and try to
transcribe it exactly you will be surprised how frequently you transcribe something
that is similar, but by no means identical to what is on the tape.
10 Note, however, that sometimes manual extraction is the only option cf. Colleman
(2006, 2009), who manually searched a 1-million word corpus of Dutch in order to
extract all ditransitive clauses. To give you a rough idea of the work load involved in
this type of manual extraction, it took Colleman ten full work days to go through the
entire corpus (Colleman, pers. comm.).
43
2. What is corpus linguistics?
.1
Denition (ird aempt, Version 1)
Corpus linguistics is the investigation of linguistic phenomena on the basis
of computer-readable linguistic corpora using corpus analysis soware.
v0
However, this is not a useful approach. It is true that there are scientic
disciplines that are so heavily dependent upon a particular technology that they
could not exist without it for example, radio astronomy (which requires a radio
telescope) or radiology (which requires an x-ray machine). However, even in such
T
cases we would hardly want to claim that the technology in question can serve as
a dening criterion: one can use the same technology ways that do not qualify as
AF
belonging to the respective discipline. For example, a spy might use a radio
telescope to intercept enemy transmissions, and an engineer may use an x-ray
machine to detect fractures in a steel girder, but that does not make the spy a
radio astronomer or the engineer a radiologist.
Clearly, even a discipline that relies crucially on a particular technology
R
cannot be dened by the technology itself but by the uses to which it puts that
technology. If anything, we must thus replace the reference to corpus analysis
soware by a reference to what that soware typically does.
D
ere is a wide range of soware packages for corpus analysis that vary in
their capabilities, but most of them have three functions in common:
i. they produce KWIC (Key Word In Context) concordances, i.e. they display
all occurrences of a particular search paern (oen a word form),
together with their immediate context, dened in terms of a particular
number of words or characters to the le and the right (see Table 2.4 for a
KWIC concordance of the noun time);
ii. they identify collocates of a given search word, i.e. words that occur in a
certain position relative to it; these words are typically listed in the order
44
Anatol Stefanowitsch
of frequency with which they occur in the position in question (see Table
2.5 for a list of collocates of the noun time in a span of three words to the
le and right);
iii. they produce frequency lists, i.e. lists of all word forms in a given corpus,
listed in the order of their frequency of occurrence (see Table 2.6 for the
forty most frequent stings (word forms and punctuation marks in the
BNC BABY).
Table 2-4: KWIC concordance (random sample) of the noun time (BNC BABY)
st sight to take an unconscionably long [time] . A common fallacy is the attempt to as
.1
s with Arsenal . Graham reckons it 's [time] his side went gunning for trophies agai
ough I did n't , he he , I did n't have [time] to ask him what the hell he 'd been up
was really impressed . I think the last [time] I came I had erm a chicken thing . A ch
away . No Ann would have him the whole [time] . Yeah well [unclear] Your mum would n'
arch 1921 . He had been unwell for some [time]
v0 and had now gone into a state of collap
te the planned population in five years [time] . So what are you gon na multiply that
tempt to make a coding time and content [time] the same ) . 10 Conclusion I have stres
hearer and for the analyst most of the [time] . Most of the time , things will indeed
in good faith and was reasonable at the [time] it was prepared . The bank gave conside
he had something on his mind the whole [time] . Perhaps he was thinking of his wr
ctices of commercial architects because [time] and time again they come up with the go
nyway . From then on Augusto , at the [time] an economist with the World Bank , and
T
This may be my last free night for some [time] . I do n't think they 'd be in any
two reasons . Firstly , the passage of [time] provides more and more experience and t
go . The horse was racing for the first [time] for Epsom trainer Roger Ingram having p
imes better and you would do it all the [time] , right Mm I mean basically you say the
AF
ther , we 'll see you in a fortnight 's [time] . Perhaps then , said Viola , who
granny does it and she 's got loads of [time] . She sits there and does them twice as
pig , fattened up in woods in half the [time] and costing well under an eighth of the
ike to be [unclear] like that , all the [time] ! Yeah . I said they wo n't bloody lock
es in various biological groups through [time] are most usefully analysed in terms of
? Er do you want your dinner at dinner [time] or [unclear] No I do n't know what I 'v
But they always around about Christmas [time] . My mam reckons that the You can put t
R
inversion , i.e. of one descriptor at a [time] , but they are generally provided and e
Let us briey look at why these functions might be useful in corpus linguistic
D
research (we will discuss them in much more detail in later chapters).
A concordance provides a quick overview of the typical usage of a particular
word (or more complex linguistic expression). e hits are presented in random
order in Table 2.1, but corpus-linguistic soware packages typically allow the
researcher to sort concordances in various ways, for example, by the rst word to
the le or to the right; this will give us an even beer idea as to what the typical
usage contexts for the expression under investigation are.
Collocate lists are a useful way of summarizing the contexts of a linguistic
expression. For example, the collocate list in the column marked L1 in Table 2.2
will show us at a glance what words typically directly precede time. e
45
2. What is corpus linguistics?
determiners the and this are presumably due to the fact that we are dealing with a
noun, but the adjectives rst, same, long, some, last, every and next are related
specically to the meaning of the noun time; the high frequency of the
prepositions at, by, for and in in the column marked L2 (two words to the le of
the node word time) not only gives us additional information about the meaning
and phraseology associated with the word time, it also tells us that time
frequently occurs in prepositional phrases in general.
Table 2-5: Collocates of time in a span of three words to the le and to the right
L3 L2 L1 R1 R2 R3
.1
for 335 the 851 the 1032 . 950 the 427 . 294
. 322 at 572 this 380 , 661 168 , 255
at 292 a 361 first 320 to 351 i 141 the 212
,
a
the
227
170
130
all
.
by
226
196
192
of
same
a
242
240
239
v0
of
and
for
258
223
190
.
and
you
137
118
104
to
a
it
120
118
112
it 121 , 162 long 224 in 184 it 102 was 107
to 100 of 154 some 200 i 177 a 96 and 92
and 89 for 148 last 180 he 136 he 92 i 86
T
in 89 it 117 every 134 ? 122 , 91 you 76
was 85 in 93 in 113 you 120 was 87 in 75
AF
was 50
Finally, frequency lists provide useful information about the distribution of words
(and, in the case of wrien language, punctuation marks) in a particular corpus.
is can be useful, for example, in comparing the structural properties or typical
contents of dierent text types (see further Chapter 12). It is also useful in
assessing which collocates of a particular word are frequent only because they
are frequent in the corpus in general, and which collocates actually tell us
something interesting about a particular word.
46
Anatol Stefanowitsch
Note, for example, that the collocate frequency lists on the right side of the word
.1
time are more similar to the general frequency list than those on the le side,
suggesting that the noun time has a stronger inuence on the words preceding it
than on the words following it (see further Chapter 9).
v0
Given the widespread implementation of these three techniques, they are
obviously central to corpus linguistics research, so we might amend the denition
above as follows (a similar denition is implied by Kennedy 1998: 244.):
Two problems remain with this denition. e rst problem is that the
requirements of systematicity and exhaustiveness that were introduced in the
second denition are missing. is can be remedied by combining the second and
third denition as follows:
R
47
2. What is corpus linguistics?
On the one hand, this denition subsumes the previous two denitions: If we
assume that corpus linguistics is essentially the study of the distribution of
linguistic phenomena in a linguistic corpus, we immediately understand the
central role of the techniques described above: (i) KWIC concordances are
.1
essentially a way of displaying the distribution of a lexical item across dierent
syntagmatic contexts; (ii) collocations summarize the distribution of lexical items
with respect to other lexical items in quantitative terms, and (iii) frequency lists
v0
summarize the overall quantitative distribution of lexical items in a given corpus.
On the other hand, the denition is not limited to these techniques but can be
applied open-endedly on all levels of language and to all kinds of distributions.
is denition is close to the understanding of corpus linguistics that this book
will advance, but it must still be narrowed down somewhat.
T
First, it must not be misunderstood to suggest that studying the distribution of
linguistic phenomena is an end in itself in corpus linguistics. Fillmore (1992: 35)
AF
the following:
48
Anatol Stefanowitsch
introduction to this eld can be found in Manning and Schtze 2000, especially
the overview in Chapter 1).
Stochastic natural language processing and corpus linguistics are closely
related elds that have frequently proted from each other (see, e.g., Kennedy
1998: 5); it is understandable, therefore, that they are sometimes conated (see,
e.g., Sebba and Fligelstone 1994: 769). However, the two disciplines are best
regarded as overlapping but separate research programs with very dierent
research interests.
Corpus linguistics, as its name suggests, is part of linguistics and thus focuses
on linguistic research questions that may include, but are in no way limited to the
.1
stochastic properties of language. Adding this perspective to our denition, we
get the following:
v0
Denition (Fourth aempt, linguistic interpretation)
Corpus linguistics is the investigation of linguistic research questions based
on the systematic and exhaustive analysis of the distribution of linguistic
phenomena in a linguistic corpus.
T
To arrive at our nal denition, it remains for us to specify what it actually
means to investigate the distribution of linguistic phenomena. Much of the next
AF
chapter will be concerned with this issue, so the following short discussion will
only set the stage for the more detailed discussion to follow.
Let us assume we have noticed, that English speakers use two dierent words
for the front window of cars: some say windscreen, some say windshield. It is a
genuinely linguistic question, what factor or factors explain this variation. In line
R
with the denition above, we would now try to determine their distribution in a
corpus. Since the word is not very frequent, assume that we combine four
corpora that we happen to have available, namely the BROWN, FROWN, LOB
D
49
2. What is corpus linguistics?
given that they refer to a type of window. e distributional fact that the two
words occur in the same types of grammatical contexts is more enlightening: it
suggests that we are dealing with exact synonyms. However, it does not provide
an answer to the question why there should be two words for the same thing.
It is only when we look at the distribution in the four corpora, that we nd
that answer: windscreen occurs exclusively in the LOB and FLOB corpora, while
windshield occurs exclusively in the BROWN and FROWN corpora. e rst two
are corpora of British English, the second two are corpora of American English;
thus, we can conclude that we are probably dealing with dialectal variation. In
other words: we had to investigate dierences in the distribution of linguistic
.1
phenomena under dierent conditions in order to answer our research question.
Taking this into account, we can now posit the following nal denition of
corpus linguistics: v0
Denition (Final Version)
Corpus linguistics is the investigation of linguistic research questions that
have been framed in terms of the conditional distribution of linguistic
phenomena in a linguistic corpus.
T
Note that in the simple example presented here, the conditional distribution is a
AF
ese two distributions are, of course, only the limiting cases: Two (or more)
words (or other linguistic phenomena) may also show relative dierences in their
distribution across conditions. For example, the words railway and railroad show
D
clear dierences in their distribution across the combined corpus used above:
railway occurs 118 times in the British part compared to only 16 times in the
American part, while railroad occurs 96 times in the American part but only 3
times in the British part. Clearly, this tells us something very similar about the
words in question: they also seem to be dialectal variants, even though the
dierence between the dialects is gradual rather than absolute in this case. Given
that very lile is absolute when it comes to human behavior, it will come as no
surprise that gradual dierences in distribution will turn out to be much more
common in language (and thus, more important to linguistic research) than
absolute dierences and later chapters will discuss this issue at length. For now,
50
Anatol Stefanowitsch
note that both kinds of conditional distributions are covered by the nal version
of our denition.
Note also that many of the aspects that were proposed as dening criteria in
previous denitions need no longer be included once we adopt our nal version,
since they are presupposed by this denition: conditional distributions (whether
they dier in relative or absolute terms) are only meaningful if they are based on
the complete data base (hence the criterion of exhaustivity); conditional
distributions can only be assessed if the data are carefully categorized according
to the relevant conditions (hence the criterion of systematicity); distributions
(especially relative ones) are more reliable if they are based on a large data set
.1
(hence the preference for large electronically stored corpora that are accessed via
appropriate soware applications); and oen but not always the standard
procedures for accessing corpora (concordances, collocate lists, frequency lists) are
v0
a natural step towards identifying the relevant distributions in the rst place.
However, these preconditions are not self-serving, and hence they cannot
themselves form the dening basis of a methodological framework: they are only
motivated by the denition just given.
In conclusion, note that our nal denition does distinguish corpus linguistics
T
from other types of observational methods, such as text linguistics, discourse
analysis, variationist sociolinguistics etc.; however, it does so in a way that allows
AF
Further Reading
D
51
3 Corpus Linguistics as a Scientic
Method
.1
Having dened corpus linguistics as the investigation of linguistic research
questions that have been framed in terms of the conditional distribution of
linguistic phenomena in a linguistic corpus, the remainder of this book will
v0
essentially spell out this denition in detail. In this chapter, we will take a closer
look at the notion research (in Section 3.1), and discuss what it means to frame a
research question in terms of a conditional distribution (in Section 3.2 and 3.3).
unfamiliar to at least some linguists and that is oen not made explicit even in
the work of those linguists who work empirically in the scientic sense.
D
Research in the natural and social sciences broadly follows a cyclic process of
hypothesis formulation and testing, as described by the Austrian-British
philosopher of science Karl Popper (a readable overview is found in Popper 1963,
esp. 33 ). Put simply, the researcher rst identies and delineates some object of
research (some phenomenon or set of related phenomena in the natural or social
world). Second, the researcher formulates a hypothesis about this object of
research, typically by postulating a relationship between at least two phenomena
(at least in the classic Popperian paradigm, see further below). ird, the
researcher species procedures for the identication and classication of the
relevant phenomena in such a way that the hypothesis can be tested (this is called
operationalization), and formulates predictions. Finally, these predictions are
52
Anatol Stefanowitsch
subjected to a test. If the test is negative (i.e., does not match the prediction), the
hypothesis must be modied or discarded and replaced by a new hypothesis; if
the test is positive (i.e., does match the prediction), it may lead to renements or
additional hypotheses. is research cycle is shown schematically in Figure 3-1.
.1
Object of Hypothesis Test
Research v0
Formulation of
new/modified/additional
hypotheses
T
Any research project must, of course, begin with the choice of some object of
AF
for the description of every aspect of the object of research this is certainly the
case in linguistics, which is still very much an evolving discipline. In this case, it
D
53
3 Corpus Linguistics as a Scientic Method
.1
sometimes referred to as the deductive approach, is generally seen as the standard
way of conducting scientic research (although it needs to be restated in a way
that allows gradual distributions, see, again, Section 3.3 below).
v0
Alternatively, the research issue can be stated in terms of a general research
question of the type What (if any) is the relationship between X and Y? or even
just What relationships exist between the constructs in my data?. is approach,
referred to as an inductive or exploratory approach, is not traditionally accepted as
scientic, but it is becoming increasingly more common especially where
T
researchers have access to large amounts of data and tools that can search and
correlate these data in a reasonable time frame (big data). In corpus linguistics,
AF
this is certainly the case and thus, both the deductive and the inductive approach
have their place in the discipline. In this chapter, we will assume a deductive
perspective, but we will return to the issue of inductive/exploratory research in
the second part of this book, in particular in Chapter 9.
R
As mentioned above, a theoretical construct can play two dierent roles in the
statement of a research issue (regardless of whether it is stated as a hypothesis or
as a question). It can represent the phenomenon that we want to explain (the
explicandum); in linguistics, this is typically some aspect of language use or
language structure. Or it can be the factor in terms of which we want to explain
that phenomenon (the explicans); in linguistics, this may be some other aspect of
language use (such as the presence or absence of a particular linguistic structure,
a particular position in a discourse, etc.), or some language external factor (such
as the speaker's sex or age, the relationship between speaker and hearer, etc.).
In empirical research, the explicandum is referred to as the dependent variable
and the explicans as the independent variable (note that these terms are actually
54
Anatol Stefanowitsch
.1
NEUTER) and there are languages with even more values for this variable. The
variable VOICE has two to four values in English, depending on the way that this
construct is dened in a given model (most model of English would see ACTIVE and
v0
PASSIVE as values of the variable VOICE, some models would also include the MIDDLE
construction, and a few models might even include the ANTIPASSIVE).
A research issue that is stated in terms of two variables and their relationship
can be visualized as a cross table (also referred to as contingency table) with each
dimension of the table as one of the variables and each cell as one of the possible
T
intersections (i.e., combinations) of the values of the two variables. Such a table is
shown in Table 3-1 (note that there is a useful convention albeit not one very
AF
E N D E N T
VALUE 1 VALUE 2
INDEPENDENT VARIABLE (IV)
VALUE 1
Intersection Intersection
IV1 & DV1 IV1 & DV2
VALUE 2
Intersection Intersection
IV2 & DV1 IV2 & DV2
Since hypotheses play such a central role in the Popperian framework of doing
science, we might expect that there is a large number of criteria that a hypothesis
55
3 Corpus Linguistics as a Scientic Method
has to meet in order to qualify as the basis for a scientic research project.
Actually, there is just a single criterion dictated by the logic of the Popperian
research paradigm: a scientic hypothesis must be testable. Of course, in the real
world there will be additional criteria that will oen be relevant. For example, a
hypothesis may have to be seen as addressing some relevant aspect of the current
state of knowledge in a given research tradition, or it may even have to conform
to discipline-external criteria such as the requirements of funding agencies. But
from the viewpoint of scientic methodology, such criteria are irrelevant.
.1
Crucially, in order to count as testable in the scientic sense, a hypothesis must
be falsiable, i.e., it must be stated in a way that allows us to show that it is false.
v0
As researchers, it is then our task to try and falsify the hypothesis.
To a novice to scientic methodology, this may come as a surprise.
Spontaneously, one might assume the exact opposite: that a hypothesis must be
formulated in such a way that it is obvious how it can be shown to be true, and
that it is the task of a researcher to verify their hypotheses.
T
However, there are at least two reasons why no amount of positive evidence
could ever show a hypothesis to be true, and thus, why this procedure would not
AF
be feasible. First, for any given piece of evidence that seems to support our
hypothesis there are always several alternative hypotheses that explain the
evidence just as well as our research hypothesis. Second, and even more
importantly, no maer how much evidence we adduce in favor of our hypothesis,
the next piece of evidence we come across may falsify it. Worse, if we aim our
R
Take the example of the words windscreen and windshield, discussed in the
previous chapter. Let us assume that we have already formulated the hypothesis
that the rst of these words is British English and the second one is American
English. First, note that this is a perfect example of a scientic hypothesis: It
identies to variables (VARIETY and WORD FOR FRONT WINDOW OF CAR) with two
values each (BRITISH ENGLISH vs. AMERICAN ENGLISH, and WINDSCREEN vs. WINDSHIELD) and it
posits a relationship between them such that which word is used depends on
which variety is spoken (specically, that WORD FOR FRONT WINDOW OF CAR will take
the value WINDSCREEN if VARIETY is BRITISH ENGLISH and the value WINDSHIELD if VARIETY is
AMERICAN ENGLISH). Table 3-2 is a visual representation.
56
Anatol Stefanowitsch
Table 3-2. A contingency table with binary values for the intersections
DV: F R O N T WI N D O W O F CAR
WINDSCREEN WINDSHIELD
BRE
IV: VARIETY
AME
.1
How do we test this hypothesis empirically? It predicts that we should nd
v0
examples of windscreen in British English and examples of windshield in
American English (the intersections marked by checkmarks in Table 3-2), so we
might be tempted to conrm it by querying a corpus of American English (such
as the BROWN corpus) for the string windshield and a corpus of British
T
English (such as the LOB corpus) for the string windscreen (I will enclose
corpus queries in angled brackets from now on as a reminder that we are never,
strictly speaking, searching for words in a corpus, but for strings of characters of
AF
Instead, we must try to falsify our hypothesis by looking for cases that should
not exist according to our hypothesis. In the present case, we have to look for
occurrences of the string windscreen in American English and windshield
D
57
3 Corpus Linguistics as a Scientic Method
in order to test a hypothesis we have to specify cases that should not occur if the
hypothesis were true, and then do our best to nd such cases. e harder we try
to nd such cases but fail to do so, the more certain we will be that our
hypothesis is true. But no maer how hard we look, we must learn to accept that
we can never be absolutely certain: in science, a fact is simply a hypothesis that
has not yet been falsied. is may seem disappointing, but science has made
signicant advances despite (or perhaps because) scientists accept that there is no
certainty when it comes to truth. In contrast, a single counterexample will give us
the certainty that our hypothesis is false.
In the classical Popperian framework, counterexamples (i.e., negative
.1
outcomes of an experiment), play a crucial role and traditional linguistic
argumentation on the basis of intuited examples also accords a central role to the
counterexample (or at least claims to do so). In corpus linguistics,
v0
counterexamples have not played a major role, but this is due not so much to the
fact that they cannot play a useful role (cf. for example, Meurers 2005 and
Meurers and Mller 2009 for excellent examples of how they can inform syntactic
argumentation), but rather to the fact that corpus-linguistic research questions do
not lend themselves to the kinds of categorical hypotheses that make
T
counterexamples meaningful.
Nevertheless, a brief discussion of counterexamples will be useful both in its
AF
own right and in seing the stage for the next section. For expository reasons, I
will continue to use the case of dialectal variation as an example, but the issues
discussed apply to all corpus-linguistic research questions.
In the case of windscreen and windshield, there are no counterexamples in the
corpora used, but what if there had been one? Take another well-known case of a
R
lexical dierence between British and American English: the distilled petroleum
used to fuel cars is referred to as petrol in British English and gasoline in
D
American English. A search in the four corpora used above yields the frequencies
of occurrence shown in Table 3-3.
58
Anatol Stefanowitsch
PETROL GASOLINE
IV: VARIETY
BRE 21 0
AME 1 20
.1
In other words, the distribution is almost identical to that for the words
windscreen and windshield except for one counterexample, where petrol occurs
in the American part of the corpus (specically, in the FROWN corpus). In other
v0
words, it seems that our hypothesis is falsied at least with respect to the word
petrol. Of course, this is true only if we are genuinely dealing with a
counterexample, so let us take a closer look at the example in question, which
turns out to be from the novel Eye of the Storm by Jack Higgins:
T
(3.1) He was in Dorking within half an hour. He passed straight through and
continued toward Horsham, nally pulling into a petrol station about ve
AF
Now, Jack Higgins is a pseudonym used by the novelist Harry Paerson for some
of his novels and Paerson is British (he was born in Newcastle upon Tyne, and
grew up in Belfast and Leeds). In other words, his novel was erroneously included
R
59
3 Corpus Linguistics as a Scientic Method
theoretically check all examples, as there are only 42 examples overall. However,
the larger our corpus is (and most corpus-linguistic research requires corpora that
are much larger than the four million words used here), the less feasible it
becomes to do so. e second problem is that we were lucky, in this case, that the
counterexample came from a novel by a well-known author, whose biographical
information is easily available. But linguistic corpora do not (and cannot) contain
only well-known authors, and so checking the individual demographic data for
every speaker in a corpus may be dicult to impossible. Finally, some types of
language cannot be aributed to a single speaker at all speeches are oen
wrien by a team of speech writers that may or may not include the person
.1
delivering the speech, newspaper articles may include text from a number of
journalists and press agencies, published texts in general are typically proof-read
by people other than the author, and so forth.
v0
Let us look at a more complex example, the words for the (typically elevated)
paved strip at the side of a road provided for pedestrians. Dictionaries typically
tell us, that this is called pavement in British English and sidewalk in American
English, for example, the OALD:
T
(3.2a) pavement noun []
1 [countable] (British English) (North American English sidewalk) a at
AF
A search of the four corpora used above shows a distribution of the two words (in
D
all their potential orthographic variants) across the two varieties as shown in
Table 3-4a.
60
Anatol Stefanowitsch
PAVEMENT SIDEWALK
BRE 37 4
IV: VARIETY
AME 22 43
.1
In this case, we are not dealing with a single counterexample. Instead, there are
four apparent counterexamples where sidewalk occurs in British English, and 22
v0
apparent counterexamples where pavement occurs in American English.
In the case of sidewalk, it seems at least possible that a closer inspection of the
four cases in British English would show them to be only apparent
counterexamples, due, for example, to misclassied texts. In the case of the 22
T
cases of pavement in American English, this is less likely. Let us look at both cases
in turn.
Here are the four examples of sidewalk in British English, along with their
AF
author and title of the original source as quoted in the manuals of the
corresponding corpora:
(3.3a) One persistent taxi follows him through the street, crawling by the
R
sidewalk
[LOB E09: Wilfrid T. F. Castle, Stamps of Lebanon's Dog River]
(3.3b) Keep that black devil away from Rusty or you'll have a sick horse on
D
Not much can be found about Wilfrid Castle, other than that he also wrote
several books about English parish churches, making it seem likely that he was
British. Bert Cloos is a pseudonym for Bertram P. Cloos, and there is a US army
61
3 Corpus Linguistics as a Scientic Method
record11 from 1940 for a Bertram Cloos whose civilian occupation is listed as
authors, editors and reporters. If this is the same person, then (3.3b) would be
due to a misclassied text by an American author in the British corpus, but there
is no way of determining this.
For Linda Waterman and Christine McNeill, no biographical information can
be found at all. Watermans story was published in a British student magazine,
but this in itself is no evidence of anything. e story is set in Latin America, so
there may be a conscious eort to evoke American English. In McNeills case
there is some evidence that she is British: she uses some words that are typically
British, such as dressing gown (AmE (bath)robe) and breadbin (AmE breadbox), so
.1
it is plausible that she is British. Like Watermans story, hers was published in a
British magazine. Interestingly, however, the scene in which the word is used is
set in the United States, so she, too, might be consciously evoking American
v0
English. To sum up, we have one example that was likely produced by an
American speaker, two that were likely produced by British speakers, although
one was probably evoking American English, and one about whose author which
we cannot say anything, but which may also have been evoking American
English. Which of these, if any, we may safely discount is dicult to say.
T
Turning to pavement in American English, it would be possible to check the
origin of the speakers of all 22 cases with the same aention to detail, but it is
AF
questionable that the results would be worth the time invested: as pointed out, it
is unlikely that there are so many misclassied examples in the American
corpora.
On closer inspection, however, it becomes apparent that we may be dealing
with a dierent type of exception here: the word pavement has additional senses
R
to the one cited in (3.2a above), one of which does exist in American English.
Here is the remainder of the dictionary entry cited in (3.2a):
D
Since neither of these meanings is relevant for the issue of British and American
words for pedestrian paths next to a road, they cannot be treated as
counterexamples in our context. In other words, we have to look at all hits for
62
Anatol Stefanowitsch
pavement and annotate them for their appropriate meaning. is in itself is a non-
trivial task, which we will discuss in more detail in Chapters 4 and 5. Take the
example in (3.5):
(3.5) [H]e could see the police radio car as he rounded the corner and slammed
on the brakes. He did not bother with his radio there would be time for
that later but as he scrambled out on the pavement he saw the lling
station and the public telephone booth [BROWN L 18]
Even with quite a large context, this example is compatible with a reading of
.1
pavement as road surface or as pedestrian path. If it came from a British text,
we would not hesitate to assign the laer reading, but since it comes from an
American text (the novel Error of Judgment by the American author George
v0
Harmon Coxe), we might lean towards erring on the side of caution and annotate
it as road surface. Alas, the side of caution here is the side suggested by the
very hypothesis we are trying to falsify we would be basing our categorization
circularly on what we are expecting to nd in the data.
A more intensive search of novels by American authors in the Google Books
T
archive (which is vastly larger than the BROWN corpus), turns up clear cases of
the word pavement with the meaning of sidewalk, for example, this passage from
AF
(3.6) He had fallen asleep in his buggy, and had wakened to nd old Neie
drawing him slowly down the main street of the town, pursuing an
erratic but homeward course, while the people on the pavements watched
R
63
3 Corpus Linguistics as a Scientic Method
PAVEMENT SIDEWALK
BRE 37 2
IV: VARIETY
AME 1 43
.1
Given this distribution, would we really want to claim that it is wrong to assign
pavement to British and sidewalk to American English on the basis that there are
v0
a few possible counterexamples? More generally, is falsication by
counterexample a plausible research strategy for corpus linguistics?
ere are several reasons why the answer to this question must be no. First,
we can rarely say with any certainty whether we are dealing with true
T
counterexamples or whether the apparent counterexamples are due to errors in
the construction of the corpus or in our classication. From a theoretical
perspective, this may not count as a valid objection to the idea of falsication by
AF
counterexample. We may argue that we simply have to make sure that there are
no errors in the construction of our corpus and that we have to classify all hits
correctly. However, in actual practice this is impossible. We can (and must) try to
minimize errors in our data and our classication, but we can never get rid of
R
them completely (this is true not only in corpus-linguistics but in any discipline).
Second, even if our data and our classication were error-free, human
behavior is less deterministic than the physical processes Popper had in mind
D
when postulating his model of scientic research. Even in a simple case like word
choice, there may be many reasons why a speaker may produce an exceptional
uerance evoking a variety other than their own (as in the examples above),
unintentionally or intentionally using a word that they would not normally use
because their interlocutor has used it, temporarily slipping into a variety that
they used to speak as a child but no longer do, etc. With more complex linguistic
behavior, such as producing particular grammatical structures, there will be
additional reasons for exceptional behavior: planning errors, choosing a dierent
formulation in mid-sentence, tiredness, etc. all the kinds of things classied as
performance errors in traditional grammatical theory.
64
Anatol Stefanowitsch
To sum up, our measurements will never be perfect and speakers will never
behave perfectly consistently. is means that we cannot use a single
counterexample (or even a handful of counterexamples) as a basis for rejecting a
hypothesis, even if that hypothesis is stated in absolute terms in the rst place,
i.e., even if our hypothesis predicts that some intersections of our variables will
occur in the data while others will not.
ere is a third reason why falsication by counterexample is not a plausible
research strategy: As mentioned in Chapter 2 in connection with our nal
denition of corpus linguistics, hypotheses in linguistics are frequently not stated
in absolute terms (All Xs are Y, Zs always do Y, etc.), but in relative terms, i.e.,
.1
tendencies or preferences (Xs tend to be Y, Zs prefer Y, etc.). For example,
there are a number of prepositions and/or adverbs in English that contain the
morpheme -ward or -wards, such as aerward(s), backward(s), downward(s),
v0
inward(s), outward(s) and toward(s). ese two morphemes are essentially
allomorphs of a single sux that are in free variation: they have the same
etymology (wards simply includes a lexicalized genitive ending), they have both
existed throughout the recorded history of English and there is no discernible
dierence in meaning between them. However, many dictionaries point out that
T
the forms ending in -s are preferred in British English and the ones without the -s
are preferred in American English.
AF
(3.7a) [T]he tall young bualo hunter pushed open the swing doors and walked
R
[BROWN K]
65
3 Corpus Linguistics as a Scientic Method
-WARD -WARDS
BRE
VARIETY
AME
.1
We will return to the issue of how to phrase predictions in quantitative terms in
Chapter 6. Of course, phrasing predictions in quantitative terms raises additional
v0
questions: How large must a dierence in quantity be in order to count as
evidence in favor of a hypothesis that is stated in terms of preferences or
tendencies? And, given that our task is to try to falsify our hypothesis, how can
this be done if counterexamples cannot do the trick? In order to answer such
T
questions, we need a dierent approach to hypothesis testing, namely statistical
hypothesis testing. is approach will be discussed in detail in Chapters 7 and 8.
ere is another issue that we must turn to rst, though that of dening our
AF
variables and their values in such a way that we can identify them in our data.
We saw even in the simple cases discussed above that this is not a trivial maer.
For example, we dened American English as the language occurring in the
BROWN and FROWN corpora, but we saw that the FROWN corpus contains at
R
least one misclassied text by a British Author. And even if this was not the case,
nobody would want to claim that our denition was accurate: clearly, the corpora
contain only a tiny fraction of the language produced by American speakers of
D
66
Anatol Stefanowitsch
Further Reading
A readable exposition of Popper's ideas about falsication is Poppers essay
Science as falsication, in Popper (1963).
Popper, Karl R. (1963). Conjectures and refutations: the growth of scientic
knowledge. London & New York: Routledge and Keagan Paul.
.1
v0
T
AF
R
D
67
4 Operationalization
e examples discussed in the preceding chapter all dealt with the association
between seemingly simple and uncontroversial aspects of language structure and
.1
use lexical items and broadly dened language varieties , and yet we saw that
they were not trivial to deal with even in the context of a simple research
question such as which words are used in which variety. More oen than not,
v0
corpus-linguistic research issues will be considerably more complex, requiring a
much more sophisticated approach even to such seemingly simple notions, and
additionally also including more abstract, theory-dependent features of language.
While there are cases where even abstract features manifest themselves relatively
directly as classes or congurations of concrete elements, in many cases they are
T
reected in the corpus data only indirectly or not at all.
In either case, most features of language structure or use cannot be
AF
values of these variables must be dened in such a way that they can be
reliably measured (i.e., identied in a corpus);
D
2. retrieval: the corpus must be searched in such a way that all data points
manifesting variables in question are found;
3. coding: the data must be coded for the values that our variable may take.
For example, in order to answer the research question how windscreen and
windshield are distributed across British and American English, the variable
VARIETY was operationally dened such that anything found in the LOB or FLOB
corpus counted as BRITISH ENGLISH and anything found in the BROWN and FROWN
corpora counted as AMERICAN ENGLISH; the variable WORD FOR THE FRONT WINDOW OF A
CAR was operationally dened in terms of two sets of orthographic strings
68
Anatol Stefanowitsch
containing all ways in which the words windscreen and windshield could
conceivably be spelled in the corpus (for example, WINDSCREEN, WIND SCREEN,
WIND-SCREEN, Windscreen, Wind screen, Wind Screen, Wind-screen, Wind-Screen,
windscreen, wind screen, wind-screen for WINDSCREEN). ese orthographic strings
were then retrieved from all four corpora using a computer program that went
through each corpus word by word, checking for each word whether it
corresponded to one of these orthographic strings and saving it to a le if it did.
Each hit was then coded as BRITISH ENGLISH if it occurred in the LOB or FLOB
corpus and as AMERICAN ENGLISH if it occurred in the BROWN or FROWN corpus,
and each hit was coded as WINDSHIELD or WINDSCREEN depending on which of these
.1
two words the orthographic string presented.
If this seems like an overly complex way of describing the steps necessary to
answer the research question, this is because we tend to take much of this
v0
implicitly for granted. Indeed, even in the corpus-linguistic research literature,
the research design is oen described in much less detail, leaving it to the reader
to ll in the gaps. While this may be possible in a (seemingly) straightforward
research design such as the one described here, it becomes impossibly complex in
the case of more advanced research designs. In this chapter, we will therefore
T
take a close look the rst of these steps, in the next chapter we will deal with the
other two.
AF
(4.1) 1 FIRM TO TOUCH rm, sti, and dicult to press down, break, or cut [ so]
(LDCE, s.v. hard, cf. also the virtually identical denitions in CALD, MW
and OALD).
69
4 Operationalization
.1
provide to these questions (cf. Konrad 2011 and Wiederhorn et al. 2011 for a
discussion of hardness tests). One commonly-used group of tests is based on the
size of the indentation that an object (the indenter) leaves when pressed into
v0
some material. One such test is the Vickers Hardness Test, which species the
indenter as a diamond with the shape of a square-based pyramid with an angle of
136, and the test force as ranging between or 49.3 to 980.7 newtons (the exact
force must be specied every time a measurement is reported). Hardness is then
dened as follows (cf. Konrad 2011: 43):
T
0.102F
HV =
AF
(4.2)
A
where F is the load in newtons and A is the surface of the indentation (0.102 is a
constant that converts newtons into kiloponds; this is necessary because the
R
Vickers Hardness Test used to measured the test force in kiloponds before the
newton became the internationally recognized unit of force).
Unlike the dictionary denition quoted above, Vickers Hardness (HV) is an
D
70
Anatol Stefanowitsch
.1
given patient. But clearly, just like the hardness of materials, mental disorders are
not directly accessible. Consider, for example, the following dictionary denition
of schizophrenia: v0
(4.3) a serious mental illness in which someone cannot understand what is real
and what is imaginary (CALD, s.v. schizophrenia, see again the very
similar denitions in LDCE, MW and OALD).
T
Although maers are actually signicantly more complicated, let us assume that
this denition captures the essence of schizophrenia. As a basis for diagnosis, it is
AF
certain behaviors. For example, the fourth edition of the Diagnostic and Statistical
Manual of Mental Disorders (DSM-IV), used by psychiatrists and psychologists in
D
71
4 Operationalization
hardness, which is partly due to the fact that human behavior is more complex
and less comprehensively understood than the mechanical properties of
materials, and partly due to the fact that psychology and psychiatry are less
developed disciplines than physics. However, it is an operational denition in the
sense that it eectively presents a check-list of observable phenomena that is
used to determine the presence of an unobservable phenomenon. As in the case
of hardness tests, there is no single operational denition the International
Statistical Classication of Diseases and Related Health Problems oers a dierent
denition that overlaps with that of the DSM-IV but places more emphasis on
(and is more specic with respect to) mental symptoms and less emphasis on
.1
social behaviors.
As should have become clear, operational denitions do not (and do not
aempt to) capture the essence of the things or phenomena they dene. We
v0
cannot say that the Vickers Hardness number is hardness or that the DSM-IV
list of symptoms is schizophrenia. ey are simply ways of measuring or
diagnosing these phenomena. Consequently, it is pointless to ask whether
operational denitions are correct or incorrect they are simply useful in a
particular context. However, this does not mean that any operational denition is
T
as good as any other. A good operational denition must have two properties: it
must be reliable and valid.
AF
72
Anatol Stefanowitsch
.1
disorder by the DSM-II until 1974). us, operational denitions may create the
construct they are merely meant to measure; it is therefore important to keep in
mind that even a construct that has been operationally dened is still just a
v0
construct, i.e. part of a theory of reality rather than part of reality itself.
this does not mean that researchers are necessarily aware that this is what they
are doing. In corpus-based research, we nd roughly three dierent situations:
73
4 Operationalization
seems justied to simply apply it without any discussion at all for example,
when identifying occurrences of a particular word like sidewalk. But even here, it
is important to state explicitly which orthographic strings were searched for and
why. As soon as maers get a lile more complex, implicitly applied denitions
are unacceptable because unless we state exactly how we identied and
categorized a particular phenomenon, nobody will be able to interpret our results
correctly, let alone replicate them or test them on a dierent set of data. For
example, the English possessive construction is a fairly simple and
uncontroversial grammatical structure. In wrien English it consists either of the
sequence [NOUNapostrophe(one or more adjectives)noun], or [possessive
.1
pronoun(one or more adjectives)noun] (the parentheses denote optionality).
is sequence seems easy enough to identify in a corpus, so a researcher studying
the possessive may not even mention how they dened this construction. e
v0
following examples show that maers are more complex, however:
(4.4a) We are a women's college, one of only 46 women's colleges in the United
States and Canada [womenscollege.du.edu]
(4.4b) An indictment was handed up Monday in connection with weekend raids
T
at Danny's Family Car Wash locations.
(4.4c) Our restaurant is situated in St. Paul's Square, Birmingham.
AF
[pastadipiazza.com]
(4.4d) Such a student may arrange with his/her Registrar to submit a leer of
good standing to Jacksonville University [FROWN H]
(4.4e) My Opera was the virtual community for Opera web browser users.
[Wikipedia, s.v. My Opera]
R
While all of these cases have the form of the possessive construction and match
D
the strings above, opinions may dier on whether they should be included in a
sample of English possessive constructions. Example (4.4a) is a so-called
possessive compound, a lexicalized possessive construction that functions like a
conventional compound and could be treated as a single word. Example (4.4b) is a
proper name, just like (4.4d). Concerning the laer: if we want to include it, we
would have to decide whether also to include proper names where possessive
pronoun and noun are spelled as a single word, as in MySpace (the name of an
online social network). Example (4.4c) is a geographical name; here, the problem
is that such names are increasingly spelled without an apostrophe, oen by
conscious decisions by government institutions (see Swaine 2009, Newman 2013).
If we want to include them, we have to decide whether also to include spellings
74
Anatol Stefanowitsch
without the apostrophe (such as We have been established for over 20 years in the
prestigious St. Pauls Square area of Birmingham [henrysrestaurant.co.uk]) and
how to nd them in the corpus. Finally, (4.4d) is a regular possessive construction,
but it contains two pronouns separated by a slash; we would have to decide
whether to count these as one or as two cases of the construction.
ese are just some of the problems we face even with a very simple
grammatical structure. us, if we were to study the possessive construction (or
any other structure), we would have to state precisely which potential instances
of a structure we include. In other words, our operational denition needs to
include a list of cases that may occur in the data together with a statement of
.1
whether and why to include them or not. We will return to this issue in
Section 4.3.
Likewise, there may be situations where operational denitions already
v0
present in the data can be plausibly used without further discussion. If we accept
orthographic representations of language (which corpus linguists do, most of the
time), then we also accept some of the denitions that come along with
orthography, for example concerning the question what constitutes a word. For
many research questions, it may be irrelevant whether the orthographic word
T
correlates with a linguistic word in all cases (whether it does depends to a large
extent on the specic linguistic model we adopt), so we may simply accept this
AF
upon theory of word classes, let alone syntactic structures, this will be
problematic in many situations. It is crucial to understand that tagging and other
types of annotation are already the result of applying operational denitions by
other researchers and if we use tags or other forms of annotation, we must
familiarize ourselves with these denitions, which are usually found in manuals
accompanying the corpus. ese manuals and other literature provided by corpus
creators must be read and cited like all other literature, and we must clarify in the
description of our research design why and to what extent we rely on the
operationalizations described in these materials.
In other words, we must always explicitly state the operational denitions we
use. Fortunately, this is becoming more common as corpus linguistics is growing
75
4 Operationalization
a) Length. ere is a wide range of phenomena that has been claimed and/or
shown to be related to the weight of linguistic units (syllables, words or
.1
phrases) word-order phenomena following the principle light before heavy,
such as the dative alternation (Thompson and Koide 1987), particle placement
(Chen 1986), s-genitive and of-construction (Deane 1987), frozen binominals
v0
(Sobkowiak 1993), to name just a few. In the context of such claims, weight is
sometimes understood to refer to structural complexity, sometimes to length, and
sometimes to both. Since complexity is oen dicult to dene, it is, in fact,
frequently operationalized in terms of length, but let us rst look at the diculty
of dening length in its own right and briey return to complexity below.
T
Let us begin with words. Clearly, words dier in length everyone would
agree that the word stun is shorter than the word abbergast. ere are a number
AF
of ways in which we could operationalize WORD LENGTH, all of which would allow
us to conrm this dierence in length:
as number of leers (cf., e.g., Wul 2003), in which case abbergast has
a length of 11 and stun has a length of 4;
as number of phonemes (cf., e.g., Sobkowiak 1993), in which case
R
abbergast has a length of 9 (BrE /bst/ and AmE /bst/), and
stun has a length of 4 (BrE and AmE /stn/);
D
76
Anatol Stefanowitsch
.1
that data be transcribed phonemically (which leaves more room for interpretation
than orthography) or, in the case of wrien data, converted from an orthographic
to a phonemic representation (which requires assumptions about which the
v0
language variety and level of formality the writer in question would have used if
they had been speaking the text); number of syllables also requires these
assumptions.
e question of validity is less easy to answer: if we are dealing with language
that was produced in the wrien medium, number of leers may seem like a
T
valid measure, but writers may be speaking internally as they write, in which
AF
12 Note that I have limited the discussion here to denitions of length that make sense in
the domain of traditional linguistic corpora; there are other denitions, such as
phonetic length (time it took to pronounce a token of the word in a specic situation),
or average phonetic length (time it takes to pronounce the word on average). For
example, the pronunciation samples of the CALD, as measured by playing them in the
R
browser Chrome (Version 32.0.1700.102 for Mac OSX) and recording them using the
soware Audacity (Version 2.0.3 for Mac OSX) on a MacBook Air with a 1.8 GHz Intel
Core i5 processor and running OS X version 10.8.5 and then using Audacitys timing
D
function, have the following lengths: abbergast BrE 0.929s, AmE 0.906s and stun BrE
0.534s, AmE 0.482s. e reason I described the hardware and soware I used in so
much detail is that they are likely to have inuenced the measured length in addition
to the fact that dierent speakers will produce words at dierent lengths on dierent
occasions; thus, calculating meaningful average pronunciation lengths would be a
very time- and resource-intensive procedure even if we decided that it was the most
valid measure of WORD LENGTH in the context of a particular research project. I am not
aware of any corpus-linguistic study that has used this denition of word length;
however, there are versions of the SWITCHBOARD corpus (a corpus of transcribed
telephone conversations) that contain information about phonetic length, and these
have been used to study properties of spoken language (e.g. Greenberg et al. 1996,
Greenberg et al. 2003).
77
4 Operationalization
.1
orthographic length to some extent (in languages with phonemic and syllabic
scripts), so we might use the easily and reliably measured orthographic length as
an operational denition of phonemic and/or syllabic length and assume that
v0
mismatches will be infrequent enough to be lost in the statistical noise (cf. Wul
2003).
When we want to measure the length of linguistic units above word level, e.g.
phrases, we can choose all of the above methods, but additionally or instead we
can (and more typically do) count the number of words and/or constituents (cf.
T
e.g. Gries 2003 for a comparison of syllables and words as a measure of length).
Here, we have to decide whether to count orthographic words (which is very
AF
reliable but may or may not be valid), or phonological words (which is less
reliable, as it depends on our theory of what constitutes a phonological word).
As mentioned at the beginning of this subsection, weight is sometimes
understood to mean structural complexity rather than length. e question how
to measure structural complexity has been addressed in some detail in the case of
R
Wasow and Arnold 2003). Such a denition has a high validity, as number of
nodes directly corresponds to a central aspect of what it means for a phrase to be
syntactically complex, but as tree diagrams are highly theory-dependent, the
reliability across linguistic frameworks is low.
13 e dierence between these two language types is that in stress-timed languages, the
time between two stressed syllables tends to be constant regardless of the number of
unstressed syllables in between, while in syllable-timed languages every syllable takes
about the same time to pronounce. is suggests an additional possibility for
measuring length in stress-timed languages, namely the number of stressed syllables.
Again, I am not aware of any study that has discussed the operationalization of word
length at this level of detail.
78
Anatol Stefanowitsch
.1
complexity, the vast majority of corpus-based studies operationalize weight in
terms of some measure of WORD LENGTH even if they theoretically conceptualize it
in terms of complexity. Since complexity and length correlate to some extent, this
v0
is a justiable simplication in most cases. In any case, it is a good example of
how a phenomenon and its operational denition may be more or less closely
related.
drastically across researchers and frameworks, but there is a simple basis for
operational denitions of TOPICALITY in terms of referential distance, proposed by
Talmy Givn:
79
4 Operationalization
that are not phonologically realized) are included, and if so, which types of covert
references. With respect to clauses, it has to be specied what counts as a clause,
and it has to be specied how complex clauses are to be counted.
A concrete example may demonstrate the complexity of these decisions. Let us
assume that we are interested in determining the referential distance of the
pronouns in the following example, all of which refer to the person named Joan
(verbs and other elements potentially forming the core of a clause have been
indexed with numbers for ease of reference in the subsequent discussion):
(4.6) Joan, though Anne's junior1 by a year and not yet fully accustomed2 to the
.1
ways of the nobility, was3 by far the more worldly-wise of the two. She
watched4, listened5, learned6 and assessed7, speaking8 only when spoken9
to in general whilst all the while making10 her plans and looking11 to the
v0
future Enchanted12 at rst by her good fortune in becoming13 Anne
Mowbray's companion, grateful14 for the benets showered15 upon her,
Joan rapidly became16 accustomed to her new role. [BNC CCD]
Let us assume the traditional denition of a clause as a nite verb and its
T
dependents and let us assume that only overt references are counted. If we apply
these denitions very narrowly, we would put the referential distance between
AF
the initial mention of Joan and the rst pronominal reference at 1, as Joan is a
dependent of was in clause 4.63 and there are no other nite verbs between this
mention and the pronoun she. A broader denition of clause along the lines of a
unit expressing a complete proposition however, might include the structures
(4.61) (though Anne's junior by a year) and (4.62) (not yet fully accustomed to the
R
ways of the nobility) in which case the referential distance would be 3 (a similar
problem is posed by the potential clauses (4.612) and (4.614), which do not contain
D
nite verbs but do express complete propositions). Note that if we also count the
NP the two as including reference to the person named Joan, the distance to she
would be 1, regardless of how the clauses are counted.
In fact, the structures (4.61) and (4.62) pose an additional problem: they are
dependent clauses whose logical subject, although it is not expressed, is clearly
coreferential to Joan. It depends on our theory whether these covert logical
subjects are treated as elements of grammatical and/or semantic structure; if they
are, we would have to include them in the count.
e dierences that decisions about covert mentions can make are even more
obvious when calculating the referential distance of the second pronoun, her (in
her plans). Again, assuming that every nite verb and its dependents form a
80
Anatol Stefanowitsch
clause the distance between her and the previous use she is six clauses (4.64 to
4.69). However, in all six clauses, the logical subject is also Joan. If we include
these as mentions, the referential distance is 1 again (her good fortune is part of
the clause [4.612] and the previous mention would be the covert reference by the
logical subject of clause [4.611]).
Finally, note that I have assumed a very at, sequential understanding of
number of clauses counting every nite verb separately. However, one could
argue that the sequence She watched4, listened5, learned6 and assessed7 is actually a
single clause with four coordinated verb phrases sharing the subject she, that
speaking8 only when spoken9 to in general is a single clause consisting of a matrix
.1
clause and an embedded adverbial clause, and that this clause itself is dependent
on the clause with the four verb phrases. us, the sequence from (4.64) to (4.69)
can be seen as consisting of six, two or even just one clause, depending on how
v0
we decide to count clauses in the context of referential distance.
Obviously, there is no right or wrong way to count clauses; what maers is
that we specify a way of counting clauses that can be reliably applied and that is
valid with respect to what we are trying to measure. With respect to reliability,
obviously the simpler our specication, the beer (simply counting every verb,
T
whether nite or not, might be a good compromise between the two denitions
mentioned above). With respect to validity, things are more complicated:
AF
am not aware of studies that do this, but nding out to what extent the results
correlate with clause-based measures of various kinds seems worthwhile).
For practical as well as for theoretical reasons, it is plausible to introduce a
cut-o point for the number of clauses we search for a previous mention of a
referent: practically, it will become too time consuming to search beyond a
certain point, theoretically, it is arguable to what extent a distant previous
occurrence of a referent contributes to the current information status. Givon
(1983) originally set this cut-o point at 20 clauses, but there are also studies
seing it at ten or even at three clauses. Clearly, there is no correct number of
clauses, but there is empirical evidence that the relevant distinctions are those
between a referential distance of 1, between 2 and 3, and >3 (cf. Givon 1992).
81
4 Operationalization
.1
given information to the hearer (Givn originally intended the measure for
narrative data only, where this problem will not occur).
v0
Both WORD LENGTH and DISCOURSE STATUS are phenomena that can be dened in
relatively objective, quantiable terms not quite as objectively as physical
HARDNESS, perhaps, but with a comparable degree of rigor. Like HARDNESS measures,
they do not access reality directly and are dependent on a number of assumptions
and decisions, but providing that these are stated suciently explicitly, they can
T
be applied almost automatically. While WORD LENGTH and DISCOURSE STATUS are not
the only such phenomena, they are not typical either. Most phenomena that are
AF
c) Word senses. Although we oen pretend that corpora contain words, they
R
82
Anatol Stefanowitsch
.1
pedestrians; usually beside a street or roadway)
ere are three senses of pavement, as shown by the numbers aached, and in
v0
each case there are synonyms. Of course, in order to turn this into an operational
denition, we need to specify a procedure that allows us to assign the hits in our
corpus to these categories. For example, we could try to replace the word
pavement by a unique synonym and see whether this changes the meaning. But
even this, as we saw in Chapter 3, may be quite dicult.
T
ere is an additional problem: We are relying on someone elses decisions
about which uses of a word constitute dierent senses. In the case of pavement,
AF
this is fairly uncontroversial, but consider the entry for the noun bank:
(4.8a) bank#1 (sloping land (especially the slope beside a body of water))
(4.8b) bank#2, depository nancial institution#1, bank#2, banking concern#1,
banking company#1 (a nancial institution that accepts deposits and
R
83
4 Operationalization
transacted)
(4.8j) bank#10 (a ight maneuver; aircra tips laterally about its longitudinal
axis (especially in turning))
While everyone will presumably agree that (4.8a) and (4.8b) are separate senses
(or even separate words, i.e. homonyms), it is less clear whether everyone would
distinguish (4.8b) from (4.8i) and/or (4.8); or (4.8e) and (4.8), or even (4.8a) and
(4.8g). In these cases, one could argue that we are just dealing with contextual
variants of a single underlying meaning.
us, we have the choice of coming up with our own set of senses (which has
.1
the advantage that it will t more precisely into the general theoretical
framework we are working in and that we might nd it easier to apply), or we
can stick with an established set of senses such as that proposed by WordNet,
v0
which has the advantage that it is maximally transparent to other researchers and
that we cannot subconsciously make it t our own preconceptions, thus
distorting our results in the direction of our hypothesis. In either case, we must
make the set of senses and the criteria for applying them transparent, and in
either case we are dealing with an operational denition that does not correspond
T
directly with reality (if only because word senses tend to form a continuum
rather than a set of discrete categories in actual language use).
AF
84
Anatol Stefanowitsch
On second thought, however, it is more complex than it seems. For example, what
about dead bodies or carcasses? e fact that dictionaries disagree as to whether
these are inanimate shows that this is not a straightforward question that calls for
a decision before the nouns in a given corpus could be categorized reliably.
Let us assume for the moment, that animate is dened as potentially having
life and thus includes dead bodies and carcasses. is does not solve all
problems: For example, how should body parts, organs or individual cells be
categorized? ey have life in the sense that they are part of something alive,
but they are not, in themselves, living beings. In fact, in order to count as an
animate being in a communicatively relevant sense, an entity has to display some
.1
degree of intentional agency. is raises the question of whether, for example,
plants, jellysh, bacteria, viruses or prions should be categorized as animate.
Sometimes, the dimension of intentionality/agency is implicitly recognized as
v0
playing a crucial role, leading to a three-way categorization such as that in (4.9b):
ctional entities, such as gods, ghosts, dwarves, dragons or unicorns? Are they,
respectively, humans and animals, even though they do not, in fact exist? Clearly,
D
85
4 Operationalization
where they are treated as having human or animal intelligence and agency, and
as inanimate where they are not. In other words, our categorization of a referent
may change depending on context.
Sometimes studies involving animacy introduce additional categories in the
INANIMATE domain. One distinction that is oen made is that between concrete and
abstract, yielding the four-way categorization in 4.9c:
(4.9c) HUMAN vs. ANIMATE vs. CONCRETE INANIMATE vs. ABSTRACT INANIMATE
e distinction between concrete and abstract raises the practical issue where to
.1
draw the line (for example, is electricity concrete?). It also raises a deeper issue
that we will return to in Chapter 5: are we still dealing with a single dimension?
Are abstract inanimate entities (say, marriage or Wednesday) really less animate
v0
than concrete entities like a wedding ring or a calendar? And are animate and
abstract incompatible, or would it not make sense to treat the referents of words
like god, demon, unicorn etc. as abstract animate?
linguistics (and the social sciences in general), but there are procedures to deal
with the problem of subjectiveness at least to some extent. We will return to
D
Further reading
A discussion of the role of operationalization in the context of corpus-based
semantics is found in Stefanowitsch (2010). Wul (2003) is a study of adjective
order in English that operationalizes a variety of linguistic constructs in an
exemplary way. Zaenen et al. (2004) is an example of a detailed and extensive
coding scheme for animacy.
Stefanowitsch, Anatol. 2010. Empirical cognitive semantics: Some
86
Anatol Stefanowitsch
.1
v0
T
AF
R
D
87
5 Retrieval and Coding
Traditionally, many corpus-linguistic studies use the (orthographic) word form as
their starting point. This is at least in part due to the fact that corpora consist of
.1
text that is represented as a sequence of word forms, and that, consequently,
word forms are easy to retrieve. As briey discussed in Chapter 2, concordancing
programs oer at least the option of searching for a string of characters and
v0
displaying the results as a list of hits in context. As we saw when discussing the
case of pavement in Chapter 3, searching for a string of characters will typically
give us both more and less than we need. A query for the string pavement will
give us less than we need, because a study involving the word pavement in the
sense of pedestrian footpath requires us to look at all morphological variants (in
T
this case, the singular pavement, the plural pavements and, depending on how the
corpus is prepared, the possessive form pavements) and orthographic and
AF
making sure to dene our corpus query in such a way as to capture all variants of
the word pavement; we treated the second issue as a problem of coding, going
D
through the search results one by one and determining from the context, which
of the senses of pavement given by dictionaries like CALD and OALD we were
likely dealing with.
For research questions where one of the variables is a word (or a set of words),
this is probably the best strategy. However, as soon as we want to investigate
phenomena other than words, the issue of retrieval becomes very complex,
requiring careful thought and a number of decisions concerning an almost
inevitable trade-o between the quality of the results and the time needed to
retrieve them. is issue will be dealt with in Section 5.1. We already saw that the
issue of coding the data is complex even in the case of simple words, and the
88
Anatol Stefanowitsch
5.1 Retrieval
5.1.1 Corpus queries
Roughly speaking, there are two ways of searching a corpus for a particular
linguistic phenomenon: manually (i.e., by reading the texts contained in it, noting
down each instance of the phenomenon in question) or automatically (i.e., by
.1
using a computer program to run a query on a machine-readable version of the
texts).
As discussed in Chapter 2, there may be cases where there is no alternative to
v0
a manual search, but soware-aided queries are the default in modern corpus
linguistics. ere is a range of more or less specialized commercial and non-
commercial concordancing programs designed specically for corpus linguistic
research, and there are many other soware packages that may be used to search
text corpora even though they are not specically designed for corpus-linguistic
T
research (some non-commercial soware solutions are listed in Appendix A.3).
Finally, there are scripting languages like Perl, Python and R, that can be used to
AF
write programs capable of searching text corpora (ranging from very simple two-
liners to very complex solutions (see also Appendix A.3 for further information).
e power of soware-aided searches depends on two things: on the
information contained in the corpus itself and on the paern-matching capacities
of the soware used to access them. In the simplest case (which we assumed to
R
hold in the examples discussed in Chapters 3 and 4), a corpus will contain plain
text in a standard orthography and the soware will be able to nd passages
D
89
5 Retrieval and Coding
words, which means that we have to add at least side walk, side walks, Side walk
and Side walks when investigating such texts). In order to retrieve all occurrences
of the lexeme in Chapters 2 and 3, we could perform a separate query for each of
these strings, but I actually queried the string in (5.1a); a second example of
regular expressions is (5.1b), which represents one way of searching for all
inected forms and spelling variants of the verb synthesize:
(5.1a) [Ss]ide-?walks?
(5.1b) synthesi[sz]e?[ds]?(ing)?
.1
Any group of characters in square brackets is treated as a class, meaning that any
one of them will be treated as a match, and the question mark means zero or one
of the preceding characters). is means, that the paern in (5-1a) will match an
v0
upper- or lower-case S, followed by i, d, and e, followed by zero or one occurrence
of a hyphen, followed by w, a, l, and k, followed by zero or one occurrence of s.
is matches all the variants of the word. For (5-1b), the [sz] ensures that both
the British spelling (with an s) and the American spelling (with a z) are found.
e question mark aer e ensures that both the forms with an e (synthesize,
T
synthesizes, synthesized) and that without one (synthesizing) are matched. Next,
the sting matches zero to one occurrence of a d or an s followed by zero or one
AF
expressions. But even the simple example in (5.1b) demonstrates a problem with
such search queries: they quickly overgeneralize. For example, the paern in
D
(5.1b) would also match some non-existing forms, like synthesizding, and, more
crucially, it will match existing forms that we may not want to include in our
search results, like the noun synthesis (see further Section 5.1.2).
e benets of being able to dene complex queries becomes even more
obvious if our corpus contains annotations in addition to the original text. If the
corpus contains part-of-speech tags, as many corpora do, this will allow us to
search (within limits) for grammatical structures. For example, assume that there
is a part-of-speech tag aached to the end of every word by an underscore (as it
is in many of older corpora, including the BROWN corpus used in several
examples in previous chapters) and that the tags are as shown in 5.2 (following
the sometimes rather nontransparent BROWN naming conventions). We could
90
Anatol Stefanowitsch
then search for prepositional phrases using a paern like the one in (5.3):
.1
_NP (proper names)
_NN$ (common nouns with possessive clitic)
(5.3)
v0
\S+_IN ( \S+_AT)?( \S+_JJ[RT]?)*( \S+_N[PN][S$]?)+
1 2 3 4
e star means zero ore more, the plus means one or more, and \S means any
T
non-whitespace character, the meaning of the other symbols is as before. is
paern will match the following sequence (the numbers correspond to the indices
AF
in 5.3):
followed by
3. zero or more occurrences of a word tagged as an adjective (again
preceded by a whitespace), including comparatives and superlatives
note that the JJ-tag may be followed by zero or one occurrence of a T or
an R),
followed by
4. one or more words (again, preceded by a whitespace) that are tagged as a
noun or proper name note the square bracket containing the N for
common nouns and the P for proper nouns , including plural forms and
possessive forms note that NN or NP tags may be followed by zero or
one occurrence of an S or a $.
91
5 Retrieval and Coding
Even though this query is already quite complex, it will not return all
prepositional phrases, however; for example, it will not match cases where the
adjective is preceded by one or more quantiers (tagged _QT in the BROWN
corpus), adverbs (tagged _RB), or combinations of the two. It will also not return
cases with pronouns instead of nouns. ese and other issues can be xed by
augmenting the search paern accordingly, although the increasing complexity
will bring problems of its own, to which we return in the next subsection.
Other problems are impossible to x; for example, if the noun phrase inside
the prepositional phrase contains another PP, the paern will not recognize it as
.1
belonging to the NP but will treat it as a new match and there is nothing we can
do about this, since there is no dierence between the sequence of POS tags in a
structures like (5.4a), where the PP o the kitchen is a complement of the noun
v0
room and as such is part of the NP inside the rst PP, and (5.4b), where the PP at
a party is an adjunct of the verb standing and as such is not part of the NP
preceding it:
1. include hits that are instances of the phenomenon we are looking for
92
Anatol Stefanowitsch
.1
Table 5-1: Four possible outcomes of a corpus query for a phenomenon X
SEARCH RESULT
INCLUDED NOT INCLUDED Total
X
v0
True Positive (TP) False Negative (FN) P
CORPUS
(hit) (miss)
X
False Positive (FP) True Negative (TN) N
(incorrectly included) (correctly rejected)
T
Obviously, the rst case (TP) and the fourth case (TN) are desirable outcomes:
AF
we want our search results to include all instances of the phenomenon under
investigation and exclude everything that is not such an instance. e second
case (FN) and the third case (FP) are undesirable outcomes: we do not want our
search to overlook instances of our phenomenon and we do not want our search
R
found by our search) that are true positives; this is referred to as precision (or as
the positive predictive value, cf. 5.5a). Second, the proportion of all instances of
our phenomenon that are true positives (i.e., that were actually found by our
search; this is referred to as recall (cf. 5.5b):14
14 ere are two additional measures that are important in other areas of empirical
research but do not play a central role in data retrieval. First, the specicity or true
negative rate the proportion of negatives that are incorrectly included in our data
(i.e. false negatives); second, negative predictive value the proportion of negatives
(i.e., cases not included in our search) that are true negatives (i.e., that are correctly
rejected). ese measures play a role in situations where a negative outcome of a test
93
5 Retrieval and Coding
True Positives
(5.5a) Precision=
True Positives +False Positives
True Positives
(5.5b) Recall=
True Positives +False Negatives
Ideally, the value of both measures should be 1, i.e., our data should include all
cases of the phenomenon under investigation (a recall rate of 100 percent) and it
should include nothing that is not a case of this phenomenon (a precision of 100
.1
percent). However, unless we carefully search our corpus manually (a possibility I
will return to below), there is typically a trade-o between the two. Either we
devise a query that matches only clear cases of the phenomenon we are
v0
interested in (high precision) but that will miss all less clear cases (low recall). Or
we devise a query that matches as many potential cases as possible (high recall),
but that will include many cases that are not instances of our phenomenon (low
precision).
Let us look at a specic example, the English ditransitive construction, and let
T
us assume that we have an untagged and unparsed corpus. How could we retrieve
instances of the ditransitive? As the rst object of a ditransitive is usually a
AF
pronoun (in the objective case) and the second a lexical NP (see, for example,
ompson and Koide 1987), one possibility would be to search for a pronoun
followed by a determiner (i.e., for any member of the set of strings in (5.6a)),
followed by any member of the set of strings in (5.6b)). is gives us the query in
R
(5.6b) the, a, an, this, that, these, those, some, many, lots, my, your, his, her, its,
our, their, something, anything
(5.6c) (me|you|him|her|it|us|them) (the|a|an|this|that|these|those|
some|many|lots|my|your|his|her|its|our|their|something|
anything)
Let us apply this query (which is actually used in Colleman and De Clerk 2011) to
a freely available sample from the ICE-GB mentioned above (see Appendix A.1).
94
Anatol Stefanowitsch
is corpus has been manually annotated, amongst other things, for argument
structure, so that we can check the results of our search against this annotation.
ere are 36 ditransitive clauses in the sample, thirteen of which are returned
by our query. ere are also 2,838 clauses that are not ditransitive, 14 of which
are also returned by our query. Table 5.2 shows the results of the query in terms
of true and false positives and negatives:
.1
DITR. 13 23 36
true positives false negatives
CORPUS
OTHER 14
false positives
v0 2824
true negatives
2838
13
(5.7a) Precision= =0.48
27
13
R
Clearly, neither precision nor recall are particularly good. Let us look at the
reasons for this, beginning with precision.
While the sequence of a pronoun and a determiner is typical for (one type o)
ditransitive clause, it is not unique to the ditransitive, as the following false
positives of our corpus search show:
95
5 Retrieval and Coding
.1
(5.4.a), could never be excluded, since they are identical to the ditransitive as far
as the sequence of parts-of-speech is concerned.
Of course, it is relatively trivial to increase the precision of our search results:
v0
we can manually discard all false positives, which would increase precision to the
maximum value of 1. Typically, our data will have to be manually coded for
various criteria anyway, allowing us to discard false positives in the process.
However, the larger our data set, the more time consuming this will become, so
that precision should always be a consideration even at the stage of data retrieval.
T
Let us now look at the reasons for the recall rate, which is even worse than the
precision. ere are, roughly speaking, four types of ditransitive structures that
AF
e rst group of cases are those where the second object does not appear in its
canonical position, for example in interrogatives and other cases of le-
dislocation (cf. 5.9a), or passives (5.9b). e second group of cases are those where
word order is canonical, but either the rst object (5.9c) or the second object
(5.9d) or both (5.9e) do not correspond to the search paern.
Note that, unlike precision, the recall rate of a search paern cannot be
increased aer the data have been extracted from the corpus. us, an important
aspect in constructing a query is to annotate a random sample of our corpus
manually for the phenomenon we are interested in, and then to check ourquery
against this manual annotation. is will not only tell us how good or bad the
96
Anatol Stefanowitsch
recall of our query is, it will also provide information about the most frequent
cases we are missing. Once we know this, we can try to revise our query to take
these cases into account. In a pos-tagged corpus, we could, for example, search
for a sequence of a pronoun and a noun in addition to the sequence pronoun-
determiner that we used above, which would give us cases like (5.5d), or we could
search for forms of be followed by a past participle followed by a determiner or
noun, which would give us passives like those in (5.5b).
In some cases, however, there may not be any additional paerns that we can
reasonably search for. In the present example with an untagged corpus, for
example, there is no additional paern that seems in any way promising. In such
.1
cases, we have two options for dealing with low recall: First, we can check (in our
manually annotated subcorpus) whether the data recalled dier from the data not
recalled in any way signicant for our research question. If this is not the case,
v0
we might decide to continue working with a low recall and hope that our results
are still generalizable (Colleman and De Clerk 2011, for example, are mainly
interested in the question which classes of verbs were used ditransitively at what
time in the history of English, a question that they were able to discuss
insightfully based just on the subset of ditransitives matching their search
T
paern).
If our data do dier signicantly along one or more of the dimensions relevant
AF
to our research project, we might have to increase the recall at the expense of
precision and spend relatively more time weeding out false positives. In the most
extreme case, this might mean extracting the data manually, i.e. checking the
entire corpus word by word or clause by clause for the phenomenon we are
interested in (as Colleman 2006 did when he manually searched an untagged one-
R
97
5 Retrieval and Coding
.1
rst, an accuracy of 95 percent means that roughly one word in 20 is tagged
incorrectly. Assuming a mean sentence length of 20 words (actual estimates range
from 16 to 22), every sentence contains on average one incorrectly or
v0
ambiguously tagged word. Second, the incorrectly assigned tags are unlikely to be
distributed randomly across parts of speech. For example, the is unlikely to be
tagged incorrectly, as there is only one possible tag); instead, they will cluster
around certain dicult cases (like ing-forms of verbs, which may be participles,
adjectives or nouns). ey are also unlikely to be distributed randomly across
T
grammatical constructions. For example, the word form regard is systematically
tagged incorrectly as a verb in the complex prepositions with regard to and in
AF
regard to, but is correctly tagged as a noun in most instances of the phrase in high
regard. us, particular linguistic phenomena will be severely misrepresented in
corpus searches based on automatic tagging or parsing. Still, an automatically
preprocessed corpus will certainly allow us to dene searches whose precision
and recall are higher than in the example above.
R
sense to base queries on it. For example, linguistic metaphors are almost
impossible to identify automatically, as they have lile or no properties that
systematically set them apart from literal language. Consider the following
examples of the metaphors anger is heat/a (hot) liquid (from G. Lako and
Kvecses 1987):
98
Anatol Stefanowitsch
feelings of anger or rage, they can also occur in their literal meaning, as the
corresponding authentic examples in (5.11ac) show:
(5.11a) Now, aer I am burned up, he said, snatching my wrist, and the re is
out, you must scaer the ashes. " [Anne Rice, e Vampire Lestat]
(5.11b) As soon as the driver saw the train which had been hidden by the curve,
he let o steam and checked the engine [Galignani, Accident on the
Paris and Orleans Railway]
(5.11c) Heat water in saucepan on highest seing until you reach the boiling
point and it starts to boil gently. [www.sugarfreestevia.net]
.1
Clearly, there can be no search paern that would nd the examples in (5.10)
but not those in (5.11). In contrast, it is very easy for a human to recognize the
v0
examples in (5.11) as literal. If we are explicitly interested in metaphors involving
liquids and/or heat, we could choose a semi-manual approach, rst extracting all
instances of words from the eld of liquids and/or heat and then discarding all
cases that are not metaphorical. is type of approach is used quite fruitfully, for
example, by Deignan (2005), amongst others.
T
If we are interested in metaphors of anger in general, however, this approach
will not work, since we have no way of knowing beforehand which semantic
AF
for example, Semino and Masci 1996 or Jkel 1997 for very thorough early studies
of this type).
D
99
5 Retrieval and Coding
.1
(5.12b) He was lled with anger.
(5.12c) She was brimming with rage.
v0
In these cases, the PPs by/with anger/rage make it clear that consume, (be)
lled and brimming are not used literally. If we limit ourselves just to
metaphorical expressions of this type, i.e. expressions that explicitly mention the
two semantic elds involved in the metaphorical expression, then it becomes
possible to search semi-automatically for all metaphors of anger by retrieving all
T
instances of the words anger and rage (as well as fury, ire, and annoyance, and
perhaps, irritation, and outrage depending on how broadly we operationalize
AF
the concept ANGER). Such a semi-automatic approach has been pursued, for
example, by Tissari (2003) and Stefanowitsch (2004, 2006, see further Chapter 13,
Section 13.2.2.1). If we are interested in specic metaphors, the approach can even
be automatized by searching for phrases or clauses that contain vocabulary from
the two semantic elds involved (see Martin 2006). As in the case of the
R
ditransitive, this retrieval strategy is only useful if we can show that the results
are comparable to the results we would get if we extracted the phenomenon
D
exhaustively (in the case of metaphor, Stefanowitsch 2006 suggests that they are,
indeed, comparable).
To sum up, it depends on the phenomenon under investigation and on the
research question whether we can take an automatic or at least a semi-automatic
approach or whether we have to resort to manual data extraction. Obviously, the
more exhaustively we can extract our object of research from the corpus, the
beer.
5.2 Coding
Once the data have been extracted from the corpus (and, if necessary, false
100
Anatol Stefanowitsch
positives have been removed), they typically have to be coded in terms of the
variables relevant for the research question. In some cases, the variables and their
values will be provided externally; they may, for example, follow from the
structure of the corpus itself (as in the case of BRITISH ENGLISH vs. AMERICAN ENGLISH
dened as occurring in the LOB corpus and occurring in the BROWN corpus
respectively. In other cases, the variables and their values may have been
operationalized in terms of criteria that can be applied objectively (as in the case
of LENGTH dened as number of leers). In most cases, however, some degree of
interpretation will be involved (as in the case of ANIMACY or the metaphors
discussed above). Whatever the case, we need a coding scheme an explicit
.1
statement of the operational denitions applied. Of course, such a coding scheme
is especially important in cases where interpretative judgments are involved in
categorizing the data. In this case, the coding scheme should contain not just
v0
operational denitions, but also explicit guidelines as to how these denitions
should be applied to the corpus data. ese guidelines must be explicit enough to
ensure a high degree of agreement if dierent coders apply it to the same data.
Let us look at each of these aspects in some detail.
T
5.2.1. Coding as interpretation
First of all, it is necessary to understand that the categorization of corpus data is
AF
accepting the operational denitions used by the makers of a particular corpus (as
well as the interpretative judgments made in applying them). Take the example of
BRITISH ENGLISH and AMERICAN ENGLISH used in Chapters 3 and 4: If we accept the idea
D
that the LOB corpus contains British English we are accepting an interpretation
of language varieties that is based on geographical criteria: British English means
the English spoken by people who live (perhaps also: who were born and grew
up) in the British Isles.
Or take the example of SEX, one of the demographic speaker variables included
in many modern corpora: By accepting the values of this variable, that the corpus
provides (typically MALE and FEMALE), we are accepting a specic interpretation of
what it means to be male or female. In some corpora, this may be the
interpretation of the speakers themselves (i.e., the corpus creators may have
asked the speakers to specify their sex), in other cases this may be the
101
5 Retrieval and Coding
interpretation of the corpus creators (based, for example, on the rst names of the
speakers or on the assessment of whoever collected the data). For many speakers
in a corpus, these dierent interpretations will presumably match, so that we can
accept whatever interpretation was used as an approximation of our own
operation denition of SEX. But in research projects that are based on a specic
understanding of SEX (for example, as a purely biological, a purely social or a
purely psychological category), simply accepting the (oen unstated) operational
denition used by the corpus creators may distort our results substantially. The
same is true of other demographic variables, like education, income, social class
etc., which are oen dened on a common sense basis that does not hold up to
.1
the current state of sociological research.
Interpretation also plays a role in the case of seemingly objective criteria. Even
though a criterion such as number of leers is largely self-explanatory, there
v0
are cases requiring interpretative judgments that may vary across researchers. In
the absence of clear instructions they may not know, among other things,
whether to treat ligatures as one or two leers, whether apostrophes or word-
internal hyphens are supposed to count as leers, or how to deal with spelling
variants (for example, in the BNC the noun programme also occurs in the variant
T
program that is shorter by two leers). is type of orthographic variation is very
typical of older stages of English (before there was a somewhat standardized
AF
orthography), which causes problems not only for retrieval (cf. the discussion in
Section 5.1.1 above, cf. also Barnbrook 1996, Ch. 8.2 for more detailed discussion),
but also for a reasonable application of the criterion number of leers.
In such cases, the role of interpretation can be reduced by including explicit
instructions for dealing with potentially unclear cases. However, we may not
R
have thought of all potentially unclear cases before we start coding our data,
which means that we may have to amend our coding scheme as we go along. In
D
this case, it is important to check whether our amendments have an eect on the
data we have already coded, and to recode them if necessary.
In cases of less objective criteria (such as ANIMACY discussed in Chapter 4
above), the role of interpretation is obvious. No maer how explicit our coding
scheme, we will come across cases that are not covered and will require
individual decisions; and even the clear cases are always based on an
interpretative judgment. As mentioned in Chapter 1, this is not necessarily
undesirable in the same way that intuitive judgements about acceptability are
undesirable; interpreting linguistic uerances is a natural activity in the context
of using language. us, if our operational denitions of the relevant variables
and values are close to the denitions speakers implicitly apply in everyday
102
Anatol Stefanowitsch
.1
linguistic data). In order to keep dierent research projects in a particular area
comparable, it is of course desirable to create coding schemes independently of a
particular research project. However, the eld of corpus-linguistics is not well-
v0
established and methodologically mature enough yet to have yielded
uncontroversial and widely applicable coding schemes for most linguistic
phenomena. ere are some exceptions, such as the part-of-speech tagsets and
the parsing schemes used by various wide-spread automatic taggers and parsers,
T
which have become de facto standards by virtue of being easily applied to new
data; there are also some substantial aempts to create coding schemes for
manual coding of phenomena like topicality (cf. Givn 1983), animacy (cf. Zaenen
AF
et al. 2004), and the grammatical description of English sentences (e.g., Sampson
1995).
Whenever it is feasible, we should use existing coding schemes instead of
creating our own searching the literature for such schemes should be a routine
R
step in the planning of a research project. Oen, however, such a search will
come up empty, or existing coding schemes will not be suitable for the specic
data we plan to use or they may be incompatible with our theoretical
D
103
5 Retrieval and Coding
question.
ere are, in addition, several general criteria that the set of values for any
variable must meet. First, they must be non-overlapping. is may seem obvious,
but it is not at all unusual, for example, to nd continuous dimensions split up
into overlapping categories, as in the following quotation:
.1
2002: 304).
Here, the authors obviously summarized the ages of their subjects into the
v0
following four classes: (I) 2535, (II) 3545, (III) 4555 and (IV) 5565: thus,
subjects aged 35 could be assigned to class I or class II, subjects aged 45 to class II
or class III, and subjects aged 55 to class III or class IV. is must be avoided, as
dierent coders might make dierent decisions, and as other researchers
aempting to replicate the research will not know how we categorized such
T
cases.
Second, the variable should be dened such that it does not conate properties
AF
that are potentially independent of each other, as this will lead to a set of values
that do not fall along a single dimension. As an example, consider the so-called
Silverstein Hierarchy used to categorize nouns for (inherent) Topicality (aer
Deane 1987: 67):
R
104
Anatol Stefanowitsch
Note, rst, that there is a lot of overlap in this coding scheme. For example, a rst
or second person pronoun will always refer to a human or animate NP and a
third person pronoun will frequently do so, as will a proper name or a kin term.
Similarly, a container is a concrete object and can also be a location, and
everything above the category Perceivable is also perceivable. is overlap can
only be dealt with by an instruction of the kind that every nominal expression
should be put into the topmost applicable category; in other words, we need to
add an except for expressions that also t into one of the categories above to
every category label.
Secondly, although the Silverstein Hierarchy may supercially give the
.1
impression of providing values of a single variable that could be called TOPICALITY,
it is actually a mixture of several quite dierent variables and their possible
values. One aempt of disentangling these variables and giving them each a set
v0
of plausible values is the following:
105
5 Retrieval and Coding
ere are two advantages of this more complex coding scheme. First, it allows a
more principled categorization of individual expressions: the variables and their
values are easier to dene and there are fewer unclear cases. Second, it would
allow us to determine empirically which of the variables are actually relevant in
.1
the context of a given research question, as irrelevant variables will not show a
signicant distribution across dierent conditions. Originally, the Silverstein
Hierarchy was meant to allow for a principled description of split ergative
v0
systems; it is possible, that the specic conation of variables is suitable to this
task. However, it is an open question whether the same conation of variables is
also suitable to the analysis of other phenomena. If we were to apply it as is, we
would not be able to tell whether this is the case. us, we should always dene
our variables in terms of a single dimension and deal with complex concepts (like
T
TOPICALITY) by analyzing the data in terms of a set of such variables.
Aer dening a variable (or set of variables) and deciding on the type and
AF
the coding scheme spells out that it does not maer by what linguistic means
humans are referred to (e.g., proper names, common nouns including kinship
terms, and pronouns) and that dead, ctional or potential future humans are
included as well as humanoid entities like gods, elves, ghosts, and androids.
e category ORGANIZATION is much more complex to apply consistently, since
there is no intuitively accessible and generally accepted understanding of what
constitutes an organization. In particular, it needs to be specied what
distinguishes an ORGANIZATION from other groups of human beings (that are to be
categorized as HUMAN according to the coding scheme). e coding scheme denes
an ORGANIZATION as a referent involving more than one human with some degree
of group identity. It then provides the following hierarchy of properties that a
106
Anatol Stefanowitsch
group of humans may have (where each property implies the presence of all
properties below its position in the hierarchy):
It then states that any group of humans at + collective voice or higher should be
.1
categorized as ORGANIZATION, while those below should simply be coded as HUMAN.
By listing properties that a group must have to count as an organization in the
sense of the coding scheme, the decision is simplied considerably, and by
v0
providing a decision procedure, the number of unclear cases is reduced. e
coding scheme also illustrates the use of the hierarchy:
us, while the posse would be an ORG, the mob might not be, depending
on whether we see the mob as having a collective purpose. e crowd
T
would not be considered ORG, but rather simply HUMAN.
AF
will return to in the next and nal subsection of this chapter. Before I do so, I will
briey address a practical question: that of where and how the coding decisions
should be recorded.
107
5 Retrieval and Coding
Other variables that are sometimes recorded in the corpus itself are LEMMA and
SYNTACTIC STRUCTURE. When more than one variable is annotated in a corpus, the
corpus is traditionally structured as shown in Figure 51, with one word per line
and dierent columns for the dierent types of annotation (alternatively, complex
.1
XML tags are used).
108
Anatol Stefanowitsch
N12:0300.24 YC +, - .
N12:0300.27 CCB but but [S+.
N12:0300.30 NP1m Curt Curt [Nns:s.Nns:s]
N12:0300.33 VDD did do [Vde.
N12:0300.39 XX +n<apos>t not .
N12:0300.42 VV0v interpret interpret .Vde]
N12:0310.03 PPH1 it it [Ni:o.Ni:o]
N12:0310.06 DD1i this this [Ns:h.
N12:0310.09 NNL1n way way .Ns:h]S+]S]
N12:0310.12 YF +. - .
Note that type of corpus structure requires concordancers that are specically
geared towards processing column-based input many of the commercially
available soware packages cannot handle this.
.1
In the case of annotations added in the context of specic research projects
(especially if they are added manually), the second option mentioned above is
typically preferred: the data are extracted, stored in a separate le, and then
v0
annotated. Frequently, spreadsheet applications are used to store the corpus-data
and annotation decisions, as in the example in Fig. 5-2, where possessive
pronouns and nouns are coded for NOMINAL TYPE, ANIMACY and CONCRETENESS:
e rst line contains labels that tell us what information is found in each
column respectively. is should include the example itself (either as shown in
Fig. 5-2, or in the form of a KWIC concordance line) and meta-information such
D
as what corpus and/or corpus le the example was extracted from. Crucially, it
will include the relevant variables. Each subsequent line contains one observation
(i.e., one hit and the appropriate values of each variable). is format one
column for each variable, and one line for each example is referred to as a raw
data table. It is the standard way of recording measurements in all empirical
sciences and we should adhere to it strictly, as it ensures that the structure of the
data is retained. In particular, one should never store ones data in summarized
form, for example, as shown in Figure 5-3.
109
5 Retrieval and Coding
A B C D E
1 Noun Type pronoun proper name noun
2 1 2 0
3 Animacy human animate inanimate
4 2 1 0
Such a summary can be created automatically from the a raw data table like that
in Figure 5-2 when needed, but if we record our coded data like this in the rst
place, we would have no way of reconstructing the original cases and thus no
way of telling which combinations of variables actually occurred in the data. In
.1
this particular case, for example, we cannot tell whether one of the human
referents or the animate referent was referred to by the pronoun. In addition,
statistics soware packages require a raw data table as input.
v0
Since any kind of quantitative analysis requires a raw data table, we might
conclude that it is the only useful way of recording our coding decisions.
However, there are cases where it may be more useful to record them in the form
of annotations in (a copy o) the original corpus. For example, the information in
Fig. 5-1 could be recorded in the corpus itself in the same way that part-of-speech
T
tags are, i.e., we could add an ANIMACY label to every nominal element in our
corpus, either in the column format shown in Figure 5-1, or in the format used for
AF
POS tags by the original version of the BROWN corpus, as shown in (5.19):
From a corpus encoded in this way, we could easily create the raw data list in
Figure 5-1 by searching for possessives and separating the hits into the word
itself, the PART-OF-SPEECH label and the ANIMACY annotation (this can be done
manually, or we could write a simple computer program to do it for us). e
advantage would be that we, or other researchers, could also use our coded data
for research projects concerned with completely dierent research questions.
us, if we are dealing with a variable that is likely to be of general interest, we
should consider the possibility of annotating the corpus itself instead of rst
110
Anatol Stefanowitsch
extracting the relevant data to a raw data table and coding them aerwards.16
.1
assess the quality of our automatized coding scheme in the same way in which
we would assess the quality of the results returned by a particular query (cf.
Section 5.1.2, esp. Table 5-1). In the context of coding data, a true positive result
v0
for a particular value would be a case where that value has been assigned to a
corpus example correctly, a false positive would be a case where that value has
been assigned incorrectly, a false negative would be a case where the value has
not been assigned although it should have been, and a true negative would be a
T
case where the value has not been assigned and should not have been assigned.
is assumes, of course, that we can determine with sucient certainty
whether a particular value should be assigned to a particular corpus example.
AF
Note that this problem also exists with coding schemes for the manual coding of
data: as pointed out above, coding usually involves a certain degree of subjective
interpretation, so we need to ensure that the inuence of subjective judgments is
as small as possible. e best way of doing this is to have (at least) two dierent
R
coders apply the coding scheme to the data if our measurements cannot be
made objective (and, as should be clear by now, they rarely can in linguistics), this
will at least allow us to ensure that they are intersubjectively reliable.
D
16 Such a direct annotation of corpus les is rarely done in corpus linguistics, but it has
become the preferred strategy in various elds concerned with qualitative analysis of
textual data. ere are open-source and commercial soware packages dedicated to
this task (see Appendix A.3). ey typically allow the user to dene a set of codes,
import a text le, and then assign the codes to a word or larger textual unit by
selecting it with the mouse and then clicking a buon for the appropriate code that is
then added (oen in XML format) to the imported text. is strategy has the
additional advantage that one can view one's coded examples in their original context
(which may be necessary when annotating additional variables later). However, the
available soware packages are geared towards the analysis of individual texts and do
not let the user to work comfortably with large corpora.
111
5 Retrieval and Coding
One approach would be to have the entire data set coded by two coders
independently on the basis of the same coding schema. We could then identify all
cases in which the two coders did not assign a the same value and determine,
where the disagreement came from. Obvious possibilities include cases that are
not covered by the coding scheme at all, cases where the denitions in the coding
scheme are too vague to apply or too ambiguous to make a principled decision,
and cases where one of the coders has misunderstood the corpus example or
made a mistake due to inaention. Where the coding schema is to blame, it could
be revised accordingly and re-applied to all unclear cases. Where a coder is at
fault, they could correct their coding decision. At the end of this process we
.1
would have a carefully coded data set with no (or very few) unclear cases le.
However, in practice there are two problems with this procedure. First, it is
extremely time-consuming, which will oen make it dicult to impossible to nd
v0
a second coder. Second, discussing all unclear cases but not the apparently clear
cases holds the danger that the former will be coded according to dierent
criteria than the laer. Both problems can be solved (or at least alleviated) by
testing the coding scheme on a smaller dataset using two coders and calculating
its reliability across coders. If this so-called inter-rater reliability (see Carlea
T
1996) is suciently high, the coding scheme can be applied to the actual data set
by a single coder. If not, it needs to be made more explicit and applied to another
AF
project and there are many situations where a higher reliability is necessary.
Further reading
No maer what corpora and concordancing soware you work with, you will
need regular expressions at some point. High-quality information is easy to
nd online, but if you insist on buying a book, Friedl (2006) is as good as any.
Barnbrook (1996, Ch. 8.1) is an interesting case study on retrieving data from
corpora with non-standardized orthography. Leech (2005) is a good
introduction to the issues involved in annotating corpora.
112
Anatol Stefanowitsch
.1
v0
T
AF
R
D
113
6. antifying Research estions
Recall, once again, that at the end of Chapter 2, we presented the following
denition of corpus linguistics:
.1
Corpus linguistics is the investigation of linguistic research questions that
have been framed in terms of the conditional distribution of linguistic
v0
phenomena in a linguistic corpus.
We discussed the fact that this denition covers cases of hypotheses phrased in
absolute terms, i.e. cases where the distribution of a phenomenon across dierent
conditions is a maer of all or nothing (for example, All speakers of American
T
English refer to the front window of a car as windshield, all speakers of British
English as windscreen). We pointed out that it also covers cases where the
AF
stated in quantitative terms which in turn means that our data have to be
quantied in some way so that we can compare them to our predictions. In this
chapter, we will discuss in more detail how this is done when dealing with
dierent types of data.
Given that, as mentioned in Chapter 3, hypotheses stated in relative terms are
actually more typical in corpus linguistics than hypotheses stated in absolute
terms, one might expect quantitative methods to be a well-established aspect of
corpus-linguistic methodology. However, this is not the case. Many early corpus-
linguistic studies treated quantication impressionistically at best (stating that
one thing seemed to be more common in a given corpus or under a given
condition than another), and although this has changed over the last twenty years
114
Anatol Stefanowitsch
(slowly at rst, but increasingly rapidly in the past decade), the impressionistic
school of corpus-linguistics is still strong. In order to distinguish the more
principled quantitative approach taken here from this impressionistic approach, it
is now oen referred to as Quantitative Corpus Linguistics (a distinction rst
suggested in Stefanowitsch and Gries 2005, although the term is, of course, older).
In this chapter, we will discuss the types of data we may be confronted with
when quantifying the (annotated) results of a corpus query (Section 6.1). We will
then discuss the way in which these types of data can be quantied (Sections 6.2,
6.3 and 6.4), laying the ground work for a brief introduction to statistical
hypothesis testing in the following chapters.
.1
6.1 Types of Data
v0
Let us turn to an example that is more complex and much closer to the kind of
phenomenon actually of interest to corpus linguists than the distribution of
words across varieties, that of the two English possessive constructions. In
English, certain types of semantic relations, possession being very prominent
among them, can be expressed in two alternative ways; either by what is
T
traditionally referred to as the genitive (cf. 6.1a) or by the preposition of
connecting two noun phrases (cf. 6.1b):
AF
(6.1a) e city's museums are treasure houses of inspiring objects from all eras
and cultures. [www.res.org.uk]
(6.1b) Today one can nd the monuments and artifacts from all of these eras in
the museums of the city. [www.travelhouseuk.co.uk]
R
number of relations that are exclusively encoded by of, such as quantities (both
generic, as in a couple/bit/lot of, and in terms of measures, as in six
miles/years/gallons of), type relations (a kind/type/sort/class of) and substance (a
mixture of water and whisky, a dress of silk, etc.) (cf. Stefanowitsch 2003).
Second, and more interestingly, even a relation that can be expressed by both
constructions in principle can oen not be expressed by both constructions
equally well in a specic situation. A number of explanations have been proposed
(and investigated using quantitative corpus-linguistics), for example the
following, which we will also investigate here (since these explanations are
treated as hypotheses here, I will take the liberty and illustrate them with
constructed examples rather than corpus data).
115
6. antifying Research estions
.1
(6.3a) e Guggenheim is much larger than the museums of other major cities.
(6.3b) ??e Guggenheim is much larger than other major cities museums.
v0
b) Animacy. Since animate referents tend to be more topical than inanimate
ones and more topical elements tend to precede less topical ones, if the modier
is animate, the genitive will be preferred, if it is inanimate, the construction with
of will be preferred (irk et al. 1972: 192-203, Deane 1987):
T
(6.4a) Solomon R. Guggenheims collection contains some ne paintings.
(6.4b) e collection of Solomon R. Guggenheim contains some ne paintings.
AF
modier is short, the genitive will be preferred, if it is long, the construction with
of will be preferred (Altenberg 1980):
D
In all three cases, we are dealing with hypotheses concerning preferences rather
than absolute dierence. None of the examples with question marks are
ungrammatical and all of them could conceivably occur; they just sound a lile
bit odd or even just a lile less good than the alternatives. us, the predictions
116
Anatol Stefanowitsch
we can derive from each hypothesis must be stated, and tested in terms of relative
rather than absolute dierences they all involve a predictions stated in terms
more-or-less rather than all-or-nothing. Relative quantitative dierences are
expressed and dealt with in dierent ways depending on the type of data they
involve. ere are three types of data, exemplied respectively by the variables in
each of the hypotheses discussed above: nominal (or categorical) data, ordinal
data and cardinal data. Let us discuss each of these in turn.
.1
no intrinsic order with respect to each other (i.e., there is no aspect of their
denition that would allow us to put them in a natural order). If we categorize
data in terms of such a nominal variable (either because there is no other way of
v0
categorizing them or simply because we choose to categorize them in this way),
the only way to quantify them is to count the number of observations in each
category and express the result in absolute frequencies (i.e., raw numbers) or
relative frequencies (i.e. percentages). We cannot rank the data based on intrinsic
T
criteria, and we can not calculate mean values.
Some typical examples of nominal variables are demographic variables like
SEX, NATIONALITY or LANGUAGE. In a sample of speakers, we could count how many of
AF
them are speakers of GERMAN and how many are speakers of FRENCH, but we cannot
rank the German language higher the French language on intrinsic criteria.
is does not mean that we can never rank the two languages for example,
we could rank them by number of native speakers worldwide (in which case
R
German ranks above French) or by the number of countries in which they are an
ocial language (in which case French ranks above German). But the number of
speakers or the number of countries where a language has an ocial status is not
D
part of the denition of that language (German would still be German if its
number of speakers was reduced by half because the others decided to speak only
English from this day on, and French would still be French if it lost its ocial
status in every single country).
In other words, we would be ranking these languages by extrinsic criteria,
which means that really we are ranking something completely dierent: in the
rst case, we are ranking the variable SIZE OF SPEECH COMMUNITY, in the second case
we are ranking the variable NUMBER OF COUNTRIES WITH OFFICIAL LANGUAGE X by size.
We also cannot calculate a mean value between the two languages (for example,
claiming that English is the mean of German and French because its vocabulary is
117
6. antifying Research estions
derived in equal proportions from the Germanic and the Romance language
families. We can calculate the mean number of speakers of a set of languages or
the mean number of countries they are spoken in, but again, we would not be
calculating means between languages but between sets of speakers or countries.
With respect to the examples above, it should be obvious that they all involve
a nominal variable we could call TYPE OF POSSESSIVE CONSTRUCTION (with the values
GENITIVE and OF-CONSTRUCTION. We can categorize all possessive expressions in a
corpus in terms of these two categories and we can then report the frequency of
each: for example, by my e the genitive occurs 22,193 times in the BROWN
corpus (excluding proper names and instances of the double genitive), and the of-
.1
construction 17,800 times (this laer value is actually an estimate; it would take
too long to go through all 36,406 occurrences of of and identify those that occur
in the structure relevant here, so I categorized a random subsample of 500 hits of
v0
of and generalized the proportion of of-constructions vs. other uses of of to the
total number of hits for of).
We can rank the constructions in terms of their frequency (the genitive is
more frequent), but in this case we are not ranking the constructions based on an
intrinsic criterion, but on an extrinsic one: their corpus frequency in one
T
particular corpus. We can also calculate their mean frequency (19,996.5), but
again, this is not a mean of the two constructions, but of their frequencies in one
AF
particular corpus. Again, the two constructions are in no way dened by their
frequency, so their frequency is an extrinsic criterion.
An ordinal variable is a variable whose values are labels for categories that do
have an intrinsic order with respect to each other but that cannot be expressed in
terms of natural numbers. In other words, ordinal variables are variables that are
D
are dened in such a way that some aspect of their denition allows us to order
them without reference to an extrinsic criterion, but that does not give us any
information about the distance (or degree of dierence) between one category
and the next. If we categorize data in terms of such an ordinal variable, we can
treat them accordingly (i.e., we can rank them), or we can treat them like nominal
data by simply ignoring their inherent order (i.e., we can still count the number of
observations for each value and report absolute or relative frequencies. We
cannot calculate mean values.
Some typical examples of ordinal variables are demographic variables like
EDUCATION or (in the appropriate sub-demographic) MILITARY RANK, but also SCHOOL
118
Anatol Stefanowitsch
GRADES and the kind of ratings oen found in questionnaires (both of which are,
however, oen treated as though they were cardinal data, see below). For
example, academic degrees are intrinsically ordered: it is part of the denition of
a PhD degree that it ranks higher than a masters degree, which in turn ranks
higher than a bachelors degree. us, we can easily rank speakers in a sample of
university graduates based on the highest degree they have completed. We can
also simply count the number of PhDs, MAs, and BAs and ignore the ordering of
the degrees. But we cannot calculate a mean: if ve speakers in our sample have a
PhD and ve have a BA, this does not allow us to claim that all of them have an
MA degree on average. e rst important reason for this is that the size of the
.1
dierence in terms of skills and knowledge that separates a BA from an MA is not
the same as that separating an MA from a PhD: in Europe, one typically studies
two years for an MA, but it typically takes four to ve years to complete a PhD.
v0
e second important reason is that the values of ordinal variables typically dier
along more than one dimension: while it is true that a PhD is a higher degree
than an MA, which is a higher degree than a BA, the three degrees also dier in
terms of specialization (from a relatively broad BA to a very narrow PhD), and
the PhD degree diers from the two other degrees qualitatively: a BA and an MA
T
primarily show that one has acquired knowledge and (more or less practical
skills), but a PhD primarily shows that one has acquired research skills.
AF
ANIMATE
On this scale, ANIMATE ranks higher than INANIMATE which ranks higher than
D
ABSTRACT in terms of the property we are calling animacy, and this ranking is
determined by the scale itself, not by any extrinsic criteria. is means that we
could categorize and rank all nouns in a corpus according to their animacy. But
again, we cannot calculate a mean. If we have 50 HUMAN nouns and 50 ABSTRACT
nouns, we cannot say that we have an mean of 100 INANIMATE nouns. Again, this is
because we have no way of knowing whether, in terms of animacy, the dierence
between ANIMATE and INANIMATE is the same size as that between INANIMATE and
ABSTRACT, but also, because we are, again, dealing with qualitative as well as
quantitative dierences: the dierence between animate and inanimate on the
one hand and abstract on the other is that the rst two have physical existence;
and the dierence between animate on the one hand and inanimate and abstract
119
6. antifying Research estions
on the other is that animates are potentially alive and the other two are not. In
other words, our scale is really a combination of at least two dimensions. Again,
we could also ignore the intrinsic order of the values on our ANIMACY scale and
simply treat them as nominal data, i.e., count them and report the frequency with
which each value occurs in our data. Potentially ordinal data are actually
frequently treated like nominal data in corpus linguistics, and with complex
scales combining a range of dierent dimensions, this is probably a good idea;
but ordinal data also have a useful place in quantitative corpus linguistics.
.1
Cardinal variables are variables whose values are numerical measurements along
a particular dimension. In other words, they are intrinsically ordered (like ordinal
data), but not because some aspect of their denition allows us to order them, but
v0
because of their nature as numbers. Also, the distance between any two
measurements is precisely known and can directly be expressed as a number
itself. is means that we can perform any arithmetic operation on cardinal data.
Crucially, we can calculate means. Of course, we can also treat cardinal data like
T
rank data by ignoring all of their mathematical properties other than their order,
and we could also treat them as nominal data (see further below).
Typical cases of cardinal variables are demographic variables like AGE or
AF
INCOME. For example, we can categorize a sample of speakers by their age and then
calculate the mean age of our sample. If our sample contains 5 50-year-olds and 5
30-year-olds, it makes perfect sense to say that the mean age in our sample is 40;
we might need additional information to distinguish between this sample and
R
another sample that consists of 5 41-year-olds and 5 39-year-olds, that would also
have an mean age of 40 (cf. Section 6.2.3 below), but the mean itself is meaningful,
because the distance between 30 and 40 is the same as that between 40 and 50 and
D
120
Anatol Stefanowitsch
.1
6.1.4 Interim summary
In the preceding three subsections, we have repeatedly mentioned concepts like
v0
frequency, percentage, rank and mean. In the following three sections, we
will introduce these concepts in more detail, providing a solid foundation of
descriptive statistical measures for nominal, ordinal and cardinal data.
Note, however, that most research designs, including those useful for
T
investigating the hypotheses about the two possessive constructions, involve (at
least) two variables: (at least) one independent one and (at least) one dependent
one. Even our denition of corpus-linguistics makes reference to this fact when it
AF
states that research questions should be framed such that it enables us to answer
them by looking at the distribution of linguistic phenomena across dierent
conditions.
Since such conditions are most likely to be nominal in character (a set of
R
nominal and one ordinal variable, and (c) designs with one nominal and one
cardinal variable. Logically, there are three additional designs, namely designs
with (d) two ordinal variables, (e) two cardinal variables or () one ordinal and
one cardinal value. For such cases, we would need dierent types of correlation
analysis, which we will not discuss in this book in any detail (there are pointers
to the relevant literature at the end of Chapter 7).
121
6. antifying Research estions
variables: VARIETY (British vs. American English) and some linguistic alternation
(mostly regional synonyms of some lexicalized concept). us, this type of
research design should already be somewhat familiar.
For a closer look, we will apply it to the rst of the three hypotheses
introduced in the preceding section, which is restated here with the background
assumption from which it is derived:
.1
Before we can turn to the quantitative prediction, there are some aspects of the
research design that need to be discussed. First, the two constructions have been
v0
renamed to show that they are values of a variable in a specic research design.
As such, they must, of course, be given operational denitions. e denitions I
used were the following:
Proper names are excluded in both cases because they are xed and could not
vary. Therefore, they will not be subject to any restrictions concerning givenness,
D
animacy or length. e OF-POSSESSIVE was dened in such a way that only those
cases count that are in a semantic relationship with the S-POSSESSIVE. is, in a
sense, is what construes them theoretically as values of a single variable
POSSESSIVE.
Next, DISCOURSE-OLD and DISCOURSE-NEW have to be operationalized. is could be
done using the measure of referential distance discussed in Chapter 4, but instead
I chose an indirect operationalization: it is well established that pronouns refer to
old information, whereas new information must be introduced in lexical NPs.
us, we will dene DISCOURSE-OLD as encoded by a pronoun and DISCOURSE-NEW as
encoded by a lexical NP. e advantage is that this is a very easily retrievable
and codable property, the disadvantage is that lexical NPs may also refer to
122
Anatol Stefanowitsch
.1
pronoun. Consider (6.11a, b):
Ideally, it should be repeated by a second rater, to make sure nothing was missed,
but as pointed out in the preceding chapter, this is a very time-consuming
D
123
6. antifying Research estions
(6.13) Prediction: ere will be more cases of the S-POSSESSIVE with DISCOURSE-OLD
modiers than with DISCOURSE-NEW modiers, and more cases of the OF-
POSSESSIVE with discourse-new modiers than with DISCOURSE-OLD modiers.
Table 6-1 shows the absolute frequencies of the parts of speech of the
modier in both constructions.
Table 6-1. Part of Speech of the Modier in the s-possessive and the of-possessive
POSSE S S I V E
.1
S-POSSESSIVE OF-POSSESSIVE Total
DISC. STATUS
Total 200
v0 156 356
cell showing the table total (the sum of all four intersections. e row and column
totals for a given cell are referred to as the marginal frequencies for that cell.
6.2.1 Percentages
The frequencies in Table 6-1 are fairly easy to interpret in this case, because the
dierences in frequency are very clear. However, we should be wary of basing
our assesment of corpus data directly on raw frequencies in a contingency table.
ese can be very misleading, especially if the marginal frequencies of the
variables dier substantially, which in this case, they do: the s-possessive is more
frequent overall than the of-possessive and the overall frequency discourse-old
124
Anatol Stefanowitsch
modiers (i.e. pronouns) are slightly more frequent overall than discourse-new
ones (i.e., lexical nouns).
Thus, it is generally useful to convert the absolute frequencies to relative
frequencies, abstracting away from the dierences in marginal frequencies. In
order to convert an absolute frequency n into a percentage, we simply divide it by
the total number of cases N of which it is a part and multiply the result by 100.
n
(6.14) 100
N
.1
For example, if we have a group of 31 students studying some foreign
language and six of them study German, the percentage of students studying
German is v0
6
(6.15) = 0.1935 100 = 19.35 %
31
T
In other words, a percentage is just another way of expressing a decimal
fraction, which is just another way of expressing a fraction (note that in academic
AF
by its row total or by the table total. Table 6-2 shows the results for all three
possibilities.
e column percentages can be related to our prediction most
D
125
6. antifying Research estions
Table 6-2. Absolute and relative frequencies of the modiers POS in the English
possessive constructions
POSSESSIVE
S-POSSESSIVE OF-POSSESSIVE Total
Abs. Freq. 180 Abs. Freq. 3 Abs. Freq. 183
Col. Pct. 90.00% Col. Pct. 1.92% Col. Pct. 51.40%
OLD
Row Pct. 98.36% Row Pct. 1.64% Row Pct. 100%
Tab. Pct. 50.56% Tab. Pct. 0.84%
DISCOURSE STATUS
.1
Col. Pct. 10.00% Col. Pct. 98.08% Col. Pct. 48.60%
NEW
Row Pct. 11.56% Row Pct. 88.44% Row Pct. 100%
Tab. Pct. 5.62% Tab. Pct.
v0 42.98%
Abs. Freq. 200 Abs. Freq. 156 Abs. Freq. 356
Col. Pct. 100% Col. Pct. 100% Col. Pct. 100%
Total
Row Pct. 56.18% Row Pct. 43.82% Row Pct. 100%
Tab. Pct. 100%
T
is is the case in Table 6-2, and this is certainly compatible with our hypothesis.
AF
However, if it were not the case, this could also be compatible with our
hypothesis. Note that the constructions dier in frequency, with the of-possessive
being only three-quarters as frequent as the s-possessive. Now imagine the
dierence was ten to one instead of four to three. In this case, we might well nd
that the majority of both old and new modiers occurs in the s-possessives,
R
in the case of new modiers. In other words, even if we are looking at row
percentages, the relevant comparisons are across rows, not within rows.
Whether column percentages or row percentages are more relevant to a
hypothesis depends, of course, on the way variables are arranged in the table: if
we rotate the table such that the variable Possessive ends up in the rows, then the
row percentages would be more relevant. When interpreting percentages in a
contingency table, we have to nd those that actually relate to our hypothesis. In
any case, the interpretation of both row and column percentages requires us to
choose one value of one of our variables and compare it across the two values of
the other variable, and then compare this comparison to a comparison of the
other value of that variable. If that sounds complicated, this is because it is
126
Anatol Stefanowitsch
complicated.
It would be less confusing if we had a way of taking into account both values
of both variables at the same time. e table percentages allow this to some
extent. e way our hypothesis is phrased, we would expect a majority of cases
to instantiate the intersections S-POSSESSIVE DISCOURSE-OLD and OF-POSSESSIVE
DISCOURSE-NEW, with a minority of cases instantiating the other two intersections.
In Table 6-2, this is clearly the case: the intersection S-POSSESSIVE DISCOURSE-OLD
contains more than y percent of all cases, the intersection OF-POSSESSIVE
DISCOURSE-NEW well over 40 percent. Again, if the marginal frequencies dier more
extremely, so may the table percentages in the relevant intersections. We could
.1
imagine a situation, for example, where 90 percent of the cases fell into the
intersection S-POSSESSIVE DISCOURSE-OLD and 10 percent in the intersection OF-
POSSESSIVE DISCOURSE-NEW this would still be a conrmation of our hypothesis.
v0
While percentages are, with due care, more easily interpretable than absolute
frequencies, they have two disadvantages. First, by abstracting away from the
absolute frequencies, we lose valuable information: we would interpret a
distribution such as that in Table 6-3 dierently, if we knew that it was based on a
sample on just 35 instead of 356 corpus hits. Second, it provides no sense of how
T
dierent our observed distribution is from the distribution that we would expect
if there was no relation between our two variables, i.e., if the values were
AF
But how can we calculate the expected frequencies of the intersections of our
variables? Consider the textbook example of a random process: ipping a coin
onto a hard surface. ere are two possible outcomes, heads and tails, and we
know that their probability is y percent (or 0.5) each (ignoring the remote
possibility that the coin will land, and remain standing, on its edge). From these
probabilities, we can calculate the expected frequency of heads and tails in a
series of coin ips. If we ip the coin ten times, the expected frequency of heads
and tails is 5 (i.e., y percent of ten, or 0.5 10) for heads and tails each; if we
ip the coin 42 times, the expected frequency would be 21 (0.5 42) for each, and
so on. In the real world, we would of course expect some variation (more on this
127
6. antifying Research estions
.1
and y of them were discourse-new (i.e. lexical NPs). But this is not the case:
there are more discourse-old modiers than discourse-new ones (183 vs. 173) and
there are more s-possessives than of-possessives (200 vs. 156).
v0
These marginal frequencies of our variables and their values are a fact about
our data that must be taken as a given when calculating the expected frequencies:
our hypothesis says nothing about the overall frequency of the two constructions
or the overall frequency of discourse-old and discourse-new modiers, but only
about the frequencies with which these values should co-occur. In other words,
T
the question we must answer is the following: Given that the s- and the of-
possessive occur 200 and 156 times respectively and given that there are 183
AF
the relative frequencies in each column should be the same as those of the
column total.
D
For example, 56.18 percent of all possessive constructions in our sample are s-
possessives and 43.82 percent are of-possessives; if there were only a chance
relationship between type of construction and discourse status of the modier,
we should nd the same proportions for the 183 constructions with old modiers,
i.e. 183 0.5618 = 102.81 s-possessives and 183 0.4382 = 80.19 of-possessives.
Likewise, there are 173 constructions with new modiers, so 173 0.5618 = 97.19
of them should be s-possessives and 173 0.4382 = 75.81 of them should be of-
possessives. e same goes for the columns: 51.4 percent of all constructions have
old modiers and 41.6 percent have new modiers. If there were only a chance
relationship between type of construction and discourse status of the modier,
we should nd the same proportions for both types of possessive construction:
128
Anatol Stefanowitsch
there should be 200 0.514 = 102.8 s-possessives with old modiers and 97.2 with
new modiers, as well as 156 0.514 = 80.18 of-possessives with old modiers
and 156 0.486 = 75.82 of-possessives with new modiers. Note that the expected
frequencies for each intersection are the same whether we use the total row
percentages or the total column percentages: the small dierences are due to
rounding errors.
To avoid rounding errors, we should not actually convert the row and column
totals to percentages at all, but use the following much simpler way of calculating
the expected frequencies: for each cell, we simply multiply its marginal
frequencies and divide the result by the table total as shown in Table 6-3; note
.1
that we are using the standard convention of using O to refer to observed
frequencies, E to refer to expected frequencies, and subscripts to refer to rows and
columns. e convention for these subscripts is as follows: use 1 for the rst row
v0
or column, 2 for the second row or column, and T for the row or column total,
and give the index for the row before that of the column. For example, E21 refers
to the expected frequency of the cell in the second row and the rst column, O1T
refers to the total of the rst row, and so on.
T
Table 6-3 Calculating expected frequencies from observed frequencies
DEPENDENT VARIABLE
AF
(O T1O 2T ) E22 =
E21 = (O T2O 2T )
VALUE 2 OTT O2T
D
OTT
Total OT1 OT2 OTT
Applying this procedure to our observed frequencies yields the results shown in
Table 6-4. One should always report nominal data in this way, i.e., giving both the
observed and the expected frequencies in the form of a contingency table.
129
6. antifying Research estions
Table 6-4 Observed and expected frequencies of old and new modiers in the s- and
the of-possessive
POSSESSIVE
S-POSSESSIVE OF-POSSESSIVE Total
Obs. 180 Obs. 3 Obs. 183
DISCOURSE STATUS
OLD
Exp. 102.81 Exp. 80.19
Obs. 20 Obs. 153 Obs. 173
NEW
Exp. 97.19 Exp. 75.81
.1
Total Obs. 200 Obs. 156 Obs. 356
We can now compare the observed and expected frequencies of each intersection
v0
to see whether the dierence conforms to our quantitative prediction. is is
clearly the case: for the intersections S-POSSESSIVE DISCOURSE-OLD and OF-POSSESSIVE
DISCOURSE-NEW, the observed frequencies are higher than the expected ones, for the
intersections S-POSSESSIVE DISCOURSE-NEW and OF-POSSESSIVE DISCOURSE-OLD, the
observed frequencies are lower than the expected ones.
T
is conditional distribution seems to conrm our hypothesis. However, note
that it does not yet prove or disprove anything, since, as mentioned above, we
AF
Let us turn, next, to a design with one nominal and one ordinal variable: a test of
the second of the three hypotheses introduced at the beginning of this chapter.
D
Again, it is restated here together with the background assumption from which it
is derived:
130
Anatol Stefanowitsch
the data).
ANIMACY was operationally dened in terms of the coding scheme shown in
Table 6-5 (based on Zaenen et al. 2003).
.1
HUMAN ATTRIBUTE Body parts, organs, etc. of HUMANS. 4
CONCRETE TOUCHABLE Physical entities that are incapable of life and can be touched. 5
CONCRETE Physical entities that are incapable of life and cannot be
v0 6
NONTOUCHABLE touched.
LOCATION Physical places and regions 7
TIME Points in and periods of time 8
EVENT Events 9
T
ABSTRACT Other abstract entities. 10
Total
AF
(6.17) Prediction: e modiers of the S-POSSESSIVE will tend to occur high on the
ANIMACY scale, the modiers of OF-POSSESSIVE will tend to occur low on the
ANIMACY scale.
Note that phrased like this it is not yet a quantitative prediction, since tend to is
not a mathematical concept. In contrast to frequency (for nominal data) and
average or mean (for cardinal data), which are used in everyday language
with something close to their mathematical meaning, we do not have an
everyday word for dealing with dierences in ordinal data. We will return to this
131
6. antifying Research estions
point presently, but rst, let us look at the data impressionistically. Table 6-6
shows the coded sample (cases are listed in the order in which they occurred in
the corpus, the example IDs can be used to refer to the actual hits listed in
Appendix D.1, Study 2).
Table 6-6. A sample of s- and of-possessives coded for Animacy (BROWN corpus)
(a) S-POSSESSIVE (b) OF-POSSESSIVE
Example ID Animacy Animacy Example ID Animacy Animacy
Rank Rank
6-6a1 ORG 2 6-6b1 LOC 8
.1
6-6a2 HUM 1 6-6b2 ORG 2
6-6a3 HUM 1 6-6b3 EVT 7
6-6a4 CCN 6 v0 6-6b4 HUM 1
6-6a5 ORG 2 6-6b5 CCT 5
6-6a6 ORG 2 6-6b6 ABS 10
6-6a7 HUM 1 6-6b7 HUM 1
6-6a8 HUM 1 6-6b8 ABS 10
T
6-6a9 HUM 1 6-6b9 CCN 6
6-6a10 HUM 1 6-6b10 ABS 10
AF
A simple way of nding out whether the data conform to our prediction would be
to sort the entire data set by the rank assigned to the examples and check
132
Anatol Stefanowitsch
whether the s-possessives cluster near the top of the list and the of-possessives
near the boom. Table 6-7 shows this ranking
Table 6-7. e coded sample from Table 6-6 ordered by assigned rank.
RANKING RANKING (CONTD.)
Rank Example Rank Example
1 s-possessive (6-6a10) 3 s-possessive (6-6a16)
1 s-possessive (6-6a12) 3 s-possessive (6-6a22)
1 s-possessive (6-6a15) 4 of-possessive (6-6b13)
1 s-possessive (6-6a17) 5 s-possessive (6-6a11)
.1
1 s-possessive (6-6a18) 5 of-possessive (6-6b11)
1 s-possessive (6-6a19) 5 of-possessive (6-6b16)
1 s-possessive (6-6a2) v0 5 of-possessive (6-6b17)
1 s-possessive (6-6a20) 5 of-possessive (6-6b18)
1 s-possessive (6-6a21) 5 of-possessive (6-6b5)
1 s-possessive (6-6a23) 6 s-possessive (6-6a4)
1 s-possessive (6-6a3) 6 of-possessive (6-6b9)
1 s-possessive (6-6a7) 7 of-possessive (6-6b14)
T
1 s-possessive (6-6a8) 7 of-possessive (6-6b3)
1 s-possessive (6-6a9) 8 of-possessive (6-6b1)
AF
2 s-possessive (6-6a5)
2 s-possessive (6-6a6)
2 of-possessive (6-6b2)
D
Table 6-7 shows that the data conform to our hypothesis: among the cases whose
modiers have an animacy of rank 1 to 3, s-possessives dominate, among those
with a modier of rank 4 to 10, of-possessives make up an overwhelming
majority.
However, we need a less impressionistic way of summarizing data sets coded
as ordinal variables, since not all data set will be as straightforwardly
interpretable as this one. So let us turn to the question of an appropriate
descriptive statistic for ordinal data.
133
6. antifying Research estions
6.3.1 Medians
As explained above, we cannot calculate a mean for a set of ordinal values, but we
can do something similar. e idea behind calculating a mean value is, essentially,
to provide a kind of mid-point around which a set of values is distributed it is a
so-called measure of central tendency. us, if we cannot calculate a mean, the
next best thing is to simply list our data ordered from highest to lowest and nd
the value in the middle of that list. is value is known as the median a value
that splits a sample or population into a higher and a lower portion of equal sizes.
For example, the rank values for the Animacy of our sample of s-possessives
.1
are shown in (6.18):
(6.18) 11111111111111222223356 v0
ere are 23 values, thus the median is the twelh value in the series (shown in
bold face), because there are 11 values above it and eleven below it. e twelh
values in the series is a 1, so the mean value of s-possessive modiers in our
sample is 1 (or HUMAN).
T
If the sample consists of an even number of data points, we simply calculate
the mean between the two values that lie in the middle of the ordered data set.
AF
For example, the rank values for the Animacy of our sample of of-possessives are
shown in (6.19):
(6.19) 1 1 2 4 5 5 5 5 5 6 7 7 8 9 10 10 10 10
R
ere are 18 values, so the median would fall between the ninth and the tenth
value (shown in italics). e ninth and tenth value are 5 and 6 respectively, so the
D
median for the of-possessive modiers is (5+6)/2 = 5.5 (i.e., it falls between
CONCRETE TOUCHABLE and CONCRETE NONTOUCHABLE).
Using the idea of a median, we can now rephrase our prediction in
quantitative terms:
Our data conform to this prediction, as 1 is higher on the scale than 5.5. As
before, this does not prove or disprove anything, as, again, we would expect some
random variation. Again, we will return to this issue in Chapter 7.
134
Anatol Stefanowitsch
Table 6-8. Relative frequencies for the Animacy values of the modiers of the s-
and of-possessives.
RANK ANIMACY CATEGORY S-POSSESSIVE OF-POSSESSIVE
.1
1 HUMAN 14 60.9% 2 11.1%
2 ORGANIZATION 5 21.7% 1 5.6%
3 OTHER ANIMATE v0 2 8.7% 0
4 HUMAN ATTRIBUTE 0 1 5.6%
5 CONCRETE TOUCHABLE 1 4.3% 5 27.86
6 CONCRETE NONTOUCHABLE 1 4.3% 1 5.6%
7 LOCATION 0 1 5.6%
T
8 TIME 0 1 5.6%
9 EVENT 0 2 11.1%
AF
10 ABSTRACT 0 4 22.2%
Total 23 100% 18 100%
As we can see, this table also nicely shows the preference of the s-possessive for
R
animate modiers (human, organization, other animate) and the preference of the
of-possessive for the categories lower on the hierarchy. e table also shows,
however, that the modiers of the of-possessive are much more evenly distributed
D
135
6. antifying Research estions
(which could easily have happened), its frequency would also be ve; in this case,
the of-possessive modier would have two modes (CONCRETE TOUCHABLE and
ABSTRACT).
e concept of mode may seem useful in cases where we are looking for a
single value by which to characterize a set of nominal data, but on closer
inspection it turns out that it does not actually tell us very much: it tells us, what
the most frequent value is, but not, how much more frequent that value is than
the next most frequent one, how many other values occur in the data at all, etc.
us, it is always preferable to report the frequencies of all values, and, in fact, I
have never come across a corpus-linguistic study reporting modes.
.1
6.5 Descriptive Statistics for Cardinal Data
v0
Let us turn, nally, to a design with one nominal and one cardinal variable: a test
of the third of the three hypotheses introduced at the beginning of this chapter.
Again, it is restated here together with the background assumption from which it
is derived:
T
(6.21) Assumption: Short items tend to occur toward the beginning of a
constiutent, long items tend to occur at the end.
AF
excluded. e reason for this is that we already know from the rst case study
that pronouns, which we used as an operational denition of old information
D
prefer the s-possessive. Since all pronouns are very short (regardless of whether
we measure their lenght in terms of words, syllables or leers), including them
would bias our data in favor of the hypothesis. is le 20 cases of the s-
possessive and 154 cases of the of-possessive. To get samples of roughly equal
size for expository clarity, I selected every sixth case of the of-possessive, giving
me 25 cases (note that in a real study, there would be no good reason to create
such roughly equal sample sizes we would simply use all the data we have).
e variable LENGTH was dened operationally as number of orthographic
words. We can now state the following prediction:
136
Anatol Stefanowitsch
Table 6-9 shows the length of head and modier for all cases in our sample
(again, see Appendix D.1, Study 3 for the actual examples).
.1
6-9a2 2 6 6-9b2 5 2
6-9a3 2 1 6-9b3 4 2
6-9a4 2 3 v0 6-9b4 8 2
6-9a5 3 1 6-9b5 7 3
6-9a6 1 2 6-9b6 1 1
6-9a7 2 2 6-9b7 9 5
6-9a8 2 1 6-9b8 2 3
6-9a9 2 1 6-9b9 5 2
T
6-9a10 2 1 6-9b10 6 2
6-9a11 1 1 6-9b11 2 3
6-9a12 2 4 6-9b12 2 2
AF
6-9a13 1 7 6-9b13 1 2
6-9a14 3 1 6-9b14 8 2
6-9a15 2 1 6-9b15 5 4
6-9a16 1 1 6-9b16 2 2
R
6-9a17 2 3 6-9b17 2 2
6-9a18 2 1 6-9b18 20 2
6-9a19 2 4 6-9b19 2 2
D
6-9a20 2 2 6-9b20 1 1
6-9b21 2 2
6-9b22 8 2
6-9b23 3 2
6-9b24 2 3
6-9b25 2 2
6.5.1 Means
How to calculate a mean (more precisely, an arithmetic mean) is common
knowledge, but for completeness sake, here is the formula:
137
6. antifying Research estions
n
1 x 1+ x2 ++ x n
(6.23) xarithm = xi =
n i=1 n
In other words, in order to calculate the mean of a set of values {x 1, x2, xn} of
size n, we add up all values and divide them by n (or multiply them by 1/n, which
is the same thing).
Since we have stated our hypothesis and the corresponding prediction only in
terms of the modier, we should rst make sure that the heads of the two
possessives do not dier greatly in length: if they did, any dierences we nd for
.1
the modiers could simply be related to the fact that one of the constructions
may be longer in general than the other. Adding up all 20 values for the s-
possessive heads gives us a total of 57, so the mean is 57/20 = 2.85. Adding up all
v0
25 values of the of-possessive heads gives us a total of 59, so the mean is 59/25 =
2.36. We have, as yet, no way of telling whether this dierence could be due to
chance, but the two values are so close together that we will assume so for now.
In fact, note that there is one obvious outlier (a value that is much bigger than the
T
others: Example 6-9a1 has a head that is 14 words long. If we assume that this is
somehow exceptional and remove this value, we get a mean length of 43/19 =
AF
2.26, which is almost identical to the mean length of the of-possessives modiers.
If we apply the same formula to the modiers, however, we nd that they
dier substantially: the mean length of the s-possessive modiers is 38/20 = 1.9,
while and the mean length of the of-possessives modiers is more than twice as
much, namely 112/25 = 4.48. Even if we remove the obvious outlier, example (6-
R
9b18), the of-possessives modiers are twice as long as those of the s-possessive,
namely 92/24 = 3.83.
D
6.6 Summary
We have looked at three case studies, one involving nominal, one ordinal and one
cardinal data. In each case, we were able to state a hypothesis and derive a
quantitative prediction from it. Using appropriate descriptive statistics
(percentages, observed and expected frequencies, modes, medians and means), we
were able to determine that the data conform to these predictions i.e., that the
quantitative distribution of the values of the variables PART OF SPEECH, ANIMACY and
LENGTH across the conditions S-GENITIVE and OF-CONSTRUCTION ts the predictions
formulated.
138
Anatol Stefanowitsch
.1
Further reading
See next chapter.
v0
T
AF
R
D
139
7. Signicance Testing
In Chapter 3, we discussed the fact that we can never prove a hypothesis to be
true, but we can prove it to be false. is insight leads to the Popperian research
.1
cycle that roughly goes as follows: e researcher formulates a hypothesis and
then do their best to falsify it. If they manage to falsify it, the hypothesis has to be
revised, if they dont manage to falsify it, they may continue assuming that it is
v0
correct. If the hypothesis can be formulated in such a way that it could be
falsied by a counterexample, this procedure can be applied fairly
straightforwardly (although there may be diculties in deciding what counts as a
counterexample).
But, as also discussed in Chapter 3, if the hypothesis is formulated in relative
T
terms, counterexamples are irrelevant. In the last chapter, we formulated
hypotheses and derived dierent types of quantitative predictions. We then coded
AF
140
Anatol Stefanowitsch
.1
Prediction: The data should be distributed non-randomly across the
intersections of A and B; i.e., the frequency/medians/means of some the
intersections should be higher and/or lower than those expected by
v0
chance.
Once we have formulated our research hypothesis and the corresponding null
hypothesis in this way (and once we have operationalized the constructs used in
formulating them), we collect the relevant data.
T
In a second step, we then have to show that the data we have collected dier
from the prediction of the null hypothesis in the direction predicted by the
AF
alternative hypothesis. For example, if Variable A has the values X and Y and
Variable B has the values P and Q, and our prediction is that X and P tend to co-
occur, we need to show that the combination of X and P is more frequent, has a
higher median, or has a higher mean than they would have if the relationship
between the two variables were random. If this is not the case, i.e., if the
R
frequency, the median or the mean of the combination X & P is equal to what we
would expect by chance, or if it is actually lower than that, then we cannot falsify
D
the null hypothesis (at least not in a way that is compatible with our alternative
hypothesis) and have to assume instead that our research hypothesis is false. If it
is the case, i.e., if the frequency, the median or the mean of the combination X & P
is higher than what we would expect by chance, we know that we can potentially
reject the null hypothesis in a way that conforms to our research hypothesis
this was the case in the case studies in the preceding chapter. However, we
cannot actually reject the null hypothesis yet, because we must always expect a
certain amount of random variation even if there is a relationship between two
variables.
Before we can reject the null hypothesis, we have to show in a third step that
the dierence between the observed distribution and the chance distribution is so
141
7. Signicance Testing
large that it is likely not due to random variation. is is the task of inferential
statistics, which we will deal with in this chapter.
.1
from the probabilities of each individual outcome. In reality, every outcome
from ten heads to ten tails is possible, as each ip of the coin is an independent
event. v0
Intuitively, we know this: if we ip a coin ten times, we do not really expect it
to come down heads and tails exactly ve times each but we accept a certain
amount of variation. However, the greater the imbalance between heads and tails,
the less willing we will be to accept it as a result of chance. In other words, we
would not be surprised if the coin came down heads six times and tails four
T
times, or even heads seven times and tails four times, but we might already be
slightly surprised if it came down heads eight times and tails only twice, and we
AF
heads - tails
tails - heads
D
tails - tails
Obviously, none of these outcomes is more or less likely than the others: since
there are four possible outcomes, they each have a probability of 1/4 = 0.25 (i.e.,
25 percent, we will be using the decimal notation for percentages from here on).
Alternatively, we can calculate the probability of each series by multiplying the
probability of the individual events in each series, i.e. 0.5 * 0.5 = 0.25.
Crucially, however, there are dierences in the probability of geing a
particular set of results (i.e, a particular number of heads and regardless of the
order they occur in): There is only one possibility of geing two heads and one of
geing two tails, but there are two possibilities of geing one head and one tail.
142
Anatol Stefanowitsch
.1
Table 7-1. Sets of ten coin ips.
Unordered Set No. of Series Probability
1 {0 heads, 10 tails}
v0 1 0.000977
2 {1 heads, 9 tails} 10 0.009766
3 {2 heads, 8 tails} 45 0.043945
4 {3 heads, 7 tails} 120 0.117188
5 {4 heads, 6 tails} 210 0.205078
6 {5 heads, 5 tails} 252 0.246094
T
7 {6 heads, 4 tails} 210 0.205078
8 {7 heads, 3 tails} 120 0.117188
AF
Again, these outcomes dier with respect to their probability. e third column of
Table 7-1 gives us the number of dierent series corresponding to each set.17 For
example, there is only one way to get a set consisting of heads only: the coin
D
must come down showing heads every single time. ere are ten dierent ways
of geing one heads and nine tails: e coin must come down heads the rst or
second or third or fourth or h or sixth or seventh or eighth or ninth or tenth
time, and tails the rest of the time. Next, there are forty-ve dierent ways of
geing two heads and eight tails, which I am not going to list here (but you may
17 You may remember having heard of Pascals triangle, which, among more
sophisticated things, lets us calculate the number of dierent ways in which we can
get a particular combination of heads and tails for a given number of coin ips: the
third column of Table 7-1 corresponds to line 11 of this triangle. If you dont
remember, no worries, we will not need it.
143
7. Signicance Testing
want to, as an exercise), and so on. e fourth column contains the same
information, expressed in terms of relative frequencies: there are 1024 dierent
series of ten coin ips, so the probability of geing, for example, two heads and
eight tails is 45/1024 = 0.043945.
e basic idea behind statistical hypothesis testing is simple: calculate the
probability of the result that we have observed. e lower this probability is, the
less likely it is to be due to chance and the more likely we are correct if we reject
the null hypothesis. For example, if we observed a series of ten heads and zero
tails, we know that the likelihood that the deviation from the expected result of
ve heads and ve tails is due to chance is 0.000977 (i.e. roughly a tenth of a
.1
percent). is tenth of a percent is also the probability that we are wrong if we
reject the null hypothesis and claim that the coin is not behaving randomly (for
example, that it is manipulated in some way).
v0
If we observed one heads and nine tails, we would know that the likelihood
that this deviation from the expected result is 0.009766 (i.e. almost one percent).
us we might think that, again, if we reject the null hypothesis, this is the
likelihood that we are wrong. However, we must add to this the likelihood of
geing ten heads and zero tails. e reason for this is that if we accept a result of
T
1:9 as evidence for a non-random distribution, we would also accept the even
more extreme result of 0:10. So the likelihood that we are wrong in rejecting the
AF
smaller than ve percent), the result is said to be statistically signicant (i.e., not
due to chance), if it is larger, the result is said to be non-signicant (i.e., probably
to chance). Two other values of p that have a conventional importance are 0.01
(one percent), at which a result is said to be very signicant, and 0.001 (a tenth of
a percent), at which a result is considered to be highly signicant.
Note that the probability of error depends not just on the proportion of the
deviation, but also on the overall size of the sample. For example, if we observe a
series of two heads and eight tails (i.e., twenty percent heads), the probability of
error in rejecting the null hypothesis is 0.000977+ 0.009766 + 0.043945 = 0.054688.
However, if we observe a series of four heads and sixteen tails (again, twenty
percent heads), the probability of error would be roughly ten times lower, namely
144
Anatol Stefanowitsch
.1
coin ipping, which involves just a single variable with two values. However, it is
theoretically possible to generalize the coin-ipping logic to any research design,
i.e., calculate the probabilities of all possible outcomes and add up the
v0
probabilities of the observed outcome and all outcomes that deviate from the
expected outcome even further in the same direction. Simply put, though, this is
only a theoretical possibility, as the computations would be too long and complex
even for the most powerful computers in existence, let alone for manual
calculations (or even a standard-issue home computer). erefore, many
T
statistical methods use a kind of mathematical trick: they derive from a given
sample a single value whose probability distribution is known a so-called test
AF
practice, we just have to look up the test statistic in a chart that will give us the
corresponding probability of error (or p-value, as we will call it from now on).
D
145
7. Signicance Testing
easy to perform: we dont need more than a paper and a pencil, or a calculator, if
we want to speed up things, and they are also included as standard functions in
widely-used spreadsheet applications. ey are also relatively robust even in
situations where we should not really use them (a point I will return to below).
ey are also ideal procedures for introducing statistics to novices. Again, they
are easy to perform and do not require statistical soware packages that are
typically expensive and/or have a steep learning curve. ey are also relatively
transparent with respect to their underlying logic and the steps required to
perform them. us, my purpose in introducing them in some detail here is at
least as much to introduce the logic and the challenges of statistical analysis, as it
.1
is to provide basic tools for actual research. I will not introduce the mathematical
underpinnings of these tests and I will mention alternative procedures only in
passing. For those who are interested in a deeper discussion of those issues, there
v0
are a large number of relatively readable introductory textbooks on probability
theory, theoretical and applied statistics (see the Further Reading section at the
end of this chapter). I will also not make any reference to statistical soware,
although such soware packages are useful for anyone planning to use statistics
in their actual research.
T
As mentioned in the preceding chapter, nominal data (or data that are best
treated like nominal data) are the type of data most frequently encountered in
corpus linguistics. I will therefore treat them in slightly more detail than the
other two types, introducing dierent versions and (in the next chapter)
R
extensions of the most widely used statistical test for nominal data, the chi-square
(2) test. is test in all its variants is extremely exible, and is thus more useful
D
across dierent research designs than many of the more specic and more
sophisticated procedures (much like a Swiss army knife is an excellent all-
purpose tool despite the fact that there is usually a beer tool dedicated to a
specic task at hand).
Despite its exibility, there are two requirements that must be met in order for
the chi-square test to be applicable: rst, no intersection of variables must have a
frequency of zero in the data, and second, no more than twenty-ve percent of
the intersections must have frequencies lower than ve. When these conditions
are not met, an alternative test must be used instead (or we need to collect
additional data).
146
Anatol Stefanowitsch
(7.3a) H1: ere is a relationship between DISCOURSE STATUS and TYPE OF POSSESSIVE
such that the S-POSSESSIVE is preferred when the modier is DISCOURSE-OLD,
the OF-POSSESSIVE is preferred when the modier is DISCOUSE-NEW.
Prediction: ere will be more cases of the S-POSSESSIVE with DISCOURSE-OLD
modiers than with DISCOURSE-NEW modiers, and more cases of the OF-
.1
POSSESSIVE with discourse-new modiers than with DISCOURSE-OLD modiers.
Table 7-2. Observed and expected frequencies of old and new modiers in the s- and
the of-possessive (= Table 6-4b)
POSSESSIVE
R
OLD
Exp. 102.81 Exp. 80.19
Obs. 20 Obs. 153 Obs. 173
NEW
Exp. 97.19 Exp. 75.81
Total Obs. 200 Obs. 156 Obs. 356
In order to test our research hypothesis, we must show that the observed
frequencies dier from the null hypothesis in the direction of our prediction. We
already saw in Chapter 6 that this is the case: e null hypothesis predicts the
expected frequencies, but there are more cases of s-possessives with old modiers
147
7. Signicance Testing
and of-possessives with new modiers than expected. Next, we must apply the
coin-ip logic and ask the question: Given the sample size, how surprising is the
dierence between the expected frequencies (i.e., a perfect chance distribution)
and the observed frequencies?
As mentioned above, the conceptually simplest way of doing this would be to
compute all possible ways in which the marginal frequencies (the sums of the
columns and rows) could be distributed across the four cells of our table and then
check what proportion of these tables deviates from the perfect chance
distribution at least as much as the table we have actually observed. For two-by-
two tables, there is, in fact, a test that does this, the exact test aer Fisher and
.1
Yates, and where the conditions for using the chi-square test are not met, we
should use it. But, as mentioned above, this test is dicult to perform without
statistical soware, and it is not available for tables larger than 2-by-2 anyway, so
v0
instead we will derive the chi-square test statistic from the table.
First, we need to assess the magnitude of the dierences between observed
and expected frequencies. e simplest way of doing this would be to subtract the
expected dierences from the observed ones, giving us numbers that show for
each cell the size of the deviation as well as its direction (i.e., are the observed
T
frequencies higher or lower than the expected ones). For example, the values for
Table 7-2 would be 77.19 for cell C11 (S-POSSESSIVE & OLD), -77.19 for C21 (OF-POSSESSIVE
AF
& OLD), -77.19 for C12 (S-POSSESSIVE & NEW) and 77.19 for C22 (OF-POSSESSIVE & NEW).
However, we want to derive a single measure from the table, so we need a
measure of the overall deviation of the observed frequencies from the expected,
not just a measure for the individual intersections. Obviously, adding up the
dierences of all intersections does not give us such a measure, as it would
R
always be zero (since the marginal frequencies are xed, any positive deviance in
one cell will have a corresponding negative deviance in its neighboring cells.
D
Second, subtracting the observed from the expected frequencies gives us the same
number for each cell, when it is obvious that the actual magnitude of the
deviation depends on the expected frequency. For example, a deviation of 77.19 is
more substantial if the expected frequency is 75.81 than if the expected frequency
is 102.81. In the rst case, the observed frequency is more than a hundred percent
higher than expected, in the second case, it is only 75 percent higher.
e rst problem is solved by squaring the dierences. is converts all
deviations into positive numbers, and thus their sum will no longer be zero, and it
has the additional eect of weighing larger deviations more strongly than smaller
ones. e second problem is solved by dividing the squared dierence by the
expected frequencies. is will ensure that a deviation of a particular size will be
148
Anatol Stefanowitsch
weighed more heavily for a small expected frequency than for a large expected
frequency. e values arrived at in this way are referred to as the cell components
of chi square (or simply chi-square components); the formulas for calculating the
cell components in this way are shown in Table 7-3a.
2 2
INDEP. VARIABLE
(O 11 E 11 ) (O 12 E 12 )
.1
VALUE 1
E 11 E 12
2 2
(O 21 E 21) (O 22 E 22)
VALUE 2 v0
E 21 E 22
If we apply this procedure to Table 7.2, we get the components shown in Table
7.3b.
T
Table 7-3b. Chi-square components for Table 7-2
AF
R
D
149
7. Signicance Testing
POSSESSIVE
S-POSSESSIVE OF-POSSESSIVE
2 2
(180 102.81) (3 80.19)
DISC. STATUS
e degree of deviance from the expected frequencies for the entire table can
.1
then be calculated by adding up the chi-square components. For Table 7-3, the
chi-square value (2) is 272.16. is value can now be used to determine the
probability of error by checking it against a table like that in Appendix C.1.
v0
Before we can do so, there is a nal technical point to make. Note that the
degree of variation in a given table that is expected to occur by chance depends
quite heavily on the size of the table. e bigger the table, the higher the number
of cells that can vary independently of other cells without changing the marginal
sums (i.e., without changing the overall distribution). e number of such cells
T
that a table contains is referred to as the number of degrees of freedom of the
table. In the case of a two-by-two table, there is just one such cell: if we change
AF
any single cell, we must automatically adjust the other three cells in order to keep
the marginal sums constant. us, a two-by-two table has one degree of freedom.
Signicance levels of chi-square values dier depending on how many degrees of
freedom a table has.
Now, we can turn to the table in Appendix C.1. In the row for one degree of
R
freedom (the rst row), we must check whether our 2-value is larger than that
required for the level of signicance that we are aer. In our case, the value of
D
272.16 is much higher than the chi-square value required for a signicance level
of 0.001 at one degree of freedom, which is 10.83. us, we can say that the
dierences in Table 7-2 are statistically highly signicant. e results of a chi-
square test are conventionally reported in the following format:
us, in the present case, the analysis might be summarized along the following
lines: This study has shown that s-possessives are preferred when the modier is
150
Anatol Stefanowitsch
.1
signicance with theoretical importance. However, there are at least three
reasons why this equation must be avoided.
First, and perhaps most obviously, statistical signicance has nothing to do
v0
with the validity of the operational denitions used in our research design. In our
case, this validity is reasonably high, provided that we limit our conclusions to
wrien English. As a related point, statistical signicance has nothing to do with
the quality of our data. If we have chosen unrepresentative data of if we have
extracted or coded our data sloppily, the statistical signicance of the results is
T
meaningless.
Second, statistical signicance has nothing to do with theoretical relevance.
AF
which a modier is printed, rather than on the discourse status of the modier,
there is not much that we conclude from our ndings.18
D
151
7. Signicance Testing
.1
calculated as follows:
2
v0
(7.5) =
O TT
T
272.16
(7.6) = = 0.8744
356
AF
that the value of a correlation coecient may have (similarly to the verbal
descriptions of p-values discussed above. ese descriptions are shown in Table
7.4.
D
hypothesize that there is a relationship between font and level of formality, and the
laer has been shown to have an inuence on the choice of possessive constructions
(Jucker 1993).
19 is statement must be qualied to a certain degree: given the right research design,
statistical signicance may actually be a very reasonable indicator of association
strength (cf. e.g. Stefanowitsch and Gries 2003, Gries and Stefanowitsch 2004 for
discussion). However, most of the time we are well advised to keep statistical
signicance and association strength completely separate.
152
Anatol Stefanowitsch
.1
1 Perfect association
Our value of 0.8744 falls into the very strong category, which is unusual in
v0
uncontrolled observational research, and which suggests that DISCOURSE STATUS is
indeed a very important factor in the choice of POSSESSIVE constructions in English.
Exactly how much of the variance in the use of the two possessives is
accounted for by the discourse status of the modier can be determined by
looking at the square of the coecient: the square of a correlation coecient
T
generally tells us what proportion of the distribution of the dependent variable
we can account for on the basis of the independent variable (or, more generally,
AF
what proportion of the variance our design has captured). In our case, 2 = (0.8744
0.8744 ) = 0.7645. In other words, the variable DISCOURSE STATUS explains roughly
three-quarters of the variance in the use of the POSSESSIVE constructions if, that
is, our operational denition actually captures the discourse status of the
modier, and nothing else. A more precise way of reporting the results from our
R
study would be something like the following is study has shown a strong and
statistically highly signicant inuence of Discourse Status on the choice of
D
153
7. Signicance Testing
.1
443 FEMALE speakers and 932 MALE speakers. In order to determine whether the ICE-
GB can be considered a balanced corpus with respect to SPEAKER SEX, we can
compare this observed20 distribution of speakers to the expected one more or
v0
less exactly in the way described in the previous sections except that we have two
alternative ways of calculating the expected frequencies.
First, we could simply take the total number of elements and divide it by the
number of categories (values), on the assumption that random distribution
T
means that every category should occur with the same frequency. In this case, the
expected number of MALE and FEMALE speakers would be [Total Number of Speakers
AF
/ Sex Categories], i.e. 1,375 / 2 = 687.5. We can now calculate the chi-square
components just as we did in the preceding sections, using the formula [(O-E)2/E].
Table 7-5 shows the results.
Table 7-5. Observed and expected frequencies of Speaker Sex in the ICE-GB (based
R
20 I put the word observed in scare quotes because it is unclear whether this dierence
was in fact observed aer the fact or whether the texts were sampled in this way
intentionally.
154
Anatol Stefanowitsch
2
(443 687.5)
FEMALE 443 687.5 = 86.95
687.5
Total 1,375
.1
row in the table in Appendix C.1, we can see that this value is much higher than
the 10.83 required for a signicance level of 0.001. us, we can say that the ICE-
GB corpus contains a signicantly higher proportion of male speakers than
v0
expected by chance (2=173.96, df=1, p<0.001) (note that since this is a test of
proportions rather than correlations, we cannot calculate a phi value here).
The second way of deriving expected frequencies for a univariate distribution
is from prior knowledge concerning the distribution of the values in general. In
T
our case, we could nd out the proportion of men and women in the relevant
population (Great Britain) and then derive the expected frequencies for our table
AF
by assuming that they follow this proportion. A population estimate from around
the time that the ICE-GB was released puts the male population of England and
Wales in 2003 at roughly 25.9 million and the female population at roughly 27
million (Oce for National Statistics 2005). In other words, 49 percent of the
population are men and 51 percent are women. Assuming that this is
R
Table 7-6. Observed and expected frequencies of Speaker Sex in the ICE-GB (based
on the proportions in the general population)
Observed Expected Chi-Square Component
(932 673.75)2
MALE 932 932 0.49 = 673.75 = 98.99
673.75
SEX
(443 701.25)2
FEMALE 443 443 0.51 = 701.25 = 95.11
701.25
155
7. Signicance Testing
Total 1375
Clearly, the empirical distribution in this case closely resembles our hypothesized
equal distribution, and thus the results are very similar. e chi-square-value is
194.09 slightly higher even, than above and thus, the result is again
signicant at the 0.001-level.
In this case, it does not make much of a dierence how we derive the expected
frequencies. However, this is, of course, only because men and women account
for roughly half the population each. For variables where such an even
distribution of values does not exist, the dierences between these two
.1
procedures can be quite drastic.
Table 7-8. Observed and expected frequencies of Speaker Age in the ICE-GB
v0
Equal Proportions Population Proportions
Observed Pop. % Expected 2
Expected 2
1,259
0-18 0 0.23 = 251.8 251.8 1,259 0.23 = 295.67 295.67
5
T
SPEAKER AGE
As an example, consider Table 7-8, which lists the observed distribution of the
speakers in the ICE-GB across age groups, together with the expected frequencies
D
156
Anatol Stefanowitsch
Table 7-8 has four degrees of freedom (we can vary four cells independently and
then simply adjust the h to keep the marginal sum constant). e required chi-
square value for a 0.001-level of signicance at four degrees of freedom is 18.47:
clearly, whichever way we calculate the expected frequencies, the dierences
between observed and expected are highly signicant. However, the conclusions
we will draw concerning the over- or underrepresentation of individual
categories will be very dierent. In the rst case, for example, we might be led to
believe that the age group 18-25 is relatively fairly represented while the age
group 26-45 is drastically overrepresented. In the second case, we see that in fact
both age groups are overrepresented, the rst one slightly more so than the
.1
second one. In this case, there is a clear argument for using empirically derived
expected frequencies: the categories dier in terms of the age span each of them
covers, so even if we thought that the distribution of ages in the population is
v0
homogeneous, we would not expect all categories to have the same size.
e exact alternative to the univariate chi-square test with a two-level
variable is the binomial test, which we used (without calling it that), in our coin-
ip example in Section 7.1 above and which is included as a predened function
in many major spreadsheet applications and in R; for one-by-n tables, there is a
T
multinomial test typically not available outside large statistics packages, so we
have to content ourselves with the chi-square test.
AF
possessive constructions. Here is the research hypothesis again, from (6.12) and
(6.16) above:
(7.7a) H1: e S-POSSESSIVE will be used when the modier is high in ANIMACY, the
OF-POSSESSIVE will be used when the modier is low in ANIMACY.
Prediction: e modiers of the S-POSSESSIVE will have a higher median on
the ANIMACY scale than the the modiers of the OF-POSSESSIVE.
157
7. Signicance Testing
e median animacy of all modiers in our sample taken together is 2,22 so the H0
predicts that the medians of s-possessive and the of-possessive should also be 2.
Recall that the observed median animacy in our sample was 1 for the s-possessive
and 5 for the of-possessive, which deviates from the prediction of the H0 in the
direction of our H1. However, as in the case of nominal data, a certain amount of
deviation from the null hypothesis will occur due to chance, so we need a test
statistic that will tell us how likely our observed result is. For ordinal data, this
.1
test statistic is the U value, which is calculated as follows.
In a rst step, we have to determine the rank order of the data points in our
sample. For expository reasons, let us distinguish between the rank value and the
v0
rank position of a data point: the rank value is the ordinal value it received
during coding (in our case, its value on the ANIMACY scale), its rank position is the
position it occupies in an ordered list of all data points. If every rank value
occurred only once in our sample, rank value and rank position would be the
same. Howerver, there are 41 data points in our sample, so the rank positions will
T
range from 1 to 41, and there are only 10 rank values in our coding scheme for
ANIMACY. is means that at least some rank values will occur more than once,
AF
Table 7-9. Coded sample from Table 6-6 ordered by assigned rank ( Table 6-7)
R
22 ere are 41 data points in our sample, whose ranks are the following: 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 4, 5, 5, 5, 5, 5, 5, 6, 6, 7, 7, 8, 9, 10, 10, 10, 10 . e
twenty-rst item on the list is a 2, so this is the median.
158
Anatol Stefanowitsch
.1
2 s-possessive (6-6a5) 19.5
2 s-possessive (6-6a6) 19.5
2 of-possessive (6-6b2) 19.5 v0
Every rank value except 4, 8 and 9 occurs more than once; for example, there are
sixteen cases that have an ANIMACY rank value of 1 and six cases that have a rank
value of 2, two cases that have a rank value of 3, and so on. is means we cannot
simply assign rank positions from 1 to 41 to our examples, as there is no way of
T
deciding which of the sixteen examples with the rank value 1 should receive the
rank position 1, 2, 3, etc. Instead, these 16 examples as a group share the range of
AF
ranks from 1 to 16, so each example gets the average rank position of this range.
ere are sixteen cases with rank value 1, to their average rank is
16 16
e rst example with the rank value 2 occurs in line 17 of the table, so it would
D
receive the rank position 17. However, there are ve more examples with the
same rank value, so again we calculate the average rank position of the range
from rank 17 to 22, which is
17+18+19+20+21+22 117
(7.9) = = 19.5
6 16
Repeating this process for all examples yields the rank positions shown in the
third column in Table 7-9 above. Once we have determined the rank position of
each data point, we separate them into two subsamples corresponding to the
159
7. Signicance Testing
values of the nominal variable TYPE OF POSSESSIVE again, as shown in Table 7-10a.
Table 7-10a. Rank positions for the coded sample from Table 7.9.
S-POSSESSIVE OF-POSSESSIVE
.1
2 s-possessive (6-6a6) 19.5 10 of-possessive (6-6b6) 39.5
1 s-possessive (6-6a7) 8.5 1 of-possessive (6-6b7) 8.5
1 s-possessive (6-6a8) 8.5 10 of-possessive (6-6b8) 39.5
1 s-possessive (6-6a9) 8.5 6 of-possessive (6-6b9) 32.5
1
5
s-possessive (6-6a10)
s-possessive (6-6a11)
8.5
28.5
v0 10
5
of-possessive (6-6b10)
of-possessive (6-6b11)
39.5
28.5
1 s-possessive (6-6a12) 8.5 9 of-possessive (6-6b12) 37
2 s-possessive (6-6a13) 19.5 4 of-possessive (6-6b13) 25
2 s-possessive (6-6a14) 19.5 7 of-possessive (6-6b14) 32.5
T
1 s-possessive (6-6a15) 8.5 10 of-possessive (6-6b15) 39.5
3 s-possessive (6-6a16) 23.5 5 of-possessive (6-6b16) 28.5
AF
Next, we calculate the so-called rank sum for each group, which is simply the
sum of their rank positions, and we count the number of data points in each
group. ese measures are shown in Table 7-10b.
Using these two measures, we can now calculate the U values for both group
160
Anatol Stefanowitsch
using the following simple formulas, where N stands for the number of data
points in the sample and R for the rank sum:
N 1 (N 1 + 1)
(7.10a) U 1 = (N 1 N 2) + R1
2
N 2 (N 2 + 1)
(7.10b) U 2 = ( N 1N 2 ) + R2
2
Applying these formulas to the measures for the s-possessive (7.10a) and of-
.1
possessive (7.10b) respectively, we get the following U values:
23 (23 + 1)
(7.11a) U 1 = (23 18) +
2
v0 324.5 = 365.5
18 (18 + 1)
(7.11b) U 2 = (23 18) + 532.5 = 52.5
2
T
e U value for the entire data set is always the smaller of the two U values. In
our case this is U2, so our U value is 52.5. is value can now be compared against
AF
its known distribution in the same way as the chi-square value for nominal data.
In our case, this means looking it up in the appropriate table in Appendix C.3,
which tells us that the p-value for this U value is smaller than 0.001 the
dierence between the s- and the of-possessive is, again, highly signicant. e
Mann-Whitney test may be reported as follows:
R
us, we could report the results of this case study as follows: This study has
shown that s-possessives are preferred when the modier is high in animacy,
while of-possessives are preferred when the modier is low in animacy. A Mann-
Whitney test shows that the dierences between the constructions are highly
signicant (U = 52.5, N1= 18, N2 = 23, p < 0.001).
161
7. Signicance Testing
.1
Let us return to the case study of the length of modiers in the two English
possessive constructions. Here is the research hypothesis again, paraphrased
slightly from (6.17) and (6.18) above:
v0
(7.13a) H1: The S-POSSESSIVE will be used with short modiers, the OF-POSSESSIVE will
be used with long modiers.
Prediction: The mean LENGTH (in number of words) of modiers of the S-
T
GENITIVE should be smaller than that of the modiers of the OF-GENITIVE.
Table 7-13 shows the length in number of words for the modiers of the s- and of-
possessives (as already reported in Table 6-9), together with a number of
D
162
ave
Cancel
Anatol Stefanowitsch
.1
6-9a16 1 -0.9 0.8100 6-9b16 2 -1.8333 3.3611
6-9a17 2 0.1 0.0100 6-9b17 2 -1.8333 3.3611
6-9a18 2 0.1 0.0100 v0 6-9b19 2 -1.8333 3.3611
6-9a19 2 0.1 0.0100 6-9b20 1 -2.8333 8.0278
6-9a20 2 0.1 0.0100 6-9b21 2 -1.8333 3.3611
6-9b22 8 4.1667 17.3611
6-9b23 3 -0.8333 0.6944
6-9b24 2 -1.8333 3.3611
T
6-9b25 2 -1.8333 3.3611
N 20 24
AF
TOTAL 58 0 92 0
Mean 1.9 3.8333
Sample 0.3053 6.6667
Variance
R
First, note that one case that was still included in Table 6-9 is missing: Example
(6-9b19), which had a modier of length 20. is is treated here as a so-called
outlier, i.e., a value that is so far away from the mean that it can be considered an
D
exception. ere are dierent opinions on if and when outliers should be removed
that we will not discuss here, but for expository reasons alone it is reasonable
here ro remove it (and for our results, it would not have made a dierence if we
had kept it).
In order to calculate Welchs t-test, we determine three values on the basis of
our measurements of LENGTH: the number of measurements, the mean length for
each group, and a value called sample variance. e number of measurements is
easy to determine we just count the cases in each group: 20 s-possessives and
24 of-possessives. We already calculated the mean lengths in Chapter 6: for the s-
possessive, the mean length is 1.9 words, for the of-possessive it is 3.83 words. As
163
7. Signicance Testing
.1
measurement for the of-possessive from the group mean of 3.83. e results are
shown in the third column of each sub-table in 7-13. However, we do not want to
know how much each single measurement deviates from the mean, but how far
v0
the group S-POSSESSIVE or OF-POSSESSIVE as a whole varies around the mean.
Obviously, adding up all individual values is not going to be helpful: as in the case
of observed and expected frequencies of nominal data, the result would always be
zero. So we use the same trick we used there, and calculate the square of each
value making them all positive and weighting larger deviations more heavily.
T
e results of this are shown in the fourth column of each sub-table. We then
calculate the mean of these values for each group, but instead of adding up all
AF
values and dividing them by the number of cases, we add them up and divide
them by the total number of cases minus one. is is referred to as the sample
variance:
n
R
2
i =1
)2
( xi X
(7.14) s =
n1
D
e sample variances itself cannot be very easily interpreted (see further below),
but we can use them to calculate our test statistic, the t-value, using the following
formula (the barred X stands for the group mean, s2 stands for the sample
variance, and N stands for the number of cases; the subscripts 1 and 2 indicate the
two sub-samples:
164
Anatol Stefanowitsch
X 1 X 2
(7.15) t Welch =
2 2
s1 s
+ 2
N1 N2
Note that this formula assumes that the measures with the subscript 1 are from
the larger of the two samples (if we dont pay aention to this, however, all that
happens is that we get a negative t-value, whose negative sign we can simply
ignore). In our case, the sample of of-possessives is the larger one, giving us:
3.8333 1.9
.1
t Welch =
(7.16)
6.6667
24
+
0.3053
20
= 3.5714
v0
As should be familiar by now, we compare this t-value against its distribution to
determine the probability of error (i.e., we look it up in the appropriate table in
Appendix C.4. Before we can do so, however, we need to determine the degrees of
freedom of our sample. is is done using the following formula:
T
AF
( )
2 2 2
s1 s
+ 2
N1 N2
(7.17) df 4 4
s 1 s2
2
+ 2
R
N 1 df 1 N 2 df 2
Again, the subscripts indicate the sub-samples, s2 is the sample variance, and N is
D
the number of items the degrees of freedom for the two groups (df 1 and df2) are
dened as N-1. If we apply the formula to our data, we get the following:
( )
2
6.6667 0.3053
+
24 20
(7.18) df 2 = 25.5038
6.6667 0.30532
+
242 (241) 202 (201)
As we can see in the table in Appendix C.4, the t-value is smaller than 0.01. A t-
test should be reported in the following format:
165
7. Signicance Testing
.1
us very much. We can convert it into something called the sample standard
deviation, however, by taking its square root. e standard deviation is an
indicator of the amount of variation in a sample (or sub-sample) that is frequently
v0
reported; it is good practice to report standard deviations whenever we report
means.
Finally, note that, again, the signicance level does not tell us anything about
the size of the eect, so we should calculate an eect size separately. e most
widely-used eect size for data analyzed with a t-test is Cohens d, also referred
T
to as the standardized mean dierence. ere are several ways to calculate it, the
simplest one is the following, where is the standard deviation of the entire
AF
sample:
X 1 X2
(7.20) d =
R
3.8333 1.9
D
(7.21) d = = 0.8966
2.1562
d
r =
(7.22)
2
2 ( N 1 + N 2)
d +
N1 N2
166
Anatol Stefanowitsch
0.8966
r = = 0.4077
(7.23)
2
2 (24 + 20)
0.8966 +
24 20
.1
possessives are preferred when the modier is short, while of-possessives are
preferred when the modier is long. (t(25.50) = 3.5714, p < 0.01, r = 0.41).
either side until they reach zero. ere are many ways of determining whether a
sample is distributed normally, but we will not discuss these here for the simple
reason that, unfortunately, cardinal measurements derived from corpus data
(such as the length or of words, constituents, sentences etc. or the distance to the
last mention of a referent) are usually not distributed normally (see, e.g., McEnery
R
and Hardie 2012: 51), and so the discussion would hardly be worth the trouble at
this point.
D
ere are two ways of dealing with this problem: rst, ignore it and hope that
the t-test is robust enough to yield meaningful results despite this violation of the
normality requirement. Although statisticians warn against it categorically, this is
what many linguists (and other social scientists) regularly do. In many cases, this
may be less of a problem than one might assume, since, as mentioned in the
introduction to this chapter, the tests introduced here are fairly robust against
violations of their prerequisites.
Still, ignoring the prerequisites of a statistical procedure is not exactly good
scientic practice, so the second way of dealing with the problem is the more
recommendable one: nd a way of geing around having to use a t-test in the
rst place. One way would be to nd an alternative test that does not assume a
167
7. Signicance Testing
normal distribution. In some cases, so-called exact tests are available (such as the
Fisher-Yates exact test mentioned in Section 7.2.1) and some corpus linguists
argue that we should restrict ourselves to such exact tests (e.g. Gries 2006). In
other cases, we may have to rethink the nature of our variables. For example, as
pointed out in Chapter 6, we can treat our cardinal data as ordinal data. We can
then use the Mann-Whitney U-test, which does not require a normal distribution
of the data. I leave it as an exercise to the reader to apply this test to the data in
Table 7-11 (you know you have succeeded if your result for U is 137, p < 0.01).
Alternatively, we could try to nd a less problematic operationalization of the
phenomenon we are trying to measure and use the corresponding statistical
.1
procedures. Since nominal variables are oen less troublesome than cardinal
ones, we could, for example, code the data in Table 7-11 in terms of a very simple
nominal variable: LONGER CONSTITUENT (with the variables HEAD and MODIFIER). For
v0
each case, we simply determine whether the head is longer than the modier (in
which case we assign the value head) or whether the modier is longer than the
head (in which case we assign the value MODIFIER; we discard all cases where the
two have the same length. is gives us Table 7.12.
T
Table 7.12. e inuence of length on the choice between the two possessives.
POSSESSIVE
AF
168
Anatol Stefanowitsch
LONGER CONSTITUENT
e chi-square value for this table is 0.7281, which at one degree of freedom
means that the p value is larger than 0.05, so we would have to conclude that
.1
there is no inuence of length on the choice of possessive construction. However,
the deviations of the observed from the expected frequencies go in the right
direction, so this may simply be due to the fact that our sample is too small
v0
(obviously, a serious corpus-linguistic study would not be based on just 33 cases).
more than scratch the surface, however, so for anyone who is serious about using
statistics in their research, following the recommendations for further reading at
the end of this chapter is very important more so than in any other chapter.
One aspect of statistical analysis that we only mentioned in passing was that
of general quantitative characteristics that a sample may have. For example, we
R
mentioned that data must follow a normal distribution for some statistical
procedures but not for others. ere are additional restrictions for some tests. For
D
example, many procedures for comparing group means can only be applied if the
two groups have the same variance (e.g., the more widely used Student t-test), and
there are tests to tell us this (the F test). Also, it makes a dierence whether the
two groups that we are comparing are independent of each other (as in the case
studies presented here), or if they are dependent in that there is a correspondence
between measures in the two groups. For example, if we wanted to compare the
length of heads and modiers in the s-possessive, we would have two groups that
are dependent in that for any data point in one of the groups there is a
corresponding data point in the other group that comes from the same corpus
example. In this case, we would use a paired test (for example, the matched-pairs
Wilcoxon test for ordinal data and Students paired t-test for cardinal data).
169
7. Signicance Testing
.1
necessary in the next chapter, limiting ourselves to designs where all variables
are nominal. However, in multivariate designs many combinations of variable
types are possible and there is a large number of multivariate statistical
v0
procedures, such as the well-known Analysis of Variance (ANOVA) for designs
with several independent nominal variables and a cardinal dependent variable.
Further reading
T
Anyone serious about using statistics in your research should start with a
basic introduction to statistics, and then proceed to an introduction of more
AF
towards linguistic research questions, but I also encourage you to explore the
wide range of free or commercially available books introducing statistics with R.
D
170
Anatol Stefanowitsch
.1
v0
T
AF
R
D
171
8. Complex antitative Research
Designs
.1
In this chapter, we will nish our discussion of quantitative corpus linguistic
research designs by looking at some issues that become relevant with more
complex designs: rst, designs with two variables where at least one variable has
v0
more than two values, and second, designs with more than two variables. We will
focus on designs with nominal variables, but the issues discussed here are
relevant to research designs with all kinds of variables.
independent variable was binary only two two constructions were involved.
e dependent variables were more complex: the cardinal variable LENGTH
obviously had a potentially innite number of values and the ordinal variable
ANIMACY had ten values; the nominal value DISCOURSE STATUS was treated like a
binary variable too. Frequently, perhaps even typically, corpus linguistic research
R
questions will be more complex, and we will be confronted with designs where
independent variables and/or nominal dependent variables will have more than
D
two values.
For example, we might be intrigued by the fact that transitive clauses are
explicitly or implicitly assumed to be the most typical clause type by
syntacticians, while in actual usage, intransitive clauses are much more frequent
(cf. ompson and Hopper 2001). We might then speculate that the frequency of
dierent subcategorization frames may dier across text types and that the
perceived typicality of transitive clauses may be due to the salience of particular
text types. Assuming we limit ourselves to the major subcategorization paerns,
our dependent variable will have at least four values (intransitive, transitive,
ditransitive, and copular). e independent variable will, of course, have
indenitely many values (since no-one knows how many text types there are).
172
Anatol Stefanowitsch
Let us assume that we have chosen ve text types as representative of two major
distinctions oen made: formal vs. informal and spoken vs. wrien. e
intersection of these two variables will then yield a four-by-ve table. Table 8-1
shows the observed and expected frequencies for each of these intersections.23
.1
v0
T
AF
R
D
23 Note that, at least under one interpretation, this design is actually awed: if we
assume that the dimensions just mentioned can directly inuence the frequency of
subcategorization paerns, then the values of our independent variable overlap: two
of the values are spoken, two are wrien, and one is both; likewise, two are informal
and three are formal. us, a beer design would be one with FORMALITY and CHANNEL as
independent variables; since we do not know how to deal with more than one
independent variable at a time yet, let us ignore this aw. Of course, if you do not
assume that the dimensions FORMALITY and CHANNEL can have a direct inuence, but that
they are simply dimensions along which holistic and wholly independent text types
happen to be similar to each other, then there is not problem in the rst place. is is a
good example of how the constructs in our research design are always anchored in a
particular model of reality, rather than reality itself.
173
8. Complex antitative Research Designs
.1
SCRIPTED
(1435.62) (1578.00) (2400.93) (91.45) 5506
E X T
For example, for the top le cell, we get (13,077 2,680) 50,154 = 5,913.51.
Likewise, the chi-square cell components are calculated in the usual way, i.e. ((O
D
E)2 E); for the top le cell, for example, this gives us (6333 5913.51) 2
5913.51 = 29.76.
174
Anatol Stefanowitsch
S U B C A T E G O R I Z A T I O N F R A M E
SPOKEN
29.76 26.24 63.87 3.77
INFORMAL
Y P E
SPOKEN
0.66 1.34 0.61 16.17
FORMAL
T
WRITTEN
3.95 2.08 2.26 33.93
.1
INFORMAL
T
WRITTEN
69.91 21.44 128.79 33.48
FORMAL v0
Adding up the individual cell components gives us a chi-square value of 473.73.
e general formula for determining the degrees of freedom of a table is the
following (recall that rows and columns containing labels or marginal frequencies
do not count):
T
Our table thus has 43=12 degrees of freedom. As the chi-square table in
Appendix C.1 shows, the required value for a signicance level of 0.001 at 12
degrees of freedom is 32.90; our chi-square value for Table 6-6 is much higher
than this, thus, our results are highly signicant. We could summarize our
R
Recall that the mere fact that there is a signicant association between register
and subcategorization does not tell us anything about the strength of that
association (the eect size). For tables larger than two-by-two, there is a
correlation coecient referred to as Cramers V (occasionally also referred to as
Cramers or ) which is calculated as follows (N is the table sum, k is the
number of rows or columns, whichever is smaller):
(8.2) Cramer s V =
2
N (k 1)
175
8. Complex antitative Research Designs
(8.3) Cramer s V =
473.73
50154 (4 1)
= 0.06
Recall that the square of a correlation coecient tells us the proportion of the
variance captured by our design, which, in this case, is 0.0031. In other words,
TEXT TYPE explains a mere third of a percent of the distribution of
subcategorization paerns (at least if we operationalize it as we have done here);
or This study has shown a very weak but highly signicant inuence of text
.1
type on the frequency of subcategorization paerns (2=473.73 (df=12), p<0.001,
r=0.0031).
is result in itself will probably not satisfy the curiosity arising from our
v0
initial hypothesis concerning the overrepresentation of transitive clauses in
syntactic argumentation: we need to know whether particular subcategorization
paerns are particularly frequent in a particular register. In order to answer this
question, we need to know precisely which intersections of values are more
frequent or less frequent that expected. We can easily nd out by comparing the
T
observed and expected frequencies in each cell. For example, we can look at the
top le cell and see that the observed frequency is 6,333 against an expected
AF
5,913, i.e., that intransitives are more frequent than expected in informal spoken
language.
However, while we know that the distribution in the table as a whole is
signicant, we do not know whether any particular dierence by itself is
R
24 In fact, maers are slightly more complex: if we treat each cell as an independent
176
Anatol Stefanowitsch
against the required corrected values and nd out exactly which of the
intersections contribute signicantly to our result. By comparing the observed
against the expected frequencies, we can determine for each intersection of
values whether it is more or less frequent than expected. For twenty tests, this is
quite a complex task, that we would not want to leave to the reader. Some way of
summarizing the information in a compact way is required. ere is no standard
way of doing this, but the representation in Table 8-3 seems a reasonable
suggestion: in each cell, the rst line contains either a plus (for more frequent
than expected) or a minus (for less frequent than expected); the second line
contains the actual contribution to chi-square for the cell, and the third line
.1
contains the level of signicance (using the standard convention of representing
them by asterisks three for a level of 0.001, two for a level of 0.01, and one for a
level of 0.05; non-signicance is represented by n.s. and marginal signicance by
v0
marg.)
T
AF
R
D
mini-table, then we have eectively performed 20 chi-square tests on the same data
set. But recall that the level of signicance is nothing but a probability of error for
example, p=0.05 means that there is a ve percent likelihood that the results are due to
chance. Of course, this means that if we perform twenty tests, there is a likelihood of
one-hundred percent that one of our results did, indeed, come about by chance (20
0.05 = 1.00). In order to avoid this danger, the chi-square values must be adjusted.
Appendix C.2 lists the corrected values for multiple tests with one degree of freedom.
Checking the adjusted chi-square values for twenty tests, we nd that the required
value for a 0.001 level of signicance is 16.45: our value of 29.76 is higher than even
this corrected value, and thus the association between intransitives and spoken
informal language is highly signicant.
177
8. Complex antitative Research Designs
SPOKEN
+ +
29.76 26.24 63.87 3.77
INFORMAL
*** *** *** n.s.
SPOKEN
+ +
0.66 1.34 0.61 16.17
Y P E
FORMAL
n.s. n.s. n.s. **
.1
+
T
+ +
WRITTEN
3.95
v02.08 2.26 33.93
T
INFORMAL
n.s. n.s. n.s. ***
WRITTEN
+
69.91 21.44 128.79 33.48
FORMAL
*** *** *** ***
T
is table presents the complex results at a single glance; they can now be
AF
interpreted. ree paerns are quite obvious in the data: First, there is a clear
dierence between the two extreme ends of the register scale: in spoken
informal language, intransitive and copular verbs are more frequent and
transitives are less frequent; in wrien formal language, it is the other way
R
contain more ditransitives than expected (which must somehow be related to the
two most frequent ditransitive verbs in the ICE-GB, give and tell). In light of our
initial hypothesis, we could now conclude, for example, that syntacticians were
fooled by the relative importance of transitive clauses in formal wrien language
into thinking that this subcategorization paern is in some way typical.
178
Anatol Stefanowitsch
investigate the relationship between two variables), such as those discussed in the
previous chapters and also in the previous section. In many cases, bivariate
research designs are sucient to test a given research hypotheses, and thus, they
will continue to have a place in corpus linguistics.
However, there is an obvious reason why bivariate designs might not always
be useful: a fragment of reality that we wish to investigate will very oen be too
complex to capture in terms of just two variables. is may be obvious to the
researcher from the outset or it may become obvious to them when a given
variable turns out to have a very small eect size (suggesting that there may be
additional variables that explain the deviation of the observed from the expected
.1
frequencies). In other words, our research hypothesis may oen posit a
relationship between more than two variables, and even where it does not, there
is an inherent danger in bivariate research designs. e next subsection will spell
v0
out this danger, and the subsection following it will present a solution. Note that
this solution is considerably more complex than the statistical procedures we
have looked at so far and while it will be presented in sucient detail to enable
the reader in principle to apply it themselves, some additional reading will be
highly advisable.
T
179
8. Complex antitative Research Designs
Table 8-4. Uerances containing the word lovely by speaker age (ICE-GB)
U T T E R A N C E S
.1
Obs. 74 Obs. 44,478 44,552
YOUNG Exp. 59.20 Exp. 44,492.80
2 3.70 2 0.005
G E
Obs. 18
v0 Obs. 24,667 24,685
A
and old people use it less frequently, although the eect is very weak (2=10.39
(df=1), p<0.01; 2 =0.0001).
Next, consider a dierent hypothesis: e word lovely is typical of womens
language (cf. R. Lako 1973), from which we can straightforwardly derive the
R
prediction that it is used more frequently by women than by men. Let us accept
the operationalization of SEX that the creators of the ICE-GB chose (a bad idea,
since we have no idea what it is based on). Table 8-5 shows the relevant data.
D
Again, the hypothesis is conrmed, although, again, the eect is very weak
(2=20.29 (df=1), p<0.001; 2=0.0003). is raises the following question: what is
the relationship between AGE and SEX when it comes to the use of the word
lovely? Consider now Table 8-6, which shows the overall frequency of uerances
in the ICE-GB by AGE and SEX.
180
Anatol Stefanowitsch
Table 8-5. Uerances containing the word lovely by speaker sex (ICE-GB)
U T T E R A N C E S
.1
Total 92 69,145 69,237
Total
R
YOUNG OLD
2 665.98 2 1201.97
E X
181
8. Complex antitative Research Designs
.1
analysis of linguistic research questions. An early language-related application
involving language disorders is Lautsch et al. (1988), and it is found occasionally
in psycholinguistic and sociolinguistic research (e.g. Hsu et al 2000, Adli 2004);
v0
the rst corpus-linguistic application I am aware of is Gries (2002); it has since
established itself as a corpus-linguistic research tool to some extent (see, e.g.,
Stefanowitsch and Gries 2005, 2008).
In its simplest version, congural frequency analysis is simply a chi-square
T
test on a contingency table with more than two dimensions. ere is no logical
limit to the number of dimensions, but if we want to calculate this statistic
manually (rather than leing a specialized soware package do it, cf. Appendix
AF
A.3), then a three-dimensional table is already quite complex to deal with. us,
we will not go beyond three dimensions here.
A three-dimensional contingency table would have the form of a cube, as
shown in Figure 8-1. e smaller cube represents the cells on the far side of the
R
big cube seen from the same perspective and the smallest cube represents the cell
in the middle of the whole cube). As before, cells are labeled by subscripts: the
rst subscript stands for the values and totals of the dependent variable, the
D
second for those of the rst independent variable, and the third for those of the
second independent variable.
182
Anatol Stefanowitsch
De OT1T
p. V eB
ari abl
abl OT12 O21T ari
e e p. V
O212 Ind
OT11 O11T
O211 O112
O111
.1
OT21 O12T O222
Indep. Variable A
O221 O122
OTT1
O2T1
O121
O1T2
v0 O1TT
O1T1 OT2T
T
OT22 O22T
Indep. Variable A
OTTT
AF
OTT2 O2TT
B De
le p. V
ariab ari
e p. V O2T2 abl
e
Ind
R
While this kind of visualization is quite useful in grasping the notion of a three-
D
183
8. Complex antitative Research Designs
.1
Given this representation, the expected frequencies for each intersection of the
three variables can now be calculated in a way that is similar (but not identical) to
v0
that used for two-dimensional tables. Table 8-8 shows the formulas for each cell
as well as those marginal sums needed for the calculation.
dv 1 dv 2 dv dv 1 dv 2 dv dv 1 dv 2 dv Total
Total Total
OT1T O1TT OTT1 OT1T O1TT OTT2 OT2T O1TT OTT1 OT2T O1TT OTT2
ivA 1 O TTT2 O TTT2 O TTT2 O TTT2 O1TT
OT1T O2TT OTT1 OT1T O2TT OTT2 OT2T O2TT OTT1 OT2T O2TT OTT2
ivA 2 O2TT
R
184
Anatol Stefanowitsch
In our case, each variable has two values, thus we get (222) (2+2+2) + 2 = 4.
More interestingly, we can also look at the individual cells to determine whether
their contribution to the overall value is signicant). In this case, as before, each
cell has one degree of freedom and the signicance levels have to be adjusted for
multiple testing. In CFA, an intersection of variables whose observed frequency is
signicantly higher than expected is referred to as a type and one whose observed
frequency is signicantly lower is referred to as an antitype.
Let us apply this method to the case described in the previous sub-section.
Table 8-9 shows the observed and expected frequencies of uerances containing
.1
the word lovely by SPEAKER AGE and SPEAKER SEX, as well as the corresponding chi-
square components. Before we deal with these, note one important fact about
multi-dimensional contingency tables that may be confusing: if we add up the
v0
expected frequencies of a given column, their sum will not usually correspond to
the sum of the observed frequencies in that column (in contrast to two-
dimensional tables, where this is necessarily the case). Instead, the sum of the
observed and expected frequencies in each slice is identical.
T
Table 8-9. Use of lovely by SEX and AGE
YOUNG OLD Total
AF
185
8. Complex antitative Research Designs
be read o straightforwardly from Table 8-10, the information has merely been
rearranged for ease of exposition. Each of the chi-square components can be
checked against the table for multiple chi-square tests in Appendix C.2. Since we
are dealing with eight individual tests, the required chi-square values are 7.48 for
p<0.05, 10.41 for p<0.01, and 14.71 for p<0.01. e last column shows whether the
chi-square value for a given combination is signicant.
.1
LOVELY MALE YOUNG 5.09 n.s.
LOVELY MALE OLD v0 2.63 n.s.
LOVELY FEMALE YOUNG 35.92 + ***
LOVELY FEMALE OLD 4.49 n.s.
OTHER MALE YOUNG 662.62 ***
MALE OLD 1207.68 + ***
T
OTHER
use the word more frequently than expected; we might have expected them to do
so given the results of the bivariate analyses presented in the preceding section. It
is only speakers who are both young and female who use the word lovely more
D
frequently than expected. Second, the lower half of the table quite clearly reects
the over- and under-representations of speaker groups in the corpus as a whole.
ey are exactly those identied by the bivariate analysis of Age and Sex above.
e interesting result, then is that young female speakers also use other words
more frequently than expected in other words, they are generally
overrepresented in the corpus. What the multivariate analysis tells us, though, is
that despite this general overrepresentation, young female speakers use the word
lovely more frequently than expected; our three individual bivariate analyses
could never have shown us this.
Let us look at a second example of two bivariate designs that yield a more
186
Anatol Stefanowitsch
.1
would be that there is no relationship between SEX and HEDGING (we will return to
the plausibility of the hypothesis and our operationalizations below; they are
drastically simplied adaptations of claims in R. Lako 1973).
v0
Table 8-11 shows the observed and expected frequencies of hedged and
unhedged uerances by men and women in the International Corpus of English -
British Component (or rather, those les from that corpus that are annotated for
speaker sex and for reasons that will become relevant in a moment
education).
T
Table 8-11. Observed and expected frequencies of hedged and unhedged uerances
AF
MALE
(Exp. 412.94) (Exp. 42,621.06)
SEX
FEMALE
(Exp. 241.06) (Exp. 24,879.94)
Total 654 67,501 68,155
e results clearly show that (British) mens uerances contain the hedges sort
of/kind of less frequently than expected, while (British) womens uerances
contain them more frequently than expected (2=33.86 (df = 1), p<0.001). is
conrms our research hypothesis if our operationalizations make sense. A good
way of testing this would be to nd another variable that should be associated
with COMMUNICATIVE INSECURITY.
187
8. Complex antitative Research Designs
Table 8-12. Observed and expected frequencies of hedged and unhedged uerances
.1
by EDUCATION (ICE-GB, sub-corpus annotated for sex and education)
COMMUNICATIVE SUBORDINATION
Total
Obs.
v0
HEDGED UTTERANCES
203
UNHEDGED UTTERANCES
SECONDARY
Exp. 165 Exp. 17106 .26
Obs. 451 Obs. 50432 50,883
UNIVERSITY
Exp. 488 Exp. 40394 .74
T
Total 654 67,501 68,155
AF
Again, hypothesis is clearly conrmed: educated speakers use less hedging than
uneducated ones (2=11.43 (df=1), p<0.001).
We are now, again, faced with two possible explanations: either both SEX and
EDUCATION have an inuence on the use of hedging, or SEX and EDUCATION are
R
correlated in the corpus (like AGE and SEX were) e laer would be expected,
given that access to higher education was dicult for women until fairly
recently. A multivariate analysis will help us distinguish between these two
D
possibilities. Table 8-13 shows the relevant frequencies for a HEDGING SEX
EDUCATION design.
188
Anatol Stefanowitsch
FEMALE 118 8,320 8,438 195 16,488 16,683 313 24,808 25,121
61 .09 6,305 .13 179 .97 18,574 .81
53 .02 643 .87 1 .26 234 .45
.1
Total 203 17,069 17,272 451 50,432 50,883 654 67,501 68,155
e lower half of the table reects a general bias in the corpus in the direction we
expected (or feared): there do indeed seem to be more uerances by educated
males and uneducated females and fewer by uneducated males and educated
females in the corpus. However, the upper half of the table shows that, despite
this imbalance, there is a signicant inuence of both variables: educated males
189
8. Complex antitative Research Designs
use hedging signicantly less frequently than expected and uneducated females
use it highly signicantly more frequently. is suggests that educated males feel
most secure in what they are saying and their status in saying so, and uneducated
females feel least secure. Uneducated males and educated females fall somewhere
in the middle.
is may seem a plausible basis for a theory of communicative insecurity, but
we should be careful not to jump to conclusions over our joy of having
discovered relationships between variables in our corpus data. All that our data
have shown us is that people who were dened as male by the makers of the ICE-
GB and who have a university education are least likely to use the hedges kind of
.1
and sort of, while people who were dened as female by the makers of the ICE-
GB and who do not have a university education are most likely to do so and
that this is statistically signicant. v0
e data do not tell us whether our research design is valid. In this particular
case, this validity is doubtful (to say the least): First, there is no reason to trust the
judgment of the ICE-GB makers with respect to the categorization of speakers
according to SEX presumably, they went by their intuition, which is based on
outward appearance, proper names etc., rather than asking for a gene test and
T
using the chromosome pair XX as a denition for FEMALE and XY as a denition
for MALE (even if they had, there are researchers in the humanities who question
AF
the link between genetics and sex in general, and there are medical conditions
that can lead to a divergence of genetics and physiology that make it dicult
even in the context of biology to determine sex.
In addition, it is questionable whether communication styles are determined
by genetic sex at all more likely, they are the result of the socialization into
R
particular sex/gender roles. ese, of course, were not assessed at all by the
corpus makers, so we have to hope that they are associated suciently strongly
D
with the outward signals of biological SEX that the corpus makers were paying
aention to. Second, there is no a priori logical link between hedging and social
subordination. Intuitively, hedging is related to the degree to which a speaker
wants to commit to the contents of an uerance. While a low degree of
commitment to an uerance may of course reect an aempt to signal insecurity,
it may also have a range of other functions for example, a desire to soen ones
statements so as not to impose on ones addressee, a signal that one wants to
construct statements cooperatively etc. (cf. Holmes 1995: 73-74; cf. Coates 1996:
162 for additional possibilities).
is chapter was intended to impress on the reader one thing: that looking at
one potential variable inuencing some phenomenon that you are interested in
190
Anatol Stefanowitsch
may not be enough. Multivariate research designs are becoming the norm rather
than the exception, and rightly so. However, the more complex our design, the
more likely we are to forget that it is just that a design, whose usefulness
depends entirely on how we have operationalized our constructs. Regardless of
the sophistication of our statistical models, our results may tell us very lile, or
they may tell us something completely dierent from what we think we are
seeing in them. Empirical research is never nished every interpretation of
your data should lead you to come up with additional ways of testing it,
distinguishing it from other plausible interpretations, rening your operational
denitions, and so on.
.1
Further reading v0
Note that Congural Frequency Analysis is much more exible and complex
than might by suggested by the above exposition. A readable introduction (by the
standards of statistical textbooks) to the method is von Eye (1990). A good
introduction to multivariate methods and designs in linguistics in general is
T
Gries (2013). Some examples of multivariate research designs can be found,
for example, in Glynn and Robinson (2014).
AF
(2nd revised edition.). Berlin: De Gruyter Mouton.
Glynn, Dylan & Justyna A. Robinson (eds.). 2014. Corpus methods for
D
191
Part II: Applications
.1
v0
T
AF
R
D
192
Anatol Stefanowitsch
9. Collocation
e (orthographic) word plays a central role in corpus-linguistics. As suggested in
Chapter 5, this is in no small part due to the fact that all corpora, whatever
.1
additional annotations may have been added, consist of orthographically
represented language, which makes it easy to retrieve word forms. Every
concordancing program oers the possibility to search for a string of
v0
orthographic characters in fact, some are limited to this type of search.
However, the focus on words is also due to the fact that the results of corpus
linguistic research quickly showed that words (individually and in groups) are
more interesting and show a more complex behavior than traditional, grammar-
focused theories of language assumed. An area in which this is very obvious, and
T
which has therefore become one of the most heavily researched areas in corpus
linguistics, is the way in which words combine to form so-called collocations.
AF
9.1. Collocates
Trivially, texts are not random collections of words. e occurrence of words is
restricted by several factors.
First, the grammatical context limits our choices. For example, a determiner
cannot be followed by another determiner or a verb, but only by a noun, by an
adjective modifying a noun, or an adverb modifying such an adjective; likewise, a
transitive verb requires a direct object in the form of a noun phrase, so we expect
193
9. Collocation
.1
However, it has long been noted that words are not distributed randomly even
within the connes of grammar, lexical or world knowledge, and communicative
intent. Instead, a given word will have anities to some words (and disanities
v0
to others) that we could not predict given a set of grammatical rules, a dictionary
and a thought that needs to be expressed. One of the rst principled discussions
of this phenomenon is found in Firth (1957). Using the example of the word ass
(in the sense of donkey), he discusses the way in which what he calls habitual
collocations contribute to the meaning of words:
T
One of the meanings of ass is its habitual collocation with an immediately
AF
Note that Firth, although writing well before the advent of corpus linguistics,
refers explicitly to frequency as a characteristic of collocations. e possibility of
D
194
Anatol Stefanowitsch
.1
POSITION 2
WORD B v0 OTHER WORDS Total
WORD A a+b a + other word a
POSITION 1
On the basis of such a table, we can determine the collocation status of a given
word pair. For example, we can ask whether Firth was right with respect to the
claim that silly ass is a collocation. e necessary data are shown in Table 9-2.
R
195
9. Collocation
POSITION 2
ASS OTHER WORDS Total
Obs: 9 Obs: 2,670 2,679
POSITION 1
SILLY
Exp: 0.01 Exp: 2,678.99
OTHER Obs: 350 Obs: 98,493,136 98,493,486
WORDS Exp: 358.99 Exp: 96,983,773.008
Total 359 98,493,127.01 98,496,165
Note: Data are based on the version of the BNC distributed by the Oxford Text
Archive (see Appendix A.1). e queries are for the lemmata silly and ass and the
.1
lemma sequence silly ass. e total number of words in the BNC is based on a query
for all tokens in the BNC that are not tagged as punctuation marks.
v0
e combination silly ass is very rare in English, occurring just seven times in the
98,496,165 word BNC, but Table 9-2 shows that it is vastly more frequent than
expected: by chance, it should have occurred only 0.01 times (i.e., not at all). e
dierence between the observed and the expected frequencies is highly
signicant (2 = 8277.66, df = 1, p < 0.001). Note that we are using the chi-square
T
test here because we are already familiar with it; it is questionable, however,
whether this test is the most useful one for the purpose of identifying
AF
frequencies with which they occur in the corpus in general. In other words, the
two conditions across which we are investigating the distribution of a word are
next to a given other word and everywhere else. In other words, the corpus
D
196
Anatol Stefanowitsch
occurs more frequently with ass than with donkey or animal than that it occurs
more frequently with ass than with stone or democracy. Likewise, it is more
interesting that silly occurs with ass more frequently than childish than that silly
occurs with ass more frequently than precious or parliamentary.
In such cases, we can modify Table in 9-1 as shown in Table 9-3 to identify the
collocates that dier signicantly between two words. ere is no established
term for such collocates, so we we will call them dierential collocates here25 (the
method is based on Church et al. 1991).
.1
POSITION 2
WORD B WORD C Total
word a +
v0 word a + word a
POSITION 1
WORD A
word b word c
OTHER other words + other words + other words
WORDS word b word c occ. with b or c
T
Total word b word c sample size
AF
Since the collocation silly ass and the word ass in general is so infrequent in
the BNC, let us use a dierent noun to demonstrate the usefulness of this method,
the word game. We can speak of silly game(s) or childish game(s), but we may feel
that the laer is more typical than the former. e relevant lemma frequencies to
put this feeling to the test are shown in Table 9-4.
R
Table 9-4: Childish game vs. silly game (lemmas) in the BNC
D
25 Gries (2003) and Gries and Stefanowitsch (2004) use the term distinctive collocate,
which has been taken up by some authors; however, many other authors use the term
distinctive collocate much more broadly to refer to characteristic collocates of a word.
197
9. Collocation
POSITION 1
CHILDISH SILLY Total
Obs: 12 Obs: 31 43
POSITION 2
GAME
Exp: 6.1 Exp: 36.9
OTHER Obs: 431 Obs: 2648 3,079
WORDS Exp: 436.9 Exp: 2642.1
Total 443 2,679 3,122
Childish game(s) and silly game(s) both occur in the BNC. Both combinations
.1
taken individually are signicantly more frequent than expected (you may check
this yourself using the frequencies from Table 94, the total lemma frequency of
game in the BNC (20,625), and the total number of words in the BNC given in
v0
Table 9.2 above). e lemma sequence silly game is more frequent, which may
suggest that it is the stronger collocation. However, the direct comparison shows
that this is due to the fact that silly is more frequent in general than childish, so
that the chance likelihood of silly games is higher than that of childish games. The
T
dierence between the observed and the expected frequencies suggests that
childish is more strongly associated with game(s) than silly. e dierence is very
AF
words must occur directly next to each other), but many allow spans larger spans
(ve words being a relatively typical span size).
Other researchers treat co-occurrence as a structural phenomenon, i.e., they
D
dene collocates as words that co-occur more frequently than expected in two
related positions in a particular grammatical structure, for example, the adjective
and noun positions in noun phrases of the form [Det Adj N] or the verb and noun
position in transitive verb phrases of the form [V [NP (Det) (Adj) N]].26 However,
instead of limiting the denition to one of these possibilities, it seems more
plausible to dene the term appropriately in the context of a specic research
question. In the examples above, we used a purely sequential denition that
26 Note that such word-class specic collocations are sometimes referred to as
colligations, although the term colligation usually refers to the co-occurrence of a
word in the context of particular word classes, which is not the same.
198
Anatol Stefanowitsch
simply required words to occur next to each other, paying no aention to their
word-class or structural relationship; given that we were looking at adjective-
noun combinations, it would certainly have been reasonable to restrict our search
parameters to adjectives modifying the noun ass, regardless of whether other
adjectives intervened, for example in expressions like silly old ass, which our
search would have missed if they occurred in the BNC (which they do not).
Note that the designs in Table 9-1 and 9-3 are essentially variants of the
general research design introduced in previous chapters and used as the
foundation of dening corpus linguistics: it has two variables, POSITION 1 and
POSITION 2, both of which have two values, namely WORD X vs. OTHER WORDS (or, in
.1
the case of dierential collocates, WORD X vs. WORD Y). e aim is to determine
whether the value WORD A is more frequent for POSITION 1 under the condition that
WORD B occurs in POSITION 2 than under the condition that other words (or a
v0
particular other word) occur in POSITION 2.
199
9. Collocation
environment that is ideal for almost any task that we are likely to come across as
corpus linguists).
.1
add the probability of geing ten tails, 0.000976). is is well below the level
required to claim statistical signicance. However, if we performed a hundred
series of ten coin-ips, there is a very strong likelihood that one series of this
v0
type (or of the more extreme type zero heads and ten tails) occurred aer all,
there is a one percent chance, i.e. a chance of 1 in 100. is is not a problem as
long as we did not accord this one result out of a hundred any special importance.
However, if we were to identify a set of 100 collocations with p-values of 0.001 in
a corpus, we are potentially treating all of them as important, even though it is
T
very likely that at least one of them reached this level of signicance by chance.
To avoid this, we have to correct our levels of signicance when performing
AF
multiple tests on the same set of data. e simplest way to do this is the so-called
Bonferroni correction, which consists in dividing the conventionally agreed-upon
signicance levels by the number of tests we are performing. For example, if we
performed signicance tests on 100 potential collocations, the signicance levels
would be 0.0005 (signicant), 0.0001 (very signicant) and 0.00001 (highly
R
signicant).
It should be noted that this is an extremely conservative correction, that might
D
make it quite dicult for any given collocation to reach signicance. For example,
if we were to calculate the signicance of all token pairs in the LOB corpus, we
would have to perform a statistical test on 422,764 contingency tables. is means
that the most generous level of signicance is now 0.005/422,764 = 0.00000012.
is still leaves more than 15,000 statistically signicant collocation in the LOB
corpus, but it will remove many more that could also be signicant. On the other
hand, the Bonferroni correction has the advantage of being very simple to apply
(see Shaer 1995 for an overview corrections for multiple testing, including many
that are less conservative than the Bonferroni correction).
Of course, the question is how important the role of p-values is in a design
where our main aim is to identify collocates and order them in terms of their
200
Anatol Stefanowitsch
collocation strength. I will turn to this point presently, but before I do so, let us
discuss the third of the three consequences of large-scale testing for collocation,
the methodological one.
.1
scientic research that does not start with a hypothesis, but rather with general
questions like Do relationships exist between the constructs in my data? and If
so, what are those relationships?. e research then consists in applying
v0
statistical procedures to large amounts of data and examining the results for
interesting paerns. As electronic storage and computing power have become
cheaper and more widely accessible, this approach the exploratory or inductive
approach has become increasingly popular in all branches of science,
particularly the social sciences. It would be surprising if corpus linguistics was an
T
exception, and indeed, it is not. Especially the area of collocational research is
typically exploratory.
AF
201
9. Collocation
language in general).
But there is a danger, too: Most statistical procedures will produce some
statistically signicant result if we apply them to a large enough data set, and
collocational methods certainly will. Unless we are interested exclusively in
description, the crucial question is whether these results are meaningful. If we
start with a hypothesis, we are restricted in our interpretation of the data by the
need to relate our data to this hypothesis. If we do not start with a hypothesis, we
can interpret our results without any restrictions, which, given the human
propensity to see paerns everywhere, may lead to somewhat arbitrary post-hoc
interpretations that could easily be changed, even reversed, if the results had been
.1
dierent and that therefore tell us very lile about the phenomenon under
investigation or language in general. Thus, it is probably a good idea to formulate
at least some general expectations before doing a large-scale collocation analysis.
v0
Even if we do start out with general expectations or even with a specic
hypothesis, we will oen discover additional facts about our phenomenon that go
beyond what is relevant in the context of our original research question. For
example, checking in the BNC Firths claim that the most frequent collocates of
ass are silly, obstinate, stupid, awful and egregious and that young is much more
T
frequent than old, we nd that silly is indeed the most frequent adjectival
collocate, but that obstinate, stupid and egregious do not occur at all, that awful
AF
occurs only once, and that young and old both occur twice. Instead, frequent
adjectival collocates (ignoring second-placed wild, referring to actual donkeys),
are pompous and bad. Pompous does not really t with the semantics that Firths
adjectives suggest and indicates that a semantic shi from stupidity to self-
importance may have taken place between 1957 and 1991 (when the BNC was
R
assembled).
is is, of course, a new hypothesis that can (and must) be investigated by
D
comparing data from the 1950s and the 1990s. It has some initial plausibility in
that the adjectives blithering, hypocritical, monocled and opinionated also co-occur
with ass in the BNC but were not mentioned by Firth. However, it is crucial to
treat this as a hypothesis rather than a result. e same goes for bad ass which
suggests that the American sense of ass (boom) and/or the American adjective
badass (which is oen spelled as two separate words) may have begun to enter
British English. In order to be tested, these ideas and any ideas derived from an
exploratory data analysis have to be turned into testable hypotheses and the
constructs involved have to be operationalized. Crucially, they must be tested on
a new data set; if we were to circularly test them on the same data that they were
derived from, we would obviously nd them conrmed.
202
Anatol Stefanowitsch
.1
I will represent the formulas with reference to the table in 9-5a, i.e, O11 means the
observed frequency of the top le cell, E11 its expected frequency, R1 the rst row
total, C2 the second column total, and so on.
v0
Table 9-5a: A generic 2-by-2 table for collocation research
POSITION 2
Total
T
WORD B OTHER WORDS
or WORD C
O11 O12 R1
POSITION 1
AF
WORD A
Now all we need is a good example to demonstrate the calculations. Let us use
the adjective-noun sequence good example from the LOB corpus (but do not fear,
we will return to equine animals and their properties further below).
203
9. Collocation
POSITION 2
EXAMPLE OTHER WORDS Total
Obs: 9 Obs: 870 879
POSITION 1
GOOD
Exp: 0.1846 Exp: 878.8154
OTHER Obs: 235 Obs: 1160704 1160939
WORDS Exp: 243.8154 Exp: 1160695.1846
Total 244 1161574 1161818
e main reason for the existence of dierent measures is, of course, that they
.1
give dierent results. In particular, many measures have a problem with rare
collocations, especially if the individual words of which they consist are also rare.
Aer we have introduced the measures, we will therefore compare their
v0
performance with a particular focus on the way in which they deal (or fail to
deal) with such rare events.
would be 421.37 (at 1 degree of freedom this means that p < 0.001, but we are not
concerned with p-values here).
Recall that the chi-square test statistic is not an eect size, but that it needs to
be divided by the table total to turn it into one; but as long as we are deriving all
our collocation data from the same corpus, this will not make a dierence, since
R
the table total will always be the same. However, if we are calculating dierential
collocates, this is no longer the case, so we might want to use the phi statistic
instead. I am not aware of any research using the phi-statistic as an association
D
measure, and in fact the chi-square statistic itself is not used very widely either.
is is because it has a serious problem: recall that it cannot be applied if more
than 20 percent of the cells of the contingency table contain expected frequencies
smaller than 5 (in the case of collocates, this means not even one out of the four
cells of the 2-by-2 table). One reason for this is that it dramatically overestimates
the eect size and signicance of such events, and of rare events in general. Since
collocations are oen relatively rare events, this makes the chi-square statistic a
bad choice as an association measure.
204
Anatol Stefanowitsch
(9.1a) MI = log2
( ) O 11
E 11
.1
2
In our case, we are looking at cases where WORD A and WORD B occur directly next
v0
to each other, i.e., the span size is 1. When looking at a larger span (which is oen
done in collocation research), the likelihood of encountering a particular collocate
increases, because there are more slots that it could potentially occur in. e MI
statistic can be adjusted for larger span sizes as follows (where S is the span size):
T
(9.1c) MI = log2
( O 11
E 11 S )
AF
e mutual information measure suers from the same problem as the chi-square
statistic, overestimating the importance of rare events, but since it is still fairly
wide-spread in collocational research, we may need it for situations where we are
R
comparing our own data to the results of published studies. However, note that
28 A logarithm with a base b of a given number x is the power to which b must be raised
D
to produce x, so, for example, log 10(2) = 0.30103, because 100.30103=2. Most calculators
give us at least a choice between the natural logarithm, where the base is the number
e (approx. 2.7183) and the common logarithm, where the base is the number 10; many
calculators and all major spreadsheet programs allow us to calculate logarithms with
any base. In the formula in (9.1a), we need the logarithm with base 2; if this is not
available, we can use the natural logarithm and divide the result by the natural
logarithm of 2:
(i)
MI =
log e
( )
O 11
E 11
log e ( 2 )
205
9. Collocation
there are versions of the MI measure that will give dierent results, so we need to
make sure we are using the same version as the study we are comparing our
results to. Or beer yet, we should not use mutual information at all (one of the
case studies presented below uses it, see Section 9.2.1.1.).
.1
( )
n
Oi
G = 2 O i log e
2
(9.2a)
i=1 Ei
v0
In order to calculate the log-likelihood measure, we calculate for each cell the
natural logarithm of the observed frequency divided by the expected frequency
and multiply it by the observed frequency. We then add up the results for all four
cells and multiply the result by two. Note that if the observed frequency of a
T
given cell is zero, the expression Oi/Ei will, of course, also be zero. Since the
logarithm of zero is undened, this would result in an error in the calculation.
AF
us, log(0) is simply dened as zero when applying the formula in (9.2b).29
Applying the formula to the data in Table 9-5b, we get the following:
R
D
206
Anatol Stefanowitsch
(9.2b) 2
G = 2 9log e
( ( 0.1846
9
)) + (870log ( 878.8154
e
870
))
(
+ 235loge ( 243.8154
235
)) + (1160704log ( 1160695.1846
e
1160704
))
= 2 ((34.9809) + (8.771) + (8.6541) + (8.8154)) = 52.7424
.1
e log-likelihood test statistic has long been known to be more reliable than the
chi-square test when dealing with small samples and small expected frequencies
v0
(Read and Cressie 1988: 134). is led Dunning (1993) to propose it as an
association measure specically to avoid the overestimation of rare events that
plagues the chi-square test, mutual information and other measures.
(9.3a) MS = min
( O 11 O 11
,
R1 C1 )
R
207
9. Collocation
.1
computationally expensive it cannot be calculated manually, except for very
small tables, because it involves computing factorials, which become very large
very quickly. For completeness sake, here is (one version o) the formula:
v0
R1! R 2 ! C 1 ! C 2 !
(9.4) p Fisher =
O 11 ! O 12 ! O 21 ! O 22 ! N !
Obviously, it is not feasible to apply this formula directly to the data in Table 9-
T
5b, because we cannot realistically calculate the factorials for 235 or 870, let alone
1,160,704. But if we could, we would nd that the p-value for Table 9-5 is
AF
0.0000000000004841512.
Spreadsheet applications do not oer Fishers exact test, but all major statistics
applications implement it. However, typically, the exact p-value is not reported
beyond the limit of a certain number of decimal places. is means that there is
R
oen no way of ranking the most strongly associated collocates, because their p-
values are smaller than this limit. For example, there are 125 collocates in the
LOB corpus with a Fishers exact p-value that is smaller than the smallest value
D
To conclude, let us compare how the association measures compare using a data
set of 20 potential collocations. Inspired by Firths silly ass, they are all
combinations of adjectives with equine animals. Table 9-6 shows the
combinations and their frequencies in the BNC (sorted by their raw frequency of
occurrence.
208
Anatol Stefanowitsch
Table 9-6. Some collocates of the form [ADJ Nequine] (data from the BNC)
WORD A WORD B A WITH B A WITHOUT B B WITHOUT A NEITHER A NOR B
(O11) (O12) (O21) (O22)
rocking horse 35 423 1083 96985166
Trojan horse 21 96 1097 96985493
new horse 13 124021 1105 96861568
silly ass 8 2436 294 96983969
galloping horse 8 199 1110 96985390
prancing horse 8 71 1110 96985518
pompous ass 4 251 298 96986154
.1
old donkey 3 52645 428 96933631
common zebra 3 19839 170 96966695
old mule 3 52645 27 96934032
old ass
v0
2 52646 300 96933759
braying donkey 2 29 429 96986247
young zebra 1 32292 172 96954242
large mule 1 34230 29 96952447
T
female hinny 1 1451 7 96985249
extinct quagga 1 421 4 96986281
AF
have selected them in such a way that they dier with respect to their status as
potential collocations (in the sense of typical combinations of words). Some are
compounds or compound like combinations (rocking horse, Trojan horse, and, in
specialist discourse, common zebra). Some are the kind of semi-idiomatic
combinations that Firth had in mind (silly ass, pompous ass). Some are very
conventional combinations of nouns with an adjective denoting a property
specic to that noun (prancing horse, braying donkey, galloping horse the rst of
these being a conventional way of referring to the Ferrari brand mark logo). Some
only give the appearance of semi-idiomatic combinations (jumped-up jackass,
actually an unconventional variant of jumped-up jack-in-oce; dumb-fuck donkey,
actually an extremely rare phrase that occurs in the book Trail of the Octopus:
209
9. Collocation
From Beirut to Lockerbie - Inside the DIA and which sounds like an idiom because
of the alliteration and the semantic relationship to silly ass; and monocled ass,
which brings to mind pompous ass but is actually not a very conventional
combination). Finally, there are a number of fully compositional combinations
that make sense but do not have any special status (caparisoned mule, new horse,
old donkey, young zebra, large mule, female hinny, extinct quagga).
In addition, I have selected them to represent dierent types of frequency
relations: some of them are (relatively) frequent, some of them very rare, for some
of them the either the adjective or the noun is generally quite frequent, and for
some of them neither of the two is frequent.
.1
Table 9-7 shows the ranking of these twenty collocations by the ve
association measures discussed above. Simplifying somewhat, a good association
measure should rank the conventionalized combinations highest (rocking horse,
v0
Trojan horse, silly ass, pompous ass, prancing horse, braying donkey, galloping
horse), the distinctive sounding but non-conventionalized combinations
somewhere in the middle (jumped-up jackass, dumb-fuck donkey, old ass, monocled
ass) and the compositional combinations lowest (common zebra, jumped-up
jackass, dumb-fuck donkey, old ass, monocled ass). Common zebra is dicult to
T
predict it is a conventionalized expression, but not in the general language.
All of the association measures fare quite well, generally speaking, with
AF
respect to the compositional expressions these tend to occur in the lower third
of all lists. Where there are exceptions, the chi-square statistic, mutual
information and minimum sensitivity rank rare cases higher than they should
(e.g. caparisoned mule, extinct quagga, female hinny), while the log-likelihood test
statistic and the p-value of Fishers exact test rank frequent cases higher (new
R
horse).
D
210
Table 9-7. Comparison of selected association measures for collocates of the form [ADJ Nequine] (data from the BNC)
CHI-SQ. FISHERS P
2
COLLOCATION COLLOCATION MI COLLOCATION MS COLLOCATION G COLLOCATION
jumped-up jackass 577300 jumped-up jackass 19.1390 jumped-up jackass 0.047619 rocking horse 549.811 rocking horse 2.90E-121
Trojan horse
rocking horse
dumb-fuck donkey
326944
231962
225026
dumb-fuck donkey
caparisoned mule
monocled ass
.1
17.7797
17.3025
16.7079
caparisoned mule
rocking horse
Trojan horse
0.033333
0.031306
0.018784
Trojan horse
prancing horse
galloping horse
1.28E-81
7.93E-30
2.21E-26
caparisoned mule
monocled ass
prancing horse
extinct quagga
braying donkey
161643
107048
70263
45963
29032
extinct quagga
Trojan horse
braying donkey
prancing horse
female hinny
15.4883
13.9265
13.8255
13.1008
13.0275
v0
pompous ass
prancing horse
galloping horse
braying donkey
monocled ass
0.013245
0.007156
0.007156
0.004640
0.003311
silly ass
pompous ass
braying donkey
new horse
old mule
2.50E-22
1.58E-14
4.22E-09
9.16E-09
6.42E-07
galloping horse
pompous ass
26806
20143
rocking horse
pompous ass
12.6947
12.2985
silly ass
extinct quagga
T 0.003273
0.002370
jumped-up jackass 24.7112 jumped-up jackass 1.73E-06
dumb-fuck donkey 24.6503 dumb-fuck donkey 4.44E-06
AF
silly ass 8394 galloping horse 11.7111 dumb-fuck donkey 0.002320 caparisoned mule 22.0709 caparisoned mule 6.19E-06
female hinny 8348 silly ass 10.0379 female hinny 0.000689 monocled ass 21.5436 common zebra 7.07E-06
old mule 547 old mule 7.5253 common zebra 0.000151 common zebra 20.7614 monocled ass 9.34E-06
common zebra 248 large mule 6.5614 new horse 0.000105 extinct quagga 19.6885 extinct quagga 2.18E-05
new horse 94 common zebra 6.4053 old mule 0.000057 female hinny 16.1915 female hinny 1.20E-04
large mule
old donkey
92
33
young zebra
old donkey
4.1177
3.6806
old donkey
old ass
0.000057
0.000038
R
old donkey
large mule
9.7931 old donkey
7.1502 large mule
1.78E-03
1.05E-02
old ass
young zebra
21
15
old ass
new horse
3.6088
3.1846
young zebra
large mule
0.000031
0.000029
old ass
young zebra D 6.3448 old ass
3.8288 young zebra
1.20E-02
5.60E-02
9. Collocation
.1
conventionalized cases in the middle.
To demonstrate the problems that very rare events can cause (especially those
where both the combination and each of the two words in isolation are very rare),
v0
imagine someone had used the phrase tomfool onager once in the BNC neither
the adjective tomfool (a synonym of silly) nor the noun onager (the name of the
donkey sub-genus Equus hemionus, also known as Asiatic or Asian wild ass)
occur in the BNC, which would give us the distribution in Table 9-8.
T
Table 9-8. Fictive occurrence of tomfool onager in the BNC
POSITION 2
AF
Applying the formulas discussed above to this table will give us a chi-square
value of 96.986,707, an MI value of 26.53 and a minimum sensitivity value of 1,
placing this (hypothetical) one-o combination at the top of the respective
rankings by a wide margin. Again, the log-likelihood test statistic and Fishers
exact test are much beer, relegating it to the seventh place (G 2 = 38.78) and ninth
place (p= 1,03E-08) respectively.
Although the example is hypothetical, the problem is not. It uncovers a
mathematical weakness of many commonly used association measures. From an
empirical perspective, this would not necessarily be a problem, if cases like that
in Table 9-8 were rare in linguistic corpora. However, they are not. e LOB
corpus, for example, contains almost one thousand such cases, including some
212
Anatol Stefanowitsch
.1
To sum up, when doing collocational research, we should use the best
association measures available. For the time being, this is the p value of Fishers
exact test (if we have the means to calculate it), or the log-likelihood test-statistic
v0
G2 (if we dont, or if we prefer using a widely-accepted association measure). We
will use G2 through much of the remainder of this book whenever dealing with
collocations or collocation-like phenomena.
cases, where both variables consist of (some part o) the lexicon and the values
are individual words.
213
9. Collocation
of individual words may partly be due to the fact that words tend to be too
idiosyncratic in their behavior to make their study theoretically aractive.
However, this idiosyncrasy itself is, of course, theoretically interesting and so
such studies hold an unrealized potential at least for areas like lexical semantics.
.1
idiosyncratic, Kennedy identies the adjectival and verbal collocates of 24
frequent degree adverbs in the BNC, extracting all words occurring in a span of
two words to their le or right, and using mutual information to determine which
v0
of them are associated with each degree adverb.
us, Kennedys study adopts an exploratory perspective (as is typical for this
kind of study). It involves two nominal variables: DEGREE ADVERB (with 24 values
corresponding to the 24 specic adverbs he selects) and VERB OR ADJECTIVE (with as
many dierent potential values as there are dierent verbs and adjectives in the
T
BNC (in exploratory studies, it is oen the case that we do not know the exact
values of at least one of the two variables in advance). Which of the two values is
AF
the dependent one and which the independent is always dicult to say.
Statistically, it does not maer, since our statistical tests for nominal data do not
distinguish between dependent and independent variables. Conceptually, our
interpretation may depend on which variable we treat as independent. Let us
assume that when planning an uerance, we rst choose a verb or adjective
R
214
Anatol Stefanowitsch
which is associated with words denoting sensory perception). Finally, there are
morphological restrictions (some adverbs are used frequently with words derived
by particular suxes, for example, perfectly, which is frequently found with
words derived by able/ible, or totally, whose collocates oen contain the prex
un-). Table 9-9 illustrates these ndings for 5 of the 24 degree adverbs and their
top 15 collocates (note that these ndings are not precisely replicable using the
version of the BNC available via the OTA).
Table 9-9. Selected degree adverbs and their collocates (Kennedy 2003)
INCREDIBLY PERFECTLY TOTALLY COMPLETELY BADLY
.1
sexy 4.70 contestable 7.56 unsuited 6.23 reed 6.09 mauled 6.23
nave 4.53 proportioned 6.75 unprepared 5.89 inelastic 5.85 sprained 5.85
handsome 3.75 manicured 6.12 illegible
v0 5.65 outclassed 5.69 bruised 5.68
boring 3.73 spherical 5.84 unsuitable 5.61 redesigned 5.51 corroded 5.48
brave 3.67 groomed 5.68 impractical 5.60 refurbished 5.42 damaged 5.34
exciting 3.65 timed 5.40 uncharacteristic 5.58 overhauled 5.35 behaved 5.19
lucky 3.55 understandable 5.29 illogical 5.57 eradicated 5.33 decomposed 5.05
stupid 3.55 symmetrical 5.15 unacceptable 5.55 disorientated 5.27 unstuck 5.05
ecient 3.47 acceptable 5.07 unconnected 5.51 renovated 5.18 shaken 4.79
T
clever 3.42 complemented 4.97 devoid 5.49 mystied 5.16 wounded 4.63
complicated 3.34 intelligible 4.93 unintelligible 5.38 sequenced 5.08 injured 4.59
AF
dangerous 3.21 feasible 4.88 symmetric 5.35 gued 4.91 ventilated 4.58
thin 3.20 balanced 4.87 unfounded 5.34 revamped 4.86 charred 4.57
fast 3.12 legitimate 4.74 untrue 5.33 uninterested 4.76 mutilated 4.55
beautiful 2.99 elastic 4.71 unmoved 5.27 untrue 4.76 hurt 4.48
R
the study of lexical relations most notably, (near) synonymy (Taylor 2001, cf.
below), but also polysemy (e.g. Yarowsky 1993, investigating the idea that
associations exist not between words but between particular senses of words) and
antonymy (Justeson and Katz 1991, see below).
9.2.2.1 Case study: Near Synonyms. Natural languages typically contain pairs (or
larger sets) of words with very similar meanings, such as big and large, begin and
start or high and tall. In isolation, it is oen dicult to tell what the dierence in
meaning is, especially since they are oen interchangeable at least in some
contexts. Obviously, the distribution of such pairs or sets with respect to other
215
9. Collocation
words in a corpus can provide insights into their similarities and dierences.
Table 9-10. Objects described as tall or high in the LOB corpus (Taylor 2001)
CATEGORY TALL HIGH TOTAL
Humans Obs: 45 Obs: 2 47
Exp: 22.91 Exp: 24.09
2 : 21.31 2 : 20.26
Animals Obs: 0 Obs: 1 1
Exp: 0.49 Exp: 0.51
2 : 0.49 2 : 0.46
Plants, trees Obs: 7 Obs: 3 10
Exp: 4.87 Exp: 5.13
2 : 0.93 2 : 0.88
.1
Buildings Obs: 3 Obs: 10 13
Exp: 6.34 Exp: 6.66
2 : 1.76 v0 2 : 1.67
Walls, fences, etc. Obs: 0 Obs: 5 5
Exp: 2.44 Exp: 2.56
2 : 2.44 2 : 2.32
Towers, statues, Obs: 0 Obs: 7 7
pillars, sticks Exp: 3.41 Exp: 3.59
2 : 3.41 2 : 3.24
T
Articles of clothing Obs: 0 Obs: 7 7
Exp: 3.41 Exp: 3.59
2 : 3.41 2 : 3.24
AF
One example of such a study is Taylor (2001), which investigates the synonym
pair high and tall by identifying all instances of the two words in their subsense
large vertical extent in the LOB corpus and categorizing the words they modify
into eleven semantic categories. ese categories are based on semantic
distinctions such as human vs. inanimate, buildings vs. other artifacts vs. natural
entities etc., which are expected a priori to play a role.
216
Anatol Stefanowitsch
.1
individually are those for the category HUMAN, which show that tall is preferred
and high avoided with human referents. e sparse data in the table are due, rst,
to the fact that the sample is too small, but second, to the fact that there are too
v0
many categories. e category labels are not well chosen either: they overlap
substantially in several places (e.g., towers and walls are buildings, pieces of
clothing are artifacts, etc.) and not all of them seem relevant to any expectation
we might have about the words high and tall.
Taylor later cites earlier psycholinguistic research indicating that tall is used
T
when the vertical dimension is prominent, is an acquired property and is a
property of an individuated entity. It would thus have been beer to categorize
AF
the corpus data according to these properties in other words, a more strictly
deductive approach would have been more promising given the small data set.
Alternatively, Taylor could have taken an exploratory approach and looked for
dierential collocates as described in Section 9.1.1 above. is would have
involved constructing a table like Table 9-11a for every noun occurring with the
R
adjectives tall or high (in a larger corpus, such as the BNC) and then using an
association measure to identify words more strongly aracted to one or the other.
D
Table 9-11a. Tall and high as dierential collocates for window (in the BNC)
POSITION 1
TALL HIGH Total
Obs: 34 Obs: 41 75
POSITION 2
WINDOW
Exp: 3.12 Exp: 71.88
OTHER Obs: 1686 Obs: 39633 41319
WORDS Exp: 1716.88 Exp: 39602.12
Total 1720 39674 41394
217
9. Collocation
Once we have calculated the association strengths of all nouns in this way, we
can sort the collocates rst, according to which of the two words they occur with
more frequently, and second, according to their association strength. Table 9-11b
shows the top 15 dierential collocates of the two words in the BNC.
Table 9-11b. dierential collocates for tall and high in the BNC
COLLOCATE CO-OCCURRENCE CO-OCCURRENCE OTHER WORDS OTHER WORDS G2
WITH TALL WITH HIGH WITH TALL WITH HIGH
MOST STRONGLY ASSOCIATED WITH TALL
.1
man 182 3 1538 39671 1146.54
building 82 26 1638 39648 408.35
tree 73 26 1647 39648 355.52
boy 40 0 v0 1680 39674 255.36
glass 39 2 1681 39672 233.14
woman 38 3 1682 39671 221.34
ship 33 0 1687 39674 210.54
girl 32 0 1688 39674 204.15
gure 62 93 1658 39581 195.58
T
chimney 28 8 1692 39666 141.09
order 62 176 1658 39498 138.01
grass 24 5 1696 39669 126.76
AF
218
Anatol Stefanowitsch
The results for tall clearly support Taylors ideas about the salience of the vertical
dimension. e results for high show something Taylor could not have found,
since he restricted his analysis to the subsense vertical dimension: when
compared with tall, high is most strongly associated with quantities or positions
in hierarchies and rankings. ere are no spatial uses at all among its top
dierential collocates. This does not answer the question why we can use it
spatially and in competition with tall, but it shows what general sense we would
have to assume: one concerned not with the vertical extent as such, but with the
.1
magnitude of that extent (which, incidentally, Taylor notes in his conclusion).
is case study shows how the same question can be approached by a
deductive or an inductive (exploratory) approach. e deductive approach can be
v0
more precise, but this depends on the appropriateness of the categories chosen a
priori for coding the data; it is also time consuming and therefore limited to
relatively small data sets. In contrast, the inductive approach can be applied to a
large data set because it requires no a priori coding. It also does not require any
choices concerning coding categories; however, there may be a danger to project
T
paerns into the data post hoc.
AF
9.2.2.2 Case study: Antonymy. At rst glance, we expect the relationship between
antonyms to be a paradigmatic one, where only one or the other will occur in a
given uerance. However, Charles and Miller (1989) suggest, based on the results
of sorting tasks and on theoretical considerations, that, on the contrary, antonym
pairs are frequently found in syntagmatic relationships, occurring together in the
R
syntactic behavior).
ere are dierences in detail in these studies, but broadly speaking, they take
a deductive approach: ey choose a set of test words for which there is
agreement as to what their antonyms are, search for these words in a corpus, and
check whether their antonyms occur in the same sentence signicantly more
frequently than expected. e studies thus involve two nominal variables:
SENTENCE (with the values CONTAINS TEST WORD and DOES NOT CONTAIN TEST WORD) and
ANTONYM OF TEST WORD (with the values OCCURS IN SENTENCE and DOES NOT OCCUR IN
SENTENCE). is seems like an unnecessarily complicated way of representing the
type of co-occurrence design used in the examples above, but I have chosen it to
219
9. Collocation
show that in this case sentences containing a particular word are used as the
condition under which the occurrence of another word is investigated a
straightforward application of the general research design that denes
quantitative corpus linguistics. Table 9-12 demonstrates the design using the
adjectives good and bad.
Table 9-12. Sentential co-occurrence of good and bad in the BROWN corpus
(based on Justeson and Katz 1991: 5).
GOOD
Total
.1
OCCURS DOES NOT OCCUR
(1991) apply this procedure to 36 adjectives and get signicant results for 25 of
them (19 of which remain signicant aer a Bonferroni correction for multiple
tests). ey also report that in a larger corpus, the frequency of co-occurrence for
all adjective pairs is signicantly higher than expected (but do not give any
gures). Fellbaum (1995) uses a very similar procedure with words from other
R
taxonomy, etc.). us, there is no way of telling whether co-occurrence within the
same sentence is something that is typical specically of antonyms, or whether it
is something that characterizes word pairs in other lexical relations, too.
An obvious approach to testing this would be to repeat the study with other
types of lexical relations. Alternatively, we can take an exploratory approach that
does not start out from specic word pairs at all. Justeson and Katz (1991)
investigate the specic grammatical contexts which antonyms tend to co-occur,
identifying, among others, coordination of the type [ADJ and ADJ] or [ADJ or
ADJ]. We can use these specic contexts to determine the role of co-occurrence
for dierent types of lexical relations by simply extracting all word pairs
220
Anatol Stefanowitsch
Table 9-13a. Co-occurrence of good and bad in the rst and second slot of [ADJ +
and/or + ADJ] sequences.
SECOND ADJECTIVE
BAD OTHER WORDS Total
.1
Obs: 304 Obs: 542 846
FIRST ADJ.
GOOD
Exp: 17.77 Exp: 828.23
OTHER Obs: 44 v0 Obs: 157674 15718
WORDS Exp: 330.23 Exp: 15387.77
Total 348 16216 54717
Note that this is a slightly dierent procedure from what we have seen before:
T
instead of comparing the frequency of co-occurrence of two words with their
individual occurrence in the rest of the corpus, we are comparing it to their
AF
Clearly, antonymy is the dominant relation among these word pairs, which are
mostly opposites (black/white, male/female, public/private, etc.), and sometimes
relational antonyms (primary/secondary, economic/social, economic/political,
D
221
9. Collocation
Table 9-13b. Co-occurrence of adjectives in the rst and second slot of [ADJ
and/or ADJ] sequences (BNC).
WORD A WORD B A&B A & !B !A & B !A & ! B G2
black white 1015 608 722 154797 7759.74
male female 545 27 31 157479 6809.46
economic social 1065 1326 1412 153239 6112.06
public private 448 171 186 157471 4655.26
social economic 779 1848 928 154059 4301.73
.1
primary secondary 295 69 34 158184 3726.64
deaf dumb 276 46 8 158290 3721.86
good bad 304 542 44 157674 3042.80
internal external 223 41 27 158435 2975.78
positive
hon.
negative
learned
258
232
184
92
v0 57
119
158157
158265
2931.04
2656.65
lesbian gay 185 8 26 158583 2645.03
greater lesser 189 50 14 158541 2576.04
political economic 492 1323 1215 155158 2512.32
T
le right 183 41 41 158541 2415.68
physical mental 245 565 66 157806 2347.59
AF
is case study shows how deductive and inductive domains may complement
222
Anatol Stefanowitsch
each other: while the deductive studies cited show that antonyms tend to co-
occur syntagmatically, the inductive study presented here shows that words that
co-occur syntagmatically (at least in certain syntactic contexts) tend to be
antonyms. ese two ndings are not equivalent; the second nding shows that
the rst nding may indeed be typical for antonymy as opposed to other lexical
relations. Of course, the exploratory study was limited to a particular syntactic
context which leaves open the question whether there are other types of contexts
that would allow the identication of word pairs connected by other types of
lexical relations.
.1
9.2.3 Grammaticalization
While lexical semantics (including lexicography) is one of the most typical areas
of study in collocation research, it is not the only one. A very dierent, but also
v0
highly interesting area is language change. Language usage and the (conditional)
frequency of linguistic items is clearly relevant, for example, to phenomena like
grammaticalization and lexicalization (corpus linguists who are interested in this
area of research might start by looking at the contributions in Bybee and Hopper
T
2001, Lindquist and Mair 2004 and Bybee 2007).
modals with respect to the magnitude of this preference. For example, more than
99 percent of all occurrences of the sequence [do NEG] are realized as dont in the
spoken demographic part of the BNC, but only 92.6 percent of all occurrences of
D
[are NEG] are realized as arent. For some modals, the clitic is not the preferred
form at all: for example, [ought NEG] is only realized as oughtnt in 40 percent of
all cases, and [may NEG] is realized as maynt in a mere 1.5 percent.
Krug (2003) tests the hypothesis that this is related to the overall co-
occurrence of the respective form of a modal with a negative particle, such that
the more frequently a modal is negated overall, the more likely it is to be negated
by a clitic rather than the full form of the negative particle. More specically, he
is interested in whether it is the raw frequency with which a modal is negated
that allows us to predict the percentage of cliticized cases, or whether it is the
probability with which a given modal is negated.
223
9. Collocation
.1
e necessary numbers for all three versions of the hypothesis can be derived
from a contingency table like that in Table 9-14 (where NEGATIVE combines all cases
of not and nt). v0
Table 9-14. Collocation of the modal word form does and the negative morpheme.
POSITION 2
NEGATIVE OTHER WORDS Total
T
Obs: 3502 Obs: 4299 7801
POSITION 1
DOES
Exp: 159.27 Exp: 7,641.73
AF
long as we are taking all our data from the same corpus, it does not make a
dierence. e probability that does occurs as a negated form is simply the
percentage of negated cases of does out of all cases of does, i.e. 3,502/7,801 =
0.4489. Finally, the G2 statistic for this table is 16,812.67 (calculated according to
the formula in 9.2b) above.
Once we have derived these three measures for each modal, we have to
determine, again for each modal, the percentage of negated cases where the
negative morpheme occurs as a clitic. Table 9-15 provides the relevant
information for all [MODAL NEG] combinations in the demographic part of the
BNC (note that Krug (2003) is based on the entire spoken part of the BNC, about
224
Anatol Stefanowitsch
10 million words, so his numbers dier from the ones reported here).
Table 9-15. Modals, their likelihood of negation and degree of cliticization of the
negative morpheme.
CLITICIZATION FREQ. PROBABILITY LOG-LIKELIHOOD
MODAL WITH WITHOUT OTHER OTHER RANK PERCENT RANK RANK PERCENT RANK G2
NEG NEG NEG BIGRAMS
.1
has 1186 3437 101196 4908836 5 0.9890 9 12 0.2565 10 4119.39
does 3502 4299 98880 4907974 6 0.9889 2 5 0.4489 5 16812.67
should 795 3601 101587 4908672 v0 7 0.9887 10 14 0.1808 14 2185.13
could 2314 5761 100068 4906512 8 0.9857 8 9 0.2866 7 8618.97
will 3953 5487 98429 4906786 9 0.9836 4 4 0.4188 4 18298.59
can 8234 15494 94148 4896779 10 0.9832 5 3 0.3470 3 34703.48
need 58 2981 102324 4909292 11 0.9828 22 23 0.0191 23 0.28
had 700 12258 101682 4900015 12 0.9814 20 15 0.0540 16 508.35
T
were 1763 8816 100619 4903457 13 0.9722 11 10 0.1667 11 4576.41
was 2734 32039 99648 4880234 14 0.9674 16 13 0.0786 13 3488.71
would 3359 6635 99023 4905638 15 0.9589 6 6 0.3361 6 13756.03
AF
In his study, Krug concludes that raw frequency is the best predictor of
cliticization, but he also notes that frequency and probability do not dier by
much. For the subset of the data reported in Table 9-15, we can say that all three
measures of collocation correlate with the degree of cliticization signicantly,
with G2 performing best (r = 0.61, p < 0.01), followed closely by probability (r =
0.59, p < 0.01) and frequency (r = 0.57, p<0.01) (calculated using Spearmans rank
correlation).
is case study shows that collocation (more specically, association strength)
may itself be used as a variable, rather than showing the relationship between the
225
9. Collocation
.1
of denotational properties of the node word. Louw (1993: 157) refers to this
phenomenon as semantic prosody, dened, somewhat impressionistically, as the
consistent aura of meaning with which a form is imbued by its collocates.
v0
is denition has been understood by collocation researchers in two dierent
(but related) ways. Much of the subsequent research on semantic prosody is
based on the understanding that this aura consists of connotational meaning
(cf. e.g. Partington 1998: 68), so that words can have a positive, neutral or
T
negative semantic prosody. However, Sinclair, who according to Louw invented
the term,30 seems to have in mind aitudinal or pragmatic meanings that are
much more specic than positive, neutral or negative. ere are insightful
AF
terminological discussions concerning this issue (cf. e.g. Hunston 2007), but since
the term is widely-used in (at least) these two dierent ways, and since positive
and negative connotations are very general types of aitudinal meaning, it
seems more realistic to accept a certain vagueness of the term. If necessary, we
R
could dierentiate between the general semantic prosody of a word (its positive
or negative connotation as reected in its collocates) and its specic semantic
prosody (the word-specic aitudinal meaning reected in its collocates).
D
30 Louw aributes the term to John Sinclair, but Louw (1993) is the earliest appearance of
the term in writing. However, Sinclair is clearly the rst to discuss the phenomenon
itself systematically, without giving it a label (e.g. Sinclair 1991: 74-75).
226
Anatol Stefanowitsch
informally.
.1
12 I only taught him to keep his [true feelings] from me . But he remained
13 othing stand in the way of his [true feelings] . He felt humiliated apar
14 between false surface and his [true feelings] , and in its affected , force
15 be reluctant to offer their [true feelings] or observations . The Bir
v0
16 d in the art of disguising his [true feelings] . Let him not be frightened o
17 n something of these mothers ' [true feelings] from the women who suffered t
18 o whom he could reveal his new [true feelings] . As for his own talk tha
19 that she had been denying her [true feelings] for years . She had loved Con
20 controlled voice disguised his [true feelings] , but Christina sensed his je
21 helpful to show each side the [true feelings] of the other , the need to ac
22 ys felt quite at odds with her [true feelings] . In fact , there were lots o
23 that he had first realised his [true feelings] for her . Then that he had fi
T
24 nd , but you like to hide your [true feelings] . Oh , do n't be so s
25 as n't actually dealt with the [true feelings] that he had towards his fathe
26 away so little . What were his [true feelings] , his intentions ? All I
AF
34 which she hoped disguised her [true feelings] , and carried on inking in co
35 he humiliation of exposing her [true feelings] . That 's not how it is
36 sly , turning blindly from her [true feelings] to hate him more . What m
D
Sinclair notes three things: rst, the phrase is almost always part of a possessive
(realized by pronoun, possessive noun phrase or of-construction). is is also true
of our sample, with the exception of lines 25 (where there is a possessive relation,
but it is realized by the verb have) and 30 (where the chosen span does not
contain a possessive). Second, the expression collocates with verbs of expression
(perhaps unsurprising for an expression relating to emotions); this, too, is true for
our sample. ird, and crucially in our context, the majority of the examples
express a reluctance to express emotions in many cases this reluctance is
communicated as part of the verb (the most frequent verbs are disguise, conceal
and hide), in other cases it is communicated separately (e.g. line 15, reluctant to
227
9. Collocation
oer; line 29, at last gave vent (suggesting a period of holding back); line 35,
humiliation of exposing). Sinclair ascribes the denotational meaning genuine
emotions to the phrase; the sense of reluctance to express them is the semantic
prosody of the phrase (Sinclair 2004: 36) an aitudinal meaning much more
specic than a general positive or negative connotation.
Sinclairs methodological approach is quantitative only very informal sense
he rarely reports exact frequencies for a given semantic feature in his sample. In
is an interesting question how to quantify his approach more strictly. Of course,
we could count and report the frequency of a given prosody with a lexical item or
phrase, but since we do not know and have no plausible way of determining
.1
how frequent this prosody is in the corpus in general, we cannot tell whether that
frequency is higher or lower than expected.
One way around this problem could be to use denotationally synonymous
v0
phrases for comparison phrases like genuine emotion(s) (which Sinclair uses to
paraphrase the denotational meaning), genuine feeling(s), real emotion(s) and real
feeling(s). In a sample of these phrases drawn from the BNC in the same way as
that in Table 9-16 (see Appendix D.2), only 8 out of 65 examples had a context of
reluctance or inability to express genuine emotions; in Table 9-16, I would argue
T
that 19 of the 36 examples do so (namely those in lines 1, 2, 8, 9, 10, 11, 12, 14, 15,
16, 19, 20, 22, 24, 28, 29, 34, 35, and 36). We can now compare the phrase true
AF
feelings to the other expressions in the usual way (cf. Table 9-17).
Table 9-17. e meaning reluctance to express with the phrase true meaning and
its synonyms.
R
CONTEXT
TRUE FEELINGS SYNONYMS OF Total
D
TRUE FEELINGS
Obs: 19 Obs: 8 27
RELUCTANCE
YES
Exp: 9.92 Exp: 17.08
Obs: 17 Obs: 53 70
NO
Exp: 26.08 Exp: 44.92
Total 36 65 97
228
Anatol Stefanowitsch
9.2.4.2 Case study: Cause. A second way in which semantic prosody can be studied
quantitatively is implicit in Kennedys study of collocates of degree adverbs
discussed in Section 9.2.1 above. Recall that Kennedy discusses for each degree
adverb whether a majority of its collocates has a positive or a negative
connotation. is, of course, is a statement about the (broad) semantic prosody of
the respective adverb, based not on an inspection and categorization of usage
contexts, but on inductively discovered strongly associated collocates.
is method was rst proposed by Stubbs (1995a), who studies, among other
things, the noun and verb cause. He rst presents the result of a manual
.1
extraction of all nouns (sometimes with adjectives qualifying them, as in the case
of wholesale slaughter) that occur as subject or object of the verb cause or as a
prepositional object of the noun cause in the LOB. He codes them in their context
v0
of occurrence for their connotation, nding that approximately 80 percent are
negative, 18 percent are neutral and 2 percent are positive. is procedure is still
very close to the one suggested in the preceding case study.
He then notes that this manual extraction becomes unfeasible as the number
of corpus hits grows and discusses a procedure for identifying collocates that
T
involves a combination of association measures.31 As his focus is on
methodological issues, Stubbs does not present the results of this procedure
AF
exhaustively and the corpus he uses is not available, so I have replicated his
procedure using the BNC. e results are shown in Table 9-18 (following Stubbs, I
have limited it to nouns).
R
31 Stubbs uses a combination of mutual information (cf. Chapter 10) and a statistic called
t, rst proposed by Church and Hanks (1989) as an association measure. e laer is
not discussed in this book since it is not widely used and does not oer any
D
advantages over the more widely-used association measures. I have used it here in the
following form, suggested by Stubbs (1995a).
R1 C 1
O 11
N
(i) t =
O 11
Stubbs procedure consists in calculating MI and t for all words occurring within a
span of three words to the right or le of the node word, and then discarding all
words that only occur once (his way of geing around the fact that MI is overly
sensitive to rare events), have MI values of less than 3 and t-values of less than 2. I
have replicated the method here as a methodological exercise for actual collocation
research, I recommend, of course, using a beer association measure in the rst place.
229
9. Collocation
.1
pain 192 6793 29631 96226783 3.89 2.27
diculties 188 6530 29635 96227046 3.91 2.27
confusion 161 2610 v0 29662 96230966 4.97 3.04
distress 158 1282 29665 96232294 5.88 4.15
pollution 149 3907 29674 96229669 4.31 2.32
root 128 2016 29695 96231560 5.01 2.75
deaths 127 2232 29696 96231344 4.86 2.6
delay 117 3010 29706 96230566 4.33 2.07
T
disruption 115 754 29708 96232822 6.15 3.89
injuries 108 2399 29715 96231177 4.54 2.14
AF
It is clear that all of the nominal collocates (which are very similar to the ones
Stubbs cites from his results) with the exception of root (from the compound root
cause) have negative connotations. Incidentally, this is also true of most adjectival
230
Anatol Stefanowitsch
collocates identied by Stubbs procedure in the BNC, which are serious, severe,
bodily, underlying, grievous, reckless and proximate).
While Stubbs' (1995a) study, like many collocational studies, looks at all words
occurring in a given span around a particular word, Stefanowitsch and Gries
(2003) present data for nominal collocates of the verb cause in the object position
of dierent subcategorization paerns. While their results conrm the negative
connotation of cause also found by Stubbs, their results add an interesting
dimension: while objects of cause in the transitive construction (cause a problem)
and the prepositional dative (cause a problem to someone) refer to negatively
perceived external and objective states, the objects of cause in the ditransitive
.1
refer to negatively experienced internal and/or subjective states (see Table 9-19).
change 1.43E-05
properties aerwards can be very eective, but note that it limits the perspective
to what I have called here broad semantic prosody; while it is possible, in many
cases, to ascribe a positive or negative connotation to words isolated from
their context of usage, this is not possible for the more complex, nuanced
aitudinal meanings relevant for the narrow semantic prosody of words.
231
9. Collocation
language system, but also of the cultural conditions under which they were
produced. is allows corpus linguistic methods to be used in uncovering at least
some properties of that culture. Specically, we can take lexical items to represent
culturally dened concepts and investigate their distribution in linguistic corpora
in order to uncover these cultural denitions. Of course, this adds complexity to
the question of operationalization: we must ensure that the words we choose are
indeed valid representatives of the cultural concept in question.
9.2.5.1 Case Study: Small boys, lile girls. Obviously, lexical items used
conventionally to refer to some culturally relevant group of people are plausible
.1
representatives of the cultural concept of that group. For example, some very
general lexical items referring to people (or higher animals) exist in male and
female versions man/woman, boy/girl, lad/lass, husband/wife, father/mother,
v0
king/queen, etc. If such word pairs dier in their collocates, this could tell us
something about the cultural concepts behind them. For example, Stubbs (1995b)
cites a nding by Baker and Freebody (1989), that in childrens literature, the
word girl collocates with lile much more strongly than the word boy, and vice
versa for small. Stubbs shows that this is also true for balanced corpora (see Table
T
9-20; again, since Stubbs corpora are not available, I show frequencies from the
BNC instead but the proportions are within a few percent points of his).
AF
LITTLE
Exp: 927.53 Exp: 1011.47
D
232
Anatol Stefanowitsch
Stubbs argues that this dierence is due to dierent connotations of small and
lile which he investigates on the basis of the noun collocates to their right and
the adjectival and adverbial collocates to the le. Again, instead of Stubbs
original data (which he identies on the basis of raw frequency of occurrence and
only cites selectively) I show data from the BNC identied using the log-
likelihood test statistic. Table 9-21a shows the ten most strongly associated noun
collocates to the right of the node word and Table 9-21b shows the ten most
strongly associated adjectival collocates to the le.
.1
COLLOCATE CO-OCCURRENCE CO-OCCURRENCE OTHER WORDS OTHER WORDS G2
WITH LITTLE WITH SMALL WITH LITTLE WITH SMALL
MOST STRONGLY ASSOCIATED WITH LITTLE v0
bit 2838 40 33606 36388 3678.84
girl 1148 81 35296 36347 1122.07
doubt 546 5 35898 36423 710.68
time 595 23 35849 36405 664.49
while 435 0 36009 36428 605.46
T
evidence 324 0 36120 36428 450.46
aention 253 1 36191 36427 339.81
AF
is part of the study is more inductive. Stubbs may have expectations about
what he will nd, but he essentially identies collocates exploratively and then
interprets the ndings. e nominal collocates show, according to Stubbs, that
233
9. Collocation
small tends to mean small in physical size or low in quantity, while lile is
more clearly restricted to quantities, including informal quantifying phrases like
lile bit. is is generally true for the BNC data, too (note, however, the one
exception among the top ten collocates girl).
e connotational dierence between the two adjectives becomes clear when
we look at the adjectives they combine with. e word lile has strong
associations to evaluative adjectives that may be positive or negative, and that are
oen patronizing. Small, in contrast, does not collocate with evaluative adjectives.
.1
COLLOCATE CO-OCCURRENCE CO-OCCURRENCE OTHER WORDS OTHER WORDS G2
WITH LITTLE WITH SMALL WITH LITTLE WITH SMALL
MOST STRONGLY ASSOCIATED WITH LITTLE v0
nice 354 4 4643 1055 110.24
poor 246 0 4751 1059 96.75
prey 119 0 4878 1059 46.25
tiny 95 0 4902 1059 36.84
nasty 59 0 4938 1059 22.80
T
funny 67 1 4930 1058 18.96
dear 47 0 4950 1059 18.15
AF
Stubbs sums up his analysis by pointing out that small is a neutral word for
describing size, while lile is sometimes used neutrally, but is more oen
nonliteral and convey[s] connotative and aitudinal meanings, which are oen
234
Anatol Stefanowitsch
.1
keep the results comparable with those reported above, let us stick with the BNC
instead. Table 9-21c shows the top ten adjectival collocates of boy and girl.
CO-OCCURRENCE CO-OCCURRENCE
v0
Table 9-21c. Adjectival collocates of lile and small at L1 (BNC)
COLLOCATE OTHER WORDS OTHER WORDS G2
WITH BOY WITH GIRL WITH BOY WITH GIRL
MOST STRONGLY ASSOCIATED WITH BOY
old 634 257 5385 7296 279.98
T
small 336 81 5683 7472 237.78
dear 126 45 5893 7508 61.30
AF
235
9. Collocation
.1
there is one adjective signaling marital status. ese results also generally reect
Caldas-Coulthard and Moons ndings (in the yellow-press, the evaluations are
oen heavily sexualized in addition). v0
To put it mildly, then, the collocates of boy and girl conrm a general aitude
that the laer are up for constant evaluation while the former are mainly seen as
a neutral default. at the adjectives dead and unmarried are among the top ten
collocates in a representative, relatively balanced corpus, hints at something
darker a patriarchal world view that sees girls as victims and sexual partners
T
and not much else (other studies investigating gender stereotypes on the basis of
collocates of man and woman are Gesuato 2003 and Pearce 2008).
AF
Further Reading
If you want to learn more about association measures, Evert (2005) and the
companion website (Evert 20042010) are very comprehensive and relatively
R
236
Anatol Stefanowitsch
10. Grammar
e fact that corpora are most easily accessed via words (or word forms) is also
reected in many corpus studies focusing on various aspects of grammatical
.1
structure. Many such studies either take (sets o) words as a starting point for
studying various aspects of grammatical structure, or they take easily identiable
aspects of grammatical structure as a starting point for studying the distribution
v0
of words. However, as the case studies of the English possessive constructions in
Chapters 6 and 7 showed, grammatical structures can be (and are) also studied in
their own right, for example with respect to semantic, information-structural and
other restrictions they place on particular slots or sequences of slots, or with their
distribution across texts, text types, demographic groups or varieties.
T
ere are two major problems to be solved when searching corpora for
grammatical structures. We discussed both of them to some extent in Chapter 5,
but let us briey recapitulate and elaborate some aspects of the discussion before
R
237
10. Grammar
detail in Chapter 5). Again, this is simpler in the case of morphologically marked
and relatively simple grammatical structures or example, the s-possessive (as
dened above) is typically characterized by the sequence NOUN s ADJECTIVE*
NOUN in corpora containing texts in standard orthography; it can thus be
retrieved from a POS-tagged corpus with a fairly high degree of precision and
reliability. However, even this simple case is more complex than it seems: in the
sequence just given, s it may also stand for the verb be (Sams head of marketing);
and of course, the modied nominal may not always be directly adjacent to the s
(for example in is oce is Sams or in Sams friends and family).
Other structures may be dicult to retrieve even though they can be
.1
characterized straightforwardly: most linguists would agree, for example, that
transitive verbs, are verbs that take a direct object. However, this is of very lile
help in retrieving transitive verbs even from a POS-tagged corpus, since many
v0
noun-phrases following a verb will not be direct objects (Sam slept the whole day)
and direct objects do not necessarily follow their verb; in addition, noun phrases
themselves are not trivial to retrieve.
Yet other structures may be easy to retrieve, but not without retrieving many
false hits at the same time. is is the case with ambiguous structures like the of-
T
possessive, which can be retrieved by a query along the lines of NOUN of ART?
ADJ* NOUN, which will also retrieve partitive and quantitative uses, etc.).
AF
structures, even very abstract ones, retrieving the relevant data by mind-numbing
and time-consuming manual analysis of the results of very broad searches or
D
even of the corpus itself, if necessary. But it is probably one reason why so much
grammatical research in corpus linguistics takes a word-centered approach.
A second reason is that it allows us to transfer well-established collocational
methods to the study of grammar. In the preceding chapter we saw that while
collocation research oen takes a sequential approach to co-occurrence, counting
as potential collocates of a node word any word within a given span around it, it
is not uncommon to see a structure-sensitive approach that considers only those
potential collocates that occur in a particular grammatical position relative to
each other (for example, adjectives relative to the nouns they modify or vice
versa). In this approach, grammatical structure is already present in the design,
even though it remains in the background. We can move these types of
238
Anatol Stefanowitsch
.1
10.2.1 Collocational frameworks and grammar patterns
v0
An early extension of collocation research to the association between words and
grammatical structure is Renouf and Sinclairs notion of collocational frameworks,
which they dene as a discontinuous sequence of two words, positioned at one
word remove from each other, where the two words in question are always
function words (examples are a(n)+ X + of, too + X + to or many + N + of. ey
T
are particularly interested in classes of items that ll the position of such
frameworks and see the fact that these items tend to be semantically coherent as
AF
they call grammar paerns), are meaningful and that their meaning is manifest
in the collocates in central slots::
D
e paerns of a word can be dened as all the words and structures which
are regularly associated with the word and which contribute to its
meaning. A paern can be identied if a combination of words occurs
relatively frequently, if it is dependent on a particular word choice, and if
there is a clear meaning associated with it (Hunston and Francis 2000: 37).
239
10. Grammar
nouns and adjectives (Francis et al. 1998); there were also aempts to identify
grammar paerns automatically (cf Mason and Hunston 2004). Research on
collocational frameworks and grammar paerns is mainly descriptive and takes
place in applied contexts, but Hunston and Francis (2002) argue very explicitly for
a usage-based theory of grammar on the basis of their descriptions (note that the
denition quoted above is strongly reminiscient of the way constructions were
later dened in Construction Grammar (Goldberg 1995: 4), a point I will return to
in the Epilogue of this book.
.1
consider [a(n) + X + of]. Table 10-1 shows the most strongly associated lexical
items for this framework in the BNC.
Word a(n) X of X
v0
Table 10-1. Top collocates of the collocational framework [a(n) + X + of] (BNC)
a(n) _ of Rest LL-G2 a/(a+b)
lot 14632 13329 271461 96687283 132624.95 0.5233
number 15137 33669 270956 96666943 116935.11 0.3101
couple 7042 4838 279051 96695774 66197.79 0.5928
T
series 5985 8237 280108 96692375 50552.96 0.4208
result 5610 16300 280483 96684312 40644.29 0.2560
AF
e results are comparable to those Renouf and Sinclair present, although they
use the COBUILD corpus, which is not publicly available, and they do not
calculate association strength, but they provide the percentages of all occurrences
of each word that occur in a given paern (shown here in the last column).
e most strongly associated words are related mainly to quantities, parts, or
types (here). is type of semantic coherence is evidence for Renouf and Sinclair
240
Anatol Stefanowitsch
Table 10-3. Most frequent words in the grammar paern [it be ADJ that/to] (BNC)
RANK COLLOCATION FREQUENCY RANK COLLOCATION FREQUENCY
1. possible 3475 11. essential 804
.1
2. important 3030 12. unlikely 727
3. dicult 2281 13. interesting 679
4. clear 1897 v0 14. obvious 524
5. necessary 1488 15. beer 505
6. hard 1372 16. good 461
7. likely 1261 17. best 382
8. easy 1129 18. vital 381
T
9. impossible 1121 19. nice 367
10. true 970 20. easier 359
AF
Hunston and Francis (2002: 29) note that the adjectives fall into a broad class they
describe as modality, ability, importance, predictability, obviousness, value and
appropriacy, rationality, truth. Like Renouf and Sinclair, they see this semantic
coherence as evidence for the relevance of the paerns they identify.
R
241
10. Grammar
constructions (in this, too, it has much in common with Paern Grammar).
.1
construction). e prediction is that the most strongly associated verbs will be
verbs of literal or metaphorical transfer. Table 10-4a gives the information needed
to calculate the association strength for give (although the procedure should be
v0
familiar by now), and Table 10-4b shows the ten most strongly associated verbs.
GIVE
Exp: 8.57 Exp: 1,139.43
VERB
The hypothesis is conrmed: the top ten collexemes (and most of the other
signicant collexemes not shown here) refer to literal or metaphorical transfer.
D
However, note that on the basis of lists like that in Table 10-3 we cannot reject
a null hypothesis along the lines of ere is no relationship between the
ditransitive and the encoding of transfer events, since we did not test this. All we
can say is that we can reject null hypotheses stating that there is no relationship
between the ditransitive and each individual verb on the list. In practice, this may
amount to the same thing, but if we wanted to reject the more general null
hypothesis, we would have to code all verbs in the corpus according to whether
they are transfer verbs or not, and the show that transfer verbs are signicantly
more frequent in the ditransitive construction than in the corpus as a whole.
242
Anatol Stefanowitsch
.1
award 7 9 1028 137620 1.36E-11
allow 18 313 1017 137316 1.12E-10
lend 7 24 1028 137605 2.85E-09
v0
10.2.2.2 Case study: Ditransitive and prepositional dative. Collostructional analysis
can also be applied in the direct comparison of two grammatical constructions (or
other grammatical features), analogous to the dierential collocate design
T
discussed in Chapter 9. For example, Gries and Stefanowitsch (2004) compare
verbs in the ditransitive and the so-called to-dative both constructions express
transfer meanings, but it has been claimed that the ditransitive encodes a
AF
spatially and temporally more direct transfer than the to-dative (ompson and
Koide 1987). If this is the case, it should be reected in the verbs most strongly
associated to one or the other construction in a direct comparison. Table 10-5a
shows the data needed to determine for the verb give which of the two
R
Table 10-5a: Give in the ditransitive and the prepositional dative (ICE-GB)
D
ARGUMENT STRUCTURE
DITRANSITIVE TO-DATIVE Total
Obs: 461 Obs: 146 607
GIVE
Exp: 212.68 Exp: 394.32
VERB
243
10. Grammar
Give is signicantly more frequent than expected in the ditransitive and less
frequent than expected in the to-dative (p = 1.84E-120). It is therefore be said to
be a signicant distinctive collexeme of the ditransitive (again, I will use the term
dierential instead of distinctive here). Table 10-5b shows the top ten dierential
collexemes for each construction.
Table 10-5b: Verbs the ditransitive and the prepositional dative (ICE-GB, Gries and
Stefanowitsch 2004: 106).
COLLEXEME O11 O12 O21 O22 p (Fisher)
MOST STRONGLY ASSOCIATED WITH THE DITRANSITIVE
.1
give 461 146 574 1773 1.84E-120
tell 128 2 907 1917 8.77E-58
show 49 15 v0 986 1904 8.32E-12
oer 43 15 992 1904 9.95E-10
cost 20 1 1015 1918 9.71E-09
teach 15 1 1020 1918 1.49E-06
wish 9 1 1026 1918 0.0005
T
ask 12 4 1023 1915 0.0013
promise 7 1 1028 1918 0.0036
AF
Generally speaking, the list for the ditransitive is very similar to the one we get if
we calculate the simple collexemes of the construction; crucially, many of the
dierential collexemes of the to-dative highlight the spatial distance covered by
the transfer, which is in line with what the hypothesis predicts.
244
Anatol Stefanowitsch
10.2.2.3 Case study: Negative evidence. Recall from the introductory chapter that
one of the arguments routinely made against corpus-linguistics is that corpora do
not contain negative evidence. Even corpus linguists occasionally agree with this
claim. For example, McEnery and Wilson (2001: 11), in their otherwise excellent
introduction to corpus-linguistic thinking, cite the sentence in (10.1):
ey point out that this sentence will not occur in any given nite corpus, but
.1
that this does not allow us to declare it ungrammatical, since it could simply be
one of innitely many sentences that that simply havent occurred yet. ey
then oer the same solution Chomsky has repeatedly oered:
v0
It is only by asking a native or expert speaker of a language for their
opinion of the grammaticality of a sentence that we can hope to
dierentiate unseen but grammatical constructions from those which are
simply grammatical but unseen (McEnery and Wilson 2001: 12).
T
However, as Stefanowitsch (2006, 2008) points out, zero is just a number, no
AF
ARGUMENT STRUCTURE
DITRANSITIVE OTHER Total
Obs: 0 Obs: 3 333 3 333
SAY
Exp: 44.52 Exp: 3 288.48
VERB
Fishers exact test shows that the observed frequency of zero diers signicantly
245
10. Grammar
.1
determine whether the absence of a particular word is signicant or not.
Note that since zero is no dierent from any other frequency of occurrence, this
procedure does not tell us anything about the dierence between an intersection
246
Anatol Stefanowitsch
of variables that did not occur at all and an intersection that occurred with any
other frequency less than the expected one. All the method tells us is whether an
occurrence of zero is signicantly less than expected.
In other words, the method makes no distinction between zero occurrence and
other less-frequent-than-expected occurrences. However, Stefanowitsch (2006:
70f.) argues that this is actually an advantage: if we were to treat an occurrence of
zero as special opposed to, say, an occurrence of 1, then a single counterexample
to an intersection of variables hypothesized to be impossible will appear to
disprove our hypothesis. e knowledge that a particular intersection is
signicantly less frequent than expected, in contrast, remains relevant even when
.1
faced with apparent counterexamples). And, as anyone who has ever elicited
acceptability judgments (whether from someone else or introspectively from
themselves) knows, the same is true of acceptability judgments: We may feel
v0
something to be unacceptable even though we know of counterexamples (or can
even think of such examples ourselves) that seem possible but highly unusual.
Of course, applying signicance testing to zero occurrences of some
intersection of variables is not always going to provide a signicant result: if one
(or both) of the values of the intersection are rare in general, an occurrence of
T
zero may not be signicantly less than expected. In this case, we still do not know
whether the absence of the intersection is due to chance or to its impossibility
AF
but with such rare combinations, acceptability judgments are also going to be
variable.
away from this perspective, we nd research designs that begin to resemble more
closely the type illustrated in Chapters 68, but that nevertheless include lexical
items as one of their variables. ese will typically start from a word (or a small
set of words) and identify associated grammatical (structural and semantic)
properties. is is of interest for many reasons, for example in the context of how
much idiosyncrasy there is in the grammatical behavior of lexical items in general
(for example, what are the grammatical dierences between near synonyms), or
how much of a particular grammatical alternation is lexically determined.
247
10. Grammar
.1
variables, Verb of Seeing (with values corresponding to the individual verbs) and
Frame Elements Expressed (with values corresponding to the frame elements).
Table 10-7 shows one of the ways in which Atkins summarizes the data.
v0
Table 10-7. Distribution of frame elements across verbs of seeing (Atkins 1994: 50)
FRAME ELEMENT EXPRESSED
EXPERIENCER PERCEPT LOCATION OF LOCATION OF BARRIER TIME
PERCEPT EXPERIENCER
T
see 86% 95% 14%
catch sight of 100% 100% 7% 2%
AF
248
Anatol Stefanowitsch
.1
express. Table 10-8 shows the result of one of her case studies using verbs of
permission (note that for expository reasons the table is simplied and the labels
for semantic roles are more generic than the ones used by Faulhaber).
v0
Table 10-8. Argument structures possible with allow, permit, authorize and entitle in
the sense 'permit' (Faulhaber 2011: 101)
Argument Structure allow permit authorize entitle
NPAGT V NPTHM [for_N/V-ing]PURP
T
NPAGT V NPTHM [to_NP]REC
NPAGT V [NP(]+[)to_INF]THM
AF
NPAGT V NPTHM
NPAGT V [V-ing]THM
NPAGT V [that_CL]THM
NPAGT V [for_NP to_INF]THM
R
249
10. Grammar
.1
10.2.3.2 Case study: Complementation of begin and start. ere are a number of
verbs in English that display variation with respect to their complementation
v0
paerns and the factors inuencing this variation have provided (and continue to
provide) an interesting area of research for corpus linguists. In an early example
of such a study, Schmid (1996) investigates the near-synonyms begin and start,
both of which can occur with to-clauses or ing-clauses. He starts out from the
hypothesis (derived from discussions in the literature) that begin signals gradual
T
onsets and start signals sudden ones, and that ing-clauses are typical of dynamic
situations while to-clauses are typical of stative situations. He then identies all
AF
occurrences of both verbs with both complementation paers in the LOB and
codes them for whether the embedded verb (the verb in the complement clause)
refers to an activity, a process or a state.
His study is a deductive one, as he starts out from a hypothesis to be tested on
the corpus data. It involves three nominal variables: VERB (with the values BEGIN
R
and START), COMPLEMENTATION (with the values ING and TO) and AKTIONSART (with the
values ACTIVITY, PROCESS and STATE). Again, we are dealing with a multivariate
D
design.
Schmid (1996) reports frequencies and percentages and bases his discussion on
an informal inspection of these. Let us be more rigorous and submit his data to a
Congural Frequency Analysis as introduced in Chapter 8. Table 10-9a shows the
observed frequencies (from Schmid 1996) and the expected frequencies derived
from these, and Table 10-9b shows the results of the CFA.
ere are three types with corresponding antitypes that reach signicance (in
one case, only marginally so, at corrected levels of signicance): activity verbs are
associated with the verb start and the complementation paern ing and avoid the
verb start with the complementation paern to. is conrms the hypothesis
activities have clear (sudden) onsets and they are dynamic. For state verbs, the
250
Anatol Stefanowitsch
paern is reversed, again, in line with the hypothesis states do not typically
have clear (sudden) onsets and they are, of course, stative). Processes seem to be
closer to states that to activities, but perhaps one of the reasons that the type
PROCESS + BEGIN + TO is only marginally signicant is because they are somewhere
in between states and processes.
Table 10-9a. Aktionsart, matrix verb and complementation type (LOB, Schmid 1996:
229)
BEGIN START Total
AKTIONSART ING TO Total ING TO Total
.1
ACTIVITY O: 21 O: 96 117 O: 50 O: 26 76 193
E: 30.85 E: 114.42 E: 10.14 E: 37.59
PROCESS O: 2 O: 59 61 v0 O: 2 O: 10 12 73
E: 11.67 E: 43.28 E: 3.83 E: 14.22
STATE O: 1 O: 101 102 O: 3 O: 1 4 106
E: 16.94 E: 62.84 E: 5.57 E: 20.65
Total 24 256 280 55 37 92 372
T
Table 10-9b. Congural frequency analysis for Table 10-21a.
Intersection Observed Expected Type Chi-sq. p Sig.
AF
TO
251
10. Grammar
they will not be discussed further here, but they have been extensively studied for
a variety of phenomena (cf. e.g. ompson/Koide 1987 for the dative alternation;
Chen 1986, Gries 2003 for particle placement; Rosenbach 2002, Stefanowitsch
2003 for the possessives). Instead, we will discuss some less frequently
investigated contextual factors here, namely word frequency, phonology, the
horror aequi and priming.
.1
adjectives, more frequent adjectives precede less frequent ones. ere are two
straightforward ways of testing this hypothesis. First (as Wul does), based on
average frequencies: if we assume that the hypothesis is correct, then the average
v0
frequency of the rst adjective of two-adjective pairs should be higher than that
of the second. Second, based on the number of cases in which the rst and second
adjective, respectively, are more frequent: if we assume that the hypothesis is
correct, there should be more cases where the rst adjective is the more frequent
one than cases, where the second adjective is the more frequent one. Obviously,
T
the two ways of investigating the hypothesis could give us dierent results.
Consider Table 10-10, which lists ten randomly chosen adjective pairs from the
AF
spoken part of the BNC ( Wul uses the entire BNC in her study).
SECOND
new allied 6337 33 FIRST
conservative young 148 1856 SECOND
other married 13099 436 FIRST
whole bloody 2361 2110 FIRST
frail elderly 10 276 SECOND
delighted united 175 515 SECOND
new posh 6337 139 FIRST
Total 4240.6 638.9 5/5
252
Anatol Stefanowitsch
.1
dierence in frequency (like necessary/nancial in the rst line of Table 10-10)
count just as much as cases with a vast dierence in frequency (like other/dumb
in line 2). In cases like this, it may be advantageous to apply both methods and
v0
reject the null hypothesis only if both of them give the same result.
Searching for sequences of exactly two adjectives (excluding comparatives and
superlatives) followed by a noun in the spoken part of the BNC yields 7769 hits. I
determined the frequency of each adjective in the spoken part of the BNC (again
ignoring comparatives and superlatives), and calculated means and sample
T
variance as described in Chapter 7, Section 7.3.1). Table 10-11 shows the relevant
information.
AF
Table 10-11. Word frequency and word order in ADJ+ADJ combinations, variant 1
(BNC Spoken)
FIRST WORD SECOND WORD
R
Using the formulas given in Chapter 7, we get a t-value of 14.99, which at 14 791
degrees of freedom is highly signicant (t(14 791) = 14.99, p < 0.01). Going by
average frequency, then, frequency seems to have an inuence on word order.
Next, let us look at the number of cases that match the prediction. ere are
274 cases where both adjectives have exactly the same frequency (in most cases,
because an adjective is repeated, e.g. in long, long time; in some cases because two
dierent adjectives happen to have the same frequency, e.g. antiseptic soapy
water). We can discard these cases, as our hypothesis does not make any
prediction about them. is leaves 4 279 cases where the rst adjective is more
253
10. Grammar
frequent than the second, and 3 216 cases where the second is more frequent than
the rst. If there was no relation between order and frequency, we would expect
the cases to be evenly distributed. is gives us the observed and expected
frequencies in Table 10-12, which show that the dierence is, again, statistically
highly signicant (2 = 150.76, df = 1, p < 0.001).
Table 10-12. Word frequency and word order in ADJ+ADJ combinations, variant 2
(BNC spoken)
MORE FREQUENT
.1
FIRST WORD SECOND WORD Total
Obs. 4 279 Obs. 3 216 7 495
ADJ+ADJ
Exp. 3 747.5 v0 Exp. 3 747.5
COMBINATIONS
2-Cp. 75.38 2-Cp. 75.38
is case study was meant to demonstrate that sometimes we can derive dierent
types of quantitative predictions from a hypothesis with no way to decide which
of them more accurately reects the hypothesis; in this case, it is a good idea to
T
test both predictions. e case study also shows that word frequency may have
eects on grammatical variation, which is interesting from a methodological
AF
perspective because not only is corpus linguistics is the only way to test for such
eects; corpora are also the only source from which the relevant values for the
independent variable can be extracted.
10.2.4.2 Case study: Binomials and sonority. Frozen binomials (i.e. phrases of the
R
type esh and blood, day and night, size and shape) have inspired a substantial
body of research aempting to uncover principles determining the order of the
D
254
Anatol Stefanowitsch
examples. Second, these lists do not tell us exactly how frozen the phrases are;
while there are cases that seem truly non-reversible (like esh and blood), others
simply have a strong preference (day and night is three times as frequent as night
and day in the BNC) or even a relatively weak one (size and shape is only one-
and-a-half times as frequent as shape and size).
We can avoid these problems by drawing our sample from the corpus itself.
For this case study, I selected all instances of [NOUN and NOUN] that occurred at
least 30 times in the BNC; because it is known that length and stress are strong
determining factors, I only included cases where both nouns are monosyllabic. I
then determined for all cases their frequency in both orders. I then calculated a
.1
frozenness rank by calculating for each pair the percentage of the more
frequent order out of all uses of the pair.
Next, I coded the nal consonants of all items and determined, whether the
v0
rst or the last one was more sonorant. For this, I used the following (hopefully
uncontroversial) sonority hierarchy:
(10.3) [vowels] > [semivowels] > [liquids] > [h] > [nasals] >
[voiced fricatives] > [voiceless fricatives] > [voiced aricates] >
T
[voiceless aricates] > [voiced stops] > [voiceless stops]
AF
e data are shown in Table 10-13. e rst column shows the phrase, the second
column gives the frequency of the phrase is in the more frequent order (the one
shown), the third column gives the frequency of the less frequent order (the
reversal of the order shown), the fourth column gives the percentage of the more
frequent order, and the h column shows whether the nal consonant of the
R
255
10. Grammar
pride and joy 61 0 1,0000 1st land and sea 37 5 0,8810 1st
rights and wrongs 60 0 1,0000 1st hearts and minds 64 9 0,8767 1st
man and wife 58 0 1,0000 1st size and weight 28 4 0,8750 2nd
heart and soul 55 0 1,0000 1st east and west 282 43 0,8677
son and heir 55 0 1,0000 1st arms and legs 195 30 0,8667
ebb and ow 51 0 1,0000 1st heart and lung 30 5 0,8571 1st
cat and mouse 36 0 1,0000 1st eyes and ears 77 13 0,8556
egg and chips 33 0 1,0000 1st rind and juice 27 5 0,8438 2nd
hip and thigh 32 0 1,0000 1st food and wine 74 14 0,8409 1st
head and tail 32 0 1,0000 1st height and weight 37 7 0,8409
months and years 30 0 1,0000 1st date and time 59 13 0,8194 1st
re and life 30 0 1,0000 2nd meat and sh 29 7 0,8056 1st
.1
drink and drugs 29 0 1,0000 1st maths and science 33 8 0,8049
knees and toes 29 0 1,0000 road and rail 65 16 0,8025 1st
esh and bone 28 0 1,0000 1st
v0 boys and girls 337 84 0,8005
mums and dads 28 0 1,0000 trees and shrubs 131 33 0,7988
cut and thrust 28 0 1,0000 coal and steel 49 13 0,7903
day and age 26 0 1,0000 2nd men and boys 29 8 0,7838 2nd
births and deaths 26 0 1,0000 war and peace 69 21 0,7667 2nd
king and queen 25 0 1,0000 1st arm and leg 29 9 0,7632 2nd
mum and dad 489 11 0,9780 2nd north and west 111 35 0,7603 2nd
T
north and south 398 9 0,9779 peas and beans 25 8 0,7576
black and white 127 4 0,9695 day and night 309 101 0,7537 2nd
AF
care and skill 36 2 0,9474 2nd shoes and socks 36 16 0,6923 2nd
youth and sports 52 3 0,9455 south and west 72 37 0,6606 2nd
wife and son 34 2 0,9444 1st doom and gloom 58 30 0,6591
D
oil and gas 373 23 0,9419 2nd eyes and mouth 27 15 0,6429 2nd
hips and thighs 28 2 0,9333 1st plants and trees 26 17 0,6047 1st
news and views 28 2 0,9333 legs and feet 27 18 0,6000 2nd
age and sex 118 9 0,9291 2nd size and shape 61 42 0,5922 2nd
days and nights 84 7 0,9231 2nd nose and mouth 32 25 0,5614 2nd
fruit and veg 34 3 0,9189 1st sea and air 27 27 0,5000 2nd
Let us rst simply look at the number of cases for which the claim is true or false.
First, there are 27 cases where both words end in consonants of the same
sonority; these can be discarded, as the hypothesis says nothing about them. is
leaves 36 cases where the rst words nal consonant is less sonorant (as
256
Anatol Stefanowitsch
predicted), and 30 cases where the second words nal consonant is less sonorant
(counter to the prediction). As Table 10-14a shows, this dierence is not
signicant (2 = 0.54, df=1, p > 0.05).
Table 10-14a. Sonority of the nal consonant and word order in binomials
LESS SONOROUS FINAL CONSONANT
FIRST WORD SECOND WORD Total
Obs. 36 Obs. 30 66
BINOMIALS Exp. 33 Exp. 33
.1
2-Cp. 0.27 2-Cp. 0.27
However, note that we are including both cases with a very high degree of
v0
frozenness (like arts and cras, rise and fall, heart and soul) and cases with a
relatively low degree of frozenness (like nose and mouth, sea and air): this will
dilute our results, as the cases with low frozenness are not actually predicted to
adhere to the less-before-more-sonorant hypothesis. We could, of course, limit our
analysis to cases with a high degree of frozenness, say, above 90 per cent (the
T
data is available, so you might want to try). However, it would be even beer to
keep all our data and make use of the rank order that the frozenness measure
AF
provides: the prediction would be that cases with a high frozenness rank would
be more likely to adhere to the sonority constraint than those with a low
frozenness rank. Table 11-13 contains all the data we need to determine medians
the median of words adhering or not adhering to the constraint, as well as the
rank sums and number of cases, which we need to calculate a U-test. We will not
R
go through the test step by step (but you can try for yourself if you want to).
Table 11-14b provides the necessary values derived from Table 11-13 (again,
D
binomials with equal sonority values for the rst and second word were ignored).
Table 11-14b. Sonority of the nal consonant and word order in binomials
FINAL CONSONANT OF 1ST WORD MORE SONOROUS FINAL CONSONANT OF 2ND WORD MORE SONOROUS
Median 23.5 Median 47.5
Rank Sum 942.5 Rank Sum 1268.5
No. of Data Points 36 No. of Data Points 30
257
10. Grammar
.1
binomials.
10.2.4.3 Case study: Horror aequi. In a number of studies, Rohdenburg (e.g. 1995,
v0
2003) has studied the inuence of contextual (and, consequently, conceptual)
complexity on grammatical variation. e general idea is that, in contexts that are
already complex, speakers will try to chose a variant that reduces (or at least does
not contribute) to this complexity. A particularly striking example of this is what
Rohdenburg (adapting a term from Karl Brugmann), calls the horror aequi
T
principle: the widespread (and presumably universal) tendency to avoid the
repetition of identical and adjacent grammatical elements or structures
AF
(as in 10.4a) or an ing-form (as in 10.4b); but as a to-innitive it would avoid the
to-innitive (although not completely, as 10.4c shows), and strongly prefer the
D
Impressionistically, this seems to be true: in the BNC there are 11 cases of started
to think about, 18 of started thinking about, only one of to start to think about but
258
Anatol Stefanowitsch
.1
Table 10-15. Complementation of start and the horror aequi principle
TYPE OF COMPLEMENT CLAUSE
TO-CLAUSE v0 ING-CLAUSE Total
start (without to) Obs: 1326 Obs: 2503 3829
Exp: 1751,13 Exp: 2077,87
2: 103,21 2: 86,98
to start Obs: 65 Obs: 1050 1115
FORM OF THE MATRIX VERB
2: 525,61 2: 442,96
Total 6083 7218 13301
e most obvious and most signicant deviation from the expected frequencies is
indeed in the case where the matrix verb start has the same form as the
complement clause: there are far fewer cases of to start to VERB and starting
VERBing than expected. Interestingly, if the base form of start does not occur with
an innitive particle, the to-complement is still fairly strongly avoided in favor of
the ing-complement. It seems as though horror aequi is a graded principle the
stronger the similarity, the stronger the avoidance.
is case study is intended to introduce the notion of horror aequi, which has
259
10. Grammar
10.2.4.4 Case study: Synthetic and analytic comparatives and persistence. e horror
aequi principle has been plausibly demonstrated to have an eect on certain
types of variation and it can plausibly be explained as a way of reducing
.1
complexity (if the same structure occurs twice in a row, this might cause
problems for language processing). However, there is another well-known
principle that is, in a way, the opposite of horror aequi: priming. We know from
v0
psycholinguistic experiments, that it is easier to activate the linguistic knowledge
associated with a particular linguistic unit if it was already active shortly before.
is has been argued to lead to so-called persistence eects, where a word or
structure that has been introduced in the discourse is reused within a short span.
For example, Szmrecsanyi (2005) argues that persistence is one of the factors
T
inuencing the choice of adjectival comparison. While most adjectives require
either synthetic comparison (further, but not *more far) or analytic comparison
AF
(more distant, but not *distanter), some vary quite freely between the two
(friendlier/more friendly). Szmrecsanyi hypothesizes that one of many factors
inuencing the choice is the presence of an analytic comparative in the directly
preceding context if speakers have used an analytic comparative (for example,
with an adjective that does not allow anything else), they are more likely to use it
R
260
Anatol Stefanowitsch
analytic comparative.
Table 10-16a shows the distribution of analytic comparatives in this context
across the two conditions (the 28 cases that do contain an analytic comparative
are listed in Appendix D.3).
.1
CONTAINS Obs: 16 Obs: 12 28
ANALYTIC Exp: 13.81 Exp: 14.19
PRECEDING
COMP.
CONTEXT
DOES NOT
CONTAIN
Obs:
Exp:
205
207.19
v0 Obs:
Exp:
215
212.81
420
AN. COMP.
(10.5a) But the statistics for the second quarter, announced just before the
October Conference of the Conservative Party, were even more damaging
D
261
10. Grammar
in (10.5a).
Let us therefore restrict the context in which we count analytic comparatives
to a size more likely to show persistence. I have chosen 7 words (based on the
factoid that short term memory can hold up to seven units). Table 10-16b shows
the distribution of analytic comparatives across the two conditions in this smaller
window.
.1
ANALYTIC SYNTHETIC Total
CONTAINS Obs: 13 v0 Obs: 0 13
ANALYTIC Exp: 6.41 Exp: 6.59
PRECEDING
COMP.
CONTEXT
We see that all of the analytic comparatives preceding synthetic ones disappear
when we restrict the window in this way, while the number of analytic
comparatives preceding analytic comparatives remains roughly the same. e
dierence between the two conditions is now statistically highly signicant (2 =
R
treatment of the topic, but see also Gries (2005) and Weiner and Labov (1983); see
also Ferreira and Bock (2006) for a theoretical discussion. It also demonstrates
that details of the research design, such as the size of the search window, may
have substantial consequences for the results.
262
Anatol Stefanowitsch
English uses the same verb as a complex intransitive (for example to protest
something vs. to protest against something). He tests this hypothesis for a number
of verbs from dierent semantic elds in a corpus of British and American
Newspapers and nds the tendency conrmed.
Let us replicate the study for two selected verbs using the much smaller, but
more diverse LOB/FLOB and BROWN/FROWN corpora: ght and protest. Each of
these verbs allows both simple transitive and complex intransitive uses (in
addition to intransitive uses, which we will ignore here but which do not dier
signicantly across varieties). Consider ght in (10.6):
.1
(10.6a) ese people knocked each other about for a while but united to ght the
Huns
(10.6b) soldiers who had gone to ght against Germans had turned into
v0
gardeners in uniform.
Table 10-17a shows the distribution of these two variants in British and American
English (not counting uses with a cognate direct object, such as ght a war).
ere is no signicant dierence between the two varieties (2 = 0.83, df = 1, p >
T
0.05), and the very small dierence that does exist goes against the hypothesis.
AF
Table 10-17a Transitive and complex intransitive uses of ght in British and
American English
COMP LEMEN TATION
TRANSITIVE COMP. INTRANS. Total
R
Table 10-17b shows the distribution of the two variants across British and
263
10. Grammar
Table 10-17b. Transitive and complex intransitive uses of protest in British and
American English
COMP LEMEN TATION
TRANSITIVE COMP. INTRANS. Total
BRITISH Obs: 2 Obs: 10 12
ENGLISH Exp: 5.38 Exp: 6.62
VARIETY
.1
AMERICAN Obs: 11 Obs: 6 17
ENGLISH Exp: 7.62 Exp: 9.38
13
v0 16 29
In order to show that the preference for transitive and complex intransitive verbs
respectively is a dierence that characterizes American and British English, it
would be necessary to look at a large number of verbs (which Rohdenburg does),
T
using a representative corpus (which Rohdenburg does not, for lack of large
comparable corpora). If and when the American National Corpus (see Appendix
AF
10.2.5.2 Case study: Sex. Grammatical dierences may also exist between varieties
D
264
Anatol Stefanowitsch
annotation so it is limited to relatively small corpora, but let us see what we can
do in terms of a larger-scale analysis. Let us focus on tag questions with negative
polarity containing the auxiliary be (e.g. isnt it, wasnt she, am I not, was it not).
ese can be extracted relatively straightforwardly even from an untagged corpus
using the following queries:
e query in (10.8a) will nd all nite form of the verb to be (as non-nite forms
.1
cannot occur in tag questions), followed by the negative clitic n't, followed by a
pronoun; the query in (10.8b) will do the same thing for the full form of the
particle not, which then follows rather than precedes the pronoun. Both queries
v0
will only nd those cases that occur before a punctuation mark signaling a clause
boundary (what to include here will depend on the transcription conventions of
the corpus, if it is a spoken one).
e queries are meant to work for the spoken portion of the BNC, which uses
the comma for all kinds of things, including hesitation or incomplete phrases, so
T
we have to make a choice whether to include it and increase the precision or to
include it and increase the recall (I will choose the laer option). e queries are
AF
not perfect yet: British English also has the form aint it, so we might want to
include the query ai n't (I|you|he|she|it|we|they) ; however, aint can stand for
be or for have, which lowers the precision somewhat. Finally, there is also the
form innit in (some varieties o) British English, so we might want to include in
n it ; however, this is an invariant form that can occur with any verb or auxiliary
R
in the main clause, so it will decrease the precision even further. We will ignore
aint and innit here (they are not particularly frequent and hardly change the
D
265
10. Grammar
such occurrences for female speakers and 113 650 for male speakers. Next, we can
count the number of interrogative sentences in a sample of 100 hits. In three
separate samples, I found eight to ten questions, so let us take the highest value
to be sure, and adjust the number of hits by subtracting ten percent; this gives us
estimates of 60 164 nite declarative positive-polarity clauses with be for female
speakers and 102 285 for male speakers.
ese numbers can now be used to compare the number of tag questions
against, as shown in Table 10.18. Since the tag questions that we found using our
queries have negative polarity, they are not included in the sample, but must
occur as tags to a subset of the sentences. is means that by subtracting the
.1
number of tag questions from the total for each group of speakers, we get the
number of sentences without tag questions.
v0
Table 10-18. Negative polarity tag questions in male and female speech in the
spoken BNC.
TAG QUESTION
PRESENT ABSENT Total
T
Obs: 3 801 Obs: 56 363 60 164
SPEAKER SEX
FEMALE
Exp: 2 559.16 Exp: 57 604.84
AF
266
Anatol Stefanowitsch
.1
diachronic development and, second, the interaction of word order and
grammatical complexity. She nds that the prepositional use is preferred in
British English (around two thirds of all uses are prepositions in Present-Day
v0
British Newspapers) while American English favors the postpositional use (more
than two thirds of occurrences in American English are postpositions). She shows
that the postpositional use initially accounted for around a quarter of all uses but
then almost disappeared in both varieties; its re-emergence in American English
is a recent development (the convergence and/or divergence of British and
T
American English has been intensively studied by Hundt, e.g. 1997, 2009).
e basic design with which to test the convergence or divergence of two
AF
varieties with respect to a particular feature is a multivariate one with VARIETY and
PERIOD as independent variables and the frequency of the feature as a dependent
one. Let us try to apply such a design to notwithstanding using the LOB, BROWN,
FLOB and FROWN corpora (two British and two American corpora each from the
1960s and the 1990s). Note that these corpora are rather small and 30 years are
R
not a long period of time, so we would not necessarily expect results even if the
hypothesis were true that American English reintroduced postpositional
D
267
10. Grammar
.1
While the prepositional use is more frequent in both corpora from 1961, the
postpositional use is the more frequent one in the American English corpus from
1991. A CFA shows, that the intersection 1991 AM.ENGL POSTPOSITION is the only
v0
one whose observed frequencies dier signicantly from the expected (cf. Table
10-19b).
Due to the small number of cases, we would be well advised not to place too
much condence in our results, but note that they fully conrm Berlages claims
that British English prefers the prepositional use and American English has
recently begun to prefer the postpositional use.
is case study is intended to provide a further example of a multivariate
design and to show, that even small data sets may provide evidence for or against
a hypothesis. It is also intended to introduce the study of the convergence and/or
divergence of varieties and the basic design required. is eld of studies is of
interest especially in the case of pluricentric languages like, for example, English,
Spanish or Arabic (cf. Rohdenburg and Schlter 2009, from which Berlages study
268
Anatol Stefanowitsch
.1
methodology. An excellent example is Mair (2004), which looks at a number of
grammaticalization phenomena to answer this and other questions.
He uses the OEDs citation database as a corpus (not just the citations given in
v0
the OEDs entry for going to, but all citations used in the entire OED). In is an
interesting question, to what extent such a citation database can be regarded as a
corpus. One argument against doing so is, that it is an intentional selection of
certain examples over others and thus may not yield an authentic picture of a
given phenomenon. However, as Mair points out, the vast majority of examples of
T
a given phenomenon X will occur in citations that were collected to illustrate
other phenomena, so they should constitute random samples with respect to X.
AF
e advantage of citation databases for historical research is that the sources for
citations will have been carefully checked and very precise information will be
available as to their year of publication, the author, etc.
As an example of this method, consider the going-to future. It is relatively easy
to determine at what point at the latest the sequence [going to Vinf] was
R
established as a future marker. In the literature on going to, the following example
from the 1482 Revelation to the Monk of Evesham is considered the rst
D
documented use with a future meaning (it is also the rst citation in the OED):
(10.9) erefore while thys onhappy sowle by the vyctoryse pompys of her enmyes
was goyng to be broughte into helle for the synne and onleful lustys of her
body.
Mair also notes that it is mentioned in grammars from 1646 onward; at the very
latest, then, it was established at the end of the 17th century.
However, as Figure 10-1 shows, there was only a small rise in frequency
during the time that the construction became established, but a substantial jump
in frequency aerwards (the dashed line shows Mairs conservative estimate for
269
10. Grammar
the point at which the construction was rmly established as a way to express
future tense).
.1
v0
T
AF
R
D
Interestingly, around the time of that jump, we also nd the rst documented
instances of the contracted form gonna. ese results suggest, that semantic
reanalysis is the rst step in grammaticalization, followed by a rise in discourse
frequency accompanied by phonological reduction.
is case study is intended to show that very large collections of citations may
be used as a corpus under certain circumstances. It also demonstrates the
importance of corpora in diachronic research, which was, of course, always text-
based, but which deals with many questions that require not just texts, but large
corpora.
270
Anatol Stefanowitsch
.1
collocates in the verb and theme positions of the ditransitive if we want to know
how particular things are transferred, cf. Stefanowitsch and Gries 2009). In this
way, grammatical structures can become diagnostics of culture, Again, care must
v0
be taken to ensure that the link between a grammatical structure and a putative
scenario is plausible.
10.2.6.1 Case study: He said, she said. In a paper on the medial representation of
T
men and women, Caldas-Coulthard (1993) nds that men are quoted vastly more
frequently than women in the COBUILD corpus (cf. also Chapter 11). She also
AF
271
10. Grammar
.1
WITH HE WITH SHE WITH HE WITH SHE
MOST STRONGLY ASSOCIATED WITH HE
say 17794 10936 v0 16687 13235 230.32
growl 153 4 34328 24167 132.65
drawl 170 10 34311 24161 121.39
write 209 43 34272 24128 68.27
grate 80 4 34401 24167 59.99
rasp 77 7 34404 24164 46.08
T
snarl 79 13 34402 24158 32.08
continue 311 123 34170 24048 31.25
AF
272
Anatol Stefanowitsch
.1
10.2.7.1 Case study: to- vs. that-complements. A good case study for English that is
based largely on counterexamples is Noel (2003), who looks at a number of claims
v0
made about the semantics of innitival complements as compared to that-clauses.
He takes claims made by other authors based on their intuition and treats them
like Popperian hypotheses, searching the BNC for counterexamples. He mentions
more or less informal impressions about frequencies, but only to clarify that the
T
counterexamples are not just isolated occurrences that could be explained away.
For example, he takes the well-known claim that with verbs of knowing,
innitival complements present knowledge as subjective/personal, while that-
AF
273
10. Grammar
(10.12a)Erika was surprised to nd that she was beginning to like Bach (BNC
.1
A7A)
(10.12b)[A]che of loneliness apart, I found that I was stimulated by the challenge
of nding my way about this great and beautiful city. (BNC AMC)
v0
And here are some with to-complements expressing objective, impersonal
knowledge:
(10.13a)Li and her coworkers have been able to locate these sequence variations
T
in the three-dimensional structure of the toxin, and found them to be
concentrated in the sheets of domain II . (BNC ALV)
AF
(10.13b)e visiting party , who were the rst and last ever to get a good look at
the crater of Perboewetan, found it to be about 1,000 metres in diameter
and about y metres deep (BNC ASR)
ese counterexamples (and others not cited here) in fact give us a new
R
situations where this objective knowledge was not previously known to the
participants of the situation described. In fact, if we extend our search for
counterexamples beyond the BNC to the world-wide web, we nd examples that
are even more parallel to (10.10b), such as (10.13c):
(10.13c) Our more recent ancestors (since 1800) were named Burnaman and told
they were from Ireland. We met much resistance from the now living
older generation when we found them to be German and spelled
Bornemann in Germany and their early years in America. (female speaker
from Texas, 2008)
274
Anatol Stefanowitsch
.1
Meurers (2005) and Meurers and Mllers (2009) work on German syntax as good
examples for the theoretically informed search for counterexamples).
Further reading
v0
Grammar is a complex phenomenon investigated from very dierent
perspectives. is makes general suggestions for further reading dicult. It may
be best to start with collections focusing on the corpus-based analysis of
T
grammar, such as Rohdenburg and Mondorf (2003), Gries and Stefanowitsch
(2006), Rohdenburg and Schlter (2009) on dierences between British and
AF
275
11. Morphology
e wordform-centeredness of most corpora and corpus-access tools that requires
a certain degree of ingenuity when studying structures larger than the word is
.1
not a huge problem for corpus-based morphology, which studies structures
smaller than the word. Corpus morphology is mostly concerned with the
distribution of axes (derivational ones, like the nominalizing suxes -ness and
v0
-ity and inectional ones, like participial -ing or the plural -s). Retrieving all
occurrences of an ax plausibly starts with the retrieval of all strings potentially
containing this ax. We could retrieve all occurrences of -ness, for example, with
a query like .+ness. is will retrieve all occurrences of the sux the recall
will be 100 percent. e precision will not be 100 percent in most cases, as such a
T
search it will also retrieve words that just happen to end with the search string
in the case of -ness, words like witness, governess or place names like Inverness.
AF
e degree of precision will depend on how unique the search string is for the
ax in question; for -ness and -ity it is fairly good, as there are only a few words
that share the same string accidentally (examples like those just mentioned for
-ness and words like city and pity for -ity).
However, once we have extracted and manually cleaned up our data set, we
R
are faced with a problem that does not present itself when studying lexis or
grammar: the very fact that axes do not occur independently but always as
D
276
Anatol Stefanowitsch
.1
straightforward (in fact, tautological) only because we have made tacit
assumptions about what it means for instances of a particular phenomenon to
occur (generally, or under dierent conditions.
v0
When we are interested in the frequency of occurrence of a particular word
(as we were, for example, in Chapter 3), there seems to be only one meaningful
way of counting it: search for all occurrences and add them up. For example, in
order to determine how frequent the denite article the is in the BNC, we query
the string the in all combinations of upper and lower cases, i.e. at least the, e,
T
and THE, but perhaps also tHe, E, THe, tHE and thE, to be sure) and count the
hits (since this string corresponds uniquely to the word the, we dont even have
AF
to clean up the results manually). e query will yield 6,041,234 hits, so the
frequency of occurrence of the word the in the BNC is 6,041,234.
When searching for grammatical structures (for example in Chapters 6 and 7),
we have simply transferred this way of counting occurrences. For example, in
order to determine the frequency of the s-possessive in the BNC, we would dene
R
277
11. Morphology
are simply communicatively useful in many situations, like her head (4,919), his
younger brother (112), peoples lives (224) and bodys immune system (29).
is means that there are two dierent ways to count occurrences of the s-
genitive. First, we could simply count all instances without paying any aention
to whether they recur in identical form or not. When looking at occurrences of a
linguistic item or structure in this way, they are referred to as tokens, so 1,492,349
is the token frequency of the possessive. Second, we could exclude repetitions and
count only the number of instances that are dierent from each other, for
example, we would count Kings Cross only the rst time we encounter it,
disregarding the other 321 occurrences. When looking at occurrences of linguistic
.1
items in this way, they are referred to as types; the type frequency of the s-
possessive in the BNC is 378,893. e type frequency of the, of course, is 1.
Let us look at one more example of the type/token distinction before we move
v0
on. Consider the following famous line from Gertrude Steins poem Sacred
Emily:
oen they are used under a particular condition, so it is their token frequency
that is relevant to us; but we could imagine designs where we are mainly
D
interested in whether a word occurs at all, in which case all that is relevant is
whether its type frequency is one or zero. When studying grammatical structures,
we will also mainly be interested in how frequently a particular grammatical
structure is used under a certain condition, regardless of the words that ll this
structure. Again, it is the token frequency that is relevant to us. However, note
that we can (to some extent) ignore the specic words lling our structure only
because we are assuming that the structure and the words are, in some
meaningful sense, independent of each other; i.e., that the same words could have
been used in a dierent structure (e.g. an of-possessive instead of an s-possessive)
or that the same structure could have been used with dierent words (e.g. Johns
spouse instead of his wife). Recall that in our case studies in Chapters 7 and 8 we
278
Anatol Stefanowitsch
excluded all instances where this assumption does not hold (such as proper
names and xed expressions); since there is no (or very lile) choice with these
cases, including them, let alone counting repeated occurrences of them, would
have added nothing (we did, of course, include repetitions of free combinations,
of which there were four in our sample: his sta, his mouth, his work and his head
occurred twice each).
Obviously, morphemes (whether inectional or derivational) can be counted in
the same two ways as words or grammatical structures. Take the following
passage from William Shakespeares play Julius Cesar:
.1
(11.2) CINNA: Am I a married man, or a bachelor? en, to answer every
man directly and briey, wisely and truly: wisely I say, I am a bachelor.
v0
Let us count the occurrences of the adverbial sux -ly. ere are ve word
tokens that contain this sux (directly, briey, wisely, truly, wisely), so its token
frequency is 5; however, there are only four types, since wisely occurs twice, so its
type frequency in this passage is 4.
Again, whether type or token frequency is the more relevant or useful
T
measure depends on the research design, but the issue is more complicated than
in the case of words and grammatical structures. Let us begin to address this
AF
problem by looking at the diminutive axes -icle (as in cubicle, icicle) and mini-
(as in minivan, mini-cassee).
a) Token frequency. First, let us count the tokens of both axes in the BNC.
We will nd that -icle has a token frequency of 20,857, almost ten times that of
R
mini-, which occurs only 2,161 times. We might thus be tempted to conclude that
we will be able to learn much more about the sux -icle from our data than about
D
the prex mini-. We might also conclude that -icle is much more central to the
domain of English diminutives than mini-. However, both conclusions would be
misleading, or at least premature, for reasons related to the problems introduced
above. Recall that axes do not occur by themselves, but always as parts of
words (this is what makes them axes in the rst place). is means, however,
that their token frequency can reect situations that are both quantitatively and
qualitatively very dierent.
Specically, the high frequency of an ax may be due to the fact that it is
used in a small number of very frequent words, or in a large number of very
infrequent words (or something in between). e rst case holds for -icle: just the
three most frequent words it occurs in (article, vehicle and particle) account for
279
11. Morphology
19,206 hits (i.e., 92.08 percent of all occurrences). In contrast, the three most
frequent words with mini- (mini-bus, mini-computer and mini-bar) account for
only 45.94 per cent of all occurrences (617 hits); to get to 92 percent, we have to
include the twenty-one next-most-frequent words (mini-skirt, mini-cab, mini-
gram, mini-series, mini-golf, mini-enterprise, mini-step, mini-tab, mini-raise, mini-
van, mini-dress, mini-budget, mini-tel, mini-ramp, mini-league, mini-boom, mini-
roundabout, mini-break, mini-clinic, mini-LP, and mini-tower; note that most of
these words occur both with and without hyphens, I have standardized their
spelling for expository ease).
In other words, the high token frequency of -icle tells us nothing (or at least
.1
very lile) about the importance of the ax, but rather about the importance of
some of the words containing it. is is true regardless of whether we are looking
at its token frequency in the corpus as a whole or under specic conditions; if its
v0
token frequency turned out to be higher, under one condition than under the
other, this could point to the association between that condition and one or more
of the words containing the ax, rather than between the condition and the ax
itself. For example, the token frequency of the sux -icle is higher in the BROWN
corpus (270 tokens) than in the LOB corpus (226 tokens). However, as Table 11-1
T
shows, this is exclusively due to the fact that the word vehicle is more frequent
than expected in American English and less frequent than expected in British
AF
English (the corresponding chi-square components being the only ones reaching
signicance).
Even if all words containing a particular ax were more frequent under one
condition (e.g. in one variety) and under another, this would tell us nothing about
the ax itself, as this distribution could be due to the ax itself (the adverbial
R
sux -ly, for example, is disappearing from American English, but not from
British English), or to the words containing it.
D
is is not to say that the token frequencies of axes can never play a useful
role; they may be of interest, for example, in cases of morphological alternation
(i.e. two suxes competing for the same stems, such as -ic and -ical in words like
electric/al); here, we may be interested in the quantitative association between
particular stems and one or the other of the ax variants, essentially giving us a
collocation-type research design based on token frequencies. But for most
research questions, designs based (exclusively) on the distribution of token
frequencies under dierent conditions will give us meaningless results.
280
Anatol Stefanowitsch
.1
chronicle Obs: 8 Obs: 7 15
Exp: 8.17 Exp: 6.83
2-Cmp: 0.00 2-Cmp: 0.00
fascicles Obs: 3
v0 Obs: 0 3
Exp: 1.63 Exp: 1.37
2-Cmp: 1.14 2-Cmp: 1.37
ventricle Obs: 4 Obs: 8 12
Exp: 6.53 Exp: 5.47
2-Cmp: 0.98 2-Cmp: 1.17
T
testicle Obs: 2 Obs: 0 2
Exp: 1.09 Exp: 0.91
2-Cmp: 0.76 2-Cmp: 0.91
AF
281
11. Morphology
occurs only in a few words. Note that order to compare type frequencies, we have
to correct for the size of the sample: all else being equal, a larger sample will
contain more types than a smaller one simply because it oers more
opportunities for dierent types to occur (a point we will return to in more detail
in the next subsection). A simple way of doing this is to divide the number of
types by the number of tokens; the resulting measure is referred to very
transparently as the type/token ratio (or TTR):
n (types)
(11.3) TTR =
n (tokens)
.1
e TTR is the percentage of types in a sample are dierent from each other; or,
put dierently, it is the average likelihood that we will encounter a new type if
v0
we go through the sample item by item.
For example, the ax -icle occurs in just 35 dierent words in the BNC, so its
TTR is 35/20,857 = 0.0017. In other words, 0.17 percent its tokens the BNC are
dierent from each other, the vast remainder consists of repetitions of these 0.17
percent; put dierently, if we went through the occurrences of -icle in the BNC
T
item by item, the likelihood that the next item instantiating this sux will be a
type we have not seen before is 0.17 percent, so we will encounter a new type
AF
about once every six-hundred words. For mini-, the type-token ratio is much
higher: it occurs in 555 dierent words, so its TTR is 555/2,161 = 0.2568. In other
words, more than twenty-ve percent of all occurrences of mini- are dierent
from each other; put dierently, if we went through our sample word by word,
R
the likelihood that the next instance of mini- is a a new type would be 25 percent,
so it will happen once every four hits on average. e dierences in their TTRs
suggests that mini-, in its own right is much more central in the English lexicon
D
than -icle, even though the laer has a much higher token frequency. Note that
this is a statement only about the axes; it does not mean that the words
containing mini- are individually or collectively more important than those
containing -icle (on the contrary: words like vehicle, article and particle are
arguably much more important than words like minibus, minicomputer and
minibar).
Likewise, observing the type frequency (i.e., the TTR) of an ax under
dierent conditions provides information about the relationship between these
conditions and the ax itself, albeit one that is mediated by the lexicon: it tells us
how important the sux in question is for the subparts of the lexicon that are
282
Anatol Stefanowitsch
relevant under those conditions. For example, there are 7 types and 9 tokens for
mini- in the 1991 British FLOB corpus (two token each for mini-bus and mini-
series and one each for min-charter, mini-disc, mini-maestro, mini-roll and mini-
submarine), so the TTR is 7/9 = 0.77789; in the 1991 US-American FROWN
corpus, there are 10 types and 11 tokens, so the TTR is 10/11 = 0.9091. is
suggests that the prex mini- was more important to the US-English lexicon than
to the British English lexicon in the 1990s, although, of course, the samples and
the dierence between them are both rather small, so we would not want to draw
that conclusion without testing for signicance rst, a point I will return to in the
next subsection.
.1
c) Hapax legomena. While type frequency is a useful (and, in my view,
insuciently valued) way of measuring the importance of axes in general or
v0
under specic conditions, it has one drawback: it does not tell us whether the
ax plays a productive role in a language at the time from which we take our
samples (i.e., whether speakers at that time made use of it when coining new
words). An ax may have a high TTR because it was productively used at the
time of the sample, or because it was productively uses at some earlier period in
T
the history of the language in question. In fact, an ax can have a high TTR even
if it was never productively used, for example, because speakers at some point
AF
borrowed a large number of words containing it; this is the case for a number of
Romance axes in English, occurring in words borrowed from Norman French
but never (or very rarely) used to coin new words. An example is the sux
-ence/-ance occurring in many Latin and French loanwords (such as appearance,
dierence, existence, inuence, nuisance, providence, resistance, signicance,
R
vigilance, etc.), but only in a handful of words formed in English (e.g. abidance,
forbearance, furtherance, hinderance, and riddance).
D
283
11. Morphology
word I coined productively (in fact, I coined it in order to have a good example of
a hapax legomenon), whereas ingenuity has been part of the English language for
more than four-hundred years (the OED rst records it in 1598); it occurs only
once in this book for the simple reason that I only needed it once. So a word may
be a hapax legomenon because it is a productive coinage, or because it is
infrequently-needed (in larger corpora, the category of hapaxes typically also
contains misspelled or incorrectly tokenized words which will have to be cleaned
up manually). e idea is simply to use the notion hapax legomenon as an
operationalization of the notion productive application of a rule in the hope
that the correlation between the two notions (in a large enough corpus) will be
.1
substantial.32
Like the number of types, the number of hapax legomena is dependent on
sample size (although the relationship is not as straightforward as in the case of
v0
types, see next subsection); it is useful, therefore, to divide the number of hapax
legomena by the number of tokens to correct for sample size:
n( hapax legomena)
(11.4) HTR =
n ( tokens)
T
We will refer to this measure as the hapax-token ratio (or HTR) by analogy with
AF
the term type-token ratio. Note, however, that in the literature this measure is
referred to as P for Productivity (following Baayen, who rst suggested the
measure); I depart from this nomenclature here to avoid confusion with p for
probability of error.
R
Let us apply this measure to our two diminutive axes. e sux -icle has
just ve hapax legomena in the BNC (auricle, denticle, pedicle, pellicle and tunicle).
is means that its HTR is 5/20,857 = 0.0002 (i.e., 0.02 percent of its tokens are
D
hapax legomena). In contrast, there are 370 hapax legomena for mini- in the BNC
(including, for example, mini-10p-piece, mini-brain, mini-gasometer, mini-textbook
and mini-wurlitzer). is means that its HTR is 370/2,161 = 0.1712 (i.e. 17 percent
of its tokens are hapax legomena). us, mini- is much more productive than -icle,
32 Note also that the productive application of a sux does not necessarily result in a
hapax legomenon: two or more speakers may arrive at the same coinage, or a single
speaker may like their own coinage so much that they use it again; some researchers
therefore suggest that we should also pay aention to dis legomena (words occurring
twice) or even tris legomena (words occurring three times). We will stick with the
mainstream here and use only hapax legomena.
284
Anatol Stefanowitsch
(or was, at the time that the BNC was assembled); this presumably matches the
intuition of most speakers of English.
.1
values NEW and SEEN BEFORE. One appropriate statistical test for a distribution
nominal values under dierent conditions is the chi-square test, which we are
already more than familiar with. For example, if we wanted to test whether the
v0
TTRs of -icle and mini- in the BNC dier signicantly, we might construct a Table
like that in 11-1. e chi-square test would tell us that the dierence is highly
signicant (2 = 5104.04, df = 1, p < 0.001, = 0.4709).
T
Table 11-1. Type/token ratios of -icle and mini- in the BNC
AFFIX
AF
SEEN
BEFORE Exp. 20,322.39 Exp. 21,05.61
For HTRs, we could follow the same procedure: in this case we are dealing with a
nominal variable TYPE with the variables OCCURS ONLY ONCE and OCCURS MORE THAN
ONCE, so we could construct the corresponding table and perform the chi-square
test.
However, while the logic behind this procedure is plausible in theory both for
HTRs and for TTRs, in practice, maers are much more complicated. e reason
for this is that type-token ratios and hapax-token ratios are also dependent on
sample size.
In order to understand why and how this is the case and how to deal with it,
285
11. Morphology
let us leave the domain of morphology for a moment and look at the relationship
between tokens and types or hapax legomena in texts. Consider the opening
sentences of Jane Austens novel Pride and Prejudice (the novel is freely available
from Gutenberg.org, cf. Appendix A.1):
.1
some one or2/-1 other of6 their daughters.
All words without a subscript are new types and hapax legomena at the point at
v0
which they appear in the text; if they have one subscript, this is their token
frequency, i.e., they do not constitute new types but a repetition; their rst
repetition (i.e., the point when their token frequency becomes 2) is additionally
marked by a subscript reading -1, indicating that they cease to be hapax legomena
at this point, decreasing the overall count of hapaxes by one.
T
As we move through the text word by word, initially all words are new types
and hapaxes, so the type- and hapax-counts rise at the same rate as the token
AF
counts. However, it only takes eight tokens before we reach the rst repetition
(the word a), so while the token frequency rises to 8, the type count remains
constant at seven and the hapax count falls to six. Six words later, there is
another occurrence of a, so type and hapax counts remain at 12 and 11
respectively as the token count rises to 14, and so on. In other words, while the
R
already occurred, the more types are there to be re-used (put simply, speakers will
encounter fewer and fewer communicative situations that require a new type),
which makes it less and less likely for new types (including new hapaxes) to
occur. Figure 11-1a shows how type and hapax counts fall substantially below
token counts in the rst 100 words of the novel, Figure 11-1b shows the same
thing for the entire novel.
As Figure 11-1a shows, the type-token and hapax-token ratios shrinks fairly
quickly: aer 20 tokens, the TTR is 18/20 = 0.9 and the HTR is 17/20 = 0.85, aer
40 tokens the TTR is 31/40 = 0.775 and the HTR is 26/40 = 0.65, aer 60 tokens
the HTR is 42/60 = 0.7 and the TTR is 33/60 = 0.55, and so on (note also how the
hapax-token ratio sometimes drops before it rises again, as words that were
286
Anatol Stefanowitsch
Fig. 11-1a: TTR and HTR for the rst Fig. 11-1b: TTR and HTR for the entire
100 words of Pride and Prejudice text of Pride and Prejudice
.1
v0
T
AF
If we zoom out and look at the entire novel, shown in Figure 11-1b, we see
that the growth in hapaxes slows considerably, to the extent that it has almost
stopped by the time we reach the end of the novel. e growth in types also
slows, although not as much as in the case of the hapaxes. In both cases this
R
means that the ratios will continue to fall as the number of tokens increases. Now
imagine we wanted to use the TTR and the HTR as measures of Jane Austens
overall lexical productivity (referred to as lexical richness in computational
D
287
11. Morphology
.1
BEFORE
e same is true if we do not compare the TTR or HTR of entire texts but of
particular axes (or other linguistic rules). Consider Figures 11-2a and 11-2b,
which shows the TTR and the HTR of the verb suxes -ize (occuring in words
like realize, maximize or liquidize) and -ify (occurring in words like identify,
intensify or liquify).
R
As we can see, the TTR and HTR of both axes behaves roughly like that of
Jane Austens vocabulary as a whole as we increase sample size: both of them
D
grow fairly quickly at rst before their growth slows down; the laer happens
more quickly in the case of the HTR than in the case of the TTR, and, again, we
observe that the HTR sometimes decreases as types that were hapaxes up to a
particular point in the sample re-occur and cease to be hapaxes.
288
Anatol Stefanowitsch
Fig. 11-2a: TTRs of -ize and -ify in the Fig. 11-2b: HTRs of -ize and -ify in the
BROWN corpus BROWN corpus
.1
v0
Taking into account the entire sample, the TTR for -ize is 151/1057 = 0.1429
T
and that for -ify is 52/498 = 0.1044; it seems that -ize is slightly more important to
the lexicon of English than -ify. A chi-square test suggests that that the dierence
AF
is signicant, but only relatively weakly and with a very small eect size (cf.
Table 11-3a; 2 = 4.41, df = 1, p = 0.0358, =0.0532).
Table 11-3a. Type/token ratios of -ise/-ize and -ify in the BROWN corpus
R
AFFIX
-IZE/-ISE -IFY Total
D
Likewise, taking into account the entire sample, the HTR for -ize is 53/1057 =
0.0501 and that for -ify is 15/498 = 0.0311; it seems that -ize is slightly more
productive than -ify. However, the dierence is very small, and a chi-square test
289
11. Morphology
shows that it is not signicant and that the eect size would be very small even if
it were (cf. Table 11-3b; 2 = 3.24, df = 1, p = 0.0716, =0.0021).
Table 11-3b. Hapax/token ratios of -ize/-ize and -ify in the BROWN corpus
AFFIX
-IZE/-ISE -IFY Total
Obs. 53 Obs. 15 68
HAPAX
Exp. 46.22 Exp. 21.78
TYPE
.1
NOT
HAPAX Exp. 1010.78 Exp. 476.22
However, note that -ify has a token frequency that is less than half of that of -ize,
so the sample is much smaller: as in the example of lexical richness in Pride and
Prejudice, this means that the TTR and the HTR of this smaller sample are
exaggerated and our comparisons are, in fact, completely meaningless.
T
e simplest way of solving the problem of dierent sample sizes is, of course,
to create samples of equal size for the purposes of comparison. We simply take
AF
the size of the smaller of our two samples and draw a random sample of the same
size from the larger of the two samples (if our data sets are large enough, it would
be even beer to draw random samples for both axes). is means that we lose
some data, but there is nothing we can do about this (note that we can still
include the discarded data in a qualitative description of the ax in question).
R
Figures 11-3a and 11-3b show the growth rates of the TTR and the HTR of a
random sub-sample of 498 tokens of -ize in comparison with the total sample of
D
290
Anatol Stefanowitsch
Fig. 11-3a: TTRs of -ize (sample) and -ify Fig. 11-2b: HTRs of -ize (sample) and -ify
in the BROWN corpus in the BROWN corpus
.1
v0
T
e TTR of -ize based on the random sub-sample is 113/489 = 0.2311, that of -ify
is still 0.1044; the dierence between the two suxes is much clearer now, and a
AF
chi-square test shows that it is highly signicant and the eect size is more than
80 times higher than in our inappropriate comparison above (cf. Table 11-4a; 2 =
27.0293, df = 1, p-value = 2.004e-07, = 0.1647).
Table 11-4a. Type/token ratios of -ize/-ize (sample) and -ify in the BROWN corpus
R
AFFIX
-IZE/-ISE -IFY Total
D
Likewise, the HTR of -ize based on our sub-sample is 113/498 = 0.2269, the HTR
of -ify remains 0.1044. Again, the dierence is much clearer, and it, too, is now
highly signicant with a noticeable eect size (cf. Table 11-4b; 2 = 18.4529, df = 1,
291
11. Morphology
p = 1.742e-05, = 0.1361).
Table 11-4b. Hapax/token ratios of -ize/-ize (sample) and -ify in the BROWN corpus
AFFIX
-IZE/-ISE -IFY Total
Obs. 48 Obs. 15 63
HAPAX
Exp. 31.5 Exp. 31.5
TYPE
.1
HAPAX
(directly occurs 9 times in the entire play, briey occur 4 times and truly 8 times).
is means that while we must draw random samples of equal size in order to
compare HTRs, we should make sure that these samples are as large as possible.
R
292
Anatol Stefanowitsch
insights into constraints that an ax places on its stems (and obviously, corpus-
based approaches are without an alternative in diachronic studies and they yield
particularly interesting results when used to study changes in the quality or
degree of productivity, cf. for example Dalton-Puer 1996).
In the study of constraints placed by derivational axes on the stems that
they combine with, the combinability of derivational morphemes (in an absolute
sense or in terms of preferences) is of particular interest. Again, corpus linguistics
is a uniquely useful tool to investigate this.
Finally, there are cases where two derivational morphemes are in direct
competition because they are functionally roughly equivalent (e.g. -ness and -ity,
.1
both of which form abstract nouns from typically adjectival bases, -ize and -ify,
which form process verbs from nominal and adjectival bases or -ic and -ical,
which form adjectives from typically nominal bases). Here, too, corpus linguistics
v0
provides useful tools, for example to determine whether the choice between
axes is inuenced by syntactic, semantic or phonological properties of stems.
non-hapax types and 19 of the twenty most frequent types found in the BNC are
older than the 19th century. us, it is possible that the constraints observed in
D
293
11. Morphology
preference for those polysyllabic stems which already have the stress on the nal
syllable.
Plag simply checks his neologisms against the literature, but we will evaluate
the claims from the literature quantitatively. Our main hypotheses will be that
neologisms with -ify do not dier from established types with respect to the fact
that the directly preceding the sux must carry primary stress, with the
consequences that (i) they prefer monosyllabic stems, and (ii) if the stem is
polysyllabic, they prefer stems that already have the primary stress on the last
syllable. Our independent variable is therefore LEXICAL STATUS with the values
ESTABLISHED WORD vs. NEOLOGISM (which will be operationalized presently). Our
.1
dependent variables are SYLLABICITY with the values MONOSYLLABIC and POLYSYLLABIC,
and STRESS SHIFT with the values REQUIRED vs. NOT REQUIRED (both of which should be
self-explanatory). v0
Our design compares two predened groups of types with respect to the
distribution that particular properties have in these groups; this means that we do
not need to calculate TTRs or HTRs, but that we need operational denitions of
the values ESTABLISHED WORD and NEOLOGISM). Following Plag, let us dene NEOLOGISM
as coined in the 20th century, but let us use a large historical dictionary (the
T
Oxford English Dictionary, 3rd edition) and a large corpus (the BNC) in order to
identify words matching this denition; this will give us the opportunity to
AF
evaluate the idea that hapax legomena are a good way of operationalizing
productivity.
Excluding cases with prexed stems, the OED contains 456 entries or sub-
entries for verbs with -ify, 31 of which are rst documented in the 20th century.
Of the laer, 21 do not occur in the BNC at all, and 10 do occur in the BNC, but
R
are not hapaxes (see Table 11-5a below). e BNC contains 30 hapaxes, of which
13 are spelling errors and 7 are rst documented in the OED before the 20th
D
stem-nal consonants are deleted (as in liquid liquify); cf. Plag 1999 for a more
detailed discussion.
294
Anatol Stefanowitsch
hapax legomenon in the BNC, using the formulas introduced in Chapter 5 (cf.
5.1a, b). Precision is dened as the number of true positives (items that were
found and that actually are what they are supposed to be) divided by the number
of all positives (all items found); 10 of the 30 hapaxes in the BNC are actually
neologisms, so the precision is 10/30 = 0.3333. Recall is dened as the number of
true positives divided by the number of true positives and false negatives (i.e., all
items that should have been found); 10 of the 45 neologisms were actually found
by using the hapax denition, so the recall is 10/45 = 0.2222. In other words,
neither precision or recall of the method are very good, at least for moderately
productive axes like -ify (the method will give beer results with highly
.1
productive axes). Let us also determine the recall of neologisms from the OED
(using the denition rst documented in the 20th century according to the
OED): the OED lists 31 of the 45 neologisms, so the recall is 31/45 = 0.6889; this
v0
is much beer than the recall of the corpus-based hapax-denition, but it also
shows that if we combine corpus data and dictionary data, we can increase
coverage substantially even for moderately productive axes.
faintify, fuzzify, lewisify, rawify, rockify, sickify, sonify, validify, yankify, yukkify
Non-Hapax Legomena in the BNC that are not listed in the OED
(iv) commodify, desertify, extensify, geriatrify
Let us now turn to the denition of ESTABLISHED TYPES. Given our denition of
NEOLOGISMS, established types would rst have to be documented before the 20th
century, so we could use the 420 types in the OED that meet this criterion (again,
excluding prexed forms). However, these 420 types contain many very rare or
even obsolete forms, like duplify to make double, eaglify to make into an eagle
or naucify to hold in low esteem. Clearly, these are not established in any
meaningful sense, so let us add the requirement that a type must occur in the
295
11. Morphology
BNC at least twice to count as established. Let us further limit the category to
verbs rst documented before the 19th century, in order to leave a clear
diachronic gap between the established types and the productive types.34
.1
stratify, stultify, terrify, testify, transmogrify, typify, uglify, unify, verify, versify, vilify,
vitrify, vivify
v0
Let us now evaluate the hypotheses. Table 11-6a shows the type frequencies for
monosyllabic and polysyllabic stems in the two samples. In both cases, there is a
preference for monosyllabic stems (as expected), but interestingly, this preference
is less strong among the neologisms than among the established types and this
dierence is very signicant (2 = 7.37, df = 1, p < 0.01 , = 0.2577).
T
Table 11-6a. Monosyllabic and bisyllabic stems with -ify
AF
NO. OF SYLLABLES
MONOSYLLABIC POLYSYLLABIC Total
Obs. 57 Obs. 9 66
LEXICAL STATUS
ESTABLISHED
Exp. 51.14 Exp. 14.86
R
Obs. 29 Obs. 16 45
NEOLOGISM
Exp. 34.86 Exp. 10.14
D
Total 86 25 111
34 Interestingly, leaving out words coined in the 19th century does not make much of a
dierence: although the 19th century saw a large number of coinages (with 138 new
types it was the most productive century in the history of the sux), few of these are
frequent enough today to occur in the BNC; if anything, we should actually extend
our denition of neologisms to include the 19th century.
296
Anatol Stefanowitsch
.1
cannot use the chi-square test here, since half of the expected frequencies are
below 5, but Fishers exact test conrms that the dierence is not signicant).
ESTABLISHED
Exp. 4.68 Exp. 4.32
Obs. 10 Obs. 6 16
AF
NEOLOGISM
Exp. 8.32 Exp. 7.68
Total 13 12 25
b) Case Study: Semantic dierences between -ic and -ical. Axes, like words, can be
related to other axes by lexical relations like synonymy, antonymy etc. In the
case of (roughly) synonymous axes, an obvious research question is what
297
11. Morphology
determines the choice between them for example, whether there are more ne-
grained semantic dierences that are not immediately apparent.
One way of approaching this question is to focus on stems that occur with
both axes (such as liqui(d) in liquidize and liquify/liquefy, scarce in scarceness
and scarcity or electr- in electric and electrical) and to investigate the semantic
contexts in which they occur for example, by categorizing their collocates,
analogous to the way Taylor (2001) categorizes collocates of high and tall (cf.
Chapter 9, Section 9.2.2).
A good example of this approach is found in Kaunisto (1999), who investigates
the pairs electric/electrical and classic/classical on the basis of the British
.1
Newspaper Daily Telegraph. Since his corpus is not accessible, we will use the
BROWN corpus instead to replicate his study for electric/electrical. It is a study
with two nominal variables: AFFIX VARIANT (with the values -IC and -ICAL), and
v0
SEMANTIC CATEGORY (with a set of values to be discussed presently). Note that this
design can be based straightforwardly on token frequency, as we are not
concerned with the relationship between the stem and the ax, but with the
relationship between the stem-ax combination and the nouns modied by it.
Put dierently, we are not using the token frequency of a stem-ax combination,
T
but of the collocates of words derived by a particular ax.
Kaunisto uses a mixture of dictionaries and existing literature to identify
AF
potentially interesting values for the variable SEMANTIC CATEGORY; we will restrict
ourselves to dictionaries here:
(11.6) electric
a) connected with electricity; using, produced by or producing electricity
R
[OALD]
b) of or relating to electricity; operated by electricity [MW]
D
(11.7) electrical
a) connected with electricity; using or producing electricity [OALD]
b) of or relating to electricity; operated by electricity [MW] [mentioned as
298
Anatol Stefanowitsch
MW treats the two words as largely synonymous and OALD distinguishes them
only insofar as mentioning for electric, but not electrical, that it may refer to
.1
phenomena produced by electricity (this is meant to cover cases like electric
current/charge); however, since both words are also dened referring to anything
connected with electricity, this is not much of a dierentiation (the entry for
v0
electrical also mentions electrical power/energy). Macmillans Dictionary also
treats them as largely synonymous, although it is pointed out specically that
electric refers to entities carrying electricity (citing electric outlet/plug/cord).
CALD and LDCE present electrical as a more general word for anything related
to electricity, whereas they mention specically that electric is used for things
T
worked by electricity (e.g. electric light/appliance) or carrying electricity
(presumably cords, outlets etc.) and phenomena produced by electricity
AF
engineers, etc.).
Summarizing, we can posit the following ve broad values for our variable
D
SEMANTIC CATEGORY:
DEVICES and appliances working by electricity
ENERGY in the form of electricity
COMPANIES producing or supplying energy
CIRCUITS carrying electricity (including their components)
ENGINEERS and their work with electricity, circuits, devices
Table 11-7 shows the token frequency with which words from these categories
are referred to as electric or electrical, and it also lists all types found for each
category.
299
11. Morphology
.1
spit, tool, toothbrush,
2 1.52 2 2.05
circuit, component, line, material component, contact, control, line,
outlet, picko, torquer, wiring
D
Total 58 43 101
300
Anatol Stefanowitsch
.1
Electric Company, Industrial Electric, Narraganse Electric, Vermont Hydro-Electric
Corporation, and Westinghouse Electric Corp.) and only one with electrical
(Metropolitan Vickers Electrical Co.), which corroborate the trend for companies to
v0
be referred to by the adjective electric.
In sum, there is a signicant tendency for electric to be used with devices (and,
if we were to include proper names, companies), and electrical for people and
their work with electricity. In addition, there are (non-signicant) tendencies for
electrical to be used with circuits and energy. is is broadly similar to Kaunistos
T
(1999) ndings (he goes on to look at the categories in more detail, revealing even
more ne-grained dierences).
AF
301
11. Morphology
Table 11-8. Dierential nominal collocates of electric and electrical in the BNC
KEY WORD O11 O12 O21 O22 G2
MOST STRONGLY ASSOCIATED WITH ELECTRIC
.1
window 38 0 2,794 2,029 41.27
kele 37 0 2,795 2,029 40.18
cooker 34 0 2,798 2,029 36.91
drill 39 1 v0 2,793 2,028 34.75
train 32 0 2,800 2,029 34.73
Co. 27 0 2,805 2,029 29.28
vehicle 24 0 2,808 2,029 26.02
lighting 21 0 2,811 2,029 22.76
fan 21 0 2,811 2,029 22.76
tramway 20 0 2,812 2,029 21.67
T
fence 20 0 2,812 2,029 21.67
traction 19 0 2,813 2,029 20.58
AF
302
Anatol Stefanowitsch
.1
collocates we already know to distinguish signicantly between the variants.
In conclusion, let me stress once more that this kind of study does not
primarily uncover dierences between axes, but dierences between specic
v0
word pairs containing these axes. ey are, as pointed out above, essentially
lexical studies of near-synonymy. Of course, it is possible that by performing such
analyses for a large number of word pairs containing a particular ax pair,
general semantic dierences may emerge, but since we are frequently dealing
with highly lexicalized forms, there is no guarantee for this. Gries (2003, cf. also
T
2001) has shown that -ic/-ical pairs dier substantially in the extent to which they
are synonymous; for example, he nds substantial dierence in meaning for
AF
c) Case Study: Phonological dierences between -ic and -ical. In an interesting but
D
303
11. Morphology
in the BROWN corpus, and 240 with -ical (since the point is to show the inuence
of length on sux choice, prexed stems, compound stems etc. are included in
this gure). e mean length is 2.86 (sample variance: 1.35) for stems occurring
with -ic and 2.72 (sample variance: 1.59) for stems occurring with -ical. Applying
the formula in (7.15) from Chapter 7, we get a t-value of 1.54, which, at 392.82
degrees of freedom (which we determine using the formula in 7.17), is not
signicant (p = 0.12).35
Note that in this kind of design we have used samples of types rather than
tokens again, because we are investigating a claim about an inuence of stem
length on ax choice; in this context, it is important to determine how many
.1
stems of a given length occur with a particular ax variant, but it does not
maer how oen a particular stem does so.
v0
d) Case Study: Ax combinations. It has oen been observed that certain
derivational axes show a preference for stems that are already derived by a
particular ax. is observation is potentially relevant to cases of alternation
between suxes, as such preferences may be one of the factors inuencing the
choice between the variants.
T
For example, Lindsay (2011) and Lindsay and Arono (2013) show that while
-ic is more productive in general than -ical (see preceding case study), -ical is
AF
preferred with stems that contain the ax -olog- (for example, morphological or
methodological). ey show this by comparing the ratio of the two suxes for
stems that occur with both suxes; I will return to their approach presently. First,
though, let us see what we can nd out by looking at the overall distribution of
types. e BROWN and LOB corpora taken together contain 737 dierent stems
R
occurring with either -ic or -ical or both (spelling dierences were ignored and
since we are interested in sux combinations, all prexes were removed before
D
counting stem types). ere is a clear overall preference for -ic (559 types) over
-ical (178 types). is general preference can now be used to determine expected
frequencies for stems containing a particular sux in a one-by-n design as
described in Chapter 7, Section 7.2.2 above): 75.85 percent of all -ic(al) stem types
occur with -ic, 24.15 percent occur with -ical. If other suxes have no inuence
35 As mentioned in Chapter 7, word length (however we measure it) rarely follows a
normal distribution, so the Mann-Whitney U test would be the beer test in this case.
Here are the rank sums, if you want to calculate it: 313061.5 for stems occurring with
-ic and 104179.5 for stems occurring with -ical. You will nd that the dierence is not
signicant according to the U test either (which is unsurprising, since the median
length for both suxes is 3 syllables).
304
Anatol Stefanowitsch
on this preference, stems containing a particular sux (like -olog-) should show
the same proportion. ere are 46 such stems, (5 with -ic, 41 with -ical), so the
expected frequency for -ic is 46 0.7585 = 34.89, for -ical it is 46 0.2415 = 11.11.
is is shown in Table 11-9a.
.1
Exp. 34.89 Exp. 34.89
-OLOG- 2-Cp. 25.61 2-Cp. 80.42
v0
e dierence between the observed distribution of -ic and -ical with olog-stems
and that expected on the basis of the overall distribution of the suxes in the
lexicon is highly signicant (2 = 106.02, df=1, p < 0.001): Stems with -olog- do,
indeed, favor -ical against the general trend.
is could be due specically to the sux -olog-, but it could also be a general
T
preference of derived stems for -ical. ere are two more suxes that occur
frequently enough among the types with -ic(al) to test this: -ist (as is altruistic or
AF
STEM TYPES
STEMS WITH -IC STEMS WITH -ICAL Total
D
305
11. Morphology
.1
dierence at all. In contrast, there is a signicant dierence for stems with the
sux -ist (2 = 14.5, df=1, p < 0.001), but it goes in the opposite direction: the
preference of stems with -ist for -ic is even stronger than the overall preference.
v0
us, individual suxes may dier in their preference for other suxes.
Note that we have been talking of a preference of particular stems for one or
the other sux, but this is somewhat imprecise: we looked at the total number of
stem types with -ic and -ical and the three additional suxes. While dierences
in number are plausibly aributed to preferences, they may also be purely
T
historical leovers due to the specic history of the two suxes (which is rather
complex, involving borrowing from Latin, Greek and, in the case of -ical, French).
AF
frequently with -ic or with -ical and calculate a preference ratio. ey then
compare the ratio of all stems to that of stems with -olog- (see Table 11-10).
D
Table 11-10. Stems favoring -ic or -ical in the COCA (Lindsay 2011: 194)
TOTAL STEMS RATIO -OLOG- STEMS RATIO
FAVORING -IC 1197 4.5 13 1
FAVORING -ICAL 268 1 73 5.6
Total 1465 86
306
Anatol Stefanowitsch
.1
suxes, so we will use the BNC instead. e BNC contains 196 stems that take
both suxes, which, since I want to show all relevant information so that you can
retrace the necessary steps for yourself, is too much. erefore, let us include
v0
only the stems with -olog and -graph, as the laer were shown by our previous
analysis to be typical of -ic(al) stems in general. Table 11-11 shows the relevant
data.
e preference of stems with -olog- for the variant -ical is very obvious even
on purely visual inspection of the table. Stems with graph are found with stems
T
ranging from a 99.89 percent preference for -ic to a 90 percent preference for -ical
(i.e., a 10 percent preference for -ic), i.e, they cover almost the entire range of
AF
possible preferences. In contrast, stems with -olog- are found with stems within a
very narrow range of preferences for -ical, namely 99,88 (a 0.12 preference for -ic)
to 80 (a 20 percent preference for -ic). Accordingly, there is very lile overlap, and
the medians of the two suxes dier substantially: 9.5 for -graph and 28.5 for
-olog-. A Mann-Whitney U-test shows this dierence to be highly signicant (U =
R
36 In fact, they are cardinal data, as the value can range from 0 to 1 with every possible
value in between; it is safer to treat them as ordinal data, however, because we dont
know whether such preference values are normally distributed (in fact, since they are
based on word frequency data, which we know not to be normally distributed, it is a
fair guess that the preference data are not normally distributed).
307
11. Morphology
Table 11-11. Preference of stems containing graph and -olog- for the sux
variants -ic and -ical (BNC)
STEM SUFFIX F(-IC) F(-ICAL) P(-IC) RANK
photograph- graph 890 1 0,9989 1
radiograph- graph 44 1 0,9778 2
cartograph- graph 79 2 0,9753 3
orthograph- graph 77 3 0,9625 4
ethnograph- graph 200 8 0,9615 5
calligraph- graph 13 1 0,9286 6
physiograph- graph 10 2 0,8333 7
petrograph- graph 34 7 0,8293 8
.1
thermograph- graph 3 1 0,75 9
lexicograph- graph 21 10 0,6774 10
iconograph- graph 16 v0 10 0,6154 11
bibliograph- graph 203 130 0,6096 12
hagiograph- graph 4 6 0,4 13
stratigraph- graph 67 120 0,3583 14
typograph- graph 42 82 0,3387 15
topograph- graph 48 139 0,2567 16
pedolog- olog 2 8 0,2 17
T
geograph- graph 299 1607 0,1569 18
hydrolog- olog 7 56 0,1111 19
AF
308
Anatol Stefanowitsch
.1
a) Case study: Productivity and genre. Guz (2009) studies the prevalence of
dierent kinds of nominalization across genres. e question whether the
v0
productivity of derivational morphemes diers across genres is a very interesting
one, and Guz presents a potentially more detailed analysis than previous studies
in that he looks at the prevalence of dierent stem types for each ax, so that
qualitative as well as quantitative dierences in productivity could, in theory, be
T
studied. In practice, unfortunately, the study oers preliminary insights at best, as
it is based entirely on token frequencies, which, as discussed in Section 11.1
AF
and our hypothesis (for the sake of the argument) will be that it is more
productive in prose ction, since authors of ction are under pressure to use
language creatively (this is not Guzs hypothesis; his study is entirely
D
explorative).
e sux has a relatively high token frequency: there are 2751 tokens in the
prose ction section of the BNC, and 7173 tokens in the newspaper section
(including all sub-genres of newspaper language, such as reportage, editorial,
etc.). is dierence is not due to the respective sample sizes: the prose-ction
section in the BNC is much larger than the newspaper section; thus, token
frequency would suggest that the sux is more important in newspaper language
than in ction if token frequency meant anything.
In order to compare the two genres in terms of the type-token and hapax-
token ratios, they need to have the same size; the following is based on random
309
11. Morphology
.1
newspaper sample is 62/2750 = 0.0218. e fact that both TTRs are very small
conrms that the sux -ship is not very central to the English lexicon. Crucially,
the dierence is also very small and as Table 11-12 shows, it is not statistically
v0
signicant, although it only just misses the 5% level (2 = 3.5, df = 1, p = 0.0614). If
the dierence turned out to be signicant in a study based on a larger sample,
this could be due to a higher importance of the sux in prose ction, but it is also
possible that ction has a higher overall lexical richness.
T
Table 11-12. Types with -ship in prose ction and newspapers
GENRE
AF
Let us now turn to the hapax legomena, which are listed in Table 11-13a. At rst
glance, there seem to be twenty-six hapax legomena in the prose-ction sample,
and seventeen in the newspaper sample. But here, too, there is substantial
overlap. For example, crasmanship, managership and ministership occur as hapax
legomena in both samples; other words that are hapaxes in one subsample occur
several times in the other. Obviously, such words cannot be considered hapaxes:
in designs comparing an ax across two subsamples, a word can only be
regarded as a hapax legomenon if its overall frequency in the combined sub-
samples is still 1 (this may seem obvious, but it is sometimes overlooked in the
310
Anatol Stefanowitsch
literature).
.1
connoisseurship cramanship
convenership distributorship
cramanship entrepreneurship
executorship guardianship
gamesmanship
generalship
v0 kingship
listenership
headship managership
highwaymanship ministership
honourship parentship
T
impress-ship salesmanship
mageship statesmanship
AF
managership traineeship
mediumship
ministership
penmanship
professorship
R
queenship
rulership
D
studentship
trhoneship
wardership
studentship
26 22 8 17 8 4
e column combined sample shows those words that are still hapax legomena
if the two samples are combined: in the prose subsample, 22 hapaxes remain, so
the HTR is 22/2,750 = 0.008; in the newspaper, only 8 hapaxes remain so the HTR
is 8/2,750 = 0.0029. Like the TTRs, then, the HTRs are extremely small,
311
11. Morphology
conrming our assumption that the sux is not particularly productive in either
of the two genres. Again, the dierence between the HTRs is also very small, but
as Table 13b shows, it is statistically signicant (2 = 5.86, df = 1, p-value < 0.05,
= 0.10).
.1
LEGOMENON Exp. 14.5 Exp. 14.5
TYPE
in the entire corpus is 1. e column whole BNC shows those words that would
remain hapaxes if the entire corpus were taken into consideration: 8 in the prose
sample and 4 in the newspaper sample. is dierence is no longer statistically
signicant (2 = 1.34, df = 1, p-value > 0.05).
However, while this may seem a reasonable idea, it is not: by increasing the
R
size of the sample used to determine hapax-hood without increasing the size of
the sample in which we identify potential hapax legomena, we are distorting the
D
data. Our assumption when counting hapaxes in a sample is not that all instances
are actually hapaxes in the language as a whole, but that the hapax-token relation
in the sample is comparable to the hapax-token relation in the language as a
whole (which does not mean that it is identical recall that the HTR falls as we
increase the size of our sample.
By taking the entire BNC into account when determining what counts as a
hapax legomenon, but not when identifying potential hapax legomena, we are
destroying the relationship between the HTR measured in our sample and the
HTR in the language as a whole. Ultimately, the number of hapax legomena will
fall to zero if we continue to dene them against an ever larger sample. For
example, a single word from the list in Table 11-13a retains its status as a hapax
312
Anatol Stefanowitsch
.1
and spoken language.
b) Case study: Productivity and speaker sex. Morphological productivity has not
v0
traditionally been investigated from a sociolinguistic perspective, but a study by
Sily (2009) suggests that this may be a promising eld of research. Sily
investigates dierences in the productivity of the suxes -ness and -ity in the
language produced by men and women in the BNC. She nds no dierence in
productivity for -ness, but a higher productivity of -ity in the language produced
T
by men (cf. also Sily and Suomela 2009 for a diachronic study with very similar
results). She uses a sophisticated method involving the comparison of the
AF
suxes type and hapax growth rates, but let us replicate her study using the
simple method used in the preceding case study, beginning with a comparison of
type-token ratios.
Like the ICE-GB (cf. Chapter 7, Secztion 7.2.2.), the BNC contains substantially
more speech and writing by male speakers than by female speakers, which is
R
produced by men; for -ness, there are 616 tokens produced by women and 1,154
tokens produced by men (note that unlike Sily, I excluded the words business and
witness, since they did not seem to me to be synchronically transparent instances
of the ax). To get samples of equal size for each ax, random sub-samples were
drawn from the tokens produced by men.
Based on these subsamples, the type-token ratios for -ity are 0.0652 for men
and 0.0777 for women; as Tables 11-15a shows, this dierence is not statistically
signicant (2 = 3.01, df = 1, p < 0.05, = 0.0242).
313
11. Morphology
Table 11-15a. Types with -ity in male and female speech (BNC)
SPEAKER SEX
FEMALE MALE Total
Obs. 167 Obs. 199 366
NEW
Exp. 183 Exp. 183
TYPE
.1
e type-token ratios for -ness are much higher, namely 0.1981 for women and
0.2597 for men. As Table 11-15b shows, the dierence is statistically signicant,
v0
although the eect size is weak (2 = 5.37, df = 1, p < 0.05, = 0.066).
Table 11-15b. Types with -ness in male and female speech (BNC)
SPEAKER SEX
T
FEMALE MALE Total
Obs. 122 Obs. 156 278
AF
NEW
Exp. 139 Exp. 139
TYPE
Note that Sily investigates spoken and wrien language separately and she also
includes social class in her analysis, so her results dier from the ones presented
D
here; she nds a signicantly lower HTR for -ness in lower-class womens speech
in the spoken sub-corpus, but not in the wrien one, and a signicantly lower
HTR for -ity in both sub-corpora. is might be due to the dierent methods
used, or to the fact that I excluded business, which is disproportionally frequent in
male speech and writing in the BNC and would thus reduce the diversity in the
male sample substantially. However, the type-based dierences do not have a
very impressive eect size in our design and they are unstable across conditions
in Silys, so perhaps they are simply not very substantial.
Let us turn to the HTR next. As before, we are dening what counts as a
hapax legomenon not with reference to the individual subsamples of male and
314
Anatol Stefanowitsch
female speech, but with respect to the combined sample. Table 11-16a shows the
Hapaxes for -ity in the male and female samples. e HTRs are very low,
suggesting that -ity is not a very productive sux: 0.0099 in female speech and
0.016 in male speech.
Table 11-16a. Hapaxes with -ity in sampes of male and female speech (BNC)
MALE
abnormality, antiquity, applicability, brutality, civility, criminality, deliverability, divinity, duplicity,
eccentricity, eventuality, falsity, femininity, xity, frivolity, illegality, impurity, inexorability,
infallibility, inrmity, levity, longevity, mediocrity, obesity, perversity, predictability, rationality,
regularity, reliability, scarcity, seniority, serendipity, solidity, subsidiarity, susceptibility, tangibility,
.1
verity, versatility, virtuality, vitality, voracity
FEMALE
absurdity, adjustability, admissibility, centrality, complicity, eemininity, enormity, exclusivity,
gratuity, hilarity, humility, impunity, inquisity, morbidity, municipality, originality, progility,
v0
respectability, sanity, scaleability, sincerity, spontaneity, sterility, totality, virginity
Although the dierence in HTR is relatively small, Table 16b shows that it is
statistically signicant, albeit again with a very weak eect size (2 = 3.93, df = 1,
p < 0.05, = 0.0277).
T
Table 11-16b. Hapax legomena with -ity in male and female speech (BNC)
AF
SPEAKER SEX
FEMALE MALE Total
HAPAX Obs. 25 Obs. 41 66
R
Table 11-17a shows the Hapaxes for -ness in the male and female samples. e
HTRs are low, but much higher than for -ity, 0.0795 for women and 0.1023 for
men.
315
11. Morphology
Table 11-17a. Hapaxes with -ness in sampes of male and female speech (BNC)
FEMALE
ancientness, appropriateness, badness, bolshiness, chasifness, childishness, chubbiness, clumsiness,
conciseness, eagerness, easiness, faithfulness, falseness, feverishness, zziness, freshness, ghostliness,
greyness, grossness, grotesqueness, heaviness, laziness, likeness, mysteriousness, nastiness,
outspokenness, pinkness, plainness, politeness, preiness, priggishness, primness, randomness,
responsiveness, scratchiness, sloppiness, smoothness, stiness, stretchiness, tenderness, tightness,
timelessness, timidness, ugliness, uncomfortableness, unpredictableness, untidiness, wetness, zombieness
MALE
abjectness, adroitness, aloneness, anxiousness, awfulness, barrenness, blackness, blandness, bluntness,
carefulness, centredness, cleansiness, clearness, cowardliness, crispness, delightfulness, dierentness,
.1
dizziness, drowsiness, dullness, eyewitnesses, fondness, fullness, genuineness, godliness, graciousness,
headedness, heartlessness, heinousness, keenness, lateness, likeliness, limitedness, loudness, mentalness,
messiness, narrowness, nearness, neighbourliness, niceness, numbness, peiness, pleasantness,
plumpness, positiveness, quickness, reasonableness, rightness, riseness, rudeness, sameness, sameyness,
v0
separateness, shortness, smugness, soness, soreness, springiness, steadiness, stubbornness,
timorousness, toughness, uxoriousness
As Table 11-17b shows, the dierence in HTRs is not statistically signicant, and
the eect size would be very weak anyway (2 = 1.93, df = 1, p > 0.05, = 0.0395).
T
Table 11-17b. Hapax legomena with -ness in male and female speech (BNC)
AF
SPEAKER SEX
FEMALE MALE Total
HAPAX Obs. 49 Obs. 63 112
Exp. 56 Exp. 56
R
LEGOMENON
TYPE
In this case, the results correspond to Silys, who also nds a signicant
dierence in productivity for -ity, but not for -ness.
is case study was meant to demonstrate, once again, the method of
comparing TTRs and HTRs based on samples of equal size. It was also meant to
draw aention to the fact that morphological productivity may be an interesting
area of research for variationist sociolinguistics; however, it must be pointed out
that it would be premature to conclude that men and women dier in their
productive use of particular axes; as Sily herself points out, men and women
316
Anatol Stefanowitsch
are not only represented unevenly in quantitative terms (with a much larger
proportion of male language included in the BNC), but also in qualitative terms
(the text types with which they are represented dier quite strikingly). us, this
may actually be another case of dierent degrees of productivity in dierent text
types (which we investigated in the preceding case study).
Further Reading
.1
For a dierent proposal of how to evaluate TTRs statistically, see Baayen
(2008, Section 6.5); for a very interesting method of comparison for TTRs and
HTRs based on permutation testing instead of classical inferential statistics see
v0
Sily and Suomela (2009).
317
12 Text
As mentioned repeatedly, linguistic corpora, by their nature, consist of word
forms, while other levels of linguistic representation are not represented unless
.1
the corresponding annotations are added. In wrien corpora, there is one level
other than the lexical that is (or can be) directly represented: the text. Well-
constructed linguistic corpora typically consist of (samples from) individual texts,
v0
whose meta-information (author, title, original place and context of publication,
etc.) are known. ere is a substantial body of corpus-linguistic research based on
designs that combine the two inherently represented variables WORD (FORM) and
TEXT; such designs may be concerned with the occurrence of words in individual
texts, or, more typically, with the occurrence of words in clusters of texts
T
belonging to the same text type (dened by topic, genre, function etc.).
Texts are, of course, produced by speakers, and depending on how much and
AF
what type of information about these speakers is available, we can also cluster
texts according to demographic variables such as dialect, socioeconomic status,,
gender, age, political or religious aliation, etc. (as we have done in many of the
examples in earlier chapters). In these cases, quantitative corpus linguistics is
essentially a variant of sociolinguistics, that diers mainly in that the linguistic
R
phenomena it pays most aention to are not necessarily those most central to
sociolinguistic research in general.
D
318
Anatol Stefanowitsch
given text or set of texts, where unusual means high by comparison with a
reference corpus of some kind. (Sco 1997: 236).
In other words, the corpus-linguistic identication of key words is analogous
to the identication of dierential collocates, except that it analyses the
association of a word W to a particular text (or collection of texts) T in
comparison to the language as a whole (as represented by the reference corpus,
which is typically a large, balanced corpus). Table 12-1 shows this schematically.
.1
TEXT/CORPUS T REFERENCE CORPUS Total
WORD W O11 v0 O12 R1
WORD
Total C1 C2 N
T
Just like collocation analysis, key word analysis is typically applied inductively,
but there is nothing that precludes a deductive design if we have hypotheses
AF
either the topic area or some stylistic property of that text. When applied to text
types (registers, genres, etc), the aim is typically to identify general lexical and/or
D
grammatical properties of the respective text type. As a rst example of the kind
of results that key-word analysis yields, consider Table 12-2, which shows the 20
most frequent tokens (including punctuation marks) in the BROWN corpus and
two individual texts.
As we can see, the dierences are relatively small, as all lists are dominated by
frequent function words and punctuation marks. Ten of these occur on all three
lists (a, and, in, of, that, the, to, was, the comma and the period), and another
seven occur on two of them (as, he, his, it, on, and opening and closing quotation
marks). Even the types that occur only once are mostly uninformative with
respect to the type of text we may be dealing with (1959, at, by, for, had, is, with,
the hyphen and opening and closing parentheses). e only exception are four
319
12 Text
content words in Text A: Neosho, river, species, station these suggest that the
text is about rivers (Neosho being the name of a river) and perhaps biology (as
suggested by the word species).
.1
AND 2.48 OF 3.33 A 2.31
TO 2.24 AND 2.62 TO 2.08
A 1.99 ( 1.48 HE 1.92
IN
THAT
IS
1.83
0.93
0.88
)
AT
WAS
v0 1.48
1.27
1.27
AND
1.50
1.49
1.48
WAS 0.86 TO 1.27 HIS 1.40
HE 0.84 A 1.27 WAS 1.38
FOR 0.81 WERE 1.21 IN 1.12
T
IT 0.78 NEOSHO 1.09 THAT 1.10
0.76 SPECIES 0.85 HAD 0.99
AF
Applying key word analysis to each text or collection of texts identies the
D
words that dier most signicantly in frequency from the reference corpus and
thereby tell us how the text in question diers lexically from the (wrien)
language of its time as a whole. Table 12-3a lists the key words for Text A.
e key words now convey a very specic idea of what the text is about: there
are two proper names of rivers (the Neosho already seen on the frequency list and
the Marais des Cygnes, represented by its constituents Cygnes, Marais and des),
there are a number of words for specic species of sh as well as the words river
and channel.
320
Anatol Stefanowitsch
.1
FISH 104 35 22316 1165286 670.66
CYGNES 83 0 22337 1165321 659.30
MARAIS 83 0 22337 1165321 659.30
CATFISH 80 2 22340 1165319 616.73
DES
SHINER
84
75
9
0
v0
22336
22345
1165312
1165321
608.45
595.72
CHANNEL 81 16 22339 1165305 557.14
ABUNDANCE 78 13 22342 1165308 545.42
MINNOW 63 0 22357 1165321 500.38
T
LOWER 100 123 22320 1165198 492.31
TAKEN 122 281 22298 1165040 485.74
AF
Neosho and Marais des Cygnes Rivers of Kansas (available via Project Guenberg, .
Note that the occurrence of some tokens (such as the dates and the parentheses
may be characteristic of a text type rather than an individual text, a point we will
return to below).
Next, consider Table 12-3b, which lists the key words for Text B. ree things
are noticeable: the keyness of a number of words that are most likely proper
names (Hume, Vye, Rynch, Wass, Brodie and Jumala), pronouns (he, his) and
punctuation marks indicative of direct speech (the quotation marks and the
exclamation mark).
321
12 Text
.1
604 8808 39625 1156513 221.63
601 8771 39628 1156550 220.08
HUNTER 42 19 40187 1165302 211.27
L-B 31 0 40198 1165321 210.83
NEEDLER
GUILD
30
33
0
7
v0
40199
40196
1165321
1165314
204.03
187.81
HAD 397 5230 39832 1160091 185.30
JUMALA 27 0 40202 1165321 183.62
SAFARI 29 2 40200 1165319 182.53
T
. 2261 48958 37968 1116363 175.98
! 125 796 40104 1164525 172.79
AF
is does not tell us anything about this particular text, but taken together, these
pieces of evidence point to a particular genre: narrative text (novels, short stories,
etc.). e few potential content words suggest particular sub-genres: hunter and
guild suggest a historical or a fantasy novel, the unusual words ier and needler
R
support the laer hypothesis. If we were to include the next ten most strongly
associated nouns, we would nd camp, patrol, out-hunter, globes, barrier, beast,
D
tube, spacer and starfall, which strongly suggest that we are, in fact, dealing with
a science-ction novel. And indeed, the text in question is the science-ction
novel Starhunter by Andre Alice Norton (available via Project Gutenberg, see
Appendix A.1).
Again, the keywords identied are a mixture of topical markers and markers
for the text type (in this case, genre) of the text, so even a study of the key words
of single texts provides information about more general linguistic properties of
the text in question as well as its specic topic.
322
Anatol Stefanowitsch
12.2.1.1 Case study: Keywords in scientic writing. ere are a number of keyword-
.1
based analysis of academic writing (cf., for example, Sco and Tribble 2006 on
literary criticism, Rmer and Wul 2010 on academic student essays). Instead of
replicating one of these studies in detail, let us look more generally at the
v0
Learned section of the BROWN corpus, using the rest of BROWN (all sections
except the Learned section) as a reference corpus.
Table 12-4 shows contains the key words for this section.
323
12 Text
In this case, there is lile we can say about the specic content of the collection
of text there is a dominance of academic terms, but they are either very general
(data, index), or relate to dierent areas of study (anode from the domain of
electricity, cells from biology, dictionary from the language sciences). is reects
the fact that the Learned section of the BROWN corpus is a collection of 80
dierent texts from the natural sciences, medicine, mathematics, the social and
behavioral sciences, political science, law and education.
is case study demonstrates a typical application of keyword analysis to a
collection of texts of a particular type, which allows us to make statements about
.1
the relevant text type in general. For example, we can see that academic writing is
characterized by scientic terminology not a surprising result, but one which
demonstrates that the method works. We also see that certain types of
v0
punctuation are typical of academic writing in general (such as the parentheses,
which we already suspected based on the analysis of the sh population report in
Section 12.1 above). Finally, and perhaps most interestingly, this type of analysis
can reveal function words that are characteristic for a particular text type and
thus give us potential insights into grammatical structures that may be typical for
T
it; for example, is, the and of are among the most signicant key words of
academic writing are. e laer two are presumably related to the nominal
AF
to be interesting to investigate.
D
12.2.1.2 Case study: [a(n) + N + of] in medical research papers. Of course, key word
analysis is not the only way to study lexical characteristics of text types. In
principle, any design studying the interaction of lexical items with other units of
linguistic structure can also be applied to specic text types.
For example, Marco (2000) investigates collocational frameworks (see Chapter
10, Section 10.2.1) in medical research papers. While this may not sound
particularly interesting at rst glance, it turns out that even highly frequent
frameworks like [a(n) _ of] are lled by completely dierent items from those
found in the language as a whole, which is important for many applied purposes
(such as language teaching or machine processing of language), but which also
shows just how dierent text types can actually be. Since Marcos corpus is not
324
Anatol Stefanowitsch
publicly available, let us instead use the Learned subsection of the BROWN
corpus again. Table 12-5 shows the 15 most strongly associated words in the
framework [a(n) __ of], i.e. the words whose frequency of occurrence inside this
framework diers most signicantly from their frequency of occurrence outside
of this framework in the same corpus.
Table 12-5: Most strongly associated words in the framework [a(n) + _ + of] in the
Learned subsection of the BROWN corpus
FRAMEWORK O11 O12 O21 O22 G2
NUMBER 54 119 518 180440 412.98
.1
MATTER 18 27 554 180532 147.45
SERIES 13 14 559 180545 112.69
VARIETY 13 16
v0 559 180543 110.21
FUNCTION 13 68 559 180491 79.06
MEANS 11 54 561 180505 68.11
KIND 8 22 564 180537 57.58
RESULT 9 48 563 180511 54.36
COMBINATION 5 8 567 180551 40.35
T
PRODUCT 6 26 566 180533 38.43
SET 7 64 565 180495 35.37
PIECE 4 4 568 180555 35.03
AF
If we compare the result in Table 12-5 to that in Table 10-2 in Chapter 10, we
notice clear dierences between the use of this framework in academic texts and
the language as a whole; for example, lot, which is most strongly associated with
D
the framework in the general language does not occur at all; instead, there are
many typically academic terms like function, I.Q. and study. However, it is
actually dicult to assess the magnitude of the dierences, as both tables were
derived independently.
A beer way of assessing the specic characteristics of a collocational
framework in a particular text type is to combine collocational framework
analysis and keyword analysis by applying the laer to the items occurring in a
particular framework (i.e., instead of statistically evaluating the frequencies of all
words in the two corpora, we evaluate the frequencies of those words occurring
in a particular collocational framework (or collocation, collostruction, etc.) in the
325
12 Text
two corpora.
For example, Table 12-6 shows the results of a comparison of words occurring
in the framework [a(n) + ___ + of] in the LEARNED section of the Brown corpus
with those occurring in this framework in ALL OTHER SECTIONS of the Brown
corpus.
.1
FUNCTION 13 1 504 2353 38.03
NUMBER 46 64 471 2290 35.31
PRODUCT 6 2 v0 511 2352 12.42
MEANS 11 11 506 2343 11.70
VARIETY 11 12 506 2342 10.75
SET 6 3 511 2351 10.35
DISTRIBUTION 3 0 514 2354 10.30
CURVE 3 0 514 2354 10.30
T
REFUND 3 0 514 2354 10.30
PHENOMENON 3 0 514 2354 10.30
KEY FOR THE REST OF BROWN
AF
326
Anatol Stefanowitsch
is case study shows the variability that even seemingly simple grammatical
paerns may display across text types. It is also meant to demonstrate how
simple techniques like collocational-framework analysis can be combined with
more sophisticated techniques to yield more insightful results.
.1
variables in question, or, more typically, that existing corpora have to be
separated into sub-corpora accordingly. is is true for inductive key word
analyses as well as for the type of deductive analysis of individual words or
v0
constructions that we used in some of the examples in earlier chapters. e
dependent variable will, as in all examples in this and the preceding chapter,
always be nominal, consisting of (some part o) the lexicon (with words as
values).
T
Language variety (dialect, sociolect etc.) is an obvious demographic category
to investigate using corpus-linguistic methods. Language varieties dier from
each other along a number of dimensions, one of which is the lexicon. While
AF
suciently large corpora representing two dierent varieties, and study the
distribution of a particular word across these two corpora. Alternatively, we can
study the distribution of all words across the two corpora in the same way as we
have studied their distribution across texts or text types in the preceding section.
is was actually done fairly early, long before the invention of key word
analysis, by Johansson and Hoand (1989). ey compared all word forms in the
LOB and BROWN corpora using a coecient of dierence calculated by the
following formula:
327
12 Text
is formula will give us the percentage of uses of the word in the LOB or the
BROWN corpus (whichever is larger), with a negative sign if it occurs in the
BROWN corpus.37 In addition, Johansson and Hoand test each dierence for
signicance using the chi-square test. As discussed in Chapter 9, it is more
recommendable to use an association measure like Log Likelihood or the p-value
of Fishers exact test, because percentages will massively overestimate infrequent
events (a word that occurs only a single time will be seen as 100 per cent typical
.1
of whichever corpus it happens to occur in); also, the chi-square test cannot be
applied to infrequent words. Still, Johansson and Hoands basic idea is of course
highly interesting and their work constitutes the rst example of a keyword
v0
analysis that I am aware of.
Comparing two (large) corpora representing two varieties will not, however,
straightforwardly result in a list of dialect dierences. Instead, there are at least
ve types of dierences that must be selectively discarded in any analysis. Table
12-7, which shows the 10 most strongly dierential key words for the LOB and
T
BROWN corpus respectively, based on a direct comparison of the corpora).
AF
R
D
37 Note that this formula only works because the two corpora are roughly equal in size;
if you want to apply it to two corpora A and B that have dierent sizes, the
frequencies have to be normalized by dividing them by the respective total corpus size
(this is the case, for example, in Schmids study reported in Case Study 12.2.3.1 below):
f (word A ) f (word B )
size A size B
(i)
f (word A ) f (word B )
+
size A size B
328
Anatol Stefanowitsch
Table 12-7. Key words of British and American English based on a comparison of
LOB and BROWN.
WORD FREQUENCY IN FREQUENCY IN OTHER WORDS IN OTHER WORDS IN G2
LOB BROWN LOB BROWN
MOST STRONGLY ASSOCIATED WITH LOB
3064 187 1154620 1165143 3098.38
3048 185 1154636 1165145 3086.63
MR 266 0 1157418 1165330 370.54
LABOUR 285 4 1157399 1165326 360.34
LONDON 502 92 1157182 1165238 314.11
SIR 456 95 1157228 1165235 259.71
.1
I 7617 5841 1150067 1159489 248.33
COLOUR 139 0 1157545 1165330 193.62
TOWARDS 318 64 1157366 1165266 185.97
CENTRE
SHE
ROUND
144
4090
336
2
2984
76
v0 1157540
1153594
1157348
1165328
1162346
1165254
182.21
181.54
178.95
BRITAIN 301 61 1157383 1165269 175.10
COMMONWEAL 158 7 1157526 1165323 171.81
TH
T
CENT. 129 1 1157555 1165329 169.34
MOST STRONGLY ASSOCIATED WITH BROWN
AF
First, there are dierences that are due to the way the corpora are constructed.
e @ sign is no more frequent in American English than in British English; it
was simply used as a stand-in for a range of non-ASCII characters in the
BROWN, but not in the LOB corpus. is seems trivial, but since it is impossible
329
12 Text
.1
In fact, they are oen irritating, since of course we would like to know whether
words like labo(u)r or Mr(.) are more typical for British or for American English
aside from the spelling dierences. To nd out, we have to normalize spellings in
v0
the corpora before comparing them (which is possible, but labo(u)r-intensive).
ird, there are proper nouns that dier in frequency across corpora; for
example, geographical names like London, Britain, Commonwealth, and (New)
York will dier in frequency because their referents are of dierent degrees of
interest to the speakers of the two varieties. ere are also personal names that
T
dier across corpora; for example, the name Macmillan occurs 62 times in the
LOB corpus but only once in BROWN; this is because in 1961, Harold Macmillan
AF
was the British Prime Minister and thus Brits had more reason to mention the
name. But there are also names that dier in frequency because they dier in
popularity in the speech communities: for example, Mike is a keyword for
BROWN, Michael for LOB. us, proper names may dier in frequency for purely
cultural or for linguistic reasons; the same is true of common nouns.
R
Fourth, nouns may dier in frequency not because they are dialectal, but
because the things they refer to play a dierent role in the respective culture.
D
State, for example, is a word found in both varieties, but it is more frequent in US-
American English because the USA is organized into 50 states that play an
important cultural and political role.
Fih, nouns may dier in frequency due to dialectal dierences (as we saw in
many of the examples in previous chapters. Take toward and towards, which
mean the same thing, but for which the rst variant is preferred in US-American
and the second in British English. Or take round, which is an adjective meaning
shaped like a circle or a ball in both varieties, but also an adverb with a range of
related meanings that corresponds to American English around.
Perhaps because there appears, supercially, lile le to discover in terms of
dialectal dierences between those variants of English for which there are
330
Anatol Stefanowitsch
12.2.2.1 Case study: British vs. American Culture. e earliest study of this type is
.1
Leech and Fallon (1992), which is based on the Johansson and Hoands (1989)
keyword list of British and American English.
e authors inductively identify words pointing to cultural contrasts by
v0
discarding all words whose distribution across the two corpora is not signicant,
all proper names, and all words whose signicant dierences in distribution are
due to dialectal variation (including spelling variation). Next, they look at
concordances of the remaining words to determine rst, which senses are most
frequent and thus most relevant for the observed dierences, and second,
T
whether the words are actually distributed across the respective corpus,
discarding those whose overall frequency is simply due to their frequent
AF
occurrence in a single le (since those words would not tell us anything about
cultural dierences). Finally, they sort the words into semantic elds such as
sport, travel and transport, business, mass media, military etc., discussing the
quantitative and qualitative dierences for each semantic eld.
For example, they note that there are obvious dierences between the types of
R
reecting the importance of these sports in the two cultures, but also, that general
sports vocabulary (athletic, ball, playing, victory) is more oen associated with the
BROWN corpus, suggesting a greater overall importance of sports in 1961 US-
American culture. Except for one case, they do not present the results
systematically. ey list lexical items they found to dierentiate between the
corpora, but it is unclear whether these lists are exhaustive or merely illustrative
(the only drawback of this otherwise methodologically excellent study).
e one case where they do present a table is the semantic eld MILITARY. eir
results are shown in Table 12-8a/b (I have recalculated them using the log-
likelihood test statistic we have used throughout this and the preceding section).
331
12 Text
Table 12-8a Military keywords in BROWN (cf. Leech and Fallon 1992: 49)
KEYWORD LOB BROWN G2 KEYWORD LOB BROWN G2
AIRCRAFT 31 71 15,85 MARINE 12 59 33,61
ARMED 22 60 18,05 MERCENARIES 0 12 16,56
ARMIES 5 15 5,17 MILITARY 133 212 17,74
ARMS 87 121 5,36 MILITIA 0 11 15,18
ASSAULT 4 15 6,71 MISSILE 5 49 41,25
BALLISTIC 1 17 17,12 MISSILES 11 32 10,57
BATTERY 6 18 6,20 MISSION 50 78 5,99
BATTLE 54 87 7,58 MISSIONS 3 16 9,68
BOMBERS 7 23 8,89 MOBILE 6 44 32,37
.1
BOMBS 16 35 7,13 PATRIOT 0 10 13,80
BULLET 13 28 5,52 PATROL 6 25 12,39
BULLETS 4 21 12,56 PEACE 159 198 4,02
CAMPAIGNS 5 17 6,84 PENTAGON
v0 3 14 7,65
CAVALRY 3 26 20,76 PIRATES 3 13 6,67
CIVILIAN 11 24 4,86 PISTOL 13 27 4,91
CODE 14 40 12,88 PLANE 51 114 24,26
CODES 4 17 8,58 RIFLE 20 64 23,95
COLUMN 24 71 24,00 RIFLES 6 23 10,52
COMBAT 8 27 10,77 SHERMAN 0 36 49,67
T
COMMAND 51 73 3,78 SHOT 65 112 12,32
COMMANDS 5 15 5,17 SIGNAL 40 63 5,03
AF
332
Anatol Stefanowitsch
Table 12-8b Military keywords in LOB (cf. Leech and Fallon 1992: 50)
KEYWORD LOB BROWN LL KEYWORD LOB BROWN G2
CONQUEST 25 9 7,94 RANK 43 24 5,59
DISARMAMENT 27 11 7,06 TANKS 35 18 5,66
MEDAL 37 7 22,64 TRENCH 15 2 11,34
Note: To save space, the remaining number of tokens for other words in the corpora are not shown
in Tables 12-8a/b. if necessary, it can be calculated by subtracting from the total number of tokens in
each corpus, which, in the version used here, are 1157684 (LOB) and 1165330 (BROWN).
Even this list is not exhaustive. For example, the word soldier is missing, although
it is associated with the LOB corpus (60 vs. 40 occurrences, G2=4.61); but overall,
.1
it seems that Leech and Fallon checked the keywords thoroughly (soldier was one
of the few missing signicant keywords I was able to nd. us, their conclusion
that the concept of war played a central role in US culture in 1961 seems reliable.
v0
is case study is an example of a very carefully constructed and executed
contrastive cultural analysis based on keywords. Note, especially, that Leech and
Fallon do not just look for semantic elds that are strongly represented among
the statistically signicant keywords of one corpus, but that they check whether
the semantic eld is also represented (with dierent vocabulary) among the
T
statistically signicant keywords of the other corpus. is search for counter-
evidence is oen lacking in inductive studies.
AF
distributed (cease(-)re 7:7, truce 0:5). Taking all three words together, there is no
signicant dierence between the two varieties (22:16, G2=0.99). It seems that
D
there is simply a preference for the word armistice in British English and for truce
in American English. Without taking into account synonymy, we might have
concluded that armistices are more central to British thinking about military
conicts, and similar distortions might underlie some of the other keywords.
12.2.2.2. Case study: African key words. Wol and Polzenhagen (2007) present an
analysis of African keywords arrived at by comparing a corpus of Cameroon
English to the combined FLOB/FROWN corpora (jointly meant to present
Western culture). eir study is partially deductive, in that they start out from a
pre-conceived African model of community which they claim holds for all of
sub-Saharan Africa and accords the extended family and community a central
333
12 Text
.1
the distribution of words across les in the two corpora, but, like Leech and
Fallon, they check concordance lines for ambiguous words to determine what
senses they are used in. v0
is procedure yields seemingly convincing word lists like that in Table 12-9.
However, due to their method, we cannot tell whether this is a real result or
whether it is simply due to the specic selection of words they look at. Some
obvious items from the semantic eld AUTHORITY are missing from their list for
example, command, rule, disobey, disrespect, and power. As long as we do not
334
Anatol Stefanowitsch
know whether they failed to check the keyness of these (and other) AUTHORITY
words or whether they did check them but found them not to be signicant, we
cannot claim that authority and respect play a special role in African culture(s)
for the simple reason that we cannot exclude the possibility that these words may
actually be signicant keywords for the combined FROWN/FLOB corpus.
Finally, it is questionable whether one can simply combine a British and an
American corpus to represent Western culture. First, it assumes that the two
cultures individually belong to such a larger culture and can jointly represent it;
why not take, say, Irish English and Canadian English instead? Second, it glosses
over dierences between the two speech communities in exactly those areas in
.1
which they are being compared to Cameroon English. For example, authority,
ruler and even father are very strongly associated with the British English (using
LOB) as compared to American English (using BROWN); the reverse is true for
v0
leadership, dignitaries, obedience and respect. us, it seems beer to compare
specic cultures (for example in the sense of speech communities within a
nation state) directly, than to construct a combined Western (or other) culture
(cf. Oakes and Farrow 2007 for a method than can be used to compare multiple
varieties against each other in a single analysis).
T
335
12 Text
and health, car and trac and public aairs. As far as one can tell from the
methodological description, he only looks at selected words from each category,
so the reliability of the results depends on the plausibility of his selection.
One area in which this is unproblematic is color: Schmid nds that all basic
color terms (black, white, red, yellow, blue, green, orange, pink, purple, grey) are
more frequent in womens language than in mens language. Since basic color
terms are (synchronically) a closed set and Schmid looks at the entire set, the
results may be taken to suggest that women talk about color more oen than men
do. Similarly, for the domain of temporal deixis Schmid looks at the expressions
yesterday, tomorrow, last week, tonight, this morning, today, next week, last year
.1
and next year and nds that all but the last three are signicantly more frequently
used by women. While this is not a complete list of temporal deictic expressions,
it seems representative enough to suggest that the results reect a real dierence.
v0
Schmids analysis of the semantic eld personal reference, shown in Table
12-10, is somewhat more problematic. Personal reference is a large, lexically
diverse and somewhat diuse semantic eld, and while the sample Schmid
presents certainly seems to contain many items that can be considered central to
this domain, the question remains whether a more exhaustive sample would yield
T
the same results. For example, the plurals of girl and boy are missing, which is
odd, given that the plurals of all other nouns are included.
AF
Table 12-10. Male and female keywords for the semantic eld personal reference
(Schmid 2003)
WORD FREQ. MEN FREQ. WOMEN COEFFICIENT WORD FREQ. MEN FREQ. WOMEN COEFFICIENT
she 2266.94 7842.65 -0.55 *** (continued)
R
girl 91.91 274.92 -0.50 *** they 9134.26 10563.21 -0.07 ***
boy 127.49 232.22 -0.29 *** person 236.27 229.46 0.01
D
Despite these problems, we can probably agree that the selection in Table 12-10 is
relatively representative for the eld of personal reference. For many other elds
that Schmid investigates, the representativity of his lexical sample is much more
dicult to assess. For example, Table 12-11 shows Schmid's results for the
semantic eld body and health (again, negative coecients indicate that a word
336
Anatol Stefanowitsch
Table 12-11. Male and female keywords for the semantic eld body and health
(Schmid 2003)
WORD FREQ. MEN FREQ. WOMEN COEFFICIENT WORD FREQ. MEN FREQ. WOMEN COEFFICIENT
breast 4.47 15.97 -0.56 *** (continued)
hair 55.71 195.67 -0.56 *** eyes 49.21 79.56 -0.24 ***
headache 4.47 14.13 -0.52 *** nger 30.91 44.85 -0.18 **
legs 32.94 79.86 -0.42 *** ngers 26.03 29.49 -0.06
sore throat 2.03 4.91 -0.41 eye 53.27 58.67 -0.05
.1
doctor 71.17 139.76 -0.33 *** body 101 103.52 -0.01
sick 48.39 90.31 -0.30 *** hands 96.99 98.29 -0.01
ill 30.3 55.9 -0.30 *** hand 231.39 214.4 0.04
leg 40.06 65.12 -0.24 ***
v0
In this case, it is dicult to conclude, as Schmid does, that the eld body and
health is associated with womens language, because the selection of words is
too small and eclectic to be considered representative for the entire eld (some
T
words that come to mind immediately are pain, ache/aching, unwell, nurse,
medicine, (in)u(enza) for the domain HEALTH and nose, arm(s), foot, feet, stomach,
mouth, teeth for the domain BODY.
AF
337
12 Text
Table 12-12. Key words im womens speech and mens speech in the BNC
conversation subcorpus
KEY WORD O11 O12 O21 O22 G2
MOST STRONGLY ASSOCIATED WITH FEMALE SPEAKERS
SHE 7037 22807 1479484 2285127 3291.17
HER 2313 7306 1484208 2300628 990.39
SAID 4911 12375 1481610 2295559 881.36
N'T 24221 44380 1462300 2263554 444.52
I 54825 93330 1431696 2214604 306.99
AND 29109 50467 1457412 2257467 231.78
.1
COS 3314 6864 1483207 2301070 191.93
TO 23693 40934 1462828 2267000 175.92
CHRISTMAS 285 1005 1486236 2306929 171.10
CHARLOTTE
THOUGHT
OH
24
1545
13236
298
3523
23472
v0 1486497
1484976
1473285
2307636
2304411
2284462
170.52
166.21
152.83
LOVELY 406 1217 1486115 2306717 145.25
MM 7067 13039 1479454 2294895 139.46
T
BECAUSE 1830 3901 1484691 2304033 129.79
MOST STRONGLY ASSOCIATED WITH MALE SPEAKERS
FUCKING 1383 326 1485138 2307608 1251.11
AF
Two dierences are obvious immediately. First, there are pronouns among the
most signicant keywords of womens speech, but not of mens speech,
conrming Schmids nding concerning personal reference. is is further
338
Anatol Stefanowitsch
.1
analysis of preselected items for example, looking at the rst y keywords in
male and female speech will reveal no clear additional dierences, although they
point to a number of potentially interesting semantic elds (for example, the
v0
occurrence of lovely as a female keyword points to the possibility that there
might be dierences in the use of evaluative adverbs).
keywords we nd words like minus, plus, perent, equals, squared, decimal as well
as many number words, which could be used to construct a stereotype of male
language as being concerned with abstract domains like mathematics. However,
these dierences very obviously depend on the topics of the conversations
included in the corpus. It is not inconceivable, for example, that male linguists
constructing a spoken corpus will record their male colleagues in a university
seing and their female spouses in a home seing. us, we must take care to
distinguish stable, topic-independent dierences from those that are due to the
content of the corpora investigated. is should be no surprise, of course, since
keyword analysis was originally invented to uncover precisely such dierences in
content.
339
12 Text
12.2.4 Ideology
Just as we can choose texts to stand for demographic variables, we can choose
them to stand for the world views or ideologies of the speakers who produced
them. Note that in this case, the texts serve as an operational denition of the
corresponding ideology, an operationalization that must be plausibly justied.
.1
text to a reference corpus. is is frequently done when investigating political
ideologies, which is reasonable in contexts where two ideologies are clearly set
against each other. In Raysons study, one wonders what the results would have
v0
been if he had compared the Labour Manifesto against that of the Conservative
Party instead; or if he had compared the manifestos of all major parties against a
general reference corpus. Which specic design we choose depends on what
exactly we are investigating.
T
Table 12-13 shows the result of the direct comparison, derived from my own
analysis of the two party manifestos (which can easily be found online). e
AF
results dier from Raysons in a few details, due to slightly dierent decisions
about tokenization, but they are identical with respect to all major observations.
Obviously, the names of each party are overrepresented in the respective
manifesto as compared to that of the other party. More interesting is the fact that
would is a key word for the Liberal Democrats program; this because their
R
ten strongest keywords; presumably, because they were already in power and felt
less of a need to mention specic policies that they were planning to implement.
e Liberal Democrats, in contrast, have green and environmental, pointing to a
strong environmental focus, as well as powers, which suggests that they are very
concerned with the distribution of decision-making powers (both speculations
turn out to be correct if we read the actual manifesto).
340
Anatol Stefanowitsch
Table 12-13. Dierential Keywords of the 2001 Labour and Liberal Democrat
election manifestos
KEY WORD O11 O12 O21 O22 G2
MOST STRONGLY ASSOCIATED WITH THE LABOUR PARTY ELECTION MANIFESTO
PER CENT 92 0 30330 21248 97.58
OUR 286 76 30136 21172 66.44
LABOUR 177 36 30245 21212 58.17
IS 336 119 30086 21129 44.89
NOW 77 8 30345 21240 42.82
MILLION 68 6 30354 21242 41.10
.1
1997 60 4 30362 21244 40.79
NEXT 55 4 30367 21244 36.16
SINCE 39 2 30383 21246 28.91
NEW 174 57
v0 30248 21191 27.62
MOST STRONGLY ASSOCIATED WITH THE LIBERAL DEMOCRATIC PARTY ELECTION MANIFESTO
LIBERAL 0 46 30422 21202 81.81
WOULD 11 70 30411 21178 71.81
DEMOCRATS 0 39 30422 21209 69.35
WHICH 37 92 30385 21156 48.21
T
SUPPORT 14 57 30408 21191 45.70
ENVIRONMENTAL 15 47 30407 21201 30.85
AF
Of course, identifying key words is only the rst step in an analysis of ideologies
(or of key word analyses in general). Once the key words are identied, they are
D
typically investigated in their context (cf. Rayson 2008, who presents KWIC
concordances of some important key words, and Sco 1997, who identies
collocates of the key words in a sophisticated procedure that leads to highly
insightful clusters of key words).
In fact, key word analysis is sometimes used not to identify the specic key
words as such, but to identify a set of lexical items to investigate in the context of
a larger research question. For example, Partington (1997) suggests using a key
word analysis to identify potential metaphors in a particular domain of discourse.
12.2.4.2 Case study: The importance of men and women. Just as text may stand for
something other than text, words may stand for something other than words in a
341
12 Text
given research design. Perhaps most obviously, they may stand for their referents
(or classes of referents). If we are careful with our operational denitions, then,
we may actually use corpus-linguistic methods to investigate not (only) the role
of words in texts, but the role of their referents in a particular community.
In perhaps the rst study aempting this, Kjellmer (1986) uses the frequency
of masculine and feminine pronouns in the topically dened sub-corpora of the
LOB and BROWN corpora as an indicator of the importance accorded to women
in the respective discourse domain. His research design is essentially deductive,
since he starts from the hypothesis that women will be mentioned less frequently
than men. e design has two nominal variables: SEX (with the values MAN and
.1
WOMAN, operationalized as male pronoun and female pronoun) and TEXT
CATEGORY (with the values provided by the text categories of the LOB/BROWN
corpora). v0
First, Kjellmer notes that men are referred to much more frequently than
women overall: ere are 17,965 male pronouns in the LOB corpus compared to
only 8,261 female ones (Kjellmers gures dier very slightly from the ones given
here and below, perhaps because he used an earlier version of the corpus). is
dierence between male and female pronouns is signicant: using the single-
T
variable version of the chi-square test introduced in Chapter 7 (Section 7.2.2), and
assuming that the population in 1961 consisted of 50 percent men and 50 percent
AF
women, we get the expected frequencies shown in Table 12-14a (2 = 3590.62 (df =
1), p < 0.001).
Table 12-14a. Observed and expected frequencies of male and female pronouns in
the LOB corpus (based on the assumption of equal proportions).
R
Chi-Square
Observed Expected
Component
D PRONOUN
342
Anatol Stefanowitsch
o) people of any sex. However, Kjellmer shows that only 4 percent of the male
pronouns are used generically, which does not change the imbalance perceptibly;
also, the FLOB corpus from 1991 shows almost the same distribution (16,278 male
pronouns, 8,669 female ones).
Kjellmers main question is whether, given this overall imbalance, there are
dierences in the individual text categories, and as Table 12-14b shows, this is
indeed the case.
Table 12-14b. Male and female pronouns in the dierent text categories of the
LOB corpus.
.1
SEX OF PRONOUN REFERENTS
MALE FEMALE Total
v0
TE XT CATEGORY
E: 284.26 E: 130.74
2: 5.01 (n.s.) 2: 10.89 *
D
343
12 Text
.1
2: 9.41 (n.s.) 2: 20.47 ***
N. Adventure and Western O: 2362 O: 1250 3612
Fiction E: 2474.12
v0 E: 1137.88
2: 5.08 (n.s.) 2: 11.05 *
P. Romance and Love O: 2095 O: 2186 4281
Story E: 2932.36 E: 1348.64
2: 239.12 *** 2: 519.91 ***
R. Humor O: 329 O: 131 460
T
E: 315.09 E: 144.91
2: 0.61 (n.s.) 2: 1.34 (n.s.)
AF
actual, existing men are simply thought of as more worthy topics of discussion
than actual, existing women. Women, in contrast, are overrepresented in popular
D
lore, general ction, adventure and western, and romance and love stories
(overrepresented, that is, compared to their general underrepresentaion; in
absolute numbers, they are mentioned less frequently in every single category
except romance and love stories). In other words, ctive women are slightly less
strongly discriminated against in terms of their worthiness for discussion than
are real women.
Other researchers have taken up and expanded Kjellmers method of using the
distribution of male and female pronouns (and other gendered words) in corpora
to assess the role of women in society (see, e.g. Romaine 2001; Baker 2010b;
Twenge et al 2012 and Libermans 2012 discussion of their method; Subtirelu
2014).
344
Anatol Stefanowitsch
.1
al. 2011). In practice, culturomics is simply the application of standard corpus-
linguistic techniques (word frequencies, tracked across time), and it has yielded
some interesting results (if applied carefully, which is not always the case).
v0
12.2.5.1 Case Study: God. Michel et al. (2011) present a number of small case
studies intended to demonstrate the potential of the method. ey use the Google
Books corpus (see Appendix A.3). e use of Google Books may be criticized
T
because it is not a balanced corpus, but the authors point out that rst, it is the
largest corpus available and second, books constitute cultural products and thus
AF
may not be such a bad choice for studying culture aer all.
345
12 Text
As a simple example, consider the search for the word God in US-novels (I used
the 2012 version of the corpus, so the result dier slightly from theirs).
e authors present this as an example of the history of religion, they
conclude from their result somewhat ippantly that God is not dead but needs
a new publicist. is ippancy, incidentally, signals an unwillingness to engage
with their own results in any depth that is not entirely untypical of researchers in
culturomics. Broadly speaking the result certainly suggests a waning dominance
of religion on topic selection in book publishing, but not necessarily on the
culture as such.
.1
12.2.5.2 Example: Censorship. While it is not implausible to analyze culture in
general on the basis of a literary corpus, any analysis that involves the area of
publishing itself will be particularly convincing. One such example is Michel et
v0
al.s (2011) use of frequencies to identify periods of censorship. For example, they
search for the name of the Jewish artist Marc Chagall in the German and the US-
English corpus. As Figure 12-2 shows, there is a rst peak in the German corpus
around 1920, but during the time of the Nazi government, the name drops almost
so zero while it continues to rise in the US-English corpus.
T
Figure 12-2. e name Marc Chagall in the US-English and the German part of the
AF
346
Anatol Stefanowitsch
Further reading
.1
is chapter has focused on very simple aspects of variation across text types and
a very simple notion of text type. See Biber (1988, 1989) for a more
comprehensive corpus-based perspective on text types. As seen in some of
v0
the case studies in this chapter, text is frequently a proxy for demographic
properties of the speakers who have produced it, making corpus linguistics a
variant of sociolinguistics see Baker (2010).
Biber, Douglas. 1988. Variation across speech and writing. Cambridge &
T
New York: Cambridge University Press.
Biber, Douglas. 1989. A typology of English texts. Linguistics 27(1). 344.
AF
347
13 Metaphor and Metonymy
e ease with which corpora are accessed via word forms and the diculty of
accessing them at other levels of linguistic representation is an advantage as long
.1
as it is our aim to investigate words, for example with respect to their
relationship to other words, to their internal structure or to their distribution
across grammatical structures and across texts and text types. As we saw in
v0
Chapter 10, it is more problematic where our aim is to investigate grammar in its
own right, but since grammatical structures tend to be associated with particular
words and/or morphemes, these diculties can be overcome to some extent.
When it comes to investigating phenomena that are not lexical in nature, the
word-based nature of corpora is clearly a disadvantage and it may seem as
T
though there is no alternative to a careful manual search and/or a sophisticated
annotation (manual, semi-manual or based on advanced natural-language
AF
348
Anatol Stefanowitsch
annotation, and there are very detailed and sophisticated proposals for
annotation procedures (most notably the Pragglejaz Metaphor Identication
Procedure, cf., for example, Steen et al. 2010).
However, as stressed in various places throughout this book, the manual
annotation of corpora severely limits the amount of data that can be included in a
research design; this does not invalidate manual annotation, but it makes
alternatives highly desirable. Two broad alternatives have been proposed in
corpus linguistics. Since these were discussed in some detail in Chapter 5, we will
only repeat them briey here before illustrating them in more detail in the case
studies presented in the next section. e rst approach starts from a source
.1
domain, searching for individual words or sets of words (synonym sets, semantic
elds, discourse domains) and then identifying the metaphorical uses and the
respective targets and underlying metaphors manually. This approach is
v0
extensively demonstrated, for example, in Deignan (1999, 2005). e second
approach starts from a target domain, searching for abstract words describing
(parts o) the target domain and then identifying those that occur in a
grammatical paern together with items from a dierent semantic domain (which
will normally be a source domain). is approach has been taken by
T
Stefanowitsch (2004, 2006) and others (see Martin 2006 for a semi-automatic
variant of the method). A third approach has been suggested by Wallington et al.
AF
(2003): they aempt to identify words that are not themselves part of a
metaphorical transfer but that point to a metaphorical transfer in the immediate
context (the expression guratively speaking would be an obvious candidate). is
approach has not been taken up widely, but it is very promising at least for the
identication of certain types of metaphor, and of course the expressions in
R
349
13 Metaphor and Metonymy
(13.2a) the ames of civil war engulfed the central Yugoslav republic. [BNC AHX]
(13.2b) e game was going OK and then it went up in ames. [BNC CBG]
Deignans design has two nominal variables: WORD FORM OF FLAME (with the
variables SINGULAR and PLURAL) and CONNOTATION OF METAPHOR (with the values POSITIVE
.1
and NEGATIVE; she does not say how exactly the metaphors were categorized).
Table 13-1a shows her results (2=53.98 (df = 1), p<0.001).
v0
Table 13-1a. Positive and negative metaphors with singular and plural forms of
ame (Deignan 2006: 117)
WORD FORM OF FLAME
SINGULAR PLURAL Total
T
Obs: 90 Obs: 24 114
E VA L UAT I O N
POSITIVE
Exp: 70.78 Exp: 43.22
AF
Obs: 5 Obs: 34 39
NEGATIVE
Exp: 24.22 Exp: 14.78
95 58 153
R
explains this in relation to the literal uses of ame(s): a single ame is usually
under control, and it may be if use, as a candle or a burning match. If there is
more than one ame, we are essentially dealing with a re flames are oen
undesired, out of control and very dangerous (Deignan 2006: 117).
is explanation itself is of course a hypothesis about the literal use of
singular and plural ame that must be tested separately. Deignan does not
provide such a test, but it is easy enough to do. I selected 100 random
concordance lines each for ame and ames from the BNC and removed all
gurative uses. e remaining literal uses (see Appendix D.4) were sorted
according to whether they described a ame/re that was safe and under control
(as in 13.3a and 13.4a), or dangerous and out of control (as in 13.3b and 13.4b):
350
Anatol Stefanowitsch
(13.3a) She watched the kiln's ames icker on the walls [BNC HH3]
(13.3b) Malekith was caught within the ames, his body terribly scarred and
burned [BNC CM1]
(13.4a) At her elbow the candle burned with a steady ame [BNC G1X]
(13.4b) [He] felt the wild pain of the ame on his arm as he dived for safety. [BNC
HA3]
.1
Table 13-1b. Positive and negative contexts for literal uses of singular and plural
forms of ame v0
WORD FORM OF FLAME
SINGULAR PLURAL Total
UNDER Obs: 29 Obs: 17 36
SI T UAT ION
CONTROL
63 83 136
singular ame is used more frequently than expected for situations where the re
is under control, while the plural ames is used more frequently than expected
for situations where it is out of control. e quality of being out-of-control is also
dierent: 14 of the 83 cases refer to a situation where someone is trying and
failing to put out the re (compared to 0 cases for singular ame) and 7 cases
refer to people being trapped by re (compared to 1 case for singular ame).
us, Deignans explanation appears to be fundamentally correct, providing
evidence for a substantial degree of isomorphism between literal and gurative
uses of (at least some) words. An analysis of more such cases could show whether
this isomorphism between literal and metaphorical uses is a general principle (as
G. Lakos 1993 conceptual theory of metaphor suggests it should be).
351
13 Metaphor and Metonymy
.1
interested in a particularly target domain and the source domains associated with
it, there is no lexical resource to draw from. Partington (1998) suggests an
interesting solution: He applies keyword analysis (see Chapter 12, Section 12.1) to
v0
a thematic corpus dealing with the target domain in question and then inspects
the results for items that are not from the domain in question. Among these items
are many, that are likely to be used metaphorically in the target domain.
To demonstrate this method, let us create a small subcorpus of economic texts
from the BROWN corpus and perform a keyword analysis comparing this
T
subcorpus against the rest of the les in BROWN. ere are eight les that, based
on their description in the manual, deal with economics topics (specically, files
AF
A26, A27, A28, H05, H11, H24, J40, J41). Table 13-2a shows selected results of a
keyword analysis of these les.
As expected, most of the strongest keywords are directly related to the domain
of economics. Among the Top 10 (shown in Part I of Table 13-2a), only the items
roleplaying and gin do not seem to t this paern, but closer inspection reveals
R
that they are related to economics indirectly: roleplaying occurs in a text about
the use of roleplaying in sales training (BROWN J30) and gin in a text about the
D
352
Anatol Stefanowitsch
.1
9 ROLEPLAYING 16 0 18259 1005775 128.85
10 GIN 18 5 18257 1005770 121.04
(II) STRONGEST NON-ECONOMIC KEYWORDS v0
20 MOVABLE 14 5 18261 1005770 91.02
23 RETURN 30 150 18245 1005625 84.80
35 TANGIBLE 11 8 18264 1005767 63.00
51 EXTENSION 12 24 18263 1005751 51.67
54 RECOVERY 11 18 18264 1005757 50.73
T
57 PRESSURE 22 163 18253 1005612 48.07
67 RISE 16 86 18259 1005689 43.32
AF
Among the spatial terms, words referring to the vertical dimension are very
frequent (rise, rising, higher, level, raise, below, rises, upswing), and among these,
the word rise is dominantly represented by three dierent word forms among the
353
13 Metaphor and Metonymy
top 250 signicant keywords. Table 13-2b is a concordance of all instances of all
wordforms of rise in the economics subcorpus.
.1
9 f the country and the moderate [[rise]] in vacancy rates for apartm
10 he spring of 1961 before a new [[rise]] in economic activity gets u
11 ties , a continued substantial [[rise]] in expenditures by state an
12 enditures will not continue to [[rise]] in the Sixties . I am confi
v0
13 as . Paradoxically , the sales [[rise]] is due in large measure to
14 o allow the basic wage rate to [[rise]] obviously depends upon the
15 ections show a more pronounced [[rise]] to an annual rate of 1,338,
16 ge rate . Since marginal costs [[rise]] when the wage rate rises ,
17 costs rise when the wage rate [[rises]] , the profit-maximizing pr
18 only rises as the wage rate [[rises]] . In such circumstances ,
19 he public-limit price only [[rises]] as the wage rate rises . I
T
20 o which the public-limit price [[rises]] in response to a basic wag
21 e profit-maximizing price also [[rises]] when the public-limit pric
22 is manifestly justified by [[rising]] costs ( due to rising wag
AF
e concordance shows that all instances of the word rise in the economics sub-
corpus are metaphorical uses, based on the metaphor increase in quantity is
upward motion. It also provides information about the words with which this
this metaphor is used most frequently: there are 9 cases of rate(s), 6 of price(s), 4
of expenditure(s), 3 each of costs, sales and spending, and 2 of wages.
is case study demonstrates that central metaphors for a given target domain
354
Anatol Stefanowitsch
.1
13.2.1.3 e impact of metaphorical expressions. As a nal example of a source-
domain oriented study consider Stefanowitsch (2005), which investigates the
v0
relationship between metaphorical and literal expressions hinted at at the end of
the preceding case study. e aim of the study is to uncover evidence for the
function of metaphorical expressions that have literal paraphrases, such as [dawn
of NP] in examples like (13.5a), which is seemingly equivalent to the literal
[beginning of NP] in (13.5b):
T
(13.5a) [I]t has taken until the dawn of the 21st century to realise that the best
AF
independent variable (whose values are pairs of paerns like the one illustrated
in examples (13.5a,b)), and the nouns in the NP slot as the dependent variable.
is corresponds to a dierential-collexeme analysis (Chapter 10, Section
10.2.2.2).
e study aims to test the hypothesis that metaphorical language serves a
38 For example, a query of the BNC for (rising|increasing) (cost|prot)s? results in 84
hits for rising cost(s) as opposed to 34 hits for increasing cost(s) and 7 hits for rising
prot(s) as opposed to 12 hits for increasing prot(s). It is le as an exercise for the
reader to test this distribution for signicance using the chi-square test or a similar
test (but use a separate sheet of paper, as the margin of this page will be too small to
contain your calculations).
355
13 Metaphor and Metonymy
cognitive function and that for each pair of paerns investigated, the
metaphorical variant should be used with nouns referring to more complex
entities. e construct COMPLEXITY is operationalized in the form of axioms derived
from gestalt psychology, such as the following:
Concepts representing entities that have a simple shape and/or have a clear
boundary are less complex than those representing entities with complex
shapes or fuzzy boundaries (because they are more easily delineable). is
follows from the gestalt principles of closure and simplicity (Stefanowitsch
2005: 170).
.1
For each pair of expressions, the dierential collexemes are identied and the
resulting lists are compared against these axiomatic assumptions. Take the case of
v0
the two expressions introduced above, whose dierential collexemes are shown in
Table 13-3.
Obviously, both expressions are associated with words referring to time, but
those associated with the literal beginning of refer to time spans with clear
boundaries and a clearly dened duration (year and century), while those
associated with the metaphorical dawn of refer to variable time spans without
such boundaries (civilization, time, history, age, era, culture). e one apparent
exception is day, but this occurs exclusively in literal uses of dawn of, such as It
was the dawn of the fourth day since the murder (BNC CAM). is (and similar
356
Anatol Stefanowitsch
.1
these observations in two large corpora, the 450-million-word Corpus of Current
American English (COCA) and the 1.9-billion-word Corpus of Global Web-based
English (GloWbE) (see Appendix A.1). e names of decades (such as 1960s or
v0
sixties) occur too infrequently with dawn of in these corpora to say anything
useful about them, but the names of centuries are frequent enough.
Table 13-4 shows the percentage of dawn of for the past ten centuries (all
spelling variants of the respective centuries were included, e.g. 19th century,
nineteenth century, etc.).
T
Table 13-4. Dawn of vs. beginning of with the century in two large corpora
AF
CoCA GloWbE
DAWN BEGINNING % DAWN DAWN BEGINNING % DAWN
12TH CENTURY 0 4 0.0000 0 26 0.0000
13TH CENTURY 0 5 0.0000 3 37 0.0750
14TH CENTURY 0 8 0.0000 3 43 0.0652
R
ere are clear dierences between the centuries, and these dierences are
roughly similar in the two corpora: in both corpora, the 19th century occurs
signicantly less frequently and the 21st century occurs signicantly more
frequently than expected with dawn of (you can test this for yourself by
calculating the chi-square components). Why this should be the case requires
357
13 Metaphor and Metonymy
further study, but one could speculate that the nineteenth century feels more
bounded than the 21st because it is actually over, and we can imagine it in its
entirety, while none of the speakers in the respective corpora will live to see the
end of the 21st century, making it conceptually less bounded.
is case study demonstrates use of the dierential collexeme analysis (and
thus of collocational methods in general) that goes beyond associations between
words and other elements of structure and instead uses words and grammatical
paerns as ways of investigating semantic associations. Direct comparisons of
literal and metaphorical language are rare in the research literature, so this
remains a potentially interesting eld of research.
.1
13.2.2 Target domains
13.2.2.1 Case study: happiness and joy. As discussed in Chapter 5 (Section 5.1.3),
v0
there are two types of metaphorical uerances: those that could be interpreted
literally in their entirety (like Lako and Kvecses example I am burned up, cf.
Examples 5.6a and 5.7a), and those that contain vocabulary from both the source
and the target domain (like He was lled with anger, cf. 5.8a). Stefanowitsch (2004,
T
2006) refers to the laer as metaphorical paerns, dened as follows:
domain (SD) into which one or more specic lexical item from a given
target domain (TD) have been inserted (Stefanowitsch 2006: 66).
[NPcontainer be lled with NPsubstance], the source domain would be that of substances
in containers. e target domain word that has been inserted in this expression is
anger, yielding the metaphorical paern [NPcontainer be lled with NPemotion]. e
D
358
Anatol Stefanowitsch
can then be grouped into larger sets corresponding to metaphors like emotions
are substances.
Of course, the metaphors identied in this way are (or can be) specically
related to the specic target-domain word occurring in the corresponding
paern, rather than the emotion referred to. For example, the BNC contains the
metaphorical paerns lled with anger and lled with rage, but not lled with
irritation/annoyance/fury. In other words, metaphorical paern analysis identies
metaphors associated with a particular lexicalization of a concept, rather than the
concept itself. In this way, it can be used for comparisons of lexicalizations across
languages (cf. e.g. Stefanowitsch 2004, Daz-Vera 2013), of alternative
.1
lexicalizations within a single language (e.g. Stefanowitsch 2006, Turkkila 2014),
or both (e.g. Rojo Lopez 2013), or of changes in the metaphors associated with a
lexical item (Tissari 2003, 2010). v0
For example, Stefanowitsch (2006) investigates the metaphorical paerns
associated with the near synonyms happiness and joy in an inductive study of
metaphors associated with various words for basic emotions in English. One of
the ndings is that while liquid/substance metaphors are found with both
words, they dier in the number of metaphors suggesting that the person
T
experiencing the corresponding emotion is completely full or even bursting.
Table 13-5a shows the respective metaphors and their frequency of occurrence
AF
expected with joy and less frequent than expected with happiness (2=9.4 (df = 1),
p < 0.01). Stefanowitsch (2006) suggests that this reects the fact that joy
describes a much more intense emotion than happiness. is hypothesis would
have to be tested with other pairs of near synonyms where one describes a more
intense emotion than the other, such as grief/sadness, disgust/revulsion, lust/desire,
rage/anger, etc. (Turkkila 2014 nds more SUBSTANCE-IN-A-CONTAINER metaphors for
rage (20/500) than for anger (9/500), with fury in the middle but closer to anger
(13/500); however, the study does not list FULLNESS metaphors separately).
359
13 Metaphor and Metonymy
Table 13-5a. LIQUID metaphors for happiness and joy (Stefanowitsch 2006: 100)
happiness joy
source of NP 12 6
NP spring from X 1 1
X open self to NP 1
NP pour into heart 1
inner NP 1 2
X contain/include/hold NP 7 1
NP be in X 2
distillation of NP 1
X drink NP 1
NP evaporate 1
.1
X leave X empty of NP 1 1
TOTAL 24 16
NP subside 1
lled/loaded with/full of NP 8 15
heart (be) full to bursting with NP 1
heart ll/swell with NP 2
X ll/swell Y(s heart) with NP 1 6
NP brim in heart 1
R
burst/explosion of NP 1 1
cold void run over with NP 1
NP burst in/through X(s) heart 2
D
NP overow 1
X brim over with NP 1
X burst/erupt/explode in/with NP 6
NP surge/sweep/wash over/through X 1 3
X be swept away by NP 1
X pour NP 1
ow of NP emanate from X 1
NP seep from X 1
TOTAL 20 47
360
Anatol Stefanowitsch
.1
is case study demonstrates how central metaphors for a given target domain
can be identied by searching for general words describing (aspects o) the
v0
domain in question. It also shows that these metaphors can be associated to
dierent degrees with dierent words within a given target domain. It is unclear
to what extent this lexeme specicity of metaphorical mappings supports or
contradicts current cognitive theories of metaphor, so it is potentially a highly
productive area of research.
T
(13.6a) the princess held a gun to Charles 's head, guratively speaking [BNC
D
CBF]
(13.6b) He pictures eternity as a lthy Russian bathhouse [BNC A18]
(13.6c) the only way they can deal with crime is to ght re, so to speak, with re.
[BNC ABJ]
Wallington et al. (2003) investigate the extent to which these devices, which they
call metaphoricity signals, correlate systematically with the occurrence of
metaphorical expressions in language use. ey nd no strong correlation, but as
they note, this may well be due to various aspects of their design. First, they
adopt a very broad view of what constitutes a metaphoricity signal, including
361
13 Metaphor and Metonymy
expressions like a kind/type/sort of, not so much NP as NP and even prexes like
super-, mini-, etc. While some or all of these signals may have an anity to
certain types of non-literal language, one would not really consider them to be
metaphoricity signals in the same way as those in (13.6ac). Second, they
investigate a carefully annotated, but very small corpus. ird, they do not
distinguish between stronlgy conventionalized metaphors, which are found in
almost every uerance and are thus unlikely to be explicitly signaled, and weakly
conventionalized metaphors, which seems more likely to be signaled explicitly a
priori).
A more restricted case study could determine whether the idea of
.1
metaphoricity signals is, in principle, plausible. Let us look at what is intuitively
the clearest case of such a signal on Wallington et al.s list: the sentence
adverbials metaphorically speaking and guratively speaking. As a control, let us
v0
use the roughly equally frequent sentence adverbial technically speaking, which
does not signal metaphoricity but which can, of course, co-occur with
(conventionalized) metaphors and which can thus serve as a baseline.
ere are 22 cases of technically speaking in the BNC. Only four of these are
part of a clause that arguably contains a metaphor:
T
(13.7b) [D]irectors can, technically speaking, be held liable for negligence and
consequently sued. [BNC CBU]
(13.7c) [T]echnically speaking, you are no longer in a position to provide him with
employment. [BNC HWN]
(13.7d) As novelists, however, Orwell and Waugh evolve not towards each other but,
R
Note that we are taking a very generous view on what counts as a metaphor:
(13.7a) contains the spatial preposition o as part of the phrase o duty, which
could be said to instantiate the metaphor a situation is a location; (13.7b) uses
hold as part of the phrase hold liable, instantiating a metaphor like believing
something about someone is holding them (cf. also hold s.o.
responsible/accountable, hold in high esteem); (13.7c) uses provide as part of the
phrase provide employment, which instantiates a metaphor like causing someone
to be in a state is transferring an object to them (cf. also provide s.o. with an
opportunity/insight/power). All three cases are highly lexicalized. Even (13.7d),
which uses the verb evolve metonymically to refer to a non-evolutionary
development and then uses the spatial expressions towards and opposite direction
362
Anatol Stefanowitsch
.1
(13.8b) Let's pick someone completely at random, now we've had Tracey
guratively speaking! [BNC F7U]
v0
Table 13-6. Metaphorical expressions signaled by metaphorically/literally speaking
(BNC)
EXPRESSION MEANING METAPHOR SOURCE
bloody the nose of sb be successful in court against sb legal ght is physical ght A3G
take it on the chin endure being criticized argument is physical ght ASA
T
put the boot in treat sb cruelly sports is physical ght K25
hold gun to sbs head coerce sb to act power is physical force CBF
a taste for excitement experience experience is taste AC9
AF
balance the books set things right life is commercial transaction KAE
cash in be successful life is commercial transaction C9U
make a killing be nancially successful life is a hunt BMR
transport sb across the world make sb think of a distant location imagination is motion ACX
bulwark of society defender of society defense is a wall B0Y
R
make sth serve one's aims put sth to use in achieving sth to be used is to serve BMA
a frozen moment in time documentation of a particular state time is a river HPN
superego be soluble in alcohol self-control disappears when drunk character is a substance KM7
D
burst into song take one's turn speaking singing for speaking B21
give ones right arm to do sth want sth very much body part for personal value B2D
be within arms length be in close proximity arm's length for short distance HTP
you tell me your co-employee tell me employee for company HGY
We can now compare the literal and metaphorical contexts in which the
expressions technically speaking and metaphorically/guraltively speaking occur. If
the laer are a metaphoricity signal, they should occur signicantly more
frequently in metaphorical contexts than the former. As Table 13-7 shows, this is
indeed the case (2 = 20.74 (df = 1), p < 0.001).
363
13 Metaphor and Metonymy
Table 13-7. Literal and gurative uerances containing the sentence adverbials
metaphorically/guratively speaking and technically speaking in the BNC
SENTENCE ADVERBIAL
METAPHORICALLY/ TECHNICALLY Total
FIGURATIVELY SPEAKING SPEAKING
Obs: 17 Obs: 4 21
FIGU RAT I V E
YES
Exp: 9.73 Exp: 11.27
Obs: 2 Obs: 18 20
NO
Exp: 9.27 Exp: 10.73
.1
19 22 41
v0
Of course, the question remains, why some metaphors should be explicitly
signaled while the majority is not. For example, we might suspect that
metaphorical expressions are more likely to be explicitly signaled in contexts in
which they might be interpreted literally. is may be the case for put the boot in,
which occurs in a description of a rugby game where one could potentially
T
misread it for a statement that someone was actually kicked:
AF
(13.9a) Guy Gregory, who was later to prove the match winner, easily converted .
Gloucester needed a kick start to get into the game. ree penalties helped
put them in front. But not for long. Gregory put the boot in metaphorically
speaking!
R
with hold a gun to sbs head, of which there are ten examples in the BNC, only
one of which is metaphorical:
(13.9b) I'm not sure if the princess held a gun to Charles's head, guratively
speaking, but it seems if she wanted something said.
Again, which of these hypotheses (if any of them) is correct would have to be
studied more systematically.
is case study found a clear eect where the authors of the study it is based
on did not. is demonstrates the need to formulate specic predictions
concerning the behavior of specic linguistic items in such a way that they can
364
Anatol Stefanowitsch
.1
communication and media reporting on immigration, containing speeches,
political manifestos and articles from the conservative newspapers Daily Mail and
Daily Telegraph. He nds, among other things, that the metaphor immigration is
v0
a ood is used heavily, arguing that this allows the right to portray immigration
as a disaster that must be contained, citing examples like a ood of refugees, the
tide of immigration, and the trickle of applicants has become a ood.
Charteris-Blacks ndings are intriguing, but since he does not compare the
ndings from his corpus of right-wing materials to a neutral or a corresponding
T
le-wing corpus, it remains an open question whether the use of these metaphors
indicates a specically right-wing perspective on immigration. Let us therefore
AF
replicate his analysis more systematically. e BNC contains 1 232 966 words
from the Daily Telegraph (all les whose names begin with AH, AJ and AK),
which will serve as our right-wing corpus, and 918 159 words from the Guardian
(all les whose names begin with A8, A9 or AA, except le AAY), which will
serve as our corresponding le-wing (or at least le-leaning) corpus. Since
R
above) suggests itself. Table 13-8a shows all concordance lines for the words
migrant(s), immigrant(s) and refugee(s) containing metaphorical paerns
instantiating the metaphor immigration is a mass of water.
In terms of absolute frequencies, there is no great dierence between the two
subcorpora (10 vs. 11), but the overall number of hits for the words in question
diers drastically: there are 136 instances of these words in the Guardian
subcorpus but only half as many (68) in the Telegraph subcorpus. is means that
relatively speaking, in the domain of migration liquid metaphor are more
frequent than expected in the Telegraph and less frequent than expected in the
Guardian (see Table 13-8b), which suggests that such metaphors are indeed
typical of right-wing discourse. e dierence is only marginally signicant,
365
13 Metaphor and Metonymy
.1
for help to deal with flood of [[refugees]] By Philip Sherwell in Tuzla T
Sherwell in Tuzla THE FLOOD of [[refugees]] fleeing the escalating confli
an militia groups . Most of the [[refugees]] flowing into Tuzla are escapi
gling to cope with the flood of [[refugees]] and have appealed to the inte
v0
GUARDIAN (n=136)
mirroring this year 's flood of [[refugees]] . Watched by a demonstration
litary security , migration and [[refugee]] flows on an vast scale . As We
cope with the current surge of [[refugees]] , her Foreign Secretary , Mr
nd other aspects of large-scale [[immigrant]] absorption . The bureaucracy
he need to control an influx of [[immigrants]] . Rebel troops end siege of
T
control and manage the flows of [[migrants]] that wars , disasters , and ,
greed that the flow of economic [[migrants]] from Vietnam should be stoppe
e form of an influx of Romanian [[refugees]] . In one case in 1988 a Roman
AF
control and manage the flows of [[migrants]] that wars , disasters , and ,
greed that the flow of economic [[migrants]] from Vietnam should be stoppe
Table 13-8b. LIQUID paerns with the words migrant(s), immigrant(s), refugee(s)
NEWSPAP ER
R
366
Anatol Stefanowitsch
example, Koller 2004, Charteris-Black 2004, cf. also Musol 2012), although at
least part of this literature is not as systematic and quantitative as it should be, so
this remains a promising area of research). e case study also demonstrates the
need to include a control sample in corpus-linguistic designs (in case that this still
needed to be demonstrated at this point).
13.2.4 Metonymy
Following G. Lako and Johnson (1980:35), metonymy is dened in a broad sense
here as using one entity to refer to another that is related to it (this includes
what is oen called synecdoche, see Seto 1999 for critical discussion). Text book
.1
examples are the following from (Lako and Johnson 1980: 35 and 39
respectively): v0
(13.10a)e ham sandwich is waiting for his check
(13.10b)Nixon bombed Hanoi.
In the (13.10a), the metonym ham sandwich stands for the target expression the
T
person who ordered the ham sandwich, in (13.10b) the metonym Nixon stands
for the target expression the air-force pilots controlled by Nixon (at least at rst
glance, cf. Case Study 13.2.4.3 below).
AF
us, metonymy diers from metaphor in that it does not mix vocabulary
from two domains, which has consequences for a transfer of the methods
introduced for the study of metaphor in Section 13.1.
e source-domain oriented approach can be transferred relatively
R
straightforwardly we can query an item (or set of items) that we suspect may
be used as metonyms then identify the actual metonymic uses (see Case Study
13.2.4.1). e main diculty with this approach is choosing promising items for
D
investigation. For example, the word sandwich occurs almost 900 times in the
BNC, but unless I have overlooked one, it is is not used as a metonym even once.
A straightforward analogue to the target-domain oriented approach (i.e.,
metaphorical paern analysis) is more dicult to devise, as metonymies do not
combine vocabulary from dierent semantic domains. One possibility would be to
search for verbs that we know or suspect to be used with metonymic subjects
and/or objects. For example, a Google search for "is waiting for (his|her|their|the)
check" turns up about 20 unique hits; most of these have people as subjects and
none of them have meals as subjects, but there are three cases that have table as
subject, as in (13.11):
367
13 Metaphor and Metonymy
A more systematic but still somewhat informal investigation of this type will be
presented in Case Study 13.2.4.2).
Finally, note that analyses of potential metonyms or target frames are not the
only way in which corpus data can shed light on metonymical relationships; a
particularly interesting, much more complex study is discussed in Case Study
13.2.4.3).
.1
13.2.4.2 Case study: Metonymy and grammar. As mentioned above, a quasi-source-
domain-oriented approach, i.e. one starting from the potential metonym, is
straightforward to implement, although it depends on the choice of a promising
v0
candidate for metonymic usage. Body parts are known to be such promising
candidates (see, e.g., Deignan 1999 on shoulder, Levin and Lindquist 2007 on nose,
Lindquist and Levin 2008 on foot and mouth and Niemeier 2008 on head and
heart).
Hilpert (2006) looks at metonymic uses of eye with the aim of determining the
T
extent to which they represent conventionalized paerns (xed or semi-xed
phrases like those in [13.11a,b]) as opposed to productive uses (of which Hilpert
AF
(13.11a)[H]er brother-in-law had kept an eye on her nances from the beginning
[BNC FPM]
R
(13.11b)A detail on the screen had caught his eye. [BNC FR0]
(13.11c) Joan had an eye which knew what to do with apparently unpromising
D
In a ten-million-word sample from the BNC, Hilpert nds 443 metonymic uses of
eye. Of these, 323 (i.e. 72.9 percent) represent highly conventionalized paerns,
which are listed in Table 13-9 (with some minor dierences from Hilperts
original classication).
368
Anatol Stefanowitsch
Table 13-9. Metonymic paerns with eye in a 10 percent sample of the BNC
Paern Meaning Metonymic Links Freq.
i. eye contact 46
ii. catch POSS eye 34
visual contact
catch the eye (of NP) eye for watching 7
eye-catching 2
iii. V DET/POSS eye over NP scan NP 5
iv. keep an eye on NP 56
keep a ADJ eye on NP pay aention to NP 10
keep POSS eye on NP 11
v. one eye on NP pay some aention to eye for watching + 5
.1
vi. PREP the public eye PREP the public aention watching for aention 12
Public Eye (TV series) 7
vii. with an eye on NP (cf. xiii) pay aention to NP
v0 2
viii. turn a blind eye (to NP) disregard something/NP 17
ix. under the eye of NP eye for watching + 6
under observation of NP
under POSS eye watching for supervision 1
eye for watching +
x. with an eye to NP with concern for NP 5
watching for concern
with the intention of V-ing eye for watching +
T
xi. with an eye to V-ing NP 3
NP watching for intention
eye for watching +
xii. have an eye for NP (cf xvii) have interest in NP 3
watching for interest
AF
369
13 Metaphor and Metonymy
.1
It would be interesting to look more closely at the nature of these productive
uses, for example, to see whether they are largely minimal extensions of
conventionalized paerns (like example 13.11c), or whether there are also truly
v0
creative uses, as are commonly found in the domain of metaphor. Interestingly,
these paerns Hilpert identies as productive uses overwhelmingly represent the
most frequently instantiated metonymies eye for aention, eye for watching
and eye for perception. is may appear to suggest that these are indeed
productive metonymies, while mappings like eye for expression or eye for
T
beholder are dead metonymies limited to a few xed phrases; however, it may
also be due to the fact that the expressions instantiating them are more
AF
39 Alternatively, as argued by Stallard (1993), it is the predicate rather than the subject
370
Anatol Stefanowitsch
bomb] should allow us to asses, for example, the importance of this metonymy
in relation to other metonymies and literal uses.
erying the BNC for [pos: NOUN] [lemma: bomb, pos: VERB] yields 31
hits referring to the dropping of bombs. Of these, only a single one has the
ultimate decision maker as a subject (cf. 13.12a). Somewhat more frequent in
subject position are countries or inhabitants of countries (5 cases) (cf. 13.12b, c).
Even more frequently, the organization responsible for carrying out the bombing
e.g. an air force, or part of an air force is chosen as the subject (9 cases) (cf.
13.12d,e). e most frequent case (14 hits) mentions the aircra carrying the
bombs in subject position, oen accompanied by an adjective referring to the
.1
country whose military operates the planes (cf 13.12) or some other responsible
group (cf. 13.12g). Finally, there are two cases where the bombs themselves
occupy the subject position (cf. 13.12h).
v0
(13.12a)Mussolini bombed and gassed the Abyssinians into subjection.
(13.12b)On the day on which Iraq bombed Larak
(13.12c) Seven years aer the Americans bombed Libya
(13.12d)[T]he school was blasted by an explosion, louder than anything heard there
T
since the Luwae bombed it in 1944.
(13.12e)Germany, whose Condor Legion bombed the working classes in Guernica
AF
(13.12) on Jan. 24 French aircra bombed Iraq for the rst time
(13.12g) Rebel jets bombed the Miraores presidential palace
(13.12h) Watching our bombs bomb your people
Cases with pronouns in subject position have a similar distribution, again, there
R
is only one hit with a human controller in subject position. All hits (whether with
pronouns, common nouns or proper names), interestingly, have metonymic
D
subjects i.e., not a single example has the bomber pilot in the subject position.
is is unexpected, since literal uses should be more frequent than gurative uses
(it leads Stefanowitsch 2015 to reject an analysis of such sentences as metonymies
altogether). On the other hand, there are cases that are plausibly analyzed as
metonymies here, such as examples (13.12de), which seem to instantiate a
metonymy like military unit for member of unit (i.e. whole for part) and (13.f
h), which instantiate plane for pilot, i.e. instrument for controller.
More systematic study of such metonymies by target domain could uncover
that is used metonymically in this sentence, which would make this a metonym-
oriented case study.
371
13 Metaphor and Metonymy
13.2.4.3 Case Study: Logical metonymies. Lapata and Lascarides (2003) present an
innovative approach to what they call logical metonymy clauses, where the
.1
object of a verb is logically the object of some other verb that does not occur in
the uerance. For example, we would typically interpret the uerance Mary
enjoyed the book as Mary enjoyed reading the book, or perhaps Mary enjoyed
v0
writing the book, but denitely not as Mary enjoyed buying the book even
though buying a book is a perfectly normal thing to do.
Lapata and Lascarides argue that the interpretation of the missing verb
depends both on the verb that is actually present (e.g. enjoy) and on the object
(e.g. book) it is determined, in our examples, on the basis of our knowledge of
T
what people typically enjoy and our knowledge of what people do with lms. By
identifying verbal collocates shared by the verb (when used as a matrix verb with
AF
a dependent clause) and the object, it should be possible to recover the most
plausible interpretations for any given logical metonymy.
For example, the ten most frequent dependent verbs of enjoy as a matrix verb
in the BNC are (in that order) be, work, read, play, have, go, watch, meet, make,
and see; the verbs most frequently taking book as an object (or predicate noun) are
R
be, read, write, have, get, publish, buy, nd, produce and make. Taking into account
the overall frequency of these verbs and their frequencies with enjoy and lm, it
D
should be possible to nd the best match. Lascarides and Lapata use the following
formula, where Vmet is the metonymically used verb (e.g. enjoy), Vimp is the
potentially implied verb (e.g. read, write, buy) and Obj is the object (e.g. book):
Buy occurs 25,827 times in the BNC in general, and 125 times with the object
[(Det) (Adj) book]; enjoy buying occurs 3 times, and there are 11,292,844 verbs in
the version of the BNC used here. us, we can calculate the hypothetical
372
Anatol Stefanowitsch
3 125
(13.13b) P (V imp , V met , Obj ) = = 0.000000001286
11292844 25827
If we repeat this calculation for all verbs occurring with enjoy and/or book and
order the results by probability of co-occurrence, we nd that the most likely
interpretation is, indeed, read, followed by write (see Table 13-10).
.1
VMET OBJ & V IMP V MET&V IMP V
IMP P(V ,V , OBJ)
IMP MET
Of course, the method is not awless. First, it obviously depends on the size and
quality of the corpus (note the occurrence in third place of the verb cook this
is entirely due to the questionable tagging of cook as a verb in the compound
R
expected, suggesting that paerns like enjoy + Object at least in some cases
receive conventionalized interpretations that do not reect an intersection of the
use of [enjoy + Verb] and [Verb + Object]. Compare, for example, the results for
food and meal in Table 13-11.
Obviously, I enjoyed the meal/food can both be interpreted to mean I enjoyed
cooking the meal/food in a specic context, but in both cases the default
interpretation would be I enjoyed eating the meal/food. However, based on the
frequencies of [enjoy + Verb] and [Verb + meal], the most likely interpretation for
enjoy the meal should be enjoy cooking the meal. It is possible, that a more ne-
grained corpus analysis (for example, one that distinguishes between transitive
373
13 Metaphor and Metonymy
and intransitive uses of cook (and other verbs) in [enjoy + Verb]) would yield
more accurate results, but it is also possible, that usage does not fully reect our
assumptions about what people enjoy and or typically do with meals.
Table 13-11. Most probable interpretations for enjoy & food and enjoy & meal.
VMET OBJ & V IMP V MET &V IMP V IMP P(V ,V , OBJ)
IMP MET
FOOD
eat 210 14 13791 1.89E-08
cook 39 13 3933 1.14E-08
buy 129 3 25827 1.33E-09
.1
share 20 9 12803 1.24E-09
be 354 150 4122614 1.14E-09
MEAL
cook
eat
make
83
112
93
v0 13
14
35
3933
13791
210554
2.43E-08
1.01E-08
1.37E-09
share 15 9 12803 9.34E-10
have 216 52 1317706 7.55E-10
T
Finally, as a demonstration that the method does not simply pick the verb most
frequently used with the object, consider Table 13-12, which shows the most
AF
probable interpretations for [plan (Det) (Adj) book]. Even though read is the most
frequent verb with book, the method correctly picks write, followed by the almost
equally likely publish as the best interpretation (read is the eighth-best
interpretation).
R
374
Anatol Stefanowitsch
deductive and in terms of what exactly the variables are. is is because it does
not use corpus data primarily as a way of testing a hypothesis or answering a
research question, but as a model of how humans (or other language-learning
systems) may acquire the correct interpretations of logical metonymies. is
model itself may then be tested, for example (as Lapata and Lascarides do), by
comparing its results with experimentally collected native-speaker judgments.
Here, the hypothesis would be that the model correctly predicts the majority of
interpretations, and the variable would be PREDICTION OF THE MODEL with the values
CORRECT and INCORRECT.
.1
Further Reading
Deignan 2005 is a comprehensive aempt to apply corpus-linguistic methods to a
v0
range of theoretically informed research questions concerning metaphor. e
contributions in Stefanowisch & Gries (2006) demonstrate a range of
methodological approaches by many leading researchers applying corpus
methods to the investigation of metaphor.
T
Deignan, Alice. 2005. Metaphor and corpus linguistics. Amsterdam &
Philadelphia: John Benjamins.
AF
375
Epilogue
In this book, I have focused on corpus linguistics as a methodology, more
precisely, as an application of a general observational scientic procedure to large
samples of linguistic usage. I have refrained from placing this method in a
.1
particular theoretical framework for two reasons.
e rst reason is that I'm not convinced that linguistics should be focusing
quite as much on theoretical frameworks, but rather on linguistic description
v0
based on data. Edward Sapir famously said that unfortunately, or luckily no
language is tyrannically consistent. All grammars leak (1921: 39). is is all the
more true of formal models, that aempt to achieve tyrannical consistency by
pretending those leaks do not exist or, if they do exist, are someone else's
problem. To me, and to many others whose studies I discussed in this book, the
T
ways grammars leak are simply more interesting than the formalisms that help us
ignore this.
AF
this analysis for the model depend on the type of linguistic reality that is being
modeled. If it is language use, as, for example, in historically or sociolinguistically
oriented studies, the distance is relatively short, requiring the researcher to
D
discover the systematicity behind the usage paerns observed in the data. If it is
the mental representation of language, the length of the distance depends on your
assumptions about those representations.
Traditionally, those representations have been argued to be something
fundamentally dierent from linguistic usage that they are an ephemeral
competence based on a universal grammar that may be a mental organ
(Chomsky 1980) or an evolved biological instinct (Pinker 1994), but that is
dependent on and responsible for linguistic usage only in the most indirect ways
imaginable. As I have argued in Chapters 1 and 2, even those frameworks have no
alternative to corpus data that does not suer from the same drawbacks, without
376
Anatol Stefanowitsch
.1
words, to be viewed as a kind of pastiche, pasted together in an improvised way
out of ready-made elements (Hopper 1987: 144).
In these models, the corpus becomes more than just a research tool, it becomes
v0
part of a model of linguistic competence itself (see further Stefanowitsch 2011). In
fact, in the most radical version, Hoeys (2005) notion of lexical priming, the
corpus essentially is the model of linguistic competence:
e notion of priming as here outlined assumes that the mind has a mental
T
concordance of every word it has encountered, a concordance that has
been richly glossed for social, physical, discoursal, generic and
AF
377
Epilogue
.1
where children's expanding grammatical abilities (as reected in acquisition
corpora) are investigated against the input that they get from their caretakers. It
is not always shared by the major theoretical proponents of the Usage-Based
v0
Model, who connect the notion of usage to the notion of linguistic corpora only
in theory. However, it is a view that oers a tremendous potential to bring
together two broad strands of research cognitive-functional linguistics
(including some versions of construction grammar) and corpus linguistics
(including aempts to build theoretical models on corpus data, such as Paern
T
Grammar [Hunston and Francis 2000] and Lexical Priming [Hoey 2005]). ese
strands have developed more or less independently and their proponents are
AF
sometimes mildly hostile toward each other over minor, but fundamental
dierences in perspective (see McEnery and Hardie 2012, Section 8.3 for
discussion), but they could complement each other in many ways, cognitive
linguistics providing a more explicitly psychological framework than most corpus
linguists adopt, and corpus linguistics providing a methodology that cognitive
R
378
Anatol Stefanowitsch
Appendix
A: Resources
.1
A.1 Corpora and Text Archives
ANC American National Corpus. An ongoing project to build a large
v0
corpus of wrien and spoken American English from the 1990s
onward. A 15 million word slice is available for free download.
hp://www.anc.org/
hp://www.natcorp.ox.ac.uk/
379
Appendix
.1
COLT e Bergen Corpus of London Teenage Usage. Available
commercially on CD ROM from ICAME.
FLOB
v0
e Freiburg LOB Corpus. A corpus of wrien British English from
1991, constructed in parallel to the BROWN corpus. Available
commercially on CD ROM from ICAME.
380
Anatol Stefanowitsch
.1
OTA e Oxford Text Archive. A vast collection of freely available
corpora, including the BNC and the LOB corpus (in some cases,
free registration is required).
v0
hp://ota.ox.ac.uk
Note. In this book, three texts from this collection are used at
various points:
D
381
Appendix
A.2 Dictionaries
.1
CALD Cambridge Advanced Learner's Dictionary & esaurus
Online: hp://dictionary.cambridge.org/
v0
LDCE Longman Dictionary of Contemporary English
Online: hp://www.ldoceonline.com/
T
MW Merriam-Webster
Online: hp://www.merriam-webster.com/
AF
WN WordNet
R
Online: hp://wordnet.princeton.edu/
D
A.3 Soware
Concordancing
TextSTAT A simple concordancer for Windows, Linux or MacOS X with a
graphical user interface. Best for working with small corpora up to
1 million words. Does concordances and frequency lists, allows
regular expressions in queries. Uses a free soware licence.
Recommended for beginners, or for working with small, ad-hoc
corpora.
hp://neon.niederlandistik.fu-berlin.de/en/textstat/
382
Anatol Stefanowitsch
.1
CWB e Corpus Work Bench. A sophisticated concordancer with a
command-line interface, running in any Unix-type environment,
including Linux, MacOS X or, with appropriate virtualization,
v0
Windows. Can be installed locally or on a server. Requires corpora
to be transformed into a standard format and indexed by the
program. Installation and corpus preparation is fairly complex, but
once it is done, the program is extremely powerful and exible.
Recommended for professional users working with large corpora.
T
hp://cwb.sourceforge.net/
AF
383
Appendix
.1
procedures mentioned at various points in this book:
e cfa package for Congural Frequency Analysis
e irr package for calculating inter-rater agreement
v0
e ngramr package for accessing the Google Ngram data
384
Anatol Stefanowitsch
B: Statistical techniques
.1
B.1: Cohens Kappa () v0
e agreement between the two raters can then be expressed in terms of a
measure referred to as Cohens Kappa (). is measure can be calculated in six
steps.
Step 1: Determine for all coding categories the number of cases where the raters
T
have assigned the same category and the numbers of cases where they have
assigned dierent categories. Record the results in a table like that shown in
AF
Table 1 (this table assumes that there are two categories, but it should be obvious
how to generalize this to larger tables).
R AT ER 1
CATEGORY X CATEGORY Y Total
D
C11 C12
No. of items catego- No. of items catego-
CAT. X
rized as X by both rized as Y by R1 and X
R AT ER 2
raters by R2
C21 C22
No. of items catego- No. of items catego-
CAT. Y
rized as X by R1 and rized as X by both
Y by R2) raters
Total N
Step 2: Calculate the expected frequencies from the observed frequencies, follow-
385
Appendix
ing the standard procedure for each cell of multiplying the column total and the
row total and dividing the product by the table total N.
Step 3: Calculate the sum of observed agreements (AObs) by adding up the
observed frequencies of all cells that represent intersections of the same category
for both raters. No maer what size the table, these will always be the cells in the
diagonal from the top le cell to the boom right cell of the table (if you number
them according to the standard scheme, they are all cells whose row and column
index are identical (C11, C22, C33, C44, etc.).
Step 4: Calculate the sum of expected agreements (AExp) by repeating the
.1
procedure for the expected frequencies.
Step 5: You now have all the necessary information for calculating Cohens :
=
AObs Aexp
N Aexp
v0
Step 6: Interpret and draw the appropriate conclusions. Generally, 0.7 is
considered to be an indication that inter-rater reliability is satisfactory while <
T
0.7 is taken to show that inter-rater reliability is not satisfactory. Obviously, 0.7
cannot be taken as an absolute level; what you consider to be satisfactory will
AF
depend on your research project, and there are many situations where a higher
reliability is necessary. If you are satised with your inter-rater reliability, you can
go on to the next step of your research project (presumably the statistical
evaluation of your data). If you are not satised, you must try to determine
whether there is a problem with the coding scheme or the way it was applied. Fix
R
C: Statistical Tables
e statistical tables in this section are provided for readers who want to practice
manually computing the tests introduced in this book. For serious research
projects, it is highly recommendable to use a statistical soware package such as
R (see Appendix A.3 above), which will output the p-values directly, making such
tables superuous.
386
Anatol Stefanowitsch
2. Find the most rightward column listing a value smaller than the chi-square
value you have calculated. At the top, it will tell you the corresponding
probability of error.
.1
3 4.1083 6.2514 6.4915 6.7587 7.0603 7.4069 7.8147 8.3112 8.9473 9.8374 11.3449 16.2662
4 5.3853 7.7794 8.0434 8.3365 8.6664 9.0444 9.4877 10.0255 10.7119 11.6678 13.2767 18.4668
5 6.6257 9.2364 9.5211 9.8366 10.1910 10.5962 11.0705 11.6443 12.3746 13.3882 15.0863 20.5150
6
7
8
7.8408
9.0371
10.2189
10.6446
12.0170
13.3616
10.9479
12.3372
13.6975
11.2835
12.6912
14.0684
11.6599
13.0877
14.4836
v0
12.0896
13.5397
14.9563
12.5916
14.0671
15.5073
13.1978
14.7030
16.1708
13.9676
15.5091
17.0105
15.0332
16.6224
18.1682
16.8119
18.4753
20.0902
22.4577
24.3219
26.1245
9 11.3888 14.6837 15.0342 15.4211 15.8537 16.3459 16.9190 17.6083 18.4796 19.6790 21.6660 27.8772
10 12.5489 15.9872 16.3516 16.7535 17.2026 17.7131 18.3070 19.0207 19.9219 21.1608 23.2093 29.5883
11 13.7007 17.2750 17.6526 18.0687 18.5334 19.0614 19.6751 20.4120 21.3416 22.6179 24.7250 31.2641
12 14.8454 18.5493 18.9395 19.3692 19.8488 20.3934 21.0261 21.7851 22.7418 24.0540 26.2170 32.9095
T
13 15.9839 19.8119 20.2140 20.6568 21.1507 21.7113 22.3620 23.1423 24.1249 25.4715 27.6882 34.5282
14 17.1169 21.0641 21.4778 21.9331 22.4408 23.0166 23.6848 24.4855 25.4931 26.8728 29.1412 36.1233
15 18.2451 22.3071 22.7319 23.1993 23.7202 24.3108 24.9958 25.8162 26.8479 28.2595 30.5779 37.6973
AF
16 19.3689 23.5418 23.9774 24.4564 24.9901 25.5950 26.2962 27.1356 28.1907 29.6332 31.9999 39.2524
17 20.4887 24.7690 25.2150 25.7053 26.2514 26.8701 27.5871 28.4450 29.5227 30.9950 33.4087 40.7902
18 21.6049 25.9894 26.4455 26.9467 27.5049 28.1370 28.8693 29.7451 30.8447 32.3462 34.8053 42.3124
19 22.7178 27.2036 27.6694 28.1814 28.7512 29.3964 30.1435 31.0367 32.1577 33.6874 36.1909 43.8202
20 23.8277 28.4120 28.8874 29.4097 29.9910 30.6489 31.4104 32.3206 33.4624 35.0196 37.5662 45.3147
R
Note that such tables typically have only three columns, one each for the standard signicance
levels 0.05, 0.01, and 0.001. However, it should be kept in mind that this is a relatively arbitrary
convention (as two psychologists put it: Surely, God loves the .06 nearly as much as the .05
D
(Rosnow and Rosenthal 1989: 1277). Clearly, what probability of error one is willing to accept for
any given study depends very much on the nature of the study, the nature of the research design,
and a general disposition to take or avoid risk. us, signicance levels between 0.05 and 0.1 (or
even higher) are oen referred to as marginally signicant, reecting a willingness on the part of
the researcher to live with a six, seven, or even ten percent chance of being mistaken in assuming a
non-random distribution. Our table lists levels up to ten percent, and it lists the values in between
ve percent and one percent in order to allow a more ne grained assessment of the chi-square
values.
387
Appendix
.1
Table C-2. Critical values for multiple chi-square tests at one degree of freedom.
No. of 0.1 0.05 0.01 0.001 No. of 0.1 0.05 0.01 0.001
Tests Tests
1 2.7055 3.8415 6.6349 10.8276
v0 36 8.9480 10.2205 13.2146 17.5641
2 3.8415 5.0239 7.8794 12.1157 37 8.9980 10.2710 13.2660 17.6162
3 4.5286 5.7311 8.6154 12.8731 38 9.0468 10.3202 13.3160 17.6670
4 5.0239 6.2385 9.1406 13.4121 39 9.0943 10.3682 13.3647 17.7164
5 5.4119 6.6349 9.5495 13.8311 40 9.1406 10.4149 13.4121 17.7645
6 5.7311 6.9604 9.8846 14.1739 41 9.1858 10.4605 13.4585 17.8115
7 6.0025 7.2367 10.1685 14.4641 42 9.2299 10.5051 13.5037 17.8574
8 6.2385 7.4768 10.4149 14.7157 43 9.2730 10.5486 13.5478 17.9022
T
9 6.4475 7.6891 10.6326 14.9379 44 9.3151 10.5911 13.5910 17.9459
10 6.6349 7.8794 10.8276 15.1367 45 9.3563 10.6326 13.6332 17.9887
11 6.8049 8.0520 11.0041 15.3167 46 9.3966 10.6733 13.6745 18.0305
AF
388
Anatol Stefanowitsch
.1
than the value in the appropriate cell, stop and report a signicance level
of 0.05. If it is smaller or equal to the value in the appropriate cell, go to 3.
3. Repeat the three steps with the third table. If your U value is larger than
v0
the value in the appropriate cell, report a signicance level of 0.01. If it is
smaller or equal to the value in the appropriate cell, report a signicance
level of 0.001.
T
Table C-3. Critical values for the two-sided Mann-Whitney test (p < 0.05)
n
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
AF
1
2 0 0 0 0 1 1 1 1 1 2 2 2 2 3 3 3 3 3
3 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10
4 0 1 2 3 4 4 5 6 7 8 9 10 11 11 12 13 14 15 16 17 17 18
5 2 3 5 6 7 8 9 11 12 13 14 15 17 18 19 20 22 23 24 25 27
6 5 6 8 10 11 13 14 16 17 19 21 22 24 25 27 29 30 32 33 35
7 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44
R
8 13 15 17 19 22 24 26 29 31 34 36 38 41 43 45 48 50 53
9 17 20 23 26 28 31 34 37 39 42 45 48 50 53 56 59 62
10 23 26 29 33 36 39 42 45 48 52 55 58 61 64 67 71
11 30 33 37 40 44 47 51 55 58 62 65 69 73 76 80
D
12 37 41 45 49 53 57 61 65 69 73 77 81 85 89
m 13 45 50 54 59 63 67 72 76 80 85 89 94 98
14 55 59 64 69 74 78 83 88 93 98 102 107
15 64 70 75 80 85 90 96 101 106 111 117
16 75 81 86 92 98 103 109 115 120 126
17 87 93 99 105 111 117 123 129 135
18 99 106 112 119 125 132 138 145
19 113 119 126 133 140 147 154
20 127 134 141 149 156 163
21 142 150 157 165 173
22 158 166 174 182
23 175 183 192
24 192 201
25 211
389
Table C-2. Critical values for multiple chi-square tests at one degree of freedom.
Table C-4. Critical values for the two-sided Mann-Whitney test (p < 0.01)
n
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
1
2 0 0 0 0 0 0 0
3 0 0 0 1 1 1 2 2 2 2 3 3 3 4 4 4 5
4 0 0 1 1 2 2 3 3 4 5 5 6 6 7 8 8 9 9 10 10
5 0 1 1 2 3 4 5 6 7 7 8 9 10 11 12 13 14 14 15 16 17
6 2 3 4 5 6 7 9 10 11 12 13 15 16 17 18 19 21 22 23 24
7 4 6 7 9 10 12 13 15 16 18 19 21 22 24 25 27 29 30 32
8 7 9 11 13 15 17 18 20 22 24 26 28 30 32 34 35 37 39
9 11 13 16 18 20 22 24 27 29 31 33 36 38 40 43 45 47
10 16 18 21 24 26 29 31 34 37 39 42 44 47 50 52 55
11 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63
.1
12 27 31 34 37 41 44 47 51 54 58 61 64 68 71
m 13 34 38 42 45 49 53 57 60 64 68 72 75 79
14 42 46 50 54 58 63 67 71 75 79 83 87
15 51 55 60 64 69 73 78 82 87 91 96
16 60 65 70 74 79 84 89 94 99 104
17
18
19
v0 70 75
81
81
87
93
86 91 96
92 98 104
99 105 111
102
109
117
107
115
123
112
121
129
20 105 112 118 125 131 138
21 118 125 132 139 146
22 133 140 147 155
23 148 155 163
T
24 164 172
25 180
AF
Table C-5. Critical values for the two-sided Mann-Whitney test (p < 0.001)
n
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
1
2
3 0 0 0 0 0
4 0 0 0 1 1 1 2 2 2 3 3 3 3
R
5 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8
6 0 1 2 2 3 4 5 5 6 7 8 8 9 10 11 12 12 13
7 0 1 2 3 4 5 6 7 8 9 10 11 13 14 15 16 17 18 19
8 2 4 5 6 7 9 10 11 13 14 15 17 18 20 21 22 24 25
D
9 5 7 8 10 11 13 15 16 18 20 21 23 25 26 28 30 32
10 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38
11 12 15 17 19 21 24 26 28 31 33 35 38 40 42 45
12 17 20 22 25 27 30 33 35 38 41 44 46 49 52
m 13 23 25 28 31 34 37 40 43 46 49 52 56 59
14 29 32 35 39 42 45 49 52 55 59 62 66
15 36 39 43 46 50 54 58 61 65 69 73
16 43 47 51 55 59 63 67 72 76 80
17 51 56 60 65 69 74 78 83 87
18 61 65 70 75 80 85 89 94
19 70 76 81 86 91 96 102
20 81 87 92 98 103 109
21 93 98 104 110 116
22 105 111 117 124
23 118 124 131
24 132 139
25 146
390
Anatol Stefanowitsch
1. Find the appropriate row for the degrees of freedom of your test.
2. Find the most rightward column whose value is smaller than your t-value. At
the top of the column you will nd the p-value you should report.
.1
df .10 .05 .01 .001 df .10 .05 .01 .001
1 6.31 12.71 63.66 636.62 21 1.72 2.08 2.83 3.82
2 2.92 4.30 9.22 31.60
v0 35 1.69 2.03 2.72 3.59
3 2.35 3.18 5.84 12.92 40 1.68 2.02 2.70 3.55
4 2.13 2.78 4.60 8.61 45 1.68 2.01 2.69 3.52
5 2.02 2.57 4.03 6.87 50 1.68 2.01 2.68 3.50
6 1.94 2.45 3.71 5.96 55 1.67 2.00 2.67 3.48
7 1.90 2.37 3.50 5.41 60 1.67 2.00 2.66 3.46
8 1.86 2.31 3.36 5.04 65 1.67 2.00 2.65 3.45
9 1.83 2.26 3.25 4.78 70 1.67 1.99 2.65 3.44
T
10 1.81 2.23 3.17 4.59 75 1.67 1.99 2.64 3.43
11 1.80 2.20 3.11 4.44 80 1.66 1.99 2.64 3.42
12 1.78 2.18 3.06 4.32 85 1.66 1.99 2.64 3.41
AF
391
Table C-2. Critical values for multiple chi-square tests at one degree of freedom.
D: Data
D1. Possessives Data
.1
its policy PRON their glasses PRON
his waterfront property PRON their own goals PRON
his oces PRON our major parties PRON
its conference victory last Saturday night PRON v0 their new scal year PRON
his wife, Nori PRON its passengers PRON
their big week ends of the year PRON whose stock PRON
their clients' guilt PRON the Department's recommendation NOUN
its budgeted $66,000 in the category PRON our rules PRON
his wife, Sue PRON its members' duties PRON
her professional roles PRON their appropriate participation in the PRON
determination of high policy
T
the government's special ceremonies at NOUN our laboratory PRON
Memorial University honoring distinguished
sons and daughters of the island province
AF
their work PRON its synthesis by systems with intact thyroid cells PRON
in vitro
his research PRON their economic development programs PRON
the year's grist of nearly 15,000 book titles NOUN its environment PRON
their party PRON his distress PRON
their actions PRON her partially exposed breast PRON
R
our only obligation for this day PRON the posse's approach NOUN
our anatomy PRON her leer to John Brown PRON
his apologetic best PRON ladies' fashions NOUN
his work PRON your present history PRON
his readers PRON her honor PRON
his stage seings PRON his art PRON
its musical frame PRON the convict's climactic reappearance in London NOUN
his horn PRON industry's main criticism of the Navy's NOUN
antisubmarine eort
a burgomaster's Beethoven NOUN his gaunt height PRON
the world's nest fall coloring NOUN his position PRON
a standard internist's text NOUN her sharp tongue and erce initiative PRON
their head PRON his own salvation PRON
my salvation PRON his work PRON
its program PRON his sister Mary PRON
392
Anatol Stefanowitsch
.1
mom's apple pie NOUN his mouth PRON
his problems PRON his glass PRON
his mind PRON his things PRON
his on-the-job problems PRON v0 her dish PRON
our health PRON her mother PRON
her hits PRON his helplessness PRON
her captain PRON her cats PRON
their families PRON his jacket pockets PRON
his permanently shortened leg PRON her fear PRON
his own task force of three stragglers PRON his voice PRON
their athletic pursuits your landlord's habits and movements
T
PRON PRON
his instructions PRON her crew PRON
the Square's historic value NOUN their clothes PRON
her testimony my God
AF
PRON PRON
their home range PRON his baldness PRON
their relation to issues of the most desperate PRON her brain PRON
urgency for the life of mankind
their work and responsibilities PRON the town marshal's oce NOUN
its periodic bloody engagements with the Senate PRON his bullet-riddled hat PRON
their eectiveness in reaching desired goals PRON his glance at Gyp Carmer PRON
R
393
Table C-2. Critical values for multiple chi-square tests at one degree of freedom.
.1
whose arms were too long PRON his line of vision PRON
his role PRON his back PRON
his reading public PRON our aention PRON
his early poems her Midwestern lineage
their impulses and desires
our need
PRON
PRON
PRON
v0 whose only fault
the novelist's carping phrase
PRON
PRON
NOUN
the announcement last week of the forthcoming NOUN the number of new plants and construction NOUN
encounter projects in Rhode Island
a joint session of Congress NOUN the advancement of all people NOUN
the opening of a new bypass NOUN the magnitude of eort and costs of each of NOUN
these possible phases
the side of right NOUN the existence of Prandtl numbers reaching NOUN
values of more than unity
support of private, state, or municipal ETV NOUN subsection (A) of this section NOUN
eorts
a conference of the California Democratic NOUN the passage of Public Law 236 NOUN
Council directorate
the necessity of interpretation by a Biblical NOUN violation of the provisions of the Universal NOUN
scholar Military Training and Service Act
394
Anatol Stefanowitsch
.1
Dvorak, Canteloube, Copland and Brien
the expense of everything else NOUN the growth of senile individuals NOUN
the vision and conscience of the candidate the NOUN two indicators of developmental level NOUN
party chooses to lead it v0
the deep friezes of his architectonic music NOUN the anterior lobe of the pituitary NOUN
a lack of unity of purpose and respect for heroic NOUN hyalinization of aerent glomerular arterioles NOUN
leadership
the golden throne of the merchant NOUN the study of the actions of drugs in this respect NOUN
the third oor of her house which was behind NOUN the space of N times continuously dierentiable NOUN
the wall functions
the revival of native ones, some of which had the totality of singular lines
T
PRON NOUN
been lying in slumber for centuries
the danger that an eect of it PRON the image regulus of the pencil NOUN
the use of his particular skills the place of religion
AF
NOUN NOUN
the ideals of the country NOUN follow-up visits of the nurse and social worker NOUN
the Mahayana metaphysic of mystical union NOUN the ranking of the various families NOUN
for salvation
the death throes of men who were shot before NOUN the number of registered persons NOUN
the paredon
a result of the multi-purpose resources control NOUN a consequence of the severe condition of NOUN
R
program of the government perceived threat that persists unabated for the
anxious child in an ambiguous sort of school
environment
D
the volume of the cylinder opening in the head NOUN the possible forms of nonverbal expression NOUN
gasket
the passengers of the tallyho, from which there NOUN a grammatical description of each form NOUN
issued clouds of smoke
the right of them PRON the design of wrien languages NOUN
thickness of glaze NOUN the conicting expectations of East and West NOUN
the price of swimming NOUN the value of the election NOUN
lack of rainfall NOUN the lead of the Russians NOUN
the design and production of engraved seals NOUN terms of contract NOUN
the early stages of shipping fever NOUN the preparation of the questionnaire NOUN
the commercial realities of the market NOUN the rst part of the Regulation that is currently NOUN
at issue
the habit of many schools NOUN children of military personnel NOUN
the depths of the fourth dimension NOUN the maintenance of social stratication in the NOUN
schools
395
Table C-2. Critical values for multiple chi-square tests at one degree of freedom.
.1
NOUN NOUN
electronic switch
the many mistakes of the new group of traders NOUN knowledge of the environment NOUN
certain sections of the country NOUN the rst section of this publication NOUN
the gardens of the square the potentially useful sources of ionizing
NOUN
v0 radiations, gamma sources, cobalt-60, cesium-
137, ssion products, or a reactor irradiation
loop system using a material such as an indium
NOUN
salt
the constitution of his home state of NOUN the value of a for the major portion of the knife NOUN
Massachuses
T
leaders of the nobility of Lincoln and Lee NOUN the boom of Kate's steps NOUN
all the details of the paern NOUN the old members of the church NOUN
the odor of decay NOUN the composer of a successful opera NOUN
AF
NOUN NOUN
the makers of constitutions NOUN the spirit of a thing NOUN
another source of intellectual stimulus NOUN the precepts of her conditioning NOUN
Ann's own description of the scene NOUN the high ridge of the mountains NOUN
D
the sound of jazz and the blues NOUN the rigid xture of her jaws NOUN
the image of the facts NOUN the side of the stall NOUN
this aspect of the literary process NOUN the muscles of his jaws NOUN
the carvings of cavemen NOUN the corner of the car NOUN
the fossilized, formalized, precedent-based NOUN the end of a long, miswrien chapter in the NOUN
thinking of the legendary military brain social life of the community
considerable criticism of its length NOUN the pirouee of his arms NOUN
the extent of ethical robotism NOUN the existence of AZ's useless Wisconsin set-up NOUN
the dra of soldiers NOUN other complications of the hex you've aroused NOUN
the incarnation of the new rational man NOUN the doors of the D train NOUN
396
Anatol Stefanowitsch
.1
6-6a7 their families S-GENITIVE HUM
6-6a8 Lumumba's death S-GENITIVE HUM
6-6a9 his arts or culture S-GENITIVE HUM
6-6a10 her life to religion v0 S-GENITIVE HUM
6-6a11 its [monument] reputation S-GENITIVE CCT
6-6a12 their impulses and desires S-GENITIVE HUM
6-6a13 its [advisory board] members' duties S-GENITIVE ORG
6-6a14 our national economy S-GENITIVE ORG
6-6a15 the convict's climactic reappearance in London S-GENITIVE HUM
6-6a16 its [bird] wing S-GENITIVE ANI
6-6a17 her father S-GENITIVE HUM
T
6-6a18 his voice S-GENITIVE HUM
6-6a19 her brain S-GENITIVE HUM
6-6a20 his brown face S-GENITIVE HUM
AF
397
Table C-2. Critical values for multiple chi-square tests at one degree of freedom.
.1
6-9a5 a standard internist's text S-POSSESSIVE 3 1
6-9a6 mom's apple pie S-POSSESSIVE 1 2
6-9a7 the Square's historic value S-POSSESSIVE 2 2
6-9a8 his mother's urging v0 S-POSSESSIVE 2 1
6-9a9 the Department's recommendation S-POSSESSIVE 2 1
6-9a10 the posse's apPROach S-POSSESSIVE 2 1
6-9a11 ladies' fashions S-POSSESSIVE 1 1
6-9a12 the convict's climactic reappearance in London S-POSSESSIVE 2 4
6-9a13 industry's main criticism of the Navy's antisubmarine eort S-POSSESSIVE 1 7
6-9a14 the town marshal's oce S-POSSESSIVE 3 1
6-9a15 the pool's edge S-POSSESSIVE 2 1
T
6-9a16 man's tongue S-POSSESSIVE 1 1
6-9a17 an egotist's rage for fame S-POSSESSIVE 2 3
6-9a18 a women's oor S-POSSESSIVE 2 1
AF
6-9b5 the death throes of men who were shot before the paredon OF-POSSESSIVE 7 3
6-9b6 lack of rainfall OF-POSSESSIVE 1 1
6-9b7 the amazing variety and power of reactions, aitudes, and emotions OF-POSSESSIVE 9 5
precipitated by the nude form
D
398
Anatol Stefanowitsch
.1
1 tyle is only a front for his [real feelings] . It is probably best to
2 imbledon . I think she has a [genuine feeling] for the fans and for a
3 ng with passion . Yes , with [real feeling] . With she searched f
4 night Bob had displayed any [real emotion] . His eyes sparkled . T
5 comes nowhere . There was [real feeling] in this judgement , McLei
v0
6 nfamiliar , and I had had no [real feeling] of kinship . Now , while
7 f the title bloom , brings a [real feeling] of tragic suffering to th
8 avonarola are described with [real feeling] . There is a fascinating
9 st : I do my best to hide my [real feelings] from others I always try
10 ) but with the intrusion of [real feelings] love , obsession , jea
11 even today . So when we have [real emotions] about someone , we lift
12 . His daughter summed up the [real emotions] of herself and her mothe
T
13 the sadness she felt . When [genuine feelings] are denied and pushe
14 s not theatricality but very [genuine feeling] . It is my way of show
15 the first time and said with [genuine feeling] : I 'm sorry . No il
16 ot be interpreted as lack of [real feeling] or a sign that the end of
AF
28 ing is realistic and gives a [real feeling] of being there , and the
29 t be able to verbalize their [real feelings] , or to admit they had h
30 presence which communicates [real emotion] where the programme offer
31 theless appears to have been [genuine emotion] on both sides . Yet th
32 h sooner if he had known her [real feelings] towards him , but she ha
33 ppen . " I do n't recall any [real feelings] of resentment , and I do
34 ds describing a situation of [real emotion] for working people : a ri
35 was quite clear they had no [real feeling] for their mother , and we
36 llington , said Julia with [real feeling] . But I 'm quite all ri
37 . But Tod looked at it with [real feeling] , with the dull heat of-I
38 n neither can say what their [real feelings] are . A true conversatio
39 xtraordinarily difficult for [genuine feelings] of distress or even s
40 ade her see more clearly her [real feelings] for me . Whatever it was
41 y spoke to me . Told me your [real feeling] about oh , about life a
42 vely , and could still evoke [genuine emotions] ; the way a Woolworth
43 sts , she said , with more [real feeling] than she intended . Now
44 sted it But , whatever his [real feelings] , at least he was n't an
399
Table C-2. Critical values for multiple chi-square tests at one degree of freedom.
.1
59 hordes who came without any [real feeling] , who listened without un
60 lito 's . He sings with such [genuine feeling] . It 's all an act
61 m but his control over his [real feelings] had remained even then .
62 hat she had not betrayed her [real feelings] , Sophie concentrated on
63 member . They 're taking the [real feeling] of Liverpool away , she
v0
64 nsight into the Mr. Darcy 's [real feelings] during particular parts
65 lly but you have n't got the [actual feelings] . Yeah . No I 'm not d
she should feel relieved or even more depressed and on the laer . He sounded even more angry analytic no
decided aer a moment
ite oen a series of progressively unpleasant child becoming more obstinate and more angry analytic yes
interchanges will take place with the the parent
is dismissed as domestic violence . In these of feminism has become more more angry analytic yes
circumstances the voice insistent and
it even worse because you go round and round in circles , you know and the more [unclear] more angry analytic yes
R
and e
to the more amorphous boundaries of suburban to the countryside in search of a more friendly analytic yes
neighbourhoods , many people have migrated
e colour the what the lay out of the foyer we would look at again to make it more inviting more friendly analytic yes
D
of them . Seeing how closely Weis was watching him , he an eort to be more warm , more friendly analytic yes
made
are welcome to visit China . Over the next ten years , will become economically more more friendly analytic yes
China liberal , internationally
School , a mixed High School in Ohio where the dierent , much more relaxed , more friendly analytic yes
atmosphere was totally
be signicant dierences in the way they perform their be more active than others , some more friendly analytic yes
jobs . Some will
pond and began the same operation along its full length . to be more interesting and was more lively analytic yes
is proved certainly
signicant average correlation between risk and recall . junctions are both intrinsically more more risky analytic yes
One possibility is that certain memorable and
seem an ever more clear indication of personal turn makes openness about more risky analytic no
inadequacy which in its problems appear yet
eect and purpose of any given transition . As a system the logic becomes more obscure , more risky analytic yes
gets larger modication
400
Anatol Stefanowitsch
.1
the British people have a far more reserved nature than 'm used to people being a lot friendlier synthetic no
Australians . I
of the several alternatives along this coast ; it is more Bayonne , less pompous than livelier synthetic no
intimate than Biarritz and
present in writing of every kind , is more obtrusive than it in the other tale , is the livelier synthetic no
is
time when to be new seemed more than usually
important ,
v0
as exemplied by the titles of its livelier synthetic no
ese imperfections make it all the more important for BIS minimums and set higher riskier synthetic no
regulators to enforce the standards for
with sta . It means more imaginative deployment of preparedness to experiment with riskier synthetic no
sta . It means new and possibly
T
more damaging to the Government showing a rise of 17 1989 . Indeed these gures made sorrier synthetic no
per cent on even
AF
nds , smiling above the soft yellow [flame] whose elvish reflection danced in
l in charge of my own destiny . The [flame] of the candle on this table is by
nside with a satisfactory whoosh of [flame] . His point had been proved with s
brief space of time I turned up the [flame] . The X-rays that had proved so of
D
401
Table C-2. Critical values for multiple chi-square tests at one degree of freedom.
ot the fire there was very little [flame] . He flashed a sudden stunning s
anuka that burns with a hot , clean [flame] is best He was interrupted by
e resin column ; ( iv ) Potassium : [flame] photometry ( Corning 405 ) . Tissu
ling to the breeze . A tiny lick of [flame] flickered round the mouth of the
gain at those dim shapes around the [flame] , they did appear to be rigs , vag
would be sufficient to relight the [flame] and so keep their fires burning ,
the points represent the sparks of [flame] . It has the backing of the farmer
mbard standing impassive beside the [flame] , de Guichet at his shoulder . Haz
NOT UNDER CONTROL
ed with Class 1 Surface Spread of [flame] ratings . CONDITIONING In common
ire officers normally do , to put a [flame] to the scenery to ensure it is pro
s he threw his head back and spewed [flame] a couple of yards into the air . O
h a few to blowing the gases into a [flame] . The resulting explosion loosened
& thermostats A Baxi Bermuda living [flame] fire-front , compatible with back
.1
ng his cigarette from a twelve-inch [flame] shooting out of a gas jet , who wa
nt susceptibility . For example , a [flame] detector on a North Sea drilling p
too late do they discover that the [flame] burns and is lethal . But which is
spark from her tinder , a sheet of [flame]
v0 could envelop the Genoese . And ad
out of one door but then a sheet of [flame] came down and blocked me , so I ha
seemed to be in a haze or a glowing [flame] . The two men continued walking ,
back in time to avoid the spear of [flame] from the torch , and flung a handf
ething that ignited enough to get a [flame] onto his seat . After that , panic
hair re-grew . He emerged from the [flame] unscathed , transformed by the cle
burned . Unable to pass through the [flame] , he managed to cast himself back
ch exploded in small dirty gouts of [flame] and smoke . The first French skirm
. Major cations were determined by [flame] atomic absorption , and P and Si b
T
you do n't even need to use a naked [flame] , you just need a hot surface like
he hotel complex burned into autumn [flame] and the air smelt of bonfires that
snaked towards it to snuff out the [flame] . The audience clapped . The black
AF
turned to see the first tongues of [flame] licking up the walls of the cainca
kness underground because the small [flame] of the oil-lamp was n't showing en
sh that was not flesh , dousing the [flame] , runnelling in scarred troughs
m , as far as detection ( and hence [flame] damage ) is concerned . Once detec
been billeted began to blossom with [flame] . Two figures were at its eaves wi
ook out , she cried as tongues of [flame] blow-torched from the crevices aro
pstairs windows , a sudden spurt of [flame] , and a part of the roof begin to
R
ide , and felt the wild pain of the [flame] on his arm as he dived for safety
ver , disappear in a white sheet of [flame] . He just kept right on kicking Pi
jetties , or the waiting batons of [flame] and black smoke that fenced the na
les an hour . There was a tongue of [flame] and Asa pulled back the column and
D
windows and roof . The inferno spat [flame] 20 feet into the air . Station Off
uality has always been a priority . [flame] Retardants gained BS5750 Pt 2 accr
re teddy bears by the looks of it . [flame] retardant oh ! They 're , they 're
PLURAL FLAMES
UNDER CONTROL
yrotechnic display of rose and gold [flames] that burnt up the whole western s
s fixed thoughtfully on the dancing [flames] . Roger sat down in the high-back
heat serves physical needs and its [flames] give inspiration for the mind . T
999 ambulance sped past them with [flames] roaring from its back . Drivers f
intest it was a face imagined among [flames] . The story paints an idyllic pic
With practice I learned that those [flames] instantly settled down to a prope
d . The smoke billows and spreads , [flames] crackle . Peasants and shepherds
er mind 's eye rude naked women and [flames] as upright and stiff as those on
sible place lay a deep disquiet . [flames] danced in the glass of each eye ,
owing coals and the cosy flicker of [flames] around them that was , she soon r
buses . Young men danced around the [flames] and slapped onto those cars that
402
Anatol Stefanowitsch
th were huge . They shone with blue [flames] . There were rings of blue fire r
nd a huge log fire , the flickering [flames] casting long shadows against the
h at dawn . She watched the kiln 's [flames] flicker on the walls . Li Lu stok
My master just sat staring into the [flames] of the fire . What is the matte
ch they eventually committed to the [flames] . The government did its best to
s , she said flatly , eyes on the [flames] in the hearth . I think , for u
NOT UNDER CONTROL
A) ATTEMPT TO EXTINGUISH
anger , but he does not put out the [flames] . as if to prove that he is in ea
ound near and began to beat out the [flames] with it . But she soon realised t
ic oxygen supply and extinguish the [flames] . Closed windows and doors help t
up the bucket for him to quench the [flames] . But there were no flames . All
dioxide ? Mm . Erm it smothers the [flames] they ca n't get oxygen if there '
e walls , and throwing water on the [flames] , but the fire was burning more s
.1
was full , then tried to douse the [flames] . One curtain went out , extingui
d him about the pavement to put the [flames] out . Nobody else was hurt , bu
ual struggle to contain the leaping [flames] . Shutting down their pumps , the
fire brigade squirting water on the [flames] . He must be feeling pretty dis
where terrified workers doused the [flames]
v0 . Moments before , Mr Chittenden
and I used my shirt to put out the [flames] and Hazel gave him first aid .
ound by Mr Nellis , who put out the [flames] . He could have been much more se
e fire brigade went in to quell the [flames] . Only once firemen had arrived d
B) ATTEMPT TO ESCAPE
ough windows in a bid to escape the [flames] licking around them after an IRA
of them were engulfed in a rush of [flames] before she could throw the baby t
ay . Malekith was caught within the [flames] , his body terribly scarred and b
T
bring to life would perish in the [flames] , together with all Frankenstein
een Colbert , 55 , was enveloped in [flames] after her dressing gown caught fi
fety as the vehicle was engulfed in [flames] . The incident , which happened w
AF
s attempt to save his wife from the [flames] . She marries him , and in the la
nt happened late Friday afternoon . [flames] set light to his jacket and Mr Wi
C) OTHER
wanner go up in a pile a smoke an' [flames] an' eye shadder an' levver shoes
e in Acton , Richard Baxter saw the [flames] and the huge pall of smoke which
the Hindenburg crashed in a ball of [flames] just a few miles from Nicholson '
now collapsed . Sparks , smoke and [flames] poured into the air , and the hea
R
, as from my window I could see the [flames] from the burning laundry , less t
ommunicate amongst the glare of the [flames] and through dense smoke ! The two
the spectators finally bursts into [flames] as the commentators announce that
flame it and then shoot it into the [flames] . A convenient hole appears and t
D
e wolf suddenly disappeared and the [flames] on the lances were extinguished .
ge was double : the exposure to the [flames] had been long enough to cause sev
ver again A ROYAL palace erupted in [flames] early yesterday , exactly a week
nutes earlier , however , and those [flames] would have been funeral pyres s
-holocaust Los Angeles circa 2029 . [flames] belch from the wreckage , degener
smelling the smoke . Or seeing the [flames] . I feel distant , removed and co
artment buildings spewing smoke and [flames] and chaotic scenes of paramedics
ir Montego hit a van and burst into [flames] on the M18 near Doncaster . Colin
the spiral column of smoke and the [flames] that played at its heels . Only t
ce smoke kills far more people than [flames] , and most fire deaths occur at n
soon have burst uncontrollably into [flames] . And if it be thought that Mr Hu
all but drowned in the hiss of the [flames] , sounded fluttery and frail .
ople trapped in the ruble Houses in [flames] all around The loud ringing of Fi
rate the loaded vehicle through the [flames] out to safety . The blue flames w
aughter . Jezrael cried . Faces and [flames] . Heavy breathing in shadows . Tr
temperature it will burst into the [flames] , likewise if it 's not balanced
re of the goods stored is such that [flames] are unlikely to damage them withi
403
Table C-2. Critical values for multiple chi-square tests at one degree of freedom.
.1
en prove to be more lethal than the [flames] themselves . The old-style foam c
eir lives almost ended in a ball of [flames] . So Norman White , Sheila Stroud
to discover what happens within the [flames] Male speaker We 've mounted a who
rt Gallery when a passer-by spotted [flames] coming from the roof . The galler
, flicked a lighter and exploded in [flames]
v0 in London 's Parliament Square as
T
AF
R
D
404
Anatol Stefanowitsch
References
Adli, Aria. 2004. Grammatische Variation und Sozialstruktur. De Gruyter.
Altenberg, Bengt. 1980. Binominal NPs in a thematic perspective: Genitive vs.
.1
ofconstructions in 17th century English. In Sven Jacobsen (ed.), Papers from
the Scandinavian Symposium on Syntactic Variation, 149172. (Stockholm
Studies in English 52). Stockholm: Almqvist and Wiksell.
v0
American Psychiatric Association. 2000. Diagnostic and statistical manual of
mental disorders: DSM-IV-TR. 4th ed., text revision. Washington, DC:
American Psychiatric Association.
Atkins, Beryl T.S. 1994. Analyzing the verbs of seeing: a frame semantics
approach to corpus lexicography. In Susanne Gahl, Andy Dolbey &
T
Christopher Johnson (eds.), Proceedings of the Twentieth Annual Meeting of
the Berkeley Linguistics Society: General Session Dedicated to the
AF
405
References
.1
Biber, Douglas, Susan Conrad & Randi Reppen. 1998. Corpus linguistics
investigating language structure and use. Cambridge; New York: Cambridge
University Press. v0
Bondi, Marina & Mike Sco (eds.). 2010. Keyness in texts. (Studies in Corpus
Linguistics v. 41). Amsterdam; Philadelphia: John Benjamins.
Borkin, Ann. 1973. To be and not to be. In Claudia Corum, T. Cedrik Smith-Stark
& Ann Weiser (eds.), Papers from the Ninth Regional Meeting of the Chicago
Linguistics Society, vol. 9, 4456. Chicago: Chicago Linguistic Society.
T
Bush, Nathan. 2001. Frequency eects and word-boundary palatalization in
English. In Joan Bybee & Hopper, Paul J. (eds.), Frequency and the emergence
AF
406
Anatol Stefanowitsch
.1
New York, NY: Routledge.
Chen, Ping. 1986. Discourse and particle movement in English. Studies in
Language 10(1). 7995. v0
Chomsky, Noam. 1957. Syntactic structures. e Hague: Mouton.
Chomsky, Noam. 1964. Formal discussion of W. Miller and Susan Ervin, e
Development of Grammar in Child Language. In Ursula Bellugi & Roger
Brown (eds.), e acquisition of language, 3539. (Monographs of the Society
for Research in Child Development 92). Lafayee, IN: Purdue University Press.
T
Chomsky, Noam. 1972. Language and mind. Second edition. New York: Harcourt
Brace Jovanovich.
AF
407
References
.1
Dabrowska, Ewa. 2001. From formula to schema: e acquisition of English
questions. Cognitive Linguistics 11(1-2).
Deane, Paul. 1987. English possessives, topicality, and the Silverstein Hierarchy.
v0
In Jon Aske, Natasha Beery, Laura Michaelis & Hana Filip (eds.), Proceedings
of the Annual Meeting of the Berkeley Linguistics Society: General Session
and Parasession on Grammar and Cognition, vol. 13, 6576. Berkeley:
Berkeley Linguistics Society.
Deignan, Alice. 1999. Corpus-based research into metaphor. In Graham Low &
T
Lynne Cameron (eds.), Researching and applying metaphor, 177199.
Cambridge: Cambridge University Press.
AF
408
Anatol Stefanowitsch
.1
Francis, Gill, Susan Hunston & Elisabeth Manning. 1996. Collins COBUILD
Grammar Paerns 1: Verbs. London: HarperCollins.
Francis, Gill, Susan Hunston & Elizabeth Manning. 1998. Collins Cobuild
v0
Grammar Paerns 2. Nouns and Adjectives. London: HarperCollins.
Fromkin, Victoria. 1973. Speech errors as linguistic evidence. e Hague: Mouton.
Fromkin, Victoria (ed.). 1980. Errors in linguistic performance: slips of the tongue,
ear, pen, and hand. San Francisco: Academic Press.
Garretson, Gregory. 2004. Coding practices used in the project Optimal Typology
T
of Determiner Phrases. Unpublished manuscript. Boston, MA, ms.
Georgila, Kallirroi, Maria Wolters, Johanna D. Moore & Robert H. Logie. 2010. e
AF
Gilquin, Gatanelle & Sylvie De Cock. 2011. Errors and disuencies in spoken
corpora: Seing the scene. International Journal of Corpus Linguistics 16(2).
D
141172.
Givn, Talmy. 1983. Topic continuity in discourse a quantitative cross-language
study. Amsterdam; Philadelphia: John Benjamins.
Givn, Talmy. 1992. e grammar of referential coherence as mental processing
instructions. Linguistics 30(1).
Goldberg, Adele E. 1995. Constructions: a construction grammar approach to
argument structure. (Cognitive eory of Language and Culture). Chicago:
University of Chicago Press.
Greenberg, Steven, Hannah Carvey, Leah Hitchcock & Shuangyu Chang. 2003.
Temporal properties of spontaneous speecha syllable-centric perspective.
Journal of Phonetics 31(3-4). 465485.
409
References
Greenberg, Steven, Joy Hollenback & Dan Ellis. 1996. Insights into spoken
language gleaned from phonetic transcription of the Switchboard corpus.
Proceedings of the Fourth International Conference on Spoken Language
Processing, vol. 1, 3235. Philadelphia.
Gries, Stefan . 2001. A corpus-linguistic analysis of English -ic vs -ical
adjectives. ICAME Journal 25. 65108.
Gries, Stefan . 2003. Testing the sub-test: An analysis of English -ic and -ical
adjectives. International Journal of Corpus Linguistics 8(1). 3161.
Gries, Stefan . 2010. Behavioral proles: A ne-grained and quantitative
approach in corpus-based lexical semantics. e Mental Lexicon 5(3). 323346.
.1
Gries, Stefan omas. 2002. Evidence in Linguistics: three approaches to genitives
in English. In Ruth M. Brend & William J. Sullivan (eds.), LACUS Forum
XXVIII: what constitutes evidence in linguistics?, 1731. Fullerton, CA:
v0
LACUS.
Gries, Stefan omas. 2003. Multifactorial analysis in corpus linguistics: a study
of particle placement. (Open Linguistics Series). New York: Continuum.
Gries, Stefan omas & Naoki Otani. 2010. Behavioral proles: A corpus-based
perspective on synonymy and antonymy. ICAME Journal 34. 121150.
T
Gries, Stefan . & Anatol Stefanowitsch. 2004. Extending collostructional
analysis: A corpus-based perspective on `alternations. International Journal of
AF
410
Anatol Stefanowitsch
.1
Amsterdam; Philadelphia: John Benjamins.
Jackendo, Ray. 1994. Paerns in the mind: language and human nature. New
York: BasicBooks. v0
Jkel, Olaf. 1997. Metaphern in abstrakten Diskurs-Domnen: eine kognitiv-
linguistische Untersuchung anhand der Bereiche Geistesttigkeit, Wirtscha
und Wissenscha. (Duisburger Arbeiten Zur Sprach- Und Kulturwissenscha,
Duisburg Papers on Research in Language and Culture Bd. 30). Frankfurt am
Main; New York: P. Lang.
T
Jespersen, Oo. 1909. A modern English grammar on historical principles
(volumes 17). . 7 vols. Heidelberg: C. Winter.
AF
Johansson, Stig & Knut Hoand. 1989. Frequency analysis of English vocabulary
and grammar. Tag frequencies and word frequencies. . Vol. 1. 2 vols. Oxford:
Clarendon Press.
Justeson, John S. & Slava M. Katz. 1991. Co-occurrences of antonymous adjectives
and their contexts. Computational Linguistics 17(1). 119.
R
184.
Kaunisto, Mark. 1999. Electric/electrical and classic/classical: Variation between
the suxes ic and ical. English Studies 80(4). 343370.
Kennedy, Graeme. 2003. Amplier Collocations in the British National Corpus:
Implications for English Language Teaching. TESOL arterly 37(3). 467.
Kennedy, Graeme D. 1998. An introduction to corpus linguistics. London; New
York: Longman.
Kjellmer, Gran. 1986. e lesser man: Observations on the role of women in
modern English writing. In Jan Aarts & Willem Meijs (eds.), Corpus linguistics
II: New studies in the analysis and exploitation of computer corpora, 163176.
Amsterdam: Rodopi.
411
References
.1
Lako, George. 1993. e contemporary theory of metaphor. In Andrew Ortony
(ed.), Metaphor and thought. 2nd ed. Cambridge: Cambridge University Press.
Lako, George. 2004. Re: Empirical methods in Cognitive Linguistics.
v0
Lako, George & Mark Johnson. 1980. Metaphors we live by. Chicago: University
of Chicago Press.
Lako, Robin. 1973. Language and womans place. Language in Society 2(1). 45
80. (4 February, 2013).
Langacker, Ronald W. 1991. Concept, image, and symbol: the cognitive basis of
T
grammar. (Cognitive Linguistics Research 1). Berlin: Mouton de Gruyter.
Lapata, Maria & Alex Lascarides. 2003. A Probabilistic Account of Logical
AF
and British English conversation. In Hilde Hasselgard & Signe Okseell (eds.),
Out of Corpora. Studies in honor of Stig Johansson, 107118. Amsterdam:
D
Rodopi.
Leech, Georey N., Roger Garside & Michael Bryant. 1994. CLAWS4: e tagging
of the British National Corpus. Proceedings of the 15th International
Conference on Computational Linguistics, 622628. Kyoto.
Leech, Georey N. & Andrew Kehoe. 2006. Recent grammatical change in wrien
English 1961-1992: some preliminary ndings of a comparison of American
with British English. In Antoinee Renouf & Andrew Kehoe (eds.), e
Changing Face of Corpus Linguistics, 185204. (Language and Computers 55).
Amsterdam: Rodopi.
Levin, Magnus & Hans Lindquist. 2007. Sticking ones nose in the data:
Evaluation in phraseological sequences with nose. ICAME Journal 31. 87110.
412
Anatol Stefanowitsch
Liberman, Mark. 2005. What happened to the 1940s? Blog. Language Log.
itre.cis.upenn.edu/~myl/languagelog/archives/002397.html (1 March, 2015).
Liberman, Mark. 2012. Historical culturomics of pronoun frequencies. Blog.
Language Log.
Lindquist, Hans & Magnus Levin. 2008. Foot and Mouth: e phrasal paerns of
two frequent nouns. In Sylviane Granger & Fanny Meunier (eds.),
Phraseology: An interdisciplinary perspective, 143158. Amsterdam: John
Benjamins. hps://benjamins.com/catalog/z.139.15lin (25 February, 2015).
Lindquist, Hans & Christian Mair (eds.). 2004. Corpus approaches to
grammaticalization in English. (Studies in Corpus Linguistics v. 13).
.1
Amsterdam; Philadelphia: John Benjamins.
Lindsay, Mark. 2011. Rival suxes: synonymy, competition, and the emergence of
productivity. In Angela Ralli, Geert Booij, Sergio Scalise & Athanasios
v0
Karasimos (eds.), Morphology and the architecture of grammar. On-line
proceedings of the Eighth Mediterranean Morphology Meeting., 192203.
Patras: University of Patras.
Lindsay, Mark & Mark Arono. 2013. Natural selection in self-organizing
morphological systems. In Nabil Hathout, Fabio Montermini & Jesse Tseng
T
(eds.), Morphology in Toulouse. Selected proceedings of Dcembrees 7, 133
153. Mnchen: Lincom Europa.
AF
thesis.
Louw, William E. 1993. Irony in the text or insincerity in the writer? e
D
413
References
.1
253270.
McEnery, Tony & Andrew Hardie. 2012. Corpus linguistics: method, theory and
practice. (Cambridge Textbooks in Linguistics). Cambridge; New York:
v0
Cambridge University Press.
McEnery, Tony & Andrew Wilson. 2001. Corpus linguistics: an introduction.
Edinburgh: Edinburgh University Press.
Merriam-Webster. 2014. How does a word get into a Merriam-Webster
dictionary? FAQ. Merriam-Webster Online.
T
Meurers, W. Detmar. 2005. On the use of electronic corpora for theoretical
linguistics. Lingua 115(11). 16191639.
AF
Meurers, W. Detmar & Stefan Mller. 2009. Corpora and syntax. In Anke Ldeling
& Merja Kyt (eds.), Corpus linguistics, vol. 2, 920933. (Handbooks of
Linguistics and Communication Science 29). Berlin; New York: De Gruyter
Mouton.
Meyer, Charles F. 2002. English corpus linguistics: an introduction. (Studies in
R
414
Anatol Stefanowitsch
.1
Mondorf (eds.), Determinants of Grammatical Variation in English. Berlin,
New York: DE GRUYTER MOUTON.
Oakes, M. P. & M. Farrow. 2007. Use of the Chi-Squared Test to Examine
v0
Vocabulary Dierences in English Language Corpora Representing Seven
Dierent Countries. Literary and Linguistic Computing 22(1). 8599.
Oce for National Statistics. 2005. T 01 arterly Population Estimates for
England and Wales by inary Age Groups and Sex Sept 03 - June 05.
London: Oce for National Statistics.
T
Or, Winnie Wing-fung. 1994. A corpus-based study of features of adjectival
suxation in English. Proceedings Joint Seminar on Corpus Linguistics and
AF
Lexicology Guangzhou and Hong Kong, 7281. Hong Kong: Language Centre,
Hong Kong University of Science and Technology.
Oxford University Press. 1989. e Oxford English dictionary. (Ed.) J. A. Simpson
& E. S. C. Weiner. 2nd ed. Oxford: Oxford; New York: Clarendon Press;
Oxford University Press.
R
Partington, Alan. 1998. Paerns and meanings using corpora for English language
research and teaching. Amsterdam; Philadelphia: John Benjamins.
D
415
References
.1
innitives in English. English Studies 76(4). 367388.
Rohdenburg, Gnter. 2003. Cognitive complexity and horror aequi as factors
determining the use of interrogative clause linkers in English. In Gnter
v0
Rohdenburg & Bria Mondorf (eds.), Determinants of Grammatical Variation
in English. Berlin, New York: DE GRUYTER MOUTON.
Rohdenburg, Gnter. 2009. Nominal complements. In Gnter Rohdenburg & Julia
Schlter (eds.), One Language, Two Grammars?, 194211. Cambridge:
Cambridge University Press.
T
Rohdenburg, Gnter & Julia Schlter. 2009. One language, two grammars?
dierences between British and American English. Cambridge, UK; New York:
AF
416
Anatol Stefanowitsch
Wilson & Tony McEnery (eds.), Proceedings of the Corpus Linguistics 2003
conference, 662668. Lancaster: UCREL, Computing Dept., Universtity of
Lancaster.
Sacks, Harvey. 1992. Lectures on conversation. Oxford, UK; Cambridge, Mass:
Blackwell.
Sacks, Harvey, Emanuel A. Scheglo & Gail Jeerson. 1974. A Simplest
Systematics for the Organization of Turn-Taking for Conversation. Language
50(4). 696.
Sily, Tanja. 2011. Variation in morphological productivity in the BNC:
Sociolinguistic and methodological considerations. Corpus Linguistics and
.1
Linguistic eory 7(1).
Sily, Tanja & Jukka Suomela. 2009. Comparing type counts: e case of women,
men and -ity in early English leers. Language and Computers Studies in
v0
Practical Linguistics 69. 87109.
Sampson, Georey. 1995. English for the computer: the SUSANNE corpus and
analytic scheme. Oxford: New York: Clarendon Press; Oxford University
Press.
Sapir, Edward. 1921. Language: An introduction to the study of speech. New York:
T
Harcourt, Brace & Co.
Schlter, Julia. 2003. Phonological determinants of grammatical variation in
AF
417
References
Sco, Mike. 1997. PC analysis of key words And key key words. System 25(2).
233245.
Sco, Mike & Chris Tribble. 2006. Textual paerns: key words and corpus
analysis in language education / Mike Sco and Christopher Tribble. (Studies
in Corpus Linguistics v. 22). Amsterdam; Philadelphia: John Benjamins.
Searle, John R. 1969. Speech acts: an essay in the philosophy of language. London:
Cambridge U.P.
Sebba, Mark & Steven D. Fligelstone. 1994. Corpora. (Ed.) Ronald E. Asher &
James M.Y. Simpson. e encyclopedia of language and linguistics. Oxford:
Pergamon.
.1
Semino, E. & M. Masci. 1996. Politics is Football: Metaphor in the Discourse of
Silvio Berlusconi in Italy. Discourse & Society 7(2). 243269.
Seto, Ken-ichi. 1999. Distinguishing metonymy from synecdoche. In Klaus-Uwe
v0
Panther & Gnter Radden (eds.), Metonymy in language and thought, vol. 4,
91120. (Human Cognitive Processing). Amsterdam; Philadelphia: John
Benjamins. hps://benjamins.com/catalog/hcp.4.06set (20 February, 2015).
Shaer, J P. 1995. Multiple Hypothesis Testing. Annual Review of Psychology
46(1). 561584.
T
Sinclair, John. 1991. Corpus, concordance, collocation. Oxford: Oxford University
Press.
AF
418
Anatol Stefanowitsch
.1
multiplen Strukturanwendung, 2945. (Diversitas Linguarum). Bochum:
Brockmeyer.
Stefanowitsch, Anatol. 2007b. Linguistics beyond grammaticality. Corpus
v0
Linguistics and Linguistic eory 3(1).
Stefanowitsch, Anatol. 2008. Negative entrenchment: A usage-based approach to
negative evidence. Cognitive Linguistics 19(3).
Stefanowitsch, Anatol. 2010. Empirical cognitive semantics: Some thoughts. In
Dylan Glynn & Kerstin Fischer (eds.), antitative Methods in Cognitive
T
Semantics: Corpus-Driven Approaches. Berlin, New York: De Gruyter Mouton.
Stefanowitsch, Anatol. 2011. Cognitive linguistics meets the corpus. In Mario
AF
419
References
.1
Linguistic eory 1(1).
Szmrecsanyi, Benedikt. 2006. Morphosyntactic persistence in spoken English: a
corpus study at the intersection of variationist sociolinguistics,
v0
psycholinguistics, and discourse analysis. (Trends in Linguistics 177). Berlin;
New York: Mouton de Gruyter.
Tagliamonte, Sali. 2006. Analysing sociolinguistic variation. (Key Topics in
Sociolinguistics). Cambridge, UK; New York: Cambridge University Press.
Taylor, John R. 2012. e mental corpus: how language is represented in the
T
mind. Oxford; New York: Oxford University Press.
ompson, Sandra A. & Paul J. Hopper. 2001. Transitivity, clause structure, and
AF
420
Anatol Stefanowitsch
.1
Technical Report. Birmingham: e University of Birmingham, School of
Computer Science.
Wasow, Tom & Jennifer Arnold. 2003. Post-verbal constituent ordering in English.
v0
In Gnter Rohdenburg & Bria Mondorf (eds.), Determinants of grammatical
variation in English, 119154. (Topics in English Linguistics 43). Berlin; New
York: Mouton de Gruyter.
Weiner, E. Judith & William Labov. 1983. Constraints on the agentless passive.
Journal of Linguistics 19(01). 29.
T
Widdowson, Henry G. 2000. On the limitations of linguistics applied. Applied
Linguistics 21(1). 325.
AF
Horst Czichos, Tetsuya Saito & Leslie Smith (eds.), Springer Handbook of
Metrology and Testing, 339452. Berlin, Heidelberg: Springer Berlin
D
Heidelberg.
Wierzbicka, Anna. 1988. e semantics of grammar. (Studies in Language
Companion Series v. 18). Amsterdam; Philadelphia: John Benjamins.
Wierzbicka, Anna. 2003. Cross-cultural pragmatics the semantics of human
interaction. Berlin; New York: Mouton de Gruyter.
Williams, Raymond. 1976. Keywords: a vocabulary of culture and society. New
York: Oxford University Press.
Winchester, Simon. 2003. e meaning of everything: the story of the Oxford
English dictionary. Oxford; New York: Oxford University Press.
Wolf, Hans-Georg & Frank Polzenhagen. 2007. Fixed expressions as
manifestations of cultural conceptualizations: Examples from African varieties
421
References
.1
Zaenen, Annie, Jean Carlea, Gregory Garretson, Joan Bresnan, Andrew Koontz-
Garboden, Tatiana Nikitina, M. Catherine OConnor & Tom Wasow. 2004.
Animacy encoding in English: Why and how. Proceedings of the 2004 ACL
v0
Workshop on Discourse Annotation, 118125. (DiscAnnotation 04).
Stroudsburg, PA, USA: Association for Computational Linguistics.
T
AF
R
D
422