Vous êtes sur la page 1sur 395


Karin Aijmer and Bengt Altenberg

University of Gteborg and University of Lund

Corpus linguistics has made spectacular advances since the early 1960s when
computer corpora were first made available for research. The use of corpora has
spread to practically every branch of linguistics and has become indispensable in
many practical applications of linguistic research, from lexicography and
terminology extraction to information retrieval and computer-assisted translation.
Corpora have become bigger and more diversified: apart from large general-
purpose corpora, a number of specialised corpora are now being used for research
in such areas as historical linguistics, sociolinguistics, dialectology, LSP,
interlanguage research, contrastive linguistics and translation studies. In addition,
CD-ROM newspaper collections and the Internet have become increasingly
important resources for language study. Hand in hand with these developments a
variety of research tools have been created for exploring, annotating and
processing language data in various ways.
However, the most important achievement of corpus linguistics is
undoubtedly that it has put the use of language at the centre of linguistics. In
theoretical as well as practical approaches to language, computer corpora have
placed linguistics on a firm empirical footing, emphasising the functional and
communicative basis of language.
This volume contains twenty-two papers presented at the 23rd
International Conference on English Language Research on Computerized
Corpora of Modern and Medieval English (ICAME) held at Gteborg, Sweden, in
May 2002. They cover a wide range of topics and, though few of them represent
the technical or computational side of the discipline, they illustrate clearly the
diversity of research that is characteristic of corpus linguistics today. The
contributions have been divided into six broad and inevitably overlapping
categories, under the following headings:

The role of corpora in linguistic research

Exploring lexis, grammar and semantics
Discourse and pragmatics
Language change and language development
Cross-linguistic studies
Software development

We have chosen to call the volume Advances in Corpus Linguistics. This may
seem a bold title, as it suggests a systematic account of recent developments in
the field. However, advances in linguistics seldom take the form of big leaps.
2 Karin Aijmer and Bengt Altenberg

This is particularly true of corpus linguistics where each study can be seen as a
small step in the expansion of a vast and complex discipline, whether the focus is
theoretical, descriptive or methodological. Corpus linguistics is a constantly
changing field and ICAME conferences generally provide a good reflection of
this. The theme of the 2002 conference was The Theory and Use of Corpora. In
what ways can the present volume be said to represent advances in these
respects? Rather than presenting a summary of the individual contributions, we
will try to point out some issues and tendencies that we think are characteristic of
the volume as a whole.
The role of corpus linguistics and the relationship between data and theory
have been debated ever since the rise of corpus linguistics. The debate is also
clearly reflected in the present volume. That there is a need for such a debate may
suggest that corpus linguistics has not advanced in the past decades, but it can
also be regarded as a sign of the vitality of the field. A constant re-examination of
the goals of corpus linguistics and a critical discussion of theoretical and
methodological questions are necessary if corpus linguistics is to make significant
progress in the future.
The following issues are brought up for discussion in the first three
programmatic articles of the volume:

the problems of transcription and annotation

the role of intuition in corpus linguistics
corpus-based vs corpus-driven approaches
the relationship between data, description and theory
the conflict between lexical access and the need for research on grammar
and spoken language

These are old questions in corpus linguistics. Although they are largely
methodological in character, they all have theoretical relevance. They are also
closely related.
The transcription of speech and the grammatical annotation of corpora
both involve imposing an analysis of the corpus data. This also means that they
allow the researchers intuition and, in the case of annotation, a preconceived
theoretical model to play a role at an early stage in the research process. But as
Michael Halliday points out in his contribution, transcription and annotation are
different in nature: while transcription of prosodic features provides an essential
part of the meaning of spoken discourse, grammatical annotation adds a
received linguistic description to the data, a description that may be incomplete,
obsolete or incorrect and therefore bound to distort the analysis before it has
started. Halliday recognises the problems involved in prosodic transcription but
also emphasises the desirability of marking such meaningful features as
intonation and rhythm. John Sinclair makes a distinction between mark-up and
annotation and argues that both should be kept separate from the raw text.
According to Sinclair, annotation should be avoided except in corpora used for
Introduction 3

practical applications since it prevents the development of language theory and

description. In this respect, corpus linguistics has still to mature a little (p 55).
Intuition has been discredited in corpus linguistics. Does it have a place at
all and, if so, when is it allowed to play a role? Two contributors, John Sinclair
and Geoffrey Leech, touch on this issue. Both make a distinction between two
senses of intuition: (a) the knowledge of the language of the native speaker and
(b) the analytical expertise of the linguist. Intuition in the former sense is fallible
and unreliable and therefore to be distrusted, except possibly as a hunch to be
tested out in corpora. But in the latter sense intuition is indispensable. An
important task of the corpus linguist is to interpret the patterns of the data and
transform them into theoretical statements.
The distinction between corpus-driven and corpus-based approaches in
language research has been brought into focus and debated in recent years.
Briefly, the approaches can be said to differ in the role given to a theoretical
model in the course of a study. To many linguists the opposition is artificial or
irrelevant as long as a theoretical stance is introduced at some point in the
research process. Halliday accepts the distinction in principle, but cannot see a
clear boundary between theory and data; the borderline is fuzzy and corpus-
driven approaches are normally not entirely theory-free. He also rejects the idea
that corpus-driven linguistics is about parole (as has been maintained); all usage-
based linguistic research is concerned with both parole and langue, i.e. both
usage and system. Sinclair, on the other hand, strongly advocates the corpus-
driven approach on the grounds that corpus-based methods are at best concerned
with testing established theories, though generally no serious testing is done. In
contrast, the corpus-driven approach allows the data to control the analysis and
consequently to create or modify linguistic theories.
Geoffrey Leech, looking at language research in a similar perspective but
in slightly different terms, recognises three levels of investigation:

data (collection) - description - theory

Although corpus studies have a natural starting point in data, Leech objects to the
common assumption that corpus linguistics is concerned with mere data
collection and description. Explaining usage or changes in usage in Leechs
case inevitably involves theoretical considerations. The explanation of usage
may be language-internal or language-external, i.e. motivated by social factors.
As Leech demonstrates, corpus linguistics is naturally suited to usage-based
conceptions of linguistics which (unlike the Chomskyan paradigm) assume that
there is a bridge between the study of naturally-occurring data and the cognitive
and social workings of language (p 78).
Another problem that is no doubt familiar to most corpus linguists but
seldom discussed is the fact that corpus research is by necessity biassed in the
direction of lexis. Corpora are organised lexically and accessed via the
orthographic word. As a result, phenomena at the lexical end of the
lexicogrammatical continuum are more accessible than those at the grammatical

4 Karin Aijmer and Bengt Altenberg

end. As Halliday points out, this problem is especially acute in the study of the
spoken language where meaning is more highly grammaticalized and more
covert. What is needed, according to Halliday, are ways of designing a corpus for
the study of phenomena at the grammatical end of the continuum. This need is
especially great in the area of spoken language where, prototypically, meaning is
made and the frontiers of meaning potential are extended (p 11).
To judge from the present volume, Hallidays appeal for research on the
grammar of spoken discourse is warranted. The great majority of the studies
represented in the volume either focus on the lexical end of the continuum or
explore grammar or text via lexis. Moreover, few of them are specifically devoted
to spoken discourse as such. Two exceptions are Bernard De Clercks
examination of the pragmatic function of lets in the spoken part of ICE-GB and
Clive Souters study of childrens vocabulary in the Polytechnic of Wales (PoW)
Corpus. However, the focus De Clercks investigation is on the functional
variation of lets utterances in different speech categories and it is rather an
example of another important use of corpora, viz. the exploration of language
variation. Similarly, Souters aim is to demonstrate the usefulness of a small but
richly annotated corpus for studies of childrens vocabulary development and, in
particular, how this is affected by such extra-linguistic factors as sex and age.
Several other contributors explore register or regional variation. Jonathan
Charteris-Black uses corpora to compare metaphors in British and American
political discourse and Peter Tan, Vincent Ooi and Andy Chiang investigate the
spoken character of personal advertisements placed on the Web by ESL
speakers in South East Asia, using spoken and written portions of the Singapore
component of ICE as a standard of comparison.
More directly concerned with the structure of discourse are the papers by
Michael Hoey and Hilde Hasselgrd. Both argue that corpus-linguistic techniques
can be used to study patterns in text. However, their starting points are different.
Hoey claims that every lexical item is primed for use in textual organisation (p
174) and consequently examines textual patterns via lexis. Hasselgrd, on the
other hand, starts with a grammatical construction. Her paper investigates the
discourse and information structural functions of it-cleft constructions with an
adverbial in focus position.
As mentioned, the corpus linguist now has access to a wide variety of
corpora, ranging from very large corpora (the Cobuild Bank of English, the
British National Corpus) and carefully designed and annotated million-word
corpora in the tradition of the Brown and LOB corpora (e.g. Frown, FLOB and
the regional variants of ICE) to various smaller corpora collected for specific
purposes (e.g. the Helsinki Corpus, the ICLE corpus, the PoW Corpus). Many of
these are tagged and parsed, permitting the user easy retrieval of specific
grammatical categories. In addition, there is a rapidly growing number of
multilingual corpora with English as one of the languages compared. The
usefulness of all these types of corpora is amply illustrated in the present volume.
Yet, for certain purposes in particular the study of specific domains or
genres that are absent from, or insufficiently represented in, the general-purpose
Introduction 5

corpora the researcher has to collect his/her own corpus. Here material available
on the Web has proved to be a useful additional resource. No less than six of the
contributions to the present volume make use of such material (Charteris-Black,
Kbler, Renouf et al., Tan et al., Hoey, Tognini Bonelli and Manca). However,
using the Web as an unrestricted language resource presents several problems. As
Antoinette Renouf, Andrew Kehoe and David Mezquiriz point out in their
contribution, the nature of the Web as a random accumulation of heterogeneous
texts, many being less conventionally text-like, poses problems for the corpus
linguist who tries to access it through existing search engines (p. 403). Reporting
on a project designed to develop a user-friendly and more selective search tool for
the Web (WebCorp), they discuss some of the difficulties involved and how these
might be overcome or reduced. Their report is the only contribution representing
software development in the volume.
The volume also illustrates a variety of methodological approaches and, in
particular, that the choice of method is to a large extent determined by the
purpose of the study. One well-established method is the use of concordances
where syntagmatic lexicogrammatical patterns are revealed and make it possible
for the researcher to classify and describe the data in general theoretical terms.
This approach is of course especially useful in studies focusing on the lexical end
of the continuum and when the researcher knows which word or expression to
start from. Sometimes, however, there is no obvious lexical starting point. A case
in point is the study of metaphor. In this case the researcher first has to make an
educated guess about which lexical items are likely to serve as vehicles of
metaphors of a certain type (e.g. body parts, terms of war, etc), make a tentative
list of potentially rewarding items, and adjust the list after pilot searches in the
selected corpus material. An example of a study based on such intuitive
sampling is Charteris-Blacks comparison of metaphors in British and American
political discourse mentioned above.
Another example of the methodological problems facing the corpus
linguist is Thomas Kohnens investigation of the history of English directive
speech acts. With speech acts there is no predictable link between form and
function and consequently no systematic and reliable way of retrieving relevant
forms. Kohnen gives a summary of some of the methodological problems
involved and advocates a procedure called structured eclecticism. The method
implies the deliberate selection of typical patterns, such as the use of the
imperative or a performative clause, which are then traced throughout the history
of English. Kohnens diachronic study is also a good illustration of how several
corpora can be combined to throw light on linguistic change (in Kohnens case
the Helsinki Corpus, the electronic version of the Middle English Dictionary and
the Brown and LOB corpora). Another illustration is Geoffrey Leechs
examination of recent changes in English grammar on the basis of data from six
corpora spanning the last four decades of the 20th century.
However, diachronic change can also be demonstrated on the basis of
synchronic variation in recent corpora. Liselotte Brems investigates signs of
delexicalization and synchronic grammaticalization revealed by patterns in the

6 Karin Aijmer and Bengt Altenberg

use of measure nouns in the Cobuild Corpus. In a similar fashion, Gran Kjellmer
combines information from the OED with indications of synchronic variation in
recent corpora (the Cobuild Corpus and the BNC) to explain referential changes
of reflexive pronouns through the centuries.
Contrastive studies based on multilingual corpora require special
methodologies of their own. Here the languages compared serve as mirror images
of each other, highlighting cross-linguistic differences and similarities. For those
concerned with contrastive lexicology, such as Helge Dyvik and ke Viberg,
translation corpora clearly reveal such phenomena as overlapping polysemy,
diverging meaning extensions and language-specific lexical relations (synonymy,
hyponymy, etc). The procedure used in these studies is truly corpus-driven,
although theoretical frameworks guide the analysis at different stages. For Anna-
Lena Fredriksson, who investigates the notion of clausal theme in an English-
Swedish perspective, parallel corpus data help to define a tertium comparationis
and to identify a cross-linguistic theoretical model.
In contrastive research based on corpora of comparable texts from
different languages (rather than translations) the method has to be different. Here
a comparison must be made between typical expressions of concepts and
functions used in comparable situations in the compared languages. This is well
illustrated in Elena Tognini Bonellis and Elena Mancas comparison of
meanings encoded in English and Italian descriptions of farmhouse holidays on
the Web.
Natalie Kblers contribution is also cross-linguistic in character but has a
more clearly defined applied purpose. It reports on an experiment in corpus-
driven learning in the area of cross-linguistic lexicography. Trawling the Web by
means of the WebCorp tool (described by Renouf et al) and comparing the results
with data from multilingual corpora, students are taught to evaluate different
methods and sources for the purpose of building customised dictionaries for
machine translation.
Interlanguage studies on the basis of learner corpora such as the International
Corpus of Learner Language (ICLE) also require a special contrastive methodology.
Patterns of usage in the learners production that deviate from those of native
English writers may be due to contrastive differences between the learners L1 and
the target language. Conversely, contrastive differences can be used to formulate
hypotheses about interlanguage problems that can be checked against data in learner
corpora. As a result, research on learner corpora generally require comparisons with
corpora representing both the learners native language and the target language. This
is well illustrated in Roumania Blagoevas study of the use of demonstrative
pronouns by advanced Bulgarian learners of English.
Corpus linguistics can be combined with different theoretical approaches.
Whether corpus-driven or corpus-based, most of the contributions make some
link with theory. The aim of Joybrato Mukherjees paper on the verb give, for
example, is to bridge the gap between corpus-based research into actual language
use and cognitive grammar. Caroline David attempts to refine existing syntactico-
semantic classifications of putting verbs on the basis of corpus-data. Similarly,
Introduction 7

Peter Willemses study of pseudo-definite NPs in existential constructions is a

usage-informed attempt to create a more exhaustive and refined classification of
different types of pseudo-definiteness than has previously been achieved.
Although the present volume can only give a limited picture of the
advances of corpus linguistics in recent years, the contributions give clear
evidence of the variety and vitality of the field. Electronic corpora are now
exploited for a wide range of purposes. New types of corpora are being created
and new techniques developed to serve the demands of an expanding circle of
scholars who may have different interests and theoretical backgrounds but who
have a common desire to explore the nature of language by studying its use in
authentic texts. The theoretical, methodological and pedagogical issues addressed
in the present volume demonstrate clearly the steady advance of an expanding
discipline inspired by an empirical, usage-based approach to the study of

The spoken language corpus: a foundation for grammatical

M.A.K. Halliday

University of Sydney

1. Introductory
I felt rather daunted when Professor Karin Aijmer invited me to talk at this
Conference, because it is fifteen years since I retired from my academic
appointment and, although I continue to follow new developments with interest, I
would certainly not pretend to keep up to date especially since I belong to that
previous era when one could hope to be a generalist in the field of language
study, something that is hardly any longer possible today. But I confess that I was
also rather delighted, because if there is one topic that is particularly close to my
heart it is that of the vast potential that resides in a corpus of spoken language.
This is probably the main source from which new insights can now be expected
to flow.
I have always had greater interest in the spoken language, because that in
my view is the mainspring of semogenesis: where, prototypically, meaning is
made and the frontiers of meaning potential are extended. But until the coming of
the tape recorder we had no means of capturing spoken language and pinning it
down. Since my own career as a language teacher began before tape recorders
were invented (or at least before the record companies could no longer stop them
being produced), I worked hard to train myself in storing and writing down
conversation as it occurred; but there are obviously severe limits on the size of
corpus you can compile like that. Of course, to accumulate enough spoken
language in a form in which it could be managed in very large quantities, we
needed a second great technical innovation, the computer; but in celebrating the
computerized corpus we should not forget that it was the tape recorder that broke
through the sound barrier (the barrier to arresting speech sound, that is) and made
the enterprise of spoken language research possible. It is ironical, I think, that
now that the technology of speech recording is so good that we can eavesdrop on
almost any occasion and kind of spoken discourse, we have ethics committees
and privacy protection agencies denying us access, or preventing us from making
use of what we record. (Hence my homage to Svartvik and Quirk, which I still
continue to plunder as a source of open-ended spontaneous dialogue.)
So my general question, in this paper, is this: what can we actually learn,
about spoken language and, more significantly, about language, by using a
computerized corpus on a scale such as can now be obtained? What I was
suggesting by my title, of course (and the original title had the phrase at the
foundation of grammatics, which perhaps makes the point more forcefully), was
that we can learn a great deal: that a spoken language corpus does lie at the
12 M.A.K. Halliday

foundation of grammatics, using grammatics to mean the theoretical study of

lexicogrammar this being located, in turn, in the context of a general theory of
language. (I had found it necessary to introduce this term because of the
confusion that constantly arose between grammar as one component of a
language and grammar as the systematic description of that component.) In this
sense, the spoken language corpus is a primary resource for enabling us to
theorize about the lexicogrammatical stratum in language and thereby about
language as a whole.
I can see no place for an opposition between theory and data, in the sense
of a clear boundary between data-gathering and theory construction. I
remember wondering, when I was reading Isaac Newtons Optics, what would
have happened to physics if Newton, observing light passing through different
media and measuring the refraction, had said of himself Im just a data-gatherer;
I leave the theorizing to others. What was new, of course, was that earlier
physicists had not been able to observe and measure very much because the
technology wasnt available; so they were forced to theorize without having
adequate data. Galileo and Newton were able to observe experimentally; but this
did not lead them to set up an opposition between observation and theory
between the different stages in a single enterprise of extending the boundaries of
knowledge. Now, until the arrival of the tape recorder and the computer, linguists
were in much the same state as pre-Renaissance physicists: they had to invent, to
construct their database without access to the phenomena on which they most
depended. Linguistics can now hope to advance beyond its pre-scientific age; but
it will be greatly hindered if we think of data and theory as realms apart, or divide
the world of scholarship into those who dig and those who spin.
It is not the case, of course, that linguists have had no data at all. They
have always had plenty of written language text, starting with texts of high
cultural value, the authors whose works survived from classical times. This
already provoked disputation, in Europe, between text-based scholars and
theoreticians; we find this satirized in the late medieval fable of the Battle of the
Seven Arts, fought out between the Auctores and the Artes. But the auctores
embodied the notion of the text as a model (author as authority); this was written
language as object with value, rather than just as specimen to be used as evidence.
And this in turn reflects the nature of written language: it is language produced
under attention, discourse that is self-conscious and self-monitored. This does
not, of course, invalidate it as data; it means merely that written texts tell us about
written language, and we have to be cautious in arguing from this to the
potentiality of language as a whole. After all, speech evolved first, in the species;
speech develops first, in the individual; and, at least until the electronic age,
people did far more talking than writing throughout their lives.

2. Spoken and written

Throughout most of the history of linguistics, therefore, there has been no choice.
To study text, as data, meant studying written text; and written text had to serve
The spoken language corpus 13

as the window, not just into written language but into language. Now, thanks to
the new technology, things have changed; we might want to say: well, now, we
can study written texts, which will tell us about written language, and we can
study spoken texts, which will tell us about spoken language.
But where, then, do we find out about language? One view might be:
theres no such thing as language, only language as spoken and language as
written; so we describe the two separately, with a different grammar for each, and
the two descriptions together will tell us all we need to know. The issue of same
or different grammars has been much discussed, for example by David Brazil,
Geoffrey Leech and Michael Stubbs; there is obviously no one right answer it
depends on the context and the purpose, on what you are writing the grammar for.
The notion there is no such thing as language; there are only , whether only
dialects, only registers, only individual speakers or even only speech
events is a familiar one; it represents a backing away from theory, in the name of
a resistance to totalizing, but it is itself an ideological and indeed theoretical
stance (cf. Martins 1993 observations on ethnomethodology). And while of all
such attempts to narrow down the ultimate domain of a linguistic theory the
separation into spoken language and written language is the most plausible, it still
leaves language out of account, and hence renders our conception of semantics
particularly impoverished it is the understanding of the meaning-making power
of language that suffers most from such a move.
It was perhaps in the so-called modern era that the idea of spoken
language and written language as distinct semiotic systems made most sense,
because that was the age of print, when the two were relatively insulated one
from the other although the spoken standard language of the nation state was
already a bit of a hybrid. Now, however, when text is written electronically, and
is presented in temporal sequence on the screen (and, on the other hand, more and
more of speech is prepared for being addressed to people unknown to the
speaker), the two are tending to get mixed up, and the spoken/written distinction
is increasingly blurred. But even without this mixing, there is reason for
postulating a language, such as English, as a more abstract entity encompassing
both spoken and written varieties. There is nothing strange about the existence of
such varieties; a language is an inherently variable system, and the spoken/written
variable is simply one among many, unique only in that it involves distinct
modalities. But it is just this difference of modality, between the visual-synoptic
of writing and aural-dynamic of speech, that gives the spoken corpus its special
value not to mention, of course, its own very special problems!
I think it is not necessary, in the present context, to spend time and energy
disposing of a myth, one that has done so much to impede, and then to distract,
the study of spoken language: namely the myth that spoken language is lacking in
structure. The spoken language is every bit as highly organized as the written it
couldnt function if it wasnt. But whereas in writing you can cross out all the
mistakes and discard the preliminary drafts, leaving only the finished product to
offer to the reader, in speaking you cannot do this; so those who first transcribed
spoken dialogue triumphantly pointed to all the hesitations, the false starts and the
14 M.A.K. Halliday

backtrackings that they had included in their transcription (under the pretext of
faithfulness to the data), and cited these as evidence for the inferiority of the
spoken word a view to which they were already ideologically committed. It
was, in fact, a severe distortion of the essential nature of speech; a much more
faithful transcription is a rendering in ordinary orthography, including ordinary
punctuation. The kind of false exoticism which is imposed on speech in the act of
reducing it to writing, under the illusion of being objective, still sometimes gets in
the way, foregrounding all the trivia and preventing the serious study of language
in its spoken form. (But not, I think, in the corridors of corpus linguistics!)

3. Spoken language and the corpus

Now what the spoken corpus does for the spoken language is, in the first instance,
the same as what it does for the written: it amasses large quantities of text and
processes it to make it accessible for study. Some kinds of spoken language can
be fairly easily obtained: radio and television interviews, for example, or
proceedings in courts of law, and these figured already in the earliest COBUILD
corpus of twenty million words (eighteen million written and two million
spoken). The London-Lund corpus (alone, I think, at that time) included a
considerable amount of spontaneous conversation, much of it being then
published in the Corpus of English Conversation I referred to earlier (see Svartvik
and Quirk 1980). Ronald Carter and Mike McCarthy, in their CANCODE corpus at
Nottingham, work with five million words of natural speech; on a comparable
scale is the UTS -Macquarie corpus in Sydney, which includes a component of
spoken language in the workplace that formed the basis of Suzanne Eggins and
Diana Slades (1997) Analysing Casual Conversation. Already in the 1960s there
was a valuable corpus of childrens speech, some of it in the form of interview
with an adult but some of children talking amongst themselves, at the Nuffield
Foreign Language Teaching Materials Project under the direction of Sam Spicer
in Leeds; and in the 1980s Robin Fawcett assembled a database of primary school
childrens language in the early years of his Computational Linguistics Unit at the
(then) Polytechnic of Wales.
These are, I am well aware, just the exemplars that are known to me, in a
worldwide enterprise of spoken language corpus research, in English and no
doubt in many other languages besides. What all these projects have in common,
as far as I know, is that the spoken text, as well as being stored as speech, is also
always transcribed into written form. There are numerous different conventions
of transcribing spoken English; I remember a workshop on the grammar of casual
conversation, about twenty years ago, in which we looked into eight systems then
in current use (Hasan 1985), and there must be many more in circulation now.
What I have not seen, though such a thing may exist, is any systematic discussion
of what all these different systems imply about the nature of spoken language,
what sort of order (or lack of order) they impose on it or, in general terms, of
what it means to transcribe spoken discourse into writing. And this is in fact an
extraordinarily complex question.
The spoken language corpus 15

In English we talk about reducing spoken language to writing, in a

metaphor which suggests that something is lost; and so of course it is. We know
that the melody and rhythm of speech, which are highly meaningful features of
the spoken language, are largely absent; and it is ironical that many of the
transcription systems the majority at the time when I looked into them
abandoned the one feature of writing that gives some indication of those
prosodies, namely punctuation. Of course punctuation is not a direct marker of
prosody, because in the evolution of written language it has taken on a life of its
own, and now usually (again referring to English) embodies a compromise
between the prosodic and the compositional (constituent) dimensions of
grammatical structure; but it does give a significant amount of prosodic
information, as anyone is aware who reads aloud from a written text, and it is
perverse to refuse to use it under the pretext of not imposing patterns on the data
rather as if one insisted on using only black and white reproductions of
representational art, so as not to impose colours on the flowers, or on the clothing
of the ladies at court. The absence of punctuation merely exaggerates the dogs
dinner image that is being projected on to spoken language.
There are transcriptions which include prosodic information; and these are
of two kinds: those, like Svartvik and Quirk (deriving from the work of Quirk and
Crystal in the 1960s), which give a detailed account of the prosodic movement in
terms of pitch, loudness and tempo, and those (like my own) which mark just
those systemic features of intonation and rhythm which have been shown to be
functional in carrying meaning as realizations of selections in the grammar, in
the same way that, in a tone language, they would be realizations of selections in
vocabulary. I use this kind of transcription because I want to bring out how
systems which occur only in the spoken language not only are regularly and
predictably meaningful but also are integrated with other, recognized grammatical
systems (those marked by morphology or ordering or class selection) in a manner
no different from the way these latter are integrated with each other. (Texts 14
illustrate some different conventions of transcription: Text 1 from a tape
recording made and transcribed about 1960; Text 2 from Svartvik and Quirk
1980; Text 3 an orthographic (and somewhat reduced) version of Text 2; Text 4
from Grimshaw 1994.)
Thus there is a gap in the information about spoken discourse that is
embodied in our standard orthographies; and since one major function of a
spoken language corpus is to show these prosodically-realized systems at work, it
seems to me that any mode of transcription used with such a corpus should at
least incorporate prosodic features in some systematic way. They are not optional
extras; in some languages at least, but probably in all, intonation and rhythm are
meaningful in an entirely systematic fashion.
But while it is fairly obvious what an orthographic transcription leaves
out, it is perhaps less obvious what it puts in. Orthographies impose their own
kind of determinacy, of a kind that belongs to the written language: a constituent-
like organization which is not really a feature of speech. Words are given clear
boundaries, with beginnings and endings often somewhat arbitrarily assigned;
16 M.A.K. Halliday

and punctuation, while in origin marking patterns of prosodic movement, has

been preempted to mark off larger grammatical units (there is considerable
variation in practice: some writers do still use it more as a prosodic device). It is
true that spoken language is also compositional: the written sentence, for
example, is derived from the clause complex of natural speech; but its
components are not so much constituents in a constituent hierarchy as movements
in a choreographic sequence. The written sentence knows where its going when
it starts; the spoken clause complex does not. (Text 3 illustrates this second
But writing imposes determinacy also on the paradigmatic axis, by its
decisions about what are, or are not, tokens of the same type. Here the effect of
reducing speech to writing depends largely on the nature of the script. There is
already variation here on the syntagmatic axis, because different scripts impose
different forms of constituency: in Chinese, and also in Vietnamese, the unit
bounded by spaces is the morpheme; in European languages it is the word,
though with room for considerable variation regarding what a word is; in
Japanese it is a mixture of the morpheme and the syllable, though you can
generally tell which morpheme begins a new word. On the paradigmatic axis,
Chinese, as a morphemic script, is the most determinate: it leaves no room for
doubt about what are and what are not regarded as tokens of the same type. But
even English and French, though in principle having a phonological script, have
strong morphemic tendencies; they have numerous homonyms at the morpho-
syllabic interface, which the writing system typically keeps apart. Such writing
systems mask the indeterminacy in the spoken language, so that (for example)
pairs like mysticism / misty schism, or icicle / eye sickle, which in speech are
separated only by minor rhythmic differences, come to be quite unrelated in their
written forms James Joyce made brilliant use of this as a semogenic resource
(but as a resource for the written language). But even in languages with a more
purely phonological script, such as Russian or Italian, the writing system enforces
regularities, policing the text to protect it from all the forms of meaningful
variation which contribute so much to the richness and potency of speech.
So transcribing spoken discourse especially spontaneous conversation
into written form in order to observe it, and to use the observations as a basis for
theorizing language, is a little bit problematic. Transcribing is translating, and
translating is transforming; I think to compile and interpret an extensive spoken
corpus inevitably raises questions about the real nature of this transformation.

4. Some features of the spoken language

I would like to refer briefly to a number of features which have been investigated
in corpus studies, with reference to what they suggest about the properties of
language as a whole. I will group these under seven headings; but they are not in
any systematic order just the order in which I found it easiest to move along
from each one to the next.
The spoken language corpus 17

4.1 Patterns in casual conversation

Eggins and Slade, in their book Analysing Casual Conversation (1997), studied
patterns at four strata: lexicogrammatical, semantic, discoursal and generic. The
first two showed up as highly patterned in the interpersonal domain (interpersonal
metafunction), particularly in mood and modality. At the level of genre they
recognized a cline from story-telling to chat, with opinion and gossip in between;
of the ten genres of conversation that they ranged along this cline, they were able
to assign generic structures to seven of them: these were narrative, anecdote,
exemplum, recount, observation/comment, opinion and gossip. Of the other three,
joke-telling they had not enough data to explore; the other two, sending up and
chat, they said cannot be characterized in generic terms. Their analysis, based
on a spoken corpus, suggests that casual conversation is far from lacking in
structural order.

4.2 Pattern forming and re-forming

Ronald Carter, in a recent paper Language and creativity: the evidence from
spoken English (2002), was highlighting, as the title makes clear, the creative
potential of the spoken language, especially casual speech. He referred to its
pattern forming and re-forming, emphasizing particularly the re-forming that
takes place in the course of dialogue: one speaker sets up some kind of
lexicogrammatical pattern, perhaps involving a regular collocation, an idiom or
clich, or some proverbial echo; the interlocutor builds on it but then deflects,
re-forms it into something new, with a different pattern of lexicogrammatical
wording. This will usually not all happen in one dyadic exchange; it may be
spread across long passages of dialogue, with several speakers involved; but it
can happen very quickly, as illustrated in one or two of Carters examples from
the CANCODE corpus:

[Two students are talking about the landlord of a mutual friend]

A: Yes, he must have a bob or two.
B: Whatever he does he makes money out of it just like that.
A: Bobs your uncle.
B: Hes quite a lot of money, erm, tied up in property and things. Hes
got a finger in all kinds of pies and houses and stuff.
[Two colleagues, who are social workers, are discussing a third
colleague who has a tendency to become too involved in individual
A: I dont know but she seems to have picked up all kinds of lame
ducks and traumas along the way.
B: That -- thats her vocation.
A: Perhaps it is. She should have been a counsellor.
B: Yeah but the trouble with her is she puts all her socialist carts before
the horses.
18 M.A.K. Halliday

4.3 Patterns in words and phrases

There might seem to be some contradiction between this and Michael Stubbs
observation, in his Words and Phrases: corpus studies of lexical semantics
(2000), that a high proportion of language use is routinized, conventional and
idiomatic, at least when this is applied to spoken language. Of course, one way
in which both could be true would be if speech was found to consist largely of
routinized stuff with occasional flashes of creativity in between; but I dont think
this is how the two features are to be reconciled. Rather, it seems to me that it is
often precisely in the use of routinized, conventional and idiomatic features that
speakers creativity is displayed. (I shall come back to this point later.) But, as
Stubbs anticipated in his earlier work (1996), and has demonstrated in his more
recent study (of extended lexical units), it is only through amassing a corpus of
speech that we gain access to the essential regularities that must be present if they
can be played with in this fashion. There can be no meaning in departing from a
norm unless there is a norm already in place to be departed from.

4.4 Patterns in grammar

Michael Stubbs book is subtitled corpus studies in lexical semantics; Susan
Hunston and Gill Francis (1999) is Pattern Grammar: a corpus-driven approach
to the lexical grammar of English: one lexical semantics, the other lexical
grammar. I have written about Hunston and Francis book elsewhere (2001);
what they are doing, in my view, is very successfully extending the grammar in
greater detail (greater delicacy) across the middle ground where lexis and
grammar meet. There is no conflict here with theoretical grammar, at least in my
own understanding of the nature of theory; indeed they make considerable use of
established grammatical categories. But this region of the grammar, with its
highly complex network of microcategories, could not be penetrated without
benefit of a corpus and again, it has to include a spoken corpus, because it is in
speech that these patterns are most likely to be evolving and being ongoingly

4.5 The grammar of appraisal

Eggins and Slade referred to, and also demonstrated in the course of their
analysis, the centrality, in many types of casual conversation, of the interpersonal
component in meaning. Our understanding of the interpersonal metafunction
derives particularly from the work of Jim Martin: his book English Text: system
and structure (1992), several articles (e.g. 1998), and a new book co-authored
with Peter White (forthcoming). Martin focussed especially on the area of
appraisal, comprising appreciation, affect, judgment and amplification all
those systems whereby speakers organize their personal opinions, their likes and
dislikes, and their degree and kind of involvement in what they are saying. These
features have always been difficult to investigate: partly for ideological reasons
The spoken language corpus 19

they werent recognized as a systematic component of meaning; but also because

they are realized by a bewildering mixture of lexicogrammatical resources:
morphology, prosody (intonation and rhythm), words of all classes, closed and
open, and the ordering of elements in a structure. Martin has shown how these
meanings are in fact grammaticalized that is, they are systemic in their
operation; but to demonstrate this you need access to a large amount of data, and
this needs to be largely spoken discourse. Not that appraisal does not figure in
written language it does, even if often more disguised (see Hunston 1993); but
it is in speech that its systemic potential is more richly exploited.

4.6 Non-standard patterns

There is a long tradition of stigmatizing grammatical patterns that do not conform
to the canons of written language. This arose, naturally enough, because
grammatics evolved mainly in the study of written language (non-written cultures
often developed theories of rhetoric, but never theories of grammar), and then
because grammarians, like lexicographers, were seen as guardians of a nations
linguistic morals. I dont think I need take up time arguing this point here. But,
precisely because there are patterns which dont occur in writing, we need a
corpus of spoken language to reveal them. I dont mean the highly publicized
grammatical errors beloved of correspondents to the newspapers; these are
easily manufactured, without benefit of a corpus, and I suspect that that kind of
attention to linguistic table manners is a peculiarly English phenomenon
perhaps shared by the French, Ive heard it said. I mean the more interesting and
productive innovations which pass unnoticed in speech but have not (yet) found
their way into the written language and are often hard to construct with
conscious thought; for example, from my own observations:

Its been going tove been being taken out for a long time. [of a package
left on the back seat of the car]
All the system was somewhat disorganized, because of not being sitting in
the front of the screen. [cf. because I wasnt sitting ]
Drrr is the noise which when you say it to a horse the horse goes faster.
Excuse me is that one of those rubby-outy things? [pointing to an object
on a high shelf in a shop]
And then at the end I had one left over, which youre bound to have at
least one that doesnt go.
Thats because I prefer small boats, which other people dont necessarily
like them.
This court wont serve. [cf. its impossible to serve from this court]
20 M.A.K. Halliday

4.7 Grammatical intricacy

Many years ago I started measuring lexical density, which I defined as the
number of lexical items (content words) per ranking (non-embedded) clause. I
found a significant difference between speech and writing: in my written
language samples the mean value was around six lexical words per clause, while
in the samples of spoken language it was around two. There was of course a great
deal of variation among different registers, and Jean Ure (1971) showed that the
values for a range of text types were located along a continuum. She however
counted lexical words as a proportion of total running words, which gives a
somewhat different result, because spoken language is more clausal (more and
shorter clauses) whereas written language is more nominal (clauses longer and
fewer). Michael Stubbs, using a computerized corpus, followed Jean Ures model,
reasonably enough since mine makes it necessary to identify clauses, and hence
requires a sophisticated parsing programme. But the clause-based comparison is
more meaningful in relation to the contrast between spoken and written discourse.
What turned out to be no less interesting was what I called grammatical
intricacy, quantified as the number of ranking clauses in the clause complex. A
clause complex is any sequence of structurally related ranking clauses; it is the
spoken analogue of (and of course the underlying origin of) what we recognize in
written language as a sentence. In spontaneous spoken language the clause
complex often became extraordinarily long and intricate (see Texts 3 and 5). If
we analyse one of these in terms of its hypotactic and paratactic nexuses, we get a
sense of its complexity. Now, it is very seldom that we find anything like these in
writing. In speech, they tend to appear in the longer monologic turns that occur
within a dialogue (that is, they are triggered dialogically, but constructed by a
single speaker, rather than across turns). Since dialogue also usually has a lot of
very short turns, of just one clause, which is often a minor clause which doesnt
enter into complex structures in any case, there is no sense in calculating a mean
value for this kind of intricacy. What one can say is, that the more intricate a
given clause complex is, the more likely it is that it happened in speech rather
than in writing. But the fuller picture will only emerge from more corpus studies
of naturally occurring spoken language (cf. Matthiessen 2002: 295 ff.).

5. Some problems with a spoken corpus

So let me turn now to some of the problems faced by corpus linguists when they
want to probe more deeply into the mysteries of spoken language. One
problematic area Ive mentioned already: that of representing spoken language in
writing; I would like to add some more observations under this heading. As I
remarked, there are many different conventions used in transcribing, and all of
them distort in some way or other.
The lack of prosodic markers is an obvious and serious omission, but
one that can be rectified in one way or another. In another few decades it may be
possible to devise speech recognition systems that can actually assign prosodic
The spoken language corpus 21

features patterns of intonation and rhythm at the phonological level (that is,
identifying them as meaningful options); meanwhile we might explore the value
of something which is technically possible already but less useful for
lexicogrammar and semantics, namely annotation of speech at the phonetic level
based on analysis of the fundamental parameters of frequency, amplitude and
But, as I suggested, a more serious problem is that of over-transcribing,
especially of a kind which brings with it a false flavour of the exotic: speech is
made to look quaint, with all its repetitions, false starts, clearings of the throat and
the like solemnly incorporated into the text. This practice, which is regrettably
widespread, not only imparts a spurious quaintness to the discourse one can
perhaps teach oneself to disregard that but, more worryingly, obscures, by
burying them in the clutter, the really meaningful sleights of tongue on which
spoken language often relies: swift changes of direction, structures which Eggins
and Slade call abandoned clauses, phonological and morphological play and
other moments of semiotic inventiveness. Of course, the line between these and
simple mistakes is hard to draw; but that doesnt mean we neednt try. Try getting
yourself recorded surreptitiously, if you can, in some sustained but very casual
encounter, and see which of the funny bits you would cut out and which you
would leave in as a faithful record of your own discourse.
But even with the best will, and the best skill, in the world, a fundamental
problem remains. Spoken language isnt meant to be written down, and any
visual representation distorts it in some way or other. The problem is analogous,
in a way, to that of choreographers trying to develop notations for the dance: they
work as aids to memory, when you want to teach complex routines, or to preserve
a particular choreographers version of a ballet for future generations of dancers.
But you wouldnt analyse a dance by working on its transcription into written
symbols. Naturally, many of the patterns of spoken language are recognizable in
orthographic form; but many others are not types of continuity and
discontinuity, variations in tempo, paralinguistic features of tamber (voice
quality), degrees of (un)certainty and (dis)approval and for these one needs to
work directly with the spoken text. And we are still some way off from being able
to deal with such things automatically.
The other major problem lies in the nature of language itself; it is a
problem for all corpus research, although more acute with the spoken language:
this is what we might call the lexicogrammatical bind. Looking along the
lexicogrammatical continuum (and I shall assume this unified view, well set out
by Michael Stubbs (1996) among the principles of Sinclairs and my approach,
as opposed to the bricks-&-mortar view of a lexicon plus rules of syntax) if we
look along the continuum from grammar to lexis, it is the phenomena at the
lexical end that are the most accessible; so the corpus has evolved to be organized
lexically, accessed via the word, the written form of a lexicogrammatical item.
Hence corpuses have been used primarily as tools for lexicologists rather than for
22 M.A.K. Halliday

In principle, as I think is generally accepted, the corpus is just as useful,

and just as essential, for the study of grammar as it is for the study of lexis. Only,
the grammar is very much harder to get at. In a language like English, where
words may operate all the way along the continuum, there are grammatical items
like the and and and to just as there are lexical items like sun and moon and stars,
as well as those like behind and already and therefore which fall somewhere in
the middle; occurrences of any of these are easily retrieved, counted, and
contextualized. But whereas sun and moon and stars carry most of their meaning
on their sleeves, as it were, the and and and to tell us very little about what is
going on underneath; and what they do tell us, if we just observe them directly,
tends to be comparatively trivial. It is an exasperating feature of patterns at the
grammatical end of the continuum, that the easier they are to recognize the less
they matter.
And it is here that the spoken language presents special problems for a
word-based observation system: by comparison with written language, it tends to
be more highly grammaticalized. In the way it organizes its meaning potential the
spoken language, relative to the written, tends to favour grammatical systems. We
have seen this already in the contrast between lexical density and grammatical
intricacy as complementary ways of managing semantic complexity: the written
language tends to put more of its information in the lexis, and hence it is easier to
retrieve by means of lexical searching. Consider pairs of examples such as the
following (and cf. those cited as Text 6 below):

Sydneys latitudinal position of 33 south ensures warm summer

Sydney is at latitude 33 south, so it is warm in summer.

The goal of evolution is to optimize the mutual adaption of species.

Species evolve in order to adapt to each other as well as possible.

If you are researching the forms of expression of the meaning cause, you can
identify a set of verbs which commonly lexify this meaning in written English
verbs like cause, lead to, bring about, ensure, effect, result in, provoke and
retrieve occurrences of these together with the (typically nominalized) cause and
effect on either side; likewise the related nouns and adjectives in be the cause of,
be responsible for, be due to and so on. It takes much more corpus energy to
retrieve the (mainly spoken) instances where this relationship is realized as a
clause nexus, with cause realized as a paratactic or hypotactic conjunction like
so, because or as, for at least three reasons: (i) these items tend to be polysemous
(and to collocate freely with everything in the language); (ii) the cause and effect
are now clauses, and therefore much more diffuse; (iii) in the spoken language
not only semantic relations but participants also are more often grammaticalized,
in the form of cohesive reference items like it, them, this, that, and you may have
to search a long way to find their sources. Thus it will take rather longer to derive
a corpus grammar of causal relations from spoken discourse than from written;
The spoken language corpus 23

and likewise with many other semantic categories. Note that this is not because
they are not present in speech; on the contrary, there is usually more explicit
rendering of semantic relationships in the spoken variants; you discover how
relatively ambiguous the written versions are when you come to transpose them
into spoken language. It is the form of their realization more grammaticalized,
and so more covert that causes most of the problems.
Another aspect of the same phenomenon, but one that is specific to
English, is the way that material processes tend to be delexicalized: this is the
effect whereby gash slash hew chop pare slice fell sever mow cleave shear and so
on all get replaced by cut. This is related to the preference for phrasal verbs,
which has gained momentum over a similar period and is also a move towards the
grammaticalizing of the process element in the clause. Ogden and Richards, when
they devised their Basic English in the 1930s, were able to dispense with all but
eighteen verbs, by relying on the phrasal verb constructions (they would have
required me to say were able to do away with all but eighteen verbs); they
were able to support their case by rewording a variety of different texts, including
biblical texts, using just the high frequency verbs they had selected. These are, as
I said, particular features of English; but I suspect there is a general tendency for
the written varieties of a language to favour a more lexicalized construal of
So I feel that, in corpus linguistics in general but more especially in
relation to a spoken language corpus, there is work to be done to discover ways of
designing a corpus for the use of grammarians or rather, since none of us is
confined to a single role, for use in the study of phenomena towards the
grammatical end of the continuum. Hunston and Francis, in their work on
pattern grammar (1999), have shown beyond doubt that the corpus is an
essential resource for extending our knowledge of the grammar. But a corpus-
driven grammar needs a grammar-driven corpus; and that is something I think we
have not yet got.

6. Corpus-based and corpus-driven

Elena Tognini-Bonelli, in her book Corpus Linguistics at Work (2001), defines
corpus linguistics as a pre-application methodology, comprising an empirical
approach to the description of language use, within a contextual-functional theory
of meaning, and making use of new technologies. Within this framework, she
sees new facts leading to new methodologies leading to new theories. Given that
she has such a forward-looking vision, I find it strange that she finds it strange
that more data and better counting can trigger philosophical repositioning; after
all, thats what it did in physics, where more data and better measuring
transformed the whole conception of knowledge and understanding. How much
the more might we expect this to be the case in linguistics, since knowing and
understanding are themselves processes of meaning. The spoken corpus might
well lead to some repositioning on issues of this kind.
24 M.A.K. Halliday

Like Hunston and Francis, Tognini-Bonelli stresses the difference between

corpus-based and corpus-driven descriptions; I accept this distinction in
principle, though with two reservations, or perhaps caveats. One, that the
distinction itself is fuzzy; there are various ways of using a corpus in grammatical
research that I would not be able to locate squarely on either side of the boundary
where, for example, one starts out with a grammatical category as a heuristic
device but then uses the results of the corpus analysis to refine it further or
replace it by something else. (If I may refer here to my own work, I would locate
both my study of the grammar of pain (1998), and the quantitative study of
polarity and primary tense carried out by Zoe James and myself (1993),
somewhere along that rather fuzzy borderline.) And that leads to the second
caveat: a corpus-driven grammar is not one that is theory-free (cf. Matthiessen
and Nesbitts On the idea of theory-neutral descriptions 1996). As I have
remarked elsewhere (2001), there is considerable recourse to grammatical theory
in Hunston and Francis book. I am not suggesting that they deny this they are
not at all anti-theoretical; but it is important, I think, to remove any such
implication from the notion of corpus-driven which is itself a notably
theoretical concept.
I dont think Tognini-Bonelli believes this either, though there is perhaps a
slight flavour in one of her formulations (p. 184): If the paradigm is not
excluded from this [corpus-driven] view of language, it is seen as secondary with
respect to the syntagm. Corpus-driven linguistics is thus above all a linguistics of
parole. I wonder. Paradigm and syntagm are the two axes of description, for
both of which we have underlying theoretical categories: structure as theory of
the syntagm, system as theory of the paradigm. It is true that, in systemic theory,
we set up the most abstract theoretical representations on the paradigmatic axis;
there were specific reasons for doing this (critically, it is easier to map into the
semantics by that route, since your view of regularity is not limited by structural
constraints), but that is not to imply that structure is not a theoretical construct.
(Firth, who first developed system-structure theory, did not assign any theoretical
priority to the system; but he developed it in the context of phonology, where
considerations are rather different.) So I dont think corpus-driven linguistics is a
linguistics of parole but in any case, isnt that notion rather self-contradictory?
Once you are doing linguistics, you have already moved above the instantial
I can see a possible interpretation for a linguistics of parole: it would be a
theory about why some instances some actes de parole are more highly
valued that others: in other words, a stylistics. But the principle behind corpus
linguistics is that every instance carries equal weight. The instance is valued as a
window on to the system: the potential that is being manifested in the text. What
the corpus does is to enable us to see more closely, and more accurately, into that
underlying system into the langue, if you like. The corpus-driven grammar is
a form of, and so also a major contributor to, grammatics.
The spoken language corpus 25

7. Aspects of speech: a final note

I am assuming that the spoken language corpus includes a significant amount of
authentic data: unsolicited, spontaneous, natural speech which is likely to
mean dialogue, though there may be lengthy passages of monologue embodied
within it. Not because there is anything intrinsically superior about such discourse
as text if anything, it tends to carry a rather low value in the culture; but because
the essential nature of language, its semogenic or meaning-creating potential, is
most clearly revealed in the unselfconscious activity of speaking. This is where
systemic patterns are established and maintained; where new, instantial patterns
are all the time being created; and where the instantial can become systemic, not
(as is more typical of written language) by way of single instances that carry
exceptional value (what I have called the Hamlet factor) but through the
quantitative effects of large numbers of unnoticed and unremembered sayings.
For this reason, I would put a high priority on quantitative research into
spoken language, establishing the large-scale frequency patterns that give a
language its characteristic profile its characterology, as the Prague linguists
used to call it. This is significant in that it provides the scaffolding whereby
children come to learn their mother tongue, and sets the parameters for systematic
variation in register: what speakers recognize as functional varieties of their
language are re-settings of the probabilities in lexicogrammatical choice. The
classic study here was Jan Svartviks study of variation in the English voice
system (1966). It also brings out the important feature of partial association
between systems, as demonstrated in their quantitative study of the English clause
complex by Nesbitt and Plum (1988). My own hypothesis is that the very general
grammatical systems of a language tend towards one or the other of two
probability profiles: either roughly equal, or else skew to a value of about one
order of magnitude; and I have suggested why I think that this would make good
sense (1993). But it can only be put to the test by large-scale quantitative studies
of naturally occurring speech. Let me say clearly that I do not think this kind of
analysis replaces qualitative studies of patterns of wording in individual texts. But
it does add further insight into how those patterns work.
It is usually said that human language, as it evolved and as it is developed
by children, is essentially dialogic. I see no reason to question this; the fact that
other primates (like ourselves!) send out warnings or braggings or other
emotional signals, without expecting a response, is not an objection that need be
taken seriously. Dialogue, in turn, provides the setting for monologic acts; and
this is true not only instantially but also systemically: monologue occurs as
extended turns in the course of dialogic interaction, as a good-sized corpus of
casual conversation will show. Clearly monologue is also the default condition of
many systemic varieties: people give sermons, make speeches, write books,
broadcast talks and so on; but they do so, even if it is largely for their own
satisfaction, only because there are others who listen to them (or at least hear
them) and who read them.
26 M.A.K. Halliday

Any piece of spoken monologue can be thought of as an extended turn:

either given to the speaker by the (contextual) system, as it were, like a
conference paper, or else having to be established, and perhaps struggled for, as
happens in casual conversation. Speakers have many techniques for holding the
floor, prolonging their speaking turn. Some of these techniques are, in Eggins and
Slades terms, generic: you switch into telling a joke, or embark on a personal
narrative. But one very effective strategy is grammatical: the clause complex. The
trick is to make the listeners aware another clause is coming. How you do this, of
course, varies according to the language; but the two main resources, in many
languages, are intonation and conjunction. These are, in effect, two mechanisms
for construing logical-semantic relationships in lexicogrammatical form in
wording. The highly intricate clause complexes that I referred to earlier as a
phenomenon of informal speech embroil the listener in a shifting pattern of
phono-syntactic connections. This is not to suggest that their only function is to
hold the floor; but they help, because listeners do, in general, wait for the end of a
sequence it takes positive energy to interrupt.
What the clause complex really does, or allows the speaker to do, is to
navigate through and around the multidimensional semantic space that defines the
meaning potential of a language, often with what seem bewildering changes of
direction, for example (Text 3) from the doctors expectations to corridors lined
with washing to the danger of knocking out expectant mothers, all the while
keeping up an unbroken logical relationship with whatever has gone before. It is
grammatical logic, not formal logic; formal logic is the designed offspring of
grammatical logic, just as the written sentence is the designed offspring of the
clause complex of speech. This kind of spontaneous semantic choreography is
something we seldom find other than in unselfmonitored spoken discourse,
typically in those monological interludes in a dialogue; but it represents a
significant aspect of the power of language as such.
I have been trying to suggest, in this paper, why I think that the spoken
language corpus is a crucial resource for theoretical research: research not just
into the spoken language, but into language in general. Because the gap between
what we can recover by introspection and what people actually say is greatest of
all in sustained, unselfmonitored speaking, the spoken language corpus adds a
new dimension to our understanding of language as semiotic system-&-process.
That there is such a gap is not only because spontaneous speech is the mode of
discourse that is processed at furthest remove from conscious attention, but also
because it is the most complexly intertwined with the ongoing socio-semiotic
context. Tognini-Bonellis observation that all corpus studies imply a contextual
theory of meaning is nowhere more cogent than in the contexts of informal
conversation. Hasan and Clorans work on their corpus of naturally occurring
dialogue between mothers and their three-to-four-year-old children showed how
necessary it was not merely to note the situations in which meanings were
exchanged but to develop the theoretical model of the contextual stratum as a
component in the overall descriptive strategy (Hasan and Cloran 1990; Hasan
1991, 1992, 1999; Cloran 1994). Peoples meaning potential is activated and
The spoken language corpus 27

hence ongoingly modified and extended when the semogenic energy of their
lexicogrammar is brought to bear on the material and semiotic environment,
construing it, and reconstruing it, into meaning. In this process, written language,
being the more designed, tends to be relatively more focussed in its demands on
the meaning-making powers of the lexicogrammar; whereas spoken language is
typically more diffuse, roaming widelier around the different regions of the
network. So spoken language is likely to reveal more evidence for the kind of
middle range grammar patterns and extended lexical units that corpus studies
are now bringing into relief; and this in turn should enrich the analysis of
discourse by overcoming the present disjunction between the lexical and the
grammatical approaches to the study of text.
Already in 1935 Firth had recognized the value of investigating
conversation, remarking it is here we shall find the key to a better understanding
of what language really is and how it works (1957: 32). He was particularly
interested in its interaction with the context of situation, the way each moment
both narrows down and opens up the options available at the next. My own
analysis of English conversation began in 1959, when I first recorded spoken
dialogue in order to study rhythm and intonation. But it was Sinclair, taking up
another of Firths suggestions the study of collocation (see Sinclair 1966) who
first set up a computerized corpus of speech. Much later, looking back from the
experience with COBUILD , Sinclair wrote (1991: 16): a decision I took in
1961 to assemble a corpus of conversation is one of the luckiest I ever made. It
would be hard now to justify leaving out conversation from any corpus designed
for general lexicogrammatical description of a language. Christian Matthiessen,
using a corpus of both spoken and written varieties, has developed text-based
profiles: quantitative studies of different features in the grammar which show up
the shifts in probabilities that characterize variation in register. One part of his
strategy is to compile a sub-corpus of partially analysed texts, which serve as a
basis for comparison and also as a test site for the analysis, allowing it to be
modified in the light of ongoing observation and interpretation. I have always felt
that such grammatical probabilities, both global and local, are an essential aspect
of what language really is and how it works. For these, above all, we depend on
spoken language as the foundation.

Baker, M., G. Francis and E. Tognini-Bonelli (eds) (1993), Text and technology:
in honour of John Sinclair. Amsterdam: John Benjamins.
Brazil, D. (1995), A grammar of speech. Oxford: Oxford University Press.
Carter, R. (2002), Language and creativity: the evidence from spoken English.
[The Second Sinclair Open Lecture, Department of English, University of
Carter, R., and M. McCarthy (1995), Grammar and the spoken language.
Applied Linguistics 16: 141-158.
28 M.A.K. Halliday

Cloran, C. (1994), Rhetorical units and decontextualization: an enquiry into some

relations of meaning, context and grammar. Monographs in Systemic
Linguistics 6. Department of English, University of Nottingham.
Eggins, S., and D. Slade (1997), Analysing casual conversation. London: Cassell.
Fawcett, Robin, and Michael Perkins (1981), Project report: language
development in 6- to 12-year-old children. First Language 2: 75-79.
Firth, J.R. (1935), The technique of semantics. Transactions of the Philological
Society. Reprinted in J.R. Firth, Papers in linguistics 1934-1951. London:
Oxford University Press, 1957. 7-33.
Grimshaw, A. D. (ed.) (1994), Whats going on here. Complementary studies of
professional talk. Norwood, N.J.: Ablex.
Halliday, M.A.K. (1993), Quantitative studies and probabilities in grammar. In
Michael Hoey (ed.), Data, description, discourse. Papers on the English
language in honour of John McH. Sinclair. London: Harper Collins. 1-25.
Halliday, M.A.K. (1998), On the grammar of pain. Functions of Language 5: 1-
Halliday, M.A.K. (2002), Judge takes no cap in mid-sentence: on the
complementarity of grammar and lexis. [The First Sinclair Open Lecture,
Department of English, University of Birmingham]
Halliday, M.A.K. and Z.L. James (1993), A quantitative study of polarity and
primary tense in the English finite clause. In John M. Sinclair, Michael
Hoey and Gwyneth Fox (eds), Techniques of description: spoken and
written discourse. London & New York: Routledge. 32-66.
Hasan, R. (ed.) (1985), Discourse on discourse. Applied Linguistics Association
of Australia: Occasional Papers 7.
Hasan, R. (1991), Questions as a mode of learning in everyday talk. In Thao L
and Mike McCausland (eds), Language education: interaction and
development. Launceston: University of Tasmania. 70-119.
Hasan, R. (1992), Rationality in everyday talk: from process to system. In Jan
Svartvik (ed.), Directions in corpus linguistics. Berlin: Mouton de
Gruyter. 257-307.
Hasan, R. (1999), Speaking with reference to context. In Mohsen Ghadessy
(ed.), Text and context in functional linguistics. Amsterdam &
Philadelphia: John Benjamins. 219-328.
Hasan, R., and C. Cloran (1990), A sociolinguistic interpretation of everyday
talk between mothers and children. In M.A.K. Halliday, John Gibbons
and Howard Nicholas (eds), Learning, keeping and using language.
Selected papers from the Eighth World Congress of Applied Linguistics.
Amsterdam & Philadelphia: John Benjamins. Vol. 1: 67-99.
Hunston, S. (1993), Evaluation and ideology in scientific English. In Mohsen
Ghadessy (ed.), Register analysis: theory and practice. London: Pinter.
Hunston, S., and G. Francis (2000), Pattern grammar. A corpus-driven approach
to the lexical grammar of English. Amsterdam & Philadelphia: John
The spoken language corpus 29

Leech, G. (2000), Same grammar or different grammar? Contrasting approaches

to the grammar of spoken English discourse. In Srikant Sarangi and
Malcolm Coulthard (eds), Discourse and social life. Harlow: Longman.
Martin, J.R. (1992), English text: system and structure. Amsterdam: John
Martin, J.R. (1993), Life as a noun: arresting the universe in science and
humanities. In M.A.K. Halliday and J.R. Martin, Writing science: literacy
and discursive power. London & Washington, D.C.: Falmer. 221-267.
Martin, J.R. (1998), Beyond exchange: appraisal systems in English. In Susan
Hunston and Geoff Thompson (eds), Evaluation in text. Oxford: Oxford
University Press.
Matthiessen, C. M.I.M. (1999), The system of TRANSITIVITY: an exploratory
study of text-based profiles. Functions of Language 6: 1-51.
Matthiessen, C. M.I.M. (2002), Combining clauses into clause complexes: a
multi-faceted view. In Joan Bybee and Michael Noonan (eds), Complex
sentences in grammar and discourse. Essays in honour of Sandra A.
Thompson. Amsterdam & Philadelphia: John Benjamins.235-319.
Matthiessen, C. M.I.M., and Christopher Nesbitt (1996), On the idea of theory-
neutral descriptions. In Ruqaiya Hasan, Carmel Cloran and David G. Butt
(eds), Functional descriptions: theory and practice. Amsterdam &
Philadelphia: John Benjamins. 39-85.
Quirk, R., and D. Crystal (1964), Systems of prosodic and paralinguistic features
in English. The Hague: Mouton.
Sinclair, J. (1966), Beginning the study of lexis. In C.E. Bazell et al. (eds), In
memory of J.R. Firth. London: Longmans. 410-430.
Sinclair, J. (1991), Corpus, concordance, collocation. Oxford: Oxford University
Stubbs, M. (1996), Text and corpus analysis: computer-assisted studies of
language and culture. Oxford: Blackwell.
Stubbs, M. (2000), Words and phrases: corpus studies of lexical semantics.
Oxford: Blackwell.
Svartvik, J. (1966), On voice in the English verb. The Hague: Mouton.
Svartvik, J., and R. Quirk (eds) (1980), A corpus of English conversation. Lund:
C.W.K. Gleerup.
Tognini-Bonelli, E. (2001), Corpus linguistics at work. Amsterdam &
Philadelphia: John Benjamins.
Ure, J. (1971), Lexical density and register differentiation. In G.E. Perren and
J.L.M. Trim (eds), Applications of linguistics. Selected papers of the
Second International Congress of Applied Linguistics. London:
Cambridge University Press. 443-452.
30 M.A.K. Halliday

Appendix: Transcripts of recorded conversations

Text 1: Passage from tape recording transcribed about 1960

Indented lines represent the contributions of the interviewer, the asterisks
in the informants speech indicating the points at which such
contributions began, or during which they lasted.
The hyphens (-, --, ---) indicate relative lengths of pauses.
Proper names are fictitious substitutes for those actually used.
The informant is a graduate, speaking RP with a normal delivery.

i is this true I heard on the radio last night that er pay has gone net pay
but er -- retirement age has gone up - *for you chaps*
*yes but er*
to seventy*
5 *yes I think thats scandalous*
*but is it right is it true*
*yes it is true yes it is true*
*well its a good thing*
yes *but the thing is that er -* everybody wants more money --
10 *I mean youve got your future secure*
but er the thing is you know -- er I mean of course er the whole thing is
absolutely an absolute farce because -- really with this grammar school
business its perfectly true that - that youre drawing all your your brains of
the country are going to come increasingly from those schools - therefore
15 youve got to have able men - and women to teach in them - but you want
fewer and better ** - thats the thing they want
- fewer grammar schools and better ones --- *because at the
*Mrs Johnson was saying*
20 moment* its no good having I mean weve got some very good men where I
am which is a bit of a glory hole -- but er theres some theres some good
men there theres one or two millionaires nearly theres Ramsden who
cornered the - English text book market -- *and er* - yes hes got a net
income of
25 *hm*
about two thousand five hundred a year and er theres some good chaps there
I mean you know first class men but its no good having first class men -
dealing with the tripe that we get *--* you see thats the trouble that youre
wasting its
30 *hm*
a waste of energy -- um an absolute waste of energy - your - your er method
of selection there is all wrong -- *um
*but do you think its better to have -- er teachers whove had a lot of
experience - having an extra five years to help solve this - problem of
of fewer teachers -- er or would you say - well no cut them off at at
sixty-five and lets get younger*
The spoken language corpus 31

35 of fewer teachers -- er or would you say - well no cut them off at at

sixty-five and lets get younger*
*its no good having I would if I were a head Id and you know and I know
well Id chuck everyone out who taught more than ten years on principle *--
40 *ha ha ha why*
*because after that time as a boy said they either become too strict or too
laxative* --
*ha ha ha ha ha ha - hm*
*yes - but ha ha ha no they get absolutely stuck you know after ten years * * -
45 - they just go absolutely dead - we all
do - bound to you know you you churn out the same old stuff you see - but
um - the thing is I mean its no good having frightfully - well anyway they
they if they paid fifteen hundred a year I mean - if you could expect to get
50 that within -- ten years er er for graduates er you you still wouldnt get the
first class honours - scientists - theyd still go into industry because its a
present er a pleasanter sort of life * * youre living in an adult world and
living in a world which is in the main stream -- I mean school mastering is
55 bound to be a backwater youre bound to you want some sort of sacrifice
sacrificial type of people you know **
no matter what you pay them youve got to pay them more but youve got to
give -- theres got to be some reason you know some - youre always giving
60 out and you get nothing back **
and --- I mean they dont particularly want to learn even the bright ones
theyd much rather -- fire paper pellets out of the window or something or --
no they dont do that but they they -- you know youve got to drive them all
65 the time --- theyve got to have some sort of exterior reason apart from your
own -- personal satisfaction in doing it you know
32 M.A.K. Halliday

Text 2: Passage from Svartvik and Quirk (1980: 215-218)

The spoken language corpus 33
34 M.A.K. Halliday

Text 3: Orthographic (and somewhat reduced) version of Text 2

A: Yes; thats very good. I wouldnt be able to have that one for some reason
you see: this checker board effect I recoil badly from this. I find I hadnt
looked at it, and I think its probably because it probably reminds me you
know of nursing Walter through his throat, when you play checker boards or
something. I think its it reminds me of the ludo board that we had, and I
just recoiled straight away and thought [mm] not not that one, and I didnt
look inside; but thats very fine, [mm mm] isnt it? very fine, yes.
B: Its very interesting to try and analyse why one like abstract paintings, cause
I like those checks; just the very fact that theyre not all at right angles means
that my eyes dont go out of focus chasing the lines [yes] they can actually
follow the lines without sort of getting out of focus.
A: Yes Ive got it now: its those exact two colours you see, together. He had
he had a blue and orange crane, I remember it very well, and you know one
of those things that wind up, and thats it.
B: It does remind me of meccano boxes [yes well] the box that contains
meccano, actually.
A. Yes. Well, we had a bad do you know; we had oh we had six or eight
weeks when he had a throat which was [mhm] well at the beginning it was
lethal if anyone else caught it. [yeah] It was lethal to expectant mothers with
small children, and I had to do barrier nursing; it was pretty horrible, and the
whole corridor was full of pails of disinfectant you know [mm], and you went
in, and of course with barrier nursing I didnt go in in a mask I couldnt
with a child that small, and I didnt care if I caught it, but I mean it was
ours emptied outside you see [mm] and you had to come out and you brought
all these things on to a prepared surgical board [mm mm] and you stripped
your gloves off before you touched anything [mm] and you disinfected oh it
was really appalling [mm]. I dont think the doctor had expected that I would
do barrier nursing you see [mm] I think she said something about she
wished that everybody would take the thing seriously you know, when they
were told, as I did, cause she came in and the whole corridor was lined [mm]
with various forms of washing and so on, but after all I mean you cant go
down and shop if you know that youre going to knock out an expectant
mother. It was some violent streptococcus that hed got and he could have
gone to an isolation hospital but I think she just deemed that he was too small
[yes mm mm] for the experience, and then after wed had him, you know,
had him for a few days at home this couldnt be done. [mhm] She made the
decision for me really, which at the time I thought was very impressive, but
she didnt know me very well: I think she thought I was a career woman who
would be only too glad and would say oh well hes got to go into a hospital,
you know, so she made the decision for me and then said its too late now to
put him into an isolation hospital; I would have had to do that a few days
ago which, I thought, I didnt want her to do!
The spoken language corpus 35

B: Do nurses tend to be aggressive, or does one just think that nurses are
A: Well, that was my doctor [oh], and she didnt at that time understand me very
well. I think she does now.

Text 4: Passage from Grimshaw (ed.) 1994

. . . ) and I / think shes a/ware of this and I / think you / know she . . . // 4
) I / think one / thing thatll / happen I / think that . . . // 1 ) that / Mike may
en/courage her // 1 ) and I / think thatll be / all to the / good //
P. // 4 ) to / what ex/tent are / these / ) the / three / theories that she se/lected // 1
truly repre/sentative of / theories in this / area //
A. // 1 thats / it / ) // 1 thats / it //
P. // 1 ) they / are in/deed //
S. // 1 yeah //
P. // 1 oh // 2 they are / the / theories //
A. // 1 thats about / it //
P. // 1 they are / not / really repre/sentative / then //
S. // 1 well there are // 1 ) there are / vari/ations // 1 ) there are / vari/ations // 1
on / themes but . . . // 4 ) but / I dont / know of any / major con/tender ) there
/ may be // 1 ) well / I dont / know of / anything that / looks much / different
from the / things shes . . . ) she has / looked at in the spe/cific / time //
A. // 4 ) ex/cept for the / sense that
P. // 1 ) so / nobody / nobody would at/tack her on / that ground / then if she
A. // 1 oh no / I dont / think so // 4 ) I think the / only / thing that would be
sub/stantially / different would be a // 1 real / social / structuralist who would
/ say // 4 ) you / dont have to / worry about cog/nitions // 1 what you have to
/ do is / find the lo/cation of these / people in the / social / structure // 1- ) and
/ then youll / find out how theyre / going to be/have with/out having to / get
into their / heads at / all // 4 ) and / that // 1 hasnt been / tested // 1- ) ex/cept
in / very / gross / kinds of / ways with // 1 macro / data which has / generally /
not been / very satis/factory // 1 yeah / ) // 1 ) so I can / tell her that // 3 )
you / know I
S. // 1 ) shes / won //
36 M.A.K. Halliday

Text 5. Choreographic notation for the clause complex of spoken language (cf.
forms of notation in Martin 1992). Clause complex from Text 3 above.
The doctor prob-ably

I would say

That he had to
Go into hospital


Instead of
Asking me

she made the

decision for me

Which at the time

Seemed very impressive

=3 But she didnt

Know me very well

She said

Its too late now to
Put him into a hospital
I should have had to do
That a few days ago
And I thought
To myself
I didnt want to

You to do that
The spoken language corpus 37

Text 6: Spoken translations of some sentences of written


Note: Written originals are those lettered (a) in Set 1 and those in the left hand
column of Set 2.

1. (1a) Strength was needed to meet driver safety requirements in the event
of missile impact.
(1b) The material needed to be strong enough for the driver to be safe if it
got impacted by a missile.
(2a) Fire intensity has a profound effect on smoke injection.
(2b) The more intense the fire, the more smoke it injects (into the
(3a) The goal of evolution is to optimize the mutual adaption of species.
(3b) Species evolve in order to adapt to each other as well as possible.
(4a) Failure to reconfirm will result in the cancellation of your
(4b) If you fail to reconfirm your reservations will be cancelled.
(5a) We did not translate respectable revenue growth into earnings
(5b) Although our revenues grew respectably we were not able to
improve our earnings.

2. Sydneys latitudinal position of 33 Sydney is at latitude 33 south, so it is

south ensures warm summer warm in summer.
Investment in a rail facility implies If you invest in a facility for the
a long term commitment. railways you will be committing
[funds] for a long term.
[The atomic nucleus absorbs [] Each time it absorbs energy it
energy in quanta, or discrete units.] (moves to a state of higher energy = )
Each absorption marks its becomes more energetic.
transition to a state of higher
[Evolutionary biologists have [] when [species] suddenly [start
always assumed that] rapid to] evolve more quickly this is
changes in the rate of evolution are because something has happened
caused by external events [which is outside [] they want to explain that
why ] they have sought an the dinosaurs dies out because a
explanation for the demise of the meteorite impacted.
dinosaurs in a meteorite impact.
38 M.A.K. Halliday

[It will be seen that] a [] it is possible both to replace

successful blending of asset assets and to remanufacture [current
replacement with remanufacture is equipment] successfully. We must
possible. Careful studies are to be study [the matter] carefully to ensure
undertaken to ensure that viability that ([the plan] is viable = ) we will
exists. be able to do what we plan.
The theoretical program of As well as working theoretically by
devising models of atomic nuclei devising models of atomic nuclei we
has been complemented by have also investigated [the topic] by
experimental investigations. experimenting.
Increased responsiveness may be [The child] is becoming more
reflected in feeding behaviour. responsive, so s/he may feed better.
Equation (3) provided a When we used equation (3) we could
satisfactory explanation of the explain satisfactorily (the different
observed variation in seepage rates. rates at which we have observed that
seepage occurs = ) why, as we have
observed, [water] seeps out more
quickly or more slowly.
The growth of attachment between Because / if / when
infant and mother signals the first the mother and her infant grow
step in the childs capacity to (more) attached to one another //
discriminate among people. the infant grows / is growing (more)
attached to its mother
we know that / she knows that / [what
is happening is that]
the child has begun / is beginning / is
going to begin to be able
to tell one person from another /
prefer one person over another.
Intuition and annotation the discussion continues

John Sinclair

The Tuscan Word Centre


Some corpus linguists prefer to research using plain text, while others first
prepare the texts by adding various analytic annotations. The former group
express reservations about the reliability of intuitive data, whereas the latter
group, if obliged to choose, will reject corpus evidence in favour of their intuitive
responses. This paper attempts to move from the broad differences expressed
above to a small number of specific points of contrast between the two

1. Introduction
As the study of language in corpora continues to grow and diversify, differences
of methodology emerge, and there is room for misunderstanding. Aarts (1991,
2002a, 2002b) has monitored the development of the relationship between the
management of corpora and the theory of language, and Tognini Bonelli (2001)
has described contrasting conceptualisations of the relation between theory and
The key concept here is -driven, in the phrase corpus-driven linguistics.
-driven has several characteristic usages, among which we may focus on two,
which might be paraphrased as motivated and controlled. Its use in relation
to corpus linguistics can be traced back to Johns (e.g. 1990) and his data-driven
learning. Here the matter of motivation is on top, as it was found that learners
have unbounded curiosity when they are allowed to interrogate corpora, and
apparently natural learning mechanisms to profit from the curiosity. Francis
(1993) shifted the focus to corpus-driven grammar, where controlled is
perhaps the more appropriate gloss. The grammar should follow the corpus,
accounting for as much as possible of the patterning, and being cautious in
ascribing to the language a pattern that is not attested in the corpus.
Tognini Bonelli (op. cit.) noted that in much corpus research the
theoretical and descriptive positions were carefully insulated from the findings of
the corpus investigations. Though researchers acknowledged that one legitimate
use of a corpus was to test hypotheses, there was no serious testing of the
governing theories. These, it was held, had been forged over many years, and
thoroughly tested against intuitive responses, and they were extremely abstract.
The myriad details of actual usage could provide some helpful reflections of the
40 John Sinclair

theory, but there was no question of threatening the theory with evidence from
usage, however compelling that might be.
Tognini Bonelli called this position corpus-based linguistics, and
contrasted it with corpus-driven linguistics, which specifically places the theory
in a vulnerable position, to be justified or modified according to the results of
investigations the classic posture of the empirical scientist.1 There may be
intermediate positions between these poles, but I cannot imagine any. Either
ones whole cathedral of linguistic structures is ready to receive the scaffolding,
or it is not.
Aarts (op. cit.) offers a penetrating discussion of this dichotomy, and
suggests that the two approaches contrast in their methodologies in two
important places the role they see for a persons intuition, and the place, value
and legitimacy of annotating corpora.
Regarding intuition, he anticipates quite opposed positions; the corpus-
based linguist allows his intuition to overrule his corpus data and hence gives
primacy to the former (2002a: 8) and Aarts expects the corpus-driven linguist to
do the opposite.
There are two observations to be made here. One is that Aarts moves
smoothly from considering the use of intuitive data to the more general point
of the role of the intuition in the process of making linguistic descriptions. But it
is quite reasonable to differentiate between these two positions, to reject the
former and keep an open mind on the latter as I do, and as I think most corpus-
driven linguists would also do. The other point is that I wonder if corpus-based
linguists have ever thought seriously about the priority they assign to intuitive
data. Would they really just set aside a mass of information about how people
use a language when their still, small voice tells them something different?
Leave aside the details, the one-offs, the peculiarities corpus linguistics is
about generalities if anything. Would they really feel secure in preferring their
intuition against measurable, incontrovertible objective evidence? Their only
hope would be to find an explanation for the apparent conflict, and although that
is a laudable aim, it is rarely resorted to because of the low prestige of empirical
data in the last half-centurys linguistics.
In no way do I intend this argument to devalue the importance of
intuition, as will become apparent in a little while. But I urge caution. When
cheap pitch meters became available in phonetics, it was possible to discover
exactly what the pitch contours of an utterance were. It was discovered that
people believed things that were at variance with the facts. They believed, for
example, that questions were spoken on a rising intonation, although in British
English they usually are not, and they would hear the pitch going up when in
fact it was going down. Intuition is not some kind of gut reaction to events, it is
educated in various ways, and sophisticated.
On the topic of annotation, Aarts considers that the contrast between the
two approaches to corpus linguistics is at its most marked in this area. He
deduces that corpus-driven linguists are bound to reject annotation, because it
could hamper their wish to be as close to the plain text as possible, whereas
Intuition and annotation 41

corpus-based linguists, who do not share their concerns, rely on annotation as the
main means by which they express their analysis and make it available to others.
It is, in Aarts uncompromising phrase, an indispensable tool for them.
Certainly there are contrasting attitudes to annotation among corpus
linguists of different styles, but not perhaps as extreme as suggested by Aarts. I
would like to continue this valuable discussion with an examination of the roles
of intuition and annotation in corpus linguistics, because I think that some
misunderstandings have arisen. These are quite understandable in their context,
and I am certain that Aarts is striving to be completely fair in his representation
of all points of view, and particularly those that he does not share. I can, of
course, talk only from my own perspective as following the corpus-driven
approach to research.

2. Intuition
In considering the role of intuition, for example, two issues have in recent years
tended to undermine confidence in the reliability of this elusive faculty. Let us
examine each of them briefly.

1. I have no longer any confidence in the ability of a human being to invent

sentences which display the same patterns of meaning that are to be found in
naturally occurring sentences. This has not always been my position; thirty or
more years ago I published an English grammar which illustrated its points with
almost entirely made-up sentences. I would not do that today. What is more, I
believe that most linguists share my misgivings, and it is easy to find subjective
evidence in support of this position. On the other hand, objective evidence of
what is natural and unnatural is not yet available, and this points up the primitive
nature of even our best descriptions.2
Both our productive ability in making up sentences and our critical
faculty in evaluating those sentences for naturalness are within the skills domain
that is usually held to be informed by intuition; it is clear that they do not match
up, and that even in the behaviour of the same person over time a sentence
can be approved as natural and condemned as unnatural, both positions ascribed
to intuition.
However, invented sentences are not always condemned; the general
agreement that ordinary language users can detect phoney sentences does not
lead to everyone behaving consistently with respect to them. It is a tenable
position to accept that they are different from natural ones but to prefer to study
them because of the insights they are said to give to mental processes. Or in the
business of language teaching to accept that they have no role in mature
discourse but that they are valuable stepping stones towards this. Or to maintain
that the differences between actual and invented sentences are not structurally
important. Or to dismiss the whole point by saying that the circumstances of
actual usage are of no concern to the theoretician, and so the differences are of
no account.
42 John Sinclair

My own position among these alternatives is perhaps over-cautious, but it

is shaped by many years of exposure to both academic and commercial attitudes
and arguments about the use of actual examples in presenting the language. I
simply do not trust my intuition in this matter; and now, when there is an
overabundance of used language available, it is as easy normally to find an
appropriate example from a corpus as to make one up.3 Whats more, if I have a
problem with a big corpus in finding an example, this makes me pause for
thought perhaps I have not specified the example adequately, or perhaps I am
on a wild goose chase.
No doubt in time we will contrive better descriptions, and check our
invented examples against such descriptions; but experience suggests that we
will have a long wait, because descriptions of this quality would enable us to
construct accurate examples by rule. Some fifty years ago in the science of
phonetics researchers found that there was a large gap between their ability to
reproduce by machine the actual speech sounds of an individual subject, which
was so good that the original speaker could be easily identified, and their ability
to synthesise speech by rule using the same machine, which was lamentably
poor. This kind of gap is also showing between our ability to recognise normal
English and our ability to construct it without the benefit of an interactive

2. In the early days of corpus linguistics, when researchers were trying to

interpret the results of probes into the mass of data, it quickly became apparent
that the information that they expected differed substantially from the
information that they received. To recall just one of hundreds of examples, it was
found that the common verbs in English did not occur very frequently in the
meanings that were intuitively associated with them. Everyone knows that give
has to do with a free and generous passing of ownership, take concerns grasping
and holding something, keep essentially is to do with maintenance, and put to do
with placement. However, these meanings were found to be of only minor
significance in a large corpus, besides the meanings of the same verbs in familiar
There are some ready partial explanations for this set of observations.
First of all, three very common verbs, be, have and do have a fully grammatical
role as auxiliary verbs that is the reason for their great frequency, and the
occurrence of, say have meaning possess is far less common, although
recognised as the core meaning of the word. Secondly, there is an exotic
feature of English called the phrasal verb, where a verb usually a very
common one combines with a preposition or an adverbial particle to form a
unit of meaning. Give up, meaning abandon, take over, meaning assume
control of, keep off, meaning stay away from, and put off, meaning
postpone are among the thousands of examples.
But even if we put to one side the auxiliary and phrasal verb uses of these
common verbs, we are by no means down to the core meaning. What we now
find is a host of frequent collocations which make up idiomatic structures,
Intuition and annotation 43

idiomatic in the sense that their meaning does not simply combine the meanings
of the individual words. Examples are take place, take a photograph, take
control, take time.
While these latter phrases were well known to people dealing with
English, their prominence in texts was not, and they had clearly been under-
assessed in reference books; their frequency was overwhelming, and the
intuitively-favoured meaning was insignificant in comparison.4 This was the
beginning of distrust of intuition why did ones intuition fail to come up with
the massively common uses of a word, but instead reported a rather rare one?
The term delexicalisation was used of the process whereby the original
meaning of the verbs which appeared in these patterns was watered down or lost
completely, overlaid with a new meaning that arose from regular collocation. At
that time no-one was questioning the ideas (a) that each word had one or more
meanings, and (b) that one of the meanings had special status as the original or
core meaning. But gradually confidence in these ideas was eroded as it was
realised that a model based on these ideas only fitted the facts marginally, and
left most of the meaningful patterning unresolved in layers of ambiguity.
Estimates of the proportion of text that consists of multi-word lexical units rose
to as high as 80% in some circumstances. The link between the word and the
meaning gradually crumbled.
Delexicalisation is thus an unfortunate term. The word only appears to
lose meaning when the model has no higher unit to show where the meaning of a
multi-word unit is actually created. With the higher unit the lexical item
established we can return to the role of intuition, and take a different view of its
accuracy and relevance.
The lemma TAKE contributes to many lexical items in coselection with
e.g. prepositions and particles to make phrasal verbs, and nouns to maintain the
preference of English for simple verbs such as take a risk instead of risk. These
are not strictly meanings of TAKE , but uses of the word in combinations. If
these and other coselections are removed from the concordance of TAKE then it
might well be that the main remaining meaning of this lemma is as reported by
the intuition.5 The intuition was probably right after all, but if so this has been
obscured by an inadequate model of interpretation.
Problems of the intuition are not always resolved so easily, but the fairly
objective evidence from corpora allows us to study intuitive positions and
reactions with greater clarity, and there is less chance that the intuition will be
dismissed as irrelevant on future occasions.
From these brushes between the corpus and the intuition, it is easy to see
how word could get around that intuition was not to be trusted, and that it tended
to take up a position that could not be supported by corpus evidence. However,
the failing, as so often, is more likely to be in the model, or theory, of language
through which we perceive linguistic events and with which we interpret them.
Take the case of inventing sentences what are you asking your intuition to do?
Any utterance which is part of a communicative event is heavily dependent on
the events preceding it, to the extent that many contextual settings are already
44 John Sinclair

established before the utterance takes place, and the utterance is interpreted with
reference to those settings. If a user of English is asked to produce a sentence of
English in the absence of these settings, it is a most unnatural request, and it is
unlikely that the subject will be able to imagine a suitable communicative event,
master all the relevant settings, mentally construct enough of the preceding
utterances to provide an adequate cotext, and then think up a sensible
contribution although he or she is not involved in the hypothetical event. No
wonder we usually make a hash of it!
Because our basic models of language structure concentrate on sub-
sentential matters, and do not assign a central importance to interaction, we
formulate requests that appear simple enough from the perspective of our model,
but involve processes which are almost impossible to control, as we see when we
look through a richer model.
In the second instance, where a unit of meaning can spread over several
words, the intuition was delivering a perfectly reasonable answer to the question
as asked, but our resident models misinterpreted it, and so we blamed the
intuition and lost confidence in it. This kind of confusion is likely to characterise
future encounters with the intuition as well, until our models are rich enough to
cope with the information they are receiving, both from corpus and from the
intuitive reactions of people with command of the language.
So we both trust our intuitions and keep a wary eye on the strong
possibility of misunderstanding what we are observing. To hint at another area
where this could arise, it has been noticed informally that people recall phrases
that are frequent exceptions rather than normal constructions. Grammar deals
with the regular, and so is resonant with frequency; however many words appear
commonly in phrases which are uncharacteristic of the normal usage of the
words. The intuition will tend thus to retrieve the non-standard structure. For
example the corpus tells us that there are a number of adjectives that do not
usually appear in front of nouns, in what is called the attributive position. Instead
they normally occur after the verb be in the predicative position. However, in
certain collocations, fixed phrases and idioms this restriction is lifted; so users of
English have to remember both the rule and the exceptions. It seems that the
intuition, if queried about a particular adjective, tends to report on the
participation of the adjective in lexical patterns rather than grammatical ones in
phrases with particular collocates that are uncharacteristic of the grammar, rather
than the regular structures. From a grammatical point of view this seems
perverse it is the exceptions that come to the surface first but from a lexical
point of view it is a sensible response, since if an established multi-word
expression has a structure that is exceptional compared with the normal usage of
one of the words in it, this point has to be remembered in connection with the
individual word.
This intuitive response first came to our notice in Cobuild with the
adjective glad, and I have commented before about it (Sinclair 1991).
Overwhelmingly glad is used predicatively, and in some complex constructions.
However, many English speakers, when asked about the usage of this word, cite
Intuition and annotation 45

a phrase from the translation of the bible published in 1611 glad tidings of
great joy that is still alive and well in the speech community. Apart from this
relic, and a few minor phrases, glad will be found, on thousands and thousands
of occasions, in predicative position. Without one of the tiny number of
collocations like tidings, it will sound very odd indeed as an attributive adjective.
There are good reasons for this which I will not go into now, because my point is
that here is another place where our intuitions may appear to report falsely about
the facts of the language. Following a grammar-predominant model, such as we
have, glad will be classified as predicative without question on the basis of
corpus evidence; the intuition may hold on fiercely to the few phrases that
contravene this convention. A model which is more balanced between grammar
and lexis should mediate successfully between these apparently opposed
It seems that glad is not alone in presenting a different pattern to the
grammar and the lexis. The adjectives ill, safe and likely are all found
predominantly in the predicative position, but it is easy to think up phrases where
they are used attributively an ill wind, ill effects; safe haven, safe sex; a likely
story, a likely lad. So in each of these cases we can interpret the role of intuition
as preserving memory of those phrasings which are characteristic of the lexical
patterning, especially when the more general and freer usage of the adjective is
in a contrasting grammatical structure. When what is exceptional in grammar is
typical in lexis, the phrasings are stored as individual items.
While there is at least an interpretive problem in the cases that we have
discussed, there is one process where the intuition can be safely trusted. In the
evaluation of corpus evidence the researcher has virtually no option but to yield
to the organising influence of his or her intuition. Complex patterns of
coselection are immediately interpreted semantically and classified broadly with
respect to each other. The same mental resource that we have seen is unable to
manage coselections outside participation in a genuine communicative context is
apparently razor sharp and completely reliable in a receptive mode.
To illustrate this I present the results of asking The Bank of English what
the principal collocates were of the pattern on the of.6 Any single word form
might occur in the gap, but it should be noted that the collocates are not restricted
to occurrence in that position, but might be anywhere in a ten-word window
around the phrase.
The leading collocates, according to their t-scores, are listed in Table 1. I
believe that anyone with normal fluency in English would find it difficult to scan
this table without making tentative groupings of the words along various
46 John Sinclair

Table 1: Collocates of the phrase on the of (Bank of English May 2002)

1-12 13-24 25-36 37-48

basis depends isle sale
edge depending number day
eve streets effect corner
back future surface cover
part issue banks focused
verge grounds floor comment
based strength island report
side depend focus evidence
brink impact site outcome
outskirts heels night amount
subject morning stroke restrictions
face question advice emphasis

One grouping could go like this:

1. Timing expressions, such as eve, night, morning, day. Here we note that eve is
an unusual word, usually found in poetry and oratory. This is a clue to the
meaning of these expressions, which are used in the timing of important events.
On the stroke of is also a somewhat dramatic timing expression, which needs a
particular time after it, the kind of time that is likely to be signalled by a clock
striking or something similar. As well as the hours, especially midnight, and half-
time, full-time, those unfamiliar with the game of cricket might be surprised to
find on the stroke of lunch/tea in there as well.

2. Spatial indicators, such as back, side, surface, and floor, corner. Site attracts
collocates to do with buildings. Outskirts, streets, banks are more specific spatial
references. Isle and island are parts of place names. Some uses of edge, verge,
brink are also spatial, but on the brink of and on the verge of are commonly used
as complex prepositions introducing mainly dreadful things.

3. The phoric nouns subject, issue, question, whose referents are to be found in
the surrounding cotext; in this phrasing probably just after the of.

4. The complex prepositions on the basis of, on the grounds of, on the strength
of, indicating the reason for a decision.

5. In some cases the lexical item extends beyond the designated phrase; for
example most of the occurrences of face in this phrasing is in the phrase on the
face of it, with a variant on the face of things. On the heels of is usually preceded
by hot or hard, or one of a few variants like close.
Intuition and annotation 47

6. More generally, part fits into a phrasing X on the part of Y, where X is some
action, usually described in derogatory terms, and Y is the actor. Future attracts
talks and similar events on its left, and political problems on its right. Effect is
part of an item which can be represented as X has a Y effect on Z, where X is
some event, Y is an adjective like adverse, dramatic, and Z is something like a
political programme. Cover is usually preceded by the name of a celebrity, and
followed by the name of a journal. Sale predictably attracts the vocabulary of
financial dealings. On the evidence of has a remarkable tendency to come at the
beginning of its clause, introducing the reason for an action which is reported

7. The remaining nouns that occur between o n and of are frequent but
unremarkable collocationally, like number, amount, advice.

8. Depends, depending, depend are verb forms which are likely to come in front
of the expression, as also focus, focused, based, comment, report. Evidence is
typically preceded by one of these verbs. Impact, emphasis and restrictions are
much more likely to precede this phrase than to be the missing noun.

The account of the patterning associated with this phrasal framework is

presented here artificially, because normally any instance of it would come with
a word selected to fill the gap. A fluent user of English, encountering an actual
instance of the phrase, performs an instant interpretation which involves all the
relevant categorisation and a lot more detail besides. There is no escape from
intuition if you have command of the language you are investigating. Even if a
researcher wanted to view the data directly and without the accompaniment of
intuition, it would be almost impossible. It is instructive to examine a
concordance in a language unknown to you to get an idea of what it is like to see
pattern only. But techniques exist for keeping the intuition temporarily at bay,
and these are worth cultivating.
The format of a KWIC concordance is a great help in itself, because the
vertical patterns which are not meaning-bearing are prominent, and can
provide a neutral framework within which the researcher can see patterns
without immediately ascribing meaning to them and therefore establishing
meaning-bearing relationships among them. In reading through Table 1, and
imagining each word in the frame on the of, there are probably instances
where at first the meaningful order was not clear especially if the whole of the
lexical item was not present. On the stroke of, for example, clearly needs to be
followed by a specific time to be intelligible. On the strength of is normally a
complex preposition as noted in point 4 above, where strength has little to do
with strong; however, if preceded by a form of D E P E N D , the preposition
disappears and strength reverts to its independent meaning.
The examination of Table 1, trying each word in it to see if it fits in the
gap and if so how the meaning is organised around it is a kind of alienation
process that I have called degeneralisation (Sinclair et. al. 1996: 177). Since the
48 John Sinclair

essence of finding the meaning-creating mechanisms in corpora is the

comparison of the patterns as physical objects and quasi-linguistic units with
the meanings, it is valuable to be able at times to study the one without the other.
This takes a little skill and practice, but to my mind should be an essential part of
the training of a corpus linguist.

3. Annotation
Aarts (2002a: 10) went so far as to say that annotation is anathema to
corpus-driven linguists. This is a fairly serious misunderstanding, and to clarify
my own position it is necessary to define terms carefully.
Let us first distinguish between mark-up and annotation. They are not
always kept distinct in usage, and their domains may overlap, but they are worth
distinguishing. Both of them are processes which provide additional information
to what is called plain text. Plain text is a straightforward concept, but there
are some who claim not to understand it, so we will start there.
Imagine that you had a long thin reel of paper to write on rather than a
rectangular sheet like a reel of sticky tape but made of paper. You have in front
of you a piece of writing that you want to record onto this reel of paper just a
paragraph. How would you do it? I expect that you would ignore line ends,
remove hyphens that marked words split at line-ends, and otherwise produce a
continuous stream of letters, numbers and punctuation marks in the same
sequence as the original. That is plain text, and it consists of an alphanumeric
If you continue transferring written text in this way, however, you will
soon encounter problems bold face, italic, underlinings for example, and
headings, large fonts and other layout matters. Mark-up is the process of
recording these additional pieces of information by making notes interspersed in
the alphanumeric string. So just before a section of bold face will be a tag that
says from here on there is bold face and just after the section there will be a tag
that says from here on we return to normal face. The tags are coded in a mark-
up language, of which the most widely used has been SGML, now giving way to
XML. So for each note there will be two tags.
In marking text up, then, the aim is to preserve information that would
otherwise be lost in the transfer of text to electronic form. Annotation, which we
will come to later, uses the same conventions as mark-up but has no limits on the
kind of information that is provided. Specifically, it encodes information which
is not directly recoverable from the original text, but is added by a researcher.
Returning to mark-up, now imagine that instead of a written text to be
transferred to the reel of paper, you were faced with a recording of a
conversation. Here there are many more decisions to be taken, because the sound
wave has to be interpreted as an alphanumeric stream. Let us say that you do not
attempt a phonetic transcription, but you adopt the mode of transcription called
orthographic, using ordinary spelling wherever possible, but noting all the false
starts, laughs, coughs and stutters.
Intuition and annotation 49

If you do this conscientiously you will end up with a legible text, but one
which has lost a lot of the original information in the sound wave. Intonation is
poorly represented in punctuation, stress is usually not marked in writing, and all
sorts of emotional and attitudinal meanings will not transfer. You may want to
mark up the transcription to record some of these important items, again using a
tag coding.
There is good motivation for preserving this information, and there are
various ways of preserving it. However, it is important to note that a simple
orthographic transcription has a definite status in and for itself, even though it
may be enhanced by good mark-up. It is legible, and a fluent speaker is usually
able to infer enough of the missing information to understand the transcript with
only occasional difficulty, much as he or she would adjust to a speaker of an
unfamiliar variety of English. You will have included word spaces in your
transcription without difficulty, though you did not hear most of them, and
perhaps speaker change, with some attempt to recognise the various speakers.
You will have made a stab at sentence and paragraph boundaries, and used full
stops and capital letters with confidence.
In the very first corpus of spoken language in electronic form (Sinclair et.
al. 1970 and forthcoming) there was no difference made between capital letters
and small ones because in the early sixties computers could only cope with one
alphabet. There was no punctuation and no indication of speaker change because
transcribers were asked not to include these. Word spaces were present, and this
led to criticism from some purists, but I find no problem in using the
transcribers ability to detect word spaces to improve legibility.7
Conventions such as SGML originated when computers were not nearly
as powerful and flexible as they are today. We have reached a stage where a
recorded conversation can be digitised and all the features of the sound wave
which are relevant to language can be retained in the computer and presented to a
researcher as required so, for example, an orthographic transcription can be
aligned with the sound wave from which it was transcribed, and segments of the
recording can be played back to order, so there is no further need for mark-up.
Similarly, documents can be digitised, retaining all relevant aspects of their
format, layout and typography, and again this information, kept separate from the
alphanumeric stream, can be aligned as and when required.
So the mark-up languages represent a stage of development of computer
text processing which is now obsolete. The updating of existing corpora will be
slow because a lot of material has been tagged (and often re-tagged to keep up
with changes in best practice), and there are some contingent problems which
will be mentioned below.
There is an issue here of the integrity of texts. While it is conceded that no
electronic representation of a text is identical with the original, the object of
making an electronic copy is surely to preserve at least the alphanumeric stream
in its original sequence. Any disturbance to that will lead to difficulties later on,
particularly now that many corpora are much too large for human inspection.
The principal problem is that it is not possible to be sure that all the tags have
50 John Sinclair

been removed, without the accidental removal of some genuine text. There are
two sources of error here one is the accuracy with which the tags have been
inserted, and despite the availability in recent years of SGML parsing and
checking programs there are all sorts of opportunities for error. The other is that
strict adherence to the rules is laborious, and there are a number of short-cuts that
are commonplace, and not necessarily retrievable. The situation was summed up
by Vlado Keselj in a message to the Corpora List in April 2002: Actually,
writing a correct and general SGML detagger would be a *very* difficult task.
Thankfully, there is an easy way of avoiding this problem. The
alphanumeric stream, the plain text file, can be just one of several parallel data
streams, and mark-up tags can be another. When required, these two streams can
be merged, and a single string alternating text and tag can be made. This does not
affect the integrity of the plain text file, and the process can be repeated and
elaborated as required. This system has been in everyday use for some fifteen
years now, but it is still common to find tagged corpora that are not available in
plain text form, and can only be separated by a laborious process of doubtful
We can summarise the arguments around mark-up as follows:

1. The information captured in mark-up is valuable and worth preserving.

2. Mark-up is not the only way of preserving this information.
3. Mark-up is now obsolete as a way of storing text.
4. Marked-up text can be prepared by merging plain text and tags.
5. Corpus material should always be kept in plain text format.

With this in mind, we can now turn to annotation. Annotation uses the same
conventions as mark-up, but is not restricted to features of the original text or
recording. The classic annotation is POS-tagging, which means inserting after
each word in a corpus a code denoting its part of speech, but there are now many
others, some quite unusual and informal, and many corpora are very heavily
I would certainly not condemn all annotation, and I make judicious use of
it myself; but I have reservations about some practices, and about the wisdom of
relying on a platform of annotated text in our present state of knowledge. The
idea that annotation is anathema to people who share my views no doubt arises
because of these stated reservations.
In order to clarify my position, I would like now to make a distinction
between a corpus which is prepared for general use by a community of
researchers, students and workers in the language industries, and one which is
put together for a particular application. My comments and particularly my
reservations largely concern the former type of corpus, often known as a
generic corpus, where I take the simple view that all the information apart from
the plain text should be optional, because (a) some important groups of users
require only that, and (b) most researchers will only require a small subset of the
annotations that might be available. Researchers using statistical methods usually
Intuition and annotation 51

need a large amount of plain text, as do those searching for lexical patterns.
Information from mark-up and annotation would only be of interest in problem
cases, and statistical studies rarely get down to that level of detail. Also, as
annotations become more varied and verbose, no-one will want to make use of
all of them, and if the corpus is only available in fully annotated form, they will
be carrying a lot of baggage around with them.
The other type of corpus, one that is designed and built for a pre-
determined application, will give top priority to the needs of the job, quite
rightly. The type and level of mark-up and annotation will depend on the kind of
queries that the investigation requires, most of which will be knowable in
advance. In such circumstances, which are common in commercial applications,
the best that one can do is appeal for the researchers to observe good practice so
that their corpus may be reusable for other purposes.8 The same situation is
found in the growing practice of putting together quick, highly specialised
corpora, perhaps from the internet, in order to carry out a limited set of tasks,
with no intention of retaining the corpus after the disposable corpus. In such
cases any short-cut is justified and it is irrelevant to suggest that researchers
conform to good practice (see Pearson and Bowker 2002).
So I am only concerned with generic corpus resources. Many prospective
users of such corpora expect to be offered POS tagging and sometimes full
parsing and semantic and pragmatic tagging as well, and there is no reason why
such annotations should not be available with generic plain text corpora, but they
should be optional, and they should conform to the conditions set out above, for
mark-up, and below. Many projects start out with a request to the corpus
linguistics community for a corpus already tagged in a particular way. In the best
scientific tradition, researchers use previous research as their platform, and probe
beyond their predecessors.
Here is where my reservations start. In the first place, all the annotation
systems that I know of that code linguistic information have an element of
human input, of which the smallest-scale intervention is the human correction
of the computers mistakes. In many procedures the computer plays a fairly
minor role in the decision-making and is used just to manage the data; in others
there is a preliminary stage where the input text is manually edited and then
processed automatically.
I have argued for some years that annotation which is not fully automatic
has no place in the toolboxes of generic corpora. It is unavoidable in many
applications because of their need for practical outcomes, and because there are
no suitable tools which are fully automatic. While it is claimed that better and
better analyses are made by researchers working in partnership with the
computers, in Aarts words, at some moment the descriptive model and the
annotation tool derived from it must be frozen if the desired result is to be
achieved (Aarts 2002b). That is a fact of life in applications.
Unfortunately, too many researchers nowadays expect, and accept, off-
the-shelf tools that they do not examine too closely; the tools may be of some
antiquity, but they are not carefully evaluated. There is thus no incentive to work
52 John Sinclair

towards a new generation of fully automatic tools which derive from a corpus-
sensitive analysis, and which may present a rather different picture of the
language from the present ones. The whole procedure of annotation is pretty
frozen at present, and has moved very little in the last decade, because the
theories are not accessible for modification by the data.
There are two compelling reasons why annotation of this kind should not
be offered as part of a generic package.
One is that the models of language used in todays taggers date from a
time before evidence from a corpus was available, and some of them derive from
models which ignored empirical evidence entirely. A corpus can certainly be
used to evaluate and correct the descriptions that come from these models, and
eventually the models themselves, and this does happen in a very small way
concerning some of the details of classification. But, as Tognini Bonelli points
out in the quote early in this paper, for many scholars there is no impetus to
expose the theory to such scrutiny. Overwhelmingly the consensus view of
researchers is that the models are basically correct, and while they can be tidied
up by corpus evidence there is no need to open up the whole complexity of
language theory and description for the sake of some minor blemishes. Better to
get on with the job.
In the view of corpus-driven linguists, the picture is quite different. Their
perception is that corpus study provides a constant, subtle undermining of the
received models of language. The evidence is piling up all the time, but it is
invisible to anyone who looks only through the categories of the received model.
Claims of a high-per-cent accuracy of tagging are misleading, because the
decisions about what is correct and what is wrong are not supported with
linguistic evidence. Also most wrong assignments are systematically wrong,
because the machines are consistent at least, and the researcher is left with two
misgivings: (a) perhaps the computer is offering valuable new information rather
than making mistakes, and (b) the places where the computer is unreliable are
probably just the places where the researcher would like to rely on it.
The other argument against conventional tagging causes some problems
when put, as I sometimes do, in the form Annotation loses information. It
would seem at first sight that annotations add to the information in the corpus,
and indeed terms like enrichment are sometimes rather rashly used to promote
annotated text. Let us start with a simple case and follow it through. Let us agree
that boy, bicycle and brat are all nouns. They each are given the tag N. Once
this is done, they are all identical from the point of view of the tags; their
individuality is lost.
The proponents of annotation argue at this point (a) that there is a gain in
generality in the recognition of what is shared among members of the class N,
and (b) that the individuality of the word is not lost, because the word itself is
still there in the linear stream. These points need to be explored carefully.
First, the gain in generalisation, which is certainly a valid point as long as
generalisation can be demonstrated, but here the informality of the received parts
of speech weakens the argument considerably. No formal definition exists of the
Intuition and annotation 53

class N; computer grammarians rely on an uneasy mix of received

grammatical categories that cannot be represented in a computer, and
discriminatory routines whose only virtue is that they come fairly close in
practice to the received categories, thus reducing the amount of manual labour in
matching them precisely. The painstaking efforts and academic honesty of Biber
et al (1999) is worth noting here, because they doggedly follow the model of
Quirk et al (1985) and so they do not have a chance of aligning their received
categories with the evidence from their corpus. So they resort to talk of the
nouny- ness of nouns (p. 59) and are most unconvincing in their attempt (pp.
255-8) to hold onto species nouns like sort, loads as still in the same class as
nouns like boy and bicycle.
The second argument is that the annotated text gets the best of both
worlds, because the individual word is retained, and the researcher has the choice
of word or tag. But the fact that word and tag alternate in a single linear string
should not deceive us; the text and the tags form two mutually exclusive versions
of the corpus, as Aarts is careful to point out (2002a: 9). While it is possible to
search for a mixed string, for example boy with the tag N, that is essentially a
lexical query and is not likely to be characteristic of the searches.
The replacement of a word like boy by a tag like N loses information in
a more subtle way also, because, having been designated N, the word cannot
be reclassified or seen, even temporarily, as anything else. If the grammar has
failed to note that Oh, boy! is an enthusiastic expression of approval, then that
boy will be replaced by N like any other. A particular view of language is
imposed on the corpus, down to the finest detail, and it is non-negotiable. There
are many areas even in POS tagging where experts differ for good reasons just
what is negative and what is not and what, if anything, lies in between these
polar opposites, for example. Or just what is a modal expression. Each tagger
will put into practice a policy for these categories that is more likely to be the
result of expediency than the elaboration of a theory, and these decisions will
affect a decade or more of research, without the users even being aware of them.
Most researchers are content that someone has tagged the corpus, and they are
not inquisitive as to how this was done, or what the shortcomings are.
One major structural feature of English, often commented on, is the large
number of forms which function as either verbs or nouns, so that a conventional
tagger has a huge job to distinguish them. Promise and promises, for example,
are interchangeable between the two word classes. Also this is a productive area,
so that new crossovers occur daily,9 and even nouns formed by suffix from verbs
can become verbs again, e.g. gift, gifted. This prominent, almost defining feature
of English word classes is completely ignored in normal POS tagging, and all
sorts of tricks and dodges are used to obliterate it. If discussed it is called
portmanteau tagging, which shows the all-pervading grip of the received
models. Why should a grammar of English not recognise a word class that covers
both verb and noun, as well as having one for just verbs and one for just nouns?
This second kind of information loss is the loss of the potential to be
classified in all sorts of different ways according to different criteria; such
54 John Sinclair

flexibility is vital to any theory-based research. This is another way of seeing the
individuality of words, which is denied them as soon as they are given a tag.
From this discussion it is clear that non-automatic annotations are best
confined to applications, where they can expect to remain in use for some time.
Their inclusion among generic resources, however, is misplaced and hazardous,
and it holds back progress substantially. Instead of research projects pushing
ahead with the improvement of fully automatic annotation, a considerable
proportion of the available funding goes into this very flawed activity.
Any unavoidable human role in the process of analysing corpora holds
back progress along many dimensions, but none so obvious as in the size of
corpus to be managed. Generic corpora are now measured in the hundreds of
millions of words, and this figure will rise and rise because each rise in the order
of magnitude shows the need for the next one, and there is no reason why this
should stop at some arbitrary size. Any human input, no matter how tiny, that
grows with the size of the corpus adds so much to the cost and time, as well as
opening an opportunity for inaccuracy, that either the size of the corpus has to be
kept down or costs will soar.
To summarise this complex area, my reservations about annotation are
quite specific, and concern only their inclusion in the resources around generic
corpora. Because they impose one particular model of language on the corpus,
they restrict the kind of research that can be done; because the practice of
annotation normally requires human intervention, it is not a replicable process
and therefore fails the first test of scientific method. Because the models imposed
by current conventions of annotation are unlikely to be informed by corpus
evidence, I believe researchers who use them are likely to make unnecessary
problems for themselves.
None of these reservations are relevant when researchers are concerned
with an application and considering matters such as cost-effectiveness, and are
not interested in any factors outside the application. Annotation as an
exploitation of the mark-up facility is typical of the kind of tool that emerged in
the early days of computing simple, extremely flexible and useful. The other
side of the coin is that it can be uncontrolled, invasive and overwhelming; I
believe that most of the research projects in corpus linguistics that are in progress
at the present time are not examining their languages at all, but are examining the
tags. The particular choices of word combinations that corpora uniquely offer us
are impossible to retrieve using tags.
As a matter of personal practice, I have very little need for non-automatic
annotation, and I use plain-text corpora whenever possible. This is because I am
primarily interested in the implications of corpus study for the development of
language theory and description. If I was obliged to use only annotated corpora
to work with which is the settled policy of, for example, the Arts and
Humanities Research Board in UK, which funds most of the relevant research
then my work would be hampered if not rendered impossible.
This is where we come to the crunch about annotation, where I think I
part company not only with Jan Aarts but with quite a proportion of the ICAME
Intuition and annotation 55

community. This is because I do not regard the description of languages as

application, and therefore I would advise against using annotations of the kind
we have available at present in the practice of language description.
I must define what I mean by application as carefully as possible, because
the word can be used to describe many relationships between theory and
practice; a description of a language is often seen as an application of a theory,
for example, but that is not the sense in which I want to use the term. For me an
application in linguistics is the use of language tools in order to achieve a result
that is relevant outside the world of linguistics. If you are building a machine that
will hold a telephone conversation, for example, that is an application, or a
translating machine or even writing a dictionary; the end users are not
necessarily nor even primarily linguists, and so these projects are applications of
But research that tries to produce a better description of English grammar,
for example, is not an application; it is only directly relevant to other
grammarians. My contention is that whereas there is justification in applications
(in my sense) for using any tools that may further the work, this is not so in
language description for its own sake. In the former case the judgement is by
results, and the end justifies the means, so if the translation machine works well
it matters little what is inside it.10 But how do we tell if a description is of good
Descriptions are evaluated from above and below, so to speak. Lying
between the theory and the data, a good description is one that shows few
discrepancies in either direction. Its categories will be consistent with the theory,
and they will account comprehensively for the patterns observed in the data. But
if the data is preprocessed by annotation which is not automatic and is avowedly
an elaboration of the theory, then there is clearly a vicious circle in operation.
The theory cannot come under attack because the only available view of the
corpus is one viewed via the theory.
Corpus linguistics has still to mature a little, to shake off the last traces of
the days when a corpus was a major problem for a fledgling computer, and
where Mr Fixit attitudes were welcome because they led to quick, if perhaps
wobbly, results. The demands of todays researchers are ever more sophisticated,
and the software facilities they are offered are often built on shaky foundations.
The results of applications using annotated corpora are uniformly unimpressive
when they concern the appreciation of meaning in open text, and as a
consequence workers in Information Technology do not trust the structure of
language, and talk of it in degenerate terms reminiscent of Chomskys dismissal
of performance (Chomsky 1965: 3 f.).
My unease about the over-use of annotation an annotated corpus can
reach a condition where over 80% of the bulk consists of the annotations,
compared with less than 20% the texts always ends up with concern that the
models underlying the annotation are neither adequate for nor even relevant to
the description of the language in a corpus.11 Most researchers are not language
theorists, and they take on trust the software that is offered, and apply it
56 John Sinclair

uncritically. Corpus-driven linguistics aims at developing the models so that they

become more reliable; it is reasonable to suppose that as the models improve, the
descriptive categories become more amenable to automation, and annotations
always optionally could become associated with generic corpora.
There is no space for me to illustrate these points with reference to actual
cases, but I can refer the interested reader to my review of Biber et al (1999)
published in IJCL 6/2 (Sinclair 2001). This grammar explicitly applies a pre-
corpus model of language to a small corpus and annotates the corpus as a first
step. Despite what must have been an enormous effort of silent editing, the
evidence that surfaces in the book consistently fails to validate the categories of
the imposed description.

4. Conclusion
This has been an exercise in clarification, because for many linguists working
with corpora it might seem bizarre that one group distinguish themselves by
denying any role for the intuition, and condemning the normal practice of
annotation. I cannot, of course, speak for all researchers who might see
themselves as sharing a corpus-driven perspective, but I hope that I reflect their
general position fairly. They have a great respect for intuition, and cannot work
without it. The cannot applies in two meanings they are constantly guided by
it, and they could not get rid of it if they wanted to. As part of their professional
stance they cultivate the skills of degeneralisation, allowing them to stand back a
little from participating in the language events they observe as researchers, and
to defer momentarily the intuitive response; this gives them a small amount of
independence from their intuitions.
They appreciate, moreover, that intuitive responses need careful
interpretation, and they respect the limits of intuitive competence; in particular
they do not expect that if they invent a sentence their intuitions will ensure that it
has all the features of a naturally-occurring one.
At present corpus-driven linguists are not likely to have much use for
annotation, because most of the available systems suffer from the twin
drawbacks that their underlying model of language is pre-corpus, and that they
fit the corpus so badly that human intervention is necessary. Annotation,
however, even of the limited kind we have, has its place in applications, where
quick results are needed and rough-and-ready ones will suffice.
Perhaps the main difference between the two methodological stances in
corpus linguistics is their attitude to the use of annotations, of the present-day
variety, in purely descriptive studies. To the corpus-based linguist they are
indispensable, whereas to the corpus-driven linguist they are obfuscating.
But provided that the various safeguards discussed above are respected
(including those raised in connection with mark-up) there is no objection to the
practice of annotation in itself; used without understanding of its limitations it is
a hazardous practice. Perhaps newcomers to the growing profession of corpus
linguist should be given a few warnings that annotation is a coding convention
Intuition and annotation 57

that has no controls beyond the grammar of the code, that the appearance of an
annotated corpus belies the fact that it is an alternation of two separate and
incompatible codes (in the sense that plain text is also a code), that the two
coding streams should always be maintained separately, and that non-automatic
annotation is essentially subjective.


1. The pattern grammars mark a first step in following the corpus evidence
with little or no grammatical preconceptions, and Hunston and Francis (1999)
give a thorough explication of this approach.
2. See the discussion in Sinclair (1984).
3. The phrase used language is from Brazil (1995); while a little whimsical to
be a regular term, it allows us to avoid the issue of authenticity that is such a
humbug in this kind of discussion.
4. It is always conceded that frequency is a crude measure of importance,
and more an indication of a criterion than a criterion in itself. But where two
uses of a word show massive discrepancies in frequency, and the less common
one is the one that first comes to mind, then there is some explaining to be
5. There are 755784 instances of TAKE in the Bank of English, so it would be a
considerable though worthy labour to check this. I have looked at several
small samples, and I have not so far found any convincing examples of the
core meaning, but I would expect them to be few and far between.
6. The Bank of English stood at a little less than 500 million words when this
data were retrieved. Details of the corpus can be found at
http://www.cobuild.collins.co.uk. I am grateful to The University of
Birmingham, co-owners of the corpus, for access to it.
7. An example of this kind of text can be found in the file LEXIS at
http://ota.ahds.ac.uk/, being transcripts of recordings made at the University of
Edinburgh in the early 1960s.
8. See Wynne (ed) (forthcoming) for an example of such guidance.
9. Todays example: I badged my way into the lobby. said by a police
inspector arriving at a crime scene (Patterson 2002: 23).
10. Some might say that if the description is inaccurate then the machine will
never work properly, and that there is evidence in the performance of such
devices that support this position. But it is an empirical question.
11. Attitudes change quickly in this area of study, and I can only be sure that in
the few years up to the composition of this paper in 2002, SGML format was
regarded as the standard among the advisers to AHRB. The advisers have
changed, thankfully, and there may now be a greater understanding of the
58 John Sinclair

numbing effect of having to view ones data through the imperfect vision of


Aarts, J. (1991), Intuition-based and observation-based grammars, in K. Aijmer

and B. Altenberg (eds), Corpus linguistics. Studies in honour of Jan
Svartvik. London: Longman. 44-62.
Aarts, J. (2002a), Does corpus linguistics exist? Some old and new issues, in L.
Breivik and A. Hasselgren, From the COLTs mouthand others.
Amsterdam: Rodopi. 1-17.
Aarts, J. (2002b, forthcoming), Review of E. Tognini Bonelli, Corpus linguistics
at work. International Journal of Corpus Linguistics 7 (1).
Biber, Douglas, S. Johansson, G. Leech, S. Conrad, and E. Finegan (1999),
Longman grammar of spoken and written English. London: Longman.
Brazil, D. (1995), A grammar of speech. Oxford: OUP.
Chomsky, N. (1965), Aspects of the theory of syntax. Cambridge, Mass.: MIT
Francis, G. (1993), A Corpus-Driven Approach to Grammar, in M. Baker, G.
Francis and E. Tognini Bonelli, Text and technology. Amsterdam: John
Hunston, S. and G. Francis (1999), Pattern grammar. Amsterdam: John
Benjamins. 137-156.
Johns, T. (1990), From printout to handout: Grammar and vocabulary teaching in
the context of data-driven learning. CALL Austria 10: 14-34.
Patterson, J. (2002), 1st to Die. London: Headline.
Pearson, J. and L. Bowker (2002), Working with specialised language: a
practical guide to using corpora. London: Routledge.
Quirk, R., S. Greenbaum, G. Leech and J. Svartvik (1985), A comprehensive
grammar of the English language. London: Longman.
Sinclair, J. (1984), Naturalness in language, in J. Aarts and W. Meijs (eds),
Corpus linguistics: Recent developments in the use of computer corpora in
English language research. Amsterdam: Rodopi. 203-210.
Sinclair, J. (1991), Shared knowledge, in J. E. Alatis (ed), Linguistics and
language pedagogy. The state of the art. Georgetown University Round
Table on Languages and Linguistics: Georgetown University Press,
Washington D.C. 489-500.
Sinclair, J. (2001), Review of Biber et al., The Longman grammar of spoken and
written English, in IJCL 6 (2): 339-359.
Sinclair, J., S. Jones and R. Daley (1970), English lexical studies. Report to OSTI
on Project C/LP/08. Revised edition forthcoming 2003: E n g l i s h
Collocation Studies, ed. by R. Krishnamurthy, Introduction by W. Teubert.
Birmingham: Birmingham University Press.
Intuition and annotation 59

Sinclair, J., J. Payne and C. Hernandez (eds) (1996), Corpus to corpus A study
of translation equivalence. International Journal of Lexicography 9 (3)
(Special Issue): 172-196.
Tognini Bonelli, E. (2001), Corpus linguistics at work. Amsterdam and
Philadelphia: John Benjamins.
Wynne, M. (ed) (forthcoming), Developing linguistic corpora a guide to good
practice (provisional title). Web and Print versions. Oxford Text Archive.
Recent grammatical change in English: data, description,

Geoffrey Leech

Lancaster University


This chapter begins by considering the contrast between the data-driven

paradigm characteristic of corpus linguistics and the theory-oriented paradigm
characteristic of some other schools of linguistics, particularly those espousing a
generative framework. To illustrate the corpus linguistics paradigm in detail, I
present a case study of grammatical differences observed in the LOB and FLOB
corpora and also other corpora of the early 1960s and the early 1990s. By
abductive or inductive inference from the observed data, (fallible) descriptive
generalizations can be made, and tentative conclusions of theoretical interest can
be drawn. In conclusion, I argue that corpus linguistics is not purely
observational or descriptive in its goals, but also has theoretical implications.
However, like a theory-driven inquiry in the classic formulation of Poppers
hypothetico-deductive method (1972: 297), a corpus linguistic investigation can
only lay claim to provisional truths, and therefore requires confirmation or
refutation by further research findings.

Table 1. Summary of the contents of this article

A. Metatheoretical preamble
B. Case study:
Recent grammatical changes in (mainly) written (mainly) British
English viz. frequency changes between 1961 and 1991-2:
(a) modal auxiliaries and semi-modals
(b) other grammatical categories relating to colloquialization
C. Conclusions

1. Introduction

In the 1960s, one of the widely-accepted fundamentals of linguistics was to be

found in Chomskys hierarchy of three levels of adequacy (1964: 62-3):

(1) Explanatory adequacy is achieved when the associated linguistic theory

provides a general basis for selecting a grammar that achieves [descriptive]
adequacy over others that do not.
62 Geoffrey Leech

Descriptive adequacy is achieved when the grammar gives a correct account

of the linguistic intuition of the native speaker, and presents the observed
data (in particular) in terms of significant generalisations that express the
underlying regularities of the language.
Observational adequacy is achieved if the grammar presents the observed
data correctly.

One of the implications of this formulation was a downgrading of the importance

of empirical observation: as Chomsky himself pointed out, observation adequacy
could be achieved by a mere listing of the data. Another implication, as I saw it,
was a confusion between two notions of intuition: Chomskys concept of
descriptive adequacy confused the knowledge of the language of a native speaker
with the analytic knowledge or expertise of the linguistic scientist, able to make
significant generalizations about the language. In Leech (1968) I argued this case,
and suggested a different hierarchy of three levels, which would be a more
realistic account of the main strata of investigation in linguistics:

(2) Theory: formal [and functional] characterization or explanation of language

as a phenomenon of the human mind and of society.
Description: formal [and functional] characterization of a given language, in
terms of the theory.
Data collection: collection of observations which a description, and
ultimately a theory, has to account for [e.g. corpora]

Since that time, the more empiricist and more rationalist trends in linguistics have
diverged so far as to be almost irreconcilable. However, I still find the
formulation in (2) useful, although I would now prefer to insert the words in
square brackets [and functional], showing my preference for a combination of
formal and functional explanation which corpus linguistics is characteristically
attracted to. The other words in brackets [e.g. corpora] are of course a
reminder that corpus linguistics finds its raison dtre at the observational or
data-collection stratum of these three, the one that Chomsky found to be of such
little importance. However, my overarching goal in the present chapter is to
explore the relation between these three interrelated levels, and to argue against
the common assumption that corpus linguistics is concerned with mere data
collection or mere description.
Recent grammatical change in English 63

2. A case-study: recent changes in English grammar

Alongside this, I also have a more practical goal, which is to exhibit as a case
study a particular area of linguistic description: recent quantitative change in
English grammar, as observed through the comparison of the LOB and FLOB
corpora. Although the main study has been focused on the LOB and FLOB
corpora, and therefore on written British English, it has been supplemented where
practicable by work on other corpora permitting a similar comparison between
English in the early 1960s and in the early 1990s. I will use this case study as a
means of illustrating the relation between the three levels of theory, description
and data collection or, to put them in the order which would more naturally
occur to a corpus linguist data collection, description and theory.

2.1 Data collection: using the LOB, FLOB, and other corpora

To begin with the level of observation: we began with a study of the two
matching corpora LOB and FLOB, which had already been part-of-speech
tagged, through the combined processing of two taggers: CLAWS4 and Template
Tagger (see Smith 1997 on the tagging techniques).1 By using the powerful
annotation-aware search and retrieval tool Xkwic (Christ 1994), we found it
possible to extract occurrences of a whole range of grammatical categories that
have been suspected, with varying degrees of empirical backing, to have become
more frequent or less frequent in the recent past. The main areas of grammar we
focus on in this chapter are (a) the modal auxiliaries, together with the mixed
array of verbal constructions conveniently termed semi-modals, and (b) a range
of grammatical phenomena associated with a suspected trend of
Although we began with the LOB and FLOB corpora, we extended our
study to a selective use of some other comparable corpora spanning
approximately the same period of 30 years, as shown in Table 2.
The family of four matching corpora Brown, LOB, Frown and FLOB
(henceforward termed the Brown family) is well placed to provide evidence of
frequency changes in British and American English over the period between 1961
and 1991-2. Unfortunately no comparable corpora for spoken English exist, but
we were reluctant to confine our attention to written (printed) language,
especially considering that much grammatical innovation is likely to originate in
the spoken language. With the permission and help of Bas Aarts and Gerry
Nelson at University College London, we were able to identify small comparable
spoken subsets from two other million-word corpora developed at UCL with data
from around the early 1960s and the early 1990s.3 These were the corpus of the
Survey of English Usage (SEU), of which a large spoken part was computerized
and distributed as the London-Lund Corpus, and the International Corpus of
English (the British variant known as ICE-GB). Because of difficulties of
matching samples, the spoken mini-corpora from SEU and ICE-GB were even
smaller, indeed much smaller, and were moreover less closely matched than the
64 Geoffrey Leech

Table 2. The corpora of English used in the study

Name of corpus American Date of Spoken Corpus size and design
or British data or
English collected written
LOB Corpus BrE 1961 Written Each corpus contains
Brown Corpus AmE 1961 Written approx. a million words, in
500 text samples from 15
FLOB Corpus BrE 1991 Written different genres. The four
Frown Corpus AmE 1992 Written corpora are built according
to the same design and
sampling method.
SEU-mini-sp BrE 1959- Spoken Each (sub)corpus contains
1965 approx. 80,000 words from
ICE-GB-mini-sp BrE 1990- Spoken a comparable and balanced
1992 range of spoken genres.

Brown family of corpora. One difficulty was that, although the SEU corpus had
been collected over a period of about 30 years, comparability with LOB and
Brown dictated that we rejected any material not contemporaneous with the
written corpora, a constraint we interpreted rather liberally to exclude any
material outside the time frame 1959-1965. Another problem was that the SEU
corpus was subdivided into texts of 5000 words each, whereas the ICE-GB texts
were of 2000 words each. Hence a one-by-one matching of texts between the two
spoken mini-corpora was not feasible, and partial and overlapping matchings had
to be allowed.
Because of these drawbacks, particularly the restriction of the mini-corpora
of speech to a mere 80,000 words each, our findings from the spoken corpora
could only be seen as highly tentative indicators of what was happening to spoken
English over this period. Nevertheless, we felt that such a study, however
inadequate and provisional, would be preferable to a survey of recent
grammatical change which took no account of the spoken language. In fact,
differences observed between the mini-corpora in the frequency of modals and
semi-modals were tantalizingly even greater than those observed between LOB
and FLOB. A summary of the contents of the two spoken mini-corpora is given in
Table 3.
The sophisticated ICECUP software available for searching the ICE-GB
could not be used with SEU-mini-sp, and so to ensure comparability we decided
to use the WordSmith retrieval package and XKwic for both mini-corpora.
Recent grammatical change in English 65

Table 3. Mini-corpora for studying language change in recent British spoken

Name of corpus: Survey of English Usage International Corpus of English
spoken Mini-corpus (Great Britain) spoken Mini-
Abbreviation: SEU-mini-sp ICE-GB-mini-sp
Period of texts: 1959-1965 1990-1992
Size: 80,000 words each
Texts from these (in each corpus:) conversation, broadcast discussions, sports
categories: commentaries, other commentaries, broadcast news, broadcast talks

This section of the chapter has been called Data collection, and under this
heading we can bring together the basic evidence-providing tools of the corpus
linguists stock in trade. Obviously, these include the corpora used for this
particular study, and the software used to extract the relevant grammatical
phenomena in this case the search and retrieval tools XKwic and WordSmith.
Basic retrieval products such as concordances and frequency lists, especially
when they incorporate the results of simple grammatical analysis such as POS
tagging, might be considered to take us beyond mere data collection, and to bring
us to the threshold of the descriptive level of analysis. However, the scale of
abstraction represented by the three levels of data collection, description, and
theory is best assumed to consist of many small steps, rather than three giant
strides. I return to the matter of data collection versus description in 2.2 below.
Although so far my presentation of the three levels has worked from the
bottom up, this is of course by no means inevitable in the methodology of corpus
linguists. Some studies are problem-driven where the need to investigate a
particular theoretical or descriptive hypothesis may determine the collection or
selection of a suitable corpus, and the selection of particular corpus data to be
studied. But in the present case, the bottom-up methodology prevailed. We did
not start with a particular theoretical claim (say about the process of historical
change) or a particular descriptive hypothesis (say about the English modals),
although our study led to these. It was the existence of the LOB and FLOB
corpora, and the particular equivalence relation between them (found also
between Brown and Frown) which enticed us to follow the example already set
by Hundt, Mair and others, and to use these corpora to investigate recent changes
in grammar.4

2.2 Description: the modals and semi-modals

The descriptive level of linguistic investigation attempts to determine what can be

truly said about some aspect or level of the language, in this case English
grammar. On the face of it, an example of linguistic description is provided by
Table 4, showing changes in the frequency of modal auxiliaries over the 30-year
period as reflected by the paired corpora.5 However, at this stage, statements are
66 Geoffrey Leech

being made about a particular set of corpora, rather than about the language that
they exemplify. We could call this level of statement data description: an
intermediate step between data collection and linguistic description.

Table 4. Frequencies of modals in the four written corpora (including negative


British English Log Diff % American English Log Diff

likhd likhd %
LOB FLOB Brown Frown

would 3028 2694 20.4 -11.0 would 3053 2868 5.6 -6.1
will 2798 2723 1.2 -2.7 will 2702 2402 17.3 -11.1
can 1997 2041 0.4 +2.2 can 2193 2160 0.2 -1.5
could 1740 1782 2.4 +2.4 could 1776 1655 4.1 -6.8
may 1333 1101 22.8 -17.4 may 1298 878 81.1 -32.4
should 1301 1147 10.1 -11.8 should 910 787 8.8 -13.5
must 1147 814 57.7 -29.0 must 1018 668 72.8 -34.4
might 777 660 9.9 -15.1 might 635 635 0.7 -4.5
shall 355 200 44.3 -43.7 shall 267 150 33.1 -43.8
ought 104 58 13.4 -44.2 ought 70 49 3.7 -30.0
need 78 44 9.8 -43.6 need 40 35 0.3 -12.5
Total 14667 13272 73.6 -9.5 Total 13962 12287 68.0 -12.2

In this chapter we will be almost entirely concerned with description in

terms of relative frequency, or relative likelihood, of occurrence.6 Table 4 records
the frequency of each modal auxiliary of the canonical set of modals in each of
the Brown family of corpora. In the absence of other explanations (such as the
corpora being importantly different in other ways than in the dates of their
composition) we can tentatively conclude that these differences reflect different
states of the language: that between 1961 and 1991, the modals declined very
significantly in frequency in written English in both American and British usage.
(The overall percentage losses are 9.5% in BrE and 12.2% in AmE). The fourth
and ninth columns in Table 4 tell us how much the frequencies of the modals
have declined, as a percentage of the 1961 figures. The fifth and tenth columns
provide a second measure of the degree of decline, this time using the log
likelihood ratio (G2) as a measure of significance (Dunning 1993). In these
columns, any score of 3.8% or over is calculated to be significant at the chi-
square level of p <0.05, and any score of 6.6% or over is significant at the level of
p <0.01. The larger the log likelihood ratio, the greater the significance.
The individual modals show a decline varying between can (which actually
increases its frequency in FLOB, and declines only 1.5% in Frown) and shall
(which declines over 40% in both FLOB and Frown). In Table 4, the modals are
Recent grammatical change in English 67

listed in order of frequency in LOB, and exactly the same order of frequency,
with the exception of should and must, applies to Brown. It will also be seen that
a roughly similar pattern of falling frequency is observed in both BrE and AmE
corpora. Broadly, the most frequent modals decline least, and the least frequent
modals decline most in percentage terms, the rare modals shall, ought (to) and
need (+ bare infinitive) having become much rarer. Some middle-order modals
(especially must and may) also show very significant falls in frequency.
The most interesting observation from Table 4, however, is that the overall
frequency of modals is highest in LOB and lowest in Frown, with FLOB and
Brown in intermediate positions. Alongside the decline between 1961 and 1991-
2, there is an equally important difference between AmE and BrE, which invites
interpretation as a time lag. It is as if BrE is following rather reluctantly in the
wake of a change in AmE, with something like a generation gap. This is shown
graphically (though not strictly to scale) in Figure 1.

more frequent ---------------------------------------------------------------> less frequent

13,962 12,287
Brown (1961) Frown (1992)

LOB (1961) FLOB (1991)

14,667 13,272

Figure 1: British English following an apparent American English trend

It might be proposed that the apparent decline in modal usage is due to the
rise, in recent centuries, of the so-called semi-modals, such as be going to and
have to, which are presumed to be still increasingly used. Perhaps these are
gradually encroaching on the territory of the canonical modals. Such a hypothesis
can be tested, up to a point, by noting the differences of frequency of semi-
modals in the four corpora, as shown in Table 5. Although the class of semi-
modals is not a well-defined set, those in Table 5 may be taken as fairly
Ostensibly, there is no strong connection between the patterns shown by the
modals and the semi-modals.7 Altogether, the semi-modals are very much less
frequent (in written English) than the modals, and their changes in frequency
show a mixed picture. Some of them seem to have increased their usage
massively in the period 1961-1991/2, but others have declined. One of the
differences at first glance lending credence to the encroachment hypothesis is that
AmE shows a greater increase in the semi-modals (+18.6%) in comparison with
BrE (+10.0%) a mirror image of what is happening with the modals.
Unexpectedly, however, the overall frequency of semi-modals is found to be
greater in the BrE than in the AmE corpora in both periods!
68 Geoffrey Leech

Table 5. Frequencies of some semi-modals in the four written corpora

BrE LOB FLOB Log Diff AmE Brown Frown Log Diff
likhd (%) likhd (%)
BE g oing 248 245 0.0 -1.2 BE going 219 332 23.5 +51.6
to* to*
BE to 454 376 7.6 -17.2 BE to 349 209 35.3 -40.1
(had) 50 37 2.0 -26.0 (had) 41 34 0.7 -17.1
better better
(HAVE) 41 27 2.9 -34.1 (HAVE) 45 52 0.5 +15.6
got to* got to*
HAVE to 757 825 2.7 +9.0 HAVE to 627 643 0.1 +1.1
NEED to 54 198 83.0 +249.1 NEED to 69 154 33.3 +123.2
BE sup- 22 47 9.2 +113.6 BE sup- 48 51 0.1 +6.3
posed to posed to
used to 86 97 0.6 +12.8 used to 51 74 4.3 +45.1
WANT to* 357 423 5.4 +18.5 WANT to* 323 552 60.9 +5.2
TOTAL 2069 2275 9.2 +10.0 TOTAL 1772 2101 28.4 +18.6
*Forms spelt gonna, gotta and wanna are counted under be going to, have got to, and want to

Table 6. Comparison of SEU-mini-sp and ICE-GB-mini-sp: modals in spoken

BrE (provisional figures)
SEU-mini-sp ICE-GB-mini-sp Log likhd Difference
would 415 (5188) 271 (3388) 30.5 -34.7
will 248 (3100) 307 (3838) 6.3 +23.8
can 252 (3150) 295 (3688) 3.4 +17.1
could 145 (1813) 83 (1038) 17.1 -42.8
may 86 (1075) 36 (450) 17.5 -54.1
should 100 (1250) 84 (1050) 1.6 -17.3
must 87 (1088) 35 (438) 24.3 -60.7
might 56 (700) 50 (625) 0.3 -10.7
shall 26 (325) 17 (213) 1.9 -34.6
ought 20 (250) 9 (113) 4.3 -55.0
need 0 (0) 0 (0) 0.0 0.0
Total 1435 (17938) 1187 (14838) 23.5 -17.3

Note: The figures in parenthesis show frequency per million words, and are therefore comparable to
the figures for the written corpora given in Table 4.
Recent grammatical change in English 69

At this point, it is an attractive idea to look at the patterns of change

observable in the spoken mini-corpora, small as they are. Surely the innovatively
increasing use of semi-modals, and perhaps the corresponding fall in modals, are
likely to show up far more in the spoken language than in the written. The
differences in frequency (in spoken BrE only) between SEU-mini and ICE-GB-
mini are shown in Table 6.
In general, the patterns of frequency shown in Table 6 suggest that trends in
spoken English are similar to those in written English, but somewhat more
exaggerated. The modals are more frequent in the written than in the spoken
corpora, for both periods, but the decline in frequency is also greater a loss of
17.3%. May and must are particular heavy losers, whereas will and can, in
contrast, show a surprising increase from the 1961 to the 1991 corpus. This
picture may be contrasted with the apparent considerable increase in semi-modals
in spoken English between the early sixties and the early nineties, as observed in
the spoken corpora, and as shown in Table 7.

Table 7. Comparison of SEU-mini and ICE-GB-mini: some semi-modals in

spoken BrE
SEU-mini-sp ICE-GB-mini-sp Log Difference (%)
(BE) going to 88 120 4.9 +36.4
BE to 5 10 1.7 +100.0
(HAVE) got to 35 26 1.3 -25.7
HAVE to 79 104 3.4 +31.6
NEED to 2 15 11.3 +650.0
BE supposed to 8 12 0.8 +50.0
Total 217 287 9.8 +32.3

These numbers, of course, are ridiculously small only three can be

counted as significant in log likelihood terms. However, overall they suggest, as
many would suspect, that the general increase of semi-modals is even greater in
spoken than in written English.

2.3 Descriptive conclusions and further discussion on modals and semi-


The following overall findings can be presented by way of summary of the

preceding section on modals and semi-modals. On the basis of the evidence from
the corpora:

(i) In general terms, there is clearly an appreciable decline of frequency in the

use of modal auxiliaries between 1961 and 1991-2.
70 Geoffrey Leech

(ii) During this period, individual modals have been declining at different rates,
but there is a tendency for very common modals to hold their own (e.g. will,
can), and for infrequent modals (e.g. shall, ought to, need) to decline
sharply and to appear almost moribund. Some middle-ranking modals (e.g.
may and must) have also declined sharply.
(iii) Alongside the decline of modals, there is no clear overall picture regarding
semi-modals: although in general, semi-modal usage is increasing, some
semi-modals are declining, and semi-modals as a whole are much less
frequent than true modals.

If we ignore the italicised phrase above (On the basis of the evidence from the
corpora) these statements are descriptive: they claim to tell us something that is
true about the language, English. But rather than accept them uncritically, we
have to bear in mind some hazardous assumptions which can be made in moving
from data description to language description:

Hazardous Assumptions: from Data Description to Language Description

1. That the corpora are large enough and varied/balanced enough to allow us to
extrapolate from corpus findings to what is happening in (relevant varieties of) the
language in general.
2. That the corpora are sufficiently comparable in terms of samples of the varieties
represented, and in using the same sampling methods.
3. That statistically significant results can be attributed to real linguistic differences,
rather than to extraneous factors such as cultural shifts or faulty sampling.
4. That the grammatical categories are defined and used in a way that other
grammarians or linguists find reasonable.
5. That the extraction of data from the corpora has been acceptably (if not totally) free
from error.

The first of these assumptions the well-known issue of representativeness is

perhaps the biggest hazard. In the lack of any practical, general measure of
representativeness,9 the statements (i)-(iii) must be regarded as hypotheses, well
evidenced, it is true, but needing to be supported by further corpus studies as and
when opportunities arise. The second assumption underlies the whole enterprise
of comparing the Brown family of corpora. The third raises the thorny question of
how to relate statistical significance to certain causative factors. For example, we
might attempt to explain changes in the direction of spoken style as part of a
general socially-driven trend of colloquialization (see sections 2.4 and 3) when it
is possible that these changes can be more directly explained by an increase in the
amount of quoted speech included in the 1991-2 corpora (see below). The fourth
hazardous assumption reminds us that linguistic categories even consensual
ones like modal auxiliary verb, are not Gods truth but capable of being
challenged. The fifth has already been discussed in Note 4.
Recent grammatical change in English 71

In my view, none of these hazards justifies a response of extreme scepticism

which says if one cannot prove the truth of these descriptions, one should not
make them at all. Rather, they lead to the recognition that such results should be
regarded as provisional and that there is a need to seek further corroborating
evidence as well as means of increasing accuracy and reliability. This striving
for perfection can be a slow, gradual and time-consuming process, which might
include further manual checking or even collecting and analysing fresh corpora.
It is, though, reassuring to bear in mind that even if an objection is raised to
a hazardous assumption, this often fails to undermine the results in more than a
minor way. For example, the discovery of occasional errors or differences of
categorization in identifying modals is unlikely to cause more than a minor
change in the frequency counts in Table 6, and hence in the statistical significance
of the results. Thus if someone insists that ought to is not a modal but a semi-
modal, this would change the overall findings only marginally. Or to take another
example, on checking the examples of may in Frown, I found two examples of
non-modal may lurking in the database, thus reducing the count of the modal may
from 878 to 876: this makes almost no difference to the significance of the
decline, and in fact increases it.
Returning to the colloquialization trend mentioned above (and to be taken
up again in 3 below), the claim that this phenomenon is an illusion because of an
increase in quoted speech in the later corpora can be checked by actually
undertaking a measurement of quoted material in LOB and FLOB. This has been
done by Nick Smith for LOB and FLOB, and shows that there is an increase of c.
9.5% in the incidence of quoted material in FLOB as compared with LOB.10
However, this could account only in part for most of the changes that might be
attributed to colloquialization (see Table 8 below), so there still remains
something linguistically interesting to be explained here.
To counterbalance the hazardous assumptions, observations such as the
following can have a compensating effect in increasing the plausibility if not
authority of the results, and suggesting that they are not just a matter of chance or

(a) Many results are highly significant as measured by log likelihood ratio.
(b) Trends are consistent across different items e.g. the general frequency
decline of the modals is replicated in almost every single modal auxiliary.
(c) Trends are often consistent across different subcorpora e.g. if we
subdivide each of the Brown family into genre categories Press (A-C),
General Prose (D-H), Learned (J), and Fiction (K-R), often similar trends
are observed in all these four subcorpora. An instance of this is the decline
of the passive from LOB to FLOB (see Table 8). The passive is less
frequent in FLOB as a whole by 12.4%, a trend repeated in a similar way
for each subcorpus: Press 12.5%; Gen Prose 12.4%; Learned 16.6%;
Fiction 3.6%).
72 Geoffrey Leech

I find it useful to use an analogy of scaffolding in the confirmation and extension

of descriptive findings. If we think of the corpus-based methodology as the
constructing of a building by erection of scaffolding, the superstructure of
description of a language can be supported in three ways:

(i) Data observation: from below, struts or buttresses can be used to

strengthen the grounding of data description (e.g. seeking confirmation from new
(ii) Description: at the same descriptive level, findings can be extended and
deepened. For example, we can probe into the crude frequency changes of modals
in Table 4 by analysing subcorpora as already noted in (c) above, or by
undertaking a semantic analysis of examples. This we did for may, must and
should, and noted a trend in may and should towards monosemy viz. the
dominant senses of may (epistemic) and should (deontic) increased their
dominance in spite of loss of frequency (see Leech, 2003). Must, on the other
hand, showed a decline of both its major senses, the epistemic and deontic
meanings. Such further descriptive investigations help to pinpoint what is
happening more precisely, in terms of how and where the modals are becoming
less used.
(iii) Theory: pointing up to the theoretical level, further descriptive
investigations, for example by taking contextual factors into account, can help to
identify appropriate theoretical explanations as to why the modals are declining.
This is where broad explanatory concepts such as colloquialization come into
play, and help to direct investigation into particular channels.

2.4 Continuing the case study: grammatical changes relating to


Taking further the descriptive study of the LOB and FLOB corpora, we now turn
to a wider-ranging set of grammatical categories, mostly belonging either to the
verb phrase or to the noun phrase. What brings all these categories together is that
they can all be associated with a trend towards colloquialization, that is a
tendency for the written language gradually to acquire norms and characteristics
associated with the spoken conversational language. Quantitatively,
colloquialization can be shown in two ways: (a) by an increasing frequency of
phenomena associated with spoken language, and (b) by a decreasing frequency
of phenomena associated with the written language. Type (a) changes
predominate in Table 8 below, but Type (b) changes are also seen, in the
decreasing frequency of the passive, of the of-construction, and of the relative
pied-piping construction.
Recent grammatical change in English 73

Table 8. Changes apparently indicative of colloquialization (tokens per million

LOB FLOB Log lkhd Difference (%)
Categories within the verb phrase
a. Present progressive (active) 980 1263 36.0 +28.9
b. Progressive passive 198 260 8.4 +31.3
c. Verb contractions (e.g. its) 3126 3867 79.1 +23.7
d. Negative contractions (-nt) 1940 2462 62.6 +26.9
e. Passive forms (all) 13260 11614 109.8 -12.4

Miscellaneous colloquialization
features outside the verb phrase
f. Questions (all) 2572 2816 11.1 +9.5
g. Verbless questions 310 424 17.7 +36.6
h. Tag questions 63 65 0.1 +4.5
j. Genitives 4935 6122 128.5 +24.1
k. Of-phrases 33715 32139 37.9 -4.7
l. Of-phrases competing with the 124 95 3.9 -23.6
genitive (2% sample only)

Relative clauses
m. Wh-relative pronouns 6971 6376 26.7 -8.5
n. Zero relative with stranding 18 73 36.4 +310.0
p. Pied-piping relatives 1394 1158 21.9 -16.9

Of the categories within the verb phrase, the first four (a.-d.) all show very
convincing increases between LOB and FLOB. Previous corpus studies (e.g.
Biber et al. 1999: 461-463) have shown the progressive to be more common in
conversation than in written genres, and this is a justification for treating
colloquialization as a possible explanation for a. and b. (However, the growing
use of the progressive aspect can also be linked with grammaticalization, going
back over 500 years.) The passive (e.), on the other hand, is strongly associated
with the written medium (see for example Biber et al. 1999: 476-477), and so its
decline in frequency can count as a negative manifestation of colloquialization.
The next set of categories in Table 8 (f.-h.) is more mixed. In fact f. and h.
(questions) should arguably be excluded from the list of colloquialization
phenomena, as the increase of quoted speech in FLOB compared with LOB (see
Note 9) provides a readier explanation for the increasing occurrence of questions
(+9.5%) and tag questions (+4.5%).
We have begun to investigate two further colloquialization themes in the
noun phrase (see j.-p. in Table 8): the s-genitive vs. the of-phrase; and zero or
that-relative clauses vs. wh- relative clauses. Results so far point in the direction
of (a) a rise in the genitive with a corresponding decline in of-phrases; and (b) a
rise in zero relative clauses ending with a stranded preposition and a
corresponding decline in wh- relative clauses. The rise in stranding accords with
an unsurprising and significant fall in the use of pied-piping constructions in
which the wh-relative pronoun is preceded by a preposition (in which, of whom,
74 Geoffrey Leech

Summary of descriptive conclusions relating to colloquialization

(a) The use of the present progressive construction has increased overall by c.
30% between LOB and FLOB. This seems part and parcel of the spread
of the progressive aspect usage over the past 500 years.
(b) In practice, this increase has been chiefly in the present progressive the
past progressive has actually shown a slight decline.
(c) As part of a general colloquialization trend, the use of negative and verb
contractions has increased by approximately a quarter (25%). Part of this,
though, can be attributed to the increase in the proportion of quoted
speech in the written corpora.
(d) Conversely there has been an appreciable decline in the use of the passive
a verbal category strongly associated with formal written language.
(e) The written corpora show an increase in 9.5% in the use of questions.
(f) This actually increases to approximately 36.6% if we confine our
attention to questions which lack a finite verb this fragmentary
interrogative type is particularly strongly associated with conversational
English (see Biber et al. 1999: 211-212). Tag questions, on the other
hand, have not increased much. Perhaps this is because they are
essentially dialogic in a way that other questions are not. (In Biber et al.
ibid, tag questions are shown to be of particularly low frequency in the
written language.)
(g) In the noun phrase, historically, the competition between s genitives and
of-constructions has been interpreted as a competition between more and
less oral styles of expression.11 Genitives have increased by about 25%
from LOB to FLOB, whereas of-phrases have declined by about 5%.
However, if we confine our attention to of-phrases which could be
replaced semantically by genitives, the decline of the of-construction
(based on a 2% sample) goes up to 24%. This intriguing provisional
result, which almost exactly balances the gain in the genitive, needs
further corroboration with a larger sample.
(h) There is a general tendency for wh-relative clauses to decline. This
applies not only to whom but also to who, whose, and which. The decline
is not unexpectedly magnified if we confine our attention to pied-piping
relatives (beginning with a preposition e.g. of which, to whom).
(j) Conversely, there appears to have been an increase in the use of zero
relatives, i.e. relative clauses with a zero relativizer (the book I read)
especially when combined with a stranded final preposition (someone I
spoke to). This is a provisional finding based on a small sample, and again
needs further research.

As a conclusion to the descriptive sections of this chapter, I reiterate two caveats

already mentioned. First, the results presented are provisional (particularly those
based on a small sample, such as (j) above) since the research presented here is
Recent grammatical change in English 75

still work in progress. (In fact I have gone so far as to suggest that it is in the
nature of corpus research to be provisional.) Second, the hazardous assumptions
listed in section 2.3 have to be kept in mind throughout, and opportunities found
to probe them further. I have yielded above to the temptation to talk in terms of
the language change between LOB and FLOB: a kind of dynamic metaphor used
to explain what are actually sets of synchronic observations about a 1961 corpus
and a 1991 corpus. But the claims that these observations represent changes in the
(use of the) language ultimately remain hypotheses, in need of further probing
and confirmation.

3. Back to theory: conclusions

There is a great deal more to be done in terms of short term diachronic

investigation of the Brown family of corpora. Once the gross frequency changes
have been plotted, the next step is to investigate factors internal to the corpora
that might help to explain these changes (e.g. differential results in different
subsections of the corpus). Much more research also needs to be done and some
is being done on the changing frequency of semantic categories such as
epistemic modals and pragmatic uses of the progressive. We are also making
further comparisons between the British corpora and their American counterparts
Brown and Frown. And of course, there is room for much more work on spoken
language the spoken mini-corpora used for this study are likely to reflect more
fascinating indications of language change, but are obviously of inadequate size.
Explaining the changes in a deeper sense means finding historical reasons
investigating both language-internal and language-external (especially socially
motivated) explanations of why these changes of frequency are taking place. In
part the changes noted e.g. in the increase of semi-modal use may be related
to well-known grammaticallization processes:

Grammaticalization the process whereby lexical items and constructions

come in certain linguistic contexts to serve grammatical functions, and,
once grammaticalized, continue to develop new grammatical functions.
(Hopper and Traugott 1993: xv)

This is a linguistically-oriented explanation, invoking a whole theory of language

change, applicable particularly to the growth of the semi-modals and the
progressive aspect. But frequency studies such as the present one are less
concerned with linguistic innovation than with diffusion and attenuation of
aspects of language use, and invite social explanations in terms of such trends as:

Colloquialization a tendency for features of the conversational spoken

language to infiltrate and spread in the written language.
Democratization speakers and writers tendency to avoid unequal and face-
threatening modes of interaction (this may account in part for the decline of
76 Geoffrey Leech

deontic must and the rise of deontic should, have to and need to). For this
kind of explanation in the realm of modality, see Myhill (1995).
Americanization the influence of north American habits of expression and
behaviour on the UK (and other nations). This shows up apparently in the
loss of frequency of the modals, as depicted in Figure 1.12

However, these izations manifest themselves patchily. For example, in contrast

to the Americanization effect noted with the decline of modals, the growth of the
present progressive shows very little difference between AmE (in the Brown and
Frown corpora) and BrE, as demonstrated in Table 9.

Table 9. Comparison of increase of present progressive in LOB-FLOB and

Brown-Frown (active only)
1961 corpora 1991-2 corpora Log likelihood Difference
British 980 (LOB) 1263 (FLOB) 36.0 +28.9%
American 996 (Brown) 1316 (Frown) 43.6 +31.8%

So Americanization can be only tentatively invoked here, although it might be

applied to other changes touched on earlier, such as the decline of the relative
pronoun which. Another example of patchiness is the virtual stasis of the get-
passive in LOB and FLOB (101 instances in LOB; 104 in FLOB): this obviously
colloquial construction does not seem to follow the pattern observed elsewhere.
One explanation for the selectivity of these ization trends is that the trends
can be in conflict with one another. What happens, for example, to a formal
(uncolloquial) construction characteristic of AmE? Does it increase in BrE
because of American influence, or does it decline in BrE because of its negative
association with colloquialization? An apparent example of this kind of conflict is
the mandative subjunctive as in:

the Secretary of Labor requires that he be willing to risk his reputation

(Example from the Brown Corpus)
a construction which (in a study by Serpollet 2001) increases from 14 in LOB
to 33 in FLOB, while in AmE it is far more frequent, though declining: 91 in
Brown and 78 in Frown. What seems to happen here is that in BrE, the
Americanism of the construction outweighs its non-colloquialism. But different
kinds of explanations might be applicable to other cases.
As we move from the level of description to that of explanation, it is
appropriate to ask what kind or kinds of theory would be best able to explain the
descriptive findings of corpus linguistics. Terms like colloquialization do
represent some rather general attempt to explain change, but they do not amount
to well-developed theories. As for grammaticalization, Croft (2000), like Krug
(2000) is one of those who see grammaticalization taking place within a usage-
based, communication-based, utterance-oriented theory of language change. Croft
emphasises the important diachronic collaboration between innovation or
Recent grammatical change in English 77

actuation the creation of novel forms of language and propagation or diffusion

the way the use of these forms expands into more general language use. The
converse mechanisms of change contraction and loss also need to be given
fuller consideration: we need a theory to explain the decline of the modals as well
as the growth of the semi-modals.
In diachronic corpus comparisons we can observe the results of propagation
and contraction. (It is unlikely that we will find true grammatical innovation or
that we would recognize it as such in a corpus even if we came across it.) This
means that we need explanations which take full account of socio-cultural factors
inducing language change. Croft argues (2000: 166) that the basic mechanism for
propagation is the speakers self-identification with a social group, and he cites in
this connection a maxim put forward by Keller (1990/1994), Talk like others
talk. Here, the social-psychological theory of accommodation as a linguistic
process comes into play.
This seems to place propagation of change firmly in the sphere of
sociolinguistics, but it might be pointed out that the Brown family of corpora are
not sociolinguistically sensitive in the normal sense: by definition, they contain
published, i.e. public, language. So where does this leave the explanation of
increase and decrease of frequency in the LOB and FLOB corpora? It is
reasonable to suggest that the spread or shrinkage of linguistic usage in recent
modern society has been influenced considerably by language use in the public
media. So it can be helpful to complement the sociolinguistic perspective by
perspectives oriented towards mass communication.

Table 10. Some principles of usage-based models of language (after Barlow and
Kemmer 2000)
1. The intimate relation between linguistic structures and instances of the
use of language.
2. The importance of frequency.
3. Comprehension and production are integral, rather than peripheral to the
language system.
4. Focus on the role of learning and experience in language acquisition.
5. Importance of usage data in theory construction and description.
6. The intimate relation between usage, synchronic variation, and diachronic
7. The interconnectedness of the linguistic system with non-linguistic
cognitive systems.
8. The crucial role of context in the operation of the linguistic system.

For example, with reference to colloquialization, Fairclough (1992)

discusses the apparent democratization of discourse in present-day English-
speaking society. Conversational discourse, he goes on, has been, and is being,
projected from its primary domain into the public sphere (p.98). Social theories
focusing on public discourse, like Faircloughs, here provide a valuable
78 Geoffrey Leech

supplement to the more established frameworks of historical linguistics and

sociolinguistics. But there would be much benefit in investing in the support such
theories may gain from the empirical findings of corpus research.
To conclude, I return to the opening theme of metatheory. Although I have
not gone far towards suggesting theoretical solutions, I have worked my way
around to suggesting the kind of theoretical approach that is better suited to
corpus linguistics than is the Chomskyan paradigm. Corpus linguistics finds a
good ally in the usage-based frameworks championed by Barlow and Kemmer
(2000: viii-xxii), who, among other principles of this approach, list those in Table
The usage-based conception of linguistics is not a monolithic theory, or a
single school of thought, but is more like a confederation of linguists with similar
goals, priorities and methods. Their tenets are the opposite of the generative
paradigm in nearly every respect. Corpus linguistics finds a natural place in this
body of linguists who believe that there is not a gulf, but on the contrary a natural
bridge, between the study of naturally-occurring data and the cognitive and social
workings of language.


1. In this chapter, we refers to Nicholas Smith and myself. I am grateful to Nick

for much of the corpus processing, quantitative and analytic work that resulted
in the findings reported here, as well as for discussion of broader issues and
specific comments on this chapter. The project on Recent Grammatical
Change in British English was supported by a research grant from the Arts
and Humanities Research Board (UK) and a British Academy Larger
Research Grant. In this research, we have benefited from collaboration with
Christian Mair and Marianne Hundt at Freiburg University, to whom we owe
support and inspiration, as well as the more practical benefit of the post-
editing of most of the automatically-tagged FLOB corpus.
2. The colloquialization tendency for written style to drift towards more oral
styles over time for some genres between and 17th and the 20th centuries is
demonstrated statistically by Biber and Finegan (1989).
3. We are very grateful to Bas Aarts and Gerry Nelson for their help both in
allowing use of these corpora, and extracting the data for the mini-corpora.
4. There has been a growing range of publications on the comparison of the
LOB and FLOB corpora. Particularly relevant to the present study are Hundt
(1997) and Mair (1997).
5. The findings on the modals in this chapter are presented and discussed more
extensively in Leech (forthcoming 2003) and Smith (forthcoming 2003).
Some of the counts in the tables are slightly different from those in these cited
papers, owing to further research and further accuracy checks (see Note 6).
Recent grammatical change in English 79

6. A caveat about frequency: most of the frequency figures in this study are very
close approximations rather than guaranteed 100% accurate. Both manual
procedures and automatic procedures can give rise to error, although the
incidence of error is likely to be totally insignificant. The one exception to this
is the margin of error arising from POS tagging (about 2% in the present
context). Although we were able to use the results of manual correction for
the LOB Corpus and most of the FLOB corpus, for the fictional genres (K-R)
of FLOB and for the Frown Corpus we had to rely on automatic tagging only.
A method of approximation was devised on the basis of comparing automatic
tagging and manual tagging outcomes in cases where they were both
available, and hence calculating an error coefficient for each tag. The
procedure is described in the Appendix to Mair et al. (2003).
7. However, the decline of must may have some connection with the increase in
use of have to and need to see Smith (2003). In general, the varied
behaviour of the semi-modals in this corpus confirm the impression that they
comprise a miscellaneous category. In Quirk et al. (1985:136-148), where it is
argued that they form a gradient between auxiliary and full verbs, four
intermediate categories are distinguished: marginal modals, modal idioms,
semi-auxiliaries, and catenative verbs.
8. However, the decline of must may have some connection with the increase in
use of have to and need to see Smith (forthcoming 2003). In general, the
varied behaviour of the semi-modals in this corpus confirm the impression
that they comprise a miscellaneous category. In Quirk et al. (1985:136-148),
where it is argued that they form a gradient between auxiliary and full verbs,
four intermediate categories are distinguished: marginal modals, modal
idioms, semi-auxiliaries, and catenative verbs.
9. On representativeness, Biber (1993) is the classic reference; but Bibers
position has also been criticised (e.g. by Vradi 2001). There is no test that
could be used to ensure that statements about the LOB and FLOB corpora are
representative of the varieties of English of which they are samples, except to
collect independent samples of data of the same text types in effect, to
replicate the LOB and FLOB corpora but with different text samples.
10. Nick Smith has undertaken a count of quoted material in the LOB and FLOB
corpora, helped by a program written by Izumi Tanaka. He found that the
number of words within quotation marks in FLOB was c.127,000, compared
with c.116,000 words in LOB an increase of c. 9.5%. This figure of +9.5%
is a reasonably close approximation, but needs to be followed up by further
checks and edits.
11. Actually genitives are not so frequent in conversation as in some varieties of
written English, especially news writing (see Biber et al. 1999: 302). This can
be largely explained by the fact that nouns are notably infrequent in the
spoken language: a construction which is rich in nouns (a description that
applies both to the genitive construction and the of-construction) is therefore
comparatively rare. However, if we consider the likelihood of choosing a
80 Geoffrey Leech

genitive as contrasted with a semantically equivalent of-phrase, the odds in

favour of the genitive are higher in spoken English than in a range of written
varieties (see Leech et al. 1997).
12. Colloquialization and Americanization are discussed, with reference to the
LOB and FLOB corpora, by Mair (1997, 1998). See also Hundt (1997).


Biber, D. (1993), Representativeness in corpus design, Literary and Linguistic

Computing 8: 243-257.
Biber, D. and E. Finegan (1989), Drift and the evolution of English style: a
history of three genres, Language 65.3: 487-517.
Biber, D., S. Johansson, G. Leech, S. Conrad, and E. Finegan (1999), Longman
grammar of spoken and written English. London: Longman.
Barlow, M. and S. Kemmer (eds) (2000), Usage-based models of language.
Stanford: CLSI.
Christ, O. (1994), A modular and flexible architecture for an integrated corpus
query system, Proceedings of C O M P L E X 94: 3rd Conference on
Computational Lexicography and Text Research (Budapest, July 7-10,
1994). Budapest, 23-32.
Croft, W. (2000), Explaining language change: an evolutionary approach.
London: Longman.
Chomsky, N. (1964), Current issues in linguistic theory, in: J.A. Fodor and
J.J.Katz, The structure of language. Englewood Cliffs, New Jersey, 50-
Dunning, T. (1993), Accurate methods for the statistics of surprise and
coincidence, Computational Linguistics 19.1: 61-74.
Fairclough, N. (1992), Discourse and social change. Cambridge: Polity Press.
Hopper, P. and E. Traugott (1993), Grammaticalization. Cambridge: Cambridge
University Press.
Hundt, M. (1997), Has BrE been catching up with AmE over the past 30 years?,
in: M. Ljung (ed.), Corpus-based studies in English: Papers from the 17th
International Conference on English Language Research on Computerized
Corpora (ICAME 17). Amsterdam, Rodopi, 135-151.
Keller, R. (1990/1994), On language change: the invisible hand in language.
London: Routledge. (Translation and expansion of Sprachwandel: von der
unsichtbaren Hand in der Sprache. Tbingen: Francke.)
Krug, M. (2000), Emerging English modals: A corpus-based study of
grammaticalization. Berlin & New York: Mouton de Gruyter.
Leech, G. (1968), Some assumptions in the metatheory of linguistics,
Linguistics 39: 87-102.
Recent grammatical change in English 81

Leech, G. (2003). Modality on the move: the English modal auxiliaries 1961-
1992, in: R. Facchinetti, M. Krug and F. R. Palmer (eds), Modality in
contemporary English. Berlin & New York: Mouton de Gruyter, 223-240.
Leech, G., B. Francis and X. Xu (1997), The odds in favour of the genitive: a
study of gradience in English, in: K. Yamanaka and T. Ohori, The locus
of meaning: Papers in honor of Yoshihiko Ikegami. Tokyo: Kuroshio, 187-
Mair, C., M. Hundt, G. Leech and N. Smith (2002), Short term diachronic shifts
in part-of-speech frequencies: a comparison of the tagged LOB and FLOB
corpora, International Journal of Corpus Linguistics, 245-264.
Mair, C. (1997), Parallel corpora: a real-time approach to language change in
progress, in: M. Ljung (ed.), Corpus-based studies in English: Papers
from the 17th International Conference on English Language Research on
Computerized Corpora (ICAME 17). Amsterdam: Rodopi, 195-209.
Mair, C. (1998), Corpora and the study of the major varieties of English: issues
and results, in: H. Lindqvist et al. (eds), The major varieties of English.
Vxj: Vxj University Press, 139-157.
Myhill, J. (1995), Change and continuity in the functions of the American
English modals, Linguistics 33: 157-211.
Popper, K. (1972), Objective knowledge (revised edition). Oxford: Oxford
University Press.
Quirk, R., S. Greenbaum, G. Leech and J. Svartvik (1985), A comprehensive
grammar of the English language. London: Longman.
Rayson, P., A. Wilson, T. McEnery, A. Hardie and S. Khoja (eds) (2001),
Proceedings of the Corpus Linguistics 2001 Conference. Lancaster
University: UCREL Technical Papers 13.
Serpollet, N. (2001), The mandative subjunctive in British English seems to be
alive and kicking Is this due to the influence of American English?, in:
Rayson et al. (2001), 531-542.
Smith, N. (1997), Improving a tagger, in: R. Garside, G. Leech and A. McEnery
(eds), Corpus annotation: Linguistic information from text corpora.
London: Longman, 137-150.
Smith, N. (2003), Changes in the modals and semi-modals of strong obligation
and epistemic necessity in recent British English, in: R. Facchinetti, M.
Krug and F. R. Palmer (eds), Modality in contemporary English. Berlin &
New York: Mouton de Gruyter, 241-266.
Vradi, T. (2001), The linguistic relevance of corpus linguistics, in: Rayson et
al. (2001), 587-593.
Corpus data in a usage-based cognitive grammar

Joybrato Mukherjee

University of Giessen


The present paper is intended to bridge the long-established gap between corpus-
based research into actual language use on the one hand and cognitive models of
the abstract language system (in terms of speakers competence) on the other.
For this purpose, a very useful, non-generative framework is provided by
Langackers usage-based cognitive grammar. In general, the consideration of
corpus data in cognitive grammar leads to an innovative and realistic model of
speakers linguistic knowledge, i.e. a model which is data-oriented and
frequency-based, functionalist and lexicogrammatical in nature. This theoretical
from-corpus-to-cognition approach will be illustrated by discussing corpus data
on the use of the ditransitive verb GIVE and by sketching out how the data may
be included in a truly usage-based model of the lexicogrammar of GIVE.

1. Introduction: cognitive grammar and corpus data

In principle, generative models of language cognition have always been based on

what Langacker (1987, 1999, 2000) has repeatedly called the rule/list fallacy,
that is a clear distinction between a set of syntactic rules on the one hand and a
list of lexical entries on the other. This is particularly true of the recent version of
generative grammar, the Minimalist Program, which is guided by strict economy
conditions (cf. Chomsky 1995). Langacker (2000), on the other hand, suggests a
fundamentally different approach to language cognition:

There is a viable alternative: to include in the grammar both the rules

and instantiating expressions. This option allows any valid generali-
zations to be captured (by means of rules), and while the descriptions
it affords may not be maximally economical, they have to be preferred
on grounds of psychological accuracy to the extent that specific
expressions do in fact become established as well-rehearsed units.
Such units are cognitive entities in their own right whose existence is
not reducible to that of the general patterns they instantiate.
(Langacker 2000: 2)

Such well-rehearsed units, comprising routinised patterns of specific instantiat-

ing expressions, cut across the lexicon-syntax boundary. What is more, they are
established due to the recurrent use of specific lexical items in a given
86 Joybrato Mukherjee

construction or, from a complementary perspective, the frequent use of specific

constructions with a given lexical item. In Figure 1, one of the examples given by
Langacker (1999) is shown. It visualises how combinations of specific construct-
ions, e.g. the basic ditransitive pattern [[V][NP][NP]], and specific ditransitive
verbs such as GIVE and SEND are entrenched as cognitive entities in their own
right. The left-hand circle refers to the constructional network of the construct-
ional schema [[V][NP][NP]], while the right-hand circle depicts the lexical net-
work of the verb SEND. At the intersection of the two circles, the resulting
pattern can be found, i.e. [[send][NP][NP]].

Figure 1. Lexical and constructional networks in cognitive grammar (Langacker

1999: 123)

In Figure 1, the conceptual similarities between Langackers cognitive grammar

and corpus-linguistic approaches are obvious, even though the objects of inquiry,
namely language cognition and language use respectively, are no doubt different.
Specifically, the concept of lexical and constructional networks (representing
lexicogrammatical entities) could be easily mapped onto the notion of lexico-
grammatical pattern as it is described by Hunston and Francis (2000):

The patterns of a word can be defined as all the words and structures
which are regularly associated with the word and which contribute to
its meaning. [...] as a word can have several different patterns, so a
Corpus data in a usage-based cognitive grammar 87

pattern can be seen to be associated with a variety of different words.

This is the opposite side of the coin.
(Hunston and Francis 2000: 37, 43)

In effect, such lexicogrammatical patterns are at the basis of cognitive grammar.1

Another cross-correspondence between cognitive grammar and corpus-
based pattern grammar is related to the fact that Langacker (1987) considers his
model to be usage-based, which is defined as follows:

Substantial importance is given to the actual use of the linguistic

system and a speakers knowledge of this use; the grammar is held
responsible for a speakers knowledge of the full range of linguistic
conventions, regardless of whether these conventions can be sub-
sumed under more general statements. [It is a] nonreductive approach
to linguistic structure that employs fully articulated schematic net-
works and emphasizes the importance of low-level schemas.
(Langacker 1987: 494)

Special emphasis is placed here on the actual use of the linguistic system. In
general, this clearly mirrors the Hallidayan assumption that system and use are
inseparable because language use instantiates the system (cf. Halliday 1991: 31).
More specifically, a model of language cognition should be able to account for
actual usage, so that the model has to be based on actual use in the first place. It is
exactly here that corpus data may play a major role in refining cognitive grammar
and increasing its usage-basedness: corpora are samples of actual use of the
linguistic system; the schematic networks, low-level schemas and linguistic
conventions correspond largely to the lexicogrammatical patterns and routines
that can be identified by drawing on corpus data.

Table 1. Corpus-based insights into actual language use and their implications
for a usage-based cognitive grammar
some typical features of language use implications for a usage-based
as attested in corpora cognitive grammar
linguistic forms differ with regard to knowledge about these frequencies
frequency and distribution and distributions should be part of
the model
language use is to a large extent based the model should account not only
on recurrent patterns of different kinds for linguistic creativity but also for
linguistic routine
quantitative findings can often be ex- these principles/factors are part of
plained by considering functional and speakers linguistic knowledge and
context-dependent principles/factors should be included in the model
lexical and grammatical choices are lexicogrammatical patterns should
interdependent be at the basis of the model
88 Joybrato Mukherjee

Table 1 summarises four typical and general features of actual language

use as attested in corpora. In the right-hand column of Table 1, the implications of
these corpus-based findings for a truly usage-based cognitive grammar are
indicated. While, in a sense, lexicogrammatical patterns have always been at the
basis of cognitive grammar (cf. Figure 1), it seems to me that the first three
aspects in Table 1 have so far been neglected by proponents of a usage-based
cognitive grammar. In particular, existing models based on cognitive grammar
include neither actual frequencies of linguistic forms nor the principles and
factors that may lead language users to choose from a variety of options a specific
form in a given context. This kind of information, however, can be easily
obtained from corpus data. I would contend that the incorporation of this corpus-
based information in cognitive grammar would certainly increase the usage-based
quality of cognitive models. This theoretical approach will be exemplified in the
following section by delving more closely into the patterns of the ditransitive
verb GIVE in the British component of the International Corpus of English (ICE-
GB, cf. Nelson et al. 2002) and by deriving from the data a genuinely usage-
based cognitive model of the lexicogrammar of GIVE.

2. The relevance of corpus data to a usage-based cognitive grammar: the

case of GIVE

Table 2 provides an overview of the frequency of all GIVE-patterns in ICE-GB.2

In the following, I will be concerned with the eight most frequent patterns only;
they are given in boldface in Table 2. These eight patterns alone account for more
than 91% of all occurrences of GIVE in ICE-GB. In a sense, then, it is these eight
patterns in particular that should be taken into consideration in a model of routin-
ised patterns in language use, because all the other patterns are only sporadically
used. Picking up on Aartss (1991) distinction between performance and
language use, this section is thus intended to abstract away from the entirety of
performance data a model of language use that accounts for frequent lexico-
grammatical routines in using GIVE.
Generally speaking, type I represents the basic ditransitive pattern with
both objects realised as noun phrases. I have little to say about this pattern since it
can be regarded as the default case both quantitatively and structurally. Thus, the
focus here should be on the reasons why language users opt for other patterns
than this default pattern in specific contexts, i.e. on significant principles of
pattern selection (cf. Mukherjee 2001).
For type I b, one specific factor can be easily identified. This type tends to
be used whenever the direct object has already been activated in the preceding
text because it is part of a previous pattern. As shown in (1), this explanation
accounts for some 83% of all cases of type I b. The examples in (2) to (4) nicely
illustrate the fact that, generally speaking, a preceding pattern in the text (e.g.
the... the..., grateful for sth., thank sb. for sth.) predetermines to a large extent the
following GIVE-pattern by providing the initial slot (and element) for the next
Corpus data in a usage-based cognitive grammar 89

pattern.4 In the examples, the preceding pattern is given in italics, and the over-
lapping GIVE-pattern is underlined.

Table 2. Frequency of GIVE-patterns in ICE-GB3

Type Pattern Sum Freq.

I (S) GIVE [Oi:NP] [Od: NP] 404 38.0%

Ia (S) GIVE [Od: NP] [Oi:NP] 1 0.1%
Ib [Od: NP (antecedent)] (rel. pron.) [S] GIVE [Oi:NP] 23 2.2%
Ic [Oi:NP (antecedent)] (rel. pron.) [S] GIVE [Od: NP] 2 0.2%
Id [Od: NP (fronted)] [S] GIVE [Oi:NP] 1 0.1%
Miscellaneous 10 0.9%
IP [S < Oi active] BE given [Od:NP] (by-agent) 84 7.9%
IP b IP with [Od:NP (antecedent)]+ rel. clause/past participle 12 1.1%
II (S) GIVE [Od:NP] [Oi:PP (to...)] 123 11.6%
II a (S) GIVE [Od:NP] [Oi:PP (for...)] 4 0.4%
II b [Od:NP (antecedent)] (rel. pron.) [S] GIVE [Oi:PP (to...)] 7 0.7%
II c (S) GIVE [Oi:PP (to...)] [Od:NP] 2 0.2%
Miscellaneous 6 0.6%
IIP [S < Od active] BE given [Oi:PP (to...)] (by-agent) 23 2.2%
IIP b IIP with [S<Od (antecedent)]+ rel. clause/past participle 17 1.6%
Miscellaneous 2 0.2%
III (S) GIVE [Od:NP] Oi 247 23.2%
III b [Od:NP (antecedent)] (rel. pron.) [S] GIVE 16 1.5%
Miscellaneous 3 0.3%
IIIP [S < Od active] BE given Oi (by-agent) 38 3.6%
IIIP b IIIP with [S<Od (antecedent)]+ rel. clause/past participle 28 2.6%
IV (S) GIVE Oi Od 10 0.9%
Miscellaneous 1 0.1%
Total 1064 100%

(1) I b [Od: NP (antecedent)] (rel. pron.) [S] GIVE [Oi:NP]

part of a previous pattern

(19 of 23 cases = 82.6%)

(2) But it then means that the more things they put on the menu the tinier the
amount they give you <ICE-GB:S1A-018 #24:1:B>
(3) I would anticipate doing one or two units per year and would be grateful
for any financial assistance that the college could give me
<ICE-GB:W1B-022 #152:13>
90 Joybrato Mukherjee

(4) I must thank you, Simon and your parents officially for the slow cooker
and table cloth you gave us for our wedding <ICE-GB:W1B-004 #12:1>

For the passive type IP, there are many factors that seem to play a role in
the process of pattern selection. The cluster of relevant factors is summarised in
(5). It is not at all surprising that in more than 96% of all instances the by-agent is
left out. An important reason for choosing type IP thus lies in the optionality of
the agent. Additionally, two further factors seem to be responsible for the fact that
the recipient (corresponding to the indirect object in the default active type-I
pattern) is placed in the initial slot, thus serving as the grammatical subject. First,
this pattern tends to be chosen whenever the direct object is significantly heavier
than the initial element and is therefore placed in final position according to the
principle of end-weight (cf. Quirk et al. 1985: 1362). The correlation between
weight and pattern selection is illustrated in examples (6) and (7). This factor
alone accounts for 50% of all 84 cases. Second, in some 10% of all cases it is the
recipient that has already been activated before and is thus taken up as the first
element in the type-IP pattern. This is in line with the principle of end-focus (cf.
Quirk et al. 1985: 1357) according to which there is a general tendency to place
given information before new information. In examples (7) to (9), the previously
activated element which is part of (or provides the initial element for) the GIVE-
pattern at hand is italicised.

(5) IP [S < Oi active] BE given [Od:NP] (by-agent)

activated before/ heavy left out

taken up (42 of 84 (81 of 84
(8 of 84 cases cases cases
= 9.5%) = 50.0%) = 96.4%)

(6) [...] Margaret Thatcher cannot be given all the credit for our record levels
of radioactivity both at sea and on land <ICE-GB:W2B-014 #11>
(7) and rather nastily she had been tied to a chair until she was fourteen by her
blind mother and never actually given any form of uhm sound or language
communication <ICE-GB:S1B-003 #102>
(8) After all Saddam Hussein uh led his people they although they were not
given much choice in the matter in an eight year war [...]
<ICE-GB:S1B-035 #66>
(9) The Italian peoples were bound to fight in Romes wars at their own
charge [...] Some peoples were actually given Roman citizienship [...]
<ICE-GB:W2A-001 #006/8>

In type II again, it is a cluster of factors that can be shown to play a role to

different extents in the process of pattern selection. As shown in Table 2, type II
differs from the basic type I in that the indirect object is realised as a pre-
positional phrase (introduced by to) and placed after the direct object. Heaviness
Corpus data in a usage-based cognitive grammar 91

of the final element is again a relevant factor since it is involved in 39 of 123

cases (= 31.7%). But there is another factor that seems to be even more important
for language users choice of this pattern in given contexts, namely the lexical
item in direct-object position. The lexical items that are frequently used as direct
objects in type II can be grouped into three major types. In nearly 25% of all
cases, it is the pronoun it. The second group contains words which, broadly
speaking, are habitually associated with the preposition to according to the pattern
information in the corpus-based Macmillan English Dictionary (cf. Rundell
2002). This group thus includes nouns such as access, answer and reaction which
have a pattern themselves that could be described in COBUILD manner (cf.
Sinclair 1995) as N to n. This group also includes nouns (such as name) which
are part of larger verb-dependent patterns containing the sequence N to (e.g.
give ones name to sth. and put a name to). Whether it is due to small-scale
patterns of the noun itself or due to large-scale patterns of a verb including the
noun-to sequence, the overall effect is the same: the noun at hand and the pre-
position to tend to co-occur fairly frequently in actual usage. The third group
includes words that are so closely associated with this pattern that the resulting
word-pattern combinations may be regarded as lexically stabilised idioms, e.g.
give birth to sb./sth. and give rise to sb./sth.: here the type-I pattern no longer
provides a genuine alternative. These three groups of lexical items in direct-object
position account for some 75% of all occurrences of this pattern. The two factors
that are responsible for the preference of the type-II pattern over others namely
weight of the indirect object and lexis of the direct object are summarised in
(10). The examples given in (11) to (13) are intended to illustrate the second
factor in particular. In all three examples, the lexical items in direct-object
position that seem to trigger off the selection of the type-II pattern are italicised.
Additionally, the relevant small-scale to-pattern of the noun in direct-object
position is in boxes in examples (12) and (13).

(10) II (S) GIVE [Od:NP] [Oi:PP (to...)]

(39 of 123 cases
= 31.7%)

frequent lexical items in Od-position (91 of 123 cases = 73.9%):

1. it (30 of 123 cases = 24.4%)
2. words that are associated with the preposition to in general, e.g. access,
aid, answer, attention, comfort, consideration, credence, (ones) name,
reaction, reply, substance (18 of 123 cases = 14.6%)
3. words bound to type II in lexically stabilised idioms, e.g. give birth /
rise / thought / way to sb./sth. (43 of 123 cases = 34.9%)

(11) so we can have an acid and alcohol and give it to the esterase which is a
useful product <ICE-GB:S2A-034 #39>
92 Joybrato Mukherjee

(12) A clutch of opinion polls gave comfort to both sides in the simmering
civil war yesterday <ICE-GB:W2C-006 #76>
(13) but when you follow that through youve got the means to give rise to a
change in the method of accounting thats adopted in the company
<ICE-GB:S2A-037 #122>

Type IIP is the passive form that can be derived from the type-II pattern.
Note that the systematic correspondence between the two patterns stems from the
fact that in both cases the indirect object is realised as a to-phrase. As shown in
(14), all the kinds of factors that are involved in the choice of the passive pattern
IP are also involved in type IIP: previous activation of the initial element (6 of 23
cases = 26.1%), heaviness of the post-verbal element (8 of 23 cases = 34.8%),
and the frequent omission of the by-agent (22 of 23 cases = 95.6%). In the light of
the 23 cases at hand, we may also assume that two further factors may at times tip
the balance in favour of type IIP: (i) the need to put the indirect object in focus
according to the principle of end-focus; (ii) the use of a lexical item (e.g. thought)
in the passive subject which may be habitually associated with the preposition to.
The cluster of all five factors and their explanatory power in quantitative terms
are summarised in (14). Example (15) illustrates the relevance of the principle of
end-focus (here in order to contrast the two italicised elements at the end of the
two dependent clauses). Example (16) refers to the influence of the lexical item in
subject position on the selection of the type-IIP pattern.

(14) IIP [S < Od active] BE given [Oi:PP (to...)] (by-agent)

activated before/ heavy left out

taken up (8 of 23 (22 of 23
(6 of 23 cases cases cases
= 26.1%) = 34.8%) = 95.6%)

words that are associated with the deliberately placed in

preposition to [> Macmillan English final focus position
Dictionary] (9 of 23 cases = 39.1%) (7 of 23 cases = 30.4%)

(15) At the start of the conflict you said more time should have been given to
sanctions but now youre saying that more time should have been given to
pursue those diplomatic initiatives <ICE-GB:S2B-018 #92-94:2:D>
(16) It is not clear that enough thought has been given to the consequences of
these proposals for the movement of traffic outside the areas immediately
affected <ICE-GB:W1B-027 #41:4>

What all type-III patterns have in common is the fact that the indirect
object is omitted. Note that in many of these cases, the verb GIVE is not parsed
as ditransitive but as monotransitive in ICE-GB. For various reasons, how-
Corpus data in a usage-based cognitive grammar 93

ever, I regard all instances of GIVE as examples of ditransitivity. Without going

into details about this theoretical issue, it is necessary to point out that my
approach to ditransitivity is inherently lexico-semantic (rather than, say, merely
syntactic) in nature. In other words, the underlying assumption is that the verb
GIVE always triggers what Goldberg (1995) calls the ditransitive construction
at a cognitive level. However, as pointed out by Goldberg (1995) herself, not all
argument roles of the process of giving (i.e. the agent, the recipient and the
patient) need to be explicitised at the level of syntactic surface structure. Among
many others, Matthews (1981), Jackson (1990), Newman (1996) and Biber et al.
in the Longman Grammar of Spoken and Written English (1999) show that
specific elements may be left out because, for example, they can be recovered
from the context or can be inferred from world knowledge. In a sense, then,
GIVE should be regarded as a ditransitive verb in all its occurrences from a
cognitive-semantic point of view because it is bound to evoke an event type
which includes three argument roles, even though some implicit argument roles
may not be explicitised.
Type III is the second most frequent pattern of GIVE in ICE-GB. It does
not come as a surprise that corpus data reveal that this pattern is used whenever
the recipient is indeed recoverable from the context or when its specification is
irrelevant in a given context. In fact, this pertains to all 247 cases at hand.
Furthermore, the pattern tends to be chosen whenever specific lexical items are
used in direct-object position. That is to say, the omission of the indirect object
seems to be linked to lexical items which may imply no need for any specification
of the recipient because it is only the mere existence of a recipient that is relevant
but not the particular kind of recipient.5 In (17), those 21 words are listed that are
used at least three times as direct objects in the type-III pattern of GIVE. Note
that these 21 words alone account for roughly 50% of all cases of this pattern.6 As
in the type-II patterns, it thus seems as though specific lexical items may serve as
pointers to the type-III pattern. Some examples are given in (18) to (20).

(17) III (S) GIVE [Od:NP] Oi

contextually recoverable / specification irrelevant

(all 247 cases = 100.0%)

frequent lexical items (3):

account (9), birth (3), command (3), detail (10), effect (3), evidence (20),
example (9), hint (3), impression (10), indication (7), information (5),
instruction (5), it (4), lecture (8), message (3), (sb.s) name (4), notice (3),
reason (3), signal (3), talk (3), way (6) (124 of 247 cases = 50.2%)

(18) So for instance we can give a very nice account of coarticulation [...]
<ICE-GB:S2A-030 #12>
(19) It helps to clarify the poets ambiguous comments beforehand by giving an
actual example of what he means <ICE-GB:W1A-018 #33>
94 Joybrato Mukherjee

(20) And its that sort of thing that gave the impression which Im sure he was
trying to do <ICE-GB:S1B-038 #103>

From type III, the passive form IIIP can be derived. Again, the optionality
of the by-agent is most important for the process of pattern selection because it is
omitted in 31 out of 38 cases (81.6%). Additionally, specific lexical items in the
subject position (i.e. the subjectivised direct objects of the type-III pattern) tend to
be closely associated with this pattern. That is to say, not only is the type-IIIP
pattern used whenever neither the agent nor the recipient needs to be explicitised
but also when particular words refer to the patient of the action. In (21) those
words are listed that occur at least twice in this pattern in ICE-GB, accounting for
some 45% of all instances. Some of them are exemplified in (22) to (24).7

(21) IIIP [S < Od active] BE given Oi (by-agent)

left out
(31 of 38 cases
= 81.6%)
recurrent lexical items (2):
approval (2), limit (2), information (2), detail (7), time (2), directions (2)
(17 of 38 cases = 44.7%)

(22) Hes called Malachi in the opening verse but no biographical information
is given about him <ICE-GB:S2A-036 #78>
(23) uh directions are given from Ushant uh from the Scillies uh from the
South coast of Ireland down to Cape Ortegal or Finisterre
<ICE-GB:S2B-043 #20>
(24) More specific implementation details are given at the end of the report
<ICE-GB:W1A-005 #5:1>

The last pattern to be mentioned is type IIIP b. This type is similar to

pattern I b in that the patient (i.e. the subjectivised direct object) serves as an
antecedent to which a relative clause or a past participle construction refers back.
As shown in (25), there is again a clear tendency for language users to choose this
pattern with a fronted antecedent whenever this antecedent has already been part
of a preceding pattern in the text at hand. Examples (26) to (28) illustrate this
dependency on the previous pattern (given here in italics: know of sth., consider
sth., trace on to ... sth.) the last element of which provides the starting-point for
the subsequent GIVE-pattern. It should be noted in passing that the by-agent is
not as frequently omitted as in all other passive patterns mentioned so far. In fact,
in more than one third of all cases (10 of 28 cases = 35.7%), the agent is stated
explicitly. Thus, the optionality of the by-agent as such turns out to be less
forceful a factor for this particular passive form.
Corpus data in a usage-based cognitive grammar 95

(25) IIIP b IIIP with [S<Od (antecedent)] + relative clause/past participle

part of a previous pattern with or without by-agent (10 vs.

(16 of 28 cases = 57.1%) 18 cases = 35.7% vs. 64.3%)

(26) and he will also know of the increased uh support given uh in the uh
announcement last week by my right honourable friend the Social Security
Secretary <ICE-GB:S1B-056 #46:1:B>
(27) [...] it also is of relevance when considering the evidence given by Mr Holt
because there is a clear conflict [...] <ICE-GB:S2A-068 #40:1:A>
(28) But what I have simply done is to trace on to a map the directions that are
given which give you some indication [...] <ICE-GB:S2B-043 #19:1:A>

(S) GIVE [Od: NP (antecedent)] [S < Oi active] BE
[Oi:NP] [Od: NP] (rel. pron.) [S] GIVE [Oi:NP] given [Od:NP] (by-agent)

[Od:NP] part of recipient activat-

(default case) previous pattern agent irrelevant ed before/taken up
=> antecedent => by-agent
[Od:NP] heavy

[Oi:PP (to...)] recipient irrelevant/recover-

II heavy able => Oi III
[Oi:PP (to...)] specific lexical specific lexical [Od:NP] Oi
items in [Od:NP]: it; access, items in [Od:NP]: account,
answer...; birth, rise... detail, evidence ...

transferred entity activ-

ated before/taken up [Oi:PP
agent irrelevant (to...)] recipient
=> by-agent heavy recoverable/ (other
IIP irrelevant patterns)
[S < Od active] BE given => Oi
[Oi:PP (to...)] (by-agent)
agent irrelevant [S<Od] part of previous
specific lexical items in [S<Od] => by-agent pattern => antecedent
detail, limit, time...
[S < Od active] BE IIIP with [S<Od antecedent)]
given Oi (by-agent) + relative clause/past participle

Figure 2. A usage-based cognitive model of the lexicogrammar of GIVE

96 Joybrato Mukherjee

The actual use of the eight most frequent GIVE-patterns and the relevant
principles of pattern selection as described above provide an empirically sound
basis for a truly usage-based cognitive model of the lexicogrammar of GIVE.
Such a usage-based model on the basis of ICE-GB is visualised in Figure 2.
In two regards, the tentative model suggested in Figure 2 is more
elaborated and more usage-based, as it were, than traditional lexical networks in
cognitive grammar (as, for example, shown in Figure 1). Firstly, the thickness of
the lines between GIVE and its patterns depends on the frequency of GIVE in
each pattern. Figure 2 thus puts into operation what has been suggested, among
others, by Lamb (2002: 91), namely that different [d]egrees of entrenchment
[can be] accounted for by variability in the strengths of connections. Secondly,
at all lines connecting GIVE and its patterns there is information on why a
particular pattern is used in a given context. Such principles of pattern selection
can be identified only by looking at large amounts of natural data in context and
have so far not been taken into consideration in cognitive grammar. More
specifically, traditional network models in cognitive grammar have focused on
what is structurally possible. Corpus data, however, provide information on what
is likely to occur and why. As I have argued elsewhere (cf. Mukherjee 2002),
both aspects are part of speakers linguistic knowledge and should therefore be
covered by a truly usage-based cognitive grammar.

3. Conclusions and prospects for future research

The present paper is informed by the belief that corpus linguistics and cognitive
linguistics are not at all mutually exclusive but can fruitfully complement each
other in developing a genuinely usage-based model of language cognition, i.e. of
speakers knowledge of the underlying language system. A genuinely usage-
based model defies the rigid Chomskyan dichotomy between competence and
performance.8 In fact, such a model is intended to bridge the gap between
system and use and to mirror speakers linguistic knowledge along the lines of
Hymess (1972, 1992) concept of communicative competence, in which the
ability to use linguistic forms and structures idiomatically (e.g. in terms of
frequently co-occurring forms) and appropriately (e.g. in terms of pragmatic
principles) is integral to speakers knowledge of the language. This view is
closely related to the Hallidayan idea that language use and language system are
intricately interwoven, which makes it possible and reasonable to derive from a
corpus-based analysis of actual language use a usage-based model of the
cognitive entrenchment of the language system. In effect, this approach
capitalises on Schmids (2000: 39) From-Corpus-to-Cognition Principle:
Frequency in text instantiates entrenchment in the cognitive system. In
particular, I hope to have shown that lexical network models in cognitive
grammar can be refined in two regards by taking into account corpus data: not
only is it possible to introduce frequency-based information on different strengths
of linkage between lexical items and constructions but also to introduce in the
Corpus data in a usage-based cognitive grammar 97

model context-dependent principles of pattern selection (such as lexico-

grammatical co-selections, pragmatic principles and activation statuses of
discourse entities). Thus, corpus-linguistic methodology obviously opens up new
and promising perspectives in cognitive linguistics.
By including quantitative trends and context-dependent principles of
pattern selection in usage-based models of language cognition, future research in
this field should try to quantify the influence that each of the relevant factors
exerts on the process of pattern selection and to empirically describe the
prototypicality of a specific pattern in a given context (cf. Griess 2001 model of
a multifactorial analysis). In order to establish more reliable quantitative trends
(in terms of, say, lexical co-selections of a given pattern), it will certainly be
useful to analyse larger corpora such as the British National Corpus. From a more
theoretical perspective, future research into the refinement of the usage-based
model as sketched out in the present paper will have to address the question as to
whether the principles of pattern selection should be integrated with each
individual lexicogrammatical pattern of a given verb or, alternatively, whether
they should best be regarded as a separate subcomponent of a usage-based model.
As shown in Figure 1, constructional networks provide, in a sense, mirror images
of lexical networks, which begs the question as to whether it is necessary and
reasonable to posit separate constructional networks in a usage-based model.
While proponents of construction grammar (e.g. Goldberg 1995) place special
emphasis on the constructional nature of language cognition, other researchers
(e.g. Nemoto 1998) call into question the plausibility of the concept of abstract
and entirely delexicalised constructions.
Finally, brief mention should be made of the issue of genre distinctions. In
the present paper, the influence that specific genres may exert on the frequency of
individual GIVE-patterns has been left out of consideration. Future research into
corpus-based cognitive models should certainly delve more closely into the
correlations between specific genres and the frequency of linguistic forms. It
remains to be seen, though, whether genre-specific factors should best be
regarded as full-fledged principles of pattern selection at the centre of a usage-
based model or as additional factors on the periphery of such a model.9


1. Note that Langacker (1999: 122) himself states that lexicon and grammar
grade into one another so that any specific line of demarcation would be
arbitrary. This description is of course largely reminiscent of the Hallidayan
approach to lexicogrammar as a unified phenomenon, a single level of
wording, of which lexis is the most delicate resolution (Halliday 1991:
2. It should be noted that the data in Table 2 are based on a manual analysis of
all occurrences of GIVE and not on the parsing information included in ICE-
GB. The reason why the data were analysed manually is the fact that many
98 Joybrato Mukherjee

instances of GIVE are not parsed as ditransitive in ICE-GB but, for example,
as monotransitive (especially in the case of type-III patterns) or as complex-
transitive (especially in the case of type-II patterns). In contrast, I regard all
instances of GIVE as examples of ditransitivity on cognitive and semantic
grounds (cf. Goldberg 1995 and Newman 1996). It is for this reason that
phrasal verbs such as GIVE AWAY, GIVE IN and GIVE UP have not been
taken into account, because their semantics tends to be quite different from
GIVE. Note also that not all instances of GIVE can be grouped into any of the
patterns listed in Table 2. However, such miscellaneous cases are rare and
thus of a marginal nature.
3. The pattern formulas are based on the following notational conventions: [...]
obligatory element; [...(...)] obligatory element with a specific form/function;
(...) optional element; Oi/Od clause element which is not part of the
lexicogrammatical pattern at the level of syntactic surface structure (although
the corresponding argument role is taken to be implicitly evoked by GIVE at a
cognitive level).
4. In fact, this is reminiscent of what Hunston and Francis (2000: 211) refer to as
pattern flow: Pattern flow occurs whenever a word that occurs as part of the
pattern of another word has a pattern of its own.
5. Since, from a lexico-semantic point of view, the existence of a recipient is
already inherent in the event type evoked by the ditransitive verb GIVE, there
is no need to explicitise the recipient as an indirect object at the level of
syntactic surface structure in these cases. For example, in phrases such as give
a lecture and give a talk some kind of recipient is always implied (e.g. an
unspecified audience). Accordingly, Newman (1996: 54), in his cognitive
study of GIVE, describes such implicit argument roles as unfilled elaboration
6. As a matter of fact, many of the lexical items could be complemented by other
items of the same semantic field that also occur in this GIVE-pattern in ICE-
GB, e.g. give a lecture/a talk (+ a paper, a speech, a statement...), give
instructions (+ advice, help, orientation...) and give a message (+ an answer,
an outline, a response, a warning...). The important point here is that the lexis
in direct-object position is semantically restricted.
7. Note that the analysis of type IIIP is based on 38 instances only. One could
easily hypothesise that the list of recurrent lexical items would have been
much more similar to the list given for the type-III pattern if some 250 cases
had been scrutinised. Here, larger corpora are needed.
8. It is for this reason that the term competence is not used in the present paper.
A cognitive model that is based on corpus evidence, as suggested in the
present paper, has not much in common with a generative model of
competence. Thus, it is not very useful to take over and extend or redefine the
term competence, which would automatically lead to terminological
confusion (cf. Taylor 1988). Instead, I prefer to speak of a usage-based model
of speakers linguistic knowledge.
Corpus data in a usage-based cognitive grammar 99

9. Note that many issues that have only been mentioned in passing in this
section, including the implications of the concept of communicative
competence, the issue of constructional networks and the place of genre
distinctions in a usage-based model of speakers linguistic knowledge, will be
discussed in much more detail in a book-length study that is underway (cf.
Mukherjee, forthcoming).


Aarts, J. (1991), Intuition-based and observation-based grammars, in: K. Aijmer

and B. Altenberg (eds.) English corpus linguistics: studies in honour of
Jan Svartvik. London: Longman. 44-62.
Biber, D., S. Johansson, G. Leech, S. Conrad and E. Finegan (1999), Longman
grammar of spoken and written English. Harlow: Pearson Education.
Chomsky, N. (1995), The minimalist program. Cambridge, MA: MIT Press.
Goldberg, A.E. (1995), Constructions: a construction grammar approach to
argument structure. Chicago, IL: The University of Chicago Press.
Gries, S.T. (2001), A multifactorial analysis of syntactic variation: particle
movement revisited, Journal of quantitative linguistics, 8: 33-50.
Halliday, M.A.K. (1991), Corpus studies and probabilistic grammar, in: K.
Aijmer and B. Altenberg (eds.) English corpus linguistics: studies in
honour of Jan Svartvik. London: Longman. 30-43.
Hunston, S. and G. Francis (2000), Pattern grammar: a corpus-driven approach
to the lexical grammar of English. Amsterdam: Benjamins.
Hymes, D.H. (1972), On communicative competence, in: J.B. Pride and J.
Holmes (eds.) Sociolinguistics: selected readings. Harmondsworth:
Penguin. 269-293.
Hymes, D.H. (1992), The concept of communicative competence revisited, in:
M. Ptz (ed.) Thirty years of linguistic evolution: studies in honour of
Ren Dirven on the occasion of his sixtieth birthday. Amsterdam:
Benjamins. 31-57.
Jackson, H. (1990), Grammar and meaning: a semantic approach to English
grammar. London: Longman.
Lamb, S. (2002), Types of evidence for a realistic approach to language, in: R.
Brend, W. Sullivan and A. Lommel (eds.) LACUS forum XXVIII: what
constitutes evidence in linguistics? Houston, TX: LACUS. 89-101.
Langacker, R.W. (1987), Foundations of cognitive grammar, vol. I: theoretical
prerequisites. Stanford, CA: Stanford University Press.
Langacker, R.W. (1999), Grammar and conceptualization. Berlin: Mouton de
Langacker, R.W. (2000), A dynamic usage-based model, in: M. Barlow and S.
Kemmer (eds.) Usage-based models of language. Stanford, CA: CSLI
Publications. 1-63.
Matthews, P.H. (1981), Syntax. Cambridge: Cambridge University Press.
100 Joybrato Mukherjee

Mukherjee, J. (2001), Principles of pattern selection: a corpus-based case study,

Journal of English linguistics, 29: 295-314.
Mukherjee, J. (2002), The scope of corpus evidence, in: R. Brend, W. Sullivan
and A. Lommel (eds.) LACUS forum XXVIII: what constitutes evidence in
linguistics? Houston, TX: LACUS. 103-114.
Mukherjee, J. (forthcoming), English ditransitive verbs: aspects of theory,
description and a usage-based model. Amsterdam: Rodopi
Nelson, G., S. Wallis and B. Aarts (2002), Exploring natural language: working
with the British component of the International Corpus of English.
Amsterdam: Benjamins.
Nemoto, N. (1998), On the polysemy of ditransitive save: the role of frame
semantics in construction grammar, English linguistics, 15: 219-242.
Newman, J. (1996), Give: a cognitive linguistic study. Berlin: Mouton de
Quirk, R., S. Greenbaum, G. Leech and J. Svartvik (1985), A comprehensive
grammar of the English language. London: Longman.
Rundell, M. (ed.) (2002), Macmillan English dictionary: school edition for
advanced learners. Hannover: Schroedel.
Schmid, H.-J. (2000), English abstract nouns as conceptual shells: from corpus
to cognition. Berlin: Mouton de Gruyter.
Sinclair, J. (ed.) (1995), Collins COBUILD English dictionary. London: Harper
Taylor, D.S. (1988), The meaning and use of the term competence in linguistics
and applied linguistics, Applied linguistics, 9: 148-168.
Putting putting verbs to the test of corpora

Caroline David

Universit de Poitiers


Constraints on verb-preposition combinations and of variations in the relation

between object and destination point raise questions such as the following:
How is it that two verbs semantically as close as put and place have
different syntactic constraints in sentences such as:
(a) X puts something into Y
(b) ?? X places something into Y
How is it that spray and load (both labelled as 'putting verbs'), which are
often compared and associated with the groups of Coil-verbs and Pour-verbs, in
fact can be closer to Fill-verbs only because they display a quite similar
behaviour in the nature of the link they maintain between their object and the
Based mainly on a quantitative study of the British National Corpus and
the LOB, FLOB, Brown and Frown corpora, the work presented in this paper is
an attempt to highlight how the prototypical verb put functions, in order to better
understand the syntactico-semantic mechanisms which underlie the other verbs of
this class. My ultimate purpose is to show that, beyond the classifications already
proposed by various linguists, a new typology of putting verbs can be outlined.

1. Introduction
Beneath homogeneous semantic and cognitive features, the class of Verbs of
Putting, as Levin (1993) and Dixon (1991) label it, displays significant variations
in their syntactic organisation. On the basis of corpus evidence, I will first tackle
the problem from a semantic point of view, showing that the synonymy of put,
set, lay and place is only superficial, and that the constraints they impose on their
prepositions depend on the semantic content of each (verb and preposition).
Indeed, the general semantic value of these three-place predicates (which
represent most of their uses) is the result of a semantic balance between the
preposition and the verb itself. Then I will examine the verb load whose syntactic
behaviour seems to make it closer either to FILL or COIL-verbs according to the
structure it is in. Finally, the analysis of fill will bring out (i) the complex and
intricate relations that tie up the three arguments of the verb (the subject, the
object, and the goal location), that is to say the syntactico-semantic mechanisms
which underlie it, and (ii) the intrinsic semantic properties of each argument that
are required by fill.
102 Caroline David

2. Put as a hyperOnym: a semantic approach

The postulate inherited from Saussure's (1916) structuralist approach according to
which there exists a structural organisation of the lexicon is now generally
accepted. The various semantic features shared or not by each lexeme of the same
semantic field inevitably give rise to a classification in strata, with hierarchical
relations between the different lexemes, and co-relations among lexemes of the
same stratum. Thus, in contrast to Dixon (1991: 99)1 who considers put, set, place
as belonging to the same group of verbs and Levin (1993: 111-122, see
Appendix) who groups put, set and place into the same class entitled PUT-verbs
(sub-class n1), I will make a slight semantic distinction between the three of
them and between them and the verb lay.

2.1 Polysemy: phraseological abundance

Indeed, put seems more polysemous than the other three, which is notably shown
by the large number of idiomatic structures, phrasals and proverbs including put
(see Table 1). A dictionary like the Cobuild Dictionary of Idioms, to mention but
one, lists more than 94 idiomatic expressions such as put on airs, put your foot in
it, put someone on a pedestal, put the cat among the pigeons, put all your eggs in
one basket etc., whereas set only counts 21 (e.g. set the wheels in motion, set out
your stall, set tongues wagging), lay 12 (e.g. lay your cards on the table, lay an
egg) and place as a verb, none (only as a noun: put someone in their place, fall
into place, not a hair out of place etc.).

Table 1. Number of idioms in the Cobuild Dictionary of Idioms


Number of idioms 94 21 12 0

2.2 Generalness of meaning

This abundance of phraseological expressions and the fact that put combines with
more prepositions and particles than set, lay and place (see Table 2, taken from
Pauwels 2000: 139) can be an indication of generalness of meaning. In other
words, the less meaningful the verb is, the more it can combine with different
prepositions, and the more important the role of the preposition is in the general
meaning of the prepositional verb phrase, and vice-versa. The semantic value of
such three-place predicates stronly depends on the semantic weight of the
prepositional phrase, on the one hand, and on the semantic content conveyed by
the verb itself, on the other.
Putting putting verbs to the test of corpora 103

Table 2. Verb-preposition combinations2

VERBS PUT (%) SET (%) PLACE (%) LAY (%)
on 560 (26.5) 37 (6.9) 85 (35.1) 60 (19.3)
in 428 (20.4) 47 (8.7) 59 (24.4) 7
down 184 (8.7) 14 45 (14.5)
to 116 (5.4) 32 (5.9) 5 (2) 6
out 77 (3.7) 72 (13.4) 42 (13.5)
up 72 (3.6) 78 (14.5) 5
back 61 4 1 2
(a)round 42 2
away 39 1
forward 38
off 26 20 14 (4.5)
at 24 9 15 (6.2) 8 (2.6)
together 22 1 1
over 22 3 2
under 21 3 6 (2.5)
aside 17 14 2
through 12
behind 11
before 7 2 5 3
against 6 9 1 7
by 6 1
across 5 1 2
ahead 4
past 4
about 4 11 3
above 3 2
between 3 1 3
after 2 1
outside 2
beyond 1
from 1
with 1 1 3
within 1
forth 1 3
beneath 1 2 1
beside 1 2 4
towards 1
along 1 1
next to 1 1
inside 2
below 1
among 1
near 2 1
opposite 1
apart 1
104 Caroline David

2.3 High frequency

In addition, if we look at the quantitative results of a search in the British
National Corpus (100 million words) shown in Table 3, we can see that, with
65,194 examples found, put (including all its forms) is one of the most frequent
verbs in the corpus, perhaps only a little less common than get and take. If we do
the same with the other three verbs, we obtain a total of 33,441 instances for set,
which is almost half the frequency of put, 10,260 instances for place, which is
about one sixth of those of put, and finally, 6,993 instances for lay, which is about
ten times less frequent than put. These proportions are not quite the same as those
in the LOB, FLOB, Brown, Frown corpora (which I will refer to as the Brown
set) which contain 4,38 million words (see Table 4 below), but the high frequency
of put compared to the other three verbs is again emphasised.

Table 3. Number of occurrences in the BNC


Total 65 194 33 441 10 260 6 993

Table 4. Number of occurrences in the Brown set of corpora3


Total 2 212 1 432 683 790

The high frequency characteristic of put and its generalness of meaning, that is to
say its polysemy compared with set, lay and place, both lead me to the same
conclusion: put, set, lay and place are not synonymous, and therefore cannot be
classified under the same label of PUT-verbs (sub-class n1), because they do
not seem to represent the same process of putting things. Therefore, what has
been obvious for lay since Levin (1993) described it as a Verb of Putting in
Spatial Configuration, should be the same for set and place.
Defining lay as the way things are put, the way the object is displaced,
adds more information to the process of putting, which is not included in the
light meaning of put, if I can put it this way. Similarly, it seems that set and
place behave like lay in the sense that they also describe the way things are
moved, if we rely on the following definitions given by dictionaries such as the
Collins Cobuild English Dictionary, the Oxford English Dictionary and the
Longman Dictionary of Contemporary English (the highlighting in boldface is

Collins Cobuild English Dictionary

SET 1. If you set something somewhere, you put it there, especially in a careful
or deliberate way. He took the case out of her hand and set it on the
floor.When he set his glass down he spilled a little drink.
Putting putting verbs to the test of corpora 105

PLACE 1. If you place something somewhere, you put it in a particular position,

especially in a careful, firm or deliberate way. Chairs were hastily
placed in rows for the parents.

LAY 1. If you lay something somewhere, you put it there in a careful, gentle, or
neat way. Mothers routinely lay babies on their backs to sleep.

Oxford English Dictionary

SET 1. To put in a definite place (the manner of the action being implied either
in the verb itself or in the context); to put (more or less permanently) in a
definite place.

PLACE 1. To put or set in a particular place, position, or situation; to station; to

posit; fig. to set in some condition, or relation to other things. Often a mere
synonym of put, set.
2. To put or set (a number of things) in the proper relative place, i.e. in
order or position; to arrange, dispose, adjust.

LAY 1. To deposit; to place in a position of rest on the ground or any other

supporting surface; to deposit in some situation specified by means of an
adverb or phrase.
2. To dispose or arrange in proper relative position over a surface.

Longman Dictionary of Contemporary English

SET 1. To carefully put something down somewhere, especially something that is

difficult to carry.

PLACE 1. To put something somewhere, especially with care.

2. To put someone or something in a particular situation.

LAY 1. To put someone or something down carefully into a flat position.

For each verb, we find the notions of deposit, dispose, arrange or place in a
certain, specified or particular position, not to forget the notion of carefulness
which is quite often underlined. The OED even specifies for set that the manner
of the action is implied either in the verb itself or in the context.
Therefore, I will propose a new distribution of the different classes, where
on the one hand put, alone, is considered the prototypical verb of the general
process of putting with little additional information regarding the way things are
displaced and, on the other hand, set, lay, place etc., which are more specific in
meaning than put, are classified together as a kind of manner of putting.
According to Lyons tests (1977: 292), which allow us to establish the position of
a lexeme in a hierarchical lexical field, setting, placing, laying things are all a
kind of putting things, that is to say put is a hyperOnym and the others hyponyms.
This conception of the semantic organisation of the verbs might be represented as
in Figure 1.4
106 Caroline David



Figure 1. New structural organisation of the Putting Verbs

Let us now move on to some other verbs of the large class of Verbs of
Putting, viz. SPRAY/LOAD-verbs.

3. Load and locative alternation: a syntactic approach

The SPRAY /LOAD-class is considered homogeneous by a large number of
linguists (Anderson 1971, Jackendoff and Culicover 1971, Tremblay 1991,
Rivire 1997), in the sense that they accept locative alternation of two different
syntactic structures. Taking LOAD as an example, and starting from an instance in
the LOB corpus (1), we can compare the simplified version in (1a) with the with-
variant in (1b):

(1) Similarly there had been hay bales. Similarly, now there were for us
school trunks. Three times a year I loaded school trunks on to the car
and took them to the station, and three times a year loaded them on the car
and brought them home from the station. (LOB G19 163-164)

(1a) I loaded school trunks on to the car5

(1b) I loaded the car with school trunks

Table 6. Load and locative alternation

Corpora Brown Frown LOB FLOB Total

No. of PP-phrases 6 6 9 5 24
No. of with-phrases 12 7 4 11 34

To judge from the frequencies in the Brown set of corpora (Table 6), load does
not really favour one structure more than the other.6 This syntactic feature
Putting putting verbs to the test of corpora 107

(locative alternation), which can be viewed as the common feature of the whole
class, links them closely to either COIL-verbs or FILL-verbs.

3.1 The PP-structures

The first structure (1a) resembles the syntactic construction of put and can be
glossed as I put school trunks on to the car by loading them. So if load and put
are syntactically and semantically close to each other we have a subject (I) who
triggers the action of moving an object (trunks) to a destination (the car) what
sets them apart is only the manner of putting the trunks on to the car, which ties
up with what I have just said about the manner of movement for set, lay and
place. Thus, in (1a) the object is transferred and located relative only to the
destination, that is pure displacement. What I should add is that the affectedness
of the object tends to take on a holistic interpretation, in other words the default
interpretation is that all the trunks are loaded, irrespective of whether the car is
full or not. As with put, in this structure, we can analyse the process of loading
in two steps (see Figure 2): first and foremost the relation [I - load - trunks] is
established (a strong link between the subject and the object), and secondly the
trunks are located relative to the car, and they are completely loaded. The order of
construction of the two relations in (1a) emphasises one thing: the importance of
the second argument, that is to say the direct object, which is also found in the
construction of put and of course in one of the possible structures of COIL and
POUR-verbs, as illustrated by pour and spill, respectively:7

First step: I loaded school trunks

I trunks
subject object

Second step: the whole quantity of trunks is moved on to the car

I trunks the car

subject object destination

Figure 2. Process of loading: pure displacement (1a)

(2) Spoon the mixture over the potatoes and then pour the cheese sauce over
the top. (FLOB E19 122)
108 Caroline David

(3) Enthusiastically, Thompson and Arbella Lacey spilled the contents of

four hessian sacks onto the kitchen floor. (FLOB M01 219)

3.2 The with-structures

However, the second structure (1b), I loaded the car with school trunks, is quite
remote from that of put, and at the same time far from that of COIL or POUR-
verbs, since we cannot have *I put the car with school trunks (by loading them),
or *John poured the bowl with water. The argument that is now in the position of
object is the car and it is no longer a goal location. It is an affected object and the
quantity is no longer the focus of interest as it is given a holistic interpretation:
the car is entirely loaded. What was a displacement of the object (trunks) in the
first structure is now a change of state of the object (the car). As a result, the
affected object (the car) could be described as being a car full of trunks, as
opposed to any other condition full of any other kind of objects, such as stools,
or desks, for instance. In a sense, this structure has to be linked to the FILL-verbs.
In this case, the focus is put on the direct object (the car) whose properties are
modified by trunks. Accordingly, the process of loading in this second structure
can be divided as shown in Figure 3.

First step: The car is loaded

I the car
subject object

Second step: The car is qualified as being a car full of trunks

I the car school trunks

subject object affected


Figure 3. Process of loading: affected object (1b)

It is interesting to notice that the notions of affectedness and

qualification of the object are reinforced by the overwhelming number of
passive examples of load with-constructions. Actually, 28 of the 34 examples of
load with (82%) in the Brown set of corpora (Table 6) are passive constructions.
This distribution, which contrasts with the active examples of load usually found
in the literature, underlines once again the usefulness of corpora in the study of
the language.
Putting putting verbs to the test of corpora 109

Before moving on to the FILL -verbs, let us examine a trickier and more
unusual example of load, which shows that the with-construction is decisive for
the notion of qualification but not for quantification:

(4) She knew, for Gideon dispatched formal reports whenever he could, that
the ship had made her way safely to Australia and disgorged her
passengers and cargo safely. She had loaded with Mundy wool from the
warehouses in Melbourne and was on her way back home. (FLOB P03

This can be simplified as:

(4') The ship had loaded with wool

If we regard this example as indicating the result of a process of loading which

implies first that (X) had loaded the ship with wool and then that the ship was
loaded with wool, then the goal location, v i z. the container (the ship), is
interpreted holistically. Even if the ship is in a syntactic subject position, it is
indeed also an affected object, in terms of semantic roles, interpreted as being a
ship full of wool. Once again, we have an operation of qualification where a
particular kind of load (wool) modifies the properties of the container (the ship).
Unlike Quirk et al. (1985: 744) who call this structure an intransitive
construction,10 which does not correspond to a transitive construction, my claim
is that this example looks like a kind of middle voice, where the object has been
moved into a syntactic subject position, leaving thus an empty object position and
an obligatory adverbial (or adjunct) required for that type of structure, since it
attributes the characteristics, or qualifications, to the syntactic subject. Note,
finally, that the perfective aspect (had loaded) which indicates a resulting state
can be useful to convey the idea of affectedness.
It may be useful to sum up the three (rather than two) different structures
with load.11 We have a PP-variant (1a), syntactically close to the COIL and POUR-
verbs, and two with-variants (1b) and (4') which can be linked to FILL-verbs:

(5) Norah will pack us up some sandwiches, and I will fill the flasks with tea.
(FLOB P25 146)

(6) Just then, to his delight, the gray fills with snow, as though someone
standing below the window had broken open seedpods and tossed up
fistfuls of white puffs. (Frown N22 17)

4. FILL-verbs
If we now turn to the structure of F ILL-verbs, we find that we can have Norah
filled the flasks with tea, but not *Norah filled the tea into the flasks. The flask,
both locative, container, and the object affected by tea, has changed properties,
110 Caroline David

since it goes from the state of emptiness to the state of fullness. It changes from a
flask (a tea flask) to a flask of tea. The destination (the underlying element in this
pattern) is first any kind of container (any flask) and then it becomes an object
with a specified content (a flask which has the properties of being a flask full of
tea and not a flask of water, or of wine). Therefore, fill (like the FILL -verbs in
general) is more restrictive on the type of object it can take than load: we can
only take into account the [fill-container] relation and then qualify it. Hence, we
have first [fill-flask] and then [tea: process of filling], which is very similar to the
second case of load (1b). This relation [fill-container] is very strong since we can
have a structure with only two nominal arguments (two participants), which is
statistically quite frequent (28% of the transitive uses of fill are two-participant

(7) He fetched the Nescafe and camp stove from his provisions, then went to
fill the bottle. (Frown N21 54)

In fact, load gives less information about the type of locative it accepts and about
the way things are moved than fill. The object constructed with fill must satisfy
all the properties of something which can be filled in, whereas almost anything
can be loaded. As FILL-verbs format, give shape to, and put more restrictions
on the container than load, the only possible constructions are Norah filled the
flasks with tea, and the gray fills with snow (as in example 6).
In short, beyond the locative alternation which shows a certain syntactic
consistency across the members of the class, the properties of the object in one
type of structure and of the destination in the other reveal how the verb behaves
in a more significant way. The choice of one structure over another, e.g. the
choice of pour and coil over fill, serves to highlight the motion of the content,
rather than the change in fullness of the container. This idea is notably supported
by Gropen et al. (1991: 161) in their discussion of verbs like pour and fill:

If a verb specifies how something moves in a main event, it must

specify that it moves; hence we predict that for verbs that are choosy
about manners of motion (but not change of state), the moving entity
should be linked to the direct object role. In contrast, if a verb
specifies how something changes state in a main event, it must specify
that it changes state; this predicts that for verbs that are choosy about
the resultant state of changing entity (but not manner of motion), the
changing entity should be linked to the direct object role.

What is called a change of location can be related to what I have described as a

quantification, and what is said to be a change of state is comparable to a
qualification of the resultant state. Consequently, the locative alternation of a
verb such as load reveals three perspectives:
Putting putting verbs to the test of corpora 111

school trunks on to the car = COIL-verbs (Quantification)

LOAD the car with school trunks = FILL-verbs (Qualification)

the ship with Mundy wool = FILL-verbs (Qualification)

5. Conclusion
All the verbs put, set, lay, place, coil, load, and fill need a PP which provides
different kinds of additional specification of the destination from a semantic point
of view; at the same time, a characterisation is given of the nature of the link
between the verb and its different arguments.
If we return to Levin's classification (see Appendix) and run through her
categories, we have first a light semantic verb (put) which imposes few
restrictions on the type of object, on the destination point and thus on the
preposition. This verb can combine with the largest number of prepositions: 39.
Then, verbs such as set, lay and place add more specification and information on
the process of putting but still without heavy constraints on the object and on the
destination as is shown by the various definitions of the dictionaries and the verb-
preposition combinations: 24 different prepositions for place and only 20 for set
and lay. Next, we come to the POUR -verbs and the COIL-verbs which combine
with a restricted set of prepositions reflecting the semantics of the verb and the
focus on the quantity of the object displaced. At the boundary between the COIL-
verbs and the FILL-verbs, we find the SPRAY/LOAD-verbs which, depending on
the type of structure (a PP-construction or a with-construction), are closer to
either of these classes. LOAD, according to our findings in the Brown set of
corpora, does not really favour any of these two constructions: 24 PP-phrases and
34 with-phrases. Moreover, it must be stressed that 82% of the examples of load
with in the Brown set are passive, a distribution that contrasts with the preference
for active examples in reference books. Next, if we return to SPRAY, the figures
are reversed, since we found 13 occurrences with PP-structures and 5 with-
structures, a tendency which once again could not be induced without a corpus
analysis. Finally, the FILL-verbs specify and severely limit the goal location (the
container) by qualifying it with its object. Let us recall that 28% of the transitive
uses of fill in the Brown set do not have any goal location in their structures.
Large electronic corpora of the kind used here provide a fruitful resource
for new approaches to the study of language, such as the use of data for a more
qualitative analysis and for testing hypotheses about syntactic structures. Corpora
describe and reflect a type of language reality which does not always correspond
to the picture presented in traditional descriptions.
112 Caroline David


1. Dixon (1991: 99) even goes further since he gathers put, set, place, fill and
load in the same class labelled rest verbsput subtype, which refers to
causing something to be at rest at a Locus.
2. Note that deverbal uses have been included in this count.
3. Excluding deverbal cases.
4. In this study, I use some of the concepts in Culioli's Theory of Enunciative
Operations (Culioli 1990), in particular quantification (QNT) and
qualification (QLT) (see below). However, it should be noted (J. Chuquet,
private communication) that the concept of notional domain that one might be
tempted to associate with the put/set/lay/place system cannot be relevant here,
as it bears little or no relation to a prototype theory.
5. The use of the definite or indefinite article with the location and the object
transferred has been discussed by Laffut (1997, 1998).
6. Conversely, spray is more used in a PP-structure (13 examples) than in a with-
structure (5 examples) in the Brown set of corpora.
7. For more details on how COIL-verbs function, see Beatty (1979).
8. Quantification refers to the existence of an entity in Culioli's theory.
9. In rough approximation qualification refers to the properties of an entity.
(for a full treatment, see Culioli 1999).
10. They give the following examples: Her books translate well. The sentence
reads clearly. My shirts have dried very quickly. The sheets washed easily. My
teapot pours without spilling.
11. By contrast, it seems that spray has only got two structures (1a) and (1b).
12. 163 examples out of 583.


Anderson, S.R. (1971), On the role of deep structure in semantic interpretation,

Foundations of Language 6: 387-396.
Beatty, J. (1979), An analysis of some verbs of motion in English, in: J. Fisiak
(ed) Studia Anglica Posnaniensia 1, Pozna_: 127-142.
Collins Cobuild English Dictionary (1995), London: Harper Collins.
Cobuild Dictionary of Idioms. (1995), University of Birmingham: HarperCollins.
Culioli, A. (1990), The concept of notional domain, in: Pour une linguistique de
l'nonciation, Tome 1, Gap: Ophrys: 67-81.
Culioli, A. (1999), Pour une linguistique de l'nonciation. Domaine notionnel.
Tome 3, Gap: Ophrys.
Dixon, R.M.W. (1991), A new approach to English grammar. On semantic
principles. Oxford: Clarendon Press.
Putting putting verbs to the test of corpora 113

Gropen, J., S. Pinker, M. Hollander and R. Goldberg (1991), Affectedness and

direct objects: the role of lexical semantics in the acquisition of verb
argument structure, in: B. Levin and S. Pinker (eds) Lexical and
conceptual semantics. Oxford: Blackwell: 153-195.
Jackendoff, R. S. and P. Culicover (1971), A reconsideration of dative
movements, Foundations of Language 6: 397-412.
Laffut, A. (1997), The Spray/Load Alternation: some remarks on a textual and a
constructionist approach, Leuvense Bijdragen 86: 457-87.
Laffut, A. (1998), The locative alternation: a contrastive study of Dutch vs.
English, Languages in Contrast 3: 127-160.
Levin, B. (1993), English verb classes alternations. A preliminary investigation.
Chicago: University of Chicago Press.
Longman Dictionary of Contemporary English (2000), Web Dictionary. Essex:
Pearson Education-Longman.
Lyons, J. (1977), Semantics. Cambridge: Cambridge University Press.
Oxford English Dictionary (1993). CD-ROM, Release version 1.0b: Oxford
University Press.
Pauwels, P. (2000), Put, set, lay and place. A cognitive linguistic approach to
verbal meaning. Lincom studies in theoretical linguistics: Lincom Europa.
Quirk, R., S. Greenbaum, G. Leech and J. Svartvik (1985), A comprehensive
grammar of the English language. London: Longman.
Rivire, C. (1997), Une ou plusieurs notions? Les prdicats trois places qui
admettent deux constructions, in: C. Rivire et M.L. Groussier (eds) La
Notion. Paris: Ophrys: 175-184.
Saussure, F. de (1916/1965), Cours de linguistique gnrale. 3rd ed. Paris: Payot.
Tremblay, M. (1991), Alternances d'arguments internes en franais et en
anglais, Revue qubcoise de linguistique 20, Montral: Universit du
Qubec: 39-53.

Electronic Corpora:

British National Corpus Online (BNC) (http://info.ox.ac.uk/bnc/).

Appendix: Levin's classification of Verbs of Putting (1993: 111-122)

114 Caroline David

Appendix: Levin's classification of Verbs of Putting (1993: 111-122)

1. Put Verbs
arrange, immerse, install, lodge, mount, place, position, put, set, situate, sling,
stash, stow
2. Verbs of Putting in Spatial Configuration
dangle, hang, lay, lean, perch, rest, sit, stand, suspend
3. Funnel Verbs
bang, channel, dip, dump, funnel, hammer, ladle, pound, push, rake, ram, scoop,
scrape, shake, shovel, siphon, spoon, squeeze, squish, squash, sweep, tuck, wad,
wedge, wipe, wring
4. Verbs of Putting with a Specified Direction
drop, hoist, lift, lower, raise
5. Pour Verbs
dribble, drip, pour, slop, slosh, spew, spill, spurt
6. Coil Verbs
coil, curt, loop, roll, spin, twirl, twist, whirl, wind
7. Spray/Load Verbs
brush, cram, crowd, cultivate, dab, daub, drape, drizzle, dust, hang, heap, inject,
jam, load, mound, pack, pile, plant, plaster, ?prick, pump, rub, scatter, seed,
settle, sew, shower, slather, smear, smudge, spatter, splash, splatter, spray, spread,
sprinkle, spritz, squirt, stack, stick, stock, strew, string, stuff, swab, ?vest, ?wash,
8. Fill Verbs
adorn, anoint, bandage, bathe, bestrew, bind, blanket, block, blot, bombard,
carpet, choke, cloak, clog, clutter, coat, contaminate, cover, dam, dapple, deck,
decorate, deluge, dirty, douse, dot, drench, edge, embellish, emblazon, encircle,
encrust, endow, enrich, entangle, face, festoon, fill, fleck, flood, frame, garland,
garnish, imbue, impregnate, infect, inlay, interlace, interlard, interleave,
intersperse, interweave, inundate, lard, lash, line, litter, mask, mottle, ornament,
pad, pave, plate, plug, pollute, replenish, repopulate, riddle, ring, ripple, robe,
saturate, season, shroud, smother, soak, soil, speckle, splotch, spot, staff, stain,
stipple, stop up, stud, suffuse, surround, swaddle, swathe, taint, tile, trim, veil,
vein, wreathe
9. Butter Verbs
asphalt, bait, blanket, blindfold, board, bread, brick, bridle, bronze, butter,
buttonhole, cap, carpet, caulk, chrome, cloak, cork, crown, diaper, drug, feather,
fence, flour, forest, frame, fuel, gag, garland, glove, graffiti, gravel, grease,
groove, halter, harness, heel, ink, label, leash, leaven, lipstick, mantle, mulch,
muzzle, nickel, oil, ornament, panel, paper, parquet, patch, pepper, perfume,
pitch, plank, plaster, poison, polish, pomade, poster, postmark, powder, putty,
Putting putting verbs to the test of corpora 115

robe, roof, rosin, rouge, rut, saddle, salt, salve, sand, seed, sequin, shawl, shingle,
shoe, shutter, silver, slate, slipcover, sod, sole, spice, stain, starch, stopper, stress,
string, stucco, sugar, sulphur, tag, tar, tarmac, tassel, thatch, ticket, tile, turf, veil,
veneer, wallpaper, water, wax, whitewash, wreathe, yoke, zipcode
10. Pocket Verbs
archive, bag, bank, beach, bed, bench, berth, billet, bin, bottle, box, cage, can,
case, cellar, cloister, coop, corral, crate, dock, drydock, file, fork, garage, ground,
hangar, house, jail, jar, jug, kennel, land, lodge, pasture, pen, pillory, pocket, pot,
sheathe, shelter, shelve, shoulder, skewer, snare, spindle, spit, spool, stable,
string, tin, trap, tree, warehouse
Esphoric reference and pseudo-definiteness

Peter Willemse

University of Leuven


This paper investigates the referential status and structure of pseudo-definite

NPs occurring in postverbal position in the unmarked type of existential
sentences. These NPs are pseudo-definite in the sense that they are formally
definite but in fact realize presenting rather than presuming reference (Martin
1992); they introduce new entities into the discourse. In many cases, the formal
definiteness of the NP can be explained in terms of an esphoric, i.e. a forward
phoric relationship between elements within the same NP (Du Bois 1980, Martin
1992). The aim of this paper is to propose a classification of the different types of
pseudo-definites in terms of the distinct nominal constructions they realize. This
classification aims at being more exhaustive than the ones already available in
the literature (e.g. Lumsden 1988, Ward and Birner 1995). To this end, a corpus
of unmarked, or cardinal, existentials containing a formally definite postverbal
NP has been analyzed. The classification obtained on the basis of the analysis of
this corpus was checked against an additional corpus extraction containing
attributive clauses in which a formally definite NP occurs as Attribute.

1. Introduction

Esphoric reference is a type of cataphoric reference, i.e. a type of reference that

manifests itself as a forward phoric relationship. It is a phoric relationship
because esphora is associated with presuming rather than presenting reference: it
occurs in NPs whose grammar signals that the identity of the discourse
participant they realize is in some way recoverable. It is a forward relationship
since the identity of the referent of an esphoric NP is retrievable through
information located further on in the discourse. More specifically, Martin (1992:
123) defines esphora as forward reference within the same nominal group.
Whereas Halliday and Hasan (1976) do not distinguish esphora within the broad
category of cataphoric reference, Martin (1992) sets esphora apart as a phoric
relation within one and the same NP, from cataphora, which he defines as a
forward phoric relationship between different NPs. Martin further remarks, with
reference to Du Bois (1980: 224-225), that in general, what follows in the NP
justifies the use of a definite determiner; more specifically, it is usually the
Postmodifier element that gives the information necessary to identify the
participant in question. Halliday and Hasan (1976: 72) make a similar remark in
saying that the definite article the often refers cataphorically to a modifying
element within the same nominal group as itself.
118 Peter Willemse

It seems that the category of esphora, as defined by Martin (1992), covers several
phenomena that are still quite different in nature. It is a broad category within
which further subcategories can be distinguished. Most importantly, real
definite esphoric NPs should be distinguished from what Davidse (1999: 231) has
referred to as pseudo-definite NPs. An example of a truly definite esphoric type
of NP is an NP containing a restrictive relative clause functioning as a
postmodifier, for instance the man who spoke first at the meeting. Davidse (2000:
1112) gives a convincing argument for this building on Langackers (1991)
interpretation of the restrictive relative clause (RRC) as part of the type
specification evoked by the nominal. The RRC makes the type specification more
specific and thereby restricts it. Davidse points out that in NPs which contain a
definite determiner, a reference mass, viz. the set of all instances corresponding
to that particular type in the discourse context, is defined by the type
specification. For an NP containing a definite article and a singular count noun to
be unambiguous, it is necessary that only one contextually relevant instance
corresponds to the specified type (viz. the one referred to). When an NP contains
a RRC, the latter may narrow down the reference mass (by making the type
specification more specific) to only one contextually relevant instance, thus
making the NP truly definite and justifying the use of a definite determiner. In the
example quoted above, for instance, the reference mass defined by the type
specification man presumably contains several instances; the added RRC who
spoke first at the meeting narrows down this reference mass to just one instance
and in this way makes the NP definite.
The focus of the present paper is on the other type of esphoric NPs, viz.
the pseudo-definite ones. Consider the following example:

(1) Tomorrow afternoon, there will be the usual Christmas concert.

In this example, the underlined NP is formally definite in that it contains a

definite determiner (viz. the definite article). However, with regard to its
referential status it is not definite; it introduces a new instance of the type
Christmas concert into the discourse rather than referring to an instance known
to the hearer or a contextually unique instance. The definite article seems to
signal something else than the referential status in this example. Similarly, non-
referential NPs in certain contexts where one would normally expect a formally
indefinite NP sometimes take a definite form. Consider the following example of
an attributive clause in which the Attribute is realized by a formally definite NP:1

(2) He is the son of a slave.

If one imagines that this sentence occurs in a context from which it is clear that
the slave in question has more than one son, the definite form of the attribute NP
is unexpected. The definite article seems to be motivated here by some
relationship within the NP in which it occurs.
Esphoric reference and pseudo-definiteness 119

In order to learn more about these pseudo-definite NPs, I will zoom in on

two grammatical contexts which normally exclude definite NPs, yet in which
formally definite NPs do occur occasionally. I will primarily focus on the
unmarked, or cardinal type of existential constructions. In addition, I will look
at some types of pseudo-definite NPs occurring in attributive clauses, as
illustrated by example (2).
The unmarked type of existential is particularly interesting in view of the
topic of the present paper because a so-called definiteness restriction applies to
its postverbal, or Existent (Halliday 1994) NP. Traditionally, existential
constructions were analyzed as basically attributing a certain location to the
central entities they refer to (see Davidse 1999). However, it has been remarked
(Milsark 1976, 1977, Lumsden 1988, Davidse 1999) that two rather different
types of existentials should be distinguished. On the one hand, there is the
unmarked, or cardinal, type of existential. On the other hand, there is the
enumerative existential, which is the marked type of existential construction.
The semantic differences between the two types of construction are reflected in
the systematic distributional differences which they display. Consider the
following examples:

(3) There were two usherettes in the foyer. [cardinal existential]

(4) Even before his triumph at Seoul, Lewis had toyed with the idea of
returning to England. There was the British title to go for, and also the
Commonwealth, the European. [enumerative existential]

In (3), the existential states how much instantiation of the type usherettes there is
in the contextually specified situation. A cardinal existential indicates the
cardinality of the instantiation of the type expressed by the type specification in
the Existent NP (Davidse 1999: 238). Typically, cardinal existentials express
cardinal quantification (hence the name), i.e. they measure the intrinsic
magnitude of the designated mass in terms of a quantitative scale (see Langacker
1991: 84). Importantly, moreover, the existent (postverbal) NP is typically
indefinite: the designated instance(s) is/are being introduced into the discourse
and are not presumed known to the hearer. An enumerative existential, on the
other hand, enumerates in ordinal fashion, with implied reference to a
contextually specified type, instances sharing a superordinate type which
corresponds to that contextual type (Davidse 1999: 240-241). Let us look at
example (4) to make this more clear. In this example, three Existent NPs (the
British title, the Commonwealth, the European) are ordinally enumerated. They
all share a superordinate type, viz. something like major athletics competition.
This general type is further specified and situated contextually as roughly major
competitions for which Carl Lewis considered returning to England. The
instances of this type, designated by the three Existent NPs, are then held up
for consideration one by one. It is also possible that only one instance
corresponding to a contextually defined type is mentioned in the enumerative
existential. An important distributional difference with the cardinal existential is
120 Peter Willemse

that the enumerated instances in the existent NP are typically realized as

presumed known to the hearer, i.e. with definite grounding. Types of NP that
often occur as the existent NP in enumerative existentials are NPs with definite
grounding such as proper names, pronouns, NPs containing a definite article, a
demonstrative or a possessive and definite genitive NPs. However, it is also
possible that the enumerated instances are coded as not presumed known
through an indefinite NP.
From the above description it may be clear why the unmarked, or cardinal,
type of existential has been chosen as the environment to study types of pseudo-
definite NPs. The semantics of this type of construction entail that the Existent
NP is indefinite. Consequently, when a formally definite NP occurs in postverbal
position, its referential status can only be pseudo-definite, and not truly definite.
Enumerative existentials were not included in the analysis, since no definiteness
restriction applies to the postverbal NP in this type of construction.
The other type of construction which has been looked at for this paper is
the attributive clause. In contrast with the postverbal NP in existential sentences,
the NP realizing the attribute in an attributive clause is non-referential. Typically,
the Attribute NP has indefinite reference (see Halliday 1994: 120). Consequently,
attributive clauses are another suitable environment to study formally definite
NPs with pseudo-definite status.

2. The corpus study

The corpus analyzed for this paper is an extraction from COBUILDs The Bank
of English, a corpus of 450 million words containing both spoken and written
material.2 A concordance was extracted on any third-person finite form of to be
(i.e. is, are, was, were) preceded by there and followed by a complex noun phrase
consisting of a definite noun phrase and another noun phrase, connected by the
preposition of. This query was considered to be the most practical way of tracing
many, if not most, of the cases of postverbal NPs in cardinal existentials having
pseudo-definite status and involving some form of esphoric retrieval. Indeed,
pseudo-definite reference requires a definite determiner with its first noun for its
apparent definiteness and an indefinite or zero determiner with its second
noun to realize its true indefinite status. These are necessary but not sufficient
conditions for pseudo-definiteness, as many NP complexes of this form are truly
definite. Naturally, this extraction covers only those pseudo-definite NPs
occurring in unmarked existentials. At least some additional subtypes which were
not attested in the existential corpus will be included in the overview for the sake
of completeness. Some of these additional subtypes were found in the additional
corpus containing 50 attributive clauses (which was consulted only for types that
did not occur in the existential corpus). For the existential corpus, in total 200
examples were extracted. This relatively low number of examples is due to the
marked nature of esphora as such, and to its highly marked occurrence in the
Esphoric reference and pseudo-definiteness 121

intrinsically indefinite postverbal complement-slot of the unmarked existential

and as the Attribute in the attributive clause.
The corpus data were classified according to the construction types which
are realized by the pseudo-definite NPs. Three primary categories were
distinguished; within some of these categories, I also propose more delicate
subcategories. For each category, the main question that will be dealt with is the
motivation of the use of the definite article despite the indefinite referential status
of the NP.

2.1 Type/Subtype-constructions

The first category contains the complex Existent NPs in which the first NP
indicates a subtype of the second NP.

(5) Iona McLeish's vast concrete set, a wire mesh-gated compound within the
ravaged Troy, is a happening in itself. <p> And all around, and above,
there is the sort of action reminiscent of war movies: the clatter of
helicopter rotors, whistling jet airstrikes and, when the city burns, a fire so
realistic you could toast bread on it.
(6) They could still be our friends. But now they never will be. Now there is
the sort of hatred I've described, the sort of cruelty, savagery, barbarity.

In both examples, the grammatical structure of the postverbal NP may provide an

explanation for the use of the definite determiner the in a context which does not
normally allow for definite NPs. In both cases, the noun referring to the type
(sort) functions as the Head of the NP. It is followed by a Postmodifier (of action,
of hatred). This complex Head is in its turn modified by a Postmodifier
(reminiscent of war, Ive described). It seems that the definite determiner, then,
has its normal function in the NP as such: it signals that the identity of the
referent of the head noun it modifies is somehow recoverable. In other words, the
hearer is supposed to be able to identify the specific sort of thing that is being
talked about. More specifically, the information needed to identify the referent of
N1 is present within the same NP in the form of the postmodifier: the definite
determiner is, therefore, esphorically motivated. A possible paraphrase of
example (5) may illustrate this: there is an instance of action of the sort
reminiscent of war. This paraphrase also makes clear why this type of NP
containing the definite article the is acceptable in a cardinal existential: although
the hearer is supposed to identify the type of thing talked about (through the
information given by the postmodifier), the actual instance of this type of thing
referred to is a new instance, which is being introduced into the discourse. The
reference to a new instance is a pragmatic inference; the NP as a whole refers
primarily to an identifiable type. However, the pragmatic inference to instances is
one that must be made due to the specific nature of the existential context.3
Therefore, the reference of the complex NP in this type of examples can be said
122 Peter Willemse

to be pseudo-definite rather than truly definite (or truly indefinite, for that
matter): although on one level of interpretation, the identity of the referent is
recoverable (viz. the type of thing), on another level, an instance is still being
introduced into the discourse as a new entity, which is, consequently, not
presumed known to the hearer. We can briefly refer here to another type of
pseudo-definite NP which involves a similar reference mechanism, viz. NPs
containing specific types of postdeterminers. Postdeterminers are elements which
occupy the slot following the determiner slot in the NP and which fulfill a
function ancillary to the functions of (definite/indefinite) identification and/or
(absolute/relative) quantification realized primarily by the determiner (see
Davidse 2001 for a more comprehensive discussion of postdeterminers). Consider
the following examples:

(7) The Woody Allen-Mia Farrow breakup, and Woodys declaration of love
for one of Mias adopted daughters, seems to have everyones attention.
There are the usual sleazy reasons for that, of coursethe visceral thrill of
seeing the extremely private couples dirt in the street, etc. [San Francisco
Chronicle, 24/8/92; cited in Ward and Birner 1995:732]
(8) There was the usual crop of letters to the Member of Parliament
concerned, and about once every twelve months a really abusive one from
the tortured victim himself.

Ward and Birner (1995: 732) describe this kind of postverbal NPs as having dual
reference, both to a type and a token. They point out that, although the type has
hearer-old status, which justifies the use of the definite article, the token or
instance referred to is hearer-new, which accounts for the acceptability in
existential contexts. It is clear that the underlined NPs in examples (7) and (8)
introduce new entities into the discourse. At the same time, a definite determiner
is used to signal the identifiability, not of the instances, but of the type. In (7), for
instance, although the specific reasons for the public fascination with the break-
up are being introduced into the discourse and thus not presumed known to the
hearer, the speaker does assume that the hearer knows the type of reasons that are
usually the basis for such public attention. The postdeterminer fulfills a secondary
identifying function, in that it helps the hearer to make mental contact with the
right type: it provides a clue for the hearer to identify the correct type.
It may be clear that a very similar explanation holds for type-subtype
constructions, although in the latter type of construction the reference to the type
is given an explicit linguistic realization in the form of the head noun the sort of,
whereas the notion of type is not lexicalized in the case of a postdeterminer with
dual reference.
Esphoric reference and pseudo-definiteness 123

2.2 Possessive constructions

This second category contains different types of pseudo-definite NP which are

structurally and semantically very similar, in that they all express some kind of
possessive relation (in the broad sense) between NP1 and NP2. More specifically,
it concerns two types of constructions which express types of relationships which
can be regarded as prototypical for possessive constructions (Langacker 1991:
169), viz. part/whole and kinship relations.
The first type is formed by NP complexes in which NP1 specifies a part or
component of the referent of NP2. This category can be further subdivided
according to the degree of abstractness of the part-whole relation which holds
between the two NPs.
The part-whole relation can be situated on a concrete, spatio-temporal
level (meronymy), as in the following example, which was also the only attested
example of this type in my existential corpus:

(9) In a room outside the court he talked with the French prosecuting counsel,
who showed him some of the evidence he was going to submit. There was
the shrunken head of a Polish boy whose crime had been that he had fallen
in love with a German girl. The head, mounted on a plaque like some
trophy of the hunt, had been found in a German official's house, used as an

The use of the definite article in an example like this can be explained
straightforwardly in terms of an esphoric bridging relationship. Bridging is
defined by Martin (1992: 124) as a type of indirect reference in which the identity
of a part is recovered through an experiential connection which exists between
that part and another part of the same whole or between that part and the whole it
belongs to, or vice-versa. In the type of construction under discussion here the
bridging relationship is an esphoric one because it holds between the first and the
second NP within the same complex NP. NP2 introduces the entity Polish boy
into the discourse; it realizes presenting reference and consequently uses an
indefinite form (the indefinite article a). The first NP has presuming reference
and consequently uses a definite form (the definite article the); the identity of its
referent is recoverable by virtue of an experiential connection with the entity
introduced by the second NP: a head is part of (the body of) a boy.
Examples like these shed an interesting light on Martins (1992) taxonomy
of retrieval types. Bridging and esphora are more intertwined than it would seem
at first sight. In fact, bridging is often the motivating factor for the use of a
definite form in a pseudo-definite, esphoric NP. In such cases, the esphoric nature
of the NP lies in the fact that the information necessary to identify its referent,
and thus to justify the use of a definite determiner, is to be found further on in the
same NP; the actual information is then retrievable through bridging from the
second NP.
124 Peter Willemse

In the majority of the cases attested in the corpus, the part-whole relation
was of a more abstract nature. A first group contains examples such as the

(10) there was a limp wad of lettuce whose leaves glistened with a fine film of
oil; there was a clean piece of wood jutting out with a shining nail bent at
the end of it; there were several eggshells showing bits of yellow yolk;
there was the stump of a cigar bearing the marks of a man's teeth; and
there was a clump of fluffy dust freshly gathered from some floor
(11) It comes from a pleasant er beach in Cornwall I won't want to say exactly
where it is in case it affects the tourist potential of the beach but er on the
slope where that photograph was taken from there is the remains of an
<ZF1> old tin mine <ZF0> old tin mine <ZGY> and it so happens that
that particular tin mine had quite a lot of uranium in the ore <ZF1> as a
<ZF0> as a sort of by-product.

In these part-whole constructions, a certain spatio-temporal element is still

present. They are different from the type represented by example (9), however, in
that the part-whole relation which they symbolize implies a process. In all of the
examples of this type found in my corpus, the head noun of the first NP is
situated in the lexical field of remains. It is clear that remains are not an intrinsic
part of a whole; rather, they are the result of a process of change, often more
specifically a process of destruction or demolition. It is this process which is
implicitly evoked in this type of examples. In (10), for instance, a stump is what
is left of a cigar after it has been smoked; thus the process of smoking is implied.
In (11), the remains are what is left of the tin mine which has presumably been
abandoned and fallen into disrepair; thus the process of change which the mine
has undergone is activated. In order to explain the use of the definite determiner
in examples of this kind, it can be remarked that the process which is implicitly
activated is one that is strongly associated with, and possibly even collocationally
linked to (e.g. in the case of cigar and smoking), the referent introduced by the
formally indefinite second NP. Consequently, it can be assumed that the process
itself is evoked fairly automatically and easily in the hearers mind. The first NP,
then, refers to the result or end product of this process and can do so with a
definite form since the process which has been evoked in the hearers mind in its
turn implies a specific result. Instead of a relation between a material part and a
material whole, the more abstract part-whole relationship which is present in this
type of examples can be said to be one between a process and one of its phases.
A process is equally implied in the following examples, which are still
further removed from the concrete, spatio-temporal level:

(12) He added: `We still don't know how many are required. I just wish there
were 20 games left. There's the making of a good team here. There is real
ability. In the first half we definitely suffered from tension after their goal.
But in the second half we looked good."
Esphoric reference and pseudo-definiteness 125

(13) In this sample, he uses only a few, such as the s for plural, although there
is the beginning of a negative construction in `no book" and the beginning
of a question form in `where going?" although he does not yet have the
auxiliary verb added to the question.

In (12), a process such as creating a team is implied; in (13), the process which
is implicitly activated is that of the child acquiring (linguistic constructions).
The use of the definite determiner in the first NP in these examples can be
explained in the same way as for examples (10) and (11): the first NP refers to a
phase of the process associated with, and thus implied by, the referent of the
second NP.
In a number of similar examples, finally, the process is explicitly realized
in the form of a gerundive or a nominalization:

(14) There is the birth of healing and that may be a silly thing to say but I think
if I may be allowed to develop the theme I think that there is a p <ZF1> a
<ZF0> a feeling of healing and time passing in nineteen-ninety-six that
isn't again just locked into
(15) And I think if Mary Matalin wants to go out and raise those questions, she
could be doing the president an enormous disservice because there--there's
the beginning of discussion of--of his side of this too.

Another similar type of pseudo-definite NP complex is the one where a kinship or

family relation is the link between the referent of the first and the second NP.
Examples of this were only found in the corpus containing attributive clauses:

(16) Next month the Network is bringing Dr Kenneth Kaunda, the former
president of Zambia, to Scotland (he is the son of a Church of Scotland
minister) as part of the crusade to fight the `cancer of debt" in the world's
poorest countries.
(17) The 7ft 1in centre, who has a home in West Hampstead, is the son of a
Nigerian diplomat but has lived in England since he was two.

In these cases, the definite article is again motivated by forward bridging and
again, a conceptual-associative link rather than a hyponymy or meronymy
relationship is the basis for the bridging: kinship nouns evoke in their conceptual
structure other people who fulfill certain roles in relation to the person they refer
to. For instance, a son is always someones son; the concept son evokes in its
structure the concept of parents, or at least of one specific parent (mother or
father). The referent of the second NP corresponds to this concept.
126 Peter Willemse

1.3 General-Specific constructions

This third category contains a number of types of pseudo-definite NPs in which

the first NP is in some way more general than the second NP. A further
distinction has to be made between appositive and non-appositive cases of
general-specific pseudo-definite NPs. When the construction is appositive in the
sense of Van Langendonck (1999), the first NP categorizes the second NP. When
the construction is non-appositive, the relation between NP1 and NP2 is one of

1.3.1 Appositive

In by far the majority of my corpus examples, the relationship between NP1 and
NP2 is one of apposition, and more specifically of restrictive or close apposition.
Van Langendonck (1999: 113) points out that in close appositional structures,
there are two appositives which together form an intonational unit and cannot
always be interchanged. Several subtypes can be distinguished, according to the
sort of noun that functions as the head noun of NP1.

nouns of modality
(18) By several accounts, there is the possibility of an Iraqi attack, either by
missile or by bomb on the air base--the allied air base where the US forces
are in Saudi Arabia at Dhahran.
(19) He reckons the beans cause wind, and he feels there's a danger the eggs
can be underdone and there's the chance of sickness.

phenomenal nouns
(20) At that moment there was the sound of a door opening.
(21) The crowd surged as the musicians pounded and whined. There was the
scent of sweat, and the stench of arrack on hot breath.

nouns denoting subject matter

(22) Fitness became a problem and, of course, there was the matter of personal
discipline - more off the park than on it.
(23) I think we certainly on the <ZZ1> place name <ZZ0> side felt that we had
to do something to stop this it you know wonderful police with <ZGY> I
<ZF1> don't <ZF0> don't disagree with that <M01> Mm. <M02> but
there is the question of sector policing er which would replace <ZZ1>
place name <ZZ0> when it goes <M01> Mm. <

(24) And for the two overall leaders after the semi-final stage of the
competition there is the bonus of a day out at the FA Cup final at
Wembley on May 11.
Esphoric reference and pseudo-definiteness 127

(25) Noosa is a great course it's fast and the climate is good. <p> And there's
the added motivation of a $25,000 car, so I'm giving it my best shot."

It is interesting to consider the semantic relationship between NP1 and NP2 in

this type of construction. Van Langendoncks (1999) remarks on close
appositions involving proper names (PNs) shed an interesting light on this. Van
Langendonck uses close apposition as a formal criterion for proper-namehood,
since he regards PNs as a semantic-syntactic class rather than as a word-class and
opposes them to proprial lemmas, lexemes which prototypically function as PNs
but can also be used in other ways. The element with the potentially most
specific reference (Van Langendonck 1999: 116) in a close apposition in which
two minimal nominal units are juxtaposed, sometimes with the help of the
apposition marker of (116) is the proper name. Interestingly, Van Langendonck
argues that the common noun element in a close apposition containing a PN
indicates the basic level category to which the proper name in question belongs
(Van Langendonck 1999: 113, 120f). For instance, in a close apposition like the
city of Antwerp, Antwerp is the unit with the most specific reference and therefore
functions as a PN. The other unit, the city, functions as a common noun and
indicates the basic level category which Antwerp belongs to. However, the
category indicated by the common noun in the appositional construction is not
necessarily the basic level category; especially in the case of personal name
appositions, it can also be a more specific category (e.g. Prime Minister Blair, the
Parisian Chirac, etc.).
Although Van Langendonck mentions minimal definite determination
(Van Langendonck 1999: 116) of the two units as a prerequisite for the close
apposition test for proper-namehood to work, the constructions under discussion
here seem to display similar features while the second unit has indefinite
determination. In all of the different subcategories of appositional pseudo-definite
complex NPs, the first NP categorizes the referent of the second NP in some way.
In example (20), for instance, the phenomenon referred to by NP2 is categorized
as a sound in NP1. In example (24), the day out referred to by NP2 is classified
as a bonus. These two examples also make clear that although NP2 normally
restricts the number of possibilities regarding possible categorizations, the
speaker has a certain amount of freedom and thus creative or rhetorical
categorization is possible to a greater or lesser extent. The categorization of a day
out as a bonus in (24) is an example of a fairly creative categorization on the
part of the speaker, whereas the categorization of the phenomenon referred to as a
sound in (20) seems to be more determined by the nature of the referent that is
being categorized. In the cases where the head noun of the first NP is a noun of
modality, the categorization also very often has a creative or rhetorical character.
The fact that the categorizing NP has definite determination not only when the
categorized NP is definite (for instance in the case of PNs) but even when the
second NP is indefinite (in the constructions under discussion here) allows us to
conclude that in English, category indications appear to be typically definite,
irrespective of the definiteness status of the NP which is being categorized.
128 Peter Willemse

Another important question regarding this type of pseudo-definite

contruction pertains to the motivation of the use of the definite determiner in the
first (categorizing) appositive even though the second (categorized) appositive is
indefinite. There are still two different possibilities here. In a first type of
construction, N1 functions as the Head of the construction while the second NP is
a Postmodifier indicating a subtype of the type designated by N1:

(26) Then the low whine of the vacuum cleaner came to his ears, and when it
stopped there was the musical flow of water in the bathroom.
(27) There was the smell of pot all over the apartment. [quoted in
Woisetschlaeger 1983: 142]

In these cases, the pseudo-definite status of the whole NP complex can be

explained in terms of dual reference. On the one hand, there is (definite)
reference to a generic concept. Woisetschlaeger (1983: 142) observes that in
examples such as these, some generic concept having narrow enough
specifications to qualify for prior identification accounts for the definite form of
the NP. Using different terminology, what is identifiable in these examples is the
type of thing talked about. On the other hand, there is (implied) reference to a
new instance of this known type: definiteness, and the attendant existential
presupposition, attaches to the concept referred to by the generic, while the
existence claim introduced by existential there attaches to some instantiation of
the generic concept (Woisetschlaeger 1983: 143). It should be noted that dual
reference is no longer present and that only a reading in terms of instantial
reference is possible in case the definite article in the first NP is replaced by an
indefinite article:

(27) There was a smell of pot all over the apartment.

The second type of appositive construction has a different internal structure. In

this type, N2 rather than N1 functions as the Head of the construction and the first
NP is a Premodifier to this Head. In many of these cases, NP2 contains a

(28) Suddenly, from nowhere, there was the sound of a very fast Forbes saying
that all Americans need only pay a single income tax rate of less than 20
per cent. The middle classes should be given a big break.

In these cases, the definite determiner has to be explained in terms of forward

bridging rather than dual reference. This type of pseudo-definite is thus, again,
esphorically motivated; the first NP points forward to the second NP, in the
sense that the information needed to identify the referent of the first NP is given
in the second NP, which introduces an instance into the discourse. Note that an
alternative in which NP1 has an indefinite article is not possible here:
Esphoric reference and pseudo-definiteness 129

(28) *There was a sound of a very fast Forbes saying that all Americans need
only pay a single income tax rate of less than 20 percent.

It seems, then, that pseudo-definite appositive constructions can be put on a cline

regarding their internal organization and their referential status. At the far ends of
the cline, there are the two different readings, one in terms of dual reference
(where N1 is the Head and NP2 functions as a Postmodifier) and the other in
terms of instantial reference with forward bridging from the Premodifier NP1 to
the Head NP2. Some examples clearly allow for only one of these two possible
readings, while others are more or less ambiguous. It seems that the more specific
and instance-like NP2 is, the more fixed the definiteness of NP1 (recall the
impossibility of a paraphrase with the indefinite article) and the more natural a
reading in terms of forward bridging becomes.
Finally, a special subclass of appositive pseudo-definite NPs is formed by
the cases in which the first appositive indicates a measure or degree:

(29) The title tells it all, and there's the flavour of Whiskey Galore and The
Titfield Thunderbolt about the movie, which offers chuckles and beautiful
Welsh locations as a group of villagers insist on their hill" officially
recorded as a mountain the very first mountain in Wales.
(30) Every time there was a lull, every time there was the hint of an opportunity
for any Tory to giggle at her personally, in came the trolleys again rank
upon rank of them, as patients queued while wicked Conservatives `tore
the National Health Service limb from limb

As is clear from examples such as (29) and (30), lexical extension mechanisms
such as metaphor play a role in these cases: a phenomenal noun occurs as the
head noun of NP1, but a literal categorization of the referent of the second NP
in terms of this type of phenomenon is not intended.

1.3.2 Non-Appositive

The last type of pseudo-definite NPs occurring in my corpus includes the ones in
which a symbolic relation, and more specifically a relation of representing or
depicting, exists between the referents of NP1 and NP2.

(31) Turn your back on the rock and follow the coastal path the other side of
the church to the covered fontaine de St They. There is the statue of a saint
in one niche and, until a few years ago, the other contained a stone,
apparently also revered, showing that the old practices of Morgan's people
have not wholly faded away.
(32) There was the wedding picture of a young black couple among his papers.
[quoted in Woisetschlaeger 1983, example 15f]
130 Peter Willemse

This type of pseudo-definite NP allows for alternation with a formally indefinite

NP, containing an indefinite article:

(31) There is a statue of a saint in one niche.

(32) There was a wedding picture of a young black couple among his papers.

Moreover, in Dutch, a language which is typologically closely related to English,

similar constructions allow only for the indefinite article; a pseudo-definite
variant is not possible in examples like:

(33) Er staat een beeld van een heilige in de ene nis.

There stands a statue of a saint in the one niche.

The motivation for the use of the definite article in the first NP is, again, a
forward bridging relation between the two NPs in the NP complex. As has
already been remarked earlier, a relationship of collocation or association is a
possible basis for bridging. In these cases, the concepts evoked by NP1 and NP2
are strongly associated with each other; we can therefore assume that one concept
evokes the other fairly automatically. Note that in this type of construction, the
second, and not the first noun, is the node of the collocation. However, these are
still cases of forward bridging, because the definite article is motivated by, and
thus points forward to, the second NP.

3. Conclusion

The category of esphoric reference as it has been defined and discussed in the
literature (Martin 1992) covers a number of constructions which are still quite
different in nature. First of all, truly definite NPs may involve esphora, in that
the information needed to identify the referent of the NP is present in the NP
itself, for instance in the form of a restrictive relative clause.
Besides these real definites, there are a number of pseudo-definite
types of esphoric NPs, which, although they show formal signs of definiteness,
realize indefinite reference to instances that are being introduced into the
discourse. This paper has zoomed in on these constructions and has studied them
in an environment which specifically excludes truly definite NPs, viz. the
postverbal position in the unmarked type of existential sentences. On the basis of
a corpus analysis, a classification of the different types of pseudo-definite NPs
was made. The construction types realized by the pseudo-definite NPs and the
semantic relation between NP1 and NP2 formed the basis of the classification. An
important question that was asked for each type concerned the motivation of the
use of the definite article in the first NP of the pseudo-definite NP complexes:
why does the first NP contain a definite determiner even though the unit it is part
of really realizes indefinite reference? It turns out that there are two basic
explanations for this.
Esphoric reference and pseudo-definiteness 131

The first possible motivation is what I have called, using a term from
Ward and Birner (1995), dual reference. The definite article can in that case be
explained by definite reference to a generic concept, to a known type. The NP
complex is, however, only pseudo-definite and can hence take the postverbal
position in an unmarked existential, because at the same time there is reference to
new instances, which are being introduced into the discourse. The reference to
new instances is a pragmatic inference which must be made due to the particular
nature of the grammatical environment of the unmarked existential, which does
not allow true definites.
The second explanation is a relation of what I have termed forward
bridging within the NP. The first NP in the NP complex takes a definite article
because its referent is identifiable by virtue of a bridging relationship to the
information supplied in the second NP. The definite article is thus esphorically
motivated, with bridging as the ultimate foundation for the esphoric relationship.
The basis for the bridging relation may in its turn be hyponymy or meronymy or,
alternatively, a collocational or associative link.


1. The term attributive is used by Halliday (1967, 1985, 1994); an equivalent

term, used by Declerck (1988), among others, is predicative clause.
2. All the examples quoted in this paper, except when otherwise indicated, are
extracted from the COBUILD corpus via remote log-in and are reproduced
here with the kind permission of HarperCollins publishers.
3. Unlike in other types of constructions, where the pragmatic inference to (new)
instances is optional and therefore creates vagueness. For instance, in They
decided to bomb various sorts of targets, there is certainly and primarily
reference to different (sub)types of targets (e.g. military target, civilian target,
etc.). On a second level of interpretation, then, specific instances of targets
may be evoked as well (e.g. a military base in Kabul, a residential area in
Bagdad, etc.). However, the sentence is perfectly acceptable without this
inference; in existential contexts, on the other hand, ungrammaticality arises
when the interpretation in terms of instances is not activated.


Davidse, K. (1999), The semantics of cardinal versus enumerative existential

constructions, Cognitive Linguistics 10(3), 203-250.
Davidse, K. (2000), A constructional approach to clefts, Linguistics 38-6, 1101-
132 Peter Willemse

Davidse, K. (2001), Postdeterminers: their secondary identifying and quantifying

functions. Preprint n177, Linguistics Dept., K.U. Leuven.
Declerck, R. (1988), Studies on Copular Sentences: Clefts and Pseudo-Clefts.
Leuven: Leuven University Press and Foris Publications.
Du Bois, J. (1980), Beyond Definiteness: the trace of identity in discourse, in
W. Chafe (ed.), The Pear Stories: cognitive, cultural and linguistic aspects
of narrative production, 203-274. Norwood: Ablex.
Halliday, M.A.K. (1967), Notes on Transitivity and Theme in English, Journal
of Linguistics 3 (1), 37-81.
Halliday, M.A.K. (1994), Introduction to functional grammar. 2nd Ed. London:
Halliday, M.A.K. and R. Hasan (1976), Cohesion in English. London: Longman.
Langacker, R.W. (1991), Foundations of cognitive grammar. Vol. II: Descriptive
Application. Stanford: Stanford University Press.
Lumsden, M. (1988), Existential Sentences. Their Structure and Meaning.
London: Croom Helm.
Martin, J. (1992), English text: System and Structure. Amsterdam: Benjamins.
Milsark, G (1976). Existential Sentences in English. Bloomington: Indiana
University Linguistics
Milsark, G. (1977). Toward an explanation of certain peculiarities of the
existential construction in English, Linguistic Analysis 3: 1-29.
Van Langendonck, W. (1999), Neurolinguistic and syntactic evidence for basic
level meaning in proper names, Functions of Language 6.1: 95-138.
Ward, G. and B. Birner (1995), Definiteness and the English existential,
Language 71, 4: 722-742.
Woisetschlaeger, E. (1983), On the question of definiteness in an old mans
book, Linguistic Inquiry 14, 1: 137-154.
Why an angel rides in the whirlwind and directs the storm: A
corpus-based comparative study of metaphor in British and
American political discourse.

Jonathan Charteris-Black

University of Surrey


This paper compares choice of metaphor in two political corpora: the Inaugural
speeches of American Presidents and party political manifestos of two British
political parties during 1974-1997. Initially metaphors are classified according
to their source domain; they are then analysed from a cognitive semantic
approach. The major findings are that metaphors from the domains of conflict,
journeys and building are common to both corpora. However, the British corpus
includes metaphors that draw on the source domain of plants whereas the
American corpus contains metaphors that draw on source domains such as fire
and light and the physical environment that do not occur in the British corpus.
These variations suggest differences in metaphors between British and American
political discourse and provide insight into cultural differences.
The cognitive analysis reveals the importance of the conceptual metaphors
WEATHER CONDITION occur only in the American corpus. There is some evidence
that British political discourse has borrowed metaphors based on the concept
POLITICS IS RELIGION from American political discourse.

1. Introduction

In this paper I am interested in exploring variation in metaphor choice within the

domain of politics by comparing the metaphors found in a corpus of American
presidential speeches with those found in a corpus of British party political
manifestos. Speeches and political manifestos are both types of political discourse
with the shared function of persuasion, therefore we may anticipate some overlap
in metaphor use. However, it may also be expected that differences in the
historical traditions and cultures of Britain and the United States may lead to
differences in the types of metaphors that are selected to attain similar rhetorical
objectives within political discourse.
134 Jonathan Charteris-Black

2. Political speeches, manifestos and metaphor

Metaphor plays an important rhetorical role in persuasive language because it has

the potential to exploit the associative power of language in order to provoke an
emotional response on the part of the hearer. In the domain of politics metaphors
have a crucial role since they combine a discourse function of communicating
policy with an expressive function of persuasion both as regards evaluating
policy and evaluating the reliability and integrity of politicians. Metaphors are
valuable because they facilitate the exploration of possible political objectives
while not committing the speaker to any of these and because they encourage
affective involvement of the type sought by successful political leaders. Using
Hallidays terms they therefore combine ideational and interpersonal functions of
language (Halliday 1994).
One of the main difficulties in comparing British and American political
discourse is that there are not identical text types in the conventions of the two
political systems. In the British system the policy statements that lay out the
intentions of a political party are communicated in written manifestos published
prior to an election. In the American system one means of communicating policy
statements are the Inaugural addresses that occur in the January following the
election of a new President; they receive wide media coverage because they set
out policy for the duration of the administration. The rationale for comparing
these two text types is that they broadly share a common communicative purpose
of persuading the voting public of the value of the governments intended policy.
Manifestos are documents stating the intentions and policies of political
parties that have the communicative function of persuading the electorate to vote
for a political party. They are usually generated through collaborative processes
of drafting and redrafting and entail multiple authorship. They provide phrases
that may subsequently be used as slogans in speeches. Political speeches differ
from many other types of spoken discourse because there is usually some use of
pre-prepared written script; they are characterised by a higher degree of planning
than is normal in spontaneous speech and scripts are often prepared by a team of
ghost writers. Therefore, manifestos and political speeches are similar in that
they both involve collaborative planning. They also share the communicative
purpose of persuasion both as regards the ideology of a political party and as
regards the integrity and values of politicians.
My view of metaphor originates in Richards (1936) tensile view of
metaphor in which metaphor refers to the semantic tension arising from a shift
in the use of a word from one domain or context to another. Metaphors are not
inherent in word forms but arise from the relationship between words and their
contexts. Excellent summaries of metaphor are available in Black (1962), Gibbs
(1994), Goatly (1997), Ortony (1979) and Lakoff and Johnson (1980, 1999).
Some of the background issues relating to defining metaphor are discussed
Charteris-Black (2000), Charteris-Black and Ennis (2001), Charteris-Black and
Musolff (2003), Charteris-Black (2004).
Metaphor in British and American political discourse 135

3. Research method

The aims may be summarised by the following research question:

What are the similarities and differences between the metaphors employed in
American inaugural speeches and British election manifestos?

Two corpora one American and one British were used to assist in answering
this question; the first is a corpus comprising the 51 Inaugural addresses of
American Presidents spanning approximately 200 years from George Washington
to Bill Clinton and was 98,237 words in length. The second was a corpus
comprising the party political manifestos for the Labour and Conservative Party
in the period 1945-1997 inclusive and was 132,775 words in length.1 For
convenience I will refer to the corpus of Inaugural speeches as the American
corpus and the corpus of political manifestos as the British corpus.
While diachronic variation was not the main focus of this study, it is
possible that some of the differences observed between the two corpora may be
partially attributed to the difference in the time period they cover. However, since
Inaugural speeches often refer intertextually to the speeches of earlier presidents
it was considered acceptable to treat them as a coherent and homogeneous body
of texts. In the same way, the structure of the British party manifestos has
remained relatively unchanged during the period covered and so this was also
considered a coherent body of texts. In both cases the function of language
combines communication of ideas with persuasion and this similarity of
communicative purpose makes them comparable genres. However, diachronic
factors could be taken into account in interpreting the findings and may indeed
form the central focus of future research in this area.
The methodology combines qualitative with quantitative approaches.
Initially, qualitative analysis of a sample of each corpus revealed a set of words
that have the potential to be used as metaphors. Identification of metaphor was
based on the definitions discussed above. The procedure was to analyse
potentially metaphorical linguistic forms in the two corpora to establish whether
on each occasion of use they should be classified as metaphor.
For example, windfall typically refers to apples blown down by the
wind; however, it is also found in New Labour discourse in expressions such as
windfall tax; this innovative use in a political context is the basis for identifying
all uses in this corpus as metaphor. Admittedly, for some speakers windfall may
more commonly be used in the context of taxation than that of fruit or the weather
however, there are invariably subjective issues of variation between language
users that influence metaphor identification. A further example from the
American corpus is that words such as path and step may be used as
metaphors that draw on the domain of journeys (see example 9); however, the
President sometimes refers to the white steps on which he is standing while
speaking. Evidently, such a use of step refers literally to the steps of the White
House. It is necessary to examine each of the contexts of words and phrases that
136 Jonathan Charteris-Black

have the potential to be used metaphorically to establish whether there is the

presence or absence of the semantic tension that is the basis for their
classification as actual metaphors.
Metaphors were classified according to the lexical fields of their linguistic
forms; these are generally referred to as source domains. It was then possible to
undertake further qualitative analysis to propose conceptual bases for metaphor
clusters using a cognitive semantic framework (cf. Lakoff and Johnson 1980,
1999). So, for example, a conceptual metaphor LIFE IS A JOURNEY could be said to
motivate uses of step, path etc when occurring in a context that does not refer
to physical movements in space.
I established the resonance of source domains for metaphors using a
simple statistical measure: first the types and tokens of the metaphors are
calculated. Types are separate unique linguistic forms while tokens are the
number of times each form occurs irrespective of whether it has already occurred
tokens include repetitions of identical linguistic forms whereas types do not.
Then the total number of types for each lexical field, or source domain, is
multiplied by the total number of tokens in that source domain; this provides a
measure of its resonance. This can then be converted to a percentage by dividing
the resonance of each source domain by the total of the resonances for all the
source domains. This calculation overcomes the problem of the difference in the
size of the two corpora; it also facilitates comparison of source domains in a way
that takes into account their productivity in terms of the types and tokens of the
metaphors that they produce. It permits identification of similarities and
differences between the productivity of metaphor source domains within a single
corpus and between different corpora.

4. Findings

I will first present an overview of the findings then address each part of the
research question. The findings as regards the resonance of metaphor source
domains are shown in Table 1.
From the bottom row we can see that more than twice as many types of
metaphor were identified in the American corpus as compared with the British
one. However, the larger British corpus contained many more tokens of metaphor
indicating that in the British corpus there is a tendency for metaphors to repeat the
same linguistic forms. There are areas of similarity in metaphor domains between
the two corpora since conflict, journeys and buildings are the three most resonant
lexical fields in each corpus. These domains account for 66% of total resonance
in the American corpus and 89% of total resonance in the British corpus;
however, there are also variations in metaphor use.
Metaphor in British and American political discourse 137

Table 1. Comparison of metaphor types in British and American political corpora

classified by source domain
Source domain American corpus British corpus
Types Tokens Reson- % of Types Tokens Reson- % of
ance total ance total
(types x (types x
tokens) tokens)
Conflict 18 116 2,088 36 9 494 4,446 54
Journeys 12 76 912 16 5 187 935 11
Buildings 12 66 792 14 7 287 2009 24
Fire & light 15 51 765 13 - - - -
Physical environment 16 35 560 9 - - - -
Plants - - - - 5 150 750 9
Religion 6 72 432 7 4 46 184 2
Body part 4 76 304 5 - - - -
Total 83 492 5,853 100 30 1,164 8,324 100

First, we may notice that while conflict is the most common lexical field
for metaphor in both corpora it is more resonant in the British than in the
American Corpus. Perhaps this may be explained by the combative discourse
function of the pre-election party political manifesto as compared with the
postelection inaugural speeches where there is less need to combat a defeated
opposition party.
Another interesting distinction is that while journey metaphors are more
common in the American Corpus, building metaphors are more common in the
British Corpus where they account for nearly a quarter of all metaphors. This is
an interesting difference in resonance that can be explained with reference to
different cultural experiences new experiences arising from journeys are salient
for Americans while the sense of security and solidity arising from buildings are
salient for the British.
A few lexical fields occurred in only one of the corpora; for example,
metaphors based on fire and light were only found in the American corpus. Fire
and light metaphors often convey idealism in the American corpus, while in the
British corpus religious metaphors such as vision are used for this purpose.
Another major difference between the two political corpora is that the
American corpus tends to employ physical environment metaphors in situations
where plant metaphors are employed in the British corpus (cf. example 10
below); in each case these two lexical fields constitute 9% of the total resonance.
This may have a cultural explanation in that gardening is a major pastime in
British society; the garden is a domain of private but external space. Conversely,
American cultural and historical experience draws on undomesticated space and
this reflects in the use of words such as valley, horizon, jungle, mountain
138 Jonathan Charteris-Black

or desert in a political context. For the majority of English people nature is

conceived as something to be controlled by physical intervention, whereas for
Americans nature is conceived as larger and more elemental and to be controlled
by travel. Such metaphor preferences demonstrate the influence of cultural
practice and the physical environment on metaphor use.
In answering the first part of the research question that refers to
similarities I will consider the three lexical fields for metaphor that were common
to both corpora.

5. Metaphors common to both American and British political discourse

Conflict metaphors

Metaphors from the lexical field of conflict were originally identified in relation
to spoken language in terms of debate and represented as ARGUMENT IS WAR (cf.
Lakoff and Johnson 1980). Conflict is the most common lexical field in both
corpora and provides evidence of a conceptual metaphor POLITICS IS CONFLICT. I
suggest that metaphors of conflict are chosen to emphasise the personal sacrifice
and physical struggle that is necessary to achieve social goals while subliminally
creating the opportunity for positive evaluation of actual military conflict. Table 2
shows some examples from the American corpus.
Nearly all the conflict metaphors have a very similar rhetorical pattern: in
pragmatic terms the choice of a conflict metaphor determines the nature of the
speakers evaluation. The conflict is either for abstract social goals that are
positively evaluated such as rights, freedom, faith etc. or against social
phenomena that are negatively evaluated such as poverty disease, injustice etc.;
these social ills are conceptualised as enemies. In addition, the stages through
which social progress is to be made are conceptualised in terms of the stages of a
military action: the trumpet that calls to action, attack, retreat, truce and eventual
victory or surrender. POLITICS IS CONFLICT implies an isomorphic relationship
between the domains of politics and war.
Similarly, in the British corpus both parties defend abstract social goals
that are positively evaluated by their own party but perhaps because of the
combative function of the election manifest imply that such goals are under
threat from political opponents:

(1) We will defend the fundamental right of parents to spend their money on
their childrens education should they wish to do so. (Conservative)

(2) While continuing to defend and respect the absolute right of individual
conscience (Labour)
Metaphor in British and American political discourse 139

Table 2. POLITICS IS CONFLICT The American Corpus

Conflict Tokens Examples

enemies 10 we will fight our wars against poverty, ignorance,
and injustice for those are the enemies against
which our forces can be honorably marshaled.
(Jimmy Carter)
destroy 8 wise and correct course to follow in taxation and
all other economic legislation is not to destroy those
who have already secured success but to create
conditions (Calvin Coolidge)
victory 6 Every victory for human freedom will be a victory for
world peace. (Ronald Reagan)
struggle 6 as a call to battle, though embattled we arebut a
call to bear the burden of a long twilight struggle,
year in and year out, "rejoicing in hope, patient in
tribulation" (Jimmy Carter)
fight 3 We will be ever vigilant and never vulnerable, and
we will fight our wars against poverty, ignorance, and
injustice (Jimmy Carter)
trumpet 3 We have heard the trumpets. We have changed the
guard. (Bill Clinton)
battle 3 We know the race is not to the swift nor the battle to
the strong. Do you not think an angel rides in the
whirlwind and directs the storm? (George W. Bush)

Parties may also defend social institutions or groups in society that are positively

(3) Labour created the National Health Service and is determined to defend it.

(4) We will continue to defend farmers and consumers. (Conservative)

However, while defence metaphors are used in similar ways by both the major
British parties, attack metaphors are used rather differently; The following
examples show that the Labour party fights against or attacks a general range of
social ills while the Conservative party defends social virtues:

(5) Economic success is not an end in itself. For the Labour party, prosperity
and fairness march hand in hand on the road to a better Britain. During the
next Parliament, we intend to continue our fight against all form of social
injustice. (Labour)
140 Jonathan Charteris-Black

(6) We will fight against crime and violence which affects all Western
societiesAt the same time, we shall attack the social deprivation which
allows crime to flourish. (Labour)

(7) We have to compete to win. That means a constant fight to keep tight
control over public spending and enable Britain to remain the lowest taxed
major economy in Europe. It means a continuing fight to keep burdens off
business. (Conservative)

(8) We will continue to fight for free and fair trade in international
negotiations. (Conservative)

When fight is used in a Conservative manifesto it signifies defending something

that is represented as under attack from Labour. Conversely, fight is used in
Labour manifestos to represent its policies as an attack on negatively evaluated
social ills, the cause of which is not usually identified.
One explanation of the extensive use of metaphors related to the
conceptual metaphor POLITICS IS CONFLICT in both corpora is that decisions about
whether or not to engage in political conflict are perhaps the most important
decisions that politicians have to make. This is particularly the case in the USA
where the President has the sole right of veto over whether or not nuclear missiles
should be launched. This reflects in the use of the word defend in the
presidential oath: clearly the American constitution emerged out of conflict both
against external political enemies (Britain and France) and internal enemies (the
native inhabitants, later the Civil War). Interestingly, terrorism can be defined
either as an external or as internal threat. With the continuing importance of
American military involvement as a basis for international power, and the role of
Britain in legitimising post-colonial dominance, it is of little surprise that conflict
remains a highly potent source domain in both American and British political
discourse. When metaphors based on POLITICS IS CONFLICT are used by
democratic politicians, they treat the struggle against social ills in the language
usually reserved for military conflict. In subliminal terms this creates the potential
for military combat to be represented as socially beneficial (cf. Lakoff 1991,
Jansen and Sabo 1994).

Journey metaphors

Journey metaphors have quite a long history in cognitive linguistic research.

Originally, Lakoff and Johnson (1980: 44) proposed LOVE IS A JOURNEY; Lakoff
and Turner (1989) then proposed LIFE IS A JOURNEY. A more generic
representation is PURPOSES ARE DESTINATIONS (Lakoff and Johnson 1999: 52-53).
In these representations a journey is taken as a prototype purposeful activity
involving movement in physical space from a starting point to an end point or
destination. Since politicians are concerned with goal-oriented social activity, I
propose a similar representation: PURPOSEFUL SOCIAL ACTIVITY IS TRAVELLING
Metaphor in British and American political discourse 141

ALONG A PATH TOWARD A DESTINATION. Normally, journey metaphors evaluate

policies positively because the ends are socially valued ones.
A single conceptual metaphor can account for the semantic coherence of a
whole speech; this is evident from an analysis of Lyndon Johnsons inaugural
address (January 20th 1965). The theme of this speech is change; the sentences in
example 4 are chosen from a total of thirty-four sentences and their positions are
shown in parentheses:

(9) Even now, a rocket moves toward Mars. (4)

They came here the exile and the stranger, brave but frightened to
find a place where a man could be his own man. (7)

First, justice was the promise that all who made the journey would share in
the fruits of the land. (8)

Think of our world as it looks from the rocket that is heading toward Mars.
It is like a child's globe, hanging in space, the continents stuck to its side
like colored maps. We are all fellow passengers on a dot of earth. (16)

For this is what America is all about. It is the uncrossed desert and the
unclimbed ridge. It is the star that is not reached and the harvest sleeping
in the unplowed ground. Is our world gone? We say "Farewell." Is a new
world coming? (31)

To these trusted public servants and to my family and those close friends
of mine who have followed me down a long, winding road, (32)

The journey metaphors attempt to evoke the original historical experience of the
Pilgrim Fathers (7 and 8); the opening up of the American west (31) and the
space programme (4 and 16). These are integrated with the more general use of
journey metaphors to describe human relationships as implied by Lakoff and
Johnsons LIFE IS A JOURNEY metaphor, as in (32). We can see from the
distribution of these metaphors that they form a path through the text inviting the
listener to participate in a journey. In metaphorical terms the President is
represented as a guide; since only the guide knows the destination, the speech
provides a type of map towards it. Since journeys may be to unknown
destinations, the choice of this conceptual basis has the rhetorical goal of
persuading the American people to accept innovation and social change.
However, travel can be slow and arduous because of impediments to
movement and hence there will be barriers to overcome and burdens to bear.
Lakoff and Johnson (1999: 188) represent metaphoric use of words such as
burden as DIFFICULTIES ARE IMPEDIMENTS TO MOVEMENT. In a political context
these metaphors express the need for patience since it takes time and effort to
reach a destination. This is rhetorically effective because it implies that the
142 Jonathan Charteris-Black

electorate should not expect instant results and that, at times, they may need to
suffer to achieve goals; it also implies that hardships are to be tolerated because
these goals are worthwhile. The extracts shown in the following examples all
share the notion of a burden whose weight should be endured (or ignored
altogether) because of the value placed on the destination:

(10) indeed all free men, remember that in the final choice a soldiers pack
is not so heavy a burden as a prisoners chains. (Dwight Eisenhower)

(11) Let us accept that high responsibility not as a burden, but gladly gladly
because the chance to build such a peace is the noblest (Richard Nixon)

Building metaphors

Metaphors from the source domain of building are typically evaluative, carry a
strong positive connotation and are employed to express aspiration towards
desired social goals such as peace, democracy and progress towards a better
future. They emphasise social cohesion, social purpose and control of ones
environment. These metaphors can be divided into two types. First, there are
those that refer to the parts of a building foundations, threshold, doors, etc. and
others that refer to types of building such as house, or bridge.
The most frequent part of a building that is used metaphorically is the
foundations. In such metaphors an abstract phenomenon is positively evaluated
which permits us to infer a conceptual representation A WORTHWHILE ACTIVITY IS
A BUILDING. Laying foundations is a conventional metaphor for a solid and
valuable policy although it may not in fact be taken through to completion. We
know that any building which is to be durable must first have foundations and
that these may take a long time to construct; however, we also know that the
laying of foundations does not necessarily imply the completion of a building. If
the money to buy materials or to pay builders runs out then the building will not
be built. So in reality it is very difficult to predict the extent to which laying
foundations will guarantee the successful completion of a construction.
Building metaphors make an interesting comparison with journey
metaphors. Building and travelling are conceptually related, as they are both
activities in which progress takes place in stages towards a predetermined goal.
Topographically, both involve increase in the surface that is covered; in the case
of journeys this is linear movement along a horizontal path whereas for buildings
there is three-dimensional increase along a vertical path. Both activities highlight
the need for patience since they require time and effort. Difficulties entail a need
to make sacrifices and not to expect instant outcomes. Since we think of
achieving goals as inherently good, in pragmatic terms, both journey and building
metaphors imply a positive evaluation of political policy. They require a plan or
map and an architect or guide, and it may be this conceptual proximity accounts
for their resonance in the corpus.
Metaphor in British and American political discourse 143

I will now address the second part of the research question by considering
metaphors that only occurred in one of the corpora.

6. American corpus: light and fire metaphors

The analysis suggests that light & fire metaphors are particular to American
political discourse. The lexical field of light has traditionally been linked with the
target domain of understanding and metaphors that draw on it are motivated by a
conceptual metaphor KNOWING IS SEEING (cf. Lakoff and Johnson 1999: 53-54).
However, for this political data I suggest a conceptual metaphor HOPE IS LIGHT
that invariably implies a positive evaluation. It is likely that spiritual notions will
be evoked because of the importance of hope in religious discourse. As we can
see from Table 3, light is contrasted with darkness that is associated with
ignorance, failure to understand and evil.
Light is always positive because of its polarity with darkness. In other
circumstances fire metaphors can also be used for positive evaluation. This is
because George Washington first used the fire metaphor in an inaugural address
and the metaphorical link between fire and liberty has become a source of
intertextual reference in presidential addresses as we can see from the following

(12) since the preservation of the sacred fire of liberty and the destiny of the
republican model of government (George Washington)

(13) The preservation of the sacred fire of liberty and the destiny of the
republican model of government are justly (Theodore Roosevelt)

(14) He would extinguish the fire of liberty, which warms and animates the
hearts of happy millions (James Polk)

Fire is represented as the guarantor of liberty. This may be because it implies that
some form of burning or destruction will be necessary: this is in keeping with
Americas revolutionary wars, struggle for independence and Civil War. In these
metaphors conflict can be represented as a means to peace. Consider the
following examples:
144 Jonathan Charteris-Black

Table 3. Light and fire metaphors (American Corpus)

Light & Fire Tokens Examples

light 15 I have spoken of a thousand points of light, of all the
community organizations that are spread like stars
throughout the Nation, doing good.
(George Bush)
dark 8 Finally, to those nations who would make themselves
our adversary, we offer not a pledge but a request:
that both sides begin anew the quest for peace, before
the dark powers of destruction unleashed by science
engulf all humanity in planned or accidental self-
(John F. Kennedy)
fire(s) 7 and since the preservation of the sacred fire of liberty
and the destiny of the republican model of
government are justly considered, perhaps, as deeply,
as finally, staked on the experiment entrusted to the
hands of the American people.
(George Washington)
bright 4 These principles form the bright constellation which
has gone before us and guided our steps through
(Thomas Jefferson)
dawn 3 so that together, we can see the dawn of a new age of
progress for America, and together, as we celebrate
our 200th anniversary as a nation,
(Richard Nixon)
beacon 2 We will again be the exemplar of freedom and a
beacon of hope for those who do not now have
freedom. (Ronald Reagan)

(15) And it is imperative that we should stand together. We are being forged
into a new unity amidst the fires that now blaze throughout the world. In
their ardent heat we shall, in God's Providence, let us hope, be purged of
faction and division (Woodrow Wilson)

(16) Mill fires were lighted at the funeral pile of slavery. (Benjamin Harrison)

In such cases fire originates in Washingtons fires of liberty and provides

evidence of the conceptual metaphor SOCIAL PURIFICATION IS HEAT. Therefore,
different aspects of the source domain are highlighted in particular choices of
Metaphor in British and American political discourse 145

metaphor. For example, it seems that when words such as kindled or flames
are used metaphorically to convey notions of anger, it is the speed and rate of
burning that are important rather than heat. In this corpus heat is a positive rather
than a negative attribute of fire because it is associated in scientific senses with
the notions of purification (as when impure metals are converted to pure ones by
the application of heat). Similarly, fire metaphors are also positive when they
highlight the quality of fire to produce light as in metaphorical uses of beacon.
In this respect it depends on which aspect of the source domain is highlighted
whether a President conveys a positive or negative evaluation. Such malleability
makes fire a useful and potent cognitive domain as it can combine different
aspects of our knowledge of an element to convey an evaluation that is
appropriate to a specific discourse context. Similarly, light and darkness provide
prototype poles for creating contrasts between spiritual or moral notions of
goodness and evil. Light and fire metaphors therefore share both a cognitive and
pragmatic role in American political discourse.

7. British corpus: plant metaphors

Metaphors from the domain of plants are an important group comprising 9% of

all metaphors in the British Corpus. Many of these were accounted for by a
conventional metaphor for 'growth' in the context of describing economic
expansion. We also find a similar use of flourish to imply a strong positive

(17) As we want small businesses to flourish, we will go even further.


(18) To build a responsible society which protects the weak but also allows the
family and the individual to flourish. (Conservative)

In these cases flourish identifies those social entities that are highly valued. In
some cases these are the same for both parties, for example 'families', but in
others they are specific to parties, for example 'business' is claimed to 'flourish'
under the Conservatives and 'democracy' is claimed to 'flourish' under Labour.
There is also evidence of effective use of plant metaphors for the purpose
of political persuasion. Let us consider the use of the term windfall. As in the
Labour Party 1997 manifesto, it is always used in a nominal compound form
windfall levy. The use of this metaphor is important in that it conceals agency:
it is not clear that this is in fact a tax imposed by the government of the day. The
Bank of English corpus shows that the other familiar collocations of this word are
windfall tax, cash windfall, and windfall profits. Here public revenue is
conceptualised as being obtained without any effort because it is through the
natural process of the wind blowing. There is no victim and no effort involved in
obtaining a social benefit. This is an example of a creative use of metaphor that
146 Jonathan Charteris-Black

deliberately construes an event as effortless because there is no animate agent and

positive because it is seen as a gift of nature. In the 'windfall' metaphor the agency
of government is concealed.
Many plant metaphors imply a strong positive evaluation because of the
connotation formed by the association of fertility with life, as in the following

(19) We will nurture investment in industry, skills, infrastructure and new

technology (Labour)

(20) More realistic attitudes to profit and investment take root. (Conservative)

Here the expansion of investment is represented as a natural process in which

there is an analogy between the roots that are the pre-requisite of a healthy plant
and the investment that is conceptualised as the pre-requisite of a healthy
economy. This is based on the fact that both are invisible causes of visible effects
they create a semantic association between consumer wealth and fertility.
Metaphors such as nurture and took root are extensions of the highly
conventionalised use of growth to refer to economic expansion. As with the
building and journey metaphors, there is, then, an isomorphic correspondence
between the sequence of events that led to a successful outcome in the natural
world and in the world of business.

8. American corpus: physical environment metaphors

I decided to combine two sub-domains that are both related to the physical
environment; these are weather metaphors and metaphors for natural
geographical features. Such metaphors may appeal particularly to that significant
minority of the North American population that inhabits rural and semi-rural
areas such as the vast Midwest.
Weather metaphors are a conventional source domain for conveying
abstract notions of change and associated ideas; they have been related in the
cognitive linguistic literature to a conceptual key CIRCUMSTANCES ARE WEATHER
(e.g. Grady et al. 1997: 109). For example, our knowledge that wind brings about
a change in the weather provides a useful metaphorical representation of cause
and effect.

(21) Thus across all the globe there harshly blow the winds of change. (Dwight

(22) in the shadows of the Cold War assumes new responsibilities in a world
warmed by the sunshine of freedom but threatened still by ancient hatreds
and new plagues. (Bill Clinton)
Metaphor in British and American political discourse 147

It is significant that metaphors associated with changing conditions are much

more common than those associated with stable ones. The more intense the
weather condition, the more intense the change implied. Weather metaphors
evoke either a positive or a negative evaluation. I propose that in the domain of
politics, therefore, a specific conceptual metaphor is A SOCIAL CONDITION IS A
WEATHER CONDITION. This is related to the more generic conceptual metaphor
Geographical metaphors highlight a particular aspect of a physical
geographical feature of the landscape; typically, this is either vertical (e.g. valley,
mountain) or horizontal (e.g. desert, horizon).

(23) Together let us explore the stars, conquer the deserts, eradicate disease, tap
the ocean depths, (Dwight Eisenhower)

(24) Vitality has been preserved. Courage and confidence have been restored.
Mental and moral horizons have been extended. (Theodore Roosevelt)

Physical environment metaphors have the pragmatic effect of evaluating social

conditions as if they were physical ones and are specific realisations of the
generic level conceptual representation STATES ARE LOCATIONS (cf. Lakoff and
Johnson 1999: 180).

9. Metaphor borrowing: religious metaphors

The lexical field of religion comprised 7% of the resonance in the American

corpus and 2% in the British corpus; most of the uses in the latter occurred in the
more recent section of the corpus which implies a degree of borrowing. It would
be interesting to compare this with an earlier corpus of British political speeches,
although it was not possible to locate one on this occasion. It should come as no
surprise that religious metaphors are commonly used in American political
speeches: religion has played an important part in the evolution of the USA and
Christian evangelism has been an important source of inter-racial and inter-ethnic
harmony. Religion serves as a source domain for invoking spiritual aspirations
into the political domain and links the President with a commitment to Christian
religious belief. This suggests further evidence for a conceptual metaphor
POLITICS IS RELIGION. Example (25) contains extracts from Bill Clintons first
inaugural speech:

(25) A spring reborn in the world's oldest democracy, that brings forth the
vision and courage to reinvent America (3)

Though we march to the music of our time, our mission is timeless. (5)
148 Jonathan Charteris-Black

We must bring to our task today the vision and will of those who came
before us. (16)

Our democracy must be not only the envy of the world but the engine of
our own renewal. (19)

The brave Americans serving our nation today in the Persian Gulf, in
Somalia, and wherever else they stand are testament to our resolve. (35)

An idea ennobled by the faith that our nation can summon from its myriad
diversity the deepest measure of unity. (40)

And so, my fellow Americans, at the edge of the 21st century, let us begin
with energy and hope, with faith and discipline, and let us work until our
work is done. The scripture says, "And let us not be weary in well-doing,
for in due season, we shall reap, if we faint not." (41)

From this joyful mountaintop of celebration, we hear a call to service in

the valley. We have heard the trumpets. We have changed the guard. And
now, each in our way, and with God's help, we must answer the call. (42)

Thank you and God bless you all. (End)

Clearly, the references to vision, faith, mission etc. form a cohesive chain
that prepares the way for the strongly religious theme of this coda. This is a
further example of how metaphor can be used systematically to create coherence
in a political text.
I propose that the New Labour Party in Britain has borrowed from
American political discourse to introduce the lexical field of religion into British
political metaphors. It is no secret that Tony Blair had a close social relationship
with Bill Clinton as well as sharing a similar political allegiance to social
democracy. Example (26) shows some typical uses of vision metaphors in the
1997 New Labour manifesto:

(26) But a Government can only ask these efforts from the men and women of
this country if they can confidently see a vision of a fair and just society.
(New Labour)
The vision is one of national renewal, a country with drive, purpose and
energy. A Britain equipped (New Labour)
Our vision for Britain is founded on these values. Guided by them, we will
make our country more (New Labour)
An independent and creative voluntary sector, committed to voluntary
activity as an expression of citizenship, is central to our vision of a
stakeholder society (New Labour)
Metaphor in British and American political discourse 149

The vision metaphor is based on the conceptual metaphor SEEING IS

UNDERSTANDING (Lakoff and Johnson 1980: 48); it implies that there is an
altruistic objective that is understood by the party and towards which its policies
are directed. It is one that is analogous to spiritual progress because it claims that
the objective is to make the world a better place to live in. These metaphors
provide evidence that the conceptual metaphor POLITICS IS RELIGION has entered
British political discourse from American political discourse.

10. Conclusion

In this cognitive semantic and corpus-based comparison of metaphors in British

election manifestos and American Inaugural speeches I have found both
similarities and differences. The three most common lexical fields for metaphor
are shared by both varieties: conflict, journeys and buildings. I have argued that
conflict metaphors are the most common in both varieties because of the salience
of conflict in relation to politics and because they emphasise notions of struggle
and personal sacrifice to attain social objectives. I have also suggested that use of
such metaphors may create the potential for passive acceptance of actual military
conflict because it is subliminally associated with objectives that are evaluated as
being socially beneficial as in the current war on terrorism.
I have also identified some lexical fields that only occur in one of the
varieties: plants in the British corpus and fire & light and the physical
environment in the American corpus. I have suggested some culturally and
historically related explanations such as the British passion for gardening leading
to the positive associations of words such as growth and nurture, and the
American experience of struggling for independence leading to a positive
evaluation of fire metaphors.
I have also suggested that the recent introduction of metaphors from the
lexical field for religion into British political discourse by New Labour is
borrowed from American political discourse where they have more established
Further research is necessary to establish whether a wider range of text
types taken from political discourse confirms or conflicts with these findings. It
would also be interesting to establish whether differences in the use of metaphor
occur between varieties of general English or whether they are restricted to
particular domains of language use, such as politics. It would also be relevant to
study diachronic shifts in the use of metaphor within the domain of politics.
Finally, it would be interesting to find out whether the types of evaluation
that I have suggested motivate the use of metaphor in political discourse achieve
their intended effect by collecting empirical data on reader/hearer response to
metaphor in political contexts.
150 Jonathan Charteris-Black


1. The British party manifestos are available on the web at

http://www.psr.keele.ac.uk/platform.htm and the American Inaugural
addresses at http://www.bartleby.com/124/.


Black, M. (1962), Models and metaphors. Ithaca, N.Y.: Cornell University Press.
Charteris-Black, J. (2000), Metaphor and vocabulary teaching in ESP
economics, English for Specific Purposes 19: 149-165.
Charteris-Black, J. and T. Ennis (2001), A comparative study of metaphor in
English and Spanish financial reporting, English for Specific Purposes
20: 249-266.
Charteris-Black, J., and A. Musolff (2003), Battered hero or innocent victim? A
comparative study of metaphors for euro trading in British and German
financial reporting, English for Specific Purposes. 22:153-176
Charteris-Black, J. (2004) Corpus Approaches to Critical Metaphor Analysis.
Basingstoke: Palgrave-MacMillan
Gibbs, R.W. (1994), The Poetics of the mind: figurative thought, language and
understanding. Cambridge: Cambridge University Press.
Goatly, A. (1997), The Language of metaphors. London & New York: Routledge.
Grady, J.E., T. Oakley and S. Coulson (1997), Blending and metaphor, in: R.W.
Gibbs and G.J. Steen (eds), Metaphor in cognitive linguistics. Amsterdam
& Philadelphia: Benjamins, 101-124.
Halliday, M.A.K. (1985), An introduction to functional grammar. 2nd ed.
London: Edward Arnold.
Jansen, S.C. and D. Sabo (1994), The sport/war metaphor: hegemonic
masculinity, the Persian Gulf war, and the New World order, Sociology of
Sport Journal 11: 1-17.
Lakoff, G. (1991), The Metaphor System used to justify war in the Gulf,
Journal of Urban and Cultural Studies, 2(1): 59-72.
Lakoff, G. and M. Johnson (1980), Metaphors we live by. Chicago: University of
Chicago Press.
Lakoff, G. and M. Johnson (1999), Philosophy in the flesh : the embodied mind
and its challenge to Western thought. New York: Basic Books.
Lakoff, G.and M. Turner (1989), More than cool reason: a field guide to poetic
metaphor. Chicago: University of Chicago Press.
Ortony, A. (1979), Metaphor and thought. Cambridge: Cambridge University
Richards, I.A. (1936), The philosophy of rhetoric. New York and London: Oxford
University Press.
Signalling spokenness in personal advertisements on the Web:
The case of ESL countries in South East Asia

Peter K. W. Tan, Vincent B. Y. Ooi and Andy K. L. Chiang

National University of Singapore


The continuing impact of the World Wide Web (or the Web) on everyday life
focuses our attention on the ways in which the notions of speech community,
culture and language are patterned in this mega corpus of all time. This paper
investigates how people in South East Asia in particular Brunei, the
Philippines, Malaysia and Singapore use English in personal advertisements on
the Web. The study is part of a Web corpus project investigating related questions
in computer-mediated communication (see Herring 1996). The corpus is
currently being built and is derived entirely from the Web.
In ESL (English as a Second Language) nations, or outer circle (Kachru
1992) countries, English is often relegated to the position of a neutral and
transactional (as opposed to interactional) language where affect (emotion)
is played down and less developed in the private and personal (as opposed to
public) domains. We might assume English used for informal purposes to be less
developed. Yet, Web gurus recommend the use of spoken, as opposed to written,
norms when writing for the Web. This paper then focuses on how this tension is
Using a combination of a pen-and-paper and corpus-based approach (see
Ooi 2001), we specifically focus on the use of appraisal, attested by Eggins and
Slade (1997) to characterise spoken language. Specifically, we examine a range
of amplification items. We compare the frequencies of the items found in our
personal advertisement sub-corpus and selected written and spoken portions of
the Singapore component of the International Corpus of English (ICE-SIN) and
attempt to account for the patterns discovered.
The results suggest that although South East Asian netspeak is aligned to
spoken language, this alignment is partial.

1. Preamble

This paper arose out of a project at the National University of Singapore whose
main aim was to investigate E-English (Netspeak, English in cyberspace or
computer-mediated communication).1 We have, at the moment, collected the bulk
of the data, which run into 3.6 million words. The question that we asked was
152 Peter Tan, Vincent Ooi and Andy Chiang
whether we could discern a sense of a speech community through examining the
kind of English used (see Herring 1996).

1.1 The corpus

With the help of a research assistant, we targeted data associated with four South
East Asian nations: Singapore, Malaysia (West and East), the Philippines and
Brunei (see Figure 1).
One of the things that distinguish these nations from others in the region,
like Indonesia, Thailand or the Indo-Chinese nations, is that these nations have
undergone the colonial experience under English-speaking colonial powers
Britain and the US. These nations have therefore had a longer history of having
employed the English language and, it might be surmised, a higher likelihood of
having indigenised forms of English.
At the moment, the corpus consists of four sections: (a) news, (b)
electronic discussion groups, (c) personal advertisements and (d) electronic chat.

Figure 1. Map of South East Asia

1.2 Netspeak and the New Englishes

There has been much discussion about Netspeak and a very popular assumption is
that Netspeak is much closer to spoken language than written language. Many
Signalling spokenness in personal advertisements 153
might very easily be led to believe this particularly as Web gurus and style guides
(e.g. Hale and Scanlon 1999) push for more spoken styles. There has been a lot of
sociological, but hardly any linguistic, investigation into the nature of Netspeak.
Crystal (2001) gives a very useful coverage of the issue and his conclusion is that

it is plain that Netspeak has far more properties linking it to writing

than to speech [] Netspeak is better seen as written language
which has been pulled some way in the direction of speech than as
spoken language which has been written down. (Crystal 2001: 47)

At the end of his book, however, he claims that Netspeak is something

completely new [] From now on we must add a further dimension to
comparative enquiry: spoken language vs. written language vs. sign language vs.
computer-mediated language (Crystal 2001: 238).
Baron, discussing the email component of Netspeak in a chapter entitled
Why the Jurys Still Out on Email, concludes that Email is clearly a language
form in flux (Baron 2000: 252) and describes it, like pidgins and creoles, as a
(bilingual) mixed contact system, which therefore accounts for its seemingly
schizophrenic character (part speech, part writing) (p. 258).
What is obvious from our vantage point is that Netspeak is evolving. It
will continue to adapt the linguistic resources already available, but whether the
development will be more in the direction of the spoken or written norms of the
language remains to be seen. We accept Bibers position that it is not always
helpful to see the spoken-written dimension in absolute terms (1988: 25);
however, in a later study comparing spoken and written registers, he was able to
identify a fundamental distinction between written and spoken registers (Biber
2001: 238) based on the way complexity is exploited. The spoken-written
distinction is therefore not just virtual but real (see Collot and Belmore 1996 and
Yates 1996).
In our corpus, though, a further complication arises. For various historical
as well as linguistic reasons, Singapore, Malaysia, Brunei and the Philippines
have traditionally been labelled ESL (English as a Second Language) nations.
Alternatively, employing Kachrus (1992) labels, these nations represent outer
circle countries (as opposed to the inner circle countries of the UK, US,
Canada, Australia, etc. and the expanding circle countries of China, Japan, etc.).
Firstly, in the linguistic ecology of these nations, English is sometimes
relegated to the position of a neutral and transactional (as opposed to
interactional) language (Brown and Yule 1983) where affect (emotion) is
played down. Writing about Kenya and Nairobi (also outer circle countries),
Hudson-Ettle and Schmied also comment on the use of English (or the lack of it)
154 Peter Tan, Vincent Ooi and Andy Chiang
The man or woman on the street speaks Kiswahili and despite the fact
that English is the medium of secondary and tertiary education and
other public domains, the language preferred for conversations is
Swahili. (Hudson-Ettle and Schmied 1999: 4)

We can also therefore see this in terms of languages associated with the public
and private spheres. This can be regarded as being analogous to the position of
Latin in medieval Europe. Like Latin, English is the language employed largely
only in writing and in situations where one is on ones best behaviour.
Yet there are parts of the Web that deal with situations where affect is
important and which veer more towards the private sphere such as personal
advertisements. We might expect a variety more associated with spoken English
to be employed here.
Secondly, where more informal or colloquial versions of English exist,
these tend to be more divergent from the informal varieties of the inner circle.
This stands to reason: standardisation arose out of the need to minimise variation
(see, for example, Bex and Watts 1999; Milroy and Milroy 1999; Crowley 1989),
and therefore standard, written Englishes tend to be more similar to each other
than informal, spoken Englishes.

1.3 Mode fluidity

Social commentators have already noted that genres and text types are not rigid
and unchanging. For example, Fairclough (1994) and others have commented on
the conversationalisation of public discourse like advertising, so that print
advertisements can take on features associated with spoken conversation.
Conversationalisation is, for him, in part to do with shifting boundaries between
written and spoken discourse practices (Fairclough 1994: 260). This observation
is of course not entirely new. Indeed, years earlier, Leech had already commented
on the use of the public-colloquial style for advertising (Leech 1966: 75). And
we are also aware that language associated with computers is fluid and tolerances
are being tested (Bruthiaux 2001).
The question is whether this could be said of personal advertisements on
the Web as well.

2. The focus

In this paper therefore, we will focus on the personal advertisements sub-corpus.

The genre of personal advertisements has received some previous attention (Ooi
2001; Kadir 2000). These studies, though, do not focus on the main question that
we will try to answer in this paper: to what extent are the resources of spoken
Signalling spokenness in personal advertisements 155
English capitalised on (presumably to convey the notion of affect mentioned
above) in South East Asia?

3. Methodology

3.1 Evaluation and appraisal

Clearly, it is not possible here to try to examine many dimensions of spoken

English in the sub-corpus. For the purpose of this paper, a very small section of
the lexico-semantic system of English has been extracted. The notion of
evaluation (e.g. Hunston and Thompson 2001) or appraisal has been receiving
much attention of late. Eggins and Slade (1997) see appraisal (together with
humour and involvement) as elements that characterise casual conversation.
Figure 2 summarises the appraisal system available (Eggins and Slade 1997:
137). (It should be added that the phenomenon of evaluation in narrative has
already been commented on by Labov and Waletzky (1967) much earlier.)

(of text/process)
Appraisal (dis)satisfaction

Judgement social sanction

(behaviour) social esteem
Amplification augment

Figure 2. The Appraisal System

156 Peter Tan, Vincent Ooi and Andy Chiang
3.2 Augmenters and mitigators

We select from Eggins and Slades (1997) appraisal system the sub-system of
amplification. Within that, we select a range of what we will call augmenters and
mitigators. Elsewhere in the literature other labels are used. Carter and McCarthy
(1995) talk about intensifiers and hedges in relation to the Cambridge and
Nottingham Corpus of Discourse in English (CANCODE). Biber (1986a, 1988),
and Conrad and Biber (2001) subdivide each of our categories into two.
Unmarked augmenters are called amplifiers (completely, greatly), whereas
informal ones are labelled emphatics (for sure, a lot). Similarly, unmarked
mitigators are termed downtoners (almost, merely), whereas informal ones are
christened hedges (more or less, sort of). Emphatics and hedges seem to occur
together and are dominant in informal conversation (Biber 1986b). Even with all
four categories taken together, the total mean frequencies for written genres are
distinct from those for conversational genres: 7.7 for academic prose and 10.9 for
romantic fiction, as opposed to 21.8 for face-to-face conversation and 21.0 for
telephone conversation (Biber 1988: 255, 260, 264, 265). The use of augmenters
can also be a feature of Opinionated (as opposed to Objective) style (Biber
1986: 18), one dimension distinguishing spoken and written texts.
The augmenters that we have chosen to examine are: very, a lot, really,
too, ever, incredibly and lah. These items were selected partly on the basis of
Eggins and Slades examples, and partly with the aim of combining intuitively
common items like very and less common items like incredibly. The inclusion of
lah needs further explanation. We follow Gupta (1992) in her analysis of lah as a
pragmatic particle with an assertive function.2 (Pragmatic particles in her analysis
can serve one of three functions: contradictory, assertive or tentative.) Lah as a
pragmatic particle is available in the informal varieties of English in Singapore,
Malaysia and Brunei and is potentially employable by the majority of the
The mitigators that we will examine are: only, just, a bit and somewhat.
To ensure that the selected items represented a combination of more
frequent and less frequent items, we subjected the sub-corpus to a CLAWS
(Constituent Likelihood Automatic Word-Tagging System) tagging and examined
the rank of the adverbs. Multi-word items could not be examined (which meant
the exclusion of a lot and a bit); and of course lah, being a regionalism, could not
be included; incredibly was also not included. The result showed items from a
range of rankings, including a number of high ranking ones (see Table 1). We
were satisfied with our selection of high- and low-ranking augmenters and
Signalling spokenness in personal advertisements 157
Table 1. Ranking of adverbs using CLAWS tagging
Item Rank
just 1
very 3
really 9
only 16
too 21
ever 26
somewhat 151

3.3 Sub-corpora from ICE-SIN as points of reference

As a point of reference we made use of sections of the Singapore Component of

the International Corpus of English (ICE-SIN) to contrast the spoken and written
tendencies. ICE-SIN was chosen because there was no readily available corpus
that included Malaysian, Bruneian and Filipino English. We felt that corpora that
contained mainly British or American English (which are more readily available)
would not be appropriate for comparison with our sub-corpus. Given that, for
historical reasons, Singaporean English shares many features with Malaysian
English and Bruneian English, we felt that it would not be inappropriate to use
ICE-SIN as a reference point for South East Asian English.
For our purpose, we extracted the private dialogues portion of ICE-SIN
(a section totalling 217,121 words) to represent spoken language, and the
informational printed portion (a section totalling 277,154 words) to represent
written language (see Appendix 1, where these varieties have been highlighted in
bold face). The reason for selecting these particular sub-corpora had partly to do
with some compatibility in terms of size (each contains 100 texts), but more
importantly we wanted to focus on stereotypical spoken and written texts because
this contrast is what is in peoples mind when they are encouraged to write for the
Web as if they were speaking.

4. The personal advertisements

The personal advertisements were taken from three websites:

(a) Lavalife (http://www.all-dating-online.com/lavalife.html),

(b) One and Only (http://www.oneandonly.com/) and
(c) Excite Personals (http://exsiteads.freeservers.com/)

Country-specific advertisements were chosen, and we also chose advertisements

from all four sex and orientation options:
158 Peter Tan, Vincent Ooi and Andy Chiang
(a) MSW (men seeking women),
(b) WSM (women seeking men),
(c) MSM (men seeking men) and
(d) WSW (women seeking women).

The composition and structure of the personal ad sub-corpus is summarised in

Table 2. Table 3 gives the sizes of the files in the sub-corpus laid out in the same
way as in Table 2. The highest proportion of advertisements was from the One &
Only site.
The size of the sub-corpus is 110,247 words, with Malaysian, Philippine
and Singaporean adverts contributing roughly equal proportions of text and the
Bruneian adverts contributing a substantially smaller proportion. This is
appropriate because Brunei has a much smaller Web presence than the other
countries not surprising, if we consider Bruneis population of 350,000 in
contrast to Singapores 4 million, Malaysias 22 million and the Philippines 81
million. Singapores relatively smaller population is compensated for by its
higher computer and Internet penetration.

Table 2. Composition of sub-corpus

Data source Lavalife One & Only Excite Classifieds
Brunei WSM
East Malaysia
West Malaysia WSW
Signalling spokenness in personal advertisements 159
The sub-corpus can also be divided according to the sex and orientation of the
advertisers (Table 4). These proportions also represent the kinds of adverts
available on the Web: heterosexual women and homosexual men seem to be more
highly represented in these personal advertisements. Overall, as well, seen against
the sexual orientation of the population as a whole (popular estimates do not give
a figure of more than 10% homosexuals), we can see that homosexual adverts
(representing over 43% of adverts) have a strong presence.

Table 3. Size of the sub-corpus

Data source Lavalife One & Only Excite Country Country
Region Classifieds totals percentages
Brunei 3,956
1,778 6,042 5.5%
East 352
Malaysia 438
549 1,220 1,764
503 438 2,591 35,181* 31.9%*
West 588 3,813 4,362
Malaysia 5281 493
Philippines 436 3,160 1,058
2,198 5,185 9,574
570 5,898 3,388 34,488 31.3%
86 2,717 218
Singapore 1,400 3,769 1,971
2,856 7,246 2,629
2,323 4,797 3,546 34,525 31.3%
109 3,745 134
Website 11,618 66,890 31,728 110,236
Website % 10.5% 60.7% 28.8% 100.0%
*East and West Malaysia together
160 Peter Tan, Vincent Ooi and Andy Chiang
Table 4. Sex and Orientation Figures
Sex/Orientation Words Percentages
MSW 22,228 20%
WSM 40,279 37%
MSM 35,609 32%
WSW 12,120 11%
Total 110,236 100%

5. Results and discussion

Tables 5 and 7 (and the accompanying Figures 3 and 4) show the number of
occurrences and the normalised frequencies of the augmenters and mitigators
selected. We use the abbreviations PA, SP and WR for the personal
advertisements sub-corpus, the selected spoken sub-corpus from ICE-SIN and the
selected written sub-corpus from ICE-SIN respectively. Normalised figures
represent the number of tokens per 10,000 words.

Table 5. Augmenters
tokens normalised tokens normalised tokens normalised
incredibly 1 0.1 0 0.0 2 0.1
lah 2 0.2 1677 77.2 0 0.0
ever 37 3.4 34 1.6 22 0.8
a lot 56 5.1 331 15.2 18 0.6
really 167 15.1 495 22.8 32 1.2
too 176 16.0 334 15.4 113 4.1
very 327 29.7 1087 50.1 270 9.7
Total 766 69.6 3958 182.3 457 16.5
Signalling spokenness in personal advertisements 161



incredibly lah ever a lot really too very

Figure 3. Normalised frequencies of augmenters

All the items were also checked for statistical significance using chi-square (2,
p>0.05); each item in each sub-corpus was checked against the same item for
each of the other sub-corpora. The results for the augmenters are found in Table

Table 6. Significant differences in the distribution of augmenters

incredibly lah ever a lot really too very
PA SP too small* yes yes yes yes no yes
SP WR too small* too small* yes yes yes yes yes
PA WR too small* yes yes yes yes yes yes
*too small indicates that the frequencies were too low for the computation of
statistical significance

We should probably disregard items where all three sub-corpora register very low
normalised frequencies (say, below 5 per 10,000 words): this would push out
incredibly and ever. All the other augmenters have a significantly higher
frequency in SP than in WR. The most dramatic case is lah, where SP had a
normalised frequency of over 77 and WR had a frequency of 0. Thus far, then, we
can say that the frequencies of the augmenters provide a fairly reliable index to
the spokenness or writtenness of a text and therefore confirm the tendencies seen
in Biber (1988).3
The difference in the PA frequencies from the SP and WR frequencies are
also statistically significant with an exception in the case of too, where the
differences in the PA and SP scores are not statistically significant. In most cases,
PA frequencies fall between those of SP and WR. In fact, in the case of too (and
of ever, which we discarded from consideration earlier), the PA frequency
exceeded that of SP. By and large the figures confirm the expectation: that PA
tends towards SP norms but not quite reaching them, in most cases. However,
162 Peter Tan, Vincent Ooi and Andy Chiang
the situation is not always that clear-cut. We could arguably say that the PA
frequency for a lot tends towards WR norms. However, the outstanding item is
lah, which only occurred twice, once in the Malaysian portion of the sub-corpus
and once in the Bruneian portion. There is certainly strong resistance to the use of
lah in personal advertisements in the region.

Table 7. Mitigators
tokens normalised tokens normalised tokens normalised
somewhat 5 0.5 6 0.3 16 0.6
a bit 45 4.1 132 6.1 10 0.4
only 151 13.7 369 17.0 478 17.2
just 398 36.1 1,094 50.4 167 6.0
Total 599 54.3 1,601 73.7 671 24.2


30 SP
20 WR


somewhat a bit only just

Figure 4. Normalised frequencies of mitigators

Table 8. Significant differences in the distibution of mitigators

somewhat a bit only just
PA SP no yes yes yes
SP WR no yes no yes
PA WR no yes yes yes

Tables 7 and 8, together with Figure 4, show the figures for mitigators. What is
interesting is that mitigators do not seem to display as robust a difference between
the sub-corpora as the augmenters. Figure 4 shows three curves more or less
keeping step with each other until we reach just. The distributions of somewhat
and only in the sub-corpora are not always significantly different. Of those that
are a bit and just the pattern seen in the augmenters is replicated. The PA
Signalling spokenness in personal advertisements 163
frequencies lie between those of SP and WR, and closer to SP rather than WR.
This might suggest that some mitigators are more important than others in
showing up the differences between the three modes. Interestingly, the
importance of a mitigator is not dependent on whether it is a high- or low-
frequency item: only, which is more frequent than a bit and less frequent than
just, is not important as a distinguisher between the three modes.
How then can we account for the less dramatic difference in the case of the
mitigators? We could perhaps hypothesise that it might be more important to
mitigate (as opposed to augment) in written texts. This also makes sense in the
light of the distinction between the Opinionated and Objective style (Biber
1986a): augmenters contribute to the Opinionated style but mitigators do not. The
difference here is also understandable if we consider that in a stereotypical
written genre, academic writing, it is important to not over-generalise and delimit
ones conclusions.

6. Conclusion

So, to what extent are the resources of spoken discourse relied on in PA? On the
basis of the augmenters and mitigators selected, we could say that personal
advertisers tend to make use of features of spokenness. The analysis shows,
corroborating Crystal, that the language of PA represents written language which
has been pulled some way in the direction of speech (Crystal 2001: 47). Given
the focus on the interactional function of language, it is not surprising that
advertisers try to take on board these features. This is the case despite the fact that
the sub-corpus is from outer circle countries where English tends to be used for
more transactional functions.
However, the situation is not entirely cut and dried. There is very strong
resistance to the employment of the pragmatic particle lah in personal
advertisements. It is not entirely clear to us why this should be the case, although
it is not impossible that the notion of a borderless cyberspace might discourage
advertisers from employing items like lah that point towards the local or suggest
an insular or parochial outlook. A non-local spoken model might also be
preferred if advertisers are open to responses to non-local sojourners in the
region. It would therefore be premature to say at this stage that Netspeak in South
East Asia is closely associated with the norms of spoken language although it
seems to be an important contributor to the norms associated with personal
We obviously need to examine other parts of the corpus, e.g. the chat data,
where localisation does not seem to be such a taboo.
164 Peter Tan, Vincent Ooi and Andy Chiang

1. We are grateful for the support of the National University of Singapore,

research project ref. no. R-103-000-019-112, for this paper. We are also
grateful to the Department of English Language and Literature, National
University of Singapore, for the use of the ICE-SIN corpus.
2. Besemeres and Wierzbicka (2003) propose an alternative analysis of lah
based on the claim of common ground (like you know). This analysis does not
contradict our understanding of lah as playing an emphasising function.
3. For a rougher guide, we might also add that the word frequency list (see
Appendix 2) which gives I as the most frequent item seems to suggest
spokenness as well (personal pronouns typically rank very high in spoken


Baron, N. (2000), Alphabet to email: How written English evolved and where its
heading. London: Routledge.
Besemeres, M. and A. Wierzbicka (forthcoming), Pragmatics and cognition: the
meaning of the particle lah in Singapore English, Journal of
Pragmatics and cognition 11(1): 1-36.
Bex, T. and R. J. Watts (eds) (1999), Standard English: the widening debate.
London: Routledge.
Biber, D. (1986a), On the investigation of spoken/written differences, Studia
Linguistica 40(1): 121.
Biber, D. (1986b), Spoken and written textual dimensions in English: resolving
the contradictory findings, Language 62(2): 384414.
Biber, D. (1988), Variation across speech and writing. Cambridge: Cambridge
University Press.
Biber, D. (2001), On the complexity of discourse complexity: a multi-
dimensional analysis, in: S. Conrad and D. Biber (eds), Variation in
English: multi-dimensional studies. London: Longman. 215240.
Brown, G. and G. Yule (1983), Discourse analysis. Cambridge: Cambridge
University Press.
Bruthiaux, P. (2001), Missing in action: verbal metaphor for information
technology, English Today 67 (Vol. 17 No. 3): 2430.
Carter, R.A. and M.J. McCarthy (1995), Grammar and the spoken language,
Applied Linguistics 16(2): 141158.
Collot, M. and N. Belmore (1996), Electronic language: a new variety of
English, in: S. Herring (ed.), 1328.
Signalling spokenness in personal advertisements 165
Conrad, S. and D. Biber (2001), Multi-dimensional methodology and the
dimensions of register variation in English, in: S. Conrad and D. Biber
(eds), Variation in English: multi-dimensional studies. London: Longman.
Crowley, T. (1989), Standard English and the politics of language. London:
Palgrave Macmillan.
Crystal, D. (2001), Language and the Internet. Cambridge: Cambridge University
Eggins, S. and D. Slade (1997), Analysing casual conversation. London: Cassell.
Fairclough, N. (1994), Conversationalisation of public discourse and the
authority of the consumer, in: R. Keat, N. Whiteley and N. Abercrombie
(eds), The authority of the consumer. London: Routledge. 253268.
Gupta, A.F. (1992), The pragmatic particles of Singapore colloquial English,
Journal of Pragmatics 18: 3157.
Hale, C. and J. Scanlon (1999), Wired style: principles of English usage in the
digital age. New York: Broadway Books.
Herring, S. (ed.) (1996), Computer-mediated communication: linguistic, social
and cross-cultural perspectives. Amsterdam: John Benjamins.
Hudson-Ettle, D. and J. Schmied (1999), Manual to accompany the East African
component of the International Corpus of English: Background
information, coding conventions and list of source texts. Chemnitz:
Department of English, Chemnitz University of Technology.
Hunston, S. and G. Thompson (2001), Evaluation in text: authorial stance and
the construction of discourse. Oxford: Oxford University Press.
Kachru, B.B. (ed.) (1992), The other tongue: English across cultures, 2nd edn.
Urbana: University of Illinois Press.
Kadir, M.A. (2000), Love @ cyberspace: A corpus-based study of personal ads
on the Web. Unpublished MA dissertation, National University of
Labov, W. and J. Waletzky (1967), Narrative analysis: Oral versions of personal
experience, in: J. Helm (ed.), Essays on the verbal and visual arts.
Seattle: University of Washington Press. 1244.
Leech, G.N. (1996), English in advertising: a linguistic study of advertising in
Great Britain. London: Longman.
Milroy, J. and L. Milroy (1999), Authority in language: investigating standard
English, 3rd edn. London: Routeldge.
Ooi, V.B.Y. (2001), Investigating and teaching genres on the World Wide Web,
in: M. Ghadessy, A. Henry and R. L. Roseberry (eds), Small corpus
studies and ELT: theory and practice (Studies in Corpus Linguistics, Vol.
5). Amsterdam: John Benjamins. 175204.
Yates, S.J. (1996), Oral and written linguistic aspects of computer conferencing,
in: S. Herring (ed.), 2946.
166 Peter Tan, Vincent Ooi and Andy Chiang

Appendix 1. The structure of an ICE corpus

Spoken Dialogues Private (100)

face-to-face conversations (90)
Texts (180)
telephone conversations (10)
Public (80) classroom lessons (20)
broadcast discussions (20)
broadcast interviews (10)
parliamentary debates (10)
legal cross-examinations (10)
business transactions (10)
Mono- Unscripted (70) spontaneous commentaries (20)
logues unscripted speeches: lectures
(100) (30)
demonstrations (10)
legal presentations (10)
Scripted (30) broadcast talks (20)
non-broadcast speeches (10)
Mixed (20) broadcast news (20)
Written Non- Non-professional untimed student essays (10)
Texts Printed writing (20) student examination scripts
(200) (50) (10)
Correspondence (30) social letters (15)
business letters (15)
Printed Inform- Academic humanities (10)
(150) ational writing social sciences (10)
writing (40) natural sciences (10)
(100) technology (10)
Popular humanities (10)
writing social sciences (10)
(40) natural sciences (10)
technology (10)
Reportage press news reports (20)
Instructional writing administrative/regulatory (10)
(20) skills/hobbies (10)
Persuasive writing (10) press editorials (10)
Creative writing (20) novels/stories (20)
Signalling spokenness in personal advertisements 167

Appendix 2. Word frequency list

The top 10 words in the personal advertisements

Item Frequency
I 3,192
and 2,272
to 1,937
a 1,706
the 1,142
am 820
you 811
in 796
me 770
my 769
Textual colligation: a special kind of lexical priming

Michael Hoey

University of Liverpool


Corpus linguistics has not attended much to text-linguistic issues. This paper
argues that lexical choice has a major effect on features such as cohesion, Theme
choice and paragraph division and that corpus investigation can shed light on the
nature of the lexical choices made. It is argued that some lexis has a bias towards
(or against) certain textual functions and that this is an inherent property of such
lexis. It is also argued that lexical choices interlock, creating what I term
colligational prosody.

1. Introduction

Corpus linguistics has for perfectly understandable reasons focused most of its
attention upon lexical and grammatical matters. Although there are a few text-
linguistic and spoken discourse studies that make use of corpus linguistic
techniques (e.g. Hoey 1997; Partington 2003; Partington and Morley 2002), they
are few and far between and are rooted in no explicitly articulated theory. This
paper attempts to redress this lack by articulating a theoretical relationship
between lexis and text-linguistics.
The paper is divided into three uneven parts. In the first, I focus on current
perceptions about the organisation and nature of written discourse, identifying the
features that might be open to investigation in corpora. In the second, I contrast
two theoretically opposite positions, one of which I take to be essentially
incompatible with corpus investigation and the other of which is amenable to
such investigation. Unsurprisingly I shall favour the latter! In the third and much
the longest part, I want to present sample results from a corpus-linguistic
investigation of textual questions. These are hesitantly presented and are intended
only as hints as to how corpus linguistics might proceed. Despite the hesitancy
with which my findings are presented and the weak evidential base on which they
rest, they point to a theoretical suggestion that, if accepted, would place lexis and
text linguistics on a very different footing vis--vis each other.
172 Michael Hoey

2. The nature of text

The following propositions are accepted by a number of text linguists. None

probably have the support of all, but at least none are idiosyncratically my own

Text is interactively produced and processed (see amongst many others

Bakhtin 1973 [1929]; Goodman 1967, 1973; Winter 1971, 1977; Smith
1978; Widdowson 1979; Hoey 1979, 1983, 2001; Goffman 1981,
Nystrand 1986, 1989).1 In other words it presupposes a writer-reader
interaction in which text is the site or the residue/outcome of the
interaction, depending on whether one takes the readers or writers

Text is linearly developed. By this I mean that each sentence builds upon
what has gone before, from the speaker or writers point of view; from the
listener or readers point of view, each sentence that is reached prospects
the sentence or sentences to follow. There are two aspects to this feature of
text. In the first place, the speaker/writer is seeking to meet the
listener/readers expectations and the listener/reader has expectations on
the basis of what the speaker/writer has already said. This point, though
widely agreed, is differently articulated, depending on whose position one
considers. Amongst linguists in whose work one finds some aspect of this
position are Sinclair (1993), Halliday and Hasan (1985), Winter (1971,
1979, 1982), Crombie (1985), Graustein and Thiele (1979, 1987),
Beekman (1970), Beekman and Callow (1974), Bolivar (2001) and Tadros
(1985, 1993). In the second place, this covers the much-described
phenomenon of Theme-Rheme, both as described in the Prague School
(Firbas 1966, 1986; Dane 1974) and within the Systemic-Functional
tradition (most notably, Halliday 1994).

Text is cohesive. Whether this is a by-product of the need to be coherent

(as Morgan and Sellner 1980 have argued) or a prerequisite of coherence
(as was originally argued in Halliday and Hasan 1976) seems irrelevant.
Almost certainly, the relationship works both ways. On occasion, writers
(and more rarely speakers) consciously produce cohesive devices in order
to clarify or emphasise, i.e. to create coherence; one has only to look at the
way that Dickens, for example, exploits repetition for rhetorical effect to
see that cohesion can be a conscious tool. On other occasions, a writers or
speakers coherence is reflected automatically in the language they use,
i.e. in cohesion. Either way, that it is a feature of text cannot be denied and
one, furthermore, that continues to be the subject of study.

Text is chunked. The nature of the chunking is to some extent disputed.

Some posit a strict hierarchical organisation to text (e.g. Graustein and
Textual colligation 173

Thiele 1979, 1987; Mann and Thompson 1986, 1988; van Dijk and
Kintsch 1978). Some posit a patterning or structuring of some kind or
other without assuming that the chunking thereby created accounts for all
texts or all of any particular text (e.g. Labov 1972; Labov and Waletsky
1967; Longacre 1968, 1979, 1983; Hoey 1979, 1983, 2001; Swales 1981,
1990; Halliday and Hasan 1985; Martin 1992). There is a great deal of
commonality amongst these positions but not much actual agreement.
Still, some kind of chunking is acknowledged to exist and is reflected in
our cultural habit of writing books with chapters, sections (in the case of
academic texts) and paragraphs.

Text is shaped in the service of particular communities of users (e.g.

Halliday and Hasan 1986; Ventola 1987; Swales 1990; Martin 1992)
and/or to the advantage of those with vested power in the communities
(e.g. Fairclough 1989, Chouliaraki and Fairclough 2001; Hodge and Kress
1993). The work of the genre analysts in particular warns us that we must
be wary in text-linguistics of over-generalising; some claims can only be
made of a restricted body of data.

To these points of broad agreement in the text-linguistic community, I would

want to add that text is web-like (e.g. Hoey 1991), but since that is not a widely
held view I will not be developing it further here.

3. Two theoretical positions

There seem to be two possible ways of modelling the relationship between lexis
and the features of text I have just outlined. The first is that the relationships
found in text whether they are interactive, linear, cohesive, hierarchical or
structural are independent of the lexis of the language. According to this view,
each sentence is constructed according to the grammar, collocations and
colligations of the language in response to textual needs but without constraints
broader than those particular needs. In other words, each text imposes its own
demands and has its own unique sentence requirements. If this view is correct,
corpus linguistics cannot offer anything useful for text-linguistics except in so far
as it might offer tools for exploring individual spoken or written texts.
The other way of modelling the relationship between lexis and text is to
see textual relationships (interactive, linear, cohesive, hierarchical and structural)
as dependent upon and created by the lexis of the language in a manner not
exhausted by the demands of the individual text. According to this view, each
sentence of every text is constructed along lines that have been laid down by all
the texts that the speaker/writer has encountered in the course of his or her life,
such that the production of a text is in fact in part a reproduction of previous
texts along strictly controlled lines. If this view is correct, corpus linguistics is the
key to the future of text-linguistics.
174 Michael Hoey

The first view is presumably the default position. The second is however, I
hope you will agree, a much more interesting position. Youll reply that reality
hasnt the least obligation to be interesting. And Ill answer you that reality may
avoid that obligation but that hypotheses may not.2
What I want to claim in this paper is that every lexical item is primed for
use in textual organisation. The notion of priming is taken from psychology and
in this context means that our encounters with a word accustom us to expect it to
be used in certain kinds of ways to such an extent that these potential uses
become part of our knowledge of the word and to some extent constrain the way
we are likely to use the word ourselves.
More specifically I want to make the following claims:

1. Every lexical item (or combination of lexical items) may have a positive
or negative preference for participating in cohesive chains.
2. Every lexical item (or combination of lexical items) may have a positive
or negative preference for occurring as part of Theme in a Theme-Rheme
3. Every lexical item (or combination of lexical items) may have a positive
or negative preference for occurring as part of a specific type of semantic
relation, e.g. contrast, time sequence, exemplification.
4. Every lexical item (or combination of lexical items) may have a positive
or negative preference for occurring at the beginning or end of an
independently recognised chunk of text, e.g. the paragraph.
5. If a lexical item (or combination of lexical items) has any of the above
preferences, it may only or especially be operative in texts of a particular
type or genre or designed for a particular community of users, e.g.
academic papers.

The positive and negative preferences of a lexical item with regard to the textual
features just described are what I would term its textual colligations. The claims
of course allow for the possibility of a lexical item having not only a positive or
negative preference but also a neutral preference for each of these features. If,
however, the great majority of lexical items in a language were to prove to be
neutral with regard to one of the features, then the specific claim would be
disconfirmed with regard to the feature in question, and if that were to prove true
of all the features, then the more general claim would fall also.
If on the other hand the claims were to prove correct, we could envisage a
description of the language from a text-linguistic point of view including a giant
matrix of all the words of the language looking something like that in Table 1. I
will flesh out this matrix with some real examples near the end of this paper. For
the moment, though, we need to examine the evidence for the claims made above.
Textual colligation 175

Table 1: A fragment of a hypothetical text-colligational matrix

Lexical Lexical Lexical Lexical item
item 1 item 2 item 3 4
Preference for negative positive neutral positive
participating in
cohesive chains
Preference for positive positive positive neutral
occurring as part of
Preference for neutral negative with positive with neutral
occurring as part of a examplifica- affirmation/
specified semantic tion denial
Preference for negative neutral positive for Positive for
occurring at the paragraph paragraph
beginning or end of a initial in initial when
recognisable chunk certain phrases Theme
Constraints on Popular Biography None observed News
operation of fiction

3.1 Claim 1: Every lexical item (or combination of lexical items) may have
a positive or negative preference for participating in cohesive chains

The first claim was that every lexical item may have a positive or negative
preference for participating in cohesive chains, where a cohesive chain is a set of
at least three lexical (and grammatical) items that either co-refer to a single entity
(identity chains) or cross-connect because of their similarity of meaning
(similarity chains) (Hasan 1984; Hasan in Halliday and Hasan 1985; Parsons
1995). I shall give least attention to this claim, partly for reasons of space and
partly because I have presented detailed evidence elsewhere (Hoey, forthcoming).
Here I will simply note that all the following lexical items occur in my corpus as
members of cohesive chains occurring in a number of different texts:3

army, baby, Blair, gay, lake, music, pit, planet, political, spleen

Preliminary investigations suggest that all the following lexical items show no
tendency in my corpus to occur in cohesive chains, despite their all being fairly
frequent words in my corpus. It is of course much harder to establish that
something does not occur and it is a painfully slow process to move from each
concordance line into the original text to check for possible cohesion, so the
following list must be regarded as provisional:
176 Michael Hoey

afterwards, ago, best, comparison, crossroads, particularly, problem,


Orthography is significant in this matter. The lexical item crossroads did not
form cohesive chains; the lexical item Crossroads, on the other hand, which
refers to a defunct British soap, chains freely.
The following words are neutral in my corpus with regard to cohesion;
they do not occur in chains often and when they do, the chains tend to be short,
e.g. The first reason, The second reason:

option, reason, sixty

These lists may not seem particularly surprising and it is tempting to account for
their inclusion in other ways. My point here is that a lack of cohesive potential is
as much a quality of the word surprising as the fact that it is evaluative (as in the
previous sentence) and therefore unlikely to be a topic. Notice, too, that lexical
items that do not have a preference for appearing in cohesive chains are every bit
as common in the English language as those that do appear in cohesive chains
(and in many cases more so), despite the fact that one might have predicted that
infrequency would make it less likely that a word would participate in chains.
It is possible to express claims about the cohesive potential more subtly
than I have done here. The chains of particular words may favour repetitions, co-
hyponyms or pro-forms, for example; see Hoey (forthcoming b) for examples and
details. A crude check on the chaining potential of repetition-favouring words can
be obtained by examining the plot of distribution of a word as calculated for a
word by WordSmith (Scott 1999).

3.2 Claim 2: Every lexical item (or combination of lexical items) may have
a positive or negative preference for occurring as part of Theme in a
Theme-Rheme relation

The second claim, that every lexical item may have a positive or negative
preference for occurring as part of Theme in a Theme-Rheme relation, like all the
claims, requires more detailed support than I am able to give it here. The
following statistics suggest however that the claim is not meritless. 250 instances
of years were examined, and it was found that 37% occurred as part of Theme.
This is slightly higher than would be expected on the basis of random distribution
across Theme and Rheme, though it is hardly a striking result; what is more
interesting is that the great majority of the instances of Thematic years occur as
part of a fronted Adjunct rather than as part of Subject.4 In other words, when
years is thematised, it is usually marked.
Another example is the distribution across Theme and Rheme of instances
of consequence. 1615 of these were analysed (excluding instances of the rarer
importance sense), and it was found that the word consequence occurs in Theme
43% of the time, a considerably higher percentage than would occur on a random
Textual colligation 177

distribution. Again the Adjunct use seems significant. Almost half of the
occurrences of thematised consequence occurred as part of an Adjunct. The word
sixty occurs in Theme 75% of the time, on the basis of a sample of 294 instances.
Again, orthography is relevant: 60 shows no such tendency.
The claim I have just made can be made more subtly and more complexly.
Some words, I would argue, have a tendency to appear in marked Theme (e.g.
years and consequence, as we have just seen); others have a propensity for
appearing as unmarked Theme, i.e. as Subject. Furthermore, it is possible to
combine this feature with the previous one. All the cases I have given of
Thematic preference have either negative or neutral cohesive preference. This
means that sixty, consequence and years are not going to participate in Thematic
progression (Dane1974). There may prove to be a correlation between Marked
Theme and a negative preference for cohesive chains; this would need
If a lexical item has a positive preference for both Theme and cohesive
chains, it will inevitably have a positive preference for Thematic Progression;
again, there may be a correlation between a preference for unmarked Theme and
a preference for appearing in cohesive chains, though my claim is not of course
dependent upon such a correlation. It follows that as before one might be subtler
and expect some lexical items to be primed for participation in Simple Thematic
Progression or Linear Thematic Progression, etc.

3.3 Claim 3: Every lexical item (or combination of lexical items) may have
a positive or negative preference for occurring as part of a specific
type of semantic relation

The third claim was that every lexical item (or combination of lexical items) may
have a positive or negative preference for occurring as part of a specific type of
semantic relation, e.g. contrast, time sequence, exemplification. Such relations
may be the relations between clauses or parts of clauses or between larger chunks
of text; they may also reflect relations between speaker and listener, for example
indicating the relation between a speaker or writers utterance and a listener or
readers utterance. I give here just two examples of what this claim is intended to
cover: ago and reason.
An example of a lexical item associated with a semantic relation is ago.
More specifically ago has an association with contrast when it is part of Theme.
Of 65 Thematised instances of ago examined, 23 (35%) were followed by a
contrast and 5 (8%) preceded by a contrast. If 10 instances of not long ago and as
long ago as are removed from the calculation, the percentage associated with
contrast rises to 51%. These are small figures and not too much can be claimed of
them. But informal examination of larger quantities of data suggests that they are
not misleading. When ago is not part of Theme, there is still an association with
contrast but the manifestations are somewhat different (see Hoey forthcoming a).
The word reason may seem a rather obvious choice of word to illustrate
the association of a lexical item with a particular semantic relation, and of course
178 Michael Hoey

reason is intimately associated with the reason-result relation and other similar
relations, but it is not this association that I wish to draw attention to here.
Consider Table 2, which shows the distribution of the different structures of
postmodification of reason. Five postmodifying options are considered: reason
postmodified by a -clause (as in the reason he can continue to do that), reason
postmodified by a that-clause (as in the reason that this lobbying has had little
effect), reason postmodified by a prepositional phrase headed by for (as in part of
the reason for this), reason postmodified by a why-clause (as in another reason
why pop shows are getting better), and reason postmodified by to + V as in any
reason to celebrate).

Table 2: Distribution of postmodified reason structures and their association with

Subject Subject Complement Complement Object Object
reason reason reason reason reason reason
affirmed denied affirmed denied affirmed denied
reason +
clause 698 17 (38) 210 42 14 4
reason +
that clause 77 - 40 9 - 3
reason +
for X 1091 36 (49) 610 392 305 161
reason +
why clause 7 10 (17) 594 629 61 223
reason +
to V 22 3 286 536 732 426

The table shows how these options distribute across three of the functions
available in the clause Subject, Complement and Object and indicates whether
they are associated with an affirmation or a denial. Affirmation occurs when a
reason is asserted. Denial occurs when a reason is declared to be of no
importance, invalid or not known. Examples of reasons affirmed are:

The councils neglect was the reason the flats were falling apart.
The reason that this lobbying has had little impact is that the industry has
failed to construct a convincing case.

The negatives in the latter sentence are of course not denials of the reason but
denials about lobbying and the making of convincing cases.
Examples of reasons denied are:

I see no reason to change it.

Textual colligation 179

There is no good reason why the publishers cant provide a normal

discount to booksellers.

The ratio of positive to negative clauses in general English is 9:1 (Halliday and
James 1993). Where therefore the ratio of affirmation/denial for a particular
syntactic choice is significantly skewed and where the frequency is high as a
proportion of the total number of cases considered, it has been highlighted in the
table. Where it is unclear whether it is the reason or some other aspect of the
clause that is being denied or any other problem of allocation to the categories
arises, the higher figure that results from inclusion of such cases is included in
brackets; it will be noticed that all such cases occur in the Subject options.
Of 7238 instances examined (excluding the doubtful cases), 4747 are
affirming the reason, 2491 denying it, a ratio of close to 2:1. This points strongly
to reason being associated with denial; that means that when reason is used, it
has a good chance of being part of a pre-emptive move by the writer/speaker to
say that s/he does not want to (or cannot) answer the reader/listeners expected
question Why? or that any counter-arguments that might be offered to his/her
position, whether by the reader/listener or by a third party, cannot be supported
with evidence. Either way, the word provides evidence of association with
affirmation-denial (Winter 1979, Williames 1985), a pivotal feature of
writer/reader and speaker/listener relationships.
Looked at more closely, the table allows us to state such an association
more precisely. In the first place, notice that the Subject function is strongly
associated with affirmation (1895:66, a whopping 29:1 ratio of affirmation to
denial), whereas Complement is associated with denial (1740:1608, close to a 50-
50 ratio). (Object has a less marked association with denial.) So if you want to
affirm your reason, put it in the Subject. If you plan to reject it or say that it is
irrelevant or unknown, use the Complement (or Object). This is a useful example
of a complex textual colligation where it is the operation of reason in a particular
grammatical function that has a particular textual implication.
Looked at another way, the different postmodifying structures with which
reason appears also distribute themselves differently between affirmation and
denial. So reason + -clause is associated with affirmation (in a ratio to denial of
15/1); the only structure to come near this weighting towards affirmation is the
relatively infrequent reason + that-clause, with a ratio of 10/1 in favour of
affirmation. On the other hand, reason + why-clause is associated with denial,
there being an absolute majority of cases of denial in all three grammatical
functions. This is another instance of a complex textual colligation, where the
colligation of reason with one or other kind of clause as postmodifier is the
condition that has to be met for a textual colligation to be observable.
Two instances on their own prove nothing, but it is hoped that the two
examples I have given at least elucidate what is meant by the third claim and
suggest it may be worth further investigation.
180 Michael Hoey

3.4 Claim 4: Every lexical item (or combination of lexical items) may have
a positive or negative preference for occurring at the beginning or end
of an independently recognised chunk of text.

The fourth claim was that lexical items may have a positive or negative
preference for occurring at the beginning or end of an independently recognised
chunk of text, e.g. the paragraph. The problem with verifying this claim is of
course that it is not easy to find independently recognised chunks of text that
have validity.
In writing, of course, there is rough chunking associated with
paragraphing, but it is not difficult to demonstrate that paragraphs have no
internal structure (except in so far that several generations of Freshman English
students in the United States have been taught to write paragraphs with a
particular arbitrarily imposed structure and this structure is becoming a self-
fulfilling prophecy). I have argued elsewhere (Hoey 1985) that paragraphs are a
device used by writers to signal to readers how parts of the text relate to each
other. They are not therefore an ideal starting-point for demonstrating the validity
of my sixth claim. Nevertheless, in the absence of other chunking devices, it is
possible to use paragraph boundaries, section divisions and of course the
beginnings of texts to test the claim.
The hypothesis here is that certain Thematised words or phrases have a
preference for occurring at the beginning or end of a paragraph, or a preference
for avoiding such positions. So a particular sentence-initial word might have a
preference for being paragraph-initial. There is no assumption here that the
sentence-initial word has to have a colligational preference for being sentence-
initial in order to be eligible for consideration with regard to the paragraph-initial
claim. It is perfectly feasible that a particular word or phrase might have no
special preference for being sentence-initial or indeed even have preference for
being in a non-sentence-initial position and still have, when in sentence-initial
position, a preference for being paragraph initial.
In Hoey (1997) I report an experiment carried out to discover whether
there is any relationship between paragraphing and the predilection of certain
lexical items for appearing in paragraph-initial position. The experiment took the
form of asking 67 students to paragraph a short passage from a history textbook
that had been previously deparagraphed.5 The passage in question was the

1 Grant was, judged by modern standards, the greatest general

2 of the Civil War. He was head and shoulders above any general on either
3 side as an over-all strategist, as a master of what in later wars
4 would be called global strategy. His Operation Crusher plan, the
5 product of a mind which had received little formal instruction in the
6 higher area of war, would have done credit to the most finished
7 student of a series of modern staff and command schools. He was a
8 brilliant theatre strategist, as evidenced by the Vicksburg campaign,
9 which was a classic field and siege operation. He was a better
Textual colligation 181

10 than average tactician, although, like even the best generals of

11 both sides, he did not appreciate the destruction that the increasing
12 firepower of modern armies could visit on troops advancing across
13 open spaces. Lee is usually ranked as the greatest
14 Civil War general, but this evaluation has been made without
15 placing Lee and Grant in the perspective of military
16 developments since the war. Lee was interested hardly at all
17 in global strategy, and what few suggestions he did make to
18 his government about operations in other theatres than his own
19 indicate that he had little aptitude for grand planning.
20 As a theatre strategist, Lee often demonstrated more brilliance
21 and apparent originality than Grant, but his most audacious plans were
22 as much the product of the Confederacys inferior military
23 position as of his own fine mind. In war, the weaker side
24 has to improvise brilliantly. It must strike quickly, daringly,
25 and include a dangerous element of risk in its plans. Had Lee
26 been a Northern general with Northern resources behind him he would
27 have improvised less and seemed less bold. Had Grant been
28 a Southern general, he would have fought as Lee did.
29 Fundamentally Grant was superior to Lee because in a modern
30 total war he had a modern mind, and Lee did not. Lee
31 looked to the past in war as the Confederacy did in spirit.
32 The staffs of the two men illustrate their outlooks. It would
33 not be accurate to say that Lees general staff were
34 glorified clerks, but the statement would not be too wide
35 off the mark

The students were not told how many breaks to make; this was left to their
discretion. The number of breaks varied from one to eight, with slightly under
half of the students making three breaks. The choices of paragraph break made by
all informants is given in Table 3, choices being represented in terms of the lines
in which the sentences begin.
Some sentences were clearly seen as strong candidates for beginning a
paragraph (e.g. the sentence beginning on line 13) while others were not chosen
by any informant (e.g. the sentence beginning on line 24). Equally clearly there
was no unanimity as to where to break, with no paragraph boundary finding
universal approval. This undermines any claim that might be made for paragraphs
having a structural status, unless of course my students are thought to have been
deficient in this respect (and since they were not for the most part deficient at the
level of the clause or the group, that would itself be of interest). Instead it points
to there being some non-structural explanation. One such explanation lies in the
textual relations in the passage, and this explanation I have explored fully
elsewhere (Hoey 1985, 1997). However, a more interesting (and not
incompatible) explanation is, as mentioned above, that the students were being
cued by the lexical items that begin the sentences; in other words, it is possible
that the students were choosing to paragraph in one place rather than another
because of the way the sentences began.
182 Michael Hoey

Table 3: The distribution of paragraph break choices across the range of possible
break points
Line on which Number of informants beginning % of informants
sentence starts a paragraph at this point making the choice
2 (He..) 0 -
4 (His..) 11 17%
7 (He..) 22 33%
9 (He..) 0 -
13 (Lee..) 62 94%
16 (Lee..) 7 11%
20 (As..) 32 49%
23 (In..) 32 49%
24 (It..) 0 -
25 (Had..) 2 3%
27 (Had..) 1 2%
29 (Fundamentally..) 42 64%
30 (Lee..) 0 -
32a (The..) 13 20%
32b (It..) 5 8%

With this hypothesis in mind, I set about examining the lexical items that
began each of the sentences that were candidates for beginning a paragraph.
When I undertook that work, my corpus was greatly smaller than it now is, and
the numbers looked at were not large typically between 40 and 100 instances,
and in a couple of cases fewer than this. What I was concerned to do, however,
was focus my hypothesis, not prove by weight of numbers the paragraph priming
of certain words or phrases.
The results of my analysis were provisionally supportive of the hypothesis.
To begin with, my analysis suggested that exactly 50% of single surnames (like
Grant and Lee) in sentence-initial position are also paragraph initial; there are of
course four places where a surname is the first word of a sentence (lines 1, 13, 16,
30) (and a further four where it appears within Theme lines 20, 25, 27, 29).6
There did not however appear to be any tendency for single surnames to be
sentence-initial; rather the opposite. So we have the hypothesis of a negative
priming for surnames in Theme but a positive priming for Thematised surnames
in paragraph-initial position.
The exact opposite appeared for he, which begins three sentences in the
passage (lines 2, 7, 9). The evidence pointed strongly towards he having a strong
priming for Theme. Of 100 instances consulted 30 were sentence-initial, and this
of course discounts instances of he occurring within Theme but not in 1st position
in the sentence. On the other hand there was no tendency for sentence-initial he to
be also paragraph-initial. In fact he occurred in paragraph-initial position in my
corpus two and a half times less often than would have been expected on the basis
Textual colligation 183

of random distribution. So we have the hypothesis of a positive priming for he in

Theme but a negative priming for beginning a paragraph.
The data for his (line 4) illustrated a third possibility. In contrast with he,
his showed a negative priming for being sentence-initial. Those few instances that
were sentence-initial showed an equal negative priming for being paragraph
initial. So we have the hypothesis of a negative priming for both sentence-initial
and paragraph-initial position.
Phrases beginning as a are not all of the same kind. I divided them into
three moderately distinct categories phrases with a non-human nominal group,
e.g. as a legacy of, as a consequence, phrases with a human referent for the
noun acting as head of the group but without implication of function or role, e.g.
as a boy, as a Frenchman, and finally phrases with a human referent for the head
noun that described a function or role, e.g. as a biologist, as a musician. One of
the candidate sentences for beginning a paragraph begins with the third class of
as a(n) X (i.e. As a theatre strategist, line 20). Phrases of this third category were
found, admittedly on the basis of few data, to be positively primed for paragraph-
initial position; no calculation was made of their tendency to be sentence-initial.
It proved very difficult to explore the textual priming of in war (line 23).
In the small corpus I was working with at that time, the phrase in war itself only
occurred once, in second position in a paragraph supporting a generalisation, and
once as in a war in paragraph-initial position. A trawl more recently of the 100
million word supplemented Guardian corpus still only threw up 16 examples of
the phrase. Of these 5 were paragraph-initial, 9 were non-initial and 2 occurred in
quoted speech too short to be subject to paragraphing. The average length of the
paragraphs was five sentences. Trivially sparse though these data are, they
suggest that in war may have a positive bias towards being sentence-initial, a
conclusion I reached when I initially analysed my data on the basis of the
distribution of in plus abstract noun. Again, then, if the hypothesis were to be
supported by better data, we would be looking at a phrase with a strong aversion
to being part of Theme and a strong preference for paragraph-initial position in
the rare circumstances of its being thematised.
Looking next at it functioning as pronoun (line 24), we again have
evidence of a negative colligation. I examined 149 instances of it functioning as
anaphoric pro-form in sentence-initial position. Of these a mere 8% (12) were
also in paragraph-initial position, compared with the 25% that might have been
anticipated if its positioning were the result of random distribution. It is therefore
three times less likely to appear in paragraph-initial position than would be
accountable for in terms of chance.
There were only 29 instances of Had X been (lines 25 and 27) in the
corpus I was using at that time. One in six of these began a paragraph. Again,
though based on few data, this points to a negative colligation with paragraph-
initial position.
With fundamentally (line 29), we again have a lexical item that is
negatively primed for Theme. With my original data I had to use a number of
suspect strategies in order to have enough data to analyse; re-examining the
184 Michael Hoey

lexical item with my current corpus of 100 million words, and with a consequent
786 instances of fundamentally, there were still only 20 instances in sentence-
initial position for investigation. Clearly the word does not like to begin
sentences. On the basis of my original data, I came to the conclusion that
fundamentally had a positive colligational preference for paragraph-initial
position, with 50% more likelihood of beginning a paragraph than was explicable
in terms of random distribution. With the still thin but better data of the later
corpus, I come to the same conclusion. Of the 20 sentence-initial cases, six begin
paragraphs and 13 do not; the final instance begins a one-sentence paragraph, and
this is discounted in the analysis. The average length of the paragraphs is 5
sentences (though this is distorted upwards by one particularly long paragraph).
Again, fundamentally turns out to begin paragraphs 50% more often than one
might expect.7
On the basis of this analysis, it was possible to correlate the positive and
negative priming for paragraph-initial position with the decisions that the students
had made.8 The correlation can be seen in Table 4. I have starred those results
which seem anomalous. It will be seen that for the most part there is a good
match between actual student choice of paragraph boundary and predicted
boundary breaks on the basis of corpus evidence. Where there is a discrepancy,
there are good reasons for it. In terms of the structure of the passage, the sentence
starting at line 4 represents a deviation from the smooth parallelism of the
comparison. Those paragraphing at line 4, despite the negative colligation of his
for paragraph initiation, were doing so to mark this deviation. The rather smaller
number who broke at line 7 were breaking, again in defiance of the negative
colligation of he, in order to mark a return to the parallelism. Those breaking, on
the other hand, at line 16 were doing so with no text-structural grounds for their

Table 4: Informant choices compared with textual colligation

Sentence-initial word or Line Paragraph-initial % of informants making a paragraph
phrase no colligation break at this point (67 informants)
Grant 1 Positive 100% (by default)
He 2 Negative 0%
His 4 Negative 17%
He 7 Negative 33% *
He 9 Negative 0%
Lee 13 Positive 94%
Lee 16 Positive 11%
As a NG (human function) 20 Positive 49%
In NG (generalised noun) 23 Positive 49%
It (pronoun) 24 Negative 0%
Had NG Vn 25 Negative 3%
Had NG Vn 27 Negative 3%
Fundamentally 29 Positive 64%
Lee 30 Positive 0% *
illustrate 32a Neutral 20%
It (anticipatory) 32b Positive 8% *
Textual colligation 185

decision; the only thing going for such a break is the positive colligation of names
with paragraph initiation. The failure to break at line 30 is less interesting; the
fact that the passage is coming to an end at this juncture would have been a
deterrent to some informants, irrespective of the merits of a potential break at this
To test whether these claims were correct and to discover whether the
colligations were having any effect on the judgements of students, I then doctored
the original text slightly as follows; changes are indicated in bold:

1 Grant was, judged by modern standards, the greatest general

2 of the Civil War. He was head and shoulders above any general on either
3 side as an over-all strategist, as a master of what in later wars
4 would be called global strategy. Lees Operation Crusher plan, the
5 product of a mind which had received little formal instruction in the
6 higher area of war, would have done credit to the most finished
7 student of a series of modern staff and command schools. He was a
8 brilliant theatre strategist, as evidenced by the Vicksburg campaign,
9 which was a classic field and siege operation. He was a better
10 than average tactician, although, like even the best generals of
11 both sides, he did not appreciate the destruction that the increasing
12 firepower of modern armies could visit on troops advancing across
13 open spaces. Lee is usually ranked as the greatest
14 Civil War general, but this evaluation has been made without
15 placing Lee and Grant in the perspective of military
16 developments since the war. He was interested hardly at all
17 in global strategy, and what few suggestions he did make to
18 his government about operations in other theatres than his own
19 indicate that he had little aptitude for grand planning.
20 He often demonstrated more brilliance and apparent originality
21 as a theatre strategist than Grant, but his most audacious plans were
22 as much the product of the Confederacys inferior military
23 position as of his own fine mind. The weaker side
24 has to improvise brilliantly in war. It must strike quickly, daringly,
25 and include a dangerous element of risk in its plans. Had Lee
26 been a Northern general with Northern resources behind him he would
27 have improvised less and seemed less bold. Had Grant been
28 a Southern general, he would have fought as Lee did.
29 Fundamentally Grant was superior to Lee because in a modern
30 total war he had a modern mind, and Lee did not. Lee
31 looked to the past in war as the Confederacy did in spirit.
32 The staffs of the two men illustrate their outlooks. It would
33 not be accurate to say that Lees general staff were
34 glorified clerks, but the statement would not be too wide
35 off the mark

The changes were designed to test whether students had been influenced by the
textual colligations. I hypothesised that the change to line 4 would reinforce the
structural pressure to break at this point and that there would be a consequent
increase in the popularity of this sentence as a paragraph boundary. I likewise
186 Michael Hoey

hypothesised that removal of the positive colligation at the beginning of line 16

would render it unattractive as a candidate boundary. I further hypothesised that
the removal of positive colligations from the Themes of the sentences beginning
in lines 20 and 23 would reduce their attractiveness as potential paragraph breaks.
Having made these changes, I gave the task to a set of 32 informants drawn from
the same undergraduate degree programme; needless to say, the second set of
informants did not include any from the first set. Their paragraphing decisions are
recorded in Table 5. The first column represents the potential paragraph breaks
(by line number); I have indicated where the potential break in question has had
its wording altered. The second column indicates the number of informants
choosing to break at this point and the third represents this as a percentage of the
cohort; the final column gives the results from the original experiment for
purposes of comparison.
Table 5 contains broad support for my position. The alteration at line 4
from his with its negative priming for paragraph-initial position to Grant with its
positive priming for such positioning brings with it a surge in the percentage of
informants choosing to paragraph at this point (and a corresponding reduction in
the percentage paragraphing at line 7). The alteration at line 16 in the reverse
direction, i.e. from Proper noun to pronoun, is associated with a reduction in the

Table 5: A comparison of the two cohorts of informants in respect of their

paragraphing decisions on the original and altered de-paragraphed text
Line Number of informants % of informants choosing % of original informants
choosing this point as a this point as a paragraph choosing this point as a
paragraph break break paragraph break
2 0 - -
4 12 38% 17%
7 6 19% 33%
9 1 3% -
13 31 97% 94%
16 1 3% 11%
20 7 22% 49%
23 19 59% 49%
24 0 - -
25 5 16% 3%
27 1 3% 2%
29 19 59% 64%
30 2 6% -
32a 5 16% 20%
32b 1 3% 8%
Textual colligation 187

number of informants paragraphing there. The change at line 20 removing the

positively primed fronted adjunct as a X and replacing the proper noun with a
pronoun sees a halving of the proportion of people choosing to paragraph at this
juncture despite there being good text-linguistic grounds for making such a break.
All these changes in the popularity of the relevant line breaks are in line with
Claim 4.
One result is not as expected. The removal of the fronted adjunct In X sees
an increase rather than a decrease in the percentage of people choosing to break at
line 23. The increase at first sight seems to provide counter-evidence for the claim
advanced in this paper. On closer examination, however, the increase is
supportive of the claim, not challenging to it. The reason is that the structure the
+ adjective + noun turns out to be an even stronger paragraph-initiator than in X.
It is a relatively rare structure. In a concordance of the of 1548 lines created from
seven BNC files, there were only 116 instances with this structure (7.5%). Of
these 116 instances, 51 (44%) were either paragraph initial or text initial,
approximately twice as many as would have been expected on the basis of
random distribution. So the move of In war to non-initial position only had the
effect of bringing to the front of the sentence a structure even more associated
with paragraph initiation than the structure it replaced.
The association of certain words or phrases with paragraph initiation is not the
only kind of chunking that can be attested. Some words or phrases have a strong
tendency to be associated with text initiation. In my corpus, for example, sixty
and today both have strong tendencies to appear in text-initial sentences; in the
case of sixty it also has a tendency to appear at the beginning of its sentence.
Hoey (2000) describes a small-scale experiment to investigate this phenomenon,
making use of a jumbled version a short text by Ingmar Bergman.
Both the text-initiation experiment and the paragraph-initiation experiment
described here (and in Hoey 1997) only hint at the way forward (apart from any
worry one might have about the experimental designs used) because a corpus of
100 million words is quite small when one is counting paragraphs and tiny when
one is counting texts. 100 million words constitutes very approximately only one
million paragraphs and (equally approximately) a mere 200,000 short texts, and a
corpus of one million words would for many purposes be regarded as a modest
corpus while a corpus of 200,000 words would normally be regarded as barely
adequate for all but the most common words in the language. Nevertheless what
evidence there is provides support for the claim that lexical items are primed
(positively or negatively) to appear in paragraph initial or even text initial

3.5 Claim 5: Textual colligations may only or especially be operative in

texts of a particular type or genre or designed for a particular
community of users, e.g. academic papers

The final claim will be given little attention, but it is an important one. This is
that all of the above claims should be regarded as domain-specific. In other
188 Michael Hoey

words, a word is not primed for textual use in all contexts but only under certain
conditions. Thus, for example, ago may, as I have argued here, be associated with
contrast in newspaper text and it might be it has a similar association in academic
articles; a corpus of fictional narrative on the other hand would be unlikely to
throw up this textual colligation. It might well be that both news articles and
fictional narratives favour the word appearing in text-initial position certainly I
have evidence for this in news texts but it is extremely doubtful whether there is
any such association in advertisements. And so on. Textual colligation claims
must be tied to particular genres, text-types, domains, communities of users
(defined temporally as well as in terms of employment and place) and the like. It
is probably this property of domain specificity that has led to textual colligation
being overlooked in corpus studies hitherto, in that large general corpora are ideal
for picking up patterns across a wide range of domains but have to be used
carefully to pick up features true of only certain kinds of text.

4. Colligational prosody

Once we recognise that our generalisations must be bounded, it is possible to

produce a fragment of a matrix of the kind hypothetically posited (as Table 1) at
the outset of this paper. Thus the phrase sixty years ago today can be represented
as shown in Table 6.

Table 6: A fragment of a text-colligational matrix

sixty years ago today
Preference for negative positive negative negative
participating in
cohesive chains
Preference for occurring positive weakly positive neutral
as part of Theme positive
Preference for occurring contrast contrast or contrast contrast
as part of a specified change
semantic relation
Preference for occurring positive for positive for positive for positive for
at the beginning or end paragraph paragraph paragraph paragraph
of a recognisable chunk initial and text initial (when initial in initial (when
initial (when Theme) certain phrases Theme)
Theme) (when Theme)
Constraints on Feature Feature Feature Feature
operation of preferences articles articles articles articles

It will be observed that the individual words that make up the phrase sixty years
ago today share a number of properties. Thus sixty, ago and today share the
property of having negative preference for cohesive chains and that sixty, years
and ago share a preference for being Thematised. We can label this colligational
Textual colligation 189

prosody and its presence is both explained by and helps to explain the power of
Table 7 represents the colligational prosodies of our chosen phrase.

Table 7: Colligational prosody in the phrase sixty years ago today

sixty years ago today
Preference for participating in negative negative negative
cohesive chains
Preference for occurring as positive weakly positive
part of Theme positive

Preference for occurring as contrast contrast or contrast contrast

part of a specified semantic change
Preference for occurring at positive for positive positive for positive for
the beginning or end of a paragraph for paragraph paragraph
recognisable chunk initial paragraph initial initial
and text initial and text initial and text initial
initial in certain (when Theme)
(when Theme) phrases (when
Constraints on operation of Feature articles Feature Feature articles Feature
preferences articles articles

5. Conclusions

I have been arguing in this paper for a new perspective on text-linguistics, one
that is rooted in the lexical item, not, as previously, a perspective that sees lexis as
a network in the text contributing to its cohesion or as contributing to the
signalling of the text organisation (though this perspective is not rendered
obsolete), but a perspective that makes no distinction between the description of
the text and the description of its component lexis. I have tried to show that the
properties of text can all be tackled through the concept of textual colligation.
More specifically I have argued that lexis is primed for textual use, such
that the choice of a lexical item is simultaneously the choice of its primings. Any
lexical item is primed positively or negatively with respect to cohesion, semantic
relations in the text, Theme and textual divisions. This is not to say that the
choice of a lexical item compels certain textual developments but it certainly
makes those developments more likely. The case of paragraphing in particular
suggests that some thorny text-linguistic problems might be amenable to solution,
or at least clarification, if a lexical perspective is adopted. Just as importantly, the
work I have reported here, if supported in subsequent investigations, suggests that
a corpus-centred account of the lexical item that stops at the phrase may be
unnecessarily limited. A full account of the word may be some way off yet.
190 Michael Hoey


1. Dubois (2002) provides powerful evidence of this; one of his informants

comments on an Epistle of St Paul as he reads it without apparent awareness
of an audience.
2. The sharp-eyed and knowledgeable will recognise that the last two sentences
of this paragraph are a direct quotation from Donald A. Yates translation of
Death and the Compass by J.-L. Borges. They therefore perfectly illustrate
the point about writing being reproduction. The speaker of these sentences in
the story is however led to his doom by false hypotheses so we cannot assume
that the sentiments are safe or Borges own.
3. Here and henceforward the evidence is drawn from concordances created out
of a corpus of 100 million words, made up predominantly of Guardian
newspaper data (approximately 96 million) with a topping up from the BNC
and a Liverpool-constructed database of spoken English of approximately _
million words.
4. Given that there are approximately twice as many tokens in Rheme as in
Theme, a lexical item could be said to occur in accordance with our
expectations if it occurs in Theme a third of the time. Random distribution
would suffice as an explanation of the occurrence of a lexical item in Theme
or Rheme if the occurrence of the lexical item in Theme and Rheme was
unaffected by the nature of the lexical item itself.
5. The passage, from Lincoln and His Generals by T. Harry Williams, was
originally selected and deparagraphed by Richard Young and Alton Becker
and their work was reported in a mimeographed paper. A more general
account of their research was reported in Koen et al. (1969). Although the
research reported here used many more informants than did theirs and my
findings are different from (though not unrelated to) their findings, I wish to
pay tribute to their pioneering work without which I would certainly never
have considered exploring these matters. The line numbering and line breaks
are as in the original Young and Becker experiment. A discussion of their
work and this experiment can be found in Hoey (1985).
6. The full details of my original analysis can be found in Hoey (1997).
7. In addition I analysed paragraph-initial cases of more fundamentally, most
fundamentally, and but fundamentally. There were 31 of these and 19 of them
also began paragraphs.
8. I also analysed the paragraph-initial properties of illustrate and anticipatory it,
but since these are not psychologically likely break points because of the
closeness to the end of the passage, I have not dwelt on them here. For what it
is worth, on the basis of few data, illustrate seemed to have a weak preference
for paragraph-initial position and an equally weak preference for paragraph-
final position; it would probably be best characterised as neutral with regard
Textual colligation 191

to paragraph positioning. Anticipatory it showed quite a strong tendency to be

paragraph-initial, as strong as fundamentally.


Bakhtin, M. (1973), Problems of Dostoevskys poetics [1929], translated by R. W.

Rotsel. Ann Arbor, Michigan: Ardis.
Beekman, J. (1970), Propositions and their relations within a discourse, Notes
on Translation. 37: 6-23.
Beekman, J. and J. Callow (1974), Translating the word of God. Michigan:
Zondervan Press.
Bolivar, A. de (2001), The negotiation of evaluation in written texts, in: M.
Scott and G. Thompson (eds), Patterns of text, Amsterdam: John
Benjamins. 129-158.
Chouliaraki, L. and N. Fairclough (2001), Discourse in late modernity: rethinking
critical discourse analysis. Edinburgh: Edinburgh University Press.
Crombie, W. (1985), Process and relation in discourse and language learning.
Oxford: Oxford University Press.
Dane, F. (1974), Functional sentence perspective and the organization of the
text, in: F. Dane (ed.), Papers on functional sentence perspective,
Prague: Academia. 105-28.
Dijk, T. van and W. Kintsch (1978), Cognitive psychology and discourse:
recalling and summarizing stories, in: W.U. Dressler (ed.), Current trends
in textlinguistics, Berlin: Walter de Gruyter. 61-80.
Dubois, J. (2002), What is (natural) discourse? Implications for spoken corpus
research. Paper presented at ICAME 2002 (the 23rd International
Conference on English Language Research on Computerized Corpora of
Modern and Medieval English), Gteborg, 22-26 May 2002.
Fairclough, N. (1989), Language and power. London: Longman.
Firbas, J. (1966), Non-thematic subjects in contemporary English, Travaux
Linguistiques de Prague 2: 239-256.
Firbas, J. (1986), Given and new information and some aspects of the structures,
semantics and pragmatics of written texts, in: C.R. Cooper and S.
Greenbaum (eds), Studying writing: linguistic approaches, Written
Communication Annual, Vol. 1, London/Beverley Hills, Cal.: Sage. 40-71.
Goffman, E. (1981), Forms of talk. Philadelphia: University of Pennsylvania
Goodman, K. (1967), Reading: a psycholinguistic guessing game, Journal of
the Reading Specialist 6: 126-135.
Goodman, K. (1973), On the psycholinguistic method of teaching reading, in: F.
Smith (ed.), Psycholinguistics and reading, New York: Holt, Rinehart and
Winston. 177-182.
192 Michael Hoey

Graustein, G. and W. Thiele (1979), An approach to the analysis of English

texts, Linguistiche Studien A55: 3-15.
Graustein, G. and W. Thiele (1987), Properties of English texts. Leipzig: VEB
Verlag Enzykapadie Leipzig.
Halliday, M.A.K. (1994), An introduction to functional grammar (2nd ed.).
London: Edward Arnold.
Halliday, M.A.K. and R. Hasan (1976), Cohesion in English. London: Longman.
Halliday, M.A.K. and R. Hasan (1985), Language, context and text: Aspects of
language in a social-semiotic perspective. Geelong: Deakin University
Press (republished in 1989 by Oxford University Press).
Halliday, M.A.K. and Z.L. James (1993), A quantitative study of polarity and
primary tense in the English finite clause, in: J. Sinclair, M. Hoey and G.
Fox (eds), Techniques of description: spoken and written discourse. A
festschrift for Malcolm Coulthard, London: Routledge. 32-66.
Hasan, R. (1984), Coherence and cohesive harmony, in: J. Flood (ed.),
Understanding reading comprehension, Delaware: International Reading
Association. 181-219.
Hodge, R. and G. Kress (1993), Language as ideology (2nd ed.). London:
Hoey, M. (1979), Signalling in discourse. Discourse Analysis Monographs No 6,
Birmingham: ELR, University of Birmingham.
Hoey, M. (1983), On the surface of discourse. London: George Allen and Unwin
(reprinted in Reprints in Systemic Linguistics series, University of
Nottingham, 1991).
Hoey, M. (1985), The paragraph boundary as a marker of relations between the
parts of a discourse, M.A.L.S. Journal 10: 96-107.
Hoey, M. (1991), Patterns of lexis in text. Oxford: Oxford University Press.
Hoey, M. (1997), The interaction of textual and lexical factors in the
identification of paragraph boundaries, in: M. Reinhardt and W. Thiele
(eds), Grammar and text in synchrony and diachrony in honour of
Gottfried Graustein, Frankfurt am Main: Vervuert Verlag & Madrid:
Iberoamericana. 141-67.
Hoey, M. (2000), The hidden lexical clues of textual organisation, in: L.
Burnard and T. McEnery (eds), Rethinking language pedagogy from a
corpus perspective, Frankfurt: Peter Lang, 31-42.
Hoey, M. (2001), Textual interaction. London: Routledge.
Hoey, M. (forthcoming a), The textual priming of lexis, to appear in G. Aston
(ed), Proceedings of TALC, Bertinoro, 2002.
Hoey, M. (forthcoming b), Lexical priming and properties of text, to appear in A.
Partington, J Morley and L Harman (eds), Corpora and Discourse. Bern:
Peter Lang.
Koen, F., R. Young and A. Becker (1969), The psychological reality of the
paragraph, Journal of Verbal Learning & Verbal Behavior, 8.1: 49-53.
Textual colligation 193

Labov, W. (1972), Language of the inner city: Studies in the Black English
vernacular. Philadelphia, Pa: University of Pennsylvania Press.
Labov, W. and J. Waletzky (1967), Narrative analysis: oral versions of personal
experience, in: J. Helm (ed.), Essays on the verbal and visual arts,
Seattle: University of Washington Press. 25-42.
Longacre, R.E. (1968), Discourse, paragraph and sentence structure in selected
Philippine languages. S.I.L. Publications in Linguistics & Related Fields,
No 21, Vols. 1 & 2. Dallas, Texas: Summer Institute of Linguistics
Longacre, R.E. (1979), The paragraph as a grammatical unit, in: T. Givn (ed.),
Discourse and syntax, New York: Academic Press. 115-34.
Longacre, R.E. (1983), The grammar of discourse. New York: Plenum Press.
Mann, W.C. and S.A. Thompson (1986), Relational processes in discourse,
Discourse Processes 9.1: 57-90.
Mann, W.C. and S.A. Thompson (1987), Rhetorical Structure Theory: a theory of
text or g a n i z a t i o n . Monica del Rey, Ca: Information Science
Institute/University of Southern California.
Martin, J.R. (1992), English text: system and structure. Amsterdam: John
Morgan, J.L. and M.B. Sellner (1980), Discourse and linguistic theory, in: R.J.
Spiro (ed.), Theoretical issues in reading comprehension: Perspectives
from cognitive psychology, linguistics, artificial intelligence and
education, Hillside, N.J.: Lawrence Erlbaum Associates. 165-200.
Nystrand, M. (1986), The structure of written communication: Studies in
reciprocity between writers and readers. Orlando: Academic Press.
Nystrand, M. (1989), A social interactive model of writing, Written
Communication 6.1: 66-85.
Parsons, G. (1995), Measuring cohesion in English texts: the relationship
between cohesion and coherence. Ph.D. Thesis, University of
Partington, A. (2003), The linguistics of political argument: The spin-doctor and
the wolf-pack at the White House. London: Routledge.
Partington, A. and J. Morley (2002), From frequency to ideology: comparing
word and cluster frequencies in political debate. Paper given at the 5th
TALC (Teaching and Language Corpora) conference, Bertinoro, 26-31
Scott, M. (1999), WordSmith Tools, Version 3, Oxford: Oxford University Press.
Sinclair, J.McH. (1993), Written discourse analysis, in: J. Sinclair, M. Hoey and
G. Fox (eds), Techniques of description: spoken & written discourse. A
festschrift for Malcolm Coulthard, London: Routledge. 6-31.
Smith, F. (1978), Understanding reading (2nd ed.). New York: Holt, Rinehart &
Swales, J. (1981), Aspects of article introductions. Aston ESP Monographs No 1,
Birmingham: Aston University.
194 Michael Hoey

Swales, J. (1990), Genre analysis: English in academic and research settings.

Cambridge: Cambridge University Press.
Tadros, A. (1985), Prediction in text. Discourse Analysis Monographs,
Birmingham: ELR, University of Birmingham.
Tadros, A. (1993), The pragmatics of text averral and attribution in academic
text, in: M. Hoey (ed.), Data, description, discourse, London:
HarperCollins. 98-114.
Ventola, E. (1987), The structure of social interaction. London: Frances Pinter.
Widdowson, H. (1979), The process and purpose of reading, in: Explorations in
applied linguistics, Oxford: Oxford University Press. 173-84.
Williames, J. (1985), The interactive nature of the newspaper letter, M.A.L.S.
Journal, New Series 10: 108-140.
Winter, E. (1971), Connection in science material: a proposition about the
semantics of clause relations, C.I.L.T Papers and Reports No 7 (London:
Centre for Information on Language Teaching and Research for British
Association for Applied Linguistics). 41-52.
Winter, E. (1977), A clause-relational approach to English texts: a study of some
predictive lexical items in written discourse, Instructional Science 6.1: 1-
Winter, E. (1979), Replacement as a fundamental function of the sentence in
context, Forum Linguisticum 4.2: 95-133.
Winter, E. (1982), Towards a contextual grammar of English. London: George
Allen & Unwin.
Adverbials in IT-cleft constructions

Hilde Hasselgrd

University of Oslo


On the basis of material from the International Corpus of English this paper
presents a study of IT-clefts with an adverbial in cleft position. Most of these IT-
clefts are of the informational-presupposition type (Prince 1978), i.e. the cleft
clause conveys new information. Various textual functions of the IT-clefts are
explored, using the classification of Johansson (2002). Unexpectedly, the function
of giving contrastive focus to the cleft constituent does not seem to be
predominant in this material. An alternative hypothesis is explored: the IT-cleft
construction is seen primarily as a thematizing device, whereby the cleft
constituent receives thematic focus (which may imply contrast) and the theme-
rheme division is made particularly explicit.

1. Introduction

The IT-cleft construction has been studied both as a focusing device (e.g. Prince
1978, Gundel 2002) and as a thematizing device (e.g. Gmez-Gonzlez 2000).
The construction is of interest in studies of information structure because it
allows a speaker/writer to spread the information of a single proposition over two
clauses and, consequently, two information units. It is normally assumed that the
cleft construction is a means of steering the focus towards the clefted constituent
(e.g. Gundel 2002: 118). The IT -cleft can have various types of phrases and
clauses as its focus, as shown below.
IT-cleft = IT + BE + clefted constituent [NP, PP, AdvP, (non-)finite clause] + cleft clause

Adjunct adverbials, in contrast to conjuncts and disjuncts, can be the focus of an

IT-cleft construction (cf. Quirk et al. 1985: 504), as illustrated in (1):

(1) Can we call a special meeting or something?

Maybe just that its this week that uhm there arent enough people around
< S1B-078 #172-173:1:C>2

The aim at hand is to examine such adverbials in IT-cleft constructions in order to

discover the information structural role of the focused adverbial as well as the
function of the whole IT -cleft construction in context. The study is primarily
based on the British component of the International Corpus of English (ICE-GB).
196 Hilde Hasselgrd

2. Types of adjunct in cleft position

Before going further, I shall briefly outline the kinds of adverbials that occur in
cleft focus position in the ICE-GB. Table 1 shows the occurrence of different
semantic types of adjunct in cleft position. The distribution of semantic types of
adjuncts in cleft position is not surprising; time and place adjuncts are the most
common types of adjuncts in most registers irrespective of position (cf. Biber et
al. 1999: 783 f.).

Table 1. Adjuncts in cleft position in the ICE-GB

Semantic type N %
Time 23 45.1
Place 15 29.4
Manner3 7 13.7
Cause/reason 5 9.8
Condition 1 2.0
Total 51 100

The adverbials in cleft position have different realizations, of which the

most common is the prepositional phrase (Table 2). Table 2 includes a column
containing Johanssons (2002:90) figures for the realization of adverbials in cleft
position in the English part of the English-Swedish Parallel Corpus (ESPC). As
shown in the table, the proportions of different realization types are relatively
similar across the two corpora. The frequency of the different realization types
corresponds quite well to the overall realization of adverbials regardless of
position (Hasselgrd: in prep., Biber et al. 1999: 769). One can thus assume that
adjuncts in cleft position are similar to those in other positions as regards their
semantic types as well as their realization.

Table 2. Realization of adjuncts in cleft position in ICE-GB and the English-

Swedish Parallel Corpus (ESPC figures from M. Johansson 2002:90)
N % N %
Prepositional phrase 34 66.7 58 73.4
Adverb phrase 8 15.7 12 15.2
Noun phrase 2 3.9 1 1.3
Clause 7 13.7 8 10.1
Total 51 100 79 100
Adverbials in IT-cleft constructions 197

3. The information structure of IT-clefts

The most common assumption about the information structure of IT-clefts is that
the clefted constituent represents new, often contrastive, information (e.g. Biber
et al. 1999: 959). The subordinate clause typically conveys presupposed
information (e.g. Prince 1978: 896). She calls this stressed focus IT-clefts, thereby
indicating the discourse function of such clefts, namely to give special focus to
the clefted constituent. Gundel (2002: 118) refers to this information structure in
clefts as prototypical. Similarly, Collins (1991: 84) follows Halliday in claiming
that this information structure constitutes the unmarked type of IT-cleft: the
theme/new combination is unmarked: the construction creates, through
predication, a local structure the superordinate clause in which information
focus is in its unmarked place, at the end.4 This is illustrated in Table 3, from
Halliday (1994).

Table 3. Marked and unmarked information focus combined with unpredicated

and predicated Theme (Halliday 1994: 301)
Unmarked Marked
Non-nominalized you were to blame you were to blame
Theme Rheme Theme Rheme
Given New (focus) New Given
Nominalized its you who were to blame its you who were to blame
(predicated Theme) Theme Rheme Theme Rheme
New Given Given New (focus)

The term unmarked here does not, however, reflect quantitative data. In
Collinss comprehensive study of cleft constructions, only 36% of the IT-clefts
have a new clefted constituent and a given cleft clause (1991: 111), although the
clefted constituent is new in a clear majority of cases.
Another type of information structure in IT-clefts is described by Prince
(1978) (and others after her: Collins 1991, Delin and Oberlander 1995, Johansson
2002), namely the informative presupposition cleft, in which the cleft clause
conveys new information. The clefted constituent may contain either given or
new information in an informative presupposition cleft. In (2), both the clefted
constituent and the cleft clause are new, since the sentence occurs text-initially.

(2) It was just about 50 years ago that Henry Ford gave us the weekend.
(Prince 1978: 898)

According to Prince (1978: 898) the information in the cleft clause is encoded as
a (non-negotiable) fact. Although it is new, it is presupposed rather than asserted,
i.e. it is marked as known to some people although not yet known to the intended
hearer (ibid: 899). Prince says that The whole point of these sentences is to
inform the hearer of that very information (ibid: 898). Delin (1992: 296),
198 Hilde Hasselgrd

however, claims that the information within an it-cleft presupposition appears to

remind rather than inform, even though it may not in actual fact be known to the
hearer (ibid: 297).

4. Information structure in IT-clefts with focused adverbial

Descriptions of IT-clefts in grammars and elsewhere are mostly concerned with

focused nominal elements. This may be partly because of the possibility of
comparing IT-clefts and wh-clefts,5 and partly because nominal elements are the
most frequent type of clefted constituent. In ICE-GB they are almost three times
as frequent as focused adverbials, which agrees quite well with Johanssons
findings (2002: 90).
It is possible that there is a correlation between the type of clefted
constituent and the type of cleft construction. Both Collins (1991: 112) and Prince
(1978: 899) note that the informative presupposition type of IT-cleft is quite
common when the clefted constituent is an adverbial. Conversely, it is possible
that the stressed focus cleft is less apt to accommodate adverbials as the clefted
constituent. In this connection we may note that the most frequently quoted
example of an informative-presupposition cleft has an adverbial in cleft position,
namely (2).
In the ICE-GB material it was indeed quite common to find new
information in both the clefted constituent and in the cleft clause, as in (2). It was
also quite common for the clefted constituent to represent given information and
for the information in the cleft clause to be new, as in (3). This is the reverse of
the canonical information structure in IT -clefts. Although this pattern is
discussed as a variant of informative presupposition cleft by e.g. Prince (1978:
899) and Gundel (2002: 118 f), its frequency was unexpected.6

(3) However there are worrying signs for the Republicans in the contests for
state governors. Because of the shift in population to the warmer parts of
the country states like Florida Texas and California are to be given extra
seats in Congress. The governors of those states will have a big say in
redrawing the boundaries. And its here that the Democrats have made
significant headway. They have won the elections for governor in both
Florida and Texas from the Republicans although Mr Bushs party appears
to have held on to the biggest prize of all California. (S2b 006#15-19)

The occurrence of clefted constituents conveying given information is also briefly

noted by Biber et al. (1999: 962): The focused element in an IT -cleft is not
infrequently a pronoun or some other form which expresses given information.
[] The early position of the focused element makes it suitable both for
expressing a connection with the preceding text and for expressing contrast. The
examples given are of the type it was me, it was then, it is these. In the
material examined for the present study, just over half of the clefted adverbials
Adverbials in IT-cleft constructions 199

were anaphoric, cf. examples (1), (3), and (7). The great majority of the IT-clefts,
close to 90%, were of the informative presupposition type.
Based on the distribution of given and new information over the clefted
constituent and the cleft clause, Johansson (2002: 185 ff.) arrives at four patterns
of IT-clefts:

Type A: Clefted constituent is given/inferable; cleft clause is

Type B: Clefted constituent is given/inferable; cleft clause is new
Type C: Clefted constituent is new; cleft clause is new
Type D: Clefted constituent is new; cleft clause is given/inferable

Assessing the information status of adverbials is difficult because they often

contain some given and some new information, e.g. the nominal complement of a
prepositional phrase may be given while the relation expressed by the preposition
may be new. Such phrases have been classified as inferable, and grouped together
with given information, following the practice of Johansson (2002). Needless to
say, it was necessary to study the examples in their wider context in order to
determine the status of the information conveyed by the two parts of the cleft
construction. Information patterns A-D were all found in the ICE-GB material.
Their distribution is shown in Table 4.

Table 4. Information structure of IT-clefts in the material

Pattern Adverbial Cleft clause N Total %
given given 1
Type A
inferable given 1 5 10
(all given)
inferable inferable 3
Type B given new 18
26 51
(given + new) inferable new 8
Type C
new new 18 18 35
(all new)
Type D new given 0
2 4
(new + given) new inferable 2
Total 51 100

There are quite a few differences between the results given in Table 4 and the
corresponding results of Johansson (2002: 188), which embraces all types of
clefted constituent. First and foremost, Johansson finds that Types A and B are
equally frequent (38% each), while Type C is least frequent in his material (10%).
Type D accounts for 14% in Johanssons material of English original texts.
Although the number of IT-clefts with adverbials is rather too low to give
conclusive results, the comparison of Johanssons material (based on 240
examples) and mine suggests that the information structure of clefts with
adverbials differs markedly from that of clefts with noun phrases. The most
200 Hilde Hasselgrd

important difference is that the IT -clefts with adverbials occur by far most
commonly with cleft clauses conveying new information (86%), while the cleft
clauses of IT-clefts in general seem to be divided about equally between given and
new information (Johansson 2002: 188). It may, however, be noted that Collins
(1991: 111) reports that 63% of his IT-clefts have new/contrastive information in
the cleft clause. It is possible that the differences may be due to the fact that
Collinss material as well as my own from ICE-GB contains both spoken and
written material while Johanssons contains only written material.

5. Discourse functions of cleft constructions

Studying the IT-clefts with adverbials in context, I found that they seem to have a
range of textual functions in the organization of the information flow. Collins
(1991: 106) makes a similar observation: The use of adverbials in cleft focus
position seems to have important textual functions, e.g. by acting as a bridge from
one topic to another or launching a (new) discourse topic. Returning to (3), for
example, we can note that the cleft sentence marks a transition between two
sections of the text. The discourse topic before the cleft sentence is demographic
and political features of Florida, Texas and California. After the cleft sentence
the text has moved on to the success of the Democratic Party, a topic that was
introduced in the cleft clause.
Johansson (2002: 193) proposes four main discourse functions of IT-
clefts (irrespective of the type of clefted constituent). I decided to use these
categories in order to facilitate comparison with his results, and was able to
identify all of them in the material for this study. The categories are the

Contrast (the clefted constituent marks a contrast to something previously

Topic Launching (the clefted constituent becomes the topic of the
subsequent discourse.)
Topic Linking (the two parts of the cleft construction clefted constituent
and cleft clause link together two discourse topics.)
Summative (the IT-cleft concludes or rounds off a text or a section of a

Interestingly, the notion of contrast does not seem to be a particularly prominent

feature of the clefted adverbials in the ICE-GB material. On the other hand, the
notion of focus is present in all the examples; after all, an IT -cleft usually
represents marked syntax as compared to its non-cleft counterpart. There is thus
some extra attention associated with the clefted constituent, even though this need
not be contrastive, and according to Delin (1990: 5 f) it need not be associated
with prosodic focus either.
In the following I shall give examples of the discourse functions found in
the material, using Johanssons classification. As will be shown, there are also
Adverbials in IT-cleft constructions 201

cases where it may be argued that the cleft construction represents a merger of
two categories, and additional categories will be suggested.

5.1 Contrast

In the canonical cleft (Gundel 2002: 118) the clefted constituent conveys new
information which is explicitly contrasted with something mentioned in the
preceding context. The cleft clause represents information that is known to the
speaker. In (4) the subject of reading books at an early age has already been
talked about, while the clefted constituent introduces a reader of a different age
from those previously discussed.

(4) I struggled terribly with them in my early teens and had no success at all.
It wasnt till I was perhaps twenty-five or thirty that I read them and
enjoyed them <S1A-013 #237-238:1:E>

It is also possible to express contrast in examples where the cleft clause conveys
new information. In such cases there is usually also another discourse function
associated with the IT-cleft. For instance, the clefted constituent may launch a
new discourse topic at the same time as marking a contrast (see next section), or
there may be a transition between two discourse topics (section 5.3).

5.2 Topic launching

An IT -cleft can introduce a discourse topic in the clefted constituent. This

constituent may be brand new or inferable, but in any case it is made prominent
by means of the clefted constituent and developed as a topic in the subsequent
discourse. In (5) the clefted constituent introduces those men and women serving
our country in the Middle East, a group which is a discourse topic in the section
that follows. It also represents a shift in the speech, introducing a human angle.
Interestingly, the you in the next sentence refers to the same group, in contrast to
the you in the sentence preceding the cleft, which is much wider in its scope.

(5) We must try to work out security arrangements for the future so that these
terrible events are never repeated <,> and we shall I promise you <,> bring
our own forces back home just as soon as it is safe to do so <,> It is to
those men and women serving our country in the Middle East <,> that my
thoughts go out most tonight # and to all of their families here at home <,>
To you I know this is not a distant war. It is a close and ever present
anxiety <,> I was privileged to meet many of our servicemen and women
in the Gulf last week <,> <S2B-030 #63-68:1:A>
202 Hilde Hasselgrd

Topic launching + Contrast

Example (6) represents a combination of two discourse functions in the IT-cleft.
The clefted constituent to Africa represents a contrast to the previous setting, the
Soviet, but at the same time introduces Africa as a new discourse topic in a text
about the worlds population. Clearly, the cleft clause does not convey given or
even presupposed information. Since all the information in the sentence is new,
the IT-cleft construction provides a means of avoiding placing new information
sentence-initially, or to keep focal material out of surface subject position
(Gundel 2002: 126).

(6) Shortages of food have been a repeated feature of recent Soviet experience
<,> with heavy dependence on grain imported from the United States as
the Soviets own production has failed <,> But the spotlight has been on
empty shops in the towns rather than empty larders in the countryside <,>
It is to Africa that the television cameras go to show what happens when
local natural resources are so inadequate for the population living off them
that drought or continuous small-arms war causes famine for the people
counted in millions and death for many of them <,> Apart from such
disasters in succession in the same or in different places infant mortality is
the main counter to the birth rates effect in Africa <,> and in parts of the
continent the heterosexual incidence of Aids may prove to have halted or
even reversed the growth in the population of potential parents and
condemned large numbers of children to early death <,> <S2B-048

5.3 Topic linking: Transition

It is the two-part structure of the IT -cleft that allows it to link together two
discourse topics: the current discourse topic is referred to in the clefted
constituent, while a new discourse topic is introduced in the cleft clause. I would
actually prefer to call this function Transition since it not only links together
two topics but also provides a bridge between two sections of a text. In other
words, the cleft sentence becomes a vehicle for topic shifting. Because the new
topic is presented as known or presupposed according to Prince, it is an
unobtrusive way of introducing new information which can then be the starting
point for the next section of the text. In (3) the cleft clause the Democrats have
made significant headway marked the beginning of a new section of the text. In
example (7) the idea of topic shifting is even clearer, because Cs attempt to shift
the topic is refused by A, who wants to spend more time on the previous topic.
Thus, in (7), the transition does not fulfil its function, while in (3) and (8) the
topic is successfully shifted.
Adverbials in IT-cleft constructions 203

(7) C: But really whats happened with my sort of history is when I met uh did a
little recording with Chandos Records uhm and the Ulster orchestra who
was conducting there came up with enough money to do their first record
and they got Chandos interested. It was then that uh I fell in love with
music like Hamilton Harty and a bit of Stanford <,> and the Arn the
Arnold Bax Saga became something quite uh excellent.
A:Well thats a day we certainly want to come back to a bit later. But if we
could just for a moment concentrate on the latter years of the nineteenth
century. < S1B-032 #22:1:C>

Contrast + transition
Example (8) has a combination of marking a contrast with the clefted constituent
and creating a transition by means of the cleft clause. The contrast is between the
Villa Somalia mentioned earlier and the office building. The introduction of the
letters in the cleft clause starts off a new section of the discourse.

(8) {BEGINNING OF TEXT} The Villa Somalia which was Siad Barres
official residence in Mogadishu still lies abandoned <,> guarded by a
handful of young men from the United Somali Congress the rebel force
which took control of Mogadishu at the end of January <,> But it was in
one of the office buildings that I discovered the letters <,> thousands of
them <,> addressed to His Excellency President Mohammed Siad Barre
but all unopened <,> I picked up one from Britain <,> It had been posted
in September nineteen eighty-eight and was signed by a retired
schoolteacher from Guildford in Surrey <,> writing on behalf of Amnesty
International to plead for the release of a blind Somali preacher whod
been imprisoned for his religious beliefs <,,> < S2B-023 #61:3:A>

5.4 Summative

Summative IT-clefts tend to occur towards the end of a text or a section of a text,
and represent a kind of conclusion or rounding off. Example (9) occurs at the
very end of a speech and contains two cleft constructions. They share a clefted
constituent which is inferable. It is not contrastive, but may have the uniqueness
feature noted by Delin and Oberlander (1995: 469). The cleft clause in the
second cleft is new. Although it is backgrounded by means of subordination
(Delin and Oberlander 1995: 473), it represents a kind of punchline and softens
the war-talk in a clever way.

(9) The purpose of war is to enforce international law. It is to uphold the

rights of nations to be independent and of people to live without fear. It is
in that spirit <,> that the men and women of our forces and our allies are
going to win the war <,> And it is in that spirit that we must build the
peace that follows. <S2B-030 #103-105>
204 Hilde Hasselgrd

Contrast + summative
In example (10) we see both contrast and the summative function, towards the
end of an obituary. The clefted constituent resolves an either-or relationship
(Perzanowski and Gurney 1997: 218) i.e. a writer for children as opposed to
adults - while the summative function is evident from the cleft clause.

(10) Dahls books often portrayed children battling against evil adults. [] As
an adult author Dahls fame was to come much later when his Tales of the
Unexpected were transferred to television. Yet it will be as a childrens
writer hell be remembered. His lasting legacy includes another two books
still to be published. Roald Dahl whos died at the age of seventy-four <,>
{END OF TEXT} < S2B-011 #17:1:B>

5.5 Thematization

In certain cases the main function of the cleft seems to be to make extra clear
what is to be understood as the theme and the rheme of a sentence. A good
example is (11), which represents a complete text. Thus there can be no contrast
involved, nor any topic-linking, topic-launching, or summary. Rather, in this
case, the writer wants to give thematic prominence to the regret he/she feels. It
may be noted that a non-cleft version (11a) cannot easily have the same
constituent in thematic position.

(11) It is with much regret that I find it necessary to send you a copy of the
enclosed letter which is self explanatory. <W1B-026 #121:15> {= entire
(11a) ? With much regret I find it necessary to send you a copy of the enclosed
letter which is self explanatory.

Thematization is an added discourse function as compared to Johanssons,

although he mentions it as a variety of topic linking (2002: 199). One reason why
it seems appropriate to propose it as a separate category is the fact that some
examples simply do not fall neatly into any of the other categories, such as (11)
above. On the other hand, thematization seems to be an accompanying factor in
most of the examples where the cleft can be assigned to one of the functional
categories described above.
Furthermore, SFL tradition views both I T -clefts and WH -clefts as
thematization devices (predicated theme and thematic equative, respectively; see
also Table 3). Here one must bear in mind the basic function of Theme and
Rheme which is a partition of the message into two parts, each of which carries a
type of prominence. Thematic prominence has to do with the functions of Theme
as the point of departure of the clause as message, or the ground from which the
message is taking off (Halliday 1994: 38). Rhematic prominence on the other
hand has to do with the fact that the (end of the) Rheme tends to be the locus of
new information (cf. Fries 1994: 233 f). Gmez-Gnzales (2000: 303 ff.)
Adverbials in IT-cleft constructions 205

describes the IT-cleft as a special theme construction, i.e. one that marks off the
theme of a sentence and gives it extra focus. Similarly, Collins (1991: 171) notes
that the theme in clefts carries a textual form of prominence. If the IT-cleft is
seen as a construction for thematization, it should follow that the theme in this
construction, like other themes, can be given, new, contrastive or non-contrastive.
Perzanowski and Gurney (1997: 214) note that certain types of it-clefts
[] frequently occur in negative contexts. This is also a finding of the present
study, where three examples had not until as the clefted constituent, such as
(12). In such cases, a non-cleft version is not without problems, given that the
speaker wants a particular theme-rheme structure. That is, the corresponding non-
cleft version will require subject-verb inversion (12a). The IT-cleft may thus be a
way of using a marked construction in order to avoid one that is even more

(12) However it wasnt until his fourth album that the instruments capabilities
were more fully explored <,> <S2B-023 #22:1:A>
(12a) Not until his fourth album was the instruments capabilities more fully

Gundel (2002) and Johansson (2002) both document that IT -clefts are more
common in Norwegian/Swedish than in English. Looking for more examples
similar to (12) above, I have, however, found several examples of English IT-
clefts corresponding to other thematization structures in Norwegian, where
fronting of adverbials is more common and less marked than in English
(Hasselgrd 1997: 14). In example (13), from the English-Norwegian Parallel
Corpus (ENPC), the Norwegian original does not have a cleft, but a fronted
adverbial. The translator, struggling to keep the thematic structure intact, opts for
a cleft possibly again to avoid an even more marked structure.

(13) Frst p den tredje dagen hadde Aua vknet. (ENPC: MN1)
Lit: first [=only] on the third day had Aua awakened.
It was not until the third day that Aua awakened. (MN1T)

However, the thematizing function of IT-clefts does not only occur where a non-
cleft alternative would be awkward. In (14), for example, a non-clefted alternative
with a fronted adverbial is quite acceptable (14a), although there may be slightly
less focus on the adverbial. The clefted constituent does not mark any contrast,
nor does it close or launch a topic. According to Collins (1991: 175) the IT-cleft
enables an unambiguous mapping of theme on to new information in the
unmarked instance, as themes in IT-clefts are likely to convey new information
(Collins 1990: 111). In (12)-(14) the clefted constituent is indeed new.

(14) It was in nineteen hundred and six that the Queens great-grandfather King
Edward the Seventh decreed that privates in the Household Cavalry should
henceforth to be known as troopers <ICE-GB:S2A-011 #91:1:A>
206 Hilde Hasselgrd

(14a) In nineteen hundred and six the Queens great-grandfather King Edward
the Seventh decreed that privates in the Household Cavalry should
henceforth to be known as troopers

Delin and Oberlander (1995: passim) suggest yet another discourse

function of IT-clefts, in that the content of the cleft clause is marked as prior in
time to the main story line. An example of this may be (15). However, the content
of the preceding sentence is also prior in time to the main story line, so I am not
convinced this property is contributed by the IT-cleft construction. Instead I have
classified the example as Thematizing. As in (11)-(13) a non-cleft version could
not easily have the same theme-rheme structure. In a sense it is also topic-
launching, in that it occurs early in a section concerned with stages in this
persons military career. However, this can also be seen as a function of Theme.

(15) And the Field Officer Brigade waiting rides up to Her Majesty the Queen.
He was not granted security for officer training when he joined the
regiment in nineteen sixty-eight because of his Polish ancestry. It was as a
G u a r d s m a n that he came to the Second Battalion which now he
commands and eventually became a lance sergeant instructor at the Guards
Depot. When he was finally accepted for Sandhurst he went on to win the
Sword of Honour and has since served as an officer with every company
of each battalion of the regiment. < S2A-011 #122-125:1:A>

It is tempting to propose that the basic function of IT-clefts is thematization, and

that other functions are subsidiary to this. In other words, the marking of contrast
with a preceding topic, the launching of a new topic and the preparation for a new
topic may all be seen simply as functions of Theme.

5.6 Discourse functions and information structure

Johansson (2002: 193) suggests that the discourse functions of clefts are
associated with the different patterns of information structure outlined in section
4 above. According to his findings, type A correlates with the discourse functions
contrast and summative, type B with topic linking, type C with topic launching,
and type D with contrast. Table 5 presents a summary of the occurrence of the
various discourse functions of IT-clefts with adverbials in the present material.
This has been correlated with the type of information structure identified in each
cleft sentence.
Because few of the IT -clefts with adverbials were of the stressed focus
type, there are few examples of contrast. There are thus not enough examples of
this function to make a valid comparison with Johanssons results, although it is
interesting that several of the contrast examples belong to type C (all new). It may
be noted that when a clefted constituent conveying new information expresses
contrast, the implication is contrary to expectation rather than contrary to what
has been claimed. Type B (given + new) seems to be a good indicator of Topic
Adverbials in IT-cleft constructions 207

linking, or transition from one topic to another, as in Johanssons material. Type

C (all new) is a relatively good indicator of Thematization, although this type also
has a range of other functions. Type D is too scarcely represented to provide any
basis for even tentative conclusions.

Table 5. Discourse functions identified ranked according to frequency of

occurrence and correlated with information structure
Type A Type B Type C Type D Total
all given given + new all new new + given
Transition 1 18 4 23
Thematization7 5 7 12
Contrast8 2 3 1 6
Summative 2 2 0 1 5
Topic launching 1 1 2
Contrast + transition 0 2 2
Contrast + topic launching 0 1 1
Total 5 26 18 2 51

6. IT-clefts in different registers

Since the ICE-GB includes a range of different spoken and written registers, it
was possible to check whether the registers differed as to the use of adverbials in
IT -clefts. Collins (1991: 181) reports a slightly higher frequency of IT-clefts in
writing (the LOB Corpus) than in speech (the London-Lund Corpus). In the ICE-
GB the difference between speech and writing was the opposite as regards the
frequency of IT-clefts with adverbials: approximately 0.6 vs. 0.4 occurrences per
10,000 words, respectively. However, when the spoken category was divided into
scripted vs. unscripted, a further difference emerged, as shown in Table 6. The
category of scripted speech, making up only 6% of the corpus, accounts for 24%
of the clefts with adverbials. The unscripted spoken categories and the written
categories are then left with the same frequency of clefted adverbials.

Table 6. Frequency of IT-clefts with adverbials in different genres in the ICE-GB

No of clefted
No of words No of clefted
Genre/medium adverbials per
in ICE-GB adverbials
10,000 words
Spoken (unscripted) 572,464 24 0.4
Scripted speech 65,098 12 1.8
Writing 423,702 15 0.4
Total 1,061,264 51 0.5
208 Hilde Hasselgrd

One may speculate that the discourse functions of clefts are well suited to the
(rhetorical) purposes of the scripted speech categories. These categories typically
belong to expository genres, particularly lectures and broadcast narration, but
there are also some official speeches. Possible reasons why the informative-
presupposition clefts with adverbials are handy may have to do with the
possibility of assigning unambiguous thematic prominence to the clefted
constituent and the possibility of presenting new information in the cleft clause
without asserting it. Further, as Delin (1992: 300) claims, the information in the
(presupposed) cleft clause is presented as a non-negotiable fact, which clearly has
its rhetorical advantages. Further exploration of such rhetorical properties of
clefts is, however, beyond the scope of this paper.

7. Concluding remarks

The starting point for the present study was a hypothesis that IT-clefts with
adverbials behave differently from other IT-clefts in discourse. The background
for this hypothesis was that clefts with adverbials in focus position seemed to
have an unexpected information structure, particularly that there seemed to be
many examples of the given + new pattern. One of the conclusions of the present
study must be that these perceived differences have to do with quantity rather
than with quality, as the information structure and the discourse functions found
with clefted adverbials have also been identified and described with other types of
clefted constituent (e.g. by Collins 1991, Johansson 2002). Presumably, many of
the differences arise from the fact that IT -clefts with adverbials tend to be
informative-presupposition clefts, while other IT-clefts are more likely to be
stressed-focus clefts.
The typical information structure of the IT-clefts with adverbials involves a
clefted constituent carrying given information and a cleft clause carrying new
information or, alternatively, one in which both parts of the cleft construction are
new. In both Collins (1991:11) and Johansson (2002: 188) these two types are
less frequent.
It is clear that clefts with adverbials have a range of textual meanings, or
discourse functions. It is equally clear that one must study the clefts in context in
order to get at these functions. The material offered examples of IT-clefts serving
contrastive, topic-launching, transitional, and summative functions. It was
suggested in section 5.5 that these discourse functions can all be regarded as
somehow ancillary to the function of theme (or to the theme-rheme nexus in the
case of transition).
The use of an IT-cleft enhances the textual prominence of the theme. As a
consequence, the construction is well suited for marking off the theme as new or
contrastive. However, IT -clefts are also used when the clefted constituent is
neither new nor contrastive, in which case the construction may simply serve to
make the theme-rheme division of the message extra clear. This may be the case
Adverbials in IT-cleft constructions 209

in clefts marking transition as well as in clefts where none of the other discourse
functions outlined in section 5 can be identified.
The function of transition seems particularly prominent with clefted
adverbials. Clefts with this function typically have a given, often anaphoric,
adverbial in cleft focus position, while the cleft clause introduces a topic for the
subsequent discourse. The speaker/writer thus achieves a smooth transition
between two topics, juxtaposing them by means of a relational clause, and
launching the new topic unobtrusively in a subordinate clause.
It was also shown that there are cases of IT-clefts being used to place an
adverbial in thematic position that would otherwise be difficult to place clause-
initially. This was seen with negative adverbials (e.g. not until), which would
have required subject-operator inversion in a corresponding non-cleft sentence,
and with adverbials such as with much regret (example 11), which probably could
not have occurred in initial position in a non-cleft sentence. Furthermore, the IT-
cleft can give extra thematic focus to clefted constituents that would have come
across as relatively unmarked themes in non-cleft sentences, particularly time
adverbials (e.g. example 14).
In the present study I have made frequent comparison with other studies of
clefts, particularly Collins (1991) and Johansson (2002). There are some
weaknesses involved in these comparisons. First of all, the three studies are based
on rather different corpora. More importantly, assigning information values and
discourse functions is no exact science, and the subjective element involved in
this work may account for some of the differences between previous studies and
my own. Ideally, the present study should have been extended to include IT-clefts
with nominal constituents in the ICE-GB corpus. This might have made the
comparisons with other IT-clefts more reliable. However, this task must be left to
a later study.
Another possible extension of the study would be to explore further the
different uses of IT-clefts in various genres. The investigation reported in section
5.6 showed a clear difference in frequency of the construction across genres,
cutting across the spoken/written dimension. A further study might look more
closely into such genre differences as well as more specific rhetorical uses of the


1. I follow the terminology of e.g. Gundel (2002) and Delin (1992), describing
the IT-cleft construction in terms of a clefted constituent and a cleft clause. No
discussion of the syntactic status of the relative-like subordinate clause will
be undertaken here.
2. In the corpus examples, the IT-cleft construction has been underlined, with the
clefted constituent in italics.
3. The category of manner adjuncts has been defined quite widely and includes
adjuncts of means and comparison.
210 Hilde Hasselgrd

4. The systemic-functional term corresponding to IT-cleft is predicated Theme

(cf. Halliday 1994: 58).
5. It is usually assumed that wh-clefts do not allow focus on adverbials, though
reversed wh-cleft seem to behave differently (cf. Johansson 2002: 96-97, who
gives examples of (reversed) wh-clefts with where and why, e.g. Here is where
I look like Marilyn Monroe).
6. Delin (1992: 294) suggests that the reason that cleft presuppositions are so
frequently assumed to specify information that is mutually known perhaps lies
in the fact that much of the discussion of it-clefts has centred around
decontextualized examples.
7. As it is argued elsewhere in this paper that Thematization may be the primary
function of IT-clefts, it should be noted that the examples classified as
Thematization in Table 5 are those where none of the other discourse
functions could be clearly identified.
8. It is assumed that the function of contrast is not prominent in the examples not
classified as such in Table 5.


Biber, D., S. Johansson, G. Leech, S. Conrad and E. Finegan (1999), Longman

grammar of spoken and written English. London: Longman.
Collins, P.C. (1991), Cleft and pseudo-cleft constructions in English. London and
New York: Routledge.
Delin, J. (1990), Focus in cleft constructions, Research Series Blue Book Note
No 5, Centre for Cognitive Science, University of Edinburgh.
Delin, J. (1992), Properties of it-cleft presupposition, Journal of Semantics, 9:
Delin, J. and J. Oberlander. (1995), Syntactic constraints on discourse structure:
the case of it-clefts, Linguistics, 33: 465-500.
Fries, P.H. (1994), On Theme, Rheme and discourse goals, in: M. Coulthard
(ed.), Advances in written text analysis. London and New York:
Routledge, 229-249.
Gmez-Gonzlez, M.. (2000), The theme-topic interface. Evidence from
English. Amsterdam and Philadelphia: John Benjamins.
Gundel, J.K. (2002), Information structure and the use of cleft sentences in
English and Norwegian, in: H. Hasselgrd et al. (eds), Information
structure in a cross-linguistic perspective. Amsterdam: Rodopi, 113-128.
Halliday, M.A.K. (1994), An introduction to functional grammar. London:
Edward Arnold.
Hasselgrd, H. (1997), Sentence openings in English and Norwegian, in: M.
Ljung (ed.), Corpus-based studies in English. Papers from the seventeenth
Adverbials in IT-cleft constructions 211

international conference on English language research on computerized

corpora. Amsterdam: Rodopi, 1-14.
Hasselgrd, H. (in preparation), Manner, place, time. A corpus-based study of
adverbials in present-day English.
Johansson, M. (2002), Clefts in English and Swedish: A contrastive study of IT-
clefts and WH -clefts in original texts and translations. PhD Thesis, Lund
Perzanowski, D. and J. Gurney (1997), The functionality of it-clefts in selected
discourses: The message in the medium. Word, 48: 207-236.
Quirk, R., S. Greenbaum, G. Leech and J. Svartvik (1985), A comprehensive
grammar of the English language. London: Longman.
Prince, E. (1978), A comparison of w h -clefts and it-clefts in discourse.
Language, 54: 883-906.

Corpus material

The International Corpus of English, British component (ICE-GB); see

<http://www.ucl.ac.uk/english-usage/ice-gb/ >
The English-Norwegian Parallel Corpus (ENPC); see
On the pragmatic functions of lets utterances

Bernard De Clerck

University of Ghent

The difference between a boss and a leader: a boss says, 'Go!' a

leader says, 'Lets go!'
E. M. Kelly, Growing Disciples, 1995


This paper presents the results of research into the pragmatic functions of lets
utterances in the spoken component of the ICE-GB.1 The first part of the paper
gives an overview of the grammatical features and the pragmatic uses of lets
utterances as described in the literature. The second part presents a detailed
analysis of the attested lets utterances in the corpus. Apart from testing the force
and accuracy of the existing descriptions, the paper also examines the
frequencies of occurrence of these functions and possible relationships with the
different text categories they occur in. The goal is to provide an answer to such
questions as who uses lets utterances where, why, and how.

1. Introduction

Constructions with lets are intriguing. When one considers the possible
meanings of the pair
(a) Let us have a drink
(b) Lets have a drink
one can see that (b) is not just an informal variant of (a) with the abbreviated
objective pronoun us. On a semantic and a pragmatic level the picture is clearly
more complex than that. The meaning of example (a) is ambiguous and can be
interpreted in two ways. On the one hand, it can be interpreted as a non-inclusive
request for permission (i.e. the hearer does not belong to the group referred to by
us), which can be paraphrased as Allow us to have a drink. On the other hand, it
can be interpreted as a hearer-inclusive proposal or suggestion for joint action,
involving both the speaker and the hearer. Example (b), however, is restricted in
semantic scope and has lost its non-inclusive interpretation. It no longer has the
meaning allow us to have a drink. In contrast with (a), its illocutionary function
is restricted to a hearer-inclusive proposal for joint action and as such Shall we
214 Bernard De Clerck

have a drink? comes closer as a paraphrase than Allow us to have a drink.2 It

appears then that lets constructions seem to have gone through (and might still
be going through) a process of semantic bleaching (Huddleston and Pullum
2002: 924) which also affects their pragmatic illocutionary functions.
This paper focuses on these pragmatic functions and investigates the
influence of semantic bleaching on the different pragmatic uses of lets
constructions in present-day British English. However, before moving on to a
more fine-grained pragmatic and corpus-based analysis, I shall briefly examine
the influence of the semantic bleaching process on the grammatical properties of
lets constructions.

2. Grammatical properties of lets constructions and the status of let and


There has been a great deal of debate on the syntactic properties of let and lets in
the existing literature, including the way they should be labelled or categorised.
What is interesting about this discussion is that it shows the shortcomings of
traditional grammatical distinctions whenever they are confronted with the hybrid
syntactic nature of a certain language item. The discussions also exemplify the
interconnectedness of the pragmatics, semantics and grammar of a language and
the descriptive and analytical problems that arise when the consequences of
certain changes in a construction affect these three levels at different rates.
Indeed, when reviewing the relevant literature one can see that the syntactic
properties of let and lets in the constructions at hand are actually described as a
mixture of auxiliary and non-auxiliary-like syntactic properties, whose syntactic
behaviour is often explained in terms of idiosyncratic construction-specific
One way of accounting for these properties is found in Seppnen (1977),
who identifies let as a hybrid modal auxiliary with a mixture of features,
characteristic of central and marginal modals. Seppnen points to the fact that,
like a regular modal, let occurs only in combination with a main verb, forming
with it a complex predicate where the semantic contribution of let is the notion of
volition (Seppnen 1977: 517).3 Furthermore, like the modals, let is always
followed by a bare infinitive form of the main verb. According to Seppnen, other
shared characteristics include the absence of non-finite forms, the lack of
inflection in the third person singular and the past tense, its use for negation and
emphatic stress. However, unlike the modals, negation with do is possible (Dont
lets do it vs. Lets not do it), as is sometimes the case with the semi-modals
ought to and used to. Yet, in order to explain differences in use between let in
lets constructions and these semi-modals, Seppnen has to resort to idiosyncratic
properties. This is especially the case with regard to the use of let as the operator
of the sentence in combination with do (Do lets try it again vs. Didnt they ought
to like it?). Furthermore, in his analysis, he regards the NP following let as its
subject, which forces him to conclude that another unique and idiosyncratic
property of let in these constructions is that they require the subject to be in the
On the pragmatic functions of lets utterances 215

accusative (objective) form when it is a pronoun, hence, us, me, him, her, them.
The problem with Seppnens argumentation is that he is forced to attribute these
properties to the idiosyncratic nature of let as a modal verb.4 If let is analysed as
the imperative form of a full verb, the latter properties can be explained more
Treating let as the imperative of a full verb, however, is far from problem-
free either. In one approach, put forward by Costa (1972), no distinction is made
between the full lexical let and let as it occurs in lets constructions. In the latter
case, Costa still regards let as a straightforward imperative of a full lexical verb,
i.e. allow. The effect of let as an imperative is described as exhorting the
second person to allow the desired event () to take place (1972: 142).5 This
view on the meaning of let, however, cannot account for the ambiguous meaning
of Let us have a drink, described above, and seems to ignore the fact that in the
contracted variant Lets have a drink the interpretation request for permission
(similar to allow us) has all but disappeared.6 There are other distinctive features
of let in lets constructions which remain unexplained in Costas approach. One
of the most obvious characteristics of let in these constructions is of course the
fact that the accusative form of we, i.e. the pronoun us, can be, and most of the
time is, contracted to s. In all other types of imperatives (including those with a
full lexical let) us cannot be contracted. Other features that distinguish the lets
construction from the imperative with the full lexical verb include (a) the
occurrence of shall we instead of will you in tag questions, (b) the non-
omissibility of lets in ellipsis, (c) the difference in semantic scope in negative
utterances and (d) the fact that lets cannot occur with a subject. A full discussion
of these features is to be found in Huddleston and Pullum (2002: 934-935). I will
restrict myself to giving a few examples that illustrate these contrastive

(a) Lets have another drink, shall we? Let her have another drink, will you?
(b) Lets go with her. *Yes, do. *No, dont
Yes, lets. No, lets not. (Huddleston
and Pullum 2002: 934)
(c) 1a. Dont lets go with her. 1b. Dont let her go with you.
2a. Lets not go with her. 2b. Let her not go with you.
(Huddleston and Pullum 2002: 935)
(d) *You lets go. You let her go.

In (c) there is a clear difference in meaning between the ordinary imperatives (1b)
and (2b): in (1b) let is inside the scope of the negation, so it is paraphrasable as
Dont allow X to do Y. In (2b) let is outside the scope of the negation: in this
case the utterance can be paraphrased as Allow X (not) to do Y. All these
distinctive properties clearly show that the interpretation of let as an imperative of
the full lexical verb let (allow), as proposed by Costa does not account for these
216 Bernard De Clerck

More recent views on let and lets, set out by Quirk et al. (1985) and Biber
et al. (1999) label lets as a pragmatic particle. Quirk et al. (1985: 148), for
example, say that it is a pragmatic particle with a quasi-modal status, an
unanalysed particle pronounced /lets/. Along the same lines Biber et al. (1999:
1117) say that in present-day English it is for practical purposes an invariant
pragmatic particle introducing independent clauses in which the speaker makes a
proposal for action by the speaker and the hearer. According to Quirk et al., the
particle status of lets is also supported by the existence, in familiar AmE, of the
pleonastic variant lets us, lets dont and of the construction lets you and me in
which the addition of the second person pronoun indicates that s is no longer
associated with us.
Huddleston and Pullum (2002) make similar observations and point out
existing differences between what they generally call dialects A and B of the
English language. Dialect B is characterised as more lenient towards
constructions such as Lets you and I, which are similar to the pleonastic
variants given by Quirk et al. with regard to AmE. In this dialect these uses
would appear to be widely enough used to qualify as acceptable informal style in
standard English (Huddleston and Pullum 2002: 935). According to them, these
constructions indicate that syntactically the specialisation of let has been taken a
significant step further:

the s in these constructions is not replaceable by us (), and also

because of the prosody, it is not plausible to treat the NP you and I
as being in apposition to s. It seems clear rather, that let and s
have fused syntactically as well as phonologically, and are no
longer analysable as verb+object: they form a single word which
functions as a marker of the first person inclusive imperative
construction (Huddleston and Pullum 2002: 935).7

However, when talking about the less lenient dialect A (i.e. a dialect which does
not have the pleonastic variants illustrated above), Huddleston and Pullum say
that there is no compelling reason to suggest that there has been a reanalysis of
the syntactic structure (and hence no reason to regard lets as a pragmatic
particle). In their opinion, the data are compatible with an analysis where let is
still a catenative verb, used with an NP object (us or s) and (except in ellipsis) a
bare infinitival clause as second complement. In their view, then, the analysis of
lets as a pragmatic particle proposed by Quirk et al. (1985) and Biber et al.
(1999) would then only apply to AmE and not (yet) to BrE. Davies (1986) holds a
similar view in regarding let in lets constructions grammatically as the
imperative of the full verb let, with additional and exceptional features, the
possibility of contracting let us to lets being one of them. Syntactically, it still
has the status of an imperative of a main verb, but [t]o provide a plausible
account of both the form and the interpretation of the let-construction, then, it
seems necessary to acknowledge a certain lack of correspondence between the
two (Davies 1986: 250).
On the pragmatic functions of lets utterances 217

None of the pleonastic variants could be attested in my analysis of the

ICE-GB (which only comprises the British variant of the English language).
Therefore, the observations made by Davies (1986) and Huddleston and Pullum
(2002) and the distinctions that are made between AmE and BrE are also
supported in this paper.8 It is indeed so that let in lets constructions is undergoing
a semantic bleaching process, which takes it further away from its original
meaning of allow. At a syntactic level, there is reason to believe that this is
happening at different rates in AmE and BrE. In AmE the loss of semantic
meaning of let has led to a reanalysis of its syntax, thereby allowing the existence
of constructions such as lets us and lets you and I, while in BrE let in lets
constructions still shares a number of syntactic properties with the lexical let in its
fossilised syntax but is semantically moving away from it. It appears then that the
syntactic properties have not kept up with semantic changes in BrE, while in
AmE lets is more and more used as a pragmatic particle introducing or
announcing a joint activity of speaker and addressee. In the next section we will
focus more deeply on the meaning of lets constructions and their different
pragmatic uses in British English. It will appear that this construction is used as a
means to different ends, in which the illocutionary point of joint action
instigator often serves as a starting point to reach other illocutionary forces and
perlocutionary effects.

3. Pragmatic functions of lets utterances

This section comprises a summary and discussion of the different illocutionary

forces and pragmatic functions of lets constructions, as described in the
literature. I will start with the most stereotypical function and move on to less
frequent and more ambiguous functions which have been noticed fairly recently.
This order of describing the different pragmatic functions coincides with a
pragmatic movement from an ideational level of joint action instigation, to a more
textual or interactional level and eventually to the use of lets constructions at an
interpersonal level.

3.1 Lets utterances as proposals for joint action

As indicated above, lets utterances normally have the directive illocutionary

force of a proposal for joint action: the speaker commits herself to an action and
seeks the addressees agreement. For this reason, a verbal response is normally
expected, indicating agreement or refusal (Huddleston and Pullum 2002: 936).
Compliance with the directive (i.e. the perlocutionary effect) generally involves
joint action by the speaker and the hearer, possibly involving others as well.
Recent studies (Hamblin 1985: 60, Davies 1986: 229, Halliday 1994: 87, Swan
1996: 316, Biber et al. 1999: 1117, Huddleston and Pullum 2002: 936) bring this
aspect of joint action instigation to the fore: unlike second person singular
imperative structures, the addressees or intended agents of lets utterances include
both the speaker and the hearer. Because of this joint nature of the proposed
218 Bernard De Clerck

action, they could be regarded as essentially collaborative or even convivial

(Leech 1983: 104) in kind, as their illocutionary goal is indifferent to or
coincides with the social goal (ibid.). Huddleston and Pullum (2002), however,
remark that the speakers attitude towards compliance can range from strongly
wanting it (Come on, lets get going; the bus leaves in five minutes) to merely
accepting it (Okay, lets invite Kim as well, if thats what you want) (Huddleston
and Pullum 2002: 936), which implies that differences in conversational or
institutional power might overrule the convivial aspect of jointness which seems
to be inherent in the meaning of lets. From the analysis of the data in Section 4,
it will also appear that the actual pragmatic effect of the utterance will indeed
greatly depend on the interlocutors themselves and their interpretation or
evaluation of each others power at that specific moment in the conversation.

3.2 Speaker and hearer-oriented uses of lets utterances

The literature also mentions uses of lets which move away from this idea of joint
agency. Quirk et al. (1985: 830), for example, mention that in very colloquial
English, lets is sometimes used for a first person singular imperative as well:
Lets give you a hand.9 Biber et al. (1999: 1117) also refer to this use, saying
there is also a tendency veering towards a first person singular (exclusive)
meaning, which they equate with Let me.10 Similarly, Huddleston and
Pullum (2002: 936) refer to cases where the action is in fact just carried out by
just one (typically the speaker). In an example such as Lets open the window, it
is possible that the actual aim is that of securing your agreement to my opening
it, rather than a proposal to open the window together (ibid.). This shift in
meaning is even more obvious in cases where the hearer cannot (appropriately)
perform the action presented in the verb, but where the agreement and co-
operation of the hearer are needed in order to carry it out successfully. This use of
lets would then explain why it is often found in a medical context, when a doctor
or specialist is talking to a patient:
Lets have a look at your tongue (Biber et al. 1999: 1117)
In these cases, the speaker is trying to find agreement with the hearer for an
action that will be carried out by the speaker only. Although the hearer has a role
to play in the process of having her throat examined, the kind of action that is
expected is not the same as the one presented in the verb following lets. The
actual perlocutionary effect is to get the patient to open her mouth and not to have
a look at her own throat together with the doctor. Rather than a genuine proposal
for joint action, it is more a (self-)exhortative announcement of the next step in
the examining process with the implication that the hearer will have to open her
mouth at one stage.
There is another tendency in the use of lets utterances that moves in the
opposite direction and merges with a second person singular imperative meaning,
i.e. the hearer is the intended agent of the action presented in the lets utterance.
Biber et al. (1999) refer to this use as a second quasi-imperative meaning, as
On the pragmatic functions of lets utterances 219

this use proposes an action which is clearly to be carried out by the hearer. This
crypto-directive style (Biber et al. 1999: 1117), which aims to camouflage an
authoritative speech act as a collaborative one, is used especially by adults when
addressing children:
You all have something to do for Ms. <name>? Lets do it please.
(Biber et al. 1999: 1117)
De Rycker (1990: 311) also notes this use and says that () it has been
frequently observed that they [= lets utterances] also function as thinly disguised
addressee-oriented directive acts. In these cases the lets utterance can be
paraphrased roughly as a second person singular imperative. Used as a crypto-
directive, Lets open our books on page 5 then actually means Open your books
on page 5. The tact involved in this use of lets utterances can be perceived, not
as a face-maintaining strategy, but rather as a signal of insincerity and
condescension (De Rycker 1990: 313). As lets utterances that serve as indirect
directives are largely restricted to these rank-sensitive contexts, where direct
types [i.e. 2nd pers. sg. imperatives] of directive realisation patterns are involved,
they may well be interpreted as an unnecessary display of superiority on the part
of the speaker (ibid.).
In a similar way, one can also find the use of the first person plural
pronoun we to refer to the hearer only. This use often occurs in the interaction
between medical staff and patients, where this aspect of pretended togetherness is
found in questions such as How are we feeling today?, Did we sleep well last
night?, Did we take our pill?, where we is used to refer to the patient only. This
convivial use is often felt to be patronising as it resembles the use of we that is
sometimes found in the discourse of parents or teachers when addressing children
(e.g. Did we do our homework yesterday?). It will become clear from the analysis
below that lets utterances can also be used (deliberately) in a patronising way,
especially when they are seemingly used to smooth out rank differences where
there are none.

3.3 Lets utterances as conversational imperatives

In the literature, the uses of lets utterances as action instigators (whether joint or
not) are mostly illustrated by examples that involve a non-linguistic action (cf. the
typical example Lets go for a drink). It is presented as the means par excellence
to make a proposal for the speaker, the hearer and possibly others to do
something together outside the linguistic boundaries of the ongoing conversation.
De Rycker (1990: 229), however, remarks that lets utterances can also have
specific functions within the linguistic context of the ongoing interaction itself.
He says that, as conversational imperatives (), they frequently serve no other
function than managing part of the topical and structural development of the
interaction itself and perform actions relevant to the talk exchange. Their
illocutionary functions have to do with conversational activity (Levinson 1983:
228): speaking, to stop with speaking, paying attention, listening, and
220 Bernard De Clerck

interactional operations such as turn-taking, holding or yielding the floor,

interrupting, butting in. In other words, the functions that lets utterances
perform cannot only be considered in the light of their purpose as illocutionary
acts but also as conversational moves, i.e. acts that regulate the ongoing talk
exchange itself by initiating, responding, interrupting or redirecting (Stubbs
1984: 149). This use is illustrated in the following corpus example, where the
broadcaster uses the lets utterance to steer the conversation in a new direction:

(1) B: in the Academy they taught them to use more pencil but in the College
more rubber <,>
A: Well lets talk about Arnold Bax because the names already come up in
our conversation uhm and uh hes obviously a very important figure and
both of you have recorded quite a good deal of music by Bax
A: Where did he look for his sources of inspiration?
(ICE-GB/S1B-032#90-92; broadcast discussions)

Other lets utterances that work at the level of ongoing interaction are
those which function as pragmatic formatives of the commentary type (Fraser
1987: 187) or as prospective or retrospective metapragmatic comments
(Thomas 1985: 770). They signal how the primary illocutionary act (performed
by the utterance of which they are part) fits into the ongoing conversational
structure (Fraser 1987: 187). Some examples are Lets say or Lets face it. Their
use as true suggestions for a joint activity of saying or considering something is
secondary to their use as metalinguistic utterances that provide clues to the
interpretation of the utterance as a whole.
In the next section, we will have a closer look at the frequency of the uses
mentioned above in the spoken component of the ICE-GB and investigate
whether their description in the literature can account for all attested uses.
Attention will be paid to the frequency of lets utterances in different text
categories, their different pragmatic functions and the relationships that can be
established between these functions and the speakers who use them. A distinction
will be made between conversational and non-conversational uses and between
truly joint and speaker/hearer-oriented lets utterances, and related to these,
whether they are to be seen as negotiable proposals or as conversational moves.

4. Lets utterances in the spoken ICE-GB

4.1 Frequency of lets utterances in the spoken component of the ICE-GB

In this study, I examined the use of lets utterances in the spoken component of
the ICE-GB corpus. The spoken component consists of 300 texts, hierarchically
organised in different text categories, which are represented in Table 1.
164 instances of lets utterances could be attested, unevenly spread across
the different text categories.11 In the dialogues the frequency of lets is 1/3000
On the pragmatic functions of lets utterances 221

and in the monologues 1/5000 words.12 Moving further down to the text
categories, one can trace more specific differences in frequency.

Table 1. Text categories in the spoken ICE-GB

(Figures in parentheses indicate the number of 2,000-word texts in each category)

dialogue (180) private (100) face-to-face conversations (90)

phone calls (10)
public (80) classroom lessons (40)
broadcast discussions (20)
broadcast interviews (10)
parliamentary debates (10)
legal cross-examinations (10)
business transactions (10)
monologue (100) unscripted (70) spontaneous. commentaries (20)
unscripted speeches (30)
demonstrations (10)
legal presentations (10)
scripted (30) broadcast talks (20)
non-broadcast speeches (10)
mixed (20) broadcast news (20)


tes tions ches calls tions talks iews aries ches tions tions ions sons tions
ba v t s
y de enta spee one mina cast inter men spee ersa nstra scus m les nsac
n ta pr cast eleph -exa road ast com ted conv emo st di sroo s tra
r e s
me legal road t ss b adc us crip ce d ca las nes
rlia cro s
bro aneo un -to-f
a ad c usi
pa n-b al nt bro b
no leg o fa c e

Figure 1. Distribution of lets in the spoken categories of the ICE-GB (tokens per
10,000 words)
222 Bernard De Clerck

Figure 1 shows the distribution of lets utterances in the dialogue and

monologue text categories. The differences in frequency of lets utterances across
the text categories are quite large. Lets is most common in the dialogue text
categories of business transactions, classroom lessons and broadcast discussions.
Within the monologue text categories it is especially frequent in the
demonstrations (in fact more frequent than in the face-to-face conversations) and
in the unscripted speeches.
In an attempt to explain these differences in frequency, I looked at the
different functions of lets utterances in the corpus and tried to establish
relationships between the functions and the text categories in which they

4.2 Pragmatic functions of lets utterances in the ICE-GB

4.2.1 Joint, speaker or hearer-oriented agency

I first investigated the use of lets utterances in the corpus by using the distinction
that is made in the literature between joint, speaker and hearer-oriented agency.
The pie chart in Figure 2 shows the distribution of the intended agents of lets
utterances in the corpus.

joint agency
intended agent ~hearer
54% intended agent ~speaker


Figure 2. Intended agents of lets in the spoken ICE-GB

As we can see, about half of the lets utterances are proposals for joint action in
which the hearer and the speaker are the intended agents of the proposed action.
The following corpus example illustrates this use:
(2) A: Let s have a good uh
A: So let s play Trivial Pursuit as well after or something
B: Mm
A: Shall we
(ICE-GB/S1A-048#123-126; face-to-face conversations)
On the pragmatic functions of lets utterances 223

Both speaker and hearer will be involved and actively participate in the proposed
action. In 38% of the lets utterances the intended agent of the action is the
speaker. In most of these cases, the hearer neither gives nor has the opportunity to
give a verbal response and is undergoing the action performed by the speaker. In
(3) we have an example of this speaker-oriented use in which a woman addresses
herself while trying to solve a slide feed problem. The use of lets is similar to
that of let me in this case:
(3) A: Think I have a slide feed problem <,,>
A: Here lets try the next one <,>
(ICE-GB/S2A-029# 97-98; unscripted speeches)
Hearer-oriented utterances only comprise 8% of the cases. Example (4) illustrates
this use:
(4) A: So you go up on O and come down on Ooo and see if we can get to it that
way <,,>
B: That was a bit That was certainly easier <,>
A: Well lets do it again only this time your little <unclear-words>
B: Yeah
(ICE-GB/S1A-044# 114-117; face-to-face conversations)
The music teacher is using a lets utterance to give instructions, but does not carry
out the proposed action herself.





60% intended agent= speaker

50% intended agent =hearer

40% intended agent= we





Figure 3. Intended agents in dialogue and monologue text categories

Figure 3 shows the distribution of these three types in the monologue and
dialogue text categories. A closer examination of these different sub-categories
provides the picture shown in Table 2.
224 Bernard De Clerck

Table 2. Intended agents of lets in the subcorpora of the ICE-GB

Text categories Intended Intended Intended agent =
agent = we agent = speaker
face-to-face conversations 44 7 12
telephone calls 1 0 0
classroom lessons 8 1 10
broadcast discussions 9 2 8
broadcast interviews 2 0 2
legal cross-examinations 2 0 0
business transactions 7 0 6
spontaneous commentaries 5 0 3
unscripted speeches 7 0 15
demonstrations 1 2 5
broadcast talks 1 1 3

The table shows a higher concentration of speaker-oriented lets utterances in

unscripted speeches, demonstrations, classroom lessons and broadcast talks. Lets
utterances with joint agency are most frequent in face-to-face conversations,
spontaneous commentaries, business transactions and broadcast discussions
(although in these categories there is a fair number of speaker oriented lets
utterances as well). In order to find an explanation for these different frequencies
in the various text categories, I analysed their pragmatic function and the
background information on the speakers. The following sections deal with these
parameters and the influence they have on the use of lets utterances.

4.2.2 Conversational use of lets

Apart from distinguishing between the different kinds of agent, the analysis also
focused on the pragmatic function as conversational and non-conversational
imperatives. The first important observation is the high frequency of
conversational lets utterances in the ICE-GB. No less than 67% of joint and
speaker-oriented lets utterances consisted of process types that were aimed at
influencing the conversational flow of the interaction. An example of this
conversational use is shown in (5), where the lets utterance is used by a
university lecturer to structure the topical organisation of his talk:
(5) A: So we 've got the nerve is having a kind of trophic action on muscle but
ma muscle actually also a a a acts in a trophic way towards the nerve
A: But let s just stick with the nerve affecting the muscle for the moment.
(ICE-GB/S1B-009#159-160; classroom lessons)
The correspondence between speaker-oriented lets utterances and its
conversational uses was most obvious in the monologue text categories of
unscripted speeches, demonstrations and broadcast talks. About 70% of the lets
On the pragmatic functions of lets utterances 225

utterances are speaker-oriented, 70% of which are used as conversational

imperatives. Examples (6) and (7) exemplify their conversational use as
expository directives (Huddleston and Pullum 2002: 931) in these text
(6) A: So I said there was going to be a lot about Ravenna in this lecture
A: And uh lets just start by establishing the idea of the basilica plan uhm
because of course many of the famous
A: churches of Ravenna are built in this style.
(ICE-GB/ S2A-060#44-46; demonstrations)
(7) A: So its to our advantage that the Greenhouse Effect exists
A: So lets just backtrack for a second
A: What Ive talked about is uh the following
(ICE-GB/S2A043#115-117; unscripted speeches)
Rather than being genuine proposals for a joint action to which a verbal response
of agreement or disagreement is expected, they are more like announcements of a
topical shift that round off the present topic and introduce the next step in the talk.
Apart from structuring the speakers own talk, they are also aimed at engaging the
active participation of the addressee in the speakers exposition, but do not really
expect a verbal response from the audience. As conversational imperatives they
are often used in combination with other utterance launchers such as well, right,
OK, so or hesitation markers (uhm, hmm). This was the case in about 70% of the
conversational imperatives. They typically mark the end of one particular topic,
or introduce a kind of conclusive utterance that rounds off the topic and paves the
way for a new one (cf. (6) and (7) above).
A similar use of speaker-oriented lets utterances can be attested in the
dialogue categories. The first important observation that can be gleaned from the
analysis is that only 5% of the speakers using speaker-oriented lets utterances
had less institutional or conversational power (they were interviewees, who used
lets in lets see as a hesitation device). The rest of the speakers using this type
had the same or more institutional or conversational power. Institutionally and/or
conversationally more powerful speakers include teachers, doctors, student
counsellors, careers counsellors and professors, broadcasters, interviewers and
journalists. 80% of the utterances introduced by speaker-oriented lets consisted
of process types that were directly aimed at influencing the conversational flow of
the interaction, both with regard to topical management and turn-taking.
Especially in broadcast discussions, broadcast interviews and classroom lessons,
they were used by the more powerful speaker who creates the illusion of a joint
and convivial interaction, but who is in fact skilfully using lets as a way to steer
the discourse and to decide which actions are to be taken (whether jointly or not)
and when. These lets utterances are actually performative in nature, as utterance
launchers or idiomatic overtures (Biber et al. 1999: 1073); the speaker is
announcing the next step to be taken in the interaction rather than presenting a
genuine proposal for joint activity. Apart from propelling the conversation in a
226 Bernard De Clerck

new direction, or orienting the listener to the following utterance, especially in

relation to what has preceded, their role also consists in providing the speaker
with a planning respite, during which the rest of the utterance can be prepared for
execution (ibid.). In this way, the broadcasters use of lets see as a hesitation
marker in example (8) does not just represent a mental process of cognition,
semantically speaking, but rather a conversational process of the organisational
type: a linguistic means of bridging a possible gap during ones contribution and
hence a means of turn-holding and/or preventing another participant from
claiming the turn.
(8) B: But in fact you 've both recorded the Enigma Variations and both with
the with the London Philharmonic Orchestra
B: But uh let s see
B: Watch Which one are we going to have
B: That 's the problem
A: Have Jack 's it 's
(ICE-GB/S1B-032#61-65; broadcast discussions)
Table 3 shows the distribution of conversational and non-conversational uses of
lets in the dialogue text categories.14

Table 3. distribution of conversational and non-conversational uses of lets in

dialogue text categories
Text categories Conversational lets Non-conversational lets
business transactions 7 6
broadcast interviews 3 1
broadcast discussions 12 7
classroom lessons 16 3
legal cross-examinations 2 0
telephone calls 0 1
face-to-face conversations 16 47
parliamentary debates 0 0

The conversational use of lets prevails in classroom lessons, business

transactions, broadcast discussions, broadcast interviews and legal cross-
examinations. Lets utterances used as conversational imperatives then seem to be
part of the repertoire of chairmen, interviewers and teachers, generally speaking
of interactionally more powerful speakers, who present the conversation as a joint
enterprise, but actually try to control it by restricting the hearers influence to a
minimum. Rather than giving the floor to the hearer and providing her with the
opportunity for verbal agreement or disagreement, the speaker keeps the floor and
starts carrying out the action immediately after announcing it.
Apart from this use at a structural or topical level, conversational lets
utterances also function as pragmatic formatives in 10% of the cases. This
means that they function as metalinguistic utterances, which signal how the
On the pragmatic functions of lets utterances 227

primary illocutionary act (performed by the utterance of which they are part) fits
into the ongoing conversational structure (Fraser 1987: 187). Their use is briefly
illustrated in the following examples:
(9) A: And the bar for a start
A: Right
A: which is unlikely, lets face it.
B: So I said to him
(ICE-GB/S1A-008# 284-287; face-to-face conversations)
(10) A: But I still believe that <,> what is
A: I mean let s say Charles Dickens is communicating through you
A: It 's still <,> got to be through your through the medium of you and
therefore the writer
(ICE-GB/ S1B-026#235-237; broadcast discussion)
(11) A: Well it 's not that wonderful a film really <,>
A: let s be honest
A: I 'm sure we 'll find something
B: No
(ICE-GB/S1A-006# 167-169; face-to-face conversations)
When used as a pragmatic formative, lets face it can be seen in the first place as
an indication or a signal that the speaker feels very strongly about the primary
communicative act. According to De Rycker (1990: 403), it may also have a
defensive side to it: it expresses the speakers awareness that s/he is saying
something that is either controversial or obvious, but which in each case may well
lead to a potentially unfavourable reaction by the listener. Lets is used as a
connective and in this way it still retains some of its prototypical pragmatic
meaning as an act of suggesting a desirable joint activity. Similarly, it can be
argued that lets say is an indication that part of what follows counts as a
hypothetical example, a rough guess or anything else about which the speaker is
not entirely sure (De Rycker 1990: 402). Of course it can also be used as a
convenient hesitation marker, thereby allowing the speaker to take more time to
formulate his or her thoughts. In this way lets say can be both a conversational
imperative and a pragmatic formative. Lets be honest also gives indications as to
how the utterance of which it is part should be interpreted: it seems to indicate
that although the information in the proposition might be controversial, it is still
something that the speaker supports. By using the connective lets and appealing
to the hearers sense of honesty, the speaker tries to repair common ground or at
least indicates that s/he is aware of a possible discrepancy between the
propositional attitudes of the partners in the conversation. On the whole, it
appears that the function of lets utterances is primarily interactional, when used
as conversational imperatives. Especially in the case of stereotyped or formulaic
uses taken from a stock of ready-made utterances, their illocutionary force as a
proposal for joint action is fairly weak and secondary to their use as idiomatic
overtures, hesitation markers and pragmatic formatives.
228 Bernard De Clerck

Lets utterances, however, were not used as conversational imperatives

only. In the next section I will briefly comment on the non-conversational uses
and pay particular attention to a number of special cases which are not really
action-oriented but which seem to present evaluative statements or emotions.

4.2.3 Non-conversational uses of lets utterances

Non-conversational lets utterances account for 33% of all attested utterances.

Most of them were typical proposals for joint agency and occurred in face-to-face
conversations between intimates. They exemplify the convivial nature of the
proposed action often described in the literature (cf. example (2) above). Non-
conversational speaker-oriented utterances were used as self-monitoring devices
during the process of carrying out an action (cf. example (3) above). Used as self-
addressed imperatives, rather than managing the ongoing interaction, they are
aimed at managing the speakers own actions. The attested cases of hearer-
oriented lets utterances (8%) were non-conversational in kind (cf. example (4)
above). They were especially frequent in demonstrations where the audience were
asked to perform a number of activities on a computer. In the text category of
face-to-face conversations, they all occurred in one and the same interaction,
where an institutionally more powerful informant gives instructions to a
colleague who is also trying to solve a computer problem. As mentioned above, it
is possible that the indirectness of the utterance can be perceived, not as a face-
maintaining strategy, but rather as a signal of insincerity and condescension.
However, there was no evidence in the linguistic output of the hearers in the
corpus that shows that they regarded the use of lets in these cases as insincerely
over-polite. In the attested cases this was probably due to the fact that the
instructions were given for the benefit of the hearer.
Interestingly, not all hearer-oriented lets utterances occurred in a rank-
sensitive context, where it was used by the more powerful speaker. There is one
instance in the corpus where lets is used between intimates, which exemplifies
yet another pragmatic use. In the following example the lets utterance is not
primarily used as an action instigator but as an evaluation of the hearers
(12) A: God you really know how to put someone down don't you
B: Oh lets not get touchy touchy <,,>
A: Very difficult Moses when you 're around <,>
(ICE-GB/S1A-038#253-255; face-to-face conversations)

In this example one can see that a negative evaluation of the hearers behaviour
has been poured into a lets utterance. Bearing in mind the fact that hearer-
oriented lets utterances are primarily used in rank-sensitive contexts (cf. above),
one could see this as a conscious use of patronising language by the speaker in
order to be deliberately offensive, ironic or sarcastic in criticising the hearers
behaviour. From this position of assumed authority the speaker reprimands the
On the pragmatic functions of lets utterances 229

hearer and monitors her behaviour in a way similar to parents reprimanding

It appears that lets utterances cannot only be used as typical action
instigators, but also as reactive acts of (dis)agreeing performed in response to
prior claims, affirmations, statements and similar assertive illocutions. Many
formulaic and colloquial imperatives, such as lets be real, come off it, dont be
stupid, apart from being directive utterances, are also and maybe in the first
place retrospective evaluative comments about the addressees behaviour or
statements. In a sense, they are assertions, expressions of the speakers disbelief,
disapproval, accompanied by a directive dimension aimed at redirecting the
addressees behaviour. Next to expressions of evaluation of the hearers
behaviour (such as disapproval in example (12)), lets was also used to express a
more positive attitude. In example (13), the lets utterance is not really being used
as an action instigator, but rather as a genuine expression of positive attitude
towards the hearer:
(13) A: Thanks a lot
B: All right
B: Let s hope it works
A: Yep
(ICE-GB/S1A-078#185-187; face-to-face conversations)
Rather than acting as a request or proposal for joint action (Shall we hope it
works?), it is an expression of (shared) concern, an expression of sympathy and
In this way, then, it seems that lets utterances cannot only be used as
action instigators or as conversational imperatives, but also as expressions of
emotion or approval. In these cases, their directiveness is less prominent.

5. Conclusion

Apart from their typical function as a (genuine) proposal for joint action, lets
utterances have various other pragmatic uses. From the analysis it has become
clear that, in the ICE-GB, their most frequent function is that of a conversational
imperative, aimed at regulating the conversational flow of the interaction or the
structure of the speakers own talk. This use is especially frequent when the lets
utterances are speaker-oriented or in cases of joint agency when they are used by
the interactionally more powerful speaker. In such cases, their illocutionary force
of proposal for joint action is secondary to their use as conversational managers.
Other non-typical functions were attested in non-conversational uses of
lets utterances which were not really oriented towards joint action, but which
aimed at presenting evaluative statements or feelings on the part of the speaker at
an interpersonal level. Rather than being proposals for joint action, they can be
seen as retrospective evaluations.
230 Bernard De Clerck

Further research will require a more detailed pragmatic analysis of lets

utterances and a comparison with other let constructions (e.g. let me) and other
imperative structures, such as look, listen, tell me and the way these relate to lets
utterances as conversational imperatives.


1. The research reported on in this paper was made possible by the Research
Fund of Flanders.
2. There are other more specialised uses of lets utterances, which will be
focused on in Sections 3 and 4. None of these uses, however, is completely
compatible with the strict notion of request, as in allow.
3. Seppnen rightly remarks that in this way, let is both syntactically and
semantically, close to the modal may, as used in wishes: May I (we) never see
that day! (Seppnen 1977: 517).
4. Another argument Seppnen uses to support his subject interpretation of the
following NP, is the existence of utterances such as

(a) Let you and I cry quits

(b) Let all these matters pass and we three sing a song

which have the nominative form and can hence not be explained in the
analysis of let as the imperative of a main verb. As an example of a parallel
development he refers to Dutch, where similarly both the accusative and the
nominative are used with the verb laten: Laat me/ik voorzichtig zijn, Laat
hem/hij maar komen, Laat ons gaan/even Laten we gaan. The parallelism
with Dutch, however, is far from complete. In fact, the only instances where
these rare examples of the nominative form after let are found in English, are
restricted to cases where the NP consists of two co-ordinated NPs and where
the NP is separated from the verb. According to Davies (1986), the
nominative form in (a) could be felt to result from the same sort of
hypercorrection that is responsible for the now frequent use of forms like
between you and I, or Thats for you and I to decide, while (b) could be
considered to result from hypercorrection in reaction to use forms like us
three even as subjects (Davies 1986: 237). Apart from that, no examples of
*Let we see or *Let I see can be attested in the English language, which, all in
all, makes the argument of parallelism with the uses of laten in Dutch rather
5. A similar view is taken by Ukaji (1978). With regard to lets constructions,
the meaning is described as one in which the speaker prays the hearer to
allow a group of persons among whom the speaker and the hearer are
included to carry out a particular action (Ukaji 1978: 120).
On the pragmatic functions of lets utterances 231

6. Davies (1979) errs in the other direction by assigning special status to all
examples involving let, even those examples which are obviously instances of
the imperative form of the lexical verb.
7. To support their argument Huddleston and Pullum refer to the occurrence of a
negative construction used by some speakers of dialect B that provides
evidence for the reanalysis: lets dont bother. Huddleston and Pullum (2002:
935) remark, though, that this is much less common than the construction
with an NP after lets, and cannot be regarded as acceptable in standard
English. Still, according to Huddleston and Pullum (ibid.), its syntactic
interest is that it shows conclusively that let is no longer construed as a verb: a
subjectless dont could not appear in the complement of a catenative verb.
This, in fact corroborates what Quirk et al. say about the lets dont in AmE.
They do not say, however, that it is not acceptable as standard English.
8. It is possible, of course, that some of these constructions do occur in certain
dialects of English. However, from the fact that they do not occur in the ICE-
GB, we might tentatively conclude that they are not used widely enough to
qualify as acceptable informal style in standard British English.
9. There are other uses of us with a first person reference in colloquial English.
Instances such as Give us a hand, Tell us a story, Give us a kiss all feature
uses of us with a first person singular reference. In spoken language us is
often abbreviated to /s/, which makes its resemblance to this particular use of
lets even more striking.
10. Although Biber et al. (1999) say that the meaning of lets is equivalent to let
me in these cases, I tend to believe that this equivalence is not complete. It
seems to me that there is a slight difference in illocutionary force and in the
freedom given to the addressee to reject the proposal. Let me, just like let us,
can still be used or interpreted in two ways: as a true request for permission
from a person or as a (self-addressed) exhortative. As mentioned, the
permissive interpretation of lets has faded and an example like Lets have a
look at your throat can no longer be interpreted as a true request for
permission the way Let me have a look at your throat can. Both utterances
aim at getting the hearers agreement, but do so in a slightly different way. By
playing with the requestive interpretation of the ambiguous let me, one can
give the impression of actually looking for or asking for agreement, whereas
when one uses the convivial lets ranks being equal agreement is taken
for granted and can be taken as a starting-point to proceed with the proposed
action. Consequently, one could say in a more accurate way that the meaning
of first person lets is equivalent to the exhortative meaning of let me only.
11. All in all, the corpus contains 178 lets utterances, 14 of which were found in
the written part of the corpus.
12. The mixed category of broadcast news did not feature any lets utterances.
13. Clearly, the fact that some text categories allow certain specific pragmatic
uses more easily than others, is just one possible way of explaining these
232 Bernard De Clerck

differences in frequency. Of course one should also bear in mind stylistic

reasons, to name but one aspect in explaining differences between the text
categories. In parliamentary debates or legal presentations, for example,
which contained no instances of lets, speakers might have chosen a less
informal construction in order to comply with the formal nature of interaction
that is taking place.
14. In the text categories where only a few lets utterances were attested (e.g.
telephone calls, broadcast talks, legal cross-examinations), it should be clear
that the distribution of conversational and non-conversational uses is not to be
taken as representative of the general use of lets in these categories. Further
research, including the pragmatic analysis of a larger amount of data, will be
needed to corroborate the attested uses.


Biber, D., S. Johansson, G. Leech, S. Conrad and E. Finegan (1999), T h e

Longman grammar of spoken and written English. London: Longman.
Costa, R.M. (1972), Lets solve lets!. Papers in Linguistics 5: 141-144.
Davies, E.C. (1979), On the semantics of syntax: mood and condition in English.
London: Croom Helm.
Davies, E.E. (1986), The English imperative. London: Croom Helm.
De Rycker, T. (1990), Imperative subtypes in conversational British English: an
empirical investigation. Unpublished PhD dissertation. Department of
Linguistics, University of Antwerp.
Fraser, B. (1987), Pragmatic formatives, in: J. Verschueren and M. Bertuccelli-
Papi (eds), The pragmatic perspective: Selected papers from the 1985
International Pragmatics Conference. Amsterdam: Benjamins, 179-194.
Halliday, M.A.K. (1994), An introduction to functional grammar. London:
Edward Arnold.
Hamblin, C. (1987), Imperatives. Oxford: Basil Blackwell.
Huddleston, R. and G.K. Pullum (2002), The Cambridge grammar of the English
language. Cambridge: Cambridge University Press.
Leech, G. (1983), Principles of pragmatics. London: Longman.
Levinson, S. (1983), Pragmatics. Cambridge: Cambridge University Press.
Quirk, R., S. Greenbaum, G. Leech and J. Svartvik (1985), A comprehensive
grammar of the English language. London: Longman.
Seppnen, A. (1977), The position of let in the English auxiliary system.
English Studies 58: 515-529.
Stubbs, M. (1984), Discourse analysis: the sociolinguistic analysis of natural
language. Oxford: Blackwell
Swan, M. (1996), Practical English usage. Oxford: Oxford University Press.
On the pragmatic functions of lets utterances 233

Thomas, J. (1985), The language of power: towards a dynamic pragmatics.

Journal of Pragmatics 9: 765-783.
Ukaji, M. (1978), Imperative sentences in early modern English. Tokyo:
Methodological problems in corpus-based historical
pragmatics. The case of English directives

Thomas Kohnen

University of Cologne


This paper gives a summary of some methodological problems a corpus-based

diachronic analysis of speech acts has to face, with particular emphasis on the
case of English directives. Among the issues discussed are the difficulties with a
complete inventory of the different manifestations of directives in the history of
English, the scarcity of historical data, problems of interpretation and the
relationship between the number of the particular manifestations of directives
found in a corpus and the 'underlying' total number of directives. In a second part
the paper presents some illustrative results of corpus-based investigations tracing
the history of English directives.

1. Introduction

The study of the diachronic development of speech acts with the help of
electronic corpora raises serious questions which challenge both the reliability of
existing data collections and the results of the investigations which are based on
them. Any attempt to write a corpus-based illocutionary history is faced with
basic problems involving the methodology of historical pragmatics and the design
and use of historical corpora. This paper aims to give a summary of some
important problems which I encountered in my corpus-based research, illustrating
them with several studies on the history of English directive speech acts. It falls
into two parts. The first part addresses some basic methodological issues; the
second part is devoted to some illustrative results of studies exploring aspects of
the history of English directives.

2. Methodological issues

One of the rather basic methodological problems a corpus-based diachronic study

of directives is faced with is finding all the different manifestations or patterns
associated with directive speech acts in the history of English. The research
carried out in the fields of speech-act theory and pragmatics has made it
sufficiently clear that with speech acts there is no predictable link between form
and function. We may know that a directive speech act is an attempt by a speaker
or writer to get the addressee to carry out an act (Searle 1969: 66; 1976: 11) and
238 Thomas Kohnen

we may assume that this illocutionary function remains stable throughout the
history of English. But we do not know in advance what linguistic form a speaker
or writer may employ for his directive. We can only rely on the fact that people
tend to use more or less fixed phrases or patterns in order to perform certain
speech acts. Since corpus searches must be based on forms rather than functions,
the study of a history of directives has to start with a selection of forms we would
consider typical manifestations of directives in the different periods of the
English language.
What are the most important manifestations of directives in the history of
English? The most straightforward examples which come to mind are explicit
performatives (I order you to carry this message to the king), imperative
sentences, constructions with let (lets do it) and constructions involving the
subjunctive, especially those with inverted word order (go we).1 Clearly,
however, there are many other manifestations of directive speech acts. The
number of possible candidates becomes even greater if we include those
realisations which are sometimes called indirect, because they involve sentence
types different from the imperative format.2 Some typical examples are
declarative sentences with the second person pronoun plus a modal involving
obligation (you must leave, you ought to do this etc.), declarative sentences with a
first person pronoun plus a verb involving volition (I want you to do this, I would
like you to do this etc.) and different kinds of interrogative manifestations (Can
you open the door? Will you do the washing up? Why dont you come in? cf.
Quirk et al. 1985: 1477-78).
Quite clearly, this enumeration could be continued. It is difficult to give a
comprehensive list of all the typical manifestations of directive speech acts for all
periods of the English language. What does this mean for a corpus-based analysis
of speech acts? Basically, there seem to be two kinds of procedure. First, since
we are faced with an open, heterogeneous and highly variable set of forms we can
restrict our analysis to an eclectic illustration of the speech act under
consideration. That is, we look around in the periods of English and see what
typical realisations we find, for example, showing some imperatives and inverted
constructions in Middle English texts and some interrogative constructions in
Early Modern texts, perhaps adding some intuitive judgments about changes
which we assume to be typical. Secondly, we can base our analysis on a
deliberate selection of typical patterns which we trace by way of a representative
analysis throughout the history of English. For example, we could examine the
development of imperatives, constructions with let or interrogative directives
throughout the history of English. I call the first kind of procedure illustrative
eclecticism, the second structured eclecticism. Given the fact that the research is
doomed to be eclectic, I think corpus linguists should opt for structured
Another methodological problem is difficulties of interpretation. Speech
act assignment is in many cases a matter of interpretation which requires careful
consideration of contextual knowledge. For example, we are liable to find
imperative constructions which do not serve as directives, but as imprecations or
Methodological problems in corpus-based historical pragmatics 239

wishes (Quirk et al. 1985: 831-832, Biber et al. 1999: 220). This problem, of
course, becomes more serious with indirect manifestations because their
indirectness is due to their openness to different speech-act assignments. These
are difficulties with functional interpretation, which apply to any linguistic data,
historical or contemporary. But in the history of English directives we also
encounter difficulties which relate to semantic or syntactic changes. This often
results in what I would call pragmatic false friends, constructions which, against
a contemporary background, suggest a wrong pragmatic interpretation. I will
present four examples.
The first example is taken from Chaucer's Canterbury Tales. Here a young
knight urgently needs to know what women desire most.

(1) My leeve mooder, quod this knyght, certain

I nam but deed but if that I kan seyn
What thyng it is that wommen moost desire.
Koude ye me wisse, I wolde wel quite youre hire.
(c 1395, Geoffrey Chaucer, The Wife of Bath's Tale, 1005-1008)

Quite clearly, the meaning of the last line is not Could you please instruct me? I
would certainly reward your efforts, that is, an indirect request, but rather If you
could tell me (knew how to instruct me), I would reward your efforts. The
difficulty of interpretation is here due to the fact that the inverted clause pattern
(koude ye) could be interrogative or conditional in Middle English and that the
verb cunne was still used as a full verb in this period.
The second example is taken from an official letter by Henry V:

(2) we wol and charge you. fiat ye se and ordeyne at hasty restitucion
of e forsaide goodes be maad and at ye do compelle our saide
sougettes to make restitucion abouesaid
(Helsinki Corpus, Letters, 1418/1419, Henry 5, 99)

The difficulty of interpretation is here due to the fact that willan had an additional
speech-act meaning in Old English and Middle English. Thus the expression we
wol must be taken as a performative phrase, where wol has the speech-act
meaning of order, command. The speech-act meaning seems likely since wol is
in a co-ordinate construction with another performative directive (charge). Thus
we are not dealing with some kind of indirect directive along the lines of I would
like you to but with a performative expression.

(3) Ford: Blesse you sir.

Fal.: And you sir: would you speake with me?
(Helsinki Corpus, 1623 [1597], William Shakespeare, The Merry
Wives of Windsor, 46.C1)
240 Thomas Kohnen

In the third example, which is from Shakespeare, Falstaff has already been
informed that Ford wants to talk to him. Thus the utterance Would you speak with
me? is not a request (Would you talk to me?) but rather a real question which
serves to identify the man who wanted to talk to Falstaff (Did you want to talk to
me?). In Modern English this interpretation would not be possible because would
cannot be taken as referring to the past.

(4) But now my good masters since we must be gone

And leaue you behinde vs, here all alone:
Since at our last ending thus mery we bee,
For Gammer Gurtons nedle sake, let vs haue a plaudytie.
(Helsinki Corpus, 1575 [1552-63], William Stevenson, Gammer
Gvrtons Nedle, 70)

The fourth example contains a construction with let us. It is found at the ending
of Stevenson's play Gammer Gvrtons Nedle. Here the phrase let vs haue a
plaudytie clearly is an invitation addressed to the audience to give applause,
which should be paraphrased with allow us / cause us to have some applause.
Thus, although this construction is an imperative, it cannot be considered a
hortative or periphrastic imperative construction (cf. Rissanen 1999: 279). Rather
it is a construction with the full verb let. This is quite disturbing against a
contemporary background since we would like to rely on the assumption that
constructions with let us are always periphrastic constructions.
The discussion of the excerpts has shown that a corpus-based analysis
which selects items on the basis of form must be extremely careful with the
interpretation of the examples. Each individual item which we assume to be a
manifestation of a directive speech act requires careful consideration if we want
to avoid pragmatic false friends.
The third methodological issue involves the relationship between the
examples of particular manifestations of directives found in a corpus and the
underlying total number of directives. If we compare the different frequencies
of selected manifestations in the periods of English, do the increasing or
decreasing numbers only reflect an increase or decrease in the respective
manifestations or do they suggest a general change in the use of directives as
well? For example, if we find more directive performatives in Late Middle
English letters than in Modern English ones, is this because people choose to use
different means for expressing their requests in letters today or is it because they
use fewer requests? In other words, would a decreasing frequency of
performative directives point to an increase in alternative manifestations (e.g.
imperatives, constructions with let, interrogative directives) or to a general,
underlying decrease of directives? I think this problem can only be tackled if we
base our analysis on comparable text types or genres and if we assume a more or
less stable functional profile for these text types or genres. For example, we could
assume that religious instruction requires directive speech acts in the Middle
Ages as well as today. Or we might assume that text types involving spoken
Methodological problems in corpus-based historical pragmatics 241

interaction are liable to contain a stable amount of directives because, when

people talk to each other (especially in an everyday setting), they are likely to
perform requests.
The fourth methodological problem to be mentioned here is how to deal
with the lack of sufficient data. In the early periods of English the number of
relevant examples (especially of indirect manifestations) found in a classic
corpus (like the Helsinki Corpus) tends to be fairly low. An analysis based on
such numbers cannot be held to be valid. Among the options of dealing with this
problem is extending the database by using large dictionary corpora (e.g. MED
or OED) or finding functional patterns in the restricted number of items at hand.
Each approach has advantages as well as disadvantages.
When studying constructions with let me (e.g. let me ask you to do this) I
found that the Helsinki Corpus had only 14 relevant items in the Middle English
section.3 In order to obtain more examples I searched all the quotations in the
electronic version of the Middle English Dictionary. Here my search produced
231 relevant items. On the basis of these data one can show, for example, that in
Middle English there are no combinations with let me plus an illocutionary verb
(let me ask you, let me entreat you). I presume that such a result is much more
reliable if based on the MED data rather than on the 14 items found in the
Helsinki Corpus. On the other hand, with electronic corpora like the MED we
do not know the proportion of the individual text types and we cannot get
regularised frequencies.
When studying interrogative manifestations of directives which mention
the addressee (Can you pass the salt? Would you do the washing up?) I found
only 36 examples in the Helsinki Corpus. This number may seem very small but
in the Early Modern English section these data showed an exceptional degree of
distributional consistency. First, all instances (apart from one item) belong to
spoken interaction, and the large majority of the items (85%) belong to two text
types: plays and trials. If we focus on these two text types (containing 79,000
words out of the total number of 551,000 words), we can see a consistent increase
of the items with a reasonably high frequency reaching 4.15 per 10,000 words at
the end of the 17th century (see Table 1; for a detailed discussion of the data see
Kohnen 2002).

Table 1: Addressee-based directives in plays and trials in the Helsinki Corpus

(tokens per 10,000 words)

1420-1500 1500-1570 1570-1640 1640-1710

0 2.26 3.45 4.15

So the scarcity of data may be balanced if we focus our attention on indvidual

text types and their functional profiles. On the other hand, the concentration on
242 Thomas Kohnen

individual text types or genres may raise doubts about the representativeness of
the analysis.
To sum up this methodological section, I would like to advocate the
procedure which was called structured eclecticism. It makes up for the
heterogeneity of the data by systematic selection, comprehensive diachronic
statistical analysis and careful consideration of each item. In addition, it has been
shown that a diachronic analysis of speech acts should be embedded in a
reasonably stable functional profile of text types. This as well as the notorious
lack of data call for more extensive text-type specific corpora for historical
pragmatic studies.

3. Aspects of a history of English directives

What is the outline of a history of directives in English? It seems best to start with
a general consideration of the speech-act class of directives. Since a directive
aims at an act to be performed by the addressee, it can be seen as a threat against
the addressee's freedom of action and freedom from imposition, that is as a threat
against what is usually called the addressee's negative face (Brown and Levinson
1987). It seems that in the history of English considerations of face have assumed
increasing importance, changing the manifestations of directives more towards
polite and indirect realisations. This tendency can be illustrated by a decrease of
direct realisations of directives, for example performatives and imperatives, and
by an increase of prototypical indirect manifestations, for example interrogative
directives and constructions with let.
With regard to performatives, I found that the frequency of directive
performatives in the Old English section of the Helsinki Corpus is seven times as
high as that found in the LOB Corpus (4 vs. 0.55 per 10,000 words). In addition,
the frequency of performative verbs referring to acts of ordering and
commanding is far higher in the Old English section than in the LOB Corpus (1.5
vs. 0.07). And whereas the LOB Corpus shows a clear predominance of
suggest/advice verbs, the Old English part of the Helsinki Corpus has none (for
a detailed discussion see Kohnen 2000). So it seems that performatives, which at
least today are a rather direct and mostly face-threatening manifestation of
directives, are significantly less common in contemporary (written) English than
they were during the Anglo-Saxon period. If they are used today, they tend to be
employed in rather mild requests, like suggestions or advice. On the assumption
that people perform as many directives in written English today as they did in
Anglo-Saxon times we may infer that the performative option of directives has
fallen out of favour and that other possibly less face-threatening means are
employed instead.
With regard to the imperative manifestation of directives it can be shown
that the number of imperatives decreases from Middle English to Modern
English. I looked at imperatives in the religious treatises in the Penn-Helsinki
Parsed Corpus of Middle English (PPCME2, Kroch and Taylor 2000) and in the
Methodological problems in corpus-based historical pragmatics 243

Brown Corpus and LOB Corpus. Religious treatises, both in their Late Medieval
and their modern form, can be assumed to have a basic instructional function,
which makes it likely for imperatives to be used there. Since I wanted to focus on
instructional imperatives which directly serve the purpose of religious instruction
associated with the text type, I excluded those sentences which appear in direct
speech, that is imperatives contained in narrative sections, quotations from the
Bible, etc. (see Figure 1).

5 4,5


1 0,6

1200-1375 1390-1450


Figure 1: Imperatives (excluding direct speech) in the PPCME2, the LOB Corpus
and the Brown Corpus (tokens per 1000 words).

Figure 1 shows that the frequency of imperatives in the period 1200-1375 is fairly
high (4.5). It goes down significantly in Late Middle English (1.9). But the
frequency in the LOB Corpus (0.6) and in the Brown Corpus (0.83) is less than
half of the Late Middle English figure.4 Since we may assume that all texts share
the same basic function of religious instruction, the decrease in imperatives can
be explained either by the hypothesis that religious instruction uses fewer
directives today (and, for example, more representative speech acts instead) or by
the hypothesis that imperatives no longer enjoy wide currency as a means of
directives in religious instruction but are replaced by other manifestations, for
example indirect directives. It is difficult to determine with certainty which
hypothesis is correct, but there is some evidence which renders the second option
more probable than the first one. Frequent constructions with let us and with
modals point to the fact that directives are still being used in religious instruction.
And, of course, a number of directives employing the imperative can still be
found in the contemporary data. There is, however, a remarkable difference
between most of these imperatives and the typical imperatives in the Middle
244 Thomas Kohnen

English religious instructions. Whereas the Middle English texts involve

straightforward acts which are or should be part of the addressees everyday life
(like doing good deeds or praying, see examples 5 and 6), the imperatives
found in the modern texts typically denote mental acts which help the addressee
to decode the text or to grasp the point of the text (see examples 7 and 8).

(5) ... flerfore do flou fli-silf alle fle gode deedis wifl-oute deuocioun, fle
whiche ou didist bifore with deuocioun.
(a 1396, Walter Hilton, Hilton's Eight Chapters on Perfection,
PPCME2 - 4.23)

(6) ... and preie e hertly for hem, that God of his greet mercy eue to
hem very knowing of scripturis, and meekenesse, and charite.
(a 1397, John Purvey, Purvey's General Prologue to the Bible,
PPCME2 I, 49.2033)

(7) What about religion and politics? They are not in two watertight
compartments. Think of the number of laws that have just as much to
do with a mans soul as with his body.
(LOB, D16, 11-13)

(8) If the people had kept the Lord before them and observed His words
through the former prophets, things would have been far otherwise.
And what was His word now through Zechariah, but just what it had
been through them. Take Isaiahs first chapter as an example. He
accused the people of moral corruption, whilst maintaining
ceremonial exactitude.
(LOB, D11, 39-44)

It is far less face-threatening to guide people through a text using imperatives

than employing imperatives in requests which affect their everyday lives. So it
seems here that considerations of politeness may explain some of the changes

Table 2: Indirect manifestations of directives in the Helsinki Corpus (tokens per

10,000 words)
1420-1500 1500-1570 1570-1640 1640-1710
Speaker volition 0.37 0.79 1.16 1.11
Interrogative 0.19 0.42 0.47 0.82

Since the data indicates that the frequency of more direct and possibly face-
threatening manifestations of directives decreases in the history of English, it
makes sense to ask whether these forms are replaced by other, possibly more
polite directives. To find a possible answer, it is instructive to look at the
Methodological problems in corpus-based historical pragmatics 245

evolution of so-called indirect manifestations of directives. Table 2 shows the

development of directives involving speaker volition (the type I would like you to
do this) and of interrogative realisations referring to the addressee (Can you pass
the salt?) in the Helsinki Corpus (for a detailed analysis, see Kohnen 2002).
Although the frequency of the items found is rather low (but see the
discussion in Section 1), the general picture of the development of the
constructions is quite clear. Both constructions increase noticeably during the
Early Modern English period, with some precursors occurring in Late Middle
English. Interrogative manifestations develop more slowly in the data and one
could claim that they are later in appearing. At least their increase during the
Early Modern English period is slow and the frequency is clearly lower than that
of the other construction. It is only towards the end of the Early Modern period
that the frequencies of the two constructions converge. The result that is most
relevant here is that the indirect manifestations do not become common until the
Early Modern period. It is quite striking that this is just the time (the end of the
Middle English period) when the imperative forms seem to recede.
Another illustration of indirect manifestations of directives are
constructions with let me plus a (mostly directive) illocutionary verb:

(9) Sam. I pray you let mee intreat you: foure or five houres is not so
Dan. Well, I will goe with you.
(Helsinki Corpus, 1593, George Gifford, A Dialogue Concerning
Witches B2R)

(10) Venat. Come my friend, Piscator, let me invite you along with us;
I'll bear you charges this night, and you shall bear mine to morrow;
for my intention is to accompany you a day or two in Fishing.
(Helsinki Corpus, 1676, Izaak Walton, The Compleat Angler 212)

With this construction the speaker, instead of issuing a straightforward directive

performative, asks permission for doing so. However, since this permission is
never actually granted, the construction can be seen as a more or less
conventionalised polite or indirect formula for performing a directive. Table 3
shows the development of this construction in the Helsinki Corpus. Although the
number of items is again rather small, the general picture is quite clear.6 The
construction appears during the 16th century and its frequency increases
significantly during the Early Modern period. Here again it is the Early Modern
period which witnesses the rise of an indirect directive.

Table 3: Let me constructions including illocutionary verbs in the Helsinki

Corpus (tokens per 100,000 words)
1420-1500 1500-1570 1570-1640 1640-1710
0 0.5 5.3 8.2
246 Thomas Kohnen

4. Conclusion

Although the research presented here is the result of what I called structured
eclecticism and although corpus-based historical pragmatics faces several
methodological problems, the general picture of the history of English directives
is quite consistent. During the history of English directives become less explicit,
less direct and less face-threatening. By contrast, the number of indirect
manifestations increases. The important period for the evolution of indirect
manifestations seems to be the Early Modern period, whereas the frequency of
direct manifestations seems to decrease after the Middle English period. The
underlying motivation seems to be the growing importance of considerations of


1. Cf. Fischer (1992: 248) and Rissanen (1999: 228-229, 279-280); for a survey
of directives, see Quirk et al. (1985: 827-833).
2. See, for example, Searle (1976) and Levinson (1983: 263-276).
3. For information on the Helsinki Corpus, see Kyt (1996).
4. Text 9 from the Brown Corpus (Organizing the local church, 2056 words)
was excluded from the analysis, since, as instruction for instructors, it is
focused on organisational matters and does not contain religious instruction in
its proper sense.
5. Biber et al. (1999: 222) say that in contemporary academic prose imperatives
are used as a means of guiding the reader in interpreting the text.


Biber, D., S. Johansson, G. Leech, S. Conrad, and E. Finegan (1999), Longman

grammar of spoken and written English. London: Longman.
Brown, P. and S.C. Levinson (1987), Politeness. Some universals in language
usage. Cambridge: Cambridge University Press.
Fischer, O. (1992), Syntax, in: N. Blake (ed.), The Cambridge history of the
English language, Vol. II 1066-1476. Cambridge: Cambridge University
Press. 207-408.
Kohnen, T. (2000), Explicit performatives in Old English: A corpus-based study
of directives. Journal of Historical Pragmatics 1 (2): 301-321.
Kohnen, T. (2002), Towards a history of English directives, in: A. Fischer, G.
Tottie and P. Schneider (eds), Text types and corpora. Studies in honour of
Udo Fries. Tbingen: Gunter Narr. 165-175.
Kroch, A.S. and A. Taylor (2000), The Penn-Helsinki Parsed Corpus of Middle
English PPCME2.
Methodological problems in corpus-based historical pragmatics 247

Kyt, M. (1996), Manual to the diachronic part of the Helsinki Corpus of English
Texts. 3rd ed. Helsinki: University of Helsinki.
Levinson, S.C. (1983), Pragmatics. Cambridge: Cambridge University Press.
MED: Middle English Dictionary, electronic version, in: F. McSparran et al.
(eds) (1999), The Middle English compendium. Ann Arbor, Mi: University
of Michigan Press.
OED: The Oxford English dictionary second edition on compact disc (1994),
Version 1.13. Oxford: Oxford University Press.
Quirk, R., S. Greenbaum, G. Leech and J. Svartvik (1985), A comprehensive
grammar of the English language. London: Longman.
Rissanen, M. (1999), Syntax, in: R. Lass (ed.), The Cambridge history of the
English language, vol. III: 1476-1776. Cambridge: Cambridge University
Press. 187-331.
Searle, J.R. (1969), Speech acts. Cambridge: Cambridge University Press.
Searle, J.R. (1976), A classification of illocutionary acts. Language in Society 5:
Measure noun constructions: degrees of delexicalization and

Lieselotte Brems

University of Leuven


In a narrow sense, the term Measure Noun (MN) refers to such nouns as acre
and kilo, which typically measure off a well-established and specific portion of
the mass or entity specified in a following of-phrase, e.g. a kilo of apples. When
used like this, the MN is generally considered to constitute the lexical head of the
bi-nominal noun phrase. However, the notion of MN can be extended to include
such expressions as a bunch of and heaps of, which, strictly speaking, do not
designate a measure, but display a more nebulous potential for quantification.
The structural status of MNs in this broader sense, then, is far from
straightforward and most grammatical reference works of English are either
hesitant or silent with regard to the issue. Two main analytical options seem to
suggest themselves. Either the MN is interpreted as constituting the head of the
NP, with the of-phrase as a qualifier of this head, or the MN is analysed as a
modifier, more specifically a quantifier, of the head, which in this case is the
noun in the of-phrase.
Starting from the structural analyses of MN constructions offered by such
linguists as Halliday and Langacker, my paper goes on to discuss a corpus study
aimed at charting and elucidating the structural ambivalence observed in MN
constructions. The framework eventually opted for is that of
grammaticalization. The focus of the corpus study is synchronic
grammaticalization (Lehmann 1985). More specifically, it investigates the
degree of grammaticalization of the various MNs looked at, viz. bunch(es) of,
heap(s) of, pile(s) of and load(s) of.

1. Introduction: Whats in a name?1

Measure Nouns (henceforth MNs) or nouns of measurement in the strict sense

are nouns such as acre, litre, pound, ounce, etc. that measure off a well-defined
standard-like portion of the mass or entities specified in the of-phrase following
the MN, as in an acre of wasteland.2
However, this paper extends the use of the term MN to include nouns
which, strictly speaking, do not designate a measure, but display a more
nebulous potential for quantification. More specifically, the MN expressions in
this broader sense focused on in this study are bunch(es) of, heap(s) of, pile(s) of
250 Lieselotte Brems

and load(s) of. The type of construction in which they are used is a bi-nominal
noun phrase of the kind illustrated in the following set of examples. All examples
in this paper are from the Cobuild corpus.3

(1) The fox, unable to reach a bunch of grapes that hangs too high, decides
that they were sour anyway
(2) A jilted girlfriend got revenge on the boyfriend who dumped her by
dumping a foot-high pile of manure in his bed.
(3) We still have to move loads of furniture and other stuff.
(4) The surrogate mum to princes William and Harry shared heaps of fun
with them at a fair yesterday while father Charles was otherwise engaged.
(5) I would take up a pile of commonplace books like Lord David Cecils
Library Looking Glass, John Julius Norwichs Christmas Crackers, Rupert
Hart-Daviss A Beggar in Purple, etc.
(6) Then I noticed, under a pile of other books on my nightstand, the worn
journal my father had given me those weeks ago.

The central question with regard to MN constructions is the status of the MNs
bunch, pile, loads and heaps within their respective NPs. Does the MN constitute
the head noun, or does it function as a quantifier of the head noun in the of-
phrase? Naturally, assessing the status of the MN within the bi-nominal NP has
repercussions at clause level, most notably on the question of subject-finite
concord whenever the MN-nominal occurs in subject position. The central
question of the study is of a comparative nature and focuses on possible
differences in the extent to which the various MNs have already come to function
as a quantifier (Section 3).
A quick glance at the above set of examples illustrates the specific rub of
MN constructions. Sentences (1) and (2), and (6) are rather unproblematic: in (1),
(2) and (6) the MN is the head noun, displaying the literal and collocationally
restricted meaning of bunch and pile. This analysis is reinforced by the verb
agreement between hangs and bunch in (1), under in (6) and the premodifier foot-
high in (2), which stresses the fact that it is a literal pile taking up a certain space.
In (4), on the other hand, the MN functions as a quantifier of the noun in the of-
phrase (N2).
The lexical constellation specifics of heaps have bleached into a mere
quantifier, which allows the MN to be used with an abstract noun like fun that
surely cannot be made into actual heaps. Examples (3) and (5) are more
problematic: do the furniture and stuff constitute actual loads or does the sentence
simply mean a large quantity of furniture and stuff without it necessarily being
arranged in a literal load? Do the commonplace books in (5) together make up an
actual pile or is it merely implied that the number of books could constitute a
Measure noun constructions 251

Examples (3) and (5) are intermediate between examples such as (1) and
(2), with the original and fully lexical meaning of the MN, on the one hand, and
the grammatical quantifier meaning of (4) on the other. I will suggest that the
developments observed in MN constructions are best looked at as a case of
ongoing delexicalization and grammaticalization in MNs (Section 2.3). The by
now more or less full-blown quantifiers a lot of and lots of can be considered
historical precursors in these developments.
Synchronic variation in verb agreement patterns is an important argument
for claiming that the structural status of the MN is changing from head to
quantifier. Wherever subject-finite concord is observable, analysis of data reveals
consistent patterns. When the MN is head of the bi-nominal group it controls verb
agreement (examples 7 to 9), when it functions as a quantifier, the finite agrees
with N2 (examples 10 to 12).

(7) I can show you a van-load of weapons that was confiscated at the gate.
(8) Three plane-loads of food have been ferried into the town in the past three
(9) The fox, unable to reach a bunch of grapes that hangs too high, decides
that they were sour anyway.
(10) A bunch of drunken, braindead louts seem determined to disgrace our
(11) But then, when I needed one, there were a load of excuses as to why I
couldnt borrow one.
(12) They are threatening to kill off a bunch of select committees that have
been around for a long time.

Pedagogical grammars and mainstream accounts like Kruisinga (1925), Jespersen

(1970-74), Quirk et al. (1985) and Biber et al. (1999) often bring up MN
expressions with regard to the perceived fluctuation in verb agreement. They
rarely incorporate the idea of a diachronic shift in structural status of MNs in
explaining the variation in concord. Most grammars invariably assign head status
to MNs in MN constructions.
When the verb agrees in number with the MN, strict grammatical concord
is said to be satisfied. Sentences in which the finite does not agree in number with
the MN are explained in terms of conflicting concord principles. In those cases
notional concord is said to overrule strict grammatical concord, i.e. in sentences
such as (10) to (12), the MN is still considered to be the head of the nominal
group, but it is the idea of number of the MN that determines the number of the
verb. Alternatively, the principle of proximity or attraction is invoked, which
states that whatever element of structure most closely precedes the finite controls
the number of the verb. Both notional concord and proximity concord are invoked
in an ad hoc fashion, to explain away incongruent verb agreement.
252 Lieselotte Brems

The possibility of N2 actually constituting the head in at least some cases

is not systematically considered, just like the status of the MN and N2 as such is
not systematically questioned (Jespersen 1970-74: 179; Kruisinga 1925: 306;
Quirk et al. 1985: 264; Biber et al. 1999: 184-185). I would like to argue that in
all MN constructions grammatical concord holds between the head and the verb,
depending on whether the MN or N2 is the head.
As hinted at in discussing examples (1) to (6), I will also argue that the
most adequate way of seeking order in the perceived chaos is by bringing in the
perspective of grammaticalization (cf. Lehmann 1985 and Hopper and Traugott
1993). I will argue more extensively for this approach in Section 2.3.
The following sections will first survey two relevant descriptive accounts
of MN constructions. Section 2.1 discusses Hallidays analysis of what he calls
measure nominals (Halliday 1985: 173) and Section 2.2 sums up some of
Langackers (1991) pertinent insights into MNs. Section 2.3 presents the
framework eventually adopted in this paper, viz. grammaticalization. Section 3
then reports on the most important findings of a corpus study of MNs which I
carried out based on data from the Cobuild Corpus, The Bank of English. The
focus of this study is a comparison of the extent to which heap(s) of, pile(s) of,
load(s) of and bunch(es) of have already become grammaticalized.

2. Theoretical-descriptive starting point

2.1 Hallidays account of Measure Noun constructions

Halliday (1985: 173) deals with MN constructions in a section following his

expos in favour of a twofold analysis of the nominal group on the ideational
level, i.e. the level of lexicogrammatical organization concerned with the
representation of experience.
Within the ideational level Halliday distinguishes between two layers of
analysis, one in terms of constituency and the other in terms of dependency. The
constituency layer offers an analysis of the nominal group as a multivariate
structure, i.e. as constituting a constellation of distinct functional slots which in
some way characterize the Thing of the nominal group, which itself designates a
class of entities and establishes the semantic core of the nominal group, as shown
in Table 1.

Table 1. Experiential structure of the nominal group

Deictic Numerative Epithet1 Epithet2 Classifier Thing Qualifier
those two splendid old electric trains with pantographs

The dependency layer, on the other hand, analyses the nominal group as a
univariate structure, viz. in terms of the recursive head-modifier relationship
displayed by the nominal group, as shown in Table 2.
Measure noun constructions 253

Table 2. Logical structure of the nominal group (head-modifier)

Premodifier Head Postmodifier

those electric trains with pantographs

In the default case the Thing of the experiential layer and the logical Head
coincide. However, there are a few types of nominal group where Head and
Thing do not coincide and those involving a measure of something () [i.e.]
measure nominals (Halliday 1985: 173) are an example of such a discrepancy
between Head and Thing. Halliday goes on to analyse so-called measure
expressions (id.: 169) in the following way:

In the logical structure, the measure word (pack, slice, yard) is Head,
with the of phrase as Postmodifier. The Thing, however, is not the
measure word but the thing being measured: here cards, bread, cloth.
The measure expression functions as a complex Numerative.
(Halliday 1985: 173)

This dual analysis can be visualised by the box diagram for a pack of cards
shown in Table 3.

Table 3. Twofold analysis of the nominal group; discrepancy Head/Thing

(Halliday 1985: 173)
A pack of cards
Numerative Thing Experiential structure
Modifier Head Postmodifier Logical structure

Halliday comments further that

[i]t is not that one [analysis] is right and the other wrong; but that in
order to get an adequate account of the nominal group, [] we need
to interpret it from both these points of view at once. [Italics LB]
(Halliday 1985: 172-173).

This comment seems to imply a certain flexibility in the interpretation of MN

constructions. However, it does not allow head status to shift from the measure
noun to the noun designating the matter being measured. This makes it hard to
see how diachronic variation in the status of the MN, which is reflected so clearly
in (synchronically) distinct subject-finite concord, can be captured.
What is unhelpful about the proposed dual analysis is that Numerative and
Head status of the MN are divided over two simultaneous levels of analysis, thus
254 Lieselotte Brems

suggesting that in each use the MN is always both. Against this, the description of
MN constructions proposed in this article will involve two synchronically distinct
analyses, with the second one being treated as a (diachronic) re-analysis of the

A lot of land Lots of paper

A heap of paper Heaps of people

Head Postmodifier Quantifier Head

Langackers discussion of MNs, reviewed in the following section, does bring in

the diachronic perspective and is compatible with the framework that I will adopt
in the present corpus study, viz. grammaticalization studies.

2.2 Langacker: the diachronic angle

Langacker (1991) turns to the issue of MNs in his general discussion of the
function of quantification in the NP. What is interesting about his observations is
that they immediately address the question of MNs from a diachronic angle, i.e.
MNs as an emergent means of quantification.
Langackers observations pertain to bi-nominal MN phrases such as a
bunch of carrots, a bucket of water and a lot of sharks. He observes that the
nouns which appear as heads constitute a diverse and open-ended class. [italics
LB] (Langacker 1991: 88). MNs are by default attributed head status, despite the
ambivalent semantics of appear. He continues by remarking that

[s]ome of these nouns still have an interpretation in which they

designate a physical, spatially-continuous entity that either serves as a
container for some portion of the mass (bucket, cup, barrel, crate, jar,
tub, vat, keg, box) or else is constituted of some such portion (bunch,
pile, heap, loaf, sprig, head, stack, flock, herd). [italics LB]
(Langacker 1991: 88)

In addition, most of these nouns have developed a more figurative sense. Such
metonymic extensions are possible because the above MNs all incorporate a
conception of their typical size, which is part of their encyclopedic
characterization. In the extended senses the physical entity designated by the
MNs has become secondary to the size specification provided by the noun: For
instance, a bathtub may contain a bucket of water without there being any bucket
in it it is only implied that the water would fill a bucket were it placed in one
(Langacker 1991: 88). Or in other words (Langacker 1991: 88-89), The notion
of a discrete physical object has faded, leaving behind the conception of a
schematically characterized mass (the mass that, in the original sense, either fills
Measure noun constructions 255

or constitutes the object). When a noun is interpreted in this way, it can,

according to Langacker, be regarded as a quantifier.
He thus notices a diachronic process of bleaching of lexical meaning in
certain MNs which may eventually lead to a reassessment of the structural status
of the MN, viz. from head to quantifier. As we will see in the following section,
such observations can be easily rephrased by using grammaticalization
terminology. Langacker then concludes his discussion of MNs by stating:

A further step in this evolutionary sequence would be for the second

noun to be reanalyzed as the head, leaving the remainder as a complex
quantifier: [[a lot of]QNT [sharks]N]NML. I leave open the question of
whether this reanalysis has actually occurred. (Id.: 89)

The aim of the corpus study reported on in this paper is to provide some answers
to both proposed re-analyses, viz. has N1 shifted from head to quantifier and N2
from postmodifier to head?
In conclusion to Langackers account, we can say that it is interesting that
he notes a diachronic shift with regard to the structural status of MN from head to
quantifier, instead of the mere synchronic ambivalence proposed in mainstream
grammars or the simultaneous layers in Hallidays analysis. Langacker also
suggests that this grammatical re-analysis is paralleled by lexical extension and
desemanticization of the MN. He therefore does accommodate the dynamic
aspect of MN constructions by working with two distinct diachronic stages in the
structural development of MN constructions.

2.3 Grammaticalization: diachronic and synchronic

The framework which seems most suitable for tackling the specific developments
encountered in MN constructions is that of grammaticalization theory, which not
only does justice to these developments but also explains them.
Grammaticalization itself has been defined in several ways (e.g.
Haspelmath 1989, Fischer 1999 and Bybee 2000), but its essence is captured by
Lehmanns (1985) definition, which is appropriately general and consists of a
number of interesting parameters. Lehmann also distinguishes between
diachronic and synchronic grammaticalization, a distinction that will prove useful
for this corpus study. Lehmann defines both types of grammaticalization as

Under the diachronic aspect, grammaticalization is a process which

turns lexemes into grammatical formatives and makes grammatical
formatives still more grammatical (cf. Kurylowicz, 1965: 52). From a
synchronic point of view, grammaticalization provides a principle
according to which subcategories of a given grammatical category
may be ordered. (Lehmann 1985: 303)
256 Lieselotte Brems

The diachronic interpretation of grammaticalization nicely captures the

developments and fluctuation encountered in MN expressions. Looked at in
synchronic slices, they often appear to hover indecisively between the class of
quantifiers and head nouns as a consequence of a gradual move from lexical head
to quantifier.
The synchronic interpretation of grammaticalization, then, can account for
the fact that not all MNs have come to function as a quantifier to the same extent,
i.e. there are individual differences with respect to the degree of their respective
grammaticalization. It also invites us to draw up a scale of grammaticalization
along which the various MNs looked at are positioned.
Lehmann proposes six parameters which define grammaticalization:
attrition, paradigmaticization, obligatorification, condensation, coalescence and
fixation. The first three pertain to paradigmatic aspects, the last three to
syntagmatic aspects of the grammaticalizing item or string of items. Of these
parameters, two clearly apply to MN constructions, viz. coalescence and semantic
attrition. It is these two I will focus on in this article.
Coalescence is a syntactic criterion and concerns an increase in
bondedness or syntactic cohesion of the elements that are in the process of
grammaticalizing, i.e. what were formerly individually autonomous signs become
more dependent on each other to the extent that they are increasingly interpreted
as together constituting one chunk, which as a whole expresses a (grammatical)
meaning (cf. Bybee 2000: 27).
The other relevant parameter, semantic attrition, is often referred to as
delexicalization or loss of lexical content, and is commonly mentioned in
grammaticalization studies as a symptom of grammaticalization processes.4
However, as Kurtbke (2001) correctly points out, one should be careful not to
use delexicalization as a mere synonym for grammaticalization.
Both concepts will be operationalized for the present corpus study in the
following way. Delexicalization will be identified in terms of a gradual
broadening of collocational scatter or a loosening of the collocational
requirements imposed by the MN via such semantico-pragmatic processes as
metaphorization, metonymization, analogy, etc. Grammaticalization, on the other
hand, will be restricted to the actual grammatical re-analysis of a MN as a
quantifier. In this particular study, delexicalization processes typically precede
the re-interpretation of the MN as a quantifier. Delexicalization semantically
paves the way, so to speak, for grammaticalization (Kurtbke 2001). To this
extent, these grammaticalization processes can be said to be largely semantically-
driven (cf. Traugott 1988). In addition phonetic and pragmatic factors come into
play. Nevertheless, both concepts tend to remain intertwined because, in many
cases, lexical and grammatical status are difficult to tease apart.

3. Corpus study of MN expressions

For this corpus study heap(s) of, pile(s) of, bunch(es) of and load(s) of were
extracted from the Cobuild Corpus, The Bank of English. In each case the plural
Measure noun constructions 257

and singular variants of the MNs were regarded as distinct expressions. The
corpus data were analyzed as either head or quantifier, or vague. The last
category subsumes those MN uses which activate both head and quantifier
features. Typically, it contains expressive stretches of discourse in which both the
lexical meaning and the quantificational potential of the MN are exploited.
On the basis of these quantified data, differences in the degree of
(synchronic) grammaticalization between the various MNs can be studied, which
will be the main focus in the discussion of the corpus results.
The relative frequencies of quantifier, head or vague uses of the various
MNs examined are represented in the following tables, which also contain
examples of the respective categories. Adjectives or nouns premodifying MN or
N2 are underlined, as well as verbs or other elements of structure that serve as
important clues for either head or quantifier status.

Bunch of Tokens % Examples

Head 27 11.4 (13) All stopped a moment when Linda, in clothes
of mourning, bearing a little bunch of roses, comes
through the draped doorway into the kitchen.
Quantifier 209 88.6 (14) The ideologies might be different, but youre
all a bunch of lying, treacherous bastards when it
comes down to it.
(15) Traditional advertising pictures are a bunch of
(16) Russia and America were just a bunch of
enthusiastic and very fit guys who ran around for 80
minutes without much method.
Total 236 100

Bunches of Tokens % Examples

Head 47 97.9 (17) Just after 5pm two bunches of flowers were
delivered with a card saying simply: to the
neighbour next door.
Quantifier 1 2.1 (18) I had to listen to those two bunches of spoiled,
/vague pampered brats whinge their brains out about which
is the hardest done by.
Total 48 100
258 Lieselotte Brems

Heap of Tokens % Examples

Head 58 55.2 (19) My first impression was not that it was an
earthquake, said Heinz Hermanns, standing by a
heap of bricks that had fallen from his 100-year-old
Quantifier 41 39.1 (20) They went through my bags, searched me and
asked me a heap of questions.
Vague 6 5.7 (21) That deadly, winking snuggling chromium-
plated, scent-impregnated, luminous, quivering,
giggling, fruit-flavoured, mincing, ice-covered heap
of motherlove [said about Liberace]
(22) The British have forged a fine tradition of
gardening and cannot afford to sit on their well-
clipped laurels. Striding past the compost heap of
nostalgia, comes Christopher Lloyd.
(23) He test-fired a dozen of Hellfire missiles at a
fleet of old Saudi school buses, reducing the
vehicles to a heap of springs and blackened chassis.
Total 105 100

Heaps of Tokens % Examples

Head 29 32.2 (24) Pulham, scion of the Portland cement family,
experimented and perfected in the 1840s the art of
using liquid cement poured over heaps of clinker to
make rock formations
Quantifier 59 65. 6 (25) Whats interesting is how many sexual
researchers and observers were driven by self-
interest? Heaps of them at least.
(26) The graphics are very polished, with pitch
detail, markings and the like to add heaps of
Vague 2 2.2 (27) Many other viruses are highly malignant
reducing the priceless words of your PhD thesis to
amorphous heaps of molten letters.
Total 90 100
Measure noun constructions 259

Pile of Tokens % Examples

Head 176 88.4 (28) A jilted girlfriend got revenge on the boyfriend
who dumped her by dumping a foot-high pile o f
manure in his bed.
Quantifier 6 3.1 (29) I can just see a whole pile of the boys walking
out after the final and saying bye bye.
(30) It [i.e. a performance] was the biggest pile of
want ever.
Vague 17 8.5 (31) 6 CAMMY MURRAY: Recovered well after a
nervous start and put in a pile of strong defensive
work. ((paper)work-metonymy; here more or less
(32) If you go and have a look next door theres a
great pile of work that builds up. (paperwork
metonymy; leans more towards the Head category
Total 199 100

Piles of Tokens % Examples

Head 159 93 (33) There was no memory of summer but the little
sad piles of hay that rotted in the fields.
Quantifier 8 4.7 (34) Mike Atherton has been warned he must score
piles of runs for Lancashire to keep his England test
Vague 4 2.3 (35) Leshan emphasizes a remark by G.B. Shaw that
Lourdes is the most blasphemous place on the face
of earth: mountains of wheelchairs and piles of
crutches exist, but not a single wooden leg, glass
eye, toupe[e]!.
(36) The real fun begins when you start receiving
piles of property details. (paperwork-metonymy)
Total 171 100

The percentages indicated in the tables immediately point out that there are
differences between the various MNs in terms of their degree of
grammaticalization or quantifier potential. In order to represent visually the
degrees to which the MNs have grammaticalized, they are set out on a scale of
synchronic grammaticalization in Figure 1 (see Section 2.3). The percentages for
lot of and lots of were also obtained through analysis of Cobuild data. They are
included in the cline as precursors in the present MN-developments.
260 Lieselotte Brems

Load of Tokens % Examples

Head 33 19.3 (37) TODAY flew in a helicopter-load of supplies
to the service station.
Quantifier 132 77.2 (38) I just think thats a load of nonsense.
(39) When our image first went out of control, we
played to a load of skinheads.
(40) What are you going to write about when you
get success and a load of money?
Vague 6 3.5 (41) I always go to Sainsburys and I always go to
the Cake Place and buy a load of cakes. (trolley-
load/lots of)
Total 171 100

Loads of Tokens % Examples

Head 15 7.2 (42) Six plane-loads of food are also being
flown today to the city of Baidoa.
Quantifier 193 92.8 (43) Around Christmas time I was in the
British home Stores and I tried on loads and
loads of dresses.
(44) Ive applied for loads of jobs including
one in a flowershop, but they wanted a
(45) Youve got loads of people that you can
have conversation with, but not many people
that you can have communication with.
(46) It has tons of variety, smart graphics and
loads of action.
Total 208 100

bunches of heaps of bunch of lot of

pile of
piles of heap of load of loads of lots of

4.7% 77.2% 92.8%

0% 3.0% 41.0% 50% 65.6% 88.6% 100%

2.1% 99.8%

Figure 1. Scale of synchronic grammaticalization

Measure noun constructions 261

Loads of, bunch of, load of and heaps of have all grammaticalized strongly, while
piles of, pile of and bunches of have hardly grammaticalized at all.
Two main factors seem to play a part in these observed differences in
degree of grammaticalization, viz. dissimilarity in the degree of delexicalization
and collocational broadening, and differences in expressive value of the MNs,
which also involves a phonetic factor.
Considering the fact that the grammaticalization processes of MNs are
largely semantically driven (Section 2.3), it seems only natural that differences in
grammaticalization level can be explained by differences in the preliminary
delexicalization processes. Differences in quantifier potential between the
semantically related heap and pile, for example, have to be put down to
differences in delexicalization potential between the two MNs. These differences
are dependent on certain lexico-semantic properties inherent in the concepts of
pile and heap, which are resistant and conducive to semantic generalization
The blocking factor in pile is the feature of verticality and constructional
solidity it calls up. These semantic features are, so to speak, too specific to bleach
into a mere quantity meaning. The concept of heap on the other hand is in itself
more vague and simply profiles an undifferentiated mass, from which it is much
easier to detach a mere quantifier meaning. The lack of delexicalization potential
in pile is matched by a very restricted collocational extension, mainly limited to
prototypically stackable concrete nouns like rubble, paper, bricks etc. Heap(s) of,
on the other hand, has loosened its collocational requirements systematically. In
addition to the prototypically stackable nouns it combines with when used as
head, it has extended to concrete nouns irrespective of their semantics, to human
nouns (e.g. (25)) and abstract nouns (e.g (26)). Heaps of has hence developed a
systematic quantifier use which is more or less devoid of its original lexical
The non-head/quantifier uses of pile of and piles of distinguished in the
tables above are all restricted to very specific contexts, as in (34), highly
expressive stretches of discourse, as in (30) and (35), or dependent on metonymy,
as in (31) and (36) for example. Compare the following two MN nominals, which
alternatively have heaps and piles combined with the same set of nouns; the
collocational restrictions on the quantifier use of piles of are immediately

piles of stones/paper/people
heaps of stones/paper/people

Piles of evokes a vertical, layered constellation with all three nouns, which
renders the combination piles of people highly marked. By contrast, with heaps a
mere quantifier reading is at least as unmarked as a literal interpretation for all
three nouns; in the case of people the quantifier reading is the most natural one.
Bunch of is another MN of which the high level of grammaticalization can
be explained by a process of extensive collocational broadening. As opposed to
262 Lieselotte Brems

pile or heap, the delexicalization process of bunch has a readily identifiable first
stage in which it designates a very specific cluster-like constellation with an
accordingly restricted set of collocates in the N2 slot, e.g. grapes, flowers,
carrots. Gradually, the specific cluster meaning starts to bleach and the
collocational scatter broadens to include concrete nouns beyond that limited set,
as well as abstract nouns and human nouns, which are in fact the predominant N2
type when bunch functions as a quantifier. The following table represents the
various extensions.

+Inanimate plural (47) Theres now a whole bunch of studies from

count noun different cities that show the same thing.
(48) Traditional advertising pictures are a bunch of lies.
(49) Ned wanted to give me a bunch of suits.
+Uncount noun (50) Trouble was, the funds were able to neatly hide all
but the most conspicuous of their charges in a bunch of
(51) We started in May and did a bunch of practising.
(52) I spent a bunch of time, when I was visiting the
county, talking to his neighbours.
+Human/animate (53) Who said Americans were a jingoistic bunch o f
plural count noun rednecks who know or care nothing about what happens
beyond their shores?
(54) Deng was pictured taking a dip with a bunch of his
beaming buddies at a summer resort in the north of the
(55) Russia and America were just a bunch of
enthusiastic and very fit guys who ran around for 80
minutes without much method.
(56) We guarantee the noble young lord will complain
about having to spend time with such a boring bunch of

Bunch of, just like loads of and heaps of, also illustrates the expressivity and
affective values grammaticalizing MNs often develop, at times leading to new
patterns of collocational consolidation. Especially when used with human nouns,
bunch of goes beyond merely quantifying N2 and also qualifies it, usually
negatively. This qualitative function can be reinforced by the addition of
qualitative adjectives premodifying the MN, as in (53) and (56).
The negative qualification expressed by bunch is best described as
negative semantic prosody, as defined by Sinclair (1992) and Bublitz (1996): a
negative, or occasionally positive, semantic aura spreading from node to
Measure noun constructions 263

collocate. Bunch radiates a specific halo, it prospects ahead and sets the
scene (Sinclair 1992: 8) for a particular type of subsequent item (Bublitz 1996:
11). This strong predictive power with regard to N2 can create new collocational
requirements and idiom-like patterns of collocational consolidation, as in (48),
(50), (53) and (56), both with human nouns and abstract nouns. It is only with
regard to nouns such as guys, lads, kids, etc. that bunch of radiates a positive
semantic prosody. In such expressions as a bunch of guys/lads/etc. there is the
additional suggestion of bondedness, of a close-knit group of amicable people.
This can be seen as a metaphorical revival of the original cluster semantics.
The specific qualitative meaning of bunch of brings us to the second
important factor motivating the grammaticalization of MNs, viz. the expressive
value they can acquire. As a means of quantification heaps of and loads of are
very hyperbolic in nature, which can be stressed by repeating them as in (43).
Differences in expressive value might also explain why the plural versions of
heap, pile and load have grammaticalized more strongly than the singular
variants. The plurality in terms of grammatical number adds to the hyperbolic
meaning it expresses as a quantifier. In addition, the intrinsic mass meaning of
plural nouns (Langacker 1991) likewise enlarges the magnitude already expressed
by the MN. Phonetically, these plurals contain a vowel that can easily be
lengthened, producing a similar effect of exaggerating the quantity of N2. In this
respect, the extensive grammaticalization of loads of might be enhanced by the
graphemic and phonetic resemblance to lots of, with the added bonus of a
strongly prolongable diphthong in front of a voiced consonant.
In the case of bunch, on the other hand, it is precisely the other way round:
the plural form displays a near-exclusive head use, while it is the singular form
which has a prevailing quantifier use. The resistance of bunches to
grammaticalize into a quantifier might be due to prosodic features which do not
lend themselves well to expressive use, such as the extra syllable the plural
morpheme gives rise to (p. c. Halliday). Grammaticalizing MNs, with their
typical blend of lexical and grammatical potential, thus satisfy the language
users needs for a quantifier as well as the desire to be expressive.

4. Conclusion

The assessment of the structural status of MNs in MN constructions is complex

because of the subtle and often intricate interdependence of the MNs lexical and
grammatical status. The observed structural fluctuation involves many more
dimensions than suggested by traditional descriptions. Lehmanns parameters for
grammaticalization proved essential to impose some order on what appears as
intractable material.
Grammaticalization, both the diachronic and synchronic interpretation,
allows one to reveal the patterns in empirical data. Grammaticalization of MNs
seems to involve two main motivating factors, which typically interlock. Firstly,
there is delexicalization and collocational broadening of the MN. In addition
there is a pragmatic factor, viz. the hyperbolic expressiveness MNs cater for as
264 Lieselotte Brems

quantifiers. Differences in the degree of grammaticalization are matched by

differences in the extent to which these two factors have come into play in the
various MNs.
Not only is there a synchronic dissimilarity in the extent to which the
various MNs have grammaticalized; each of the MNs individually displays a
layering (cf. Hopper and Traugott 1993) of lexical head uses and grammatical
quantifier uses, as well as a considerable number of transitional uses. Some
contextualized examples proved to be irreducible blends (Bolinger 1961) of
quantifier and head status. Our main descriptive research question has thus been
confirmed: bunch(es) of, heap(s) of pile(s) of and load(s) of have developed a
quantifier use comparable to that of regular quantifiers. However, they still retain
the possibility of appearing as the lexical head noun of a nominal group.
The MNs looked at thus do constitute an emergent means of quantification
(cf. Langacker 1991). The observed structural fluctuation and layering
phenomena suggest that they are still very much quantifiers on the move. A
certain amount of lexicality is bound to cling to all MN quantifiers to some
extent. For pile in particular such lexical persistence (Hopper and Traugott 1993)
is at present very strong, whereas heaps of has already developed a systematic
quantifier use which is more or less oblivious to its original lexico-semantics.
Still, even when MNs have become highly grammaticalized, their lexical
semantics can still be exploited, alluded to or revived in various ways, e.g. They
employ lorry-loads of insincere flattery. Again the strong interpersonal
motivation behind MNs as a means of quantification comes to the fore, as well as
the importance of casual, informal registers.


1. I would like to thank all people at the 2nd Workshop of the Systemic
Functional Research Community (FWO - Fund for Scientific Research
Flanders grant n WO.018.00N) in Leuven, 21-24 November 2001, as well
as those at ICAME 2002 for their much appreciated comments on earlier
versions of this paper.
2. These are just two of the many names they are commonly labelled with.
Others are quantifying nouns in Biber et al. (1999) and NP-like quantifiers
in Akmajian and Lehrer (1976).
3. All examples are extracted from the Cobuild Corpus, The Bank of English,
and reproduced here with the kind permission of HarperCollins.
4. This concept is alternatively referred to as semantic attrition, desemanti-
cization and demotivation (Lehmann 1985:307).
Measure noun constructions 265


Akmajian, A. and A. Lehrer (1976), NP-like quantifiers and the problem of

determining the head of an NP, Linguistic Analysis 2: 395-413.
Biber, D., S. Johansson, G. Leech, S. Conrad and E. Finegan (1999), Longman
grammar of spoken and written English. London: Longman.
Bolinger, D.L. (1961), Syntactic blends and other matters, Language 37: 366-
Bublitz, W. (1996), Semantic prosody and cohesive company: somewhat
predictable, Leuvense Bijdragen 85: 1-32.
Bybee, J. (2000), Cognitive processes in grammaticalization, to appear in M.
Tomasello (ed.), The new psychology of language, volume 2. New Jersey:
Lawrence Erlbaum.
Fischer, O. (1999), Grammaticalization: Unidirectional, nonreversible? The case
of to before the infinitive in English, Views 7: 5-24.
Halliday, M.A.K. (1985), An introduction to functional grammar. London:
Haspelmath, M. (1989), From purposive to infinitive - A universal path of
grammaticalization, Folia Linguistica Historica 10: 287-310.
Hopper, P. and E. C. Traugott (1993), Grammaticalization. Cambridge:
Cambridge University Press .
Jespersen, O. (1970-74), A modern English grammar on historical principles. 7
vols. London: Allen and Unwin; Copenhagen: Enjar Minksgaard.
Kruisinga, E. (1925), A handbook of present-day English, 7th ed. Utrecht:
Kemink en Zoon.
Kurylowicz, J. (1965), The evolution of grammatical categories, Diogenes 51:
Kurtbke, P. (2001), YAP- and OL- as delexical nominalising devices in
diaspora Turkish. Paper delivered at the University of Leuven, March 15,
Langacker, R.W. (1991), Foundations of cognitive grammar. Volume 2:
Descriptive application. Stanford: Stanford University Press.
Lehmann, C. (1985), Grammaticalization: synchronic variation and diachronic
change, Lingua e Stile 20: 303-318.
Quirk, R., S. Greenbaum, G. Leech and J. Svartvik (1985), A comprehensive
grammar of the English language. London & New York: Longman.
Sinclair, J. (1992), Corpus, concordance, collocation. Oxford: Oxford University
Traugott, E. C. (1988), Pragmatic strengthening and grammaticalization,
Proceedings of the 14th Annual Meeting of the Berkeley Linguistics
Society. 406-416.
Yourself: a general-purpose emphatic-reflexive?

Gran Kjellmer

Gteborg University


In general, the reference of English personal pronouns has been relatively stable
over the centuries: I (and its forerunners) can normally be taken to refer to the
first person singular, and so on. If this is the general picture, it is necessary to
add some qualifications, most of them of a minor kind. For instance, I is
sometimes used to refer to the second person singular (I shouldnt disturb him
at this time of night), we is sometimes used with reference to the first person
singular, the authorial and the royal we (We are not amused), to the
second person singular (How are we today?) and with general reference (We
should not underestimate the defence of honour), and they can also be used
with general reference (They say that ill weeds grow apace). However, you
and its reflexive-emphatic correspondence yourself stand out in this respect and
differ from their pronominal cousins, both in that their referential changes have
been more generally remarkable over the centuries, and in that such changes are
still in progress. This paper will attempt to chart some of those changes in
modern English with the help of large modern corpora.

1. Introduction

As speakers of English we tend to look upon the pronominal system as something

more or less established and invariable, if indeed we bother to think about it at
all. But there are occasions when this happy mood is broken. When an English
professor addressing his seminar audience says

(1) You can see for yourself that ...

and when an English boy accused of repeating himself replies

(2) I aint repeating yourself

one may well receive a jolt. Our traditional grammars would have us expect the
professor to say see for yourselves and the boy to say repeating myself. A
natural question then is if such linguistic events are due to accidental slips, and
hence of no great interest, or if something is happening to the use of yourself. I
268 Gran Kjellmer

would suggest that among the bewildering mass of uses that yourself can be put
to an interesting pattern can be seen to emerge.

2. Development of you

In order to look into the matter of yourself we shall use material from the
CobuildDirect and the BNC corpora. It is indeed one of the great advantages of
corpora that they can provide us with material that has not yet become
established as mainstream varieties. But before discussing yourself, let us first
consider very briefly the development of the closely related pronoun you. You,
starting out as an object form of the Old English second person plural personal
pronoun , came in late Middle English and Early Modern English times to be
used as a plural subject form and, about the same time, as a singular pronoun,
used both as a subject and an object form (OED You I-II). In Early Modern
English a secondary use developed, Denoting any hearer or reader; hence as an
indef. pers. pron.: One, any one (OED You III:6). From Early Modern English
times onwards you can thus be used in the same way as today, in the plural and
singular and with general reference. Modern examples, from the Cobuild Corpus,

2nd pl. (OED you I):

(3) I can see great things for you, kids, I think your troubles are over.
(Cobuild: usbooks/09. Text: B9000001423)

(4) The great thing about meeting someone through an agency is that you find
a lot about them the first time you meet.
(Cobuild: ukmags/03. Text: N0000000375)

2nd sg. (OED you II):

(5) `Honk all you like, baby. I hope it makes you happy, you little redneck.
(Cobuild: usbooks/09. Text: B9000000418)

Generic (OED you III:6):

(6) Where they lived in it was in near Christchurch as you sort of you had to
go you know maybe five miles before you saw anyone and in winter they
were just cut off anyway. So they never saw a soul.
(Cobuild: ukspok/04. Text: S9000000254)

Out of the generic use there has developed a more specific use where you mostly,
but not always, means I or we (and which is not in the OED; cf. Quirk et al.
Yourself: a general-purpose emphatic-reflexive? 269

Generic > specific:

(7) When I made the booking I explained that the trip was for shopping, but
the tickets arrived with a booklet listing that particular weekend as a
public holiday in France. Now Going Places wants 90 to change the date.
<p> Going Places Direct showed no compassion when you explained your
problem and insisted that you pay a 90 re-booking fee (you = I)
(Cobuild: times/10. Text: N2000951104)

(8) `Theres another one in the back as well Mr Giggins added: `For all the
world it looked as though there were people asleep in the car although
when you looked again you realised they had been shot ) (you = I/we)
(Cobuild: times/10. Text: N2000951208

(9) but I shouldnt think its probably all that much different <F01> Mm.
<F02> except we used to finish off putting chairs on the tables hands
together and eyes <tc text=laughs> closed you know before you went
home every night. (you = we)
(Cobuild: ukspok/04. Text: S9000000758)

(10) Balancing the lust for a story against the demands of self-preservation,
conquering your own fear and crawling that extra exclusive maggot-
infested mile before remembering you were a mother with responsibilities
back home. Home. It was time to call her husband. Her nervousness, for
which she had no explanation - or, at least, none she could remember -
came flooding back. (you = she)
(Cobuild: ukbooks/08. Text: B0000001117)

(10) is probably an example of free indirect speech.

3. History of yourself

As for yourself, its early history is partly dependent on that of you, as could be
expected. The Middle English plural e ou selve(n) became our(e) self(e) in the
early part of the fourteenth century, and like you the latter form came to be used
with singular reference in late Middle English and Early Modern English (OED
yourself II, originally as a honorific plural). And then towards the end of the
fifteenth century the present s-plural ourselves, yourselves came into existence
and eventually became the standard forms (Wright and Wright 1924: 323; see
also Visser 1962-73 I 455). The forms with -selves are [...] the normal plural
usage by the middle of the sixteenth century (Barber 1997: 159). So the form
yourselves gradually becomes the standard one for use in the plural. If yourself,
on the other hand, was thus originally a plural form, as in
270 Gran Kjellmer

(11) All the wise how it was ye wetyn your selfe.

(c1400: OED Yourself I1: obsolete)

its standard modern use is as a singular reflexive form (OED Yourself II: 6), as in

(12) Now you never thought of yourself as a fan. You were a journalist
covering sports.
(Cobuild: npr/07. Text: S2000901019)

or as a singular emphatic form (OED Yourself II: 3), as in

(13) Vu: You used to molest other kids yourself? <p> Mary: Mm-hmm.
(Cobuild: npr/07. Text: S2000911102)1

This, then, is the traditional view of modern you and yourself/yourselves, as

presented in the standard grammars: you is the second person singular and plural
personal pronoun, yourself is the second person singular and yourselves the
second person plural reflexive pronoun (Quirk et al. 1985: 346, Biber et al. 1999:
328). But in order to understand the occurrence of examples like (1) and (2), I
suggest we follow an admittedly hypothetical line of development of modern
yourself. Such a development would imply an ongoing extension of its semantic
range, and consequently an increasing lack of precision.

4. Development of modern yourself

Let us start with the standard use of yourself, where it refers to a singular

(14) its exciting for a young man like yourself ...

(Cobuild: npr/07. Text: S2000911214)

As we saw, you can refer to one or several addressees, and frequently it is

difficult or impossible for the listener or reader to decide which is meant.2 The
same thing then applies to yourself. The number indeterminacy of you spills over
on to yourself by analogy, so that the latter can be used in situations where the
speaker may have a plural addressee in mind. In cases like the following, there
could be one addressee or several:

(15) Treat yourself to a Maltese odyssey

(Cobuild: today/11. Text: N6000940101)
Yourself: a general-purpose emphatic-reflexive? 271

(16) Before buying a single share of stock, force yourself to answer one
question: are you reasonably sure that you can keep your money invested
for 7 to 10 years?
(Cobuild: usbooks/09. Text: B9000000404)

(17) If you have just spent 329,000 on a red Ferrari F50 then why not treat
yourself to the perfect number plate?
(Cobuild: times/10. Text: N2000960217)

How then are we to know whether, and how often, yourself in fact refers to a
number of addressees? It is difficult to answer that question as, just in the case of
you, the speaker or writer may not always have made a distinction between
singular and plural but may be addressing himself indifferently to an audience of
one or several. The context is often of little or no help. However, by an indirect
route we might get an idea of the size of the phenomenon. The reflexives myself,
himself, herself, itself have plural correspondences, ourselves and themselves. If
we assume that the relation between reflexive singulars and plurals is very
approximately constant throughout the system, we can investigate the matter in a
corpus like Cobuild and draw our conclusions. The figures are shown in Table 1.

Table 1. Reflexive singulars and plurals in the CobuildDirect corpus

Formally singular Formally plural % formally plural
myself 7311 ourselves 2798 27.7%
himself 14815 themselves 10636 27.4%
herself 5525
itself 7894
yourself 6758 yourselves 289 4.1%

The discrepancy between 27-28% and 4% suggests that a great number of the
yourself instances have plural reference.
When yourself can be interpreted as referring to plural addressees, as in
(15) - (17), one further step in its development follows naturally, viz. that when
yourself unambiguously refers to plurals, and plurals only. This step constitutes a
break with traditional descriptions of the word; it is not described in our standard
grammars. Sentence (1) is one example, and some further examples follow.

(18) Ladies and gentlemen, Francie announced suddenly appearing brightly.

Our resident antiques expert will be having his break now, for twenty
minutes only. Until resumption, please avail yourself of the fairgrounds
refreshments at reasonable prices ... The queue groaned.
(Cobuild: ukbooks/08. Text: B0000000010)
272 Gran Kjellmer

(19) Well can you sort that out amongst yourself and then after youve done
that then present it to the February sales meeting
(BNC: JN6 142)

(20) If come Valentines Day you girls found yourself still manless after
deploying every known method to hook that rare breed of muscle, there
was only one place to be.
(Cobuild: ukmags/03. Text: N0000000722)

(21) Coffees are ordered. Do you all consider yourself to be Botards?

(Cobuild: ukmags/03. Text: N0000000686)

(22) I have some good news for those of you who didnt manage to pull
yourself together enough to get tickets to Creamfields
(Cobuild: sunnow/17. Text: N9119980502)

(23) Prologue Oedipus:

My children, generations of the living
In the line of Kadmos, nursed at his ancient hearth:
Why have you strewn yourself before these altars
In supplication, with your boughs and garlands?
(Cobuild: usbooks/09. Text: B9000001423)

(24) Make sure youre in different groups. Okay. ---

One, two, three, so we separate yourself into different groups.
(BNC: KPV 514)

One can see the process in operation whereby yourself is supplanting yourselves
in examples like the following, where the speaker is hesitating between the two
forms and deciding on yourself :

(25) So what subjects did you take then at er <ZF1> S<ZF0> School
Certificate? <ZF1> What what <ZF0> what were your pushing yourselves
to yourself towards?
(Cobuild: ukspok/04. Text: S0000000834)

As suggested above, analogy with you is probably at work here. There is also a
slim chance that a few instances of plural yourself, labelled by the OED as
obsolete,3 are a deliberate continuation of the Middle English plural and hence
imitative of Middle English usage. This may be the case in an example like (23),
where the tone is solemn and somewhat archaic.
Examples like (18) - (24) above, where yourself is used with direct
reference to several addressees, are frequent enough in the corpora. (It is hardly
possible to give statistics, because yourself is a very frequent word,4 and evidence
of the number of addressees, if it occurs at all, may occur anywhere in the
Yourself: a general-purpose emphatic-reflexive? 273

context, often at some distance from yourself.) On the other hand, a further step
in the development of the word, where it is still plural but no longer limited to the
second person, is not recorded as frequently. This step could be represented by
cases like

(26) When I went to that stress management course we were told to use
physical resources like deep breathing and actually making yourself sit
down and making yourself go floppy. and let every muscle let it relax.
(BNC: KBF 8025)

(27) Fiona Me and, did you see me and Sarah [at the show] ...
Jessica No. No, cos we were sitting down down by yourself
(BNC: KBL 2998)

(28) We have to think yourself !

(BNC: KE0 859 )

This usage is clearly colloquial and scarcely acceptable in the standard language.
The shifts in the usage of yourself that we have seen so far represent a
widening of its sphere of application, from reference to second person singular to
reference to second person singular and plural, and from there, in addition, to
reference to other plurals. It has, in other words, become more general in its
application. By a slightly different route it concurrently acquires a generic sense,
as we shall now see.
When yourself, in the wake of you, was used to refer to singular and plural
addressees indifferently, the semantic distinction between what might be called
specific addressing, where you means e.g. you, Benjamin (You should avail
yourself of this opportunity) and general addressing, where you means one
(When you are young, without a job, ... it is your passions that often define
you) became blurred, particularly in general contexts. Ever since late Middle
English times English has lacked a distinctive generic pronoun, corresponding to
French on and German man,5 but you (and one) have come to fill that place.
Consequently yourself, too, could be used in a generic sense, as in the following

(29) Knowing how to present yourself # can really make or break you,
Charmaine said.
(Cobuild: oznews/01. Text: N5000950205)

(30) The role demands a lot of things. It demands subjecting yourself to

complete vulnerability.
(Cobuild: today/11. Text: N6000950602)
274 Gran Kjellmer

(31) Janet Parsons knows what it is to find yourself a victim of crime. Her
husband, Leslie was killed at the wheel of his lorry by two joyriders
racing each other.
(BNC: K1K 3765)

(32) The general sense of not being quite yourself

(BNC: BLW 1117)

This very clear step towards generality is also shown by the fact that yourself in
this sense can refer back to generic one:

(33) Theres a danger that in a science course one concentrates purely on how
and why nature works, or in an engineering course one concerns yourself
only with how to apply and harness phenomena, not to understand
sufficiently the nature of the phenomena and what are the inherent
(BNC: KRW 36)

(34) one is to do it yourself

(Cobuild: ukbooks/08. Text: B0000000774)

One step in the development of yourself remains to be discussed. As we

saw in (7)-(10), you is sometimes used in a generic sense although, paradoxically,
it has specific reference. This can at least initially be due to modesty on the part
of the speaker and/or on a wish not to take personal responsibility for the matter
presented, as you mostly stands for I or we. In the same way, yourself can then be
used in a seemingly general way but with clear reference to one or more persons,
mostly I or we:

(35) Id have loosened my tie, but they had taken it away along with my wallet,
gun, belt and shoelaces. I wondered how easy it would be to hang yourself
with your shoelaces.
(BNC: GVL 1718)

The general phrasing refers to the speakers specific problem, but both the
general and the specific meaning of yourself are part of the full meaning of the
sentence. The relevant part means both to hang oneself with ones shoelaces
and to hang myself with my shoelaces. This type of usage can be seen as a
transition to the final stage, that where the reference of yourself is exclusively
specific (and not always I or we, as in (39)). Some examples are:

(36) Peter Look, youve been repeating yourself again.

Kevin Yeah, so are you.
Peter I di-- , I aint repeating yourself.
Yourself: a general-purpose emphatic-reflexive? 275

Kevin Did, you did. You did!

Peter I aint repeating yourself.
(BNC: K SP 256)

(37) I know I, er in the past when Ive felt myself going off to sleep in those
situations, Ive been pinching myself and, and really making yourself do
something rather than just sitting there doing nothing, - - - weve read and
heard about people that have gone to sleep on motorways havent they?
(BNC: KBX 687)

(38) Ten-year-old Trevor Kachel, of Belgrave Road, said: `I like boxing

because it means I can defend yourself if you ever needed to.
(BNC: K52 6141)

(39) Petes gone down to the shop and got yourself a bottle whisky.
(BNC: KCT 7304)

As the contexts make clear, these sentences do not mean ... repeating you, ...
making you, etc., and they could not mean ... repeating oneself , ... making
oneself, etc. yourself is clearly specific here.6
The different types of usage that have been presented above could of
course be described as related in several different ways, none of which is
necessarily the correct one. If they are set out as suggested here, the stages in
the development of yourself can be seen as implicational in Figure 1:
This means, for instance, that those who use yourself to refer to the second
person plural (d) will also use it to refer to the second person singular and plural
indifferently (c), but not necessarily to other plurals (e).

5. Conclusions

As we have seen, yourself has changed a good deal through the ages, with
striking results in some variety or varieties of the language. We need not assume,
however, that the development of yourself in the standard language will
inevitably follow suit. This is one line of development among several, in its later
phases very much a minority option. Nevertheless, it is an interesting option in
that it represents the phenomenon of pattern neatening, to borrow a phrase
from Jean Aitchison (1991). From being distributionally and semantically quite
different from its corresponding personal pronoun you deviating in number as
well as type of reference yourself has become a close reflexive-pronoun copy of
it by getting rid of constraining features in its later stages of development. In
those stages it would appear justifiable to regard yourself as a general-purpose
emphatic-reflexive pronoun.
276 Gran Kjellmer

Reference to 2nd plur

(Ye weten your selfe)

Reference to 2nd sing

(A young man like yourself)

Ref. to 2nd sing/plur

(Treat yourself to a Volvo)

Ref. to 2nd plur Generic

(Separate yourself into (The sense of not being
groups) quite yourself)

Ref. to other plurals Explicit ref. to gen. one

(We have to think (One concerns yourself
yourself) with ...)

Ref. to any subject

(I can defend yourself)

Figure 1. Types of usage with yourself


1. There is occasional ambiguity between the reflexive and the emphatic use, as
You gave yourself to the poor,
meaning either You dedicated yourself to the poor or You yourself gave
to the poor.
2. ... it is not always clear in present-day English whether the second person
pronoun refers to one or more people (Biber et al. 1999: 330).
3. Yourself I. In plural sense: now replaced by yourselves.
4. There are 6758 occurrences of yourself in Cobuild and 10587 in the BNC.
Yourself: a general-purpose emphatic-reflexive? 277

5. Old English man with that meaning developed into Middle English me and
became obsolete in late Middle English times.
6. A case like
I shouldnt worry yourself, Dolly, said Carrie, with apparent innocence
(BNC HHC 240)
is probably different, in that I shouldnt do that is often used to mean You
shouldnt do that; I shouldnt worry yourself then means You shouldnt
worry yourself.

Aitchison, J. (1991), Language change: progress or decay. 2nd ed. Cambridge
University Press.
Aston, G., and L. Burnard (1998), The BNC handbook. Edinburgh: Edinburgh
University Press.
Barber, C. (1996), Early Modern English. 2nd ed. Edinburgh: Edinburgh
University Press.
Biber, D., S. Johansson, G. Leech, S. Conrad and E. Finegan (1999), Longman
grammar of spoken and written English. Harlow: Longman.
BNC = British National Corpus, see Aston and Burnard (1998).
CobuildDirect Corpus, cf. Sinclair (1987).
OED = Simpson, J. A., and E. S. C. Weiner (eds) (1989), The Oxford English
dictionary, 2nd ed. Oxford: Clarendon.
Quirk, R., S. Greenbaum, G. Leech and J. Svartvik (1985), A comprehensive
grammar of the English language. London & New York: Longman.
Sinclair, J. M. (ed.) (1987), Looking up. An account of the COBUILD project in
lexical computing. London and Glasgow: Collins.
Visser, F. Th. (1962-73). An historical syntax of the English language I-III.
Leiden: Brill.
Wright, J., and E. M. Wright (1924), An elementary historical new English
grammar. London, etc.: Oxford University Press.
Aspects of spoken vocabulary development in the Polytechnic of
Wales Corpus of Childrens English

Clive Souter

University of Leeds


The Polytechnic of Wales Corpus was collected in the late 1970s for the study of
syntactic and semantic development of native English-speaking children aged
between six and twelve. This paper demonstrates that interesting lexical
information can be gleaned from this corpus for EFL instructors and curriculum
designers, even though the size of the corpus (61,000 words) makes it too small
for dictionary development. The Corpus was organised to permit researchers to
observe changes across age groups, and differences between the sexes and
between children of different socio-economic backgrounds. Five investigations
rate of vocabulary growth with age in this Corpus;
the extent to which vocabulary is sex-specific;
differences between sexes in the use of affirmatives and negatives, and in
the use of male and female personal pronouns;
the extent to which vocabulary size is related to socio-economic class;
persistence of errors in applying regular verb endings to irregular verbs.
The Corpus does show active vocabulary size increasing with age, at a rate of
only around 50 words per year (in the limited activities used to elicit speech from
the children). Surprisingly, around half of the words used by each of the sexes are
limited to that sex. Boys make more use of positive expressions, whereas girls
make greater use of negatives. Both sexes use he far more than she. There is no
clear evidence that social class differences influence vocabulary size. Errors
caused by applying regular verb endings to irregular verbs seem to diminish in
children between ages six and eight, and have disappeared by age ten.
Although it is clear that data sparsity influences these results, they are still
useful (and thought-provoking) to curriculum developers and coursebook
designers in EFL, as well as researchers in sociolinguistics of child language.

1. Introduction

In this paper, I present some investigations into the development of childrens

English spoken vocabulary between the ages of 6 and 12. I focus particularly on
the differences in vocabulary between the ages 6, 8, 10 and 12, between the two
sexes, and between socio-economic classes, since the corpus material has been
organised to permit this.
280 Clive Souter

The motivation for such a study came from my belief that, until recently,
the Polytechnic of Wales (POW) Corpus has never been used for vocabulary
study. (It was originally collected for the study of childrens syntactic and
semantic development.) This omission can perhaps be explained by the small size
of the corpus: only 61,000 words. Lexicographers building dictionaries of adult
vocabulary have had access to far larger English corpora, such as LOB and
Brown, and more recently the British National Corpus and the COBUILD/Bank
of English. For dictionary-building purposes, clearly the POW corpus is nothing
like large enough, and may have been overlooked for this reason alone. However,
it does have great value for researchers into child language development, TEFL
syllabus designers and course-book authors.
The POW Corpus is unique in containing childrens spoken language,
organised clearly by age, sex and class, and in being richly syntactically
annotated. I hope to show that there are some interesting features to be uncovered
even in such a small corpus, by modern standards. Such features should hopefully
catch the attention of the designers of school syllabi for English language
learning. In many EU countries, there is pressure on the education system to
introduce foreign language learning earlier in the curriculum, at primary rather
than secondary school age. This is not without difficulty: there are few primary
school teachers trained to teach foreign languages. Space needs to be found in the
curriculum and working week of primary schools. An appropriate syllabus needs
to be designed to engage younger learners. Finally, the impact on the secondary
curriculum needs to be addressed, particularly if some children have been
introduced to a foreign language already, but others havent. For this reason, a
team at the Freie Universitt Berlin in Germany led by Dieter Mindt has also
recently been using the POW Corpus to assess which vocabulary and
grammatical items should be introduced to younger German learners of English,
and in what order. A paper describing their work was also presented by Norbert
Schlter at the ICAME conference in May 2002 in Gteborg, Sweden.

2. Special value of spoken corpora for learners and teachers

Developers of language teaching materials and courses are increasingly making

use of corpus evidence. Such corpora may typically consist of native speaker
material, which is of course seen as the learners target, but may still contain
errors. Additionally, corpus collections have been made of non-native learners
language, such as for the ICLE project (Granger 1993, 1998) and ISLE project
(Menzel et al 2000, Atwell et al 2003), in which learner errors may be found.
From the aspect of young learners of English, native speaker spoken corpora such
as the POW corpus are particularly useful in that they can provide

pronunciation examples
intonation and prosody examples
awareness of accents
Aspects of vocabulary development 281

indications of lexical range including expressions and colloquialisms

grammar of speech (false starts, ellipsis, repetitions, unfinished elements,
discourse and dialogue patterns
production, lexical and grammatical errors/rarities in speech
relationships between and frequency of these

This paper will deal primarily with lexical variations between types of speaker,
and illustrate some of the lexical errors produced by younger native speakers of

3. The Polytechnic of Wales Corpus of Childrens Spoken English

The POW Corpus was collected by Robin Fawcett and Mick Perkins, between
1978-9, for the purpose of studying development of syntax and semantics in
children aged between 6 and 12. The corpus was carefully balanced for age, sex
and socio-economic class. In total, there were 96 child informants, subdivided by
age (within 3 months of 6, 8, 10 and 12 years old), sex (B, G) and class (A, B, C,
or D). Such a division resulted in 32 homogeneous groups of 3 children. Each
group was recorded in a play session (PS) performing a lego building task, and
each child was interviewed (I) separately by the same adult to discuss favourite
games, TV programmes etc.
The recordings were then transcribed orthographically, and annotated
prosodically and published in four volumes (Fawcett and Perkins 1980). A
machine-readable version of the corpus was produced in 1980 with full syntactic
analysis for each utterance, using Fawcetts Systemic Functional Grammar
(Fawcett 1981), but which omitted the prosodic annotation, and separated the
speech of each individual child into one text file. For example, the file 6ABICJ
contains the speech of a six-year-old, social class A boy in the interview situation,
whose initials are CJ. The corresponding utterances during the play session for
this individual are in the file 6ABPSCJ (but not those of his playmates). This is
beneficial for our present purpose, but does make analysis of dialogue difficult.
The original machine-readable version contains around 65,000 words, but
the corpus is now more commonly distributed as the Edited Polytechnic of Wales
Corpus (EPOW: ODonoghue 1991). EPOW contains only 60,784 word-forms
(3,730 word-types), because the texts have been edited for typographical errors
which led to part-of-speech categories wrongly being counted as words for
example. This total corresponds to around 11,000 utterances.
The corpus was initially collected and used for the study of the linguistic
development of older children (Perkins 1983). It was later used for the machine
learning of probabilistic models of lexis and grammar for computer parsing
programs (ODonoghue 1993, Weerasinghe 1994, Souter 1989, 1996).
282 Clive Souter

4. Investigations

Three investigations are presented here into vocabulary range by age, across the
sexes, and by socio-economic class. We then investigate errors in use of irregular
verbs, and the extent to which speakers develop their use of syntactically
ambiguous words.

a) Vocabulary size and rate of growth

We can use the corpus to investigate how childrens vocabulary expands with
age. Taking the part-of-speech tagged version of the EPOW corpus as our data
source, we can extract the number of unique word + word-tag pairs for each age
group. This is achieved using standard unix operating system commands on the
text files of the corpus, once they have been verticalised with only one word +
word-tag per line. For instance, the unix command

cat 6* | sort +0 -1 | uniq | wc

produces the output
1821 3642 79093
(lines strings characters)

and shows that there are 1,821 unique word + wordtag pairs used by the entire
group of six-year-olds. Extracting the same for the older children gives us an
indicative growth rate over each two year span of around 6% (Table 1). Note that
we are not talking about growth rates and vocabulary sizes for individuals here,
but of the combined vocabulary of 24 children in each age group. It does however
give us some indication of the typical upper bound for word + word-tag pairs
used by children. The number of unique word-forms is somewhat lower: the
number of unique words in the corpus is 3,730, compared with 4,618 unique word
+ word-tag pairs.

Table 1. Tagged EPOW Corpus: types by age

6 8 10 12 All
Types 1821 1938 2006 2162 4618
Growth (%) 6.4 3.5 7.8
Tokens 14120 14718 15368 16528 60784

From intuition, we may expect that vocabulary size should grow with age for
older children. We might also expect that the corpus had been carefully controlled
so that there were equal numbers of word-forms in each age cohort, but this was
not the case. As can be seen from the third row of Table 1, there are more tokens
in each cohort as the ages increase.
Aspects of vocabulary development 283



1500 Age 6
Age 8

Age 10
1000 Age 12
















Figure 1. Unique word-wordtag pairs by age

In order to discover if there is a genuine growth in vocabulary with age, we can

plot a learning curve for each age group, which shows how many unique word +
word-tag pairs are found as we read through the corpus data (Figure 1). This has
the effect of normalising for uneven sample sizes.
Until the data supply for six-year-olds runs out at just over 14,000 word-
forms, we can see that the twelve-year-olds consistently have a greater
vocabulary range than any younger group. The ten-year-olds only show a
markedly higher range once we have seen at least half of the data. The eight- and
six-year-olds appear not to differ greatly in vocabulary range. Rather surprisingly,
for much of the learning curve shown in Figure 1, the six-year-olds exceed the
eight-year-olds slightly in vocabulary range.
284 Clive Souter

These figures for vocabulary range obviously need to be carefully

interpreted. They reflect the limited contexts in which the data were collected
(lego-building and conversations with an adult about games, films and TV), but
they are better than nothing as pointers towards active vocabulary.
For greater detail, Appendix 1 shows the 100 most frequent word + word-
tag pairs for each age group. These data reveal the pronoun I to be the most
common word across all age groups in the corpus, and a fairly consistent ranking
of other personal pronouns across the age ranges. Interestingly, he is around twice
as frequent as she across all age groups. Of the words used to express affirmation
and negation, we see a fairly consistent ranking for the word no. The use of yes is
quite consistent among six-ten year olds, but drops significantly among twelve-
year-olds. The use of yeah instead of yes is a growing trend across all the age
groups, and increases quite sharply among twelve-year-olds, as use of yes

b) Vocabulary differences by sex and age

Using similar unix commands, we can easily separate the data by sex and age.
Table 2 shows the range of word + word-tag pairs used by boys and girls.
Although the overall total for the corpus for each sex is almost the same, this
parity is only maintained in the subcorpus for eight- and ten-year-olds. Six-year-
old boys appear to have a significantly smaller vocabulary than six-year-old girls,
whereas the reverse is the case for twelve-year-olds, at least to judge from the
POW corpus.

Table 2. Tagged EPOW Corpus: word-wordtag types by sex and age

6 8 10 12 Total
Boys 1099 1252 1319 1454 3054
Girls 1265 1250 1319 1342 3044
Total 1821 1938 2006 2162 4618

What is interesting to observe here, and which is made more obvious in Table 3,
is the number of word types being used only by boys, or only by girls.

Table 3. Raw EPOW Corpus: word types

Girls Boys 6 8 10 12 All
2487 2491 1508 1614 1670 1760 3730

There are 3,730 unique words (word types) being used in the corpus as a whole.
Table 3 columns 1 and 2 show how many of these are used specifically by just the
boys or just the girls. Columns 3-6 show how many types are used by the six-
year-olds (of either sex), eight-year-olds, ten-year-olds, and twelve-year-olds,
respectively. Columns 3-6 are indicative of fairly steady vocabulary growth in
children aged between six and twelve.
Aspects of vocabulary development 285

Boys use 2,491 words and girls 2,487, which are remarkably similar totals.
However, only around 1,240 of the words in the corpus are being used by both
sexes, and the other half is specific to the speakers sex. We might perhaps expect
that the overlap between sexes would increase if we had a larger corpus, or if the
speakers were adult, but perhaps this distribution is demonstrating a genuine
socio-linguistic phenomenon as well.
We can explore the words used only by boys or only by girls by deleting
those used by both from an alphabetically sorted lexicon extracted from the
corpus. Appendix 2 contains such words (beginning with A) extracted from the
An obvious area of difference is in the use of proper nouns. Male names
are prominent in the boys only list, and female names in the girls only list. The
corpus also displays stereotypical examples for favourite toys, careers, games etc
for each sex. Beyond this, we have to speculate as to whether the appearance of a
word in one column or the other is due to data sparsity, or whether it really is
indicative of a difference between the sexes.
There is evidence for both, I would argue. Data sparsity is evidenced by
the occurrence of amusement twice in boys speech (but not in girls), and
amusements once in girls speech (but not in boys). Boys talk about aeroplane,
aircraft, air-force and airport, whereas only air stewardess and air hostess
feature on the girls side. Boys talk about antennas and airholes, action men and
astronauts, whereas girls talk about animal magic, all creatures great and small,
and Alice in Wonderland.
Clearly, in a list such as Appendix 2, many of the items occur only once in
the corpus. If we instead consider the most frequent words used by boys and girls,
can we see any differences? Appendix 3 contains the 100 most frequent word +
word-tag pairs in the boys and girls sub-corpus. If we consider the most
common words which express affirmation or negation, we can see a clear
difference between the sexes. In the POW Corpus, words like yes and no are
labelled with the part of speech F (formula). Given that the corpus contains equal
quantities of text spoken by each sex, boys tend overall to use more positives than
girls do, whereas girls use more negative words, as illustrated in Table 4. There
are, of course, other ways of expressing affirmation and negation, but these are
the ones found most frequently in the corpus. (The use of no as a quantifier has
been omitted from the table.) Either this reflects a general trend between the sexes
in childrens spoken language, or it is an artifact of the tasks performed in corpus
collection. Perhaps Lego building elicits more positive responses from boys, and
more negative responses from girls. Perhaps being interviewed by a friendly male
adult has an impact.
286 Clive Souter

Table 4. Occurrence of some affirmatives and negatives by sex

Item (part of speech) Boys Girls
YEAH (F) 561 336
YES (F) 136 214
YEH (F) 52 41
TOTAL 749 591

NO (F) 274 311

NOT (N) 130 174
DONT (ON) 188 223
CANT (OMN) 59 102
HAVENT (OXN) 75 79
TOTAL 726 889

In line with the data for all the children, regardless of sex, the personal pronoun
he occurs far more frequently than she. One might expect this in the boys
language (239 instances of he against only 56 instances of she), but even the girls
use he (178 occurrences) more frequently than she (123 occurrences).

c) Track differences in social background

The corpus also allows us to look for possible differences by socio-economic

class, which is expressed from A (highest) to D (lowest) in the corpus filenames,
and was judged by parental occupation information collected when the corpus
was compiled. Table 5 displays the word + word-form types by class and age.

Table 5. Tagged EPOW Corpus: types by social class and age

6 8 10 12
ClassA 846 806 983 979
ClassB 852 699 923 938
ClassC 761 813 789 786
ClassD 546 871 702 890

Few clear patterns are evident. Vocabulary range is not always highest for the
class A children, although it is for the ten- and twelve-year-olds. For eight-year-
olds, it is the class D children who have the widest vocabulary. Given the
judgmental approach to allocation of socio-economic class labels, it is perhaps
not worth exploring this area any further.

d) Genuine learners errors (not typographical or transcription errors)

Running a spelling checker on the Edited POW Corpus, and ignoring the many
proper nouns, we can find some examples of native learner errors, such as regular
Aspects of vocabulary development 287

past tense forms for irregular verbs. Table 6 shows alphabetically which errors of
this kind are found in the corpus, and the source file in each case. One six-year-
old girl is the source of many of these. There are only 11 such errors among the
six-year-olds. Eight-year-olds have produced only four, and thereafter it appears
that these children have learned to use the irregular forms correctly.

Table 6. Past form errors of irregular verbs in POW

Word Source
amnt 6cg (6cgihb)
blowed 8cb
bringed 6cg x 2
comed 6cg
digged 6cg
drawed 8db
keeped 6cg
rided 6cg
runned 6cg x 2
shooted 6ag
throwed 6bg
weared 8db x 2

e) Lexical ambiguity

One of the reasons for using the tagged POW corpus in these investigations was
to discover whether there was an increase in the range of syntactic uses of a word
with age, between the ages 6-12. Do children of these ages know how to use the
word cut as a noun, verb, and adjective? Table 7 shows the number of lexically
ambiguous word types used by each age group, as a percentage of the total
number of types of word + word-tag pairs. This proportion remains remarkably
static across the four age groups. Perhaps children have already learned all such
syntactic differences before the age of six, but I would think that unlikely. More
probably, the corpus elicitation tasks were too constrained to demonstrate this
feature adequately.

Table 7. Tagged EPOW Corpus: ambiguous type

6 8 10 12
Word types 1508 1614 1670 1760
Ambiguous types 204 214 211 238
(% by age group) (13.52) (13.25) (12.6) (13.52)
288 Clive Souter

5. Conclusions

The five investigations have hopefully illustrated some of the possibilities for
discovery of distinguishing features of childrens vocabulary development.
Whilst in some areas it is clear that the data are too sparse (to inform the
compilation of a childrens dictionary, for example), there are others which are
more promising and perhaps disturbing, from the point of view of syllabus and
course material designers. The POW corpus evidence suggests that many of the
words we use between the ages of 6-12 are not regularly used by the opposite sex
in similar contexts. This feature is worth a good deal more investigation. Growth
in vocabulary with age has also been demonstrated, although perhaps not at a rate
of increase we might expect. It would be interesting to compare the vocabulary of
children aged 6-12 with that of adults in the better known corpora, but the limited
tasks for speech collection used in the POW Corpus would confound a
straightforward comparison.
For syllabus and coursebook designers, there are also some warnings to be
made with respect to the Welsh dialect features of the POW Corpus. Although the
collectors sought to minimise Welsh language influence in the data, there are
some dialectal features which show through quite strongly. Two of these are the
disproportionately high occurrence of tag questions (including the use of isnt it
without person agreement with the main clause verb), and the use of Welsh
dialect locative adverbs by-here and by-there, instead of here and there, which
becomes more prevalent in the older age groups.
Further warnings should be made regarding the domain-based lexis. The
most frequent common nouns in POW are house, door, man, window and car,
because of the Lego-building task which the children were set.
From the point of view of syntactic structures, the POW corpus illustrates
just how ill-behaved speech can be, especially when uttered by children.
Around 30% of the constituents in the parsed corpus are lacking a grammatical
head, mainly because of ellipsis or interruption, so there is a wide range of
grammatical structures not typically found in written corpora.
The POW Corpus is a small corpus for lexical work, but it still reveals
some interesting comparative and quantitative linguistic features of children of
different ages and across the sexes. It is almost unique as a lexico-grammatical
resource for childrens spoken language. I have not tried to show all such
features, by any means, but I hope to have demonstrated that it is worth
exploring, particularly if you have an interest in learning and teaching language.
Aspects of vocabulary development 289


Atwell, E., P. Howarth and C. Souter (2003), The ISLE Corpus: Italian and
German spoken learners English, ICAME Journal 27: 5-18.
Fawcett, R.P. (1981), Some proposals for systemic syntax. Journal of the
Midlands Association for Linguistic Studies (MALS), 1.2, 2.1, 2.2 (1974-
76). Re-issued with light amendments, 1981, Department of Behavioural
and Communication Studies, Polytechnic of Wales.
Fawcett, R.P. and M. Perkins (1980), Child language transcripts 6-12 (with a
preface, in 4 volumes). Department of Behavioural and Communication
Studies, Polytechnic of Wales.
Granger, S. (1993), The International Corpus of Learner English, in: J. Aarts, P.
de Haan and N. Oostdijk (eds), English language corpora: design,
analysis and exploitation. Amsterdam: Rodopi, 57-69.
Granger, S. (ed.) (1998), Learner English on computer. London and New York:
Addison Wesley Longman.
Menzel, W., E. Atwell, P. Bonaventura, D. Herron, P. Howarth, R. Morton and C.
Souter (2000), The ISLE Corpus of non-native spoken English, in: M.
Gavrilidou, G. Carrayannis, S. Markantionadou, S. Piperidis and G.
Stainhaouer (eds), Proceedings of LREC2000: Language Resources and
Evaluation Conference, vol. 2, 957-964. European Language Resources
O'Donoghue, T.F. (1991), Taking a parsed corpus to the cleaners: the EPOW
corpus, ICAME Journal 15: 55-62.
O'Donoghue, T.F. (1993), Reversing the process of generation in Systemic
Grammar. Ph.D. thesis. School of Computer Studies, Leeds University.
Perkins, M.R. (1983), Modal expressions in English. London: Frances Pinter.
Souter, C. (1989), The COMMUNAL Project: Extracting a grammar from the
Polytechnic of Wales corpus, ICAME Journal 13: 20-27.
Souter, C. (1996), A corpus-trained parser for systemic-functional syntax. Ph.D.
Thesis. School of Computing, University of Leeds.
Weerasinghe, A.R. (1994), Probabilistic parsing in Systemic Functional
Grammar. Ph.D. thesis. School of Computing Mathematics, University of
Wales College of Cardiff.
290 Clive Souter

Appendix 1: 100 most frequent word-wordtag pairs by age in POW

Age 6 Age 8 Age 10 Age 12

Frq Type Tag Frq Type Tag Frq Type Tag Frq Type Tag
762 I HP 641 I HP 644 I HP 632 I HP
507 THE DD 597 THE DD 556 THE DD 590 THE DD
489 A DQ 451 A DQ 431 A DQ 530 A DQ
389 AND & 411 AND & 426 IT HP 403 IT HP
336 YOU HP 368 IT HP 391 AND & 359 AND &
328 IT HP 348 WE HP 381 YOU HP 342 'S OM
254 'S OM 281 'S OM 296 'S OM 337 WE HP
196 GOT M 262 YOU HP 264 WE HP 333 THAT DD
191 THAT DD 262 THAT DD 234 THAT DD 327 YEAH F
168 WE HP 192 YEAH F 230 YEAH F 319 YOU HP
155 THEY HP 170 NO F 155 THEY HP 191 GOT M
151 IN P 163 GOT M 149 TO I 171 PUT M
148 YEAH F 143 THEY HP 147 NO F 166 IN P
134 MY DD 123 PUT M 141 GOT M 158 NO F
132 TO I 113 TO I 131 THERE STH 157 THEY HP
129 HE HP 113 IN P 124 IN P 145 DON'T ON
110 NO F 110 YES F 122 THIS DD 141 ONE HP
107 CAN OM 108 THIS DD 119 OF VO 129 TO I
100 YES F 104 ON P 109 PUT M 112 OF VO
92 LOOK M 103 MY DD 104 YES F 108 HAVE M
90 'M OX 101 HE HP 104 HE HP 107 ON AX
84 TWO DQ 100 CAN OM 98 DO M 106 THIS DD
84 OF VO 98 'LL OM 96 ONE HP 104 NOT N
83 ONE HP 97 LOOK M 96 LOOK M 102 KNOW M
80 MAKE M 91 DO M 87 ALL DQ 93 BE M
79 PUT M 90 BE M 86 'LL OM 87 CAN OM
79 ON AX 89 'VE OX 84 BE M 85 ON P
77 GO M 87 MAKE M 81 HAVE M 83 HE HP
76 HAVE M 85 OF VO 79 CAN OM 81 GO M
75 KNOW M 81 ONE HP 76 IF B 80 THEM HP
69 GET M 78 GO M 72 KNOW M 76 LIKE P
Aspects of vocabulary development 291

68 'S OX 73 ON AX 71 UP AX 73 GET M
67 IF B 70 WHAT HWH 70 'VE OX 66 WITH P
66 NOT N 68 THEM HP 70 'S OX 66 'VE OX
64 THEM HP 68 IF B 67 NOW AX 64 LOOK M
62 SOME DQ 67 'S OX 67 MAKE M 63 OUT AX
58 DO M 55 WITH P 58 GO M 61 IS OM
56 UP AX 55 'M OX 57 ON AX 59 LITTLE AX
56 DOOR H 51 WAS OM 57 IS OM 58 ROOF H
55 BUT & 50 AND-THEN & 56 HERE AX 58 DO M
53 'VE OX 46 AN' & 54 LIKE P 56 UP AX
51 BE M 45 SHE HP 53 THESE DD 55 FOR P
50 ALL DQ 45 GET M 53 BUT & 54 JUST AI
46 ONE DQ 42 OUT AX 48 THINK M 49 TO P
46 COME M 41 UP AX 47 JUST AI 48 'S OX
45 WANT M 40 CAN'T OMN 46 FOR P 47 BUT &
45 'LL OM 39 TO P 45 ONE DQ 45 MAKE M
42 CAR H 37 DO O 42 NEED M 43 RED AX
36 MINE HP 35 IN AX 37 YEH F 39 OR &
292 Clive Souter

35 NOW AX 34 ROOF H 36 SO & 39 'RE OX

35 AT P 34 JUST AI 34 WENT M 39 'D OM
33 HAVE OX 33 HERE AX 34 BUILD M 37 AN' &
30 COS B 31 GOOD AX 32 'D OM 35 THEN AX
28 WENT M 30 DOWN AX 31 'RE OX 34 SEE M
27 IN AX 29 ME HP 30 THING H 33 DOOR H



Appendix 2: Sex-specific words in POW

Boys only talk Girls only talk

Freq Word Type Freq Word Type
4 A-LEVEL 1 A'
2 ABOVE 1 A...
Aspects of vocabulary development 293

294 Clive Souter


Appendix 3: 100 most frequent word-wordtag pairs by sex in POW

Boys Girls
Frq Type Tag Frq Type Tag
1190 I HP 1489 I HP
1186 THE DD 1064 THE DD
942 A DQ 959 A DQ
800 IT HP 801 AND &
749 AND & 727 YOU HP
571 YOU HP 725 IT HP
571 'S OM 602 'S OM
565 WE HP 552 WE HP
561 YEAH F 477 THAT DD
354 GOT M 337 GOT M
288 IN P 336 YEAH F
274 NO F 311 NO F
249 THEY HP 282 TO I
241 TO I 266 IN P
240 PUT M 242 PUT M
239 HE HP 232 THERE AX
209 THIS DD 223 DON'T ON
193 ON AX 214 YES F
190 ONE HP 211 ONE HP
188 DON'T ON 202 MY DD
179 CAN OM 200 LOOK M
173 ON P 197 HAVE M
167 'LL OM 194 CAN OM
166 THERE AX 193 ON P
151 MY DD 188 OF VO
149 LOOK M 180 KNOW M
149 DO M 178 HE HP
148 BE M 174 NOT N
Aspects of vocabulary development 295

146 MAKE M 170 BE M

144 HAVE M 163 GO M
140 KNOW M 156 DO M
140 'VE OX 150 ALL DQ
138 IF B 147 'LL OM
137 ALL DQ 138 HOUSE H
136 YES F 138 'VE OX
136 THEM HP 133 MAKE M
136 GET M 131 IF B
131 GO M 129 IS OM
130 NOT N 127 NOW AX
127 'S OX 127 LIKE M
118 UP AX 126 'S OX
114 IS OM 124 WHAT HWH
111 NOW AX 123 WAS OM
109 WITH P 123 SHE HP
104 WAS OM 123 ON AX
96 TWO DQ 118 LIKE P
96 FOR P 117 HERE AX
94 HERE AX 114 BUT &
92 LIKE P 109 GET M
92 JUST AI 106 UP AX
91 TO P 106 SOME DQ
91 'M OX 105 'M OX
88 HAVE-TO XM 101 DO O
86 AND-THEN & 95 TO P
84 DOOR H 92 ME HP
84 BUT & 91 THESE DD
78 CAR H 84 PLAY M
296 Clive Souter


61 GOT-TO XM 77 'RE OM
61 DO O 74 WENT M
59 MAN H 65 ARE OM
56 SHE HP 64 'D OM
56 IN AX 62 AN' &
54 AT P 59 OR &
52 YEH F 58 NEED M
Demonstrative reference as a cohesive device in advanced
learner writing: a corpus-based study

Roumiana Blagoeva

Sofia University St. Kliment Ohridski


This paper discusses the under/overuse of different types of demonstrative

reference and their role for the achievement of cohesion in argumentative essays
written by advanced Bulgarian learners of English. The use of pro-forms and
their place within the total framework of text-forming relations are examined in
both native and non-native writing. A comparative approach to the study of
learner language is adopted for the investigation of differences between learner
and native English writing. These differences shed light on L1- induced and
universal features of learner discourse.
The analysis is based on data drawn from the Bulgarian component of the
International Corpus of Learner English (BUCICLE), the LOCNESS corpus of
native learner writing, a sub-corpus of the BNC, and a corpus of Bulgarian non-
learner writing. The frequency of occurrence, the distribution of demonstratives,
and their function as reference items in the four corpora are compared and
examples of their use are discussed.
Explanations of the phenomena observed are sought in several directions:
L1 interference, strategies of teaching/learning, avoidance of certain discourse
patterns, and the nature of the text type.
The differences between learner and native speaker English in the
frequency and distribution of demonstratives might not directly obstruct
communication but it is an indication that there is still much to be done in the
development of language skills even at an advanced level of foreign language
acquisition. The adoption of a corpus-based approach to the study of learner
language can reveal problematic areas in the foreign language and can enable
language researchers and language teaching professionals to diagnose the true
needs of learners and make appropriate choices of teaching materials and

1. Introduction

Interlanguage studies in Bulgaria developed in the early 1980s as a result of the

growing awareness that it was hardly possible to achieve effectiveness in foreign
language acquisition (FLA) and improvement of foreign language teaching (FLT)
without knowledge of the learners needs and the peculiarities of their foreign
298 Roumiana Blagoeva

language production. Course designers, textbook authors and teachers

concentrated their efforts, on the one hand, on cross-language comparisons which
helped to generate predictions about the areas of learning difficulty in the target
language, and, on the other hand, on analysing learners errors and the factors that
cause them.
Such studies placed too much emphasis on errors detectable on the
phrase and sentence levels, and they paid little attention to the inability of learners
to create a unified whole of the sentences that they produced. This led to the
assumption that as long as students stick to the rules of grammar and the
appropriate use of words they would be able to communicate successfully in the
foreign language. Yet, it was perceived by both teachers and learners that even at
a high level of FLA where very few errors occur there is still much difference
between learner and native-speaker production.
In recent years the collection of electronic learner-language corpora has
led to a shift of priorities in the study of learner production mainly in two
directions. First, by providing larger stretches of discourse a corpus enables
language teaching professionals and language researchers to study not only
isolated sentences and their structure but also the ways these sentences are
organised and utilised by text producers in realistic conditions for the purposes of
communication. Second, electronic learner corpora and corpus linguistics have
provided the necessary material and tools to turn the focus of attention from
erroneous structures to language patterns that might consist of acceptable units of
language but used in unnatural combinations. With the help of corpus data it is
now possible to reveal and analyse quantitative as well as qualitative differences
between learner and native speaker production. These differences seem to be a
major cause of the artificiality of learners interlanguage and they indicate the real
areas of difficulties in the acquisition of a foreign language.

2. Aims of the study

This paper is part of a wider study of grammatical cohesive devices in

argumentative essays written by advanced Bulgarian learners of English which
aims at establishing how Bulgarian learners of English use the resources available
in the foreign language to achieve effective communication. It deals with the
under/overuse of the demonstratives this, that and their plural variants these,
those, both in their functions as determiner (modifier) and pronoun (head), and
their use as cohesive ties in written advanced learner discourse.

3. The corpora

A learner corpus is very different from a native corpus because of the nature of
the material collected. A native corpus contains data from a natural language and
can be used on its own for the investigation of characteristic features of this
language. A learner corpus presents evidence of an interlanguage; and an
Demonstrative reference as a cohesive device 299

interlanguage, regardless of its stages of development, can only be an

approximation to the natural language that is the target aimed at in the process of
FLT. Therefore, any learner corpus would be of little value on its own, but it can
be a useful tool for investigating a particular interlanguage when compared to a
relevant native corpus. The choice of the native-speaker corpus is dependent on
the aims of FLT. If the final goal of FLT/FLA is to achieve an ability to use the
target language as it is used by native speakers for the fulfilment of certain real-
life tasks, then a study of interlanguage will, firstly, need a suitable sample of the
foreign language to compare with the learners production. Secondly, a learner
language is always characterised by some degree of L1 interference and, thirdly,
it could be influenced by the nature of the text type that learners have to produce.
Therefore, their language should be evaluated against a target norm representing a
similar text type. For all these reasons, comparisons with relevant data that take
into consideration these aspects of learner production are indispensable for a
comprehensive description and investigation of any feature a learner corpus might
In view of the peculiarities of learner corpora mentioned above, the
present analysis is based on comparisons of data drawn from four electronic
corpora of about 200,000 words each.
Corpus 1 is a learner corpus of argumentative essays written by Bulgarian
university students of English language and literature, compiled within the
framework of the International Corpus of Learner English (ICLE) project, namely
the Bulgarian sub-Corpus of the International Corpus of Learner English
(BUCICLE). The ICLE project was launched at the University of Louvain in
1990. From the very beginning strict design criteria were adopted and variables
such as age, sex, native language background, level of foreign language
education, and the type and length of texts to be included were carefully
controlled. Each of the research teams from the participating countries was to
assemble a computerized collection of 200,000 words of learner English. At
present the ICLE corpus contains approximately 2 million words of
argumentative writing from university students of English from 11 different
language backgrounds and is an important resource for analysing features of
written interlanguage grammar, lexis and discourse (for further details, see
Granger, Dagneaux and Meunier 2002).
Corpus 2 is the British component of the Louvain Corpus of Native
English Essays (LOCNESS) containing argumentative essays by native-speaker
university students.
Corpus 3 is a sub-corpus of the BNC consisting of non-fiction texts from
the domains of Applied Science, Social Science and World Affairs, as this is the
target norm Bulgarian students are expected to master.
Corpus 4 is a collection of texts written in Bulgarian and taken from
domains comparable to those of the BNC sub-corpus.
300 Roumiana Blagoeva

4. Theoretical framework

Before discussing the results it is necessary to mention some similarities and

differences between the demonstratives and their role as cohesive devices in
English and Bulgarian. As far as textual relations are concerned demonstratives in
English and Bulgarian behave in a similar way. First, in both languages
demonstratives can function as determiners in noun phrases, or as pronouns, i.e.
as whole noun phrases. Second, in both languages their basic deictic function is to
indicate definiteness and proximity: near and remote (or not near) from the
point of view of the speaker. Third, in both languages they indicate that
information about their meaning, their referent, is to be retrieved from elsewhere:
either from the communicative situation thus relating exophorically to entities in
the world outside the text, or from the text itself where they refer endophorically
to preceding or following items expressing anaphoric or cataphoric reference
respectively. They refer to the location of some thing (person or object) in space
or time that is participating in the process. Finally, in both languages they have
distinct singular and plural forms (for Bulgarian, see Maslov 1982: 309-310;
Krastev 1992: 77-78; Pashov 1994: 95; Andreichin et al. 1998: 239; for English,
see Quirk and Greenbaum 1973: 107; Halliday 1985: 160, 292; Leech and
Svartvik 1994: 267; Lyons 1977: 647).
Two major dissimilarities, however, exist between demonstratives in
English and Bulgarian. The first one arises from the different expressions of
gender and the inflectional character of Bulgarian. This accounts for the larger
number of Bulgarian forms corresponding to the singular forms this and that.
Another important difference comes from the distinction between registers made
in Bulgarian, which leads to the existence of stylistically marked forms of the
demonstratives. These differences and similarities are summarised in Table 1.

Table 1. The English demonstratives and their Bulgarian equivalents

Gender Formal/Neutral Stylistically

masc. tozi/toja toz


Sing this fem. tazi/taja taz

neuter tova tuj
Pl. these tezi/tija tez
masc. onzi/onja
Sing that fem. onazi/onaja onaz

neuter onova onuj

Pl. those onezi/onija onez

One important feature of the demonstratives in English compared with the

demonstratives in Bulgarian that makes them both similar and different should be
Demonstrative reference as a cohesive device 301

noted here, namely that with extended reference and with reference to a fact
only singular forms can be used.
In English the use of demonstratives to refer to extended text, including
text as fact [] applies only to the singular forms this and that used without
a following noun (Halliday and Hasan 1976: 66). Whereas extended reference
differs from usual instances of reference only in extent the referent is more than
just a person or object, it is a process or sequence of processes (grammatically, a
clause or string of clauses not just a single nominal) text reference differs in
kind: the referent is not being taken at its face-value but is being transmuted into
a fact or report (Halliday and Hasan 1976: 52).
In Bulgarian, as Krastev (1992:78) notes, the singular form tova (near),
but not onova (remote), has a special place in the system and is one of the most
frequent and most economical words in the language. Only the demonstrative
tova can replace any word, combination of words, phrases and even whole
stretches of text. Thus in Bulgarian only one form of the singular demonstratives
performs the functions of extended reference and reference to fact, which in
English are shared between the two singular forms.

5. Comparisons and observations

Using WordSmith Tools (Scott 1997), frequency lists and concordances were
produced for all the investigated items in each of the four corpora. The raw data
were then examined to exclude all examples that were irrelevant to the present
study, namely cases where that was used as a conjunction or relative pronoun,
and whenever it was used as an adverb in front of an adjective to express the
degree of a quality. The total number of tokens that were extracted from the
corpora after these first searches is shown in Table 2.

Table 2. Frequency of occurrence of the demonstratives in the four corpora

Corpus 1 Corpus 2 Corpus 3 Corpus 4
Near singular 1167 1552 656 1600
plural 325 297 146 182
Remote singular 412 160 263 76
plural 209 161 128 28
Total 2113 2170 1193 1886

Most often a first step in a quantitative study of any language feature is to look at
the number of occurrences of the items examined, which can give a preliminary
idea of the spread of the feature through entire collections of texts. So when
examining the cohesive function of demonstratives it seems reasonable to start
with a comparison of the total number of tokens found in the corpora. A first
glance at the figures in Table 2 shows a striking similarity between the
frequencies of this/these and that/those in Corpus 1 and Corpus 2. Moreover, the
302 Roumiana Blagoeva

frequencies are nearly twice as high as that in Corpus 3 (the BNC) and slightly
higher than that in Corpus 4 (the Bulgarian language corpus).
However, these data could be misleading and could bring us to the rash
conclusion that there is no over- or underuse of demonstratives by the Bulgarian
learners of English. Instead, it may be that the use of demonstratives is
determined by the different text types represented in the learner and non-learner
corpora, as their number is greater in the argumentative essays than in the BNC
sub-corpus and the Bulgarian language corpus, both of which consist of other
types of non-fiction texts. However, if we make a distinction between near and
remote types of demonstratives and look at each of these types separately, the
picture changes, as shown in Tables 3 and 4.

Table 3. Near types of demonstratives

Near Corpus 1 Corpus 2 Corpus 3 Corpus 4
Sing. + pl. 1492 1849 802 1782

Table 4. Remote types of demonstratives

Remote Corpus 1 Corpus 2 Corpus 3 Corpus 4
Sing. + pl. 621 321 391 104

The distinction between proximity and non-proximity is expressed differently in

the learner and non-learner material. Demonstratives referring to near persons and
objects are slightly underused by Bulgarian learners when compared to British
students and this is compensated for by a clear overuse of demonstratives
referring to remote persons and objects. This tendency for Bulgarian learners to
use that/those occurs in spite of the very low frequency of occurrence of their
Bulgarian equivalents.
So far mere statistical comparisons of the data suggest that native language
interference as a factor determining learner production plays an insignificant role
in the use of English demonstratives by the Bulgarian learners. However, looking
carefully at the examples extracted from the corpora, we can observe that the
Bulgarian learner writing shows a much wider variety of patterns than the
LOCNESS and the BNC material. The question is how this difference could be
Two very typical patterns that have some relevance to cohesion in that
they determine the use of demonstratives in endophoric (textual) reference were
observed in the BUCICLE. The first involves a demonstrative functioning as
determiner, as in:

(1) I know a little boy, for example, whose father is a scientist. This nine-year
old boy reads only Science Fiction and I can never persuade him to read a
fairy tale or fable or a folk tale. He is not interested even in books about
famous adventurers, about sailors and pirates, books which I read with
Demonstrative reference as a cohesive device 303

great interest and pleasure when I was his age. That boy reads only about
robots, machines, spacecraft, numbers. I agree that Science Fiction
somehow stirs children's imagination but it creates a world controlled by
machines, rather than one controlled by human beings. Probably the
science fiction stories will be the fairy tales of the new era. (BUCICLE)

The other typical group of examples observed involves the use of demonstratives
to refer to extended text, including text as fact. In English this function applies
only to the singular forms this and that used without a following noun (see
Halliday and Hasan 1976: 66) as in:

(2) Sinclair's, at all events, is the work of a Modernist, and is unlikely to be

that of an occultist. This makes it, in a sense, compatible with Hawksmoor.
But Hawksmoor is a different beast. (BNC)

(3) It fulfilled none of my expectations and seemed to be merely trying to

make me laugh at the fact that it had left me standing there grasping at
nothing. And that was all there was to it. By contrast, here is a comment
by an anthropologist who went to see the work of Mark Rothko. (BNC)

In English the choice of this or that to refer to something that has been said before
is clearly related to that of near (the speaker) versus not near; what I have
just mentioned is, textually speaking, near me whereas what you have just
mentioned is not (Halliday and Hasan 1976: 60). At the same time the notion
of proximity has various interpretations; and in such cases there is no very clearly
felt distinction between this and that (Halliday and Hasan 1976: 61).
In Bulgarian the demonstrative tova (singular, neuter, near), which
according to most traditional Bulgarian grammars (Krastev 1992; Pashov 1994;
Andreichin et al. 1998) expresses the idea of near in time and space, has a very
wide spectrum of uses and has a special place in the system of Bulgarian
demonstratives. As mentioned above in Section 4, apart from its use as pronoun
or determiner to refer to any singular neuter object or person, it is the only
demonstrative that can convey extended reference relations in a text. Here the
distinction near/remote is lost and the reference of tova is derived from the
immediate context in or outside the textual world irrespective of the idea of
proximity. Thus in this particular function its use coincides with both this and
that in English and we may expect a great overuse of this by Bulgarian learners.
The functions of onova (singular, neuter, remote) are always either Head
or Modifier so it can never be used in extended reference and reference to fact;
and as the data demonstrate (Table 6) it is rare in Bulgarian. Yet, this infrequent
use of onova does not cause an underuse of its English equivalent that by the
Bulgarian learners. On the contrary, Table 2 shows a clear overuse of that in
Corpus 1 in comparison with Corpora 2 and 3. It is true that the total number of
singular forms is nearly the same in the learner material, the native-speaker
304 Roumiana Blagoeva

student writing and the Bulgarian language corpus, as shown in Table 5 and this
at first glance may blur some differences.

Table 5. Frequency of singular forms

Singular Corpus 1 Corpus 2 Corpus 3 Corpus 4
Remote + Near 1579 1712 918 1676

However, the number of singular demonstratives used by the Bulgarian learners is

unevenly distributed between this and that, with a predominance of near over
remote, with the result that the total frequency of this and that in Corpus 1 (1579)
approaches that of tova in Corpus 4 (Table 6).

Table 6. Frequency of singular forms in BUCICLE and the Bulgarian language

Singular Corpus 1 Corpus 4
Near 1167 1600
Remote 412 76
Total 1579 1676

One possible reason could be the fact that most teaching materials used in
Bulgaria overlook the distinction between the English counterparts of tova and
onova and learners are left with the impression that it is unimportant and that both
this and that, having a very wide range of referents, could be used
indiscriminately to point to any word, phrase or longer stretch of text.
The lower frequency of singular forms in Corpus 3 than in the other
corpora could be attributed to the differences between the text types involved.
One could argue that since the distinction near/remote in the use of the
singular forms is not as clear-cut in English as in Bulgarian, the
interchangeability of this and that is permissible and might not lead to serious
communication breakdowns. Still, it is my view that it could interfere with a
receivers comprehension of a text and could contribute to the production of
unclear textual references by learners of English. In the following example the
choice of this or that would only slightly change the point of view of the writer:

(4) [] no-one is to be thought superior to another despite the differences of

race, social status, nationality and so on and every person is to be treated
objectively by the law and social institutions. And though that is being
continuously officially stated and re-stated often the talk about equality
remains just an euphemism to hide the cruel reality. It is obvious that some
people are more equal than others. [BUCICLE]

That is probably preferred because the fact it refers to in the preceding sentence is
not explicitly linked to the personal feelings of the writer; it is perceived rather as
Demonstrative reference as a cohesive device 305

being officially stated by a third party. In such cases this could easily substitute
for that and make the whole statement more involved.
But sometimes this tendency goes too far and in their desire to vary their
style and avoid repetition learners use this and that as absolute synonyms.
Consider the following examples from BUCICLE:

(5) [] my opinion is that dreaming and imagination are still part of our
society. Even if it werent so, I do not see what the problem is. The world
is changing, developing all the time and if it does not need these, it gets rid
of them as something useless, that is just the way it goes. And if someone
cannot live without dreams they either adapt to the new conditions or keep
dreams in their souls which is a question of personal choice.

In (5) it is unclear why the referents of these (dreaming and imagination) are
perceived as being closer to the writer of the passage than the fact that is referred
to by means of that. The idea of proximity is even more confused in (6) where
one and the same fact is referred to by both this and that in the same sentence:

(6) But is it really so, or it is just another old-dated "fairy tale" we are taught
to believe in and which is so trivial that we have learned it by heart. We
fight for freedom, we strive for equality, we talk about democracy and
having equal rights, but that is just an illusion, with which our minds are
washed away and we are all blind, because we believe in this. Human
beings are not equal. Inequality is determined by history. History is the
reflection of our lives.

6. Conclusions

The observations of the data presented in this paper demonstrate: (1) an overuse
of demonstratives in argumentative writing by both Bulgarian learners of English
and native-speaker students; (2) a tendency for Bulgarian learners to use
that/those in spite of the very low frequency of occurrence of their Bulgarian
equivalents; (3) a similar frequency of this/these in Bulgarian learner writing and
English native-speaker student writing; (4) a similar frequency of this/these and
their Bulgarian equivalents.
These findings shed light on some aspects of Bulgarian learner discourse
that are still unexplored and need further investigation. At this stage of the study
some of the similarities between the production of Bulgarian learners and native
speaker students might point to an influence on learner production by the nature
of the text type. A task-based learner corpus requiring students to produce one
particular text type might not reveal features of other text types. Yet, an academic
essay gives students freedom to write what they want, and more importantly what
they can, on a variety of topics, and in this sense a corpus of this kind can tell the
researcher a lot about learners abilities to produce coherent texts in any real-life
306 Roumiana Blagoeva

context. It can allow us to draw meaningful conclusions about how aware, or

rather unaware, learners are of certain discourse features.
One indisputable reason for the deviations in the use of demonstratives by
Bulgarian learners from the native speaker target norm is native language
interference. The differences that exist between the systems of demonstratives in
English and Bulgarian reflect affect learner production even at an advanced stage
of foreign language acquisition.
It is also my contention that there exists a strategy of communication
common to many advanced second language learners, namely that at a certain
stage of FLA they feel confident enough to communicate in the foreign language
and stop learning in the sense that they tend to stick to language patterns that
have become fossilised at an earlier stage of learning and continue to learn at a
slower pace, mostly by adding vocabulary. The main concern of such learners are
the real errors they make at the level of vocabulary and syntax and it never
occurs to them that there could be other aspects of the foreign language that are to
be mastered. If at a certain stage of FLA learners are made aware that there is a
tendency for them to resort to a restricted range of language patterns, they would
probably be encouraged to learn alternative ways of expression and a more target-
like way of producing coherent texts.
Naturally, further corpus-based research in this area is likely to enhance
our understanding and intuitive evaluation of learner production and point to
effective ways of bringing their interlanguage closer to the kind of language used
by native speakers of English. This can be done through the development of
teaching materials and methods that focus attention not only on grammar rules
but also on discourse features.


Andreichin, L. et al. (1998), Gramatika na sa vremennija balgarski knijoven ezik.

Morfologija. a s t parva. [Grammar of the Contemporary Bulgarian
language. Morphology. Part one]. Abagar Publishing.
BNC World Edition, December 2000, SARA Version 0.98. Published by the
Humanities Computing Unit of Oxford University on behalf of the BNC
Granger, S., E. Dagneaux and F. Meunier (eds) (2002), International Corpus of
Learner English. Version 1.1. Handbook & CD-ROM. Louvain-la-Neuve:
Presses Universitaires de Louvain.
Halliday, M.A.K. (1985), An introduction to functional grammar. London and
New York: Edward Arnold.
Halliday, M.A.K. and R. Hasan (1976), Cohesion in English. London and New
York: Longman.
Krastev, B. (1992), Gramatika za vsichki [Grammar for all]. Sofia: Nauka i
Demonstrative reference as a cohesive device 307

Leech, G. and J. Svartvik (1994), A communicative grammar of English. London

and New York: Longman.
Lyons, J. (1977), Semantics, Vol. 2. Cambridge: Cambridge University Press.
Maslov, J.S (1982), Gramatika na ba lgarskija ezik [Grammar of the Bulgarian
language]. Sofia: Nauka i izkustvo.
Pashov, P. (1994), Prakti eska balgarska gramatika [Practical Bulgarian
grammar]. Sofia: Prosveta.
Quirk, R. and S. Greenbaum (1973) A university grammar of English. Longman.
Scott, M. (1997), Wordsmith tools. version 2. Oxford: Oxford University.
Translations as semantic mirrors: from parallel corpus to

Helge Dyvik

University of Bergen


The paper reports from the project From Parallel Corpus to Wordnet at the
University of Bergen (20012004), which explores a method for deriving wordnet
relations such as synonymy and hyponymy from data extracted from parallel
corpora. Assumptions behind the method are that semantically closely related
words ought to have strongly overlapping sets of translations, and words with
wide meanings ought to have a larger number of translations than words with
narrow meanings. Furthermore, if a word a is a hyponym of a word b (such as
tasty of good, for example), then the possible translations of a ought to be a
subset of the possible translations of b.
Based on assumptions like these a set of definitions are formulated,
defining semantic concepts like, e.g., synonymy, hyponymy, ambiguity and
semantic field in translational terms. The definitions are implemented in a
computer program which takes words with their sets of translations from the
corpus as input and performs the following calculations: (1) On the basis of the
input different senses of each word are identified. (2) The senses are grouped in
semantic fields based on overlapping sets of translations, such overlap being
assumed to indicate semantic relatedness. (3) On the basis of the structure of a
semantic field a set of features is assigned to each individual sense in it, coding
its relations to other senses in the field. (4) Based on intersections and inclusions
among these feature sets a semilattice is calculated with the senses as nodes.
According to our hypothesis, hyponymy/hyperonymy, near-synonymy and other
semantic relations among the senses now appear through dominance and other
relations among the nodes in the semilattice. Thus, the semilattice is supposed to
contain some of the semantic information we want to represent in wordnets. (5)
In accordance with this assumption, thesaurus-like entries for words are
generated from the information in the semilattice.
In the project these assumptions are tested against data from the English-
Norwegian Parallel Corpus ENPC (Johansson 1997).
312 Helge Dyvik

1. Introduction

1.1 Translations as semantic data

Parallel corpora, in which original texts are aligned with their translations into
another language, are a rich source of semantic information. Translations come
about when translators evaluate the degree of interpretational equivalence
between linguistic expressions in specific contexts. In many ways such
evaluations, made without any theoretical concerns in mind, seem more reliable
as sources of semantic information than the careful paraphrases of the semanticist
or the meaning descriptions of the lexicographer. Assuming that this is the case,
can we then retrieve some of the semantic properties of expressions by going
backwards from the network of translational relations in situated texts? Can we
reconstruct semantic properties from the translational properties manifested in a
parallel corpus?
The idea that semantic information can be gleaned from multilingual data
has been explored by others. Resnik and Yarowsky (1997), discussing word sense
disambiguation, suggest that in distinguishing between senses it may be fruitful to
restrict attention to such distinctions as are lexicalised differently in other
languages. Nancy Ide has explored the connections between semantics and
translation in several papers; in Ide et al. (2002) the authors study versions of the
same novel in seven languages and attempt to identify subsenses of words by
considering how the translations of a given word cluster in the six other texts.

1.2 Wordnets and thesauri

The output of the method presented here is a structure containing some of the
information which we find in wordnets. A wordnet is a semantically structured
lexical database. The Princeton WordNet (Fellbaum 1998), which has been built
manually, distinguishes between the senses of words and groups senses across
words into synsets according to near-synonymy. Pointers between such synsets
express semantic relations like hypero- and hyponymy, antonymy, and holo- and
meronymy. Wordnets for various European languages were developed within the
project Eurowordnet (http://www.illc.uva.nl/EuroWordNet/).
Wordnets are important resources for many applications within language
technology. They can be used in meaning-based information retrieval (searching
for concepts rather that specific word forms), in logical inference (if a document
mentions dogs, a wordnet allows the inference that it is about animals), in word
sense disambiguation (providing the search space of alternative meanings), etc.
A related kind of semantic resource is the thesaurus. As an example we
may consider the entry for the adjective conspicuous in the Merriam-Webster
Collegiate Thesaurus (http://www.m-w.com/home.htm), where two senses are
distinguished, each with its own sets of synonyms, antonyms etc.:
Translations as semantic mirrors 313

Entry Word: conspicuous

Function: adjective
Text: 1
Synonyms CLEAR 5, apparent, distinct, evident, manifest,
obvious, open-and-shut, openhanded, patent, plain
Synonyms NOTICEABLE, arresting, arrestive, marked,
outstanding, pointed, prominent, remarkable, salient, striking
Related Word celebrated, eminent, illustrious; showy
Contrasted Words common, everyday, ordinary; covert, secret;
concealed, hidden
Antonyms inconspicuous

We may compare this with the thesaurus-like entry for conspicuous below, which
has been generated automatically from parallel corpus data by the method to be
described in this paper:

Sense 1
(Norwegian: avstikkende.)
Sense 2
Hyperonyms: great, hard, large.
Subsense (i) (Norwegian: synlig, tydelig.)
clear, conclusive, definite, distinct, distinctive, obvious,
plain, substantial, unmistakable, vivid.
Hyponyms: apparent, evident, pervasive, visible.
Subsense (ii) (Norwegian: fremtredende, kraftig,
sterk, stor.)
Near-synonyms: outstanding, primary.
Subsense (iii) (Norwegian: oppsiktsvekkende.)
Near-synonyms: amazing, spectacular, startling,
surprising, unusual.

Antonyms and contrasted words are not included in the latter entry, since the
method only allows the derivation of relations of semantic similarity (synonymy,
hyperonymy and hyponymy) from the parallel corpus data. The entry displays a
major division into two senses (of which the first one in this case has no
information associated with it apart from a Norwegian translation), and
furthermore a division into subsenses within the more informative second sense.
Sense 1 in this example is probably a spurious consequence of sparsity of data
in the corpus. A better example of a major division into senses although even
there we would have liked sense 1 to have been merged with sense 4 is
provided by the following automatically derived entry for the Norwegian noun
rett, which is contrastively ambiguous between a number of senses, among which
we find course in a meal and court of law. Some of the related words listed in
this entry are surprising, while most of them are to the point:

rett N
Sense 1
(English: course.)
Sense 2
(English: court, justification.)
314 Helge Dyvik

argument, begrunnelse, berettigelse, domstolsbehandling,
grd, grdsplass, plass, sak, ting.
Sense 3
Subsense (i) (English: option.)
Hyponyms: tilbud.
Subsense (ii) (English: rightN.)
Hyponyms: adgang, rettighet.
Subsense (iii) (English: order.)
bestemmelse, klasse, krav, lov, lsning, mte, orden, regel,
regelverk, stand, system, vedtak.
Sense 4
(English: dish, food, supper.)
aftens, aftensmat, fat, fde, gryte, kar, kopp, kosthold,
kveldsmat, lunsj, mat, matvare, middag, mltid, nring, skl,

1.3 Semantic lattices

The thesaurus entries above are generated from semantic lattices, which in their
turn are derived automatically from the translational data. Figure 1 below is an
example of such a lattice, representing the semantic field associated with sense 4
of rett in the entry above (labelled rettN2 in the lattice):

Figure 1. A semantic lattice

According to the hypothesis behind the method, senses on dominating nodes are
hyperonyms of senses on dominated nodes. Thus, a sense of mat food
dominates senses of rett dish, middag dinner, mltid meal, lunsj lunch,
kveldsmat supper, aftensmat supper, and aftens supper, all of which are
plausible hyponyms of mat. Less convincingly, lunsj also dominates aftensmat.
Formally the lattice expresses inclusion and overlap relations among sets
of translationally derived features, as described in section 2.3 below.
Translations as semantic mirrors 315

1.4 The parallel corpus

The English-Norwegian Parallel Corpus (ENPC), from which the above results
are derived, comprises approximately 2.6 million words, originals and
translations included. The corpus contains fiction as well as non-fiction and
English originals translated into Norwegian as well as the other way around. The
corpus is aligned at sentence level (Johansson et al. 1996), while it is a part of our
present project to align the ENPC at word level, in order to be able to extract the
sets of translations of a given word automatically. Our present data has been
derived from the sentence-aligned corpus, however, which means that the
translational data for each word in our data set has been extracted manually.
For example, searching for the Norwegian word form bemerkelsesverdig
returns the sentences containing bemerkelsesverdig coupled with the
corresponding English sentences in the parallel text (translation or original).
Based on a set of heuristic criteria to decide whether a word can be said to
correspond to a given word in the translation or not, the set of translations of
bemerkelsesverdig is extracted by the human analyser:

(bemerkelsesverdig (amazing notable remarkable spectacular surprising))

Sets of such lemmas with their associated sets of translations from the corpus
constitute the input to the procedure deriving semantic lattices and thesaurus
entries, by principles which we now proceed to describe.

2. Semantic mirrors

2.1 Separation of senses

We assume that contrastive ambiguity, such as the ambiguity between the two
unrelated senses of the English noun bank money institution and riverside
tends to be a historically accidental and idiosyncratic property of individual
words. That is, we don't expect to find instances of the same contrastive
ambiguity replicated by other words in the language or by words in other
languages. Furthermore, we don't expect words with unrelated meanings to share
translations into another language, except in cases where the shared word is
contrastively ambiguous between the two meanings. By the first assumption there
should then be at most one such shared word.
Given these assumptions contrastive ambiguity should be discoverable in
the patterns of translational relations. We may consider the Norwegian noun tak,
contrastively ambiguous between the meanings roof and grip. Figure 2 shows
the first t-image of tak in the right-hand box, and the first t-images of each of
those English words again in the left-hand box. We refer to the last-mentioned set
of sets as the inverse t-image of tak.
316 Helge Dyvik

Figure 2. The first and inverse t-images of tak.

The point worth noticing is that the images of roof and ceiling overlap in hvelving
in addition to tak, while the images of grip and hold overlap in grep in addition to
tak. This indicates that roof and ceiling are semantically related, and similarly
grip and hold, while no overlap (apart from tak) unites grip/hold and roof/ceiling.
Grip/hold and roof/ceiling hence seem to represent unrelated meanings, and the
conclusion is that tak is ambiguous.

Figure 3. The second t-image of tak

Translations as semantic mirrors 317

The overlap patterns are necessarily preserved within the first t-image of tak
when we make our third movement and find all the first t-images in English of
the words in the inverse t-image, as shown in Figure 3. We refer to this set of sets
as the second t-image of tak.
As shown in Figure 3, the second t-image can be divided into three
clusters or groups of sets, each group being held together by overlap relations (we
only consider overlaps in the restriction of the second t-image to the members of
the first t-image). On the basis of these groups the first t-image of tak can be
partitioned into the three sense partitions shown in Figure 4.

Figure 4. The sense partitions of tak's first t-image

By this method the main senses of lemmas are individuated.

The limited size of the corpus is a source of error: a translation t of a
occurring only once in the corpus, or only occurring translationally related to a,
will give rise to a separate sense partition only containing t, and hence give rise to
a potentially spurious sense of a (cf. the doubtful sense 1 of the examples
conspicuous and rett in Section 1.2). A larger corpus might display more
alternative translations of t, and thereby include t in one of the other sense
partitions. A frequency filter excluding hapax legomena from consideration might
reduce this problem.

2.2 Semantic fields

Once senses are individuated in the manner described, they can be grouped into
semantic fields. Traditionally, a semantic field is a set of senses that are directly
or indirectly related to each other by a relation of semantic closeness.
In our translational approach, the semantic fields are isolated on the basis
of overlaps among the first t-images of the senses. Since we treat translational
correspondence as a symmetric relation (disregarding the direction of translation),
we get paired semantic fields in the two languages involved, each field assigning
a subset structure to the other. Figure 5 gives a rough illustration of the principle
(arrows indicate the t-image of each sense for simplicity, the indicated sets are
just suggested and in no way reflect the corpus data accurately).
318 Helge Dyvik

Figure 5. Paired semantic fields (simplified illustration)

The subset structure of a semantic field, assigned by its partner field in the other
language, contains rich information about the semantic relations among its
members. For example, senses with a wide meaning (such as good) will in
general have a larger number of alternative translations than words with a
narrower meaning (such as tasty). The number of translations is of course directly
reflected in the number of subsets of which the sense is a member. Thus the
senses at the peaks in the semantic fields will have the widest meanings.
We may illustrate this by means of a constructed and artificially simple
example. Assume that we find the translational pattern illustrated in Figure 6,
where hingst stallion is found translated into animal, horse and stallion, while
dyr animal is translated into animal, horse, stallion, mare and dog, etc.
Translations as semantic mirrors 319

Figure 6. A constructed example

Since animal1 is translationally related to every member of the Norwegian field,

animal1 becomes the peak of the English field, being a member of all the
subsets, with horse1 ranked immediately below it, etc. By symmetry, the
Norwegian field gets a corresponding subset structure (cf. Figure 7).

2.3 Feature assignment

The next step is to encode, for each sense, its position within the semantic field,
along with its translational relations to the members of the other field. This is
done by means of feature sets, automatically derived from the set structure. In
accordance with traditional semantic componential analysis, the intention is that
wide senses should have few features, while more specific senses should have
more features, some of which are inherited from wider, superordinate senses. This
is achieved by starting from the tops in two paired fields i.e. the sense pair
which is both translationally interrelated and whose members belong to the
largest number of subsets which in Figure 7 gives us the pair dyr1 and animal1.
A feature [dyr1|animal1] is constructed from this pair and assigned to both its
members dyr1 and animal1. Then the feature is inherited (non-transitively) by
lower senses according to the following principle: all senses in the first t-image
of animal1 and ranked lower than dyr1 (i.e. belonging to fewer subsets than dyr1)
inherit the feature, and conversely, all senses in the first t-image of dyr1 and
ranked lower than animal1 inherit the feature. Then the procedure moves to the
next highest, translationally interrelated, peaks hest1 and horse1, constructs a
feature from that pair, and assigns it according to the same principle. The result is
shown in Figure 7.
320 Helge Dyvik

Figure 7. Feature assignment in semantic fields

The feature sets in Figure 7 define a lattice based on inclusion relations among
them, as shown in Figure 8.

Figure 8. Lattices defined by the feature sets

In Figure 8 the daughters of a node N have supersets of the feature set associated
with N. In this constructed example the lattices evidently also reflect hyperonym /
hyponym relations among the senses.
Translations as semantic mirrors 321

The lattices in Figure 8 are simple trees, while actual derived lattices tend
to be more complex. In the first place, senses may inherit features from more than
one peak in the semantic field, which gives rise to multiple mothers in the
lattice. In the second place, nodes may have intersecting feature sets without
either of the sets including the other, so that there is no mother/daughter
relationship between the nodes in question. When no actual sense is associated
with the intersection, x-nodes (cf. Figure 1) are introduced, carrying the
intersection of the feature sets of their daughters. Thus the x-nodes can intuitively
be seen as virtual hyperonyms of their daughters. It is the presence of x-nodes
which guarantees that the structure is a semilattice (i.e. all nodes with intersecting
feature sets are guaranteed to be dominated by a node carrying the intersection).
In the semilattice, two senses are assumed to be more closely related the more of
their features they share, i.e. the shorter the distance is to their common
dominating node.
Returning now to the actual corpus-based lattice in Figure 1, it is defined
by the feature sets on the nodes according to the principles just described. For
instance, mat2 is associated with the singleton feature set {[mat2|supper3]},
kveldsmat1 with {[mat2|supper3], [kveldsmat1|meal1]}, and aftensmat2 with
{[mat2|supper3], [kveldsmat1|meal1], [lunsj2|meal1], [aftensmat2]}. In Figure 1,
x-nodes with only one feature (such as x1) are displayed with the feature beside

Derivation of thesaurus entries

Derivation of thesaurus entries involves determining subsenses, hyperonyms,

near-synonyms and hyponyms of each sense on the basis of the information in the
semilattices. The semilattices are in some cases extremely complex, showing
intricate networks of connections between the word senses. Much of this
complexity should probably be considered as noise resulting from accidental
biases and gaps in the corpus. In the transition to a wordnet database or a
thesaurus we therefore want to abstract away from much detail in the lattices, and
this can obviously be done in more than one way. We presently use two
parameters to regulate the generation of thesaurus entries: OverlapThreshold and
The value of the parameter OverlapThreshold decides the granularity of
the division into subsenses in the thesaurus entry. This does not concern the
division into main senses described above (tak1, tak2, tak3 etc.) those senses
usually end up in different semantic fields and hence in different lattices. Division
into subsenses is a further subdivision of each sense into related shades of
meaning. We assume that there is no final and universal answer to the question of
how many related subsenses a word sense has (cf. Kilgarriff 1997). By means of
the parameter OverlapThreshold we may attune that kind of semantic granularity
to our purposes.
322 Helge Dyvik

We may illustrate the procedure by means of an example: the adjective

sweet. Figure 9 shows a small sublattice of the large lattice including the sense

Figure 9: A sublattice containing sweet1

Sweet1 is also dominated by several nodes outside this sublattice; size limitations
prevent displaying a more complete graph. The node sweet1 is associated with the
following feature set: {[god3|good1], [fin2|nice2], [pen1|gentle3],
[vakker1|soft2], [snill1|pleasant1], [deilig1|splendid3], [frisk4|sweet1],
[blid3|sweet1]}. Finding hyperonyms, near-synonyms and hyponyms of sweet1
now first involves considering which other senses in the lattice share features
with sweet1. The features in question are assigned to the following senses in the
complete semilattice (we will refer to the sets of senses as the denotations of the

(able1 accurate1 adept1 adequate2 affectionate1 all_right2 amiable2 appropriate5
attractive4 beautiful2 beneficial1 benign3 bright2 burning3 charming2 clean1 clear1 close3
comfortable2 comforting3 competent2 confident2 correct1 cozy2 cute1 decent2 delicious1
delightful2 detailed3 dishy1 easy1 efficient2 elegant3 excellent2 fair2 fancy1 favourable1
fine1 firmA1 first-class3 first-rate2 fit3 fortunate1 fresh3 friendly2 full2 genuine2 good1
handsome2 happy3 healthy2 high3 hot2 joyful2 kind1 kindly1 long3 lovely2 lucky2
magnificent3 marvellous1 neat2 nice2 okay1 peaceful1 perfect3 placid2 pleasant1 pleased2
pleasing1 pleasurable1 plentiful1 plenty1 polite2 positive1 pretty2 proficient1 quite_certain1
real2 reassuring2 respectable3 right2 ripe1 safe2 satisfactory1 satisfying1 secure2 sizeable1
smart2 smooth3 soft2 solid2 sound2 spectacular2 steady1 strong3 successful2 suited1
superb2 superior5 sure1 sweet1 talented2 thorough1 tidy1 well2 whole2 wholesome1
wonderful3 worthy2)
Translations as semantic mirrors 323

(attractive4 beautiful2 breathtaking2 charming2 comfortable2 cute1 delicate3 dishy1 easy1
elegant3 enchanting1 excellent2 fancy1 fine1 first-class3 gentle3 glorious4 graceful2
handsome2 impressive2 lovely2 magnificent3 marvellous1 neat2 nice2 okay1 perfect3
pleasurable1 polite2 pretty2 pure2 slight3 smart2 soft2 splendid3 sweet1 thin2 wonderful3)

(attractive4 beautiful2 charming2 clean1 cute1 dishy1 elegant3 enchanting1 fancy1 fine1
first-class3 formal1 gentle3 graceful2 handsome2 lovely2 neat2 pleasant1 polite2 pretty2
soft2 sweet1 tidy1)

(attractive4 charming2 cute1 delightful2 dishy1 enchanting1 fair2 fancy1 graceful2
handsome2 lovely2 magnificent3 mild2 ornate2 pleasant1 pleasurable1 pretty2 soft2 sweet1)

(all_right2 amiable2 benign3 friendly2 good-humoured1 good-natured3 jolly1 kind1 kindly1
mild3 pleasant1 pleasing1 polite2 smiling2 sweet1)

(beautiful2 charming2 cute1 enchanting1 delicious1 delightful2 pleasureable1 splendid3

(all_right2 brisk5 eager2 fit3 fresh3 healthy2 new1 pert2 sweet1 well2)

(amiable2 blithe3 cheerful4 cheery1 good-humoured1 good-natured3 jolly1 kind1 kindly1
merry1 mild3 smiling2 sweet1)

The most general features, [god3|good1], [fin2|nice2] and [pen1|gentle3], denote

a large number of senses each especially [god3|good1]. This reflects the fact
that they are constructed from wide senses such as god3 and good1. As a result,
many of the senses carrying those features are not sufficiently close to sweet1 to
be called near-synonyms. Therefore we do not want to consider all the senses
sharing such general features as near-synonyms of each other. The value of the
parameter SynsetLimit defines the maximal size which the set denoted by a
feature can have in order to be included among the near-synonyms. With
SynsetLimit = 20, the sets of senses denoted by [god3|good1], [fin2|nice2] and
[pen1|gentle3] are not included among the near-synonyms of sweet1 (unless they
are denoted by other features as well). On the other hand, good1, nice2 and
gentle3 the English senses from which the wide features were constructed are
recorded as hyperonyms of sweet1.
Intuitively, the features represent different aspects of the sense sweet1,
and the question now is whether those aspects are sufficiently different from
each other to be considered different subsenses. Their distinctness can be
measured in terms of the degree of overlap among the sets of senses they denote.
If the set of features denote strongly overlapping sets of senses, the favoured
conclusion is that there is no division into subsenses. On the other hand, the less
the denotations of the features overlap, the more a division into subsenses is
324 Helge Dyvik

motivated. The degree of overlap in a set of sets can be measured as a value

between 0 and 1, with 0 indicating no overlap and 1 full overlap (full overlap
meaning that for each set s, every set either includes s or is included in s). In
calculating the degree of overlap among feature denotations we disregard the
sense sweet1 itself, since it is necessarily a member of all the feature denotations.
The value of the parameter OverlapThreshold is a number between 0 and
1. A feature belongs to subsense n if the overlap between its denotation and the
denotation of at least one other feature in subsense n is equal to or greater than
OverlapThreshold. Hence, the higher the OverlapThreshold, the more subsenses
tend to be distinguished.
The two last features in the set above are constructed from sweet1 itself,
and we assume that senses sharing this feature are hyponyms of sweet1: they have
inherited the feature from sweet1 and must have been ranked lower in the
semantic field.
Setting the parameter values at SynsetLimit = 20 and OverlapThreshold =
0.05, we consequently generate the following entry for sweet:

OverlapThreshold = 0.05:

Hyperonyms: gentle, good, nice.
Subsense (i) (Norwegian: frisk.)
Hyponyms: all_right, brisk, crisp, eager, fit, fresh,
healthy, new, pert, well.
Subsense (ii) (Norwegian: blid, deilig, fin, god, pen,
snill, st, vakker.)
amiable, amused, attractive, beautiful, benign, blithe, charming,
cheerful, cheery, cute, delicious, delightful, dishy, easygoing,
enchanting, fair, fancy, friendly, good-humoured, good-natured,
graceful, handsome, jolly, kind, kindly, lovely, magnificent, merry,
mild, ornate, picturesque, pleasant, pleasing, pleasurable, polite,
pretty, smiling, soft.
Hyponyms: all_right.

Subsense (ii) includes near-synonyms referring to personal character (e.g.

amiable) as well as synonyms referring to appearance (e.g. beautiful). Raising the
OverlapThreshold to 0.1 leads to the separation of those two kinds of near-

OverlapThreshold = 0.1:

Hyperonyms: gentle, good, nice.
Subsense (i) (Norwegian: frisk.)
Hyponyms: all_right, brisk, crisp, eager, fit, fresh,
healthy, new, pert, well.
Subsense (ii) (Norwegian: deilig, fin, god, pen, st,
attractive, beautiful, charming, cute, delicious, delightful, dishy,
enchanting, fair, fancy, graceful, handsome, lovely, magnificent,
Translations as semantic mirrors 325

ornate, picturesque, pleasant, pleasurable, pretty, soft.

Subsense (iii) (Norwegian: blid, snill.)
amiable, amused, benign, blithe, cheerful, cheery, easygoing,
friendly, good-humoured, good-natured, jolly, kind, kindly, merry,
mild, pleasant, pleasing, polite, smiling.
Hyponyms: all_right.

3. Conclusion

We have given an illustration of the method employed in the project From

Parallel Corpus to Wordnet. The method is implemented in a computer program
taking words with their sets of translations from the parallel corpus as input and
returning semantic lattices and thesaurus entries as output. The presentation has
been based on examples of the results obtained on the basis of manually extracted
data from the parallel corpus ENPC.
The examples have only served as illustrations and have not been
subjected to a critical analysis in this paper. An important task within the project
is the evaluation of the results, part of which involves comparisons with existing
sources like the Princeton Wordnet and Merriam-Webster's Thesaurus. Another
task is the alignment of the corpus ENPC at word level, which will make it
possible to extract lemmas with their sets of translations automatically.
Based on our results so far we feel able to conclude that the method merits
further exploration.


1. The analyses in this paper are based on corpus data resulting from work by
Martha Thunes, Gunn Inger Lyse and the author. The software producing the
semantic analyses has been developed by the author and reimplemented and
improved by Paul Meurer. I am grateful to Martha Thunes for useful
comments on an earlier version of this article.


Aijmer, K., B. Altenberg, and M. Johansson (eds.). 1996. Languages in contrast.

Papers from a symposium on text-based cross-linguistic studies in Lund,
4-5 March 1994, 73-85. Lund: Lund University Press.
Diab, M. and P. Resnik (2002): An Unsupervised Method for Word Sense
Tagging using Parallel Corpora. 40th Anniversary Meeting of the
Association for Computational Linguistics (ACL-02), Philadelphia, July,
Dyvik, H. (1998a): A translational basis for semantics. In: Stig Johansson and
Signe Oksefjell (eds.) 1998. 51-86.
326 Helge Dyvik

Dyvik, H. (1998b): Translations as semantic mirrors. In Proceedings of Workshop

W13: Multilinguality in the lexicon II. 24.44, Brighton, UK. The 13th
biennial European Conference on Artyificial Intelligence ECAI 98.
Fellbaum, C. (ed.) (1998), WordNet. An electronic lexical database. Cambridge:
The MIT Press.
Grefenstette, G. (1994): Explorations in Automatic Thesaurus Discovery,
Boston/Dordrecht/London: Kluwer.
Hearst, M. A. (1998): Automated Discovery of WordNet Relations. In Fellbaum
(1998). 131 - 151.
Ide, N. (1999): Word sense disambiguation using cross-lingual information. In:
Proceedings of ACH-ALLC '99 International Humanities Computing
Conference, Charlottesville, Virginia. http://jefferson.village.virginia.edu
Ide, N. (1999): Parallel translations as sense discriminators. In: SIGLEX99:
Standardizing Lexical Resources, ACL99 Workshop, College Park,
Maryland. 52-61.
Ide, N., T. Erjavec and D. Tufis (2002), Sense discrimination with parallel
corpora. Proceedings of ACL'02 Workshop on Word Sense
Disambiguation: Recent Successes and Future Directions, Philadelphia,
Johansson, S. (1997), Using the English-Norwegian Parallel Corpus a corpus
for contrastive analysis and translation studies, in: B. Lewandowska-
Tomaszczyk and P.J. Melia (eds), Practical applications in language
corpora. Lodz: Lodz University. 282-296.
Johansson, S., J. Ebeling, and K. Hofland (1996), Coding and aligning the
English-Norwegian Parallel Corpus, in: K. Aijmer, B. Altenberg and M.
Johansson (eds), Languages in contrast. Papers from a symposium on text-
based cross-linguistic studies in Lund, 4-5 March 1994. Lund: Lund
University Press. 87-112.
Johansson, S. and S. Oksefjell (eds.) (1998): Corpora and Crosslinguistic
Research: Theory, Method, and Case Studies. Amsterdam: Rodopi.
Kilgarriff, A. (1997), I don't believe in word senses, Computers and the
Humanities 31 (2): 91-113.
Resnik, P.S. and D. Yarowsky (1997), A perspective on word sense
disambiguation methods and their evaluation. Position paper presented at
the ACL SIGLEX Workshop on Tagging Text with Lexical Semantics:
Why, What, and How?, held April 4-5, 1997 in Washington, D.C., USA in
conjunction with ANLP-97.
Turcato, D. (1998): Automatically Creating Bilingual Lexicons for Machine
Translation from Bilingual Text. In: Proceedings of the 17th International
Conference on Computational Linguistics (COLING-98) and of the 36th
Annual Meeting of the Association for Computational Linguistics (ACL-
98), Montreal.
Physical contact verbs in English and Swedish from the
perspective of crosslinguistic lexicology

ke Viberg

Uppsala University


The major English physical contact verbs strike, hit and beat are compared with
their primary Swedish translation equivalent sl on the basis of data from the
English-Swedish Parallel Corpus. The analysis is carried out within two
theoretical frameworks concerning the underlying conceptual representation and
the linguistic cues that can be used for word sense identification. In addition to a
rather detailed account of points of contrast in the fairly extensive patterns of
polysemy that are characteristic of the verbs, an attempt is made to provide a
general characterisation in contrastive terms. In comparison with the English
verbs, the conceptual representation of sl is grounded more firmly in
sensorimotor experience and the fact that hitting prototypically is a hand action.
As in other languages such as Chinese, the main verb of hitting in Swedish has
extended senses that refer to other types of hand actions. With respect to word
sense identification, the semantic classification of the subject and object is a
prominent cue for the distinction between the major meanings of the main
physical contact verbs but to various degrees in English and Swedish. Several
examples are also given of cases where linguistic cues are not sufficient and
disambiguation must be based on topical or pragmatic information.

1. Introduction

This paper will present a contrastive lexical analysis of the major English
physical contact verbs strike, hit and beat in comparison to the Swedish verb sl
which is the closest equivalent to all three English verbs. The semantic analysis is
based on an earlier paper on the verbs of physical contact in Swedish (Viberg
1999). The verb sl has a complex pattern of polysemy and many extended
meanings which require a wide range of translations in English. The rich
polysemy tends to be characteristic of verbs with the same prototypical meaning
across a wide range of languages (for Chinese, see Gao 2001).
The comparison of Swedish and English that will be presented in this
paper is based on the English-Swedish Parallel Corpus, ESPC (Aijmer et al. 1996,
Altenberg and Aijmer 2000), which contains original text samples in English and
Swedish together with their translations. The text samples represent both fiction
and non-fiction and the total number of words from each source language is about
328 ke Viberg

half a million. The corpus will be used for contrastive purposes, whereas matters
such as translation problems or the general characteristics of translated texts will
not be dealt with (see Johansson 1998 on the various uses of parallel corpora).
The aim of the present paper is primarily to present a systematic
contrastive account of the data but the general theoretical significance will be
briefly indicated within two frameworks. The first concerns the conceptual
representation of lexical items accounting for the patterns of polysemy and their
cognitive motivations. This will be oriented towards cognitive semantics and in
particular prototype theory (Taylor 1989). Another important cognitive semantic
idea is the notion of embodiment which implies that our concepts to a large extent
are shaped by our bodies and brains (Lakoff and Johnson 1999). In particular,
bodily movement will be shown to play an important role for the conceptual
representation of the main verbs of physical contact.
The second framework concerns the contextual representation of lexical
items and the process of word sense identification accounting for the interaction
between word meaning and cues in the linguistic context in the disambiguation
process and in the choice of translation equivalents. According to Miller and
Leacock (2000), each meaning of a word must be associated with a contextual
representation, which can be either local or topical. Experimental work has shown
that people can identify various meanings of a polysemous word with a relatively
high degree of success if they are presented with a window of 2 words of
context, but local context is not always enough. Local cues turned out to be very
precise when they occurred but all too often they simply did not occur (op. cit.
p. 156). Miller and Leacock also give an account of the use of topical context
which refers to the general topic of a text or conversation. Topical context has
been tested with various statistical classifiers run on computers. In one such
experiment, only the words occurring in the same sentence as the target word
were presented (in random order). With three or more senses to distinguish of
words such as line and serve the statistical classifiers reached close to 75%
correctness. Human subjects who were presented with lists of words co-occurring
with line in reverse alphabetical order only managed to identify the correct sense
a little better